[llvm-branch-commits] [clang] [flang] [llvm] [mlir] [LifetimeSafety] Improve Origin information in debug output (PR #153951)
Utkarsh Saxena via llvm-branch-commits
llvm-branch-commits at lists.llvm.org
Tue Aug 19 04:37:49 PDT 2025
=?utf-8?b?6buD5ZyL5bqt?= <we3223 at gmail.com>,Aiden Grossman
<aidengrossman at google.com>,Antonio Frighetto <me at antoniofrighetto.com>,Jeremy
Kun <jkun at google.com>,Yitzhak Mandelbaum <ymand at users.noreply.github.com>,Damyan
Pepper <damyanp at microsoft.com>,LLVM GN Syncbot <llvmgnsyncbot at gmail.com>,Nishant
Patel <nishant.b.patel at intel.com>,Andreas Jonson <andjo403 at hotmail.com>,Ramkumar
Ramachandra <ramkumar.ramachandra at codasip.com>,Tobias Stadler
<mail at stadler-tobias.de>,Yang Bai <baiyang0132 at gmail.com>,Panagiotis
Karouzakis <45971450+karouzakisp at users.noreply.github.com>,Aiden Grossman
<aidengrossman at google.com>,Shafik Yaghmour <shafik.yaghmour at intel.com>,Krzysztof
Drewniak <Krzysztof.Drewniak at amd.com>,LauraElanorJones
<laura.elanor.jones at gmail.com>,Justin Fargnoli <jfargnoli at nvidia.com>,Shaoce
SUN <sunshaoce at outlook.com>,Brox Chen <guochen2 at amd.com>,Jordan Rupprecht
<rupprecht at google.com>,Kyle Wang <ec1wng at gmail.com>,Thurston Dang
<thurston at google.com>,Steven Perron <stevenperron at google.com>,Krzysztof
Parzyszek <Krzysztof.Parzyszek at amd.com>,Matthias Braun <matze at braunis.de>,Trevor
Gross <tmgross at umich.edu>,Jonas Devlieghere <jonas at devlieghere.com>,Utkarsh
Saxena <usx at google.com>,Usama Hameed <u_hameed at apple.com>,Stanislav
Mekhanoshin <Stanislav.Mekhanoshin at amd.com>,Stanislav Mekhanoshin
<Stanislav.Mekhanoshin at amd.com>,Daniel Thornburgh <dthorn at google.com>,Naveen
Seth Hanig <naveen.hanig at outlook.com>,Konrad Kleine <kkleine at redhat.com>,Florian
Hahn <flo at fhahn.com>,Baranov Victor <bar.victor.2002 at gmail.com>,Sergei
Barannikov <barannikov88 at gmail.com>,Stanislav Mekhanoshin
<Stanislav.Mekhanoshin at amd.com>,Stanislav Mekhanoshin
<Stanislav.Mekhanoshin at amd.com>,Jordan Rupprecht <rupprecht at google.com>,Thurston
Dang <thurston at google.com>,Stanislav Mekhanoshin
<Stanislav.Mekhanoshin at amd.com>,Charitha Saumya
<136391709+charithaintc at users.noreply.github.com>,Oliver Hunt
<oliver at apple.com>,Mehdi Amini <joker.eph at gmail.com>,Mehdi Amini
<joker.eph at gmail.com>,Jonas Devlieghere <jonas at devlieghere.com>,Florian Hahn
<flo at fhahn.com>,Stanislav Mekhanoshin <Stanislav.Mekhanoshin at amd.com>,Mehdi
Amini <joker.eph at gmail.com>,Mehdi Amini <joker.eph at gmail.com>,Aiden Grossman
<aidengrossman at google.com>,"Oleksandr T." <oleksandr.tarasiuk at outlook.com>,Sergei
Barannikov <barannikov88 at gmail.com>,Stanislav Mekhanoshin
<Stanislav.Mekhanoshin at amd.com>,Shubham Sandeep Rastogi
<srastogi22 at apple.com>,Oliver Hunt <oliver at apple.com>,Peter Klausler
<pklausler at nvidia.com>,Matthias Braun <matze at braunis.de>,Peter Klausler
<pklausler at nvidia.com>,Peter Klausler <pklausler at nvidia.com>,Peter Klausler
<pklausler at nvidia.com>,Peter Klausler <pklausler at nvidia.com>,Peter Klausler
<pklausler at nvidia.com>,Peter Klausler <pklausler at nvidia.com>,Daniel
Paoliello <danpao at microsoft.com>,Stanislav Mekhanoshin
<Stanislav.Mekhanoshin at amd.com>,Utkarsh Saxena <usx at google.com>
Message-ID:
In-Reply-To: <llvm.org/llvm/llvm-project/pull/153951 at github.com>
https://github.com/usx95 updated https://github.com/llvm/llvm-project/pull/153951
>From c6fe567064847ed3c8821422a4fc81eefc7f4291 Mon Sep 17 00:00:00 2001
From: Jonathan Cohen <joncoh at apple.com>
Date: Mon, 18 Aug 2025 15:10:59 +0300
Subject: [PATCH 001/112] [AArch64][MachineCombiner] Combine sequences of
gather patterns (#152979)
Reland of #142941
Squashed with fixes for #150004, #149585
This pattern matches gather-like patterns where
values are loaded per lane into neon registers, and
replaces it with loads into 2 separate registers, which
will be combined with a zip instruction. This decreases
the critical path length and improves Memory Level
Parallelism.
rdar://151851094
---
llvm/lib/Target/AArch64/AArch64InstrInfo.cpp | 334 +++++++++++++++
llvm/lib/Target/AArch64/AArch64InstrInfo.h | 4 +
...arch64-combine-gather-lanes-limit-size.mir | 33 ++
...aarch64-combine-gather-lanes-with-call.mir | 45 ++
.../AArch64/aarch64-combine-gather-lanes.mir | 400 ++++++++++++++++++
.../complex-deinterleaving-uniform-cases.ll | 134 +++---
llvm/test/CodeGen/AArch64/concat-vector.ll | 5 +-
.../AArch64/fp-maximumnum-minimumnum.ll | 50 +--
llvm/test/CodeGen/AArch64/fsh.ll | 113 ++---
llvm/test/CodeGen/AArch64/llvm.frexp.ll | 14 +-
llvm/test/CodeGen/AArch64/neon-dotreduce.ll | 345 +++++++--------
llvm/test/CodeGen/AArch64/nontemporal.ll | 48 ++-
12 files changed, 1179 insertions(+), 346 deletions(-)
create mode 100644 llvm/test/CodeGen/AArch64/aarch64-combine-gather-lanes-limit-size.mir
create mode 100644 llvm/test/CodeGen/AArch64/aarch64-combine-gather-lanes-with-call.mir
create mode 100644 llvm/test/CodeGen/AArch64/aarch64-combine-gather-lanes.mir
diff --git a/llvm/lib/Target/AArch64/AArch64InstrInfo.cpp b/llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
index a55f103bff385..6a8e7a472bf51 100644
--- a/llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
@@ -20,7 +20,9 @@
#include "Utils/AArch64BaseInfo.h"
#include "llvm/ADT/ArrayRef.h"
#include "llvm/ADT/STLExtras.h"
+#include "llvm/ADT/SmallSet.h"
#include "llvm/ADT/SmallVector.h"
+#include "llvm/Analysis/AliasAnalysis.h"
#include "llvm/CodeGen/CFIInstBuilder.h"
#include "llvm/CodeGen/LivePhysRegs.h"
#include "llvm/CodeGen/MachineBasicBlock.h"
@@ -83,6 +85,11 @@ static cl::opt<unsigned>
BDisplacementBits("aarch64-b-offset-bits", cl::Hidden, cl::init(26),
cl::desc("Restrict range of B instructions (DEBUG)"));
+static cl::opt<unsigned> GatherOptSearchLimit(
+ "aarch64-search-limit", cl::Hidden, cl::init(2048),
+ cl::desc("Restrict range of instructions to search for the "
+ "machine-combiner gather pattern optimization"));
+
AArch64InstrInfo::AArch64InstrInfo(const AArch64Subtarget &STI)
: AArch64GenInstrInfo(AArch64::ADJCALLSTACKDOWN, AArch64::ADJCALLSTACKUP,
AArch64::CATCHRET),
@@ -7412,11 +7419,319 @@ static bool getMiscPatterns(MachineInstr &Root,
return false;
}
+/// Check if the given instruction forms a gather load pattern that can be
+/// optimized for better Memory-Level Parallelism (MLP). This function
+/// identifies chains of NEON lane load instructions that load data from
+/// different memory addresses into individual lanes of a 128-bit vector
+/// register, then attempts to split the pattern into parallel loads to break
+/// the serial dependency between instructions.
+///
+/// Pattern Matched:
+/// Initial scalar load -> SUBREG_TO_REG (lane 0) -> LD1i* (lane 1) ->
+/// LD1i* (lane 2) -> ... -> LD1i* (lane N-1, Root)
+///
+/// Transformed Into:
+/// Two parallel vector loads using fewer lanes each, followed by ZIP1v2i64
+/// to combine the results, enabling better memory-level parallelism.
+///
+/// Supported Element Types:
+/// - 32-bit elements (LD1i32, 4 lanes total)
+/// - 16-bit elements (LD1i16, 8 lanes total)
+/// - 8-bit elements (LD1i8, 16 lanes total)
+static bool getGatherLanePattern(MachineInstr &Root,
+ SmallVectorImpl<unsigned> &Patterns,
+ unsigned LoadLaneOpCode, unsigned NumLanes) {
+ const MachineFunction *MF = Root.getMF();
+
+ // Early exit if optimizing for size.
+ if (MF->getFunction().hasMinSize())
+ return false;
+
+ const MachineRegisterInfo &MRI = MF->getRegInfo();
+ const TargetRegisterInfo *TRI = MF->getSubtarget().getRegisterInfo();
+
+ // The root of the pattern must load into the last lane of the vector.
+ if (Root.getOperand(2).getImm() != NumLanes - 1)
+ return false;
+
+ // Check that we have load into all lanes except lane 0.
+ // For each load we also want to check that:
+ // 1. It has a single non-debug use (since we will be replacing the virtual
+ // register)
+ // 2. That the addressing mode only uses a single pointer operand
+ auto *CurrInstr = MRI.getUniqueVRegDef(Root.getOperand(1).getReg());
+ auto Range = llvm::seq<unsigned>(1, NumLanes - 1);
+ SmallSet<unsigned, 16> RemainingLanes(Range.begin(), Range.end());
+ SmallVector<const MachineInstr *, 16> LoadInstrs;
+ while (!RemainingLanes.empty() && CurrInstr &&
+ CurrInstr->getOpcode() == LoadLaneOpCode &&
+ MRI.hasOneNonDBGUse(CurrInstr->getOperand(0).getReg()) &&
+ CurrInstr->getNumOperands() == 4) {
+ RemainingLanes.erase(CurrInstr->getOperand(2).getImm());
+ LoadInstrs.push_back(CurrInstr);
+ CurrInstr = MRI.getUniqueVRegDef(CurrInstr->getOperand(1).getReg());
+ }
+
+ // Check that we have found a match for lanes N-1.. 1.
+ if (!RemainingLanes.empty())
+ return false;
+
+ // Match the SUBREG_TO_REG sequence.
+ if (CurrInstr->getOpcode() != TargetOpcode::SUBREG_TO_REG)
+ return false;
+
+ // Verify that the subreg to reg loads an integer into the first lane.
+ auto Lane0LoadReg = CurrInstr->getOperand(2).getReg();
+ unsigned SingleLaneSizeInBits = 128 / NumLanes;
+ if (TRI->getRegSizeInBits(Lane0LoadReg, MRI) != SingleLaneSizeInBits)
+ return false;
+
+ // Verify that it also has a single non debug use.
+ if (!MRI.hasOneNonDBGUse(Lane0LoadReg))
+ return false;
+
+ LoadInstrs.push_back(MRI.getUniqueVRegDef(Lane0LoadReg));
+
+ // If there is any chance of aliasing, do not apply the pattern.
+ // Walk backward through the MBB starting from Root.
+ // Exit early if we've encountered all load instructions or hit the search
+ // limit.
+ auto MBBItr = Root.getIterator();
+ unsigned RemainingSteps = GatherOptSearchLimit;
+ SmallSet<const MachineInstr *, 16> RemainingLoadInstrs;
+ RemainingLoadInstrs.insert(LoadInstrs.begin(), LoadInstrs.end());
+ const MachineBasicBlock *MBB = Root.getParent();
+
+ for (; MBBItr != MBB->begin() && RemainingSteps > 0 &&
+ !RemainingLoadInstrs.empty();
+ --MBBItr, --RemainingSteps) {
+ const MachineInstr &CurrInstr = *MBBItr;
+
+ // Remove this instruction from remaining loads if it's one we're tracking.
+ RemainingLoadInstrs.erase(&CurrInstr);
+
+ // Check for potential aliasing with any of the load instructions to
+ // optimize.
+ if (CurrInstr.isLoadFoldBarrier())
+ return false;
+ }
+
+ // If we hit the search limit without finding all load instructions,
+ // don't match the pattern.
+ if (RemainingSteps == 0 && !RemainingLoadInstrs.empty())
+ return false;
+
+ switch (NumLanes) {
+ case 4:
+ Patterns.push_back(AArch64MachineCombinerPattern::GATHER_LANE_i32);
+ break;
+ case 8:
+ Patterns.push_back(AArch64MachineCombinerPattern::GATHER_LANE_i16);
+ break;
+ case 16:
+ Patterns.push_back(AArch64MachineCombinerPattern::GATHER_LANE_i8);
+ break;
+ default:
+ llvm_unreachable("Got bad number of lanes for gather pattern.");
+ }
+
+ return true;
+}
+
+/// Search for patterns of LD instructions we can optimize.
+static bool getLoadPatterns(MachineInstr &Root,
+ SmallVectorImpl<unsigned> &Patterns) {
+
+ // The pattern searches for loads into single lanes.
+ switch (Root.getOpcode()) {
+ case AArch64::LD1i32:
+ return getGatherLanePattern(Root, Patterns, Root.getOpcode(), 4);
+ case AArch64::LD1i16:
+ return getGatherLanePattern(Root, Patterns, Root.getOpcode(), 8);
+ case AArch64::LD1i8:
+ return getGatherLanePattern(Root, Patterns, Root.getOpcode(), 16);
+ default:
+ return false;
+ }
+}
+
+/// Generate optimized instruction sequence for gather load patterns to improve
+/// Memory-Level Parallelism (MLP). This function transforms a chain of
+/// sequential NEON lane loads into parallel vector loads that can execute
+/// concurrently.
+static void
+generateGatherLanePattern(MachineInstr &Root,
+ SmallVectorImpl<MachineInstr *> &InsInstrs,
+ SmallVectorImpl<MachineInstr *> &DelInstrs,
+ DenseMap<Register, unsigned> &InstrIdxForVirtReg,
+ unsigned Pattern, unsigned NumLanes) {
+ MachineFunction &MF = *Root.getParent()->getParent();
+ MachineRegisterInfo &MRI = MF.getRegInfo();
+ const TargetInstrInfo *TII = MF.getSubtarget().getInstrInfo();
+
+ // Gather the initial load instructions to build the pattern.
+ SmallVector<MachineInstr *, 16> LoadToLaneInstrs;
+ MachineInstr *CurrInstr = &Root;
+ for (unsigned i = 0; i < NumLanes - 1; ++i) {
+ LoadToLaneInstrs.push_back(CurrInstr);
+ CurrInstr = MRI.getUniqueVRegDef(CurrInstr->getOperand(1).getReg());
+ }
+
+ // Sort the load instructions according to the lane.
+ llvm::sort(LoadToLaneInstrs,
+ [](const MachineInstr *A, const MachineInstr *B) {
+ return A->getOperand(2).getImm() > B->getOperand(2).getImm();
+ });
+
+ MachineInstr *SubregToReg = CurrInstr;
+ LoadToLaneInstrs.push_back(
+ MRI.getUniqueVRegDef(SubregToReg->getOperand(2).getReg()));
+ auto LoadToLaneInstrsAscending = llvm::reverse(LoadToLaneInstrs);
+
+ const TargetRegisterClass *FPR128RegClass =
+ MRI.getRegClass(Root.getOperand(0).getReg());
+
+ // Helper lambda to create a LD1 instruction.
+ auto CreateLD1Instruction = [&](MachineInstr *OriginalInstr,
+ Register SrcRegister, unsigned Lane,
+ Register OffsetRegister,
+ bool OffsetRegisterKillState) {
+ auto NewRegister = MRI.createVirtualRegister(FPR128RegClass);
+ MachineInstrBuilder LoadIndexIntoRegister =
+ BuildMI(MF, MIMetadata(*OriginalInstr), TII->get(Root.getOpcode()),
+ NewRegister)
+ .addReg(SrcRegister)
+ .addImm(Lane)
+ .addReg(OffsetRegister, getKillRegState(OffsetRegisterKillState));
+ InstrIdxForVirtReg.insert(std::make_pair(NewRegister, InsInstrs.size()));
+ InsInstrs.push_back(LoadIndexIntoRegister);
+ return NewRegister;
+ };
+
+ // Helper to create load instruction based on the NumLanes in the NEON
+ // register we are rewriting.
+ auto CreateLDRInstruction = [&](unsigned NumLanes, Register DestReg,
+ Register OffsetReg,
+ bool KillState) -> MachineInstrBuilder {
+ unsigned Opcode;
+ switch (NumLanes) {
+ case 4:
+ Opcode = AArch64::LDRSui;
+ break;
+ case 8:
+ Opcode = AArch64::LDRHui;
+ break;
+ case 16:
+ Opcode = AArch64::LDRBui;
+ break;
+ default:
+ llvm_unreachable(
+ "Got unsupported number of lanes in machine-combiner gather pattern");
+ }
+ // Immediate offset load
+ return BuildMI(MF, MIMetadata(Root), TII->get(Opcode), DestReg)
+ .addReg(OffsetReg)
+ .addImm(0);
+ };
+
+ // Load the remaining lanes into register 0.
+ auto LanesToLoadToReg0 =
+ llvm::make_range(LoadToLaneInstrsAscending.begin() + 1,
+ LoadToLaneInstrsAscending.begin() + NumLanes / 2);
+ Register PrevReg = SubregToReg->getOperand(0).getReg();
+ for (auto [Index, LoadInstr] : llvm::enumerate(LanesToLoadToReg0)) {
+ const MachineOperand &OffsetRegOperand = LoadInstr->getOperand(3);
+ PrevReg = CreateLD1Instruction(LoadInstr, PrevReg, Index + 1,
+ OffsetRegOperand.getReg(),
+ OffsetRegOperand.isKill());
+ DelInstrs.push_back(LoadInstr);
+ }
+ Register LastLoadReg0 = PrevReg;
+
+ // First load into register 1. Perform an integer load to zero out the upper
+ // lanes in a single instruction.
+ MachineInstr *Lane0Load = *LoadToLaneInstrsAscending.begin();
+ MachineInstr *OriginalSplitLoad =
+ *std::next(LoadToLaneInstrsAscending.begin(), NumLanes / 2);
+ Register DestRegForMiddleIndex = MRI.createVirtualRegister(
+ MRI.getRegClass(Lane0Load->getOperand(0).getReg()));
+
+ const MachineOperand &OriginalSplitToLoadOffsetOperand =
+ OriginalSplitLoad->getOperand(3);
+ MachineInstrBuilder MiddleIndexLoadInstr =
+ CreateLDRInstruction(NumLanes, DestRegForMiddleIndex,
+ OriginalSplitToLoadOffsetOperand.getReg(),
+ OriginalSplitToLoadOffsetOperand.isKill());
+
+ InstrIdxForVirtReg.insert(
+ std::make_pair(DestRegForMiddleIndex, InsInstrs.size()));
+ InsInstrs.push_back(MiddleIndexLoadInstr);
+ DelInstrs.push_back(OriginalSplitLoad);
+
+ // Subreg To Reg instruction for register 1.
+ Register DestRegForSubregToReg = MRI.createVirtualRegister(FPR128RegClass);
+ unsigned SubregType;
+ switch (NumLanes) {
+ case 4:
+ SubregType = AArch64::ssub;
+ break;
+ case 8:
+ SubregType = AArch64::hsub;
+ break;
+ case 16:
+ SubregType = AArch64::bsub;
+ break;
+ default:
+ llvm_unreachable(
+ "Got invalid NumLanes for machine-combiner gather pattern");
+ }
+
+ auto SubRegToRegInstr =
+ BuildMI(MF, MIMetadata(Root), TII->get(SubregToReg->getOpcode()),
+ DestRegForSubregToReg)
+ .addImm(0)
+ .addReg(DestRegForMiddleIndex, getKillRegState(true))
+ .addImm(SubregType);
+ InstrIdxForVirtReg.insert(
+ std::make_pair(DestRegForSubregToReg, InsInstrs.size()));
+ InsInstrs.push_back(SubRegToRegInstr);
+
+ // Load remaining lanes into register 1.
+ auto LanesToLoadToReg1 =
+ llvm::make_range(LoadToLaneInstrsAscending.begin() + NumLanes / 2 + 1,
+ LoadToLaneInstrsAscending.end());
+ PrevReg = SubRegToRegInstr->getOperand(0).getReg();
+ for (auto [Index, LoadInstr] : llvm::enumerate(LanesToLoadToReg1)) {
+ const MachineOperand &OffsetRegOperand = LoadInstr->getOperand(3);
+ PrevReg = CreateLD1Instruction(LoadInstr, PrevReg, Index + 1,
+ OffsetRegOperand.getReg(),
+ OffsetRegOperand.isKill());
+
+ // Do not add the last reg to DelInstrs - it will be removed later.
+ if (Index == NumLanes / 2 - 2) {
+ break;
+ }
+ DelInstrs.push_back(LoadInstr);
+ }
+ Register LastLoadReg1 = PrevReg;
+
+ // Create the final zip instruction to combine the results.
+ MachineInstrBuilder ZipInstr =
+ BuildMI(MF, MIMetadata(Root), TII->get(AArch64::ZIP1v2i64),
+ Root.getOperand(0).getReg())
+ .addReg(LastLoadReg0)
+ .addReg(LastLoadReg1);
+ InsInstrs.push_back(ZipInstr);
+}
+
CombinerObjective
AArch64InstrInfo::getCombinerObjective(unsigned Pattern) const {
switch (Pattern) {
case AArch64MachineCombinerPattern::SUBADD_OP1:
case AArch64MachineCombinerPattern::SUBADD_OP2:
+ case AArch64MachineCombinerPattern::GATHER_LANE_i32:
+ case AArch64MachineCombinerPattern::GATHER_LANE_i16:
+ case AArch64MachineCombinerPattern::GATHER_LANE_i8:
return CombinerObjective::MustReduceDepth;
default:
return TargetInstrInfo::getCombinerObjective(Pattern);
@@ -7446,6 +7761,10 @@ bool AArch64InstrInfo::getMachineCombinerPatterns(
if (getMiscPatterns(Root, Patterns))
return true;
+ // Load patterns
+ if (getLoadPatterns(Root, Patterns))
+ return true;
+
return TargetInstrInfo::getMachineCombinerPatterns(Root, Patterns,
DoRegPressureReduce);
}
@@ -8701,6 +9020,21 @@ void AArch64InstrInfo::genAlternativeCodeSequence(
MUL = genFNegatedMAD(MF, MRI, TII, Root, InsInstrs);
break;
}
+ case AArch64MachineCombinerPattern::GATHER_LANE_i32: {
+ generateGatherLanePattern(Root, InsInstrs, DelInstrs, InstrIdxForVirtReg,
+ Pattern, 4);
+ break;
+ }
+ case AArch64MachineCombinerPattern::GATHER_LANE_i16: {
+ generateGatherLanePattern(Root, InsInstrs, DelInstrs, InstrIdxForVirtReg,
+ Pattern, 8);
+ break;
+ }
+ case AArch64MachineCombinerPattern::GATHER_LANE_i8: {
+ generateGatherLanePattern(Root, InsInstrs, DelInstrs, InstrIdxForVirtReg,
+ Pattern, 16);
+ break;
+ }
} // end switch (Pattern)
// Record MUL and ADD/SUB for deletion
diff --git a/llvm/lib/Target/AArch64/AArch64InstrInfo.h b/llvm/lib/Target/AArch64/AArch64InstrInfo.h
index b903cd90c1e73..70c814a3a48c9 100644
--- a/llvm/lib/Target/AArch64/AArch64InstrInfo.h
+++ b/llvm/lib/Target/AArch64/AArch64InstrInfo.h
@@ -172,6 +172,10 @@ enum AArch64MachineCombinerPattern : unsigned {
FMULv8i16_indexed_OP2,
FNMADD,
+
+ GATHER_LANE_i32,
+ GATHER_LANE_i16,
+ GATHER_LANE_i8
};
class AArch64InstrInfo final : public AArch64GenInstrInfo {
const AArch64RegisterInfo RI;
diff --git a/llvm/test/CodeGen/AArch64/aarch64-combine-gather-lanes-limit-size.mir b/llvm/test/CodeGen/AArch64/aarch64-combine-gather-lanes-limit-size.mir
new file mode 100644
index 0000000000000..17c15124e787e
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/aarch64-combine-gather-lanes-limit-size.mir
@@ -0,0 +1,33 @@
+# RUN: llc -run-pass=machine-combiner -aarch64-search-limit=2 -mcpu=neoverse-n2 -mtriple=aarch64-none-linux-gnu -verify-machineinstrs %s -o - | FileCheck %s
+
+---
+name: negative_pattern_mbb_too_large
+body: |
+ bb.0.entry:
+ liveins: $x0, $x1, $x2, $x3, $x4
+
+ ; CHECK-LABEL: name: negative_pattern_mbb_too_large
+ ; CHECK: [[COPY:%[0-9]+]]:gpr64common = COPY $x0
+ ; CHECK-NEXT: [[COPY1:%[0-9]+]]:gpr64common = COPY $x1
+ ; CHECK-NEXT: [[COPY2:%[0-9]+]]:gpr64common = COPY $x2
+ ; CHECK-NEXT: [[COPY3:%[0-9]+]]:gpr64common = COPY $x3
+ ; CHECK-NEXT: [[COPY4:%[0-9]+]]:gpr64common = COPY $x4
+ ; CHECK-NEXT: [[LD_i32:%[0-9]+]]:fpr32 = LDRSroX [[COPY]], killed [[COPY1]], 0, 1
+ ; CHECK-NEXT: [[FIRST_REG:%[0-9]+]]:fpr128 = SUBREG_TO_REG 0, killed [[LD_i32]], %subreg.ssub
+ ; CHECK-NEXT: [[LD_LANE_1:%[0-9]+]]:fpr128 = LD1i32 [[FIRST_REG]], 1, killed [[COPY2]]
+ ; CHECK-NEXT: [[LD_LANE_2:%[0-9]+]]:fpr128 = LD1i32 [[LD_LANE_1]], 2, killed [[COPY3]]
+ ; CHECK-NEXT: [[LD_LANE_3:%[0-9]+]]:fpr128 = LD1i32 [[LD_LANE_2]], 3, killed [[COPY4]]
+ ; CHECK-NEXT: $q0 = COPY [[LD_LANE_3]]
+ ; CHECK-NEXT: RET_ReallyLR implicit $q0
+ %0:gpr64common = COPY $x0
+ %1:gpr64common = COPY $x1
+ %2:gpr64common = COPY $x2
+ %3:gpr64common = COPY $x3
+ %4:gpr64common = COPY $x4
+ %5:fpr32 = LDRSroX %0, killed %1, 0, 1
+ %6:fpr128 = SUBREG_TO_REG 0, killed %5, %subreg.ssub
+ %7:fpr128 = LD1i32 %6, 1, killed %2
+ %8:fpr128 = LD1i32 %7, 2, killed %3
+ %9:fpr128 = LD1i32 %8, 3, killed %4
+ $q0 = COPY %9
+ RET_ReallyLR implicit $q0
\ No newline at end of file
diff --git a/llvm/test/CodeGen/AArch64/aarch64-combine-gather-lanes-with-call.mir b/llvm/test/CodeGen/AArch64/aarch64-combine-gather-lanes-with-call.mir
new file mode 100644
index 0000000000000..6b338d98afb53
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/aarch64-combine-gather-lanes-with-call.mir
@@ -0,0 +1,45 @@
+# RUN: llc -run-pass=machine-combiner -mcpu=neoverse-n2 -mtriple=aarch64-none-linux-gnu -verify-machineinstrs %s -o - | FileCheck %s
+
+
+--- |
+ @external_func = external global i32
+ define void @negative_pattern_offset_reg_copied_to_physical(i64 %arg0, i64 %arg1, i64 %arg2, i64 %arg3, i64 %arg4) {
+ entry:
+ ret void
+ }
+...
+---
+name: negative_pattern_offset_reg_copied_to_physical
+body: |
+ bb.0.entry:
+ liveins: $x0, $x1, $x2, $x3
+
+ ; CHECK-LABEL: name: negative_pattern_offset_reg_copied_to_physical
+ ; CHECK: [[BASE_REG:%[0-9]+]]:gpr64common = COPY $x0
+ ; CHECK-NEXT: [[PTR_1:%[0-9]+]]:gpr64common = COPY $x1
+ ; CHECK-NEXT: [[PTR_2:%[0-9]+]]:gpr64common = COPY $x2
+ ; CHECK-NEXT: [[PTR_3:%[0-9]+]]:gpr64common = COPY $x3
+ ; CHECK-NEXT: [[LD_i32:%[0-9]+]]:fpr32 = LDRSroX [[BASE_REG]], killed [[PTR_1]], 0, 1
+ ; CHECK-NEXT: [[LD_LANE_0:%[0-9]+]]:fpr128 = SUBREG_TO_REG 0, killed [[LD_i32]], %subreg.ssub
+ ; CHECK-NEXT: [[LD_LANE_1:%[0-9]+]]:fpr128 = LD1i32 [[LD_LANE_0]], 1, [[PTR_2]]
+ ; CHECK-NEXT: $x0 = COPY [[PTR_2]]
+ ; CHECK-NEXT: BL @external_func, csr_aarch64_aapcs, implicit-def $lr, implicit $x0, implicit-def $x0
+ ; CHECK-NEXT: [[LD_LANE_2:%[0-9]+]]:fpr128 = LD1i32 [[LD_LANE_1]], 2, killed [[PTR_2]]
+ ; CHECK-NEXT: [[LD_LANE_3:%[0-9]+]]:fpr128 = LD1i32 [[LD_LANE_2]], 3, killed [[PTR_3]]
+ ; CHECK-NEXT: [[RESULT:%[0-9]+]]:gpr64common = COPY $x0
+ ; CHECK-NEXT: $q0 = COPY [[LD_LANE_3]]
+ ; CHECK-NEXT: RET_ReallyLR implicit $q0
+ %0:gpr64common = COPY $x0
+ %1:gpr64common = COPY $x1
+ %2:gpr64common = COPY $x2
+ %3:gpr64common = COPY $x3
+ %5:fpr32 = LDRSroX %0, killed %1, 0, 1
+ %6:fpr128 = SUBREG_TO_REG 0, killed %5, %subreg.ssub
+ %7:fpr128 = LD1i32 %6, 1, %2
+ $x0 = COPY %2
+ BL @external_func, csr_aarch64_aapcs, implicit-def $lr, implicit $x0, implicit-def $x0
+ %8:fpr128 = LD1i32 %7, 2, killed %2
+ %9:fpr128 = LD1i32 %8, 3, killed %3
+ %10:gpr64common = COPY $x0
+ $q0 = COPY %9
+ RET_ReallyLR implicit $q0
\ No newline at end of file
diff --git a/llvm/test/CodeGen/AArch64/aarch64-combine-gather-lanes.mir b/llvm/test/CodeGen/AArch64/aarch64-combine-gather-lanes.mir
new file mode 100644
index 0000000000000..a7570d2293f8a
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/aarch64-combine-gather-lanes.mir
@@ -0,0 +1,400 @@
+# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5
+# RUN: llc -run-pass=machine-combiner -mcpu=neoverse-n2 -mtriple=aarch64-none-linux-gnu -verify-machineinstrs %s -o - | FileCheck %s
+
+---
+name: split_loads_to_fpr128
+body: |
+ bb.0.entry:
+ liveins: $x0, $x1, $x2, $x3, $x4
+
+ ; CHECK-LABEL: name: split_loads_to_fpr128
+ ; CHECK: [[COPY:%[0-9]+]]:gpr64common = COPY $x0
+ ; CHECK-NEXT: [[COPY1:%[0-9]+]]:gpr64common = COPY $x1
+ ; CHECK-NEXT: [[COPY2:%[0-9]+]]:gpr64common = COPY $x2
+ ; CHECK-NEXT: [[COPY3:%[0-9]+]]:gpr64common = COPY $x3
+ ; CHECK-NEXT: [[COPY4:%[0-9]+]]:gpr64common = COPY $x4
+ ; CHECK-NEXT: [[LD_i32:%[0-9]+]]:fpr32 = LDRSroX [[COPY]], [[COPY1]], 0, 1
+ ; CHECK-NEXT: [[FIRST_REG:%[0-9]+]]:fpr128 = SUBREG_TO_REG 0, [[LD_i32]], %subreg.ssub
+ ; CHECK-NEXT: [[LD0_1:%[0-9]+]]:fpr128 = LD1i32 [[FIRST_REG]], 1, [[COPY2]]
+ ; CHECK-NEXT: [[LD1_0:%[0-9]+]]:fpr32 = LDRSui [[COPY3]], 0
+ ; CHECK-NEXT: [[SECOND_REG:%[0-9]+]]:fpr128 = SUBREG_TO_REG 0, killed [[LD1_0]], %subreg.ssub
+ ; CHECK-NEXT: [[LD1_1:%[0-9]+]]:fpr128 = LD1i32 [[SECOND_REG]], 1, [[COPY4]]
+ ; CHECK-NEXT: [[ZIP:%[0-9]+]]:fpr128 = ZIP1v2i64 [[LD0_1]], [[LD1_1]]
+ ; CHECK-NEXT: $q0 = COPY [[ZIP]]
+ ; CHECK-NEXT: RET_ReallyLR implicit $q0
+ %0:gpr64common = COPY $x0
+ %1:gpr64common = COPY $x1
+ %2:gpr64common = COPY $x2
+ %3:gpr64common = COPY $x3
+ %4:gpr64common = COPY $x4
+ %5:fpr32 = LDRSroX %0, %1, 0, 1
+ %6:fpr128 = SUBREG_TO_REG 0, %5, %subreg.ssub
+ %7:fpr128 = LD1i32 %6, 1, %2
+ %8:fpr128 = LD1i32 %7, 2, %3
+ %9:fpr128 = LD1i32 %8, 3, %4
+ $q0 = COPY %9
+ RET_ReallyLR implicit $q0
+
+---
+name: split_loads_to_fpr128_ui
+body: |
+ bb.0.entry:
+ liveins: $x0, $x1, $x2, $x3, $x4
+
+ ; CHECK-LABEL: name: split_loads_to_fpr128_ui
+ ; CHECK: [[COPY:%[0-9]+]]:gpr64common = COPY $x0
+ ; CHECK-NEXT: [[COPY1:%[0-9]+]]:gpr64common = COPY $x1
+ ; CHECK-NEXT: [[COPY2:%[0-9]+]]:gpr64common = COPY $x2
+ ; CHECK-NEXT: [[COPY3:%[0-9]+]]:gpr64common = COPY $x3
+ ; CHECK-NEXT: [[COPY4:%[0-9]+]]:gpr64common = COPY $x4
+ ; CHECK-NEXT: [[LD_i32:%[0-9]+]]:fpr32 = LDRSui [[COPY]], 0
+ ; CHECK-NEXT: [[FIRST_REG:%[0-9]+]]:fpr128 = SUBREG_TO_REG 0, killed [[LD_i32]], %subreg.ssub
+ ; CHECK-NEXT: [[LD0_1:%[0-9]+]]:fpr128 = LD1i32 [[FIRST_REG]], 1, killed [[COPY1]]
+ ; CHECK-NEXT: [[LD1_0:%[0-9]+]]:fpr32 = LDRSui [[COPY2]], 0
+ ; CHECK-NEXT: [[SECOND_REG:%[0-9]+]]:fpr128 = SUBREG_TO_REG 0, killed [[LD1_0]], %subreg.ssub
+ ; CHECK-NEXT: [[LD1_1:%[0-9]+]]:fpr128 = LD1i32 [[SECOND_REG]], 1, killed [[COPY3]]
+ ; CHECK-NEXT: [[ZIP:%[0-9]+]]:fpr128 = ZIP1v2i64 [[LD0_1]], [[LD1_1]]
+ ; CHECK-NEXT: $q0 = COPY [[ZIP]]
+ ; CHECK-NEXT: RET_ReallyLR implicit $q0
+ %0:gpr64common = COPY $x0
+ %1:gpr64common = COPY $x1
+ %2:gpr64common = COPY $x2
+ %3:gpr64common = COPY $x3
+ %4:gpr64common = COPY $x4
+ %5:fpr32 = LDRSui %0, 0
+ %6:fpr128 = SUBREG_TO_REG 0, killed %5, %subreg.ssub
+ %7:fpr128 = LD1i32 %6, 1, killed %1
+ %8:fpr128 = LD1i32 %7, 2, killed %2
+ %9:fpr128 = LD1i32 %8, 3, killed %3
+ $q0 = COPY %9
+ RET_ReallyLR implicit $q0
+
+---
+name: split_loads_to_fpr128_i16
+body: |
+ bb.0.entry:
+ liveins: $x0, $x1, $x2, $x3, $x4, $x5, $x6, $x7, $x8
+
+ ; CHECK-LABEL: name: split_loads_to_fpr128_i16
+ ; CHECK: [[COPY:%[0-9]+]]:gpr64common = COPY $x0
+ ; CHECK-NEXT: [[COPY1:%[0-9]+]]:gpr64common = COPY $x1
+ ; CHECK-NEXT: [[COPY2:%[0-9]+]]:gpr64common = COPY $x2
+ ; CHECK-NEXT: [[COPY3:%[0-9]+]]:gpr64common = COPY $x3
+ ; CHECK-NEXT: [[COPY4:%[0-9]+]]:gpr64common = COPY $x4
+ ; CHECK-NEXT: [[COPY5:%[0-9]+]]:gpr64common = COPY $x5
+ ; CHECK-NEXT: [[COPY6:%[0-9]+]]:gpr64common = COPY $x6
+ ; CHECK-NEXT: [[COPY7:%[0-9]+]]:gpr64common = COPY $x7
+ ; CHECK-NEXT: [[COPY8:%[0-9]+]]:gpr64common = COPY $x8
+ ; CHECK-NEXT: [[LD_i16:%[0-9]+]]:fpr16 = LDRHroX [[COPY]], killed [[COPY1]], 0, 1
+ ; CHECK-NEXT: [[FIRST_REG:%[0-9]+]]:fpr128 = SUBREG_TO_REG 0, killed [[LD_i16]], %subreg.hsub
+ ; CHECK-NEXT: [[LD0_1:%[0-9]+]]:fpr128 = LD1i16 [[FIRST_REG]], 1, killed [[COPY2]]
+ ; CHECK-NEXT: [[LD0_2:%[0-9]+]]:fpr128 = LD1i16 [[LD0_1]], 2, killed [[COPY3]]
+ ; CHECK-NEXT: [[LD0_3:%[0-9]+]]:fpr128 = LD1i16 [[LD0_2]], 3, killed [[COPY4]]
+ ; CHECK-NEXT: [[LD1_0:%[0-9]+]]:fpr16 = LDRHui [[COPY5]], 0
+ ; CHECK-NEXT: [[SECOND_REG:%[0-9]+]]:fpr128 = SUBREG_TO_REG 0, killed [[LD1_0]], %subreg.hsub
+ ; CHECK-NEXT: [[LD1_1:%[0-9]+]]:fpr128 = LD1i16 [[SECOND_REG]], 1, killed [[COPY6]]
+ ; CHECK-NEXT: [[LD1_2:%[0-9]+]]:fpr128 = LD1i16 [[LD1_1]], 2, killed [[COPY7]]
+ ; CHECK-NEXT: [[LD1_3:%[0-9]+]]:fpr128 = LD1i16 [[LD1_2]], 3, killed [[COPY8]]
+ ; CHECK-NEXT: [[ZIP:%[0-9]+]]:fpr128 = ZIP1v2i64 [[LD0_3]], [[LD1_3]]
+ ; CHECK-NEXT: $q0 = COPY [[ZIP]]
+ ; CHECK-NEXT: RET_ReallyLR implicit $q0
+ %0:gpr64common = COPY $x0
+ %1:gpr64common = COPY $x1
+ %2:gpr64common = COPY $x2
+ %3:gpr64common = COPY $x3
+ %4:gpr64common = COPY $x4
+ %5:gpr64common = COPY $x5
+ %6:gpr64common = COPY $x6
+ %7:gpr64common = COPY $x7
+ %8:gpr64common = COPY $x8
+ %9:fpr16 = LDRHroX %0, killed %1, 0, 1
+ %10:fpr128 = SUBREG_TO_REG 0, killed %9, %subreg.hsub
+ %11:fpr128 = LD1i16 %10, 1, killed %2
+ %12:fpr128 = LD1i16 %11, 2, killed %3
+ %13:fpr128 = LD1i16 %12, 3, killed %4
+ %14:fpr128 = LD1i16 %13, 4, killed %5
+ %15:fpr128 = LD1i16 %14, 5, killed %6
+ %16:fpr128 = LD1i16 %15, 6, killed %7
+ %17:fpr128 = LD1i16 %16, 7, killed %8
+ $q0 = COPY %17
+ RET_ReallyLR implicit $q0
+
+---
+name: split_loads_to_fpr128_i16_ui
+body: |
+ bb.0.entry:
+ liveins: $x0, $x1, $x2, $x3, $x4, $x5, $x6, $x7, $x8
+
+ ; CHECK-LABEL: name: split_loads_to_fpr128_i16_ui
+ ; CHECK: [[COPY:%[0-9]+]]:gpr64common = COPY $x0
+ ; CHECK-NEXT: [[COPY1:%[0-9]+]]:gpr64common = COPY $x1
+ ; CHECK-NEXT: [[COPY2:%[0-9]+]]:gpr64common = COPY $x2
+ ; CHECK-NEXT: [[COPY3:%[0-9]+]]:gpr64common = COPY $x3
+ ; CHECK-NEXT: [[COPY4:%[0-9]+]]:gpr64common = COPY $x4
+ ; CHECK-NEXT: [[COPY5:%[0-9]+]]:gpr64common = COPY $x5
+ ; CHECK-NEXT: [[COPY6:%[0-9]+]]:gpr64common = COPY $x6
+ ; CHECK-NEXT: [[COPY7:%[0-9]+]]:gpr64common = COPY $x7
+ ; CHECK-NEXT: [[COPY8:%[0-9]+]]:gpr64common = COPY $x8
+ ; CHECK-NEXT: [[LD_i16:%[0-9]+]]:fpr16 = LDRHui [[COPY]], 0
+ ; CHECK-NEXT: [[FIRST_REG:%[0-9]+]]:fpr128 = SUBREG_TO_REG 0, killed [[LD_i16]], %subreg.hsub
+ ; CHECK-NEXT: [[LD0_1:%[0-9]+]]:fpr128 = LD1i16 [[FIRST_REG]], 1, killed [[COPY1]]
+ ; CHECK-NEXT: [[LD0_2:%[0-9]+]]:fpr128 = LD1i16 [[LD0_1]], 2, killed [[COPY2]]
+ ; CHECK-NEXT: [[LD0_3:%[0-9]+]]:fpr128 = LD1i16 [[LD0_2]], 3, killed [[COPY3]]
+ ; CHECK-NEXT: [[LD1_0:%[0-9]+]]:fpr16 = LDRHui [[COPY4]], 0
+ ; CHECK-NEXT: [[SECOND_REG:%[0-9]+]]:fpr128 = SUBREG_TO_REG 0, killed [[LD1_0]], %subreg.hsub
+ ; CHECK-NEXT: [[LD1_1:%[0-9]+]]:fpr128 = LD1i16 [[SECOND_REG]], 1, killed [[COPY5]]
+ ; CHECK-NEXT: [[LD1_2:%[0-9]+]]:fpr128 = LD1i16 [[LD1_1]], 2, killed [[COPY6]]
+ ; CHECK-NEXT: [[LD1_3:%[0-9]+]]:fpr128 = LD1i16 [[LD1_2]], 3, killed [[COPY7]]
+ ; CHECK-NEXT: [[ZIP:%[0-9]+]]:fpr128 = ZIP1v2i64 [[LD0_3]], [[LD1_3]]
+ ; CHECK-NEXT: $q0 = COPY [[ZIP]]
+ ; CHECK-NEXT: RET_ReallyLR implicit $q0
+ %0:gpr64common = COPY $x0
+ %1:gpr64common = COPY $x1
+ %2:gpr64common = COPY $x2
+ %3:gpr64common = COPY $x3
+ %4:gpr64common = COPY $x4
+ %5:gpr64common = COPY $x5
+ %6:gpr64common = COPY $x6
+ %7:gpr64common = COPY $x7
+ %8:gpr64common = COPY $x8
+ %9:fpr16 = LDRHui %0, 0
+ %10:fpr128 = SUBREG_TO_REG 0, killed %9, %subreg.hsub
+ %11:fpr128 = LD1i16 %10, 1, killed %1
+ %12:fpr128 = LD1i16 %11, 2, killed %2
+ %13:fpr128 = LD1i16 %12, 3, killed %3
+ %14:fpr128 = LD1i16 %13, 4, killed %4
+ %15:fpr128 = LD1i16 %14, 5, killed %5
+ %16:fpr128 = LD1i16 %15, 6, killed %6
+ %17:fpr128 = LD1i16 %16, 7, killed %7
+ $q0 = COPY %17
+ RET_ReallyLR implicit $q0
+
+---
+name: split_loads_to_fpr128_i8
+body: |
+ bb.0.entry:
+ liveins: $x0, $x1, $x2, $x3, $x4, $x5, $x6, $x7, $x8, $x9, $x10, $x11, $x12, $x13, $x14, $x15, $x16
+
+ ; CHECK-LABEL: name: split_loads_to_fpr128_i8
+ ; CHECK: [[COPY:%[0-9]+]]:gpr64common = COPY $x0
+ ; CHECK-NEXT: [[COPY1:%[0-9]+]]:gpr64common = COPY $x1
+ ; CHECK-NEXT: [[COPY2:%[0-9]+]]:gpr64common = COPY $x2
+ ; CHECK-NEXT: [[COPY3:%[0-9]+]]:gpr64common = COPY $x3
+ ; CHECK-NEXT: [[COPY4:%[0-9]+]]:gpr64common = COPY $x4
+ ; CHECK-NEXT: [[COPY5:%[0-9]+]]:gpr64common = COPY $x5
+ ; CHECK-NEXT: [[COPY6:%[0-9]+]]:gpr64common = COPY $x6
+ ; CHECK-NEXT: [[COPY7:%[0-9]+]]:gpr64common = COPY $x7
+ ; CHECK-NEXT: [[COPY8:%[0-9]+]]:gpr64common = COPY $x8
+ ; CHECK-NEXT: [[COPY9:%[0-9]+]]:gpr64common = COPY $x9
+ ; CHECK-NEXT: [[COPY10:%[0-9]+]]:gpr64common = COPY $x10
+ ; CHECK-NEXT: [[COPY11:%[0-9]+]]:gpr64common = COPY $x11
+ ; CHECK-NEXT: [[COPY12:%[0-9]+]]:gpr64common = COPY $x12
+ ; CHECK-NEXT: [[COPY13:%[0-9]+]]:gpr64common = COPY $x13
+ ; CHECK-NEXT: [[COPY14:%[0-9]+]]:gpr64common = COPY $x14
+ ; CHECK-NEXT: [[COPY15:%[0-9]+]]:gpr64common = COPY $x15
+ ; CHECK-NEXT: [[COPY16:%[0-9]+]]:gpr64common = COPY $x16
+ ; CHECK-NEXT: [[LD_i8:%[0-9]+]]:fpr8 = LDRBroX [[COPY]], killed [[COPY1]], 0, 0
+ ; CHECK-NEXT: [[FIRST_REG:%[0-9]+]]:fpr128 = SUBREG_TO_REG 0, killed [[LD_i8]], %subreg.bsub
+ ; CHECK-NEXT: [[LD0_1:%[0-9]+]]:fpr128 = LD1i8 [[FIRST_REG]], 1, killed [[COPY2]]
+ ; CHECK-NEXT: [[LD0_2:%[0-9]+]]:fpr128 = LD1i8 [[LD0_1]], 2, killed [[COPY3]]
+ ; CHECK-NEXT: [[LD0_3:%[0-9]+]]:fpr128 = LD1i8 [[LD0_2]], 3, killed [[COPY4]]
+ ; CHECK-NEXT: [[LD0_4:%[0-9]+]]:fpr128 = LD1i8 [[LD0_3]], 4, killed [[COPY5]]
+ ; CHECK-NEXT: [[LD0_5:%[0-9]+]]:fpr128 = LD1i8 [[LD0_4]], 5, killed [[COPY6]]
+ ; CHECK-NEXT: [[LD0_6:%[0-9]+]]:fpr128 = LD1i8 [[LD0_5]], 6, killed [[COPY7]]
+ ; CHECK-NEXT: [[LD0_7:%[0-9]+]]:fpr128 = LD1i8 [[LD0_6]], 7, killed [[COPY8]]
+ ; CHECK-NEXT: [[LD1_0:%[0-9]+]]:fpr8 = LDRBui [[COPY9]], 0
+ ; CHECK-NEXT: [[SECOND_REG:%[0-9]+]]:fpr128 = SUBREG_TO_REG 0, killed [[LD1_0]], %subreg.bsub
+ ; CHECK-NEXT: [[LD1_1:%[0-9]+]]:fpr128 = LD1i8 [[SECOND_REG]], 1, killed [[COPY10]]
+ ; CHECK-NEXT: [[LD1_2:%[0-9]+]]:fpr128 = LD1i8 [[LD1_1]], 2, killed [[COPY11]]
+ ; CHECK-NEXT: [[LD1_3:%[0-9]+]]:fpr128 = LD1i8 [[LD1_2]], 3, killed [[COPY12]]
+ ; CHECK-NEXT: [[LD1_4:%[0-9]+]]:fpr128 = LD1i8 [[LD1_3]], 4, killed [[COPY13]]
+ ; CHECK-NEXT: [[LD1_5:%[0-9]+]]:fpr128 = LD1i8 [[LD1_4]], 5, killed [[COPY14]]
+ ; CHECK-NEXT: [[LD1_6:%[0-9]+]]:fpr128 = LD1i8 [[LD1_5]], 6, killed [[COPY15]]
+ ; CHECK-NEXT: [[LD1_7:%[0-9]+]]:fpr128 = LD1i8 [[LD1_6]], 7, killed [[COPY16]]
+ ; CHECK-NEXT: [[ZIP:%[0-9]+]]:fpr128 = ZIP1v2i64 [[LD0_7]], [[LD1_7]]
+ ; CHECK-NEXT: $q0 = COPY [[ZIP]]
+ ; CHECK-NEXT: RET_ReallyLR implicit $q0
+ %0:gpr64common = COPY $x0
+ %1:gpr64common = COPY $x1
+ %2:gpr64common = COPY $x2
+ %3:gpr64common = COPY $x3
+ %4:gpr64common = COPY $x4
+ %5:gpr64common = COPY $x5
+ %6:gpr64common = COPY $x6
+ %7:gpr64common = COPY $x7
+ %8:gpr64common = COPY $x8
+ %9:gpr64common = COPY $x9
+ %10:gpr64common = COPY $x10
+ %11:gpr64common = COPY $x11
+ %12:gpr64common = COPY $x12
+ %13:gpr64common = COPY $x13
+ %14:gpr64common = COPY $x14
+ %15:gpr64common = COPY $x15
+ %16:gpr64common = COPY $x16
+ %17:fpr8 = LDRBroX %0, killed %1, 0, 0
+ %18:fpr128 = SUBREG_TO_REG 0, killed %17, %subreg.bsub
+ %19:fpr128 = LD1i8 %18, 1, killed %2
+ %20:fpr128 = LD1i8 %19, 2, killed %3
+ %21:fpr128 = LD1i8 %20, 3, killed %4
+ %22:fpr128 = LD1i8 %21, 4, killed %5
+ %23:fpr128 = LD1i8 %22, 5, killed %6
+ %24:fpr128 = LD1i8 %23, 6, killed %7
+ %25:fpr128 = LD1i8 %24, 7, killed %8
+ %26:fpr128 = LD1i8 %25, 8, killed %9
+ %27:fpr128 = LD1i8 %26, 9, killed %10
+ %28:fpr128 = LD1i8 %27, 10, killed %11
+ %29:fpr128 = LD1i8 %28, 11, killed %12
+ %30:fpr128 = LD1i8 %29, 12, killed %13
+ %31:fpr128 = LD1i8 %30, 13, killed %14
+ %32:fpr128 = LD1i8 %31, 14, killed %15
+ %33:fpr128 = LD1i8 %32, 15, killed %16
+ $q0 = COPY %33
+ RET_ReallyLR implicit $q0
+
+---
+name: negative_pattern_missing_lanes
+body: |
+ bb.0.entry:
+ liveins: $x0, $x1
+
+ ; CHECK-LABEL: name: negative_pattern_missing_lanes
+ ; CHECK: [[LD1:%.*]]:fpr128 = LDRQui $x1, 0
+ ; CHECK-NEXT: [[LD2:%.*]]:fpr128 = LD1i32 [[LD1]]
+
+ %0:gpr64common = COPY $x0
+ %1:fpr128 = LDRQui $x1, 0
+ %2:fpr128 = LD1i32 %1, 3, %0
+ $q0 = COPY %2
+ RET_ReallyLR implicit $q0
+
+---
+name: out_of_order_lanes
+body: |
+ bb.0.entry:
+ liveins: $x0, $x1, $x2, $x3, $x4
+
+ ; CHECK-LABEL: name: out_of_order_lanes
+ ; CHECK: [[COPY:%[0-9]+]]:gpr64common = COPY $x0
+ ; CHECK-NEXT: [[COPY1:%[0-9]+]]:gpr64common = COPY $x1
+ ; CHECK-NEXT: [[COPY2:%[0-9]+]]:gpr64common = COPY $x2
+ ; CHECK-NEXT: [[COPY3:%[0-9]+]]:gpr64common = COPY $x3
+ ; CHECK-NEXT: [[COPY4:%[0-9]+]]:gpr64common = COPY $x4
+ ; CHECK-NEXT: [[LD_i32:%[0-9]+]]:fpr32 = LDRSroX [[COPY]], killed [[COPY1]], 0, 1
+ ; CHECK-NEXT: [[FIRST_REG:%[0-9]+]]:fpr128 = SUBREG_TO_REG 0, killed [[LD_i32]], %subreg.ssub
+ ; CHECK-NEXT: [[LD0_1:%[0-9]+]]:fpr128 = LD1i32 [[FIRST_REG]], 1, killed [[COPY3]]
+ ; CHECK-NEXT: [[LD1_0:%[0-9]+]]:fpr32 = LDRSui [[COPY2]], 0
+ ; CHECK-NEXT: [[SECOND_REG:%[0-9]+]]:fpr128 = SUBREG_TO_REG 0, killed [[LD1_0]], %subreg.ssub
+ ; CHECK-NEXT: [[LD1_1:%[0-9]+]]:fpr128 = LD1i32 [[SECOND_REG]], 1, killed [[COPY4]]
+ ; CHECK-NEXT: [[ZIP:%[0-9]+]]:fpr128 = ZIP1v2i64 [[LD0_1]], [[LD1_1]]
+ ; CHECK-NEXT: $q0 = COPY [[ZIP]]
+ ; CHECK-NEXT: RET_ReallyLR implicit $q0
+ %0:gpr64common = COPY $x0
+ %1:gpr64common = COPY $x1
+ %2:gpr64common = COPY $x2
+ %3:gpr64common = COPY $x3
+ %4:gpr64common = COPY $x4
+ %5:fpr32 = LDRSroX %0, killed %1, 0, 1
+ %6:fpr128 = SUBREG_TO_REG 0, killed %5, %subreg.ssub
+ %7:fpr128 = LD1i32 %6, 2, killed %2
+ %8:fpr128 = LD1i32 %7, 1, killed %3
+ %9:fpr128 = LD1i32 %8, 3, killed %4
+ $q0 = COPY %9
+ RET_ReallyLR implicit $q0
+
+---
+name: negative_pattern_no_subreg_to_reg
+body: |
+ bb.0.entry:
+ liveins: $x0, $x1, $x2, $x3
+
+ ; CHECK-LABEL: name: negative_pattern_no_subreg_to_reg
+ ; CHECK: [[COPY:%[0-9]+]]:gpr64common = COPY $x0
+ ; CHECK-NEXT: [[COPY1:%[0-9]+]]:gpr64common = COPY $x1
+ ; CHECK-NEXT: [[COPY2:%[0-9]+]]:gpr64common = COPY $x2
+ ; CHECK-NEXT: [[COPY3:%[0-9]+]]:gpr64common = COPY $x3
+ ; CHECK-NEXT: [[INITIAL_VEC:%[0-9]+]]:fpr128 = LDRQui [[COPY]], 0
+ ; CHECK-NEXT: [[LD_LANE_1:%[0-9]+]]:fpr128 = LD1i32 [[INITIAL_VEC]], 1, killed [[COPY1]]
+ ; CHECK-NEXT: [[LD_LANE_2:%[0-9]+]]:fpr128 = LD1i32 [[LD_LANE_1]], 2, killed [[COPY2]]
+ ; CHECK-NEXT: [[LD_LANE_3:%[0-9]+]]:fpr128 = LD1i32 [[LD_LANE_2]], 3, killed [[COPY3]]
+ ; CHECK-NEXT: $q0 = COPY [[LD_LANE_3]]
+ ; CHECK-NEXT: RET_ReallyLR implicit $q0
+ %0:gpr64common = COPY $x0
+ %1:gpr64common = COPY $x1
+ %2:gpr64common = COPY $x2
+ %3:gpr64common = COPY $x3
+ %4:fpr128 = LDRQui %0, 0
+ %5:fpr128 = LD1i32 %4, 1, killed %1
+ %6:fpr128 = LD1i32 %5, 2, killed %2
+ %7:fpr128 = LD1i32 %6, 3, killed %3
+ $q0 = COPY %7
+ RET_ReallyLR implicit $q0
+
+---
+name: negative_pattern_multiple_users
+body: |
+ bb.0.entry:
+ liveins: $x0, $x1, $x2, $x3, $x4
+
+ ; CHECK-LABEL: name: negative_pattern_multiple_users
+ ; CHECK: [[COPY:%[0-9]+]]:gpr64common = COPY $x0
+ ; CHECK-NEXT: [[COPY1:%[0-9]+]]:gpr64common = COPY $x1
+ ; CHECK-NEXT: [[COPY2:%[0-9]+]]:gpr64common = COPY $x2
+ ; CHECK-NEXT: [[COPY3:%[0-9]+]]:gpr64common = COPY $x3
+ ; CHECK-NEXT: [[COPY4:%[0-9]+]]:gpr64common = COPY $x4
+ ; CHECK-NEXT: [[LD_i32:%[0-9]+]]:fpr32 = LDRSroX [[COPY]], killed [[COPY1]], 0, 1
+ ; CHECK-NEXT: [[FIRST_REG:%[0-9]+]]:fpr128 = SUBREG_TO_REG 0, killed [[LD_i32]], %subreg.ssub
+ ; CHECK-NEXT: [[LD_LANE_1:%[0-9]+]]:fpr128 = LD1i32 [[FIRST_REG]], 1, killed [[COPY2]]
+ ; CHECK-NEXT: [[LD_LANE_2:%[0-9]+]]:fpr128 = LD1i32 [[LD_LANE_1]], 2, killed [[COPY3]]
+ ; CHECK-NEXT: [[LD_LANE_3:%[0-9]+]]:fpr128 = LD1i32 [[LD_LANE_2]], 3, killed [[COPY4]]
+ ; CHECK-NEXT: $q0 = COPY [[LD_LANE_3]]
+ ; CHECK-NEXT: $q1 = COPY [[LD_LANE_2]]
+ ; CHECK-NEXT: RET_ReallyLR implicit $q0, implicit $q1
+ %0:gpr64common = COPY $x0
+ %1:gpr64common = COPY $x1
+ %2:gpr64common = COPY $x2
+ %3:gpr64common = COPY $x3
+ %4:gpr64common = COPY $x4
+ %5:fpr32 = LDRSroX %0, killed %1, 0, 1
+ %6:fpr128 = SUBREG_TO_REG 0, killed %5, %subreg.ssub
+ %7:fpr128 = LD1i32 %6, 1, killed %2
+ %8:fpr128 = LD1i32 %7, 2, killed %3
+ %9:fpr128 = LD1i32 %8, 3, killed %4
+ $q0 = COPY %9
+ $q1 = COPY %8
+ RET_ReallyLR implicit $q0, implicit $q1
+
+---
+name: aliasing_store_between_vector_loads
+alignment: 4
+tracksRegLiveness: true
+body: |
+ bb.0.entry:
+ liveins: $x0, $x1, $x2, $x3
+
+ ; CHECK-LABEL: name: aliasing_store_between_vector_loads
+ ; CHECK: [[BASE_PTR:%[0-9]+]]:gpr64common = COPY $x0
+ ; CHECK-NEXT: [[OFFSET_PTR:%[0-9]+]]:gpr64common = COPY $x1
+ ; CHECK-NEXT: [[ALIAS_ADDR:%[0-9]+]]:gpr64common = COPY $x2
+ ; CHECK-NEXT: [[OTHER_ADDR:%[0-9]+]]:gpr64common = COPY $x3
+ ; CHECK-NEXT: [[LOAD0:%[0-9]+]]:fpr32 = LDRSroX [[BASE_PTR]], killed [[OFFSET_PTR]], 0, 1
+ ; CHECK-NEXT: [[VEC0:%[0-9]+]]:fpr128 = SUBREG_TO_REG 0, killed [[LOAD0]], %subreg.ssub
+ ; CHECK-NEXT: [[VEC1:%[0-9]+]]:fpr128 = LD1i32 [[VEC0]], 1, [[ALIAS_ADDR]]
+ ; CHECK-NEXT: [[CONST:%[0-9]+]]:gpr32 = MOVi32imm 99
+ ; CHECK-NEXT: STRWui [[CONST]], [[ALIAS_ADDR]], 0
+ ; CHECK-NEXT: [[VEC2:%[0-9]+]]:fpr128 = LD1i32 [[VEC1]], 2, killed [[ALIAS_ADDR]]
+ ; CHECK-NEXT: [[VEC3:%[0-9]+]]:fpr128 = LD1i32 [[VEC2]], 3, killed [[OTHER_ADDR]]
+ ; CHECK-NEXT: $q0 = COPY [[VEC3]]
+ ; CHECK-NEXT: RET_ReallyLR implicit $q0
+ %0:gpr64common = COPY $x0
+ %1:gpr64common = COPY $x1
+ %2:gpr64common = COPY $x2
+ %3:gpr64common = COPY $x3
+ %5:fpr32 = LDRSroX %0, killed %1, 0, 1
+ %6:fpr128 = SUBREG_TO_REG 0, killed %5, %subreg.ssub
+ %7:fpr128 = LD1i32 %6, 1, %2
+ %10:gpr32 = MOVi32imm 99
+ STRWui %10, %2, 0
+ %8:fpr128 = LD1i32 %7, 2, killed %2
+ %9:fpr128 = LD1i32 %8, 3, killed %3
+ $q0 = COPY %9
+ RET_ReallyLR implicit $q0
diff --git a/llvm/test/CodeGen/AArch64/complex-deinterleaving-uniform-cases.ll b/llvm/test/CodeGen/AArch64/complex-deinterleaving-uniform-cases.ll
index 7686740aec302..13434fabefa78 100644
--- a/llvm/test/CodeGen/AArch64/complex-deinterleaving-uniform-cases.ll
+++ b/llvm/test/CodeGen/AArch64/complex-deinterleaving-uniform-cases.ll
@@ -203,89 +203,93 @@ define <12 x float> @abp90c12(<12 x float> %a, <12 x float> %b, <12 x float> %c)
; CHECK: // %bb.0: // %entry
; CHECK-NEXT: // kill: def $s1 killed $s1 def $q1
; CHECK-NEXT: // kill: def $s3 killed $s3 def $q3
-; CHECK-NEXT: ldr s17, [sp, #40]
-; CHECK-NEXT: add x10, sp, #56
; CHECK-NEXT: // kill: def $s0 killed $s0 def $q0
+; CHECK-NEXT: // kill: def $s2 killed $s2 def $q2
+; CHECK-NEXT: ldr s17, [sp, #32]
+; CHECK-NEXT: // kill: def $s5 killed $s5 def $q5
; CHECK-NEXT: add x9, sp, #48
+; CHECK-NEXT: add x10, sp, #64
; CHECK-NEXT: mov v1.s[1], v3.s[0]
-; CHECK-NEXT: ldr s3, [sp, #32]
-; CHECK-NEXT: // kill: def $s2 killed $s2 def $q2
; CHECK-NEXT: mov v0.s[1], v2.s[0]
-; CHECK-NEXT: ld1 { v17.s }[1], [x10]
-; CHECK-NEXT: // kill: def $s5 killed $s5 def $q5
-; CHECK-NEXT: ldr s16, [sp, #8]
; CHECK-NEXT: // kill: def $s4 killed $s4 def $q4
-; CHECK-NEXT: add x10, sp, #24
-; CHECK-NEXT: ld1 { v3.s }[1], [x9]
-; CHECK-NEXT: add x9, sp, #72
-; CHECK-NEXT: // kill: def $s7 killed $s7 def $q7
+; CHECK-NEXT: add x11, sp, #72
+; CHECK-NEXT: ld1 { v17.s }[1], [x9]
+; CHECK-NEXT: ldr s18, [x10]
+; CHECK-NEXT: add x9, sp, #80
+; CHECK-NEXT: add x10, sp, #56
; CHECK-NEXT: // kill: def $s6 killed $s6 def $q6
+; CHECK-NEXT: // kill: def $s7 killed $s7 def $q7
+; CHECK-NEXT: ldr s16, [sp, #8]
+; CHECK-NEXT: ldr s3, [sp, #96]
+; CHECK-NEXT: ld1 { v18.s }[1], [x9]
+; CHECK-NEXT: add x9, sp, #88
; CHECK-NEXT: ldr s2, [sp]
-; CHECK-NEXT: ld1 { v16.s }[1], [x10]
-; CHECK-NEXT: add x10, sp, #112
-; CHECK-NEXT: ldr s20, [sp, #136]
; CHECK-NEXT: mov v1.s[2], v5.s[0]
-; CHECK-NEXT: ld1 { v17.s }[2], [x9]
-; CHECK-NEXT: add x9, sp, #64
-; CHECK-NEXT: ldr s5, [sp, #96]
-; CHECK-NEXT: ld1 { v3.s }[2], [x9]
+; CHECK-NEXT: ldr s5, [sp, #40]
; CHECK-NEXT: mov v0.s[2], v4.s[0]
-; CHECK-NEXT: add x9, sp, #88
-; CHECK-NEXT: ldr s4, [sp, #104]
-; CHECK-NEXT: ldr s19, [sp, #192]
; CHECK-NEXT: ld1 { v5.s }[1], [x10]
-; CHECK-NEXT: add x10, sp, #80
-; CHECK-NEXT: ld1 { v17.s }[3], [x9]
-; CHECK-NEXT: mov v1.s[3], v7.s[0]
-; CHECK-NEXT: add x9, sp, #120
-; CHECK-NEXT: ld1 { v3.s }[3], [x10]
-; CHECK-NEXT: ld1 { v4.s }[1], [x9]
-; CHECK-NEXT: ldr s7, [sp, #128]
+; CHECK-NEXT: ldr s19, [x11]
; CHECK-NEXT: add x10, sp, #144
+; CHECK-NEXT: zip1 v4.2d, v17.2d, v18.2d
+; CHECK-NEXT: add x11, sp, #160
+; CHECK-NEXT: ldr s18, [sp, #136]
+; CHECK-NEXT: ld1 { v19.s }[1], [x9]
; CHECK-NEXT: mov v0.s[3], v6.s[0]
-; CHECK-NEXT: add x9, sp, #16
+; CHECK-NEXT: ldr s6, [sp, #128]
+; CHECK-NEXT: mov v1.s[3], v7.s[0]
+; CHECK-NEXT: add x9, sp, #24
+; CHECK-NEXT: ldr s7, [sp, #104]
+; CHECK-NEXT: ld1 { v16.s }[1], [x9]
+; CHECK-NEXT: add x9, sp, #112
+; CHECK-NEXT: ld1 { v6.s }[1], [x10]
+; CHECK-NEXT: zip1 v5.2d, v5.2d, v19.2d
+; CHECK-NEXT: add x10, sp, #120
+; CHECK-NEXT: ld1 { v3.s }[1], [x9]
; CHECK-NEXT: ld1 { v7.s }[1], [x10]
-; CHECK-NEXT: ld1 { v2.s }[1], [x9]
-; CHECK-NEXT: add x9, sp, #160
-; CHECK-NEXT: fmul v6.4s, v17.4s, v1.4s
-; CHECK-NEXT: fmul v18.4s, v4.4s, v16.4s
-; CHECK-NEXT: fmul v16.4s, v5.4s, v16.4s
-; CHECK-NEXT: fmul v1.4s, v3.4s, v1.4s
-; CHECK-NEXT: add x10, sp, #208
-; CHECK-NEXT: ld1 { v7.s }[2], [x9]
-; CHECK-NEXT: add x9, sp, #152
-; CHECK-NEXT: ld1 { v19.s }[1], [x10]
-; CHECK-NEXT: ld1 { v20.s }[1], [x9]
+; CHECK-NEXT: ldr s17, [x11]
; CHECK-NEXT: add x9, sp, #176
-; CHECK-NEXT: add x10, sp, #184
-; CHECK-NEXT: fneg v6.4s, v6.4s
-; CHECK-NEXT: fneg v18.4s, v18.4s
-; CHECK-NEXT: fmla v16.4s, v2.4s, v4.4s
-; CHECK-NEXT: fmla v1.4s, v0.4s, v17.4s
-; CHECK-NEXT: ld1 { v7.s }[3], [x9]
-; CHECK-NEXT: add x9, sp, #168
-; CHECK-NEXT: ld1 { v20.s }[2], [x9]
-; CHECK-NEXT: ldr s4, [sp, #200]
+; CHECK-NEXT: add x10, sp, #16
+; CHECK-NEXT: add x11, sp, #168
+; CHECK-NEXT: ld1 { v17.s }[1], [x9]
+; CHECK-NEXT: ld1 { v2.s }[1], [x10]
+; CHECK-NEXT: add x9, sp, #152
+; CHECK-NEXT: fmul v19.4s, v5.4s, v1.4s
+; CHECK-NEXT: fmul v20.4s, v7.4s, v16.4s
+; CHECK-NEXT: fmul v16.4s, v3.4s, v16.4s
+; CHECK-NEXT: fmul v1.4s, v4.4s, v1.4s
+; CHECK-NEXT: ld1 { v18.s }[1], [x9]
+; CHECK-NEXT: ldr s21, [x11]
+; CHECK-NEXT: zip1 v6.2d, v6.2d, v17.2d
+; CHECK-NEXT: ldr s17, [sp, #192]
+; CHECK-NEXT: add x9, sp, #184
+; CHECK-NEXT: add x10, sp, #208
+; CHECK-NEXT: ld1 { v21.s }[1], [x9]
; CHECK-NEXT: add x9, sp, #216
-; CHECK-NEXT: fmla v6.4s, v0.4s, v3.4s
-; CHECK-NEXT: fmla v18.4s, v2.4s, v5.4s
-; CHECK-NEXT: ld1 { v4.s }[1], [x9]
-; CHECK-NEXT: fsub v0.4s, v7.4s, v1.4s
-; CHECK-NEXT: fsub v1.4s, v19.4s, v16.4s
-; CHECK-NEXT: ld1 { v20.s }[3], [x10]
-; CHECK-NEXT: fadd v2.4s, v4.4s, v18.4s
-; CHECK-NEXT: fadd v3.4s, v20.4s, v6.4s
+; CHECK-NEXT: fneg v19.4s, v19.4s
+; CHECK-NEXT: fneg v20.4s, v20.4s
+; CHECK-NEXT: fmla v16.4s, v2.4s, v7.4s
+; CHECK-NEXT: fmla v1.4s, v0.4s, v5.4s
+; CHECK-NEXT: ld1 { v17.s }[1], [x10]
+; CHECK-NEXT: ldr s5, [sp, #200]
+; CHECK-NEXT: zip1 v7.2d, v18.2d, v21.2d
+; CHECK-NEXT: ld1 { v5.s }[1], [x9]
+; CHECK-NEXT: fmla v19.4s, v0.4s, v4.4s
+; CHECK-NEXT: fmla v20.4s, v2.4s, v3.4s
+; CHECK-NEXT: fsub v0.4s, v6.4s, v1.4s
+; CHECK-NEXT: fsub v1.4s, v17.4s, v16.4s
+; CHECK-NEXT: fadd v2.4s, v7.4s, v19.4s
+; CHECK-NEXT: fadd v3.4s, v5.4s, v20.4s
; CHECK-NEXT: ext v4.16b, v0.16b, v1.16b, #12
-; CHECK-NEXT: ext v5.16b, v3.16b, v2.16b, #12
-; CHECK-NEXT: trn2 v1.4s, v1.4s, v2.4s
+; CHECK-NEXT: ext v5.16b, v2.16b, v3.16b, #12
+; CHECK-NEXT: trn2 v1.4s, v1.4s, v3.4s
; CHECK-NEXT: ext v4.16b, v0.16b, v4.16b, #12
-; CHECK-NEXT: ext v5.16b, v3.16b, v5.16b, #8
+; CHECK-NEXT: ext v5.16b, v2.16b, v5.16b, #8
; CHECK-NEXT: rev64 v4.4s, v4.4s
-; CHECK-NEXT: trn2 v2.4s, v4.4s, v5.4s
-; CHECK-NEXT: zip2 v4.4s, v0.4s, v3.4s
-; CHECK-NEXT: zip1 v0.4s, v0.4s, v3.4s
-; CHECK-NEXT: ext v1.16b, v2.16b, v1.16b, #8
-; CHECK-NEXT: mov v4.d[1], v2.d[0]
+; CHECK-NEXT: trn2 v3.4s, v4.4s, v5.4s
+; CHECK-NEXT: zip2 v4.4s, v0.4s, v2.4s
+; CHECK-NEXT: zip1 v0.4s, v0.4s, v2.4s
+; CHECK-NEXT: ext v1.16b, v3.16b, v1.16b, #8
+; CHECK-NEXT: mov v4.d[1], v3.d[0]
; CHECK-NEXT: str q0, [x8]
; CHECK-NEXT: stp q4, q1, [x8, #16]
; CHECK-NEXT: ret
diff --git a/llvm/test/CodeGen/AArch64/concat-vector.ll b/llvm/test/CodeGen/AArch64/concat-vector.ll
index acf15f1bd1178..e6f27b95d92c8 100644
--- a/llvm/test/CodeGen/AArch64/concat-vector.ll
+++ b/llvm/test/CodeGen/AArch64/concat-vector.ll
@@ -186,8 +186,9 @@ define <16 x i8> @concat_v16s8_v4s8_load(ptr %ptrA, ptr %ptrB, ptr %ptrC, ptr %p
; CHECK: // %bb.0:
; CHECK-NEXT: ldr s0, [x0]
; CHECK-NEXT: ld1 { v0.s }[1], [x1]
-; CHECK-NEXT: ld1 { v0.s }[2], [x2]
-; CHECK-NEXT: ld1 { v0.s }[3], [x3]
+; CHECK-NEXT: ldr s1, [x2]
+; CHECK-NEXT: ld1 { v1.s }[1], [x3]
+; CHECK-NEXT: zip1 v0.2d, v0.2d, v1.2d
; CHECK-NEXT: ret
%A = load <4 x i8>, ptr %ptrA
%B = load <4 x i8>, ptr %ptrB
diff --git a/llvm/test/CodeGen/AArch64/fp-maximumnum-minimumnum.ll b/llvm/test/CodeGen/AArch64/fp-maximumnum-minimumnum.ll
index c6b8e41f9bdfd..4906e2e15e51c 100644
--- a/llvm/test/CodeGen/AArch64/fp-maximumnum-minimumnum.ll
+++ b/llvm/test/CodeGen/AArch64/fp-maximumnum-minimumnum.ll
@@ -1431,6 +1431,7 @@ define <9 x half> @max_v9f16(<9 x half> %a, <9 x half> %b) {
; FULLFP16-NEXT: add x9, sp, #16
; FULLFP16-NEXT: // kill: def $h3 killed $h3 def $q3
; FULLFP16-NEXT: // kill: def $h4 killed $h4 def $q4
+; FULLFP16-NEXT: add x10, sp, #40
; FULLFP16-NEXT: // kill: def $h5 killed $h5 def $q5
; FULLFP16-NEXT: // kill: def $h6 killed $h6 def $q6
; FULLFP16-NEXT: // kill: def $h7 killed $h7 def $q7
@@ -1439,30 +1440,30 @@ define <9 x half> @max_v9f16(<9 x half> %a, <9 x half> %b) {
; FULLFP16-NEXT: ld1 { v1.h }[1], [x9]
; FULLFP16-NEXT: add x9, sp, #24
; FULLFP16-NEXT: mov v0.h[2], v2.h[0]
-; FULLFP16-NEXT: ldr h2, [sp]
; FULLFP16-NEXT: ld1 { v1.h }[2], [x9]
; FULLFP16-NEXT: add x9, sp, #32
-; FULLFP16-NEXT: fminnm v2.8h, v2.8h, v2.8h
; FULLFP16-NEXT: mov v0.h[3], v3.h[0]
; FULLFP16-NEXT: ld1 { v1.h }[3], [x9]
-; FULLFP16-NEXT: add x9, sp, #40
-; FULLFP16-NEXT: ldr h3, [sp, #72]
-; FULLFP16-NEXT: ld1 { v1.h }[4], [x9]
+; FULLFP16-NEXT: ldr h2, [x10]
; FULLFP16-NEXT: add x9, sp, #48
+; FULLFP16-NEXT: ldr h3, [sp, #72]
+; FULLFP16-NEXT: ld1 { v2.h }[1], [x9]
+; FULLFP16-NEXT: add x9, sp, #56
; FULLFP16-NEXT: fminnm v3.8h, v3.8h, v3.8h
; FULLFP16-NEXT: mov v0.h[4], v4.h[0]
-; FULLFP16-NEXT: ld1 { v1.h }[5], [x9]
-; FULLFP16-NEXT: add x9, sp, #56
-; FULLFP16-NEXT: fmaxnm v2.8h, v2.8h, v3.8h
-; FULLFP16-NEXT: mov v0.h[5], v5.h[0]
-; FULLFP16-NEXT: ld1 { v1.h }[6], [x9]
+; FULLFP16-NEXT: ld1 { v2.h }[2], [x9]
; FULLFP16-NEXT: add x9, sp, #64
-; FULLFP16-NEXT: str h2, [x8, #16]
+; FULLFP16-NEXT: mov v0.h[5], v5.h[0]
+; FULLFP16-NEXT: ld1 { v2.h }[3], [x9]
+; FULLFP16-NEXT: zip1 v1.2d, v1.2d, v2.2d
+; FULLFP16-NEXT: ldr h2, [sp]
; FULLFP16-NEXT: mov v0.h[6], v6.h[0]
-; FULLFP16-NEXT: ld1 { v1.h }[7], [x9]
+; FULLFP16-NEXT: fminnm v2.8h, v2.8h, v2.8h
; FULLFP16-NEXT: fminnm v1.8h, v1.8h, v1.8h
; FULLFP16-NEXT: mov v0.h[7], v7.h[0]
+; FULLFP16-NEXT: fmaxnm v2.8h, v2.8h, v3.8h
; FULLFP16-NEXT: fminnm v0.8h, v0.8h, v0.8h
+; FULLFP16-NEXT: str h2, [x8, #16]
; FULLFP16-NEXT: fmaxnm v0.8h, v0.8h, v1.8h
; FULLFP16-NEXT: str q0, [x8]
; FULLFP16-NEXT: ret
@@ -2012,6 +2013,7 @@ define <9 x half> @min_v9f16(<9 x half> %a, <9 x half> %b) {
; FULLFP16-NEXT: add x9, sp, #16
; FULLFP16-NEXT: // kill: def $h3 killed $h3 def $q3
; FULLFP16-NEXT: // kill: def $h4 killed $h4 def $q4
+; FULLFP16-NEXT: add x10, sp, #40
; FULLFP16-NEXT: // kill: def $h5 killed $h5 def $q5
; FULLFP16-NEXT: // kill: def $h6 killed $h6 def $q6
; FULLFP16-NEXT: // kill: def $h7 killed $h7 def $q7
@@ -2020,30 +2022,30 @@ define <9 x half> @min_v9f16(<9 x half> %a, <9 x half> %b) {
; FULLFP16-NEXT: ld1 { v1.h }[1], [x9]
; FULLFP16-NEXT: add x9, sp, #24
; FULLFP16-NEXT: mov v0.h[2], v2.h[0]
-; FULLFP16-NEXT: ldr h2, [sp]
; FULLFP16-NEXT: ld1 { v1.h }[2], [x9]
; FULLFP16-NEXT: add x9, sp, #32
-; FULLFP16-NEXT: fminnm v2.8h, v2.8h, v2.8h
; FULLFP16-NEXT: mov v0.h[3], v3.h[0]
; FULLFP16-NEXT: ld1 { v1.h }[3], [x9]
-; FULLFP16-NEXT: add x9, sp, #40
-; FULLFP16-NEXT: ldr h3, [sp, #72]
-; FULLFP16-NEXT: ld1 { v1.h }[4], [x9]
+; FULLFP16-NEXT: ldr h2, [x10]
; FULLFP16-NEXT: add x9, sp, #48
+; FULLFP16-NEXT: ldr h3, [sp, #72]
+; FULLFP16-NEXT: ld1 { v2.h }[1], [x9]
+; FULLFP16-NEXT: add x9, sp, #56
; FULLFP16-NEXT: fminnm v3.8h, v3.8h, v3.8h
; FULLFP16-NEXT: mov v0.h[4], v4.h[0]
-; FULLFP16-NEXT: ld1 { v1.h }[5], [x9]
-; FULLFP16-NEXT: add x9, sp, #56
-; FULLFP16-NEXT: fminnm v2.8h, v2.8h, v3.8h
-; FULLFP16-NEXT: mov v0.h[5], v5.h[0]
-; FULLFP16-NEXT: ld1 { v1.h }[6], [x9]
+; FULLFP16-NEXT: ld1 { v2.h }[2], [x9]
; FULLFP16-NEXT: add x9, sp, #64
-; FULLFP16-NEXT: str h2, [x8, #16]
+; FULLFP16-NEXT: mov v0.h[5], v5.h[0]
+; FULLFP16-NEXT: ld1 { v2.h }[3], [x9]
+; FULLFP16-NEXT: zip1 v1.2d, v1.2d, v2.2d
+; FULLFP16-NEXT: ldr h2, [sp]
; FULLFP16-NEXT: mov v0.h[6], v6.h[0]
-; FULLFP16-NEXT: ld1 { v1.h }[7], [x9]
+; FULLFP16-NEXT: fminnm v2.8h, v2.8h, v2.8h
; FULLFP16-NEXT: fminnm v1.8h, v1.8h, v1.8h
; FULLFP16-NEXT: mov v0.h[7], v7.h[0]
+; FULLFP16-NEXT: fminnm v2.8h, v2.8h, v3.8h
; FULLFP16-NEXT: fminnm v0.8h, v0.8h, v0.8h
+; FULLFP16-NEXT: str h2, [x8, #16]
; FULLFP16-NEXT: fminnm v0.8h, v0.8h, v1.8h
; FULLFP16-NEXT: str q0, [x8]
; FULLFP16-NEXT: ret
diff --git a/llvm/test/CodeGen/AArch64/fsh.ll b/llvm/test/CodeGen/AArch64/fsh.ll
index 4c28c90824028..ae2ef2649102e 100644
--- a/llvm/test/CodeGen/AArch64/fsh.ll
+++ b/llvm/test/CodeGen/AArch64/fsh.ll
@@ -2509,87 +2509,88 @@ define <7 x i32> @fshl_v7i32(<7 x i32> %a, <7 x i32> %b, <7 x i32> %c) {
;
; CHECK-GI-LABEL: fshl_v7i32:
; CHECK-GI: // %bb.0: // %entry
-; CHECK-GI-NEXT: ldr s3, [sp, #48]
-; CHECK-GI-NEXT: ldr s20, [sp, #56]
-; CHECK-GI-NEXT: add x9, sp, #56
+; CHECK-GI-NEXT: ldr s17, [sp, #48]
+; CHECK-GI-NEXT: add x8, sp, #56
+; CHECK-GI-NEXT: add x9, sp, #64
; CHECK-GI-NEXT: ldr s4, [sp, #48]
-; CHECK-GI-NEXT: ldr s7, [sp, #80]
-; CHECK-GI-NEXT: mov w12, #-1 // =0xffffffff
-; CHECK-GI-NEXT: ldr s21, [sp, #88]
-; CHECK-GI-NEXT: mov v3.s[1], v20.s[0]
-; CHECK-GI-NEXT: fmov s20, w12
-; CHECK-GI-NEXT: ld1 { v4.s }[1], [x9]
-; CHECK-GI-NEXT: ldr s17, [sp]
-; CHECK-GI-NEXT: add x13, sp, #64
-; CHECK-GI-NEXT: mov v7.s[1], v21.s[0]
+; CHECK-GI-NEXT: ldr s21, [sp, #56]
+; CHECK-GI-NEXT: mov w10, #-1 // =0xffffffff
+; CHECK-GI-NEXT: ld1 { v17.s }[1], [x8]
+; CHECK-GI-NEXT: ldr s20, [x9]
+; CHECK-GI-NEXT: add x8, sp, #72
+; CHECK-GI-NEXT: mov v4.s[1], v21.s[0]
; CHECK-GI-NEXT: fmov s21, w7
+; CHECK-GI-NEXT: ldr s6, [sp]
+; CHECK-GI-NEXT: ld1 { v20.s }[1], [x8]
; CHECK-GI-NEXT: ldr s19, [sp, #64]
-; CHECK-GI-NEXT: mov w11, #31 // =0x1f
-; CHECK-GI-NEXT: mov v20.s[1], w12
+; CHECK-GI-NEXT: ldr s7, [sp, #80]
+; CHECK-GI-NEXT: ldr s22, [sp, #88]
+; CHECK-GI-NEXT: mov w9, #31 // =0x1f
+; CHECK-GI-NEXT: mov w11, #1 // =0x1
+; CHECK-GI-NEXT: mov v21.s[1], v6.s[0]
+; CHECK-GI-NEXT: fmov s6, w9
; CHECK-GI-NEXT: ldr s18, [sp, #96]
-; CHECK-GI-NEXT: ld1 { v4.s }[2], [x13]
-; CHECK-GI-NEXT: mov w13, #1 // =0x1
-; CHECK-GI-NEXT: mov v3.s[2], v19.s[0]
-; CHECK-GI-NEXT: mov v21.s[1], v17.s[0]
-; CHECK-GI-NEXT: fmov s17, w11
-; CHECK-GI-NEXT: fmov s19, w13
+; CHECK-GI-NEXT: zip1 v17.2d, v17.2d, v20.2d
+; CHECK-GI-NEXT: fmov s20, w10
+; CHECK-GI-NEXT: mov v7.s[1], v22.s[0]
+; CHECK-GI-NEXT: mov v4.s[2], v19.s[0]
+; CHECK-GI-NEXT: fmov s19, w11
; CHECK-GI-NEXT: fmov s23, w0
-; CHECK-GI-NEXT: fmov s24, w11
-; CHECK-GI-NEXT: ldr s6, [sp, #8]
+; CHECK-GI-NEXT: mov v6.s[1], w9
+; CHECK-GI-NEXT: fmov s24, w9
+; CHECK-GI-NEXT: ldr s2, [sp, #8]
+; CHECK-GI-NEXT: mov v20.s[1], w10
; CHECK-GI-NEXT: ldr s0, [sp, #24]
; CHECK-GI-NEXT: ldr s5, [sp, #32]
+; CHECK-GI-NEXT: mov v19.s[1], w11
; CHECK-GI-NEXT: mov v7.s[2], v18.s[0]
-; CHECK-GI-NEXT: mov v17.s[1], w11
-; CHECK-GI-NEXT: mov v19.s[1], w13
-; CHECK-GI-NEXT: mov v20.s[2], w12
; CHECK-GI-NEXT: ldr s16, [sp, #72]
; CHECK-GI-NEXT: mov v23.s[1], w1
; CHECK-GI-NEXT: ldr s18, [sp, #80]
-; CHECK-GI-NEXT: mov v21.s[2], v6.s[0]
-; CHECK-GI-NEXT: mov v24.s[1], w11
+; CHECK-GI-NEXT: mov v21.s[2], v2.s[0]
+; CHECK-GI-NEXT: mov v24.s[1], w9
; CHECK-GI-NEXT: mov v0.s[1], v5.s[0]
-; CHECK-GI-NEXT: fmov s6, w4
-; CHECK-GI-NEXT: add x10, sp, #88
+; CHECK-GI-NEXT: fmov s5, w4
+; CHECK-GI-NEXT: mov v20.s[2], w10
+; CHECK-GI-NEXT: add x8, sp, #88
; CHECK-GI-NEXT: movi v22.4s, #31
-; CHECK-GI-NEXT: mov v3.s[3], v16.s[0]
-; CHECK-GI-NEXT: mov v17.s[2], w11
-; CHECK-GI-NEXT: mov v19.s[2], w13
-; CHECK-GI-NEXT: ldr s2, [sp, #16]
-; CHECK-GI-NEXT: ldr s1, [sp, #40]
-; CHECK-GI-NEXT: ld1 { v18.s }[1], [x10]
-; CHECK-GI-NEXT: eor v5.16b, v7.16b, v20.16b
+; CHECK-GI-NEXT: mov v4.s[3], v16.s[0]
+; CHECK-GI-NEXT: mov v6.s[2], w9
+; CHECK-GI-NEXT: mov v19.s[2], w11
+; CHECK-GI-NEXT: ldr s1, [sp, #16]
+; CHECK-GI-NEXT: ldr s3, [sp, #40]
+; CHECK-GI-NEXT: ld1 { v18.s }[1], [x8]
; CHECK-GI-NEXT: mov v23.s[2], w2
-; CHECK-GI-NEXT: mov v6.s[1], w5
-; CHECK-GI-NEXT: add x8, sp, #72
-; CHECK-GI-NEXT: add x9, sp, #96
-; CHECK-GI-NEXT: mov v21.s[3], v2.s[0]
-; CHECK-GI-NEXT: mov v24.s[2], w11
-; CHECK-GI-NEXT: mov v0.s[2], v1.s[0]
-; CHECK-GI-NEXT: ld1 { v4.s }[3], [x8]
-; CHECK-GI-NEXT: bic v2.16b, v22.16b, v3.16b
-; CHECK-GI-NEXT: ld1 { v18.s }[2], [x9]
-; CHECK-GI-NEXT: and v1.16b, v5.16b, v17.16b
+; CHECK-GI-NEXT: mov v5.s[1], w5
+; CHECK-GI-NEXT: add x8, sp, #96
+; CHECK-GI-NEXT: eor v2.16b, v7.16b, v20.16b
+; CHECK-GI-NEXT: mov v21.s[3], v1.s[0]
+; CHECK-GI-NEXT: mov v24.s[2], w9
+; CHECK-GI-NEXT: mov v0.s[2], v3.s[0]
+; CHECK-GI-NEXT: bic v1.16b, v22.16b, v4.16b
+; CHECK-GI-NEXT: ld1 { v18.s }[2], [x8]
; CHECK-GI-NEXT: neg v3.4s, v19.4s
+; CHECK-GI-NEXT: and v4.16b, v17.16b, v22.16b
+; CHECK-GI-NEXT: and v2.16b, v2.16b, v6.16b
; CHECK-GI-NEXT: mov v23.s[3], w3
-; CHECK-GI-NEXT: mov v6.s[2], w6
-; CHECK-GI-NEXT: and v4.16b, v4.16b, v22.16b
-; CHECK-GI-NEXT: ushr v5.4s, v21.4s, #1
-; CHECK-GI-NEXT: neg v2.4s, v2.4s
-; CHECK-GI-NEXT: and v7.16b, v18.16b, v24.16b
+; CHECK-GI-NEXT: mov v5.s[2], w6
+; CHECK-GI-NEXT: ushr v6.4s, v21.4s, #1
; CHECK-GI-NEXT: neg v1.4s, v1.4s
+; CHECK-GI-NEXT: and v7.16b, v18.16b, v24.16b
; CHECK-GI-NEXT: ushl v0.4s, v0.4s, v3.4s
+; CHECK-GI-NEXT: neg v2.4s, v2.4s
; CHECK-GI-NEXT: ushl v3.4s, v23.4s, v4.4s
-; CHECK-GI-NEXT: ushl v2.4s, v5.4s, v2.4s
-; CHECK-GI-NEXT: ushl v4.4s, v6.4s, v7.4s
-; CHECK-GI-NEXT: ushl v0.4s, v0.4s, v1.4s
-; CHECK-GI-NEXT: orr v1.16b, v3.16b, v2.16b
+; CHECK-GI-NEXT: ushl v1.4s, v6.4s, v1.4s
+; CHECK-GI-NEXT: ushl v4.4s, v5.4s, v7.4s
+; CHECK-GI-NEXT: ushl v0.4s, v0.4s, v2.4s
+; CHECK-GI-NEXT: orr v1.16b, v3.16b, v1.16b
; CHECK-GI-NEXT: orr v0.16b, v4.16b, v0.16b
; CHECK-GI-NEXT: mov s2, v1.s[1]
; CHECK-GI-NEXT: mov s3, v1.s[2]
; CHECK-GI-NEXT: mov s4, v1.s[3]
+; CHECK-GI-NEXT: fmov w0, s1
; CHECK-GI-NEXT: mov s5, v0.s[1]
; CHECK-GI-NEXT: mov s6, v0.s[2]
-; CHECK-GI-NEXT: fmov w0, s1
; CHECK-GI-NEXT: fmov w4, s0
; CHECK-GI-NEXT: fmov w1, s2
; CHECK-GI-NEXT: fmov w2, s3
diff --git a/llvm/test/CodeGen/AArch64/llvm.frexp.ll b/llvm/test/CodeGen/AArch64/llvm.frexp.ll
index 2213aa1429dbd..4e1876db772ed 100644
--- a/llvm/test/CodeGen/AArch64/llvm.frexp.ll
+++ b/llvm/test/CodeGen/AArch64/llvm.frexp.ll
@@ -700,13 +700,14 @@ define { <4 x float>, <4 x i32> } @test_frexp_v4f32_v4i32(<4 x float> %a) nounwi
; CHECK-NEXT: ldr s1, [sp, #44]
; CHECK-NEXT: ldr q2, [sp] // 16-byte Folded Reload
; CHECK-NEXT: // kill: def $s0 killed $s0 def $q0
-; CHECK-NEXT: ld1 { v1.s }[1], [x19]
; CHECK-NEXT: mov v2.s[3], v0.s[0]
-; CHECK-NEXT: ld1 { v1.s }[2], [x20]
+; CHECK-NEXT: ld1 { v1.s }[1], [x19]
+; CHECK-NEXT: ldr s0, [x20]
+; CHECK-NEXT: ld1 { v0.s }[1], [x21]
; CHECK-NEXT: ldp x20, x19, [sp, #64] // 16-byte Folded Reload
-; CHECK-NEXT: mov v0.16b, v2.16b
-; CHECK-NEXT: ld1 { v1.s }[3], [x21]
; CHECK-NEXT: ldp x30, x21, [sp, #48] // 16-byte Folded Reload
+; CHECK-NEXT: zip1 v1.2d, v1.2d, v0.2d
+; CHECK-NEXT: mov v0.16b, v2.16b
; CHECK-NEXT: add sp, sp, #80
; CHECK-NEXT: ret
;
@@ -872,10 +873,11 @@ define <4 x i32> @test_frexp_v4f32_v4i32_only_use_exp(<4 x float> %a) nounwind {
; CHECK-NEXT: bl frexpf
; CHECK-NEXT: ldr s0, [sp, #28]
; CHECK-NEXT: ld1 { v0.s }[1], [x19]
-; CHECK-NEXT: ld1 { v0.s }[2], [x20]
+; CHECK-NEXT: ldr s1, [x20]
+; CHECK-NEXT: ld1 { v1.s }[1], [x21]
; CHECK-NEXT: ldp x20, x19, [sp, #48] // 16-byte Folded Reload
-; CHECK-NEXT: ld1 { v0.s }[3], [x21]
; CHECK-NEXT: ldp x30, x21, [sp, #32] // 16-byte Folded Reload
+; CHECK-NEXT: zip1 v0.2d, v0.2d, v1.2d
; CHECK-NEXT: add sp, sp, #64
; CHECK-NEXT: ret
;
diff --git a/llvm/test/CodeGen/AArch64/neon-dotreduce.ll b/llvm/test/CodeGen/AArch64/neon-dotreduce.ll
index 048e988b6c497..88b6f6c40baca 100644
--- a/llvm/test/CodeGen/AArch64/neon-dotreduce.ll
+++ b/llvm/test/CodeGen/AArch64/neon-dotreduce.ll
@@ -8062,195 +8062,200 @@ define i32 @test_sdot_v48i8_double_nomla(<48 x i8> %a, <48 x i8> %b, <48 x i8> %
; CHECK-SD-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
; CHECK-SD-NEXT: .cfi_def_cfa_offset 16
; CHECK-SD-NEXT: .cfi_offset w29, -16
-; CHECK-SD-NEXT: ldr b5, [sp, #208]
+; CHECK-SD-NEXT: ldr b0, [sp, #208]
; CHECK-SD-NEXT: add x8, sp, #216
-; CHECK-SD-NEXT: fmov s0, w0
+; CHECK-SD-NEXT: add x9, sp, #272
+; CHECK-SD-NEXT: ldr b2, [sp, #80]
; CHECK-SD-NEXT: ldr b4, [sp, #976]
-; CHECK-SD-NEXT: add x9, sp, #984
-; CHECK-SD-NEXT: add x12, sp, #328
-; CHECK-SD-NEXT: ld1 { v5.b }[1], [x8]
-; CHECK-SD-NEXT: add x8, sp, #224
-; CHECK-SD-NEXT: movi v1.16b, #1
-; CHECK-SD-NEXT: mov v0.b[1], w1
-; CHECK-SD-NEXT: ld1 { v4.b }[1], [x9]
-; CHECK-SD-NEXT: movi v3.2d, #0000000000000000
-; CHECK-SD-NEXT: add x11, sp, #992
; CHECK-SD-NEXT: ldr b6, [sp, #720]
-; CHECK-SD-NEXT: ldr b7, [sp, #80]
-; CHECK-SD-NEXT: ld1 { v5.b }[2], [x8]
+; CHECK-SD-NEXT: ld1 { v0.b }[1], [x8]
+; CHECK-SD-NEXT: add x8, sp, #224
+; CHECK-SD-NEXT: fmov s16, w0
+; CHECK-SD-NEXT: ldr b17, [sp, #848]
+; CHECK-SD-NEXT: add x10, sp, #24
+; CHECK-SD-NEXT: movi v19.2d, #0000000000000000
+; CHECK-SD-NEXT: ld1 { v0.b }[2], [x8]
; CHECK-SD-NEXT: add x8, sp, #232
-; CHECK-SD-NEXT: add x13, sp, #88
-; CHECK-SD-NEXT: ld1 { v4.b }[2], [x11]
-; CHECK-SD-NEXT: ld1 { v7.b }[1], [x13]
-; CHECK-SD-NEXT: add x13, sp, #856
-; CHECK-SD-NEXT: mov v0.b[2], w2
-; CHECK-SD-NEXT: add x14, sp, #1008
-; CHECK-SD-NEXT: add x15, sp, #872
-; CHECK-SD-NEXT: ld1 { v5.b }[3], [x8]
+; CHECK-SD-NEXT: mov v16.b[1], w1
+; CHECK-SD-NEXT: ld1 { v0.b }[3], [x8]
; CHECK-SD-NEXT: add x8, sp, #240
-; CHECK-SD-NEXT: add x16, sp, #888
-; CHECK-SD-NEXT: add x10, sp, #16
-; CHECK-SD-NEXT: add x9, sp, #24
-; CHECK-SD-NEXT: add x11, sp, #40
-; CHECK-SD-NEXT: movi v2.2d, #0000000000000000
-; CHECK-SD-NEXT: ld1 { v5.b }[4], [x8]
+; CHECK-SD-NEXT: mov v16.b[2], w2
+; CHECK-SD-NEXT: ld1 { v0.b }[4], [x8]
; CHECK-SD-NEXT: add x8, sp, #248
-; CHECK-SD-NEXT: mov v0.b[3], w3
-; CHECK-SD-NEXT: ld1 { v5.b }[5], [x8]
+; CHECK-SD-NEXT: mov v16.b[3], w3
+; CHECK-SD-NEXT: ld1 { v0.b }[5], [x8]
; CHECK-SD-NEXT: add x8, sp, #256
-; CHECK-SD-NEXT: mov v0.b[4], w4
-; CHECK-SD-NEXT: ld1 { v5.b }[6], [x8]
+; CHECK-SD-NEXT: ld1 { v0.b }[6], [x8]
; CHECK-SD-NEXT: add x8, sp, #264
-; CHECK-SD-NEXT: mov v0.b[5], w5
-; CHECK-SD-NEXT: ld1 { v5.b }[7], [x8]
-; CHECK-SD-NEXT: add x8, sp, #272
-; CHECK-SD-NEXT: ld1 { v5.b }[8], [x8]
+; CHECK-SD-NEXT: mov v16.b[4], w4
+; CHECK-SD-NEXT: ld1 { v0.b }[7], [x8]
+; CHECK-SD-NEXT: ldr b1, [x9]
; CHECK-SD-NEXT: add x8, sp, #280
-; CHECK-SD-NEXT: mov v0.b[6], w6
-; CHECK-SD-NEXT: ld1 { v5.b }[9], [x8]
+; CHECK-SD-NEXT: add x9, sp, #88
+; CHECK-SD-NEXT: mov v16.b[5], w5
+; CHECK-SD-NEXT: ld1 { v1.b }[1], [x8]
; CHECK-SD-NEXT: add x8, sp, #288
-; CHECK-SD-NEXT: mov v0.b[7], w7
-; CHECK-SD-NEXT: ld1 { v5.b }[10], [x8]
+; CHECK-SD-NEXT: ld1 { v1.b }[2], [x8]
; CHECK-SD-NEXT: add x8, sp, #296
-; CHECK-SD-NEXT: ld1 { v0.b }[8], [x10]
-; CHECK-SD-NEXT: add x10, sp, #128
-; CHECK-SD-NEXT: ld1 { v5.b }[11], [x8]
+; CHECK-SD-NEXT: mov v16.b[6], w6
+; CHECK-SD-NEXT: ld1 { v1.b }[3], [x8]
; CHECK-SD-NEXT: add x8, sp, #304
-; CHECK-SD-NEXT: ld1 { v0.b }[9], [x9]
-; CHECK-SD-NEXT: add x9, sp, #136
-; CHECK-SD-NEXT: ld1 { v5.b }[12], [x8]
+; CHECK-SD-NEXT: mov v16.b[7], w7
+; CHECK-SD-NEXT: ld1 { v1.b }[4], [x8]
; CHECK-SD-NEXT: add x8, sp, #312
-; CHECK-SD-NEXT: ld1 { v5.b }[13], [x8]
+; CHECK-SD-NEXT: ld1 { v1.b }[5], [x8]
; CHECK-SD-NEXT: add x8, sp, #320
-; CHECK-SD-NEXT: ld1 { v5.b }[14], [x8]
-; CHECK-SD-NEXT: add x8, sp, #32
-; CHECK-SD-NEXT: ld1 { v0.b }[10], [x8]
-; CHECK-SD-NEXT: add x8, sp, #144
-; CHECK-SD-NEXT: ld1 { v5.b }[15], [x12]
-; CHECK-SD-NEXT: add x12, sp, #728
-; CHECK-SD-NEXT: ld1 { v6.b }[1], [x12]
-; CHECK-SD-NEXT: add x12, sp, #1000
-; CHECK-SD-NEXT: ld1 { v0.b }[11], [x11]
-; CHECK-SD-NEXT: ld1 { v4.b }[3], [x12]
-; CHECK-SD-NEXT: add x12, sp, #736
-; CHECK-SD-NEXT: add x11, sp, #920
-; CHECK-SD-NEXT: sdot v3.4s, v5.16b, v1.16b
-; CHECK-SD-NEXT: ldr b5, [sp, #848]
-; CHECK-SD-NEXT: ld1 { v6.b }[2], [x12]
-; CHECK-SD-NEXT: add x12, sp, #48
-; CHECK-SD-NEXT: ld1 { v5.b }[1], [x13]
-; CHECK-SD-NEXT: add x13, sp, #744
-; CHECK-SD-NEXT: ld1 { v4.b }[4], [x14]
-; CHECK-SD-NEXT: add x14, sp, #96
-; CHECK-SD-NEXT: ld1 { v0.b }[12], [x12]
-; CHECK-SD-NEXT: ld1 { v6.b }[3], [x13]
-; CHECK-SD-NEXT: add x13, sp, #864
-; CHECK-SD-NEXT: ld1 { v7.b }[2], [x14]
-; CHECK-SD-NEXT: add x14, sp, #1016
-; CHECK-SD-NEXT: ld1 { v5.b }[2], [x13]
-; CHECK-SD-NEXT: add x13, sp, #752
-; CHECK-SD-NEXT: ld1 { v4.b }[5], [x14]
-; CHECK-SD-NEXT: add x14, sp, #104
-; CHECK-SD-NEXT: ld1 { v6.b }[4], [x13]
-; CHECK-SD-NEXT: add x13, sp, #1024
-; CHECK-SD-NEXT: ld1 { v7.b }[3], [x14]
-; CHECK-SD-NEXT: ld1 { v5.b }[3], [x15]
-; CHECK-SD-NEXT: add x15, sp, #760
-; CHECK-SD-NEXT: add x14, sp, #112
-; CHECK-SD-NEXT: ld1 { v4.b }[6], [x13]
-; CHECK-SD-NEXT: add x13, sp, #880
-; CHECK-SD-NEXT: ld1 { v6.b }[5], [x15]
-; CHECK-SD-NEXT: add x15, sp, #1032
-; CHECK-SD-NEXT: ld1 { v7.b }[4], [x14]
-; CHECK-SD-NEXT: ld1 { v5.b }[4], [x13]
-; CHECK-SD-NEXT: add x14, sp, #768
-; CHECK-SD-NEXT: add x13, sp, #120
-; CHECK-SD-NEXT: ld1 { v4.b }[7], [x15]
-; CHECK-SD-NEXT: add x15, sp, #1040
-; CHECK-SD-NEXT: ld1 { v6.b }[6], [x14]
-; CHECK-SD-NEXT: ld1 { v7.b }[5], [x13]
-; CHECK-SD-NEXT: add x13, sp, #776
-; CHECK-SD-NEXT: ld1 { v5.b }[5], [x16]
-; CHECK-SD-NEXT: add x14, sp, #1048
-; CHECK-SD-NEXT: ld1 { v4.b }[8], [x15]
-; CHECK-SD-NEXT: add x15, sp, #896
-; CHECK-SD-NEXT: ld1 { v6.b }[7], [x13]
-; CHECK-SD-NEXT: ld1 { v7.b }[6], [x10]
-; CHECK-SD-NEXT: add x10, sp, #784
-; CHECK-SD-NEXT: ld1 { v5.b }[6], [x15]
-; CHECK-SD-NEXT: add x13, sp, #1056
-; CHECK-SD-NEXT: ld1 { v4.b }[9], [x14]
-; CHECK-SD-NEXT: add x14, sp, #904
-; CHECK-SD-NEXT: ld1 { v6.b }[8], [x10]
-; CHECK-SD-NEXT: ld1 { v7.b }[7], [x9]
-; CHECK-SD-NEXT: add x9, sp, #792
-; CHECK-SD-NEXT: ld1 { v5.b }[7], [x14]
-; CHECK-SD-NEXT: add x10, sp, #1064
-; CHECK-SD-NEXT: ld1 { v4.b }[10], [x13]
-; CHECK-SD-NEXT: add x13, sp, #912
-; CHECK-SD-NEXT: ld1 { v6.b }[9], [x9]
-; CHECK-SD-NEXT: ld1 { v7.b }[8], [x8]
-; CHECK-SD-NEXT: add x9, sp, #800
-; CHECK-SD-NEXT: ld1 { v5.b }[8], [x13]
+; CHECK-SD-NEXT: ld1 { v1.b }[6], [x8]
+; CHECK-SD-NEXT: add x8, sp, #328
+; CHECK-SD-NEXT: ld1 { v1.b }[7], [x8]
+; CHECK-SD-NEXT: ld1 { v2.b }[1], [x9]
+; CHECK-SD-NEXT: add x8, sp, #96
+; CHECK-SD-NEXT: add x9, sp, #144
+; CHECK-SD-NEXT: ld1 { v2.b }[2], [x8]
+; CHECK-SD-NEXT: add x8, sp, #104
+; CHECK-SD-NEXT: zip1 v0.2d, v0.2d, v1.2d
+; CHECK-SD-NEXT: movi v1.16b, #1
+; CHECK-SD-NEXT: ld1 { v2.b }[3], [x8]
+; CHECK-SD-NEXT: add x8, sp, #112
+; CHECK-SD-NEXT: ld1 { v2.b }[4], [x8]
+; CHECK-SD-NEXT: add x8, sp, #120
+; CHECK-SD-NEXT: ld1 { v2.b }[5], [x8]
+; CHECK-SD-NEXT: add x8, sp, #128
+; CHECK-SD-NEXT: ld1 { v2.b }[6], [x8]
+; CHECK-SD-NEXT: add x8, sp, #136
+; CHECK-SD-NEXT: ld1 { v2.b }[7], [x8]
+; CHECK-SD-NEXT: ldr b3, [x9]
; CHECK-SD-NEXT: add x8, sp, #152
-; CHECK-SD-NEXT: ld1 { v4.b }[11], [x10]
-; CHECK-SD-NEXT: add x10, sp, #1072
-; CHECK-SD-NEXT: ld1 { v6.b }[10], [x9]
-; CHECK-SD-NEXT: ld1 { v7.b }[9], [x8]
-; CHECK-SD-NEXT: add x9, sp, #808
-; CHECK-SD-NEXT: ld1 { v5.b }[9], [x11]
-; CHECK-SD-NEXT: add x8, sp, #56
-; CHECK-SD-NEXT: ld1 { v4.b }[12], [x10]
-; CHECK-SD-NEXT: add x10, sp, #160
-; CHECK-SD-NEXT: ld1 { v0.b }[13], [x8]
-; CHECK-SD-NEXT: ld1 { v6.b }[11], [x9]
-; CHECK-SD-NEXT: add x9, sp, #928
-; CHECK-SD-NEXT: ld1 { v7.b }[10], [x10]
-; CHECK-SD-NEXT: add x10, sp, #1080
-; CHECK-SD-NEXT: ld1 { v5.b }[10], [x9]
+; CHECK-SD-NEXT: add x9, sp, #984
+; CHECK-SD-NEXT: ld1 { v3.b }[1], [x8]
+; CHECK-SD-NEXT: add x8, sp, #160
+; CHECK-SD-NEXT: ld1 { v3.b }[2], [x8]
+; CHECK-SD-NEXT: add x8, sp, #168
+; CHECK-SD-NEXT: ld1 { v3.b }[3], [x8]
+; CHECK-SD-NEXT: add x8, sp, #176
+; CHECK-SD-NEXT: ld1 { v3.b }[4], [x8]
+; CHECK-SD-NEXT: add x8, sp, #184
+; CHECK-SD-NEXT: ld1 { v3.b }[5], [x8]
+; CHECK-SD-NEXT: add x8, sp, #192
+; CHECK-SD-NEXT: ld1 { v3.b }[6], [x8]
+; CHECK-SD-NEXT: add x8, sp, #200
+; CHECK-SD-NEXT: ld1 { v3.b }[7], [x8]
+; CHECK-SD-NEXT: ld1 { v4.b }[1], [x9]
+; CHECK-SD-NEXT: add x8, sp, #992
+; CHECK-SD-NEXT: add x9, sp, #1040
+; CHECK-SD-NEXT: ld1 { v4.b }[2], [x8]
+; CHECK-SD-NEXT: add x8, sp, #1000
+; CHECK-SD-NEXT: zip1 v2.2d, v2.2d, v3.2d
+; CHECK-SD-NEXT: ld1 { v4.b }[3], [x8]
+; CHECK-SD-NEXT: add x8, sp, #1008
+; CHECK-SD-NEXT: ld1 { v4.b }[4], [x8]
+; CHECK-SD-NEXT: add x8, sp, #1016
+; CHECK-SD-NEXT: ld1 { v4.b }[5], [x8]
+; CHECK-SD-NEXT: add x8, sp, #1024
+; CHECK-SD-NEXT: ld1 { v4.b }[6], [x8]
+; CHECK-SD-NEXT: add x8, sp, #1032
+; CHECK-SD-NEXT: ld1 { v4.b }[7], [x8]
+; CHECK-SD-NEXT: ldr b5, [x9]
+; CHECK-SD-NEXT: add x8, sp, #1048
+; CHECK-SD-NEXT: add x9, sp, #728
+; CHECK-SD-NEXT: ld1 { v5.b }[1], [x8]
+; CHECK-SD-NEXT: add x8, sp, #1056
+; CHECK-SD-NEXT: ld1 { v5.b }[2], [x8]
+; CHECK-SD-NEXT: add x8, sp, #1064
+; CHECK-SD-NEXT: ld1 { v5.b }[3], [x8]
+; CHECK-SD-NEXT: add x8, sp, #1072
+; CHECK-SD-NEXT: ld1 { v5.b }[4], [x8]
+; CHECK-SD-NEXT: add x8, sp, #1080
+; CHECK-SD-NEXT: ld1 { v5.b }[5], [x8]
+; CHECK-SD-NEXT: add x8, sp, #1088
+; CHECK-SD-NEXT: ld1 { v5.b }[6], [x8]
+; CHECK-SD-NEXT: add x8, sp, #1096
+; CHECK-SD-NEXT: ld1 { v5.b }[7], [x8]
+; CHECK-SD-NEXT: ld1 { v6.b }[1], [x9]
+; CHECK-SD-NEXT: add x8, sp, #736
+; CHECK-SD-NEXT: add x9, sp, #784
+; CHECK-SD-NEXT: ld1 { v6.b }[2], [x8]
+; CHECK-SD-NEXT: add x8, sp, #744
+; CHECK-SD-NEXT: zip1 v4.2d, v4.2d, v5.2d
+; CHECK-SD-NEXT: movi v5.2d, #0000000000000000
+; CHECK-SD-NEXT: ld1 { v6.b }[3], [x8]
+; CHECK-SD-NEXT: add x8, sp, #752
+; CHECK-SD-NEXT: sdot v19.4s, v4.16b, v1.16b
+; CHECK-SD-NEXT: sdot v5.4s, v0.16b, v1.16b
+; CHECK-SD-NEXT: ld1 { v6.b }[4], [x8]
+; CHECK-SD-NEXT: add x8, sp, #760
+; CHECK-SD-NEXT: ld1 { v6.b }[5], [x8]
+; CHECK-SD-NEXT: add x8, sp, #768
+; CHECK-SD-NEXT: ld1 { v6.b }[6], [x8]
+; CHECK-SD-NEXT: add x8, sp, #776
+; CHECK-SD-NEXT: ld1 { v6.b }[7], [x8]
+; CHECK-SD-NEXT: ldr b7, [x9]
+; CHECK-SD-NEXT: add x8, sp, #792
+; CHECK-SD-NEXT: add x9, sp, #856
+; CHECK-SD-NEXT: ld1 { v7.b }[1], [x8]
+; CHECK-SD-NEXT: add x8, sp, #800
+; CHECK-SD-NEXT: ld1 { v7.b }[2], [x8]
+; CHECK-SD-NEXT: add x8, sp, #808
+; CHECK-SD-NEXT: ld1 { v7.b }[3], [x8]
; CHECK-SD-NEXT: add x8, sp, #816
-; CHECK-SD-NEXT: ld1 { v4.b }[13], [x10]
-; CHECK-SD-NEXT: add x9, sp, #168
-; CHECK-SD-NEXT: add x10, sp, #176
-; CHECK-SD-NEXT: ld1 { v6.b }[12], [x8]
-; CHECK-SD-NEXT: add x8, sp, #936
-; CHECK-SD-NEXT: ld1 { v7.b }[11], [x9]
-; CHECK-SD-NEXT: add x9, sp, #1088
-; CHECK-SD-NEXT: ld1 { v5.b }[11], [x8]
-; CHECK-SD-NEXT: add x8, sp, #64
-; CHECK-SD-NEXT: ld1 { v4.b }[14], [x9]
-; CHECK-SD-NEXT: add x9, sp, #824
-; CHECK-SD-NEXT: ld1 { v0.b }[14], [x8]
-; CHECK-SD-NEXT: ld1 { v6.b }[13], [x9]
-; CHECK-SD-NEXT: add x9, sp, #944
-; CHECK-SD-NEXT: ld1 { v7.b }[12], [x10]
-; CHECK-SD-NEXT: add x10, sp, #1096
-; CHECK-SD-NEXT: ld1 { v5.b }[12], [x9]
+; CHECK-SD-NEXT: ld1 { v7.b }[4], [x8]
+; CHECK-SD-NEXT: add x8, sp, #824
+; CHECK-SD-NEXT: ld1 { v7.b }[5], [x8]
; CHECK-SD-NEXT: add x8, sp, #832
-; CHECK-SD-NEXT: ld1 { v4.b }[15], [x10]
-; CHECK-SD-NEXT: add x9, sp, #184
-; CHECK-SD-NEXT: add x10, sp, #72
-; CHECK-SD-NEXT: ld1 { v6.b }[14], [x8]
-; CHECK-SD-NEXT: add x8, sp, #952
-; CHECK-SD-NEXT: ld1 { v7.b }[13], [x9]
-; CHECK-SD-NEXT: ld1 { v5.b }[13], [x8]
+; CHECK-SD-NEXT: ld1 { v7.b }[6], [x8]
; CHECK-SD-NEXT: add x8, sp, #840
-; CHECK-SD-NEXT: ld1 { v0.b }[15], [x10]
-; CHECK-SD-NEXT: sdot v2.4s, v4.16b, v1.16b
-; CHECK-SD-NEXT: add x9, sp, #192
-; CHECK-SD-NEXT: ld1 { v6.b }[15], [x8]
+; CHECK-SD-NEXT: ld1 { v7.b }[7], [x8]
+; CHECK-SD-NEXT: ld1 { v17.b }[1], [x9]
+; CHECK-SD-NEXT: add x8, sp, #864
+; CHECK-SD-NEXT: add x9, sp, #16
+; CHECK-SD-NEXT: ld1 { v16.b }[8], [x9]
+; CHECK-SD-NEXT: add x9, sp, #912
+; CHECK-SD-NEXT: ld1 { v17.b }[2], [x8]
+; CHECK-SD-NEXT: add x8, sp, #872
+; CHECK-SD-NEXT: zip1 v0.2d, v6.2d, v7.2d
+; CHECK-SD-NEXT: ld1 { v16.b }[9], [x10]
+; CHECK-SD-NEXT: ld1 { v17.b }[3], [x8]
+; CHECK-SD-NEXT: add x8, sp, #880
+; CHECK-SD-NEXT: sdot v19.4s, v0.16b, v1.16b
+; CHECK-SD-NEXT: ld1 { v17.b }[4], [x8]
+; CHECK-SD-NEXT: add x8, sp, #888
+; CHECK-SD-NEXT: ld1 { v17.b }[5], [x8]
+; CHECK-SD-NEXT: add x8, sp, #896
+; CHECK-SD-NEXT: ld1 { v17.b }[6], [x8]
+; CHECK-SD-NEXT: add x8, sp, #904
+; CHECK-SD-NEXT: ld1 { v17.b }[7], [x8]
+; CHECK-SD-NEXT: ldr b18, [x9]
+; CHECK-SD-NEXT: add x8, sp, #920
+; CHECK-SD-NEXT: ld1 { v18.b }[1], [x8]
+; CHECK-SD-NEXT: add x8, sp, #32
+; CHECK-SD-NEXT: ld1 { v16.b }[10], [x8]
+; CHECK-SD-NEXT: add x8, sp, #928
+; CHECK-SD-NEXT: ld1 { v18.b }[2], [x8]
+; CHECK-SD-NEXT: add x8, sp, #40
+; CHECK-SD-NEXT: ld1 { v16.b }[11], [x8]
+; CHECK-SD-NEXT: add x8, sp, #936
+; CHECK-SD-NEXT: ld1 { v18.b }[3], [x8]
+; CHECK-SD-NEXT: add x8, sp, #48
+; CHECK-SD-NEXT: ld1 { v16.b }[12], [x8]
+; CHECK-SD-NEXT: add x8, sp, #944
+; CHECK-SD-NEXT: ld1 { v18.b }[4], [x8]
+; CHECK-SD-NEXT: add x8, sp, #56
+; CHECK-SD-NEXT: ld1 { v16.b }[13], [x8]
+; CHECK-SD-NEXT: add x8, sp, #952
+; CHECK-SD-NEXT: ld1 { v18.b }[5], [x8]
+; CHECK-SD-NEXT: add x8, sp, #64
+; CHECK-SD-NEXT: ld1 { v16.b }[14], [x8]
; CHECK-SD-NEXT: add x8, sp, #960
-; CHECK-SD-NEXT: ld1 { v7.b }[14], [x9]
-; CHECK-SD-NEXT: ld1 { v5.b }[14], [x8]
-; CHECK-SD-NEXT: sdot v3.4s, v0.16b, v1.16b
-; CHECK-SD-NEXT: add x8, sp, #200
-; CHECK-SD-NEXT: add x9, sp, #968
-; CHECK-SD-NEXT: sdot v2.4s, v6.16b, v1.16b
-; CHECK-SD-NEXT: ld1 { v7.b }[15], [x8]
-; CHECK-SD-NEXT: ld1 { v5.b }[15], [x9]
-; CHECK-SD-NEXT: sdot v3.4s, v7.16b, v1.16b
-; CHECK-SD-NEXT: sdot v2.4s, v5.16b, v1.16b
-; CHECK-SD-NEXT: add v0.4s, v3.4s, v2.4s
+; CHECK-SD-NEXT: ld1 { v18.b }[6], [x8]
+; CHECK-SD-NEXT: add x8, sp, #72
+; CHECK-SD-NEXT: ld1 { v16.b }[15], [x8]
+; CHECK-SD-NEXT: add x8, sp, #968
+; CHECK-SD-NEXT: ld1 { v18.b }[7], [x8]
+; CHECK-SD-NEXT: sdot v5.4s, v16.16b, v1.16b
+; CHECK-SD-NEXT: zip1 v0.2d, v17.2d, v18.2d
+; CHECK-SD-NEXT: sdot v5.4s, v2.16b, v1.16b
+; CHECK-SD-NEXT: sdot v19.4s, v0.16b, v1.16b
+; CHECK-SD-NEXT: add v0.4s, v5.4s, v19.4s
; CHECK-SD-NEXT: addv s0, v0.4s
; CHECK-SD-NEXT: fmov w0, s0
; CHECK-SD-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
diff --git a/llvm/test/CodeGen/AArch64/nontemporal.ll b/llvm/test/CodeGen/AArch64/nontemporal.ll
index f8ba150a0405f..f7a87ae340a73 100644
--- a/llvm/test/CodeGen/AArch64/nontemporal.ll
+++ b/llvm/test/CodeGen/AArch64/nontemporal.ll
@@ -683,41 +683,43 @@ define void @test_stnp_v17f32(<17 x float> %v, ptr %ptr) {
;
; CHECK-BE-LABEL: test_stnp_v17f32:
; CHECK-BE: // %bb.0: // %entry
-; CHECK-BE-NEXT: // kill: def $s4 killed $s4 def $q4
+; CHECK-BE-NEXT: // kill: def $s1 killed $s1 def $q1
; CHECK-BE-NEXT: // kill: def $s0 killed $s0 def $q0
-; CHECK-BE-NEXT: ldr s16, [sp, #36]
+; CHECK-BE-NEXT: // kill: def $s4 killed $s4 def $q4
; CHECK-BE-NEXT: // kill: def $s5 killed $s5 def $q5
-; CHECK-BE-NEXT: // kill: def $s1 killed $s1 def $q1
-; CHECK-BE-NEXT: ldr s17, [sp, #4]
-; CHECK-BE-NEXT: add x8, sp, #44
-; CHECK-BE-NEXT: mov v4.s[1], v5.s[0]
+; CHECK-BE-NEXT: add x8, sp, #12
+; CHECK-BE-NEXT: add x9, sp, #20
+; CHECK-BE-NEXT: ldr s16, [sp, #36]
; CHECK-BE-NEXT: mov v0.s[1], v1.s[0]
+; CHECK-BE-NEXT: ldr s1, [sp, #4]
+; CHECK-BE-NEXT: mov v4.s[1], v5.s[0]
+; CHECK-BE-NEXT: add x10, sp, #52
; CHECK-BE-NEXT: // kill: def $s6 killed $s6 def $q6
; CHECK-BE-NEXT: // kill: def $s2 killed $s2 def $q2
; CHECK-BE-NEXT: // kill: def $s7 killed $s7 def $q7
; CHECK-BE-NEXT: // kill: def $s3 killed $s3 def $q3
-; CHECK-BE-NEXT: ldr s1, [sp, #68]
-; CHECK-BE-NEXT: ld1 { v16.s }[1], [x8]
-; CHECK-BE-NEXT: add x8, sp, #12
-; CHECK-BE-NEXT: ld1 { v17.s }[1], [x8]
-; CHECK-BE-NEXT: add x8, sp, #52
-; CHECK-BE-NEXT: str s1, [x0, #64]
-; CHECK-BE-NEXT: ld1 { v16.s }[2], [x8]
-; CHECK-BE-NEXT: add x8, sp, #20
+; CHECK-BE-NEXT: ld1 { v1.s }[1], [x8]
+; CHECK-BE-NEXT: ldr s5, [x9]
+; CHECK-BE-NEXT: add x8, sp, #28
+; CHECK-BE-NEXT: add x9, sp, #44
+; CHECK-BE-NEXT: ld1 { v5.s }[1], [x8]
+; CHECK-BE-NEXT: ld1 { v16.s }[1], [x9]
+; CHECK-BE-NEXT: ldr s17, [x10]
+; CHECK-BE-NEXT: add x8, sp, #60
; CHECK-BE-NEXT: mov v4.s[2], v6.s[0]
; CHECK-BE-NEXT: mov v0.s[2], v2.s[0]
-; CHECK-BE-NEXT: ld1 { v17.s }[2], [x8]
-; CHECK-BE-NEXT: add x8, sp, #60
-; CHECK-BE-NEXT: ld1 { v16.s }[3], [x8]
-; CHECK-BE-NEXT: add x8, sp, #28
-; CHECK-BE-NEXT: ld1 { v17.s }[3], [x8]
+; CHECK-BE-NEXT: ld1 { v17.s }[1], [x8]
+; CHECK-BE-NEXT: ldr s2, [sp, #68]
+; CHECK-BE-NEXT: add x8, x0, #32
+; CHECK-BE-NEXT: zip1 v1.2d, v1.2d, v5.2d
+; CHECK-BE-NEXT: add x9, x0, #48
+; CHECK-BE-NEXT: str s2, [x0, #64]
+; CHECK-BE-NEXT: zip1 v5.2d, v16.2d, v17.2d
; CHECK-BE-NEXT: mov v4.s[3], v7.s[0]
-; CHECK-BE-NEXT: add x8, x0, #48
; CHECK-BE-NEXT: mov v0.s[3], v3.s[0]
-; CHECK-BE-NEXT: st1 { v16.4s }, [x8]
-; CHECK-BE-NEXT: add x8, x0, #32
-; CHECK-BE-NEXT: st1 { v17.4s }, [x8]
+; CHECK-BE-NEXT: st1 { v1.4s }, [x8]
; CHECK-BE-NEXT: add x8, x0, #16
+; CHECK-BE-NEXT: st1 { v5.4s }, [x9]
; CHECK-BE-NEXT: st1 { v4.4s }, [x8]
; CHECK-BE-NEXT: st1 { v0.4s }, [x0]
; CHECK-BE-NEXT: ret
>From 8f671a675f52a7bbf33df5d4c8545bab31d28689 Mon Sep 17 00:00:00 2001
From: ZhaoQi <zhaoqi01 at loongson.cn>
Date: Mon, 18 Aug 2025 20:15:49 +0800
Subject: [PATCH 002/112] [LoongArch] Always emit symbol-based relocations
regardless of relaxation (#153943)
This commit changes all relocations to be relocated with symbols.
Without this commit, errors may occur in some cases, such as when using
`llc/lto+relax`, or combining relaxed and norelaxed object files using
`ld -r`.
Some tests updated.
---
.../MCTargetDesc/LoongArchAsmBackend.cpp | 3 +-
.../MCTargetDesc/LoongArchELFObjectWriter.cpp | 15 +--
.../MCTargetDesc/LoongArchMCTargetDesc.h | 2 +-
.../CodeGen/LoongArch/linker-relaxation.ll | 14 +--
.../xray-attribute-instrumentation.ll | 12 +-
.../LoongArch/dwarf-loongarch-relocs.ll | 11 +-
llvm/test/MC/LoongArch/Misc/cfi-advance.s | 2 +-
.../test/MC/LoongArch/Relocations/fde-reloc.s | 9 +-
.../MC/LoongArch/Relocations/relax-addsub.s | 12 +-
.../MC/LoongArch/Relocations/relax-attr.s | 2 +-
.../Relocations/relocation-specifier.s | 4 +-
llvm/test/MC/LoongArch/Relocations/sub-expr.s | 107 +++++++-----------
12 files changed, 79 insertions(+), 114 deletions(-)
diff --git a/llvm/lib/Target/LoongArch/MCTargetDesc/LoongArchAsmBackend.cpp b/llvm/lib/Target/LoongArch/MCTargetDesc/LoongArchAsmBackend.cpp
index 994e8577e496e..338134ffcde61 100644
--- a/llvm/lib/Target/LoongArch/MCTargetDesc/LoongArchAsmBackend.cpp
+++ b/llvm/lib/Target/LoongArch/MCTargetDesc/LoongArchAsmBackend.cpp
@@ -496,8 +496,7 @@ bool LoongArchAsmBackend::addReloc(const MCFragment &F, const MCFixup &Fixup,
std::unique_ptr<MCObjectTargetWriter>
LoongArchAsmBackend::createObjectTargetWriter() const {
- return createLoongArchELFObjectWriter(
- OSABI, Is64Bit, STI.hasFeature(LoongArch::FeatureRelax));
+ return createLoongArchELFObjectWriter(OSABI, Is64Bit);
}
MCAsmBackend *llvm::createLoongArchAsmBackend(const Target &T,
diff --git a/llvm/lib/Target/LoongArch/MCTargetDesc/LoongArchELFObjectWriter.cpp b/llvm/lib/Target/LoongArch/MCTargetDesc/LoongArchELFObjectWriter.cpp
index 7e021e486836a..7d5456555045b 100644
--- a/llvm/lib/Target/LoongArch/MCTargetDesc/LoongArchELFObjectWriter.cpp
+++ b/llvm/lib/Target/LoongArch/MCTargetDesc/LoongArchELFObjectWriter.cpp
@@ -21,26 +21,23 @@ using namespace llvm;
namespace {
class LoongArchELFObjectWriter : public MCELFObjectTargetWriter {
public:
- LoongArchELFObjectWriter(uint8_t OSABI, bool Is64Bit, bool EnableRelax);
+ LoongArchELFObjectWriter(uint8_t OSABI, bool Is64Bit);
~LoongArchELFObjectWriter() override;
bool needsRelocateWithSymbol(const MCValue &, unsigned Type) const override {
- return EnableRelax;
+ return true;
}
protected:
unsigned getRelocType(const MCFixup &, const MCValue &,
bool IsPCRel) const override;
- bool EnableRelax;
};
} // end namespace
-LoongArchELFObjectWriter::LoongArchELFObjectWriter(uint8_t OSABI, bool Is64Bit,
- bool EnableRelax)
+LoongArchELFObjectWriter::LoongArchELFObjectWriter(uint8_t OSABI, bool Is64Bit)
: MCELFObjectTargetWriter(Is64Bit, OSABI, ELF::EM_LOONGARCH,
- /*HasRelocationAddend=*/true),
- EnableRelax(EnableRelax) {}
+ /*HasRelocationAddend=*/true) {}
LoongArchELFObjectWriter::~LoongArchELFObjectWriter() {}
@@ -103,6 +100,6 @@ unsigned LoongArchELFObjectWriter::getRelocType(const MCFixup &Fixup,
}
std::unique_ptr<MCObjectTargetWriter>
-llvm::createLoongArchELFObjectWriter(uint8_t OSABI, bool Is64Bit, bool Relax) {
- return std::make_unique<LoongArchELFObjectWriter>(OSABI, Is64Bit, Relax);
+llvm::createLoongArchELFObjectWriter(uint8_t OSABI, bool Is64Bit) {
+ return std::make_unique<LoongArchELFObjectWriter>(OSABI, Is64Bit);
}
diff --git a/llvm/lib/Target/LoongArch/MCTargetDesc/LoongArchMCTargetDesc.h b/llvm/lib/Target/LoongArch/MCTargetDesc/LoongArchMCTargetDesc.h
index bb05baa9b717c..ab35a0096c8a2 100644
--- a/llvm/lib/Target/LoongArch/MCTargetDesc/LoongArchMCTargetDesc.h
+++ b/llvm/lib/Target/LoongArch/MCTargetDesc/LoongArchMCTargetDesc.h
@@ -36,7 +36,7 @@ MCAsmBackend *createLoongArchAsmBackend(const Target &T,
const MCTargetOptions &Options);
std::unique_ptr<MCObjectTargetWriter>
-createLoongArchELFObjectWriter(uint8_t OSABI, bool Is64Bit, bool Relax);
+createLoongArchELFObjectWriter(uint8_t OSABI, bool Is64Bit);
} // end namespace llvm
diff --git a/llvm/test/CodeGen/LoongArch/linker-relaxation.ll b/llvm/test/CodeGen/LoongArch/linker-relaxation.ll
index 3bb83193ce7ac..6b197bc578919 100644
--- a/llvm/test/CodeGen/LoongArch/linker-relaxation.ll
+++ b/llvm/test/CodeGen/LoongArch/linker-relaxation.ll
@@ -1,6 +1,6 @@
; RUN: llc --mtriple=loongarch64 --filetype=obj -mattr=-relax \
; RUN: --relocation-model=pic --code-model=medium < %s \
-; RUN: | llvm-readobj -r - | FileCheck --check-prefixes=CHECK-RELOC,PCALA-RELOC %s
+; RUN: | llvm-readobj -r - | FileCheck --check-prefix=CHECK-RELOC %s
; RUN: llc --mtriple=loongarch64 --filetype=obj -mattr=+relax \
; RUN: --relocation-model=pic --code-model=medium < %s \
; RUN: | llvm-readobj -r - | FileCheck --check-prefixes=CHECK-RELOC,RELAX %s
@@ -33,10 +33,8 @@ declare dso_local void @callee3() nounwind
; RELAX: R_LARCH_RELAX - 0x0
; CHECK-RELOC-NEXT: R_LARCH_GOT_PC_LO12 g_e 0x0
; RELAX-NEXT: R_LARCH_RELAX - 0x0
-; PCALA-RELOC: R_LARCH_PCALA_HI20 .bss 0x0
-; RELAX-NEXT: R_LARCH_PCALA_HI20 g_i 0x0
-; PCALA-RELOC: R_LARCH_PCALA_LO12 .bss 0x0
-; RELAX-NEXT: R_LARCH_PCALA_LO12 g_i 0x0
+; CHECK-RELOC-NEXT: R_LARCH_PCALA_HI20 g_i 0x0
+; CHECK-RELOC-NEXT: R_LARCH_PCALA_LO12 g_i 0x0
; CHECK-RELOC: R_LARCH_TLS_GD_PC_HI20 t_un 0x0
; RELAX-NEXT: R_LARCH_RELAX - 0x0
; CHECK-RELOC-NEXT: R_LARCH_GOT_PC_LO12 t_un 0x0
@@ -75,11 +73,9 @@ declare dso_local void @callee3() nounwind
; RELAX-NEXT: R_LARCH_RELAX - 0x0
; CHECK-RELOC-NEXT: R_LARCH_TLS_LE_LO12_R t_le 0x0
; RELAX-NEXT: R_LARCH_RELAX - 0x0
-; PCALA-RELOC: R_LARCH_PCALA_HI20 .data 0x0
-; RELAX-NEXT: R_LARCH_PCALA_HI20 g_i1 0x0
+; CHECK-RELOC-NEXT: R_LARCH_PCALA_HI20 g_i1 0x0
; RELAX-NEXT: R_LARCH_RELAX - 0x0
-; PCALA-RELOC: R_LARCH_PCALA_LO12 .data 0x0
-; RELAX-NEXT: R_LARCH_PCALA_LO12 g_i1 0x0
+; CHECK-RELOC-NEXT: R_LARCH_PCALA_LO12 g_i1 0x0
; RELAX-NEXT: R_LARCH_RELAX - 0x0
; RELAX-NEXT: R_LARCH_ALIGN - 0x1C
; CHECK-RELOC-NEXT: R_LARCH_CALL36 callee1 0x0
diff --git a/llvm/test/CodeGen/LoongArch/xray-attribute-instrumentation.ll b/llvm/test/CodeGen/LoongArch/xray-attribute-instrumentation.ll
index 8999c20387003..7838bcea1025d 100644
--- a/llvm/test/CodeGen/LoongArch/xray-attribute-instrumentation.ll
+++ b/llvm/test/CodeGen/LoongArch/xray-attribute-instrumentation.ll
@@ -43,14 +43,14 @@ define i32 @foo() nounwind noinline uwtable "function-instrument"="xray-always"
; CHECK-NEXT: .dword 2
; RELOC: Section ([[#]]) .relaxray_instr_map {
-; RELOC-NEXT: 0x0 R_LARCH_64_PCREL .text 0x0
-; RELOC-NEXT: 0x8 R_LARCH_64_PCREL .text 0x0
-; RELOC-NEXT: 0x20 R_LARCH_64_PCREL .text 0x34
-; RELOC-NEXT: 0x28 R_LARCH_64_PCREL .text 0x0
+; RELOC-NEXT: 0x0 R_LARCH_64_PCREL .L{{.*}} 0x0
+; RELOC-NEXT: 0x8 R_LARCH_64_PCREL .L{{.*}} 0x0
+; RELOC-NEXT: 0x20 R_LARCH_64_PCREL .L{{.*}} 0x0
+; RELOC-NEXT: 0x28 R_LARCH_64_PCREL .L{{.*}} 0x0
; RELOC-NEXT: }
; RELOC-NEXT: Section ([[#]]) .relaxray_fn_idx {
-; RELOC-NEXT: 0x0 R_LARCH_64_PCREL xray_instr_map 0x0
+; RELOC-NEXT: 0x0 R_LARCH_64_PCREL .Lxray_sleds_start0 0x0
; RELOC-NEXT: }
; RELOC-NEXT: Section ([[#]]) .rela.eh_frame {
-; RELOC-NEXT: 0x1C R_LARCH_32_PCREL .text 0x0
+; RELOC-NEXT: 0x1C R_LARCH_32_PCREL .L{{.*}} 0x0
; RELOC-NEXT: }
diff --git a/llvm/test/DebugInfo/LoongArch/dwarf-loongarch-relocs.ll b/llvm/test/DebugInfo/LoongArch/dwarf-loongarch-relocs.ll
index d28836d560377..2f5cc373a68f5 100644
--- a/llvm/test/DebugInfo/LoongArch/dwarf-loongarch-relocs.ll
+++ b/llvm/test/DebugInfo/LoongArch/dwarf-loongarch-relocs.ll
@@ -1,5 +1,5 @@
; RUN: llc --filetype=obj --mtriple=loongarch64 --mattr=-relax %s -o %t.o
-; RUN: llvm-readobj -r %t.o | FileCheck --check-prefixes=RELOCS-BOTH,RELOCS-NORL %s
+; RUN: llvm-readobj -r %t.o | FileCheck --check-prefix=RELOCS-BOTH %s
; RUN: llvm-objdump --source %t.o | FileCheck --check-prefix=SOURCE %s
; RUN: llvm-dwarfdump --debug-info --debug-line %t.o | FileCheck --check-prefix=DWARF %s
@@ -16,10 +16,8 @@
; RELOCS-ENRL-NEXT: 0x18 R_LARCH_RELAX - 0x0
; RELOCS-BOTH-NEXT: }
; RELOCS-BOTH: Section ({{.*}}) .rela.debug_frame {
-; RELOCS-NORL-NEXT: 0x1C R_LARCH_32 .debug_frame 0x0
-; RELOCS-NORL-NEXT: 0x20 R_LARCH_64 .text 0x0
-; RELOCS-ENRL-NEXT: 0x1C R_LARCH_32 .L0 0x0
-; RELOCS-ENRL-NEXT: 0x20 R_LARCH_64 .L0 0x0
+; RELOCS-BOTH-NEXT: 0x1C R_LARCH_32 .L0 0x0
+; RELOCS-BOTH-NEXT: 0x20 R_LARCH_64 .L0 0x0
; RELOCS-ENRL-NEXT: 0x28 R_LARCH_ADD64 .L0 0x0
; RELOCS-ENRL-NEXT: 0x28 R_LARCH_SUB64 .L0 0x0
; RELOCS-ENRL-NEXT: 0x3F R_LARCH_ADD6 .L0 0x0
@@ -29,8 +27,7 @@
; RELOCS-BOTH-NEXT: 0x22 R_LARCH_32 .debug_line_str 0x0
; RELOCS-BOTH-NEXT: 0x31 R_LARCH_32 .debug_line_str 0x2
; RELOCS-BOTH-NEXT: 0x46 R_LARCH_32 .debug_line_str 0x1B
-; RELOCS-NORL-NEXT: 0x4F R_LARCH_64 .text 0x0
-; RELOCS-ENRL-NEXT: 0x4F R_LARCH_64 .L0 0x0
+; RELOCS-BOTH-NEXT: 0x4F R_LARCH_64 .L0 0x0
; RELOCS-ENRL-NEXT: 0x5F R_LARCH_ADD16 .L0 0x0
; RELOCS-ENRL-NEXT: 0x5F R_LARCH_SUB16 .L0 0x0
; RELOCS-BOTH-NEXT: }
diff --git a/llvm/test/MC/LoongArch/Misc/cfi-advance.s b/llvm/test/MC/LoongArch/Misc/cfi-advance.s
index 494b8af21064b..86b36a38c3f15 100644
--- a/llvm/test/MC/LoongArch/Misc/cfi-advance.s
+++ b/llvm/test/MC/LoongArch/Misc/cfi-advance.s
@@ -6,7 +6,7 @@
# RELOC: Relocations [
# RELOC: .rela.eh_frame {
-# RELOC-NEXT: 0x1C R_LARCH_32_PCREL .text 0x0
+# RELOC-NEXT: 0x1C R_LARCH_32_PCREL .L{{.*}} 0x0
# RELOC-NEXT: }
# RELOC-NEXT: ]
# DWARFDUMP: DW_CFA_advance_loc: 8
diff --git a/llvm/test/MC/LoongArch/Relocations/fde-reloc.s b/llvm/test/MC/LoongArch/Relocations/fde-reloc.s
index ab911d1853a87..3b9f4003950f8 100644
--- a/llvm/test/MC/LoongArch/Relocations/fde-reloc.s
+++ b/llvm/test/MC/LoongArch/Relocations/fde-reloc.s
@@ -1,7 +1,7 @@
# RUN: llvm-mc --filetype=obj --triple=loongarch64 --mattr=-relax < %s \
# RUN: | llvm-readobj -r - | FileCheck %s
# RUN: llvm-mc --filetype=obj --triple=loongarch64 --mattr=+relax < %s \
-# RUN: | llvm-readobj -r - | FileCheck %s --check-prefix=RELAX
+# RUN: | llvm-readobj -r - | FileCheck %s
## Ensure that the eh_frame records the symbolic difference with
## the R_LARCH_32_PCREL relocation.
@@ -11,9 +11,6 @@ func:
ret
.cfi_endproc
-# CHECK: Section (4) .rela.eh_frame {
-# CHECK-NEXT: 0x1C R_LARCH_32_PCREL .text 0x0
+# CHECK: Section ({{.*}}) .rela.eh_frame {
+# CHECK-NEXT: 0x1C R_LARCH_32_PCREL .L{{.*}} 0x0
# CHECK-NEXT: }
-# RELAX: Section ({{.*}}) .rela.eh_frame {
-# RELAX-NEXT: 0x1C R_LARCH_32_PCREL .L{{.*}} 0x0
-# RELAX-NEXT: }
diff --git a/llvm/test/MC/LoongArch/Relocations/relax-addsub.s b/llvm/test/MC/LoongArch/Relocations/relax-addsub.s
index da3f655e9a31e..67c643c076895 100644
--- a/llvm/test/MC/LoongArch/Relocations/relax-addsub.s
+++ b/llvm/test/MC/LoongArch/Relocations/relax-addsub.s
@@ -6,18 +6,18 @@
# NORELAX: Relocations [
# NORELAX-NEXT: Section ({{.*}}) .rela.text {
# NORELAX-NEXT: 0x0 R_LARCH_CALL36 foo 0x0
-# NORELAX-NEXT: 0x10 R_LARCH_PCALA_HI20 .text 0x8
-# NORELAX-NEXT: 0x14 R_LARCH_PCALA_LO12 .text 0x8
+# NORELAX-NEXT: 0x10 R_LARCH_PCALA_HI20 .L1 0x0
+# NORELAX-NEXT: 0x14 R_LARCH_PCALA_LO12 .L1 0x0
# NORELAX-NEXT: }
# NORELAX-NEXT: Section ({{.*}}) .rela.data {
# NORELAX-NEXT: 0x30 R_LARCH_ADD8 foo 0x0
-# NORELAX-NEXT: 0x30 R_LARCH_SUB8 .text 0x10
+# NORELAX-NEXT: 0x30 R_LARCH_SUB8 .L3 0x0
# NORELAX-NEXT: 0x31 R_LARCH_ADD16 foo 0x0
-# NORELAX-NEXT: 0x31 R_LARCH_SUB16 .text 0x10
+# NORELAX-NEXT: 0x31 R_LARCH_SUB16 .L3 0x0
# NORELAX-NEXT: 0x33 R_LARCH_ADD32 foo 0x0
-# NORELAX-NEXT: 0x33 R_LARCH_SUB32 .text 0x10
+# NORELAX-NEXT: 0x33 R_LARCH_SUB32 .L3 0x0
# NORELAX-NEXT: 0x37 R_LARCH_ADD64 foo 0x0
-# NORELAX-NEXT: 0x37 R_LARCH_SUB64 .text 0x10
+# NORELAX-NEXT: 0x37 R_LARCH_SUB64 .L3 0x0
# NORELAX-NEXT: }
# NORELAX-NEXT: ]
diff --git a/llvm/test/MC/LoongArch/Relocations/relax-attr.s b/llvm/test/MC/LoongArch/Relocations/relax-attr.s
index d94d32ebd7ab0..7cc8dda07e333 100644
--- a/llvm/test/MC/LoongArch/Relocations/relax-attr.s
+++ b/llvm/test/MC/LoongArch/Relocations/relax-attr.s
@@ -8,7 +8,7 @@
# CHECK-NEXT: 0x4 R_LARCH_CALL36 foo 0x0
# CHECK-NEXT: }
# CHECK-NEXT: Section ({{.*}}) .rela.data {
-# CHECK-NEXT: 0x0 R_LARCH_64 .text 0xC
+# CHECK-NEXT: 0x0 R_LARCH_64 .L1 0x0
# CHECK-NEXT: }
# CHECK-NEXT: ]
diff --git a/llvm/test/MC/LoongArch/Relocations/relocation-specifier.s b/llvm/test/MC/LoongArch/Relocations/relocation-specifier.s
index d0898aaab92fe..c2526a6ecd701 100644
--- a/llvm/test/MC/LoongArch/Relocations/relocation-specifier.s
+++ b/llvm/test/MC/LoongArch/Relocations/relocation-specifier.s
@@ -6,10 +6,10 @@
## This test is similar to test/MC/CSKY/relocation-specifier.s.
# RELOC32: '.rela.data'
-# RELOC32: R_LARCH_32 00000000 .data + 0
+# RELOC32: R_LARCH_32 00000000 local
# RELOC64: '.rela.data'
-# RELOC64: R_LARCH_32 0000000000000000 .data + 0
+# RELOC64: R_LARCH_32 0000000000000000 local
# CHECK: TLS GLOBAL DEFAULT UND gd
# CHECK: TLS GLOBAL DEFAULT UND ld
diff --git a/llvm/test/MC/LoongArch/Relocations/sub-expr.s b/llvm/test/MC/LoongArch/Relocations/sub-expr.s
index 2d439194eb932..4554101200818 100644
--- a/llvm/test/MC/LoongArch/Relocations/sub-expr.s
+++ b/llvm/test/MC/LoongArch/Relocations/sub-expr.s
@@ -1,78 +1,57 @@
# RUN: llvm-mc --filetype=obj --triple=loongarch64 --mattr=-relax %s \
-# RUN: | llvm-readobj -r - | FileCheck %s
+# RUN: | llvm-readobj -r - | FileCheck %s --check-prefixes=CHECK,NORELAX
# RUN: llvm-mc --filetype=obj --triple=loongarch64 --mattr=+relax %s \
-# RUN: | llvm-readobj -r - | FileCheck %s --check-prefix=RELAX
+# RUN: | llvm-readobj -r - | FileCheck %s --check-prefixes=CHECK,RELAX
## Check that subtraction expressions emit R_LARCH_32_PCREL and R_LARCH_64_PCREL relocations.
## TODO: 1- or 2-byte data relocations are not supported for now.
-# CHECK: Relocations [
-# CHECK-NEXT: Section ({{.*}}) .rela.sx {
-# CHECK-NEXT: 0x4 R_LARCH_PCALA_HI20 z 0x0
-# CHECK-NEXT: 0x8 R_LARCH_PCALA_LO12 z 0x0
-# CHECK-NEXT: 0xC R_LARCH_32_PCREL .sy 0x10
-# CHECK-NEXT: }
+# CHECK: Relocations [
+# NORELAX-NEXT: Section ({{.*}}) .rela.sx {
+# NORELAX-NEXT: 0x4 R_LARCH_PCALA_HI20 z 0x0
+# NORELAX-NEXT: 0x8 R_LARCH_PCALA_LO12 z 0x0
+# NORELAX-NEXT: 0xC R_LARCH_32_PCREL y 0x8
+# NORELAX-NEXT: }
+# RELAX-NEXT: Section ({{.*}}) .rela.sx {
+# RELAX-NEXT: 0x4 R_LARCH_PCALA_HI20 z 0x0
+# RELAX-NEXT: 0x4 R_LARCH_RELAX - 0x0
+# RELAX-NEXT: 0x8 R_LARCH_PCALA_LO12 z 0x0
+# RELAX-NEXT: 0x8 R_LARCH_RELAX - 0x0
+# RELAX-NEXT: 0xC R_LARCH_ADD32 y 0x0
+# RELAX-NEXT: 0xC R_LARCH_SUB32 x 0x0
+# RELAX-NEXT: }
# CHECK-NEXT: Section ({{.*}}) .rela.data {
-# CHECK-NEXT: 0x0 R_LARCH_64_PCREL .sx 0x4
-# CHECK-NEXT: 0x8 R_LARCH_64_PCREL .sy 0x8
-# CHECK-NEXT: 0x10 R_LARCH_32_PCREL .sx 0x4
-# CHECK-NEXT: 0x14 R_LARCH_32_PCREL .sy 0x8
-# CHECK-NEXT: 0x18 R_LARCH_ADD64 .sx 0x4
-# CHECK-NEXT: 0x18 R_LARCH_SUB64 .sy 0x8
-# CHECK-NEXT: 0x20 R_LARCH_ADD64 .sy 0x8
-# CHECK-NEXT: 0x20 R_LARCH_SUB64 .sx 0x4
-# CHECK-NEXT: 0x28 R_LARCH_ADD32 .sx 0x4
-# CHECK-NEXT: 0x28 R_LARCH_SUB32 .sy 0x8
-# CHECK-NEXT: 0x2C R_LARCH_ADD32 .sy 0x8
-# CHECK-NEXT: 0x2C R_LARCH_SUB32 .sx 0x4
-# CHECK-NEXT: 0x30 R_LARCH_ADD64 .data 0x30
-# CHECK-NEXT: 0x30 R_LARCH_SUB64 .sx 0x4
-# CHECK-NEXT: 0x38 R_LARCH_ADD32 .data 0x38
-# CHECK-NEXT: 0x38 R_LARCH_SUB32 .sy 0x8
-# CHECK-NEXT: }
-# CHECK-NEXT: Section ({{.*}}) .rela.sy {
-# CHECK-NEXT: 0x0 R_LARCH_CALL36 foo 0x0
-# CHECK-NEXT: 0x10 R_LARCH_32_PCREL .sx 0xC
+# CHECK-NEXT: 0x0 R_LARCH_64_PCREL x 0x0
+# CHECK-NEXT: 0x8 R_LARCH_64_PCREL y 0x0
+# CHECK-NEXT: 0x10 R_LARCH_32_PCREL x 0x0
+# CHECK-NEXT: 0x14 R_LARCH_32_PCREL y 0x0
+# CHECK-NEXT: 0x18 R_LARCH_ADD64 x 0x0
+# CHECK-NEXT: 0x18 R_LARCH_SUB64 y 0x0
+# CHECK-NEXT: 0x20 R_LARCH_ADD64 y 0x0
+# CHECK-NEXT: 0x20 R_LARCH_SUB64 x 0x0
+# CHECK-NEXT: 0x28 R_LARCH_ADD32 x 0x0
+# CHECK-NEXT: 0x28 R_LARCH_SUB32 y 0x0
+# CHECK-NEXT: 0x2C R_LARCH_ADD32 y 0x0
+# CHECK-NEXT: 0x2C R_LARCH_SUB32 x 0x0
+# CHECK-NEXT: 0x30 R_LARCH_ADD64 {{.*}} 0x0
+# CHECK-NEXT: 0x30 R_LARCH_SUB64 x 0x0
+# CHECK-NEXT: 0x38 R_LARCH_ADD32 {{.*}} 0x0
+# CHECK-NEXT: 0x38 R_LARCH_SUB32 y 0x0
# CHECK-NEXT: }
+# NORELAX-NEXT: Section ({{.*}}) .rela.sy {
+# NORELAX-NEXT: 0x0 R_LARCH_CALL36 foo 0x0
+# NORELAX-NEXT: 0x10 R_LARCH_32_PCREL x 0x8
+# NORELAX-NEXT: }
+# RELAX-NEXT: Section ({{.*}}) .rela.sy {
+# RELAX-NEXT: 0x0 R_LARCH_CALL36 foo 0x0
+# RELAX-NEXT: 0x0 R_LARCH_RELAX - 0x0
+# RELAX-NEXT: 0x8 R_LARCH_ALIGN - 0xC
+# RELAX-NEXT: 0x14 R_LARCH_ADD32 x 0x0
+# RELAX-NEXT: 0x14 R_LARCH_SUB32 y 0x0
+# RELAX-NEXT: }
# CHECK-NEXT: ]
-# RELAX: Relocations [
-# RELAX-NEXT: Section ({{.*}}) .rela.sx {
-# RELAX-NEXT: 0x4 R_LARCH_PCALA_HI20 z 0x0
-# RELAX-NEXT: 0x4 R_LARCH_RELAX - 0x0
-# RELAX-NEXT: 0x8 R_LARCH_PCALA_LO12 z 0x0
-# RELAX-NEXT: 0x8 R_LARCH_RELAX - 0x0
-# RELAX-NEXT: 0xC R_LARCH_ADD32 y 0x0
-# RELAX-NEXT: 0xC R_LARCH_SUB32 x 0x0
-# RELAX-NEXT: }
-# RELAX-NEXT: Section ({{.*}}) .rela.data {
-# RELAX-NEXT: 0x0 R_LARCH_64_PCREL x 0x0
-# RELAX-NEXT: 0x8 R_LARCH_64_PCREL y 0x0
-# RELAX-NEXT: 0x10 R_LARCH_32_PCREL x 0x0
-# RELAX-NEXT: 0x14 R_LARCH_32_PCREL y 0x0
-# RELAX-NEXT: 0x18 R_LARCH_ADD64 x 0x0
-# RELAX-NEXT: 0x18 R_LARCH_SUB64 y 0x0
-# RELAX-NEXT: 0x20 R_LARCH_ADD64 y 0x0
-# RELAX-NEXT: 0x20 R_LARCH_SUB64 x 0x0
-# RELAX-NEXT: 0x28 R_LARCH_ADD32 x 0x0
-# RELAX-NEXT: 0x28 R_LARCH_SUB32 y 0x0
-# RELAX-NEXT: 0x2C R_LARCH_ADD32 y 0x0
-# RELAX-NEXT: 0x2C R_LARCH_SUB32 x 0x0
-# RELAX-NEXT: 0x30 R_LARCH_ADD64 {{.*}} 0x0
-# RELAX-NEXT: 0x30 R_LARCH_SUB64 x 0x0
-# RELAX-NEXT: 0x38 R_LARCH_ADD32 {{.*}} 0x0
-# RELAX-NEXT: 0x38 R_LARCH_SUB32 y 0x0
-# RELAX-NEXT: }
-# RELAX-NEXT: Section ({{.*}}) .rela.sy {
-# RELAX-NEXT: 0x0 R_LARCH_CALL36 foo 0x0
-# RELAX-NEXT: 0x0 R_LARCH_RELAX - 0x0
-# RELAX-NEXT: 0x8 R_LARCH_ALIGN - 0xC
-# RELAX-NEXT: 0x14 R_LARCH_ADD32 x 0x0
-# RELAX-NEXT: 0x14 R_LARCH_SUB32 y 0x0
-# RELAX-NEXT: }
-# RELAX-NEXT: ]
-
.section .sx,"ax"
nop
x:
>From 6aafe6582dc2290b3f624128eb48186663473e87 Mon Sep 17 00:00:00 2001
From: Akash Banerjee <Akash.Banerjee at amd.com>
Date: Mon, 18 Aug 2025 13:29:23 +0100
Subject: [PATCH 003/112] Fix test added in
1fd1d634630754cc9b9c4b5526961d5856f64ff9
---
offload/test/offloading/fortran/declare-target-automap.f90 | 1 +
1 file changed, 1 insertion(+)
diff --git a/offload/test/offloading/fortran/declare-target-automap.f90 b/offload/test/offloading/fortran/declare-target-automap.f90
index 50e8c124c25fc..b9c2d34c834fa 100644
--- a/offload/test/offloading/fortran/declare-target-automap.f90
+++ b/offload/test/offloading/fortran/declare-target-automap.f90
@@ -1,6 +1,7 @@
!Offloading test for AUTOMAP modifier in declare target enter
! REQUIRES: flang, amdgpu
+! RUN: %libomptarget-compile-fortran-run-and-check-generic
program automap_program
use iso_c_binding, only: c_loc
use omp_lib, only: omp_get_default_device, omp_target_is_present
>From e8e3e6e893a2c944c8ce1878f290aa62843323e0 Mon Sep 17 00:00:00 2001
From: jofrn <jofernau at amd.com>
Date: Mon, 18 Aug 2025 08:34:59 -0400
Subject: [PATCH 004/112] [LiveVariables] Mark use without implicit if defined
at instr (#119446)
LiveVariables will mark instructions with their implicit subregister
uses. However, it will also mark the subregister as an implicit if its
own definition is a subregister of it, i.e. `$r3 = OP val, implicit-def
$r0_r1_r2_r3, ..., implicit $r2_r3`, even if it is otherwise unused,
which defines $r3 on the same line it is used.
This change ensures such uses are marked without implicit, i.e. `$r3 =
OP val, implicit-def $r0_r1_r2_r3, ..., $r2_r3`.
---------
Co-authored-by: Matt Arsenault <arsenm2 at gmail.com>
---
llvm/include/llvm/CodeGen/LiveVariables.h | 6 +-
llvm/lib/CodeGen/LiveVariables.cpp | 37 +-------
.../test/CodeGen/AArch64/ldrpre-ldr-merge.mir | 2 +-
.../test/CodeGen/AMDGPU/fncall-implicitdef.ll | 25 +++++
.../CodeGen/AMDGPU/livevars-implicitdef.mir | 91 +++++++++++++++++++
5 files changed, 123 insertions(+), 38 deletions(-)
create mode 100644 llvm/test/CodeGen/AMDGPU/fncall-implicitdef.ll
create mode 100644 llvm/test/CodeGen/AMDGPU/livevars-implicitdef.mir
diff --git a/llvm/include/llvm/CodeGen/LiveVariables.h b/llvm/include/llvm/CodeGen/LiveVariables.h
index 974bf9eaa0376..dbf736ad65a99 100644
--- a/llvm/include/llvm/CodeGen/LiveVariables.h
+++ b/llvm/include/llvm/CodeGen/LiveVariables.h
@@ -165,10 +165,8 @@ class LiveVariables {
MachineInstr *FindLastRefOrPartRef(Register Reg);
/// FindLastPartialDef - Return the last partial def of the specified
- /// register. Also returns the sub-registers that're defined by the
- /// instruction.
- MachineInstr *FindLastPartialDef(Register Reg,
- SmallSet<Register, 4> &PartDefRegs);
+ /// register.
+ MachineInstr *FindLastPartialDef(Register Reg);
/// analyzePHINodes - Gather information about the PHI nodes in here. In
/// particular, we want to map the variable information of a virtual
diff --git a/llvm/lib/CodeGen/LiveVariables.cpp b/llvm/lib/CodeGen/LiveVariables.cpp
index 1f23418642bc6..c5dfddaa21e66 100644
--- a/llvm/lib/CodeGen/LiveVariables.cpp
+++ b/llvm/lib/CodeGen/LiveVariables.cpp
@@ -213,11 +213,7 @@ void LiveVariables::HandleVirtRegDef(Register Reg, MachineInstr &MI) {
}
/// FindLastPartialDef - Return the last partial def of the specified register.
-/// Also returns the sub-registers that're defined by the instruction.
-MachineInstr *
-LiveVariables::FindLastPartialDef(Register Reg,
- SmallSet<Register, 4> &PartDefRegs) {
- Register LastDefReg = 0;
+MachineInstr *LiveVariables::FindLastPartialDef(Register Reg) {
unsigned LastDefDist = 0;
MachineInstr *LastDef = nullptr;
for (MCPhysReg SubReg : TRI->subregs(Reg)) {
@@ -226,7 +222,6 @@ LiveVariables::FindLastPartialDef(Register Reg,
continue;
unsigned Dist = DistanceMap[Def];
if (Dist > LastDefDist) {
- LastDefReg = SubReg;
LastDef = Def;
LastDefDist = Dist;
}
@@ -235,14 +230,6 @@ LiveVariables::FindLastPartialDef(Register Reg,
if (!LastDef)
return nullptr;
- PartDefRegs.insert(LastDefReg);
- for (MachineOperand &MO : LastDef->all_defs()) {
- if (MO.getReg() == 0)
- continue;
- Register DefReg = MO.getReg();
- if (TRI->isSubRegister(Reg, DefReg))
- PartDefRegs.insert_range(TRI->subregs_inclusive(DefReg));
- }
return LastDef;
}
@@ -261,27 +248,11 @@ void LiveVariables::HandlePhysRegUse(Register Reg, MachineInstr &MI) {
// ...
// = EAX
// All of the sub-registers must have been defined before the use of Reg!
- SmallSet<Register, 4> PartDefRegs;
- MachineInstr *LastPartialDef = FindLastPartialDef(Reg, PartDefRegs);
+ MachineInstr *LastPartialDef = FindLastPartialDef(Reg);
// If LastPartialDef is NULL, it must be using a livein register.
if (LastPartialDef) {
- LastPartialDef->addOperand(MachineOperand::CreateReg(Reg, true/*IsDef*/,
- true/*IsImp*/));
- PhysRegDef[Reg.id()] = LastPartialDef;
- SmallSet<MCPhysReg, 8> Processed;
- for (MCPhysReg SubReg : TRI->subregs(Reg)) {
- if (Processed.count(SubReg))
- continue;
- if (PartDefRegs.count(SubReg))
- continue;
- // This part of Reg was defined before the last partial def. It's killed
- // here.
- LastPartialDef->addOperand(MachineOperand::CreateReg(SubReg,
- false/*IsDef*/,
- true/*IsImp*/));
- PhysRegDef[SubReg] = LastPartialDef;
- Processed.insert_range(TRI->subregs(SubReg));
- }
+ LastPartialDef->addOperand(
+ MachineOperand::CreateReg(Reg, /*IsDef=*/true, /*IsImp=*/true));
}
} else if (LastDef && !PhysRegUse[Reg.id()] &&
!LastDef->findRegisterDefOperand(Reg, /*TRI=*/nullptr))
diff --git a/llvm/test/CodeGen/AArch64/ldrpre-ldr-merge.mir b/llvm/test/CodeGen/AArch64/ldrpre-ldr-merge.mir
index a10d7588cb442..8a5e0f6aa843a 100644
--- a/llvm/test/CodeGen/AArch64/ldrpre-ldr-merge.mir
+++ b/llvm/test/CodeGen/AArch64/ldrpre-ldr-merge.mir
@@ -756,7 +756,7 @@ body: |
; CHECK: liveins: $x0, $x1, $x2
; CHECK-NEXT: {{ $}}
; CHECK-NEXT: early-clobber renamable $x1, renamable $x0 = LDRSWpre renamable $x1, 40, implicit $w1, implicit $w1_hi :: (load (s32))
- ; CHECK-NEXT: renamable $w2 = LDRWui renamable $x1, 1, implicit-def $x2, implicit $w2_hi :: (load (s32))
+ ; CHECK-NEXT: renamable $w2 = LDRWui renamable $x1, 1, implicit-def $x2 :: (load (s32))
; CHECK-NEXT: STPXi renamable $x0, renamable $x2, renamable $x1, 0 :: (store (s64))
; CHECK-NEXT: RET undef $lr
early-clobber renamable $x1, renamable $x0 = LDRSWpre killed renamable $x1, 40 :: (load (s32))
diff --git a/llvm/test/CodeGen/AMDGPU/fncall-implicitdef.ll b/llvm/test/CodeGen/AMDGPU/fncall-implicitdef.ll
new file mode 100644
index 0000000000000..66a8b424b5763
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/fncall-implicitdef.ll
@@ -0,0 +1,25 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=amdgcn-amd-amdpal -mcpu=gfx900 -O1 %s -o - | FileCheck %s
+
+define amdgpu_ps <4 x float> @caller(ptr %ptr) {
+; CHECK-LABEL: caller:
+; CHECK: ; %bb.0:
+; CHECK-NEXT: flat_load_dword v1, v[0:1]
+; CHECK-NEXT: s_mov_b32 s0, 0
+; CHECK-NEXT: s_mov_b32 s1, 0
+; CHECK-NEXT: s_mov_b32 s2, 0
+; CHECK-NEXT: s_mov_b32 s5, fn at abs32@hi
+; CHECK-NEXT: s_mov_b32 s4, fn at abs32@lo
+; CHECK-NEXT: s_mov_b64 s[8:9], 0
+; CHECK-NEXT: v_mov_b32_e32 v0, 0
+; CHECK-NEXT: s_mov_b32 s3, 0
+; CHECK-NEXT: v_mov_b32_e32 v2, 0
+; CHECK-NEXT: s_mov_b32 s32, 0
+; CHECK-NEXT: s_swappc_b64 s[30:31], s[4:5]
+; CHECK-NEXT: ; return to shader part epilog
+ %L = load i32, ptr %ptr, align 4
+ %R = call <4 x float> @fn(<4 x i32> zeroinitializer, i32 0, i32 %L, i32 0)
+ ret <4 x float> %R
+}
+
+declare hidden <4 x float> @fn(<4 x i32> inreg, i32, i32, i32)
diff --git a/llvm/test/CodeGen/AMDGPU/livevars-implicitdef.mir b/llvm/test/CodeGen/AMDGPU/livevars-implicitdef.mir
new file mode 100644
index 0000000000000..18aeb2527b1a3
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/livevars-implicitdef.mir
@@ -0,0 +1,91 @@
+# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5
+# RUN: llc -mtriple=amdgcn --run-pass=livevars -o - %s | FileCheck %s
+---
+# Check that super register is defined for an sgpr copy.
+name: sgpr_copy
+tracksRegLiveness: true
+body: |
+ bb.0:
+
+ ; CHECK-LABEL: name: sgpr_copy
+ ; CHECK: %sval:sreg_32 = S_MOV_B32 0
+ ; CHECK-NEXT: $sgpr0 = COPY %sval
+ ; CHECK-NEXT: $sgpr1 = COPY %sval
+ ; CHECK-NEXT: $sgpr2 = COPY %sval
+ ; CHECK-NEXT: $sgpr3 = COPY killed %sval
+ ; CHECK-NEXT: SI_RETURN implicit killed $sgpr0_sgpr1_sgpr2_sgpr3
+ %sval:sreg_32 = S_MOV_B32 0
+
+ $sgpr0 = COPY %sval
+ $sgpr1 = COPY %sval
+ $sgpr2 = COPY %sval
+ $sgpr3 = COPY %sval
+ SI_RETURN implicit $sgpr0_sgpr1_sgpr2_sgpr3
+
+...
+---
+# Check that super register is defined for a vgpr vector copy.
+name: vgpr_copy
+tracksRegLiveness: true
+body: |
+ bb.0:
+
+ ; CHECK-LABEL: name: vgpr_copy
+ ; CHECK: %vval:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
+ ; CHECK-NEXT: $vgpr0 = COPY %vval
+ ; CHECK-NEXT: $vgpr1 = COPY %vval
+ ; CHECK-NEXT: $vgpr2 = COPY %vval
+ ; CHECK-NEXT: $vgpr3 = COPY killed %vval
+ ; CHECK-NEXT: dead [[COPY:%[0-9]+]]:vgpr_32 = COPY killed $vgpr0_vgpr1_vgpr2_vgpr3
+ %vval:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
+
+ $vgpr0 = COPY %vval
+ $vgpr1 = COPY %vval
+ $vgpr2 = COPY %vval
+ $vgpr3 = COPY %vval
+ %0:vgpr_32 = COPY $vgpr0_vgpr1_vgpr2_vgpr3
+
+...
+---
+# Check that super register is defined when there is a hole.
+name: sgpr_copy_hole
+tracksRegLiveness: true
+body: |
+ bb.0:
+ ; CHECK-LABEL: name: sgpr_copy_hole
+ ; CHECK: %sval:sreg_32 = S_MOV_B32 0
+ ; CHECK-NEXT: $sgpr0 = COPY %sval
+ ; CHECK-NEXT: $sgpr2 = COPY %sval
+ ; CHECK-NEXT: $sgpr3 = COPY killed %sval
+ ; CHECK-NEXT: SI_RETURN implicit killed $sgpr0_sgpr1_sgpr2_sgpr3
+ %sval:sreg_32 = S_MOV_B32 0
+
+ $sgpr0 = COPY %sval
+ $sgpr2 = COPY %sval
+ $sgpr3 = COPY %sval
+ SI_RETURN implicit $sgpr0_sgpr1_sgpr2_sgpr3
+
+...
+---
+# Check that super register is defined when a pair interrupts the sequence.
+name: vgpr_copy_pair
+tracksRegLiveness: true
+body: |
+ bb.0:
+ ; CHECK-LABEL: name: vgpr_copy_pair
+ ; CHECK: %vval:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
+ ; CHECK-NEXT: $vgpr0 = COPY %vval
+ ; CHECK-NEXT: $vgpr1 = COPY %vval
+ ; CHECK-NEXT: $vgpr2 = COPY %vval
+ ; CHECK-NEXT: $vgpr3 = COPY killed %vval
+ ; CHECK-NEXT: dead [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr1_vgpr2
+ ; CHECK-NEXT: dead [[COPY1:%[0-9]+]]:vgpr_32 = COPY killed $vgpr0_vgpr1_vgpr2_vgpr3
+ %vval:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
+
+ $vgpr0 = COPY %vval
+ $vgpr1 = COPY %vval
+ $vgpr2 = COPY %vval
+ $vgpr3 = COPY %vval
+ %0:vgpr_32 = COPY $vgpr1_vgpr2
+ %1:vgpr_32 = COPY $vgpr0_vgpr1_vgpr2_vgpr3
+...
>From 4a3bf27c69473e65a9176858ff57c8b55dfb184c Mon Sep 17 00:00:00 2001
From: Chaitanya <Krishna.Sankisa at amd.com>
Date: Mon, 18 Aug 2025 18:15:11 +0530
Subject: [PATCH 005/112] [OpenMP] Introduce omp.target_allocmem and
omp.target_freemem omp dialect ops. (#145464)
This PR introduces two new ops in omp dialect, omp.target_allocmem and
omp.target_freemem.
omp.target_allocmem: Allocates heap memory on device. Will be lowered to
omp_target_alloc call in llvm.
omp.target_freemem: Deallocates heap memory on device. Will be lowered
to omp+target_free call in llvm.
Example:
%1 = omp.target_allocmem %device : i32, i64
omp.target_freemem %device, %1 : i32, i64
The work in this PR is C-P/inspired from @ivanradanov commit from
coexecute implementation:
[Add fir omp target alloc and free
ops](https://github.com/ivanradanov/llvm-project/commit/be860ac8baf24b8405e6f396c75d7f0d26375de5)
[Lower omp_target_{alloc,free} to
llvm](https://github.com/ivanradanov/llvm-project/commit/6e2d584dc93ff99bb89adc28c7afbc2b21c46d39)
---
flang/include/flang/Optimizer/Support/Utils.h | 33 ++
flang/lib/Optimizer/CodeGen/CodeGen.cpp | 114 ++-----
flang/lib/Optimizer/CodeGen/CodeGenOpenMP.cpp | 49 +++
flang/lib/Optimizer/Dialect/FIROps.cpp | 1 -
flang/lib/Optimizer/Support/Utils.cpp | 71 +++++
.../test/Fir/omp_target_allocmem_freemem.fir | 294 ++++++++++++++++++
mlir/include/mlir/Dialect/OpenMP/OpenMPOps.td | 94 ++++++
mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp | 101 ++++++
.../OpenMP/OpenMPToLLVMIRTranslation.cpp | 89 ++++++
.../ompenmp-target-allocmem-freemem.mlir | 42 +++
10 files changed, 804 insertions(+), 84 deletions(-)
create mode 100644 flang/test/Fir/omp_target_allocmem_freemem.fir
create mode 100644 mlir/test/Target/LLVMIR/ompenmp-target-allocmem-freemem.mlir
diff --git a/flang/include/flang/Optimizer/Support/Utils.h b/flang/include/flang/Optimizer/Support/Utils.h
index 83c936b7dcada..0b31cfea0430a 100644
--- a/flang/include/flang/Optimizer/Support/Utils.h
+++ b/flang/include/flang/Optimizer/Support/Utils.h
@@ -27,6 +27,8 @@
#include "llvm/ADT/DenseMap.h"
#include "llvm/ADT/StringRef.h"
+#include "flang/Optimizer/CodeGen/TypeConverter.h"
+
namespace fir {
/// Return the integer value of a arith::ConstantOp.
inline std::int64_t toInt(mlir::arith::ConstantOp cop) {
@@ -198,6 +200,37 @@ std::optional<llvm::ArrayRef<int64_t>> getComponentLowerBoundsIfNonDefault(
fir::RecordType recordType, llvm::StringRef component,
mlir::ModuleOp module, const mlir::SymbolTable *symbolTable = nullptr);
+/// Generate a LLVM constant value of type `ity`, using the provided offset.
+mlir::LLVM::ConstantOp
+genConstantIndex(mlir::Location loc, mlir::Type ity,
+ mlir::ConversionPatternRewriter &rewriter,
+ std::int64_t offset);
+
+/// Helper function for generating the LLVM IR that computes the distance
+/// in bytes between adjacent elements pointed to by a pointer
+/// of type \p ptrTy. The result is returned as a value of \p idxTy integer
+/// type.
+mlir::Value computeElementDistance(mlir::Location loc,
+ mlir::Type llvmObjectType, mlir::Type idxTy,
+ mlir::ConversionPatternRewriter &rewriter,
+ const mlir::DataLayout &dataLayout);
+
+// Compute the alloc scale size (constant factors encoded in the array type).
+// We do this for arrays without a constant interior or arrays of character with
+// dynamic length arrays, since those are the only ones that get decayed to a
+// pointer to the element type.
+mlir::Value genAllocationScaleSize(mlir::Location loc, mlir::Type dataTy,
+ mlir::Type ity,
+ mlir::ConversionPatternRewriter &rewriter);
+
+/// Perform an extension or truncation as needed on an integer value. Lowering
+/// to the specific target may involve some sign-extending or truncation of
+/// values, particularly to fit them from abstract box types to the
+/// appropriate reified structures.
+mlir::Value integerCast(const fir::LLVMTypeConverter &converter,
+ mlir::Location loc,
+ mlir::ConversionPatternRewriter &rewriter,
+ mlir::Type ty, mlir::Value val, bool fold = false);
} // namespace fir
#endif // FORTRAN_OPTIMIZER_SUPPORT_UTILS_H
diff --git a/flang/lib/Optimizer/CodeGen/CodeGen.cpp b/flang/lib/Optimizer/CodeGen/CodeGen.cpp
index ba5fef97c83ed..76f3cbd421cb9 100644
--- a/flang/lib/Optimizer/CodeGen/CodeGen.cpp
+++ b/flang/lib/Optimizer/CodeGen/CodeGen.cpp
@@ -87,14 +87,6 @@ static inline mlir::Type getI8Type(mlir::MLIRContext *context) {
return mlir::IntegerType::get(context, 8);
}
-static mlir::LLVM::ConstantOp
-genConstantIndex(mlir::Location loc, mlir::Type ity,
- mlir::ConversionPatternRewriter &rewriter,
- std::int64_t offset) {
- auto cattr = rewriter.getI64IntegerAttr(offset);
- return mlir::LLVM::ConstantOp::create(rewriter, loc, ity, cattr);
-}
-
static mlir::Block *createBlock(mlir::ConversionPatternRewriter &rewriter,
mlir::Block *insertBefore) {
assert(insertBefore && "expected valid insertion block");
@@ -208,39 +200,6 @@ getDependentTypeMemSizeFn(fir::RecordType recTy, fir::AllocaOp op,
TODO(op.getLoc(), "did not find allocation function");
}
-// Compute the alloc scale size (constant factors encoded in the array type).
-// We do this for arrays without a constant interior or arrays of character with
-// dynamic length arrays, since those are the only ones that get decayed to a
-// pointer to the element type.
-template <typename OP>
-static mlir::Value
-genAllocationScaleSize(OP op, mlir::Type ity,
- mlir::ConversionPatternRewriter &rewriter) {
- mlir::Location loc = op.getLoc();
- mlir::Type dataTy = op.getInType();
- auto seqTy = mlir::dyn_cast<fir::SequenceType>(dataTy);
- fir::SequenceType::Extent constSize = 1;
- if (seqTy) {
- int constRows = seqTy.getConstantRows();
- const fir::SequenceType::ShapeRef &shape = seqTy.getShape();
- if (constRows != static_cast<int>(shape.size())) {
- for (auto extent : shape) {
- if (constRows-- > 0)
- continue;
- if (extent != fir::SequenceType::getUnknownExtent())
- constSize *= extent;
- }
- }
- }
-
- if (constSize != 1) {
- mlir::Value constVal{
- genConstantIndex(loc, ity, rewriter, constSize).getResult()};
- return constVal;
- }
- return nullptr;
-}
-
namespace {
struct DeclareOpConversion : public fir::FIROpConversion<fir::cg::XDeclareOp> {
public:
@@ -275,7 +234,7 @@ struct AllocaOpConversion : public fir::FIROpConversion<fir::AllocaOp> {
auto loc = alloc.getLoc();
mlir::Type ity = lowerTy().indexType();
unsigned i = 0;
- mlir::Value size = genConstantIndex(loc, ity, rewriter, 1).getResult();
+ mlir::Value size = fir::genConstantIndex(loc, ity, rewriter, 1).getResult();
mlir::Type firObjType = fir::unwrapRefType(alloc.getType());
mlir::Type llvmObjectType = convertObjectType(firObjType);
if (alloc.hasLenParams()) {
@@ -307,7 +266,8 @@ struct AllocaOpConversion : public fir::FIROpConversion<fir::AllocaOp> {
<< scalarType << " with type parameters";
}
}
- if (auto scaleSize = genAllocationScaleSize(alloc, ity, rewriter))
+ if (auto scaleSize = fir::genAllocationScaleSize(
+ alloc.getLoc(), alloc.getInType(), ity, rewriter))
size =
rewriter.createOrFold<mlir::LLVM::MulOp>(loc, ity, size, scaleSize);
if (alloc.hasShapeOperands()) {
@@ -484,7 +444,7 @@ struct BoxIsArrayOpConversion : public fir::FIROpConversion<fir::BoxIsArrayOp> {
auto loc = boxisarray.getLoc();
TypePair boxTyPair = getBoxTypePair(boxisarray.getVal().getType());
mlir::Value rank = getRankFromBox(loc, boxTyPair, a, rewriter);
- mlir::Value c0 = genConstantIndex(loc, rank.getType(), rewriter, 0);
+ mlir::Value c0 = fir::genConstantIndex(loc, rank.getType(), rewriter, 0);
rewriter.replaceOpWithNewOp<mlir::LLVM::ICmpOp>(
boxisarray, mlir::LLVM::ICmpPredicate::ne, rank, c0);
return mlir::success();
@@ -820,7 +780,7 @@ struct ConvertOpConversion : public fir::FIROpConversion<fir::ConvertOp> {
// Do folding for constant inputs.
if (auto constVal = fir::getIntIfConstant(op0)) {
mlir::Value normVal =
- genConstantIndex(loc, toTy, rewriter, *constVal ? 1 : 0);
+ fir::genConstantIndex(loc, toTy, rewriter, *constVal ? 1 : 0);
rewriter.replaceOp(convert, normVal);
return mlir::success();
}
@@ -833,7 +793,7 @@ struct ConvertOpConversion : public fir::FIROpConversion<fir::ConvertOp> {
}
// Compare the input with zero.
- mlir::Value zero = genConstantIndex(loc, fromTy, rewriter, 0);
+ mlir::Value zero = fir::genConstantIndex(loc, fromTy, rewriter, 0);
auto isTrue = mlir::LLVM::ICmpOp::create(
rewriter, loc, mlir::LLVM::ICmpPredicate::ne, op0, zero);
@@ -1082,21 +1042,6 @@ static mlir::SymbolRefAttr getMalloc(fir::AllocMemOp op,
return getMallocInModule(mod, op, rewriter, indexType);
}
-/// Helper function for generating the LLVM IR that computes the distance
-/// in bytes between adjacent elements pointed to by a pointer
-/// of type \p ptrTy. The result is returned as a value of \p idxTy integer
-/// type.
-static mlir::Value
-computeElementDistance(mlir::Location loc, mlir::Type llvmObjectType,
- mlir::Type idxTy,
- mlir::ConversionPatternRewriter &rewriter,
- const mlir::DataLayout &dataLayout) {
- llvm::TypeSize size = dataLayout.getTypeSize(llvmObjectType);
- unsigned short alignment = dataLayout.getTypeABIAlignment(llvmObjectType);
- std::int64_t distance = llvm::alignTo(size, alignment);
- return genConstantIndex(loc, idxTy, rewriter, distance);
-}
-
/// Return value of the stride in bytes between adjacent elements
/// of LLVM type \p llTy. The result is returned as a value of
/// \p idxTy integer type.
@@ -1105,7 +1050,7 @@ genTypeStrideInBytes(mlir::Location loc, mlir::Type idxTy,
mlir::ConversionPatternRewriter &rewriter, mlir::Type llTy,
const mlir::DataLayout &dataLayout) {
// Create a pointer type and use computeElementDistance().
- return computeElementDistance(loc, llTy, idxTy, rewriter, dataLayout);
+ return fir::computeElementDistance(loc, llTy, idxTy, rewriter, dataLayout);
}
namespace {
@@ -1124,8 +1069,9 @@ struct AllocMemOpConversion : public fir::FIROpConversion<fir::AllocMemOp> {
if (fir::isRecordWithTypeParameters(fir::unwrapSequenceType(dataTy)))
TODO(loc, "fir.allocmem codegen of derived type with length parameters");
mlir::Value size = genTypeSizeInBytes(loc, ity, rewriter, llvmObjectTy);
- if (auto scaleSize = genAllocationScaleSize(heap, ity, rewriter))
- size = mlir::LLVM::MulOp::create(rewriter, loc, ity, size, scaleSize);
+ if (auto scaleSize =
+ fir::genAllocationScaleSize(loc, heap.getInType(), ity, rewriter))
+ size = rewriter.create<mlir::LLVM::MulOp>(loc, ity, size, scaleSize);
for (mlir::Value opnd : adaptor.getOperands())
size = mlir::LLVM::MulOp::create(rewriter, loc, ity, size,
integerCast(loc, rewriter, ity, opnd));
@@ -1133,8 +1079,8 @@ struct AllocMemOpConversion : public fir::FIROpConversion<fir::AllocMemOp> {
// As the return value of malloc(0) is implementation defined, allocate one
// byte to ensure the allocation status being true. This behavior aligns to
// what the runtime has.
- mlir::Value zero = genConstantIndex(loc, ity, rewriter, 0);
- mlir::Value one = genConstantIndex(loc, ity, rewriter, 1);
+ mlir::Value zero = fir::genConstantIndex(loc, ity, rewriter, 0);
+ mlir::Value one = fir::genConstantIndex(loc, ity, rewriter, 1);
mlir::Value cmp = mlir::LLVM::ICmpOp::create(
rewriter, loc, mlir::LLVM::ICmpPredicate::sgt, size, zero);
size = mlir::LLVM::SelectOp::create(rewriter, loc, cmp, size, one);
@@ -1157,7 +1103,8 @@ struct AllocMemOpConversion : public fir::FIROpConversion<fir::AllocMemOp> {
mlir::Value genTypeSizeInBytes(mlir::Location loc, mlir::Type idxTy,
mlir::ConversionPatternRewriter &rewriter,
mlir::Type llTy) const {
- return computeElementDistance(loc, llTy, idxTy, rewriter, getDataLayout());
+ return fir::computeElementDistance(loc, llTy, idxTy, rewriter,
+ getDataLayout());
}
};
} // namespace
@@ -1344,7 +1291,7 @@ genCUFAllocDescriptor(mlir::Location loc,
mlir::Type structTy = typeConverter.convertBoxTypeAsStruct(boxTy);
std::size_t boxSize = dl->getTypeSizeInBits(structTy) / 8;
mlir::Value sizeInBytes =
- genConstantIndex(loc, llvmIntPtrType, rewriter, boxSize);
+ fir::genConstantIndex(loc, llvmIntPtrType, rewriter, boxSize);
llvm::SmallVector args = {sizeInBytes, sourceFile, sourceLine};
return mlir::LLVM::CallOp::create(rewriter, loc, fctTy,
RTNAME_STRING(CUFAllocDescriptor), args)
@@ -1599,7 +1546,7 @@ struct EmboxCommonConversion : public fir::FIROpConversion<OP> {
// representation of derived types with pointer/allocatable components.
// This has been seen in hashing algorithms using TRANSFER.
mlir::Value zero =
- genConstantIndex(loc, rewriter.getI64Type(), rewriter, 0);
+ fir::genConstantIndex(loc, rewriter.getI64Type(), rewriter, 0);
descriptor = insertField(rewriter, loc, descriptor,
{getLenParamFieldId(boxTy), 0}, zero);
}
@@ -1944,8 +1891,8 @@ struct XEmboxOpConversion : public EmboxCommonConversion<fir::cg::XEmboxOp> {
bool hasSlice = !xbox.getSlice().empty();
unsigned sliceOffset = xbox.getSliceOperandIndex();
mlir::Location loc = xbox.getLoc();
- mlir::Value zero = genConstantIndex(loc, i64Ty, rewriter, 0);
- mlir::Value one = genConstantIndex(loc, i64Ty, rewriter, 1);
+ mlir::Value zero = fir::genConstantIndex(loc, i64Ty, rewriter, 0);
+ mlir::Value one = fir::genConstantIndex(loc, i64Ty, rewriter, 1);
mlir::Value prevPtrOff = one;
mlir::Type eleTy = boxTy.getEleTy();
const unsigned rank = xbox.getRank();
@@ -1994,7 +1941,7 @@ struct XEmboxOpConversion : public EmboxCommonConversion<fir::cg::XEmboxOp> {
prevDimByteStride =
getCharacterByteSize(loc, rewriter, charTy, adaptor.getLenParams());
} else {
- prevDimByteStride = genConstantIndex(
+ prevDimByteStride = fir::genConstantIndex(
loc, i64Ty, rewriter,
charTy.getLen() * lowerTy().characterBitsize(charTy) / 8);
}
@@ -2152,7 +2099,7 @@ struct XReboxOpConversion : public EmboxCommonConversion<fir::cg::XReboxOp> {
if (auto charTy = mlir::dyn_cast<fir::CharacterType>(inputEleTy)) {
if (charTy.hasConstantLen()) {
mlir::Value len =
- genConstantIndex(loc, idxTy, rewriter, charTy.getLen());
+ fir::genConstantIndex(loc, idxTy, rewriter, charTy.getLen());
lenParams.emplace_back(len);
} else {
mlir::Value len = getElementSizeFromBox(loc, idxTy, inputBoxTyPair,
@@ -2161,7 +2108,7 @@ struct XReboxOpConversion : public EmboxCommonConversion<fir::cg::XReboxOp> {
assert(!isInGlobalOp(rewriter) &&
"character target in global op must have constant length");
mlir::Value width =
- genConstantIndex(loc, idxTy, rewriter, charTy.getFKind());
+ fir::genConstantIndex(loc, idxTy, rewriter, charTy.getFKind());
len = mlir::LLVM::SDivOp::create(rewriter, loc, idxTy, len, width);
}
lenParams.emplace_back(len);
@@ -2215,8 +2162,9 @@ struct XReboxOpConversion : public EmboxCommonConversion<fir::cg::XReboxOp> {
mlir::ConversionPatternRewriter &rewriter) const {
mlir::Location loc = rebox.getLoc();
mlir::Value zero =
- genConstantIndex(loc, lowerTy().indexType(), rewriter, 0);
- mlir::Value one = genConstantIndex(loc, lowerTy().indexType(), rewriter, 1);
+ fir::genConstantIndex(loc, lowerTy().indexType(), rewriter, 0);
+ mlir::Value one =
+ fir::genConstantIndex(loc, lowerTy().indexType(), rewriter, 1);
for (auto iter : llvm::enumerate(llvm::zip(extents, strides))) {
mlir::Value extent = std::get<0>(iter.value());
unsigned dim = iter.index();
@@ -2249,7 +2197,7 @@ struct XReboxOpConversion : public EmboxCommonConversion<fir::cg::XReboxOp> {
mlir::Location loc = rebox.getLoc();
mlir::Type byteTy = ::getI8Type(rebox.getContext());
mlir::Type idxTy = lowerTy().indexType();
- mlir::Value zero = genConstantIndex(loc, idxTy, rewriter, 0);
+ mlir::Value zero = fir::genConstantIndex(loc, idxTy, rewriter, 0);
// Apply subcomponent and substring shift on base address.
if (!rebox.getSubcomponent().empty() || !rebox.getSubstr().empty()) {
// Cast to inputEleTy* so that a GEP can be used.
@@ -2277,7 +2225,7 @@ struct XReboxOpConversion : public EmboxCommonConversion<fir::cg::XReboxOp> {
// and strides.
llvm::SmallVector<mlir::Value> slicedExtents;
llvm::SmallVector<mlir::Value> slicedStrides;
- mlir::Value one = genConstantIndex(loc, idxTy, rewriter, 1);
+ mlir::Value one = fir::genConstantIndex(loc, idxTy, rewriter, 1);
const bool sliceHasOrigins = !rebox.getShift().empty();
unsigned sliceOps = rebox.getSliceOperandIndex();
unsigned shiftOps = rebox.getShiftOperandIndex();
@@ -2350,7 +2298,7 @@ struct XReboxOpConversion : public EmboxCommonConversion<fir::cg::XReboxOp> {
// which may be OK if all new extents are ones, the stride does not
// matter, use one.
mlir::Value stride = inputStrides.empty()
- ? genConstantIndex(loc, idxTy, rewriter, 1)
+ ? fir::genConstantIndex(loc, idxTy, rewriter, 1)
: inputStrides[0];
for (unsigned i = 0; i < rebox.getShape().size(); ++i) {
mlir::Value rawExtent = operands[rebox.getShapeOperandIndex() + i];
@@ -2585,9 +2533,9 @@ struct XArrayCoorOpConversion
unsigned shiftOffset = coor.getShiftOperandIndex();
unsigned sliceOffset = coor.getSliceOperandIndex();
auto sliceOps = coor.getSlice().begin();
- mlir::Value one = genConstantIndex(loc, idxTy, rewriter, 1);
+ mlir::Value one = fir::genConstantIndex(loc, idxTy, rewriter, 1);
mlir::Value prevExt = one;
- mlir::Value offset = genConstantIndex(loc, idxTy, rewriter, 0);
+ mlir::Value offset = fir::genConstantIndex(loc, idxTy, rewriter, 0);
const bool isShifted = !coor.getShift().empty();
const bool isSliced = !coor.getSlice().empty();
const bool baseIsBoxed =
@@ -2918,7 +2866,7 @@ struct CoordinateOpConversion
// of lower bound aspects. This both accounts for dynamically sized
// types and non contiguous arrays.
auto idxTy = lowerTy().indexType();
- mlir::Value off = genConstantIndex(loc, idxTy, rewriter, 0);
+ mlir::Value off = fir::genConstantIndex(loc, idxTy, rewriter, 0);
unsigned arrayDim = arrTy.getDimension();
for (unsigned dim = 0; dim < arrayDim && it != end; ++dim, ++it) {
mlir::Value stride =
@@ -3846,7 +3794,7 @@ struct IsPresentOpConversion : public fir::FIROpConversion<fir::IsPresentOp> {
ptr = mlir::LLVM::ExtractValueOp::create(rewriter, loc, ptr, 0);
}
mlir::LLVM::ConstantOp c0 =
- genConstantIndex(isPresent.getLoc(), idxTy, rewriter, 0);
+ fir::genConstantIndex(isPresent.getLoc(), idxTy, rewriter, 0);
auto addr = mlir::LLVM::PtrToIntOp::create(rewriter, loc, idxTy, ptr);
rewriter.replaceOpWithNewOp<mlir::LLVM::ICmpOp>(
isPresent, mlir::LLVM::ICmpPredicate::ne, addr, c0);
diff --git a/flang/lib/Optimizer/CodeGen/CodeGenOpenMP.cpp b/flang/lib/Optimizer/CodeGen/CodeGenOpenMP.cpp
index 37f1c9f97e1ce..97912bda79b08 100644
--- a/flang/lib/Optimizer/CodeGen/CodeGenOpenMP.cpp
+++ b/flang/lib/Optimizer/CodeGen/CodeGenOpenMP.cpp
@@ -21,6 +21,7 @@
#include "flang/Optimizer/Dialect/Support/FIRContext.h"
#include "flang/Optimizer/Support/FatalError.h"
#include "flang/Optimizer/Support/InternalNames.h"
+#include "flang/Optimizer/Support/Utils.h"
#include "mlir/Conversion/LLVMCommon/ConversionTarget.h"
#include "mlir/Conversion/LLVMCommon/Pattern.h"
#include "mlir/Dialect/LLVMIR/LLVMDialect.h"
@@ -125,10 +126,58 @@ struct PrivateClauseOpConversion
return mlir::success();
}
};
+
+// Convert FIR type to LLVM without turning fir.box<T> into memory
+// reference.
+static mlir::Type convertObjectType(const fir::LLVMTypeConverter &converter,
+ mlir::Type firType) {
+ if (auto boxTy = mlir::dyn_cast<fir::BaseBoxType>(firType))
+ return converter.convertBoxTypeAsStruct(boxTy);
+ return converter.convertType(firType);
+}
+
+// FIR Op specific conversion for TargetAllocMemOp
+struct TargetAllocMemOpConversion
+ : public OpenMPFIROpConversion<mlir::omp::TargetAllocMemOp> {
+ using OpenMPFIROpConversion::OpenMPFIROpConversion;
+
+ llvm::LogicalResult
+ matchAndRewrite(mlir::omp::TargetAllocMemOp allocmemOp, OpAdaptor adaptor,
+ mlir::ConversionPatternRewriter &rewriter) const override {
+ mlir::Type heapTy = allocmemOp.getAllocatedType();
+ mlir::Location loc = allocmemOp.getLoc();
+ auto ity = lowerTy().indexType();
+ mlir::Type dataTy = fir::unwrapRefType(heapTy);
+ mlir::Type llvmObjectTy = convertObjectType(lowerTy(), dataTy);
+ if (fir::isRecordWithTypeParameters(fir::unwrapSequenceType(dataTy)))
+ TODO(loc, "omp.target_allocmem codegen of derived type with length "
+ "parameters");
+ mlir::Value size = fir::computeElementDistance(
+ loc, llvmObjectTy, ity, rewriter, lowerTy().getDataLayout());
+ if (auto scaleSize = fir::genAllocationScaleSize(
+ loc, allocmemOp.getInType(), ity, rewriter))
+ size = rewriter.create<mlir::LLVM::MulOp>(loc, ity, size, scaleSize);
+ for (mlir::Value opnd : adaptor.getOperands().drop_front())
+ size = rewriter.create<mlir::LLVM::MulOp>(
+ loc, ity, size, integerCast(lowerTy(), loc, rewriter, ity, opnd));
+ auto mallocTyWidth = lowerTy().getIndexTypeBitwidth();
+ auto mallocTy =
+ mlir::IntegerType::get(rewriter.getContext(), mallocTyWidth);
+ if (mallocTyWidth != ity.getIntOrFloatBitWidth())
+ size = integerCast(lowerTy(), loc, rewriter, mallocTy, size);
+ rewriter.modifyOpInPlace(allocmemOp, [&]() {
+ allocmemOp.setInType(rewriter.getI8Type());
+ allocmemOp.getTypeparamsMutable().clear();
+ allocmemOp.getTypeparamsMutable().append(size);
+ });
+ return mlir::success();
+ }
+};
} // namespace
void fir::populateOpenMPFIRToLLVMConversionPatterns(
const LLVMTypeConverter &converter, mlir::RewritePatternSet &patterns) {
patterns.add<MapInfoOpConversion>(converter);
patterns.add<PrivateClauseOpConversion>(converter);
+ patterns.add<TargetAllocMemOpConversion>(converter);
}
diff --git a/flang/lib/Optimizer/Dialect/FIROps.cpp b/flang/lib/Optimizer/Dialect/FIROps.cpp
index 01975f357a8da..87f9899aa7879 100644
--- a/flang/lib/Optimizer/Dialect/FIROps.cpp
+++ b/flang/lib/Optimizer/Dialect/FIROps.cpp
@@ -107,7 +107,6 @@ static bool verifyTypeParamCount(mlir::Type inType, unsigned numParams) {
}
/// Parser shared by Alloca and Allocmem
-///
/// operation ::= %res = (`fir.alloca` | `fir.allocmem`) $in_type
/// ( `(` $typeparams `)` )? ( `,` $shape )?
/// attr-dict-without-keyword
diff --git a/flang/lib/Optimizer/Support/Utils.cpp b/flang/lib/Optimizer/Support/Utils.cpp
index 5d663e28336c0..c71642ce4e806 100644
--- a/flang/lib/Optimizer/Support/Utils.cpp
+++ b/flang/lib/Optimizer/Support/Utils.cpp
@@ -50,3 +50,74 @@ std::optional<llvm::ArrayRef<int64_t>> fir::getComponentLowerBoundsIfNonDefault(
return componentInfo.getLowerBounds();
return std::nullopt;
}
+
+mlir::LLVM::ConstantOp
+fir::genConstantIndex(mlir::Location loc, mlir::Type ity,
+ mlir::ConversionPatternRewriter &rewriter,
+ std::int64_t offset) {
+ auto cattr = rewriter.getI64IntegerAttr(offset);
+ return rewriter.create<mlir::LLVM::ConstantOp>(loc, ity, cattr);
+}
+
+mlir::Value
+fir::computeElementDistance(mlir::Location loc, mlir::Type llvmObjectType,
+ mlir::Type idxTy,
+ mlir::ConversionPatternRewriter &rewriter,
+ const mlir::DataLayout &dataLayout) {
+ llvm::TypeSize size = dataLayout.getTypeSize(llvmObjectType);
+ unsigned short alignment = dataLayout.getTypeABIAlignment(llvmObjectType);
+ std::int64_t distance = llvm::alignTo(size, alignment);
+ return fir::genConstantIndex(loc, idxTy, rewriter, distance);
+}
+
+mlir::Value
+fir::genAllocationScaleSize(mlir::Location loc, mlir::Type dataTy,
+ mlir::Type ity,
+ mlir::ConversionPatternRewriter &rewriter) {
+ auto seqTy = mlir::dyn_cast<fir::SequenceType>(dataTy);
+ fir::SequenceType::Extent constSize = 1;
+ if (seqTy) {
+ int constRows = seqTy.getConstantRows();
+ const fir::SequenceType::ShapeRef &shape = seqTy.getShape();
+ if (constRows != static_cast<int>(shape.size())) {
+ for (auto extent : shape) {
+ if (constRows-- > 0)
+ continue;
+ if (extent != fir::SequenceType::getUnknownExtent())
+ constSize *= extent;
+ }
+ }
+ }
+
+ if (constSize != 1) {
+ mlir::Value constVal{
+ fir::genConstantIndex(loc, ity, rewriter, constSize).getResult()};
+ return constVal;
+ }
+ return nullptr;
+}
+
+mlir::Value fir::integerCast(const fir::LLVMTypeConverter &converter,
+ mlir::Location loc,
+ mlir::ConversionPatternRewriter &rewriter,
+ mlir::Type ty, mlir::Value val, bool fold) {
+ auto valTy = val.getType();
+ // If the value was not yet lowered, lower its type so that it can
+ // be used in getPrimitiveTypeSizeInBits.
+ if (!mlir::isa<mlir::IntegerType>(valTy))
+ valTy = converter.convertType(valTy);
+ auto toSize = mlir::LLVM::getPrimitiveTypeSizeInBits(ty);
+ auto fromSize = mlir::LLVM::getPrimitiveTypeSizeInBits(valTy);
+ if (fold) {
+ if (toSize < fromSize)
+ return rewriter.createOrFold<mlir::LLVM::TruncOp>(loc, ty, val);
+ if (toSize > fromSize)
+ return rewriter.createOrFold<mlir::LLVM::SExtOp>(loc, ty, val);
+ } else {
+ if (toSize < fromSize)
+ return rewriter.create<mlir::LLVM::TruncOp>(loc, ty, val);
+ if (toSize > fromSize)
+ return rewriter.create<mlir::LLVM::SExtOp>(loc, ty, val);
+ }
+ return val;
+}
diff --git a/flang/test/Fir/omp_target_allocmem_freemem.fir b/flang/test/Fir/omp_target_allocmem_freemem.fir
new file mode 100644
index 0000000000000..03eb94acb1ac7
--- /dev/null
+++ b/flang/test/Fir/omp_target_allocmem_freemem.fir
@@ -0,0 +1,294 @@
+// RUN: %flang_fc1 -emit-llvm %s -o - | FileCheck %s
+
+// UNSUPPORTED: system-windows
+// Disabled on 32-bit targets due to the additional `trunc` opcodes required
+// UNSUPPORTED: target-x86
+// UNSUPPORTED: target=sparc-{{.*}}
+// UNSUPPORTED: target=sparcel-{{.*}}
+
+// CHECK-LABEL: define void @omp_target_allocmem_scalar_nonchar() {
+// CHECK-NEXT: [[TMP1:%.*]] = call ptr @omp_target_alloc(i64 4, i32 0)
+// CHECK-NEXT: [[TMP2:%.*]] = ptrtoint ptr [[TMP1]] to i64
+// CHECK-NEXT: [[TMP3:%.*]] = inttoptr i64 [[TMP2]] to ptr
+// CHECK-NEXT: call void @omp_target_free(ptr [[TMP3]], i32 0)
+// CHECK-NEXT: ret void
+func.func @omp_target_allocmem_scalar_nonchar() -> () {
+ %device = arith.constant 0 : i32
+ %1 = omp.target_allocmem %device : i32, i32
+ omp.target_freemem %device, %1 : i32, i64
+ return
+}
+
+// CHECK-LABEL: define void @omp_target_allocmem_scalars_nonchar() {
+// CHECK-NEXT: [[TMP1:%.*]] = call ptr @omp_target_alloc(i64 400, i32 0)
+// CHECK-NEXT: [[TMP2:%.*]] = ptrtoint ptr [[TMP1]] to i64
+// CHECK-NEXT: [[TMP3:%.*]] = inttoptr i64 [[TMP2]] to ptr
+// CHECK-NEXT: call void @omp_target_free(ptr [[TMP3]], i32 0)
+// CHECK-NEXT: ret void
+func.func @omp_target_allocmem_scalars_nonchar() -> () {
+ %device = arith.constant 0 : i32
+ %0 = arith.constant 100 : index
+ %1 = omp.target_allocmem %device : i32, i32, %0
+ omp.target_freemem %device, %1 : i32, i64
+ return
+}
+
+// CHECK-LABEL: define void @omp_target_allocmem_scalar_char() {
+// CHECK-NEXT: [[TMP1:%.*]] = call ptr @omp_target_alloc(i64 10, i32 0)
+// CHECK-NEXT: [[TMP2:%.*]] = ptrtoint ptr [[TMP1]] to i64
+// CHECK-NEXT: [[TMP3:%.*]] = inttoptr i64 [[TMP2]] to ptr
+// CHECK-NEXT: call void @omp_target_free(ptr [[TMP3]], i32 0)
+// CHECK-NEXT: ret void
+func.func @omp_target_allocmem_scalar_char() -> () {
+ %device = arith.constant 0 : i32
+ %1 = omp.target_allocmem %device : i32, !fir.char<1,10>
+ omp.target_freemem %device, %1 : i32, i64
+ return
+}
+
+// CHECK-LABEL: define void @omp_target_allocmem_scalar_char_kind() {
+// CHECK-NEXT: [[TMP1:%.*]] = call ptr @omp_target_alloc(i64 20, i32 0)
+// CHECK-NEXT: [[TMP2:%.*]] = ptrtoint ptr [[TMP1]] to i64
+// CHECK-NEXT: [[TMP3:%.*]] = inttoptr i64 [[TMP2]] to ptr
+// CHECK-NEXT: call void @omp_target_free(ptr [[TMP3]], i32 0)
+// CHECK-NEXT: ret void
+func.func @omp_target_allocmem_scalar_char_kind() -> () {
+ %device = arith.constant 0 : i32
+ %1 = omp.target_allocmem %device : i32, !fir.char<2,10>
+ omp.target_freemem %device, %1 : i32, i64
+ return
+}
+
+// CHECK-LABEL: define void @omp_target_allocmem_scalar_dynchar(
+// CHECK-SAME: i32 [[TMP0:%.*]]) {
+// CHECK-NEXT: [[TMP2:%.*]] = sext i32 [[TMP0]] to i64
+// CHECK-NEXT: [[TMP3:%.*]] = mul i64 1, [[TMP2]]
+// CHECK-NEXT: [[TMP4:%.*]] = mul i64 1, [[TMP3]]
+// CHECK-NEXT: [[TMP5:%.*]] = call ptr @omp_target_alloc(i64 [[TMP4]], i32 0)
+// CHECK-NEXT: [[TMP6:%.*]] = ptrtoint ptr [[TMP5]] to i64
+// CHECK-NEXT: [[TMP7:%.*]] = inttoptr i64 [[TMP6]] to ptr
+// CHECK-NEXT: call void @omp_target_free(ptr [[TMP7]], i32 0)
+// CHECK-NEXT: ret void
+func.func @omp_target_allocmem_scalar_dynchar(%l : i32) -> () {
+ %device = arith.constant 0 : i32
+ %1 = omp.target_allocmem %device : i32, !fir.char<1,?>(%l : i32)
+ omp.target_freemem %device, %1 : i32, i64
+ return
+}
+
+
+// CHECK-LABEL: define void @omp_target_allocmem_scalar_dynchar_kind(
+// CHECK-SAME: i32 [[TMP0:%.*]]) {
+// CHECK-NEXT: [[TMP2:%.*]] = sext i32 [[TMP0]] to i64
+// CHECK-NEXT: [[TMP3:%.*]] = mul i64 2, [[TMP2]]
+// CHECK-NEXT: [[TMP4:%.*]] = mul i64 1, [[TMP3]]
+// CHECK-NEXT: [[TMP5:%.*]] = call ptr @omp_target_alloc(i64 [[TMP4]], i32 0)
+// CHECK-NEXT: [[TMP6:%.*]] = ptrtoint ptr [[TMP5]] to i64
+// CHECK-NEXT: [[TMP7:%.*]] = inttoptr i64 [[TMP6]] to ptr
+// CHECK-NEXT: call void @omp_target_free(ptr [[TMP7]], i32 0)
+// CHECK-NEXT: ret void
+func.func @omp_target_allocmem_scalar_dynchar_kind(%l : i32) -> () {
+ %device = arith.constant 0 : i32
+ %1 = omp.target_allocmem %device : i32, !fir.char<2,?>(%l : i32)
+ omp.target_freemem %device, %1 : i32, i64
+ return
+}
+
+
+// CHECK-LABEL: define void @omp_target_allocmem_array_of_nonchar() {
+// CHECK-NEXT: [[TMP1:%.*]] = call ptr @omp_target_alloc(i64 36, i32 0)
+// CHECK-NEXT: [[TMP2:%.*]] = ptrtoint ptr [[TMP1]] to i64
+// CHECK-NEXT: [[TMP3:%.*]] = inttoptr i64 [[TMP2]] to ptr
+// CHECK-NEXT: call void @omp_target_free(ptr [[TMP3]], i32 0)
+// CHECK-NEXT: ret void
+func.func @omp_target_allocmem_array_of_nonchar() -> () {
+ %device = arith.constant 0 : i32
+ %1 = omp.target_allocmem %device : i32, !fir.array<3x3xi32>
+ omp.target_freemem %device, %1 : i32, i64
+ return
+}
+
+// CHECK-LABEL: define void @omp_target_allocmem_array_of_char() {
+// CHECK-NEXT: [[TMP1:%.*]] = call ptr @omp_target_alloc(i64 90, i32 0)
+// CHECK-NEXT: [[TMP2:%.*]] = ptrtoint ptr [[TMP1]] to i64
+// CHECK-NEXT: [[TMP3:%.*]] = inttoptr i64 [[TMP2]] to ptr
+// CHECK-NEXT: call void @omp_target_free(ptr [[TMP3]], i32 0)
+// CHECK-NEXT: ret void
+func.func @omp_target_allocmem_array_of_char() -> () {
+ %device = arith.constant 0 : i32
+ %1 = omp.target_allocmem %device : i32, !fir.array<3x3x!fir.char<1,10>>
+ omp.target_freemem %device, %1 : i32, i64
+ return
+}
+
+// CHECK-LABEL: define void @omp_target_allocmem_array_of_dynchar(
+// CHECK-SAME: i32 [[TMP0:%.*]]) {
+// CHECK-NEXT: [[TMP2:%.*]] = sext i32 [[TMP0]] to i64
+// CHECK-NEXT: [[TMP3:%.*]] = mul i64 9, [[TMP2]]
+// CHECK-NEXT: [[TMP4:%.*]] = mul i64 1, [[TMP3]]
+// CHECK-NEXT: [[TMP5:%.*]] = call ptr @omp_target_alloc(i64 [[TMP4]], i32 0)
+// CHECK-NEXT: [[TMP6:%.*]] = ptrtoint ptr [[TMP5]] to i64
+// CHECK-NEXT: [[TMP7:%.*]] = inttoptr i64 [[TMP6]] to ptr
+// CHECK-NEXT: call void @omp_target_free(ptr [[TMP7]], i32 0)
+// CHECK-NEXT: ret void
+func.func @omp_target_allocmem_array_of_dynchar(%l: i32) -> () {
+ %device = arith.constant 0 : i32
+ %1 = omp.target_allocmem %device : i32, !fir.array<3x3x!fir.char<1,?>>(%l : i32)
+ omp.target_freemem %device, %1 : i32, i64
+ return
+}
+
+
+// CHECK-LABEL: define void @omp_target_allocmem_dynarray_of_nonchar(
+// CHECK-SAME: i64 [[TMP0:%.*]]) {
+// CHECK-NEXT: [[TMP2:%.*]] = mul i64 12, [[TMP0]]
+// CHECK-NEXT: [[TMP3:%.*]] = mul i64 1, [[TMP2]]
+// CHECK-NEXT: [[TMP4:%.*]] = call ptr @omp_target_alloc(i64 [[TMP3]], i32 0)
+// CHECK-NEXT: [[TMP5:%.*]] = ptrtoint ptr [[TMP4]] to i64
+// CHECK-NEXT: [[TMP6:%.*]] = inttoptr i64 [[TMP5]] to ptr
+// CHECK-NEXT: call void @omp_target_free(ptr [[TMP6]], i32 0)
+// CHECK-NEXT: ret void
+func.func @omp_target_allocmem_dynarray_of_nonchar(%e: index) -> () {
+ %device = arith.constant 0 : i32
+ %1 = omp.target_allocmem %device : i32, !fir.array<3x?xi32>, %e
+ omp.target_freemem %device, %1 : i32, i64
+ return
+}
+
+// CHECK-LABEL: define void @omp_target_allocmem_dynarray_of_nonchar2(
+// CHECK-SAME: i64 [[TMP0:%.*]]) {
+// CHECK-NEXT: [[TMP2:%.*]] = mul i64 4, [[TMP0]]
+// CHECK-NEXT: [[TMP3:%.*]] = mul i64 [[TMP2]], [[TMP0]]
+// CHECK-NEXT: [[TMP4:%.*]] = mul i64 1, [[TMP3]]
+// CHECK-NEXT: [[TMP5:%.*]] = call ptr @omp_target_alloc(i64 [[TMP4]], i32 0)
+// CHECK-NEXT: [[TMP6:%.*]] = ptrtoint ptr [[TMP5]] to i64
+// CHECK-NEXT: [[TMP7:%.*]] = inttoptr i64 [[TMP6]] to ptr
+// CHECK-NEXT: call void @omp_target_free(ptr [[TMP7]], i32 0)
+// CHECK-NEXT: ret void
+func.func @omp_target_allocmem_dynarray_of_nonchar2(%e: index) -> () {
+ %device = arith.constant 0 : i32
+ %1 = omp.target_allocmem %device : i32, !fir.array<?x?xi32>, %e, %e
+ omp.target_freemem %device, %1 : i32, i64
+ return
+}
+
+// CHECK-LABEL: define void @omp_target_allocmem_dynarray_of_char(
+// CHECK-SAME: i64 [[TMP0:%.*]]) {
+// CHECK-NEXT: [[TMP2:%.*]] = mul i64 60, [[TMP0]]
+// CHECK-NEXT: [[TMP3:%.*]] = mul i64 1, [[TMP2]]
+// CHECK-NEXT: [[TMP4:%.*]] = call ptr @omp_target_alloc(i64 [[TMP3]], i32 0)
+// CHECK-NEXT: [[TMP5:%.*]] = ptrtoint ptr [[TMP4]] to i64
+// CHECK-NEXT: [[TMP6:%.*]] = inttoptr i64 [[TMP5]] to ptr
+// CHECK-NEXT: call void @omp_target_free(ptr [[TMP6]], i32 0)
+// CHECK-NEXT: ret void
+func.func @omp_target_allocmem_dynarray_of_char(%e : index) -> () {
+ %device = arith.constant 0 : i32
+ %1 = omp.target_allocmem %device : i32, !fir.array<3x?x!fir.char<2,10>>, %e
+ omp.target_freemem %device, %1 : i32, i64
+ return
+}
+
+
+// CHECK-LABEL: define void @omp_target_allocmem_dynarray_of_char2(
+// CHECK-SAME: i64 [[TMP0:%.*]]) {
+// CHECK-NEXT: [[TMP2:%.*]] = mul i64 20, [[TMP0]]
+// CHECK-NEXT: [[TMP3:%.*]] = mul i64 [[TMP2]], [[TMP0]]
+// CHECK-NEXT: [[TMP4:%.*]] = mul i64 1, [[TMP3]]
+// CHECK-NEXT: [[TMP5:%.*]] = call ptr @omp_target_alloc(i64 [[TMP4]], i32 0)
+// CHECK-NEXT: [[TMP6:%.*]] = ptrtoint ptr [[TMP5]] to i64
+// CHECK-NEXT: [[TMP7:%.*]] = inttoptr i64 [[TMP6]] to ptr
+// CHECK-NEXT: call void @omp_target_free(ptr [[TMP7]], i32 0)
+// CHECK-NEXT: ret void
+func.func @omp_target_allocmem_dynarray_of_char2(%e : index) -> () {
+ %device = arith.constant 0 : i32
+ %1 = omp.target_allocmem %device : i32, !fir.array<?x?x!fir.char<2,10>>, %e, %e
+ omp.target_freemem %device, %1 : i32, i64
+ return
+}
+
+// CHECK-LABEL: define void @omp_target_allocmem_dynarray_of_dynchar(
+// CHECK-SAME: i32 [[TMP0:%.*]], i64 [[TMP1:%.*]]) {
+// CHECK-NEXT: [[TMP3:%.*]] = sext i32 [[TMP0]] to i64
+// CHECK-NEXT: [[TMP4:%.*]] = mul i64 6, [[TMP3]]
+// CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], [[TMP1]]
+// CHECK-NEXT: [[TMP6:%.*]] = mul i64 1, [[TMP5]]
+// CHECK-NEXT: [[TMP7:%.*]] = call ptr @omp_target_alloc(i64 [[TMP6]], i32 0)
+// CHECK-NEXT: [[TMP8:%.*]] = ptrtoint ptr [[TMP7]] to i64
+// CHECK-NEXT: [[TMP9:%.*]] = inttoptr i64 [[TMP8]] to ptr
+// CHECK-NEXT: call void @omp_target_free(ptr [[TMP9]], i32 0)
+// CHECK-NEXT: ret void
+func.func @omp_target_allocmem_dynarray_of_dynchar(%l: i32, %e : index) -> () {
+ %device = arith.constant 0 : i32
+ %1 = omp.target_allocmem %device : i32, !fir.array<3x?x!fir.char<2,?>>(%l : i32), %e
+ omp.target_freemem %device, %1 : i32, i64
+ return
+}
+
+// CHECK-LABEL: define void @omp_target_allocmem_dynarray_of_dynchar2(
+// CHECK-SAME: i32 [[TMP0:%.*]], i64 [[TMP1:%.*]]) {
+// CHECK-NEXT: [[TMP3:%.*]] = sext i32 [[TMP0]] to i64
+// CHECK-NEXT: [[TMP4:%.*]] = mul i64 2, [[TMP3]]
+// CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], [[TMP1]]
+// CHECK-NEXT: [[TMP6:%.*]] = mul i64 [[TMP5]], [[TMP1]]
+// CHECK-NEXT: [[TMP7:%.*]] = mul i64 1, [[TMP6]]
+// CHECK-NEXT: [[TMP8:%.*]] = call ptr @omp_target_alloc(i64 [[TMP7]], i32 0)
+// CHECK-NEXT: [[TMP9:%.*]] = ptrtoint ptr [[TMP8]] to i64
+// CHECK-NEXT: [[TMP10:%.*]] = inttoptr i64 [[TMP9]] to ptr
+// CHECK-NEXT: call void @omp_target_free(ptr [[TMP10]], i32 0)
+// CHECK-NEXT: ret void
+func.func @omp_target_allocmem_dynarray_of_dynchar2(%l: i32, %e : index) -> () {
+ %device = arith.constant 0 : i32
+ %1 = omp.target_allocmem %device : i32, !fir.array<?x?x!fir.char<2,?>>(%l : i32), %e, %e
+ omp.target_freemem %device, %1 : i32, i64
+ return
+}
+
+// CHECK-LABEL: define void @omp_target_allocmem_array_with_holes_nonchar(
+// CHECK-SAME: i64 [[TMP0:%.*]], i64 [[TMP1:%.*]]) {
+// CHECK-NEXT: [[TMP3:%.*]] = mul i64 240, [[TMP0]]
+// CHECK-NEXT: [[TMP4:%.*]] = mul i64 [[TMP3]], [[TMP1]]
+// CHECK-NEXT: [[TMP5:%.*]] = mul i64 1, [[TMP4]]
+// CHECK-NEXT: [[TMP6:%.*]] = call ptr @omp_target_alloc(i64 [[TMP5]], i32 0)
+// CHECK-NEXT: [[TMP7:%.*]] = ptrtoint ptr [[TMP6]] to i64
+// CHECK-NEXT: [[TMP8:%.*]] = inttoptr i64 [[TMP7]] to ptr
+// CHECK-NEXT: call void @omp_target_free(ptr [[TMP8]], i32 0)
+// CHECK-NEXT: ret void
+func.func @omp_target_allocmem_array_with_holes_nonchar(%0 : index, %1 : index) -> () {
+ %device = arith.constant 0 : i32
+ %2 = omp.target_allocmem %device : i32, !fir.array<4x?x3x?x5xi32>, %0, %1
+ omp.target_freemem %device, %2 : i32, i64
+ return
+}
+
+// CHECK-LABEL: define void @omp_target_allocmem_array_with_holes_char(
+// CHECK-SAME: i64 [[TMP0:%.*]]) {
+// CHECK-NEXT: [[TMP2:%.*]] = mul i64 240, [[TMP0]]
+// CHECK-NEXT: [[TMP3:%.*]] = mul i64 1, [[TMP2]]
+// CHECK-NEXT: [[TMP4:%.*]] = call ptr @omp_target_alloc(i64 [[TMP3]], i32 0)
+// CHECK-NEXT: [[TMP5:%.*]] = ptrtoint ptr [[TMP4]] to i64
+// CHECK-NEXT: [[TMP6:%.*]] = inttoptr i64 [[TMP5]] to ptr
+// CHECK-NEXT: call void @omp_target_free(ptr [[TMP6]], i32 0)
+// CHECK-NEXT: ret void
+func.func @omp_target_allocmem_array_with_holes_char(%e: index) -> () {
+ %device = arith.constant 0 : i32
+ %1 = omp.target_allocmem %device : i32, !fir.array<3x?x4x!fir.char<2,10>>, %e
+ omp.target_freemem %device, %1 : i32, i64
+ return
+}
+
+// CHECK-LABEL: define void @omp_target_allocmem_array_with_holes_dynchar(
+// CHECK-SAME: i64 [[TMP0:%.*]], i64 [[TMP1:%.*]]) {
+// CHECK-NEXT: [[TMP3:%.*]] = mul i64 24, [[TMP0]]
+// CHECK-NEXT: [[TMP4:%.*]] = mul i64 [[TMP3]], [[TMP1]]
+// CHECK-NEXT: [[TMP5:%.*]] = mul i64 1, [[TMP4]]
+// CHECK-NEXT: [[TMP6:%.*]] = call ptr @omp_target_alloc(i64 [[TMP5]], i32 0)
+// CHECK-NEXT: [[TMP7:%.*]] = ptrtoint ptr [[TMP6]] to i64
+// CHECK-NEXT: [[TMP8:%.*]] = inttoptr i64 [[TMP7]] to ptr
+// CHECK-NEXT: call void @omp_target_free(ptr [[TMP8]], i32 0)
+// CHECK-NEXT: ret void
+func.func @omp_target_allocmem_array_with_holes_dynchar(%arg0: index, %arg1: index) -> () {
+ %device = arith.constant 0 : i32
+ %1 = omp.target_allocmem %device : i32, !fir.array<3x?x4x!fir.char<2,?>>(%arg0 : index), %arg1
+ omp.target_freemem %device, %1 : i32, i64
+ return
+}
diff --git a/mlir/include/mlir/Dialect/OpenMP/OpenMPOps.td b/mlir/include/mlir/Dialect/OpenMP/OpenMPOps.td
index be114ea4fb631..c956d69781b3d 100644
--- a/mlir/include/mlir/Dialect/OpenMP/OpenMPOps.td
+++ b/mlir/include/mlir/Dialect/OpenMP/OpenMPOps.td
@@ -2115,4 +2115,98 @@ def AllocateDirOp : OpenMP_Op<"allocate_dir", clauses = [
let hasVerifier = 1;
}
+//===----------------------------------------------------------------------===//
+// TargetAllocMemOp
+//===----------------------------------------------------------------------===//
+
+def TargetAllocMemOp : OpenMP_Op<"target_allocmem",
+ [MemoryEffects<[MemAlloc<DefaultResource>]>, AttrSizedOperandSegments]> {
+ let summary = "allocate storage on an openmp device for an object of a given type";
+
+ let description = [{
+ Allocates memory on the specified OpenMP device for an object of the given type.
+ Returns an integer value representing the device pointer to the allocated memory.
+ The memory is uninitialized after allocation. Operations must be paired with
+ `omp.target_freemem` to avoid memory leaks.
+
+ * `$device`: The integer ID of the OpenMP device where the memory will be allocated.
+ * `$in_type`: The type of the object for which memory is being allocated.
+ For arrays, this can be a static or dynamic array type.
+ * `$uniq_name`: An optional unique name for the allocated memory.
+ * `$bindc_name`: An optional name used for C interoperability.
+ * `$typeparams`: Runtime type parameters for polymorphic or parameterized types.
+ These are typically integer values that define aspects of a type not fixed at compile time.
+ * `$shape`: Runtime shape operands for dynamic arrays.
+ Each operand is an integer value representing the extent of a specific dimension.
+
+ ```mlir
+ // Allocate a static 3x3 integer vector on device 0
+ %device_0 = arith.constant 0 : i32
+ %ptr_static = omp.target_allocmem %device_0 : i32, vector<3x3xi32>
+ // ... use %ptr_static ...
+ omp.target_freemem %device_0, %ptr_static : i32, i64
+
+ // Allocate a dynamic 2D Fortran array (fir.array) on device 1
+ %device_1 = arith.constant 1 : i32
+ %rows = arith.constant 10 : index
+ %cols = arith.constant 20 : index
+ %ptr_dynamic = omp.target_allocmem %device_1 : i32, !fir.array<?x?xf32>, %rows, %cols : index, index
+ // ... use %ptr_dynamic ...
+ omp.target_freemem %device_1, %ptr_dynamic : i32, i64
+ ```
+ }];
+
+ let arguments = (ins
+ Arg<AnyInteger>:$device,
+ TypeAttr:$in_type,
+ OptionalAttr<StrAttr>:$uniq_name,
+ OptionalAttr<StrAttr>:$bindc_name,
+ Variadic<IntLikeType>:$typeparams,
+ Variadic<IntLikeType>:$shape
+ );
+ let results = (outs I64);
+
+ let hasCustomAssemblyFormat = 1;
+ let hasVerifier = 1;
+
+ let extraClassDeclaration = [{
+ mlir::Type getAllocatedType();
+ }];
+}
+
+//===----------------------------------------------------------------------===//
+// TargetFreeMemOp
+//===----------------------------------------------------------------------===//
+
+def TargetFreeMemOp : OpenMP_Op<"target_freemem",
+ [MemoryEffects<[MemFree]>]> {
+ let summary = "free memory on an openmp device";
+
+ let description = [{
+ Deallocates memory on the specified OpenMP device that was previously
+ allocated by an `omp.target_allocmem` operation. After this operation, the
+ deallocated memory is in an undefined state and should not be accessed.
+ It is crucial to ensure that all accesses to the memory region are completed
+ before `omp.target_freemem` is called to avoid undefined behavior.
+
+ * `$device`: The integer ID of the OpenMP device from which the memory will be freed.
+ * `$heapref`: The integer value representing the device pointer to the memory
+ to be deallocated, which was previously returned by `omp.target_allocmem`.
+
+ ```mlir
+ // Example of allocating and freeing memory on an OpenMP device
+ %device_id = arith.constant 0 : i32
+ %allocated_ptr = omp.target_allocmem %device_id : i32, vector<3x3xi32>
+ // ... operations using %allocated_ptr on the device ...
+ omp.target_freemem %device_id, %allocated_ptr : i32, i64
+ ```
+ }];
+
+ let arguments = (ins
+ Arg<AnyInteger, "", [MemFree]>:$device,
+ Arg<I64, "", [MemFree]>:$heapref
+ );
+ let assemblyFormat = "$device `,` $heapref attr-dict `:` type($device) `,` qualified(type($heapref))";
+}
+
#endif // OPENMP_OPS
diff --git a/mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp b/mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp
index c1c1767ef90b0..fa94219016c1e 100644
--- a/mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp
+++ b/mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp
@@ -3874,6 +3874,107 @@ LogicalResult AllocateDirOp::verify() {
return success();
}
+//===----------------------------------------------------------------------===//
+// TargetAllocMemOp
+//===----------------------------------------------------------------------===//
+
+mlir::Type omp::TargetAllocMemOp::getAllocatedType() {
+ return getInTypeAttr().getValue();
+}
+
+/// operation ::= %res = (`omp.target_alloc_mem`) $device : devicetype,
+/// $in_type ( `(` $typeparams `)` )? ( `,` $shape )?
+/// attr-dict-without-keyword
+static mlir::ParseResult parseTargetAllocMemOp(mlir::OpAsmParser &parser,
+ mlir::OperationState &result) {
+ auto &builder = parser.getBuilder();
+ bool hasOperands = false;
+ std::int32_t typeparamsSize = 0;
+
+ // Parse device number as a new operand
+ mlir::OpAsmParser::UnresolvedOperand deviceOperand;
+ mlir::Type deviceType;
+ if (parser.parseOperand(deviceOperand) || parser.parseColonType(deviceType))
+ return mlir::failure();
+ if (parser.resolveOperand(deviceOperand, deviceType, result.operands))
+ return mlir::failure();
+ if (parser.parseComma())
+ return mlir::failure();
+
+ mlir::Type intype;
+ if (parser.parseType(intype))
+ return mlir::failure();
+ result.addAttribute("in_type", mlir::TypeAttr::get(intype));
+ llvm::SmallVector<mlir::OpAsmParser::UnresolvedOperand> operands;
+ llvm::SmallVector<mlir::Type> typeVec;
+ if (!parser.parseOptionalLParen()) {
+ // parse the LEN params of the derived type. (<params> : <types>)
+ if (parser.parseOperandList(operands, mlir::OpAsmParser::Delimiter::None) ||
+ parser.parseColonTypeList(typeVec) || parser.parseRParen())
+ return mlir::failure();
+ typeparamsSize = operands.size();
+ hasOperands = true;
+ }
+ std::int32_t shapeSize = 0;
+ if (!parser.parseOptionalComma()) {
+ // parse size to scale by, vector of n dimensions of type index
+ if (parser.parseOperandList(operands, mlir::OpAsmParser::Delimiter::None))
+ return mlir::failure();
+ shapeSize = operands.size() - typeparamsSize;
+ auto idxTy = builder.getIndexType();
+ for (std::int32_t i = typeparamsSize, end = operands.size(); i != end; ++i)
+ typeVec.push_back(idxTy);
+ hasOperands = true;
+ }
+ if (hasOperands &&
+ parser.resolveOperands(operands, typeVec, parser.getNameLoc(),
+ result.operands))
+ return mlir::failure();
+
+ mlir::Type restype = builder.getIntegerType(64);
+ if (!restype) {
+ parser.emitError(parser.getNameLoc(), "invalid allocate type: ") << intype;
+ return mlir::failure();
+ }
+ llvm::SmallVector<std::int32_t> segmentSizes{1, typeparamsSize, shapeSize};
+ result.addAttribute("operandSegmentSizes",
+ builder.getDenseI32ArrayAttr(segmentSizes));
+ if (parser.parseOptionalAttrDict(result.attributes) ||
+ parser.addTypeToList(restype, result.types))
+ return mlir::failure();
+ return mlir::success();
+}
+
+mlir::ParseResult omp::TargetAllocMemOp::parse(mlir::OpAsmParser &parser,
+ mlir::OperationState &result) {
+ return parseTargetAllocMemOp(parser, result);
+}
+
+void omp::TargetAllocMemOp::print(mlir::OpAsmPrinter &p) {
+ p << " ";
+ p.printOperand(getDevice());
+ p << " : ";
+ p << getDevice().getType();
+ p << ", ";
+ p << getInType();
+ if (!getTypeparams().empty()) {
+ p << '(' << getTypeparams() << " : " << getTypeparams().getTypes() << ')';
+ }
+ for (auto sh : getShape()) {
+ p << ", ";
+ p.printOperand(sh);
+ }
+ p.printOptionalAttrDict((*this)->getAttrs(),
+ {"in_type", "operandSegmentSizes"});
+}
+
+llvm::LogicalResult omp::TargetAllocMemOp::verify() {
+ mlir::Type outType = getType();
+ if (!mlir::dyn_cast<IntegerType>(outType))
+ return emitOpError("must be a integer type");
+ return mlir::success();
+}
+
#define GET_ATTRDEF_CLASSES
#include "mlir/Dialect/OpenMP/OpenMPOpsAttributes.cpp.inc"
diff --git a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
index eb96cb211fdd5..6694de8383534 100644
--- a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
+++ b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
@@ -5867,6 +5867,10 @@ static bool isTargetDeviceOp(Operation *op) {
if (mlir::isa<omp::ThreadprivateOp>(op))
return true;
+ if (mlir::isa<omp::TargetAllocMemOp>(op) ||
+ mlir::isa<omp::TargetFreeMemOp>(op))
+ return true;
+
if (auto parentFn = op->getParentOfType<LLVM::LLVMFuncOp>())
if (auto declareTargetIface =
llvm::dyn_cast<mlir::omp::DeclareTargetInterface>(
@@ -5879,6 +5883,85 @@ static bool isTargetDeviceOp(Operation *op) {
return false;
}
+static llvm::Function *getOmpTargetAlloc(llvm::IRBuilderBase &builder,
+ llvm::Module *llvmModule) {
+ llvm::Type *i64Ty = builder.getInt64Ty();
+ llvm::Type *i32Ty = builder.getInt32Ty();
+ llvm::Type *returnType = builder.getPtrTy(0);
+ llvm::FunctionType *fnType =
+ llvm::FunctionType::get(returnType, {i64Ty, i32Ty}, false);
+ llvm::Function *func = cast<llvm::Function>(
+ llvmModule->getOrInsertFunction("omp_target_alloc", fnType).getCallee());
+ return func;
+}
+
+static LogicalResult
+convertTargetAllocMemOp(Operation &opInst, llvm::IRBuilderBase &builder,
+ LLVM::ModuleTranslation &moduleTranslation) {
+ auto allocMemOp = cast<omp::TargetAllocMemOp>(opInst);
+ if (!allocMemOp)
+ return failure();
+
+ // Get "omp_target_alloc" function
+ llvm::Module *llvmModule = moduleTranslation.getLLVMModule();
+ llvm::Function *ompTargetAllocFunc = getOmpTargetAlloc(builder, llvmModule);
+ // Get the corresponding device value in llvm
+ mlir::Value deviceNum = allocMemOp.getDevice();
+ llvm::Value *llvmDeviceNum = moduleTranslation.lookupValue(deviceNum);
+ // Get the allocation size.
+ llvm::DataLayout dataLayout = llvmModule->getDataLayout();
+ mlir::Type heapTy = allocMemOp.getAllocatedType();
+ llvm::Type *llvmHeapTy = moduleTranslation.convertType(heapTy);
+ llvm::TypeSize typeSize = dataLayout.getTypeStoreSize(llvmHeapTy);
+ llvm::Value *allocSize = builder.getInt64(typeSize.getFixedValue());
+ for (auto typeParam : allocMemOp.getTypeparams())
+ allocSize =
+ builder.CreateMul(allocSize, moduleTranslation.lookupValue(typeParam));
+ // Create call to "omp_target_alloc" with the args as translated llvm values.
+ llvm::CallInst *call =
+ builder.CreateCall(ompTargetAllocFunc, {allocSize, llvmDeviceNum});
+ llvm::Value *resultI64 = builder.CreatePtrToInt(call, builder.getInt64Ty());
+
+ // Map the result
+ moduleTranslation.mapValue(allocMemOp.getResult(), resultI64);
+ return success();
+}
+
+static llvm::Function *getOmpTargetFree(llvm::IRBuilderBase &builder,
+ llvm::Module *llvmModule) {
+ llvm::Type *ptrTy = builder.getPtrTy(0);
+ llvm::Type *i32Ty = builder.getInt32Ty();
+ llvm::Type *voidTy = builder.getVoidTy();
+ llvm::FunctionType *fnType =
+ llvm::FunctionType::get(voidTy, {ptrTy, i32Ty}, false);
+ llvm::Function *func = dyn_cast<llvm::Function>(
+ llvmModule->getOrInsertFunction("omp_target_free", fnType).getCallee());
+ return func;
+}
+
+static LogicalResult
+convertTargetFreeMemOp(Operation &opInst, llvm::IRBuilderBase &builder,
+ LLVM::ModuleTranslation &moduleTranslation) {
+ auto freeMemOp = cast<omp::TargetFreeMemOp>(opInst);
+ if (!freeMemOp)
+ return failure();
+
+ // Get "omp_target_free" function
+ llvm::Module *llvmModule = moduleTranslation.getLLVMModule();
+ llvm::Function *ompTragetFreeFunc = getOmpTargetFree(builder, llvmModule);
+ // Get the corresponding device value in llvm
+ mlir::Value deviceNum = freeMemOp.getDevice();
+ llvm::Value *llvmDeviceNum = moduleTranslation.lookupValue(deviceNum);
+ // Get the corresponding heapref value in llvm
+ mlir::Value heapref = freeMemOp.getHeapref();
+ llvm::Value *llvmHeapref = moduleTranslation.lookupValue(heapref);
+ // Convert heapref int to ptr and call "omp_target_free"
+ llvm::Value *intToPtr =
+ builder.CreateIntToPtr(llvmHeapref, builder.getPtrTy(0));
+ builder.CreateCall(ompTragetFreeFunc, {intToPtr, llvmDeviceNum});
+ return success();
+}
+
/// Given an OpenMP MLIR operation, create the corresponding LLVM IR (including
/// OpenMP runtime calls).
static LogicalResult
@@ -6053,6 +6136,12 @@ convertHostOrTargetOperation(Operation *op, llvm::IRBuilderBase &builder,
// the omp.canonical_loop.
return applyUnrollHeuristic(op, builder, moduleTranslation);
})
+ .Case([&](omp::TargetAllocMemOp) {
+ return convertTargetAllocMemOp(*op, builder, moduleTranslation);
+ })
+ .Case([&](omp::TargetFreeMemOp) {
+ return convertTargetFreeMemOp(*op, builder, moduleTranslation);
+ })
.Default([&](Operation *inst) {
return inst->emitError()
<< "not yet implemented: " << inst->getName();
diff --git a/mlir/test/Target/LLVMIR/ompenmp-target-allocmem-freemem.mlir b/mlir/test/Target/LLVMIR/ompenmp-target-allocmem-freemem.mlir
new file mode 100644
index 0000000000000..1bc97609ccff4
--- /dev/null
+++ b/mlir/test/Target/LLVMIR/ompenmp-target-allocmem-freemem.mlir
@@ -0,0 +1,42 @@
+// RUN: mlir-opt %s -convert-openmp-to-llvm | mlir-translate -mlir-to-llvmir | FileCheck %s
+
+// This file contains MLIR test cases for omp.target_allocmem and omp.target_freemem
+
+// CHECK-LABEL: test_alloc_free_i64
+// CHECK: %[[ALLOC:.*]] = call ptr @omp_target_alloc(i64 8, i32 0)
+// CHECK: %[[PTRTOINT:.*]] = ptrtoint ptr %[[ALLOC]] to i64
+// CHECK: %[[INTTOPTR:.*]] = inttoptr i64 %[[PTRTOINT]] to ptr
+// CHECK: call void @omp_target_free(ptr %[[INTTOPTR]], i32 0)
+// CHECK: ret void
+llvm.func @test_alloc_free_i64() -> () {
+ %device = llvm.mlir.constant(0 : i32) : i32
+ %1 = omp.target_allocmem %device : i32, i64
+ omp.target_freemem %device, %1 : i32, i64
+ llvm.return
+}
+
+// CHECK-LABEL: test_alloc_free_vector_1d_f32
+// CHECK: %[[ALLOC:.*]] = call ptr @omp_target_alloc(i64 64, i32 0)
+// CHECK: %[[PTRTOINT:.*]] = ptrtoint ptr %[[ALLOC]] to i64
+// CHECK: %[[INTTOPTR:.*]] = inttoptr i64 %[[PTRTOINT]] to ptr
+// CHECK: call void @omp_target_free(ptr %[[INTTOPTR]], i32 0)
+// CHECK: ret void
+llvm.func @test_alloc_free_vector_1d_f32() -> () {
+ %device = llvm.mlir.constant(0 : i32) : i32
+ %1 = omp.target_allocmem %device : i32, vector<16xf32>
+ omp.target_freemem %device, %1 : i32, i64
+ llvm.return
+}
+
+// CHECK-LABEL: test_alloc_free_vector_2d_f32
+// CHECK: %[[ALLOC:.*]] = call ptr @omp_target_alloc(i64 1024, i32 0)
+// CHECK: %[[PTRTOINT:.*]] = ptrtoint ptr %[[ALLOC]] to i64
+// CHECK: %[[INTTOPTR:.*]] = inttoptr i64 %[[PTRTOINT]] to ptr
+// CHECK: call void @omp_target_free(ptr %[[INTTOPTR]], i32 0)
+// CHECK: ret void
+llvm.func @test_alloc_free_vector_2d_f32() -> () {
+ %device = llvm.mlir.constant(0 : i32) : i32
+ %1 = omp.target_allocmem %device : i32, vector<16x16xf32>
+ omp.target_freemem %device, %1 : i32, i64
+ llvm.return
+}
>From ba45ac61b6fe7a757a7ae27612261cd9ffdcb474 Mon Sep 17 00:00:00 2001
From: Nikita Popov <npopov at redhat.com>
Date: Mon, 18 Aug 2025 15:07:36 +0200
Subject: [PATCH 006/112] [CAS] Temporarily disable broken test
This test hangs forever if executed with less than three cores
available, see:
https://github.com/llvm/llvm-project/pull/114096#issuecomment-3196698403
---
llvm/unittests/CAS/ObjectStoreTest.cpp | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/llvm/unittests/CAS/ObjectStoreTest.cpp b/llvm/unittests/CAS/ObjectStoreTest.cpp
index b3c408758a007..e84e30374c9aa 100644
--- a/llvm/unittests/CAS/ObjectStoreTest.cpp
+++ b/llvm/unittests/CAS/ObjectStoreTest.cpp
@@ -269,7 +269,8 @@ TEST_P(CASTest, NodesBig) {
ASSERT_THAT_ERROR(CAS->validate(CAS->getID(ID)), Succeeded());
}
-#if LLVM_ENABLE_THREADS
+// FIXME: Re-enable the test.
+#if 0
/// Common test functionality for creating blobs in parallel. You can vary which
/// cas instances are the same or different, and the size of the created blobs.
static void testBlobsParallel(ObjectStore &Read1, ObjectStore &Read2,
>From f84aaa6eaa316bf0a1dc5f4c7524409a3c5bf800 Mon Sep 17 00:00:00 2001
From: Matthias Springer <me at m-sp.org>
Date: Mon, 18 Aug 2025 15:25:18 +0200
Subject: [PATCH 007/112] [mlir][Transforms] Dialect conversion: Add flag to
dump materialization kind (#119532)
Add a debugging flag to the dialect conversion to dump the
materialization kind. This flag is useful to find out whether a missing
materialization rule is for source or target materializations.
Also add missing test coverage for the `buildMaterializations` flag.
---
.../mlir/Transforms/DialectConversion.h | 6 ++++
.../Transforms/Utils/DialectConversion.cpp | 5 +++
mlir/test/Transforms/test-legalizer.mlir | 13 ++++++--
mlir/test/lib/Dialect/Test/TestPatterns.cpp | 32 +++++++++++--------
4 files changed, 41 insertions(+), 15 deletions(-)
diff --git a/mlir/include/mlir/Transforms/DialectConversion.h b/mlir/include/mlir/Transforms/DialectConversion.h
index 220431e6ee2f1..536b23f5c33c1 100644
--- a/mlir/include/mlir/Transforms/DialectConversion.h
+++ b/mlir/include/mlir/Transforms/DialectConversion.h
@@ -1300,6 +1300,12 @@ struct ConversionConfig {
/// The folding mode to use during conversion.
DialectConversionFoldingMode foldingMode =
DialectConversionFoldingMode::BeforePatterns;
+
+ /// If set to "true", the materialization kind ("source" or "target") will be
+ /// attached to "builtin.unrealized_conversion_cast" ops. This flag is useful
+ /// for debugging, to find out what kind of materialization rule may be
+ /// missing.
+ bool attachDebugMaterializationKind = false;
};
//===----------------------------------------------------------------------===//
diff --git a/mlir/lib/Transforms/Utils/DialectConversion.cpp b/mlir/lib/Transforms/Utils/DialectConversion.cpp
index ff34a58965763..e48cfca486808 100644
--- a/mlir/lib/Transforms/Utils/DialectConversion.cpp
+++ b/mlir/lib/Transforms/Utils/DialectConversion.cpp
@@ -1637,6 +1637,11 @@ ValueRange ConversionPatternRewriterImpl::buildUnresolvedMaterialization(
builder.setInsertionPoint(ip.getBlock(), ip.getPoint());
UnrealizedConversionCastOp convertOp =
UnrealizedConversionCastOp::create(builder, loc, outputTypes, inputs);
+ if (config.attachDebugMaterializationKind) {
+ StringRef kindStr =
+ kind == MaterializationKind::Source ? "source" : "target";
+ convertOp->setAttr("__kind__", builder.getStringAttr(kindStr));
+ }
if (isPureTypeConversion)
convertOp->setAttr(kPureTypeConversionMarker, builder.getUnitAttr());
diff --git a/mlir/test/Transforms/test-legalizer.mlir b/mlir/test/Transforms/test-legalizer.mlir
index 55d153db7f4bb..3fa42ff6b2757 100644
--- a/mlir/test/Transforms/test-legalizer.mlir
+++ b/mlir/test/Transforms/test-legalizer.mlir
@@ -1,6 +1,7 @@
// RUN: mlir-opt -allow-unregistered-dialect -split-input-file -test-legalize-patterns="allow-pattern-rollback=1" -verify-diagnostics %s | FileCheck %s
// RUN: mlir-opt -allow-unregistered-dialect -split-input-file -test-legalize-patterns="allow-pattern-rollback=1" -verify-diagnostics -profile-actions-to=- %s | FileCheck %s --check-prefix=CHECK-PROFILER
// RUN: mlir-opt -allow-unregistered-dialect -split-input-file -test-legalize-patterns="allow-pattern-rollback=0" -verify-diagnostics %s | FileCheck %s
+// RUN: mlir-opt -allow-unregistered-dialect -split-input-file -test-legalize-patterns="allow-pattern-rollback=0 build-materializations=0 attach-debug-materialization-kind=1" -verify-diagnostics %s | FileCheck %s --check-prefix=CHECK-KIND
// CHECK-PROFILER: "name": "pass-execution", "cat": "PERF", "ph": "B"
// CHECK-PROFILER: "name": "apply-conversion", "cat": "PERF", "ph": "B"
@@ -190,9 +191,12 @@ func.func @remap_drop_region() {
// -----
// CHECK-LABEL: func @dropped_input_in_use
+// CHECK-KIND-LABEL: func @dropped_input_in_use
func.func @dropped_input_in_use(%arg: i16, %arg2: i64) {
- // CHECK-NEXT: "test.cast"{{.*}} : () -> i16
- // CHECK-NEXT: "work"{{.*}} : (i16)
+ // CHECK-NEXT: %[[cast:.*]] = "test.cast"() : () -> i16
+ // CHECK-NEXT: "work"(%[[cast]]) : (i16)
+ // CHECK-KIND-NEXT: %[[cast:.*]] = builtin.unrealized_conversion_cast to i16 {__kind__ = "source"}
+ // CHECK-KIND-NEXT: "work"(%[[cast]]) : (i16)
// expected-remark at +1 {{op 'work' is not legalizable}}
"work"(%arg) : (i16) -> ()
}
@@ -430,6 +434,11 @@ func.func @test_multiple_1_to_n_replacement() {
// CHECK: %[[cast:.*]] = "test.cast"(%[[producer]]) : (i16) -> f64
// CHECK: "test.valid_consumer"(%[[cast]]) : (f64) -> ()
// CHECK: "test.valid_consumer"(%[[producer]]) : (i16) -> ()
+// CHECK-KIND-LABEL: func @test_lookup_without_converter
+// CHECK-KIND: %[[producer:.*]] = "test.valid_producer"() : () -> i16
+// CHECK-KIND: %[[cast:.*]] = builtin.unrealized_conversion_cast %[[producer]] : i16 to f64 {__kind__ = "target"}
+// CHECK-KIND: "test.valid_consumer"(%[[cast]]) : (f64) -> ()
+// CHECK-KIND: "test.valid_consumer"(%[[producer]]) : (i16) -> ()
func.func @test_lookup_without_converter() {
%0 = "test.replace_with_valid_producer"() {type = i16} : () -> (i64)
"test.replace_with_valid_consumer"(%0) {with_converter} : (i64) -> ()
diff --git a/mlir/test/lib/Dialect/Test/TestPatterns.cpp b/mlir/test/lib/Dialect/Test/TestPatterns.cpp
index 6300c5b0ca21c..b6f16ac1b5c48 100644
--- a/mlir/test/lib/Dialect/Test/TestPatterns.cpp
+++ b/mlir/test/lib/Dialect/Test/TestPatterns.cpp
@@ -1574,15 +1574,19 @@ struct TestLegalizePatternDriver
target.addDynamicallyLegalOp<ConvertBlockArgsOp>(
[](ConvertBlockArgsOp op) { return op.getIsLegal(); });
+ // Set up configuration.
+ ConversionConfig config;
+ config.allowPatternRollback = allowPatternRollback;
+ config.foldingMode = foldingMode;
+ config.buildMaterializations = buildMaterializations;
+ config.attachDebugMaterializationKind = attachDebugMaterializationKind;
+ DumpNotifications dumpNotifications;
+ config.listener = &dumpNotifications;
+
// Handle a partial conversion.
if (mode == ConversionMode::Partial) {
DenseSet<Operation *> unlegalizedOps;
- ConversionConfig config;
- config.allowPatternRollback = allowPatternRollback;
- DumpNotifications dumpNotifications;
- config.listener = &dumpNotifications;
config.unlegalizedOps = &unlegalizedOps;
- config.foldingMode = foldingMode;
if (failed(applyPartialConversion(getOperation(), target,
std::move(patterns), config))) {
getOperation()->emitRemark() << "applyPartialConversion failed";
@@ -1600,11 +1604,6 @@ struct TestLegalizePatternDriver
return (bool)op->getAttrOfType<UnitAttr>("test.dynamically_legal");
});
- ConversionConfig config;
- config.allowPatternRollback = allowPatternRollback;
- DumpNotifications dumpNotifications;
- config.foldingMode = foldingMode;
- config.listener = &dumpNotifications;
if (failed(applyFullConversion(getOperation(), target,
std::move(patterns), config))) {
getOperation()->emitRemark() << "applyFullConversion failed";
@@ -1617,9 +1616,6 @@ struct TestLegalizePatternDriver
// Analyze the convertible operations.
DenseSet<Operation *> legalizedOps;
- ConversionConfig config;
- config.foldingMode = foldingMode;
- config.allowPatternRollback = allowPatternRollback;
config.legalizableOps = &legalizedOps;
if (failed(applyAnalysisConversion(getOperation(), target,
std::move(patterns), config)))
@@ -1658,6 +1654,16 @@ struct TestLegalizePatternDriver
Option<bool> allowPatternRollback{*this, "allow-pattern-rollback",
llvm::cl::desc("Allow pattern rollback"),
llvm::cl::init(true)};
+ Option<bool> attachDebugMaterializationKind{
+ *this, "attach-debug-materialization-kind",
+ llvm::cl::desc(
+ "Attach materialization kind to unrealized_conversion_cast ops"),
+ llvm::cl::init(false)};
+ Option<bool> buildMaterializations{
+ *this, "build-materializations",
+ llvm::cl::desc(
+ "If set to 'false', leave unrealized_conversion_cast ops in place"),
+ llvm::cl::init(true)};
};
} // namespace
>From 1650e4a73c4363a7a98a29c9a181dee57ce0ba64 Mon Sep 17 00:00:00 2001
From: Aiden Grossman <aidengrossman at google.com>
Date: Mon, 18 Aug 2025 06:31:16 -0700
Subject: [PATCH 008/112] [X86] Remove TuningPOPCNTFalseDeps from AlderLake
(#154004)
This false dependency issue was fixed in CannonLake looking at the data
from uops.info. This is confirmed not to be an issue based on
benchmarking data in #153983. Setting this can potentially lead to extra
xor instructions whihc could consume extra frontend/renaming resources.
None of the other CPUs that have had this fixed have the tuning flag.
Fixes #153983.
---
llvm/lib/Target/X86/X86.td | 4 +++-
llvm/test/CodeGen/X86/bitcnt-false-dep.ll | 9 +++++++++
2 files changed, 12 insertions(+), 1 deletion(-)
diff --git a/llvm/lib/Target/X86/X86.td b/llvm/lib/Target/X86/X86.td
index 990b381341f07..3d34ea3bed318 100644
--- a/llvm/lib/Target/X86/X86.td
+++ b/llvm/lib/Target/X86/X86.td
@@ -1291,7 +1291,9 @@ def ProcessorFeatures {
list<SubtargetFeature> ADLAdditionalTuning = [TuningPERMFalseDeps,
TuningPreferMovmskOverVTest,
TuningFastImmVectorShift];
- list<SubtargetFeature> ADLTuning = !listconcat(SKLTuning, ADLAdditionalTuning);
+ list<SubtargetFeature> ADLRemoveTuning = [TuningPOPCNTFalseDeps];
+ list<SubtargetFeature> ADLTuning =
+ !listremove(!listconcat(SKLTuning, ADLAdditionalTuning), ADLRemoveTuning);
list<SubtargetFeature> ADLFeatures =
!listconcat(TRMFeatures, ADLAdditionalFeatures);
diff --git a/llvm/test/CodeGen/X86/bitcnt-false-dep.ll b/llvm/test/CodeGen/X86/bitcnt-false-dep.ll
index 5f576c8586285..793cbb8f75bdc 100644
--- a/llvm/test/CodeGen/X86/bitcnt-false-dep.ll
+++ b/llvm/test/CodeGen/X86/bitcnt-false-dep.ll
@@ -1,6 +1,7 @@
; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=haswell | FileCheck %s --check-prefix=HSW
; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=skylake | FileCheck %s --check-prefix=SKL
; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=skx | FileCheck %s --check-prefix=SKL
+; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=alderlake | FileCheck %s --check-prefix=ADL
; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=silvermont -mattr=+lzcnt,+bmi | FileCheck %s --check-prefix=SKL
; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=goldmont -mattr=+lzcnt,+bmi | FileCheck %s --check-prefix=SKL
@@ -37,6 +38,10 @@ ret:
;SKL-LABEL:@loopdep_popcnt32
;SKL: xorl [[GPR0:%e[a-d]x]], [[GPR0]]
;SKL-NEXT: popcntl {{.*}}, [[GPR0]]
+
+;ADL-LABEL:@loopdep_popcnt32
+;ADL-NOT: xor
+;ADL: popcntl
}
define i64 @loopdep_popcnt64(ptr nocapture %x, ptr nocapture %y) nounwind {
@@ -63,6 +68,10 @@ ret:
;SKL-LABEL:@loopdep_popcnt64
;SKL: xorl %e[[GPR0:[a-d]x]], %e[[GPR0]]
;SKL-NEXT: popcntq {{.*}}, %r[[GPR0]]
+
+;ADL-LABEL:@loopdep_popcnt64
+;ADL-NOT: xor
+;ADL: popcntq
}
define i32 @loopdep_tzct32(ptr nocapture %x, ptr nocapture %y) nounwind {
>From 340fa3e1bb723de53e9074f50aed13eb15820b47 Mon Sep 17 00:00:00 2001
From: Erich Keane <ekeane at nvidia.com>
Date: Mon, 18 Aug 2025 06:33:40 -0700
Subject: [PATCH 009/112] [OpenACC] Implement firstprivate lowering except
init. (#153847)
This patch implements the basic lowering infrastructure, but does not
quite implement the copy initialization, which requires #153622.
It does however pass verification for the 'copy' section, which just
contains a yield.
---
clang/lib/CIR/CodeGen/CIRGenOpenACCClause.cpp | 74 ++-
.../combined-firstprivate-clause.cpp | 571 ++++++++++++++++++
.../compute-firstprivate-clause-templates.cpp | 90 +++
.../compute-firstprivate-clause.cpp | 508 ++++++++++++++++
.../openacc-not-implemented.cpp | 3 -
5 files changed, 1226 insertions(+), 20 deletions(-)
create mode 100644 clang/test/CIR/CodeGenOpenACC/combined-firstprivate-clause.cpp
create mode 100644 clang/test/CIR/CodeGenOpenACC/compute-firstprivate-clause-templates.cpp
create mode 100644 clang/test/CIR/CodeGenOpenACC/compute-firstprivate-clause.cpp
diff --git a/clang/lib/CIR/CodeGen/CIRGenOpenACCClause.cpp b/clang/lib/CIR/CodeGen/CIRGenOpenACCClause.cpp
index 9194b522114bc..72e2c533254c9 100644
--- a/clang/lib/CIR/CodeGen/CIRGenOpenACCClause.cpp
+++ b/clang/lib/CIR/CodeGen/CIRGenOpenACCClause.cpp
@@ -387,6 +387,20 @@ class OpenACCClauseCIREmitter final
return recipeName;
}
+ void createFirstprivateRecipeCopy(
+ mlir::Location loc, mlir::Location locEnd, mlir::Value mainOp,
+ CIRGenFunction::AutoVarEmission tempDeclEmission,
+ mlir::acc::FirstprivateRecipeOp recipe, const VarDecl *varRecipe,
+ const VarDecl *temporary) {
+ builder.createBlock(&recipe.getCopyRegion(), recipe.getCopyRegion().end(),
+ {mainOp.getType(), mainOp.getType()}, {loc, loc});
+ builder.setInsertionPointToEnd(&recipe.getCopyRegion().back());
+
+ // TODO: OpenACC: Implement this copy to actually do something.
+
+ mlir::acc::YieldOp::create(builder, locEnd);
+ }
+
// Create the 'init' section of the recipe, including the 'copy' section for
// 'firstprivate'.
template <typename RecipeTy>
@@ -401,12 +415,6 @@ class OpenACCClauseCIREmitter final
cgf.cgm.errorNYI(exprRange, "OpenACC Reduction recipe init");
}
- if constexpr (std::is_same_v<RecipeTy, mlir::acc::FirstprivateRecipeOp>) {
- // We haven't implemented the 'init'/copy recipe for firstprivate yet, so
- // NYI it.
- cgf.cgm.errorNYI(exprRange, "OpenACC firstprivate recipe init");
- }
-
CIRGenFunction::AutoVarEmission tempDeclEmission{
CIRGenFunction::AutoVarEmission::invalid()};
@@ -442,17 +450,12 @@ class OpenACCClauseCIREmitter final
mlir::acc::YieldOp::create(builder, locEnd);
if constexpr (std::is_same_v<RecipeTy, mlir::acc::FirstprivateRecipeOp>) {
- if (!varRecipe->getInit()) {
- // If we don't have any initialization recipe, we failed during Sema to
- // initialize this correctly. If we disable the
- // Sema::TentativeAnalysisScopes in SemaOpenACC::CreateInitRecipe, it'll
- // emit an error to tell us. However, emitting those errors during
- // production is a violation of the standard, so we cannot do them.
- cgf.cgm.errorNYI(
- exprRange, "firstprivate copy-init recipe not properly generated");
- }
-
- cgf.cgm.errorNYI(exprRange, "firstprivate copy section generation");
+ // TODO: OpenACC: we should have a errorNYI call here if
+ // !varRecipe->getInit(), but as that generation isn't currently
+ // implemented, it ends up being too noisy. So when we implement copy-init
+ // generation both in Sema and here, we should have a diagnostic here.
+ createFirstprivateRecipeCopy(loc, locEnd, mainOp, tempDeclEmission,
+ recipe, varRecipe, temporary);
}
// Make sure we cleanup after ourselves here.
@@ -1155,6 +1158,43 @@ class OpenACCClauseCIREmitter final
llvm_unreachable("Unknown construct kind in VisitPrivateClause");
}
}
+
+ void VisitFirstPrivateClause(const OpenACCFirstPrivateClause &clause) {
+ if constexpr (isOneOfTypes<OpTy, mlir::acc::ParallelOp,
+ mlir::acc::SerialOp>) {
+ for (const auto [varExpr, varRecipe] :
+ llvm::zip_equal(clause.getVarList(), clause.getInitRecipes())) {
+ CIRGenFunction::OpenACCDataOperandInfo opInfo =
+ cgf.getOpenACCDataOperandInfo(varExpr);
+ auto firstPrivateOp = mlir::acc::FirstprivateOp::create(
+ builder, opInfo.beginLoc, opInfo.varValue, /*structured=*/true,
+ /*implicit=*/false, opInfo.name, opInfo.bounds);
+
+ firstPrivateOp.setDataClause(mlir::acc::DataClause::acc_firstprivate);
+
+ {
+ mlir::OpBuilder::InsertionGuard guardCase(builder);
+ auto recipe = getOrCreateRecipe<mlir::acc::FirstprivateRecipeOp>(
+ cgf.getContext(), varExpr, varRecipe.RecipeDecl,
+ varRecipe.InitFromTemporary,
+ Decl::castToDeclContext(cgf.curFuncDecl), opInfo.baseType,
+ firstPrivateOp.getResult());
+
+ // TODO: OpenACC: The dialect is going to change in the near future to
+ // have these be on a different operation, so when that changes, we
+ // probably need to change these here.
+ operation.addFirstPrivatization(builder.getContext(), firstPrivateOp,
+ recipe);
+ }
+ }
+ } else if constexpr (isCombinedType<OpTy>) {
+ // Unlike 'private', 'firstprivate' applies to the compute op, not the
+ // loop op.
+ applyToComputeOp(clause);
+ } else {
+ llvm_unreachable("Unknown construct kind in VisitFirstPrivateClause");
+ }
+ }
};
template <typename OpTy>
diff --git a/clang/test/CIR/CodeGenOpenACC/combined-firstprivate-clause.cpp b/clang/test/CIR/CodeGenOpenACC/combined-firstprivate-clause.cpp
new file mode 100644
index 0000000000000..6d15abc2fefd4
--- /dev/null
+++ b/clang/test/CIR/CodeGenOpenACC/combined-firstprivate-clause.cpp
@@ -0,0 +1,571 @@
+// RUN: %clang_cc1 -fopenacc -triple x86_64-linux-gnu -Wno-openacc-self-if-potential-conflict -emit-cir -fclangir -triple x86_64-linux-pc %s -o - | FileCheck %s
+
+struct NoCopyConstruct {};
+
+struct CopyConstruct {
+ CopyConstruct() = default;
+ CopyConstruct(const CopyConstruct&);
+};
+
+struct NonDefaultCtor {
+ NonDefaultCtor();
+};
+
+struct HasDtor {
+ ~HasDtor();
+};
+
+// CHECK: acc.firstprivate.recipe @firstprivatization__ZTSA5_7HasDtor : !cir.ptr<!cir.array<!rec_HasDtor x 5>> init {
+// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!cir.array<!rec_HasDtor x 5>> {{.*}}):
+// CHECK-NEXT: cir.alloca !cir.array<!rec_HasDtor x 5>, !cir.ptr<!cir.array<!rec_HasDtor x 5>>, ["openacc.private.init"]
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: } copy {
+// CHECK-NEXT: ^bb0(%[[ARG_FROM:.*]]: !cir.ptr<!cir.array<!rec_HasDtor x 5>> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!cir.array<!rec_HasDtor x 5>> {{.*}}):
+//
+// CHECK-NEXT: acc.yield
+//
+// CHECK-NEXT: } destroy {
+// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!cir.array<!rec_HasDtor x 5>> {{.*}}):
+// CHECK-NEXT: %[[LAST_IDX:.*]] = cir.const #cir.int<4> : !u64i
+// CHECK-NEXT: %[[ARRPTR:.*]] = cir.cast(array_to_ptrdecay, %[[ARG]] : !cir.ptr<!cir.array<!rec_HasDtor x 5>>), !cir.ptr<!rec_HasDtor>
+// CHECK-NEXT: %[[ELEM:.*]] = cir.ptr_stride(%[[ARRPTR]] : !cir.ptr<!rec_HasDtor>, %[[LAST_IDX]] : !u64i), !cir.ptr<!rec_HasDtor>
+// CHECK-NEXT: %[[ITR:.*]] = cir.alloca !cir.ptr<!rec_HasDtor>, !cir.ptr<!cir.ptr<!rec_HasDtor>>, ["__array_idx"]
+// CHECK-NEXT: cir.store %[[ELEM]], %[[ITR]] : !cir.ptr<!rec_HasDtor>, !cir.ptr<!cir.ptr<!rec_HasDtor>>
+// CHECK-NEXT: cir.do {
+// CHECK-NEXT: %[[ELEM_LOAD:.*]] = cir.load %[[ITR]] : !cir.ptr<!cir.ptr<!rec_HasDtor>>, !cir.ptr<!rec_HasDtor>
+// CHECK-NEXT: cir.call @_ZN7HasDtorD1Ev(%[[ELEM_LOAD]]) nothrow : (!cir.ptr<!rec_HasDtor>) -> ()
+// CHECK-NEXT: %[[NEG_ONE:.*]] = cir.const #cir.int<-1> : !s64i
+// CHECK-NEXT: %[[PREVELEM:.*]] = cir.ptr_stride(%[[ELEM_LOAD]] : !cir.ptr<!rec_HasDtor>, %[[NEG_ONE]] : !s64i), !cir.ptr<!rec_HasDtor>
+// CHECK-NEXT: cir.store %[[PREVELEM]], %[[ITR]] : !cir.ptr<!rec_HasDtor>, !cir.ptr<!cir.ptr<!rec_HasDtor>>
+// CHECK-NEXT: cir.yield
+// CHECK-NEXT: } while {
+// CHECK-NEXT: %[[ELEM_LOAD:.*]] = cir.load %[[ITR]] : !cir.ptr<!cir.ptr<!rec_HasDtor>>, !cir.ptr<!rec_HasDtor>
+// CHECK-NEXT: %[[CMP:.*]] = cir.cmp(ne, %[[ELEM_LOAD]], %[[ARRPTR]]) : !cir.ptr<!rec_HasDtor>, !cir.bool
+// CHECK-NEXT: cir.condition(%[[CMP]])
+// CHECK-NEXT: }
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: }
+//
+// CHECK-NEXT: acc.firstprivate.recipe @firstprivatization__ZTSA5_14NonDefaultCtor : !cir.ptr<!cir.array<!rec_NonDefaultCtor x 5>> init {
+// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!cir.array<!rec_NonDefaultCtor x 5>> {{.*}}):
+// CHECK-NEXT: %[[ALLOCA:.*]] = cir.alloca !cir.array<!rec_NonDefaultCtor x 5>, !cir.ptr<!cir.array<!rec_NonDefaultCtor x 5>>, ["openacc.private.init"]
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: } copy {
+// CHECK-NEXT: ^bb0(%[[ARG_FORM:.*]]: !cir.ptr<!cir.array<!rec_NonDefaultCtor x 5>> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!cir.array<!rec_NonDefaultCtor x 5>> {{.*}}):
+//
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: }
+//
+// CHECK-NEXT: acc.firstprivate.recipe @firstprivatization__ZTSA5_13CopyConstruct : !cir.ptr<!cir.array<!rec_CopyConstruct x 5>> init {
+// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!cir.array<!rec_CopyConstruct x 5>> {{.*}}):
+// CHECK-NEXT: cir.alloca !cir.array<!rec_CopyConstruct x 5>, !cir.ptr<!cir.array<!rec_CopyConstruct x 5>>, ["openacc.private.init"]
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: } copy {
+// CHECK-NEXT: ^bb0(%[[ARG_FORM:.*]]: !cir.ptr<!cir.array<!rec_CopyConstruct x 5>> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!cir.array<!rec_CopyConstruct x 5>> {{.*}}):
+//
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: }
+//
+// CHECK-NEXT: acc.firstprivate.recipe @firstprivatization__ZTSA5_15NoCopyConstruct : !cir.ptr<!cir.array<!rec_NoCopyConstruct x 5>> init {
+// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!cir.array<!rec_NoCopyConstruct x 5>> {{.*}}):
+// CHECK-NEXT: cir.alloca !cir.array<!rec_NoCopyConstruct x 5>, !cir.ptr<!cir.array<!rec_NoCopyConstruct x 5>>, ["openacc.private.init"]
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: } copy {
+// CHECK-NEXT: ^bb0(%[[ARG_FORM:.*]]: !cir.ptr<!cir.array<!rec_NoCopyConstruct x 5>> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!cir.array<!rec_NoCopyConstruct x 5>> {{.*}}):
+//
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: }
+//
+// CHECK-NEXT: acc.firstprivate.recipe @firstprivatization__ZTSA5_f : !cir.ptr<!cir.array<!cir.float x 5>> init {
+// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!cir.array<!cir.float x 5>> {{.*}}):
+// CHECK-NEXT: cir.alloca !cir.array<!cir.float x 5>, !cir.ptr<!cir.array<!cir.float x 5>>, ["openacc.private.init"]
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: } copy {
+// CHECK-NEXT: ^bb0(%[[ARG_FORM:.*]]: !cir.ptr<!cir.array<!cir.float x 5>> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!cir.array<!cir.float x 5>> {{.*}}):
+//
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: }
+//
+// CHECK-NEXT: acc.firstprivate.recipe @firstprivatization__ZTSA5_i : !cir.ptr<!cir.array<!s32i x 5>> init {
+// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!cir.array<!s32i x 5>> {{.*}}):
+// CHECK-NEXT: cir.alloca !cir.array<!s32i x 5>, !cir.ptr<!cir.array<!s32i x 5>>, ["openacc.private.init"]
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: } copy {
+// CHECK-NEXT: ^bb0(%[[ARG_FORM:.*]]: !cir.ptr<!cir.array<!s32i x 5>> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!cir.array<!s32i x 5>> {{.*}}):
+//
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: }
+//
+// CHECK-NEXT: acc.firstprivate.recipe @firstprivatization__ZTS7HasDtor : !cir.ptr<!rec_HasDtor> init {
+// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!rec_HasDtor> {{.*}}):
+// CHECK-NEXT: cir.alloca !rec_HasDtor, !cir.ptr<!rec_HasDtor>, ["openacc.private.init"]
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: } copy {
+// CHECK-NEXT: ^bb0(%[[ARG_FROM:.*]]: !cir.ptr<!rec_HasDtor> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!rec_HasDtor> {{.*}}):
+//
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: } destroy {
+// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!rec_HasDtor> {{.*}}):
+// CHECK-NEXT: cir.call @_ZN7HasDtorD1Ev(%[[ARG]]) nothrow : (!cir.ptr<!rec_HasDtor>) -> ()
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: }
+//
+// CHECK-NEXT: acc.firstprivate.recipe @firstprivatization__ZTS14NonDefaultCtor : !cir.ptr<!rec_NonDefaultCtor> init {
+// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!rec_NonDefaultCtor> {{.*}}):
+// CHECK-NEXT: %[[ALLOCA:.*]] = cir.alloca !rec_NonDefaultCtor, !cir.ptr<!rec_NonDefaultCtor>, ["openacc.private.init"]
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: } copy {
+// CHECK-NEXT: ^bb0(%[[ARG_FROM:.*]]: !cir.ptr<!rec_NonDefaultCtor> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!rec_NonDefaultCtor> {{.*}}):
+//
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: }
+//
+// CHECK-NEXT: acc.firstprivate.recipe @firstprivatization__ZTS13CopyConstruct : !cir.ptr<!rec_CopyConstruct> init {
+// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!rec_CopyConstruct> {{.*}}):
+// CHECK-NEXT: cir.alloca !rec_CopyConstruct, !cir.ptr<!rec_CopyConstruct>, ["openacc.private.init"]
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: } copy {
+// CHECK-NEXT: ^bb0(%[[ARG_FROM:.*]]: !cir.ptr<!rec_CopyConstruct> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!rec_CopyConstruct> {{.*}}):
+//
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: }
+//
+// CHECK-NEXT: acc.firstprivate.recipe @firstprivatization__ZTS15NoCopyConstruct : !cir.ptr<!rec_NoCopyConstruct> init {
+// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!rec_NoCopyConstruct> {{.*}}):
+// CHECK-NEXT: cir.alloca !rec_NoCopyConstruct, !cir.ptr<!rec_NoCopyConstruct>, ["openacc.private.init"]
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: } copy {
+// CHECK-NEXT: ^bb0(%[[ARG_FROM:.*]]: !cir.ptr<!rec_NoCopyConstruct> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!rec_NoCopyConstruct> {{.*}}):
+//
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: }
+//
+// CHECK-NEXT: acc.firstprivate.recipe @firstprivatization__ZTSf : !cir.ptr<!cir.float> init {
+// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!cir.float> {{.*}}):
+// CHECK-NEXT: cir.alloca !cir.float, !cir.ptr<!cir.float>, ["openacc.private.init"]
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: } copy {
+// CHECK-NEXT: ^bb0(%[[ARG_FROM:.*]]: !cir.ptr<!cir.float> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!cir.float> {{.*}}):
+//
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: }
+//
+// CHECK-NEXT: acc.firstprivate.recipe @firstprivatization__ZTSi : !cir.ptr<!s32i> init {
+// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!s32i> {{.*}}):
+// CHECK-NEXT: cir.alloca !s32i, !cir.ptr<!s32i>, ["openacc.private.init"]
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: } copy {
+// CHECK-NEXT: ^bb0(%[[ARG_FROM:.*]]: !cir.ptr<!s32i> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!s32i> {{.*}}):
+//
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: }
+
+extern "C" void acc_combined() {
+ // CHECK: cir.func{{.*}} @acc_combined() {
+
+ int someInt;
+ // CHECK-NEXT: %[[SOMEINT:.*]] = cir.alloca !s32i, !cir.ptr<!s32i>, ["someInt"]
+ float someFloat;
+ // CHECK-NEXT: %[[SOMEFLOAT:.*]] = cir.alloca !cir.float, !cir.ptr<!cir.float>, ["someFloat"]
+ NoCopyConstruct noCopy;
+ // CHECK-NEXT: %[[NOCOPY:.*]] = cir.alloca !rec_NoCopyConstruct, !cir.ptr<!rec_NoCopyConstruct>, ["noCopy"]
+ CopyConstruct hasCopy;
+ // CHECK-NEXT: %[[HASCOPY:.*]] = cir.alloca !rec_CopyConstruct, !cir.ptr<!rec_CopyConstruct>, ["hasCopy"]
+ NonDefaultCtor notDefCtor;
+ // CHECK-NEXT: %[[NOTDEFCTOR:.*]] = cir.alloca !rec_NonDefaultCtor, !cir.ptr<!rec_NonDefaultCtor>, ["notDefCtor", init]
+ HasDtor dtor;
+ // CHECK-NEXT: %[[DTOR:.*]] = cir.alloca !rec_HasDtor, !cir.ptr<!rec_HasDtor>, ["dtor"]
+ int someIntArr[5];
+ // CHECK-NEXT: %[[INTARR:.*]] = cir.alloca !cir.array<!s32i x 5>, !cir.ptr<!cir.array<!s32i x 5>>, ["someIntArr"]
+ float someFloatArr[5];
+ // CHECK-NEXT: %[[FLOATARR:.*]] = cir.alloca !cir.array<!cir.float x 5>, !cir.ptr<!cir.array<!cir.float x 5>>, ["someFloatArr"]
+ NoCopyConstruct noCopyArr[5];
+ // CHECK-NEXT: %[[NOCOPYARR:.*]] = cir.alloca !cir.array<!rec_NoCopyConstruct x 5>, !cir.ptr<!cir.array<!rec_NoCopyConstruct x 5>>, ["noCopyArr"]
+ CopyConstruct hasCopyArr[5];
+ // CHECK-NEXT: %[[HASCOPYARR:.*]] = cir.alloca !cir.array<!rec_CopyConstruct x 5>, !cir.ptr<!cir.array<!rec_CopyConstruct x 5>>, ["hasCopyArr"]
+ NonDefaultCtor notDefCtorArr[5];
+ // CHECK-NEXT: %[[NOTDEFCTORARR:.*]] = cir.alloca !cir.array<!rec_NonDefaultCtor x 5>, !cir.ptr<!cir.array<!rec_NonDefaultCtor x 5>>, ["notDefCtorArr", init]
+ HasDtor dtorArr[5];
+ // CHECK-NEXT: %[[DTORARR:.*]] = cir.alloca !cir.array<!rec_HasDtor x 5>, !cir.ptr<!cir.array<!rec_HasDtor x 5>>, ["dtorArr"]
+ // CHECK-NEXT: cir.call @_ZN14NonDefaultCtorC1Ev(%[[NOTDEFCTOR]]) : (!cir.ptr<!rec_NonDefaultCtor>) -> ()
+
+#pragma acc parallel loop firstprivate(someInt)
+ for(int i = 0; i < 5; ++i);
+ // CHECK: %[[PRIVATE:.*]] = acc.firstprivate varPtr(%[[SOMEINT]] : !cir.ptr<!s32i>) -> !cir.ptr<!s32i> {name = "someInt"}
+ // CHECK-NEXT: acc.parallel combined(loop) firstprivate(@firstprivatization__ZTSi -> %[[PRIVATE]] : !cir.ptr<!s32i>) {
+ // CHECK-NEXT: acc.loop combined(parallel)
+ // CHECK: acc.yield
+ // CHECK-NEXT: } loc
+ // CHECK-NEXT: acc.yield
+ // CHECK-NEXT: } loc
+#pragma acc serial loop firstprivate(someFloat)
+ for(int i = 0; i < 5; ++i);
+ // CHECK-NEXT: %[[PRIVATE:.*]] = acc.firstprivate varPtr(%[[SOMEFLOAT]] : !cir.ptr<!cir.float>) -> !cir.ptr<!cir.float> {name = "someFloat"}
+ // CHECK-NEXT: acc.serial combined(loop) firstprivate(@firstprivatization__ZTSf -> %[[PRIVATE]] : !cir.ptr<!cir.float>) {
+ // CHECK-NEXT: acc.loop combined(serial)
+ // CHECK: acc.yield
+ // CHECK-NEXT: } loc
+ // CHECK-NEXT: acc.yield
+ // CHECK-NEXT: } loc
+
+#pragma acc parallel loop firstprivate(noCopy)
+ for(int i = 0; i < 5; ++i);
+ // CHECK-NEXT: %[[PRIVATE:.*]] = acc.firstprivate varPtr(%[[NOCOPY]] : !cir.ptr<!rec_NoCopyConstruct>) -> !cir.ptr<!rec_NoCopyConstruct> {name = "noCopy"}
+ // CHECK-NEXT: acc.parallel combined(loop) firstprivate(@firstprivatization__ZTS15NoCopyConstruct -> %[[PRIVATE]] : !cir.ptr<!rec_NoCopyConstruct>) {
+ // CHECK-NEXT: acc.loop combined(parallel)
+ // CHECK: acc.yield
+ // CHECK-NEXT: } loc
+ // CHECK-NEXT: acc.yield
+ // CHECK-NEXT: } loc
+#pragma acc serial loop firstprivate(hasCopy)
+ for(int i = 0; i < 5; ++i);
+ // CHECK-NEXT: %[[PRIVATE:.*]] = acc.firstprivate varPtr(%[[HASCOPY]] : !cir.ptr<!rec_CopyConstruct>) -> !cir.ptr<!rec_CopyConstruct> {name = "hasCopy"}
+ // CHECK-NEXT: acc.serial combined(loop) firstprivate(@firstprivatization__ZTS13CopyConstruct -> %[[PRIVATE]] : !cir.ptr<!rec_CopyConstruct>) {
+ // CHECK-NEXT: acc.loop combined(serial)
+ // CHECK: acc.yield
+ // CHECK-NEXT: } loc
+ // CHECK-NEXT: acc.yield
+ // CHECK-NEXT: } loc
+#pragma acc serial loop firstprivate(notDefCtor)
+ for(int i = 0; i < 5; ++i);
+ // CHECK-NEXT: %[[PRIVATE:.*]] = acc.firstprivate varPtr(%[[NOTDEFCTOR]] : !cir.ptr<!rec_NonDefaultCtor>) -> !cir.ptr<!rec_NonDefaultCtor> {name = "notDefCtor"}
+ // CHECK-NEXT: acc.serial combined(loop) firstprivate(@firstprivatization__ZTS14NonDefaultCtor -> %[[PRIVATE]] : !cir.ptr<!rec_NonDefaultCtor>) {
+ // CHECK-NEXT: acc.loop combined(serial)
+ // CHECK: acc.yield
+ // CHECK-NEXT: } loc
+ // CHECK-NEXT: acc.yield
+ // CHECK-NEXT: } loc
+#pragma acc serial loop firstprivate(dtor)
+ for(int i = 0; i < 5; ++i);
+ // CHECK-NEXT: %[[PRIVATE:.*]] = acc.firstprivate varPtr(%[[DTOR]] : !cir.ptr<!rec_HasDtor>) -> !cir.ptr<!rec_HasDtor> {name = "dtor"}
+ // CHECK-NEXT: acc.serial combined(loop) firstprivate(@firstprivatization__ZTS7HasDtor -> %[[PRIVATE]] : !cir.ptr<!rec_HasDtor>) {
+ // CHECK-NEXT: acc.loop combined(serial)
+ // CHECK: acc.yield
+ // CHECK-NEXT: } loc
+ // CHECK-NEXT: acc.yield
+ // CHECK-NEXT: } loc
+
+#pragma acc parallel loop firstprivate(someInt, someFloat, noCopy, hasCopy, notDefCtor, dtor)
+ for(int i = 0; i < 5; ++i);
+ // CHECK: %[[PRIVATE1:.*]] = acc.firstprivate varPtr(%[[SOMEINT]] : !cir.ptr<!s32i>) -> !cir.ptr<!s32i> {name = "someInt"}
+ // CHECK-NEXT: %[[PRIVATE2:.*]] = acc.firstprivate varPtr(%[[SOMEFLOAT]] : !cir.ptr<!cir.float>) -> !cir.ptr<!cir.float> {name = "someFloat"}
+ // CHECK-NEXT: %[[PRIVATE3:.*]] = acc.firstprivate varPtr(%[[NOCOPY]] : !cir.ptr<!rec_NoCopyConstruct>) -> !cir.ptr<!rec_NoCopyConstruct> {name = "noCopy"}
+ // CHECK-NEXT: %[[PRIVATE4:.*]] = acc.firstprivate varPtr(%[[HASCOPY]] : !cir.ptr<!rec_CopyConstruct>) -> !cir.ptr<!rec_CopyConstruct> {name = "hasCopy"}
+ // CHECK-NEXT: %[[PRIVATE5:.*]] = acc.firstprivate varPtr(%[[NOTDEFCTOR]] : !cir.ptr<!rec_NonDefaultCtor>) -> !cir.ptr<!rec_NonDefaultCtor> {name = "notDefCtor"}
+ // CHECK-NEXT: %[[PRIVATE6:.*]] = acc.firstprivate varPtr(%[[DTOR]] : !cir.ptr<!rec_HasDtor>) -> !cir.ptr<!rec_HasDtor> {name = "dtor"}
+ // CHECK: acc.parallel combined(loop) firstprivate(@firstprivatization__ZTSi -> %[[PRIVATE1]] : !cir.ptr<!s32i>,
+ // CHECK-SAME: @firstprivatization__ZTSf -> %[[PRIVATE2]] : !cir.ptr<!cir.float>,
+ // CHECK-SAME: @firstprivatization__ZTS15NoCopyConstruct -> %[[PRIVATE3]] : !cir.ptr<!rec_NoCopyConstruct>,
+ // CHECK-SAME: @firstprivatization__ZTS13CopyConstruct -> %[[PRIVATE4]] : !cir.ptr<!rec_CopyConstruct>,
+ // CHECK-SAME: @firstprivatization__ZTS14NonDefaultCtor -> %[[PRIVATE5]] : !cir.ptr<!rec_NonDefaultCtor>,
+ // CHECK-SAME: @firstprivatization__ZTS7HasDtor -> %[[PRIVATE6]] : !cir.ptr<!rec_HasDtor>)
+ // CHECK-NEXT: acc.loop combined(parallel)
+ // CHECK: acc.yield
+ // CHECK-NEXT: } loc
+ // CHECK-NEXT: acc.yield
+ // CHECK-NEXT: } loc
+
+#pragma acc serial loop firstprivate(someIntArr[1])
+ for(int i = 0; i < 5; ++i);
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ONE_CONST:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[ZERO_CONST:.*]] = arith.constant 0
+ // CHECK-NEXT: %[[ONE_CONST2:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[BOUNDS:.*]] = acc.bounds lowerbound(%[[ONE_CAST]] : si32) extent(%[[ONE_CONST]] : i64) stride(%[[ONE_CONST2]] : i64) startIdx(%[[ZERO_CONST]] : i64)
+ // CHECK-NEXT: %[[PRIVATE:.*]] = acc.firstprivate varPtr(%[[INTARR]] : !cir.ptr<!cir.array<!s32i x 5>>) bounds(%[[BOUNDS]]) -> !cir.ptr<!cir.array<!s32i x 5>> {name = "someIntArr[1]"}
+ // CHECK-NEXT: acc.serial combined(loop) firstprivate(@firstprivatization__ZTSA5_i -> %[[PRIVATE]] : !cir.ptr<!cir.array<!s32i x 5>>) {
+ // CHECK-NEXT: acc.loop combined(serial)
+ // CHECK: acc.yield
+ // CHECK-NEXT: } loc
+ // CHECK-NEXT: acc.yield
+ // CHECK-NEXT: } loc
+#pragma acc parallel loop firstprivate(someFloatArr[1])
+ for(int i = 0; i < 5; ++i);
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ONE_CONST:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[ZERO_CONST:.*]] = arith.constant 0
+ // CHECK-NEXT: %[[ONE_CONST2:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[BOUNDS:.*]] = acc.bounds lowerbound(%[[ONE_CAST]] : si32) extent(%[[ONE_CONST]] : i64) stride(%[[ONE_CONST2]] : i64) startIdx(%[[ZERO_CONST]] : i64)
+ // CHECK-NEXT: %[[PRIVATE:.*]] = acc.firstprivate varPtr(%[[FLOATARR]] : !cir.ptr<!cir.array<!cir.float x 5>>) bounds(%[[BOUNDS]]) -> !cir.ptr<!cir.array<!cir.float x 5>> {name = "someFloatArr[1]"}
+ // CHECK-NEXT: acc.parallel combined(loop) firstprivate(@firstprivatization__ZTSA5_f -> %[[PRIVATE]] : !cir.ptr<!cir.array<!cir.float x 5>>) {
+ // CHECK-NEXT: acc.loop combined(parallel)
+ // CHECK: acc.yield
+ // CHECK-NEXT: } loc
+ // CHECK-NEXT: acc.yield
+ // CHECK-NEXT: } loc
+#pragma acc serial loop firstprivate(noCopyArr[1])
+ for(int i = 0; i < 5; ++i);
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ONE_CONST:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[ZERO_CONST:.*]] = arith.constant 0
+ // CHECK-NEXT: %[[ONE_CONST2:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[BOUNDS:.*]] = acc.bounds lowerbound(%[[ONE_CAST]] : si32) extent(%[[ONE_CONST]] : i64) stride(%[[ONE_CONST2]] : i64) startIdx(%[[ZERO_CONST]] : i64)
+ // CHECK-NEXT: %[[PRIVATE:.*]] = acc.firstprivate varPtr(%[[NOCOPYARR]] : !cir.ptr<!cir.array<!rec_NoCopyConstruct x 5>>) bounds(%[[BOUNDS]]) -> !cir.ptr<!cir.array<!rec_NoCopyConstruct x 5>> {name = "noCopyArr[1]"}
+ // CHECK-NEXT: acc.serial combined(loop) firstprivate(@firstprivatization__ZTSA5_15NoCopyConstruct -> %[[PRIVATE]] : !cir.ptr<!cir.array<!rec_NoCopyConstruct x 5>>) {
+ // CHECK-NEXT: acc.loop combined(serial)
+ // CHECK: acc.yield
+ // CHECK-NEXT: } loc
+ // CHECK-NEXT: acc.yield
+ // CHECK-NEXT: } loc
+#pragma acc parallel loop firstprivate(hasCopyArr[1])
+ for(int i = 0; i < 5; ++i);
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ONE_CONST:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[ZERO_CONST:.*]] = arith.constant 0
+ // CHECK-NEXT: %[[ONE_CONST2:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[BOUNDS:.*]] = acc.bounds lowerbound(%[[ONE_CAST]] : si32) extent(%[[ONE_CONST]] : i64) stride(%[[ONE_CONST2]] : i64) startIdx(%[[ZERO_CONST]] : i64)
+ // CHECK-NEXT: %[[PRIVATE:.*]] = acc.firstprivate varPtr(%[[HASCOPYARR]] : !cir.ptr<!cir.array<!rec_CopyConstruct x 5>>) bounds(%[[BOUNDS]]) -> !cir.ptr<!cir.array<!rec_CopyConstruct x 5>> {name = "hasCopyArr[1]"}
+ // CHECK-NEXT: acc.parallel combined(loop) firstprivate(@firstprivatization__ZTSA5_13CopyConstruct -> %[[PRIVATE]] : !cir.ptr<!cir.array<!rec_CopyConstruct x 5>>) {
+ // CHECK-NEXT: acc.loop combined(parallel)
+ // CHECK: acc.yield
+ // CHECK-NEXT: } loc
+ // CHECK-NEXT: acc.yield
+ // CHECK-NEXT: } loc
+#pragma acc parallel loop firstprivate(notDefCtorArr[1])
+ for(int i = 0; i < 5; ++i);
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ONE_CONST:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[ZERO_CONST:.*]] = arith.constant 0
+ // CHECK-NEXT: %[[ONE_CONST2:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[BOUNDS:.*]] = acc.bounds lowerbound(%[[ONE_CAST]] : si32) extent(%[[ONE_CONST]] : i64) stride(%[[ONE_CONST2]] : i64) startIdx(%[[ZERO_CONST]] : i64)
+ // CHECK-NEXT: %[[PRIVATE:.*]] = acc.firstprivate varPtr(%[[NOTDEFCTORARR]] : !cir.ptr<!cir.array<!rec_NonDefaultCtor x 5>>) bounds(%[[BOUNDS]]) -> !cir.ptr<!cir.array<!rec_NonDefaultCtor x 5>> {name = "notDefCtorArr[1]"}
+ // CHECK-NEXT: acc.parallel combined(loop) firstprivate(@firstprivatization__ZTSA5_14NonDefaultCtor -> %[[PRIVATE]] : !cir.ptr<!cir.array<!rec_NonDefaultCtor x 5>>) {
+ // CHECK-NEXT: acc.loop combined(parallel)
+ // CHECK: acc.yield
+ // CHECK-NEXT: } loc
+ // CHECK-NEXT: acc.yield
+ // CHECK-NEXT: } loc
+#pragma acc parallel loop firstprivate(dtorArr[1])
+ for(int i = 0; i < 5; ++i);
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ONE_CONST:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[ZERO_CONST:.*]] = arith.constant 0
+ // CHECK-NEXT: %[[ONE_CONST2:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[BOUNDS:.*]] = acc.bounds lowerbound(%[[ONE_CAST]] : si32) extent(%[[ONE_CONST]] : i64) stride(%[[ONE_CONST2]] : i64) startIdx(%[[ZERO_CONST]] : i64)
+ // CHECK-NEXT: %[[PRIVATE:.*]] = acc.firstprivate varPtr(%[[DTORARR]] : !cir.ptr<!cir.array<!rec_HasDtor x 5>>) bounds(%[[BOUNDS]]) -> !cir.ptr<!cir.array<!rec_HasDtor x 5>> {name = "dtorArr[1]"}
+ // CHECK-NEXT: acc.parallel combined(loop) firstprivate(@firstprivatization__ZTSA5_7HasDtor -> %[[PRIVATE]] : !cir.ptr<!cir.array<!rec_HasDtor x 5>>) {
+ // CHECK-NEXT: acc.loop combined(parallel)
+ // CHECK: acc.yield
+ // CHECK-NEXT: } loc
+ // CHECK-NEXT: acc.yield
+ // CHECK-NEXT: } loc
+#pragma acc serial loop firstprivate(someIntArr[1], someFloatArr[1], noCopyArr[1], hasCopyArr[1], notDefCtorArr[1], dtorArr[1])
+ for(int i = 0; i < 5; ++i);
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ONE_CONST:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[ZERO_CONST:.*]] = arith.constant 0
+ // CHECK-NEXT: %[[ONE_CONST2:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[BOUNDS:.*]] = acc.bounds lowerbound(%[[ONE_CAST]] : si32) extent(%[[ONE_CONST]] : i64) stride(%[[ONE_CONST2]] : i64) startIdx(%[[ZERO_CONST]] : i64)
+ // CHECK-NEXT: %[[PRIVATE1:.*]] = acc.firstprivate varPtr(%[[INTARR]] : !cir.ptr<!cir.array<!s32i x 5>>) bounds(%[[BOUNDS]]) -> !cir.ptr<!cir.array<!s32i x 5>> {name = "someIntArr[1]"}
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ONE_CONST:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[ZERO_CONST:.*]] = arith.constant 0
+ // CHECK-NEXT: %[[ONE_CONST2:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[BOUNDS:.*]] = acc.bounds lowerbound(%[[ONE_CAST]] : si32) extent(%[[ONE_CONST]] : i64) stride(%[[ONE_CONST2]] : i64) startIdx(%[[ZERO_CONST]] : i64)
+ // CHECK-NEXT: %[[PRIVATE2:.*]] = acc.firstprivate varPtr(%[[FLOATARR]] : !cir.ptr<!cir.array<!cir.float x 5>>) bounds(%[[BOUNDS]]) -> !cir.ptr<!cir.array<!cir.float x 5>> {name = "someFloatArr[1]"}
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ONE_CONST:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[ZERO_CONST:.*]] = arith.constant 0
+ // CHECK-NEXT: %[[ONE_CONST2:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[BOUNDS:.*]] = acc.bounds lowerbound(%[[ONE_CAST]] : si32) extent(%[[ONE_CONST]] : i64) stride(%[[ONE_CONST2]] : i64) startIdx(%[[ZERO_CONST]] : i64)
+ // CHECK-NEXT: %[[PRIVATE3:.*]] = acc.firstprivate varPtr(%[[NOCOPYARR]] : !cir.ptr<!cir.array<!rec_NoCopyConstruct x 5>>) bounds(%[[BOUNDS]]) -> !cir.ptr<!cir.array<!rec_NoCopyConstruct x 5>> {name = "noCopyArr[1]"}
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ONE_CONST:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[ZERO_CONST:.*]] = arith.constant 0
+ // CHECK-NEXT: %[[ONE_CONST2:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[BOUNDS:.*]] = acc.bounds lowerbound(%[[ONE_CAST]] : si32) extent(%[[ONE_CONST]] : i64) stride(%[[ONE_CONST2]] : i64) startIdx(%[[ZERO_CONST]] : i64)
+ // CHECK-NEXT: %[[PRIVATE4:.*]] = acc.firstprivate varPtr(%[[HASCOPYARR]] : !cir.ptr<!cir.array<!rec_CopyConstruct x 5>>) bounds(%[[BOUNDS]]) -> !cir.ptr<!cir.array<!rec_CopyConstruct x 5>> {name = "hasCopyArr[1]"}
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ONE_CONST:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[ZERO_CONST:.*]] = arith.constant 0
+ // CHECK-NEXT: %[[ONE_CONST2:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[BOUNDS:.*]] = acc.bounds lowerbound(%[[ONE_CAST]] : si32) extent(%[[ONE_CONST]] : i64) stride(%[[ONE_CONST2]] : i64) startIdx(%[[ZERO_CONST]] : i64)
+ // CHECK-NEXT: %[[PRIVATE5:.*]] = acc.firstprivate varPtr(%[[NOTDEFCTORARR]] : !cir.ptr<!cir.array<!rec_NonDefaultCtor x 5>>) bounds(%[[BOUNDS]]) -> !cir.ptr<!cir.array<!rec_NonDefaultCtor x 5>> {name = "notDefCtorArr[1]"}
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ONE_CONST:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[ZERO_CONST:.*]] = arith.constant 0
+ // CHECK-NEXT: %[[ONE_CONST2:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[BOUNDS:.*]] = acc.bounds lowerbound(%[[ONE_CAST]] : si32) extent(%[[ONE_CONST]] : i64) stride(%[[ONE_CONST2]] : i64) startIdx(%[[ZERO_CONST]] : i64)
+ // CHECK-NEXT: %[[PRIVATE6:.*]] = acc.firstprivate varPtr(%[[DTORARR]] : !cir.ptr<!cir.array<!rec_HasDtor x 5>>) bounds(%[[BOUNDS]]) -> !cir.ptr<!cir.array<!rec_HasDtor x 5>> {name = "dtorArr[1]"}
+ // CHECK-NEXT: acc.serial combined(loop) firstprivate(@firstprivatization__ZTSA5_i -> %[[PRIVATE1]] : !cir.ptr<!cir.array<!s32i x 5>>,
+ // CHECK-SAME: @firstprivatization__ZTSA5_f -> %[[PRIVATE2]] : !cir.ptr<!cir.array<!cir.float x 5>>,
+ // CHECK-SAME: @firstprivatization__ZTSA5_15NoCopyConstruct -> %[[PRIVATE3]] : !cir.ptr<!cir.array<!rec_NoCopyConstruct x 5>>,
+ // CHECK-SAME: @firstprivatization__ZTSA5_13CopyConstruct -> %[[PRIVATE4]] : !cir.ptr<!cir.array<!rec_CopyConstruct x 5>>,
+ // CHECK-SAME: @firstprivatization__ZTSA5_14NonDefaultCtor -> %[[PRIVATE5]] : !cir.ptr<!cir.array<!rec_NonDefaultCtor x 5>>,
+ // CHECK-SAME: @firstprivatization__ZTSA5_7HasDtor -> %[[PRIVATE6]] : !cir.ptr<!cir.array<!rec_HasDtor x 5>>)
+ // CHECK-NEXT: acc.loop combined(serial)
+ // CHECK: acc.yield
+ // CHECK-NEXT: } loc
+ // CHECK-NEXT: acc.yield
+ // CHECK-NEXT: } loc
+
+#pragma acc parallel loop firstprivate(someIntArr[1:1])
+ for(int i = 0; i < 5; ++i);
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST2:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ZERO_CONST:.*]] = arith.constant 0
+ // CHECK-NEXT: %[[ONE_CONST2:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[BOUNDS:.*]] = acc.bounds lowerbound(%[[ONE_CAST]] : si32) extent(%[[ONE_CAST2]] : si32) stride(%[[ONE_CONST2]] : i64) startIdx(%[[ZERO_CONST]] : i64)
+ // CHECK-NEXT: %[[PRIVATE:.*]] = acc.firstprivate varPtr(%[[INTARR]] : !cir.ptr<!cir.array<!s32i x 5>>) bounds(%[[BOUNDS]]) -> !cir.ptr<!cir.array<!s32i x 5>> {name = "someIntArr[1:1]"}
+ // CHECK-NEXT: acc.parallel combined(loop) firstprivate(@firstprivatization__ZTSA5_i -> %[[PRIVATE]] : !cir.ptr<!cir.array<!s32i x 5>>) {
+ // CHECK-NEXT: acc.loop combined(parallel)
+ // CHECK: acc.yield
+ // CHECK-NEXT: } loc
+ // CHECK-NEXT: acc.yield
+ // CHECK-NEXT: } loc
+#pragma acc serial loop firstprivate(someFloatArr[1:1])
+ for(int i = 0; i < 5; ++i);
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST2:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ZERO_CONST:.*]] = arith.constant 0
+ // CHECK-NEXT: %[[ONE_CONST2:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[BOUNDS:.*]] = acc.bounds lowerbound(%[[ONE_CAST]] : si32) extent(%[[ONE_CAST2]] : si32) stride(%[[ONE_CONST2]] : i64) startIdx(%[[ZERO_CONST]] : i64)
+ // CHECK-NEXT: %[[PRIVATE:.*]] = acc.firstprivate varPtr(%[[FLOATARR]] : !cir.ptr<!cir.array<!cir.float x 5>>) bounds(%[[BOUNDS]]) -> !cir.ptr<!cir.array<!cir.float x 5>> {name = "someFloatArr[1:1]"}
+ // CHECK-NEXT: acc.serial combined(loop) firstprivate(@firstprivatization__ZTSA5_f -> %[[PRIVATE]] : !cir.ptr<!cir.array<!cir.float x 5>>) {
+ // CHECK-NEXT: acc.loop combined(serial)
+ // CHECK: acc.yield
+ // CHECK-NEXT: } loc
+ // CHECK-NEXT: acc.yield
+ // CHECK-NEXT: } loc
+#pragma acc parallel loop firstprivate(noCopyArr[1:1])
+ for(int i = 0; i < 5; ++i);
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST2:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ZERO_CONST:.*]] = arith.constant 0
+ // CHECK-NEXT: %[[ONE_CONST2:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[BOUNDS:.*]] = acc.bounds lowerbound(%[[ONE_CAST]] : si32) extent(%[[ONE_CAST2]] : si32) stride(%[[ONE_CONST2]] : i64) startIdx(%[[ZERO_CONST]] : i64)
+ // CHECK-NEXT: %[[PRIVATE:.*]] = acc.firstprivate varPtr(%[[NOCOPYARR]] : !cir.ptr<!cir.array<!rec_NoCopyConstruct x 5>>) bounds(%[[BOUNDS]]) -> !cir.ptr<!cir.array<!rec_NoCopyConstruct x 5>> {name = "noCopyArr[1:1]"}
+ // CHECK-NEXT: acc.parallel combined(loop) firstprivate(@firstprivatization__ZTSA5_15NoCopyConstruct -> %[[PRIVATE]] : !cir.ptr<!cir.array<!rec_NoCopyConstruct x 5>>) {
+ // CHECK-NEXT: acc.loop combined(parallel)
+ // CHECK: acc.yield
+ // CHECK-NEXT: } loc
+ // CHECK-NEXT: acc.yield
+ // CHECK-NEXT: } loc
+#pragma acc serial loop firstprivate(hasCopyArr[1:1])
+ for(int i = 0; i < 5; ++i);
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST2:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ZERO_CONST:.*]] = arith.constant 0
+ // CHECK-NEXT: %[[ONE_CONST2:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[BOUNDS:.*]] = acc.bounds lowerbound(%[[ONE_CAST]] : si32) extent(%[[ONE_CAST2]] : si32) stride(%[[ONE_CONST2]] : i64) startIdx(%[[ZERO_CONST]] : i64)
+ // CHECK-NEXT: %[[PRIVATE:.*]] = acc.firstprivate varPtr(%[[HASCOPYARR]] : !cir.ptr<!cir.array<!rec_CopyConstruct x 5>>) bounds(%[[BOUNDS]]) -> !cir.ptr<!cir.array<!rec_CopyConstruct x 5>> {name = "hasCopyArr[1:1]"}
+ // CHECK-NEXT: acc.serial combined(loop) firstprivate(@firstprivatization__ZTSA5_13CopyConstruct -> %[[PRIVATE]] : !cir.ptr<!cir.array<!rec_CopyConstruct x 5>>) {
+ // CHECK-NEXT: acc.loop combined(serial)
+ // CHECK: acc.yield
+ // CHECK-NEXT: } loc
+ // CHECK-NEXT: acc.yield
+ // CHECK-NEXT: } loc
+#pragma acc parallel loop firstprivate(notDefCtorArr[1:1])
+ for(int i = 0; i < 5; ++i);
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST2:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ZERO_CONST:.*]] = arith.constant 0
+ // CHECK-NEXT: %[[ONE_CONST2:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[BOUNDS:.*]] = acc.bounds lowerbound(%[[ONE_CAST]] : si32) extent(%[[ONE_CAST2]] : si32) stride(%[[ONE_CONST2]] : i64) startIdx(%[[ZERO_CONST]] : i64)
+ // CHECK-NEXT: %[[PRIVATE:.*]] = acc.firstprivate varPtr(%[[NOTDEFCTORARR]] : !cir.ptr<!cir.array<!rec_NonDefaultCtor x 5>>) bounds(%[[BOUNDS]]) -> !cir.ptr<!cir.array<!rec_NonDefaultCtor x 5>> {name = "notDefCtorArr[1:1]"}
+ // CHECK-NEXT: acc.parallel combined(loop) firstprivate(@firstprivatization__ZTSA5_14NonDefaultCtor -> %[[PRIVATE]] : !cir.ptr<!cir.array<!rec_NonDefaultCtor x 5>>) {
+ // CHECK-NEXT: acc.loop combined(parallel)
+ // CHECK: acc.yield
+ // CHECK-NEXT: } loc
+ // CHECK-NEXT: acc.yield
+ // CHECK-NEXT: } loc
+#pragma acc parallel loop firstprivate(dtorArr[1:1])
+ for(int i = 0; i < 5; ++i);
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST2:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ZERO_CONST:.*]] = arith.constant 0
+ // CHECK-NEXT: %[[ONE_CONST2:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[BOUNDS:.*]] = acc.bounds lowerbound(%[[ONE_CAST]] : si32) extent(%[[ONE_CAST2]] : si32) stride(%[[ONE_CONST2]] : i64) startIdx(%[[ZERO_CONST]] : i64)
+ // CHECK-NEXT: %[[PRIVATE:.*]] = acc.firstprivate varPtr(%[[DTORARR]] : !cir.ptr<!cir.array<!rec_HasDtor x 5>>) bounds(%[[BOUNDS]]) -> !cir.ptr<!cir.array<!rec_HasDtor x 5>> {name = "dtorArr[1:1]"}
+ // CHECK-NEXT: acc.parallel combined(loop) firstprivate(@firstprivatization__ZTSA5_7HasDtor -> %[[PRIVATE]] : !cir.ptr<!cir.array<!rec_HasDtor x 5>>) {
+ // CHECK-NEXT: acc.loop combined(parallel)
+ // CHECK: acc.yield
+ // CHECK-NEXT: } loc
+ // CHECK-NEXT: acc.yield
+ // CHECK-NEXT: } loc
+#pragma acc parallel loop firstprivate(someIntArr[1:1], someFloatArr[1:1], noCopyArr[1:1], hasCopyArr[1:1], notDefCtorArr[1:1], dtorArr[1:1])
+ for(int i = 0; i < 5; ++i);
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST2:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ZERO_CONST:.*]] = arith.constant 0
+ // CHECK-NEXT: %[[ONE_CONST2:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[BOUNDS:.*]] = acc.bounds lowerbound(%[[ONE_CAST]] : si32) extent(%[[ONE_CAST2]] : si32) stride(%[[ONE_CONST2]] : i64) startIdx(%[[ZERO_CONST]] : i64)
+ // CHECK-NEXT: %[[PRIVATE1:.*]] = acc.firstprivate varPtr(%[[INTARR]] : !cir.ptr<!cir.array<!s32i x 5>>) bounds(%[[BOUNDS]]) -> !cir.ptr<!cir.array<!s32i x 5>> {name = "someIntArr[1:1]"}
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST2:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ZERO_CONST:.*]] = arith.constant 0
+ // CHECK-NEXT: %[[ONE_CONST2:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[BOUNDS:.*]] = acc.bounds lowerbound(%[[ONE_CAST]] : si32) extent(%[[ONE_CAST2]] : si32) stride(%[[ONE_CONST2]] : i64) startIdx(%[[ZERO_CONST]] : i64)
+ // CHECK-NEXT: %[[PRIVATE2:.*]] = acc.firstprivate varPtr(%[[FLOATARR]] : !cir.ptr<!cir.array<!cir.float x 5>>) bounds(%[[BOUNDS]]) -> !cir.ptr<!cir.array<!cir.float x 5>> {name = "someFloatArr[1:1]"}
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST2:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ZERO_CONST:.*]] = arith.constant 0
+ // CHECK-NEXT: %[[ONE_CONST2:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[BOUNDS:.*]] = acc.bounds lowerbound(%[[ONE_CAST]] : si32) extent(%[[ONE_CAST2]] : si32) stride(%[[ONE_CONST2]] : i64) startIdx(%[[ZERO_CONST]] : i64)
+ // CHECK-NEXT: %[[PRIVATE3:.*]] = acc.firstprivate varPtr(%[[NOCOPYARR]] : !cir.ptr<!cir.array<!rec_NoCopyConstruct x 5>>) bounds(%[[BOUNDS]]) -> !cir.ptr<!cir.array<!rec_NoCopyConstruct x 5>> {name = "noCopyArr[1:1]"}
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST2:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ZERO_CONST:.*]] = arith.constant 0
+ // CHECK-NEXT: %[[ONE_CONST2:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[BOUNDS:.*]] = acc.bounds lowerbound(%[[ONE_CAST]] : si32) extent(%[[ONE_CAST2]] : si32) stride(%[[ONE_CONST2]] : i64) startIdx(%[[ZERO_CONST]] : i64)
+ // CHECK-NEXT: %[[PRIVATE4:.*]] = acc.firstprivate varPtr(%[[HASCOPYARR]] : !cir.ptr<!cir.array<!rec_CopyConstruct x 5>>) bounds(%[[BOUNDS]]) -> !cir.ptr<!cir.array<!rec_CopyConstruct x 5>> {name = "hasCopyArr[1:1]"}
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST2:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ZERO_CONST:.*]] = arith.constant 0
+ // CHECK-NEXT: %[[ONE_CONST2:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[BOUNDS:.*]] = acc.bounds lowerbound(%[[ONE_CAST]] : si32) extent(%[[ONE_CAST2]] : si32) stride(%[[ONE_CONST2]] : i64) startIdx(%[[ZERO_CONST]] : i64)
+ // CHECK-NEXT: %[[PRIVATE5:.*]] = acc.firstprivate varPtr(%[[NOTDEFCTORARR]] : !cir.ptr<!cir.array<!rec_NonDefaultCtor x 5>>) bounds(%[[BOUNDS]]) -> !cir.ptr<!cir.array<!rec_NonDefaultCtor x 5>> {name = "notDefCtorArr[1:1]"}
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST2:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ZERO_CONST:.*]] = arith.constant 0
+ // CHECK-NEXT: %[[ONE_CONST2:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[BOUNDS:.*]] = acc.bounds lowerbound(%[[ONE_CAST]] : si32) extent(%[[ONE_CAST2]] : si32) stride(%[[ONE_CONST2]] : i64) startIdx(%[[ZERO_CONST]] : i64)
+ // CHECK-NEXT: %[[PRIVATE6:.*]] = acc.firstprivate varPtr(%[[DTORARR]] : !cir.ptr<!cir.array<!rec_HasDtor x 5>>) bounds(%[[BOUNDS]]) -> !cir.ptr<!cir.array<!rec_HasDtor x 5>> {name = "dtorArr[1:1]"}
+ // CHECK-NEXT: acc.parallel combined(loop) firstprivate(@firstprivatization__ZTSA5_i -> %[[PRIVATE1]] : !cir.ptr<!cir.array<!s32i x 5>>,
+ // CHECK-SAME: @firstprivatization__ZTSA5_f -> %[[PRIVATE2]] : !cir.ptr<!cir.array<!cir.float x 5>>,
+ // CHECK-SAME: @firstprivatization__ZTSA5_15NoCopyConstruct -> %[[PRIVATE3]] : !cir.ptr<!cir.array<!rec_NoCopyConstruct x 5>>,
+ // CHECK-SAME: @firstprivatization__ZTSA5_13CopyConstruct -> %[[PRIVATE4]] : !cir.ptr<!cir.array<!rec_CopyConstruct x 5>>,
+ // CHECK-SAME: @firstprivatization__ZTSA5_14NonDefaultCtor -> %[[PRIVATE5]] : !cir.ptr<!cir.array<!rec_NonDefaultCtor x 5>>,
+ // CHECK-SAME: @firstprivatization__ZTSA5_7HasDtor -> %[[PRIVATE6]] : !cir.ptr<!cir.array<!rec_HasDtor x 5>>)
+ // CHECK-NEXT: acc.loop combined(parallel)
+ // CHECK: acc.yield
+ // CHECK-NEXT: } loc
+ // CHECK-NEXT: acc.yield
+ // CHECK-NEXT: } loc
+}
diff --git a/clang/test/CIR/CodeGenOpenACC/compute-firstprivate-clause-templates.cpp b/clang/test/CIR/CodeGenOpenACC/compute-firstprivate-clause-templates.cpp
new file mode 100644
index 0000000000000..a9f0dd99e3bd4
--- /dev/null
+++ b/clang/test/CIR/CodeGenOpenACC/compute-firstprivate-clause-templates.cpp
@@ -0,0 +1,90 @@
+// RUN: %clang_cc1 -fopenacc -triple x86_64-linux-gnu -Wno-openacc-self-if-potential-conflict -emit-cir -fclangir -triple x86_64-linux-pc %s -o - | FileCheck %s
+
+struct CopyConstruct {
+ CopyConstruct() = default;
+ CopyConstruct(const CopyConstruct&);
+};
+
+struct NonDefaultCtor {
+ NonDefaultCtor();
+};
+
+struct HasDtor {
+ ~HasDtor();
+};
+
+// CHECK: acc.firstprivate.recipe @firstprivatization__ZTSi : !cir.ptr<!s32i> init {
+// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!s32i> {{.*}}):
+// CHECK-NEXT: cir.alloca !s32i, !cir.ptr<!s32i>, ["openacc.private.init"]
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: } copy {
+// CHECK-NEXT: ^bb0(%[[ARG_FROM:.*]]: !cir.ptr<!s32i> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!s32i> {{.*}}):
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: }
+//
+// CHECK-NEXT: acc.firstprivate.recipe @firstprivatization__ZTS7HasDtor : !cir.ptr<!rec_HasDtor> init {
+// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!rec_HasDtor> {{.*}}):
+// CHECK-NEXT: cir.alloca !rec_HasDtor, !cir.ptr<!rec_HasDtor>, ["openacc.private.init"]
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: } copy {
+// CHECK-NEXT: ^bb0(%[[ARG_FROM:.*]]: !cir.ptr<!rec_HasDtor> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!rec_HasDtor> {{.*}}):
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: } destroy {
+// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!rec_HasDtor> {{.*}}):
+// CHECK-NEXT: cir.call @_ZN7HasDtorD1Ev(%[[ARG]]) nothrow : (!cir.ptr<!rec_HasDtor>) -> ()
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: }
+//
+// CHECK-NEXT: acc.firstprivate.recipe @firstprivatization__ZTS14NonDefaultCtor : !cir.ptr<!rec_NonDefaultCtor> init {
+// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!rec_NonDefaultCtor> {{.*}}):
+// CHECK-NEXT: %[[ALLOCA:.*]] = cir.alloca !rec_NonDefaultCtor, !cir.ptr<!rec_NonDefaultCtor>, ["openacc.private.init"]
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: } copy {
+// CHECK-NEXT: ^bb0(%[[ARG_FROM:.*]]: !cir.ptr<!rec_NonDefaultCtor> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!rec_NonDefaultCtor> {{.*}}):
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: }
+//
+// CHECK-NEXT: acc.firstprivate.recipe @firstprivatization__ZTS13CopyConstruct : !cir.ptr<!rec_CopyConstruct> init {
+// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!rec_CopyConstruct> {{.*}}):
+// CHECK-NEXT: cir.alloca !rec_CopyConstruct, !cir.ptr<!rec_CopyConstruct>, ["openacc.private.init"]
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: } copy {
+// CHECK-NEXT: ^bb0(%[[ARG_FROM:.*]]: !cir.ptr<!rec_CopyConstruct> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!rec_CopyConstruct> {{.*}}):
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: }
+
+template<typename T, typename U, typename V, typename W>
+void dependent_version(const T &cc, const U &ndc, const V &dtor, const W &someInt) {
+ // CHECK: cir.func {{.*}}@_Z17dependent_versionI13CopyConstruct14NonDefaultCtor7HasDtoriEvRKT_RKT0_RKT1_RKT2_(%[[ARG0:.*]]: !cir.ptr<!rec_CopyConstruct> {{.*}}, %[[ARG1:.*]]: !cir.ptr<!rec_NonDefaultCtor> {{.*}}, %[[ARG2:.*]]: !cir.ptr<!rec_HasDtor> {{.*}}, %[[ARG3:.*]]: !cir.ptr<!s32i> {{.*}}) {
+ // CHECK-NEXT: %[[CC:.*]] = cir.alloca !cir.ptr<!rec_CopyConstruct>, !cir.ptr<!cir.ptr<!rec_CopyConstruct>>, ["cc", init, const]
+ // CHECK-NEXT: %[[NDC:.*]] = cir.alloca !cir.ptr<!rec_NonDefaultCtor>, !cir.ptr<!cir.ptr<!rec_NonDefaultCtor>>, ["ndc", init, const]
+ // CHECK-NEXT: %[[DTOR:.*]] = cir.alloca !cir.ptr<!rec_HasDtor>, !cir.ptr<!cir.ptr<!rec_HasDtor>>, ["dtor", init, const]
+ // CHECK-NEXT: %[[SOMEINT:.*]] = cir.alloca !cir.ptr<!s32i>, !cir.ptr<!cir.ptr<!s32i>>, ["someInt", init, const]
+ // % 3 = cir.alloca !cir.ptr<!s32i>, !cir.ptr<!cir.ptr<!s32i>>, ["someInt", init, const]
+
+#pragma acc parallel firstprivate(cc, ndc, dtor, someInt)
+ ;
+ // CHECK: %[[PRIV_LOAD:.*]] = cir.load %[[CC]] : !cir.ptr<!cir.ptr<!rec_CopyConstruct>>, !cir.ptr<!rec_CopyConstruct>
+ // CHECK-NEXT: %[[PRIVATE1:.*]] = acc.firstprivate varPtr(%[[PRIV_LOAD]] : !cir.ptr<!rec_CopyConstruct>) -> !cir.ptr<!rec_CopyConstruct> {name = "cc"}
+ // CHECK-NEXT: %[[PRIV_LOAD:.*]] = cir.load %[[NDC]] : !cir.ptr<!cir.ptr<!rec_NonDefaultCtor>>, !cir.ptr<!rec_NonDefaultCtor>
+ // CHECK-NEXT: %[[PRIVATE2:.*]] = acc.firstprivate varPtr(%[[PRIV_LOAD]] : !cir.ptr<!rec_NonDefaultCtor>) -> !cir.ptr<!rec_NonDefaultCtor> {name = "ndc"}
+ // CHECK-NEXT: %[[PRIV_LOAD:.*]] = cir.load %[[DTOR]] : !cir.ptr<!cir.ptr<!rec_HasDtor>>, !cir.ptr<!rec_HasDtor>
+ // CHECK-NEXT: %[[PRIVATE3:.*]] = acc.firstprivate varPtr(%[[PRIV_LOAD]] : !cir.ptr<!rec_HasDtor>) -> !cir.ptr<!rec_HasDtor> {name = "dtor"}
+ // CHECK-NEXT: %[[PRIV_LOAD:.*]] = cir.load %[[SOMEINT]] : !cir.ptr<!cir.ptr<!s32i>>, !cir.ptr<!s32i>
+ // CHECK-NEXT: %[[PRIVATE4:.*]] = acc.firstprivate varPtr(%[[PRIV_LOAD]] : !cir.ptr<!s32i>) -> !cir.ptr<!s32i> {name = "someInt"}
+
+ // CHECK-NEXT: acc.parallel firstprivate(@firstprivatization__ZTS13CopyConstruct -> %[[PRIVATE1]] : !cir.ptr<!rec_CopyConstruct>,
+ // CHECK-SAME: @firstprivatization__ZTS14NonDefaultCtor -> %[[PRIVATE2]] : !cir.ptr<!rec_NonDefaultCtor>,
+ // CHECK-SAME: @firstprivatization__ZTS7HasDtor -> %[[PRIVATE3]] : !cir.ptr<!rec_HasDtor>,
+ // CHECK-SAME: @firstprivatization__ZTSi -> %[[PRIVATE4]] : !cir.ptr<!s32i>) {
+ // CHECK-NEXT: acc.yield
+ // CHECK-NEXT: } loc
+}
+
+void use() {
+ CopyConstruct cc;
+ NonDefaultCtor ndc;
+ HasDtor dtor;
+ int i;
+ dependent_version(cc, ndc, dtor, i);
+}
diff --git a/clang/test/CIR/CodeGenOpenACC/compute-firstprivate-clause.cpp b/clang/test/CIR/CodeGenOpenACC/compute-firstprivate-clause.cpp
new file mode 100644
index 0000000000000..d25208c65ac20
--- /dev/null
+++ b/clang/test/CIR/CodeGenOpenACC/compute-firstprivate-clause.cpp
@@ -0,0 +1,508 @@
+// RUN: %clang_cc1 -fopenacc -triple x86_64-linux-gnu -Wno-openacc-self-if-potential-conflict -emit-cir -fclangir -triple x86_64-linux-pc %s -o - | FileCheck %s
+
+struct NoCopyConstruct {};
+
+struct CopyConstruct {
+ CopyConstruct() = default;
+ CopyConstruct(const CopyConstruct&);
+};
+
+struct NonDefaultCtor {
+ NonDefaultCtor();
+};
+
+struct HasDtor {
+ ~HasDtor();
+};
+
+// CHECK: acc.firstprivate.recipe @firstprivatization__ZTSA5_7HasDtor : !cir.ptr<!cir.array<!rec_HasDtor x 5>> init {
+// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!cir.array<!rec_HasDtor x 5>> {{.*}}):
+// CHECK-NEXT: cir.alloca !cir.array<!rec_HasDtor x 5>, !cir.ptr<!cir.array<!rec_HasDtor x 5>>, ["openacc.private.init"]
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: } copy {
+// CHECK-NEXT: ^bb0(%[[ARG_FROM:.*]]: !cir.ptr<!cir.array<!rec_HasDtor x 5>> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!cir.array<!rec_HasDtor x 5>> {{.*}}):
+//
+// CHECK-NEXT: acc.yield
+//
+// CHECK-NEXT: } destroy {
+// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!cir.array<!rec_HasDtor x 5>> {{.*}}):
+// CHECK-NEXT: %[[LAST_IDX:.*]] = cir.const #cir.int<4> : !u64i
+// CHECK-NEXT: %[[ARRPTR:.*]] = cir.cast(array_to_ptrdecay, %[[ARG]] : !cir.ptr<!cir.array<!rec_HasDtor x 5>>), !cir.ptr<!rec_HasDtor>
+// CHECK-NEXT: %[[ELEM:.*]] = cir.ptr_stride(%[[ARRPTR]] : !cir.ptr<!rec_HasDtor>, %[[LAST_IDX]] : !u64i), !cir.ptr<!rec_HasDtor>
+// CHECK-NEXT: %[[ITR:.*]] = cir.alloca !cir.ptr<!rec_HasDtor>, !cir.ptr<!cir.ptr<!rec_HasDtor>>, ["__array_idx"]
+// CHECK-NEXT: cir.store %[[ELEM]], %[[ITR]] : !cir.ptr<!rec_HasDtor>, !cir.ptr<!cir.ptr<!rec_HasDtor>>
+// CHECK-NEXT: cir.do {
+// CHECK-NEXT: %[[ELEM_LOAD:.*]] = cir.load %[[ITR]] : !cir.ptr<!cir.ptr<!rec_HasDtor>>, !cir.ptr<!rec_HasDtor>
+// CHECK-NEXT: cir.call @_ZN7HasDtorD1Ev(%[[ELEM_LOAD]]) nothrow : (!cir.ptr<!rec_HasDtor>) -> ()
+// CHECK-NEXT: %[[NEG_ONE:.*]] = cir.const #cir.int<-1> : !s64i
+// CHECK-NEXT: %[[PREVELEM:.*]] = cir.ptr_stride(%[[ELEM_LOAD]] : !cir.ptr<!rec_HasDtor>, %[[NEG_ONE]] : !s64i), !cir.ptr<!rec_HasDtor>
+// CHECK-NEXT: cir.store %[[PREVELEM]], %[[ITR]] : !cir.ptr<!rec_HasDtor>, !cir.ptr<!cir.ptr<!rec_HasDtor>>
+// CHECK-NEXT: cir.yield
+// CHECK-NEXT: } while {
+// CHECK-NEXT: %[[ELEM_LOAD:.*]] = cir.load %[[ITR]] : !cir.ptr<!cir.ptr<!rec_HasDtor>>, !cir.ptr<!rec_HasDtor>
+// CHECK-NEXT: %[[CMP:.*]] = cir.cmp(ne, %[[ELEM_LOAD]], %[[ARRPTR]]) : !cir.ptr<!rec_HasDtor>, !cir.bool
+// CHECK-NEXT: cir.condition(%[[CMP]])
+// CHECK-NEXT: }
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: }
+//
+// CHECK-NEXT: acc.firstprivate.recipe @firstprivatization__ZTSA5_14NonDefaultCtor : !cir.ptr<!cir.array<!rec_NonDefaultCtor x 5>> init {
+// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!cir.array<!rec_NonDefaultCtor x 5>> {{.*}}):
+// CHECK-NEXT: %[[ALLOCA:.*]] = cir.alloca !cir.array<!rec_NonDefaultCtor x 5>, !cir.ptr<!cir.array<!rec_NonDefaultCtor x 5>>, ["openacc.private.init"]
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: } copy {
+// CHECK-NEXT: ^bb0(%[[ARG_FORM:.*]]: !cir.ptr<!cir.array<!rec_NonDefaultCtor x 5>> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!cir.array<!rec_NonDefaultCtor x 5>> {{.*}}):
+//
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: }
+//
+// CHECK-NEXT: acc.firstprivate.recipe @firstprivatization__ZTSA5_13CopyConstruct : !cir.ptr<!cir.array<!rec_CopyConstruct x 5>> init {
+// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!cir.array<!rec_CopyConstruct x 5>> {{.*}}):
+// CHECK-NEXT: cir.alloca !cir.array<!rec_CopyConstruct x 5>, !cir.ptr<!cir.array<!rec_CopyConstruct x 5>>, ["openacc.private.init"]
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: } copy {
+// CHECK-NEXT: ^bb0(%[[ARG_FORM:.*]]: !cir.ptr<!cir.array<!rec_CopyConstruct x 5>> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!cir.array<!rec_CopyConstruct x 5>> {{.*}}):
+//
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: }
+//
+// CHECK-NEXT: acc.firstprivate.recipe @firstprivatization__ZTSA5_15NoCopyConstruct : !cir.ptr<!cir.array<!rec_NoCopyConstruct x 5>> init {
+// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!cir.array<!rec_NoCopyConstruct x 5>> {{.*}}):
+// CHECK-NEXT: cir.alloca !cir.array<!rec_NoCopyConstruct x 5>, !cir.ptr<!cir.array<!rec_NoCopyConstruct x 5>>, ["openacc.private.init"]
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: } copy {
+// CHECK-NEXT: ^bb0(%[[ARG_FORM:.*]]: !cir.ptr<!cir.array<!rec_NoCopyConstruct x 5>> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!cir.array<!rec_NoCopyConstruct x 5>> {{.*}}):
+//
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: }
+//
+// CHECK-NEXT: acc.firstprivate.recipe @firstprivatization__ZTSA5_f : !cir.ptr<!cir.array<!cir.float x 5>> init {
+// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!cir.array<!cir.float x 5>> {{.*}}):
+// CHECK-NEXT: cir.alloca !cir.array<!cir.float x 5>, !cir.ptr<!cir.array<!cir.float x 5>>, ["openacc.private.init"]
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: } copy {
+// CHECK-NEXT: ^bb0(%[[ARG_FORM:.*]]: !cir.ptr<!cir.array<!cir.float x 5>> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!cir.array<!cir.float x 5>> {{.*}}):
+//
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: }
+//
+// CHECK-NEXT: acc.firstprivate.recipe @firstprivatization__ZTSA5_i : !cir.ptr<!cir.array<!s32i x 5>> init {
+// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!cir.array<!s32i x 5>> {{.*}}):
+// CHECK-NEXT: cir.alloca !cir.array<!s32i x 5>, !cir.ptr<!cir.array<!s32i x 5>>, ["openacc.private.init"]
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: } copy {
+// CHECK-NEXT: ^bb0(%[[ARG_FORM:.*]]: !cir.ptr<!cir.array<!s32i x 5>> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!cir.array<!s32i x 5>> {{.*}}):
+//
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: }
+//
+// CHECK-NEXT: acc.firstprivate.recipe @firstprivatization__ZTS7HasDtor : !cir.ptr<!rec_HasDtor> init {
+// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!rec_HasDtor> {{.*}}):
+// CHECK-NEXT: cir.alloca !rec_HasDtor, !cir.ptr<!rec_HasDtor>, ["openacc.private.init"]
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: } copy {
+// CHECK-NEXT: ^bb0(%[[ARG_FROM:.*]]: !cir.ptr<!rec_HasDtor> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!rec_HasDtor> {{.*}}):
+//
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: } destroy {
+// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!rec_HasDtor> {{.*}}):
+// CHECK-NEXT: cir.call @_ZN7HasDtorD1Ev(%[[ARG]]) nothrow : (!cir.ptr<!rec_HasDtor>) -> ()
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: }
+//
+// CHECK-NEXT: acc.firstprivate.recipe @firstprivatization__ZTS14NonDefaultCtor : !cir.ptr<!rec_NonDefaultCtor> init {
+// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!rec_NonDefaultCtor> {{.*}}):
+// CHECK-NEXT: %[[ALLOCA:.*]] = cir.alloca !rec_NonDefaultCtor, !cir.ptr<!rec_NonDefaultCtor>, ["openacc.private.init"]
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: } copy {
+// CHECK-NEXT: ^bb0(%[[ARG_FROM:.*]]: !cir.ptr<!rec_NonDefaultCtor> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!rec_NonDefaultCtor> {{.*}}):
+//
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: }
+//
+// CHECK-NEXT: acc.firstprivate.recipe @firstprivatization__ZTS13CopyConstruct : !cir.ptr<!rec_CopyConstruct> init {
+// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!rec_CopyConstruct> {{.*}}):
+// CHECK-NEXT: cir.alloca !rec_CopyConstruct, !cir.ptr<!rec_CopyConstruct>, ["openacc.private.init"]
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: } copy {
+// CHECK-NEXT: ^bb0(%[[ARG_FROM:.*]]: !cir.ptr<!rec_CopyConstruct> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!rec_CopyConstruct> {{.*}}):
+//
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: }
+//
+// CHECK-NEXT: acc.firstprivate.recipe @firstprivatization__ZTS15NoCopyConstruct : !cir.ptr<!rec_NoCopyConstruct> init {
+// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!rec_NoCopyConstruct> {{.*}}):
+// CHECK-NEXT: cir.alloca !rec_NoCopyConstruct, !cir.ptr<!rec_NoCopyConstruct>, ["openacc.private.init"]
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: } copy {
+// CHECK-NEXT: ^bb0(%[[ARG_FROM:.*]]: !cir.ptr<!rec_NoCopyConstruct> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!rec_NoCopyConstruct> {{.*}}):
+//
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: }
+//
+// CHECK-NEXT: acc.firstprivate.recipe @firstprivatization__ZTSf : !cir.ptr<!cir.float> init {
+// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!cir.float> {{.*}}):
+// CHECK-NEXT: cir.alloca !cir.float, !cir.ptr<!cir.float>, ["openacc.private.init"]
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: } copy {
+// CHECK-NEXT: ^bb0(%[[ARG_FROM:.*]]: !cir.ptr<!cir.float> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!cir.float> {{.*}}):
+//
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: }
+//
+// CHECK-NEXT: acc.firstprivate.recipe @firstprivatization__ZTSi : !cir.ptr<!s32i> init {
+// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!s32i> {{.*}}):
+// CHECK-NEXT: cir.alloca !s32i, !cir.ptr<!s32i>, ["openacc.private.init"]
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: } copy {
+// CHECK-NEXT: ^bb0(%[[ARG_FROM:.*]]: !cir.ptr<!s32i> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!s32i> {{.*}}):
+//
+// CHECK-NEXT: acc.yield
+// CHECK-NEXT: }
+
+extern "C" void acc_compute() {
+ // CHECK: cir.func{{.*}} @acc_compute() {
+
+ int someInt;
+ // CHECK-NEXT: %[[SOMEINT:.*]] = cir.alloca !s32i, !cir.ptr<!s32i>, ["someInt"]
+ float someFloat;
+ // CHECK-NEXT: %[[SOMEFLOAT:.*]] = cir.alloca !cir.float, !cir.ptr<!cir.float>, ["someFloat"]
+ NoCopyConstruct noCopy;
+ // CHECK-NEXT: %[[NOCOPY:.*]] = cir.alloca !rec_NoCopyConstruct, !cir.ptr<!rec_NoCopyConstruct>, ["noCopy"]
+ CopyConstruct hasCopy;
+ // CHECK-NEXT: %[[HASCOPY:.*]] = cir.alloca !rec_CopyConstruct, !cir.ptr<!rec_CopyConstruct>, ["hasCopy"]
+ NonDefaultCtor notDefCtor;
+ // CHECK-NEXT: %[[NOTDEFCTOR:.*]] = cir.alloca !rec_NonDefaultCtor, !cir.ptr<!rec_NonDefaultCtor>, ["notDefCtor", init]
+ HasDtor dtor;
+ // CHECK-NEXT: %[[DTOR:.*]] = cir.alloca !rec_HasDtor, !cir.ptr<!rec_HasDtor>, ["dtor"]
+ int someIntArr[5];
+ // CHECK-NEXT: %[[INTARR:.*]] = cir.alloca !cir.array<!s32i x 5>, !cir.ptr<!cir.array<!s32i x 5>>, ["someIntArr"]
+ float someFloatArr[5];
+ // CHECK-NEXT: %[[FLOATARR:.*]] = cir.alloca !cir.array<!cir.float x 5>, !cir.ptr<!cir.array<!cir.float x 5>>, ["someFloatArr"]
+ NoCopyConstruct noCopyArr[5];
+ // CHECK-NEXT: %[[NOCOPYARR:.*]] = cir.alloca !cir.array<!rec_NoCopyConstruct x 5>, !cir.ptr<!cir.array<!rec_NoCopyConstruct x 5>>, ["noCopyArr"]
+ CopyConstruct hasCopyArr[5];
+ // CHECK-NEXT: %[[HASCOPYARR:.*]] = cir.alloca !cir.array<!rec_CopyConstruct x 5>, !cir.ptr<!cir.array<!rec_CopyConstruct x 5>>, ["hasCopyArr"]
+ NonDefaultCtor notDefCtorArr[5];
+ // CHECK-NEXT: %[[NOTDEFCTORARR:.*]] = cir.alloca !cir.array<!rec_NonDefaultCtor x 5>, !cir.ptr<!cir.array<!rec_NonDefaultCtor x 5>>, ["notDefCtorArr", init]
+ HasDtor dtorArr[5];
+ // CHECK-NEXT: %[[DTORARR:.*]] = cir.alloca !cir.array<!rec_HasDtor x 5>, !cir.ptr<!cir.array<!rec_HasDtor x 5>>, ["dtorArr"]
+ // CHECK-NEXT: cir.call @_ZN14NonDefaultCtorC1Ev(%[[NOTDEFCTOR]]) : (!cir.ptr<!rec_NonDefaultCtor>) -> ()
+
+#pragma acc parallel firstprivate(someInt)
+ ;
+ // CHECK: %[[PRIVATE:.*]] = acc.firstprivate varPtr(%[[SOMEINT]] : !cir.ptr<!s32i>) -> !cir.ptr<!s32i> {name = "someInt"}
+ // CHECK-NEXT: acc.parallel firstprivate(@firstprivatization__ZTSi -> %[[PRIVATE]] : !cir.ptr<!s32i>)
+ // CHECK-NEXT: acc.yield
+ // CHECK-NEXT: } loc
+#pragma acc serial firstprivate(someFloat)
+ ;
+ // CHECK-NEXT: %[[PRIVATE:.*]] = acc.firstprivate varPtr(%[[SOMEFLOAT]] : !cir.ptr<!cir.float>) -> !cir.ptr<!cir.float> {name = "someFloat"}
+ // CHECK-NEXT: acc.serial firstprivate(@firstprivatization__ZTSf -> %[[PRIVATE]] : !cir.ptr<!cir.float>)
+ // CHECK-NEXT: acc.yield
+ // CHECK-NEXT: } loc
+
+#pragma acc parallel firstprivate(noCopy)
+ ;
+ // CHECK-NEXT: %[[PRIVATE:.*]] = acc.firstprivate varPtr(%[[NOCOPY]] : !cir.ptr<!rec_NoCopyConstruct>) -> !cir.ptr<!rec_NoCopyConstruct> {name = "noCopy"}
+ // CHECK-NEXT: acc.parallel firstprivate(@firstprivatization__ZTS15NoCopyConstruct -> %[[PRIVATE]] : !cir.ptr<!rec_NoCopyConstruct>
+ // CHECK-NEXT: acc.yield
+ // CHECK-NEXT: } loc
+#pragma acc serial firstprivate(hasCopy)
+ ;
+ // CHECK-NEXT: %[[PRIVATE:.*]] = acc.firstprivate varPtr(%[[HASCOPY]] : !cir.ptr<!rec_CopyConstruct>) -> !cir.ptr<!rec_CopyConstruct> {name = "hasCopy"}
+ // CHECK-NEXT: acc.serial firstprivate(@firstprivatization__ZTS13CopyConstruct -> %[[PRIVATE]] : !cir.ptr<!rec_CopyConstruct>)
+ // CHECK-NEXT: acc.yield
+ // CHECK-NEXT: } loc
+#pragma acc serial firstprivate(notDefCtor)
+ ;
+ // CHECK-NEXT: %[[PRIVATE:.*]] = acc.firstprivate varPtr(%[[NOTDEFCTOR]] : !cir.ptr<!rec_NonDefaultCtor>) -> !cir.ptr<!rec_NonDefaultCtor> {name = "notDefCtor"}
+ // CHECK-NEXT: acc.serial firstprivate(@firstprivatization__ZTS14NonDefaultCtor -> %[[PRIVATE]] : !cir.ptr<!rec_NonDefaultCtor>)
+ // CHECK-NEXT: acc.yield
+ // CHECK-NEXT: } loc
+#pragma acc serial firstprivate(dtor)
+ ;
+ // CHECK-NEXT: %[[PRIVATE:.*]] = acc.firstprivate varPtr(%[[DTOR]] : !cir.ptr<!rec_HasDtor>) -> !cir.ptr<!rec_HasDtor> {name = "dtor"}
+ // CHECK-NEXT: acc.serial firstprivate(@firstprivatization__ZTS7HasDtor -> %[[PRIVATE]] : !cir.ptr<!rec_HasDtor>)
+ // CHECK-NEXT: acc.yield
+ // CHECK-NEXT: } loc
+
+#pragma acc parallel firstprivate(someInt, someFloat, noCopy, hasCopy, notDefCtor, dtor)
+ ;
+ // CHECK: %[[PRIVATE1:.*]] = acc.firstprivate varPtr(%[[SOMEINT]] : !cir.ptr<!s32i>) -> !cir.ptr<!s32i> {name = "someInt"}
+ // CHECK-NEXT: %[[PRIVATE2:.*]] = acc.firstprivate varPtr(%[[SOMEFLOAT]] : !cir.ptr<!cir.float>) -> !cir.ptr<!cir.float> {name = "someFloat"}
+ // CHECK-NEXT: %[[PRIVATE3:.*]] = acc.firstprivate varPtr(%[[NOCOPY]] : !cir.ptr<!rec_NoCopyConstruct>) -> !cir.ptr<!rec_NoCopyConstruct> {name = "noCopy"}
+ // CHECK-NEXT: %[[PRIVATE4:.*]] = acc.firstprivate varPtr(%[[HASCOPY]] : !cir.ptr<!rec_CopyConstruct>) -> !cir.ptr<!rec_CopyConstruct> {name = "hasCopy"}
+ // CHECK-NEXT: %[[PRIVATE5:.*]] = acc.firstprivate varPtr(%[[NOTDEFCTOR]] : !cir.ptr<!rec_NonDefaultCtor>) -> !cir.ptr<!rec_NonDefaultCtor> {name = "notDefCtor"}
+ // CHECK-NEXT: %[[PRIVATE6:.*]] = acc.firstprivate varPtr(%[[DTOR]] : !cir.ptr<!rec_HasDtor>) -> !cir.ptr<!rec_HasDtor> {name = "dtor"}
+ // CHECK-NEXT: acc.parallel firstprivate(@firstprivatization__ZTSi -> %[[PRIVATE1]] : !cir.ptr<!s32i>,
+ // CHECK-SAME: @firstprivatization__ZTSf -> %[[PRIVATE2]] : !cir.ptr<!cir.float>,
+ // CHECK-SAME: @firstprivatization__ZTS15NoCopyConstruct -> %[[PRIVATE3]] : !cir.ptr<!rec_NoCopyConstruct>,
+ // CHECK-SAME: @firstprivatization__ZTS13CopyConstruct -> %[[PRIVATE4]] : !cir.ptr<!rec_CopyConstruct>,
+ // CHECK-SAME: @firstprivatization__ZTS14NonDefaultCtor -> %[[PRIVATE5]] : !cir.ptr<!rec_NonDefaultCtor>,
+ // CHECK-SAME: @firstprivatization__ZTS7HasDtor -> %[[PRIVATE6]] : !cir.ptr<!rec_HasDtor>)
+ // CHECK-NEXT: acc.yield
+ // CHECK-NEXT: } loc
+
+#pragma acc serial firstprivate(someIntArr[1])
+ ;
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ONE_CONST:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[ZERO_CONST:.*]] = arith.constant 0
+ // CHECK-NEXT: %[[ONE_CONST2:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[BOUNDS:.*]] = acc.bounds lowerbound(%[[ONE_CAST]] : si32) extent(%[[ONE_CONST]] : i64) stride(%[[ONE_CONST2]] : i64) startIdx(%[[ZERO_CONST]] : i64)
+ // CHECK-NEXT: %[[PRIVATE:.*]] = acc.firstprivate varPtr(%[[INTARR]] : !cir.ptr<!cir.array<!s32i x 5>>) bounds(%[[BOUNDS]]) -> !cir.ptr<!cir.array<!s32i x 5>> {name = "someIntArr[1]"}
+ // CHECK-NEXT: acc.serial firstprivate(@firstprivatization__ZTSA5_i -> %[[PRIVATE]] : !cir.ptr<!cir.array<!s32i x 5>>)
+ // CHECK-NEXT: acc.yield
+ // CHECK-NEXT: } loc
+#pragma acc parallel firstprivate(someFloatArr[1])
+ ;
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ONE_CONST:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[ZERO_CONST:.*]] = arith.constant 0
+ // CHECK-NEXT: %[[ONE_CONST2:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[BOUNDS:.*]] = acc.bounds lowerbound(%[[ONE_CAST]] : si32) extent(%[[ONE_CONST]] : i64) stride(%[[ONE_CONST2]] : i64) startIdx(%[[ZERO_CONST]] : i64)
+ // CHECK-NEXT: %[[PRIVATE:.*]] = acc.firstprivate varPtr(%[[FLOATARR]] : !cir.ptr<!cir.array<!cir.float x 5>>) bounds(%[[BOUNDS]]) -> !cir.ptr<!cir.array<!cir.float x 5>> {name = "someFloatArr[1]"}
+ // CHECK-NEXT: acc.parallel firstprivate(@firstprivatization__ZTSA5_f -> %[[PRIVATE]] : !cir.ptr<!cir.array<!cir.float x 5>>)
+ // CHECK-NEXT: acc.yield
+ // CHECK-NEXT: } loc
+#pragma acc serial firstprivate(noCopyArr[1])
+ ;
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ONE_CONST:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[ZERO_CONST:.*]] = arith.constant 0
+ // CHECK-NEXT: %[[ONE_CONST2:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[BOUNDS:.*]] = acc.bounds lowerbound(%[[ONE_CAST]] : si32) extent(%[[ONE_CONST]] : i64) stride(%[[ONE_CONST2]] : i64) startIdx(%[[ZERO_CONST]] : i64)
+ // CHECK-NEXT: %[[PRIVATE:.*]] = acc.firstprivate varPtr(%[[NOCOPYARR]] : !cir.ptr<!cir.array<!rec_NoCopyConstruct x 5>>) bounds(%[[BOUNDS]]) -> !cir.ptr<!cir.array<!rec_NoCopyConstruct x 5>> {name = "noCopyArr[1]"}
+ // CHECK-NEXT: acc.serial firstprivate(@firstprivatization__ZTSA5_15NoCopyConstruct -> %[[PRIVATE]] : !cir.ptr<!cir.array<!rec_NoCopyConstruct x 5>>)
+ // CHECK-NEXT: acc.yield
+ // CHECK-NEXT: } loc
+#pragma acc parallel firstprivate(hasCopyArr[1])
+ ;
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ONE_CONST:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[ZERO_CONST:.*]] = arith.constant 0
+ // CHECK-NEXT: %[[ONE_CONST2:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[BOUNDS:.*]] = acc.bounds lowerbound(%[[ONE_CAST]] : si32) extent(%[[ONE_CONST]] : i64) stride(%[[ONE_CONST2]] : i64) startIdx(%[[ZERO_CONST]] : i64)
+ // CHECK-NEXT: %[[PRIVATE:.*]] = acc.firstprivate varPtr(%[[HASCOPYARR]] : !cir.ptr<!cir.array<!rec_CopyConstruct x 5>>) bounds(%[[BOUNDS]]) -> !cir.ptr<!cir.array<!rec_CopyConstruct x 5>> {name = "hasCopyArr[1]"}
+ // CHECK-NEXT: acc.parallel firstprivate(@firstprivatization__ZTSA5_13CopyConstruct -> %[[PRIVATE]] : !cir.ptr<!cir.array<!rec_CopyConstruct x 5>>)
+ // CHECK-NEXT: acc.yield
+ // CHECK-NEXT: } loc
+#pragma acc parallel firstprivate(notDefCtorArr[1])
+ ;
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ONE_CONST:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[ZERO_CONST:.*]] = arith.constant 0
+ // CHECK-NEXT: %[[ONE_CONST2:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[BOUNDS:.*]] = acc.bounds lowerbound(%[[ONE_CAST]] : si32) extent(%[[ONE_CONST]] : i64) stride(%[[ONE_CONST2]] : i64) startIdx(%[[ZERO_CONST]] : i64)
+ // CHECK-NEXT: %[[PRIVATE:.*]] = acc.firstprivate varPtr(%[[NOTDEFCTORARR]] : !cir.ptr<!cir.array<!rec_NonDefaultCtor x 5>>) bounds(%[[BOUNDS]]) -> !cir.ptr<!cir.array<!rec_NonDefaultCtor x 5>> {name = "notDefCtorArr[1]"}
+ // CHECK-NEXT: acc.parallel firstprivate(@firstprivatization__ZTSA5_14NonDefaultCtor -> %[[PRIVATE]] : !cir.ptr<!cir.array<!rec_NonDefaultCtor x 5>>)
+ // CHECK-NEXT: acc.yield
+ // CHECK-NEXT: } loc
+#pragma acc parallel firstprivate(dtorArr[1])
+ ;
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ONE_CONST:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[ZERO_CONST:.*]] = arith.constant 0
+ // CHECK-NEXT: %[[ONE_CONST2:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[BOUNDS:.*]] = acc.bounds lowerbound(%[[ONE_CAST]] : si32) extent(%[[ONE_CONST]] : i64) stride(%[[ONE_CONST2]] : i64) startIdx(%[[ZERO_CONST]] : i64)
+ // CHECK-NEXT: %[[PRIVATE:.*]] = acc.firstprivate varPtr(%[[DTORARR]] : !cir.ptr<!cir.array<!rec_HasDtor x 5>>) bounds(%[[BOUNDS]]) -> !cir.ptr<!cir.array<!rec_HasDtor x 5>> {name = "dtorArr[1]"}
+ // CHECK-NEXT: acc.parallel firstprivate(@firstprivatization__ZTSA5_7HasDtor -> %[[PRIVATE]] : !cir.ptr<!cir.array<!rec_HasDtor x 5>>)
+ // CHECK-NEXT: acc.yield
+ // CHECK-NEXT: } loc
+#pragma acc serial firstprivate(someIntArr[1], someFloatArr[1], noCopyArr[1], hasCopyArr[1], notDefCtorArr[1], dtorArr[1])
+ ;
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ONE_CONST:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[ZERO_CONST:.*]] = arith.constant 0
+ // CHECK-NEXT: %[[ONE_CONST2:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[BOUNDS:.*]] = acc.bounds lowerbound(%[[ONE_CAST]] : si32) extent(%[[ONE_CONST]] : i64) stride(%[[ONE_CONST2]] : i64) startIdx(%[[ZERO_CONST]] : i64)
+ // CHECK-NEXT: %[[PRIVATE1:.*]] = acc.firstprivate varPtr(%[[INTARR]] : !cir.ptr<!cir.array<!s32i x 5>>) bounds(%[[BOUNDS]]) -> !cir.ptr<!cir.array<!s32i x 5>> {name = "someIntArr[1]"}
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ONE_CONST:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[ZERO_CONST:.*]] = arith.constant 0
+ // CHECK-NEXT: %[[ONE_CONST2:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[BOUNDS:.*]] = acc.bounds lowerbound(%[[ONE_CAST]] : si32) extent(%[[ONE_CONST]] : i64) stride(%[[ONE_CONST2]] : i64) startIdx(%[[ZERO_CONST]] : i64)
+ // CHECK-NEXT: %[[PRIVATE2:.*]] = acc.firstprivate varPtr(%[[FLOATARR]] : !cir.ptr<!cir.array<!cir.float x 5>>) bounds(%[[BOUNDS]]) -> !cir.ptr<!cir.array<!cir.float x 5>> {name = "someFloatArr[1]"}
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ONE_CONST:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[ZERO_CONST:.*]] = arith.constant 0
+ // CHECK-NEXT: %[[ONE_CONST2:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[BOUNDS:.*]] = acc.bounds lowerbound(%[[ONE_CAST]] : si32) extent(%[[ONE_CONST]] : i64) stride(%[[ONE_CONST2]] : i64) startIdx(%[[ZERO_CONST]] : i64)
+ // CHECK-NEXT: %[[PRIVATE3:.*]] = acc.firstprivate varPtr(%[[NOCOPYARR]] : !cir.ptr<!cir.array<!rec_NoCopyConstruct x 5>>) bounds(%[[BOUNDS]]) -> !cir.ptr<!cir.array<!rec_NoCopyConstruct x 5>> {name = "noCopyArr[1]"}
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ONE_CONST:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[ZERO_CONST:.*]] = arith.constant 0
+ // CHECK-NEXT: %[[ONE_CONST2:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[BOUNDS:.*]] = acc.bounds lowerbound(%[[ONE_CAST]] : si32) extent(%[[ONE_CONST]] : i64) stride(%[[ONE_CONST2]] : i64) startIdx(%[[ZERO_CONST]] : i64)
+ // CHECK-NEXT: %[[PRIVATE4:.*]] = acc.firstprivate varPtr(%[[HASCOPYARR]] : !cir.ptr<!cir.array<!rec_CopyConstruct x 5>>) bounds(%[[BOUNDS]]) -> !cir.ptr<!cir.array<!rec_CopyConstruct x 5>> {name = "hasCopyArr[1]"}
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ONE_CONST:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[ZERO_CONST:.*]] = arith.constant 0
+ // CHECK-NEXT: %[[ONE_CONST2:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[BOUNDS:.*]] = acc.bounds lowerbound(%[[ONE_CAST]] : si32) extent(%[[ONE_CONST]] : i64) stride(%[[ONE_CONST2]] : i64) startIdx(%[[ZERO_CONST]] : i64)
+ // CHECK-NEXT: %[[PRIVATE5:.*]] = acc.firstprivate varPtr(%[[NOTDEFCTORARR]] : !cir.ptr<!cir.array<!rec_NonDefaultCtor x 5>>) bounds(%[[BOUNDS]]) -> !cir.ptr<!cir.array<!rec_NonDefaultCtor x 5>> {name = "notDefCtorArr[1]"}
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ONE_CONST:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[ZERO_CONST:.*]] = arith.constant 0
+ // CHECK-NEXT: %[[ONE_CONST2:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[BOUNDS:.*]] = acc.bounds lowerbound(%[[ONE_CAST]] : si32) extent(%[[ONE_CONST]] : i64) stride(%[[ONE_CONST2]] : i64) startIdx(%[[ZERO_CONST]] : i64)
+ // CHECK-NEXT: %[[PRIVATE6:.*]] = acc.firstprivate varPtr(%[[DTORARR]] : !cir.ptr<!cir.array<!rec_HasDtor x 5>>) bounds(%[[BOUNDS]]) -> !cir.ptr<!cir.array<!rec_HasDtor x 5>> {name = "dtorArr[1]"}
+ // CHECK-NEXT: acc.serial firstprivate(@firstprivatization__ZTSA5_i -> %[[PRIVATE1]] : !cir.ptr<!cir.array<!s32i x 5>>,
+ // CHECK-SAME: @firstprivatization__ZTSA5_f -> %[[PRIVATE2]] : !cir.ptr<!cir.array<!cir.float x 5>>,
+ // CHECK-SAME: @firstprivatization__ZTSA5_15NoCopyConstruct -> %[[PRIVATE3]] : !cir.ptr<!cir.array<!rec_NoCopyConstruct x 5>>,
+ // CHECK-SAME: @firstprivatization__ZTSA5_13CopyConstruct -> %[[PRIVATE4]] : !cir.ptr<!cir.array<!rec_CopyConstruct x 5>>,
+ // CHECK-SAME: @firstprivatization__ZTSA5_14NonDefaultCtor -> %[[PRIVATE5]] : !cir.ptr<!cir.array<!rec_NonDefaultCtor x 5>>,
+ // CHECK-SAME: @firstprivatization__ZTSA5_7HasDtor -> %[[PRIVATE6]] : !cir.ptr<!cir.array<!rec_HasDtor x 5>>)
+ // CHECK-NEXT: acc.yield
+ // CHECK-NEXT: } loc
+
+#pragma acc parallel firstprivate(someIntArr[1:1])
+ ;
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST2:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ZERO_CONST:.*]] = arith.constant 0
+ // CHECK-NEXT: %[[ONE_CONST2:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[BOUNDS:.*]] = acc.bounds lowerbound(%[[ONE_CAST]] : si32) extent(%[[ONE_CAST2]] : si32) stride(%[[ONE_CONST2]] : i64) startIdx(%[[ZERO_CONST]] : i64)
+ // CHECK-NEXT: %[[PRIVATE:.*]] = acc.firstprivate varPtr(%[[INTARR]] : !cir.ptr<!cir.array<!s32i x 5>>) bounds(%[[BOUNDS]]) -> !cir.ptr<!cir.array<!s32i x 5>> {name = "someIntArr[1:1]"}
+ // CHECK-NEXT: acc.parallel firstprivate(@firstprivatization__ZTSA5_i -> %[[PRIVATE]] : !cir.ptr<!cir.array<!s32i x 5>>)
+ // CHECK-NEXT: acc.yield
+ // CHECK-NEXT: } loc
+#pragma acc serial firstprivate(someFloatArr[1:1])
+ ;
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST2:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ZERO_CONST:.*]] = arith.constant 0
+ // CHECK-NEXT: %[[ONE_CONST2:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[BOUNDS:.*]] = acc.bounds lowerbound(%[[ONE_CAST]] : si32) extent(%[[ONE_CAST2]] : si32) stride(%[[ONE_CONST2]] : i64) startIdx(%[[ZERO_CONST]] : i64)
+ // CHECK-NEXT: %[[PRIVATE:.*]] = acc.firstprivate varPtr(%[[FLOATARR]] : !cir.ptr<!cir.array<!cir.float x 5>>) bounds(%[[BOUNDS]]) -> !cir.ptr<!cir.array<!cir.float x 5>> {name = "someFloatArr[1:1]"}
+ // CHECK-NEXT: acc.serial firstprivate(@firstprivatization__ZTSA5_f -> %[[PRIVATE]] : !cir.ptr<!cir.array<!cir.float x 5>>)
+ // CHECK-NEXT: acc.yield
+ // CHECK-NEXT: } loc
+#pragma acc parallel firstprivate(noCopyArr[1:1])
+ ;
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST2:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ZERO_CONST:.*]] = arith.constant 0
+ // CHECK-NEXT: %[[ONE_CONST2:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[BOUNDS:.*]] = acc.bounds lowerbound(%[[ONE_CAST]] : si32) extent(%[[ONE_CAST2]] : si32) stride(%[[ONE_CONST2]] : i64) startIdx(%[[ZERO_CONST]] : i64)
+ // CHECK-NEXT: %[[PRIVATE:.*]] = acc.firstprivate varPtr(%[[NOCOPYARR]] : !cir.ptr<!cir.array<!rec_NoCopyConstruct x 5>>) bounds(%[[BOUNDS]]) -> !cir.ptr<!cir.array<!rec_NoCopyConstruct x 5>> {name = "noCopyArr[1:1]"}
+ // CHECK-NEXT: acc.parallel firstprivate(@firstprivatization__ZTSA5_15NoCopyConstruct -> %[[PRIVATE]] : !cir.ptr<!cir.array<!rec_NoCopyConstruct x 5>>)
+ // CHECK-NEXT: acc.yield
+ // CHECK-NEXT: } loc
+#pragma acc serial firstprivate(hasCopyArr[1:1])
+ ;
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST2:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ZERO_CONST:.*]] = arith.constant 0
+ // CHECK-NEXT: %[[ONE_CONST2:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[BOUNDS:.*]] = acc.bounds lowerbound(%[[ONE_CAST]] : si32) extent(%[[ONE_CAST2]] : si32) stride(%[[ONE_CONST2]] : i64) startIdx(%[[ZERO_CONST]] : i64)
+ // CHECK-NEXT: %[[PRIVATE:.*]] = acc.firstprivate varPtr(%[[HASCOPYARR]] : !cir.ptr<!cir.array<!rec_CopyConstruct x 5>>) bounds(%[[BOUNDS]]) -> !cir.ptr<!cir.array<!rec_CopyConstruct x 5>> {name = "hasCopyArr[1:1]"}
+ // CHECK-NEXT: acc.serial firstprivate(@firstprivatization__ZTSA5_13CopyConstruct -> %[[PRIVATE]] : !cir.ptr<!cir.array<!rec_CopyConstruct x 5>>)
+ // CHECK-NEXT: acc.yield
+ // CHECK-NEXT: } loc
+#pragma acc parallel firstprivate(notDefCtorArr[1:1])
+ ;
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST2:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ZERO_CONST:.*]] = arith.constant 0
+ // CHECK-NEXT: %[[ONE_CONST2:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[BOUNDS:.*]] = acc.bounds lowerbound(%[[ONE_CAST]] : si32) extent(%[[ONE_CAST2]] : si32) stride(%[[ONE_CONST2]] : i64) startIdx(%[[ZERO_CONST]] : i64)
+ // CHECK-NEXT: %[[PRIVATE:.*]] = acc.firstprivate varPtr(%[[NOTDEFCTORARR]] : !cir.ptr<!cir.array<!rec_NonDefaultCtor x 5>>) bounds(%[[BOUNDS]]) -> !cir.ptr<!cir.array<!rec_NonDefaultCtor x 5>> {name = "notDefCtorArr[1:1]"}
+ // CHECK-NEXT: acc.parallel firstprivate(@firstprivatization__ZTSA5_14NonDefaultCtor -> %[[PRIVATE]] : !cir.ptr<!cir.array<!rec_NonDefaultCtor x 5>>)
+ // CHECK-NEXT: acc.yield
+ // CHECK-NEXT: } loc
+#pragma acc parallel firstprivate(dtorArr[1:1])
+ ;
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST2:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ZERO_CONST:.*]] = arith.constant 0
+ // CHECK-NEXT: %[[ONE_CONST2:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[BOUNDS:.*]] = acc.bounds lowerbound(%[[ONE_CAST]] : si32) extent(%[[ONE_CAST2]] : si32) stride(%[[ONE_CONST2]] : i64) startIdx(%[[ZERO_CONST]] : i64)
+ // CHECK-NEXT: %[[PRIVATE:.*]] = acc.firstprivate varPtr(%[[DTORARR]] : !cir.ptr<!cir.array<!rec_HasDtor x 5>>) bounds(%[[BOUNDS]]) -> !cir.ptr<!cir.array<!rec_HasDtor x 5>> {name = "dtorArr[1:1]"}
+ // CHECK-NEXT: acc.parallel firstprivate(@firstprivatization__ZTSA5_7HasDtor -> %[[PRIVATE]] : !cir.ptr<!cir.array<!rec_HasDtor x 5>>)
+ // CHECK-NEXT: acc.yield
+ // CHECK-NEXT: } loc
+#pragma acc parallel firstprivate(someIntArr[1:1], someFloatArr[1:1], noCopyArr[1:1], hasCopyArr[1:1], notDefCtorArr[1:1], dtorArr[1:1])
+ ;
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST2:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ZERO_CONST:.*]] = arith.constant 0
+ // CHECK-NEXT: %[[ONE_CONST2:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[BOUNDS:.*]] = acc.bounds lowerbound(%[[ONE_CAST]] : si32) extent(%[[ONE_CAST2]] : si32) stride(%[[ONE_CONST2]] : i64) startIdx(%[[ZERO_CONST]] : i64)
+ // CHECK-NEXT: %[[PRIVATE1:.*]] = acc.firstprivate varPtr(%[[INTARR]] : !cir.ptr<!cir.array<!s32i x 5>>) bounds(%[[BOUNDS]]) -> !cir.ptr<!cir.array<!s32i x 5>> {name = "someIntArr[1:1]"}
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST2:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ZERO_CONST:.*]] = arith.constant 0
+ // CHECK-NEXT: %[[ONE_CONST2:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[BOUNDS:.*]] = acc.bounds lowerbound(%[[ONE_CAST]] : si32) extent(%[[ONE_CAST2]] : si32) stride(%[[ONE_CONST2]] : i64) startIdx(%[[ZERO_CONST]] : i64)
+ // CHECK-NEXT: %[[PRIVATE2:.*]] = acc.firstprivate varPtr(%[[FLOATARR]] : !cir.ptr<!cir.array<!cir.float x 5>>) bounds(%[[BOUNDS]]) -> !cir.ptr<!cir.array<!cir.float x 5>> {name = "someFloatArr[1:1]"}
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST2:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ZERO_CONST:.*]] = arith.constant 0
+ // CHECK-NEXT: %[[ONE_CONST2:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[BOUNDS:.*]] = acc.bounds lowerbound(%[[ONE_CAST]] : si32) extent(%[[ONE_CAST2]] : si32) stride(%[[ONE_CONST2]] : i64) startIdx(%[[ZERO_CONST]] : i64)
+ // CHECK-NEXT: %[[PRIVATE3:.*]] = acc.firstprivate varPtr(%[[NOCOPYARR]] : !cir.ptr<!cir.array<!rec_NoCopyConstruct x 5>>) bounds(%[[BOUNDS]]) -> !cir.ptr<!cir.array<!rec_NoCopyConstruct x 5>> {name = "noCopyArr[1:1]"}
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST2:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ZERO_CONST:.*]] = arith.constant 0
+ // CHECK-NEXT: %[[ONE_CONST2:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[BOUNDS:.*]] = acc.bounds lowerbound(%[[ONE_CAST]] : si32) extent(%[[ONE_CAST2]] : si32) stride(%[[ONE_CONST2]] : i64) startIdx(%[[ZERO_CONST]] : i64)
+ // CHECK-NEXT: %[[PRIVATE4:.*]] = acc.firstprivate varPtr(%[[HASCOPYARR]] : !cir.ptr<!cir.array<!rec_CopyConstruct x 5>>) bounds(%[[BOUNDS]]) -> !cir.ptr<!cir.array<!rec_CopyConstruct x 5>> {name = "hasCopyArr[1:1]"}
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST2:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ZERO_CONST:.*]] = arith.constant 0
+ // CHECK-NEXT: %[[ONE_CONST2:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[BOUNDS:.*]] = acc.bounds lowerbound(%[[ONE_CAST]] : si32) extent(%[[ONE_CAST2]] : si32) stride(%[[ONE_CONST2]] : i64) startIdx(%[[ZERO_CONST]] : i64)
+ // CHECK-NEXT: %[[PRIVATE5:.*]] = acc.firstprivate varPtr(%[[NOTDEFCTORARR]] : !cir.ptr<!cir.array<!rec_NonDefaultCtor x 5>>) bounds(%[[BOUNDS]]) -> !cir.ptr<!cir.array<!rec_NonDefaultCtor x 5>> {name = "notDefCtorArr[1:1]"}
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+ // CHECK-NEXT: %[[ONE_CAST2:.*]] = builtin.unrealized_conversion_cast %[[ONE]] : !s32i to si32
+ // CHECK-NEXT: %[[ZERO_CONST:.*]] = arith.constant 0
+ // CHECK-NEXT: %[[ONE_CONST2:.*]] = arith.constant 1
+ // CHECK-NEXT: %[[BOUNDS:.*]] = acc.bounds lowerbound(%[[ONE_CAST]] : si32) extent(%[[ONE_CAST2]] : si32) stride(%[[ONE_CONST2]] : i64) startIdx(%[[ZERO_CONST]] : i64)
+ // CHECK-NEXT: %[[PRIVATE6:.*]] = acc.firstprivate varPtr(%[[DTORARR]] : !cir.ptr<!cir.array<!rec_HasDtor x 5>>) bounds(%[[BOUNDS]]) -> !cir.ptr<!cir.array<!rec_HasDtor x 5>> {name = "dtorArr[1:1]"}
+ // CHECK-NEXT: acc.parallel firstprivate(@firstprivatization__ZTSA5_i -> %[[PRIVATE1]] : !cir.ptr<!cir.array<!s32i x 5>>,
+ // CHECK-SAME: @firstprivatization__ZTSA5_f -> %[[PRIVATE2]] : !cir.ptr<!cir.array<!cir.float x 5>>,
+ // CHECK-SAME: @firstprivatization__ZTSA5_15NoCopyConstruct -> %[[PRIVATE3]] : !cir.ptr<!cir.array<!rec_NoCopyConstruct x 5>>,
+ // CHECK-SAME: @firstprivatization__ZTSA5_13CopyConstruct -> %[[PRIVATE4]] : !cir.ptr<!cir.array<!rec_CopyConstruct x 5>>,
+ // CHECK-SAME: @firstprivatization__ZTSA5_14NonDefaultCtor -> %[[PRIVATE5]] : !cir.ptr<!cir.array<!rec_NonDefaultCtor x 5>>,
+ // CHECK-SAME: @firstprivatization__ZTSA5_7HasDtor -> %[[PRIVATE6]] : !cir.ptr<!cir.array<!rec_HasDtor x 5>>)
+ // CHECK-NEXT: acc.yield
+ // CHECK-NEXT: } loc
+}
diff --git a/clang/test/CIR/CodeGenOpenACC/openacc-not-implemented.cpp b/clang/test/CIR/CodeGenOpenACC/openacc-not-implemented.cpp
index 0bf932ea62ceb..da45aca13e7f9 100644
--- a/clang/test/CIR/CodeGenOpenACC/openacc-not-implemented.cpp
+++ b/clang/test/CIR/CodeGenOpenACC/openacc-not-implemented.cpp
@@ -10,9 +10,6 @@ void HelloWorld(int *A, int *B, int *C, int N) {
// expected-error at +1{{ClangIR code gen Not Yet Implemented: OpenACC Declare Construct}}
#pragma acc declare create(A)
- // expected-error at +1{{ClangIR code gen Not Yet Implemented: OpenACC Clause: firstprivate}}
-#pragma acc parallel loop firstprivate(A)
- for(int i = 0; i <5; ++i);
// expected-error at +1{{ClangIR code gen Not Yet Implemented: OpenACC Clause: reduction}}
#pragma acc parallel loop reduction(+:A)
for(int i = 0; i <5; ++i);
>From 31d2db2a68ae6e810f3e5532b521b913b50cc25e Mon Sep 17 00:00:00 2001
From: Timm Baeder <tbaeder at redhat.com>
Date: Mon, 18 Aug 2025 15:40:44 +0200
Subject: [PATCH 010/112] [clang][bytecode][NFC] Use UnsignedOrNone for
Block::DeclID (#154104)
---
clang/lib/AST/ByteCode/InterpBlock.h | 13 ++++++-------
clang/lib/AST/ByteCode/Pointer.h | 2 +-
clang/lib/AST/ByteCode/Program.h | 2 +-
3 files changed, 8 insertions(+), 9 deletions(-)
diff --git a/clang/lib/AST/ByteCode/InterpBlock.h b/clang/lib/AST/ByteCode/InterpBlock.h
index 7ded1e8649fdf..8f30a6ece74ee 100644
--- a/clang/lib/AST/ByteCode/InterpBlock.h
+++ b/clang/lib/AST/ByteCode/InterpBlock.h
@@ -50,9 +50,9 @@ class Block final {
public:
/// Creates a new block.
- Block(unsigned EvalID, const std::optional<unsigned> &DeclID,
- const Descriptor *Desc, bool IsStatic = false, bool IsExtern = false,
- bool IsWeak = false, bool IsDummy = false)
+ Block(unsigned EvalID, UnsignedOrNone DeclID, const Descriptor *Desc,
+ bool IsStatic = false, bool IsExtern = false, bool IsWeak = false,
+ bool IsDummy = false)
: Desc(Desc), DeclID(DeclID), EvalID(EvalID), IsStatic(IsStatic) {
assert(Desc);
AccessFlags |= (ExternFlag * IsExtern);
@@ -62,8 +62,7 @@ class Block final {
Block(unsigned EvalID, const Descriptor *Desc, bool IsStatic = false,
bool IsExtern = false, bool IsWeak = false, bool IsDummy = false)
- : Desc(Desc), DeclID((unsigned)-1), EvalID(EvalID), IsStatic(IsStatic),
- IsDynamic(false) {
+ : Desc(Desc), EvalID(EvalID), IsStatic(IsStatic), IsDynamic(false) {
assert(Desc);
AccessFlags |= (ExternFlag * IsExtern);
AccessFlags |= (WeakFlag * IsWeak);
@@ -87,7 +86,7 @@ class Block final {
/// Returns the size of the block.
unsigned getSize() const { return Desc->getAllocSize(); }
/// Returns the declaration ID.
- std::optional<unsigned> getDeclID() const { return DeclID; }
+ UnsignedOrNone getDeclID() const { return DeclID; }
/// Returns whether the data of this block has been initialized via
/// invoking the Ctor func.
bool isInitialized() const { return IsInitialized; }
@@ -177,7 +176,7 @@ class Block final {
/// Start of the chain of pointers.
Pointer *Pointers = nullptr;
/// Unique identifier of the declaration.
- std::optional<unsigned> DeclID;
+ UnsignedOrNone DeclID = std::nullopt;
const unsigned EvalID = ~0u;
/// Flag indicating if the block has static storage duration.
bool IsStatic = false;
diff --git a/clang/lib/AST/ByteCode/Pointer.h b/clang/lib/AST/ByteCode/Pointer.h
index 94c83a0d87bc4..1f6f1cbce5391 100644
--- a/clang/lib/AST/ByteCode/Pointer.h
+++ b/clang/lib/AST/ByteCode/Pointer.h
@@ -593,7 +593,7 @@ class Pointer {
}
/// Returns the declaration ID.
- std::optional<unsigned> getDeclID() const {
+ UnsignedOrNone getDeclID() const {
if (isBlockPointer()) {
assert(asBlockPointer().Pointee);
return asBlockPointer().Pointee->getDeclID();
diff --git a/clang/lib/AST/ByteCode/Program.h b/clang/lib/AST/ByteCode/Program.h
index 207ceef91da43..b63a70ed8113a 100644
--- a/clang/lib/AST/ByteCode/Program.h
+++ b/clang/lib/AST/ByteCode/Program.h
@@ -152,7 +152,7 @@ class Program final {
};
/// Returns the current declaration ID.
- std::optional<unsigned> getCurrentDecl() const {
+ UnsignedOrNone getCurrentDecl() const {
if (CurrentDeclaration == NoDeclaration)
return std::nullopt;
return CurrentDeclaration;
>From f38c83c582cb9de04556c32bc6b18ad1aeda74af Mon Sep 17 00:00:00 2001
From: Jonathan Thackray <jonathan.thackray at arm.com>
Date: Mon, 18 Aug 2025 14:41:41 +0100
Subject: [PATCH 011/112] [AArch64][llvm] Disassemble instructions in `SYS`
alias encoding space more correctly (#153905)
For instructions in the `SYS` alias encoding space which take no
register operands, and where the unused 5 register bits are not all set
(0x31, 0b11111), then disassemble to a `SYS` alias and not the
instruction, since it is not considered valid.
This is because it is specified in the Arm ARM in text similar to this
(e.g. page C5-1037 of DDI0487L.b for `TLBI ALLE1`, or page C5-1585 for
`GCSPOPX`):
```
Rt should be encoded as 0b11111. If the Rt field is not set to 0b11111,
it is CONSTRAINED UNPREDICTABLE whether:
* The instruction is UNDEFINED.
* The instruction behaves as if the Rt field is set to 0b11111.
```
Since we want to follow "should" directives, and not encourage undefined
behaviour, only assemble or disassemble instructions considered valid.
Add an extra test-case for this, and all existing test-cases are
continuing to pass.
---
.../AArch64/MCTargetDesc/AArch64InstPrinter.cpp | 16 ++++++++++++----
llvm/test/MC/AArch64/arm64-aliases.s | 14 ++++++++++++++
2 files changed, 26 insertions(+), 4 deletions(-)
diff --git a/llvm/lib/Target/AArch64/MCTargetDesc/AArch64InstPrinter.cpp b/llvm/lib/Target/AArch64/MCTargetDesc/AArch64InstPrinter.cpp
index 3c8b5712c1f0c..54b58e948daf2 100644
--- a/llvm/lib/Target/AArch64/MCTargetDesc/AArch64InstPrinter.cpp
+++ b/llvm/lib/Target/AArch64/MCTargetDesc/AArch64InstPrinter.cpp
@@ -1017,14 +1017,22 @@ bool AArch64InstPrinter::printSysAlias(const MCInst *MI,
else
return false;
+ StringRef Reg = getRegisterName(MI->getOperand(4).getReg());
+ bool NotXZR = Reg != "xzr";
+
+ // If a mandatory is not specified in the TableGen
+ // (i.e. no register operand should be present), and the register value
+ // is not xzr/x31, then disassemble to a SYS alias instead.
+ if (NotXZR && !NeedsReg)
+ return false;
+
std::string Str = Ins + Name;
llvm::transform(Str, Str.begin(), ::tolower);
O << '\t' << Str;
- if (NeedsReg) {
- O << ", ";
- printRegName(O, MI->getOperand(4).getReg());
- }
+
+ if (NeedsReg)
+ O << ", " << Reg;
return true;
}
diff --git a/llvm/test/MC/AArch64/arm64-aliases.s b/llvm/test/MC/AArch64/arm64-aliases.s
index 3ace7a0f7183b..ae157c676c95f 100644
--- a/llvm/test/MC/AArch64/arm64-aliases.s
+++ b/llvm/test/MC/AArch64/arm64-aliases.s
@@ -512,6 +512,20 @@ foo:
sys #4, c8, c3, #6
; CHECK: tlbi vmalls12e1is
+; Check that all 5 register bits are set (0x31):
+; (from Arm ARM regarding TLBI instructions without operands)
+; "Rt should be encoded as 0b11111. If the Rt field is not set to 0b11111,
+; it is CONSTRAINED UNPREDICTABLE whether:
+; * The instruction is UNDEFINED.
+; * The instruction behaves as if the Rt field is set to 0b11111."
+;
+; Do not disassemble this to `tlbi` but a SYS alias instead
+;
+ sys #4, c8, c7, #6, x30
+; CHECK: sys #0x4, c8, c7, #0x6, x30
+ sys #4, c8, c7, #6, x31
+; CHECK: tlbi vmalls12e1
+
ic ialluis
; CHECK: ic ialluis ; encoding: [0x1f,0x71,0x08,0xd5]
ic iallu
>From e37eff5dcd1124730da94f9c447b394810afd3e9 Mon Sep 17 00:00:00 2001
From: Shilei Tian <i at tianshilei.me>
Date: Mon, 18 Aug 2025 09:44:20 -0400
Subject: [PATCH 012/112] [AMDGPU] Add an option to completely disable kernel
argument preload (#153975)
The existing `amdgpu-kernarg-preload-count` can't be used as a switch to
turn it off if it is set to 0. This PR adds an extra option to turn it
off.
Fixes SWDEV-550147.
---
.../AMDGPU/AMDGPUPreloadKernelArguments.cpp | 8 +++++
.../AMDGPU/disable-preload-kernargs.ll | 29 +++++++++++++++++++
2 files changed, 37 insertions(+)
create mode 100644 llvm/test/CodeGen/AMDGPU/disable-preload-kernargs.ll
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPreloadKernelArguments.cpp b/llvm/lib/Target/AMDGPU/AMDGPUPreloadKernelArguments.cpp
index 984c1ee89309e..a386fe621a553 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUPreloadKernelArguments.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPreloadKernelArguments.cpp
@@ -37,6 +37,11 @@ static cl::opt<unsigned> KernargPreloadCount(
"amdgpu-kernarg-preload-count",
cl::desc("How many kernel arguments to preload onto SGPRs"), cl::init(0));
+static cl::opt<bool>
+ EnableKernargPreload("amdgpu-kernarg-preload",
+ cl::desc("Enable preload kernel arguments to SGPRs"),
+ cl::init(true));
+
namespace {
class AMDGPUPreloadKernelArgumentsLegacy : public ModulePass {
@@ -275,6 +280,9 @@ AMDGPUPreloadKernelArgumentsLegacy::AMDGPUPreloadKernelArgumentsLegacy(
: ModulePass(ID), TM(TM) {}
static bool markKernelArgsAsInreg(Module &M, const TargetMachine &TM) {
+ if (!EnableKernargPreload)
+ return false;
+
SmallVector<Function *, 4> FunctionsToErase;
bool Changed = false;
for (auto &F : M) {
diff --git a/llvm/test/CodeGen/AMDGPU/disable-preload-kernargs.ll b/llvm/test/CodeGen/AMDGPU/disable-preload-kernargs.ll
new file mode 100644
index 0000000000000..75aaec6f1fa70
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/disable-preload-kernargs.ll
@@ -0,0 +1,29 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -S -mtriple=amdgcn-amd-amdhsa -mcpu=gfx942 -passes=amdgpu-preload-kernel-arguments -amdgpu-kernarg-preload=0 %s -o - | FileCheck -check-prefix=NO-PRELOAD %s
+; RUN: opt -S -mtriple=amdgcn-amd-amdhsa -mcpu=gfx942 -passes=amdgpu-preload-kernel-arguments %s -o - | FileCheck -check-prefix=DEFAULT-PRELOAD %s
+
+ at g1 = protected addrspace(1) externally_initialized global i16 0, align 2
+
+define amdgpu_kernel void @test_kernel_with_zero_kernel_arg() {
+; NO-PRELOAD-LABEL: define amdgpu_kernel void @test_kernel_with_zero_kernel_arg(
+; NO-PRELOAD-SAME: ) #[[ATTR0:[0-9]+]] {
+; NO-PRELOAD-NEXT: [[IMPLICITARG_PTR:%.*]] = call ptr addrspace(4) @llvm.amdgcn.implicitarg.ptr()
+; NO-PRELOAD-NEXT: [[GEP:%.*]] = getelementptr inbounds i8, ptr addrspace(4) [[IMPLICITARG_PTR]], i64 12
+; NO-PRELOAD-NEXT: [[GROUP_SIZE_X:%.*]] = load i16, ptr addrspace(4) [[GEP]], align 2
+; NO-PRELOAD-NEXT: store i16 [[GROUP_SIZE_X]], ptr addrspace(1) @g1, align 2
+; NO-PRELOAD-NEXT: ret void
+;
+; DEFAULT-PRELOAD-LABEL: define amdgpu_kernel void @test_kernel_with_zero_kernel_arg(
+; DEFAULT-PRELOAD-SAME: i32 inreg "amdgpu-hidden-argument" [[_HIDDEN_BLOCK_COUNT_X:%.*]], i32 inreg "amdgpu-hidden-argument" [[_HIDDEN_BLOCK_COUNT_Y:%.*]], i32 inreg "amdgpu-hidden-argument" [[_HIDDEN_BLOCK_COUNT_Z:%.*]], i16 inreg "amdgpu-hidden-argument" [[_HIDDEN_GROUP_SIZE_X:%.*]]) #[[ATTR0:[0-9]+]] {
+; DEFAULT-PRELOAD-NEXT: [[IMPLICITARG_PTR:%.*]] = call ptr addrspace(4) @llvm.amdgcn.implicitarg.ptr()
+; DEFAULT-PRELOAD-NEXT: [[GEP:%.*]] = getelementptr inbounds i8, ptr addrspace(4) [[IMPLICITARG_PTR]], i64 12
+; DEFAULT-PRELOAD-NEXT: [[GROUP_SIZE_X:%.*]] = load i16, ptr addrspace(4) [[GEP]], align 2
+; DEFAULT-PRELOAD-NEXT: store i16 [[_HIDDEN_GROUP_SIZE_X]], ptr addrspace(1) @g1, align 2
+; DEFAULT-PRELOAD-NEXT: ret void
+;
+ %implicitarg.ptr = call ptr addrspace(4) @llvm.amdgcn.implicitarg.ptr()
+ %gep = getelementptr inbounds i8, ptr addrspace(4) %implicitarg.ptr, i64 12
+ %group_size_x = load i16, ptr addrspace(4) %gep
+ store i16 %group_size_x, ptr addrspace(1) @g1
+ ret void
+}
>From 5b2c3aac90450ecb78394f61afc7e9c5e955abc7 Mon Sep 17 00:00:00 2001
From: Aiden Grossman <aidengrossman at google.com>
Date: Mon, 18 Aug 2025 06:44:43 -0700
Subject: [PATCH 013/112] [MCA][X86] Pretend To Have a Stack Engine (#153348)
This patch removes RSP dependencies from push and pop instructions to
pretend that we have a stack engine. This does not model details like
sync uops that are relevant implementation details due to complexity.
This is just enabled on all X86 CPUs given LLVM does not have a
scheduling model for any X86 CPU that does not have a stack engine.
This fixes #152008.
---
.../lib/Target/X86/MCA/X86CustomBehaviour.cpp | 24 ++++-
llvm/lib/Target/X86/MCA/X86CustomBehaviour.h | 5 +
.../tools/llvm-mca/X86/stack-engine-pop.s | 92 +++++++++++++++++++
.../tools/llvm-mca/X86/stack-engine-push.s | 92 +++++++++++++++++++
4 files changed, 211 insertions(+), 2 deletions(-)
create mode 100644 llvm/test/tools/llvm-mca/X86/stack-engine-pop.s
create mode 100644 llvm/test/tools/llvm-mca/X86/stack-engine-push.s
diff --git a/llvm/lib/Target/X86/MCA/X86CustomBehaviour.cpp b/llvm/lib/Target/X86/MCA/X86CustomBehaviour.cpp
index 817e88d8a0bc0..e2a1bbf383b3c 100644
--- a/llvm/lib/Target/X86/MCA/X86CustomBehaviour.cpp
+++ b/llvm/lib/Target/X86/MCA/X86CustomBehaviour.cpp
@@ -36,11 +36,31 @@ void X86InstrPostProcess::setMemBarriers(std::unique_ptr<Instruction> &Inst,
}
}
+void X86InstrPostProcess::useStackEngine(std::unique_ptr<Instruction> &Inst,
+ const MCInst &MCI) {
+ // TODO(boomanaiden154): We currently do not handle PUSHF/POPF because we
+ // have not done the necessary benchmarking to see if they are also
+ // optimized by the stack engine.
+ // TODO: We currently just remove all RSP writes from stack operations. This
+ // is not fully correct because we do not model sync uops which will
+ // delay subsequent rsp using non-stack instructions.
+ if (X86::isPOP(MCI.getOpcode()) || X86::isPUSH(MCI.getOpcode())) {
+ auto *StackRegisterDef =
+ llvm::find_if(Inst->getDefs(), [](const WriteState &State) {
+ return State.getRegisterID() == X86::RSP;
+ });
+ assert(
+ StackRegisterDef != Inst->getDefs().end() &&
+ "Expected push instruction to implicitly use stack pointer register.");
+ Inst->getDefs().erase(StackRegisterDef);
+ }
+}
+
void X86InstrPostProcess::postProcessInstruction(
std::unique_ptr<Instruction> &Inst, const MCInst &MCI) {
- // Currently, we only modify certain instructions' IsALoadBarrier and
- // IsAStoreBarrier flags.
+ // Set IsALoadBarrier and IsAStoreBarrier flags.
setMemBarriers(Inst, MCI);
+ useStackEngine(Inst, MCI);
}
} // namespace mca
diff --git a/llvm/lib/Target/X86/MCA/X86CustomBehaviour.h b/llvm/lib/Target/X86/MCA/X86CustomBehaviour.h
index 4a83ba848dd88..c5459e42dfc9f 100644
--- a/llvm/lib/Target/X86/MCA/X86CustomBehaviour.h
+++ b/llvm/lib/Target/X86/MCA/X86CustomBehaviour.h
@@ -28,6 +28,11 @@ class X86InstrPostProcess : public InstrPostProcess {
/// as load and store barriers.
void setMemBarriers(std::unique_ptr<Instruction> &Inst, const MCInst &MCI);
+ /// Called within X86InstrPostPorcess to remove some rsp read operands
+ /// on stack instructions to better simulate the stack engine. We currently
+ /// do not model features of the stack engine like sync uops.
+ void useStackEngine(std::unique_ptr<Instruction> &Inst, const MCInst &MCI);
+
public:
X86InstrPostProcess(const MCSubtargetInfo &STI, const MCInstrInfo &MCII)
: InstrPostProcess(STI, MCII) {}
diff --git a/llvm/test/tools/llvm-mca/X86/stack-engine-pop.s b/llvm/test/tools/llvm-mca/X86/stack-engine-pop.s
new file mode 100644
index 0000000000000..2ffb52ae61fc4
--- /dev/null
+++ b/llvm/test/tools/llvm-mca/X86/stack-engine-pop.s
@@ -0,0 +1,92 @@
+# NOTE: Assertions have been autogenerated by utils/update_mca_test_checks.py
+# RUN: llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=skylake -timeline -iterations=2 < %s | FileCheck %s
+
+movq $0x80, %rsp
+popq %rax
+popq %rcx
+popq %rdx
+popq %rbx
+popq %r12
+
+# CHECK: Iterations: 2
+# CHECK-NEXT: Instructions: 12
+# CHECK-NEXT: Total Cycles: 14
+# CHECK-NEXT: Total uOps: 22
+
+# CHECK: Dispatch Width: 6
+# CHECK-NEXT: uOps Per Cycle: 1.57
+# CHECK-NEXT: IPC: 0.86
+# CHECK-NEXT: Block RThroughput: 2.5
+
+# CHECK: Instruction Info:
+# CHECK-NEXT: [1]: #uOps
+# CHECK-NEXT: [2]: Latency
+# CHECK-NEXT: [3]: RThroughput
+# CHECK-NEXT: [4]: MayLoad
+# CHECK-NEXT: [5]: MayStore
+# CHECK-NEXT: [6]: HasSideEffects (U)
+
+# CHECK: [1] [2] [3] [4] [5] [6] Instructions:
+# CHECK-NEXT: 1 1 0.25 movq $128, %rsp
+# CHECK-NEXT: 2 6 0.50 * popq %rax
+# CHECK-NEXT: 2 6 0.50 * popq %rcx
+# CHECK-NEXT: 2 6 0.50 * popq %rdx
+# CHECK-NEXT: 2 6 0.50 * popq %rbx
+# CHECK-NEXT: 2 6 0.50 * popq %r12
+
+# CHECK: Resources:
+# CHECK-NEXT: [0] - SKLDivider
+# CHECK-NEXT: [1] - SKLFPDivider
+# CHECK-NEXT: [2] - SKLPort0
+# CHECK-NEXT: [3] - SKLPort1
+# CHECK-NEXT: [4] - SKLPort2
+# CHECK-NEXT: [5] - SKLPort3
+# CHECK-NEXT: [6] - SKLPort4
+# CHECK-NEXT: [7] - SKLPort5
+# CHECK-NEXT: [8] - SKLPort6
+# CHECK-NEXT: [9] - SKLPort7
+
+# CHECK: Resource pressure per iteration:
+# CHECK-NEXT: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
+# CHECK-NEXT: - - 1.50 1.50 2.50 2.50 - 1.50 1.50 -
+
+# CHECK: Resource pressure by instruction:
+# CHECK-NEXT: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] Instructions:
+# CHECK-NEXT: - - - - - - - 0.50 0.50 - movq $128, %rsp
+# CHECK-NEXT: - - 0.50 - 0.50 0.50 - 0.50 - - popq %rax
+# CHECK-NEXT: - - - 0.50 0.50 0.50 - - 0.50 - popq %rcx
+# CHECK-NEXT: - - 0.50 - 0.50 0.50 - 0.50 - - popq %rdx
+# CHECK-NEXT: - - - 0.50 0.50 0.50 - - 0.50 - popq %rbx
+# CHECK-NEXT: - - 0.50 0.50 0.50 0.50 - - - - popq %r12
+
+# CHECK: Timeline view:
+# CHECK-NEXT: 0123
+# CHECK-NEXT: Index 0123456789
+
+# CHECK: [0,0] DeER . . . movq $128, %rsp
+# CHECK-NEXT: [0,1] D=eeeeeeER. . popq %rax
+# CHECK-NEXT: [0,2] D=eeeeeeER. . popq %rcx
+# CHECK-NEXT: [0,3] .D=eeeeeeER . popq %rdx
+# CHECK-NEXT: [0,4] .D=eeeeeeER . popq %rbx
+# CHECK-NEXT: [0,5] .D==eeeeeeER . popq %r12
+# CHECK-NEXT: [1,0] . DeE------R . movq $128, %rsp
+# CHECK-NEXT: [1,1] . D=eeeeeeER . popq %rax
+# CHECK-NEXT: [1,2] . D==eeeeeeER. popq %rcx
+# CHECK-NEXT: [1,3] . D=eeeeeeER. popq %rdx
+# CHECK-NEXT: [1,4] . D==eeeeeeER popq %rbx
+# CHECK-NEXT: [1,5] . D==eeeeeeER popq %r12
+
+# CHECK: Average Wait times (based on the timeline view):
+# CHECK-NEXT: [0]: Executions
+# CHECK-NEXT: [1]: Average time spent waiting in a scheduler's queue
+# CHECK-NEXT: [2]: Average time spent waiting in a scheduler's queue while ready
+# CHECK-NEXT: [3]: Average time elapsed from WB until retire stage
+
+# CHECK: [0] [1] [2] [3]
+# CHECK-NEXT: 0. 2 1.0 1.0 3.0 movq $128, %rsp
+# CHECK-NEXT: 1. 2 2.0 0.0 0.0 popq %rax
+# CHECK-NEXT: 2. 2 2.5 0.5 0.0 popq %rcx
+# CHECK-NEXT: 3. 2 2.0 1.0 0.0 popq %rdx
+# CHECK-NEXT: 4. 2 2.5 1.5 0.0 popq %rbx
+# CHECK-NEXT: 5. 2 3.0 2.0 0.0 popq %r12
+# CHECK-NEXT: 2 2.2 1.0 0.5 <total>
diff --git a/llvm/test/tools/llvm-mca/X86/stack-engine-push.s b/llvm/test/tools/llvm-mca/X86/stack-engine-push.s
new file mode 100644
index 0000000000000..fc394d4c1e7d3
--- /dev/null
+++ b/llvm/test/tools/llvm-mca/X86/stack-engine-push.s
@@ -0,0 +1,92 @@
+# NOTE: Assertions have been autogenerated by utils/update_mca_test_checks.py
+# RUN: llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=skylake -timeline -iterations=2 < %s | FileCheck %s
+
+movq $0x80, %rsp
+pushq %rax
+pushq %rcx
+pushq %rdx
+pushq %rbx
+pushq %r12
+
+# CHECK: Iterations: 2
+# CHECK-NEXT: Instructions: 12
+# CHECK-NEXT: Total Cycles: 15
+# CHECK-NEXT: Total uOps: 32
+
+# CHECK: Dispatch Width: 6
+# CHECK-NEXT: uOps Per Cycle: 2.13
+# CHECK-NEXT: IPC: 0.80
+# CHECK-NEXT: Block RThroughput: 5.0
+
+# CHECK: Instruction Info:
+# CHECK-NEXT: [1]: #uOps
+# CHECK-NEXT: [2]: Latency
+# CHECK-NEXT: [3]: RThroughput
+# CHECK-NEXT: [4]: MayLoad
+# CHECK-NEXT: [5]: MayStore
+# CHECK-NEXT: [6]: HasSideEffects (U)
+
+# CHECK: [1] [2] [3] [4] [5] [6] Instructions:
+# CHECK-NEXT: 1 1 0.25 movq $128, %rsp
+# CHECK-NEXT: 3 2 1.00 * pushq %rax
+# CHECK-NEXT: 3 2 1.00 * pushq %rcx
+# CHECK-NEXT: 3 2 1.00 * pushq %rdx
+# CHECK-NEXT: 3 2 1.00 * pushq %rbx
+# CHECK-NEXT: 3 2 1.00 * pushq %r12
+
+# CHECK: Resources:
+# CHECK-NEXT: [0] - SKLDivider
+# CHECK-NEXT: [1] - SKLFPDivider
+# CHECK-NEXT: [2] - SKLPort0
+# CHECK-NEXT: [3] - SKLPort1
+# CHECK-NEXT: [4] - SKLPort2
+# CHECK-NEXT: [5] - SKLPort3
+# CHECK-NEXT: [6] - SKLPort4
+# CHECK-NEXT: [7] - SKLPort5
+# CHECK-NEXT: [8] - SKLPort6
+# CHECK-NEXT: [9] - SKLPort7
+
+# CHECK: Resource pressure per iteration:
+# CHECK-NEXT: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
+# CHECK-NEXT: - - 1.50 1.50 1.50 1.50 5.00 1.50 1.50 2.00
+
+# CHECK: Resource pressure by instruction:
+# CHECK-NEXT: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] Instructions:
+# CHECK-NEXT: - - - - - - - - 1.00 - movq $128, %rsp
+# CHECK-NEXT: - - 0.50 - 0.50 - 1.00 0.50 - 0.50 pushq %rax
+# CHECK-NEXT: - - - 0.50 - 0.50 1.00 - 0.50 0.50 pushq %rcx
+# CHECK-NEXT: - - 0.50 - 0.50 0.50 1.00 0.50 - - pushq %rdx
+# CHECK-NEXT: - - - 0.50 0.50 - 1.00 0.50 - 0.50 pushq %rbx
+# CHECK-NEXT: - - 0.50 0.50 - 0.50 1.00 - - 0.50 pushq %r12
+
+# CHECK: Timeline view:
+# CHECK-NEXT: 01234
+# CHECK-NEXT: Index 0123456789
+
+# CHECK: [0,0] DeER . . . movq $128, %rsp
+# CHECK-NEXT: [0,1] D=eeER . . pushq %rax
+# CHECK-NEXT: [0,2] .D=eeER . . pushq %rcx
+# CHECK-NEXT: [0,3] .D==eeER . . pushq %rdx
+# CHECK-NEXT: [0,4] . D==eeER . . pushq %rbx
+# CHECK-NEXT: [0,5] . D===eeER. . pushq %r12
+# CHECK-NEXT: [1,0] . DeE---R. . movq $128, %rsp
+# CHECK-NEXT: [1,1] . D===eeER . pushq %rax
+# CHECK-NEXT: [1,2] . D===eeER . pushq %rcx
+# CHECK-NEXT: [1,3] . D====eeER . pushq %rdx
+# CHECK-NEXT: [1,4] . D====eeER. pushq %rbx
+# CHECK-NEXT: [1,5] . D=====eeER pushq %r12
+
+# CHECK: Average Wait times (based on the timeline view):
+# CHECK-NEXT: [0]: Executions
+# CHECK-NEXT: [1]: Average time spent waiting in a scheduler's queue
+# CHECK-NEXT: [2]: Average time spent waiting in a scheduler's queue while ready
+# CHECK-NEXT: [3]: Average time elapsed from WB until retire stage
+
+# CHECK: [0] [1] [2] [3]
+# CHECK-NEXT: 0. 2 1.0 1.0 1.5 movq $128, %rsp
+# CHECK-NEXT: 1. 2 3.0 0.5 0.0 pushq %rax
+# CHECK-NEXT: 2. 2 3.0 1.0 0.0 pushq %rcx
+# CHECK-NEXT: 3. 2 4.0 1.0 0.0 pushq %rdx
+# CHECK-NEXT: 4. 2 4.0 1.0 0.0 pushq %rbx
+# CHECK-NEXT: 5. 2 5.0 1.0 0.0 pushq %r12
+# CHECK-NEXT: 2 3.3 0.9 0.3 <total>
>From 2a02147ff563cbfc70911b2518cfb8a256131b5b Mon Sep 17 00:00:00 2001
From: halbi2 <hehiralbi at gmail.com>
Date: Mon, 18 Aug 2025 09:49:04 -0400
Subject: [PATCH 014/112] [clang] [Sema] Simplify Expr::isUnusedResultAWarning
for CXXConstructExpr (#153116)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
…Expr
Two tests have new warnings because `warn_unused_result` is now
respected for constructor temporaries. These tests were newly added in
#112521 last year. This is good because the new behavior is better than
the old.
@Sirraide and @Mick235711 what do you think about it?
---
clang/lib/AST/Expr.cpp | 34 ++++++-------------
.../dcl.attr/dcl.attr.nodiscard/p2.cpp | 2 +-
clang/test/SemaCXX/warn-unused-result.cpp | 4 +--
3 files changed, 14 insertions(+), 26 deletions(-)
diff --git a/clang/lib/AST/Expr.cpp b/clang/lib/AST/Expr.cpp
index 7cac655ef151c..e14cff552c922 100644
--- a/clang/lib/AST/Expr.cpp
+++ b/clang/lib/AST/Expr.cpp
@@ -2805,32 +2805,20 @@ bool Expr::isUnusedResultAWarning(const Expr *&WarnE, SourceLocation &Loc,
case CXXTemporaryObjectExprClass:
case CXXConstructExprClass: {
- if (const CXXRecordDecl *Type = getType()->getAsCXXRecordDecl()) {
- const auto *WarnURAttr = Type->getAttr<WarnUnusedResultAttr>();
- if (Type->hasAttr<WarnUnusedAttr>() ||
- (WarnURAttr && WarnURAttr->IsCXX11NoDiscard())) {
- WarnE = this;
- Loc = getBeginLoc();
- R1 = getSourceRange();
- return true;
- }
- }
-
const auto *CE = cast<CXXConstructExpr>(this);
- if (const CXXConstructorDecl *Ctor = CE->getConstructor()) {
- const auto *WarnURAttr = Ctor->getAttr<WarnUnusedResultAttr>();
- if (WarnURAttr && WarnURAttr->IsCXX11NoDiscard()) {
- WarnE = this;
- Loc = getBeginLoc();
- R1 = getSourceRange();
+ const CXXRecordDecl *Type = getType()->getAsCXXRecordDecl();
- if (unsigned NumArgs = CE->getNumArgs())
- R2 = SourceRange(CE->getArg(0)->getBeginLoc(),
- CE->getArg(NumArgs - 1)->getEndLoc());
- return true;
- }
- }
+ if ((Type && Type->hasAttr<WarnUnusedAttr>()) ||
+ CE->hasUnusedResultAttr(Ctx)) {
+ WarnE = this;
+ Loc = getBeginLoc();
+ R1 = getSourceRange();
+ if (unsigned NumArgs = CE->getNumArgs())
+ R2 = SourceRange(CE->getArg(0)->getBeginLoc(),
+ CE->getArg(NumArgs - 1)->getEndLoc());
+ return true;
+ }
return false;
}
diff --git a/clang/test/CXX/dcl.dcl/dcl.attr/dcl.attr.nodiscard/p2.cpp b/clang/test/CXX/dcl.dcl/dcl.attr/dcl.attr.nodiscard/p2.cpp
index 0012ab976baa5..7f933a4dcc6b2 100644
--- a/clang/test/CXX/dcl.dcl/dcl.attr/dcl.attr.nodiscard/p2.cpp
+++ b/clang/test/CXX/dcl.dcl/dcl.attr/dcl.attr.nodiscard/p2.cpp
@@ -115,7 +115,7 @@ void usage() {
S(); // expected-warning {{ignoring temporary created by a constructor declared with 'nodiscard' attribute}}
S('A'); // expected-warning {{ignoring temporary created by a constructor declared with 'nodiscard' attribute: Don't let that S-Char go!}}
S(1);
- S(2.2);
+ S(2.2); // expected-warning {{ignoring temporary created by a constructor declared with 'gnu::warn_unused_result' attribute}}
Y(); // expected-warning {{ignoring temporary of type 'Y' declared with 'nodiscard' attribute: Don't throw me away either!}}
S s;
ConvertTo{}; // expected-warning {{ignoring return value of type 'ConvertTo' declared with 'nodiscard' attribute: Don't throw me away!}}
diff --git a/clang/test/SemaCXX/warn-unused-result.cpp b/clang/test/SemaCXX/warn-unused-result.cpp
index 447654eccd563..1f7913f1aa994 100644
--- a/clang/test/SemaCXX/warn-unused-result.cpp
+++ b/clang/test/SemaCXX/warn-unused-result.cpp
@@ -309,7 +309,7 @@ void use() {
S<double>(2); // no warning
S<int>(2); // expected-warning {{ignoring temporary of type 'S<int>' declared with 'nodiscard'}}
- S<const char>(2); // no warning (warn_unused_result does not diagnose constructor temporaries)
+ S<const char>(2); // expected-warning {{ignoring temporary of type 'S<const char>' declared with 'clang::warn_unused_result' attribute}}
// function should take precedence over type
obtain2(1.0); // expected-warning {{ignoring return value of function declared with 'nodiscard'}}
@@ -336,7 +336,7 @@ struct [[nodiscard]] G {
void use2() {
H{2}; // no warning
H(2.0); // expected-warning {{ignoring temporary created by a constructor declared with 'nodiscard'}}
- H("Hello"); // no warning (warn_unused_result does not diagnose constructor temporaries)
+ H("Hello"); // expected-warning {{ignoring temporary created by a constructor declared with 'warn_unused_result' attribute}}
// no warning for explicit cast to void
(void)H(2);
>From 81c06d198ebd684ab06eb28c38cc5b4aa19888b6 Mon Sep 17 00:00:00 2001
From: Benjamin Maxwell <benjamin.maxwell at arm.com>
Date: Mon, 18 Aug 2025 14:53:40 +0100
Subject: [PATCH 015/112] Reland "[AArch64][SME] Port all SME routines to
RuntimeLibcalls" (#153417)
This updates everywhere we emit/check an SME routines to use
RuntimeLibcalls to get the function name and calling convention.
---
llvm/include/llvm/CodeGen/TargetLowering.h | 6 +++
llvm/include/llvm/IR/RuntimeLibcalls.td | 43 ++++++++++++++++-
llvm/include/llvm/IR/RuntimeLibcallsImpl.td | 3 ++
.../Target/AArch64/AArch64FrameLowering.cpp | 42 +++++++++-------
.../Target/AArch64/AArch64ISelLowering.cpp | 47 +++++++++---------
.../AArch64/AArch64TargetTransformInfo.cpp | 19 ++++----
llvm/lib/Target/AArch64/SMEABIPass.cpp | 31 ++++++++----
.../AArch64/Utils/AArch64SMEAttributes.cpp | 48 ++++++++++++-------
.../AArch64/Utils/AArch64SMEAttributes.h | 30 ++++++++----
.../Target/AArch64/SMEAttributesTest.cpp | 2 +-
10 files changed, 182 insertions(+), 89 deletions(-)
diff --git a/llvm/include/llvm/CodeGen/TargetLowering.h b/llvm/include/llvm/CodeGen/TargetLowering.h
index e9bb979e44973..4480ced637456 100644
--- a/llvm/include/llvm/CodeGen/TargetLowering.h
+++ b/llvm/include/llvm/CodeGen/TargetLowering.h
@@ -3574,6 +3574,12 @@ class LLVM_ABI TargetLoweringBase {
return Libcalls.getMemcpyName().data();
}
+ /// Check if this is valid libcall for the current module, otherwise
+ /// RTLIB::Unsupported.
+ RTLIB::LibcallImpl getSupportedLibcallImpl(StringRef FuncName) const {
+ return Libcalls.getSupportedLibcallImpl(FuncName);
+ }
+
/// Get the comparison predicate that's to be used to test the result of the
/// comparison libcall against zero. This should only be used with
/// floating-point compare libcalls.
diff --git a/llvm/include/llvm/IR/RuntimeLibcalls.td b/llvm/include/llvm/IR/RuntimeLibcalls.td
index 9072a0aa1531f..9626004cbed42 100644
--- a/llvm/include/llvm/IR/RuntimeLibcalls.td
+++ b/llvm/include/llvm/IR/RuntimeLibcalls.td
@@ -406,6 +406,17 @@ multiclass LibmLongDoubleLibCall<string libcall_basename = !toupper(NAME),
def SC_MEMCPY : RuntimeLibcall;
def SC_MEMMOVE : RuntimeLibcall;
def SC_MEMSET : RuntimeLibcall;
+def SC_MEMCHR: RuntimeLibcall;
+
+// AArch64 SME ABI calls
+def SMEABI_SME_STATE : RuntimeLibcall;
+def SMEABI_TPIDR2_SAVE : RuntimeLibcall;
+def SMEABI_ZA_DISABLE : RuntimeLibcall;
+def SMEABI_TPIDR2_RESTORE : RuntimeLibcall;
+def SMEABI_GET_CURRENT_VG : RuntimeLibcall;
+def SMEABI_SME_STATE_SIZE : RuntimeLibcall;
+def SMEABI_SME_SAVE : RuntimeLibcall;
+def SMEABI_SME_RESTORE : RuntimeLibcall;
// ARM EABI calls
def AEABI_MEMCPY4 : RuntimeLibcall; // Align 4
@@ -1223,8 +1234,35 @@ defset list<RuntimeLibcallImpl> AArch64LibcallImpls = {
def __arm_sc_memcpy : RuntimeLibcallImpl<SC_MEMCPY>;
def __arm_sc_memmove : RuntimeLibcallImpl<SC_MEMMOVE>;
def __arm_sc_memset : RuntimeLibcallImpl<SC_MEMSET>;
+ def __arm_sc_memchr : RuntimeLibcallImpl<SC_MEMCHR>;
} // End AArch64LibcallImpls
+def __arm_sme_state : RuntimeLibcallImpl<SMEABI_SME_STATE>;
+def __arm_tpidr2_save : RuntimeLibcallImpl<SMEABI_TPIDR2_SAVE>;
+def __arm_za_disable : RuntimeLibcallImpl<SMEABI_ZA_DISABLE>;
+def __arm_tpidr2_restore : RuntimeLibcallImpl<SMEABI_TPIDR2_RESTORE>;
+def __arm_get_current_vg : RuntimeLibcallImpl<SMEABI_GET_CURRENT_VG>;
+def __arm_sme_state_size : RuntimeLibcallImpl<SMEABI_SME_STATE_SIZE>;
+def __arm_sme_save : RuntimeLibcallImpl<SMEABI_SME_SAVE>;
+def __arm_sme_restore : RuntimeLibcallImpl<SMEABI_SME_RESTORE>;
+
+def SMEABI_LibCalls_PreserveMost_From_X0 : LibcallsWithCC<(add
+ __arm_tpidr2_save,
+ __arm_za_disable,
+ __arm_tpidr2_restore),
+ SMEABI_PreserveMost_From_X0>;
+
+def SMEABI_LibCalls_PreserveMost_From_X1 : LibcallsWithCC<(add
+ __arm_get_current_vg,
+ __arm_sme_state_size,
+ __arm_sme_save,
+ __arm_sme_restore),
+ SMEABI_PreserveMost_From_X1>;
+
+def SMEABI_LibCalls_PreserveMost_From_X2 : LibcallsWithCC<(add
+ __arm_sme_state),
+ SMEABI_PreserveMost_From_X2>;
+
def isAArch64_ExceptArm64EC
: RuntimeLibcallPredicate<"(TT.isAArch64() && !TT.isWindowsArm64EC())">;
def isWindowsArm64EC : RuntimeLibcallPredicate<"TT.isWindowsArm64EC()">;
@@ -1244,7 +1282,10 @@ def AArch64SystemLibrary : SystemRuntimeLibrary<
LibmHasSinCosF32, LibmHasSinCosF64, LibmHasSinCosF128,
DefaultLibmExp10,
DefaultStackProtector,
- SecurityCheckCookieIfWinMSVC)
+ SecurityCheckCookieIfWinMSVC,
+ SMEABI_LibCalls_PreserveMost_From_X0,
+ SMEABI_LibCalls_PreserveMost_From_X1,
+ SMEABI_LibCalls_PreserveMost_From_X2)
>;
// Prepend a # to every name
diff --git a/llvm/include/llvm/IR/RuntimeLibcallsImpl.td b/llvm/include/llvm/IR/RuntimeLibcallsImpl.td
index 601c291daf89d..b5752c1b69ad8 100644
--- a/llvm/include/llvm/IR/RuntimeLibcallsImpl.td
+++ b/llvm/include/llvm/IR/RuntimeLibcallsImpl.td
@@ -36,6 +36,9 @@ def ARM_AAPCS : LibcallCallingConv<[{CallingConv::ARM_AAPCS}]>;
def ARM_AAPCS_VFP : LibcallCallingConv<[{CallingConv::ARM_AAPCS_VFP}]>;
def X86_STDCALL : LibcallCallingConv<[{CallingConv::X86_StdCall}]>;
def AVR_BUILTIN : LibcallCallingConv<[{CallingConv::AVR_BUILTIN}]>;
+def SMEABI_PreserveMost_From_X0 : LibcallCallingConv<[{CallingConv::AArch64_SME_ABI_Support_Routines_PreserveMost_From_X0}]>;
+def SMEABI_PreserveMost_From_X1 : LibcallCallingConv<[{CallingConv::AArch64_SME_ABI_Support_Routines_PreserveMost_From_X1}]>;
+def SMEABI_PreserveMost_From_X2 : LibcallCallingConv<[{CallingConv::AArch64_SME_ABI_Support_Routines_PreserveMost_From_X2}]>;
/// Abstract definition for functionality the compiler may need to
/// emit a call to. Emits the RTLIB::Libcall enum - This enum defines
diff --git a/llvm/lib/Target/AArch64/AArch64FrameLowering.cpp b/llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
index 885f2a94f85f5..fddde668b7f1a 100644
--- a/llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
@@ -1475,24 +1475,26 @@ static bool requiresSaveVG(const MachineFunction &MF) {
return true;
}
-bool isVGInstruction(MachineBasicBlock::iterator MBBI) {
+static bool matchLibcall(const TargetLowering &TLI, const MachineOperand &MO,
+ RTLIB::Libcall LC) {
+ return MO.isSymbol() &&
+ StringRef(TLI.getLibcallName(LC)) == MO.getSymbolName();
+}
+
+bool isVGInstruction(MachineBasicBlock::iterator MBBI,
+ const TargetLowering &TLI) {
unsigned Opc = MBBI->getOpcode();
if (Opc == AArch64::CNTD_XPiI || Opc == AArch64::RDSVLI_XI ||
Opc == AArch64::UBFMXri)
return true;
- if (requiresGetVGCall(*MBBI->getMF())) {
- if (Opc == AArch64::ORRXrr)
- return true;
+ if (!requiresGetVGCall(*MBBI->getMF()))
+ return false;
- if (Opc == AArch64::BL) {
- auto Op1 = MBBI->getOperand(0);
- return Op1.isSymbol() &&
- (StringRef(Op1.getSymbolName()) == "__arm_get_current_vg");
- }
- }
+ if (Opc == AArch64::BL)
+ return matchLibcall(TLI, MBBI->getOperand(0), RTLIB::SMEABI_GET_CURRENT_VG);
- return false;
+ return Opc == AArch64::ORRXrr;
}
// Convert callee-save register save/restore instruction to do stack pointer
@@ -1511,9 +1513,11 @@ static MachineBasicBlock::iterator convertCalleeSaveRestoreToSPPrePostIncDec(
// functions, we need to do this for both the streaming and non-streaming
// vector length. Move past these instructions if necessary.
MachineFunction &MF = *MBB.getParent();
- if (requiresSaveVG(MF))
- while (isVGInstruction(MBBI))
+ if (requiresSaveVG(MF)) {
+ auto &TLI = *MF.getSubtarget().getTargetLowering();
+ while (isVGInstruction(MBBI, TLI))
++MBBI;
+ }
switch (MBBI->getOpcode()) {
default:
@@ -2097,11 +2101,12 @@ void AArch64FrameLowering::emitPrologue(MachineFunction &MF,
// Move past the saves of the callee-saved registers, fixing up the offsets
// and pre-inc if we decided to combine the callee-save and local stack
// pointer bump above.
+ auto &TLI = *MF.getSubtarget().getTargetLowering();
while (MBBI != End && MBBI->getFlag(MachineInstr::FrameSetup) &&
!IsSVECalleeSave(MBBI)) {
if (CombineSPBump &&
// Only fix-up frame-setup load/store instructions.
- (!requiresSaveVG(MF) || !isVGInstruction(MBBI)))
+ (!requiresSaveVG(MF) || !isVGInstruction(MBBI, TLI)))
fixupCalleeSaveRestoreStackOffset(*MBBI, AFI->getLocalStackSize(),
NeedsWinCFI, &HasWinCFI);
++MBBI;
@@ -3468,6 +3473,7 @@ bool AArch64FrameLowering::spillCalleeSavedRegisters(
MachineBasicBlock &MBB, MachineBasicBlock::iterator MI,
ArrayRef<CalleeSavedInfo> CSI, const TargetRegisterInfo *TRI) const {
MachineFunction &MF = *MBB.getParent();
+ auto &TLI = *MF.getSubtarget<AArch64Subtarget>().getTargetLowering();
const TargetInstrInfo &TII = *MF.getSubtarget().getInstrInfo();
AArch64FunctionInfo *AFI = MF.getInfo<AArch64FunctionInfo>();
bool NeedsWinCFI = needsWinCFI(MF);
@@ -3581,11 +3587,11 @@ bool AArch64FrameLowering::spillCalleeSavedRegisters(
.addReg(AArch64::X0, RegState::Implicit)
.setMIFlag(MachineInstr::FrameSetup);
- const uint32_t *RegMask = TRI->getCallPreservedMask(
- MF,
- CallingConv::AArch64_SME_ABI_Support_Routines_PreserveMost_From_X1);
+ RTLIB::Libcall LC = RTLIB::SMEABI_GET_CURRENT_VG;
+ const uint32_t *RegMask =
+ TRI->getCallPreservedMask(MF, TLI.getLibcallCallingConv(LC));
BuildMI(MBB, MI, DL, TII.get(AArch64::BL))
- .addExternalSymbol("__arm_get_current_vg")
+ .addExternalSymbol(TLI.getLibcallName(LC))
.addRegMask(RegMask)
.addReg(AArch64::X0, RegState::ImplicitDefine)
.setMIFlag(MachineInstr::FrameSetup);
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index aefbbe2534be2..95c0954174fb9 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -3083,13 +3083,12 @@ AArch64TargetLowering::EmitGetSMESaveSize(MachineInstr &MI,
AArch64FunctionInfo *FuncInfo = MF->getInfo<AArch64FunctionInfo>();
const TargetInstrInfo *TII = Subtarget->getInstrInfo();
if (FuncInfo->isSMESaveBufferUsed()) {
+ RTLIB::Libcall LC = RTLIB::SMEABI_SME_STATE_SIZE;
const AArch64RegisterInfo *TRI = Subtarget->getRegisterInfo();
BuildMI(*BB, MI, MI.getDebugLoc(), TII->get(AArch64::BL))
- .addExternalSymbol("__arm_sme_state_size")
+ .addExternalSymbol(getLibcallName(LC))
.addReg(AArch64::X0, RegState::ImplicitDefine)
- .addRegMask(TRI->getCallPreservedMask(
- *MF, CallingConv::
- AArch64_SME_ABI_Support_Routines_PreserveMost_From_X1));
+ .addRegMask(TRI->getCallPreservedMask(*MF, getLibcallCallingConv(LC)));
BuildMI(*BB, MI, MI.getDebugLoc(), TII->get(TargetOpcode::COPY),
MI.getOperand(0).getReg())
.addReg(AArch64::X0);
@@ -3109,13 +3108,12 @@ AArch64TargetLowering::EmitEntryPStateSM(MachineInstr &MI,
const TargetInstrInfo *TII = Subtarget->getInstrInfo();
Register ResultReg = MI.getOperand(0).getReg();
if (FuncInfo->isPStateSMRegUsed()) {
+ RTLIB::Libcall LC = RTLIB::SMEABI_SME_STATE;
const AArch64RegisterInfo *TRI = Subtarget->getRegisterInfo();
BuildMI(*BB, MI, MI.getDebugLoc(), TII->get(AArch64::BL))
- .addExternalSymbol("__arm_sme_state")
+ .addExternalSymbol(getLibcallName(LC))
.addReg(AArch64::X0, RegState::ImplicitDefine)
- .addRegMask(TRI->getCallPreservedMask(
- *MF, CallingConv::
- AArch64_SME_ABI_Support_Routines_PreserveMost_From_X2));
+ .addRegMask(TRI->getCallPreservedMask(*MF, getLibcallCallingConv(LC)));
BuildMI(*BB, MI, MI.getDebugLoc(), TII->get(TargetOpcode::COPY), ResultReg)
.addReg(AArch64::X0);
} else {
@@ -5733,15 +5731,15 @@ static SDValue getSVEPredicateBitCast(EVT VT, SDValue Op, SelectionDAG &DAG) {
SDValue AArch64TargetLowering::getRuntimePStateSM(SelectionDAG &DAG,
SDValue Chain, SDLoc DL,
EVT VT) const {
- SDValue Callee = DAG.getExternalSymbol("__arm_sme_state",
+ RTLIB::Libcall LC = RTLIB::SMEABI_SME_STATE;
+ SDValue Callee = DAG.getExternalSymbol(getLibcallName(LC),
getPointerTy(DAG.getDataLayout()));
Type *Int64Ty = Type::getInt64Ty(*DAG.getContext());
Type *RetTy = StructType::get(Int64Ty, Int64Ty);
TargetLowering::CallLoweringInfo CLI(DAG);
ArgListTy Args;
CLI.setDebugLoc(DL).setChain(Chain).setLibCallee(
- CallingConv::AArch64_SME_ABI_Support_Routines_PreserveMost_From_X2,
- RetTy, Callee, std::move(Args));
+ getLibcallCallingConv(LC), RetTy, Callee, std::move(Args));
std::pair<SDValue, SDValue> CallResult = LowerCallTo(CLI);
SDValue Mask = DAG.getConstant(/*PSTATE.SM*/ 1, DL, MVT::i64);
return DAG.getNode(ISD::AND, DL, MVT::i64, CallResult.first.getOperand(0),
@@ -8594,12 +8592,12 @@ static void analyzeCallOperands(const AArch64TargetLowering &TLI,
}
static SMECallAttrs
-getSMECallAttrs(const Function &Caller,
+getSMECallAttrs(const Function &Caller, const AArch64TargetLowering &TLI,
const TargetLowering::CallLoweringInfo &CLI) {
if (CLI.CB)
- return SMECallAttrs(*CLI.CB);
+ return SMECallAttrs(*CLI.CB, &TLI);
if (auto *ES = dyn_cast<ExternalSymbolSDNode>(CLI.Callee))
- return SMECallAttrs(SMEAttrs(Caller), SMEAttrs(ES->getSymbol()));
+ return SMECallAttrs(SMEAttrs(Caller), SMEAttrs(ES->getSymbol(), TLI));
return SMECallAttrs(SMEAttrs(Caller), SMEAttrs(SMEAttrs::Normal));
}
@@ -8621,7 +8619,7 @@ bool AArch64TargetLowering::isEligibleForTailCallOptimization(
// SME Streaming functions are not eligible for TCO as they may require
// the streaming mode or ZA to be restored after returning from the call.
- SMECallAttrs CallAttrs = getSMECallAttrs(CallerF, CLI);
+ SMECallAttrs CallAttrs = getSMECallAttrs(CallerF, *this, CLI);
if (CallAttrs.requiresSMChange() || CallAttrs.requiresLazySave() ||
CallAttrs.requiresPreservingAllZAState() ||
CallAttrs.caller().hasStreamingBody())
@@ -8913,14 +8911,14 @@ static SDValue emitSMEStateSaveRestore(const AArch64TargetLowering &TLI,
DAG.getCopyFromReg(Chain, DL, Info->getSMESaveBufferAddr(), MVT::i64),
PointerType::getUnqual(*DAG.getContext()));
- SDValue Callee =
- DAG.getExternalSymbol(IsSave ? "__arm_sme_save" : "__arm_sme_restore",
- TLI.getPointerTy(DAG.getDataLayout()));
+ RTLIB::Libcall LC =
+ IsSave ? RTLIB::SMEABI_SME_SAVE : RTLIB::SMEABI_SME_RESTORE;
+ SDValue Callee = DAG.getExternalSymbol(TLI.getLibcallName(LC),
+ TLI.getPointerTy(DAG.getDataLayout()));
auto *RetTy = Type::getVoidTy(*DAG.getContext());
TargetLowering::CallLoweringInfo CLI(DAG);
CLI.setDebugLoc(DL).setChain(Chain).setLibCallee(
- CallingConv::AArch64_SME_ABI_Support_Routines_PreserveMost_From_X1, RetTy,
- Callee, std::move(Args));
+ TLI.getLibcallCallingConv(LC), RetTy, Callee, std::move(Args));
return TLI.LowerCallTo(CLI).second;
}
@@ -9108,7 +9106,7 @@ AArch64TargetLowering::LowerCall(CallLoweringInfo &CLI,
}
// Determine whether we need any streaming mode changes.
- SMECallAttrs CallAttrs = getSMECallAttrs(MF.getFunction(), CLI);
+ SMECallAttrs CallAttrs = getSMECallAttrs(MF.getFunction(), *this, CLI);
auto DescribeCallsite =
[&](OptimizationRemarkAnalysis &R) -> OptimizationRemarkAnalysis & {
@@ -9685,11 +9683,12 @@ AArch64TargetLowering::LowerCall(CallLoweringInfo &CLI,
if (RequiresLazySave) {
// Conditionally restore the lazy save using a pseudo node.
+ RTLIB::Libcall LC = RTLIB::SMEABI_TPIDR2_RESTORE;
TPIDR2Object &TPIDR2 = FuncInfo->getTPIDR2Obj();
SDValue RegMask = DAG.getRegisterMask(
- TRI->SMEABISupportRoutinesCallPreservedMaskFromX0());
+ TRI->getCallPreservedMask(MF, getLibcallCallingConv(LC)));
SDValue RestoreRoutine = DAG.getTargetExternalSymbol(
- "__arm_tpidr2_restore", getPointerTy(DAG.getDataLayout()));
+ getLibcallName(LC), getPointerTy(DAG.getDataLayout()));
SDValue TPIDR2_EL0 = DAG.getNode(
ISD::INTRINSIC_W_CHAIN, DL, MVT::i64, Result,
DAG.getConstant(Intrinsic::aarch64_sme_get_tpidr2, DL, MVT::i32));
@@ -29028,7 +29027,7 @@ bool AArch64TargetLowering::fallBackToDAGISel(const Instruction &Inst) const {
// Checks to allow the use of SME instructions
if (auto *Base = dyn_cast<CallBase>(&Inst)) {
- auto CallAttrs = SMECallAttrs(*Base);
+ auto CallAttrs = SMECallAttrs(*Base, this);
if (CallAttrs.requiresSMChange() || CallAttrs.requiresLazySave() ||
CallAttrs.requiresPreservingZT0() ||
CallAttrs.requiresPreservingAllZAState())
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
index fc332d5320181..17f0028e43fc3 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -220,20 +220,17 @@ static cl::opt<bool> EnableFixedwidthAutovecInStreamingMode(
static cl::opt<bool> EnableScalableAutovecInStreamingMode(
"enable-scalable-autovec-in-streaming-mode", cl::init(false), cl::Hidden);
-static bool isSMEABIRoutineCall(const CallInst &CI) {
+static bool isSMEABIRoutineCall(const CallInst &CI,
+ const AArch64TargetLowering &TLI) {
const auto *F = CI.getCalledFunction();
- return F && StringSwitch<bool>(F->getName())
- .Case("__arm_sme_state", true)
- .Case("__arm_tpidr2_save", true)
- .Case("__arm_tpidr2_restore", true)
- .Case("__arm_za_disable", true)
- .Default(false);
+ return F && SMEAttrs(F->getName(), TLI).isSMEABIRoutine();
}
/// Returns true if the function has explicit operations that can only be
/// lowered using incompatible instructions for the selected mode. This also
/// returns true if the function F may use or modify ZA state.
-static bool hasPossibleIncompatibleOps(const Function *F) {
+static bool hasPossibleIncompatibleOps(const Function *F,
+ const AArch64TargetLowering &TLI) {
for (const BasicBlock &BB : *F) {
for (const Instruction &I : BB) {
// Be conservative for now and assume that any call to inline asm or to
@@ -242,7 +239,7 @@ static bool hasPossibleIncompatibleOps(const Function *F) {
// all native LLVM instructions can be lowered to compatible instructions.
if (isa<CallInst>(I) && !I.isDebugOrPseudoInst() &&
(cast<CallInst>(I).isInlineAsm() || isa<IntrinsicInst>(I) ||
- isSMEABIRoutineCall(cast<CallInst>(I))))
+ isSMEABIRoutineCall(cast<CallInst>(I), TLI)))
return true;
}
}
@@ -290,7 +287,7 @@ bool AArch64TTIImpl::areInlineCompatible(const Function *Caller,
if (CallAttrs.requiresLazySave() || CallAttrs.requiresSMChange() ||
CallAttrs.requiresPreservingZT0() ||
CallAttrs.requiresPreservingAllZAState()) {
- if (hasPossibleIncompatibleOps(Callee))
+ if (hasPossibleIncompatibleOps(Callee, *getTLI()))
return false;
}
@@ -357,7 +354,7 @@ AArch64TTIImpl::getInlineCallPenalty(const Function *F, const CallBase &Call,
// change only once and avoid inlining of G into F.
SMEAttrs FAttrs(*F);
- SMECallAttrs CallAttrs(Call);
+ SMECallAttrs CallAttrs(Call, getTLI());
if (SMECallAttrs(FAttrs, CallAttrs.callee()).requiresSMChange()) {
if (F == Call.getCaller()) // (1)
diff --git a/llvm/lib/Target/AArch64/SMEABIPass.cpp b/llvm/lib/Target/AArch64/SMEABIPass.cpp
index 4af4d49306625..2008516885c35 100644
--- a/llvm/lib/Target/AArch64/SMEABIPass.cpp
+++ b/llvm/lib/Target/AArch64/SMEABIPass.cpp
@@ -15,11 +15,16 @@
#include "AArch64.h"
#include "Utils/AArch64SMEAttributes.h"
#include "llvm/ADT/StringRef.h"
+#include "llvm/CodeGen/TargetLowering.h"
+#include "llvm/CodeGen/TargetPassConfig.h"
+#include "llvm/CodeGen/TargetSubtargetInfo.h"
#include "llvm/IR/IRBuilder.h"
#include "llvm/IR/Instructions.h"
#include "llvm/IR/IntrinsicsAArch64.h"
#include "llvm/IR/LLVMContext.h"
#include "llvm/IR/Module.h"
+#include "llvm/IR/RuntimeLibcalls.h"
+#include "llvm/Target/TargetMachine.h"
#include "llvm/Transforms/Utils/Cloning.h"
using namespace llvm;
@@ -33,9 +38,13 @@ struct SMEABI : public FunctionPass {
bool runOnFunction(Function &F) override;
+ void getAnalysisUsage(AnalysisUsage &AU) const override {
+ AU.addRequired<TargetPassConfig>();
+ }
+
private:
bool updateNewStateFunctions(Module *M, Function *F, IRBuilder<> &Builder,
- SMEAttrs FnAttrs);
+ SMEAttrs FnAttrs, const TargetLowering &TLI);
};
} // end anonymous namespace
@@ -51,14 +60,16 @@ FunctionPass *llvm::createSMEABIPass() { return new SMEABI(); }
//===----------------------------------------------------------------------===//
// Utility function to emit a call to __arm_tpidr2_save and clear TPIDR2_EL0.
-void emitTPIDR2Save(Module *M, IRBuilder<> &Builder, bool ZT0IsUndef = false) {
+void emitTPIDR2Save(Module *M, IRBuilder<> &Builder, const TargetLowering &TLI,
+ bool ZT0IsUndef = false) {
auto &Ctx = M->getContext();
auto *TPIDR2SaveTy =
FunctionType::get(Builder.getVoidTy(), {}, /*IsVarArgs=*/false);
auto Attrs =
AttributeList().addFnAttribute(Ctx, "aarch64_pstate_sm_compatible");
+ RTLIB::Libcall LC = RTLIB::SMEABI_TPIDR2_SAVE;
FunctionCallee Callee =
- M->getOrInsertFunction("__arm_tpidr2_save", TPIDR2SaveTy, Attrs);
+ M->getOrInsertFunction(TLI.getLibcallName(LC), TPIDR2SaveTy, Attrs);
CallInst *Call = Builder.CreateCall(Callee);
// If ZT0 is undefined (i.e. we're at the entry of a "new_zt0" function), mark
@@ -67,8 +78,7 @@ void emitTPIDR2Save(Module *M, IRBuilder<> &Builder, bool ZT0IsUndef = false) {
if (ZT0IsUndef)
Call->addFnAttr(Attribute::get(Ctx, "aarch64_zt0_undef"));
- Call->setCallingConv(
- CallingConv::AArch64_SME_ABI_Support_Routines_PreserveMost_From_X0);
+ Call->setCallingConv(TLI.getLibcallCallingConv(LC));
// A save to TPIDR2 should be followed by clearing TPIDR2_EL0.
Function *WriteIntr =
@@ -98,7 +108,8 @@ void emitTPIDR2Save(Module *M, IRBuilder<> &Builder, bool ZT0IsUndef = false) {
/// interface if it does not share ZA or ZT0.
///
bool SMEABI::updateNewStateFunctions(Module *M, Function *F,
- IRBuilder<> &Builder, SMEAttrs FnAttrs) {
+ IRBuilder<> &Builder, SMEAttrs FnAttrs,
+ const TargetLowering &TLI) {
LLVMContext &Context = F->getContext();
BasicBlock *OrigBB = &F->getEntryBlock();
Builder.SetInsertPoint(&OrigBB->front());
@@ -124,7 +135,7 @@ bool SMEABI::updateNewStateFunctions(Module *M, Function *F,
// Create a call __arm_tpidr2_save, which commits the lazy save.
Builder.SetInsertPoint(&SaveBB->back());
- emitTPIDR2Save(M, Builder, /*ZT0IsUndef=*/FnAttrs.isNewZT0());
+ emitTPIDR2Save(M, Builder, TLI, /*ZT0IsUndef=*/FnAttrs.isNewZT0());
// Enable pstate.za at the start of the function.
Builder.SetInsertPoint(&OrigBB->front());
@@ -172,10 +183,14 @@ bool SMEABI::runOnFunction(Function &F) {
if (F.isDeclaration() || F.hasFnAttribute("aarch64_expanded_pstate_za"))
return false;
+ const TargetMachine &TM =
+ getAnalysis<TargetPassConfig>().getTM<TargetMachine>();
+ const TargetLowering &TLI = *TM.getSubtargetImpl(F)->getTargetLowering();
+
bool Changed = false;
SMEAttrs FnAttrs(F);
if (FnAttrs.isNewZA() || FnAttrs.isNewZT0())
- Changed |= updateNewStateFunctions(M, &F, Builder, FnAttrs);
+ Changed |= updateNewStateFunctions(M, &F, Builder, FnAttrs, TLI);
return Changed;
}
diff --git a/llvm/lib/Target/AArch64/Utils/AArch64SMEAttributes.cpp b/llvm/lib/Target/AArch64/Utils/AArch64SMEAttributes.cpp
index 271094f935e0e..dd6fa167c6f4d 100644
--- a/llvm/lib/Target/AArch64/Utils/AArch64SMEAttributes.cpp
+++ b/llvm/lib/Target/AArch64/Utils/AArch64SMEAttributes.cpp
@@ -7,17 +7,14 @@
//===----------------------------------------------------------------------===//
#include "AArch64SMEAttributes.h"
+#include "AArch64ISelLowering.h"
#include "llvm/IR/InstrTypes.h"
+#include "llvm/IR/RuntimeLibcalls.h"
#include <cassert>
using namespace llvm;
-void SMEAttrs::set(unsigned M, bool Enable) {
- if (Enable)
- Bitmask |= M;
- else
- Bitmask &= ~M;
-
+void SMEAttrs::validate() const {
// Streaming Mode Attrs
assert(!(hasStreamingInterface() && hasStreamingCompatibleInterface()) &&
"SM_Enabled and SM_Compatible are mutually exclusive");
@@ -77,19 +74,36 @@ SMEAttrs::SMEAttrs(const AttributeList &Attrs) {
Bitmask |= encodeZT0State(StateValue::New);
}
-void SMEAttrs::addKnownFunctionAttrs(StringRef FuncName) {
+void SMEAttrs::addKnownFunctionAttrs(StringRef FuncName,
+ const AArch64TargetLowering &TLI) {
+ RTLIB::LibcallImpl Impl = TLI.getSupportedLibcallImpl(FuncName);
+ if (Impl == RTLIB::Unsupported)
+ return;
unsigned KnownAttrs = SMEAttrs::Normal;
- if (FuncName == "__arm_tpidr2_save" || FuncName == "__arm_sme_state")
- KnownAttrs |= (SMEAttrs::SM_Compatible | SMEAttrs::SME_ABI_Routine);
- if (FuncName == "__arm_tpidr2_restore")
+ RTLIB::Libcall LC = RTLIB::RuntimeLibcallsInfo::getLibcallFromImpl(Impl);
+ switch (LC) {
+ case RTLIB::SMEABI_SME_STATE:
+ case RTLIB::SMEABI_TPIDR2_SAVE:
+ case RTLIB::SMEABI_GET_CURRENT_VG:
+ case RTLIB::SMEABI_SME_STATE_SIZE:
+ case RTLIB::SMEABI_SME_SAVE:
+ case RTLIB::SMEABI_SME_RESTORE:
+ KnownAttrs |= SMEAttrs::SM_Compatible | SMEAttrs::SME_ABI_Routine;
+ break;
+ case RTLIB::SMEABI_ZA_DISABLE:
+ case RTLIB::SMEABI_TPIDR2_RESTORE:
KnownAttrs |= SMEAttrs::SM_Compatible | encodeZAState(StateValue::In) |
SMEAttrs::SME_ABI_Routine;
- if (FuncName == "__arm_sc_memcpy" || FuncName == "__arm_sc_memset" ||
- FuncName == "__arm_sc_memmove" || FuncName == "__arm_sc_memchr")
+ break;
+ case RTLIB::SC_MEMCPY:
+ case RTLIB::SC_MEMMOVE:
+ case RTLIB::SC_MEMSET:
+ case RTLIB::SC_MEMCHR:
KnownAttrs |= SMEAttrs::SM_Compatible;
- if (FuncName == "__arm_sme_save" || FuncName == "__arm_sme_restore" ||
- FuncName == "__arm_sme_state_size")
- KnownAttrs |= SMEAttrs::SM_Compatible | SMEAttrs::SME_ABI_Routine;
+ break;
+ default:
+ break;
+ }
set(KnownAttrs);
}
@@ -110,11 +124,11 @@ bool SMECallAttrs::requiresSMChange() const {
return true;
}
-SMECallAttrs::SMECallAttrs(const CallBase &CB)
+SMECallAttrs::SMECallAttrs(const CallBase &CB, const AArch64TargetLowering *TLI)
: CallerFn(*CB.getFunction()), CalledFn(SMEAttrs::Normal),
Callsite(CB.getAttributes()), IsIndirect(CB.isIndirectCall()) {
if (auto *CalledFunction = CB.getCalledFunction())
- CalledFn = SMEAttrs(*CalledFunction, SMEAttrs::InferAttrsFromName::Yes);
+ CalledFn = SMEAttrs(*CalledFunction, TLI);
// FIXME: We probably should not allow SME attributes on direct calls but
// clang duplicates streaming mode attributes at each callsite.
diff --git a/llvm/lib/Target/AArch64/Utils/AArch64SMEAttributes.h b/llvm/lib/Target/AArch64/Utils/AArch64SMEAttributes.h
index f1be0ecbee7ed..48f9da02d3182 100644
--- a/llvm/lib/Target/AArch64/Utils/AArch64SMEAttributes.h
+++ b/llvm/lib/Target/AArch64/Utils/AArch64SMEAttributes.h
@@ -13,6 +13,8 @@
namespace llvm {
+class AArch64TargetLowering;
+
class Function;
class CallBase;
class AttributeList;
@@ -48,19 +50,27 @@ class SMEAttrs {
CallSiteFlags_Mask = ZT0_Undef
};
- enum class InferAttrsFromName { No, Yes };
-
SMEAttrs() = default;
SMEAttrs(unsigned Mask) { set(Mask); }
- SMEAttrs(const Function &F, InferAttrsFromName Infer = InferAttrsFromName::No)
+ SMEAttrs(const Function &F, const AArch64TargetLowering *TLI = nullptr)
: SMEAttrs(F.getAttributes()) {
- if (Infer == InferAttrsFromName::Yes)
- addKnownFunctionAttrs(F.getName());
+ if (TLI)
+ addKnownFunctionAttrs(F.getName(), *TLI);
}
SMEAttrs(const AttributeList &L);
- SMEAttrs(StringRef FuncName) { addKnownFunctionAttrs(FuncName); };
+ SMEAttrs(StringRef FuncName, const AArch64TargetLowering &TLI) {
+ addKnownFunctionAttrs(FuncName, TLI);
+ };
- void set(unsigned M, bool Enable = true);
+ void set(unsigned M, bool Enable = true) {
+ if (Enable)
+ Bitmask |= M;
+ else
+ Bitmask &= ~M;
+#ifndef NDEBUG
+ validate();
+#endif
+ }
// Interfaces to query PSTATE.SM
bool hasStreamingBody() const { return Bitmask & SM_Body; }
@@ -146,7 +156,9 @@ class SMEAttrs {
}
private:
- void addKnownFunctionAttrs(StringRef FuncName);
+ void addKnownFunctionAttrs(StringRef FuncName,
+ const AArch64TargetLowering &TLI);
+ void validate() const;
};
/// SMECallAttrs is a utility class to hold the SMEAttrs for a callsite. It has
@@ -163,7 +175,7 @@ class SMECallAttrs {
SMEAttrs Callsite = SMEAttrs::Normal)
: CallerFn(Caller), CalledFn(Callee), Callsite(Callsite) {}
- SMECallAttrs(const CallBase &CB);
+ SMECallAttrs(const CallBase &CB, const AArch64TargetLowering *TLI);
SMEAttrs &caller() { return CallerFn; }
SMEAttrs &callee() { return IsIndirect ? Callsite : CalledFn; }
diff --git a/llvm/unittests/Target/AArch64/SMEAttributesTest.cpp b/llvm/unittests/Target/AArch64/SMEAttributesTest.cpp
index f13252f3a4c28..e90f733d79fca 100644
--- a/llvm/unittests/Target/AArch64/SMEAttributesTest.cpp
+++ b/llvm/unittests/Target/AArch64/SMEAttributesTest.cpp
@@ -78,7 +78,7 @@ TEST(SMEAttributes, Constructors) {
"ret void\n}");
CallBase &Call =
cast<CallBase>((CallModule->getFunction("foo")->begin()->front()));
- ASSERT_TRUE(SMECallAttrs(Call).callsite().hasUndefZT0());
+ ASSERT_TRUE(SMECallAttrs(Call, nullptr).callsite().hasUndefZT0());
// Invalid combinations.
EXPECT_DEBUG_DEATH(SA(SA::SM_Enabled | SA::SM_Compatible),
>From 858d1dfa2c4823422c8c6b0459130954cf89fb73 Mon Sep 17 00:00:00 2001
From: Simon Pilgrim <llvm-dev at redking.me.uk>
Date: Mon, 18 Aug 2025 14:55:09 +0100
Subject: [PATCH 016/112] [DAG] visitTRUNCATE - early out from
computeKnownBits/ComputeNumSignBits failures. NFC. (#154111)
Avoid unnecessary (costly) computeKnownBits/ComputeNumSignBits calls - use MaskedValueIsZero instead of computeKnownBits directly to simplify code.
---
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | 15 ++++++---------
1 file changed, 6 insertions(+), 9 deletions(-)
diff --git a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
index 785245b2d9e74..43d4138df8b49 100644
--- a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
@@ -16332,25 +16332,22 @@ SDValue DAGCombiner::visitTRUNCATE(SDNode *N) {
// (trunc (abdu/abds a, b)) -> (abdu/abds (trunc a), (trunc b))
if ((!LegalOperations || N0.hasOneUse()) &&
TLI.isOperationLegal(N0.getOpcode(), VT)) {
- EVT SrcVT = N0.getValueType();
EVT TruncVT = VT;
unsigned SrcBits = SrcVT.getScalarSizeInBits();
unsigned TruncBits = TruncVT.getScalarSizeInBits();
- unsigned NeededBits = SrcBits - TruncBits;
SDValue A = N0.getOperand(0);
SDValue B = N0.getOperand(1);
bool CanFold = false;
if (N0.getOpcode() == ISD::ABDU) {
- KnownBits KnownA = DAG.computeKnownBits(A);
- KnownBits KnownB = DAG.computeKnownBits(B);
- CanFold = KnownA.countMinLeadingZeros() >= NeededBits &&
- KnownB.countMinLeadingZeros() >= NeededBits;
+ APInt UpperBits = APInt::getBitsSetFrom(SrcBits, TruncBits);
+ CanFold = DAG.MaskedValueIsZero(B, UpperBits) &&
+ DAG.MaskedValueIsZero(A, UpperBits);
} else {
- unsigned SignBitsA = DAG.ComputeNumSignBits(A);
- unsigned SignBitsB = DAG.ComputeNumSignBits(B);
- CanFold = SignBitsA > NeededBits && SignBitsB > NeededBits;
+ unsigned NeededBits = SrcBits - TruncBits;
+ CanFold = DAG.ComputeNumSignBits(B) > NeededBits &&
+ DAG.ComputeNumSignBits(A) > NeededBits;
}
if (CanFold) {
>From 0e52092ff7c1e1a1283fe8c232dd221a170e3fdc Mon Sep 17 00:00:00 2001
From: AZero13 <gfunni234 at gmail.com>
Date: Mon, 18 Aug 2025 09:56:45 -0400
Subject: [PATCH 017/112] [AArch64] Adjust comparison constant if adjusting it
means less instructions (#151024)
Prefer constants that require less instructions to materialize, in both
Global-ISel and Selection-DAG
---
.../Target/AArch64/AArch64ISelLowering.cpp | 45 +-
.../GISel/AArch64PostLegalizerLowering.cpp | 16 +-
llvm/test/CodeGen/AArch64/icmp-cst.ll | 740 ++++++------------
llvm/test/CodeGen/AArch64/srem-seteq.ll | 10 +-
.../CodeGen/AArch64/urem-seteq-optsize.ll | 5 +-
llvm/test/CodeGen/AArch64/urem-seteq.ll | 5 +-
6 files changed, 277 insertions(+), 544 deletions(-)
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index 95c0954174fb9..c27bf82157393 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -3518,6 +3518,13 @@ bool isLegalCmpImmed(APInt C) {
return isLegalArithImmed(C.abs().getZExtValue());
}
+unsigned numberOfInstrToLoadImm(APInt C) {
+ uint64_t Imm = C.getZExtValue();
+ SmallVector<AArch64_IMM::ImmInsnModel> Insn;
+ AArch64_IMM::expandMOVImm(Imm, 32, Insn);
+ return Insn.size();
+}
+
static bool isSafeSignedCMN(SDValue Op, SelectionDAG &DAG) {
// 0 - INT_MIN sign wraps, so no signed wrap means cmn is safe.
if (Op->getFlags().hasNoSignedWrap())
@@ -3987,6 +3994,7 @@ static SDValue getAArch64Cmp(SDValue LHS, SDValue RHS, ISD::CondCode CC,
// CC has already been adjusted.
RHS = DAG.getConstant(0, DL, VT);
} else if (!isLegalCmpImmed(C)) {
+ unsigned NumImmForC = numberOfInstrToLoadImm(C);
// Constant does not fit, try adjusting it by one?
switch (CC) {
default:
@@ -3995,43 +4003,48 @@ static SDValue getAArch64Cmp(SDValue LHS, SDValue RHS, ISD::CondCode CC,
case ISD::SETGE:
if (!C.isMinSignedValue()) {
APInt CMinusOne = C - 1;
- if (isLegalCmpImmed(CMinusOne)) {
+ if (isLegalCmpImmed(CMinusOne) ||
+ (NumImmForC > numberOfInstrToLoadImm(CMinusOne))) {
CC = (CC == ISD::SETLT) ? ISD::SETLE : ISD::SETGT;
RHS = DAG.getConstant(CMinusOne, DL, VT);
}
}
break;
case ISD::SETULT:
- case ISD::SETUGE:
- if (!C.isZero()) {
- APInt CMinusOne = C - 1;
- if (isLegalCmpImmed(CMinusOne)) {
- CC = (CC == ISD::SETULT) ? ISD::SETULE : ISD::SETUGT;
- RHS = DAG.getConstant(CMinusOne, DL, VT);
- }
+ case ISD::SETUGE: {
+ // C is not 0 because it is a legal immediate.
+ assert(!C.isZero() && "C should not be zero here");
+ APInt CMinusOne = C - 1;
+ if (isLegalCmpImmed(CMinusOne) ||
+ (NumImmForC > numberOfInstrToLoadImm(CMinusOne))) {
+ CC = (CC == ISD::SETULT) ? ISD::SETULE : ISD::SETUGT;
+ RHS = DAG.getConstant(CMinusOne, DL, VT);
}
break;
+ }
case ISD::SETLE:
case ISD::SETGT:
if (!C.isMaxSignedValue()) {
APInt CPlusOne = C + 1;
- if (isLegalCmpImmed(CPlusOne)) {
+ if (isLegalCmpImmed(CPlusOne) ||
+ (NumImmForC > numberOfInstrToLoadImm(CPlusOne))) {
CC = (CC == ISD::SETLE) ? ISD::SETLT : ISD::SETGE;
RHS = DAG.getConstant(CPlusOne, DL, VT);
}
}
break;
case ISD::SETULE:
- case ISD::SETUGT:
- if (!C.isAllOnes()) {
- APInt CPlusOne = C + 1;
- if (isLegalCmpImmed(CPlusOne)) {
- CC = (CC == ISD::SETULE) ? ISD::SETULT : ISD::SETUGE;
- RHS = DAG.getConstant(CPlusOne, DL, VT);
- }
+ case ISD::SETUGT: {
+ assert(!C.isAllOnes() && "C should not be -1 here");
+ APInt CPlusOne = C + 1;
+ if (isLegalCmpImmed(CPlusOne) ||
+ (NumImmForC > numberOfInstrToLoadImm(CPlusOne))) {
+ CC = (CC == ISD::SETULE) ? ISD::SETULT : ISD::SETUGE;
+ RHS = DAG.getConstant(CPlusOne, DL, VT);
}
break;
}
+ }
}
}
diff --git a/llvm/lib/Target/AArch64/GISel/AArch64PostLegalizerLowering.cpp b/llvm/lib/Target/AArch64/GISel/AArch64PostLegalizerLowering.cpp
index 3ba08c8c1d988..2abe0dd0bbdc2 100644
--- a/llvm/lib/Target/AArch64/GISel/AArch64PostLegalizerLowering.cpp
+++ b/llvm/lib/Target/AArch64/GISel/AArch64PostLegalizerLowering.cpp
@@ -614,8 +614,7 @@ tryAdjustICmpImmAndPred(Register RHS, CmpInst::Predicate P,
// x uge c => x ugt c - 1
//
// When c is not zero.
- if (C == 0)
- return std::nullopt;
+ assert(C != 0 && "C should not be zero here!");
P = (P == CmpInst::ICMP_ULT) ? CmpInst::ICMP_ULE : CmpInst::ICMP_UGT;
C -= 1;
break;
@@ -640,10 +639,8 @@ tryAdjustICmpImmAndPred(Register RHS, CmpInst::Predicate P,
// x ule c => x ult c + 1
// x ugt c => s uge c + 1
//
- // When c is not the largest possible unsigned integer.
- if ((Size == 32 && static_cast<uint32_t>(C) == UINT32_MAX) ||
- (Size == 64 && C == UINT64_MAX))
- return std::nullopt;
+ assert(C != (Size == 32 ? UINT32_MAX : UINT64_MAX) &&
+ "C should not be -1 here!");
P = (P == CmpInst::ICMP_ULE) ? CmpInst::ICMP_ULT : CmpInst::ICMP_UGE;
C += 1;
break;
@@ -656,14 +653,13 @@ tryAdjustICmpImmAndPred(Register RHS, CmpInst::Predicate P,
if (isLegalArithImmed(C))
return {{C, P}};
- auto IsMaterializableInSingleInstruction = [=](uint64_t Imm) {
+ auto NumberOfInstrToLoadImm = [=](uint64_t Imm) {
SmallVector<AArch64_IMM::ImmInsnModel> Insn;
AArch64_IMM::expandMOVImm(Imm, 32, Insn);
- return Insn.size() == 1;
+ return Insn.size();
};
- if (!IsMaterializableInSingleInstruction(OriginalC) &&
- IsMaterializableInSingleInstruction(C))
+ if (NumberOfInstrToLoadImm(OriginalC) > NumberOfInstrToLoadImm(C))
return {{C, P}};
return std::nullopt;
diff --git a/llvm/test/CodeGen/AArch64/icmp-cst.ll b/llvm/test/CodeGen/AArch64/icmp-cst.ll
index b6f452bb42cec..b75e3535bf821 100644
--- a/llvm/test/CodeGen/AArch64/icmp-cst.ll
+++ b/llvm/test/CodeGen/AArch64/icmp-cst.ll
@@ -1,687 +1,415 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4
-; RUN: llc -mtriple=aarch64-linux-gnu -global-isel=0 < %s | FileCheck %s --check-prefix=CHECK-SD
-; RUN: llc -mtriple=aarch64-linux-gnu -global-isel=1 < %s | FileCheck %s --check-prefix=CHECK-GI
+; RUN: llc -mtriple=aarch64-linux-gnu -global-isel=0 < %s | FileCheck %s --check-prefixes=CHECK,CHECK-SD
+; RUN: llc -mtriple=aarch64-linux-gnu -global-isel=1 < %s | FileCheck %s --check-prefixes=CHECK,CHECK-GI
define i1 @ule_11111111(i32 noundef %in) {
-; CHECK-SD-LABEL: ule_11111111:
-; CHECK-SD: // %bb.0:
-; CHECK-SD-NEXT: mov w8, #4370 // =0x1112
-; CHECK-SD-NEXT: movk w8, #4369, lsl #16
-; CHECK-SD-NEXT: cmp w0, w8
-; CHECK-SD-NEXT: cset w0, lo
-; CHECK-SD-NEXT: ret
-;
-; CHECK-GI-LABEL: ule_11111111:
-; CHECK-GI: // %bb.0:
-; CHECK-GI-NEXT: mov w8, #286331153 // =0x11111111
-; CHECK-GI-NEXT: cmp w0, w8
-; CHECK-GI-NEXT: cset w0, ls
-; CHECK-GI-NEXT: ret
+; CHECK-LABEL: ule_11111111:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov w8, #286331153 // =0x11111111
+; CHECK-NEXT: cmp w0, w8
+; CHECK-NEXT: cset w0, ls
+; CHECK-NEXT: ret
%out = icmp ult i32 %in, 286331154
ret i1 %out
}
define i1 @ule_22222222(i32 noundef %in) {
-; CHECK-SD-LABEL: ule_22222222:
-; CHECK-SD: // %bb.0:
-; CHECK-SD-NEXT: mov w8, #8739 // =0x2223
-; CHECK-SD-NEXT: movk w8, #8738, lsl #16
-; CHECK-SD-NEXT: cmp w0, w8
-; CHECK-SD-NEXT: cset w0, lo
-; CHECK-SD-NEXT: ret
-;
-; CHECK-GI-LABEL: ule_22222222:
-; CHECK-GI: // %bb.0:
-; CHECK-GI-NEXT: mov w8, #572662306 // =0x22222222
-; CHECK-GI-NEXT: cmp w0, w8
-; CHECK-GI-NEXT: cset w0, ls
-; CHECK-GI-NEXT: ret
+; CHECK-LABEL: ule_22222222:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov w8, #572662306 // =0x22222222
+; CHECK-NEXT: cmp w0, w8
+; CHECK-NEXT: cset w0, ls
+; CHECK-NEXT: ret
%out = icmp ult i32 %in, 572662307
ret i1 %out
}
define i1 @ule_33333333(i32 noundef %in) {
-; CHECK-SD-LABEL: ule_33333333:
-; CHECK-SD: // %bb.0:
-; CHECK-SD-NEXT: mov w8, #13108 // =0x3334
-; CHECK-SD-NEXT: movk w8, #13107, lsl #16
-; CHECK-SD-NEXT: cmp w0, w8
-; CHECK-SD-NEXT: cset w0, lo
-; CHECK-SD-NEXT: ret
-;
-; CHECK-GI-LABEL: ule_33333333:
-; CHECK-GI: // %bb.0:
-; CHECK-GI-NEXT: mov w8, #858993459 // =0x33333333
-; CHECK-GI-NEXT: cmp w0, w8
-; CHECK-GI-NEXT: cset w0, ls
-; CHECK-GI-NEXT: ret
+; CHECK-LABEL: ule_33333333:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov w8, #858993459 // =0x33333333
+; CHECK-NEXT: cmp w0, w8
+; CHECK-NEXT: cset w0, ls
+; CHECK-NEXT: ret
%out = icmp ult i32 %in, 858993460
ret i1 %out
}
define i1 @ule_44444444(i32 noundef %in) {
-; CHECK-SD-LABEL: ule_44444444:
-; CHECK-SD: // %bb.0:
-; CHECK-SD-NEXT: mov w8, #17477 // =0x4445
-; CHECK-SD-NEXT: movk w8, #17476, lsl #16
-; CHECK-SD-NEXT: cmp w0, w8
-; CHECK-SD-NEXT: cset w0, lo
-; CHECK-SD-NEXT: ret
-;
-; CHECK-GI-LABEL: ule_44444444:
-; CHECK-GI: // %bb.0:
-; CHECK-GI-NEXT: mov w8, #1145324612 // =0x44444444
-; CHECK-GI-NEXT: cmp w0, w8
-; CHECK-GI-NEXT: cset w0, ls
-; CHECK-GI-NEXT: ret
+; CHECK-LABEL: ule_44444444:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov w8, #1145324612 // =0x44444444
+; CHECK-NEXT: cmp w0, w8
+; CHECK-NEXT: cset w0, ls
+; CHECK-NEXT: ret
%out = icmp ult i32 %in, 1145324613
ret i1 %out
}
define i1 @ule_55555555(i32 noundef %in) {
-; CHECK-SD-LABEL: ule_55555555:
-; CHECK-SD: // %bb.0:
-; CHECK-SD-NEXT: mov w8, #21846 // =0x5556
-; CHECK-SD-NEXT: movk w8, #21845, lsl #16
-; CHECK-SD-NEXT: cmp w0, w8
-; CHECK-SD-NEXT: cset w0, lo
-; CHECK-SD-NEXT: ret
-;
-; CHECK-GI-LABEL: ule_55555555:
-; CHECK-GI: // %bb.0:
-; CHECK-GI-NEXT: mov w8, #1431655765 // =0x55555555
-; CHECK-GI-NEXT: cmp w0, w8
-; CHECK-GI-NEXT: cset w0, ls
-; CHECK-GI-NEXT: ret
+; CHECK-LABEL: ule_55555555:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov w8, #1431655765 // =0x55555555
+; CHECK-NEXT: cmp w0, w8
+; CHECK-NEXT: cset w0, ls
+; CHECK-NEXT: ret
%out = icmp ult i32 %in, 1431655766
ret i1 %out
}
define i1 @ule_66666666(i32 noundef %in) {
-; CHECK-SD-LABEL: ule_66666666:
-; CHECK-SD: // %bb.0:
-; CHECK-SD-NEXT: mov w8, #26215 // =0x6667
-; CHECK-SD-NEXT: movk w8, #26214, lsl #16
-; CHECK-SD-NEXT: cmp w0, w8
-; CHECK-SD-NEXT: cset w0, lo
-; CHECK-SD-NEXT: ret
-;
-; CHECK-GI-LABEL: ule_66666666:
-; CHECK-GI: // %bb.0:
-; CHECK-GI-NEXT: mov w8, #1717986918 // =0x66666666
-; CHECK-GI-NEXT: cmp w0, w8
-; CHECK-GI-NEXT: cset w0, ls
-; CHECK-GI-NEXT: ret
+; CHECK-LABEL: ule_66666666:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov w8, #1717986918 // =0x66666666
+; CHECK-NEXT: cmp w0, w8
+; CHECK-NEXT: cset w0, ls
+; CHECK-NEXT: ret
%out = icmp ult i32 %in, 1717986919
ret i1 %out
}
define i1 @ule_77777777(i32 noundef %in) {
-; CHECK-SD-LABEL: ule_77777777:
-; CHECK-SD: // %bb.0:
-; CHECK-SD-NEXT: mov w8, #30584 // =0x7778
-; CHECK-SD-NEXT: movk w8, #30583, lsl #16
-; CHECK-SD-NEXT: cmp w0, w8
-; CHECK-SD-NEXT: cset w0, lo
-; CHECK-SD-NEXT: ret
-;
-; CHECK-GI-LABEL: ule_77777777:
-; CHECK-GI: // %bb.0:
-; CHECK-GI-NEXT: mov w8, #2004318071 // =0x77777777
-; CHECK-GI-NEXT: cmp w0, w8
-; CHECK-GI-NEXT: cset w0, ls
-; CHECK-GI-NEXT: ret
+; CHECK-LABEL: ule_77777777:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov w8, #2004318071 // =0x77777777
+; CHECK-NEXT: cmp w0, w8
+; CHECK-NEXT: cset w0, ls
+; CHECK-NEXT: ret
%out = icmp ult i32 %in, 2004318072
ret i1 %out
}
define i1 @ule_88888888(i32 noundef %in) {
-; CHECK-SD-LABEL: ule_88888888:
-; CHECK-SD: // %bb.0:
-; CHECK-SD-NEXT: mov w8, #34953 // =0x8889
-; CHECK-SD-NEXT: movk w8, #34952, lsl #16
-; CHECK-SD-NEXT: cmp w0, w8
-; CHECK-SD-NEXT: cset w0, lo
-; CHECK-SD-NEXT: ret
-;
-; CHECK-GI-LABEL: ule_88888888:
-; CHECK-GI: // %bb.0:
-; CHECK-GI-NEXT: mov w8, #-2004318072 // =0x88888888
-; CHECK-GI-NEXT: cmp w0, w8
-; CHECK-GI-NEXT: cset w0, ls
-; CHECK-GI-NEXT: ret
+; CHECK-LABEL: ule_88888888:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov w8, #-2004318072 // =0x88888888
+; CHECK-NEXT: cmp w0, w8
+; CHECK-NEXT: cset w0, ls
+; CHECK-NEXT: ret
%out = icmp ult i32 %in, -2004318071
ret i1 %out
}
define i1 @ule_99999999(i32 noundef %in) {
-; CHECK-SD-LABEL: ule_99999999:
-; CHECK-SD: // %bb.0:
-; CHECK-SD-NEXT: mov w8, #39322 // =0x999a
-; CHECK-SD-NEXT: movk w8, #39321, lsl #16
-; CHECK-SD-NEXT: cmp w0, w8
-; CHECK-SD-NEXT: cset w0, lo
-; CHECK-SD-NEXT: ret
-;
-; CHECK-GI-LABEL: ule_99999999:
-; CHECK-GI: // %bb.0:
-; CHECK-GI-NEXT: mov w8, #-1717986919 // =0x99999999
-; CHECK-GI-NEXT: cmp w0, w8
-; CHECK-GI-NEXT: cset w0, ls
-; CHECK-GI-NEXT: ret
+; CHECK-LABEL: ule_99999999:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov w8, #-1717986919 // =0x99999999
+; CHECK-NEXT: cmp w0, w8
+; CHECK-NEXT: cset w0, ls
+; CHECK-NEXT: ret
%out = icmp ult i32 %in, -1717986918
ret i1 %out
}
define i1 @uge_11111111(i32 noundef %in) {
-; CHECK-SD-LABEL: uge_11111111:
-; CHECK-SD: // %bb.0:
-; CHECK-SD-NEXT: mov w8, #4368 // =0x1110
-; CHECK-SD-NEXT: movk w8, #4369, lsl #16
-; CHECK-SD-NEXT: cmp w0, w8
-; CHECK-SD-NEXT: cset w0, hi
-; CHECK-SD-NEXT: ret
-;
-; CHECK-GI-LABEL: uge_11111111:
-; CHECK-GI: // %bb.0:
-; CHECK-GI-NEXT: mov w8, #286331153 // =0x11111111
-; CHECK-GI-NEXT: cmp w0, w8
-; CHECK-GI-NEXT: cset w0, hs
-; CHECK-GI-NEXT: ret
+; CHECK-LABEL: uge_11111111:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov w8, #286331153 // =0x11111111
+; CHECK-NEXT: cmp w0, w8
+; CHECK-NEXT: cset w0, hs
+; CHECK-NEXT: ret
%out = icmp ugt i32 %in, 286331152
ret i1 %out
}
define i1 @uge_22222222(i32 noundef %in) {
-; CHECK-SD-LABEL: uge_22222222:
-; CHECK-SD: // %bb.0:
-; CHECK-SD-NEXT: mov w8, #8737 // =0x2221
-; CHECK-SD-NEXT: movk w8, #8738, lsl #16
-; CHECK-SD-NEXT: cmp w0, w8
-; CHECK-SD-NEXT: cset w0, hi
-; CHECK-SD-NEXT: ret
-;
-; CHECK-GI-LABEL: uge_22222222:
-; CHECK-GI: // %bb.0:
-; CHECK-GI-NEXT: mov w8, #572662306 // =0x22222222
-; CHECK-GI-NEXT: cmp w0, w8
-; CHECK-GI-NEXT: cset w0, hs
-; CHECK-GI-NEXT: ret
+; CHECK-LABEL: uge_22222222:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov w8, #572662306 // =0x22222222
+; CHECK-NEXT: cmp w0, w8
+; CHECK-NEXT: cset w0, hs
+; CHECK-NEXT: ret
%out = icmp ugt i32 %in, 572662305
ret i1 %out
}
define i1 @uge_33333333(i32 noundef %in) {
-; CHECK-SD-LABEL: uge_33333333:
-; CHECK-SD: // %bb.0:
-; CHECK-SD-NEXT: mov w8, #13106 // =0x3332
-; CHECK-SD-NEXT: movk w8, #13107, lsl #16
-; CHECK-SD-NEXT: cmp w0, w8
-; CHECK-SD-NEXT: cset w0, hi
-; CHECK-SD-NEXT: ret
-;
-; CHECK-GI-LABEL: uge_33333333:
-; CHECK-GI: // %bb.0:
-; CHECK-GI-NEXT: mov w8, #858993459 // =0x33333333
-; CHECK-GI-NEXT: cmp w0, w8
-; CHECK-GI-NEXT: cset w0, hs
-; CHECK-GI-NEXT: ret
+; CHECK-LABEL: uge_33333333:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov w8, #858993459 // =0x33333333
+; CHECK-NEXT: cmp w0, w8
+; CHECK-NEXT: cset w0, hs
+; CHECK-NEXT: ret
%out = icmp ugt i32 %in, 858993458
ret i1 %out
}
define i1 @uge_44444444(i32 noundef %in) {
-; CHECK-SD-LABEL: uge_44444444:
-; CHECK-SD: // %bb.0:
-; CHECK-SD-NEXT: mov w8, #17475 // =0x4443
-; CHECK-SD-NEXT: movk w8, #17476, lsl #16
-; CHECK-SD-NEXT: cmp w0, w8
-; CHECK-SD-NEXT: cset w0, hi
-; CHECK-SD-NEXT: ret
-;
-; CHECK-GI-LABEL: uge_44444444:
-; CHECK-GI: // %bb.0:
-; CHECK-GI-NEXT: mov w8, #1145324612 // =0x44444444
-; CHECK-GI-NEXT: cmp w0, w8
-; CHECK-GI-NEXT: cset w0, hs
-; CHECK-GI-NEXT: ret
+; CHECK-LABEL: uge_44444444:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov w8, #1145324612 // =0x44444444
+; CHECK-NEXT: cmp w0, w8
+; CHECK-NEXT: cset w0, hs
+; CHECK-NEXT: ret
%out = icmp ugt i32 %in, 1145324611
ret i1 %out
}
define i1 @uge_55555555(i32 noundef %in) {
-; CHECK-SD-LABEL: uge_55555555:
-; CHECK-SD: // %bb.0:
-; CHECK-SD-NEXT: mov w8, #21844 // =0x5554
-; CHECK-SD-NEXT: movk w8, #21845, lsl #16
-; CHECK-SD-NEXT: cmp w0, w8
-; CHECK-SD-NEXT: cset w0, hi
-; CHECK-SD-NEXT: ret
-;
-; CHECK-GI-LABEL: uge_55555555:
-; CHECK-GI: // %bb.0:
-; CHECK-GI-NEXT: mov w8, #1431655765 // =0x55555555
-; CHECK-GI-NEXT: cmp w0, w8
-; CHECK-GI-NEXT: cset w0, hs
-; CHECK-GI-NEXT: ret
+; CHECK-LABEL: uge_55555555:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov w8, #1431655765 // =0x55555555
+; CHECK-NEXT: cmp w0, w8
+; CHECK-NEXT: cset w0, hs
+; CHECK-NEXT: ret
%out = icmp ugt i32 %in, 1431655764
ret i1 %out
}
define i1 @uge_66666666(i32 noundef %in) {
-; CHECK-SD-LABEL: uge_66666666:
-; CHECK-SD: // %bb.0:
-; CHECK-SD-NEXT: mov w8, #26213 // =0x6665
-; CHECK-SD-NEXT: movk w8, #26214, lsl #16
-; CHECK-SD-NEXT: cmp w0, w8
-; CHECK-SD-NEXT: cset w0, hi
-; CHECK-SD-NEXT: ret
-;
-; CHECK-GI-LABEL: uge_66666666:
-; CHECK-GI: // %bb.0:
-; CHECK-GI-NEXT: mov w8, #1717986918 // =0x66666666
-; CHECK-GI-NEXT: cmp w0, w8
-; CHECK-GI-NEXT: cset w0, hs
-; CHECK-GI-NEXT: ret
+; CHECK-LABEL: uge_66666666:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov w8, #1717986918 // =0x66666666
+; CHECK-NEXT: cmp w0, w8
+; CHECK-NEXT: cset w0, hs
+; CHECK-NEXT: ret
%out = icmp ugt i32 %in, 1717986917
ret i1 %out
}
define i1 @uge_77777777(i32 noundef %in) {
-; CHECK-SD-LABEL: uge_77777777:
-; CHECK-SD: // %bb.0:
-; CHECK-SD-NEXT: mov w8, #30582 // =0x7776
-; CHECK-SD-NEXT: movk w8, #30583, lsl #16
-; CHECK-SD-NEXT: cmp w0, w8
-; CHECK-SD-NEXT: cset w0, hi
-; CHECK-SD-NEXT: ret
-;
-; CHECK-GI-LABEL: uge_77777777:
-; CHECK-GI: // %bb.0:
-; CHECK-GI-NEXT: mov w8, #2004318071 // =0x77777777
-; CHECK-GI-NEXT: cmp w0, w8
-; CHECK-GI-NEXT: cset w0, hs
-; CHECK-GI-NEXT: ret
+; CHECK-LABEL: uge_77777777:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov w8, #2004318071 // =0x77777777
+; CHECK-NEXT: cmp w0, w8
+; CHECK-NEXT: cset w0, hs
+; CHECK-NEXT: ret
%out = icmp ugt i32 %in, 2004318070
ret i1 %out
}
define i1 @uge_88888888(i32 noundef %in) {
-; CHECK-SD-LABEL: uge_88888888:
-; CHECK-SD: // %bb.0:
-; CHECK-SD-NEXT: mov w8, #34951 // =0x8887
-; CHECK-SD-NEXT: movk w8, #34952, lsl #16
-; CHECK-SD-NEXT: cmp w0, w8
-; CHECK-SD-NEXT: cset w0, hi
-; CHECK-SD-NEXT: ret
-;
-; CHECK-GI-LABEL: uge_88888888:
-; CHECK-GI: // %bb.0:
-; CHECK-GI-NEXT: mov w8, #-2004318072 // =0x88888888
-; CHECK-GI-NEXT: cmp w0, w8
-; CHECK-GI-NEXT: cset w0, hs
-; CHECK-GI-NEXT: ret
+; CHECK-LABEL: uge_88888888:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov w8, #-2004318072 // =0x88888888
+; CHECK-NEXT: cmp w0, w8
+; CHECK-NEXT: cset w0, hs
+; CHECK-NEXT: ret
%out = icmp ugt i32 %in, -2004318073
ret i1 %out
}
define i1 @uge_99999999(i32 noundef %in) {
-; CHECK-SD-LABEL: uge_99999999:
-; CHECK-SD: // %bb.0:
-; CHECK-SD-NEXT: mov w8, #39320 // =0x9998
-; CHECK-SD-NEXT: movk w8, #39321, lsl #16
-; CHECK-SD-NEXT: cmp w0, w8
-; CHECK-SD-NEXT: cset w0, hi
-; CHECK-SD-NEXT: ret
-;
-; CHECK-GI-LABEL: uge_99999999:
-; CHECK-GI: // %bb.0:
-; CHECK-GI-NEXT: mov w8, #-1717986919 // =0x99999999
-; CHECK-GI-NEXT: cmp w0, w8
-; CHECK-GI-NEXT: cset w0, hs
-; CHECK-GI-NEXT: ret
+; CHECK-LABEL: uge_99999999:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov w8, #-1717986919 // =0x99999999
+; CHECK-NEXT: cmp w0, w8
+; CHECK-NEXT: cset w0, hs
+; CHECK-NEXT: ret
%out = icmp ugt i32 %in, -1717986920
ret i1 %out
}
define i1 @sle_11111111(i32 noundef %in) {
-; CHECK-SD-LABEL: sle_11111111:
-; CHECK-SD: // %bb.0:
-; CHECK-SD-NEXT: mov w8, #4370 // =0x1112
-; CHECK-SD-NEXT: movk w8, #4369, lsl #16
-; CHECK-SD-NEXT: cmp w0, w8
-; CHECK-SD-NEXT: cset w0, lt
-; CHECK-SD-NEXT: ret
-;
-; CHECK-GI-LABEL: sle_11111111:
-; CHECK-GI: // %bb.0:
-; CHECK-GI-NEXT: mov w8, #286331153 // =0x11111111
-; CHECK-GI-NEXT: cmp w0, w8
-; CHECK-GI-NEXT: cset w0, le
-; CHECK-GI-NEXT: ret
+; CHECK-LABEL: sle_11111111:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov w8, #286331153 // =0x11111111
+; CHECK-NEXT: cmp w0, w8
+; CHECK-NEXT: cset w0, le
+; CHECK-NEXT: ret
%out = icmp slt i32 %in, 286331154
ret i1 %out
}
define i1 @sle_22222222(i32 noundef %in) {
-; CHECK-SD-LABEL: sle_22222222:
-; CHECK-SD: // %bb.0:
-; CHECK-SD-NEXT: mov w8, #8739 // =0x2223
-; CHECK-SD-NEXT: movk w8, #8738, lsl #16
-; CHECK-SD-NEXT: cmp w0, w8
-; CHECK-SD-NEXT: cset w0, lt
-; CHECK-SD-NEXT: ret
-;
-; CHECK-GI-LABEL: sle_22222222:
-; CHECK-GI: // %bb.0:
-; CHECK-GI-NEXT: mov w8, #572662306 // =0x22222222
-; CHECK-GI-NEXT: cmp w0, w8
-; CHECK-GI-NEXT: cset w0, le
-; CHECK-GI-NEXT: ret
+; CHECK-LABEL: sle_22222222:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov w8, #572662306 // =0x22222222
+; CHECK-NEXT: cmp w0, w8
+; CHECK-NEXT: cset w0, le
+; CHECK-NEXT: ret
%out = icmp slt i32 %in, 572662307
ret i1 %out
}
define i1 @sle_33333333(i32 noundef %in) {
-; CHECK-SD-LABEL: sle_33333333:
-; CHECK-SD: // %bb.0:
-; CHECK-SD-NEXT: mov w8, #13108 // =0x3334
-; CHECK-SD-NEXT: movk w8, #13107, lsl #16
-; CHECK-SD-NEXT: cmp w0, w8
-; CHECK-SD-NEXT: cset w0, lt
-; CHECK-SD-NEXT: ret
-;
-; CHECK-GI-LABEL: sle_33333333:
-; CHECK-GI: // %bb.0:
-; CHECK-GI-NEXT: mov w8, #858993459 // =0x33333333
-; CHECK-GI-NEXT: cmp w0, w8
-; CHECK-GI-NEXT: cset w0, le
-; CHECK-GI-NEXT: ret
+; CHECK-LABEL: sle_33333333:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov w8, #858993459 // =0x33333333
+; CHECK-NEXT: cmp w0, w8
+; CHECK-NEXT: cset w0, le
+; CHECK-NEXT: ret
%out = icmp slt i32 %in, 858993460
ret i1 %out
}
define i1 @sle_44444444(i32 noundef %in) {
-; CHECK-SD-LABEL: sle_44444444:
-; CHECK-SD: // %bb.0:
-; CHECK-SD-NEXT: mov w8, #17477 // =0x4445
-; CHECK-SD-NEXT: movk w8, #17476, lsl #16
-; CHECK-SD-NEXT: cmp w0, w8
-; CHECK-SD-NEXT: cset w0, lt
-; CHECK-SD-NEXT: ret
-;
-; CHECK-GI-LABEL: sle_44444444:
-; CHECK-GI: // %bb.0:
-; CHECK-GI-NEXT: mov w8, #1145324612 // =0x44444444
-; CHECK-GI-NEXT: cmp w0, w8
-; CHECK-GI-NEXT: cset w0, le
-; CHECK-GI-NEXT: ret
+; CHECK-LABEL: sle_44444444:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov w8, #1145324612 // =0x44444444
+; CHECK-NEXT: cmp w0, w8
+; CHECK-NEXT: cset w0, le
+; CHECK-NEXT: ret
%out = icmp slt i32 %in, 1145324613
ret i1 %out
}
define i1 @sle_55555555(i32 noundef %in) {
-; CHECK-SD-LABEL: sle_55555555:
-; CHECK-SD: // %bb.0:
-; CHECK-SD-NEXT: mov w8, #21846 // =0x5556
-; CHECK-SD-NEXT: movk w8, #21845, lsl #16
-; CHECK-SD-NEXT: cmp w0, w8
-; CHECK-SD-NEXT: cset w0, lt
-; CHECK-SD-NEXT: ret
-;
-; CHECK-GI-LABEL: sle_55555555:
-; CHECK-GI: // %bb.0:
-; CHECK-GI-NEXT: mov w8, #1431655765 // =0x55555555
-; CHECK-GI-NEXT: cmp w0, w8
-; CHECK-GI-NEXT: cset w0, le
-; CHECK-GI-NEXT: ret
+; CHECK-LABEL: sle_55555555:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov w8, #1431655765 // =0x55555555
+; CHECK-NEXT: cmp w0, w8
+; CHECK-NEXT: cset w0, le
+; CHECK-NEXT: ret
%out = icmp slt i32 %in, 1431655766
ret i1 %out
}
define i1 @sle_66666666(i32 noundef %in) {
-; CHECK-SD-LABEL: sle_66666666:
-; CHECK-SD: // %bb.0:
-; CHECK-SD-NEXT: mov w8, #26215 // =0x6667
-; CHECK-SD-NEXT: movk w8, #26214, lsl #16
-; CHECK-SD-NEXT: cmp w0, w8
-; CHECK-SD-NEXT: cset w0, lt
-; CHECK-SD-NEXT: ret
-;
-; CHECK-GI-LABEL: sle_66666666:
-; CHECK-GI: // %bb.0:
-; CHECK-GI-NEXT: mov w8, #1717986918 // =0x66666666
-; CHECK-GI-NEXT: cmp w0, w8
-; CHECK-GI-NEXT: cset w0, le
-; CHECK-GI-NEXT: ret
+; CHECK-LABEL: sle_66666666:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov w8, #1717986918 // =0x66666666
+; CHECK-NEXT: cmp w0, w8
+; CHECK-NEXT: cset w0, le
+; CHECK-NEXT: ret
%out = icmp slt i32 %in, 1717986919
ret i1 %out
}
define i1 @sle_77777777(i32 noundef %in) {
-; CHECK-SD-LABEL: sle_77777777:
-; CHECK-SD: // %bb.0:
-; CHECK-SD-NEXT: mov w8, #30584 // =0x7778
-; CHECK-SD-NEXT: movk w8, #30583, lsl #16
-; CHECK-SD-NEXT: cmp w0, w8
-; CHECK-SD-NEXT: cset w0, lt
-; CHECK-SD-NEXT: ret
-;
-; CHECK-GI-LABEL: sle_77777777:
-; CHECK-GI: // %bb.0:
-; CHECK-GI-NEXT: mov w8, #2004318071 // =0x77777777
-; CHECK-GI-NEXT: cmp w0, w8
-; CHECK-GI-NEXT: cset w0, le
-; CHECK-GI-NEXT: ret
+; CHECK-LABEL: sle_77777777:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov w8, #2004318071 // =0x77777777
+; CHECK-NEXT: cmp w0, w8
+; CHECK-NEXT: cset w0, le
+; CHECK-NEXT: ret
%out = icmp slt i32 %in, 2004318072
ret i1 %out
}
define i1 @sle_88888888(i32 noundef %in) {
-; CHECK-SD-LABEL: sle_88888888:
-; CHECK-SD: // %bb.0:
-; CHECK-SD-NEXT: mov w8, #34953 // =0x8889
-; CHECK-SD-NEXT: movk w8, #34952, lsl #16
-; CHECK-SD-NEXT: cmp w0, w8
-; CHECK-SD-NEXT: cset w0, lo
-; CHECK-SD-NEXT: ret
-;
-; CHECK-GI-LABEL: sle_88888888:
-; CHECK-GI: // %bb.0:
-; CHECK-GI-NEXT: mov w8, #-2004318072 // =0x88888888
-; CHECK-GI-NEXT: cmp w0, w8
-; CHECK-GI-NEXT: cset w0, ls
-; CHECK-GI-NEXT: ret
+; CHECK-LABEL: sle_88888888:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov w8, #-2004318072 // =0x88888888
+; CHECK-NEXT: cmp w0, w8
+; CHECK-NEXT: cset w0, ls
+; CHECK-NEXT: ret
%out = icmp ult i32 %in, -2004318071
ret i1 %out
}
define i1 @sle_99999999(i32 noundef %in) {
-; CHECK-SD-LABEL: sle_99999999:
-; CHECK-SD: // %bb.0:
-; CHECK-SD-NEXT: mov w8, #39322 // =0x999a
-; CHECK-SD-NEXT: movk w8, #39321, lsl #16
-; CHECK-SD-NEXT: cmp w0, w8
-; CHECK-SD-NEXT: cset w0, lo
-; CHECK-SD-NEXT: ret
-;
-; CHECK-GI-LABEL: sle_99999999:
-; CHECK-GI: // %bb.0:
-; CHECK-GI-NEXT: mov w8, #-1717986919 // =0x99999999
-; CHECK-GI-NEXT: cmp w0, w8
-; CHECK-GI-NEXT: cset w0, ls
-; CHECK-GI-NEXT: ret
+; CHECK-LABEL: sle_99999999:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov w8, #-1717986919 // =0x99999999
+; CHECK-NEXT: cmp w0, w8
+; CHECK-NEXT: cset w0, ls
+; CHECK-NEXT: ret
%out = icmp ult i32 %in, -1717986918
ret i1 %out
}
define i1 @sge_11111111(i32 noundef %in) {
-; CHECK-SD-LABEL: sge_11111111:
-; CHECK-SD: // %bb.0:
-; CHECK-SD-NEXT: mov w8, #4368 // =0x1110
-; CHECK-SD-NEXT: movk w8, #4369, lsl #16
-; CHECK-SD-NEXT: cmp w0, w8
-; CHECK-SD-NEXT: cset w0, gt
-; CHECK-SD-NEXT: ret
-;
-; CHECK-GI-LABEL: sge_11111111:
-; CHECK-GI: // %bb.0:
-; CHECK-GI-NEXT: mov w8, #286331153 // =0x11111111
-; CHECK-GI-NEXT: cmp w0, w8
-; CHECK-GI-NEXT: cset w0, ge
-; CHECK-GI-NEXT: ret
+; CHECK-LABEL: sge_11111111:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov w8, #286331153 // =0x11111111
+; CHECK-NEXT: cmp w0, w8
+; CHECK-NEXT: cset w0, ge
+; CHECK-NEXT: ret
%out = icmp sgt i32 %in, 286331152
ret i1 %out
}
define i1 @sge_22222222(i32 noundef %in) {
-; CHECK-SD-LABEL: sge_22222222:
-; CHECK-SD: // %bb.0:
-; CHECK-SD-NEXT: mov w8, #8737 // =0x2221
-; CHECK-SD-NEXT: movk w8, #8738, lsl #16
-; CHECK-SD-NEXT: cmp w0, w8
-; CHECK-SD-NEXT: cset w0, gt
-; CHECK-SD-NEXT: ret
-;
-; CHECK-GI-LABEL: sge_22222222:
-; CHECK-GI: // %bb.0:
-; CHECK-GI-NEXT: mov w8, #572662306 // =0x22222222
-; CHECK-GI-NEXT: cmp w0, w8
-; CHECK-GI-NEXT: cset w0, ge
-; CHECK-GI-NEXT: ret
+; CHECK-LABEL: sge_22222222:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov w8, #572662306 // =0x22222222
+; CHECK-NEXT: cmp w0, w8
+; CHECK-NEXT: cset w0, ge
+; CHECK-NEXT: ret
%out = icmp sgt i32 %in, 572662305
ret i1 %out
}
define i1 @sge_33333333(i32 noundef %in) {
-; CHECK-SD-LABEL: sge_33333333:
-; CHECK-SD: // %bb.0:
-; CHECK-SD-NEXT: mov w8, #13106 // =0x3332
-; CHECK-SD-NEXT: movk w8, #13107, lsl #16
-; CHECK-SD-NEXT: cmp w0, w8
-; CHECK-SD-NEXT: cset w0, gt
-; CHECK-SD-NEXT: ret
-;
-; CHECK-GI-LABEL: sge_33333333:
-; CHECK-GI: // %bb.0:
-; CHECK-GI-NEXT: mov w8, #858993459 // =0x33333333
-; CHECK-GI-NEXT: cmp w0, w8
-; CHECK-GI-NEXT: cset w0, ge
-; CHECK-GI-NEXT: ret
+; CHECK-LABEL: sge_33333333:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov w8, #858993459 // =0x33333333
+; CHECK-NEXT: cmp w0, w8
+; CHECK-NEXT: cset w0, ge
+; CHECK-NEXT: ret
%out = icmp sgt i32 %in, 858993458
ret i1 %out
}
define i1 @sge_44444444(i32 noundef %in) {
-; CHECK-SD-LABEL: sge_44444444:
-; CHECK-SD: // %bb.0:
-; CHECK-SD-NEXT: mov w8, #17475 // =0x4443
-; CHECK-SD-NEXT: movk w8, #17476, lsl #16
-; CHECK-SD-NEXT: cmp w0, w8
-; CHECK-SD-NEXT: cset w0, gt
-; CHECK-SD-NEXT: ret
-;
-; CHECK-GI-LABEL: sge_44444444:
-; CHECK-GI: // %bb.0:
-; CHECK-GI-NEXT: mov w8, #1145324612 // =0x44444444
-; CHECK-GI-NEXT: cmp w0, w8
-; CHECK-GI-NEXT: cset w0, ge
-; CHECK-GI-NEXT: ret
+; CHECK-LABEL: sge_44444444:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov w8, #1145324612 // =0x44444444
+; CHECK-NEXT: cmp w0, w8
+; CHECK-NEXT: cset w0, ge
+; CHECK-NEXT: ret
%out = icmp sgt i32 %in, 1145324611
ret i1 %out
}
define i1 @sge_55555555(i32 noundef %in) {
-; CHECK-SD-LABEL: sge_55555555:
-; CHECK-SD: // %bb.0:
-; CHECK-SD-NEXT: mov w8, #21844 // =0x5554
-; CHECK-SD-NEXT: movk w8, #21845, lsl #16
-; CHECK-SD-NEXT: cmp w0, w8
-; CHECK-SD-NEXT: cset w0, gt
-; CHECK-SD-NEXT: ret
-;
-; CHECK-GI-LABEL: sge_55555555:
-; CHECK-GI: // %bb.0:
-; CHECK-GI-NEXT: mov w8, #1431655765 // =0x55555555
-; CHECK-GI-NEXT: cmp w0, w8
-; CHECK-GI-NEXT: cset w0, ge
-; CHECK-GI-NEXT: ret
+; CHECK-LABEL: sge_55555555:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov w8, #1431655765 // =0x55555555
+; CHECK-NEXT: cmp w0, w8
+; CHECK-NEXT: cset w0, ge
+; CHECK-NEXT: ret
%out = icmp sgt i32 %in, 1431655764
ret i1 %out
}
define i1 @sge_66666666(i32 noundef %in) {
-; CHECK-SD-LABEL: sge_66666666:
-; CHECK-SD: // %bb.0:
-; CHECK-SD-NEXT: mov w8, #26213 // =0x6665
-; CHECK-SD-NEXT: movk w8, #26214, lsl #16
-; CHECK-SD-NEXT: cmp w0, w8
-; CHECK-SD-NEXT: cset w0, gt
-; CHECK-SD-NEXT: ret
-;
-; CHECK-GI-LABEL: sge_66666666:
-; CHECK-GI: // %bb.0:
-; CHECK-GI-NEXT: mov w8, #1717986918 // =0x66666666
-; CHECK-GI-NEXT: cmp w0, w8
-; CHECK-GI-NEXT: cset w0, ge
-; CHECK-GI-NEXT: ret
+; CHECK-LABEL: sge_66666666:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov w8, #1717986918 // =0x66666666
+; CHECK-NEXT: cmp w0, w8
+; CHECK-NEXT: cset w0, ge
+; CHECK-NEXT: ret
%out = icmp sgt i32 %in, 1717986917
ret i1 %out
}
define i1 @sge_77777777(i32 noundef %in) {
-; CHECK-SD-LABEL: sge_77777777:
-; CHECK-SD: // %bb.0:
-; CHECK-SD-NEXT: mov w8, #30582 // =0x7776
-; CHECK-SD-NEXT: movk w8, #30583, lsl #16
-; CHECK-SD-NEXT: cmp w0, w8
-; CHECK-SD-NEXT: cset w0, gt
-; CHECK-SD-NEXT: ret
-;
-; CHECK-GI-LABEL: sge_77777777:
-; CHECK-GI: // %bb.0:
-; CHECK-GI-NEXT: mov w8, #2004318071 // =0x77777777
-; CHECK-GI-NEXT: cmp w0, w8
-; CHECK-GI-NEXT: cset w0, ge
-; CHECK-GI-NEXT: ret
+; CHECK-LABEL: sge_77777777:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov w8, #2004318071 // =0x77777777
+; CHECK-NEXT: cmp w0, w8
+; CHECK-NEXT: cset w0, ge
+; CHECK-NEXT: ret
%out = icmp sgt i32 %in, 2004318070
ret i1 %out
}
define i1 @sge_88888888(i32 noundef %in) {
-; CHECK-SD-LABEL: sge_88888888:
-; CHECK-SD: // %bb.0:
-; CHECK-SD-NEXT: mov w8, #34951 // =0x8887
-; CHECK-SD-NEXT: movk w8, #34952, lsl #16
-; CHECK-SD-NEXT: cmp w0, w8
-; CHECK-SD-NEXT: cset w0, hi
-; CHECK-SD-NEXT: ret
-;
-; CHECK-GI-LABEL: sge_88888888:
-; CHECK-GI: // %bb.0:
-; CHECK-GI-NEXT: mov w8, #-2004318072 // =0x88888888
-; CHECK-GI-NEXT: cmp w0, w8
-; CHECK-GI-NEXT: cset w0, hs
-; CHECK-GI-NEXT: ret
+; CHECK-LABEL: sge_88888888:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov w8, #-2004318072 // =0x88888888
+; CHECK-NEXT: cmp w0, w8
+; CHECK-NEXT: cset w0, hs
+; CHECK-NEXT: ret
%out = icmp ugt i32 %in, -2004318073
ret i1 %out
}
define i1 @sge_99999999(i32 noundef %in) {
-; CHECK-SD-LABEL: sge_99999999:
-; CHECK-SD: // %bb.0:
-; CHECK-SD-NEXT: mov w8, #39320 // =0x9998
-; CHECK-SD-NEXT: movk w8, #39321, lsl #16
-; CHECK-SD-NEXT: cmp w0, w8
-; CHECK-SD-NEXT: cset w0, hi
-; CHECK-SD-NEXT: ret
-;
-; CHECK-GI-LABEL: sge_99999999:
-; CHECK-GI: // %bb.0:
-; CHECK-GI-NEXT: mov w8, #-1717986919 // =0x99999999
-; CHECK-GI-NEXT: cmp w0, w8
-; CHECK-GI-NEXT: cset w0, hs
-; CHECK-GI-NEXT: ret
+; CHECK-LABEL: sge_99999999:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov w8, #-1717986919 // =0x99999999
+; CHECK-NEXT: cmp w0, w8
+; CHECK-NEXT: cset w0, hs
+; CHECK-NEXT: ret
%out = icmp ugt i32 %in, -1717986920
ret i1 %out
}
+
+define i1 @ult_20014852997121(i64 noundef %in) {
+; CHECK-LABEL: ult_20014852997121:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov x8, #305397760 // =0x12340000
+; CHECK-NEXT: movk x8, #4660, lsl #32
+; CHECK-NEXT: cmp x0, x8
+; CHECK-NEXT: cset w0, ls
+; CHECK-NEXT: ret
+ %out = icmp ult i64 %in, 20014852997121
+ ret i1 %out
+}
+
+;; NOTE: These prefixes are unused and the list is autogenerated. Do not add tests below this line:
+; CHECK-GI: {{.*}}
+; CHECK-SD: {{.*}}
diff --git a/llvm/test/CodeGen/AArch64/srem-seteq.ll b/llvm/test/CodeGen/AArch64/srem-seteq.ll
index 4b8cbc46a6102..3b344feebb58e 100644
--- a/llvm/test/CodeGen/AArch64/srem-seteq.ll
+++ b/llvm/test/CodeGen/AArch64/srem-seteq.ll
@@ -166,10 +166,9 @@ define i32 @test_srem_odd_setne(i32 %X) nounwind {
; CHECK-NEXT: movk w8, #52428, lsl #16
; CHECK-NEXT: movk w9, #6553, lsl #16
; CHECK-NEXT: madd w8, w0, w8, w9
-; CHECK-NEXT: mov w9, #13106 // =0x3332
-; CHECK-NEXT: movk w9, #13107, lsl #16
+; CHECK-NEXT: mov w9, #858993459 // =0x33333333
; CHECK-NEXT: cmp w8, w9
-; CHECK-NEXT: cset w0, hi
+; CHECK-NEXT: cset w0, hs
; CHECK-NEXT: ret
%srem = srem i32 %X, 5
%cmp = icmp ne i32 %srem, 0
@@ -186,10 +185,9 @@ define i32 @test_srem_negative_odd(i32 %X) nounwind {
; CHECK-NEXT: movk w8, #52428, lsl #16
; CHECK-NEXT: movk w9, #6553, lsl #16
; CHECK-NEXT: madd w8, w0, w8, w9
-; CHECK-NEXT: mov w9, #13106 // =0x3332
-; CHECK-NEXT: movk w9, #13107, lsl #16
+; CHECK-NEXT: mov w9, #858993459 // =0x33333333
; CHECK-NEXT: cmp w8, w9
-; CHECK-NEXT: cset w0, hi
+; CHECK-NEXT: cset w0, hs
; CHECK-NEXT: ret
%srem = srem i32 %X, -5
%cmp = icmp ne i32 %srem, 0
diff --git a/llvm/test/CodeGen/AArch64/urem-seteq-optsize.ll b/llvm/test/CodeGen/AArch64/urem-seteq-optsize.ll
index 45726e92463b9..bb5aa1fd0684d 100644
--- a/llvm/test/CodeGen/AArch64/urem-seteq-optsize.ll
+++ b/llvm/test/CodeGen/AArch64/urem-seteq-optsize.ll
@@ -22,14 +22,13 @@ define i32 @test_optsize(i32 %X) optsize nounwind readnone {
; CHECK-LABEL: test_optsize:
; CHECK: // %bb.0:
; CHECK-NEXT: mov w8, #52429 // =0xcccd
-; CHECK-NEXT: mov w9, #13108 // =0x3334
+; CHECK-NEXT: mov w9, #858993459 // =0x33333333
; CHECK-NEXT: movk w8, #52428, lsl #16
-; CHECK-NEXT: movk w9, #13107, lsl #16
; CHECK-NEXT: mul w8, w0, w8
; CHECK-NEXT: cmp w8, w9
; CHECK-NEXT: mov w8, #-10 // =0xfffffff6
; CHECK-NEXT: mov w9, #42 // =0x2a
-; CHECK-NEXT: csel w0, w9, w8, lo
+; CHECK-NEXT: csel w0, w9, w8, ls
; CHECK-NEXT: ret
%rem = urem i32 %X, 5
%cmp = icmp eq i32 %rem, 0
diff --git a/llvm/test/CodeGen/AArch64/urem-seteq.ll b/llvm/test/CodeGen/AArch64/urem-seteq.ll
index df87e60c4f8d5..5473991e77c3e 100644
--- a/llvm/test/CodeGen/AArch64/urem-seteq.ll
+++ b/llvm/test/CodeGen/AArch64/urem-seteq.ll
@@ -9,12 +9,11 @@ define i32 @test_urem_odd(i32 %X) nounwind {
; CHECK-LABEL: test_urem_odd:
; CHECK: // %bb.0:
; CHECK-NEXT: mov w8, #52429 // =0xcccd
-; CHECK-NEXT: mov w9, #13108 // =0x3334
+; CHECK-NEXT: mov w9, #858993459 // =0x33333333
; CHECK-NEXT: movk w8, #52428, lsl #16
-; CHECK-NEXT: movk w9, #13107, lsl #16
; CHECK-NEXT: mul w8, w0, w8
; CHECK-NEXT: cmp w8, w9
-; CHECK-NEXT: cset w0, lo
+; CHECK-NEXT: cset w0, ls
; CHECK-NEXT: ret
%urem = urem i32 %X, 5
%cmp = icmp eq i32 %urem, 0
>From 07eb7b76928d6873c60859a0339591ed9e0f512a Mon Sep 17 00:00:00 2001
From: Kazu Hirata <kazu at google.com>
Date: Mon, 18 Aug 2025 07:01:29 -0700
Subject: [PATCH 018/112] [llvm] Replace SmallSet with SmallPtrSet (NFC)
(#154068)
This patch replaces SmallSet<T *, N> with SmallPtrSet<T *, N>. Note
that SmallSet.h "redirects" SmallSet to SmallPtrSet for pointer
element types:
template <typename PointeeType, unsigned N>
class SmallSet<PointeeType*, N> : public SmallPtrSet<PointeeType*, N>
{};
We only have 140 instances that rely on this "redirection", with the
vast majority of them under llvm/. Since relying on the redirection
doesn't improve readability, this patch replaces SmallSet with
SmallPtrSet for pointer element types.
---
.../llvm/Analysis/GenericDomTreeUpdaterImpl.h | 2 +-
.../llvm/CodeGen/GlobalISel/LoadStoreOpt.h | 2 +-
llvm/include/llvm/CodeGen/MachinePipeliner.h | 2 +-
llvm/include/llvm/CodeGen/ScheduleDAG.h | 2 +-
llvm/lib/Analysis/CallPrinter.cpp | 4 +--
llvm/lib/Analysis/CaptureTracking.cpp | 2 +-
llvm/lib/Analysis/ScalarEvolution.cpp | 2 +-
llvm/lib/Analysis/ValueTracking.cpp | 6 ++--
llvm/lib/CodeGen/CodeGenPrepare.cpp | 15 ++++-----
llvm/lib/CodeGen/GlobalISel/IRTranslator.cpp | 2 +-
llvm/lib/CodeGen/MachineCopyPropagation.cpp | 2 +-
llvm/lib/CodeGen/MachineDebugify.cpp | 2 +-
llvm/lib/CodeGen/MachinePipeliner.cpp | 6 ++--
llvm/lib/CodeGen/MacroFusion.cpp | 2 +-
.../SelectionDAG/SelectionDAGBuilder.cpp | 2 +-
llvm/lib/CodeGen/SwiftErrorValueTracking.cpp | 2 +-
.../Orc/Debugging/DebuggerSupportPlugin.cpp | 2 +-
llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp | 2 +-
llvm/lib/IR/AutoUpgrade.cpp | 2 +-
llvm/lib/IR/Verifier.cpp | 4 +--
llvm/lib/Target/AMDGPU/AMDGPUMemoryUtils.cpp | 2 +-
.../Target/AMDGPU/AMDGPUPerfHintAnalysis.cpp | 4 +--
.../Target/AMDGPU/AMDGPUSetWavePriority.cpp | 2 +-
llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp | 4 +--
llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp | 4 +--
llvm/lib/Target/ARM/ARMConstantIslandPass.cpp | 2 +-
.../ARM/MVETPAndVPTOptimisationsPass.cpp | 2 +-
.../Target/CSKY/CSKYConstantIslandPass.cpp | 2 +-
llvm/lib/Target/Hexagon/HexagonGenInsert.cpp | 2 +-
.../Hexagon/HexagonLoopIdiomRecognition.cpp | 2 +-
llvm/lib/Target/Hexagon/HexagonSubtarget.cpp | 10 +++---
llvm/lib/Target/Hexagon/HexagonSubtarget.h | 3 +-
.../Target/Mips/MipsConstantIslandPass.cpp | 2 +-
llvm/lib/Target/PowerPC/PPCCTRLoopsVerify.cpp | 2 +-
llvm/lib/Target/PowerPC/PPCISelLowering.cpp | 4 +--
.../Target/PowerPC/PPCLoopInstrFormPrep.cpp | 17 +++++-----
llvm/lib/Target/RISCV/RISCVISelLowering.cpp | 2 +-
.../X86LoadValueInjectionLoadHardening.cpp | 2 +-
llvm/lib/Target/X86/X86PreTileConfig.cpp | 2 +-
llvm/lib/Transforms/IPO/FunctionAttrs.cpp | 32 +++++++++----------
.../Transforms/Scalar/DFAJumpThreading.cpp | 12 +++----
llvm/lib/Transforms/Scalar/GVN.cpp | 4 +--
llvm/lib/Transforms/Scalar/GuardWidening.cpp | 4 +--
llvm/lib/Transforms/Scalar/IndVarSimplify.cpp | 2 +-
.../Scalar/LowerMatrixIntrinsics.cpp | 2 +-
.../lib/Transforms/Scalar/MemCpyOptimizer.cpp | 4 +--
llvm/lib/Transforms/Scalar/Reassociate.cpp | 2 +-
llvm/lib/Transforms/Scalar/StructurizeCFG.cpp | 8 ++---
.../Utils/CanonicalizeFreezeInLoops.cpp | 2 +-
.../lib/Transforms/Utils/ControlFlowUtils.cpp | 2 +-
llvm/lib/Transforms/Utils/Local.cpp | 6 ++--
.../Utils/PromoteMemoryToRegister.cpp | 22 ++++++-------
.../Utils/ScalarEvolutionExpander.cpp | 2 +-
.../Transforms/Vectorize/LoopVectorize.cpp | 2 +-
.../Transforms/Vectorize/SLPVectorizer.cpp | 2 +-
55 files changed, 120 insertions(+), 123 deletions(-)
diff --git a/llvm/include/llvm/Analysis/GenericDomTreeUpdaterImpl.h b/llvm/include/llvm/Analysis/GenericDomTreeUpdaterImpl.h
index 896b68c5021b3..6bfad783b529b 100644
--- a/llvm/include/llvm/Analysis/GenericDomTreeUpdaterImpl.h
+++ b/llvm/include/llvm/Analysis/GenericDomTreeUpdaterImpl.h
@@ -383,7 +383,7 @@ void GenericDomTreeUpdater<DerivedT, DomTreeT, PostDomTreeT>::
// field of all the elements of Edges.
// I.e., forall elt in Edges, it exists BB in NewBBs
// such as BB == elt.NewBB.
- SmallSet<BasicBlockT *, 32> NewBBs;
+ SmallPtrSet<BasicBlockT *, 32> NewBBs;
for (auto &Edge : Edges)
NewBBs.insert(Edge.NewBB);
// For each element in Edges, remember whether or not element
diff --git a/llvm/include/llvm/CodeGen/GlobalISel/LoadStoreOpt.h b/llvm/include/llvm/CodeGen/GlobalISel/LoadStoreOpt.h
index cee779a5fd5d1..4b7506e013762 100644
--- a/llvm/include/llvm/CodeGen/GlobalISel/LoadStoreOpt.h
+++ b/llvm/include/llvm/CodeGen/GlobalISel/LoadStoreOpt.h
@@ -162,7 +162,7 @@ class LLVM_ABI LoadStoreOpt : public MachineFunctionPass {
DenseMap<unsigned, BitVector> LegalStoreSizes;
bool IsPreLegalizer = false;
/// Contains instructions to be erased at the end of a block scan.
- SmallSet<MachineInstr *, 16> InstsToErase;
+ SmallPtrSet<MachineInstr *, 16> InstsToErase;
public:
LoadStoreOpt();
diff --git a/llvm/include/llvm/CodeGen/MachinePipeliner.h b/llvm/include/llvm/CodeGen/MachinePipeliner.h
index e50443d25cc60..c90ff4f3daa47 100644
--- a/llvm/include/llvm/CodeGen/MachinePipeliner.h
+++ b/llvm/include/llvm/CodeGen/MachinePipeliner.h
@@ -830,7 +830,7 @@ class SMSchedule {
return ScheduledInstrs[cycle];
}
- SmallSet<SUnit *, 8>
+ SmallPtrSet<SUnit *, 8>
computeUnpipelineableNodes(SwingSchedulerDAG *SSD,
TargetInstrInfo::PipelinerLoopInfo *PLI);
diff --git a/llvm/include/llvm/CodeGen/ScheduleDAG.h b/llvm/include/llvm/CodeGen/ScheduleDAG.h
index 122b7be96b46a..aee1514581485 100644
--- a/llvm/include/llvm/CodeGen/ScheduleDAG.h
+++ b/llvm/include/llvm/CodeGen/ScheduleDAG.h
@@ -237,7 +237,7 @@ class TargetRegisterInfo;
};
/// Keep record of which SUnit are in the same cluster group.
- typedef SmallSet<SUnit *, 8> ClusterInfo;
+ typedef SmallPtrSet<SUnit *, 8> ClusterInfo;
constexpr unsigned InvalidClusterId = ~0u;
/// Return whether the input cluster ID's are the same and valid.
diff --git a/llvm/lib/Analysis/CallPrinter.cpp b/llvm/lib/Analysis/CallPrinter.cpp
index 672dae1642cb3..99d8b11f0c4ba 100644
--- a/llvm/lib/Analysis/CallPrinter.cpp
+++ b/llvm/lib/Analysis/CallPrinter.cpp
@@ -70,7 +70,7 @@ class CallGraphDOTInfo {
for (Function &F : M->getFunctionList()) {
uint64_t localSumFreq = 0;
- SmallSet<Function *, 16> Callers;
+ SmallPtrSet<Function *, 16> Callers;
for (User *U : F.users())
if (isa<CallInst>(U))
Callers.insert(cast<Instruction>(U)->getFunction());
@@ -99,7 +99,7 @@ class CallGraphDOTInfo {
bool FoundParallelEdge = true;
while (FoundParallelEdge) {
- SmallSet<Function *, 16> Visited;
+ SmallPtrSet<Function *, 16> Visited;
FoundParallelEdge = false;
for (auto CI = Node->begin(), CE = Node->end(); CI != CE; CI++) {
if (!(Visited.insert(CI->second->getFunction())).second) {
diff --git a/llvm/lib/Analysis/CaptureTracking.cpp b/llvm/lib/Analysis/CaptureTracking.cpp
index bd0d417b1ed33..b6acda3a9f259 100644
--- a/llvm/lib/Analysis/CaptureTracking.cpp
+++ b/llvm/lib/Analysis/CaptureTracking.cpp
@@ -405,7 +405,7 @@ void llvm::PointerMayBeCaptured(const Value *V, CaptureTracker *Tracker,
SmallVector<const Use *, 20> Worklist;
Worklist.reserve(getDefaultMaxUsesToExploreForCaptureTracking());
- SmallSet<const Use *, 20> Visited;
+ SmallPtrSet<const Use *, 20> Visited;
auto AddUses = [&](const Value *V) {
for (const Use &U : V->uses()) {
diff --git a/llvm/lib/Analysis/ScalarEvolution.cpp b/llvm/lib/Analysis/ScalarEvolution.cpp
index ce4d4ad7a0ab0..d2c445f1ffaa0 100644
--- a/llvm/lib/Analysis/ScalarEvolution.cpp
+++ b/llvm/lib/Analysis/ScalarEvolution.cpp
@@ -7284,7 +7284,7 @@ ScalarEvolution::getDefiningScopeBound(ArrayRef<const SCEV *> Ops,
bool &Precise) {
Precise = true;
// Do a bounded search of the def relation of the requested SCEVs.
- SmallSet<const SCEV *, 16> Visited;
+ SmallPtrSet<const SCEV *, 16> Visited;
SmallVector<const SCEV *> Worklist;
auto pushOp = [&](const SCEV *S) {
if (!Visited.insert(S).second)
diff --git a/llvm/lib/Analysis/ValueTracking.cpp b/llvm/lib/Analysis/ValueTracking.cpp
index b0e4b009f3501..50e43a53def6c 100644
--- a/llvm/lib/Analysis/ValueTracking.cpp
+++ b/llvm/lib/Analysis/ValueTracking.cpp
@@ -7785,7 +7785,7 @@ bool llvm::mustExecuteUBIfPoisonOnPathTo(Instruction *Root,
// The set of all recursive users we've visited (which are assumed to all be
// poison because of said visit)
- SmallSet<const Value *, 16> KnownPoison;
+ SmallPtrSet<const Value *, 16> KnownPoison;
SmallVector<const Instruction*, 16> Worklist;
Worklist.push_back(Root);
while (!Worklist.empty()) {
@@ -8140,8 +8140,8 @@ static bool programUndefinedIfUndefOrPoison(const Value *V,
// Set of instructions that we have proved will yield poison if Inst
// does.
- SmallSet<const Value *, 16> YieldsPoison;
- SmallSet<const BasicBlock *, 4> Visited;
+ SmallPtrSet<const Value *, 16> YieldsPoison;
+ SmallPtrSet<const BasicBlock *, 4> Visited;
YieldsPoison.insert(V);
Visited.insert(BB);
diff --git a/llvm/lib/CodeGen/CodeGenPrepare.cpp b/llvm/lib/CodeGen/CodeGenPrepare.cpp
index 9223739fc0098..0e40a92fd8d64 100644
--- a/llvm/lib/CodeGen/CodeGenPrepare.cpp
+++ b/llvm/lib/CodeGen/CodeGenPrepare.cpp
@@ -377,7 +377,7 @@ class CodeGenPrepare {
/// to be optimized again.
/// Note: Consider building time in this pass, when a BB updated, we need
/// to insert such BB into FreshBBs for huge function.
- SmallSet<BasicBlock *, 32> FreshBBs;
+ SmallPtrSet<BasicBlock *, 32> FreshBBs;
void releaseMemory() {
// Clear per function information.
@@ -1105,7 +1105,7 @@ bool CodeGenPrepare::canMergeBlocks(const BasicBlock *BB,
/// Replace all old uses with new ones, and push the updated BBs into FreshBBs.
static void replaceAllUsesWith(Value *Old, Value *New,
- SmallSet<BasicBlock *, 32> &FreshBBs,
+ SmallPtrSet<BasicBlock *, 32> &FreshBBs,
bool IsHuge) {
auto *OldI = dyn_cast<Instruction>(Old);
if (OldI) {
@@ -2135,7 +2135,7 @@ static bool isRemOfLoopIncrementWithLoopInvariant(
// Rem = rem == RemAmtLoopInvariant ? 0 : Rem;
static bool foldURemOfLoopIncrement(Instruction *Rem, const DataLayout *DL,
const LoopInfo *LI,
- SmallSet<BasicBlock *, 32> &FreshBBs,
+ SmallPtrSet<BasicBlock *, 32> &FreshBBs,
bool IsHuge) {
Value *AddOffset, *RemAmt, *AddInst;
PHINode *LoopIncrPN;
@@ -2534,11 +2534,10 @@ static bool OptimizeExtractBits(BinaryOperator *ShiftI, ConstantInt *CI,
/// %ctz = phi i64 [ 64, %entry ], [ %z, %cond.false ]
///
/// If the transform is performed, return true and set ModifiedDT to true.
-static bool despeculateCountZeros(IntrinsicInst *CountZeros,
- LoopInfo &LI,
+static bool despeculateCountZeros(IntrinsicInst *CountZeros, LoopInfo &LI,
const TargetLowering *TLI,
const DataLayout *DL, ModifyDT &ModifiedDT,
- SmallSet<BasicBlock *, 32> &FreshBBs,
+ SmallPtrSet<BasicBlock *, 32> &FreshBBs,
bool IsHugeFunc) {
// If a zero input is undefined, it doesn't make sense to despeculate that.
if (match(CountZeros->getOperand(1), m_One()))
@@ -4351,7 +4350,7 @@ class AddressingModeCombiner {
PhiNodeSet &PhiNodesToMatch) {
SmallVector<PHIPair, 8> WorkList;
Matcher.insert({PHI, Candidate});
- SmallSet<PHINode *, 8> MatchedPHIs;
+ SmallPtrSet<PHINode *, 8> MatchedPHIs;
MatchedPHIs.insert(PHI);
WorkList.push_back({PHI, Candidate});
SmallSet<PHIPair, 8> Visited;
@@ -8635,7 +8634,7 @@ static bool tryUnmergingGEPsAcrossIndirectBr(GetElementPtrInst *GEPI,
}
static bool optimizeBranch(BranchInst *Branch, const TargetLowering &TLI,
- SmallSet<BasicBlock *, 32> &FreshBBs,
+ SmallPtrSet<BasicBlock *, 32> &FreshBBs,
bool IsHugeFunc) {
// Try and convert
// %c = icmp ult %x, 8
diff --git a/llvm/lib/CodeGen/GlobalISel/IRTranslator.cpp b/llvm/lib/CodeGen/GlobalISel/IRTranslator.cpp
index 64c19fab1a023..7ca02ad756f51 100644
--- a/llvm/lib/CodeGen/GlobalISel/IRTranslator.cpp
+++ b/llvm/lib/CodeGen/GlobalISel/IRTranslator.cpp
@@ -3517,7 +3517,7 @@ void IRTranslator::finishPendingPhis() {
Verifier.setCurrentInst(PI);
#endif // ifndef NDEBUG
- SmallSet<const MachineBasicBlock *, 16> SeenPreds;
+ SmallPtrSet<const MachineBasicBlock *, 16> SeenPreds;
for (unsigned i = 0; i < PI->getNumIncomingValues(); ++i) {
auto IRPred = PI->getIncomingBlock(i);
ArrayRef<Register> ValRegs = getOrCreateVRegs(*PI->getIncomingValue(i));
diff --git a/llvm/lib/CodeGen/MachineCopyPropagation.cpp b/llvm/lib/CodeGen/MachineCopyPropagation.cpp
index 742de1101faa2..e35983138550f 100644
--- a/llvm/lib/CodeGen/MachineCopyPropagation.cpp
+++ b/llvm/lib/CodeGen/MachineCopyPropagation.cpp
@@ -490,7 +490,7 @@ class MachineCopyPropagation {
SmallSetVector<MachineInstr *, 8> MaybeDeadCopies;
/// Multimap tracking debug users in current BB
- DenseMap<MachineInstr *, SmallSet<MachineInstr *, 2>> CopyDbgUsers;
+ DenseMap<MachineInstr *, SmallPtrSet<MachineInstr *, 2>> CopyDbgUsers;
CopyTracker Tracker;
diff --git a/llvm/lib/CodeGen/MachineDebugify.cpp b/llvm/lib/CodeGen/MachineDebugify.cpp
index 1a20fe586e951..307f49468eb39 100644
--- a/llvm/lib/CodeGen/MachineDebugify.cpp
+++ b/llvm/lib/CodeGen/MachineDebugify.cpp
@@ -87,7 +87,7 @@ bool applyDebugifyMetadataToMachineFunction(MachineModuleInfo &MMI,
// Do this by introducing debug uses of each register definition. If that is
// not possible (e.g. we have a phi or a meta instruction), emit a constant.
uint64_t NextImm = 0;
- SmallSet<DILocalVariable *, 16> VarSet;
+ SmallPtrSet<DILocalVariable *, 16> VarSet;
const MCInstrDesc &DbgValDesc = TII.get(TargetOpcode::DBG_VALUE);
for (MachineBasicBlock &MBB : MF) {
MachineBasicBlock::iterator FirstNonPHIIt = MBB.getFirstNonPHI();
diff --git a/llvm/lib/CodeGen/MachinePipeliner.cpp b/llvm/lib/CodeGen/MachinePipeliner.cpp
index 90005bd181f3a..3a9651c5cee04 100644
--- a/llvm/lib/CodeGen/MachinePipeliner.cpp
+++ b/llvm/lib/CodeGen/MachinePipeliner.cpp
@@ -3466,9 +3466,9 @@ bool SMSchedule::onlyHasLoopCarriedOutputOrOrderPreds(
}
/// Determine transitive dependences of unpipelineable instructions
-SmallSet<SUnit *, 8> SMSchedule::computeUnpipelineableNodes(
+SmallPtrSet<SUnit *, 8> SMSchedule::computeUnpipelineableNodes(
SwingSchedulerDAG *SSD, TargetInstrInfo::PipelinerLoopInfo *PLI) {
- SmallSet<SUnit *, 8> DoNotPipeline;
+ SmallPtrSet<SUnit *, 8> DoNotPipeline;
SmallVector<SUnit *, 8> Worklist;
for (auto &SU : SSD->SUnits)
@@ -3498,7 +3498,7 @@ SmallSet<SUnit *, 8> SMSchedule::computeUnpipelineableNodes(
// and ensure that they are in stage 0. If unable to do so, return false.
bool SMSchedule::normalizeNonPipelinedInstructions(
SwingSchedulerDAG *SSD, TargetInstrInfo::PipelinerLoopInfo *PLI) {
- SmallSet<SUnit *, 8> DNP = computeUnpipelineableNodes(SSD, PLI);
+ SmallPtrSet<SUnit *, 8> DNP = computeUnpipelineableNodes(SSD, PLI);
int NewLastCycle = INT_MIN;
for (SUnit &SU : SSD->SUnits) {
diff --git a/llvm/lib/CodeGen/MacroFusion.cpp b/llvm/lib/CodeGen/MacroFusion.cpp
index 975a3fe71abad..1db53017e6cef 100644
--- a/llvm/lib/CodeGen/MacroFusion.cpp
+++ b/llvm/lib/CodeGen/MacroFusion.cpp
@@ -79,7 +79,7 @@ bool llvm::fuseInstructionPair(ScheduleDAGInstrs &DAG, SUnit &FirstSU,
FirstSU.ParentClusterIdx = Clusters.size();
SecondSU.ParentClusterIdx = Clusters.size();
- SmallSet<SUnit *, 8> Cluster{{&FirstSU, &SecondSU}};
+ SmallPtrSet<SUnit *, 8> Cluster{{&FirstSU, &SecondSU}};
Clusters.push_back(Cluster);
// TODO - If we want to chain more than two instructions, we need to create
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
index 2d2b37925d8fd..901f10d1256d1 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
@@ -3573,7 +3573,7 @@ void SelectionDAGBuilder::visitIndirectBr(const IndirectBrInst &I) {
MachineBasicBlock *IndirectBrMBB = FuncInfo.MBB;
// Update machine-CFG edges with unique successors.
- SmallSet<BasicBlock*, 32> Done;
+ SmallPtrSet<BasicBlock *, 32> Done;
for (unsigned i = 0, e = I.getNumSuccessors(); i != e; ++i) {
BasicBlock *BB = I.getSuccessor(i);
bool Inserted = Done.insert(BB).second;
diff --git a/llvm/lib/CodeGen/SwiftErrorValueTracking.cpp b/llvm/lib/CodeGen/SwiftErrorValueTracking.cpp
index decffdc7dfe45..ff4b568b5ee20 100644
--- a/llvm/lib/CodeGen/SwiftErrorValueTracking.cpp
+++ b/llvm/lib/CodeGen/SwiftErrorValueTracking.cpp
@@ -179,7 +179,7 @@ void SwiftErrorValueTracking::propagateVRegs() {
// Check whether we have a single vreg def from all predecessors.
// Otherwise we need a phi.
SmallVector<std::pair<MachineBasicBlock *, Register>, 4> VRegs;
- SmallSet<const MachineBasicBlock *, 8> Visited;
+ SmallPtrSet<const MachineBasicBlock *, 8> Visited;
for (auto *Pred : MBB->predecessors()) {
if (!Visited.insert(Pred).second)
continue;
diff --git a/llvm/lib/ExecutionEngine/Orc/Debugging/DebuggerSupportPlugin.cpp b/llvm/lib/ExecutionEngine/Orc/Debugging/DebuggerSupportPlugin.cpp
index 1bafed79d6968..ba27aa87b7c7a 100644
--- a/llvm/lib/ExecutionEngine/Orc/Debugging/DebuggerSupportPlugin.cpp
+++ b/llvm/lib/ExecutionEngine/Orc/Debugging/DebuggerSupportPlugin.cpp
@@ -64,7 +64,7 @@ class MachODebugObjectSynthesizerBase
LLVM_DEBUG({
dbgs() << " Preserving debug section " << Sec.getName() << "\n";
});
- SmallSet<Block *, 8> PreservedBlocks;
+ SmallPtrSet<Block *, 8> PreservedBlocks;
for (auto *Sym : Sec.symbols()) {
bool NewPreservedBlock =
PreservedBlocks.insert(&Sym->getBlock()).second;
diff --git a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
index c16b0dde1a3da..e9147a42452d0 100644
--- a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
+++ b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
@@ -5930,7 +5930,7 @@ void OpenMPIRBuilder::applySimd(CanonicalLoopInfo *CanonicalLoop,
createIfVersion(CanonicalLoop, IfCond, VMap, LIA, LI, L, "simd");
}
- SmallSet<BasicBlock *, 8> Reachable;
+ SmallPtrSet<BasicBlock *, 8> Reachable;
// Get the basic blocks from the loop in which memref instructions
// can be found.
diff --git a/llvm/lib/IR/AutoUpgrade.cpp b/llvm/lib/IR/AutoUpgrade.cpp
index b91fd70bd9467..e200f3626e69d 100644
--- a/llvm/lib/IR/AutoUpgrade.cpp
+++ b/llvm/lib/IR/AutoUpgrade.cpp
@@ -5391,7 +5391,7 @@ void llvm::UpgradeNVVMAnnotations(Module &M) {
return;
SmallVector<MDNode *, 8> NewNodes;
- SmallSet<const MDNode *, 8> SeenNodes;
+ SmallPtrSet<const MDNode *, 8> SeenNodes;
for (MDNode *MD : NamedMD->operands()) {
if (!SeenNodes.insert(MD).second)
continue;
diff --git a/llvm/lib/IR/Verifier.cpp b/llvm/lib/IR/Verifier.cpp
index 5a93228faa3ac..9d9b51db98702 100644
--- a/llvm/lib/IR/Verifier.cpp
+++ b/llvm/lib/IR/Verifier.cpp
@@ -4636,7 +4636,7 @@ void Verifier::visitEHPadPredecessors(Instruction &I) {
}
// The edge may exit from zero or more nested pads.
- SmallSet<Value *, 8> Seen;
+ SmallPtrSet<Value *, 8> Seen;
for (;; FromPad = getParentPad(FromPad)) {
Check(FromPad != ToPad,
"EH pad cannot handle exceptions raised within it", FromPad, TI);
@@ -4764,7 +4764,7 @@ void Verifier::visitFuncletPadInst(FuncletPadInst &FPI) {
User *FirstUser = nullptr;
Value *FirstUnwindPad = nullptr;
SmallVector<FuncletPadInst *, 8> Worklist({&FPI});
- SmallSet<FuncletPadInst *, 8> Seen;
+ SmallPtrSet<FuncletPadInst *, 8> Seen;
while (!Worklist.empty()) {
FuncletPadInst *CurrentPad = Worklist.pop_back_val();
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUMemoryUtils.cpp b/llvm/lib/Target/AMDGPU/AMDGPUMemoryUtils.cpp
index e65dd1b04cc48..dfe7c53aaca06 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUMemoryUtils.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUMemoryUtils.cpp
@@ -384,7 +384,7 @@ bool isClobberedInFunction(const LoadInst *Load, MemorySSA *MSSA,
AAResults *AA) {
MemorySSAWalker *Walker = MSSA->getWalker();
SmallVector<MemoryAccess *> WorkList{Walker->getClobberingMemoryAccess(Load)};
- SmallSet<MemoryAccess *, 8> Visited;
+ SmallPtrSet<MemoryAccess *, 8> Visited;
MemoryLocation Loc(MemoryLocation::get(Load));
LLVM_DEBUG(dbgs() << "Checking clobbering of: " << *Load << '\n');
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPerfHintAnalysis.cpp b/llvm/lib/Target/AMDGPU/AMDGPUPerfHintAnalysis.cpp
index 3a3751892c8b6..28d5400fd1807 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUPerfHintAnalysis.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPerfHintAnalysis.cpp
@@ -134,8 +134,8 @@ static std::pair<const Value *, const Type *> getMemoryInstrPtrAndType(
bool AMDGPUPerfHint::isIndirectAccess(const Instruction *Inst) const {
LLVM_DEBUG(dbgs() << "[isIndirectAccess] " << *Inst << '\n');
- SmallSet<const Value *, 32> WorkSet;
- SmallSet<const Value *, 32> Visited;
+ SmallPtrSet<const Value *, 32> WorkSet;
+ SmallPtrSet<const Value *, 32> Visited;
if (const Value *MO = getMemoryInstrPtrAndType(Inst).first) {
if (isGlobalAddr(MO))
WorkSet.insert(MO);
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp b/llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
index b60ded33a4ac3..56aa3f6db83ad 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
@@ -195,7 +195,7 @@ bool AMDGPUSetWavePriority::run(MachineFunction &MF) {
// Lower the priority on edges where control leaves blocks from which
// the VMEM loads are reachable.
- SmallSet<MachineBasicBlock *, 16> PriorityLoweringBlocks;
+ SmallPtrSet<MachineBasicBlock *, 16> PriorityLoweringBlocks;
for (MachineBasicBlock &MBB : MF) {
if (MBBInfos[&MBB].MayReachVMEMLoad) {
if (MBB.succ_empty())
diff --git a/llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp b/llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
index f018f77bc83e1..dce4e6f993005 100644
--- a/llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
+++ b/llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
@@ -460,7 +460,7 @@ static bool hoistAndMergeSGPRInits(unsigned Reg,
// List of clobbering instructions.
SmallVector<MachineInstr*, 8> Clobbers;
// List of instructions marked for deletion.
- SmallSet<MachineInstr*, 8> MergedInstrs;
+ SmallPtrSet<MachineInstr *, 8> MergedInstrs;
bool Changed = false;
@@ -808,7 +808,7 @@ bool SIFixSGPRCopies::run(MachineFunction &MF) {
void SIFixSGPRCopies::processPHINode(MachineInstr &MI) {
bool AllAGPRUses = true;
SetVector<const MachineInstr *> worklist;
- SmallSet<const MachineInstr *, 4> Visited;
+ SmallPtrSet<const MachineInstr *, 4> Visited;
SetVector<MachineInstr *> PHIOperands;
worklist.insert(&MI);
Visited.insert(&MI);
diff --git a/llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp b/llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp
index f7a9a584a6b51..e97536d36bab2 100644
--- a/llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp
+++ b/llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp
@@ -81,7 +81,7 @@ class SILowerControlFlow {
MachineRegisterInfo *MRI = nullptr;
SetVector<MachineInstr*> LoweredEndCf;
DenseSet<Register> LoweredIf;
- SmallSet<MachineBasicBlock *, 4> KillBlocks;
+ SmallPtrSet<MachineBasicBlock *, 4> KillBlocks;
SmallSet<Register, 8> RecomputeRegs;
const TargetRegisterClass *BoolRC = nullptr;
@@ -460,7 +460,7 @@ MachineBasicBlock::iterator
SILowerControlFlow::skipIgnoreExecInstsTrivialSucc(
MachineBasicBlock &MBB, MachineBasicBlock::iterator It) const {
- SmallSet<const MachineBasicBlock *, 4> Visited;
+ SmallPtrSet<const MachineBasicBlock *, 4> Visited;
MachineBasicBlock *B = &MBB;
do {
if (!Visited.insert(B).second)
diff --git a/llvm/lib/Target/ARM/ARMConstantIslandPass.cpp b/llvm/lib/Target/ARM/ARMConstantIslandPass.cpp
index ef690838f0f3b..c53e2158f4c73 100644
--- a/llvm/lib/Target/ARM/ARMConstantIslandPass.cpp
+++ b/llvm/lib/Target/ARM/ARMConstantIslandPass.cpp
@@ -109,7 +109,7 @@ namespace {
/// NewWaterList - The subset of WaterList that was created since the
/// previous iteration by inserting unconditional branches.
- SmallSet<MachineBasicBlock*, 4> NewWaterList;
+ SmallPtrSet<MachineBasicBlock *, 4> NewWaterList;
using water_iterator = std::vector<MachineBasicBlock *>::iterator;
diff --git a/llvm/lib/Target/ARM/MVETPAndVPTOptimisationsPass.cpp b/llvm/lib/Target/ARM/MVETPAndVPTOptimisationsPass.cpp
index 0b4e7dfebe369..5eeb4fe995485 100644
--- a/llvm/lib/Target/ARM/MVETPAndVPTOptimisationsPass.cpp
+++ b/llvm/lib/Target/ARM/MVETPAndVPTOptimisationsPass.cpp
@@ -922,7 +922,7 @@ bool MVETPAndVPTOptimisations::ReplaceConstByVPNOTs(MachineBasicBlock &MBB,
// the function.
unsigned LastVPTImm = 0;
Register LastVPTReg = 0;
- SmallSet<MachineInstr *, 4> DeadInstructions;
+ SmallPtrSet<MachineInstr *, 4> DeadInstructions;
for (MachineInstr &Instr : MBB.instrs()) {
// Look for predicated MVE instructions.
diff --git a/llvm/lib/Target/CSKY/CSKYConstantIslandPass.cpp b/llvm/lib/Target/CSKY/CSKYConstantIslandPass.cpp
index e55d9b227d1cd..7885d93cbad98 100644
--- a/llvm/lib/Target/CSKY/CSKYConstantIslandPass.cpp
+++ b/llvm/lib/Target/CSKY/CSKYConstantIslandPass.cpp
@@ -116,7 +116,7 @@ class CSKYConstantIslands : public MachineFunctionPass {
/// NewWaterList - The subset of WaterList that was created since the
/// previous iteration by inserting unconditional branches.
- SmallSet<MachineBasicBlock *, 4> NewWaterList;
+ SmallPtrSet<MachineBasicBlock *, 4> NewWaterList;
using water_iterator = std::vector<MachineBasicBlock *>::iterator;
diff --git a/llvm/lib/Target/Hexagon/HexagonGenInsert.cpp b/llvm/lib/Target/Hexagon/HexagonGenInsert.cpp
index a9201460d8e2e..b2218abcaaa3c 100644
--- a/llvm/lib/Target/Hexagon/HexagonGenInsert.cpp
+++ b/llvm/lib/Target/Hexagon/HexagonGenInsert.cpp
@@ -1273,7 +1273,7 @@ void HexagonGenInsert::selectCandidates() {
for (unsigned R = AllRMs.find_first(); R; R = AllRMs.find_next(R)) {
using use_iterator = MachineRegisterInfo::use_nodbg_iterator;
- using InstrSet = SmallSet<const MachineInstr *, 16>;
+ using InstrSet = SmallPtrSet<const MachineInstr *, 16>;
InstrSet UIs;
// Count as the number of instructions in which R is used, not the
diff --git a/llvm/lib/Target/Hexagon/HexagonLoopIdiomRecognition.cpp b/llvm/lib/Target/Hexagon/HexagonLoopIdiomRecognition.cpp
index c34eecd3fcb09..a3717bb97d14b 100644
--- a/llvm/lib/Target/Hexagon/HexagonLoopIdiomRecognition.cpp
+++ b/llvm/lib/Target/Hexagon/HexagonLoopIdiomRecognition.cpp
@@ -2289,7 +2289,7 @@ bool HexagonLoopIdiomRecognize::processCopyingStore(Loop *CurLoop,
// the instructions in Insts are removed.
bool HexagonLoopIdiomRecognize::coverLoop(Loop *L,
SmallVectorImpl<Instruction*> &Insts) const {
- SmallSet<BasicBlock*,8> LoopBlocks;
+ SmallPtrSet<BasicBlock *, 8> LoopBlocks;
LoopBlocks.insert_range(L->blocks());
SetVector<Instruction *> Worklist(llvm::from_range, Insts);
diff --git a/llvm/lib/Target/Hexagon/HexagonSubtarget.cpp b/llvm/lib/Target/Hexagon/HexagonSubtarget.cpp
index ecc1b5d2ebe35..6a05b5ab2c21c 100644
--- a/llvm/lib/Target/Hexagon/HexagonSubtarget.cpp
+++ b/llvm/lib/Target/Hexagon/HexagonSubtarget.cpp
@@ -445,8 +445,8 @@ void HexagonSubtarget::adjustSchedDependency(
const HexagonInstrInfo *QII = getInstrInfo();
// Instructions with .new operands have zero latency.
- SmallSet<SUnit *, 4> ExclSrc;
- SmallSet<SUnit *, 4> ExclDst;
+ SmallPtrSet<SUnit *, 4> ExclSrc;
+ SmallPtrSet<SUnit *, 4> ExclDst;
if (QII->canExecuteInBundle(*SrcInst, *DstInst) &&
isBestZeroLatency(Src, Dst, QII, ExclSrc, ExclDst)) {
Dep.setLatency(0);
@@ -630,9 +630,9 @@ static SUnit *getZeroLatency(SUnit *N, SmallVector<SDep, 4> &Deps) {
// together with a zero latency. Only one dependence should have a zero
// latency. If there are multiple choices, choose the best, and change
// the others, if needed.
-bool HexagonSubtarget::isBestZeroLatency(SUnit *Src, SUnit *Dst,
- const HexagonInstrInfo *TII, SmallSet<SUnit*, 4> &ExclSrc,
- SmallSet<SUnit*, 4> &ExclDst) const {
+bool HexagonSubtarget::isBestZeroLatency(
+ SUnit *Src, SUnit *Dst, const HexagonInstrInfo *TII,
+ SmallPtrSet<SUnit *, 4> &ExclSrc, SmallPtrSet<SUnit *, 4> &ExclDst) const {
MachineInstr &SrcInst = *Src->getInstr();
MachineInstr &DstInst = *Dst->getInstr();
diff --git a/llvm/lib/Target/Hexagon/HexagonSubtarget.h b/llvm/lib/Target/Hexagon/HexagonSubtarget.h
index 41555db4ac662..b111471a9696c 100644
--- a/llvm/lib/Target/Hexagon/HexagonSubtarget.h
+++ b/llvm/lib/Target/Hexagon/HexagonSubtarget.h
@@ -366,7 +366,8 @@ class HexagonSubtarget : public HexagonGenSubtargetInfo {
void restoreLatency(SUnit *Src, SUnit *Dst) const;
void changeLatency(SUnit *Src, SUnit *Dst, unsigned Lat) const;
bool isBestZeroLatency(SUnit *Src, SUnit *Dst, const HexagonInstrInfo *TII,
- SmallSet<SUnit*, 4> &ExclSrc, SmallSet<SUnit*, 4> &ExclDst) const;
+ SmallPtrSet<SUnit *, 4> &ExclSrc,
+ SmallPtrSet<SUnit *, 4> &ExclDst) const;
};
} // end namespace llvm
diff --git a/llvm/lib/Target/Mips/MipsConstantIslandPass.cpp b/llvm/lib/Target/Mips/MipsConstantIslandPass.cpp
index 8067dbc54170b..2a2ccf7d43b8e 100644
--- a/llvm/lib/Target/Mips/MipsConstantIslandPass.cpp
+++ b/llvm/lib/Target/Mips/MipsConstantIslandPass.cpp
@@ -232,7 +232,7 @@ namespace {
/// NewWaterList - The subset of WaterList that was created since the
/// previous iteration by inserting unconditional branches.
- SmallSet<MachineBasicBlock*, 4> NewWaterList;
+ SmallPtrSet<MachineBasicBlock *, 4> NewWaterList;
using water_iterator = std::vector<MachineBasicBlock *>::iterator;
diff --git a/llvm/lib/Target/PowerPC/PPCCTRLoopsVerify.cpp b/llvm/lib/Target/PowerPC/PPCCTRLoopsVerify.cpp
index 46aa27e1450a6..c8e576f976f67 100644
--- a/llvm/lib/Target/PowerPC/PPCCTRLoopsVerify.cpp
+++ b/llvm/lib/Target/PowerPC/PPCCTRLoopsVerify.cpp
@@ -93,7 +93,7 @@ static bool clobbersCTR(const MachineInstr &MI) {
static bool verifyCTRBranch(MachineBasicBlock *MBB,
MachineBasicBlock::iterator I) {
MachineBasicBlock::iterator BI = I;
- SmallSet<MachineBasicBlock *, 16> Visited;
+ SmallPtrSet<MachineBasicBlock *, 16> Visited;
SmallVector<MachineBasicBlock *, 8> Preds;
bool CheckPreds;
diff --git a/llvm/lib/Target/PowerPC/PPCISelLowering.cpp b/llvm/lib/Target/PowerPC/PPCISelLowering.cpp
index b97d0e235c019..652edd4e04c60 100644
--- a/llvm/lib/Target/PowerPC/PPCISelLowering.cpp
+++ b/llvm/lib/Target/PowerPC/PPCISelLowering.cpp
@@ -14814,9 +14814,9 @@ static bool findConsecutiveLoad(LoadSDNode *LD, SelectionDAG &DAG) {
SDValue Chain = LD->getChain();
EVT VT = LD->getMemoryVT();
- SmallSet<SDNode *, 16> LoadRoots;
+ SmallPtrSet<SDNode *, 16> LoadRoots;
SmallVector<SDNode *, 8> Queue(1, Chain.getNode());
- SmallSet<SDNode *, 16> Visited;
+ SmallPtrSet<SDNode *, 16> Visited;
// First, search up the chain, branching to follow all token-factor operands.
// If we find a consecutive load, then we're done, otherwise, record all
diff --git a/llvm/lib/Target/PowerPC/PPCLoopInstrFormPrep.cpp b/llvm/lib/Target/PowerPC/PPCLoopInstrFormPrep.cpp
index 709d7e7e9b47a..adf9436b34ccf 100644
--- a/llvm/lib/Target/PowerPC/PPCLoopInstrFormPrep.cpp
+++ b/llvm/lib/Target/PowerPC/PPCLoopInstrFormPrep.cpp
@@ -264,9 +264,8 @@ namespace {
bool prepareBasesForCommoningChains(Bucket &BucketChain);
/// Rewrite load/store according to the common chains.
- bool
- rewriteLoadStoresForCommoningChains(Loop *L, Bucket &Bucket,
- SmallSet<BasicBlock *, 16> &BBChanged);
+ bool rewriteLoadStoresForCommoningChains(
+ Loop *L, Bucket &Bucket, SmallPtrSet<BasicBlock *, 16> &BBChanged);
/// Collect condition matched(\p isValidCandidate() returns true)
/// candidates in Loop \p L.
@@ -309,7 +308,7 @@ namespace {
/// Rewrite load/store instructions in \p BucketChain according to
/// preparation.
bool rewriteLoadStores(Loop *L, Bucket &BucketChain,
- SmallSet<BasicBlock *, 16> &BBChanged,
+ SmallPtrSet<BasicBlock *, 16> &BBChanged,
PrepForm Form);
/// Rewrite for the base load/store of a chain.
@@ -523,7 +522,7 @@ bool PPCLoopInstrFormPrep::chainCommoning(Loop *L,
if (Buckets.empty())
return MadeChange;
- SmallSet<BasicBlock *, 16> BBChanged;
+ SmallPtrSet<BasicBlock *, 16> BBChanged;
for (auto &Bucket : Buckets) {
if (prepareBasesForCommoningChains(Bucket))
@@ -537,7 +536,7 @@ bool PPCLoopInstrFormPrep::chainCommoning(Loop *L,
}
bool PPCLoopInstrFormPrep::rewriteLoadStoresForCommoningChains(
- Loop *L, Bucket &Bucket, SmallSet<BasicBlock *, 16> &BBChanged) {
+ Loop *L, Bucket &Bucket, SmallPtrSet<BasicBlock *, 16> &BBChanged) {
bool MadeChange = false;
assert(Bucket.Elements.size() ==
@@ -1006,7 +1005,7 @@ bool PPCLoopInstrFormPrep::prepareBaseForUpdateFormChain(Bucket &BucketChain) {
}
bool PPCLoopInstrFormPrep::rewriteLoadStores(
- Loop *L, Bucket &BucketChain, SmallSet<BasicBlock *, 16> &BBChanged,
+ Loop *L, Bucket &BucketChain, SmallPtrSet<BasicBlock *, 16> &BBChanged,
PrepForm Form) {
bool MadeChange = false;
@@ -1089,7 +1088,7 @@ bool PPCLoopInstrFormPrep::updateFormPrep(Loop *L,
bool MadeChange = false;
if (Buckets.empty())
return MadeChange;
- SmallSet<BasicBlock *, 16> BBChanged;
+ SmallPtrSet<BasicBlock *, 16> BBChanged;
for (auto &Bucket : Buckets)
// The base address of each bucket is transformed into a phi and the others
// are rewritten based on new base.
@@ -1110,7 +1109,7 @@ bool PPCLoopInstrFormPrep::dispFormPrep(Loop *L,
if (Buckets.empty())
return MadeChange;
- SmallSet<BasicBlock *, 16> BBChanged;
+ SmallPtrSet<BasicBlock *, 16> BBChanged;
for (auto &Bucket : Buckets) {
if (Bucket.Elements.size() < DispFormPrepMinThreshold)
continue;
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index aedba7e52e3ab..ce03818b49502 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -17525,7 +17525,7 @@ static SDValue combineOp_VLToVWOp_VL(SDNode *N,
return SDValue();
SmallVector<SDNode *> Worklist;
- SmallSet<SDNode *, 8> Inserted;
+ SmallPtrSet<SDNode *, 8> Inserted;
Worklist.push_back(N);
Inserted.insert(N);
SmallVector<CombineResult> CombinesToApply;
diff --git a/llvm/lib/Target/X86/X86LoadValueInjectionLoadHardening.cpp b/llvm/lib/Target/X86/X86LoadValueInjectionLoadHardening.cpp
index cf055cf3be0aa..090060eaa65e1 100644
--- a/llvm/lib/Target/X86/X86LoadValueInjectionLoadHardening.cpp
+++ b/llvm/lib/Target/X86/X86LoadValueInjectionLoadHardening.cpp
@@ -491,7 +491,7 @@ X86LoadValueInjectionLoadHardeningPass::getGadgetGraph(
NumGadgets += GadgetCount;
// Traverse CFG to build the rest of the graph
- SmallSet<MachineBasicBlock *, 8> BlocksVisited;
+ SmallPtrSet<MachineBasicBlock *, 8> BlocksVisited;
std::function<void(MachineBasicBlock *, GraphIter, unsigned)> TraverseCFG =
[&](MachineBasicBlock *MBB, GraphIter GI, unsigned ParentDepth) {
unsigned LoopDepth = MLI.getLoopDepth(MBB);
diff --git a/llvm/lib/Target/X86/X86PreTileConfig.cpp b/llvm/lib/Target/X86/X86PreTileConfig.cpp
index 3b4e531f25388..2a1c49957bf7a 100644
--- a/llvm/lib/Target/X86/X86PreTileConfig.cpp
+++ b/llvm/lib/Target/X86/X86PreTileConfig.cpp
@@ -100,7 +100,7 @@ struct BBInfo {
class X86PreTileConfig : public MachineFunctionPass {
MachineRegisterInfo *MRI = nullptr;
const MachineLoopInfo *MLI = nullptr;
- SmallSet<MachineInstr *, 8> DefVisited;
+ SmallPtrSet<MachineInstr *, 8> DefVisited;
DenseMap<MachineBasicBlock *, BBInfo> BBVisitedInfo;
DenseMap<MachineBasicBlock *, SmallVector<MIRef, 8>> ShapeBBs;
diff --git a/llvm/lib/Transforms/IPO/FunctionAttrs.cpp b/llvm/lib/Transforms/IPO/FunctionAttrs.cpp
index 8262c8c3a90f2..44394f6deb9a2 100644
--- a/llvm/lib/Transforms/IPO/FunctionAttrs.cpp
+++ b/llvm/lib/Transforms/IPO/FunctionAttrs.cpp
@@ -273,7 +273,7 @@ MemoryEffects llvm::computeFunctionBodyMemoryAccess(Function &F,
/// Deduce readonly/readnone/writeonly attributes for the SCC.
template <typename AARGetterT>
static void addMemoryAttrs(const SCCNodeSet &SCCNodes, AARGetterT &&AARGetter,
- SmallSet<Function *, 8> &Changed) {
+ SmallPtrSet<Function *, 8> &Changed) {
MemoryEffects ME = MemoryEffects::none();
MemoryEffects RecursiveArgME = MemoryEffects::none();
for (Function *F : SCCNodes) {
@@ -1002,7 +1002,7 @@ determinePointerAccessAttrs(Argument *A,
/// Deduce returned attributes for the SCC.
static void addArgumentReturnedAttrs(const SCCNodeSet &SCCNodes,
- SmallSet<Function *, 8> &Changed) {
+ SmallPtrSet<Function *, 8> &Changed) {
// Check each function in turn, determining if an argument is always returned.
for (Function *F : SCCNodes) {
// We can infer and propagate function attributes only when we know that the
@@ -1238,7 +1238,7 @@ static bool inferInitializes(Argument &A, Function &F) {
/// Deduce nocapture attributes for the SCC.
static void addArgumentAttrs(const SCCNodeSet &SCCNodes,
- SmallSet<Function *, 8> &Changed,
+ SmallPtrSet<Function *, 8> &Changed,
bool SkipInitializes) {
ArgumentGraph AG;
@@ -1510,7 +1510,7 @@ static bool isFunctionMallocLike(Function *F, const SCCNodeSet &SCCNodes) {
/// Deduce noalias attributes for the SCC.
static void addNoAliasAttrs(const SCCNodeSet &SCCNodes,
- SmallSet<Function *, 8> &Changed) {
+ SmallPtrSet<Function *, 8> &Changed) {
// Check each function in turn, determining which functions return noalias
// pointers.
for (Function *F : SCCNodes) {
@@ -1623,7 +1623,7 @@ static bool isReturnNonNull(Function *F, const SCCNodeSet &SCCNodes,
/// Deduce nonnull attributes for the SCC.
static void addNonNullAttrs(const SCCNodeSet &SCCNodes,
- SmallSet<Function *, 8> &Changed) {
+ SmallPtrSet<Function *, 8> &Changed) {
// Speculative that all functions in the SCC return only nonnull
// pointers. We may refute this as we analyze functions.
bool SCCReturnsNonNull = true;
@@ -1680,7 +1680,7 @@ static void addNonNullAttrs(const SCCNodeSet &SCCNodes,
/// Deduce noundef attributes for the SCC.
static void addNoUndefAttrs(const SCCNodeSet &SCCNodes,
- SmallSet<Function *, 8> &Changed) {
+ SmallPtrSet<Function *, 8> &Changed) {
// Check each function in turn, determining which functions return noundef
// values.
for (Function *F : SCCNodes) {
@@ -1788,13 +1788,13 @@ class AttributeInferer {
InferenceDescriptors.push_back(AttrInference);
}
- void run(const SCCNodeSet &SCCNodes, SmallSet<Function *, 8> &Changed);
+ void run(const SCCNodeSet &SCCNodes, SmallPtrSet<Function *, 8> &Changed);
};
/// Perform all the requested attribute inference actions according to the
/// attribute predicates stored before.
void AttributeInferer::run(const SCCNodeSet &SCCNodes,
- SmallSet<Function *, 8> &Changed) {
+ SmallPtrSet<Function *, 8> &Changed) {
SmallVector<InferenceDescriptor, 4> InferInSCC = InferenceDescriptors;
// Go through all the functions in SCC and check corresponding attribute
// assumptions for each of them. Attributes that are invalid for this SCC
@@ -1969,7 +1969,7 @@ static bool InstrBreaksNoSync(Instruction &I, const SCCNodeSet &SCCNodes) {
///
/// Returns true if any changes to function attributes were made.
static void inferConvergent(const SCCNodeSet &SCCNodes,
- SmallSet<Function *, 8> &Changed) {
+ SmallPtrSet<Function *, 8> &Changed) {
AttributeInferer AI;
// Request to remove the convergent attribute from all functions in the SCC
@@ -2000,7 +2000,7 @@ static void inferConvergent(const SCCNodeSet &SCCNodes,
///
/// Returns true if any changes to function attributes were made.
static void inferAttrsFromFunctionBodies(const SCCNodeSet &SCCNodes,
- SmallSet<Function *, 8> &Changed) {
+ SmallPtrSet<Function *, 8> &Changed) {
AttributeInferer AI;
if (!DisableNoUnwindInference)
@@ -2069,7 +2069,7 @@ static void inferAttrsFromFunctionBodies(const SCCNodeSet &SCCNodes,
}
static void addNoRecurseAttrs(const SCCNodeSet &SCCNodes,
- SmallSet<Function *, 8> &Changed) {
+ SmallPtrSet<Function *, 8> &Changed) {
// Try and identify functions that do not recurse.
// If the SCC contains multiple nodes we know for sure there is recursion.
@@ -2105,7 +2105,7 @@ static void addNoRecurseAttrs(const SCCNodeSet &SCCNodes,
// Set the noreturn function attribute if possible.
static void addNoReturnAttrs(const SCCNodeSet &SCCNodes,
- SmallSet<Function *, 8> &Changed) {
+ SmallPtrSet<Function *, 8> &Changed) {
for (Function *F : SCCNodes) {
if (!F || !F->hasExactDefinition() || F->hasFnAttribute(Attribute::Naked) ||
F->doesNotReturn())
@@ -2166,7 +2166,7 @@ static bool allPathsGoThroughCold(Function &F) {
// Set the cold function attribute if possible.
static void addColdAttrs(const SCCNodeSet &SCCNodes,
- SmallSet<Function *, 8> &Changed) {
+ SmallPtrSet<Function *, 8> &Changed) {
for (Function *F : SCCNodes) {
if (!F || !F->hasExactDefinition() || F->hasFnAttribute(Attribute::Naked) ||
F->hasFnAttribute(Attribute::Cold) || F->hasFnAttribute(Attribute::Hot))
@@ -2213,7 +2213,7 @@ static bool functionWillReturn(const Function &F) {
// Set the willreturn function attribute if possible.
static void addWillReturn(const SCCNodeSet &SCCNodes,
- SmallSet<Function *, 8> &Changed) {
+ SmallPtrSet<Function *, 8> &Changed) {
for (Function *F : SCCNodes) {
if (!F || F->willReturn() || !functionWillReturn(*F))
continue;
@@ -2239,7 +2239,7 @@ static SCCNodesResult createSCCNodeSet(ArrayRef<Function *> Functions) {
}
template <typename AARGetterT>
-static SmallSet<Function *, 8>
+static SmallPtrSet<Function *, 8>
deriveAttrsInPostOrder(ArrayRef<Function *> Functions, AARGetterT &&AARGetter,
bool ArgAttrsOnly) {
SCCNodesResult Nodes = createSCCNodeSet(Functions);
@@ -2248,7 +2248,7 @@ deriveAttrsInPostOrder(ArrayRef<Function *> Functions, AARGetterT &&AARGetter,
if (Nodes.SCCNodes.empty())
return {};
- SmallSet<Function *, 8> Changed;
+ SmallPtrSet<Function *, 8> Changed;
if (ArgAttrsOnly) {
// ArgAttrsOnly means to only infer attributes that may aid optimizations
// on the *current* function. "initializes" attribute is to aid
diff --git a/llvm/lib/Transforms/Scalar/DFAJumpThreading.cpp b/llvm/lib/Transforms/Scalar/DFAJumpThreading.cpp
index 938aab5879044..ac59ae182896b 100644
--- a/llvm/lib/Transforms/Scalar/DFAJumpThreading.cpp
+++ b/llvm/lib/Transforms/Scalar/DFAJumpThreading.cpp
@@ -447,7 +447,7 @@ struct MainSwitch {
/// Also, collect select instructions to unfold.
bool isCandidate(const SwitchInst *SI) {
std::deque<std::pair<Value *, BasicBlock *>> Q;
- SmallSet<Value *, 16> SeenValues;
+ SmallPtrSet<Value *, 16> SeenValues;
SelectInsts.clear();
Value *SICond = SI->getCondition();
@@ -511,7 +511,7 @@ struct MainSwitch {
void addToQueue(Value *Val, BasicBlock *BB,
std::deque<std::pair<Value *, BasicBlock *>> &Q,
- SmallSet<Value *, 16> &SeenValues) {
+ SmallPtrSet<Value *, 16> &SeenValues) {
if (SeenValues.insert(Val).second)
Q.push_back({Val, BB});
}
@@ -713,7 +713,7 @@ struct AllSwitchPaths {
// Some blocks have multiple edges to the same successor, and this set
// is used to prevent a duplicate path from being generated
- SmallSet<BasicBlock *, 4> Successors;
+ SmallPtrSet<BasicBlock *, 4> Successors;
for (BasicBlock *Succ : successors(BB)) {
if (!Successors.insert(Succ).second)
continue;
@@ -762,7 +762,7 @@ struct AllSwitchPaths {
SmallVector<PHINode *, 8> Stack;
Stack.push_back(FirstDef);
- SmallSet<Value *, 16> SeenValues;
+ SmallPtrSet<Value *, 16> SeenValues;
while (!Stack.empty()) {
PHINode *CurPhi = Stack.pop_back_val();
@@ -955,7 +955,7 @@ struct TransformDFA {
DuplicateBlockMap DuplicateMap;
DefMap NewDefs;
- SmallSet<BasicBlock *, 16> BlocksToClean;
+ SmallPtrSet<BasicBlock *, 16> BlocksToClean;
BlocksToClean.insert_range(successors(SwitchBlock));
for (ThreadingPath &TPath : SwitchPaths->getThreadingPaths()) {
@@ -984,7 +984,7 @@ struct TransformDFA {
/// the predecessors, and phis in the successor blocks.
void createExitPath(DefMap &NewDefs, ThreadingPath &Path,
DuplicateBlockMap &DuplicateMap,
- SmallSet<BasicBlock *, 16> &BlocksToClean,
+ SmallPtrSet<BasicBlock *, 16> &BlocksToClean,
DomTreeUpdater *DTU) {
APInt NextState = Path.getExitValue();
const BasicBlock *Determinator = Path.getDeterminatorBB();
diff --git a/llvm/lib/Transforms/Scalar/GVN.cpp b/llvm/lib/Transforms/Scalar/GVN.cpp
index 7704e49c499da..4baa3b3eb8242 100644
--- a/llvm/lib/Transforms/Scalar/GVN.cpp
+++ b/llvm/lib/Transforms/Scalar/GVN.cpp
@@ -978,7 +978,7 @@ static bool IsValueFullyAvailableInBlock(
unsigned NumNewNewSpeculativelyAvailableBBs = 0;
#ifndef NDEBUG
- SmallSet<BasicBlock *, 32> NewSpeculativelyAvailableBBs;
+ SmallPtrSet<BasicBlock *, 32> NewSpeculativelyAvailableBBs;
SmallVector<BasicBlock *, 32> AvailableBBs;
#endif
@@ -1222,7 +1222,7 @@ static bool liesBetween(const Instruction *From, Instruction *Between,
const Instruction *To, const DominatorTree *DT) {
if (From->getParent() == Between->getParent())
return DT->dominates(From, Between);
- SmallSet<BasicBlock *, 1> Exclusion;
+ SmallPtrSet<BasicBlock *, 1> Exclusion;
Exclusion.insert(Between->getParent());
return !isPotentiallyReachable(From, To, &Exclusion, DT);
}
diff --git a/llvm/lib/Transforms/Scalar/GuardWidening.cpp b/llvm/lib/Transforms/Scalar/GuardWidening.cpp
index 3ba5b79293bcd..d99f1eb9c93cd 100644
--- a/llvm/lib/Transforms/Scalar/GuardWidening.cpp
+++ b/llvm/lib/Transforms/Scalar/GuardWidening.cpp
@@ -642,9 +642,9 @@ Value *GuardWideningImpl::freezeAndPush(Value *Orig,
return FI;
}
- SmallSet<Value *, 16> Visited;
+ SmallPtrSet<Value *, 16> Visited;
SmallVector<Value *, 16> Worklist;
- SmallSet<Instruction *, 16> DropPoisonFlags;
+ SmallPtrSet<Instruction *, 16> DropPoisonFlags;
SmallVector<Value *, 16> NeedFreeze;
DenseMap<Value *, FreezeInst *> CacheOfFreezes;
diff --git a/llvm/lib/Transforms/Scalar/IndVarSimplify.cpp b/llvm/lib/Transforms/Scalar/IndVarSimplify.cpp
index 334c911191cb8..6720cb1ef8998 100644
--- a/llvm/lib/Transforms/Scalar/IndVarSimplify.cpp
+++ b/llvm/lib/Transforms/Scalar/IndVarSimplify.cpp
@@ -1613,7 +1613,7 @@ bool IndVarSimplify::optimizeLoopExits(Loop *L, SCEVExpander &Rewriter) {
if (CurrMaxExit == MaxBECount)
SkipLastIter = true;
};
- SmallSet<const SCEV *, 8> DominatingExactExitCounts;
+ SmallPtrSet<const SCEV *, 8> DominatingExactExitCounts;
for (BasicBlock *ExitingBB : ExitingBlocks) {
const SCEV *ExactExitCount = SE->getExitCount(L, ExitingBB);
const SCEV *MaxExitCount = SE->getExitCount(
diff --git a/llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp b/llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
index c68149b780807..5795c761b3bee 100644
--- a/llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
+++ b/llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
@@ -1209,7 +1209,7 @@ class LowerMatrixIntrinsics {
//
// For verification, we keep track of where we changed uses to poison in
// PoisonedInsts and then check that we in fact remove them.
- SmallSet<Instruction *, 16> PoisonedInsts;
+ SmallPtrSet<Instruction *, 16> PoisonedInsts;
for (auto *Inst : reverse(ToRemove)) {
for (Use &U : llvm::make_early_inc_range(Inst->uses())) {
if (auto *Poisoned = dyn_cast<Instruction>(U.getUser()))
diff --git a/llvm/lib/Transforms/Scalar/MemCpyOptimizer.cpp b/llvm/lib/Transforms/Scalar/MemCpyOptimizer.cpp
index f237322f90455..e043d072a7638 100644
--- a/llvm/lib/Transforms/Scalar/MemCpyOptimizer.cpp
+++ b/llvm/lib/Transforms/Scalar/MemCpyOptimizer.cpp
@@ -1530,7 +1530,7 @@ bool MemCpyOptPass::performStackMoveOptzn(Instruction *Load, Instruction *Store,
// to remove them.
SmallVector<Instruction *, 4> LifetimeMarkers;
- SmallSet<Instruction *, 4> AAMetadataInstrs;
+ SmallPtrSet<Instruction *, 4> AAMetadataInstrs;
bool SrcNotDom = false;
auto CaptureTrackingWithModRef =
@@ -1540,7 +1540,7 @@ bool MemCpyOptPass::performStackMoveOptzn(Instruction *Load, Instruction *Store,
Worklist.push_back(AI);
unsigned MaxUsesToExplore = getDefaultMaxUsesToExploreForCaptureTracking();
Worklist.reserve(MaxUsesToExplore);
- SmallSet<const Use *, 20> Visited;
+ SmallPtrSet<const Use *, 20> Visited;
while (!Worklist.empty()) {
Instruction *I = Worklist.pop_back_val();
for (const Use &U : I->uses()) {
diff --git a/llvm/lib/Transforms/Scalar/Reassociate.cpp b/llvm/lib/Transforms/Scalar/Reassociate.cpp
index 343da5b2e4704..ba58b8e4eda5e 100644
--- a/llvm/lib/Transforms/Scalar/Reassociate.cpp
+++ b/llvm/lib/Transforms/Scalar/Reassociate.cpp
@@ -878,7 +878,7 @@ static Value *NegateValue(Value *V, Instruction *BI,
// only that it mostly looks like one.
static bool isLoadCombineCandidate(Instruction *Or) {
SmallVector<Instruction *, 8> Worklist;
- SmallSet<Instruction *, 8> Visited;
+ SmallPtrSet<Instruction *, 8> Visited;
auto Enqueue = [&](Value *V) {
auto *I = dyn_cast<Instruction>(V);
diff --git a/llvm/lib/Transforms/Scalar/StructurizeCFG.cpp b/llvm/lib/Transforms/Scalar/StructurizeCFG.cpp
index 44e63a0583d1a..b17dcb7869420 100644
--- a/llvm/lib/Transforms/Scalar/StructurizeCFG.cpp
+++ b/llvm/lib/Transforms/Scalar/StructurizeCFG.cpp
@@ -328,7 +328,7 @@ class StructurizeCFG {
void addPhiValues(BasicBlock *From, BasicBlock *To);
void findUndefBlocks(BasicBlock *PHIBlock,
- const SmallSet<BasicBlock *, 8> &Incomings,
+ const SmallPtrSet<BasicBlock *, 8> &Incomings,
SmallVector<BasicBlock *> &UndefBlks) const;
void mergeIfCompatible(EquivalenceClasses<PHINode *> &PhiClasses, PHINode *A,
@@ -762,7 +762,7 @@ void StructurizeCFG::addPhiValues(BasicBlock *From, BasicBlock *To) {
/// from some blocks as undefined. The function will find out all such blocks
/// and return in \p UndefBlks.
void StructurizeCFG::findUndefBlocks(
- BasicBlock *PHIBlock, const SmallSet<BasicBlock *, 8> &Incomings,
+ BasicBlock *PHIBlock, const SmallPtrSet<BasicBlock *, 8> &Incomings,
SmallVector<BasicBlock *> &UndefBlks) const {
// We may get a post-structured CFG like below:
//
@@ -788,7 +788,7 @@ void StructurizeCFG::findUndefBlocks(
// path N->F2->F3->B. For example, the threads take the branch F1->N may
// always take the branch F2->P2. So, when we are reconstructing a PHI
// originally in B, we can safely say the incoming value from N is undefined.
- SmallSet<BasicBlock *, 8> VisitedBlock;
+ SmallPtrSet<BasicBlock *, 8> VisitedBlock;
SmallVector<BasicBlock *, 8> Stack;
if (PHIBlock == ParentRegion->getExit()) {
for (auto P : predecessors(PHIBlock)) {
@@ -884,7 +884,7 @@ void StructurizeCFG::setPhiValues() {
PhiMap &BlkPhis = OldPhiIt->second;
SmallVector<BasicBlock *> &UndefBlks = UndefBlksMap[To];
- SmallSet<BasicBlock *, 8> Incomings;
+ SmallPtrSet<BasicBlock *, 8> Incomings;
// Get the undefined blocks shared by all the phi nodes.
if (!BlkPhis.empty()) {
diff --git a/llvm/lib/Transforms/Utils/CanonicalizeFreezeInLoops.cpp b/llvm/lib/Transforms/Utils/CanonicalizeFreezeInLoops.cpp
index 40010aee9c111..8044f611e89f0 100644
--- a/llvm/lib/Transforms/Utils/CanonicalizeFreezeInLoops.cpp
+++ b/llvm/lib/Transforms/Utils/CanonicalizeFreezeInLoops.cpp
@@ -193,7 +193,7 @@ bool CanonicalizeFreezeInLoopsImpl::run() {
if (Candidates.empty())
return false;
- SmallSet<PHINode *, 8> ProcessedPHIs;
+ SmallPtrSet<PHINode *, 8> ProcessedPHIs;
for (const auto &Info : Candidates) {
PHINode *PHI = Info.PHI;
if (!ProcessedPHIs.insert(Info.PHI).second)
diff --git a/llvm/lib/Transforms/Utils/ControlFlowUtils.cpp b/llvm/lib/Transforms/Utils/ControlFlowUtils.cpp
index 4b0065d0030cd..8954de618bc2d 100644
--- a/llvm/lib/Transforms/Utils/ControlFlowUtils.cpp
+++ b/llvm/lib/Transforms/Utils/ControlFlowUtils.cpp
@@ -276,7 +276,7 @@ std::pair<BasicBlock *, bool> ControlFlowHub::finalize(
DomTreeUpdater *DTU, SmallVectorImpl<BasicBlock *> &GuardBlocks,
const StringRef Prefix, std::optional<unsigned> MaxControlFlowBooleans) {
#ifndef NDEBUG
- SmallSet<BasicBlock *, 8> Incoming;
+ SmallPtrSet<BasicBlock *, 8> Incoming;
#endif
SetVector<BasicBlock *> Outgoing;
diff --git a/llvm/lib/Transforms/Utils/Local.cpp b/llvm/lib/Transforms/Utils/Local.cpp
index b559212de71d7..ac344904f90f0 100644
--- a/llvm/lib/Transforms/Utils/Local.cpp
+++ b/llvm/lib/Transforms/Utils/Local.cpp
@@ -275,7 +275,7 @@ bool llvm::ConstantFoldTerminator(BasicBlock *BB, bool DeleteDeadConditions,
Builder.CreateBr(TheOnlyDest);
BasicBlock *BB = SI->getParent();
- SmallSet<BasicBlock *, 8> RemovedSuccessors;
+ SmallPtrSet<BasicBlock *, 8> RemovedSuccessors;
// Remove entries from PHI nodes which we no longer branch to...
BasicBlock *SuccToKeep = TheOnlyDest;
@@ -343,7 +343,7 @@ bool llvm::ConstantFoldTerminator(BasicBlock *BB, bool DeleteDeadConditions,
if (auto *BA =
dyn_cast<BlockAddress>(IBI->getAddress()->stripPointerCasts())) {
BasicBlock *TheOnlyDest = BA->getBasicBlock();
- SmallSet<BasicBlock *, 8> RemovedSuccessors;
+ SmallPtrSet<BasicBlock *, 8> RemovedSuccessors;
// Insert the new branch.
Builder.CreateBr(TheOnlyDest);
@@ -2518,7 +2518,7 @@ unsigned llvm::changeToUnreachable(Instruction *I, bool PreserveLCSSA,
if (MSSAU)
MSSAU->changeToUnreachable(I);
- SmallSet<BasicBlock *, 8> UniqueSuccessors;
+ SmallPtrSet<BasicBlock *, 8> UniqueSuccessors;
// Loop over all of the successors, removing BB's entry from any PHI
// nodes.
diff --git a/llvm/lib/Transforms/Utils/PromoteMemoryToRegister.cpp b/llvm/lib/Transforms/Utils/PromoteMemoryToRegister.cpp
index d96f1d6c23d47..10c162bc6463a 100644
--- a/llvm/lib/Transforms/Utils/PromoteMemoryToRegister.cpp
+++ b/llvm/lib/Transforms/Utils/PromoteMemoryToRegister.cpp
@@ -136,7 +136,7 @@ class AssignmentTrackingInfo {
/// \p ToDelete that stores to this alloca.
void updateForDeletedStore(
StoreInst *ToDelete, DIBuilder &DIB,
- SmallSet<DbgVariableRecord *, 8> *DVRAssignsToDelete) const {
+ SmallPtrSet<DbgVariableRecord *, 8> *DVRAssignsToDelete) const {
// There's nothing to do if the alloca doesn't have any variables using
// assignment tracking.
if (DVRAssigns.empty())
@@ -382,7 +382,7 @@ struct PromoteMem2Reg {
SmallVector<AssignmentTrackingInfo, 8> AllocaATInfo;
/// A set of dbg.assigns to delete because they've been demoted to
/// dbg.values. Call cleanUpDbgAssigns to delete them.
- SmallSet<DbgVariableRecord *, 8> DVRAssignsToDelete;
+ SmallPtrSet<DbgVariableRecord *, 8> DVRAssignsToDelete;
/// The set of basic blocks the renamer has already visited.
BitVector Visited;
@@ -533,11 +533,10 @@ static void removeIntrinsicUsers(AllocaInst *AI) {
/// false there were some loads which were not dominated by the single store
/// and thus must be phi-ed with undef. We fall back to the standard alloca
/// promotion algorithm in that case.
-static bool
-rewriteSingleStoreAlloca(AllocaInst *AI, AllocaInfo &Info, LargeBlockInfo &LBI,
- const DataLayout &DL, DominatorTree &DT,
- AssumptionCache *AC,
- SmallSet<DbgVariableRecord *, 8> *DVRAssignsToDelete) {
+static bool rewriteSingleStoreAlloca(
+ AllocaInst *AI, AllocaInfo &Info, LargeBlockInfo &LBI, const DataLayout &DL,
+ DominatorTree &DT, AssumptionCache *AC,
+ SmallPtrSet<DbgVariableRecord *, 8> *DVRAssignsToDelete) {
StoreInst *OnlyStore = Info.OnlyStore;
Value *ReplVal = OnlyStore->getOperand(0);
// Loads may either load the stored value or uninitialized memory (undef).
@@ -647,11 +646,10 @@ rewriteSingleStoreAlloca(AllocaInst *AI, AllocaInfo &Info, LargeBlockInfo &LBI,
/// use(t);
/// *A = 42;
/// }
-static bool
-promoteSingleBlockAlloca(AllocaInst *AI, const AllocaInfo &Info,
- LargeBlockInfo &LBI, const DataLayout &DL,
- DominatorTree &DT, AssumptionCache *AC,
- SmallSet<DbgVariableRecord *, 8> *DVRAssignsToDelete) {
+static bool promoteSingleBlockAlloca(
+ AllocaInst *AI, const AllocaInfo &Info, LargeBlockInfo &LBI,
+ const DataLayout &DL, DominatorTree &DT, AssumptionCache *AC,
+ SmallPtrSet<DbgVariableRecord *, 8> *DVRAssignsToDelete) {
// The trickiest case to handle is when we have large blocks. Because of this,
// this code is optimized assuming that large blocks happen. This does not
// significantly pessimize the small block case. This uses LargeBlockInfo to
diff --git a/llvm/lib/Transforms/Utils/ScalarEvolutionExpander.cpp b/llvm/lib/Transforms/Utils/ScalarEvolutionExpander.cpp
index 1eb8996fca031..e218db30d92b4 100644
--- a/llvm/lib/Transforms/Utils/ScalarEvolutionExpander.cpp
+++ b/llvm/lib/Transforms/Utils/ScalarEvolutionExpander.cpp
@@ -1346,7 +1346,7 @@ Value *SCEVExpander::visitAddRecExpr(const SCEVAddRecExpr *S) {
CanonicalIV->insertBefore(Header->begin());
rememberInstruction(CanonicalIV);
- SmallSet<BasicBlock *, 4> PredSeen;
+ SmallPtrSet<BasicBlock *, 4> PredSeen;
Constant *One = ConstantInt::get(Ty, 1);
for (pred_iterator HPI = HPB; HPI != HPE; ++HPI) {
BasicBlock *HP = *HPI;
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 675a230bd2c94..e009b81afd0ed 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -8111,7 +8111,7 @@ void VPRecipeBuilder::collectScaledReductions(VFRange &Range) {
// extends are intended to be lowered along with the reduction itself.
// Build up a set of partial reduction ops for efficient use checking.
- SmallSet<User *, 4> PartialReductionOps;
+ SmallPtrSet<User *, 4> PartialReductionOps;
for (const auto &[PartialRdx, _] : PartialReductionChains)
PartialReductionOps.insert(PartialRdx.ExtendUser);
diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
index b88de09a3e447..37dc41413966d 100644
--- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -24477,7 +24477,7 @@ class HorizontalReduction {
// correct, replace internal uses with undef, and mark for eventual
// deletion.
#ifndef NDEBUG
- SmallSet<Value *, 4> IgnoreSet;
+ SmallPtrSet<Value *, 4> IgnoreSet;
for (ArrayRef<Value *> RdxOps : ReductionOps)
IgnoreSet.insert_range(RdxOps);
#endif
>From c48ec7fb60b5e0b4100731d75f82ea63c0ec7b45 Mon Sep 17 00:00:00 2001
From: Kazu Hirata <kazu at google.com>
Date: Mon, 18 Aug 2025 07:02:15 -0700
Subject: [PATCH 019/112] [clang] Proofread SourceBasedCodeCoverage.rst
(#154050)
---
clang/docs/SourceBasedCodeCoverage.rst | 32 +++++++++++++-------------
1 file changed, 16 insertions(+), 16 deletions(-)
diff --git a/clang/docs/SourceBasedCodeCoverage.rst b/clang/docs/SourceBasedCodeCoverage.rst
index 3e8642479a56d..2f114070a8fb2 100644
--- a/clang/docs/SourceBasedCodeCoverage.rst
+++ b/clang/docs/SourceBasedCodeCoverage.rst
@@ -66,17 +66,17 @@ supported. Uninstrumented code simply won't be accounted for in reports.
To compile code with Modified Condition/Decision Coverage (MC/DC) enabled,
pass ``-fcoverage-mcdc`` in addition to the clang options specified above.
-MC/DC is an advanced form of code coverage most applicable in the embedded
+MC/DC is an advanced form of code coverage most applicable to the embedded
space.
Running the instrumented program
================================
-The next step is to run the instrumented program. When the program exits it
+The next step is to run the instrumented program. When the program exits, it
will write a **raw profile** to the path specified by the ``LLVM_PROFILE_FILE``
environment variable. If that variable does not exist, the profile is written
to ``default.profraw`` in the current directory of the program. If
-``LLVM_PROFILE_FILE`` contains a path to a non-existent directory, the missing
+``LLVM_PROFILE_FILE`` specifies a path to a non-existent directory, the missing
directory structure will be created. Additionally, the following special
**pattern strings** are rewritten:
@@ -97,7 +97,7 @@ directory structure will be created. Additionally, the following special
* "%b" expands out to the binary ID (build ID). It can be used with "%Nm" to
avoid binary signature collisions. To use it, the program should be compiled
with the build ID linker option (``--build-id`` for GNU ld or LLD,
- ``/build-id`` for lld-link on Windows). Linux, Windows and AIX are supported.
+ ``/build-id`` for lld-link on Windows). Linux, Windows, and AIX are supported.
* "%c" expands out to nothing, but enables a mode in which profile counter
updates are continuously synced to a file. This means that if the
@@ -128,7 +128,7 @@ and set bias to the offset between the original and the new counter location,
at which point every subsequent counter access will be to the new location,
which allows updating profile directly akin to the continuous mode.
-The advantage of this approach is that doesn't require any special OS support.
+The advantage of this approach is that it doesn't require any special OS support.
The disadvantage is the extra overhead due to additional instructions required
for each counter access (overhead both in terms of binary size and performance)
plus duplication of counters (i.e. one copy in the binary itself and another
@@ -137,7 +137,7 @@ other platforms by passing the ``-runtime-counter-relocation`` option to the
backend during compilation.
For a program such as the `Lit <https://llvm.org/docs/CommandGuide/lit.html>`_
-testing tool which invokes other programs, it may be necessary to set
+testing tool, which invokes other programs, it may be necessary to set
``LLVM_PROFILE_FILE`` for each invocation. The pattern strings "%p" or "%Nm"
may help to avoid corruption due to concurrency. Note that "%p" is also a Lit
token and needs to be escaped as "%%p".
@@ -149,7 +149,7 @@ token and needs to be escaped as "%%p".
Creating coverage reports
=========================
-Raw profiles have to be **indexed** before they can be used to generate
+Raw profiles must be **indexed** before they can be used to generate
coverage reports. This is done using the "merge" tool in ``llvm-profdata``
(which can combine multiple raw profiles and index them at the same time):
@@ -240,13 +240,13 @@ line-oriented report, try:
TOTAL 13 0 100.00% 3 0 100.00% 13 0 100.00% 12 2 83.33%
The ``llvm-cov`` tool supports specifying a custom demangler, writing out
-reports in a directory structure, and generating html reports. For the full
+reports in a directory structure, and generating HTML reports. For the full
list of options, please refer to the `command guide
<https://llvm.org/docs/CommandGuide/llvm-cov.html>`_.
A few final notes:
-* The ``-sparse`` flag is optional but can result in dramatically smaller
+* The ``-sparse`` flag is optional but can produce dramatically smaller
indexed profiles. This option should not be used if the indexed profile will
be reused for PGO.
@@ -255,7 +255,7 @@ A few final notes:
information directly into an existing raw profile on disk. The details are
out of scope.
-* The ``llvm-profdata`` tool can be used to merge together multiple raw or
+* The ``llvm-profdata`` tool can be used to merge multiple raw or
indexed profiles. To combine profiling data from multiple runs of a program,
try e.g:
@@ -299,7 +299,7 @@ There are six statistics tracked in a coverage summary:
source code that may each evaluate to either "true" or "false". These
conditions may comprise larger boolean expressions linked by boolean logical
operators. For example, "x = (y == 2) || (z < 10)" is a boolean expression
- that is comprised of two individual conditions, each of which evaluates to
+ comprised of two individual conditions, each of which evaluates to
either true or false, producing four total branch outcomes.
* Modified Condition/Decision Coverage (MC/DC) is the percentage of individual
@@ -316,7 +316,7 @@ There are six statistics tracked in a coverage summary:
``-show-mcdc-summary`` option as long as code was also compiled using the
clang option ``-fcoverage-mcdc``.
- * Boolean expressions that are only comprised of one condition (and therefore
+ * Boolean expressions comprised of only one condition (and therefore
have no logical operators) are not included in MC/DC analysis and are
trivially deducible using branch coverage.
@@ -366,7 +366,7 @@ By default the compiler runtime uses a static initializer to determine the
profile output path and to register a writer function. To collect profiles
without using static initializers, do this manually:
-* Export a ``int __llvm_profile_runtime`` symbol from each instrumented shared
+* Export an ``int __llvm_profile_runtime`` symbol from each instrumented shared
library and executable. When the linker finds a definition of this symbol, it
knows to skip loading the object which contains the profiling runtime's
static initializer.
@@ -380,7 +380,7 @@ without using static initializers, do this manually:
to ``__llvm_profile_write_file``.
* Forward-declare ``int __llvm_profile_write_file(void)`` and call it to write
- out a profile. This function returns 0 when it succeeds, and a non-zero value
+ out a profile. This function returns 0 on success, and a non-zero value
otherwise. Calling this function multiple times appends profile data to an
existing on-disk raw profile.
@@ -418,7 +418,7 @@ Collecting coverage reports for the llvm project
================================================
To prepare a coverage report for llvm (and any of its sub-projects), add
-``-DLLVM_BUILD_INSTRUMENTED_COVERAGE=On`` to the cmake configuration. Raw
+``-DLLVM_BUILD_INSTRUMENTED_COVERAGE=On`` to the CMake configuration. Raw
profiles will be written to ``$BUILD_DIR/profiles/``. To prepare an html
report, run ``llvm/utils/prepare-code-coverage-artifact.py``.
@@ -429,7 +429,7 @@ To specify an alternate directory for raw profiles, use
Drawbacks and limitations
=========================
-* Prior to version 2.26, the GNU binutils BFD linker is not able link programs
+* Prior to version 2.26, the GNU binutils BFD linker cannot link programs
compiled with ``-fcoverage-mapping`` in its ``--gc-sections`` mode. Possible
workarounds include disabling ``--gc-sections``, upgrading to a newer version
of BFD, or using the Gold linker.
>From f8cd5825346acda345df4767eed54d27d2089217 Mon Sep 17 00:00:00 2001
From: Aiden Grossman <aidengrossman at google.com>
Date: Mon, 18 Aug 2025 07:05:39 -0700
Subject: [PATCH 020/112] [Github] Remove call to llvm-project-tests.yml from
mlir-spirv-tests.yml
This will eventually allow for removing llvm-project-tests.yml. This
should significantly reduce the complexity of this workflow (including
the complexity of llvm-project-tests.yml) at the cost of a little bit of
duplication.
Reviewers: IgWod-IMG, kuhar
Reviewed By: kuhar
Pull Request: https://github.com/llvm/llvm-project/pull/153871
---
.github/workflows/mlir-spirv-tests.yml | 31 +++++++++++++++++++++-----
1 file changed, 25 insertions(+), 6 deletions(-)
diff --git a/.github/workflows/mlir-spirv-tests.yml b/.github/workflows/mlir-spirv-tests.yml
index 48b6c69a61f50..78952ccad2642 100644
--- a/.github/workflows/mlir-spirv-tests.yml
+++ b/.github/workflows/mlir-spirv-tests.yml
@@ -24,9 +24,28 @@ jobs:
check_spirv:
if: github.repository_owner == 'llvm'
name: Test MLIR SPIR-V
- uses: ./.github/workflows/llvm-project-tests.yml
- with:
- build_target: check-mlir
- projects: mlir
- extra_cmake_args: '-DLLVM_TARGETS_TO_BUILD="host" -DLLVM_INCLUDE_SPIRV_TOOLS_TESTS=ON'
- os_list: '["ubuntu-24.04"]'
+ runs-on: ubuntu-24.04
+ container:
+ image: ghcr.io/llvm/ci-ubuntu-24.04:latest
+ steps:
+ - uses: actions/checkout at 08c6903cd8c0fde910a37f88322edcfb5dd907a8 # v5.0.0
+ - name: Setup ccache
+ uses: hendrikmuhs/ccache-action at a1209f81afb8c005c13b4296c32e363431bffea5 # v1.2.17
+ with:
+ max-size: 2G
+ key: spirv-mlir-ubuntu-24.04
+ variant: sccache
+ - name: Build and Test
+ run: |
+ mkdir build
+ cmake -GNinja \
+ -S llvm \
+ -B build \
+ -DCMAKE_BUILD_TYPE=Release \
+ -DLLVM_ENABLE_ASSERTIONS=ON \
+ -DCMAKE_C_COMPILER_LAUNCHER=sccache \
+ -DCMAKE_CXX_COMPILER_LAUNCHER=sccache \
+ -DLLVM_TARGETS_TO_BUILD="host" \
+ -DLLVM_INCLUDE_SPIRV_TOOLS_TESTS=ON \
+ -DLLVM_ENABLE_PROJECTS=mlir
+ ninja -C build check-mlir
>From 2497864e0973ba8c8fd16c8cbef7869e622256fa Mon Sep 17 00:00:00 2001
From: Aiden Grossman <aidengrossman at google.com>
Date: Mon, 18 Aug 2025 07:07:26 -0700
Subject: [PATCH 021/112] [Github] Remove call to llvm-project-tests from
libclang tests
This allows for removing llvm-project-tests.yml. This significantly
reduces the complexity of this workflow (including the complexity of
llvm-project-tests.yml) at the cost of a little bit of duplication with
the other workflows that were also using llvm-project-tests.yml.
Reviewers: tstellar, DeinAlptraum
Reviewed By: DeinAlptraum
Pull Request: https://github.com/llvm/llvm-project/pull/153876
---
.github/workflows/libclang-python-tests.yml | 38 +++++++++++++++------
1 file changed, 27 insertions(+), 11 deletions(-)
diff --git a/.github/workflows/libclang-python-tests.yml b/.github/workflows/libclang-python-tests.yml
index 50ef4acf2feb1..e168928325561 100644
--- a/.github/workflows/libclang-python-tests.yml
+++ b/.github/workflows/libclang-python-tests.yml
@@ -4,7 +4,6 @@ permissions:
contents: read
on:
- workflow_dispatch:
push:
branches:
- 'main'
@@ -13,29 +12,46 @@ on:
- 'clang/tools/libclang/**'
- 'clang/CMakeList.txt'
- '.github/workflows/libclang-python-tests.yml'
- - '.github/workflows/llvm-project-tests.yml'
pull_request:
paths:
- 'clang/bindings/python/**'
- 'clang/tools/libclang/**'
- 'clang/CMakeList.txt'
- '.github/workflows/libclang-python-tests.yml'
- - '.github/workflows/llvm-project-tests.yml'
jobs:
check-clang-python:
# Build libclang and then run the libclang Python binding's unit tests.
+ # There is an issue running on "windows-2019".
+ # See https://github.com/llvm/llvm-project/issues/76601#issuecomment-1873049082.
name: Build and run Python unit tests
if: github.repository == 'llvm/llvm-project'
+ runs-on: ubuntu-24.04
strategy:
fail-fast: false
matrix:
python-version: ["3.8", "3.13"]
- uses: ./.github/workflows/llvm-project-tests.yml
- with:
- build_target: check-clang-python
- projects: clang
- # There is an issue running on "windows-2019".
- # See https://github.com/llvm/llvm-project/issues/76601#issuecomment-1873049082.
- os_list: '["ubuntu-24.04"]'
- python_version: ${{ matrix.python-version }}
+ steps:
+ - uses: actions/checkout at 08c6903cd8c0fde910a37f88322edcfb5dd907a8 # v5.0.0
+ - name: Setup Python
+ uses: actions/setup-python at 42375524e23c412d93fb67b49958b491fce71c38 # v5.4.0
+ with:
+ python-version: ${{ matrix.python-version }}
+ - name: Setup ccache
+ uses: hendrikmuhs/ccache-action at a1209f81afb8c005c13b4296c32e363431bffea5 # v1.2.17
+ with:
+ max-size: 2G
+ key: spirv-ubuntu-24.04
+ variant: sccache
+ - name: Build and Test
+ run: |
+ mkdir build
+ cmake -GNinja \
+ -S llvm \
+ -B build \
+ -DCMAKE_BUILD_TYPE=Release \
+ -DLLVM_ENABLE_ASSERTIONS=ON \
+ -DCMAKE_C_COMPILER_LAUNCHER=sccache \
+ -DCMAKE_CXX_COMPILER_LAUNCHER=sccache \
+ -DLLVM_ENABLE_PROJECTS=clang
+ ninja -C build check-clang-python
>From ae75884130ceb31c6a0f8520e906ebbfd6636124 Mon Sep 17 00:00:00 2001
From: Krzysztof Parzyszek <Krzysztof.Parzyszek at amd.com>
Date: Mon, 18 Aug 2025 09:13:27 -0500
Subject: [PATCH 022/112] [Frontend][OpenMP] Add 6.1 as a valid OpenMP version
(#153628)
Co-authored-by: Michael Klemm <michael.klemm at amd.com>
---
flang/lib/Frontend/CompilerInvocation.cpp | 17 ++++++++++++++++-
flang/test/Driver/fopenmp-version.F90 | 6 +++++-
llvm/lib/Frontend/OpenMP/OMP.cpp | 2 +-
3 files changed, 22 insertions(+), 3 deletions(-)
diff --git a/flang/lib/Frontend/CompilerInvocation.cpp b/flang/lib/Frontend/CompilerInvocation.cpp
index 265ba8e031a62..4719a242035ed 100644
--- a/flang/lib/Frontend/CompilerInvocation.cpp
+++ b/flang/lib/Frontend/CompilerInvocation.cpp
@@ -1187,6 +1187,7 @@ static bool parseOpenMPArgs(CompilerInvocation &res, llvm::opt::ArgList &args,
llvm::Triple t(res.getTargetOpts().triple);
constexpr unsigned newestFullySupported = 31;
+ constexpr unsigned latestFinalized = 60;
// By default OpenMP is set to the most recent fully supported version
res.getLangOpts().OpenMPVersion = newestFullySupported;
res.getFrontendOpts().features.Enable(
@@ -1209,12 +1210,26 @@ static bool parseOpenMPArgs(CompilerInvocation &res, llvm::opt::ArgList &args,
diags.Report(diagID) << value << arg->getAsString(args) << versions.str();
};
+ auto reportFutureVersion = [&](llvm::StringRef value) {
+ const unsigned diagID = diags.getCustomDiagID(
+ clang::DiagnosticsEngine::Warning,
+ "The specification for OpenMP version %0 is still under development; "
+ "the syntax and semantics of new features may be subject to change");
+ std::string buffer;
+ llvm::raw_string_ostream versions(buffer);
+ llvm::interleaveComma(ompVersions, versions);
+
+ diags.Report(diagID) << value;
+ };
+
llvm::StringRef value = arg->getValue();
if (!value.getAsInteger(/*radix=*/10, version)) {
if (llvm::is_contained(ompVersions, version)) {
res.getLangOpts().OpenMPVersion = version;
- if (version > newestFullySupported)
+ if (version > latestFinalized)
+ reportFutureVersion(value);
+ else if (version > newestFullySupported)
diags.Report(clang::diag::warn_openmp_incomplete) << version;
} else if (llvm::is_contained(oldVersions, version)) {
const unsigned diagID =
diff --git a/flang/test/Driver/fopenmp-version.F90 b/flang/test/Driver/fopenmp-version.F90
index c2866561461b7..59406d3dd32c8 100644
--- a/flang/test/Driver/fopenmp-version.F90
+++ b/flang/test/Driver/fopenmp-version.F90
@@ -22,4 +22,8 @@
!RUN: not %flang -c -fopenmp -fopenmp-version=29 %s 2>&1 | FileCheck --check-prefix=ERR-BAD %s
-!ERR-BAD: error: '29' is not a valid OpenMP version in '-fopenmp-version=29', valid versions are 31, 40, 45, 50, 51, 52, 60
+!ERR-BAD: error: '29' is not a valid OpenMP version in '-fopenmp-version=29', valid versions are 31, 40, 45, 50, 51, 52, 60, 61
+
+!RUN: %flang -c -fopenmp -fopenmp-version=61 %s 2>&1 | FileCheck --check-prefix=FUTURE %s
+
+!FUTURE: The specification for OpenMP version 61 is still under development; the syntax and semantics of new features may be subject to change
diff --git a/llvm/lib/Frontend/OpenMP/OMP.cpp b/llvm/lib/Frontend/OpenMP/OMP.cpp
index 555e2a61e411e..9e625b809de9e 100644
--- a/llvm/lib/Frontend/OpenMP/OMP.cpp
+++ b/llvm/lib/Frontend/OpenMP/OMP.cpp
@@ -190,7 +190,7 @@ bool isCombinedConstruct(Directive D) {
}
ArrayRef<unsigned> getOpenMPVersions() {
- static unsigned Versions[]{31, 40, 45, 50, 51, 52, 60};
+ static unsigned Versions[]{31, 40, 45, 50, 51, 52, 60, 61};
return Versions;
}
>From b368e7f6a5db365aa8d9a514db018be9607f97d1 Mon Sep 17 00:00:00 2001
From: Connector Switch <c8ef at outlook.com>
Date: Mon, 18 Aug 2025 22:15:52 +0800
Subject: [PATCH 023/112] [flang] optimize `acosd` precision (#154118)
Part of https://github.com/llvm/llvm-project/issues/150452.
---
flang/lib/Optimizer/Builder/IntrinsicCall.cpp | 9 +++++----
flang/test/Lower/Intrinsics/acosd.f90 | 18 ++++++++++++++----
2 files changed, 19 insertions(+), 8 deletions(-)
diff --git a/flang/lib/Optimizer/Builder/IntrinsicCall.cpp b/flang/lib/Optimizer/Builder/IntrinsicCall.cpp
index 319ab1912cd3d..22193f0de88a1 100644
--- a/flang/lib/Optimizer/Builder/IntrinsicCall.cpp
+++ b/flang/lib/Optimizer/Builder/IntrinsicCall.cpp
@@ -2672,10 +2672,11 @@ mlir::Value IntrinsicLibrary::genAcosd(mlir::Type resultType,
mlir::FunctionType::get(context, {resultType}, {args[0].getType()});
mlir::Value result =
getRuntimeCallGenerator("acos", ftype)(builder, loc, {args[0]});
- llvm::APFloat pi = llvm::APFloat(llvm::numbers::pi);
- mlir::Value dfactor = builder.createRealConstant(
- loc, mlir::Float64Type::get(context), llvm::APFloat(180.0) / pi);
- mlir::Value factor = builder.createConvert(loc, args[0].getType(), dfactor);
+ const llvm::fltSemantics &fltSem =
+ llvm::cast<mlir::FloatType>(resultType).getFloatSemantics();
+ llvm::APFloat pi = llvm::APFloat(fltSem, llvm::numbers::pis);
+ mlir::Value factor = builder.createRealConstant(
+ loc, resultType, llvm::APFloat(fltSem, "180.0") / pi);
return mlir::arith::MulFOp::create(builder, loc, result, factor);
}
diff --git a/flang/test/Lower/Intrinsics/acosd.f90 b/flang/test/Lower/Intrinsics/acosd.f90
index 7dfa28fd6494e..175a4902620b8 100644
--- a/flang/test/Lower/Intrinsics/acosd.f90
+++ b/flang/test/Lower/Intrinsics/acosd.f90
@@ -1,3 +1,4 @@
+! REQUIRES: flang-supports-f128-math
! RUN: %flang_fc1 -emit-fir %s -o - | FileCheck %s --check-prefixes="CHECK"
function test_real4(x)
@@ -6,9 +7,8 @@ function test_real4(x)
end function
! CHECK-LABEL: @_QPtest_real4
-! CHECK: %[[dfactor:.*]] = arith.constant 57.295779513082323 : f64
+! CHECK: %[[factor:.*]] = arith.constant 57.2957763 : f32
! CHECK: %[[result:.*]] = math.acos %{{.*}} fastmath<contract> : f32
-! CHECK: %[[factor:.*]] = fir.convert %[[dfactor]] : (f64) -> f32
! CHECK: %[[arg:.*]] = arith.mulf %[[result]], %[[factor]] fastmath<contract> : f32
function test_real8(x)
@@ -17,6 +17,16 @@ function test_real8(x)
end function
! CHECK-LABEL: @_QPtest_real8
-! CHECK: %[[dfactor:.*]] = arith.constant 57.295779513082323 : f64
+! CHECK: %[[factor:.*]] = arith.constant 57.295779513082323 : f64
! CHECK: %[[result:.*]] = math.acos %{{.*}} fastmath<contract> : f64
-! CHECK: %[[arg:.*]] = arith.mulf %[[result]], %[[dfactor]] fastmath<contract> : f64
+! CHECK: %[[arg:.*]] = arith.mulf %[[result]], %[[factor]] fastmath<contract> : f64
+
+function test_real16(x)
+ real(16) :: x, test_real16
+ test_real16 = acosd(x)
+end function
+
+! CHECK-LABEL: @_QPtest_real16
+! CHECK: %[[factor:.*]] = arith.constant 57.295779513082320876798154814105{{.*}} : f128
+! CHECK: %[[result:.*]] = fir.call @_FortranAAcosF128({{.*}}) fastmath<contract> : (f128) -> f128
+! CHECK: %[[arg:.*]] = arith.mulf %[[result]], %[[factor]] fastmath<contract> : f128
>From f5dc3021cda339f7695272ad6e02b79f193c50c4 Mon Sep 17 00:00:00 2001
From: Aaron Ballman <aaron at aaronballman.com>
Date: Mon, 18 Aug 2025 10:22:31 -0400
Subject: [PATCH 024/112] [C] Fix failing assertion with designated inits
(#154120)
Incompatible pointer to integer conversion diagnostic checks would
trigger an assertion when the designated initializer is for an array of
unknown bounds.
Fixes #154046
---
clang/docs/ReleaseNotes.rst | 2 ++
clang/lib/Sema/SemaInit.cpp | 10 ++++++----
clang/test/Sema/designated-initializers.c | 7 +++++++
clang/test/SemaObjC/exprs.m | 7 +++++++
4 files changed, 22 insertions(+), 4 deletions(-)
diff --git a/clang/docs/ReleaseNotes.rst b/clang/docs/ReleaseNotes.rst
index e04cc326b8a0a..9ea9fcdf889df 100644
--- a/clang/docs/ReleaseNotes.rst
+++ b/clang/docs/ReleaseNotes.rst
@@ -193,6 +193,8 @@ Bug Fixes in This Version
targets that treat ``_Float16``/``__fp16`` as native scalar types. Previously
the warning was silently lost because the operands differed only by an implicit
cast chain. (#GH149967).
+- Fixed a crash with incompatible pointer to integer conversions in designated
+ initializers involving string literals. (#GH154046)
Bug Fixes to Compiler Builtins
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
diff --git a/clang/lib/Sema/SemaInit.cpp b/clang/lib/Sema/SemaInit.cpp
index d7cca4bc65d2c..60f9d449fc037 100644
--- a/clang/lib/Sema/SemaInit.cpp
+++ b/clang/lib/Sema/SemaInit.cpp
@@ -3294,8 +3294,9 @@ InitListChecker::CheckDesignatedInitializer(const InitializedEntity &Entity,
if (StringLiteral *SL = dyn_cast<StringLiteral>(SubExpr)) {
// Get the length of the string.
uint64_t StrLen = SL->getLength();
- if (cast<ConstantArrayType>(AT)->getSize().ult(StrLen))
- StrLen = cast<ConstantArrayType>(AT)->getZExtSize();
+ if (const auto *CAT = dyn_cast<ConstantArrayType>(AT);
+ CAT && CAT->getSize().ult(StrLen))
+ StrLen = CAT->getZExtSize();
StructuredList->resizeInits(Context, StrLen);
// Build a literal for each character in the string, and put them into
@@ -3317,8 +3318,9 @@ InitListChecker::CheckDesignatedInitializer(const InitializedEntity &Entity,
// Get the length of the string.
uint64_t StrLen = Str.size();
- if (cast<ConstantArrayType>(AT)->getSize().ult(StrLen))
- StrLen = cast<ConstantArrayType>(AT)->getZExtSize();
+ if (const auto *CAT = dyn_cast<ConstantArrayType>(AT);
+ CAT && CAT->getSize().ult(StrLen))
+ StrLen = CAT->getZExtSize();
StructuredList->resizeInits(Context, StrLen);
// Build a literal for each character in the string, and put them into
diff --git a/clang/test/Sema/designated-initializers.c b/clang/test/Sema/designated-initializers.c
index 31a3380b5db7d..11dc3a2308dee 100644
--- a/clang/test/Sema/designated-initializers.c
+++ b/clang/test/Sema/designated-initializers.c
@@ -368,3 +368,10 @@ struct {
.b = 0, // expected-warning {{initializer overrides prior initialization of this subobject}}
},
};
+
+void gh154046(void) {
+ (void)(const char[]) {
+ [0] = "", // expected-error {{incompatible pointer to integer conversion initializing 'const char' with an expression of type 'char[1]'}}
+ [1] = "" // expected-error {{incompatible pointer to integer conversion initializing 'const char' with an expression of type 'char[1]'}}
+ }[1];
+}
diff --git a/clang/test/SemaObjC/exprs.m b/clang/test/SemaObjC/exprs.m
index dcf46d3cdbfbc..c42d270657c10 100644
--- a/clang/test/SemaObjC/exprs.m
+++ b/clang/test/SemaObjC/exprs.m
@@ -36,3 +36,10 @@ void test_encode(void) {
(void)@encode(Incomplete_ObjC_class*);
(void)@encode(id);
}
+
+void gh154046(void) {
+ (void)(const char[]) {
+ [0] = @encode(int), // expected-error {{incompatible pointer to integer conversion initializing 'const char' with an expression of type 'char[2]'}}
+ [1] = @encode(float) // expected-error {{incompatible pointer to integer conversion initializing 'const char' with an expression of type 'char[2]'}}
+ }[1];
+}
>From 0dbcdf33b835615144b308f2e7cc7f24657218eb Mon Sep 17 00:00:00 2001
From: erichkeane <ekeane at nvidia.com>
Date: Mon, 18 Aug 2025 06:51:38 -0700
Subject: [PATCH 025/112] [OpenACC] Fix racing commit test failures for
firstprivate lowering
The original patch to implement basic lowering for firstprivate didn't
have the Sema work to change the name of the variable being generated
from openacc.private.init to openacc.firstprivate.init. I forgot about
that when I merged the Sema changes this morning, so the tests now
failed. This patch fixes those up.
Additionally, Suggested on #153622 post-commit, it seems like a good idea to
use a size of APInt that matches the size-type, so this changes us to use that
instead.
---
clang/lib/Sema/SemaOpenACC.cpp | 5 +++-
.../combined-firstprivate-clause.cpp | 24 +++++++++----------
.../compute-firstprivate-clause-templates.cpp | 8 +++----
.../compute-firstprivate-clause.cpp | 24 +++++++++----------
4 files changed, 32 insertions(+), 29 deletions(-)
diff --git a/clang/lib/Sema/SemaOpenACC.cpp b/clang/lib/Sema/SemaOpenACC.cpp
index c2af456224bec..3f870ba528ad0 100644
--- a/clang/lib/Sema/SemaOpenACC.cpp
+++ b/clang/lib/Sema/SemaOpenACC.cpp
@@ -2674,7 +2674,10 @@ SemaOpenACC::CreateInitRecipe(OpenACCClauseKind CK, const Expr *VarExpr) {
// DeclRefExpr).
auto *Idx = IntegerLiteral::Create(
- getASTContext(), llvm::APInt(sizeof(std::size_t) * 8, I),
+ getASTContext(),
+ llvm::APInt(
+ getASTContext().getTypeSize(getASTContext().getSizeType()),
+ I),
getASTContext().getSizeType(), VarExpr->getBeginLoc());
Expr *Subscript = new (getASTContext()) ArraySubscriptExpr(
diff --git a/clang/test/CIR/CodeGenOpenACC/combined-firstprivate-clause.cpp b/clang/test/CIR/CodeGenOpenACC/combined-firstprivate-clause.cpp
index 6d15abc2fefd4..7571e5e3306f7 100644
--- a/clang/test/CIR/CodeGenOpenACC/combined-firstprivate-clause.cpp
+++ b/clang/test/CIR/CodeGenOpenACC/combined-firstprivate-clause.cpp
@@ -17,7 +17,7 @@ struct HasDtor {
// CHECK: acc.firstprivate.recipe @firstprivatization__ZTSA5_7HasDtor : !cir.ptr<!cir.array<!rec_HasDtor x 5>> init {
// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!cir.array<!rec_HasDtor x 5>> {{.*}}):
-// CHECK-NEXT: cir.alloca !cir.array<!rec_HasDtor x 5>, !cir.ptr<!cir.array<!rec_HasDtor x 5>>, ["openacc.private.init"]
+// CHECK-NEXT: cir.alloca !cir.array<!rec_HasDtor x 5>, !cir.ptr<!cir.array<!rec_HasDtor x 5>>, ["openacc.firstprivate.init"]
// CHECK-NEXT: acc.yield
// CHECK-NEXT: } copy {
// CHECK-NEXT: ^bb0(%[[ARG_FROM:.*]]: !cir.ptr<!cir.array<!rec_HasDtor x 5>> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!cir.array<!rec_HasDtor x 5>> {{.*}}):
@@ -48,7 +48,7 @@ struct HasDtor {
//
// CHECK-NEXT: acc.firstprivate.recipe @firstprivatization__ZTSA5_14NonDefaultCtor : !cir.ptr<!cir.array<!rec_NonDefaultCtor x 5>> init {
// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!cir.array<!rec_NonDefaultCtor x 5>> {{.*}}):
-// CHECK-NEXT: %[[ALLOCA:.*]] = cir.alloca !cir.array<!rec_NonDefaultCtor x 5>, !cir.ptr<!cir.array<!rec_NonDefaultCtor x 5>>, ["openacc.private.init"]
+// CHECK-NEXT: %[[ALLOCA:.*]] = cir.alloca !cir.array<!rec_NonDefaultCtor x 5>, !cir.ptr<!cir.array<!rec_NonDefaultCtor x 5>>, ["openacc.firstprivate.init"]
// CHECK-NEXT: acc.yield
// CHECK-NEXT: } copy {
// CHECK-NEXT: ^bb0(%[[ARG_FORM:.*]]: !cir.ptr<!cir.array<!rec_NonDefaultCtor x 5>> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!cir.array<!rec_NonDefaultCtor x 5>> {{.*}}):
@@ -58,7 +58,7 @@ struct HasDtor {
//
// CHECK-NEXT: acc.firstprivate.recipe @firstprivatization__ZTSA5_13CopyConstruct : !cir.ptr<!cir.array<!rec_CopyConstruct x 5>> init {
// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!cir.array<!rec_CopyConstruct x 5>> {{.*}}):
-// CHECK-NEXT: cir.alloca !cir.array<!rec_CopyConstruct x 5>, !cir.ptr<!cir.array<!rec_CopyConstruct x 5>>, ["openacc.private.init"]
+// CHECK-NEXT: cir.alloca !cir.array<!rec_CopyConstruct x 5>, !cir.ptr<!cir.array<!rec_CopyConstruct x 5>>, ["openacc.firstprivate.init"]
// CHECK-NEXT: acc.yield
// CHECK-NEXT: } copy {
// CHECK-NEXT: ^bb0(%[[ARG_FORM:.*]]: !cir.ptr<!cir.array<!rec_CopyConstruct x 5>> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!cir.array<!rec_CopyConstruct x 5>> {{.*}}):
@@ -68,7 +68,7 @@ struct HasDtor {
//
// CHECK-NEXT: acc.firstprivate.recipe @firstprivatization__ZTSA5_15NoCopyConstruct : !cir.ptr<!cir.array<!rec_NoCopyConstruct x 5>> init {
// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!cir.array<!rec_NoCopyConstruct x 5>> {{.*}}):
-// CHECK-NEXT: cir.alloca !cir.array<!rec_NoCopyConstruct x 5>, !cir.ptr<!cir.array<!rec_NoCopyConstruct x 5>>, ["openacc.private.init"]
+// CHECK-NEXT: cir.alloca !cir.array<!rec_NoCopyConstruct x 5>, !cir.ptr<!cir.array<!rec_NoCopyConstruct x 5>>, ["openacc.firstprivate.init"]
// CHECK-NEXT: acc.yield
// CHECK-NEXT: } copy {
// CHECK-NEXT: ^bb0(%[[ARG_FORM:.*]]: !cir.ptr<!cir.array<!rec_NoCopyConstruct x 5>> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!cir.array<!rec_NoCopyConstruct x 5>> {{.*}}):
@@ -78,7 +78,7 @@ struct HasDtor {
//
// CHECK-NEXT: acc.firstprivate.recipe @firstprivatization__ZTSA5_f : !cir.ptr<!cir.array<!cir.float x 5>> init {
// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!cir.array<!cir.float x 5>> {{.*}}):
-// CHECK-NEXT: cir.alloca !cir.array<!cir.float x 5>, !cir.ptr<!cir.array<!cir.float x 5>>, ["openacc.private.init"]
+// CHECK-NEXT: cir.alloca !cir.array<!cir.float x 5>, !cir.ptr<!cir.array<!cir.float x 5>>, ["openacc.firstprivate.init"]
// CHECK-NEXT: acc.yield
// CHECK-NEXT: } copy {
// CHECK-NEXT: ^bb0(%[[ARG_FORM:.*]]: !cir.ptr<!cir.array<!cir.float x 5>> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!cir.array<!cir.float x 5>> {{.*}}):
@@ -88,7 +88,7 @@ struct HasDtor {
//
// CHECK-NEXT: acc.firstprivate.recipe @firstprivatization__ZTSA5_i : !cir.ptr<!cir.array<!s32i x 5>> init {
// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!cir.array<!s32i x 5>> {{.*}}):
-// CHECK-NEXT: cir.alloca !cir.array<!s32i x 5>, !cir.ptr<!cir.array<!s32i x 5>>, ["openacc.private.init"]
+// CHECK-NEXT: cir.alloca !cir.array<!s32i x 5>, !cir.ptr<!cir.array<!s32i x 5>>, ["openacc.firstprivate.init"]
// CHECK-NEXT: acc.yield
// CHECK-NEXT: } copy {
// CHECK-NEXT: ^bb0(%[[ARG_FORM:.*]]: !cir.ptr<!cir.array<!s32i x 5>> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!cir.array<!s32i x 5>> {{.*}}):
@@ -98,7 +98,7 @@ struct HasDtor {
//
// CHECK-NEXT: acc.firstprivate.recipe @firstprivatization__ZTS7HasDtor : !cir.ptr<!rec_HasDtor> init {
// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!rec_HasDtor> {{.*}}):
-// CHECK-NEXT: cir.alloca !rec_HasDtor, !cir.ptr<!rec_HasDtor>, ["openacc.private.init"]
+// CHECK-NEXT: cir.alloca !rec_HasDtor, !cir.ptr<!rec_HasDtor>, ["openacc.firstprivate.init"]
// CHECK-NEXT: acc.yield
// CHECK-NEXT: } copy {
// CHECK-NEXT: ^bb0(%[[ARG_FROM:.*]]: !cir.ptr<!rec_HasDtor> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!rec_HasDtor> {{.*}}):
@@ -112,7 +112,7 @@ struct HasDtor {
//
// CHECK-NEXT: acc.firstprivate.recipe @firstprivatization__ZTS14NonDefaultCtor : !cir.ptr<!rec_NonDefaultCtor> init {
// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!rec_NonDefaultCtor> {{.*}}):
-// CHECK-NEXT: %[[ALLOCA:.*]] = cir.alloca !rec_NonDefaultCtor, !cir.ptr<!rec_NonDefaultCtor>, ["openacc.private.init"]
+// CHECK-NEXT: %[[ALLOCA:.*]] = cir.alloca !rec_NonDefaultCtor, !cir.ptr<!rec_NonDefaultCtor>, ["openacc.firstprivate.init"]
// CHECK-NEXT: acc.yield
// CHECK-NEXT: } copy {
// CHECK-NEXT: ^bb0(%[[ARG_FROM:.*]]: !cir.ptr<!rec_NonDefaultCtor> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!rec_NonDefaultCtor> {{.*}}):
@@ -122,7 +122,7 @@ struct HasDtor {
//
// CHECK-NEXT: acc.firstprivate.recipe @firstprivatization__ZTS13CopyConstruct : !cir.ptr<!rec_CopyConstruct> init {
// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!rec_CopyConstruct> {{.*}}):
-// CHECK-NEXT: cir.alloca !rec_CopyConstruct, !cir.ptr<!rec_CopyConstruct>, ["openacc.private.init"]
+// CHECK-NEXT: cir.alloca !rec_CopyConstruct, !cir.ptr<!rec_CopyConstruct>, ["openacc.firstprivate.init"]
// CHECK-NEXT: acc.yield
// CHECK-NEXT: } copy {
// CHECK-NEXT: ^bb0(%[[ARG_FROM:.*]]: !cir.ptr<!rec_CopyConstruct> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!rec_CopyConstruct> {{.*}}):
@@ -132,7 +132,7 @@ struct HasDtor {
//
// CHECK-NEXT: acc.firstprivate.recipe @firstprivatization__ZTS15NoCopyConstruct : !cir.ptr<!rec_NoCopyConstruct> init {
// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!rec_NoCopyConstruct> {{.*}}):
-// CHECK-NEXT: cir.alloca !rec_NoCopyConstruct, !cir.ptr<!rec_NoCopyConstruct>, ["openacc.private.init"]
+// CHECK-NEXT: cir.alloca !rec_NoCopyConstruct, !cir.ptr<!rec_NoCopyConstruct>, ["openacc.firstprivate.init"]
// CHECK-NEXT: acc.yield
// CHECK-NEXT: } copy {
// CHECK-NEXT: ^bb0(%[[ARG_FROM:.*]]: !cir.ptr<!rec_NoCopyConstruct> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!rec_NoCopyConstruct> {{.*}}):
@@ -142,7 +142,7 @@ struct HasDtor {
//
// CHECK-NEXT: acc.firstprivate.recipe @firstprivatization__ZTSf : !cir.ptr<!cir.float> init {
// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!cir.float> {{.*}}):
-// CHECK-NEXT: cir.alloca !cir.float, !cir.ptr<!cir.float>, ["openacc.private.init"]
+// CHECK-NEXT: cir.alloca !cir.float, !cir.ptr<!cir.float>, ["openacc.firstprivate.init"]
// CHECK-NEXT: acc.yield
// CHECK-NEXT: } copy {
// CHECK-NEXT: ^bb0(%[[ARG_FROM:.*]]: !cir.ptr<!cir.float> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!cir.float> {{.*}}):
@@ -152,7 +152,7 @@ struct HasDtor {
//
// CHECK-NEXT: acc.firstprivate.recipe @firstprivatization__ZTSi : !cir.ptr<!s32i> init {
// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!s32i> {{.*}}):
-// CHECK-NEXT: cir.alloca !s32i, !cir.ptr<!s32i>, ["openacc.private.init"]
+// CHECK-NEXT: cir.alloca !s32i, !cir.ptr<!s32i>, ["openacc.firstprivate.init"]
// CHECK-NEXT: acc.yield
// CHECK-NEXT: } copy {
// CHECK-NEXT: ^bb0(%[[ARG_FROM:.*]]: !cir.ptr<!s32i> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!s32i> {{.*}}):
diff --git a/clang/test/CIR/CodeGenOpenACC/compute-firstprivate-clause-templates.cpp b/clang/test/CIR/CodeGenOpenACC/compute-firstprivate-clause-templates.cpp
index a9f0dd99e3bd4..00aaaba3663f5 100644
--- a/clang/test/CIR/CodeGenOpenACC/compute-firstprivate-clause-templates.cpp
+++ b/clang/test/CIR/CodeGenOpenACC/compute-firstprivate-clause-templates.cpp
@@ -15,7 +15,7 @@ struct HasDtor {
// CHECK: acc.firstprivate.recipe @firstprivatization__ZTSi : !cir.ptr<!s32i> init {
// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!s32i> {{.*}}):
-// CHECK-NEXT: cir.alloca !s32i, !cir.ptr<!s32i>, ["openacc.private.init"]
+// CHECK-NEXT: cir.alloca !s32i, !cir.ptr<!s32i>, ["openacc.firstprivate.init"]
// CHECK-NEXT: acc.yield
// CHECK-NEXT: } copy {
// CHECK-NEXT: ^bb0(%[[ARG_FROM:.*]]: !cir.ptr<!s32i> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!s32i> {{.*}}):
@@ -24,7 +24,7 @@ struct HasDtor {
//
// CHECK-NEXT: acc.firstprivate.recipe @firstprivatization__ZTS7HasDtor : !cir.ptr<!rec_HasDtor> init {
// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!rec_HasDtor> {{.*}}):
-// CHECK-NEXT: cir.alloca !rec_HasDtor, !cir.ptr<!rec_HasDtor>, ["openacc.private.init"]
+// CHECK-NEXT: cir.alloca !rec_HasDtor, !cir.ptr<!rec_HasDtor>, ["openacc.firstprivate.init"]
// CHECK-NEXT: acc.yield
// CHECK-NEXT: } copy {
// CHECK-NEXT: ^bb0(%[[ARG_FROM:.*]]: !cir.ptr<!rec_HasDtor> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!rec_HasDtor> {{.*}}):
@@ -37,7 +37,7 @@ struct HasDtor {
//
// CHECK-NEXT: acc.firstprivate.recipe @firstprivatization__ZTS14NonDefaultCtor : !cir.ptr<!rec_NonDefaultCtor> init {
// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!rec_NonDefaultCtor> {{.*}}):
-// CHECK-NEXT: %[[ALLOCA:.*]] = cir.alloca !rec_NonDefaultCtor, !cir.ptr<!rec_NonDefaultCtor>, ["openacc.private.init"]
+// CHECK-NEXT: %[[ALLOCA:.*]] = cir.alloca !rec_NonDefaultCtor, !cir.ptr<!rec_NonDefaultCtor>, ["openacc.firstprivate.init"]
// CHECK-NEXT: acc.yield
// CHECK-NEXT: } copy {
// CHECK-NEXT: ^bb0(%[[ARG_FROM:.*]]: !cir.ptr<!rec_NonDefaultCtor> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!rec_NonDefaultCtor> {{.*}}):
@@ -46,7 +46,7 @@ struct HasDtor {
//
// CHECK-NEXT: acc.firstprivate.recipe @firstprivatization__ZTS13CopyConstruct : !cir.ptr<!rec_CopyConstruct> init {
// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!rec_CopyConstruct> {{.*}}):
-// CHECK-NEXT: cir.alloca !rec_CopyConstruct, !cir.ptr<!rec_CopyConstruct>, ["openacc.private.init"]
+// CHECK-NEXT: cir.alloca !rec_CopyConstruct, !cir.ptr<!rec_CopyConstruct>, ["openacc.firstprivate.init"]
// CHECK-NEXT: acc.yield
// CHECK-NEXT: } copy {
// CHECK-NEXT: ^bb0(%[[ARG_FROM:.*]]: !cir.ptr<!rec_CopyConstruct> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!rec_CopyConstruct> {{.*}}):
diff --git a/clang/test/CIR/CodeGenOpenACC/compute-firstprivate-clause.cpp b/clang/test/CIR/CodeGenOpenACC/compute-firstprivate-clause.cpp
index d25208c65ac20..924dbf6254ee4 100644
--- a/clang/test/CIR/CodeGenOpenACC/compute-firstprivate-clause.cpp
+++ b/clang/test/CIR/CodeGenOpenACC/compute-firstprivate-clause.cpp
@@ -17,7 +17,7 @@ struct HasDtor {
// CHECK: acc.firstprivate.recipe @firstprivatization__ZTSA5_7HasDtor : !cir.ptr<!cir.array<!rec_HasDtor x 5>> init {
// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!cir.array<!rec_HasDtor x 5>> {{.*}}):
-// CHECK-NEXT: cir.alloca !cir.array<!rec_HasDtor x 5>, !cir.ptr<!cir.array<!rec_HasDtor x 5>>, ["openacc.private.init"]
+// CHECK-NEXT: cir.alloca !cir.array<!rec_HasDtor x 5>, !cir.ptr<!cir.array<!rec_HasDtor x 5>>, ["openacc.firstprivate.init"]
// CHECK-NEXT: acc.yield
// CHECK-NEXT: } copy {
// CHECK-NEXT: ^bb0(%[[ARG_FROM:.*]]: !cir.ptr<!cir.array<!rec_HasDtor x 5>> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!cir.array<!rec_HasDtor x 5>> {{.*}}):
@@ -48,7 +48,7 @@ struct HasDtor {
//
// CHECK-NEXT: acc.firstprivate.recipe @firstprivatization__ZTSA5_14NonDefaultCtor : !cir.ptr<!cir.array<!rec_NonDefaultCtor x 5>> init {
// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!cir.array<!rec_NonDefaultCtor x 5>> {{.*}}):
-// CHECK-NEXT: %[[ALLOCA:.*]] = cir.alloca !cir.array<!rec_NonDefaultCtor x 5>, !cir.ptr<!cir.array<!rec_NonDefaultCtor x 5>>, ["openacc.private.init"]
+// CHECK-NEXT: %[[ALLOCA:.*]] = cir.alloca !cir.array<!rec_NonDefaultCtor x 5>, !cir.ptr<!cir.array<!rec_NonDefaultCtor x 5>>, ["openacc.firstprivate.init"]
// CHECK-NEXT: acc.yield
// CHECK-NEXT: } copy {
// CHECK-NEXT: ^bb0(%[[ARG_FORM:.*]]: !cir.ptr<!cir.array<!rec_NonDefaultCtor x 5>> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!cir.array<!rec_NonDefaultCtor x 5>> {{.*}}):
@@ -58,7 +58,7 @@ struct HasDtor {
//
// CHECK-NEXT: acc.firstprivate.recipe @firstprivatization__ZTSA5_13CopyConstruct : !cir.ptr<!cir.array<!rec_CopyConstruct x 5>> init {
// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!cir.array<!rec_CopyConstruct x 5>> {{.*}}):
-// CHECK-NEXT: cir.alloca !cir.array<!rec_CopyConstruct x 5>, !cir.ptr<!cir.array<!rec_CopyConstruct x 5>>, ["openacc.private.init"]
+// CHECK-NEXT: cir.alloca !cir.array<!rec_CopyConstruct x 5>, !cir.ptr<!cir.array<!rec_CopyConstruct x 5>>, ["openacc.firstprivate.init"]
// CHECK-NEXT: acc.yield
// CHECK-NEXT: } copy {
// CHECK-NEXT: ^bb0(%[[ARG_FORM:.*]]: !cir.ptr<!cir.array<!rec_CopyConstruct x 5>> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!cir.array<!rec_CopyConstruct x 5>> {{.*}}):
@@ -68,7 +68,7 @@ struct HasDtor {
//
// CHECK-NEXT: acc.firstprivate.recipe @firstprivatization__ZTSA5_15NoCopyConstruct : !cir.ptr<!cir.array<!rec_NoCopyConstruct x 5>> init {
// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!cir.array<!rec_NoCopyConstruct x 5>> {{.*}}):
-// CHECK-NEXT: cir.alloca !cir.array<!rec_NoCopyConstruct x 5>, !cir.ptr<!cir.array<!rec_NoCopyConstruct x 5>>, ["openacc.private.init"]
+// CHECK-NEXT: cir.alloca !cir.array<!rec_NoCopyConstruct x 5>, !cir.ptr<!cir.array<!rec_NoCopyConstruct x 5>>, ["openacc.firstprivate.init"]
// CHECK-NEXT: acc.yield
// CHECK-NEXT: } copy {
// CHECK-NEXT: ^bb0(%[[ARG_FORM:.*]]: !cir.ptr<!cir.array<!rec_NoCopyConstruct x 5>> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!cir.array<!rec_NoCopyConstruct x 5>> {{.*}}):
@@ -78,7 +78,7 @@ struct HasDtor {
//
// CHECK-NEXT: acc.firstprivate.recipe @firstprivatization__ZTSA5_f : !cir.ptr<!cir.array<!cir.float x 5>> init {
// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!cir.array<!cir.float x 5>> {{.*}}):
-// CHECK-NEXT: cir.alloca !cir.array<!cir.float x 5>, !cir.ptr<!cir.array<!cir.float x 5>>, ["openacc.private.init"]
+// CHECK-NEXT: cir.alloca !cir.array<!cir.float x 5>, !cir.ptr<!cir.array<!cir.float x 5>>, ["openacc.firstprivate.init"]
// CHECK-NEXT: acc.yield
// CHECK-NEXT: } copy {
// CHECK-NEXT: ^bb0(%[[ARG_FORM:.*]]: !cir.ptr<!cir.array<!cir.float x 5>> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!cir.array<!cir.float x 5>> {{.*}}):
@@ -88,7 +88,7 @@ struct HasDtor {
//
// CHECK-NEXT: acc.firstprivate.recipe @firstprivatization__ZTSA5_i : !cir.ptr<!cir.array<!s32i x 5>> init {
// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!cir.array<!s32i x 5>> {{.*}}):
-// CHECK-NEXT: cir.alloca !cir.array<!s32i x 5>, !cir.ptr<!cir.array<!s32i x 5>>, ["openacc.private.init"]
+// CHECK-NEXT: cir.alloca !cir.array<!s32i x 5>, !cir.ptr<!cir.array<!s32i x 5>>, ["openacc.firstprivate.init"]
// CHECK-NEXT: acc.yield
// CHECK-NEXT: } copy {
// CHECK-NEXT: ^bb0(%[[ARG_FORM:.*]]: !cir.ptr<!cir.array<!s32i x 5>> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!cir.array<!s32i x 5>> {{.*}}):
@@ -98,7 +98,7 @@ struct HasDtor {
//
// CHECK-NEXT: acc.firstprivate.recipe @firstprivatization__ZTS7HasDtor : !cir.ptr<!rec_HasDtor> init {
// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!rec_HasDtor> {{.*}}):
-// CHECK-NEXT: cir.alloca !rec_HasDtor, !cir.ptr<!rec_HasDtor>, ["openacc.private.init"]
+// CHECK-NEXT: cir.alloca !rec_HasDtor, !cir.ptr<!rec_HasDtor>, ["openacc.firstprivate.init"]
// CHECK-NEXT: acc.yield
// CHECK-NEXT: } copy {
// CHECK-NEXT: ^bb0(%[[ARG_FROM:.*]]: !cir.ptr<!rec_HasDtor> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!rec_HasDtor> {{.*}}):
@@ -112,7 +112,7 @@ struct HasDtor {
//
// CHECK-NEXT: acc.firstprivate.recipe @firstprivatization__ZTS14NonDefaultCtor : !cir.ptr<!rec_NonDefaultCtor> init {
// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!rec_NonDefaultCtor> {{.*}}):
-// CHECK-NEXT: %[[ALLOCA:.*]] = cir.alloca !rec_NonDefaultCtor, !cir.ptr<!rec_NonDefaultCtor>, ["openacc.private.init"]
+// CHECK-NEXT: %[[ALLOCA:.*]] = cir.alloca !rec_NonDefaultCtor, !cir.ptr<!rec_NonDefaultCtor>, ["openacc.firstprivate.init"]
// CHECK-NEXT: acc.yield
// CHECK-NEXT: } copy {
// CHECK-NEXT: ^bb0(%[[ARG_FROM:.*]]: !cir.ptr<!rec_NonDefaultCtor> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!rec_NonDefaultCtor> {{.*}}):
@@ -122,7 +122,7 @@ struct HasDtor {
//
// CHECK-NEXT: acc.firstprivate.recipe @firstprivatization__ZTS13CopyConstruct : !cir.ptr<!rec_CopyConstruct> init {
// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!rec_CopyConstruct> {{.*}}):
-// CHECK-NEXT: cir.alloca !rec_CopyConstruct, !cir.ptr<!rec_CopyConstruct>, ["openacc.private.init"]
+// CHECK-NEXT: cir.alloca !rec_CopyConstruct, !cir.ptr<!rec_CopyConstruct>, ["openacc.firstprivate.init"]
// CHECK-NEXT: acc.yield
// CHECK-NEXT: } copy {
// CHECK-NEXT: ^bb0(%[[ARG_FROM:.*]]: !cir.ptr<!rec_CopyConstruct> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!rec_CopyConstruct> {{.*}}):
@@ -132,7 +132,7 @@ struct HasDtor {
//
// CHECK-NEXT: acc.firstprivate.recipe @firstprivatization__ZTS15NoCopyConstruct : !cir.ptr<!rec_NoCopyConstruct> init {
// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!rec_NoCopyConstruct> {{.*}}):
-// CHECK-NEXT: cir.alloca !rec_NoCopyConstruct, !cir.ptr<!rec_NoCopyConstruct>, ["openacc.private.init"]
+// CHECK-NEXT: cir.alloca !rec_NoCopyConstruct, !cir.ptr<!rec_NoCopyConstruct>, ["openacc.firstprivate.init"]
// CHECK-NEXT: acc.yield
// CHECK-NEXT: } copy {
// CHECK-NEXT: ^bb0(%[[ARG_FROM:.*]]: !cir.ptr<!rec_NoCopyConstruct> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!rec_NoCopyConstruct> {{.*}}):
@@ -142,7 +142,7 @@ struct HasDtor {
//
// CHECK-NEXT: acc.firstprivate.recipe @firstprivatization__ZTSf : !cir.ptr<!cir.float> init {
// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!cir.float> {{.*}}):
-// CHECK-NEXT: cir.alloca !cir.float, !cir.ptr<!cir.float>, ["openacc.private.init"]
+// CHECK-NEXT: cir.alloca !cir.float, !cir.ptr<!cir.float>, ["openacc.firstprivate.init"]
// CHECK-NEXT: acc.yield
// CHECK-NEXT: } copy {
// CHECK-NEXT: ^bb0(%[[ARG_FROM:.*]]: !cir.ptr<!cir.float> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!cir.float> {{.*}}):
@@ -152,7 +152,7 @@ struct HasDtor {
//
// CHECK-NEXT: acc.firstprivate.recipe @firstprivatization__ZTSi : !cir.ptr<!s32i> init {
// CHECK-NEXT: ^bb0(%[[ARG:.*]]: !cir.ptr<!s32i> {{.*}}):
-// CHECK-NEXT: cir.alloca !s32i, !cir.ptr<!s32i>, ["openacc.private.init"]
+// CHECK-NEXT: cir.alloca !s32i, !cir.ptr<!s32i>, ["openacc.firstprivate.init"]
// CHECK-NEXT: acc.yield
// CHECK-NEXT: } copy {
// CHECK-NEXT: ^bb0(%[[ARG_FROM:.*]]: !cir.ptr<!s32i> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!s32i> {{.*}}):
>From 8b52e5ac22aa82bd81dc0ac165ec7d2a64b769d8 Mon Sep 17 00:00:00 2001
From: David Green <david.green at arm.com>
Date: Mon, 18 Aug 2025 15:30:23 +0100
Subject: [PATCH 026/112] [AArch64] Update and cleanup
irtranslator-reductions.ll. NFC
---
.../GlobalISel/irtranslator-reductions.ll | 268 +++++++++---------
1 file changed, 132 insertions(+), 136 deletions(-)
diff --git a/llvm/test/CodeGen/AArch64/GlobalISel/irtranslator-reductions.ll b/llvm/test/CodeGen/AArch64/GlobalISel/irtranslator-reductions.ll
index 16762dc4fd3fe..c38e03b41dc06 100644
--- a/llvm/test/CodeGen/AArch64/GlobalISel/irtranslator-reductions.ll
+++ b/llvm/test/CodeGen/AArch64/GlobalISel/irtranslator-reductions.ll
@@ -1,19 +1,17 @@
; NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
; RUN: llc -O0 -mtriple=aarch64-apple-ios -global-isel -disable-expand-reductions -stop-after=irtranslator %s -o - | FileCheck %s
-declare float @llvm.vector.reduce.fadd.v4f32(float, <4 x float>)
-declare double @llvm.vector.reduce.fmul.v4f64(double, <4 x double>)
-
define float @fadd_seq(float %start, <4 x float> %vec) {
; CHECK-LABEL: name: fadd_seq
; CHECK: bb.1 (%ir-block.0):
- ; CHECK: liveins: $q1, $s0
- ; CHECK: [[COPY:%[0-9]+]]:_(s32) = COPY $s0
- ; CHECK: [[COPY1:%[0-9]+]]:_(<2 x s64>) = COPY $q1
- ; CHECK: [[BITCAST:%[0-9]+]]:_(<4 x s32>) = G_BITCAST [[COPY1]](<2 x s64>)
- ; CHECK: [[VECREDUCE_SEQ_FADD:%[0-9]+]]:_(s32) = G_VECREDUCE_SEQ_FADD [[COPY]](s32), [[BITCAST]](<4 x s32>)
- ; CHECK: $s0 = COPY [[VECREDUCE_SEQ_FADD]](s32)
- ; CHECK: RET_ReallyLR implicit $s0
+ ; CHECK-NEXT: liveins: $q1, $s0
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(s32) = COPY $s0
+ ; CHECK-NEXT: [[COPY1:%[0-9]+]]:_(<2 x s64>) = COPY $q1
+ ; CHECK-NEXT: [[BITCAST:%[0-9]+]]:_(<4 x s32>) = G_BITCAST [[COPY1]](<2 x s64>)
+ ; CHECK-NEXT: [[VECREDUCE_SEQ_FADD:%[0-9]+]]:_(s32) = G_VECREDUCE_SEQ_FADD [[COPY]](s32), [[BITCAST]](<4 x s32>)
+ ; CHECK-NEXT: $s0 = COPY [[VECREDUCE_SEQ_FADD]](s32)
+ ; CHECK-NEXT: RET_ReallyLR implicit $s0
%res = call float @llvm.vector.reduce.fadd.v4f32(float %start, <4 x float> %vec)
ret float %res
}
@@ -21,14 +19,15 @@ define float @fadd_seq(float %start, <4 x float> %vec) {
define float @fadd_fast(float %start, <4 x float> %vec) {
; CHECK-LABEL: name: fadd_fast
; CHECK: bb.1 (%ir-block.0):
- ; CHECK: liveins: $q1, $s0
- ; CHECK: [[COPY:%[0-9]+]]:_(s32) = COPY $s0
- ; CHECK: [[COPY1:%[0-9]+]]:_(<2 x s64>) = COPY $q1
- ; CHECK: [[BITCAST:%[0-9]+]]:_(<4 x s32>) = G_BITCAST [[COPY1]](<2 x s64>)
- ; CHECK: [[VECREDUCE_FADD:%[0-9]+]]:_(s32) = reassoc G_VECREDUCE_FADD [[BITCAST]](<4 x s32>)
- ; CHECK: [[FADD:%[0-9]+]]:_(s32) = reassoc G_FADD [[COPY]], [[VECREDUCE_FADD]]
- ; CHECK: $s0 = COPY [[FADD]](s32)
- ; CHECK: RET_ReallyLR implicit $s0
+ ; CHECK-NEXT: liveins: $q1, $s0
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(s32) = COPY $s0
+ ; CHECK-NEXT: [[COPY1:%[0-9]+]]:_(<2 x s64>) = COPY $q1
+ ; CHECK-NEXT: [[BITCAST:%[0-9]+]]:_(<4 x s32>) = G_BITCAST [[COPY1]](<2 x s64>)
+ ; CHECK-NEXT: [[VECREDUCE_FADD:%[0-9]+]]:_(s32) = reassoc G_VECREDUCE_FADD [[BITCAST]](<4 x s32>)
+ ; CHECK-NEXT: [[FADD:%[0-9]+]]:_(s32) = reassoc G_FADD [[COPY]], [[VECREDUCE_FADD]]
+ ; CHECK-NEXT: $s0 = COPY [[FADD]](s32)
+ ; CHECK-NEXT: RET_ReallyLR implicit $s0
%res = call reassoc float @llvm.vector.reduce.fadd.v4f32(float %start, <4 x float> %vec)
ret float %res
}
@@ -36,14 +35,15 @@ define float @fadd_fast(float %start, <4 x float> %vec) {
define double @fmul_seq(double %start, <4 x double> %vec) {
; CHECK-LABEL: name: fmul_seq
; CHECK: bb.1 (%ir-block.0):
- ; CHECK: liveins: $d0, $q1, $q2
- ; CHECK: [[COPY:%[0-9]+]]:_(s64) = COPY $d0
- ; CHECK: [[COPY1:%[0-9]+]]:_(<2 x s64>) = COPY $q1
- ; CHECK: [[COPY2:%[0-9]+]]:_(<2 x s64>) = COPY $q2
- ; CHECK: [[CONCAT_VECTORS:%[0-9]+]]:_(<4 x s64>) = G_CONCAT_VECTORS [[COPY1]](<2 x s64>), [[COPY2]](<2 x s64>)
- ; CHECK: [[VECREDUCE_SEQ_FMUL:%[0-9]+]]:_(s64) = G_VECREDUCE_SEQ_FMUL [[COPY]](s64), [[CONCAT_VECTORS]](<4 x s64>)
- ; CHECK: $d0 = COPY [[VECREDUCE_SEQ_FMUL]](s64)
- ; CHECK: RET_ReallyLR implicit $d0
+ ; CHECK-NEXT: liveins: $d0, $q1, $q2
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(s64) = COPY $d0
+ ; CHECK-NEXT: [[COPY1:%[0-9]+]]:_(<2 x s64>) = COPY $q1
+ ; CHECK-NEXT: [[COPY2:%[0-9]+]]:_(<2 x s64>) = COPY $q2
+ ; CHECK-NEXT: [[CONCAT_VECTORS:%[0-9]+]]:_(<4 x s64>) = G_CONCAT_VECTORS [[COPY1]](<2 x s64>), [[COPY2]](<2 x s64>)
+ ; CHECK-NEXT: [[VECREDUCE_SEQ_FMUL:%[0-9]+]]:_(s64) = G_VECREDUCE_SEQ_FMUL [[COPY]](s64), [[CONCAT_VECTORS]](<4 x s64>)
+ ; CHECK-NEXT: $d0 = COPY [[VECREDUCE_SEQ_FMUL]](s64)
+ ; CHECK-NEXT: RET_ReallyLR implicit $d0
%res = call double @llvm.vector.reduce.fmul.v4f64(double %start, <4 x double> %vec)
ret double %res
}
@@ -51,33 +51,30 @@ define double @fmul_seq(double %start, <4 x double> %vec) {
define double @fmul_fast(double %start, <4 x double> %vec) {
; CHECK-LABEL: name: fmul_fast
; CHECK: bb.1 (%ir-block.0):
- ; CHECK: liveins: $d0, $q1, $q2
- ; CHECK: [[COPY:%[0-9]+]]:_(s64) = COPY $d0
- ; CHECK: [[COPY1:%[0-9]+]]:_(<2 x s64>) = COPY $q1
- ; CHECK: [[COPY2:%[0-9]+]]:_(<2 x s64>) = COPY $q2
- ; CHECK: [[CONCAT_VECTORS:%[0-9]+]]:_(<4 x s64>) = G_CONCAT_VECTORS [[COPY1]](<2 x s64>), [[COPY2]](<2 x s64>)
- ; CHECK: [[VECREDUCE_FMUL:%[0-9]+]]:_(s64) = reassoc G_VECREDUCE_FMUL [[CONCAT_VECTORS]](<4 x s64>)
- ; CHECK: [[FMUL:%[0-9]+]]:_(s64) = reassoc G_FMUL [[COPY]], [[VECREDUCE_FMUL]]
- ; CHECK: $d0 = COPY [[FMUL]](s64)
- ; CHECK: RET_ReallyLR implicit $d0
+ ; CHECK-NEXT: liveins: $d0, $q1, $q2
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(s64) = COPY $d0
+ ; CHECK-NEXT: [[COPY1:%[0-9]+]]:_(<2 x s64>) = COPY $q1
+ ; CHECK-NEXT: [[COPY2:%[0-9]+]]:_(<2 x s64>) = COPY $q2
+ ; CHECK-NEXT: [[CONCAT_VECTORS:%[0-9]+]]:_(<4 x s64>) = G_CONCAT_VECTORS [[COPY1]](<2 x s64>), [[COPY2]](<2 x s64>)
+ ; CHECK-NEXT: [[VECREDUCE_FMUL:%[0-9]+]]:_(s64) = reassoc G_VECREDUCE_FMUL [[CONCAT_VECTORS]](<4 x s64>)
+ ; CHECK-NEXT: [[FMUL:%[0-9]+]]:_(s64) = reassoc G_FMUL [[COPY]], [[VECREDUCE_FMUL]]
+ ; CHECK-NEXT: $d0 = COPY [[FMUL]](s64)
+ ; CHECK-NEXT: RET_ReallyLR implicit $d0
%res = call reassoc double @llvm.vector.reduce.fmul.v4f64(double %start, <4 x double> %vec)
ret double %res
}
-declare float @llvm.vector.reduce.fmax.v4f32(<4 x float>)
-declare float @llvm.vector.reduce.fmin.v4f32(<4 x float>)
-declare float @llvm.vector.reduce.fmaximum.v4f32(<4 x float>)
-declare float @llvm.vector.reduce.fminimum.v4f32(<4 x float>)
-
define float @fmax(<4 x float> %vec) {
; CHECK-LABEL: name: fmax
; CHECK: bb.1 (%ir-block.0):
- ; CHECK: liveins: $q0
- ; CHECK: [[COPY:%[0-9]+]]:_(<2 x s64>) = COPY $q0
- ; CHECK: [[BITCAST:%[0-9]+]]:_(<4 x s32>) = G_BITCAST [[COPY]](<2 x s64>)
- ; CHECK: [[VECREDUCE_FMAX:%[0-9]+]]:_(s32) = G_VECREDUCE_FMAX [[BITCAST]](<4 x s32>)
- ; CHECK: $s0 = COPY [[VECREDUCE_FMAX]](s32)
- ; CHECK: RET_ReallyLR implicit $s0
+ ; CHECK-NEXT: liveins: $q0
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(<2 x s64>) = COPY $q0
+ ; CHECK-NEXT: [[BITCAST:%[0-9]+]]:_(<4 x s32>) = G_BITCAST [[COPY]](<2 x s64>)
+ ; CHECK-NEXT: [[VECREDUCE_FMAX:%[0-9]+]]:_(s32) = G_VECREDUCE_FMAX [[BITCAST]](<4 x s32>)
+ ; CHECK-NEXT: $s0 = COPY [[VECREDUCE_FMAX]](s32)
+ ; CHECK-NEXT: RET_ReallyLR implicit $s0
%res = call float @llvm.vector.reduce.fmax.v4f32(<4 x float> %vec)
ret float %res
}
@@ -85,12 +82,13 @@ define float @fmax(<4 x float> %vec) {
define float @fmin(<4 x float> %vec) {
; CHECK-LABEL: name: fmin
; CHECK: bb.1 (%ir-block.0):
- ; CHECK: liveins: $q0
- ; CHECK: [[COPY:%[0-9]+]]:_(<2 x s64>) = COPY $q0
- ; CHECK: [[BITCAST:%[0-9]+]]:_(<4 x s32>) = G_BITCAST [[COPY]](<2 x s64>)
- ; CHECK: [[VECREDUCE_FMIN:%[0-9]+]]:_(s32) = G_VECREDUCE_FMIN [[BITCAST]](<4 x s32>)
- ; CHECK: $s0 = COPY [[VECREDUCE_FMIN]](s32)
- ; CHECK: RET_ReallyLR implicit $s0
+ ; CHECK-NEXT: liveins: $q0
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(<2 x s64>) = COPY $q0
+ ; CHECK-NEXT: [[BITCAST:%[0-9]+]]:_(<4 x s32>) = G_BITCAST [[COPY]](<2 x s64>)
+ ; CHECK-NEXT: [[VECREDUCE_FMIN:%[0-9]+]]:_(s32) = G_VECREDUCE_FMIN [[BITCAST]](<4 x s32>)
+ ; CHECK-NEXT: $s0 = COPY [[VECREDUCE_FMIN]](s32)
+ ; CHECK-NEXT: RET_ReallyLR implicit $s0
%res = call float @llvm.vector.reduce.fmin.v4f32(<4 x float> %vec)
ret float %res
}
@@ -98,12 +96,13 @@ define float @fmin(<4 x float> %vec) {
define float @fmin_nnan(<4 x float> %vec) {
; CHECK-LABEL: name: fmin_nnan
; CHECK: bb.1 (%ir-block.0):
- ; CHECK: liveins: $q0
- ; CHECK: [[COPY:%[0-9]+]]:_(<2 x s64>) = COPY $q0
- ; CHECK: [[BITCAST:%[0-9]+]]:_(<4 x s32>) = G_BITCAST [[COPY]](<2 x s64>)
- ; CHECK: [[VECREDUCE_FMIN:%[0-9]+]]:_(s32) = nnan G_VECREDUCE_FMIN [[BITCAST]](<4 x s32>)
- ; CHECK: $s0 = COPY [[VECREDUCE_FMIN]](s32)
- ; CHECK: RET_ReallyLR implicit $s0
+ ; CHECK-NEXT: liveins: $q0
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(<2 x s64>) = COPY $q0
+ ; CHECK-NEXT: [[BITCAST:%[0-9]+]]:_(<4 x s32>) = G_BITCAST [[COPY]](<2 x s64>)
+ ; CHECK-NEXT: [[VECREDUCE_FMIN:%[0-9]+]]:_(s32) = nnan G_VECREDUCE_FMIN [[BITCAST]](<4 x s32>)
+ ; CHECK-NEXT: $s0 = COPY [[VECREDUCE_FMIN]](s32)
+ ; CHECK-NEXT: RET_ReallyLR implicit $s0
%res = call nnan float @llvm.vector.reduce.fmin.v4f32(<4 x float> %vec)
ret float %res
}
@@ -111,12 +110,13 @@ define float @fmin_nnan(<4 x float> %vec) {
define float @fmaximum(<4 x float> %vec) {
; CHECK-LABEL: name: fmaximum
; CHECK: bb.1 (%ir-block.0):
- ; CHECK: liveins: $q0
- ; CHECK: [[COPY:%[0-9]+]]:_(<2 x s64>) = COPY $q0
- ; CHECK: [[BITCAST:%[0-9]+]]:_(<4 x s32>) = G_BITCAST [[COPY]](<2 x s64>)
- ; CHECK: [[VECREDUCE_FMAX:%[0-9]+]]:_(s32) = G_VECREDUCE_FMAXIMUM [[BITCAST]](<4 x s32>)
- ; CHECK: $s0 = COPY [[VECREDUCE_FMAX]](s32)
- ; CHECK: RET_ReallyLR implicit $s0
+ ; CHECK-NEXT: liveins: $q0
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(<2 x s64>) = COPY $q0
+ ; CHECK-NEXT: [[BITCAST:%[0-9]+]]:_(<4 x s32>) = G_BITCAST [[COPY]](<2 x s64>)
+ ; CHECK-NEXT: [[VECREDUCE_FMAXIMUM:%[0-9]+]]:_(s32) = G_VECREDUCE_FMAXIMUM [[BITCAST]](<4 x s32>)
+ ; CHECK-NEXT: $s0 = COPY [[VECREDUCE_FMAXIMUM]](s32)
+ ; CHECK-NEXT: RET_ReallyLR implicit $s0
%res = call float @llvm.vector.reduce.fmaximum.v4f32(<4 x float> %vec)
ret float %res
}
@@ -124,12 +124,13 @@ define float @fmaximum(<4 x float> %vec) {
define float @fminimum(<4 x float> %vec) {
; CHECK-LABEL: name: fminimum
; CHECK: bb.1 (%ir-block.0):
- ; CHECK: liveins: $q0
- ; CHECK: [[COPY:%[0-9]+]]:_(<2 x s64>) = COPY $q0
- ; CHECK: [[BITCAST:%[0-9]+]]:_(<4 x s32>) = G_BITCAST [[COPY]](<2 x s64>)
- ; CHECK: [[VECREDUCE_FMIN:%[0-9]+]]:_(s32) = G_VECREDUCE_FMINIMUM [[BITCAST]](<4 x s32>)
- ; CHECK: $s0 = COPY [[VECREDUCE_FMIN]](s32)
- ; CHECK: RET_ReallyLR implicit $s0
+ ; CHECK-NEXT: liveins: $q0
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(<2 x s64>) = COPY $q0
+ ; CHECK-NEXT: [[BITCAST:%[0-9]+]]:_(<4 x s32>) = G_BITCAST [[COPY]](<2 x s64>)
+ ; CHECK-NEXT: [[VECREDUCE_FMINIMUM:%[0-9]+]]:_(s32) = G_VECREDUCE_FMINIMUM [[BITCAST]](<4 x s32>)
+ ; CHECK-NEXT: $s0 = COPY [[VECREDUCE_FMINIMUM]](s32)
+ ; CHECK-NEXT: RET_ReallyLR implicit $s0
%res = call float @llvm.vector.reduce.fminimum.v4f32(<4 x float> %vec)
ret float %res
}
@@ -137,99 +138,91 @@ define float @fminimum(<4 x float> %vec) {
define float @fminimum_nnan(<4 x float> %vec) {
; CHECK-LABEL: name: fminimum_nnan
; CHECK: bb.1 (%ir-block.0):
- ; CHECK: liveins: $q0
- ; CHECK: [[COPY:%[0-9]+]]:_(<2 x s64>) = COPY $q0
- ; CHECK: [[BITCAST:%[0-9]+]]:_(<4 x s32>) = G_BITCAST [[COPY]](<2 x s64>)
- ; CHECK: [[VECREDUCE_FMIN:%[0-9]+]]:_(s32) = nnan G_VECREDUCE_FMINIMUM [[BITCAST]](<4 x s32>)
- ; CHECK: $s0 = COPY [[VECREDUCE_FMIN]](s32)
- ; CHECK: RET_ReallyLR implicit $s0
+ ; CHECK-NEXT: liveins: $q0
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(<2 x s64>) = COPY $q0
+ ; CHECK-NEXT: [[BITCAST:%[0-9]+]]:_(<4 x s32>) = G_BITCAST [[COPY]](<2 x s64>)
+ ; CHECK-NEXT: [[VECREDUCE_FMINIMUM:%[0-9]+]]:_(s32) = nnan G_VECREDUCE_FMINIMUM [[BITCAST]](<4 x s32>)
+ ; CHECK-NEXT: $s0 = COPY [[VECREDUCE_FMINIMUM]](s32)
+ ; CHECK-NEXT: RET_ReallyLR implicit $s0
%res = call nnan float @llvm.vector.reduce.fminimum.v4f32(<4 x float> %vec)
ret float %res
}
-declare i32 @llvm.vector.reduce.add.v4i32(<4 x i32>)
-
define i32 @add(<4 x i32> %vec) {
; CHECK-LABEL: name: add
; CHECK: bb.1 (%ir-block.0):
- ; CHECK: liveins: $q0
- ; CHECK: [[COPY:%[0-9]+]]:_(<4 x s32>) = COPY $q0
- ; CHECK: [[VECREDUCE_ADD:%[0-9]+]]:_(s32) = G_VECREDUCE_ADD [[COPY]](<4 x s32>)
- ; CHECK: $w0 = COPY [[VECREDUCE_ADD]](s32)
- ; CHECK: RET_ReallyLR implicit $w0
+ ; CHECK-NEXT: liveins: $q0
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(<4 x s32>) = COPY $q0
+ ; CHECK-NEXT: [[VECREDUCE_ADD:%[0-9]+]]:_(s32) = G_VECREDUCE_ADD [[COPY]](<4 x s32>)
+ ; CHECK-NEXT: $w0 = COPY [[VECREDUCE_ADD]](s32)
+ ; CHECK-NEXT: RET_ReallyLR implicit $w0
%res = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> %vec)
ret i32 %res
}
-declare i32 @llvm.vector.reduce.mul.v4i32(<4 x i32>)
-
define i32 @mul(<4 x i32> %vec) {
; CHECK-LABEL: name: mul
; CHECK: bb.1 (%ir-block.0):
- ; CHECK: liveins: $q0
- ; CHECK: [[COPY:%[0-9]+]]:_(<4 x s32>) = COPY $q0
- ; CHECK: [[VECREDUCE_MUL:%[0-9]+]]:_(s32) = G_VECREDUCE_MUL [[COPY]](<4 x s32>)
- ; CHECK: $w0 = COPY [[VECREDUCE_MUL]](s32)
- ; CHECK: RET_ReallyLR implicit $w0
+ ; CHECK-NEXT: liveins: $q0
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(<4 x s32>) = COPY $q0
+ ; CHECK-NEXT: [[VECREDUCE_MUL:%[0-9]+]]:_(s32) = G_VECREDUCE_MUL [[COPY]](<4 x s32>)
+ ; CHECK-NEXT: $w0 = COPY [[VECREDUCE_MUL]](s32)
+ ; CHECK-NEXT: RET_ReallyLR implicit $w0
%res = call i32 @llvm.vector.reduce.mul.v4i32(<4 x i32> %vec)
ret i32 %res
}
-declare i32 @llvm.vector.reduce.and.v4i32(<4 x i32>)
-
define i32 @and(<4 x i32> %vec) {
; CHECK-LABEL: name: and
; CHECK: bb.1 (%ir-block.0):
- ; CHECK: liveins: $q0
- ; CHECK: [[COPY:%[0-9]+]]:_(<4 x s32>) = COPY $q0
- ; CHECK: [[VECREDUCE_AND:%[0-9]+]]:_(s32) = G_VECREDUCE_AND [[COPY]](<4 x s32>)
- ; CHECK: $w0 = COPY [[VECREDUCE_AND]](s32)
- ; CHECK: RET_ReallyLR implicit $w0
+ ; CHECK-NEXT: liveins: $q0
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(<4 x s32>) = COPY $q0
+ ; CHECK-NEXT: [[VECREDUCE_AND:%[0-9]+]]:_(s32) = G_VECREDUCE_AND [[COPY]](<4 x s32>)
+ ; CHECK-NEXT: $w0 = COPY [[VECREDUCE_AND]](s32)
+ ; CHECK-NEXT: RET_ReallyLR implicit $w0
%res = call i32 @llvm.vector.reduce.and.v4i32(<4 x i32> %vec)
ret i32 %res
}
-declare i32 @llvm.vector.reduce.or.v4i32(<4 x i32>)
-
define i32 @or(<4 x i32> %vec) {
; CHECK-LABEL: name: or
; CHECK: bb.1 (%ir-block.0):
- ; CHECK: liveins: $q0
- ; CHECK: [[COPY:%[0-9]+]]:_(<4 x s32>) = COPY $q0
- ; CHECK: [[VECREDUCE_OR:%[0-9]+]]:_(s32) = G_VECREDUCE_OR [[COPY]](<4 x s32>)
- ; CHECK: $w0 = COPY [[VECREDUCE_OR]](s32)
- ; CHECK: RET_ReallyLR implicit $w0
+ ; CHECK-NEXT: liveins: $q0
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(<4 x s32>) = COPY $q0
+ ; CHECK-NEXT: [[VECREDUCE_OR:%[0-9]+]]:_(s32) = G_VECREDUCE_OR [[COPY]](<4 x s32>)
+ ; CHECK-NEXT: $w0 = COPY [[VECREDUCE_OR]](s32)
+ ; CHECK-NEXT: RET_ReallyLR implicit $w0
%res = call i32 @llvm.vector.reduce.or.v4i32(<4 x i32> %vec)
ret i32 %res
}
-declare i32 @llvm.vector.reduce.xor.v4i32(<4 x i32>)
-
define i32 @xor(<4 x i32> %vec) {
; CHECK-LABEL: name: xor
; CHECK: bb.1 (%ir-block.0):
- ; CHECK: liveins: $q0
- ; CHECK: [[COPY:%[0-9]+]]:_(<4 x s32>) = COPY $q0
- ; CHECK: [[VECREDUCE_XOR:%[0-9]+]]:_(s32) = G_VECREDUCE_XOR [[COPY]](<4 x s32>)
- ; CHECK: $w0 = COPY [[VECREDUCE_XOR]](s32)
- ; CHECK: RET_ReallyLR implicit $w0
+ ; CHECK-NEXT: liveins: $q0
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(<4 x s32>) = COPY $q0
+ ; CHECK-NEXT: [[VECREDUCE_XOR:%[0-9]+]]:_(s32) = G_VECREDUCE_XOR [[COPY]](<4 x s32>)
+ ; CHECK-NEXT: $w0 = COPY [[VECREDUCE_XOR]](s32)
+ ; CHECK-NEXT: RET_ReallyLR implicit $w0
%res = call i32 @llvm.vector.reduce.xor.v4i32(<4 x i32> %vec)
ret i32 %res
}
-declare i32 @llvm.vector.reduce.smax.v4i32(<4 x i32>)
-declare i32 @llvm.vector.reduce.smin.v4i32(<4 x i32>)
-declare i32 @llvm.vector.reduce.umax.v4i32(<4 x i32>)
-declare i32 @llvm.vector.reduce.umin.v4i32(<4 x i32>)
-
define i32 @smax(<4 x i32> %vec) {
; CHECK-LABEL: name: smax
; CHECK: bb.1 (%ir-block.0):
- ; CHECK: liveins: $q0
- ; CHECK: [[COPY:%[0-9]+]]:_(<4 x s32>) = COPY $q0
- ; CHECK: [[VECREDUCE_SMAX:%[0-9]+]]:_(s32) = G_VECREDUCE_SMAX [[COPY]](<4 x s32>)
- ; CHECK: $w0 = COPY [[VECREDUCE_SMAX]](s32)
- ; CHECK: RET_ReallyLR implicit $w0
+ ; CHECK-NEXT: liveins: $q0
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(<4 x s32>) = COPY $q0
+ ; CHECK-NEXT: [[VECREDUCE_SMAX:%[0-9]+]]:_(s32) = G_VECREDUCE_SMAX [[COPY]](<4 x s32>)
+ ; CHECK-NEXT: $w0 = COPY [[VECREDUCE_SMAX]](s32)
+ ; CHECK-NEXT: RET_ReallyLR implicit $w0
%res = call i32 @llvm.vector.reduce.smax.v4i32(<4 x i32> %vec)
ret i32 %res
}
@@ -237,11 +230,12 @@ define i32 @smax(<4 x i32> %vec) {
define i32 @smin(<4 x i32> %vec) {
; CHECK-LABEL: name: smin
; CHECK: bb.1 (%ir-block.0):
- ; CHECK: liveins: $q0
- ; CHECK: [[COPY:%[0-9]+]]:_(<4 x s32>) = COPY $q0
- ; CHECK: [[VECREDUCE_SMIN:%[0-9]+]]:_(s32) = G_VECREDUCE_SMIN [[COPY]](<4 x s32>)
- ; CHECK: $w0 = COPY [[VECREDUCE_SMIN]](s32)
- ; CHECK: RET_ReallyLR implicit $w0
+ ; CHECK-NEXT: liveins: $q0
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(<4 x s32>) = COPY $q0
+ ; CHECK-NEXT: [[VECREDUCE_SMIN:%[0-9]+]]:_(s32) = G_VECREDUCE_SMIN [[COPY]](<4 x s32>)
+ ; CHECK-NEXT: $w0 = COPY [[VECREDUCE_SMIN]](s32)
+ ; CHECK-NEXT: RET_ReallyLR implicit $w0
%res = call i32 @llvm.vector.reduce.smin.v4i32(<4 x i32> %vec)
ret i32 %res
}
@@ -249,11 +243,12 @@ define i32 @smin(<4 x i32> %vec) {
define i32 @umax(<4 x i32> %vec) {
; CHECK-LABEL: name: umax
; CHECK: bb.1 (%ir-block.0):
- ; CHECK: liveins: $q0
- ; CHECK: [[COPY:%[0-9]+]]:_(<4 x s32>) = COPY $q0
- ; CHECK: [[VECREDUCE_UMAX:%[0-9]+]]:_(s32) = G_VECREDUCE_UMAX [[COPY]](<4 x s32>)
- ; CHECK: $w0 = COPY [[VECREDUCE_UMAX]](s32)
- ; CHECK: RET_ReallyLR implicit $w0
+ ; CHECK-NEXT: liveins: $q0
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(<4 x s32>) = COPY $q0
+ ; CHECK-NEXT: [[VECREDUCE_UMAX:%[0-9]+]]:_(s32) = G_VECREDUCE_UMAX [[COPY]](<4 x s32>)
+ ; CHECK-NEXT: $w0 = COPY [[VECREDUCE_UMAX]](s32)
+ ; CHECK-NEXT: RET_ReallyLR implicit $w0
%res = call i32 @llvm.vector.reduce.umax.v4i32(<4 x i32> %vec)
ret i32 %res
}
@@ -261,11 +256,12 @@ define i32 @umax(<4 x i32> %vec) {
define i32 @umin(<4 x i32> %vec) {
; CHECK-LABEL: name: umin
; CHECK: bb.1 (%ir-block.0):
- ; CHECK: liveins: $q0
- ; CHECK: [[COPY:%[0-9]+]]:_(<4 x s32>) = COPY $q0
- ; CHECK: [[VECREDUCE_UMIN:%[0-9]+]]:_(s32) = G_VECREDUCE_UMIN [[COPY]](<4 x s32>)
- ; CHECK: $w0 = COPY [[VECREDUCE_UMIN]](s32)
- ; CHECK: RET_ReallyLR implicit $w0
+ ; CHECK-NEXT: liveins: $q0
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(<4 x s32>) = COPY $q0
+ ; CHECK-NEXT: [[VECREDUCE_UMIN:%[0-9]+]]:_(s32) = G_VECREDUCE_UMIN [[COPY]](<4 x s32>)
+ ; CHECK-NEXT: $w0 = COPY [[VECREDUCE_UMIN]](s32)
+ ; CHECK-NEXT: RET_ReallyLR implicit $w0
%res = call i32 @llvm.vector.reduce.umin.v4i32(<4 x i32> %vec)
ret i32 %res
}
>From 8f0da9b8bd342f200a8b97cb19c2ca1588175299 Mon Sep 17 00:00:00 2001
From: Timm Baeder <tbaeder at redhat.com>
Date: Mon, 18 Aug 2025 16:32:50 +0200
Subject: [PATCH 027/112] [clang][bytecode] Disable EndLifetime op for array
elements (#154119)
This breaks a ton of libc++ tests otherwise, since calling
std::destroy_at will currently end the lifetime of the entire array not
just the given element.
See https://github.com/llvm/llvm-project/issues/147528
---
clang/lib/AST/ByteCode/Interp.cpp | 10 ++++++++
clang/test/AST/ByteCode/builtin-functions.cpp | 10 ++++----
clang/test/AST/ByteCode/lifetimes26.cpp | 23 ++++++++++++++++---
3 files changed, 36 insertions(+), 7 deletions(-)
diff --git a/clang/lib/AST/ByteCode/Interp.cpp b/clang/lib/AST/ByteCode/Interp.cpp
index 931d3879f0ff8..aeab9ff381711 100644
--- a/clang/lib/AST/ByteCode/Interp.cpp
+++ b/clang/lib/AST/ByteCode/Interp.cpp
@@ -1852,6 +1852,11 @@ bool EndLifetime(InterpState &S, CodePtr OpPC) {
const auto &Ptr = S.Stk.peek<Pointer>();
if (Ptr.isBlockPointer() && !CheckDummy(S, OpPC, Ptr.block(), AK_Destroy))
return false;
+
+ // FIXME: We need per-element lifetime information for primitive arrays.
+ if (Ptr.isArrayElement())
+ return true;
+
endLifetimeRecurse(Ptr.narrow());
return true;
}
@@ -1861,6 +1866,11 @@ bool EndLifetimePop(InterpState &S, CodePtr OpPC) {
const auto &Ptr = S.Stk.pop<Pointer>();
if (Ptr.isBlockPointer() && !CheckDummy(S, OpPC, Ptr.block(), AK_Destroy))
return false;
+
+ // FIXME: We need per-element lifetime information for primitive arrays.
+ if (Ptr.isArrayElement())
+ return true;
+
endLifetimeRecurse(Ptr.narrow());
return true;
}
diff --git a/clang/test/AST/ByteCode/builtin-functions.cpp b/clang/test/AST/ByteCode/builtin-functions.cpp
index 878c0d1a40f26..3277ef65a880b 100644
--- a/clang/test/AST/ByteCode/builtin-functions.cpp
+++ b/clang/test/AST/ByteCode/builtin-functions.cpp
@@ -1789,9 +1789,11 @@ namespace WithinLifetime {
} xstd; // both-error {{is not a constant expression}} \
// both-note {{in call to}}
+ /// FIXME: We do not have per-element lifetime information for primitive arrays.
+ /// See https://github.com/llvm/llvm-project/issues/147528
consteval bool test_dynamic(bool read_after_deallocate) {
std::allocator<int> a;
- int* p = a.allocate(1);
+ int* p = a.allocate(1); // expected-note 2{{allocation performed here was not deallocated}}
// a.allocate starts the lifetime of an array,
// the complete object of *p has started its lifetime
if (__builtin_is_within_lifetime(p))
@@ -1804,12 +1806,12 @@ namespace WithinLifetime {
return false;
a.deallocate(p, 1);
if (read_after_deallocate)
- __builtin_is_within_lifetime(p); // both-note {{read of heap allocated object that has been deleted}}
+ __builtin_is_within_lifetime(p); // ref-note {{read of heap allocated object that has been deleted}}
return true;
}
- static_assert(test_dynamic(false));
+ static_assert(test_dynamic(false)); // expected-error {{not an integral constant expression}}
static_assert(test_dynamic(true)); // both-error {{not an integral constant expression}} \
- // both-note {{in call to}}
+ // ref-note {{in call to}}
}
#ifdef __SIZEOF_INT128__
diff --git a/clang/test/AST/ByteCode/lifetimes26.cpp b/clang/test/AST/ByteCode/lifetimes26.cpp
index a5203ae77bc13..c3163f8a562bf 100644
--- a/clang/test/AST/ByteCode/lifetimes26.cpp
+++ b/clang/test/AST/ByteCode/lifetimes26.cpp
@@ -17,8 +17,8 @@ namespace std {
constexpr void *operator new(std::size_t, void *p) { return p; }
namespace std {
- template<typename T> constexpr T *construct(T *p) { return new (p) T; }
- template<typename T> constexpr void destroy(T *p) { p->~T(); }
+ template<typename T> constexpr T *construct_at(T *p) { return new (p) T; }
+ template<typename T> constexpr void destroy_at(T *p) { p->~T(); }
}
constexpr bool foo() {
@@ -43,7 +43,24 @@ constexpr void destroy_pointer() {
using T = int*;
T p;
p.~T();
- std::construct(&p);
+ std::construct_at(&p);
}
static_assert((destroy_pointer(), true));
+
+namespace DestroyArrayElem {
+ /// This is proof that std::destroy_at'ing an array element
+ /// ends the lifetime of the entire array.
+ /// See https://github.com/llvm/llvm-project/issues/147528
+ /// Using destroy_at on array elements is currently a no-op due to this.
+ constexpr int test() {
+ int a[4] = {};
+
+ std::destroy_at(&a[3]);
+ int r = a[1];
+ std::construct_at(&a[3]);
+
+ return r;
+ }
+ static_assert(test() == 0);
+}
>From 8fc80519cdb97c7ad762c750e3e59c622b181599 Mon Sep 17 00:00:00 2001
From: erichkeane <ekeane at nvidia.com>
Date: Mon, 18 Aug 2025 07:28:39 -0700
Subject: [PATCH 028/112] [OpenACC] Fix crash on error recovery of variable in
OpenACC mode
As reported, OpenACC's variable declaration handling was assuming some
semblence of legality in the example, so it didn't properly handle an
error case. This patch fixes its assumptions so that we don't crash.
Fixes #154008
---
clang/lib/Sema/SemaOpenACC.cpp | 9 +++++++--
clang/test/SemaOpenACC/gh154008.cpp | 3 +++
2 files changed, 10 insertions(+), 2 deletions(-)
create mode 100644 clang/test/SemaOpenACC/gh154008.cpp
diff --git a/clang/lib/Sema/SemaOpenACC.cpp b/clang/lib/Sema/SemaOpenACC.cpp
index 3f870ba528ad0..07713992da352 100644
--- a/clang/lib/Sema/SemaOpenACC.cpp
+++ b/clang/lib/Sema/SemaOpenACC.cpp
@@ -1921,8 +1921,13 @@ void SemaOpenACC::ActOnVariableDeclarator(VarDecl *VD) {
return;
// This cast should be safe, since a static-local can only happen in a
- // function declaration.
- auto *ContextDecl = cast<FunctionDecl>(getCurContext());
+ // function declaration. However, in error cases (or perhaps ObjC/C++?), this
+ // could possibly be something like a 'block' decl, so if this is NOT a
+ // function decl, just give up.
+ auto *ContextDecl = dyn_cast<FunctionDecl>(getCurContext());
+
+ if (!ContextDecl)
+ return;
// OpenACC 3.3 2.15:
// In C and C++, function static variables are not supported in functions to
diff --git a/clang/test/SemaOpenACC/gh154008.cpp b/clang/test/SemaOpenACC/gh154008.cpp
new file mode 100644
index 0000000000000..653f0f7839c02
--- /dev/null
+++ b/clang/test/SemaOpenACC/gh154008.cpp
@@ -0,0 +1,3 @@
+// RUN: %clang_cc1 %s -fopenacc -verify
+
+void *a = ^ { static int b };
>From 98e8f01d183177a4f54187c23183da50a7cf6daf Mon Sep 17 00:00:00 2001
From: Craig Topper <craig.topper at sifive.com>
Date: Mon, 18 Aug 2025 07:38:10 -0700
Subject: [PATCH 029/112] [RISCV] Rename MIPS_PREFETCH->MIPS_PREF. NFC
(#154062)
This matches the instruction's assembler mnemonic.
---
llvm/lib/Target/RISCV/RISCVInstrInfoXMips.td | 10 +++++-----
llvm/lib/Target/RISCV/RISCVRegisterInfo.cpp | 2 +-
2 files changed, 6 insertions(+), 6 deletions(-)
diff --git a/llvm/lib/Target/RISCV/RISCVInstrInfoXMips.td b/llvm/lib/Target/RISCV/RISCVInstrInfoXMips.td
index 0c8487c2f5dbe..889ea98022572 100644
--- a/llvm/lib/Target/RISCV/RISCVInstrInfoXMips.td
+++ b/llvm/lib/Target/RISCV/RISCVInstrInfoXMips.td
@@ -129,20 +129,20 @@ class Mips_prefetch_ri<dag outs, dag ins, string opcodestr, string argstr>
// MIPS extensions
//===----------------------------------------------------------------------===//
let Predicates = [HasVendorXMIPSCBOP] ,DecoderNamespace = "Xmipscbop" in {
- def MIPS_PREFETCH : Mips_prefetch_ri<(outs), (ins GPR:$rs1, uimm9:$imm9, uimm5:$hint),
- "mips.pref", "$hint, ${imm9}(${rs1})">,
- Sched<[]>;
+ def MIPS_PREF : Mips_prefetch_ri<(outs), (ins GPR:$rs1, uimm9:$imm9, uimm5:$hint),
+ "mips.pref", "$hint, ${imm9}(${rs1})">,
+ Sched<[]>;
}
let Predicates = [HasVendorXMIPSCBOP] in {
// Prefetch Data Write.
def : Pat<(prefetch (AddrRegImm9 (XLenVT GPR:$rs1), uimm9:$imm9),
(i32 1), timm, (i32 1)),
- (MIPS_PREFETCH GPR:$rs1, uimm9:$imm9, 9)>;
+ (MIPS_PREF GPR:$rs1, uimm9:$imm9, 9)>;
// Prefetch Data Read.
def : Pat<(prefetch (AddrRegImm9 (XLenVT GPR:$rs1), uimm9:$imm9),
(i32 0), timm, (i32 1)),
- (MIPS_PREFETCH GPR:$rs1, uimm9:$imm9, 8)>;
+ (MIPS_PREF GPR:$rs1, uimm9:$imm9, 8)>;
}
let Predicates = [HasVendorXMIPSCMov], hasSideEffects = 0, mayLoad = 0, mayStore = 0,
diff --git a/llvm/lib/Target/RISCV/RISCVRegisterInfo.cpp b/llvm/lib/Target/RISCV/RISCVRegisterInfo.cpp
index 7e58b6f342689..8a3c8e2a1c1cf 100644
--- a/llvm/lib/Target/RISCV/RISCVRegisterInfo.cpp
+++ b/llvm/lib/Target/RISCV/RISCVRegisterInfo.cpp
@@ -589,7 +589,7 @@ bool RISCVRegisterInfo::eliminateFrameIndex(MachineBasicBlock::iterator II,
(Lo12 & 0b11111) != 0) {
// Prefetch instructions require the offset to be 32 byte aligned.
MI.getOperand(FIOperandNum + 1).ChangeToImmediate(0);
- } else if (Opc == RISCV::MIPS_PREFETCH && !isUInt<9>(Val)) {
+ } else if (Opc == RISCV::MIPS_PREF && !isUInt<9>(Val)) {
// MIPS Prefetch instructions require the offset to be 9 bits encoded.
MI.getOperand(FIOperandNum + 1).ChangeToImmediate(0);
} else if ((Opc == RISCV::PseudoRV32ZdinxLD ||
>From ec227050e3f94bb0c40b456c4207797459de8c42 Mon Sep 17 00:00:00 2001
From: erichkeane <ekeane at nvidia.com>
Date: Mon, 18 Aug 2025 07:48:25 -0700
Subject: [PATCH 030/112] [OpenACC] Fix verify lines from 8fc80519cdb97c
Like a big dummy, I completely skipped running this test locally and
forgot it would need check lines. *sigh*, Looks like SOMEONE has a case
of the Mondays!
Anyway, this patch fixes it by adding the proper verify lines.
---
clang/test/SemaOpenACC/gh154008.cpp | 2 ++
1 file changed, 2 insertions(+)
diff --git a/clang/test/SemaOpenACC/gh154008.cpp b/clang/test/SemaOpenACC/gh154008.cpp
index 653f0f7839c02..1ec114c000b3f 100644
--- a/clang/test/SemaOpenACC/gh154008.cpp
+++ b/clang/test/SemaOpenACC/gh154008.cpp
@@ -1,3 +1,5 @@
// RUN: %clang_cc1 %s -fopenacc -verify
+// expected-error at +2{{expected ';'}}
+// expected-error at +1{{blocks support disabled}}
void *a = ^ { static int b };
>From ad064bc5c384ea61a978af8d1d20d6cca7edc86a Mon Sep 17 00:00:00 2001
From: LLVM GN Syncbot <llvmgnsyncbot at gmail.com>
Date: Mon, 18 Aug 2025 14:52:18 +0000
Subject: [PATCH 031/112] [gn build] Port a0f325bd41c9
---
.../gn/secondary/clang-tools-extra/clang-tidy/misc/BUILD.gn | 1 +
1 file changed, 1 insertion(+)
diff --git a/llvm/utils/gn/secondary/clang-tools-extra/clang-tidy/misc/BUILD.gn b/llvm/utils/gn/secondary/clang-tools-extra/clang-tidy/misc/BUILD.gn
index 0dc5efc981c87..a6848b3c9f241 100644
--- a/llvm/utils/gn/secondary/clang-tools-extra/clang-tidy/misc/BUILD.gn
+++ b/llvm/utils/gn/secondary/clang-tools-extra/clang-tidy/misc/BUILD.gn
@@ -46,6 +46,7 @@ static_library("misc") {
"NoRecursionCheck.cpp",
"NonCopyableObjects.cpp",
"NonPrivateMemberVariablesInClassesCheck.cpp",
+ "OverrideWithDifferentVisibilityCheck.cpp",
"RedundantExpressionCheck.cpp",
"StaticAssertCheck.cpp",
"ThrowByValueCatchByReferenceCheck.cpp",
>From f4b5c24022ca5805eeafaaeb417a35a8b6d6c03d Mon Sep 17 00:00:00 2001
From: LLVM GN Syncbot <llvmgnsyncbot at gmail.com>
Date: Mon, 18 Aug 2025 14:52:19 +0000
Subject: [PATCH 032/112] [gn build] Port e6e874ce8f05
---
llvm/utils/gn/secondary/clang/unittests/Lex/BUILD.gn | 1 +
1 file changed, 1 insertion(+)
diff --git a/llvm/utils/gn/secondary/clang/unittests/Lex/BUILD.gn b/llvm/utils/gn/secondary/clang/unittests/Lex/BUILD.gn
index 16abe7a6e95e4..a0f72494a2bd9 100644
--- a/llvm/utils/gn/secondary/clang/unittests/Lex/BUILD.gn
+++ b/llvm/utils/gn/secondary/clang/unittests/Lex/BUILD.gn
@@ -20,6 +20,7 @@ unittest("LexTests") {
"LexHLSLRootSignatureTest.cpp",
"LexerTest.cpp",
"ModuleDeclStateTest.cpp",
+ "NoTrivialPPDirectiveTracerTest.cpp",
"PPCallbacksTest.cpp",
"PPConditionalDirectiveRecordTest.cpp",
"PPDependencyDirectivesTest.cpp",
>From 03912a1de59876011387de9ac5ec968c58018da0 Mon Sep 17 00:00:00 2001
From: David Green <david.green at arm.com>
Date: Mon, 18 Aug 2025 15:59:44 +0100
Subject: [PATCH 033/112] [GlobalISel] Translate scalar sequential
vecreduce.fadd/fmul as fadd/fmul. (#153966)
A llvm.vector.reduce.fadd(float, <1 x float>) will be translated to
G_VECREDUCE_SEQ_FADD with two scalar operands, which is illegal
according to the verifier. This makes sure we generate a fadd/fmul
instead.
---
llvm/lib/CodeGen/GlobalISel/IRTranslator.cpp | 3 ++
.../GlobalISel/irtranslator-reductions.ll | 29 +++++++++++++++++++
2 files changed, 32 insertions(+)
diff --git a/llvm/lib/CodeGen/GlobalISel/IRTranslator.cpp b/llvm/lib/CodeGen/GlobalISel/IRTranslator.cpp
index 7ca02ad756f51..8424a8108d76e 100644
--- a/llvm/lib/CodeGen/GlobalISel/IRTranslator.cpp
+++ b/llvm/lib/CodeGen/GlobalISel/IRTranslator.cpp
@@ -2522,6 +2522,9 @@ bool IRTranslator::translateKnownIntrinsic(const CallInst &CI, Intrinsic::ID ID,
Opc = ID == Intrinsic::vector_reduce_fadd
? TargetOpcode::G_VECREDUCE_SEQ_FADD
: TargetOpcode::G_VECREDUCE_SEQ_FMUL;
+ if (!MRI->getType(VecSrc).isVector())
+ Opc = ID == Intrinsic::vector_reduce_fadd ? TargetOpcode::G_FADD
+ : TargetOpcode::G_FMUL;
MIRBuilder.buildInstr(Opc, {Dst}, {ScalarSrc, VecSrc},
MachineInstr::copyFlagsFromInstruction(CI));
return true;
diff --git a/llvm/test/CodeGen/AArch64/GlobalISel/irtranslator-reductions.ll b/llvm/test/CodeGen/AArch64/GlobalISel/irtranslator-reductions.ll
index c38e03b41dc06..c791e35946f72 100644
--- a/llvm/test/CodeGen/AArch64/GlobalISel/irtranslator-reductions.ll
+++ b/llvm/test/CodeGen/AArch64/GlobalISel/irtranslator-reductions.ll
@@ -16,6 +16,21 @@ define float @fadd_seq(float %start, <4 x float> %vec) {
ret float %res
}
+define float @fadd_seq_scalar(float %start, <1 x float> %vec) {
+ ; CHECK-LABEL: name: fadd_seq_scalar
+ ; CHECK: bb.1 (%ir-block.0):
+ ; CHECK-NEXT: liveins: $d1, $s0
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(s32) = COPY $s0
+ ; CHECK-NEXT: [[COPY1:%[0-9]+]]:_(<2 x s32>) = COPY $d1
+ ; CHECK-NEXT: [[UV:%[0-9]+]]:_(s32), [[UV1:%[0-9]+]]:_(s32) = G_UNMERGE_VALUES [[COPY1]](<2 x s32>)
+ ; CHECK-NEXT: [[FADD:%[0-9]+]]:_(s32) = G_FADD [[COPY]], [[UV]]
+ ; CHECK-NEXT: $s0 = COPY [[FADD]](s32)
+ ; CHECK-NEXT: RET_ReallyLR implicit $s0
+ %res = call float @llvm.vector.reduce.fadd.v1f32(float %start, <1 x float> %vec)
+ ret float %res
+}
+
define float @fadd_fast(float %start, <4 x float> %vec) {
; CHECK-LABEL: name: fadd_fast
; CHECK: bb.1 (%ir-block.0):
@@ -48,6 +63,20 @@ define double @fmul_seq(double %start, <4 x double> %vec) {
ret double %res
}
+define double @fmul_seq_scalar(double %start, <1 x double> %vec) {
+ ; CHECK-LABEL: name: fmul_seq_scalar
+ ; CHECK: bb.1 (%ir-block.0):
+ ; CHECK-NEXT: liveins: $d0, $d1
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(s64) = COPY $d0
+ ; CHECK-NEXT: [[COPY1:%[0-9]+]]:_(s64) = COPY $d1
+ ; CHECK-NEXT: [[FMUL:%[0-9]+]]:_(s64) = G_FMUL [[COPY]], [[COPY1]]
+ ; CHECK-NEXT: $d0 = COPY [[FMUL]](s64)
+ ; CHECK-NEXT: RET_ReallyLR implicit $d0
+ %res = call double @llvm.vector.reduce.fmul.v1f64(double %start, <1 x double> %vec)
+ ret double %res
+}
+
define double @fmul_fast(double %start, <4 x double> %vec) {
; CHECK-LABEL: name: fmul_fast
; CHECK: bb.1 (%ir-block.0):
>From 7c53c6162bd43d952546a3ef7d019babd5244c29 Mon Sep 17 00:00:00 2001
From: Brox Chen <guochen2 at amd.com>
Date: Mon, 18 Aug 2025 11:01:57 -0400
Subject: [PATCH 034/112] [AMDGPU][True16][CodeGen] use vgpr16 for zext
patterns (#153894)
Update true16 mode with zext patterns using vgpr16 for 16bit data types.
This stop isel from inserting invalid "vgpr32 = copy vgpr16"
---
llvm/lib/Target/AMDGPU/SIInstructions.td | 22 +
llvm/test/CodeGen/AMDGPU/GlobalISel/mul.ll | 2 +-
.../CodeGen/AMDGPU/amdgcn.bitcast.1024bit.ll | 11901 ++++++++--------
.../CodeGen/AMDGPU/amdgcn.bitcast.128bit.ll | 1148 +-
.../CodeGen/AMDGPU/amdgcn.bitcast.256bit.ll | 1320 +-
.../CodeGen/AMDGPU/amdgcn.bitcast.320bit.ll | 2886 ++--
.../CodeGen/AMDGPU/amdgcn.bitcast.32bit.ll | 240 +-
.../CodeGen/AMDGPU/amdgcn.bitcast.512bit.ll | 5414 ++++---
.../CodeGen/AMDGPU/amdgcn.bitcast.64bit.ll | 637 +-
.../CodeGen/AMDGPU/amdgcn.bitcast.96bit.ll | 594 +-
.../AMDGPU/amdgpu-llvm-debuginfo-analyzer.ll | 1 +
.../atomic_optimizations_global_pointer.ll | 64 +-
llvm/test/CodeGen/AMDGPU/bf16.ll | 14 +-
.../buffer-fat-pointer-atomicrmw-fadd.ll | 42 +-
.../buffer-fat-pointer-atomicrmw-fmax.ll | 42 +-
.../buffer-fat-pointer-atomicrmw-fmin.ll | 42 +-
.../CodeGen/AMDGPU/calling-conventions.ll | 100 +-
llvm/test/CodeGen/AMDGPU/clamp-modifier.ll | 4 +-
llvm/test/CodeGen/AMDGPU/cvt_f32_ubyte.ll | 42 +-
.../test/CodeGen/AMDGPU/dynamic_stackalloc.ll | 5 +-
.../CodeGen/AMDGPU/flat-atomicrmw-fadd.ll | 106 +-
.../CodeGen/AMDGPU/flat-atomicrmw-fmax.ll | 110 +-
.../CodeGen/AMDGPU/flat-atomicrmw-fmin.ll | 110 +-
.../CodeGen/AMDGPU/flat-atomicrmw-fsub.ll | 106 +-
llvm/test/CodeGen/AMDGPU/fmul-to-ldexp.ll | 2 +-
.../AMDGPU/fold-int-pow2-with-fmul-or-fdiv.ll | 6 +-
llvm/test/CodeGen/AMDGPU/fptrunc.f16.ll | 6 +-
llvm/test/CodeGen/AMDGPU/function-args.ll | 251 +-
.../AMDGPU/gfx-callable-argument-types.ll | 222 +-
.../CodeGen/AMDGPU/global-atomicrmw-fadd.ll | 106 +-
.../CodeGen/AMDGPU/global-atomicrmw-fmax.ll | 110 +-
.../CodeGen/AMDGPU/global-atomicrmw-fmin.ll | 110 +-
.../CodeGen/AMDGPU/global-atomicrmw-fsub.ll | 106 +-
llvm/test/CodeGen/AMDGPU/idot4u.ll | 41 +-
.../CodeGen/AMDGPU/integer-mad-patterns.ll | 28 +-
.../CodeGen/AMDGPU/local-atomicrmw-fadd.ll | 60 +-
.../CodeGen/AMDGPU/local-atomicrmw-fmax.ll | 68 +-
.../CodeGen/AMDGPU/local-atomicrmw-fmin.ll | 68 +-
.../CodeGen/AMDGPU/local-atomicrmw-fsub.ll | 60 +-
llvm/test/CodeGen/AMDGPU/mad-mix-lo.ll | 31 +-
llvm/test/CodeGen/AMDGPU/mad.u16.ll | 7 +-
llvm/test/CodeGen/AMDGPU/preserve-hi16.ll | 54 +-
.../CodeGen/AMDGPU/shrink-add-sub-constant.ll | 6 +-
llvm/test/CodeGen/AMDGPU/vector-reduce-add.ll | 126 +-
.../test/CodeGen/AMDGPU/vector-reduce-umin.ll | 78 +-
45 files changed, 12480 insertions(+), 14018 deletions(-)
diff --git a/llvm/lib/Target/AMDGPU/SIInstructions.td b/llvm/lib/Target/AMDGPU/SIInstructions.td
index bd5dfa92a8e43..6488fa3dacfb3 100644
--- a/llvm/lib/Target/AMDGPU/SIInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SIInstructions.td
@@ -3056,6 +3056,8 @@ def : GCNPat<
}
} // AddedComplexity = 1
+foreach p = [NotHasTrue16BitInsts, UseFakeTrue16Insts] in
+let True16Predicate = p in {
def : GCNPat<
(i32 (DivergentUnaryFrag<zext> i16:$src)),
(V_AND_B32_e64 (S_MOV_B32 (i32 0xffff)), $src)
@@ -3071,6 +3073,26 @@ def : GCNPat<
def : GCNPat<
(i32 (zext (i16 (bitconvert fp16_zeros_high_16bits:$src)))),
(COPY VSrc_b16:$src)>;
+}
+
+let True16Predicate = UseRealTrue16Insts in {
+def : GCNPat<
+ (i32 (DivergentUnaryFrag<zext> i16:$src)),
+ (REG_SEQUENCE VGPR_32, $src, lo16, (V_MOV_B16_t16_e64 0, (i16 0), 0), hi16)
+>;
+
+def : GCNPat<
+ (i64 (DivergentUnaryFrag<zext> i16:$src)),
+ (REG_SEQUENCE VReg_64,
+ (REG_SEQUENCE VGPR_32, $src, lo16, (V_MOV_B16_t16_e64 0, (i16 0), 0), hi16), sub0,
+ (S_MOV_B32 (i32 0)), sub1)
+>;
+
+def : GCNPat<
+ (i32 (zext (i16 (bitconvert fp16_zeros_high_16bits:$src)))),
+ (REG_SEQUENCE VGPR_32, $src, lo16, (V_MOV_B16_t16_e64 0, (i16 0), 0), hi16)
+>;
+}
def : GCNPat <
(i32 (trunc i64:$a)),
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/mul.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/mul.ll
index 01854c8560ce2..637aaf7529364 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/mul.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/mul.ll
@@ -164,7 +164,7 @@ define zeroext i16 @v_mul_i16_zeroext(i16 zeroext %num, i16 zeroext %den) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mul_lo_u16 v0.l, v0.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-FAKE16-LABEL: v_mul_i16_zeroext:
diff --git a/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.1024bit.ll b/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.1024bit.ll
index 0d5f538215f18..d03d6a8940b2f 100644
--- a/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.1024bit.ll
+++ b/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.1024bit.ll
@@ -6309,64 +6309,64 @@ define <128 x i8> @bitcast_v32i32_to_v128i8(<32 x i32> %a, i32 %b) {
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr66_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr162_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr160_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr161_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr160_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr65_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr151_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr150_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr148_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr146_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr147_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr146_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr145_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr144_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr135_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr134_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr132_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr133_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr132_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr131_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr128_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr119_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr118_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr117_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr115_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr116_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr115_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr114_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr112_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr113_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr112_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr101_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr100_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr99_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr97_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr96_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr86_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr85_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr84_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr83_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr81_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr82_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr81_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr69_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr68_lo16
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(2)
@@ -6394,50 +6394,50 @@ define <128 x i8> @bitcast_v32i32_to_v128i8(<32 x i32> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[65:66], 24, v[3:4]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v68, 24, v32
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v69, 8, v32
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v71, 8, v31
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v70, 24, v30
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v70, 8, v31
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v71, 24, v30
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v80, 8, v30
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 8, v29
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 24, v28
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 8, v29
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 24, v28
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v83, 8, v28
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v84, 8, v27
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v85, 24, v26
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v86, 8, v26
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 8, v25
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 24, v24
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 8, v25
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 24, v24
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v97, 8, v24
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v99, 8, v23
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 24, v22
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 8, v23
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v99, 24, v22
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v100, 8, v22
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v101, 8, v21
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v102, 24, v20
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v103, 8, v20
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 8, v19
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 24, v18
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 8, v19
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 24, v18
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v114, 8, v18
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v116, 8, v17
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v115, 24, v16
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v115, 8, v17
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v116, 24, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v117, 8, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v118, 8, v15
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v119, 24, v14
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v128, 8, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 8, v13
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 24, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v131, 8, v12
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v133, 8, v11
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 24, v10
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 8, v11
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v133, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v134, 8, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v135, 8, v9
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v144, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v145, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v146, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v146, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v148, 8, v6
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 8, v5
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 24, v4
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 8, v5
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v151, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v162, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[54:55], 24, v[7:8]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[66:67], 24, v[1:2]
@@ -6498,50 +6498,50 @@ define <128 x i8> @bitcast_v32i32_to_v128i8(<32 x i32> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[66:67], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v68, 24, v32
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v69, 8, v32
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v71, 8, v31
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v70, 24, v30
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v70, 8, v31
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v71, 24, v30
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v80, 8, v30
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 8, v29
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 24, v28
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 8, v29
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 24, v28
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v83, 8, v28
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v84, 8, v27
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v85, 24, v26
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v86, 8, v26
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 8, v25
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 24, v24
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 8, v25
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 24, v24
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v97, 8, v24
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v99, 8, v23
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 24, v22
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 8, v23
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v99, 24, v22
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v100, 8, v22
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v101, 8, v21
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v102, 24, v20
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v103, 8, v20
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 8, v19
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 24, v18
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 8, v19
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 24, v18
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v114, 8, v18
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v116, 8, v17
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v115, 24, v16
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v115, 8, v17
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v116, 24, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v117, 8, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v118, 8, v15
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v119, 24, v14
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v128, 8, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 8, v13
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 24, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v131, 8, v12
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v133, 8, v11
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 24, v10
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 8, v11
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v133, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v134, 8, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v135, 8, v9
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v144, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v145, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v146, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v146, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v148, 8, v6
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 8, v5
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 24, v4
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 8, v5
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v151, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v162, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v39, 8, v1
; GFX11-TRUE16-NEXT: .LBB12_4: ; %end
@@ -6549,307 +6549,266 @@ define <128 x i8> @bitcast_v32i32_to_v128i8(<32 x i32> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v39.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v162.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.l, 0
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v66.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v34.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v66.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v1.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v162.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v1.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v39.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v161.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v39, v1
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v2.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v160.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v65.l
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v161.l
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v1.h, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v66.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v55, 0xffff, v55
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v160.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v1.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v39, v2
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v3.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v39.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v151.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v66, 0xffff, v66
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v55, v39
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v2.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v3.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v65.l
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v4.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v150.l
; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v150.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v66, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v55, 0xffff, v55
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v3.l, v3.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v65.l, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v149.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v5.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v55, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v55, 0xffff, v65
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v64.l
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v4.l, v4.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v65.l, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.l, 8, v148.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v55, v39
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v39, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v4.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v149.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v64.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v39, v4
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v5.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v148.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v147.l
; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v5.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v147.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.h, v6.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v55, 0xffff, v65
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v39, v5
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v6.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v146.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v54.l
; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v7.l, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v64.l, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v146.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v39, v6
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v7.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v39.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v145.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v55, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v55, 0xffff, v64
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v64.l, v7.l
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v6.l, v6.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.h, 8, v54.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v8.l, v33.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v64
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v55, v39
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v144.l
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v7.l, v7.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 8, v135.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v54, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v55
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v8.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v144.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v39, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v8.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v135.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v53.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v10.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v39, v8
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v9.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v39.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v134.l
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v8.h, v9.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v133.l
; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v11.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v39, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v10.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v10.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v132.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v52.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v12.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v39, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v11.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v11.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v131.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v130.l
; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v54, v39
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v10.l, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v53.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v133.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v54.l, v10.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v53, 0xffff, v55
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v9.l, v9.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v132.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v11.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v52.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v13.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v39, v11
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v12.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v12.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v129.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v51.l
; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v53, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v53, 0xffff, v54
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v10.l, v10.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v54.l, v11.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v131.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v53, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v52, 0xffff, v54
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v11.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v130.l
-; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v11.h, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v14.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v39, v12
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v13.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v13.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v128.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v119.l
+; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.l
+; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v15.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v39, v13
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v14.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v118.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v50.l
; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.l
+; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v16.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v39, v14
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v15.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v15.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v117.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v116.l
; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v17.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v52, v39
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v13.l, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v53.l, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v129.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v128.l
+; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v17.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v39, v15
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v16.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v16.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v115.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v49.l
+; GFX11-TRUE16-NEXT: v_and_b16 v18.l, 0xff, v18.l
+; GFX11-TRUE16-NEXT: v_and_b16 v18.h, 0xff, v18.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v39, v16
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v17.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v17.h, v17.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v114.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v113.l
; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v19.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v52, 0xffff, v53
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v53.l, v13.l
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v12.l, v12.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.h, 8, v51.l
-; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v14.l, v33.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v51, 0xffff, v53
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v52, v39
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v119.l
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v13.l, v13.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.l, v14.l
-; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.h
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v15.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.l, 8, v118.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v51, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v51, 0xffff, v52
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v14.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v117.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v14.h, v15.l
+; GFX11-TRUE16-NEXT: v_and_b16 v19.h, 0xff, v19.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v39, v17
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v18.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v18.h, v18.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v112.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v48.l
; GFX11-TRUE16-NEXT: v_and_b16 v20.l, 0xff, v20.l
+; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v20.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v39, v18
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v19.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v19.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v103.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v102.l
+; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v21.l
+; GFX11-TRUE16-NEXT: v_and_b16 v21.h, 0xff, v21.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v39, v19
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v20.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v20.h, v20.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v101.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v38.l
; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v22.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v51, v39
-; GFX11-TRUE16-NEXT: v_or_b16 v16.l, v16.l, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.l, v15.l
-; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.h, 8, v50.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v116.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v51.l, v16.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v50, 0xffff, v52
-; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.h
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v15.l, v15.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v115.l
-; GFX11-TRUE16-NEXT: v_or_b16 v17.l, v17.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v49.l
+; GFX11-TRUE16-NEXT: v_and_b16 v22.h, 0xff, v22.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v39, v20
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v21.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v21.h, v21.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v100.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v99.l
; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v23.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v50, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v50, 0xffff, v51
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v16.l, v16.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v51.l, v17.l
-; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v17.h
-; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v18.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v114.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v50, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v49, 0xffff, v51
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v17.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v113.l
-; GFX11-TRUE16-NEXT: v_or_b16 v18.l, v17.h, v18.l
+; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v23.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v39, v21
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v22.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v22.h, v22.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v98.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v37.l
+; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v24.h, 0xff, v24.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v39, v22
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v23.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v23.h, v23.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v97.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v96.l
; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v25.l
+; GFX11-TRUE16-NEXT: v_and_b16 v25.h, 0xff, v25.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v39, v23
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v24.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v24.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v87.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v36.l
; GFX11-TRUE16-NEXT: v_and_b16 v26.l, 0xff, v26.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v49, v39
-; GFX11-TRUE16-NEXT: v_or_b16 v19.l, v19.l, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v50.l, v18.l
-; GFX11-TRUE16-NEXT: v_and_b16 v18.l, 0xff, v18.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v112.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v103.l
+; GFX11-TRUE16-NEXT: v_and_b16 v26.h, 0xff, v26.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v39, v24
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v25.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v25.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v86.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v85.l
+; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v27.l
+; GFX11-TRUE16-NEXT: v_and_b16 v27.h, 0xff, v27.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v39, v25
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v26.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v26.h, v26.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v84.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v35.l
; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v28.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v49, 0xffff, v50
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v50.l, v19.l
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v18.l, v18.h
-; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v19.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v48.l
-; GFX11-TRUE16-NEXT: v_or_b16 v20.l, v20.l, v33.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v48, 0xffff, v50
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v49, v39
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v102.l
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v19.l, v19.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v49.l, v20.l
-; GFX11-TRUE16-NEXT: v_and_b16 v20.l, 0xff, v20.h
-; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v21.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.l, 8, v101.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v48, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v48, 0xffff, v49
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v20.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v100.l
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v20.h, v21.l
+; GFX11-TRUE16-NEXT: v_and_b16 v28.h, 0xff, v28.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v39, v26
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v27.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v27.h, v27.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v83.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v82.l
; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v29.l
+; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v29.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v39, v27
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v28.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v28.h, v28.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v81.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.l, 8, v34.l
+; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v30.l
+; GFX11-TRUE16-NEXT: v_and_b16 v30.h, 0xff, v30.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v39, v28
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v29.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v29.h, v29.h, v34.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v80.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.l, 8, v71.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_and_b16 v31.l, 0xff, v31.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v48, v39
-; GFX11-TRUE16-NEXT: v_or_b16 v22.l, v22.l, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v49.l, v21.l
-; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v21.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.h, 8, v38.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v99.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.l, v22.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v38, 0xffff, v49
-; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v22.h
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v21.l, v21.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.h, 8, v98.l
-; GFX11-TRUE16-NEXT: v_or_b16 v23.l, v23.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v37.l
+; GFX11-TRUE16-NEXT: v_and_b16 v31.h, 0xff, v31.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v39, v29
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v30.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v30.h, v30.h, v34.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v70.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.l, 8, v33.l
; GFX11-TRUE16-NEXT: v_and_b16 v32.l, 0xff, v32.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v38, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v38, 0xffff, v48
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v22.l, v22.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.l, v23.l
-; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v23.h
-; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v24.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v24.l, 8, v97.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v38, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v37, 0xffff, v48
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v23.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v96.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v23.h, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v32.h, 0xff, v32.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v39, v30
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v31.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v31.h, v33.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v31.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.l, 8, v69.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v68.l
; GFX11-TRUE16-NEXT: s_clause 0x1
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[1:4], off
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[5:8], off offset:16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v37, v39
-; GFX11-TRUE16-NEXT: v_or_b16 v25.l, v25.l, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v38.l, v24.l
-; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v24.h, 8, v87.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v86.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v37, 0xffff, v38
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v38.l, v25.l
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v24.l, v24.h
-; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v25.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.h, 8, v36.l
-; GFX11-TRUE16-NEXT: v_or_b16 v26.l, v26.l, v33.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v36, 0xffff, v38
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v37, v39
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v85.l
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v25.l, v25.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v37.l, v26.l
-; GFX11-TRUE16-NEXT: v_and_b16 v26.l, 0xff, v26.h
-; GFX11-TRUE16-NEXT: v_and_b16 v26.h, 0xff, v27.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.l, 8, v84.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v36, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v36, 0xffff, v37
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v26.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v83.l
-; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v26.h, v27.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v36, v39
-; GFX11-TRUE16-NEXT: v_or_b16 v28.l, v28.l, v33.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v37.l, v27.l
-; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v27.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.h, 8, v35.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v82.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v36.l, v28.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff, v37
-; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v28.h
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v27.l, v27.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v28.h, 8, v81.l
-; GFX11-TRUE16-NEXT: v_or_b16 v29.l, v29.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v34.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v35, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff, v36
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v28.l, v28.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v36.l, v29.l
-; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v29.h
-; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v30.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v30.l, 8, v80.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v35, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v36
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v29.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v71.l
-; GFX11-TRUE16-NEXT: v_or_b16 v30.l, v29.h, v30.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v34, v39
-; GFX11-TRUE16-NEXT: v_or_b16 v31.l, v31.l, v33.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_3) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v30.l
-; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v30.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v30.h, 8, v70.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v69.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v35
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v31.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v30.l, v30.h
-; GFX11-TRUE16-NEXT: v_and_b16 v31.l, 0xff, v31.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v31.h, 8, v33.l
-; GFX11-TRUE16-NEXT: v_or_b16 v32.l, v32.l, v33.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v33, 0xffff, v35
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v34, v39
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v31.l, v31.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.l, v32.l
-; GFX11-TRUE16-NEXT: v_and_b16 v32.l, 0xff, v32.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v32.h, 8, v68.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v33, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v33, 0xffff, v34
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v32.l, v32.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v32, v33, v39
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v39, v31
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v32.l, v33.l
+; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v32.h, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v32.l, v39.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v32, v39, v32
; GFX11-TRUE16-NEXT: s_clause 0x5
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[9:12], off offset:32
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[13:16], off offset:48
@@ -15413,63 +15372,63 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v31, off, s32 offset:384
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v32, off, s32 offset:380
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v31, off, s32 offset:376
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v33, off, s32 offset:372
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v49, off, s32 offset:368
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v32, off, s32 offset:364
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v50, off, s32 offset:360
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v34, off, s32 offset:356
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v50, off, s32 offset:352
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v33, off, s32 offset:348
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v51, off, s32 offset:344
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v32, off, s32 offset:372
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v50, off, s32 offset:368
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v33, off, s32 offset:364
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v51, off, s32 offset:360
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v33, off, s32 offset:356
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v51, off, s32 offset:352
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v34, off, s32 offset:348
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v52, off, s32 offset:344
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v34, off, s32 offset:340
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v52, off, s32 offset:336
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v35, off, s32 offset:332
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v53, off, s32 offset:328
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v36, off, s32 offset:324
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v35, off, s32 offset:324
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v53, off, s32 offset:320
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v35, off, s32 offset:316
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v36, off, s32 offset:316
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v54, off, s32 offset:312
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v36, off, s32 offset:308
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v54, off, s32 offset:304
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v37, off, s32 offset:300
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v55, off, s32 offset:296
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v38, off, s32 offset:292
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v55, off, s32 offset:288
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v37, off, s32 offset:284
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v64, off, s32 offset:280
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v39, off, s32 offset:276
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v65, off, s32 offset:272
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v38, off, s32 offset:268
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v66, off, s32 offset:264
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v37, off, s32 offset:292
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v64, off, s32 offset:288
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v38, off, s32 offset:284
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v64, off, s32 offset:280
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v38, off, s32 offset:276
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v66, off, s32 offset:272
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v39, off, s32 offset:268
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v66, off, s32 offset:264
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v39, off, s32 offset:260
; GFX11-TRUE16-NEXT: s_clause 0x1f
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v66, off, s32 offset:256
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v67, off, s32 offset:256
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v48, off, s32 offset:252
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v68, off, s32 offset:248
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v48, off, s32 offset:244
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v68, off, s32 offset:240
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v70, off, s32 offset:240
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v49, off, s32 offset:236
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v70, off, s32 offset:232
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v51, off, s32 offset:228
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v70, off, s32 offset:224
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v52, off, s32 offset:220
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v80, off, s32 offset:216
-; GFX11-TRUE16-NEXT: scratch_load_b32 v103, off, s32 offset:388
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v80, off, s32
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v70, off, s32 offset:232
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v49, off, s32 offset:228
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v71, off, s32 offset:224
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v50, off, s32 offset:220
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v71, off, s32 offset:216
+; GFX11-TRUE16-NEXT: scratch_load_b32 v114, off, s32 offset:388
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v81, off, s32
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v82, off, s32 offset:8
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v82, off, s32 offset:16
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v83, off, s32 offset:16
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v83, off, s32 offset:24
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v84, off, s32 offset:32
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v85, off, s32 offset:40
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v86, off, s32 offset:48
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v87, off, s32 offset:56
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v96, off, s32 offset:64
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v97, off, s32 offset:72
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v98, off, s32 offset:80
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v85, off, s32 offset:32
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v85, off, s32 offset:40
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v87, off, s32 offset:48
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v87, off, s32 offset:56
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v97, off, s32 offset:64
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v97, off, s32 offset:72
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v98, off, s32 offset:80
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v99, off, s32 offset:88
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v99, off, s32 offset:96
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v100, off, s32 offset:96
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v101, off, s32 offset:104
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v101, off, s32 offset:112
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v102, off, s32 offset:112
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v160, off, s32 offset:120
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v160, off, s32 offset:128
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v161, off, s32 offset:136
@@ -15483,144 +15442,143 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v164, off, s32 offset:192
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v165, off, s32 offset:200
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v165, off, s32 offset:208
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v64, off, s32 offset:212
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v65, off, s32 offset:204
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v67, off, s32 offset:196
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v67, off, s32 offset:188
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v69, off, s32 offset:180
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v69, off, s32 offset:172
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v71, off, s32 offset:164
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v71, off, s32 offset:156
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v81, off, s32 offset:148
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v81, off, s32 offset:140
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v83, off, s32 offset:132
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v55, off, s32 offset:212
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v65, off, s32 offset:204
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v65, off, s32 offset:196
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v67, off, s32 offset:188
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v68, off, s32 offset:180
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v69, off, s32 offset:172
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v69, off, s32 offset:164
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v80, off, s32 offset:156
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v80, off, s32 offset:148
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v81, off, s32 offset:140
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v82, off, s32 offset:132
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v84, off, s32 offset:124
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v85, off, s32 offset:116
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v84, off, s32 offset:116
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v86, off, s32 offset:108
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v87, off, s32 offset:100
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v86, off, s32 offset:100
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v96, off, s32 offset:92
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v97, off, s32 offset:84
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v98, off, s32 offset:76
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v100, off, s32 offset:68
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v100, off, s32 offset:60
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v102, off, s32 offset:52
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v102, off, s32 offset:44
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v112, off, s32 offset:36
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v112, off, s32 offset:28
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v114, off, s32 offset:20
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v96, off, s32 offset:84
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v98, off, s32 offset:76
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v99, off, s32 offset:68
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v100, off, s32 offset:60
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v101, off, s32 offset:52
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v103, off, s32 offset:44
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v103, off, s32 offset:36
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v112, off, s32 offset:28
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v113, off, s32 offset:20
; GFX11-TRUE16-NEXT: s_clause 0x1
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v115, off, s32 offset:12
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v116, off, s32 offset:4
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v115, off, s32 offset:4
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v117.l, v30.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v118.h, v28.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v119.l, v26.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v128.h, v24.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v129.h, v22.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v131.l, v20.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v131.h, v18.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v133.h, v16.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v134.h, v14.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v144.h, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v119.h, v24.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v130.l, v22.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v130.h, v20.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v131.l, v18.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v132.h, v16.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v134.l, v14.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v134.h, v12.l
; GFX11-TRUE16-NEXT: v_mov_b16_e64 v144.l, v10.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v146.l, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v147.h, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v148.h, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v150.l, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v151.l, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v151.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v149.l, 8, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v150.h, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v147.l, 8, v7.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v149.h, 8, v9.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v146.h, 8, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v148.l, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v145.l, 8, v15.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v145.h, 8, v17.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v135.l, 8, v19.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v135.h, 8, v21.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v133.l, 8, v23.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v134.l, 8, v25.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v132.l, 8, v27.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v132.h, 8, v29.l
-; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(62)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v50.h, 8, v50.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v145.h, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v146.h, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v146.l, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v149.l, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v149.h, v0.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v151.l, 8, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v151.h, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v150.l, 8, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v150.h, 8, v7.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v148.l, 8, v9.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v148.h, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v147.l, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v147.h, 8, v15.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v144.h, 8, v17.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v145.l, 8, v19.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v135.l, 8, v21.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v135.h, 8, v23.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v133.l, 8, v25.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v133.h, 8, v27.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v131.h, 8, v29.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(54)
-; GFX11-TRUE16-NEXT: v_cmp_ne_u32_e32 vcc_lo, 0, v103
+; GFX11-TRUE16-NEXT: v_cmp_ne_u32_e32 vcc_lo, 0, v114
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(53)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v130.l, 8, v80.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v132.l, 8, v81.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(52)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v130.h, 8, v82.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v129.l, 8, v82.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(51)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v128.l, 8, v82.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v129.h, 8, v83.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(50)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v129.l, 8, v83.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v128.l, 8, v83.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(49)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v118.l, 8, v84.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v128.h, 8, v85.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(48)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v119.h, 8, v85.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v117.h, 8, v85.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(47)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v116.l, 8, v86.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v118.l, 8, v87.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(46)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v117.h, 8, v87.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v116.l, 8, v87.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(45)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v114.h, 8, v96.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v116.h, 8, v97.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(44)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v115.h, 8, v97.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v114.l, 8, v97.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(43)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v113.l, 8, v98.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v114.h, 8, v98.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(42)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v113.h, 8, v99.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v112.h, 8, v99.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(41)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v103.l, 8, v99.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v113.l, 8, v100.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(40)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v103.h, 8, v101.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v102.l, 8, v101.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(39)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v101.l, 8, v101.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v102.h, 8, v102.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(38)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v101.h, 8, v160.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v100.h, 8, v160.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(37)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v99.l, 8, v160.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v101.l, 8, v160.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(36)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v99.h, 8, v161.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v98.h, 8, v161.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(35)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v97.l, 8, v161.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v99.l, 8, v161.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(34)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v98.l, 8, v162.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v97.l, 8, v162.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(33)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v87.l, 8, v162.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v97.h, 8, v162.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(32)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v96.h, 8, v163.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v87.l, 8, v163.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(31)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v85.l, 8, v163.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v87.h, 8, v163.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(30)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v86.h, 8, v164.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v85.l, 8, v164.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(29)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v83.h, 8, v164.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v85.h, 8, v164.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(28)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v84.h, 8, v165.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v83.l, 8, v165.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(27)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v82.l, 8, v165.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v82.h, 8, v80.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v80.l, 8, v70.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v80.h, 8, v70.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v70.l, 8, v68.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v70.h, 8, v68.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v68.l, 8, v66.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v68.h, 8, v66.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v66.l, 8, v65.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v83.h, 8, v165.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v81.h, 8, v71.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v82.l, 8, v71.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v71.l, 8, v70.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v71.h, 8, v70.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v70.l, 8, v68.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v70.h, 8, v67.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v67.h, 8, v66.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v68.l, 8, v66.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v66.l, 8, v64.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v66.h, 8, v64.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v64.l, 8, v55.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v65.l, 8, v55.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v55.l, 8, v54.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v55.h, 8, v54.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v54.l, 8, v53.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v54.h, 8, v53.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v53.l, 8, v52.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v53.h, 8, v51.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v52.h, 8, v50.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v50.l, 8, v49.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v51.l, 8, v31.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v49.h, 8, v31.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v64.l, 8, v55.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v64.h, 8, v54.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v54.h, 8, v54.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v55.l, 8, v53.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v53.h, 8, v53.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v54.l, 8, v52.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v52.h, 8, v52.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v53.l, 8, v51.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v51.h, 8, v51.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v52.l, 8, v50.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v50.h, 8, v31.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v51.l, 8, v31.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15_vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31
; GFX11-TRUE16-NEXT: s_and_saveexec_b32 s0, vcc_lo
; GFX11-TRUE16-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
@@ -15634,746 +15592,660 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB14_3: ; %cmp.false
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v151.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v148.h
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v150.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v146.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v31.l, 0
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v151.h
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v150.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v147.h
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v0.h, v149.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v149.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v144.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v133.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.l, v148.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v4
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v144.l
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v1.l, v145.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v31
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v1.h, v147.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v3.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v149.h
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v149.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v149.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v146.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v146.l
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v0.l, v151.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v151.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v0.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v1.l, v150.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v3.l, v149.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v145.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v144.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v149, v0
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v1.h, v150.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v132.h
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v131.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v148.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v130.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v149, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v2.l, v148.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v2.l, v149.h
; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v134.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v131.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v7, v31
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v2.l, v146.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v4.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v131.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v3.h, v135.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v6, v31
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v3.l, v145.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v128.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v118.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v134.l
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v145.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v130.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v149, v2
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v3.l, v147.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v147.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v3.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v135.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v119.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v119.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v118.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v149, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v4.l, v144.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v4.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v133.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v117.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v116.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v8, v31
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v4.l, v135.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v129.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v4.h, v134.l
-; GFX11-TRUE16-NEXT: v_and_b16 v32.l, 0xff, v32.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v8
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v5.l, v133.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v119.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v5.h, v132.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v114.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v8, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v6.l, v132.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v117.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v6.h, v130.h
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v112.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v7.l, v130.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v115.l
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v7.h, v129.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v102.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v10, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v11
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v8.l, v128.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v112.h
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v8.h, v119.h
-; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v100.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v11, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v12
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v9.l, v118.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v102.h
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v9.h, v117.h
-; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v97.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v12, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v13
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v10.l, v116.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v11.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v100.h
-; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v10.h, v115.h
-; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v87.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v13, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v14, 0xffff, v14
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v11.l, v114.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v98.h
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v11.h, v113.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v85.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v14, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v15, 0xffff, v15
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v12.l, v113.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v13.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v96.l
-; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v12.h, v103.h
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v83.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v15, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v16
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v13.l, v103.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v14.l
-; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v86.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v13.h, v101.h
-; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v81.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v16, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v17
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v14.l, v101.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v15.l
-; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v84.l
-; GFX11-TRUE16-NEXT: v_or_b16 v16.l, v14.h, v99.h
-; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v71.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v17, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v18
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v15.l, v99.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v16.l
-; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v81.h
-; GFX11-TRUE16-NEXT: v_or_b16 v17.l, v15.h, v98.l
-; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v69.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v18, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v19, 0xffff, v19
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v16.l, v97.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v17.l
-; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v71.h
-; GFX11-TRUE16-NEXT: v_or_b16 v18.l, v16.h, v96.h
-; GFX11-TRUE16-NEXT: v_and_b16 v18.h, 0xff, v67.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v19, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v20, 0xffff, v20
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v17.l, v87.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v18.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v115.h
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v115.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v149, v4
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v5.l, v135.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v5.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v132.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v129.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v113.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v112.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v149, v5
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v6.l, v133.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v6.l, v149.h
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v103.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.h, v128.h
+; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v103.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v101.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v149, v6
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v7.l, v131.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v7.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v10.h, v118.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v100.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v99.h
+; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v98.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v149, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v8.l, v129.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v8.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v11.h, v116.h
+; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v12.h, v114.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v96.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v96.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v149, v8
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v9.l, v128.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v9.l, v149.h
+; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v86.h
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v13.h, v113.l
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v86.l
+; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v84.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v149, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v10.l, v117.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v10.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v102.h
+; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v84.l
+; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v82.h
+; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v81.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v149, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v11.l, v116.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v11.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v15.h, v101.l
+; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v16.h, v99.l
+; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v80.h
+; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v80.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v149, v11
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v12.l, v114.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v12.l, v149.h
; GFX11-TRUE16-NEXT: v_and_b16 v18.l, 0xff, v69.h
-; GFX11-TRUE16-NEXT: v_or_b16 v19.l, v17.h, v86.h
-; GFX11-TRUE16-NEXT: v_and_b16 v19.h, 0xff, v64.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v20, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v21, 0xffff, v21
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v18.l, v85.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v19.l
-; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v67.h
-; GFX11-TRUE16-NEXT: v_or_b16 v20.l, v18.h, v84.h
-; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v51.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v21, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff, v22
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v19.l, v83.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v20.l
+; GFX11-TRUE16-NEXT: v_or_b16 v17.h, v17.h, v97.h
+; GFX11-TRUE16-NEXT: v_and_b16 v18.h, 0xff, v69.l
+; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v68.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v149, v12
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v13.l, v112.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v13.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v18.h, v18.h, v87.h
+; GFX11-TRUE16-NEXT: v_and_b16 v19.h, 0xff, v67.l
; GFX11-TRUE16-NEXT: v_and_b16 v20.l, 0xff, v65.h
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v19.h, v82.h
-; GFX11-TRUE16-NEXT: v_and_b16 v21.h, 0xff, v48.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v22, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v23, 0xffff, v23
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v20.l, v82.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, v21.l
-; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v52.l
-; GFX11-TRUE16-NEXT: v_or_b16 v22.l, v20.h, v80.h
-; GFX11-TRUE16-NEXT: v_and_b16 v22.h, 0xff, v39.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v23, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v24, 0xffff, v24
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v21.l, v80.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v22.l
-; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v49.l
-; GFX11-TRUE16-NEXT: v_or_b16 v23.l, v21.h, v70.h
-; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v39.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v24, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v25, 0xffff, v25
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v22.l, v70.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.l, v23.l
-; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v48.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v22.h, v68.h
-; GFX11-TRUE16-NEXT: v_and_b16 v24.h, 0xff, v38.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v25, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v26, 0xffff, v26
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v23.l, v68.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, v24.l
-; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v38.h
-; GFX11-TRUE16-NEXT: v_or_b16 v25.l, v23.h, v66.h
-; GFX11-TRUE16-NEXT: v_and_b16 v25.h, 0xff, v36.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v26, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v27, 0xffff, v27
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v24.l, v66.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.l, v25.l
-; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v37.h
-; GFX11-TRUE16-NEXT: v_or_b16 v26.l, v24.h, v65.l
-; GFX11-TRUE16-NEXT: v_and_b16 v26.h, 0xff, v36.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v27, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v28, 0xffff, v28
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v25.l, v64.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v26.l
-; GFX11-TRUE16-NEXT: v_and_b16 v26.l, 0xff, v37.l
-; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v25.h, v55.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v28, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v29, 0xffff, v29
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v26.l, v55.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.l, v27.l
-; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v35.h
-; GFX11-TRUE16-NEXT: v_or_b16 v28.l, v26.h, v54.h
-; GFX11-TRUE16-NEXT: v_and_b16 v28.h, 0xff, v34.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v29, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v29, 0xffff, v30
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v27.l, v54.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.l, v28.l
-; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v35.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr151_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr150_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr148_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr147_hi16
+; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v65.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v149, v13
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v14.l, v102.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v14.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v19.h, v85.h
+; GFX11-TRUE16-NEXT: v_or_b16 v20.h, v20.h, v83.h
+; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v55.h
+; GFX11-TRUE16-NEXT: v_and_b16 v21.h, 0xff, v50.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v149, v14
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v15.l, v100.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v15.l, v149.h
+; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v49.h
+; GFX11-TRUE16-NEXT: v_or_b16 v21.h, v21.h, v82.l
+; GFX11-TRUE16-NEXT: v_and_b16 v22.h, 0xff, v49.l
+; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v48.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v149, v15
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v16.l, v98.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v16.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v22.h, v22.h, v71.h
+; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v48.l
+; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v39.h
+; GFX11-TRUE16-NEXT: v_and_b16 v24.h, 0xff, v39.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v149, v16
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v17.l, v97.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v17.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v23.h, v23.h, v70.h
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v24.h, v68.l
+; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v38.h
+; GFX11-TRUE16-NEXT: v_and_b16 v25.h, 0xff, v38.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v149, v17
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v18.l, v87.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v18.l, v149.h
+; GFX11-TRUE16-NEXT: v_and_b16 v26.l, 0xff, v37.h
+; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v25.h, v66.h
+; GFX11-TRUE16-NEXT: v_and_b16 v26.h, 0xff, v37.l
+; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v36.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v149, v18
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v19.l, v85.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v19.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v26.h, v26.h, v64.h
+; GFX11-TRUE16-NEXT: v_and_b16 v27.h, 0xff, v36.l
+; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v35.h
+; GFX11-TRUE16-NEXT: v_and_b16 v28.h, 0xff, v35.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v149, v19
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v20.l, v83.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v20.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v27.h, v27.h, v55.l
+; GFX11-TRUE16-NEXT: v_or_b16 v28.h, v28.h, v54.l
+; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v34.h
+; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v34.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v149, v20
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v21.l, v81.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v21.l, v149.h
+; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v29.h, v29.h, v53.l
+; GFX11-TRUE16-NEXT: v_and_b16 v30.h, 0xff, v33.l
+; GFX11-TRUE16-NEXT: v_and_b16 v31.l, 0xff, v32.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v149, v21
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v22.l, v71.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v22.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v30.h, v30.h, v52.l
+; GFX11-TRUE16-NEXT: v_and_b16 v31.h, 0xff, v32.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr146_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr146_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr145_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr144_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr144_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr134_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr133_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr131_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr134_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr132_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr131_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr128_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr119_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr119_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr118_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr117_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr116_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr115_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr115_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr114_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr112_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr113_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr112_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr100_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr101_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr100_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr97_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr99_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr96_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr96_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr86_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr86_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr85_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr84_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr84_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr83_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr81_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr82_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr81_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr69_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr69_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr67_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr68_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr67_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr65_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr65_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr151_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr151_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr150_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr150_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr147_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr146_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr148_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr148_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr147_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr147_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr144_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr145_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr145_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr135_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr135_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr133_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr134_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr133_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr131_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr132_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr132_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr128_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr128_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr128_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr117_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr118_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr119_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr116_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr117_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr116_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr114_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr114_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr115_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr112_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr113_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr113_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr100_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr101_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr101_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr99_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr99_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr97_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr97_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr96_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr85_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr86_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr85_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr83_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr83_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr84_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr81_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr82_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr82_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr68_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr68_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr66_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr66_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr65_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v29, v31
-; GFX11-TRUE16-NEXT: v_or_b16 v29.l, v28.h, v53.h
-; GFX11-TRUE16-NEXT: v_and_b16 v28.h, 0xff, v34.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff, v30
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v28.l, v53.l
-; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v33.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.l, v29.l
-; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v30.l, v28.h, v52.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v35, v31
-; GFX11-TRUE16-NEXT: v_or_b16 v33.l, v29.h, v51.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v34
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v29.l, v50.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v30.l
-; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v32.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v33, 0xffff, v33
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v149, v22
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v23.l, v70.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v23.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v31.h, v51.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v149, v23
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v24.l, v67.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v24.l, v149.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr67_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v149, v24
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v25.l, v66.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v25.l, v149.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr66_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v149, v25
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v26.l, v64.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v26.l, v149.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v149, v26
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v27.l, v54.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v27.l, v149.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_hi16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v149, v27
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v28.l, v53.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v28.l, v149.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v149, v28
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v29.l, v52.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v29.l, v149.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v34, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v35
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v30.l, v50.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v34, v31
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v32.l, v49.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_hi16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v149, v29
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v30.l, v51.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v30.l, v149.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v149, v30
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v31.l, v50.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v31.l, v149.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_hi16
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v33, v31
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v149, v31
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_lo16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB14_2
; GFX11-TRUE16-NEXT: .LBB14_4: ; %cmp.true
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v151.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v148.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v150.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v146.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v147.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v149.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v149.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v146.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v146.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v31.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v144.l, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v151.h, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v150.h, v1.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v144.h, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v149.l, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v31.l, 0
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v151.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v151.h, v0.h
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v150.h, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v150.l, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v145.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v144.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v134.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v31, v3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v134.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v148.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v148.h, v1.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v149.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v147.l, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, v134.h, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v133.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v31
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v146.h, v2.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v148.l, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v4
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v2.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v145.l, v3.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v31, v4
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v31.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v147.l, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v147.h, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v132.h, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v131.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v5, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v4.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v145.h, v3.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v131.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v31, v5
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v130.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v130.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v144.h, v3.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v145.l, v3.h
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, v128.h, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v135.h, v4.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v6, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v5.l
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v135.l, v4.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, v129.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v32.l, v32.l, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v4.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, 0x300, v4.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v134.l, v5.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v118.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v133.l, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v119.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v7
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v31, v6
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v135.l, v4.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v135.h, v4.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v119.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v119.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v31, v7
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v118.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v117.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v133.l, v5.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v133.h, v5.h
; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v132.h, v6.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v31, v8
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v131.h, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v132.l, v6.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, v116.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v8, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v7.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v132.l, v6.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, v117.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v8
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v6.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, 0x300, v6.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v130.h, v7.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v114.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v8.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v130.l, v7.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v115.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v9
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v7.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v115.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v115.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v31, v9
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v113.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v112.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v129.l, v7.l
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v129.h, v7.h
; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v129.l, v8.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, v112.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v10, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v9.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v128.l, v8.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, v112.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v10
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v8.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v8.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v119.h, v9.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, v102.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v11, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v10.l
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v118.l, v9.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, v102.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v10.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v11
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v9.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.l, 0x300, v9.h
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v31, v10
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v9.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v9.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v128.l, v8.l
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v128.h, v8.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, v103.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, v103.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v31, v11
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v9.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, 0x300, v9.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v10.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v10.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, v101.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, v100.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v117.h, v9.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v118.l, v9.h
; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v117.h, v10.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, v100.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v12, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v11.l
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v116.l, v10.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.l, v100.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v11.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v12
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v10.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, 0x300, v10.h
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v115.h, v11.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, v97.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v13, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v12.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v114.h, v11.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, v98.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v12.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v14, 0xffff, v13
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v11.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.l, 0x300, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v10.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v31, v12
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v11.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.h, 0x300, v11.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v116.l, v10.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v116.h, v10.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, v99.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, v98.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v31, v13
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v11.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, 0x300, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v12.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, v96.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, v96.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v114.l, v11.l
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v114.h, v11.h
; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v113.h, v12.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.h, v87.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v14, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v13.l
-; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v113.l, v12.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.l, v96.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v13.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v15, 0xffff, v14
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v12.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, 0x300, v12.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.l
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v103.h, v13.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, v85.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v15, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v14.l
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v103.l, v13.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, v86.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v14.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v15
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v13.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.l, 0x300, v13.h
+; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v12.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v31, v14
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v13.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.h, 0x300, v13.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v112.h, v12.l
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v113.l, v12.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, v86.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, v86.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v31, v15
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v13.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.h, 0x300, v13.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v14.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v14.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, v84.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, v84.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v102.l, v13.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v102.h, v13.h
; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v101.h, v14.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.h, v83.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v16, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v15.l
-; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v101.l, v14.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.l, v84.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v15.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v16
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v14.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.l, 0x300, v14.h
-; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v99.h, v15.h
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v14.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v31, v16
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v15.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v17.h, 0x300, v15.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v100.h, v14.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v101.l, v14.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.l, v82.h, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.h, v81.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v17, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v16.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v99.l, v15.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.l, v81.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v16.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v17
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v15.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v17.l, 0x300, v15.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v31, v17
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v15.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.h, 0x300, v15.h
+; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v16.l
+; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v16.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.l, v80.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.h, v80.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v17.l, v98.h, v15.l
+; GFX11-TRUE16-NEXT: v_or_b16 v17.h, v99.l, v15.h
; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.l
-; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v98.l, v16.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v17.h, v71.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v18, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v17.l
-; GFX11-TRUE16-NEXT: v_or_b16 v16.l, v97.l, v16.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v17.l, v71.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v17.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v19, 0xffff, v18
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v16.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.l, 0x300, v16.h
-; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v17.l
-; GFX11-TRUE16-NEXT: v_or_b16 v17.h, v96.h, v17.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.h, v69.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v19, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v18.l
-; GFX11-TRUE16-NEXT: v_or_b16 v17.l, v87.l, v17.l
+; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v16.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v31, v18
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v17.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v19.h, 0x300, v17.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v17.l, v97.l, v16.l
+; GFX11-TRUE16-NEXT: v_or_b16 v17.h, v97.h, v16.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.l, v69.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v18.h, 0xff, v18.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v20, 0xffff, v19
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v17.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v19.l, 0x300, v17.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.h, v69.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v31, v19
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v17.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.h, 0x300, v17.h
+; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v18.l
+; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v18.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.l, v68.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.h, v67.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v19.l, v87.l, v17.l
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v87.h, v17.h
; GFX11-TRUE16-NEXT: v_and_b16 v18.l, 0xff, v18.l
-; GFX11-TRUE16-NEXT: v_or_b16 v18.h, v86.h, v18.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v19.h, v67.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v20, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v19.l
-; GFX11-TRUE16-NEXT: v_or_b16 v18.l, v85.l, v18.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v19.l, v67.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v19.h, 0xff, v19.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v21, 0xffff, v20
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v18.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.l, 0x300, v18.h
-; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v19.l
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v84.h, v19.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.h, v64.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v21, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v20.l
-; GFX11-TRUE16-NEXT: v_or_b16 v19.l, v83.h, v19.l
+; GFX11-TRUE16-NEXT: v_and_b16 v18.h, 0xff, v18.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v31, v20
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v19.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v19.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v19.l, v85.l, v18.l
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v85.h, v18.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.l, v65.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v20.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff, v21
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v19.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v19.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.h, v65.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v31, v21
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v19.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.h, 0x300, v19.h
+; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v20.l
+; GFX11-TRUE16-NEXT: v_and_b16 v19.h, 0xff, v20.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.l, v55.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.h, v50.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v83.l, v19.l
+; GFX11-TRUE16-NEXT: v_or_b16 v21.h, v83.h, v19.h
; GFX11-TRUE16-NEXT: v_and_b16 v20.l, 0xff, v20.l
-; GFX11-TRUE16-NEXT: v_or_b16 v20.h, v82.h, v20.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, v51.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v22, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v21.l
-; GFX11-TRUE16-NEXT: v_or_b16 v20.l, v82.l, v20.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, v52.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v21.h, 0xff, v21.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v23, 0xffff, v22
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v20.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.l, 0x300, v20.h
-; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v21.l
-; GFX11-TRUE16-NEXT: v_or_b16 v21.h, v80.h, v21.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.h, v48.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v23, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v22.l
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v80.l, v21.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.l, v49.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v22.h, 0xff, v22.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v24, 0xffff, v23
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v21.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v23.l, 0x300, v21.h
+; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v20.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v31, v22
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v21.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v23.h, 0x300, v21.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v81.h, v20.l
+; GFX11-TRUE16-NEXT: v_or_b16 v21.h, v82.l, v20.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.l, v49.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.h, v49.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v31, v23
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v21.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.h, 0x300, v21.h
+; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v22.l
+; GFX11-TRUE16-NEXT: v_and_b16 v21.h, 0xff, v22.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.l, v48.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.h, v48.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v23.l, v71.l, v21.l
+; GFX11-TRUE16-NEXT: v_or_b16 v23.h, v71.h, v21.h
; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v22.l
-; GFX11-TRUE16-NEXT: v_or_b16 v22.h, v70.h, v22.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v23.h, v39.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v24, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, v23.l
-; GFX11-TRUE16-NEXT: v_or_b16 v22.l, v70.l, v22.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v23.l, v48.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v23.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v25, 0xffff, v24
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v22.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.l, 0x300, v22.h
-; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v23.l
-; GFX11-TRUE16-NEXT: v_or_b16 v23.h, v68.h, v23.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.h, v38.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v25, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v24.l
-; GFX11-TRUE16-NEXT: v_or_b16 v23.l, v68.l, v23.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.l, v39.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v24.h, 0xff, v24.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v26, 0xffff, v25
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v23.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v23.h
+; GFX11-TRUE16-NEXT: v_and_b16 v22.h, 0xff, v22.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v31, v24
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v23.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.h, 0x300, v23.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v23.l, v70.l, v22.l
+; GFX11-TRUE16-NEXT: v_or_b16 v23.h, v70.h, v22.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.l, v39.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.h, v39.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v31, v25
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v23.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.h, 0x300, v23.h
+; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v24.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.l, v38.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.h, v38.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v25.l, v67.h, v23.l
+; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v68.l, v23.h
; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v24.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v66.l, v24.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.h, v37.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v26, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.l, v25.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v66.h, v24.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, v38.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v24.h
-; GFX11-TRUE16-NEXT: v_and_b16 v25.h, 0xff, v25.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v26, 0xffff, v26
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.l, 0x300, v24.l
-; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v25.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v64.l, v25.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, v24.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v26, v31
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v25.l, v65.l, v25.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.l, v37.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.h, v36.h, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v27, 0xffff, v27
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v25.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v25.l
+; GFX11-TRUE16-NEXT: v_and_b16 v24.h, 0xff, v24.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v31, v26
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v25.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, 0x300, v25.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v25.l, v66.l, v24.l
+; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v66.h, v24.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.l, v37.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.h, v37.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v31, v27
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v25.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.h, 0x300, v25.h
+; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v26.l
+; GFX11-TRUE16-NEXT: v_and_b16 v25.h, 0xff, v26.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.l, v36.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.h, v36.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v64.l, v25.l
+; GFX11-TRUE16-NEXT: v_or_b16 v27.h, v64.h, v25.h
; GFX11-TRUE16-NEXT: v_and_b16 v26.l, 0xff, v26.l
; GFX11-TRUE16-NEXT: v_and_b16 v26.h, 0xff, v26.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.l, v25.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v27, v31
-; GFX11-TRUE16-NEXT: v_or_b16 v26.l, v55.l, v26.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.l, v36.l, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v26.h, v55.h, v26.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, v35.h, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v29, 0xffff, v28
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v26.l
-; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v27.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.l, 0x300, v26.h
-; GFX11-TRUE16-NEXT: v_and_b16 v27.h, 0xff, v27.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.h, v34.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v29, v31
-; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v54.h, v27.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v28.l
-; GFX11-TRUE16-NEXT: v_or_b16 v27.h, v54.l, v27.h
-; GFX11-TRUE16-NEXT: v_and_b16 v28.h, 0xff, v28.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.l, v35.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.l, 0x300, v27.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v29, 0xffff, v29
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v27.h
-; GFX11-TRUE16-NEXT: v_or_b16 v28.h, v53.h, v28.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v31, v28
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v27.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v29.h, 0x300, v27.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v54.h, v26.l
+; GFX11-TRUE16-NEXT: v_or_b16 v27.h, v55.l, v26.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.l, v35.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.h, v35.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v31, v29
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v27.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v30.h, 0x300, v27.h
+; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v28.l
+; GFX11-TRUE16-NEXT: v_and_b16 v27.h, 0xff, v28.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.l, v34.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.h, v34.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v29.l, v53.h, v27.l
+; GFX11-TRUE16-NEXT: v_or_b16 v29.h, v54.l, v27.h
; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v28.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.l, v27.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v29, v31
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v29.l, v34.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v29.h, v33.h, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff, v30
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v30.l, 0x300, v28.h
+; GFX11-TRUE16-NEXT: v_and_b16 v28.h, 0xff, v28.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v31, v30
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v29.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v34.h, 0x300, v29.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v29.l, v52.h, v28.l
+; GFX11-TRUE16-NEXT: v_or_b16 v29.h, v53.l, v28.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v30.l, v33.h, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v30.h, v33.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v29.l
-; GFX11-TRUE16-NEXT: v_or_b16 v28.l, v53.l, v28.l
-; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v29.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v33.l, v30.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v31, v34
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v29.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v33.h, 0x300, v29.h
+; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v30.l
+; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v30.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v30.l, v32.h, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v29.l, v52.h, v29.l
-; GFX11-TRUE16-NEXT: v_and_b16 v30.h, 0xff, v30.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v28.l
-; GFX11-TRUE16-NEXT: v_or_b16 v29.h, v50.h, v29.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v30.h, v32.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v33.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v32.l, v51.h, v29.l
+; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v52.l, v29.h
; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v30.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v29.l, 0x300, v29.l
-; GFX11-TRUE16-NEXT: v_or_b16 v30.h, v51.l, v30.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v35, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v33
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v29.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v29.l
-; GFX11-TRUE16-NEXT: v_or_b16 v30.l, v50.l, v30.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v33.l, 0x300, v30.h
-; GFX11-TRUE16-NEXT: v_and_b16 v30.h, 0xff, v32.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v34, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v35
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v30.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v33, 0xffff, v33
-; GFX11-TRUE16-NEXT: v_or_b16 v32.l, v49.h, v30.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v34, v31
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v32.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v33, v31
+; GFX11-TRUE16-NEXT: v_and_b16 v30.h, 0xff, v30.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v31, v33
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v32.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v33.h, 0x300, v32.h
+; GFX11-TRUE16-NEXT: v_or_b16 v32.l, v50.h, v30.l
+; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v51.l, v30.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v31, v33
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v32.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v32.h, 0x300, v32.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v32.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v31, v32
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -42156,64 +42028,64 @@ define <128 x i8> @bitcast_v32f32_to_v128i8(<32 x float> %a, i32 %b) {
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr66_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr162_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr160_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr161_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr160_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr65_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr151_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr150_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr148_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr146_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr147_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr146_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr145_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr144_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr135_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr134_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr132_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr133_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr132_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr131_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr128_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr119_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr118_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr117_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr115_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr116_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr115_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr114_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr112_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr113_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr112_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr101_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr100_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr99_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr97_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr96_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr86_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr85_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr84_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr83_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr81_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr82_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr81_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr69_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr68_lo16
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(2)
@@ -42241,50 +42113,50 @@ define <128 x i8> @bitcast_v32f32_to_v128i8(<32 x float> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[65:66], 24, v[3:4]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v68, 24, v32
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v69, 8, v32
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v71, 8, v31
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v70, 24, v30
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v70, 8, v31
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v71, 24, v30
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v80, 8, v30
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 8, v29
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 24, v28
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 8, v29
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 24, v28
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v83, 8, v28
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v84, 8, v27
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v85, 24, v26
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v86, 8, v26
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 8, v25
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 24, v24
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 8, v25
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 24, v24
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v97, 8, v24
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v99, 8, v23
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 24, v22
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 8, v23
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v99, 24, v22
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v100, 8, v22
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v101, 8, v21
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v102, 24, v20
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v103, 8, v20
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 8, v19
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 24, v18
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 8, v19
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 24, v18
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v114, 8, v18
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v116, 8, v17
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v115, 24, v16
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v115, 8, v17
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v116, 24, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v117, 8, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v118, 8, v15
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v119, 24, v14
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v128, 8, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 8, v13
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 24, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v131, 8, v12
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v133, 8, v11
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 24, v10
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 8, v11
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v133, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v134, 8, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v135, 8, v9
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v144, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v145, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v146, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v146, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v148, 8, v6
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 8, v5
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 24, v4
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 8, v5
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v151, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v162, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[54:55], 24, v[7:8]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[66:67], 24, v[1:2]
@@ -42328,50 +42200,50 @@ define <128 x i8> @bitcast_v32f32_to_v128i8(<32 x float> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[66:67], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v68, 24, v32
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v69, 8, v32
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v71, 8, v31
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v70, 24, v30
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v70, 8, v31
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v71, 24, v30
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v80, 8, v30
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 8, v29
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 24, v28
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 8, v29
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 24, v28
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v83, 8, v28
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v84, 8, v27
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v85, 24, v26
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v86, 8, v26
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 8, v25
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 24, v24
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 8, v25
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 24, v24
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v97, 8, v24
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v99, 8, v23
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 24, v22
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 8, v23
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v99, 24, v22
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v100, 8, v22
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v101, 8, v21
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v102, 24, v20
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v103, 8, v20
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 8, v19
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 24, v18
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 8, v19
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 24, v18
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v114, 8, v18
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v116, 8, v17
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v115, 24, v16
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v115, 8, v17
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v116, 24, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v117, 8, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v118, 8, v15
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v119, 24, v14
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v128, 8, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 8, v13
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 24, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v131, 8, v12
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v133, 8, v11
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 24, v10
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 8, v11
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v133, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v134, 8, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v135, 8, v9
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v144, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v145, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v146, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v146, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v148, 8, v6
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 8, v5
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 24, v4
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 8, v5
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v151, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v162, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v39, 8, v1
; GFX11-TRUE16-NEXT: .LBB36_4: ; %end
@@ -42379,307 +42251,266 @@ define <128 x i8> @bitcast_v32f32_to_v128i8(<32 x float> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v39.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v162.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.l, 0
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v66.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v34.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v66.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v1.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v162.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v1.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v39.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v161.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v39, v1
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v2.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v160.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v65.l
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v161.l
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v1.h, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v66.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v55, 0xffff, v55
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v160.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v1.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v39, v2
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v3.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v39.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v151.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v66, 0xffff, v66
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v55, v39
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v2.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v3.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v65.l
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v4.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v150.l
; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v150.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v66, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v55, 0xffff, v55
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v3.l, v3.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v65.l, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v149.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v5.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v55, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v55, 0xffff, v65
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v64.l
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v4.l, v4.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v65.l, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.l, 8, v148.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v55, v39
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v39, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v4.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v149.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v64.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v39, v4
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v5.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v148.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v147.l
; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v5.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v147.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.h, v6.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v55, 0xffff, v65
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v39, v5
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v6.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v146.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v54.l
; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v7.l, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v64.l, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v146.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v39, v6
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v7.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v39.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v145.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v55, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v55, 0xffff, v64
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v64.l, v7.l
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v6.l, v6.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.h, 8, v54.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v8.l, v33.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v64
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v55, v39
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v144.l
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v7.l, v7.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 8, v135.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v54, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v55
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v8.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v144.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v39, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v8.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v135.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v53.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v10.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v39, v8
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v9.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v39.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v134.l
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v8.h, v9.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v133.l
; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v11.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v39, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v10.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v10.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v132.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v52.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v12.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v39, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v11.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v11.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v131.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v130.l
; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v54, v39
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v10.l, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v53.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v133.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v54.l, v10.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v53, 0xffff, v55
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v9.l, v9.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v132.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v11.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v52.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v13.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v39, v11
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v12.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v12.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v129.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v51.l
; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v53, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v53, 0xffff, v54
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v10.l, v10.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v54.l, v11.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v131.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v53, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v52, 0xffff, v54
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v11.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v130.l
-; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v11.h, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v14.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v39, v12
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v13.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v13.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v128.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v119.l
+; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.l
+; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v15.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v39, v13
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v14.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v118.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v50.l
; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.l
+; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v16.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v39, v14
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v15.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v15.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v117.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v116.l
; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v17.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v52, v39
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v13.l, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v53.l, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v129.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v128.l
+; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v17.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v39, v15
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v16.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v16.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v115.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v49.l
+; GFX11-TRUE16-NEXT: v_and_b16 v18.l, 0xff, v18.l
+; GFX11-TRUE16-NEXT: v_and_b16 v18.h, 0xff, v18.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v39, v16
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v17.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v17.h, v17.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v114.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v113.l
; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v19.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v52, 0xffff, v53
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v53.l, v13.l
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v12.l, v12.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.h, 8, v51.l
-; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v14.l, v33.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v51, 0xffff, v53
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v52, v39
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v119.l
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v13.l, v13.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.l, v14.l
-; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.h
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v15.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.l, 8, v118.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v51, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v51, 0xffff, v52
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v14.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v117.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v14.h, v15.l
+; GFX11-TRUE16-NEXT: v_and_b16 v19.h, 0xff, v19.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v39, v17
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v18.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v18.h, v18.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v112.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v48.l
; GFX11-TRUE16-NEXT: v_and_b16 v20.l, 0xff, v20.l
+; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v20.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v39, v18
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v19.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v19.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v103.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v102.l
+; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v21.l
+; GFX11-TRUE16-NEXT: v_and_b16 v21.h, 0xff, v21.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v39, v19
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v20.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v20.h, v20.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v101.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v38.l
; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v22.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v51, v39
-; GFX11-TRUE16-NEXT: v_or_b16 v16.l, v16.l, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.l, v15.l
-; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.h, 8, v50.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v116.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v51.l, v16.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v50, 0xffff, v52
-; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.h
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v15.l, v15.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v115.l
-; GFX11-TRUE16-NEXT: v_or_b16 v17.l, v17.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v49.l
+; GFX11-TRUE16-NEXT: v_and_b16 v22.h, 0xff, v22.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v39, v20
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v21.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v21.h, v21.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v100.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v99.l
; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v23.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v50, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v50, 0xffff, v51
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v16.l, v16.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v51.l, v17.l
-; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v17.h
-; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v18.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v114.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v50, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v49, 0xffff, v51
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v17.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v113.l
-; GFX11-TRUE16-NEXT: v_or_b16 v18.l, v17.h, v18.l
+; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v23.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v39, v21
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v22.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v22.h, v22.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v98.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v37.l
+; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v24.h, 0xff, v24.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v39, v22
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v23.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v23.h, v23.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v97.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v96.l
; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v25.l
+; GFX11-TRUE16-NEXT: v_and_b16 v25.h, 0xff, v25.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v39, v23
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v24.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v24.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v87.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v36.l
; GFX11-TRUE16-NEXT: v_and_b16 v26.l, 0xff, v26.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v49, v39
-; GFX11-TRUE16-NEXT: v_or_b16 v19.l, v19.l, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v50.l, v18.l
-; GFX11-TRUE16-NEXT: v_and_b16 v18.l, 0xff, v18.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v112.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v103.l
+; GFX11-TRUE16-NEXT: v_and_b16 v26.h, 0xff, v26.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v39, v24
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v25.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v25.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v86.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v85.l
+; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v27.l
+; GFX11-TRUE16-NEXT: v_and_b16 v27.h, 0xff, v27.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v39, v25
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v26.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v26.h, v26.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v84.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v35.l
; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v28.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v49, 0xffff, v50
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v50.l, v19.l
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v18.l, v18.h
-; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v19.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v48.l
-; GFX11-TRUE16-NEXT: v_or_b16 v20.l, v20.l, v33.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v48, 0xffff, v50
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v49, v39
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v102.l
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v19.l, v19.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v49.l, v20.l
-; GFX11-TRUE16-NEXT: v_and_b16 v20.l, 0xff, v20.h
-; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v21.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.l, 8, v101.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v48, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v48, 0xffff, v49
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v20.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v100.l
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v20.h, v21.l
+; GFX11-TRUE16-NEXT: v_and_b16 v28.h, 0xff, v28.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v39, v26
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v27.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v27.h, v27.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v83.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v82.l
; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v29.l
+; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v29.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v39, v27
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v28.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v28.h, v28.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v81.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.l, 8, v34.l
+; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v30.l
+; GFX11-TRUE16-NEXT: v_and_b16 v30.h, 0xff, v30.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v39, v28
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v29.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v29.h, v29.h, v34.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v80.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.l, 8, v71.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_and_b16 v31.l, 0xff, v31.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v48, v39
-; GFX11-TRUE16-NEXT: v_or_b16 v22.l, v22.l, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v49.l, v21.l
-; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v21.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.h, 8, v38.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v99.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.l, v22.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v38, 0xffff, v49
-; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v22.h
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v21.l, v21.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.h, 8, v98.l
-; GFX11-TRUE16-NEXT: v_or_b16 v23.l, v23.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v37.l
+; GFX11-TRUE16-NEXT: v_and_b16 v31.h, 0xff, v31.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v39, v29
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v30.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v30.h, v30.h, v34.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v70.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.l, 8, v33.l
; GFX11-TRUE16-NEXT: v_and_b16 v32.l, 0xff, v32.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v38, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v38, 0xffff, v48
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v22.l, v22.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.l, v23.l
-; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v23.h
-; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v24.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v24.l, 8, v97.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v38, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v37, 0xffff, v48
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v23.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v96.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v23.h, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v32.h, 0xff, v32.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v39, v30
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v31.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v31.h, v33.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v31.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.l, 8, v69.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v68.l
; GFX11-TRUE16-NEXT: s_clause 0x1
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[1:4], off
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[5:8], off offset:16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v37, v39
-; GFX11-TRUE16-NEXT: v_or_b16 v25.l, v25.l, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v38.l, v24.l
-; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v24.h, 8, v87.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v86.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v37, 0xffff, v38
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v38.l, v25.l
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v24.l, v24.h
-; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v25.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.h, 8, v36.l
-; GFX11-TRUE16-NEXT: v_or_b16 v26.l, v26.l, v33.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v36, 0xffff, v38
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v37, v39
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v85.l
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v25.l, v25.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v37.l, v26.l
-; GFX11-TRUE16-NEXT: v_and_b16 v26.l, 0xff, v26.h
-; GFX11-TRUE16-NEXT: v_and_b16 v26.h, 0xff, v27.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.l, 8, v84.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v36, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v36, 0xffff, v37
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v26.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v83.l
-; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v26.h, v27.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v36, v39
-; GFX11-TRUE16-NEXT: v_or_b16 v28.l, v28.l, v33.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v37.l, v27.l
-; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v27.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.h, 8, v35.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v82.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v36.l, v28.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff, v37
-; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v28.h
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v27.l, v27.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v28.h, 8, v81.l
-; GFX11-TRUE16-NEXT: v_or_b16 v29.l, v29.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v34.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v35, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff, v36
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v28.l, v28.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v36.l, v29.l
-; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v29.h
-; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v30.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v30.l, 8, v80.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v35, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v36
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v29.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v71.l
-; GFX11-TRUE16-NEXT: v_or_b16 v30.l, v29.h, v30.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v34, v39
-; GFX11-TRUE16-NEXT: v_or_b16 v31.l, v31.l, v33.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_3) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v30.l
-; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v30.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v30.h, 8, v70.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v69.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v35
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v31.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v30.l, v30.h
-; GFX11-TRUE16-NEXT: v_and_b16 v31.l, 0xff, v31.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v31.h, 8, v33.l
-; GFX11-TRUE16-NEXT: v_or_b16 v32.l, v32.l, v33.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v33, 0xffff, v35
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v34, v39
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v31.l, v31.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.l, v32.l
-; GFX11-TRUE16-NEXT: v_and_b16 v32.l, 0xff, v32.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v32.h, 8, v68.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v33, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v33, 0xffff, v34
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v32.l, v32.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v32, v33, v39
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v39, v31
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v32.l, v33.l
+; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v32.h, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v32.l, v39.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v32, v39, v32
; GFX11-TRUE16-NEXT: s_clause 0x5
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[9:12], off offset:32
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[13:16], off offset:48
@@ -52210,63 +52041,63 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v31, off, s32 offset:384
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v32, off, s32 offset:380
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v31, off, s32 offset:376
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v33, off, s32 offset:372
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v49, off, s32 offset:368
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v32, off, s32 offset:364
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v50, off, s32 offset:360
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v34, off, s32 offset:356
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v50, off, s32 offset:352
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v33, off, s32 offset:348
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v51, off, s32 offset:344
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v32, off, s32 offset:372
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v50, off, s32 offset:368
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v33, off, s32 offset:364
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v51, off, s32 offset:360
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v33, off, s32 offset:356
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v51, off, s32 offset:352
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v34, off, s32 offset:348
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v52, off, s32 offset:344
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v34, off, s32 offset:340
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v52, off, s32 offset:336
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v35, off, s32 offset:332
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v53, off, s32 offset:328
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v36, off, s32 offset:324
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v35, off, s32 offset:324
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v53, off, s32 offset:320
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v35, off, s32 offset:316
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v36, off, s32 offset:316
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v54, off, s32 offset:312
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v36, off, s32 offset:308
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v54, off, s32 offset:304
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v37, off, s32 offset:300
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v55, off, s32 offset:296
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v38, off, s32 offset:292
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v55, off, s32 offset:288
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v37, off, s32 offset:284
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v64, off, s32 offset:280
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v39, off, s32 offset:276
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v65, off, s32 offset:272
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v38, off, s32 offset:268
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v66, off, s32 offset:264
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v37, off, s32 offset:292
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v64, off, s32 offset:288
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v38, off, s32 offset:284
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v64, off, s32 offset:280
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v38, off, s32 offset:276
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v66, off, s32 offset:272
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v39, off, s32 offset:268
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v66, off, s32 offset:264
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v39, off, s32 offset:260
; GFX11-TRUE16-NEXT: s_clause 0x1f
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v66, off, s32 offset:256
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v67, off, s32 offset:256
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v48, off, s32 offset:252
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v68, off, s32 offset:248
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v48, off, s32 offset:244
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v68, off, s32 offset:240
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v70, off, s32 offset:240
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v49, off, s32 offset:236
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v70, off, s32 offset:232
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v51, off, s32 offset:228
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v70, off, s32 offset:224
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v52, off, s32 offset:220
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v80, off, s32 offset:216
-; GFX11-TRUE16-NEXT: scratch_load_b32 v103, off, s32 offset:388
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v80, off, s32
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v70, off, s32 offset:232
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v49, off, s32 offset:228
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v71, off, s32 offset:224
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v50, off, s32 offset:220
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v71, off, s32 offset:216
+; GFX11-TRUE16-NEXT: scratch_load_b32 v114, off, s32 offset:388
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v81, off, s32
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v82, off, s32 offset:8
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v82, off, s32 offset:16
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v83, off, s32 offset:16
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v83, off, s32 offset:24
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v84, off, s32 offset:32
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v85, off, s32 offset:40
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v86, off, s32 offset:48
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v87, off, s32 offset:56
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v96, off, s32 offset:64
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v97, off, s32 offset:72
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v98, off, s32 offset:80
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v85, off, s32 offset:32
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v85, off, s32 offset:40
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v87, off, s32 offset:48
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v87, off, s32 offset:56
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v97, off, s32 offset:64
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v97, off, s32 offset:72
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v98, off, s32 offset:80
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v99, off, s32 offset:88
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v99, off, s32 offset:96
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v100, off, s32 offset:96
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v101, off, s32 offset:104
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v101, off, s32 offset:112
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v102, off, s32 offset:112
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v160, off, s32 offset:120
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v160, off, s32 offset:128
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v161, off, s32 offset:136
@@ -52280,144 +52111,143 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v164, off, s32 offset:192
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v165, off, s32 offset:200
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v165, off, s32 offset:208
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v64, off, s32 offset:212
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v65, off, s32 offset:204
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v67, off, s32 offset:196
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v67, off, s32 offset:188
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v69, off, s32 offset:180
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v69, off, s32 offset:172
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v71, off, s32 offset:164
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v71, off, s32 offset:156
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v81, off, s32 offset:148
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v81, off, s32 offset:140
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v83, off, s32 offset:132
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v55, off, s32 offset:212
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v65, off, s32 offset:204
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v65, off, s32 offset:196
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v67, off, s32 offset:188
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v68, off, s32 offset:180
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v69, off, s32 offset:172
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v69, off, s32 offset:164
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v80, off, s32 offset:156
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v80, off, s32 offset:148
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v81, off, s32 offset:140
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v82, off, s32 offset:132
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v84, off, s32 offset:124
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v85, off, s32 offset:116
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v84, off, s32 offset:116
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v86, off, s32 offset:108
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v87, off, s32 offset:100
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v86, off, s32 offset:100
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v96, off, s32 offset:92
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v97, off, s32 offset:84
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v98, off, s32 offset:76
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v100, off, s32 offset:68
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v100, off, s32 offset:60
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v102, off, s32 offset:52
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v102, off, s32 offset:44
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v112, off, s32 offset:36
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v112, off, s32 offset:28
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v114, off, s32 offset:20
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v96, off, s32 offset:84
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v98, off, s32 offset:76
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v99, off, s32 offset:68
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v100, off, s32 offset:60
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v101, off, s32 offset:52
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v103, off, s32 offset:44
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v103, off, s32 offset:36
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v112, off, s32 offset:28
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v113, off, s32 offset:20
; GFX11-TRUE16-NEXT: s_clause 0x1
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v115, off, s32 offset:12
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v116, off, s32 offset:4
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v115, off, s32 offset:4
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v117.l, v30.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v118.h, v28.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v119.l, v26.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v128.h, v24.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v129.h, v22.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v131.l, v20.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v131.h, v18.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v133.h, v16.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v134.h, v14.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v144.h, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v119.h, v24.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v130.l, v22.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v130.h, v20.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v131.l, v18.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v132.h, v16.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v134.l, v14.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v134.h, v12.l
; GFX11-TRUE16-NEXT: v_mov_b16_e64 v144.l, v10.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v146.l, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v147.h, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v148.h, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v150.l, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v151.l, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v151.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v149.l, 8, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v150.h, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v147.l, 8, v7.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v149.h, 8, v9.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v146.h, 8, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v148.l, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v145.l, 8, v15.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v145.h, 8, v17.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v135.l, 8, v19.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v135.h, 8, v21.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v133.l, 8, v23.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v134.l, 8, v25.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v132.l, 8, v27.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v132.h, 8, v29.l
-; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(62)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v50.h, 8, v50.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v145.h, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v146.h, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v146.l, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v149.l, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v149.h, v0.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v151.l, 8, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v151.h, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v150.l, 8, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v150.h, 8, v7.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v148.l, 8, v9.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v148.h, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v147.l, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v147.h, 8, v15.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v144.h, 8, v17.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v145.l, 8, v19.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v135.l, 8, v21.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v135.h, 8, v23.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v133.l, 8, v25.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v133.h, 8, v27.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v131.h, 8, v29.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(54)
-; GFX11-TRUE16-NEXT: v_cmp_ne_u32_e32 vcc_lo, 0, v103
+; GFX11-TRUE16-NEXT: v_cmp_ne_u32_e32 vcc_lo, 0, v114
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(53)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v130.l, 8, v80.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v132.l, 8, v81.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(52)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v130.h, 8, v82.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v129.l, 8, v82.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(51)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v128.l, 8, v82.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v129.h, 8, v83.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(50)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v129.l, 8, v83.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v128.l, 8, v83.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(49)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v118.l, 8, v84.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v128.h, 8, v85.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(48)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v119.h, 8, v85.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v117.h, 8, v85.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(47)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v116.l, 8, v86.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v118.l, 8, v87.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(46)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v117.h, 8, v87.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v116.l, 8, v87.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(45)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v114.h, 8, v96.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v116.h, 8, v97.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(44)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v115.h, 8, v97.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v114.l, 8, v97.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(43)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v113.l, 8, v98.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v114.h, 8, v98.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(42)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v113.h, 8, v99.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v112.h, 8, v99.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(41)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v103.l, 8, v99.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v113.l, 8, v100.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(40)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v103.h, 8, v101.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v102.l, 8, v101.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(39)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v101.l, 8, v101.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v102.h, 8, v102.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(38)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v101.h, 8, v160.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v100.h, 8, v160.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(37)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v99.l, 8, v160.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v101.l, 8, v160.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(36)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v99.h, 8, v161.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v98.h, 8, v161.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(35)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v97.l, 8, v161.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v99.l, 8, v161.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(34)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v98.l, 8, v162.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v97.l, 8, v162.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(33)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v87.l, 8, v162.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v97.h, 8, v162.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(32)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v96.h, 8, v163.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v87.l, 8, v163.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(31)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v85.l, 8, v163.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v87.h, 8, v163.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(30)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v86.h, 8, v164.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v85.l, 8, v164.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(29)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v83.h, 8, v164.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v85.h, 8, v164.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(28)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v84.h, 8, v165.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v83.l, 8, v165.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(27)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v82.l, 8, v165.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v82.h, 8, v80.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v80.l, 8, v70.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v80.h, 8, v70.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v70.l, 8, v68.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v70.h, 8, v68.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v68.l, 8, v66.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v68.h, 8, v66.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v66.l, 8, v65.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v83.h, 8, v165.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v81.h, 8, v71.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v82.l, 8, v71.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v71.l, 8, v70.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v71.h, 8, v70.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v70.l, 8, v68.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v70.h, 8, v67.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v67.h, 8, v66.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v68.l, 8, v66.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v66.l, 8, v64.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v66.h, 8, v64.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v64.l, 8, v55.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v65.l, 8, v55.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v55.l, 8, v54.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v55.h, 8, v54.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v54.l, 8, v53.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v54.h, 8, v53.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v53.l, 8, v52.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v53.h, 8, v51.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v52.h, 8, v50.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v50.l, 8, v49.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v51.l, 8, v31.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v49.h, 8, v31.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v64.l, 8, v55.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v64.h, 8, v54.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v54.h, 8, v54.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v55.l, 8, v53.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v53.h, 8, v53.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v54.l, 8, v52.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v52.h, 8, v52.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v53.l, 8, v51.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v51.h, 8, v51.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v52.l, 8, v50.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v50.h, 8, v31.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v51.l, 8, v31.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15_vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31
; GFX11-TRUE16-NEXT: s_and_saveexec_b32 s0, vcc_lo
; GFX11-TRUE16-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
@@ -52431,746 +52261,660 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB38_3: ; %cmp.false
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v151.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v148.h
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v150.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v146.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v31.l, 0
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v151.h
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v150.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v147.h
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v0.h, v149.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v149.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v144.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v133.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.l, v148.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v4
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v144.l
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v1.l, v145.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v31
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v1.h, v147.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v3.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v149.h
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v149.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v149.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v146.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v146.l
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v0.l, v151.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v151.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v0.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v1.l, v150.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v3.l, v149.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v145.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v144.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v149, v0
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v1.h, v150.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v132.h
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v131.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v148.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v130.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v149, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v2.l, v148.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v2.l, v149.h
; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v134.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v131.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v7, v31
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v2.l, v146.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v4.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v131.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v3.h, v135.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v6, v31
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v3.l, v145.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v128.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v118.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v134.l
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v145.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v130.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v149, v2
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v3.l, v147.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v147.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v3.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v135.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v119.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v119.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v118.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v149, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v4.l, v144.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v4.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v133.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v117.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v116.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v8, v31
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v4.l, v135.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v129.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v4.h, v134.l
-; GFX11-TRUE16-NEXT: v_and_b16 v32.l, 0xff, v32.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v8
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v5.l, v133.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v119.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v5.h, v132.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v114.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v8, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v6.l, v132.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v117.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v6.h, v130.h
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v112.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v7.l, v130.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v115.l
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v7.h, v129.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v102.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v10, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v11
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v8.l, v128.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v112.h
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v8.h, v119.h
-; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v100.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v11, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v12
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v9.l, v118.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v102.h
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v9.h, v117.h
-; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v97.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v12, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v13
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v10.l, v116.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v11.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v100.h
-; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v10.h, v115.h
-; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v87.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v13, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v14, 0xffff, v14
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v11.l, v114.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v98.h
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v11.h, v113.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v85.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v14, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v15, 0xffff, v15
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v12.l, v113.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v13.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v96.l
-; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v12.h, v103.h
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v83.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v15, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v16
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v13.l, v103.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v14.l
-; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v86.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v13.h, v101.h
-; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v81.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v16, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v17
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v14.l, v101.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v15.l
-; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v84.l
-; GFX11-TRUE16-NEXT: v_or_b16 v16.l, v14.h, v99.h
-; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v71.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v17, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v18
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v15.l, v99.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v16.l
-; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v81.h
-; GFX11-TRUE16-NEXT: v_or_b16 v17.l, v15.h, v98.l
-; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v69.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v18, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v19, 0xffff, v19
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v16.l, v97.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v17.l
-; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v71.h
-; GFX11-TRUE16-NEXT: v_or_b16 v18.l, v16.h, v96.h
-; GFX11-TRUE16-NEXT: v_and_b16 v18.h, 0xff, v67.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v19, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v20, 0xffff, v20
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v17.l, v87.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v18.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v115.h
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v115.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v149, v4
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v5.l, v135.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v5.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v132.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v129.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v113.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v112.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v149, v5
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v6.l, v133.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v6.l, v149.h
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v103.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.h, v128.h
+; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v103.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v101.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v149, v6
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v7.l, v131.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v7.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v10.h, v118.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v100.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v99.h
+; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v98.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v149, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v8.l, v129.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v8.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v11.h, v116.h
+; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v12.h, v114.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v96.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v96.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v149, v8
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v9.l, v128.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v9.l, v149.h
+; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v86.h
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v13.h, v113.l
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v86.l
+; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v84.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v149, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v10.l, v117.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v10.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v102.h
+; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v84.l
+; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v82.h
+; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v81.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v149, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v11.l, v116.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v11.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v15.h, v101.l
+; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v16.h, v99.l
+; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v80.h
+; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v80.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v149, v11
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v12.l, v114.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v12.l, v149.h
; GFX11-TRUE16-NEXT: v_and_b16 v18.l, 0xff, v69.h
-; GFX11-TRUE16-NEXT: v_or_b16 v19.l, v17.h, v86.h
-; GFX11-TRUE16-NEXT: v_and_b16 v19.h, 0xff, v64.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v20, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v21, 0xffff, v21
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v18.l, v85.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v19.l
-; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v67.h
-; GFX11-TRUE16-NEXT: v_or_b16 v20.l, v18.h, v84.h
-; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v51.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v21, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff, v22
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v19.l, v83.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v20.l
+; GFX11-TRUE16-NEXT: v_or_b16 v17.h, v17.h, v97.h
+; GFX11-TRUE16-NEXT: v_and_b16 v18.h, 0xff, v69.l
+; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v68.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v149, v12
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v13.l, v112.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v13.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v18.h, v18.h, v87.h
+; GFX11-TRUE16-NEXT: v_and_b16 v19.h, 0xff, v67.l
; GFX11-TRUE16-NEXT: v_and_b16 v20.l, 0xff, v65.h
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v19.h, v82.h
-; GFX11-TRUE16-NEXT: v_and_b16 v21.h, 0xff, v48.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v22, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v23, 0xffff, v23
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v20.l, v82.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, v21.l
-; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v52.l
-; GFX11-TRUE16-NEXT: v_or_b16 v22.l, v20.h, v80.h
-; GFX11-TRUE16-NEXT: v_and_b16 v22.h, 0xff, v39.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v23, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v24, 0xffff, v24
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v21.l, v80.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v22.l
-; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v49.l
-; GFX11-TRUE16-NEXT: v_or_b16 v23.l, v21.h, v70.h
-; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v39.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v24, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v25, 0xffff, v25
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v22.l, v70.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.l, v23.l
-; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v48.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v22.h, v68.h
-; GFX11-TRUE16-NEXT: v_and_b16 v24.h, 0xff, v38.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v25, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v26, 0xffff, v26
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v23.l, v68.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, v24.l
-; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v38.h
-; GFX11-TRUE16-NEXT: v_or_b16 v25.l, v23.h, v66.h
-; GFX11-TRUE16-NEXT: v_and_b16 v25.h, 0xff, v36.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v26, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v27, 0xffff, v27
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v24.l, v66.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.l, v25.l
-; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v37.h
-; GFX11-TRUE16-NEXT: v_or_b16 v26.l, v24.h, v65.l
-; GFX11-TRUE16-NEXT: v_and_b16 v26.h, 0xff, v36.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v27, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v28, 0xffff, v28
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v25.l, v64.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v26.l
-; GFX11-TRUE16-NEXT: v_and_b16 v26.l, 0xff, v37.l
-; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v25.h, v55.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v28, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v29, 0xffff, v29
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v26.l, v55.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.l, v27.l
-; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v35.h
-; GFX11-TRUE16-NEXT: v_or_b16 v28.l, v26.h, v54.h
-; GFX11-TRUE16-NEXT: v_and_b16 v28.h, 0xff, v34.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v29, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v29, 0xffff, v30
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v27.l, v54.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.l, v28.l
-; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v35.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr151_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr150_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr148_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr147_hi16
+; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v65.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v149, v13
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v14.l, v102.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v14.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v19.h, v85.h
+; GFX11-TRUE16-NEXT: v_or_b16 v20.h, v20.h, v83.h
+; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v55.h
+; GFX11-TRUE16-NEXT: v_and_b16 v21.h, 0xff, v50.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v149, v14
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v15.l, v100.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v15.l, v149.h
+; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v49.h
+; GFX11-TRUE16-NEXT: v_or_b16 v21.h, v21.h, v82.l
+; GFX11-TRUE16-NEXT: v_and_b16 v22.h, 0xff, v49.l
+; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v48.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v149, v15
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v16.l, v98.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v16.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v22.h, v22.h, v71.h
+; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v48.l
+; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v39.h
+; GFX11-TRUE16-NEXT: v_and_b16 v24.h, 0xff, v39.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v149, v16
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v17.l, v97.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v17.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v23.h, v23.h, v70.h
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v24.h, v68.l
+; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v38.h
+; GFX11-TRUE16-NEXT: v_and_b16 v25.h, 0xff, v38.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v149, v17
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v18.l, v87.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v18.l, v149.h
+; GFX11-TRUE16-NEXT: v_and_b16 v26.l, 0xff, v37.h
+; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v25.h, v66.h
+; GFX11-TRUE16-NEXT: v_and_b16 v26.h, 0xff, v37.l
+; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v36.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v149, v18
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v19.l, v85.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v19.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v26.h, v26.h, v64.h
+; GFX11-TRUE16-NEXT: v_and_b16 v27.h, 0xff, v36.l
+; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v35.h
+; GFX11-TRUE16-NEXT: v_and_b16 v28.h, 0xff, v35.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v149, v19
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v20.l, v83.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v20.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v27.h, v27.h, v55.l
+; GFX11-TRUE16-NEXT: v_or_b16 v28.h, v28.h, v54.l
+; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v34.h
+; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v34.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v149, v20
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v21.l, v81.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v21.l, v149.h
+; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v29.h, v29.h, v53.l
+; GFX11-TRUE16-NEXT: v_and_b16 v30.h, 0xff, v33.l
+; GFX11-TRUE16-NEXT: v_and_b16 v31.l, 0xff, v32.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v149, v21
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v22.l, v71.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v22.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v30.h, v30.h, v52.l
+; GFX11-TRUE16-NEXT: v_and_b16 v31.h, 0xff, v32.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr146_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr146_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr145_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr144_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr144_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr134_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr133_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr131_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr134_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr132_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr131_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr128_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr119_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr119_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr118_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr117_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr116_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr115_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr115_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr114_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr112_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr113_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr112_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr100_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr101_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr100_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr97_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr99_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr96_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr96_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr86_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr86_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr85_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr84_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr84_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr83_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr81_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr82_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr81_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr69_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr69_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr67_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr68_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr67_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr65_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr65_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr151_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr151_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr150_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr150_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr147_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr146_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr148_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr148_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr147_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr147_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr144_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr145_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr145_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr135_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr135_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr133_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr134_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr133_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr131_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr132_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr132_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr128_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr128_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr128_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr117_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr118_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr119_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr116_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr117_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr116_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr114_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr114_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr115_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr112_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr113_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr113_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr100_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr101_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr101_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr99_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr99_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr97_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr97_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr96_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr85_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr86_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr85_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr83_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr83_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr84_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr81_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr82_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr82_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr68_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr68_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr66_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr66_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr65_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v29, v31
-; GFX11-TRUE16-NEXT: v_or_b16 v29.l, v28.h, v53.h
-; GFX11-TRUE16-NEXT: v_and_b16 v28.h, 0xff, v34.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff, v30
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v28.l, v53.l
-; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v33.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.l, v29.l
-; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v30.l, v28.h, v52.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v35, v31
-; GFX11-TRUE16-NEXT: v_or_b16 v33.l, v29.h, v51.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v34
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v29.l, v50.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v30.l
-; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v32.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v33, 0xffff, v33
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v149, v22
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v23.l, v70.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v23.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v31.h, v51.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v149, v23
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v24.l, v67.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v24.l, v149.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr67_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v149, v24
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v25.l, v66.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v25.l, v149.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr66_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v149, v25
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v26.l, v64.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v26.l, v149.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v149, v26
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v27.l, v54.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v27.l, v149.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_hi16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v149, v27
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v28.l, v53.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v28.l, v149.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v149, v28
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v29.l, v52.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v29.l, v149.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v34, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v35
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v30.l, v50.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v34, v31
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v32.l, v49.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_hi16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v149, v29
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v30.l, v51.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v30.l, v149.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v149, v30
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v31.l, v50.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v31.l, v149.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_hi16
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v33, v31
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v149, v31
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_lo16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB38_2
; GFX11-TRUE16-NEXT: .LBB38_4: ; %cmp.true
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v151.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v148.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v150.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v146.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v147.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v149.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v149.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v146.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v146.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v31.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v144.l, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v151.h, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v150.h, v1.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v144.h, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v149.l, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v31.l, 0
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v151.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v151.h, v0.h
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v150.h, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v150.l, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v145.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v144.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v134.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v31, v3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v134.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v148.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v148.h, v1.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v149.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v147.l, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, v134.h, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v133.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v31
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v146.h, v2.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v148.l, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v4
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v2.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v145.l, v3.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v31, v4
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v31.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v147.l, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v147.h, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v132.h, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v131.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v5, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v4.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v145.h, v3.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v131.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v31, v5
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v130.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v130.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v144.h, v3.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v145.l, v3.h
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, v128.h, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v135.h, v4.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v6, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v5.l
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v135.l, v4.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, v129.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v32.l, v32.l, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v4.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, 0x300, v4.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v134.l, v5.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v118.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v133.l, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v119.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v7
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v31, v6
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v135.l, v4.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v135.h, v4.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v119.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v119.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v31, v7
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v118.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v117.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v133.l, v5.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v133.h, v5.h
; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v132.h, v6.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v31, v8
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v131.h, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v132.l, v6.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, v116.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v8, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v7.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v132.l, v6.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, v117.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v8
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v6.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, 0x300, v6.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v130.h, v7.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v114.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v8.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v130.l, v7.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v115.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v9
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v7.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v115.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v115.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v31, v9
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v113.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v112.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v129.l, v7.l
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v129.h, v7.h
; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v129.l, v8.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, v112.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v10, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v9.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v128.l, v8.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, v112.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v10
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v8.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v8.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v119.h, v9.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, v102.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v11, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v10.l
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v118.l, v9.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, v102.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v10.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v11
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v9.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.l, 0x300, v9.h
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v31, v10
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v9.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v9.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v128.l, v8.l
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v128.h, v8.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, v103.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, v103.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v31, v11
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v9.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, 0x300, v9.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v10.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v10.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, v101.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, v100.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v117.h, v9.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v118.l, v9.h
; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v117.h, v10.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, v100.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v12, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v11.l
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v116.l, v10.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.l, v100.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v11.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v12
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v10.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, 0x300, v10.h
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v115.h, v11.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, v97.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v13, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v12.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v114.h, v11.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, v98.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v12.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v14, 0xffff, v13
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v11.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.l, 0x300, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v10.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v31, v12
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v11.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.h, 0x300, v11.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v116.l, v10.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v116.h, v10.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, v99.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, v98.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v31, v13
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v11.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, 0x300, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v12.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, v96.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, v96.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v114.l, v11.l
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v114.h, v11.h
; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v113.h, v12.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.h, v87.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v14, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v13.l
-; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v113.l, v12.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.l, v96.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v13.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v15, 0xffff, v14
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v12.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, 0x300, v12.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.l
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v103.h, v13.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, v85.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v15, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v14.l
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v103.l, v13.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, v86.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v14.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v15
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v13.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.l, 0x300, v13.h
+; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v12.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v31, v14
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v13.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.h, 0x300, v13.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v112.h, v12.l
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v113.l, v12.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, v86.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, v86.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v31, v15
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v13.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.h, 0x300, v13.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v14.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v14.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, v84.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, v84.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v102.l, v13.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v102.h, v13.h
; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v101.h, v14.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.h, v83.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v16, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v15.l
-; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v101.l, v14.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.l, v84.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v15.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v16
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v14.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.l, 0x300, v14.h
-; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v99.h, v15.h
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v14.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v31, v16
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v15.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v17.h, 0x300, v15.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v100.h, v14.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v101.l, v14.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.l, v82.h, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.h, v81.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v17, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v16.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v99.l, v15.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.l, v81.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v16.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v17
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v15.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v17.l, 0x300, v15.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v31, v17
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v15.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.h, 0x300, v15.h
+; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v16.l
+; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v16.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.l, v80.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.h, v80.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v17.l, v98.h, v15.l
+; GFX11-TRUE16-NEXT: v_or_b16 v17.h, v99.l, v15.h
; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.l
-; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v98.l, v16.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v17.h, v71.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v18, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v17.l
-; GFX11-TRUE16-NEXT: v_or_b16 v16.l, v97.l, v16.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v17.l, v71.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v17.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v19, 0xffff, v18
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v16.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.l, 0x300, v16.h
-; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v17.l
-; GFX11-TRUE16-NEXT: v_or_b16 v17.h, v96.h, v17.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.h, v69.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v19, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v18.l
-; GFX11-TRUE16-NEXT: v_or_b16 v17.l, v87.l, v17.l
+; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v16.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v31, v18
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v17.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v19.h, 0x300, v17.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v17.l, v97.l, v16.l
+; GFX11-TRUE16-NEXT: v_or_b16 v17.h, v97.h, v16.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.l, v69.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v18.h, 0xff, v18.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v20, 0xffff, v19
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v17.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v19.l, 0x300, v17.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.h, v69.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v31, v19
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v17.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.h, 0x300, v17.h
+; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v18.l
+; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v18.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.l, v68.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.h, v67.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v19.l, v87.l, v17.l
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v87.h, v17.h
; GFX11-TRUE16-NEXT: v_and_b16 v18.l, 0xff, v18.l
-; GFX11-TRUE16-NEXT: v_or_b16 v18.h, v86.h, v18.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v19.h, v67.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v20, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v19.l
-; GFX11-TRUE16-NEXT: v_or_b16 v18.l, v85.l, v18.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v19.l, v67.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v19.h, 0xff, v19.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v21, 0xffff, v20
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v18.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.l, 0x300, v18.h
-; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v19.l
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v84.h, v19.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.h, v64.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v21, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v20.l
-; GFX11-TRUE16-NEXT: v_or_b16 v19.l, v83.h, v19.l
+; GFX11-TRUE16-NEXT: v_and_b16 v18.h, 0xff, v18.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v31, v20
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v19.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v19.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v19.l, v85.l, v18.l
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v85.h, v18.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.l, v65.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v20.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff, v21
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v19.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v19.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.h, v65.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v31, v21
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v19.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.h, 0x300, v19.h
+; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v20.l
+; GFX11-TRUE16-NEXT: v_and_b16 v19.h, 0xff, v20.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.l, v55.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.h, v50.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v83.l, v19.l
+; GFX11-TRUE16-NEXT: v_or_b16 v21.h, v83.h, v19.h
; GFX11-TRUE16-NEXT: v_and_b16 v20.l, 0xff, v20.l
-; GFX11-TRUE16-NEXT: v_or_b16 v20.h, v82.h, v20.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, v51.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v22, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v21.l
-; GFX11-TRUE16-NEXT: v_or_b16 v20.l, v82.l, v20.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, v52.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v21.h, 0xff, v21.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v23, 0xffff, v22
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v20.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.l, 0x300, v20.h
-; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v21.l
-; GFX11-TRUE16-NEXT: v_or_b16 v21.h, v80.h, v21.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.h, v48.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v23, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v22.l
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v80.l, v21.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.l, v49.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v22.h, 0xff, v22.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v24, 0xffff, v23
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v21.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v23.l, 0x300, v21.h
+; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v20.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v31, v22
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v21.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v23.h, 0x300, v21.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v81.h, v20.l
+; GFX11-TRUE16-NEXT: v_or_b16 v21.h, v82.l, v20.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.l, v49.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.h, v49.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v31, v23
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v21.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.h, 0x300, v21.h
+; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v22.l
+; GFX11-TRUE16-NEXT: v_and_b16 v21.h, 0xff, v22.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.l, v48.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.h, v48.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v23.l, v71.l, v21.l
+; GFX11-TRUE16-NEXT: v_or_b16 v23.h, v71.h, v21.h
; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v22.l
-; GFX11-TRUE16-NEXT: v_or_b16 v22.h, v70.h, v22.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v23.h, v39.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v24, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, v23.l
-; GFX11-TRUE16-NEXT: v_or_b16 v22.l, v70.l, v22.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v23.l, v48.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v23.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v25, 0xffff, v24
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v22.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.l, 0x300, v22.h
-; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v23.l
-; GFX11-TRUE16-NEXT: v_or_b16 v23.h, v68.h, v23.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.h, v38.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v25, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v24.l
-; GFX11-TRUE16-NEXT: v_or_b16 v23.l, v68.l, v23.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.l, v39.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v24.h, 0xff, v24.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v26, 0xffff, v25
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v23.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v23.h
+; GFX11-TRUE16-NEXT: v_and_b16 v22.h, 0xff, v22.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v31, v24
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v23.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.h, 0x300, v23.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v23.l, v70.l, v22.l
+; GFX11-TRUE16-NEXT: v_or_b16 v23.h, v70.h, v22.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.l, v39.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.h, v39.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v31, v25
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v23.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.h, 0x300, v23.h
+; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v24.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.l, v38.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.h, v38.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v25.l, v67.h, v23.l
+; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v68.l, v23.h
; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v24.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v66.l, v24.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.h, v37.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v26, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.l, v25.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v66.h, v24.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, v38.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v24.h
-; GFX11-TRUE16-NEXT: v_and_b16 v25.h, 0xff, v25.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v26, 0xffff, v26
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.l, 0x300, v24.l
-; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v25.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v64.l, v25.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, v24.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v26, v31
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v25.l, v65.l, v25.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.l, v37.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.h, v36.h, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v27, 0xffff, v27
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v25.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v25.l
+; GFX11-TRUE16-NEXT: v_and_b16 v24.h, 0xff, v24.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v31, v26
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v25.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, 0x300, v25.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v25.l, v66.l, v24.l
+; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v66.h, v24.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.l, v37.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.h, v37.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v31, v27
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v25.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.h, 0x300, v25.h
+; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v26.l
+; GFX11-TRUE16-NEXT: v_and_b16 v25.h, 0xff, v26.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.l, v36.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.h, v36.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v64.l, v25.l
+; GFX11-TRUE16-NEXT: v_or_b16 v27.h, v64.h, v25.h
; GFX11-TRUE16-NEXT: v_and_b16 v26.l, 0xff, v26.l
; GFX11-TRUE16-NEXT: v_and_b16 v26.h, 0xff, v26.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.l, v25.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v27, v31
-; GFX11-TRUE16-NEXT: v_or_b16 v26.l, v55.l, v26.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.l, v36.l, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v26.h, v55.h, v26.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, v35.h, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v29, 0xffff, v28
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v26.l
-; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v27.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.l, 0x300, v26.h
-; GFX11-TRUE16-NEXT: v_and_b16 v27.h, 0xff, v27.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.h, v34.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v29, v31
-; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v54.h, v27.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v28.l
-; GFX11-TRUE16-NEXT: v_or_b16 v27.h, v54.l, v27.h
-; GFX11-TRUE16-NEXT: v_and_b16 v28.h, 0xff, v28.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.l, v35.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.l, 0x300, v27.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v29, 0xffff, v29
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v27.h
-; GFX11-TRUE16-NEXT: v_or_b16 v28.h, v53.h, v28.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v31, v28
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v27.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v29.h, 0x300, v27.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v54.h, v26.l
+; GFX11-TRUE16-NEXT: v_or_b16 v27.h, v55.l, v26.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.l, v35.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.h, v35.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v31, v29
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v27.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v30.h, 0x300, v27.h
+; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v28.l
+; GFX11-TRUE16-NEXT: v_and_b16 v27.h, 0xff, v28.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.l, v34.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.h, v34.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v29.l, v53.h, v27.l
+; GFX11-TRUE16-NEXT: v_or_b16 v29.h, v54.l, v27.h
; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v28.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.l, v27.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v29, v31
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v29.l, v34.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v29.h, v33.h, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff, v30
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v30.l, 0x300, v28.h
+; GFX11-TRUE16-NEXT: v_and_b16 v28.h, 0xff, v28.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v31, v30
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v29.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v34.h, 0x300, v29.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v29.l, v52.h, v28.l
+; GFX11-TRUE16-NEXT: v_or_b16 v29.h, v53.l, v28.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v30.l, v33.h, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v30.h, v33.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v29.l
-; GFX11-TRUE16-NEXT: v_or_b16 v28.l, v53.l, v28.l
-; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v29.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v33.l, v30.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v31, v34
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v29.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v33.h, 0x300, v29.h
+; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v30.l
+; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v30.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v30.l, v32.h, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v29.l, v52.h, v29.l
-; GFX11-TRUE16-NEXT: v_and_b16 v30.h, 0xff, v30.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v28.l
-; GFX11-TRUE16-NEXT: v_or_b16 v29.h, v50.h, v29.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v30.h, v32.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v33.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v32.l, v51.h, v29.l
+; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v52.l, v29.h
; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v30.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v29.l, 0x300, v29.l
-; GFX11-TRUE16-NEXT: v_or_b16 v30.h, v51.l, v30.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v35, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v33
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v29.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v29.l
-; GFX11-TRUE16-NEXT: v_or_b16 v30.l, v50.l, v30.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v33.l, 0x300, v30.h
-; GFX11-TRUE16-NEXT: v_and_b16 v30.h, 0xff, v32.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v34, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v35
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v30.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v33, 0xffff, v33
-; GFX11-TRUE16-NEXT: v_or_b16 v32.l, v49.h, v30.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v34, v31
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v32.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v33, v31
+; GFX11-TRUE16-NEXT: v_and_b16 v30.h, 0xff, v30.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v31, v33
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v32.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v33.h, 0x300, v32.h
+; GFX11-TRUE16-NEXT: v_or_b16 v32.l, v50.h, v30.l
+; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v51.l, v30.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v31, v33
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v32.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v32.h, 0x300, v32.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v32.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v31, v32
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -77938,64 +77682,64 @@ define <128 x i8> @bitcast_v16i64_to_v128i8(<16 x i64> %a, i32 %b) {
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr66_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr162_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr160_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr161_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr160_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr65_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr151_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr150_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr148_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr146_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr147_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr146_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr145_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr144_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr135_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr134_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr132_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr133_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr132_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr131_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr128_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr119_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr118_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr117_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr115_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr116_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr115_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr114_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr112_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr113_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr112_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr101_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr100_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr99_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr97_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr96_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr86_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr85_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr84_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr83_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr81_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr82_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr81_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr69_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr68_lo16
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(2)
@@ -78023,50 +77767,50 @@ define <128 x i8> @bitcast_v16i64_to_v128i8(<16 x i64> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[65:66], 24, v[3:4]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v68, 24, v32
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v69, 8, v32
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v71, 8, v31
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v70, 24, v30
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v70, 8, v31
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v71, 24, v30
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v80, 8, v30
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 8, v29
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 24, v28
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 8, v29
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 24, v28
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v83, 8, v28
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v84, 8, v27
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v85, 24, v26
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v86, 8, v26
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 8, v25
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 24, v24
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 8, v25
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 24, v24
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v97, 8, v24
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v99, 8, v23
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 24, v22
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 8, v23
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v99, 24, v22
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v100, 8, v22
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v101, 8, v21
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v102, 24, v20
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v103, 8, v20
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 8, v19
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 24, v18
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 8, v19
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 24, v18
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v114, 8, v18
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v116, 8, v17
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v115, 24, v16
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v115, 8, v17
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v116, 24, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v117, 8, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v118, 8, v15
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v119, 24, v14
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v128, 8, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 8, v13
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 24, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v131, 8, v12
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v133, 8, v11
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 24, v10
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 8, v11
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v133, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v134, 8, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v135, 8, v9
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v144, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v145, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v146, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v146, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v148, 8, v6
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 8, v5
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 24, v4
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 8, v5
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v151, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v162, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[54:55], 24, v[7:8]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[66:67], 24, v[1:2]
@@ -78135,50 +77879,50 @@ define <128 x i8> @bitcast_v16i64_to_v128i8(<16 x i64> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[66:67], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v68, 24, v32
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v69, 8, v32
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v71, 8, v31
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v70, 24, v30
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v70, 8, v31
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v71, 24, v30
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v80, 8, v30
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 8, v29
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 24, v28
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 8, v29
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 24, v28
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v83, 8, v28
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v84, 8, v27
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v85, 24, v26
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v86, 8, v26
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 8, v25
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 24, v24
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 8, v25
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 24, v24
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v97, 8, v24
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v99, 8, v23
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 24, v22
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 8, v23
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v99, 24, v22
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v100, 8, v22
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v101, 8, v21
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v102, 24, v20
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v103, 8, v20
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 8, v19
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 24, v18
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 8, v19
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 24, v18
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v114, 8, v18
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v116, 8, v17
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v115, 24, v16
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v115, 8, v17
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v116, 24, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v117, 8, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v118, 8, v15
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v119, 24, v14
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v128, 8, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 8, v13
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 24, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v131, 8, v12
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v133, 8, v11
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 24, v10
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 8, v11
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v133, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v134, 8, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v135, 8, v9
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v144, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v145, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v146, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v146, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v148, 8, v6
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 8, v5
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 24, v4
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 8, v5
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v151, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v162, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v39, 8, v1
; GFX11-TRUE16-NEXT: .LBB56_4: ; %end
@@ -78186,307 +77930,266 @@ define <128 x i8> @bitcast_v16i64_to_v128i8(<16 x i64> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v39.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v162.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.l, 0
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v66.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v34.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v66.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v1.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v162.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v1.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v39.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v161.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v39, v1
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v2.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v160.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v65.l
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v161.l
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v1.h, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v66.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v55, 0xffff, v55
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v160.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v1.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v39, v2
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v3.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v39.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v151.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v66, 0xffff, v66
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v55, v39
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v2.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v3.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v65.l
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v4.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v150.l
; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v150.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v66, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v55, 0xffff, v55
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v3.l, v3.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v65.l, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v149.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v5.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v55, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v55, 0xffff, v65
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v64.l
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v4.l, v4.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v65.l, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.l, 8, v148.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v55, v39
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v39, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v4.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v149.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v64.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v39, v4
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v5.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v148.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v147.l
; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v5.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v147.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.h, v6.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v55, 0xffff, v65
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v39, v5
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v6.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v146.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v54.l
; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v7.l, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v64.l, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v146.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v39, v6
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v7.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v39.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v145.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v55, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v55, 0xffff, v64
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v64.l, v7.l
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v6.l, v6.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.h, 8, v54.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v8.l, v33.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v64
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v55, v39
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v144.l
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v7.l, v7.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 8, v135.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v54, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v55
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v8.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v144.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v39, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v8.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v135.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v53.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v10.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v39, v8
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v9.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v39.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v134.l
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v8.h, v9.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v133.l
; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v11.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v39, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v10.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v10.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v132.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v52.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v12.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v39, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v11.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v11.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v131.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v130.l
; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v54, v39
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v10.l, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v53.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v133.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v54.l, v10.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v53, 0xffff, v55
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v9.l, v9.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v132.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v11.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v52.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v13.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v39, v11
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v12.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v12.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v129.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v51.l
; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v53, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v53, 0xffff, v54
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v10.l, v10.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v54.l, v11.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v131.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v53, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v52, 0xffff, v54
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v11.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v130.l
-; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v11.h, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v14.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v39, v12
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v13.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v13.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v128.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v119.l
+; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.l
+; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v15.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v39, v13
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v14.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v118.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v50.l
; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.l
+; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v16.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v39, v14
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v15.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v15.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v117.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v116.l
; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v17.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v52, v39
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v13.l, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v53.l, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v129.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v128.l
+; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v17.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v39, v15
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v16.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v16.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v115.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v49.l
+; GFX11-TRUE16-NEXT: v_and_b16 v18.l, 0xff, v18.l
+; GFX11-TRUE16-NEXT: v_and_b16 v18.h, 0xff, v18.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v39, v16
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v17.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v17.h, v17.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v114.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v113.l
; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v19.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v52, 0xffff, v53
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v53.l, v13.l
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v12.l, v12.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.h, 8, v51.l
-; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v14.l, v33.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v51, 0xffff, v53
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v52, v39
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v119.l
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v13.l, v13.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.l, v14.l
-; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.h
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v15.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.l, 8, v118.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v51, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v51, 0xffff, v52
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v14.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v117.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v14.h, v15.l
+; GFX11-TRUE16-NEXT: v_and_b16 v19.h, 0xff, v19.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v39, v17
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v18.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v18.h, v18.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v112.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v48.l
; GFX11-TRUE16-NEXT: v_and_b16 v20.l, 0xff, v20.l
+; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v20.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v39, v18
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v19.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v19.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v103.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v102.l
+; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v21.l
+; GFX11-TRUE16-NEXT: v_and_b16 v21.h, 0xff, v21.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v39, v19
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v20.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v20.h, v20.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v101.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v38.l
; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v22.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v51, v39
-; GFX11-TRUE16-NEXT: v_or_b16 v16.l, v16.l, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.l, v15.l
-; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.h, 8, v50.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v116.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v51.l, v16.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v50, 0xffff, v52
-; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.h
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v15.l, v15.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v115.l
-; GFX11-TRUE16-NEXT: v_or_b16 v17.l, v17.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v49.l
+; GFX11-TRUE16-NEXT: v_and_b16 v22.h, 0xff, v22.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v39, v20
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v21.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v21.h, v21.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v100.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v99.l
; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v23.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v50, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v50, 0xffff, v51
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v16.l, v16.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v51.l, v17.l
-; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v17.h
-; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v18.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v114.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v50, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v49, 0xffff, v51
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v17.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v113.l
-; GFX11-TRUE16-NEXT: v_or_b16 v18.l, v17.h, v18.l
+; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v23.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v39, v21
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v22.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v22.h, v22.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v98.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v37.l
+; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v24.h, 0xff, v24.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v39, v22
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v23.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v23.h, v23.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v97.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v96.l
; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v25.l
+; GFX11-TRUE16-NEXT: v_and_b16 v25.h, 0xff, v25.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v39, v23
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v24.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v24.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v87.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v36.l
; GFX11-TRUE16-NEXT: v_and_b16 v26.l, 0xff, v26.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v49, v39
-; GFX11-TRUE16-NEXT: v_or_b16 v19.l, v19.l, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v50.l, v18.l
-; GFX11-TRUE16-NEXT: v_and_b16 v18.l, 0xff, v18.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v112.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v103.l
+; GFX11-TRUE16-NEXT: v_and_b16 v26.h, 0xff, v26.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v39, v24
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v25.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v25.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v86.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v85.l
+; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v27.l
+; GFX11-TRUE16-NEXT: v_and_b16 v27.h, 0xff, v27.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v39, v25
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v26.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v26.h, v26.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v84.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v35.l
; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v28.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v49, 0xffff, v50
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v50.l, v19.l
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v18.l, v18.h
-; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v19.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v48.l
-; GFX11-TRUE16-NEXT: v_or_b16 v20.l, v20.l, v33.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v48, 0xffff, v50
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v49, v39
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v102.l
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v19.l, v19.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v49.l, v20.l
-; GFX11-TRUE16-NEXT: v_and_b16 v20.l, 0xff, v20.h
-; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v21.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.l, 8, v101.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v48, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v48, 0xffff, v49
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v20.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v100.l
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v20.h, v21.l
+; GFX11-TRUE16-NEXT: v_and_b16 v28.h, 0xff, v28.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v39, v26
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v27.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v27.h, v27.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v83.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v82.l
; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v29.l
+; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v29.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v39, v27
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v28.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v28.h, v28.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v81.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.l, 8, v34.l
+; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v30.l
+; GFX11-TRUE16-NEXT: v_and_b16 v30.h, 0xff, v30.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v39, v28
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v29.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v29.h, v29.h, v34.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v80.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.l, 8, v71.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_and_b16 v31.l, 0xff, v31.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v48, v39
-; GFX11-TRUE16-NEXT: v_or_b16 v22.l, v22.l, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v49.l, v21.l
-; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v21.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.h, 8, v38.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v99.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.l, v22.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v38, 0xffff, v49
-; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v22.h
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v21.l, v21.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.h, 8, v98.l
-; GFX11-TRUE16-NEXT: v_or_b16 v23.l, v23.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v37.l
+; GFX11-TRUE16-NEXT: v_and_b16 v31.h, 0xff, v31.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v39, v29
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v30.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v30.h, v30.h, v34.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v70.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.l, 8, v33.l
; GFX11-TRUE16-NEXT: v_and_b16 v32.l, 0xff, v32.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v38, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v38, 0xffff, v48
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v22.l, v22.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.l, v23.l
-; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v23.h
-; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v24.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v24.l, 8, v97.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v38, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v37, 0xffff, v48
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v23.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v96.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v23.h, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v32.h, 0xff, v32.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v39, v30
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v31.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v31.h, v33.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v31.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.l, 8, v69.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v68.l
; GFX11-TRUE16-NEXT: s_clause 0x1
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[1:4], off
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[5:8], off offset:16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v37, v39
-; GFX11-TRUE16-NEXT: v_or_b16 v25.l, v25.l, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v38.l, v24.l
-; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v24.h, 8, v87.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v86.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v37, 0xffff, v38
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v38.l, v25.l
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v24.l, v24.h
-; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v25.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.h, 8, v36.l
-; GFX11-TRUE16-NEXT: v_or_b16 v26.l, v26.l, v33.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v36, 0xffff, v38
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v37, v39
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v85.l
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v25.l, v25.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v37.l, v26.l
-; GFX11-TRUE16-NEXT: v_and_b16 v26.l, 0xff, v26.h
-; GFX11-TRUE16-NEXT: v_and_b16 v26.h, 0xff, v27.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.l, 8, v84.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v36, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v36, 0xffff, v37
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v26.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v83.l
-; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v26.h, v27.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v36, v39
-; GFX11-TRUE16-NEXT: v_or_b16 v28.l, v28.l, v33.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v37.l, v27.l
-; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v27.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.h, 8, v35.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v82.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v36.l, v28.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff, v37
-; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v28.h
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v27.l, v27.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v28.h, 8, v81.l
-; GFX11-TRUE16-NEXT: v_or_b16 v29.l, v29.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v34.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v35, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff, v36
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v28.l, v28.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v36.l, v29.l
-; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v29.h
-; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v30.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v30.l, 8, v80.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v35, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v36
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v29.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v71.l
-; GFX11-TRUE16-NEXT: v_or_b16 v30.l, v29.h, v30.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v34, v39
-; GFX11-TRUE16-NEXT: v_or_b16 v31.l, v31.l, v33.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_3) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v30.l
-; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v30.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v30.h, 8, v70.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v69.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v35
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v31.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v30.l, v30.h
-; GFX11-TRUE16-NEXT: v_and_b16 v31.l, 0xff, v31.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v31.h, 8, v33.l
-; GFX11-TRUE16-NEXT: v_or_b16 v32.l, v32.l, v33.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v33, 0xffff, v35
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v34, v39
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v31.l, v31.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.l, v32.l
-; GFX11-TRUE16-NEXT: v_and_b16 v32.l, 0xff, v32.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v32.h, 8, v68.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v33, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v33, 0xffff, v34
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v32.l, v32.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v32, v33, v39
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v39, v31
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v32.l, v33.l
+; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v32.h, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v32.l, v39.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v32, v39, v32
; GFX11-TRUE16-NEXT: s_clause 0x5
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[9:12], off offset:32
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[13:16], off offset:48
@@ -87060,63 +86763,63 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v31, off, s32 offset:384
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v32, off, s32 offset:380
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v31, off, s32 offset:376
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v33, off, s32 offset:372
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v49, off, s32 offset:368
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v32, off, s32 offset:364
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v50, off, s32 offset:360
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v34, off, s32 offset:356
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v50, off, s32 offset:352
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v33, off, s32 offset:348
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v51, off, s32 offset:344
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v32, off, s32 offset:372
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v50, off, s32 offset:368
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v33, off, s32 offset:364
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v51, off, s32 offset:360
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v33, off, s32 offset:356
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v51, off, s32 offset:352
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v34, off, s32 offset:348
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v52, off, s32 offset:344
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v34, off, s32 offset:340
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v52, off, s32 offset:336
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v35, off, s32 offset:332
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v53, off, s32 offset:328
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v36, off, s32 offset:324
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v35, off, s32 offset:324
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v53, off, s32 offset:320
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v35, off, s32 offset:316
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v36, off, s32 offset:316
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v54, off, s32 offset:312
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v36, off, s32 offset:308
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v54, off, s32 offset:304
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v37, off, s32 offset:300
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v55, off, s32 offset:296
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v38, off, s32 offset:292
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v55, off, s32 offset:288
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v37, off, s32 offset:284
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v64, off, s32 offset:280
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v39, off, s32 offset:276
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v65, off, s32 offset:272
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v38, off, s32 offset:268
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v66, off, s32 offset:264
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v37, off, s32 offset:292
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v64, off, s32 offset:288
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v38, off, s32 offset:284
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v64, off, s32 offset:280
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v38, off, s32 offset:276
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v66, off, s32 offset:272
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v39, off, s32 offset:268
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v66, off, s32 offset:264
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v39, off, s32 offset:260
; GFX11-TRUE16-NEXT: s_clause 0x1f
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v66, off, s32 offset:256
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v67, off, s32 offset:256
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v48, off, s32 offset:252
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v68, off, s32 offset:248
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v48, off, s32 offset:244
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v68, off, s32 offset:240
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v70, off, s32 offset:240
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v49, off, s32 offset:236
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v70, off, s32 offset:232
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v51, off, s32 offset:228
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v70, off, s32 offset:224
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v52, off, s32 offset:220
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v80, off, s32 offset:216
-; GFX11-TRUE16-NEXT: scratch_load_b32 v103, off, s32 offset:388
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v80, off, s32
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v70, off, s32 offset:232
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v49, off, s32 offset:228
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v71, off, s32 offset:224
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v50, off, s32 offset:220
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v71, off, s32 offset:216
+; GFX11-TRUE16-NEXT: scratch_load_b32 v114, off, s32 offset:388
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v81, off, s32
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v82, off, s32 offset:8
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v82, off, s32 offset:16
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v83, off, s32 offset:16
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v83, off, s32 offset:24
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v84, off, s32 offset:32
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v85, off, s32 offset:40
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v86, off, s32 offset:48
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v87, off, s32 offset:56
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v96, off, s32 offset:64
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v97, off, s32 offset:72
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v98, off, s32 offset:80
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v85, off, s32 offset:32
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v85, off, s32 offset:40
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v87, off, s32 offset:48
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v87, off, s32 offset:56
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v97, off, s32 offset:64
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v97, off, s32 offset:72
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v98, off, s32 offset:80
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v99, off, s32 offset:88
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v99, off, s32 offset:96
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v100, off, s32 offset:96
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v101, off, s32 offset:104
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v101, off, s32 offset:112
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v102, off, s32 offset:112
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v160, off, s32 offset:120
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v160, off, s32 offset:128
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v161, off, s32 offset:136
@@ -87130,144 +86833,143 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v164, off, s32 offset:192
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v165, off, s32 offset:200
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v165, off, s32 offset:208
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v64, off, s32 offset:212
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v65, off, s32 offset:204
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v67, off, s32 offset:196
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v67, off, s32 offset:188
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v69, off, s32 offset:180
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v69, off, s32 offset:172
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v71, off, s32 offset:164
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v71, off, s32 offset:156
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v81, off, s32 offset:148
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v81, off, s32 offset:140
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v83, off, s32 offset:132
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v55, off, s32 offset:212
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v65, off, s32 offset:204
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v65, off, s32 offset:196
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v67, off, s32 offset:188
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v68, off, s32 offset:180
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v69, off, s32 offset:172
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v69, off, s32 offset:164
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v80, off, s32 offset:156
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v80, off, s32 offset:148
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v81, off, s32 offset:140
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v82, off, s32 offset:132
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v84, off, s32 offset:124
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v85, off, s32 offset:116
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v84, off, s32 offset:116
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v86, off, s32 offset:108
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v87, off, s32 offset:100
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v86, off, s32 offset:100
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v96, off, s32 offset:92
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v97, off, s32 offset:84
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v98, off, s32 offset:76
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v100, off, s32 offset:68
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v100, off, s32 offset:60
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v102, off, s32 offset:52
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v102, off, s32 offset:44
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v112, off, s32 offset:36
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v112, off, s32 offset:28
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v114, off, s32 offset:20
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v96, off, s32 offset:84
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v98, off, s32 offset:76
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v99, off, s32 offset:68
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v100, off, s32 offset:60
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v101, off, s32 offset:52
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v103, off, s32 offset:44
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v103, off, s32 offset:36
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v112, off, s32 offset:28
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v113, off, s32 offset:20
; GFX11-TRUE16-NEXT: s_clause 0x1
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v115, off, s32 offset:12
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v116, off, s32 offset:4
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v115, off, s32 offset:4
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v117.l, v30.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v118.h, v28.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v119.l, v26.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v128.h, v24.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v129.h, v22.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v131.l, v20.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v131.h, v18.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v133.h, v16.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v134.h, v14.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v144.h, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v119.h, v24.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v130.l, v22.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v130.h, v20.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v131.l, v18.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v132.h, v16.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v134.l, v14.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v134.h, v12.l
; GFX11-TRUE16-NEXT: v_mov_b16_e64 v144.l, v10.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v146.l, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v147.h, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v148.h, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v150.l, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v151.l, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v151.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v149.l, 8, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v150.h, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v147.l, 8, v7.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v149.h, 8, v9.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v146.h, 8, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v148.l, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v145.l, 8, v15.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v145.h, 8, v17.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v135.l, 8, v19.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v135.h, 8, v21.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v133.l, 8, v23.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v134.l, 8, v25.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v132.l, 8, v27.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v132.h, 8, v29.l
-; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(62)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v50.h, 8, v50.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v145.h, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v146.h, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v146.l, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v149.l, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v149.h, v0.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v151.l, 8, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v151.h, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v150.l, 8, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v150.h, 8, v7.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v148.l, 8, v9.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v148.h, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v147.l, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v147.h, 8, v15.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v144.h, 8, v17.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v145.l, 8, v19.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v135.l, 8, v21.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v135.h, 8, v23.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v133.l, 8, v25.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v133.h, 8, v27.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v131.h, 8, v29.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(54)
-; GFX11-TRUE16-NEXT: v_cmp_ne_u32_e32 vcc_lo, 0, v103
+; GFX11-TRUE16-NEXT: v_cmp_ne_u32_e32 vcc_lo, 0, v114
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(53)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v130.l, 8, v80.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v132.l, 8, v81.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(52)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v130.h, 8, v82.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v129.l, 8, v82.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(51)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v128.l, 8, v82.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v129.h, 8, v83.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(50)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v129.l, 8, v83.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v128.l, 8, v83.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(49)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v118.l, 8, v84.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v128.h, 8, v85.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(48)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v119.h, 8, v85.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v117.h, 8, v85.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(47)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v116.l, 8, v86.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v118.l, 8, v87.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(46)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v117.h, 8, v87.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v116.l, 8, v87.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(45)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v114.h, 8, v96.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v116.h, 8, v97.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(44)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v115.h, 8, v97.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v114.l, 8, v97.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(43)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v113.l, 8, v98.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v114.h, 8, v98.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(42)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v113.h, 8, v99.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v112.h, 8, v99.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(41)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v103.l, 8, v99.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v113.l, 8, v100.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(40)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v103.h, 8, v101.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v102.l, 8, v101.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(39)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v101.l, 8, v101.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v102.h, 8, v102.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(38)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v101.h, 8, v160.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v100.h, 8, v160.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(37)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v99.l, 8, v160.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v101.l, 8, v160.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(36)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v99.h, 8, v161.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v98.h, 8, v161.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(35)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v97.l, 8, v161.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v99.l, 8, v161.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(34)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v98.l, 8, v162.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v97.l, 8, v162.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(33)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v87.l, 8, v162.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v97.h, 8, v162.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(32)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v96.h, 8, v163.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v87.l, 8, v163.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(31)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v85.l, 8, v163.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v87.h, 8, v163.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(30)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v86.h, 8, v164.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v85.l, 8, v164.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(29)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v83.h, 8, v164.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v85.h, 8, v164.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(28)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v84.h, 8, v165.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v83.l, 8, v165.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(27)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v82.l, 8, v165.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v82.h, 8, v80.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v80.l, 8, v70.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v80.h, 8, v70.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v70.l, 8, v68.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v70.h, 8, v68.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v68.l, 8, v66.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v68.h, 8, v66.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v66.l, 8, v65.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v83.h, 8, v165.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v81.h, 8, v71.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v82.l, 8, v71.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v71.l, 8, v70.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v71.h, 8, v70.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v70.l, 8, v68.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v70.h, 8, v67.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v67.h, 8, v66.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v68.l, 8, v66.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v66.l, 8, v64.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v66.h, 8, v64.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v64.l, 8, v55.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v65.l, 8, v55.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v55.l, 8, v54.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v55.h, 8, v54.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v54.l, 8, v53.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v54.h, 8, v53.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v53.l, 8, v52.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v53.h, 8, v51.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v52.h, 8, v50.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v50.l, 8, v49.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v51.l, 8, v31.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v49.h, 8, v31.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v64.l, 8, v55.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v64.h, 8, v54.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v54.h, 8, v54.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v55.l, 8, v53.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v53.h, 8, v53.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v54.l, 8, v52.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v52.h, 8, v52.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v53.l, 8, v51.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v51.h, 8, v51.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v52.l, 8, v50.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v50.h, 8, v31.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v51.l, 8, v31.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15_vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31
; GFX11-TRUE16-NEXT: s_and_saveexec_b32 s0, vcc_lo
; GFX11-TRUE16-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
@@ -87281,746 +86983,660 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB58_3: ; %cmp.false
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v151.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v148.h
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v150.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v146.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v31.l, 0
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v151.h
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v150.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v147.h
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v0.h, v149.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v149.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v144.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v133.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.l, v148.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v4
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v144.l
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v1.l, v145.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v31
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v1.h, v147.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v3.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v149.h
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v149.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v149.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v146.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v146.l
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v0.l, v151.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v151.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v0.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v1.l, v150.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v3.l, v149.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v145.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v144.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v149, v0
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v1.h, v150.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v132.h
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v131.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v148.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v130.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v149, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v2.l, v148.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v2.l, v149.h
; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v134.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v131.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v7, v31
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v2.l, v146.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v4.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v131.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v3.h, v135.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v6, v31
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v3.l, v145.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v128.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v118.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v134.l
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v145.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v130.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v149, v2
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v3.l, v147.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v147.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v3.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v135.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v119.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v119.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v118.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v149, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v4.l, v144.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v4.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v133.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v117.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v116.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v8, v31
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v4.l, v135.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v129.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v4.h, v134.l
-; GFX11-TRUE16-NEXT: v_and_b16 v32.l, 0xff, v32.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v8
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v5.l, v133.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v119.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v5.h, v132.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v114.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v8, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v6.l, v132.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v117.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v6.h, v130.h
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v112.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v7.l, v130.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v115.l
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v7.h, v129.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v102.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v10, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v11
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v8.l, v128.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v112.h
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v8.h, v119.h
-; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v100.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v11, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v12
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v9.l, v118.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v102.h
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v9.h, v117.h
-; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v97.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v12, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v13
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v10.l, v116.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v11.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v100.h
-; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v10.h, v115.h
-; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v87.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v13, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v14, 0xffff, v14
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v11.l, v114.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v98.h
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v11.h, v113.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v85.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v14, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v15, 0xffff, v15
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v12.l, v113.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v13.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v96.l
-; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v12.h, v103.h
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v83.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v15, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v16
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v13.l, v103.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v14.l
-; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v86.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v13.h, v101.h
-; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v81.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v16, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v17
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v14.l, v101.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v15.l
-; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v84.l
-; GFX11-TRUE16-NEXT: v_or_b16 v16.l, v14.h, v99.h
-; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v71.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v17, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v18
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v15.l, v99.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v16.l
-; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v81.h
-; GFX11-TRUE16-NEXT: v_or_b16 v17.l, v15.h, v98.l
-; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v69.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v18, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v19, 0xffff, v19
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v16.l, v97.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v17.l
-; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v71.h
-; GFX11-TRUE16-NEXT: v_or_b16 v18.l, v16.h, v96.h
-; GFX11-TRUE16-NEXT: v_and_b16 v18.h, 0xff, v67.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v19, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v20, 0xffff, v20
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v17.l, v87.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v18.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v115.h
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v115.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v149, v4
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v5.l, v135.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v5.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v132.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v129.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v113.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v112.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v149, v5
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v6.l, v133.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v6.l, v149.h
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v103.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.h, v128.h
+; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v103.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v101.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v149, v6
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v7.l, v131.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v7.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v10.h, v118.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v100.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v99.h
+; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v98.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v149, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v8.l, v129.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v8.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v11.h, v116.h
+; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v12.h, v114.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v96.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v96.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v149, v8
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v9.l, v128.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v9.l, v149.h
+; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v86.h
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v13.h, v113.l
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v86.l
+; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v84.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v149, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v10.l, v117.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v10.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v102.h
+; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v84.l
+; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v82.h
+; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v81.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v149, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v11.l, v116.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v11.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v15.h, v101.l
+; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v16.h, v99.l
+; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v80.h
+; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v80.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v149, v11
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v12.l, v114.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v12.l, v149.h
; GFX11-TRUE16-NEXT: v_and_b16 v18.l, 0xff, v69.h
-; GFX11-TRUE16-NEXT: v_or_b16 v19.l, v17.h, v86.h
-; GFX11-TRUE16-NEXT: v_and_b16 v19.h, 0xff, v64.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v20, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v21, 0xffff, v21
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v18.l, v85.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v19.l
-; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v67.h
-; GFX11-TRUE16-NEXT: v_or_b16 v20.l, v18.h, v84.h
-; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v51.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v21, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff, v22
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v19.l, v83.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v20.l
+; GFX11-TRUE16-NEXT: v_or_b16 v17.h, v17.h, v97.h
+; GFX11-TRUE16-NEXT: v_and_b16 v18.h, 0xff, v69.l
+; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v68.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v149, v12
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v13.l, v112.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v13.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v18.h, v18.h, v87.h
+; GFX11-TRUE16-NEXT: v_and_b16 v19.h, 0xff, v67.l
; GFX11-TRUE16-NEXT: v_and_b16 v20.l, 0xff, v65.h
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v19.h, v82.h
-; GFX11-TRUE16-NEXT: v_and_b16 v21.h, 0xff, v48.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v22, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v23, 0xffff, v23
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v20.l, v82.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, v21.l
-; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v52.l
-; GFX11-TRUE16-NEXT: v_or_b16 v22.l, v20.h, v80.h
-; GFX11-TRUE16-NEXT: v_and_b16 v22.h, 0xff, v39.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v23, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v24, 0xffff, v24
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v21.l, v80.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v22.l
-; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v49.l
-; GFX11-TRUE16-NEXT: v_or_b16 v23.l, v21.h, v70.h
-; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v39.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v24, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v25, 0xffff, v25
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v22.l, v70.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.l, v23.l
-; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v48.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v22.h, v68.h
-; GFX11-TRUE16-NEXT: v_and_b16 v24.h, 0xff, v38.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v25, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v26, 0xffff, v26
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v23.l, v68.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, v24.l
-; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v38.h
-; GFX11-TRUE16-NEXT: v_or_b16 v25.l, v23.h, v66.h
-; GFX11-TRUE16-NEXT: v_and_b16 v25.h, 0xff, v36.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v26, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v27, 0xffff, v27
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v24.l, v66.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.l, v25.l
-; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v37.h
-; GFX11-TRUE16-NEXT: v_or_b16 v26.l, v24.h, v65.l
-; GFX11-TRUE16-NEXT: v_and_b16 v26.h, 0xff, v36.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v27, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v28, 0xffff, v28
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v25.l, v64.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v26.l
-; GFX11-TRUE16-NEXT: v_and_b16 v26.l, 0xff, v37.l
-; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v25.h, v55.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v28, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v29, 0xffff, v29
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v26.l, v55.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.l, v27.l
-; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v35.h
-; GFX11-TRUE16-NEXT: v_or_b16 v28.l, v26.h, v54.h
-; GFX11-TRUE16-NEXT: v_and_b16 v28.h, 0xff, v34.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v29, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v29, 0xffff, v30
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v27.l, v54.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.l, v28.l
-; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v35.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr151_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr150_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr148_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr147_hi16
+; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v65.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v149, v13
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v14.l, v102.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v14.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v19.h, v85.h
+; GFX11-TRUE16-NEXT: v_or_b16 v20.h, v20.h, v83.h
+; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v55.h
+; GFX11-TRUE16-NEXT: v_and_b16 v21.h, 0xff, v50.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v149, v14
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v15.l, v100.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v15.l, v149.h
+; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v49.h
+; GFX11-TRUE16-NEXT: v_or_b16 v21.h, v21.h, v82.l
+; GFX11-TRUE16-NEXT: v_and_b16 v22.h, 0xff, v49.l
+; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v48.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v149, v15
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v16.l, v98.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v16.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v22.h, v22.h, v71.h
+; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v48.l
+; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v39.h
+; GFX11-TRUE16-NEXT: v_and_b16 v24.h, 0xff, v39.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v149, v16
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v17.l, v97.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v17.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v23.h, v23.h, v70.h
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v24.h, v68.l
+; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v38.h
+; GFX11-TRUE16-NEXT: v_and_b16 v25.h, 0xff, v38.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v149, v17
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v18.l, v87.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v18.l, v149.h
+; GFX11-TRUE16-NEXT: v_and_b16 v26.l, 0xff, v37.h
+; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v25.h, v66.h
+; GFX11-TRUE16-NEXT: v_and_b16 v26.h, 0xff, v37.l
+; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v36.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v149, v18
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v19.l, v85.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v19.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v26.h, v26.h, v64.h
+; GFX11-TRUE16-NEXT: v_and_b16 v27.h, 0xff, v36.l
+; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v35.h
+; GFX11-TRUE16-NEXT: v_and_b16 v28.h, 0xff, v35.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v149, v19
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v20.l, v83.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v20.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v27.h, v27.h, v55.l
+; GFX11-TRUE16-NEXT: v_or_b16 v28.h, v28.h, v54.l
+; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v34.h
+; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v34.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v149, v20
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v21.l, v81.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v21.l, v149.h
+; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v29.h, v29.h, v53.l
+; GFX11-TRUE16-NEXT: v_and_b16 v30.h, 0xff, v33.l
+; GFX11-TRUE16-NEXT: v_and_b16 v31.l, 0xff, v32.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v149, v21
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v22.l, v71.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v22.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v30.h, v30.h, v52.l
+; GFX11-TRUE16-NEXT: v_and_b16 v31.h, 0xff, v32.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr146_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr146_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr145_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr144_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr144_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr134_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr133_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr131_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr134_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr132_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr131_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr128_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr119_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr119_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr118_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr117_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr116_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr115_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr115_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr114_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr112_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr113_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr112_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr100_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr101_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr100_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr97_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr99_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr96_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr96_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr86_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr86_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr85_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr84_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr84_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr83_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr81_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr82_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr81_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr69_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr69_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr67_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr68_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr67_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr65_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr65_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr151_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr151_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr150_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr150_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr147_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr146_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr148_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr148_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr147_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr147_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr144_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr145_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr145_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr135_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr135_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr133_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr134_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr133_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr131_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr132_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr132_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr128_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr128_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr128_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr117_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr118_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr119_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr116_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr117_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr116_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr114_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr114_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr115_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr112_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr113_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr113_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr100_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr101_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr101_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr99_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr99_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr97_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr97_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr96_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr85_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr86_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr85_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr83_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr83_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr84_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr81_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr82_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr82_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr68_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr68_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr66_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr66_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr65_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v29, v31
-; GFX11-TRUE16-NEXT: v_or_b16 v29.l, v28.h, v53.h
-; GFX11-TRUE16-NEXT: v_and_b16 v28.h, 0xff, v34.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff, v30
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v28.l, v53.l
-; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v33.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.l, v29.l
-; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v30.l, v28.h, v52.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v35, v31
-; GFX11-TRUE16-NEXT: v_or_b16 v33.l, v29.h, v51.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v34
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v29.l, v50.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v30.l
-; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v32.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v33, 0xffff, v33
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v149, v22
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v23.l, v70.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v23.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v31.h, v51.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v149, v23
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v24.l, v67.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v24.l, v149.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr67_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v149, v24
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v25.l, v66.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v25.l, v149.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr66_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v149, v25
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v26.l, v64.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v26.l, v149.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v149, v26
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v27.l, v54.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v27.l, v149.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_hi16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v149, v27
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v28.l, v53.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v28.l, v149.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v149, v28
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v29.l, v52.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v29.l, v149.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v34, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v35
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v30.l, v50.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v34, v31
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v32.l, v49.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_hi16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v149, v29
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v30.l, v51.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v30.l, v149.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v149, v30
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v31.l, v50.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v31.l, v149.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_hi16
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v33, v31
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v149, v31
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_lo16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB58_2
; GFX11-TRUE16-NEXT: .LBB58_4: ; %cmp.true
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v151.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v148.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v150.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v146.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v147.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v149.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v149.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v146.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v146.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v31.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v144.l, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v151.h, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v150.h, v1.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v144.h, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v149.l, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v31.l, 0
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v151.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v151.h, v0.h
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v150.h, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v150.l, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v145.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v144.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v134.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v31, v3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v134.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v148.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v148.h, v1.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v149.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v147.l, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, v134.h, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v133.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v31
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v146.h, v2.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v148.l, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v4
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v2.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v145.l, v3.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v31, v4
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v31.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v147.l, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v147.h, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v132.h, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v131.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v5, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v4.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v145.h, v3.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v131.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v31, v5
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v130.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v130.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v144.h, v3.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v145.l, v3.h
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, v128.h, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v135.h, v4.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v6, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v5.l
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v135.l, v4.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, v129.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v32.l, v32.l, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v4.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, 0x300, v4.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v134.l, v5.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v118.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v133.l, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v119.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v7
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v31, v6
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v135.l, v4.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v135.h, v4.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v119.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v119.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v31, v7
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v118.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v117.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v133.l, v5.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v133.h, v5.h
; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v132.h, v6.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v31, v8
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v131.h, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v132.l, v6.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, v116.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v8, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v7.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v132.l, v6.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, v117.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v8
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v6.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, 0x300, v6.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v130.h, v7.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v114.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v8.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v130.l, v7.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v115.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v9
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v7.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v115.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v115.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v31, v9
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v113.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v112.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v129.l, v7.l
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v129.h, v7.h
; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v129.l, v8.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, v112.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v10, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v9.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v128.l, v8.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, v112.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v10
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v8.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v8.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v119.h, v9.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, v102.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v11, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v10.l
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v118.l, v9.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, v102.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v10.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v11
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v9.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.l, 0x300, v9.h
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v31, v10
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v9.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v9.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v128.l, v8.l
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v128.h, v8.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, v103.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, v103.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v31, v11
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v9.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, 0x300, v9.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v10.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v10.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, v101.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, v100.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v117.h, v9.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v118.l, v9.h
; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v117.h, v10.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, v100.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v12, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v11.l
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v116.l, v10.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.l, v100.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v11.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v12
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v10.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, 0x300, v10.h
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v115.h, v11.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, v97.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v13, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v12.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v114.h, v11.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, v98.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v12.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v14, 0xffff, v13
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v11.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.l, 0x300, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v10.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v31, v12
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v11.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.h, 0x300, v11.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v116.l, v10.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v116.h, v10.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, v99.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, v98.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v31, v13
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v11.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, 0x300, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v12.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, v96.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, v96.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v114.l, v11.l
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v114.h, v11.h
; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v113.h, v12.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.h, v87.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v14, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v13.l
-; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v113.l, v12.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.l, v96.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v13.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v15, 0xffff, v14
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v12.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, 0x300, v12.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.l
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v103.h, v13.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, v85.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v15, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v14.l
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v103.l, v13.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, v86.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v14.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v15
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v13.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.l, 0x300, v13.h
+; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v12.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v31, v14
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v13.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.h, 0x300, v13.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v112.h, v12.l
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v113.l, v12.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, v86.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, v86.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v31, v15
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v13.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.h, 0x300, v13.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v14.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v14.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, v84.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, v84.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v102.l, v13.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v102.h, v13.h
; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v101.h, v14.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.h, v83.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v16, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v15.l
-; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v101.l, v14.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.l, v84.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v15.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v16
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v14.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.l, 0x300, v14.h
-; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v99.h, v15.h
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v14.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v31, v16
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v15.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v17.h, 0x300, v15.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v100.h, v14.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v101.l, v14.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.l, v82.h, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.h, v81.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v17, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v16.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v99.l, v15.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.l, v81.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v16.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v17
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v15.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v17.l, 0x300, v15.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v31, v17
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v15.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.h, 0x300, v15.h
+; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v16.l
+; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v16.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.l, v80.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.h, v80.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v17.l, v98.h, v15.l
+; GFX11-TRUE16-NEXT: v_or_b16 v17.h, v99.l, v15.h
; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.l
-; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v98.l, v16.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v17.h, v71.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v18, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v17.l
-; GFX11-TRUE16-NEXT: v_or_b16 v16.l, v97.l, v16.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v17.l, v71.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v17.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v19, 0xffff, v18
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v16.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.l, 0x300, v16.h
-; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v17.l
-; GFX11-TRUE16-NEXT: v_or_b16 v17.h, v96.h, v17.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.h, v69.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v19, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v18.l
-; GFX11-TRUE16-NEXT: v_or_b16 v17.l, v87.l, v17.l
+; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v16.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v31, v18
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v17.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v19.h, 0x300, v17.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v17.l, v97.l, v16.l
+; GFX11-TRUE16-NEXT: v_or_b16 v17.h, v97.h, v16.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.l, v69.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v18.h, 0xff, v18.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v20, 0xffff, v19
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v17.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v19.l, 0x300, v17.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.h, v69.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v31, v19
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v17.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.h, 0x300, v17.h
+; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v18.l
+; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v18.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.l, v68.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.h, v67.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v19.l, v87.l, v17.l
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v87.h, v17.h
; GFX11-TRUE16-NEXT: v_and_b16 v18.l, 0xff, v18.l
-; GFX11-TRUE16-NEXT: v_or_b16 v18.h, v86.h, v18.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v19.h, v67.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v20, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v19.l
-; GFX11-TRUE16-NEXT: v_or_b16 v18.l, v85.l, v18.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v19.l, v67.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v19.h, 0xff, v19.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v21, 0xffff, v20
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v18.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.l, 0x300, v18.h
-; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v19.l
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v84.h, v19.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.h, v64.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v21, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v20.l
-; GFX11-TRUE16-NEXT: v_or_b16 v19.l, v83.h, v19.l
+; GFX11-TRUE16-NEXT: v_and_b16 v18.h, 0xff, v18.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v31, v20
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v19.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v19.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v19.l, v85.l, v18.l
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v85.h, v18.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.l, v65.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v20.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff, v21
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v19.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v19.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.h, v65.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v31, v21
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v19.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.h, 0x300, v19.h
+; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v20.l
+; GFX11-TRUE16-NEXT: v_and_b16 v19.h, 0xff, v20.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.l, v55.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.h, v50.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v83.l, v19.l
+; GFX11-TRUE16-NEXT: v_or_b16 v21.h, v83.h, v19.h
; GFX11-TRUE16-NEXT: v_and_b16 v20.l, 0xff, v20.l
-; GFX11-TRUE16-NEXT: v_or_b16 v20.h, v82.h, v20.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, v51.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v22, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v21.l
-; GFX11-TRUE16-NEXT: v_or_b16 v20.l, v82.l, v20.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, v52.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v21.h, 0xff, v21.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v23, 0xffff, v22
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v20.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.l, 0x300, v20.h
-; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v21.l
-; GFX11-TRUE16-NEXT: v_or_b16 v21.h, v80.h, v21.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.h, v48.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v23, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v22.l
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v80.l, v21.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.l, v49.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v22.h, 0xff, v22.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v24, 0xffff, v23
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v21.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v23.l, 0x300, v21.h
+; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v20.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v31, v22
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v21.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v23.h, 0x300, v21.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v81.h, v20.l
+; GFX11-TRUE16-NEXT: v_or_b16 v21.h, v82.l, v20.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.l, v49.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.h, v49.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v31, v23
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v21.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.h, 0x300, v21.h
+; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v22.l
+; GFX11-TRUE16-NEXT: v_and_b16 v21.h, 0xff, v22.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.l, v48.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.h, v48.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v23.l, v71.l, v21.l
+; GFX11-TRUE16-NEXT: v_or_b16 v23.h, v71.h, v21.h
; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v22.l
-; GFX11-TRUE16-NEXT: v_or_b16 v22.h, v70.h, v22.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v23.h, v39.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v24, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, v23.l
-; GFX11-TRUE16-NEXT: v_or_b16 v22.l, v70.l, v22.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v23.l, v48.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v23.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v25, 0xffff, v24
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v22.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.l, 0x300, v22.h
-; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v23.l
-; GFX11-TRUE16-NEXT: v_or_b16 v23.h, v68.h, v23.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.h, v38.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v25, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v24.l
-; GFX11-TRUE16-NEXT: v_or_b16 v23.l, v68.l, v23.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.l, v39.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v24.h, 0xff, v24.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v26, 0xffff, v25
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v23.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v23.h
+; GFX11-TRUE16-NEXT: v_and_b16 v22.h, 0xff, v22.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v31, v24
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v23.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.h, 0x300, v23.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v23.l, v70.l, v22.l
+; GFX11-TRUE16-NEXT: v_or_b16 v23.h, v70.h, v22.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.l, v39.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.h, v39.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v31, v25
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v23.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.h, 0x300, v23.h
+; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v24.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.l, v38.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.h, v38.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v25.l, v67.h, v23.l
+; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v68.l, v23.h
; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v24.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v66.l, v24.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.h, v37.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v26, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.l, v25.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v66.h, v24.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, v38.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v24.h
-; GFX11-TRUE16-NEXT: v_and_b16 v25.h, 0xff, v25.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v26, 0xffff, v26
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.l, 0x300, v24.l
-; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v25.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v64.l, v25.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, v24.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v26, v31
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v25.l, v65.l, v25.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.l, v37.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.h, v36.h, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v27, 0xffff, v27
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v25.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v25.l
+; GFX11-TRUE16-NEXT: v_and_b16 v24.h, 0xff, v24.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v31, v26
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v25.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, 0x300, v25.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v25.l, v66.l, v24.l
+; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v66.h, v24.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.l, v37.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.h, v37.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v31, v27
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v25.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.h, 0x300, v25.h
+; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v26.l
+; GFX11-TRUE16-NEXT: v_and_b16 v25.h, 0xff, v26.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.l, v36.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.h, v36.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v64.l, v25.l
+; GFX11-TRUE16-NEXT: v_or_b16 v27.h, v64.h, v25.h
; GFX11-TRUE16-NEXT: v_and_b16 v26.l, 0xff, v26.l
; GFX11-TRUE16-NEXT: v_and_b16 v26.h, 0xff, v26.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.l, v25.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v27, v31
-; GFX11-TRUE16-NEXT: v_or_b16 v26.l, v55.l, v26.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.l, v36.l, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v26.h, v55.h, v26.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, v35.h, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v29, 0xffff, v28
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v26.l
-; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v27.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.l, 0x300, v26.h
-; GFX11-TRUE16-NEXT: v_and_b16 v27.h, 0xff, v27.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.h, v34.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v29, v31
-; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v54.h, v27.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v28.l
-; GFX11-TRUE16-NEXT: v_or_b16 v27.h, v54.l, v27.h
-; GFX11-TRUE16-NEXT: v_and_b16 v28.h, 0xff, v28.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.l, v35.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.l, 0x300, v27.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v29, 0xffff, v29
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v27.h
-; GFX11-TRUE16-NEXT: v_or_b16 v28.h, v53.h, v28.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v31, v28
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v27.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v29.h, 0x300, v27.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v54.h, v26.l
+; GFX11-TRUE16-NEXT: v_or_b16 v27.h, v55.l, v26.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.l, v35.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.h, v35.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v31, v29
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v27.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v30.h, 0x300, v27.h
+; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v28.l
+; GFX11-TRUE16-NEXT: v_and_b16 v27.h, 0xff, v28.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.l, v34.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.h, v34.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v29.l, v53.h, v27.l
+; GFX11-TRUE16-NEXT: v_or_b16 v29.h, v54.l, v27.h
; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v28.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.l, v27.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v29, v31
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v29.l, v34.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v29.h, v33.h, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff, v30
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v30.l, 0x300, v28.h
+; GFX11-TRUE16-NEXT: v_and_b16 v28.h, 0xff, v28.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v31, v30
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v29.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v34.h, 0x300, v29.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v29.l, v52.h, v28.l
+; GFX11-TRUE16-NEXT: v_or_b16 v29.h, v53.l, v28.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v30.l, v33.h, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v30.h, v33.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v29.l
-; GFX11-TRUE16-NEXT: v_or_b16 v28.l, v53.l, v28.l
-; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v29.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v33.l, v30.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v31, v34
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v29.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v33.h, 0x300, v29.h
+; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v30.l
+; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v30.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v30.l, v32.h, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v29.l, v52.h, v29.l
-; GFX11-TRUE16-NEXT: v_and_b16 v30.h, 0xff, v30.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v28.l
-; GFX11-TRUE16-NEXT: v_or_b16 v29.h, v50.h, v29.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v30.h, v32.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v33.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v32.l, v51.h, v29.l
+; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v52.l, v29.h
; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v30.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v29.l, 0x300, v29.l
-; GFX11-TRUE16-NEXT: v_or_b16 v30.h, v51.l, v30.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v35, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v33
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v29.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v29.l
-; GFX11-TRUE16-NEXT: v_or_b16 v30.l, v50.l, v30.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v33.l, 0x300, v30.h
-; GFX11-TRUE16-NEXT: v_and_b16 v30.h, 0xff, v32.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v34, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v35
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v30.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v33, 0xffff, v33
-; GFX11-TRUE16-NEXT: v_or_b16 v32.l, v49.h, v30.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v34, v31
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v32.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v33, v31
+; GFX11-TRUE16-NEXT: v_and_b16 v30.h, 0xff, v30.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v31, v33
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v32.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v33.h, 0x300, v32.h
+; GFX11-TRUE16-NEXT: v_or_b16 v32.l, v50.h, v30.l
+; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v51.l, v30.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v31, v33
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v32.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v32.h, 0x300, v32.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v32.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v31, v32
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -111800,64 +111416,64 @@ define <128 x i8> @bitcast_v16f64_to_v128i8(<16 x double> %a, i32 %b) {
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr66_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr162_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr160_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr161_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr160_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr65_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr151_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr150_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr148_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr146_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr147_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr146_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr145_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr144_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr135_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr134_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr132_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr133_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr132_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr131_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr128_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr119_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr118_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr117_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr115_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr116_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr115_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr114_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr112_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr113_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr112_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr101_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr100_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr99_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr97_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr96_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr86_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr85_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr84_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr83_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr81_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr82_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr81_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr69_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr68_lo16
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(2)
@@ -111885,50 +111501,50 @@ define <128 x i8> @bitcast_v16f64_to_v128i8(<16 x double> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[65:66], 24, v[3:4]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v68, 24, v32
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v69, 8, v32
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v71, 8, v31
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v70, 24, v30
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v70, 8, v31
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v71, 24, v30
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v80, 8, v30
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 8, v29
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 24, v28
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 8, v29
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 24, v28
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v83, 8, v28
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v84, 8, v27
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v85, 24, v26
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v86, 8, v26
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 8, v25
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 24, v24
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 8, v25
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 24, v24
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v97, 8, v24
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v99, 8, v23
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 24, v22
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 8, v23
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v99, 24, v22
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v100, 8, v22
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v101, 8, v21
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v102, 24, v20
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v103, 8, v20
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 8, v19
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 24, v18
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 8, v19
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 24, v18
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v114, 8, v18
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v116, 8, v17
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v115, 24, v16
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v115, 8, v17
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v116, 24, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v117, 8, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v118, 8, v15
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v119, 24, v14
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v128, 8, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 8, v13
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 24, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v131, 8, v12
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v133, 8, v11
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 24, v10
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 8, v11
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v133, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v134, 8, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v135, 8, v9
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v144, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v145, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v146, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v146, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v148, 8, v6
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 8, v5
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 24, v4
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 8, v5
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v151, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v162, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[54:55], 24, v[7:8]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[66:67], 24, v[1:2]
@@ -111972,50 +111588,50 @@ define <128 x i8> @bitcast_v16f64_to_v128i8(<16 x double> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[66:67], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v68, 24, v32
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v69, 8, v32
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v71, 8, v31
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v70, 24, v30
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v70, 8, v31
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v71, 24, v30
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v80, 8, v30
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 8, v29
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 24, v28
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 8, v29
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 24, v28
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v83, 8, v28
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v84, 8, v27
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v85, 24, v26
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v86, 8, v26
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 8, v25
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 24, v24
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 8, v25
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 24, v24
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v97, 8, v24
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v99, 8, v23
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 24, v22
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 8, v23
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v99, 24, v22
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v100, 8, v22
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v101, 8, v21
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v102, 24, v20
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v103, 8, v20
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 8, v19
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 24, v18
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 8, v19
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 24, v18
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v114, 8, v18
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v116, 8, v17
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v115, 24, v16
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v115, 8, v17
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v116, 24, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v117, 8, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v118, 8, v15
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v119, 24, v14
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v128, 8, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 8, v13
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 24, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v131, 8, v12
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v133, 8, v11
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 24, v10
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 8, v11
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v133, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v134, 8, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v135, 8, v9
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v144, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v145, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v146, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v146, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v148, 8, v6
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 8, v5
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 24, v4
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 8, v5
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v151, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v162, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v39, 8, v1
; GFX11-TRUE16-NEXT: .LBB72_4: ; %end
@@ -112023,307 +111639,266 @@ define <128 x i8> @bitcast_v16f64_to_v128i8(<16 x double> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v39.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v162.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.l, 0
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v66.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v34.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v66.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v1.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v162.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v1.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v39.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v161.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v39, v1
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v2.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v160.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v65.l
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v161.l
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v1.h, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v66.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v55, 0xffff, v55
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v160.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v1.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v39, v2
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v3.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v39.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v151.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v66, 0xffff, v66
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v55, v39
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v2.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v3.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v65.l
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v4.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v150.l
; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v150.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v66, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v55, 0xffff, v55
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v3.l, v3.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v65.l, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v149.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v5.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v55, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v55, 0xffff, v65
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v64.l
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v4.l, v4.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v65.l, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.l, 8, v148.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v55, v39
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v39, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v4.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v149.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v64.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v39, v4
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v5.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v148.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v147.l
; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v5.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v147.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.h, v6.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v55, 0xffff, v65
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v39, v5
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v6.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v146.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v54.l
; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v7.l, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v64.l, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v146.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v39, v6
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v7.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v39.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v145.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v55, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v55, 0xffff, v64
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v64.l, v7.l
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v6.l, v6.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.h, 8, v54.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v8.l, v33.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v64
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v55, v39
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v144.l
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v7.l, v7.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 8, v135.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v54, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v55
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v8.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v144.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v39, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v8.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v135.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v53.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v10.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v39, v8
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v9.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v39.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v134.l
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v8.h, v9.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v133.l
; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v11.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v39, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v10.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v10.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v132.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v52.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v12.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v39, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v11.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v11.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v131.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v130.l
; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v54, v39
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v10.l, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v53.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v133.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v54.l, v10.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v53, 0xffff, v55
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v9.l, v9.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v132.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v11.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v52.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v13.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v39, v11
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v12.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v12.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v129.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v51.l
; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v53, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v53, 0xffff, v54
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v10.l, v10.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v54.l, v11.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v131.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v53, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v52, 0xffff, v54
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v11.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v130.l
-; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v11.h, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v14.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v39, v12
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v13.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v13.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v128.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v119.l
+; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.l
+; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v15.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v39, v13
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v14.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v118.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v50.l
; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.l
+; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v16.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v39, v14
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v15.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v15.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v117.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v116.l
; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v17.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v52, v39
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v13.l, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v53.l, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v129.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v128.l
+; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v17.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v39, v15
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v16.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v16.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v115.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v49.l
+; GFX11-TRUE16-NEXT: v_and_b16 v18.l, 0xff, v18.l
+; GFX11-TRUE16-NEXT: v_and_b16 v18.h, 0xff, v18.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v39, v16
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v17.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v17.h, v17.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v114.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v113.l
; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v19.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v52, 0xffff, v53
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v53.l, v13.l
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v12.l, v12.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.h, 8, v51.l
-; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v14.l, v33.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v51, 0xffff, v53
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v52, v39
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v119.l
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v13.l, v13.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.l, v14.l
-; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.h
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v15.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.l, 8, v118.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v51, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v51, 0xffff, v52
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v14.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v117.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v14.h, v15.l
+; GFX11-TRUE16-NEXT: v_and_b16 v19.h, 0xff, v19.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v39, v17
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v18.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v18.h, v18.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v112.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v48.l
; GFX11-TRUE16-NEXT: v_and_b16 v20.l, 0xff, v20.l
+; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v20.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v39, v18
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v19.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v19.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v103.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v102.l
+; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v21.l
+; GFX11-TRUE16-NEXT: v_and_b16 v21.h, 0xff, v21.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v39, v19
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v20.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v20.h, v20.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v101.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v38.l
; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v22.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v51, v39
-; GFX11-TRUE16-NEXT: v_or_b16 v16.l, v16.l, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.l, v15.l
-; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.h, 8, v50.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v116.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v51.l, v16.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v50, 0xffff, v52
-; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.h
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v15.l, v15.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v115.l
-; GFX11-TRUE16-NEXT: v_or_b16 v17.l, v17.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v49.l
+; GFX11-TRUE16-NEXT: v_and_b16 v22.h, 0xff, v22.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v39, v20
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v21.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v21.h, v21.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v100.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v99.l
; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v23.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v50, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v50, 0xffff, v51
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v16.l, v16.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v51.l, v17.l
-; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v17.h
-; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v18.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v114.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v50, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v49, 0xffff, v51
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v17.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v113.l
-; GFX11-TRUE16-NEXT: v_or_b16 v18.l, v17.h, v18.l
+; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v23.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v39, v21
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v22.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v22.h, v22.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v98.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v37.l
+; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v24.h, 0xff, v24.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v39, v22
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v23.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v23.h, v23.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v97.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v96.l
; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v25.l
+; GFX11-TRUE16-NEXT: v_and_b16 v25.h, 0xff, v25.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v39, v23
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v24.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v24.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v87.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v36.l
; GFX11-TRUE16-NEXT: v_and_b16 v26.l, 0xff, v26.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v49, v39
-; GFX11-TRUE16-NEXT: v_or_b16 v19.l, v19.l, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v50.l, v18.l
-; GFX11-TRUE16-NEXT: v_and_b16 v18.l, 0xff, v18.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v112.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v103.l
+; GFX11-TRUE16-NEXT: v_and_b16 v26.h, 0xff, v26.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v39, v24
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v25.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v25.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v86.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v85.l
+; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v27.l
+; GFX11-TRUE16-NEXT: v_and_b16 v27.h, 0xff, v27.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v39, v25
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v26.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v26.h, v26.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v84.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v35.l
; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v28.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v49, 0xffff, v50
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v50.l, v19.l
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v18.l, v18.h
-; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v19.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v48.l
-; GFX11-TRUE16-NEXT: v_or_b16 v20.l, v20.l, v33.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v48, 0xffff, v50
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v49, v39
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v102.l
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v19.l, v19.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v49.l, v20.l
-; GFX11-TRUE16-NEXT: v_and_b16 v20.l, 0xff, v20.h
-; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v21.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.l, 8, v101.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v48, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v48, 0xffff, v49
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v20.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v100.l
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v20.h, v21.l
+; GFX11-TRUE16-NEXT: v_and_b16 v28.h, 0xff, v28.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v39, v26
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v27.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v27.h, v27.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v83.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v82.l
; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v29.l
+; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v29.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v39, v27
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v28.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v28.h, v28.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v81.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.l, 8, v34.l
+; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v30.l
+; GFX11-TRUE16-NEXT: v_and_b16 v30.h, 0xff, v30.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v39, v28
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v29.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v29.h, v29.h, v34.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v80.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.l, 8, v71.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_and_b16 v31.l, 0xff, v31.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v48, v39
-; GFX11-TRUE16-NEXT: v_or_b16 v22.l, v22.l, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v49.l, v21.l
-; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v21.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.h, 8, v38.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v99.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.l, v22.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v38, 0xffff, v49
-; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v22.h
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v21.l, v21.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.h, 8, v98.l
-; GFX11-TRUE16-NEXT: v_or_b16 v23.l, v23.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v37.l
+; GFX11-TRUE16-NEXT: v_and_b16 v31.h, 0xff, v31.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v39, v29
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v30.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v30.h, v30.h, v34.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v70.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.l, 8, v33.l
; GFX11-TRUE16-NEXT: v_and_b16 v32.l, 0xff, v32.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v38, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v38, 0xffff, v48
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v22.l, v22.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.l, v23.l
-; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v23.h
-; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v24.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v24.l, 8, v97.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v38, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v37, 0xffff, v48
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v23.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v96.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v23.h, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v32.h, 0xff, v32.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v39, v30
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v31.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v31.h, v33.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v31.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.l, 8, v69.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v68.l
; GFX11-TRUE16-NEXT: s_clause 0x1
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[1:4], off
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[5:8], off offset:16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v37, v39
-; GFX11-TRUE16-NEXT: v_or_b16 v25.l, v25.l, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v38.l, v24.l
-; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v24.h, 8, v87.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v86.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v37, 0xffff, v38
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v38.l, v25.l
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v24.l, v24.h
-; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v25.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.h, 8, v36.l
-; GFX11-TRUE16-NEXT: v_or_b16 v26.l, v26.l, v33.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v36, 0xffff, v38
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v37, v39
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v85.l
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v25.l, v25.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v37.l, v26.l
-; GFX11-TRUE16-NEXT: v_and_b16 v26.l, 0xff, v26.h
-; GFX11-TRUE16-NEXT: v_and_b16 v26.h, 0xff, v27.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.l, 8, v84.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v36, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v36, 0xffff, v37
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v26.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v83.l
-; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v26.h, v27.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v36, v39
-; GFX11-TRUE16-NEXT: v_or_b16 v28.l, v28.l, v33.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v37.l, v27.l
-; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v27.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.h, 8, v35.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v82.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v36.l, v28.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff, v37
-; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v28.h
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v27.l, v27.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v28.h, 8, v81.l
-; GFX11-TRUE16-NEXT: v_or_b16 v29.l, v29.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v34.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v35, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff, v36
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v28.l, v28.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v36.l, v29.l
-; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v29.h
-; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v30.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v30.l, 8, v80.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v35, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v36
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v29.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v71.l
-; GFX11-TRUE16-NEXT: v_or_b16 v30.l, v29.h, v30.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v34, v39
-; GFX11-TRUE16-NEXT: v_or_b16 v31.l, v31.l, v33.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_3) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v30.l
-; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v30.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v30.h, 8, v70.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v69.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v35
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v31.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v30.l, v30.h
-; GFX11-TRUE16-NEXT: v_and_b16 v31.l, 0xff, v31.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v31.h, 8, v33.l
-; GFX11-TRUE16-NEXT: v_or_b16 v32.l, v32.l, v33.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v33, 0xffff, v35
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v34, v39
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v31.l, v31.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.l, v32.l
-; GFX11-TRUE16-NEXT: v_and_b16 v32.l, 0xff, v32.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v32.h, 8, v68.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v33, v39
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v33, 0xffff, v34
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v32.l, v32.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v32, v33, v39
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v39, v31
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v32.l, v33.l
+; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v32.h, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v32.l, v39.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v32, v39, v32
; GFX11-TRUE16-NEXT: s_clause 0x5
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[9:12], off offset:32
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[13:16], off offset:48
@@ -121839,63 +121414,63 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v31, off, s32 offset:384
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v32, off, s32 offset:380
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v31, off, s32 offset:376
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v33, off, s32 offset:372
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v49, off, s32 offset:368
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v32, off, s32 offset:364
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v50, off, s32 offset:360
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v34, off, s32 offset:356
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v50, off, s32 offset:352
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v33, off, s32 offset:348
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v51, off, s32 offset:344
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v32, off, s32 offset:372
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v50, off, s32 offset:368
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v33, off, s32 offset:364
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v51, off, s32 offset:360
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v33, off, s32 offset:356
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v51, off, s32 offset:352
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v34, off, s32 offset:348
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v52, off, s32 offset:344
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v34, off, s32 offset:340
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v52, off, s32 offset:336
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v35, off, s32 offset:332
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v53, off, s32 offset:328
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v36, off, s32 offset:324
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v35, off, s32 offset:324
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v53, off, s32 offset:320
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v35, off, s32 offset:316
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v36, off, s32 offset:316
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v54, off, s32 offset:312
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v36, off, s32 offset:308
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v54, off, s32 offset:304
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v37, off, s32 offset:300
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v55, off, s32 offset:296
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v38, off, s32 offset:292
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v55, off, s32 offset:288
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v37, off, s32 offset:284
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v64, off, s32 offset:280
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v39, off, s32 offset:276
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v65, off, s32 offset:272
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v38, off, s32 offset:268
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v66, off, s32 offset:264
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v37, off, s32 offset:292
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v64, off, s32 offset:288
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v38, off, s32 offset:284
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v64, off, s32 offset:280
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v38, off, s32 offset:276
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v66, off, s32 offset:272
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v39, off, s32 offset:268
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v66, off, s32 offset:264
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v39, off, s32 offset:260
; GFX11-TRUE16-NEXT: s_clause 0x1f
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v66, off, s32 offset:256
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v67, off, s32 offset:256
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v48, off, s32 offset:252
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v68, off, s32 offset:248
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v48, off, s32 offset:244
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v68, off, s32 offset:240
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v70, off, s32 offset:240
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v49, off, s32 offset:236
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v70, off, s32 offset:232
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v51, off, s32 offset:228
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v70, off, s32 offset:224
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v52, off, s32 offset:220
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v80, off, s32 offset:216
-; GFX11-TRUE16-NEXT: scratch_load_b32 v103, off, s32 offset:388
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v80, off, s32
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v70, off, s32 offset:232
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v49, off, s32 offset:228
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v71, off, s32 offset:224
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v50, off, s32 offset:220
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v71, off, s32 offset:216
+; GFX11-TRUE16-NEXT: scratch_load_b32 v114, off, s32 offset:388
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v81, off, s32
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v82, off, s32 offset:8
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v82, off, s32 offset:16
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v83, off, s32 offset:16
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v83, off, s32 offset:24
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v84, off, s32 offset:32
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v85, off, s32 offset:40
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v86, off, s32 offset:48
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v87, off, s32 offset:56
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v96, off, s32 offset:64
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v97, off, s32 offset:72
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v98, off, s32 offset:80
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v85, off, s32 offset:32
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v85, off, s32 offset:40
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v87, off, s32 offset:48
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v87, off, s32 offset:56
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v97, off, s32 offset:64
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v97, off, s32 offset:72
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v98, off, s32 offset:80
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v99, off, s32 offset:88
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v99, off, s32 offset:96
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v100, off, s32 offset:96
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v101, off, s32 offset:104
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v101, off, s32 offset:112
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v102, off, s32 offset:112
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v160, off, s32 offset:120
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v160, off, s32 offset:128
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v161, off, s32 offset:136
@@ -121909,144 +121484,143 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v164, off, s32 offset:192
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v165, off, s32 offset:200
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v165, off, s32 offset:208
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v64, off, s32 offset:212
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v65, off, s32 offset:204
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v67, off, s32 offset:196
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v67, off, s32 offset:188
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v69, off, s32 offset:180
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v69, off, s32 offset:172
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v71, off, s32 offset:164
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v71, off, s32 offset:156
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v81, off, s32 offset:148
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v81, off, s32 offset:140
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v83, off, s32 offset:132
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v55, off, s32 offset:212
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v65, off, s32 offset:204
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v65, off, s32 offset:196
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v67, off, s32 offset:188
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v68, off, s32 offset:180
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v69, off, s32 offset:172
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v69, off, s32 offset:164
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v80, off, s32 offset:156
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v80, off, s32 offset:148
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v81, off, s32 offset:140
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v82, off, s32 offset:132
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v84, off, s32 offset:124
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v85, off, s32 offset:116
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v84, off, s32 offset:116
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v86, off, s32 offset:108
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v87, off, s32 offset:100
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v86, off, s32 offset:100
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v96, off, s32 offset:92
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v97, off, s32 offset:84
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v98, off, s32 offset:76
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v100, off, s32 offset:68
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v100, off, s32 offset:60
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v102, off, s32 offset:52
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v102, off, s32 offset:44
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v112, off, s32 offset:36
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v112, off, s32 offset:28
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v114, off, s32 offset:20
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v96, off, s32 offset:84
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v98, off, s32 offset:76
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v99, off, s32 offset:68
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v100, off, s32 offset:60
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v101, off, s32 offset:52
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v103, off, s32 offset:44
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v103, off, s32 offset:36
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v112, off, s32 offset:28
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v113, off, s32 offset:20
; GFX11-TRUE16-NEXT: s_clause 0x1
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v115, off, s32 offset:12
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v116, off, s32 offset:4
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v115, off, s32 offset:4
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v117.l, v30.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v118.h, v28.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v119.l, v26.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v128.h, v24.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v129.h, v22.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v131.l, v20.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v131.h, v18.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v133.h, v16.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v134.h, v14.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v144.h, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v119.h, v24.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v130.l, v22.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v130.h, v20.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v131.l, v18.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v132.h, v16.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v134.l, v14.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v134.h, v12.l
; GFX11-TRUE16-NEXT: v_mov_b16_e64 v144.l, v10.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v146.l, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v147.h, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v148.h, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v150.l, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v151.l, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v151.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v149.l, 8, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v150.h, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v147.l, 8, v7.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v149.h, 8, v9.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v146.h, 8, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v148.l, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v145.l, 8, v15.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v145.h, 8, v17.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v135.l, 8, v19.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v135.h, 8, v21.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v133.l, 8, v23.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v134.l, 8, v25.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v132.l, 8, v27.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v132.h, 8, v29.l
-; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(62)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v50.h, 8, v50.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v145.h, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v146.h, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v146.l, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v149.l, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v149.h, v0.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v151.l, 8, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v151.h, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v150.l, 8, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v150.h, 8, v7.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v148.l, 8, v9.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v148.h, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v147.l, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v147.h, 8, v15.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v144.h, 8, v17.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v145.l, 8, v19.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v135.l, 8, v21.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v135.h, 8, v23.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v133.l, 8, v25.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v133.h, 8, v27.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v131.h, 8, v29.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(54)
-; GFX11-TRUE16-NEXT: v_cmp_ne_u32_e32 vcc_lo, 0, v103
+; GFX11-TRUE16-NEXT: v_cmp_ne_u32_e32 vcc_lo, 0, v114
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(53)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v130.l, 8, v80.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v132.l, 8, v81.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(52)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v130.h, 8, v82.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v129.l, 8, v82.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(51)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v128.l, 8, v82.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v129.h, 8, v83.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(50)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v129.l, 8, v83.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v128.l, 8, v83.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(49)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v118.l, 8, v84.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v128.h, 8, v85.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(48)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v119.h, 8, v85.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v117.h, 8, v85.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(47)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v116.l, 8, v86.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v118.l, 8, v87.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(46)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v117.h, 8, v87.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v116.l, 8, v87.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(45)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v114.h, 8, v96.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v116.h, 8, v97.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(44)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v115.h, 8, v97.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v114.l, 8, v97.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(43)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v113.l, 8, v98.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v114.h, 8, v98.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(42)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v113.h, 8, v99.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v112.h, 8, v99.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(41)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v103.l, 8, v99.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v113.l, 8, v100.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(40)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v103.h, 8, v101.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v102.l, 8, v101.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(39)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v101.l, 8, v101.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v102.h, 8, v102.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(38)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v101.h, 8, v160.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v100.h, 8, v160.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(37)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v99.l, 8, v160.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v101.l, 8, v160.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(36)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v99.h, 8, v161.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v98.h, 8, v161.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(35)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v97.l, 8, v161.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v99.l, 8, v161.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(34)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v98.l, 8, v162.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v97.l, 8, v162.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(33)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v87.l, 8, v162.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v97.h, 8, v162.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(32)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v96.h, 8, v163.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v87.l, 8, v163.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(31)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v85.l, 8, v163.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v87.h, 8, v163.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(30)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v86.h, 8, v164.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v85.l, 8, v164.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(29)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v83.h, 8, v164.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v85.h, 8, v164.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(28)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v84.h, 8, v165.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v83.l, 8, v165.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(27)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v82.l, 8, v165.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v82.h, 8, v80.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v80.l, 8, v70.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v80.h, 8, v70.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v70.l, 8, v68.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v70.h, 8, v68.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v68.l, 8, v66.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v68.h, 8, v66.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v66.l, 8, v65.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v83.h, 8, v165.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v81.h, 8, v71.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v82.l, 8, v71.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v71.l, 8, v70.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v71.h, 8, v70.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v70.l, 8, v68.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v70.h, 8, v67.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v67.h, 8, v66.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v68.l, 8, v66.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v66.l, 8, v64.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v66.h, 8, v64.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v64.l, 8, v55.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v65.l, 8, v55.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v55.l, 8, v54.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v55.h, 8, v54.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v54.l, 8, v53.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v54.h, 8, v53.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v53.l, 8, v52.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v53.h, 8, v51.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v52.h, 8, v50.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v50.l, 8, v49.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v51.l, 8, v31.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v49.h, 8, v31.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v64.l, 8, v55.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v64.h, 8, v54.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v54.h, 8, v54.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v55.l, 8, v53.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v53.h, 8, v53.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v54.l, 8, v52.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v52.h, 8, v52.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v53.l, 8, v51.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v51.h, 8, v51.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v52.l, 8, v50.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v50.h, 8, v31.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v51.l, 8, v31.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15_vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31
; GFX11-TRUE16-NEXT: s_and_saveexec_b32 s0, vcc_lo
; GFX11-TRUE16-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
@@ -122060,746 +121634,660 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB74_3: ; %cmp.false
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v151.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v148.h
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v150.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v146.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v31.l, 0
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v151.h
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v150.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v147.h
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v0.h, v149.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v149.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v144.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v133.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.l, v148.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v4
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v144.l
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v1.l, v145.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v31
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v1.h, v147.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v3.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v149.h
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v149.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v149.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v146.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v146.l
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v0.l, v151.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v151.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v0.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v1.l, v150.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v3.l, v149.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v145.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v144.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v149, v0
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v1.h, v150.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v132.h
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v131.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v148.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v130.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v149, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v2.l, v148.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v2.l, v149.h
; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v134.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v131.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v7, v31
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v2.l, v146.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v4.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v131.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v3.h, v135.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v6, v31
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v3.l, v145.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v128.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v118.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v134.l
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v145.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v130.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v149, v2
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v3.l, v147.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v147.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v3.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v135.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v119.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v119.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v118.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v149, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v4.l, v144.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v4.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v133.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v117.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v116.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v8, v31
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v4.l, v135.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v129.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v4.h, v134.l
-; GFX11-TRUE16-NEXT: v_and_b16 v32.l, 0xff, v32.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v8
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v5.l, v133.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v119.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v5.h, v132.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v114.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v8, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v6.l, v132.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v117.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v6.h, v130.h
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v112.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v7.l, v130.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v115.l
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v7.h, v129.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v102.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v10, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v11
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v8.l, v128.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v112.h
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v8.h, v119.h
-; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v100.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v11, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v12
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v9.l, v118.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v102.h
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v9.h, v117.h
-; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v97.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v12, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v13
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v10.l, v116.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v11.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v100.h
-; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v10.h, v115.h
-; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v87.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v13, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v14, 0xffff, v14
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v11.l, v114.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v98.h
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v11.h, v113.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v85.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v14, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v15, 0xffff, v15
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v12.l, v113.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v13.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v96.l
-; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v12.h, v103.h
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v83.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v15, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v16
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v13.l, v103.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v14.l
-; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v86.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v13.h, v101.h
-; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v81.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v16, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v17
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v14.l, v101.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v15.l
-; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v84.l
-; GFX11-TRUE16-NEXT: v_or_b16 v16.l, v14.h, v99.h
-; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v71.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v17, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v18
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v15.l, v99.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v16.l
-; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v81.h
-; GFX11-TRUE16-NEXT: v_or_b16 v17.l, v15.h, v98.l
-; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v69.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v18, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v19, 0xffff, v19
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v16.l, v97.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v17.l
-; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v71.h
-; GFX11-TRUE16-NEXT: v_or_b16 v18.l, v16.h, v96.h
-; GFX11-TRUE16-NEXT: v_and_b16 v18.h, 0xff, v67.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v19, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v20, 0xffff, v20
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v17.l, v87.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v18.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v115.h
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v115.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v149, v4
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v5.l, v135.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v5.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v132.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v129.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v113.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v112.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v149, v5
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v6.l, v133.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v6.l, v149.h
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v103.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.h, v128.h
+; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v103.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v101.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v149, v6
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v7.l, v131.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v7.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v10.h, v118.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v100.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v99.h
+; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v98.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v149, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v8.l, v129.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v8.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v11.h, v116.h
+; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v12.h, v114.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v96.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v96.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v149, v8
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v9.l, v128.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v9.l, v149.h
+; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v86.h
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v13.h, v113.l
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v86.l
+; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v84.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v149, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v10.l, v117.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v10.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v102.h
+; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v84.l
+; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v82.h
+; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v81.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v149, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v11.l, v116.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v11.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v15.h, v101.l
+; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v16.h, v99.l
+; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v80.h
+; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v80.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v149, v11
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v12.l, v114.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v12.l, v149.h
; GFX11-TRUE16-NEXT: v_and_b16 v18.l, 0xff, v69.h
-; GFX11-TRUE16-NEXT: v_or_b16 v19.l, v17.h, v86.h
-; GFX11-TRUE16-NEXT: v_and_b16 v19.h, 0xff, v64.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v20, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v21, 0xffff, v21
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v18.l, v85.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v19.l
-; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v67.h
-; GFX11-TRUE16-NEXT: v_or_b16 v20.l, v18.h, v84.h
-; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v51.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v21, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff, v22
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v19.l, v83.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v20.l
+; GFX11-TRUE16-NEXT: v_or_b16 v17.h, v17.h, v97.h
+; GFX11-TRUE16-NEXT: v_and_b16 v18.h, 0xff, v69.l
+; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v68.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v149, v12
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v13.l, v112.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v13.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v18.h, v18.h, v87.h
+; GFX11-TRUE16-NEXT: v_and_b16 v19.h, 0xff, v67.l
; GFX11-TRUE16-NEXT: v_and_b16 v20.l, 0xff, v65.h
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v19.h, v82.h
-; GFX11-TRUE16-NEXT: v_and_b16 v21.h, 0xff, v48.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v22, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v23, 0xffff, v23
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v20.l, v82.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, v21.l
-; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v52.l
-; GFX11-TRUE16-NEXT: v_or_b16 v22.l, v20.h, v80.h
-; GFX11-TRUE16-NEXT: v_and_b16 v22.h, 0xff, v39.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v23, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v24, 0xffff, v24
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v21.l, v80.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v22.l
-; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v49.l
-; GFX11-TRUE16-NEXT: v_or_b16 v23.l, v21.h, v70.h
-; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v39.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v24, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v25, 0xffff, v25
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v22.l, v70.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.l, v23.l
-; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v48.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v22.h, v68.h
-; GFX11-TRUE16-NEXT: v_and_b16 v24.h, 0xff, v38.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v25, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v26, 0xffff, v26
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v23.l, v68.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, v24.l
-; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v38.h
-; GFX11-TRUE16-NEXT: v_or_b16 v25.l, v23.h, v66.h
-; GFX11-TRUE16-NEXT: v_and_b16 v25.h, 0xff, v36.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v26, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v27, 0xffff, v27
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v24.l, v66.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.l, v25.l
-; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v37.h
-; GFX11-TRUE16-NEXT: v_or_b16 v26.l, v24.h, v65.l
-; GFX11-TRUE16-NEXT: v_and_b16 v26.h, 0xff, v36.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v27, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v28, 0xffff, v28
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v25.l, v64.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v26.l
-; GFX11-TRUE16-NEXT: v_and_b16 v26.l, 0xff, v37.l
-; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v25.h, v55.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v28, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v29, 0xffff, v29
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v26.l, v55.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.l, v27.l
-; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v35.h
-; GFX11-TRUE16-NEXT: v_or_b16 v28.l, v26.h, v54.h
-; GFX11-TRUE16-NEXT: v_and_b16 v28.h, 0xff, v34.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v29, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v29, 0xffff, v30
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v27.l, v54.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.l, v28.l
-; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v35.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr151_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr150_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr148_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr147_hi16
+; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v65.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v149, v13
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v14.l, v102.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v14.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v19.h, v85.h
+; GFX11-TRUE16-NEXT: v_or_b16 v20.h, v20.h, v83.h
+; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v55.h
+; GFX11-TRUE16-NEXT: v_and_b16 v21.h, 0xff, v50.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v149, v14
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v15.l, v100.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v15.l, v149.h
+; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v49.h
+; GFX11-TRUE16-NEXT: v_or_b16 v21.h, v21.h, v82.l
+; GFX11-TRUE16-NEXT: v_and_b16 v22.h, 0xff, v49.l
+; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v48.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v149, v15
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v16.l, v98.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v16.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v22.h, v22.h, v71.h
+; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v48.l
+; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v39.h
+; GFX11-TRUE16-NEXT: v_and_b16 v24.h, 0xff, v39.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v149, v16
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v17.l, v97.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v17.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v23.h, v23.h, v70.h
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v24.h, v68.l
+; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v38.h
+; GFX11-TRUE16-NEXT: v_and_b16 v25.h, 0xff, v38.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v149, v17
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v18.l, v87.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v18.l, v149.h
+; GFX11-TRUE16-NEXT: v_and_b16 v26.l, 0xff, v37.h
+; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v25.h, v66.h
+; GFX11-TRUE16-NEXT: v_and_b16 v26.h, 0xff, v37.l
+; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v36.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v149, v18
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v19.l, v85.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v19.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v26.h, v26.h, v64.h
+; GFX11-TRUE16-NEXT: v_and_b16 v27.h, 0xff, v36.l
+; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v35.h
+; GFX11-TRUE16-NEXT: v_and_b16 v28.h, 0xff, v35.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v149, v19
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v20.l, v83.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v20.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v27.h, v27.h, v55.l
+; GFX11-TRUE16-NEXT: v_or_b16 v28.h, v28.h, v54.l
+; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v34.h
+; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v34.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v149, v20
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v21.l, v81.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v21.l, v149.h
+; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v29.h, v29.h, v53.l
+; GFX11-TRUE16-NEXT: v_and_b16 v30.h, 0xff, v33.l
+; GFX11-TRUE16-NEXT: v_and_b16 v31.l, 0xff, v32.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v149, v21
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v22.l, v71.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v22.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v30.h, v30.h, v52.l
+; GFX11-TRUE16-NEXT: v_and_b16 v31.h, 0xff, v32.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr146_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr146_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr145_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr144_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr144_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr134_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr133_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr131_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr134_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr132_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr131_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr128_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr119_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr119_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr118_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr117_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr116_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr115_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr115_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr114_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr112_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr113_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr112_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr100_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr101_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr100_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr97_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr99_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr96_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr96_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr86_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr86_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr85_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr84_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr84_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr83_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr81_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr82_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr81_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr69_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr69_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr67_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr68_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr67_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr65_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr65_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr151_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr151_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr150_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr150_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr147_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr146_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr148_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr148_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr147_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr147_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr144_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr145_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr145_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr135_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr135_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr133_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr134_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr133_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr131_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr132_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr132_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr128_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr128_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr128_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr117_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr118_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr119_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr116_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr117_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr116_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr114_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr114_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr115_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr112_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr113_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr113_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr100_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr101_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr101_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr99_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr99_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr97_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr97_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr96_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr85_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr86_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr85_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr83_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr83_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr84_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr81_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr82_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr82_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr68_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr68_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr66_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr66_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr65_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v29, v31
-; GFX11-TRUE16-NEXT: v_or_b16 v29.l, v28.h, v53.h
-; GFX11-TRUE16-NEXT: v_and_b16 v28.h, 0xff, v34.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff, v30
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v28.l, v53.l
-; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v33.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.l, v29.l
-; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v30.l, v28.h, v52.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v35, v31
-; GFX11-TRUE16-NEXT: v_or_b16 v33.l, v29.h, v51.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v34
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v29.l, v50.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v30.l
-; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v32.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v33, 0xffff, v33
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v149, v22
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v23.l, v70.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v23.l, v149.h
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v31.h, v51.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v149, v23
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v24.l, v67.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v24.l, v149.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr67_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v149, v24
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v25.l, v66.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v25.l, v149.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr66_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v149, v25
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v26.l, v64.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v26.l, v149.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v149, v26
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v27.l, v54.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v27.l, v149.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_hi16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v149, v27
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v28.l, v53.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v28.l, v149.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v149, v28
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v29.l, v52.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v29.l, v149.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v34, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v35
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v30.l, v50.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v34, v31
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v32.l, v49.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_hi16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v149, v29
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v30.l, v51.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v30.l, v149.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v149, v30
+; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v31.l, v50.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v31.l, v149.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_hi16
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v33, v31
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v149, v31
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_lo16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB74_2
; GFX11-TRUE16-NEXT: .LBB74_4: ; %cmp.true
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v151.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v148.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v150.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v146.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v147.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v149.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v149.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v146.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v146.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v31.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v144.l, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v151.h, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v150.h, v1.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v144.h, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v149.l, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v31.l, 0
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v151.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v151.h, v0.h
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v150.h, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v150.l, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v145.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v144.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v134.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v31, v3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v134.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v148.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v148.h, v1.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v149.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v147.l, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, v134.h, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v133.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v31
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v146.h, v2.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v148.l, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v4
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v2.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v145.l, v3.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v31, v4
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v31.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v147.l, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v147.h, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v132.h, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v131.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v5, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v4.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v145.h, v3.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v131.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v31, v5
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v130.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v130.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v144.h, v3.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v145.l, v3.h
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, v128.h, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v135.h, v4.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v6, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v5.l
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v135.l, v4.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, v129.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v32.l, v32.l, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v4.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, 0x300, v4.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v134.l, v5.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v118.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v133.l, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v119.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v7
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v31, v6
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v135.l, v4.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v135.h, v4.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v119.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v119.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v31, v7
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v118.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v117.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v133.l, v5.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v133.h, v5.h
; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v132.h, v6.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v31, v8
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v131.h, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v132.l, v6.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, v116.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v8, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v7.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v132.l, v6.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, v117.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v8
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v6.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, 0x300, v6.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v130.h, v7.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v114.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v8.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v130.l, v7.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v115.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v9
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v7.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v115.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v115.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v31, v9
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v113.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v112.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v129.l, v7.l
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v129.h, v7.h
; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v129.l, v8.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, v112.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v10, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v9.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v128.l, v8.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, v112.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v10
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v8.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v8.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v119.h, v9.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, v102.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v11, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v10.l
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v118.l, v9.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, v102.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v10.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v11
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v9.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.l, 0x300, v9.h
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v31, v10
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v9.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v9.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v128.l, v8.l
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v128.h, v8.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, v103.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, v103.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v31, v11
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v9.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, 0x300, v9.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v10.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v10.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, v101.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, v100.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v117.h, v9.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v118.l, v9.h
; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v117.h, v10.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, v100.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v12, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v11.l
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v116.l, v10.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.l, v100.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v11.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v12
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v10.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, 0x300, v10.h
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v115.h, v11.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, v97.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v13, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v12.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v114.h, v11.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, v98.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v12.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v14, 0xffff, v13
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v11.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.l, 0x300, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v10.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v31, v12
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v11.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.h, 0x300, v11.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v116.l, v10.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v116.h, v10.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, v99.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, v98.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v31, v13
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v11.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, 0x300, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v12.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, v96.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, v96.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v114.l, v11.l
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v114.h, v11.h
; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v113.h, v12.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.h, v87.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v14, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v13.l
-; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v113.l, v12.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.l, v96.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v13.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v15, 0xffff, v14
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v12.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, 0x300, v12.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.l
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v103.h, v13.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, v85.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v15, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v14.l
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v103.l, v13.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, v86.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v14.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v15
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v13.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.l, 0x300, v13.h
+; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v12.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v31, v14
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v13.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.h, 0x300, v13.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v112.h, v12.l
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v113.l, v12.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, v86.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, v86.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v31, v15
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v13.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.h, 0x300, v13.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v14.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v14.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, v84.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, v84.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v102.l, v13.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v102.h, v13.h
; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v101.h, v14.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.h, v83.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v16, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v15.l
-; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v101.l, v14.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.l, v84.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v15.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v16
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v14.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.l, 0x300, v14.h
-; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v99.h, v15.h
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v14.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v31, v16
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v15.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v17.h, 0x300, v15.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v100.h, v14.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v101.l, v14.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.l, v82.h, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.h, v81.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v17, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v16.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v99.l, v15.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.l, v81.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v16.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v17
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v15.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v17.l, 0x300, v15.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v31, v17
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v15.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.h, 0x300, v15.h
+; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v16.l
+; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v16.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.l, v80.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.h, v80.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v17.l, v98.h, v15.l
+; GFX11-TRUE16-NEXT: v_or_b16 v17.h, v99.l, v15.h
; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.l
-; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v98.l, v16.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v17.h, v71.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v18, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v17.l
-; GFX11-TRUE16-NEXT: v_or_b16 v16.l, v97.l, v16.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v17.l, v71.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v17.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v19, 0xffff, v18
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v16.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.l, 0x300, v16.h
-; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v17.l
-; GFX11-TRUE16-NEXT: v_or_b16 v17.h, v96.h, v17.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.h, v69.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v19, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v18.l
-; GFX11-TRUE16-NEXT: v_or_b16 v17.l, v87.l, v17.l
+; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v16.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v31, v18
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v17.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v19.h, 0x300, v17.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v17.l, v97.l, v16.l
+; GFX11-TRUE16-NEXT: v_or_b16 v17.h, v97.h, v16.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.l, v69.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v18.h, 0xff, v18.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v20, 0xffff, v19
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v17.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v19.l, 0x300, v17.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.h, v69.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v31, v19
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v17.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.h, 0x300, v17.h
+; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v18.l
+; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v18.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.l, v68.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.h, v67.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v19.l, v87.l, v17.l
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v87.h, v17.h
; GFX11-TRUE16-NEXT: v_and_b16 v18.l, 0xff, v18.l
-; GFX11-TRUE16-NEXT: v_or_b16 v18.h, v86.h, v18.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v19.h, v67.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v20, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v19.l
-; GFX11-TRUE16-NEXT: v_or_b16 v18.l, v85.l, v18.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v19.l, v67.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v19.h, 0xff, v19.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v21, 0xffff, v20
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v18.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.l, 0x300, v18.h
-; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v19.l
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v84.h, v19.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.h, v64.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v21, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v20.l
-; GFX11-TRUE16-NEXT: v_or_b16 v19.l, v83.h, v19.l
+; GFX11-TRUE16-NEXT: v_and_b16 v18.h, 0xff, v18.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v31, v20
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v19.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v19.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v19.l, v85.l, v18.l
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v85.h, v18.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.l, v65.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v20.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff, v21
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v19.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v19.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.h, v65.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v31, v21
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v19.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.h, 0x300, v19.h
+; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v20.l
+; GFX11-TRUE16-NEXT: v_and_b16 v19.h, 0xff, v20.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.l, v55.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.h, v50.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v83.l, v19.l
+; GFX11-TRUE16-NEXT: v_or_b16 v21.h, v83.h, v19.h
; GFX11-TRUE16-NEXT: v_and_b16 v20.l, 0xff, v20.l
-; GFX11-TRUE16-NEXT: v_or_b16 v20.h, v82.h, v20.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, v51.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v22, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v21.l
-; GFX11-TRUE16-NEXT: v_or_b16 v20.l, v82.l, v20.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, v52.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v21.h, 0xff, v21.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v23, 0xffff, v22
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v20.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.l, 0x300, v20.h
-; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v21.l
-; GFX11-TRUE16-NEXT: v_or_b16 v21.h, v80.h, v21.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.h, v48.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v23, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v22.l
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v80.l, v21.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.l, v49.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v22.h, 0xff, v22.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v24, 0xffff, v23
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v21.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v23.l, 0x300, v21.h
+; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v20.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v31, v22
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v21.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v23.h, 0x300, v21.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v81.h, v20.l
+; GFX11-TRUE16-NEXT: v_or_b16 v21.h, v82.l, v20.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.l, v49.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.h, v49.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v31, v23
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v21.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.h, 0x300, v21.h
+; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v22.l
+; GFX11-TRUE16-NEXT: v_and_b16 v21.h, 0xff, v22.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.l, v48.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.h, v48.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v23.l, v71.l, v21.l
+; GFX11-TRUE16-NEXT: v_or_b16 v23.h, v71.h, v21.h
; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v22.l
-; GFX11-TRUE16-NEXT: v_or_b16 v22.h, v70.h, v22.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v23.h, v39.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v24, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, v23.l
-; GFX11-TRUE16-NEXT: v_or_b16 v22.l, v70.l, v22.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v23.l, v48.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v23.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v25, 0xffff, v24
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v22.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.l, 0x300, v22.h
-; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v23.l
-; GFX11-TRUE16-NEXT: v_or_b16 v23.h, v68.h, v23.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.h, v38.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v25, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v24.l
-; GFX11-TRUE16-NEXT: v_or_b16 v23.l, v68.l, v23.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.l, v39.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v24.h, 0xff, v24.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v26, 0xffff, v25
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v23.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v23.h
+; GFX11-TRUE16-NEXT: v_and_b16 v22.h, 0xff, v22.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v31, v24
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v23.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.h, 0x300, v23.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v23.l, v70.l, v22.l
+; GFX11-TRUE16-NEXT: v_or_b16 v23.h, v70.h, v22.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.l, v39.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.h, v39.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v31, v25
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v23.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.h, 0x300, v23.h
+; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v24.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.l, v38.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.h, v38.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v25.l, v67.h, v23.l
+; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v68.l, v23.h
; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v24.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v66.l, v24.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.h, v37.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v26, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.l, v25.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v66.h, v24.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, v38.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v24.h
-; GFX11-TRUE16-NEXT: v_and_b16 v25.h, 0xff, v25.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v26, 0xffff, v26
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.l, 0x300, v24.l
-; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v25.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v64.l, v25.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, v24.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v26, v31
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v25.l, v65.l, v25.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.l, v37.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.h, v36.h, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v27, 0xffff, v27
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v25.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v25.l
+; GFX11-TRUE16-NEXT: v_and_b16 v24.h, 0xff, v24.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v31, v26
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v25.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, 0x300, v25.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v25.l, v66.l, v24.l
+; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v66.h, v24.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.l, v37.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.h, v37.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v31, v27
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v25.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.h, 0x300, v25.h
+; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v26.l
+; GFX11-TRUE16-NEXT: v_and_b16 v25.h, 0xff, v26.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.l, v36.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.h, v36.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v64.l, v25.l
+; GFX11-TRUE16-NEXT: v_or_b16 v27.h, v64.h, v25.h
; GFX11-TRUE16-NEXT: v_and_b16 v26.l, 0xff, v26.l
; GFX11-TRUE16-NEXT: v_and_b16 v26.h, 0xff, v26.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.l, v25.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v27, v31
-; GFX11-TRUE16-NEXT: v_or_b16 v26.l, v55.l, v26.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.l, v36.l, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v26.h, v55.h, v26.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, v35.h, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v29, 0xffff, v28
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v26.l
-; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v27.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.l, 0x300, v26.h
-; GFX11-TRUE16-NEXT: v_and_b16 v27.h, 0xff, v27.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.h, v34.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v29, v31
-; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v54.h, v27.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v28.l
-; GFX11-TRUE16-NEXT: v_or_b16 v27.h, v54.l, v27.h
-; GFX11-TRUE16-NEXT: v_and_b16 v28.h, 0xff, v28.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.l, v35.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.l, 0x300, v27.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v29, 0xffff, v29
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v27.h
-; GFX11-TRUE16-NEXT: v_or_b16 v28.h, v53.h, v28.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v31, v28
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v27.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v29.h, 0x300, v27.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v54.h, v26.l
+; GFX11-TRUE16-NEXT: v_or_b16 v27.h, v55.l, v26.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.l, v35.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.h, v35.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v31, v29
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v27.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v30.h, 0x300, v27.h
+; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v28.l
+; GFX11-TRUE16-NEXT: v_and_b16 v27.h, 0xff, v28.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.l, v34.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.h, v34.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v29.l, v53.h, v27.l
+; GFX11-TRUE16-NEXT: v_or_b16 v29.h, v54.l, v27.h
; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v28.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.l, v27.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v29, v31
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v29.l, v34.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v29.h, v33.h, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff, v30
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v30.l, 0x300, v28.h
+; GFX11-TRUE16-NEXT: v_and_b16 v28.h, 0xff, v28.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v31, v30
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v29.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v34.h, 0x300, v29.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v29.l, v52.h, v28.l
+; GFX11-TRUE16-NEXT: v_or_b16 v29.h, v53.l, v28.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v30.l, v33.h, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v30.h, v33.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v29.l
-; GFX11-TRUE16-NEXT: v_or_b16 v28.l, v53.l, v28.l
-; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v29.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v33.l, v30.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v31, v34
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v29.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v33.h, 0x300, v29.h
+; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v30.l
+; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v30.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v30.l, v32.h, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v29.l, v52.h, v29.l
-; GFX11-TRUE16-NEXT: v_and_b16 v30.h, 0xff, v30.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v28.l
-; GFX11-TRUE16-NEXT: v_or_b16 v29.h, v50.h, v29.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v30.h, v32.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v33.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v32.l, v51.h, v29.l
+; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v52.l, v29.h
; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v30.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v29.l, 0x300, v29.l
-; GFX11-TRUE16-NEXT: v_or_b16 v30.h, v51.l, v30.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v35, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v33
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v29.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v29.l
-; GFX11-TRUE16-NEXT: v_or_b16 v30.l, v50.l, v30.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v33.l, 0x300, v30.h
-; GFX11-TRUE16-NEXT: v_and_b16 v30.h, 0xff, v32.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v34, v31
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v35
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v30.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v33, 0xffff, v33
-; GFX11-TRUE16-NEXT: v_or_b16 v32.l, v49.h, v30.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v34, v31
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v32.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v33, v31
+; GFX11-TRUE16-NEXT: v_and_b16 v30.h, 0xff, v30.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v31, v33
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v32.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v33.h, 0x300, v32.h
+; GFX11-TRUE16-NEXT: v_or_b16 v32.l, v50.h, v30.l
+; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v51.l, v30.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v31, v33
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v32.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v32.h, 0x300, v32.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v32.l, v31.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v31, v32
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -160089,159 +159577,162 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: s_clause 0x1f
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v40, s32 offset:156
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v41, s32 offset:152
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v42, s32 offset:148
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v43, s32 offset:144
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v44, s32 offset:140
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v45, s32 offset:136
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v46, s32 offset:132
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v47, s32 offset:128
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v56, s32 offset:124
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v57, s32 offset:120
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v58, s32 offset:116
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v59, s32 offset:112
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v60, s32 offset:108
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v61, s32 offset:104
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v62, s32 offset:100
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v63, s32 offset:96
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v72, s32 offset:92
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v73, s32 offset:88
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v74, s32 offset:84
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v75, s32 offset:80
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v76, s32 offset:76
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v77, s32 offset:72
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v78, s32 offset:68
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v79, s32 offset:64
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v88, s32 offset:60
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v89, s32 offset:56
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v90, s32 offset:52
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v91, s32 offset:48
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v92, s32 offset:44
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v93, s32 offset:40
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v94, s32 offset:36
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v95, s32 offset:32
-; GFX11-TRUE16-NEXT: s_clause 0x4
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v104, s32 offset:28
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v105, s32 offset:24
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v106, s32 offset:20
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v107, s32 offset:16
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v108, s32 offset:12
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v40, s32 offset:168
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v41, s32 offset:164
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v42, s32 offset:160
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v43, s32 offset:156
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v44, s32 offset:152
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v45, s32 offset:148
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v46, s32 offset:144
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v47, s32 offset:140
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v56, s32 offset:136
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v57, s32 offset:132
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v58, s32 offset:128
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v59, s32 offset:124
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v60, s32 offset:120
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v61, s32 offset:116
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v62, s32 offset:112
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v63, s32 offset:108
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v72, s32 offset:104
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v73, s32 offset:100
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v74, s32 offset:96
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v75, s32 offset:92
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v76, s32 offset:88
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v77, s32 offset:84
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v78, s32 offset:80
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v79, s32 offset:76
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v88, s32 offset:72
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v89, s32 offset:68
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v90, s32 offset:64
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v91, s32 offset:60
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v92, s32 offset:56
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v93, s32 offset:52
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v94, s32 offset:48
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v95, s32 offset:44
+; GFX11-TRUE16-NEXT: s_clause 0x7
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v104, s32 offset:40
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v105, s32 offset:36
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v106, s32 offset:32
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v107, s32 offset:28
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v108, s32 offset:24
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v109, s32 offset:20
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v110, s32 offset:16
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v111, s32 offset:12
; GFX11-TRUE16-NEXT: s_clause 0x2
; GFX11-TRUE16-NEXT: scratch_load_b32 v33, off, s32 offset:8
; GFX11-TRUE16-NEXT: scratch_load_b32 v32, off, s32 offset:4
; GFX11-TRUE16-NEXT: scratch_load_b32 v31, off, s32
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr131_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr111_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr128_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr106_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr105_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr146_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr108_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr68_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr104_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr69_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr133_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr107_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr132_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr105_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr164_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr106_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr67_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr148_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr94_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr144_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr90_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr180_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr91_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr66_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr165_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr95_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr135_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr93_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr166_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr88_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr161_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr75_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr47_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr68_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr150_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr78_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr151_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr76_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr65_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr179_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr72_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr43_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr74_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr67_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr177_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr63_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr178_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr59_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr60_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr73_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr58_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr57_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr44_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr56_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr41_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr42_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr89_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr43_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr61_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr183_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr57_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr167_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr104_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr176_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr78_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr166_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr77_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr160_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr95_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr150_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr93_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr47_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr44_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr45_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr92_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr134_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr79_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr40_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr74_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr62_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr84_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr63_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr59_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr182_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr62_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr180_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr108_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr176_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr81_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr60_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr46_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr86_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr45_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr91_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr165_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr89_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr163_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr110_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr161_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr83_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr40_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr107_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr109_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr147_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr82_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr181_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr97_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr182_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr177_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr94_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr65_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr90_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr81_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr79_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr85_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr162_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr101_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr163_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr151_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr77_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr83_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr75_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr84_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr72_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr96_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr147_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr112_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr145_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr100_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr135_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr61_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr86_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr58_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr56_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr99_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr46_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr97_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr42_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr183_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr181_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr100_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr179_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr101_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr167_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr113_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr131_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr164_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr119_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr162_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr112_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr160_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr116_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr128_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr115_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr118_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr148_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr114_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr117_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr145_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr115_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr144_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr119_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr134_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr117_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr132_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr118_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_lo16
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(2)
; GFX11-TRUE16-NEXT: v_cmp_ne_u32_e32 vcc_lo, 0, v33
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
@@ -160250,143 +159741,142 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB90_2
; GFX11-TRUE16-NEXT: ; %bb.1: ; %cmp.false
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[67:68], 24, v[7:8]
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[33:34], 24, v[31:32]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[64:65], 24, v[9:10]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[68:69], 24, v[5:6]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[34:35], 24, v[29:30]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[65:66], 24, v[7:8]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[50:51], 24, v[13:14]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[53:54], 24, v[11:12]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[64:65], 24, v[9:10]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[69:70], 24, v[3:4]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[35:36], 24, v[27:28]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[48:49], 24, v[15:16]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[66:67], 24, v[5:6]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[36:37], 24, v[25:26]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[49:50], 24, v[13:14]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[52:53], 24, v[11:12]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[67:68], 24, v[3:4]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[37:38], 24, v[23:24]
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v134, 24, v16
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 24, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 8, v16
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 8, v15
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 24, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v166, 8, v14
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 8, v15
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v163, 24, v14
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v165, 8, v14
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v176, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v167, 24, v12
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v183, 8, v12
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v43, 8, v11
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v42, 24, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v56, 8, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v58, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v59, 24, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v72, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v76, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v75, 24, v6
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v88, 8, v6
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v91, 8, v5
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v90, 24, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v94, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v106, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v180, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v182, 8, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v40, 8, v11
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v45, 24, v10
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v47, 8, v10
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v57, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v60, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v63, 8, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v74, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v76, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v78, 8, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v88, 8, v5
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v93, 24, v4
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v95, 8, v4
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v104, 8, v3
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v105, 24, v2
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v107, 8, v2
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v108, 8, v1
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v117, 24, v32
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v118, 8, v32
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v128, 8, v31
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v119, 24, v30
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 8, v30
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v131, 8, v29
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 24, v28
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v135, 8, v28
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v145, 8, v27
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 24, v26
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v151, 8, v26
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v163, 8, v25
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v162, 24, v24
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v177, 8, v24
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v182, 8, v23
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v181, 24, v22
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v40, 8, v22
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v45, 8, v21
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v46, 24, v20
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v60, 8, v20
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v63, 8, v19
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v62, 24, v18
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v74, 8, v18
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v79, 8, v17
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[68:69], 24, v[1:2]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[38:39], 24, v[21:22]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[50:51], 24, v[19:20]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[53:54], 24, v[17:18]
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v146.h, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v133.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v132.h, v2.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v164.h, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v148.h, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v144.h, v4.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v180.h, v5.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v165.h, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v161.h, v6.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v47.h, v7.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v179.h, v8.l
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v106, 8, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v111, 8, v1
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 24, v32
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 8, v32
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v134, 8, v31
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v144, 24, v30
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v145, 8, v30
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v148, 8, v29
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 24, v28
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v162, 8, v28
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v164, 8, v27
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v167, 24, v26
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v179, 8, v26
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v181, 8, v25
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v183, 24, v24
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v42, 8, v24
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v46, 8, v23
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v56, 24, v22
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v58, 8, v22
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v61, 8, v21
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v72, 24, v20
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v75, 8, v20
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v77, 8, v19
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v79, 24, v18
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v90, 8, v18
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v94, 8, v17
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[38:39], 24, v[15:16]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[70:71], 24, v[1:2]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[36:37], 24, v[25:26]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[48:49], 24, v[23:24]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[51:52], 24, v[21:22]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[54:55], 24, v[19:20]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[65:66], 24, v[17:18]
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v131.h, v1.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v128.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v129.h, v2.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v146.h, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v133.h, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v135.h, v4.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v166.h, v5.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v150.h, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v151.h, v6.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v43.h, v7.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v177.h, v8.l
; GFX11-TRUE16-NEXT: v_mov_b16_e64 v178.h, v8.h
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v73.h, v9.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v44.h, v10.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v41.h, v10.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v89.h, v11.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v61.h, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v57.h, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v104.h, v13.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v78.h, v14.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v77.h, v14.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v95.h, v15.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v93.h, v16.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v92.h, v16.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v71.h, v17.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v70.h, v18.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v84.h, v19.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v81.h, v20.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v80.h, v20.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v86.h, v21.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v83.h, v22.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v82.h, v22.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v97.h, v23.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v87.h, v24.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v85.h, v24.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v101.h, v25.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v98.h, v26.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v96.h, v26.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v112.h, v27.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v100.h, v28.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v99.h, v28.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v113.h, v29.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v103.h, v30.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v102.h, v30.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v116.h, v31.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v115.h, v32.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v114.h, v32.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v41.h, v10.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v44.h, v10.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v92.h, v11.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v59.h, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v62.h, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v108.h, v13.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v91.h, v14.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v89.h, v14.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v110.h, v15.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v107.h, v16.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v109.h, v16.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v82.h, v17.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v80.h, v18.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v81.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v85.h, v19.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v83.h, v20.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v84.h, v20.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v96.h, v21.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v86.h, v22.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v87.h, v22.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v99.h, v23.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v97.h, v24.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v98.h, v24.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v102.h, v25.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v100.h, v26.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v101.h, v26.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v113.h, v27.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v103.h, v28.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v112.h, v28.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v116.h, v29.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v114.h, v30.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v115.h, v30.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v119.h, v31.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v117.h, v32.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v118.h, v32.h
; GFX11-TRUE16-NEXT: .LBB90_2: ; %Flow
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB90_4
; GFX11-TRUE16-NEXT: ; %bb.3: ; %cmp.true
; GFX11-TRUE16-NEXT: v_and_b32_e32 v33, 0xffff0000, v18
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v18, 16, v18
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff0000, v20
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_add_f32_e32 v18, 0x40c00000, v18
; GFX11-TRUE16-NEXT: v_bfe_u32 v37, v18, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v39, 0x400000, v18
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v18, v18
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_add3_u32 v37, v37, v18, 0x7fff
-; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v70, v37, v39 :: v_dual_add_f32 v33, 0x40c00000, v33
+; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v80, v37, v39 :: v_dual_add_f32 v33, 0x40c00000, v33
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_3)
; GFX11-TRUE16-NEXT: v_bfe_u32 v36, v33, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v33
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v33, v33
; GFX11-TRUE16-NEXT: v_add3_u32 v36, v36, v33, 0x7fff
; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff0000, v17
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v33.l, v70.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v33.l, v80.h
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v55, v36, v38, vcc_lo
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v81, v36, v38, vcc_lo
; GFX11-TRUE16-NEXT: v_dual_add_f32 v34, 0x40c00000, v34 :: v_dual_lshlrev_b32 v17, 16, v17
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_add_f32_e32 v17, 0x40c00000, v17
@@ -160399,498 +159889,500 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_add3_u32 v18, v48, v34, 0x7fff
; GFX11-TRUE16-NEXT: v_add3_u32 v37, v50, v17, 0x7fff
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v71, v37, v51 :: v_dual_lshlrev_b32 v20, 16, v20
-; GFX11-TRUE16-NEXT: v_dual_add_f32 v35, 0x40c00000, v35 :: v_dual_add_f32 v20, 0x40c00000, v20
+; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v82, v37, v51 :: v_dual_and_b32 v35, 0xffff0000, v20
+; GFX11-TRUE16-NEXT: v_dual_add_f32 v35, 0x40c00000, v35 :: v_dual_lshlrev_b32 v20, 16, v20
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v34, v34
; GFX11-TRUE16-NEXT: v_and_b32_e32 v51, 0xffff0000, v11
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.l, v71.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.l, v82.h
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v20, 0x40c00000, v20
; GFX11-TRUE16-NEXT: v_bfe_u32 v36, v35, 16, 1
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v37, 0x400000, v20
; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v17, v18, v49, vcc_lo
-; GFX11-TRUE16-NEXT: v_bfi_b32 v18, 0xffff, v33, v55
+; GFX11-TRUE16-NEXT: v_bfi_b32 v18, 0xffff, v33, v81
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v35
; GFX11-TRUE16-NEXT: v_bfe_u32 v33, v20, 16, 1
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v37, 0x400000, v20
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v20, v20
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v35
; GFX11-TRUE16-NEXT: v_bfi_b32 v17, 0xffff, v34, v17
; GFX11-TRUE16-NEXT: v_add3_u32 v34, v36, v35, 0x7fff
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v36, 0xffff0000, v19
-; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v19, 16, v19
; GFX11-TRUE16-NEXT: v_add3_u32 v33, v33, v20, 0x7fff
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v62, 24, v18
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v74, 8, v18
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_dual_add_f32 v36, 0x40c00000, v36 :: v_dual_add_f32 v19, 0x40c00000, v19
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v81, v33, v37, vcc_lo
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v36, 0xffff0000, v19
+; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v11, 16, v11
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v79, 24, v18
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v90, 8, v18
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v83, v33, v37, vcc_lo
+; GFX11-TRUE16-NEXT: v_dual_add_f32 v36, 0x40c00000, v36 :: v_dual_lshlrev_b32 v19, 16, v19
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v35, v35
; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff0000, v22
-; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v22, 16, v22
-; GFX11-TRUE16-NEXT: v_bfe_u32 v33, v19, 16, 1
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_dual_add_f32 v19, 0x40c00000, v19 :: v_dual_lshlrev_b32 v22, 16, v22
; GFX11-TRUE16-NEXT: v_bfe_u32 v20, v36, 16, 1
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v80, v34, v38, vcc_lo
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v84, v34, v38, vcc_lo
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v36
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.l, v83.h
+; GFX11-TRUE16-NEXT: v_bfe_u32 v33, v19, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v37, 0x400000, v19
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v22, 0x40c00000, v22
-; GFX11-TRUE16-NEXT: v_add3_u32 v33, v33, v19, 0x7fff
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v19, v19
; GFX11-TRUE16-NEXT: v_add3_u32 v20, v20, v36, 0x7fff
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v36
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.l, v81.h
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v79, 8, v17
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v84, v33, v37, vcc_lo
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v94, 8, v17
+; GFX11-TRUE16-NEXT: v_add3_u32 v33, v33, v19, 0x7fff
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_dual_add_f32 v22, 0x40c00000, v22 :: v_dual_cndmask_b32 v85, v33, v37
; GFX11-TRUE16-NEXT: v_bfe_u32 v33, v22, 16, 1
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v36, v36
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v35, 0x40c00000, v35
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v37.l, v84.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v37.l, v85.h
; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v36, v20, v38, vcc_lo
-; GFX11-TRUE16-NEXT: v_bfe_u32 v19, v35, 16, 1
; GFX11-TRUE16-NEXT: v_add3_u32 v20, v33, v22, 0x7fff
; GFX11-TRUE16-NEXT: v_or_b32_e32 v33, 0x400000, v22
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v22, v22
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v86, v20, v33 :: v_dual_add_f32 v35, 0x40c00000, v35
+; GFX11-TRUE16-NEXT: v_bfi_b32 v20, 0xffff, v34, v84
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v86.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_bfe_u32 v19, v35, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v39, 0x400000, v35
-; GFX11-TRUE16-NEXT: v_add3_u32 v19, v19, v35, 0x7fff
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v83, v20, v33, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v35, v35
-; GFX11-TRUE16-NEXT: v_bfi_b32 v20, 0xffff, v34, v80
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v82, v19, v39, vcc_lo
-; GFX11-TRUE16-NEXT: v_bfi_b32 v19, 0xffff, v37, v36
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v36, 0xffff0000, v24
-; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v24, 16, v24
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v83.h
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v46, 24, v20
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v60, 8, v20
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v36, 0x40c00000, v36
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v24, 0x40c00000, v24
-; GFX11-TRUE16-NEXT: v_bfi_b32 v22, 0xffff, v22, v82
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v63, 8, v19
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_4) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v39, 0x400000, v36
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v72, 24, v20
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v75, 8, v20
+; GFX11-TRUE16-NEXT: v_add3_u32 v19, v19, v35, 0x7fff
; GFX11-TRUE16-NEXT: v_and_b32_e32 v38, 0xffff0000, v21
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v21, 16, v21
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v181, 24, v22
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v40, 8, v22
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v33, 0x40c00000, v38
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v87, v19, v39, vcc_lo
+; GFX11-TRUE16-NEXT: v_bfi_b32 v19, 0xffff, v37, v36
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v36, 0xffff0000, v24
; GFX11-TRUE16-NEXT: v_add_f32_e32 v21, 0x40c00000, v21
-; GFX11-TRUE16-NEXT: v_bfe_u32 v35, v33, 16, 1
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_dual_add_f32 v33, 0x40c00000, v38 :: v_dual_lshlrev_b32 v24, 16, v24
+; GFX11-TRUE16-NEXT: v_bfi_b32 v22, 0xffff, v22, v87
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v36, 0x40c00000, v36
; GFX11-TRUE16-NEXT: v_bfe_u32 v34, v21, 16, 1
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_bfe_u32 v35, v33, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v37, 0x400000, v21
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v24, 0x40c00000, v24
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v21, v21
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v33
-; GFX11-TRUE16-NEXT: v_add3_u32 v35, v35, v33, 0x7fff
; GFX11-TRUE16-NEXT: v_add3_u32 v34, v34, v21, 0x7fff
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v86, v34, v37 :: v_dual_and_b32 v37, 0xffff0000, v23
+; GFX11-TRUE16-NEXT: v_add3_u32 v35, v35, v33, 0x7fff
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v33
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v39, 0x400000, v36
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v56, 24, v22
+; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v96, v34, v37 :: v_dual_and_b32 v37, 0xffff0000, v23
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v33, v33
-; GFX11-TRUE16-NEXT: v_bfe_u32 v34, v24, 16, 1
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v23, 16, v23
+; GFX11-TRUE16-NEXT: v_bfe_u32 v34, v24, 16, 1
; GFX11-TRUE16-NEXT: v_bfe_u32 v33, v36, 16, 1
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v58, 8, v22
; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v21, v35, v38, vcc_lo
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_4) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_add3_u32 v34, v34, v24, 0x7fff
; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v24
+; GFX11-TRUE16-NEXT: v_add3_u32 v34, v34, v24, 0x7fff
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v23, 0x40c00000, v23
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v24, v24
; GFX11-TRUE16-NEXT: v_add3_u32 v33, v33, v36, 0x7fff
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v86.h
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v87, v34, v38, vcc_lo
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v38, 0xffff0000, v26
-; GFX11-TRUE16-NEXT: v_dual_add_f32 v23, 0x40c00000, v23 :: v_dual_lshlrev_b32 v26, 16, v26
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v96.h
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v77, 8, v19
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v97, v34, v38, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v36, v36
-; GFX11-TRUE16-NEXT: v_bfi_b32 v21, 0xffff, v35, v21
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v26, 0x40c00000, v26
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v37, 0x40c00000, v37
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v38, 0xffff0000, v26
+; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v26, 16, v26
; GFX11-TRUE16-NEXT: v_bfe_u32 v34, v23, 16, 1
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v85, v33, v39, vcc_lo
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v98, v33, v39, vcc_lo
+; GFX11-TRUE16-NEXT: v_bfe_u32 v24, v37, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v36, 0x400000, v23
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v23, v23
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v45, 8, v21
; GFX11-TRUE16-NEXT: v_add3_u32 v34, v34, v23, 0x7fff
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v97, v34, v36, vcc_lo
-; GFX11-TRUE16-NEXT: v_dual_add_f32 v37, 0x40c00000, v37 :: v_dual_add_f32 v34, 0x40c00000, v38
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v36.l, v97.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_3) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_bfe_u32 v24, v37, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v39, 0x400000, v37
-; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v37, v37
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v34
; GFX11-TRUE16-NEXT: v_add3_u32 v24, v24, v37, 0x7fff
-; GFX11-TRUE16-NEXT: v_bfe_u32 v37, v34, 16, 1
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v26, 0x40c00000, v26
+; GFX11-TRUE16-NEXT: v_bfi_b32 v21, 0xffff, v35, v21
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v99, v34, v36, vcc_lo
+; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v37, v37
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v34, 0x40c00000, v38
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v61, 8, v21
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v36.l, v99.h
; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v23, v24, v39, vcc_lo
-; GFX11-TRUE16-NEXT: v_add3_u32 v35, v37, v34, 0x7fff
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v37, 0x400000, v26
-; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v26, v26
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_bfe_u32 v37, v34, 16, 1
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v34
; GFX11-TRUE16-NEXT: v_bfi_b32 v23, 0xffff, v36, v23
; GFX11-TRUE16-NEXT: v_and_b32_e32 v36, 0xffff0000, v25
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v33.l, v87.h
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v182, 8, v23
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v33.l, v97.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v25, 16, v25
+; GFX11-TRUE16-NEXT: v_add3_u32 v35, v37, v34, 0x7fff
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v37, 0x400000, v26
; GFX11-TRUE16-NEXT: v_add_f32_e32 v36, 0x40c00000, v36
-; GFX11-TRUE16-NEXT: v_bfi_b32 v24, 0xffff, v33, v85
+; GFX11-TRUE16-NEXT: v_bfi_b32 v24, 0xffff, v33, v98
; GFX11-TRUE16-NEXT: v_bfe_u32 v33, v26, 16, 1
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v162, 24, v24
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v25, 0x40c00000, v25
+; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v26, v26
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v46, 8, v23
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v183, 24, v24
; GFX11-TRUE16-NEXT: v_add3_u32 v33, v33, v26, 0x7fff
; GFX11-TRUE16-NEXT: v_bfe_u32 v26, v36, 16, 1
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v177, 8, v24
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_3) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v98, v33, v37, vcc_lo
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v42, 8, v24
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v100, v33, v37, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v34, v34
-; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v25, 16, v25
-; GFX11-TRUE16-NEXT: v_add3_u32 v26, v26, v36, 0x7fff
-; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v96, v35, v38 :: v_dual_add_f32 v25, 0x40c00000, v25
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff0000, v28
-; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v28, 16, v28
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v36
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.l, v98.h
; GFX11-TRUE16-NEXT: v_bfe_u32 v33, v25, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v37, 0x400000, v25
-; GFX11-TRUE16-NEXT: v_dual_add_f32 v35, 0x40c00000, v35 :: v_dual_add_f32 v28, 0x40c00000, v28
-; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v25, v25
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_add3_u32 v26, v26, v36, 0x7fff
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v101, v35, v38, vcc_lo
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
; GFX11-TRUE16-NEXT: v_add3_u32 v33, v33, v25, 0x7fff
+; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v25, v25
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff0000, v28
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v36
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.l, v100.h
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v102, v33, v37, vcc_lo
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_dual_add_f32 v35, 0x40c00000, v35 :: v_dual_lshlrev_b32 v28, 16, v28
+; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v36, v36
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v37.l, v102.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v28, 0x40c00000, v28
; GFX11-TRUE16-NEXT: v_bfe_u32 v25, v35, 16, 1
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v39, 0x400000, v35
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_4) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v101, v33, v37, vcc_lo
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v36, v26, v38, vcc_lo
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v38, 0xffff0000, v27
+; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v27, 16, v27
; GFX11-TRUE16-NEXT: v_bfe_u32 v33, v28, 16, 1
-; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v36, v36
+; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v28, v28
; GFX11-TRUE16-NEXT: v_add3_u32 v25, v25, v35, 0x7fff
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v36, v26, v38, vcc_lo
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v39, 0x400000, v35
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v27, 0x40c00000, v27
; GFX11-TRUE16-NEXT: v_add3_u32 v26, v33, v28, 0x7fff
; GFX11-TRUE16-NEXT: v_or_b32_e32 v33, 0x400000, v28
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v38, 0xffff0000, v27
-; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v28, v28
-; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v27, 16, v27
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v37.l, v101.h
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v100, v26, v33, vcc_lo
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v103, v26, v33, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v35, v35
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v27, 0x40c00000, v27
-; GFX11-TRUE16-NEXT: v_bfi_b32 v26, 0xffff, v34, v96
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.l, v100.h
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v99, v25, v39, vcc_lo
-; GFX11-TRUE16-NEXT: v_bfi_b32 v25, 0xffff, v37, v36
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v36, 0xffff0000, v30
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v33, 0x40c00000, v38
-; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v30, 16, v30
+; GFX11-TRUE16-NEXT: v_bfi_b32 v26, 0xffff, v34, v101
; GFX11-TRUE16-NEXT: v_bfe_u32 v34, v27, 16, 1
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v33, 0x40c00000, v38
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.l, v103.h
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v112, v25, v39, vcc_lo
+; GFX11-TRUE16-NEXT: v_bfi_b32 v25, 0xffff, v37, v36
+; GFX11-TRUE16-NEXT: v_add3_u32 v34, v34, v27, 0x7fff
; GFX11-TRUE16-NEXT: v_or_b32_e32 v37, 0x400000, v27
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v36, 0x40c00000, v36
-; GFX11-TRUE16-NEXT: v_bfe_u32 v35, v33, 16, 1
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v27, v27
-; GFX11-TRUE16-NEXT: v_add3_u32 v34, v34, v27, 0x7fff
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v36, 0xffff0000, v30
+; GFX11-TRUE16-NEXT: v_bfe_u32 v35, v33, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v33
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v30, 0x40c00000, v30
+; GFX11-TRUE16-NEXT: v_bfi_b32 v28, 0xffff, v28, v112
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v113, v34, v37, vcc_lo
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v37, 0xffff0000, v29
+; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v30, 16, v30
; GFX11-TRUE16-NEXT: v_add3_u32 v35, v35, v33, 0x7fff
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v39, 0x400000, v36
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v112, v34, v37, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v33, v33
-; GFX11-TRUE16-NEXT: v_bfe_u32 v33, v36, 16, 1
-; GFX11-TRUE16-NEXT: v_bfi_b32 v28, 0xffff, v28, v99
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 24, v26
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v151, 8, v26
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_dual_add_f32 v36, 0x40c00000, v36 :: v_dual_add_f32 v37, 0x40c00000, v37
+; GFX11-TRUE16-NEXT: v_dual_add_f32 v30, 0x40c00000, v30 :: v_dual_lshlrev_b32 v29, 16, v29
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v27, v35, v38, vcc_lo
-; GFX11-TRUE16-NEXT: v_add3_u32 v33, v33, v36, 0x7fff
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v37, 0xffff0000, v29
-; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v29, 16, v29
+; GFX11-TRUE16-NEXT: v_bfe_u32 v33, v36, 16, 1
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v39, 0x400000, v36
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
; GFX11-TRUE16-NEXT: v_bfe_u32 v34, v30, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v30
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v30, v30
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v112.h
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 24, v28
-; GFX11-TRUE16-NEXT: v_add3_u32 v34, v34, v30, 0x7fff
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v135, 8, v28
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v163, 8, v25
-; GFX11-TRUE16-NEXT: v_bfi_b32 v27, 0xffff, v35, v27
-; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(1)
-; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v103, v34, v38 :: v_dual_and_b32 v38, 0xffff0000, v32
+; GFX11-TRUE16-NEXT: v_add3_u32 v33, v33, v36, 0x7fff
; GFX11-TRUE16-NEXT: v_add_f32_e32 v29, 0x40c00000, v29
+; GFX11-TRUE16-NEXT: v_add3_u32 v34, v34, v30, 0x7fff
+; GFX11-TRUE16-NEXT: v_bfe_u32 v30, v37, 16, 1
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v113.h
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 24, v28
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v162, 8, v28
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v114, v34, v38, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v36, v36
+; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v38, 0xffff0000, v32
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v32, 16, v32
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v145, 8, v27
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_3) | instid1(VALU_DEP_4)
; GFX11-TRUE16-NEXT: v_bfe_u32 v34, v29, 16, 1
-; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v102, v33, v39 :: v_dual_add_f32 v37, 0x40c00000, v37
; GFX11-TRUE16-NEXT: v_or_b32_e32 v36, 0x400000, v29
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v115, v33, v39, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v29, v29
+; GFX11-TRUE16-NEXT: v_add3_u32 v30, v30, v37, 0x7fff
; GFX11-TRUE16-NEXT: v_add3_u32 v34, v34, v29, 0x7fff
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v33.l, v103.h
-; GFX11-TRUE16-NEXT: v_bfe_u32 v30, v37, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v39, 0x400000, v37
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v32, 0x40c00000, v32
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v113, v34, v36, vcc_lo
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v33.l, v114.h
+; GFX11-TRUE16-NEXT: v_bfi_b32 v27, 0xffff, v35, v27
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v167, 24, v26
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v116, v34, v36, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v37, v37
-; GFX11-TRUE16-NEXT: v_add3_u32 v30, v30, v37, 0x7fff
; GFX11-TRUE16-NEXT: v_add_f32_e32 v34, 0x40c00000, v38
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v36.l, v113.h
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v164, 8, v27
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v179, 8, v26
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v36.l, v116.h
; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v29, v30, v39, vcc_lo
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
; GFX11-TRUE16-NEXT: v_bfe_u32 v37, v34, 16, 1
-; GFX11-TRUE16-NEXT: v_bfi_b32 v30, 0xffff, v33, v102
-; GFX11-TRUE16-NEXT: v_bfe_u32 v33, v32, 16, 1
-; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v32, v32
+; GFX11-TRUE16-NEXT: v_bfi_b32 v30, 0xffff, v33, v115
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v34
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v181, 8, v25
; GFX11-TRUE16-NEXT: v_bfi_b32 v29, 0xffff, v36, v29
+; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v36, 0xffff0000, v31
+; GFX11-TRUE16-NEXT: v_dual_add_f32 v32, 0x40c00000, v32 :: v_dual_lshlrev_b32 v31, 16, v31
; GFX11-TRUE16-NEXT: v_add3_u32 v35, v37, v34, 0x7fff
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v144, 24, v30
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v145, 8, v30
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_4) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v31, 0x40c00000, v31
+; GFX11-TRUE16-NEXT: v_bfe_u32 v33, v32, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v37, 0x400000, v32
+; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v32, v32
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v148, 8, v29
; GFX11-TRUE16-NEXT: v_add3_u32 v33, v33, v32, 0x7fff
-; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v36, 0xffff0000, v31
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v34
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v119, 24, v30
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 8, v30
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v115, v33, v37, vcc_lo
-; GFX11-TRUE16-NEXT: v_dual_add_f32 v36, 0x40c00000, v36 :: v_dual_lshlrev_b32 v31, 16, v31
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_3) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v117, v33, v37, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v34, v34
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v131, 8, v29
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.l, v115.h
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v31, 0x40c00000, v31
-; GFX11-TRUE16-NEXT: v_bfe_u32 v32, v36, 16, 1
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v114, v35, v38, vcc_lo
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v36
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_3) | instid1(VALU_DEP_4)
; GFX11-TRUE16-NEXT: v_bfe_u32 v33, v31, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v37, 0x400000, v31
-; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v31, v31
-; GFX11-TRUE16-NEXT: v_add3_u32 v32, v32, v36, 0x7fff
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.l, v117.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_4) | instid1(VALU_DEP_3)
; GFX11-TRUE16-NEXT: v_add3_u32 v33, v33, v31, 0x7fff
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v116, v33, v37 :: v_dual_and_b32 v35, 0xffff0000, v2
-; GFX11-TRUE16-NEXT: v_dual_add_f32 v35, 0x40c00000, v35 :: v_dual_lshlrev_b32 v2, 16, v2
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v118, v35, v38, vcc_lo
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff0000, v2
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v36, 0x40c00000, v36
+; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v31, v31
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v35, 0x40c00000, v35
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_bfe_u32 v32, v36, 16, 1
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v36
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v119, v33, v37, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v36, v36
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v37.l, v116.h
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v2, 0x40c00000, v2
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
; GFX11-TRUE16-NEXT: v_bfe_u32 v31, v35, 16, 1
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v39, 0x400000, v35
+; GFX11-TRUE16-NEXT: v_add3_u32 v32, v32, v36, 0x7fff
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v37.l, v119.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_add3_u32 v31, v31, v35, 0x7fff
+; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v2, 16, v2
; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v36, v32, v38, vcc_lo
; GFX11-TRUE16-NEXT: v_and_b32_e32 v38, 0xffff0000, v1
-; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v1, 16, v1
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_dual_add_f32 v2, 0x40c00000, v2 :: v_dual_lshlrev_b32 v1, 16, v1
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v1, 0x40c00000, v1
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_bfe_u32 v33, v2, 16, 1
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v2, v2
-; GFX11-TRUE16-NEXT: v_add3_u32 v31, v31, v35, 0x7fff
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v39, 0x400000, v35
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v1, 0x40c00000, v1
; GFX11-TRUE16-NEXT: v_add3_u32 v32, v33, v2, 0x7fff
; GFX11-TRUE16-NEXT: v_or_b32_e32 v33, 0x400000, v2
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v133, v32, v33, vcc_lo
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v33, 0x40c00000, v38
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v128, v32, v33, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v35, v35
-; GFX11-TRUE16-NEXT: v_bfi_b32 v32, 0xffff, v34, v114
+; GFX11-TRUE16-NEXT: v_bfi_b32 v32, 0xffff, v34, v118
; GFX11-TRUE16-NEXT: v_bfe_u32 v34, v1, 16, 1
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v132, v31, v39, vcc_lo
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v33, 0x40c00000, v38
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v2.l, v128.h
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v129, v31, v39, vcc_lo
; GFX11-TRUE16-NEXT: v_bfi_b32 v31, 0xffff, v37, v36
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v36, 0xffff0000, v4
-; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v4, 16, v4
-; GFX11-TRUE16-NEXT: v_bfe_u32 v35, v33, 16, 1
; GFX11-TRUE16-NEXT: v_add3_u32 v34, v34, v1, 0x7fff
; GFX11-TRUE16-NEXT: v_or_b32_e32 v37, 0x400000, v1
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v1, v1
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v36, 0xffff0000, v4
+; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v4, 16, v4
+; GFX11-TRUE16-NEXT: v_bfe_u32 v35, v33, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v33
-; GFX11-TRUE16-NEXT: v_add3_u32 v35, v35, v33, 0x7fff
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v2.l, v133.h
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v117, 24, v32
-; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v146, v34, v37 :: v_dual_and_b32 v37, 0xffff0000, v3
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v4, 0x40c00000, v4
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v131, v34, v37, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v33, v33
-; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, 16, v3
-; GFX11-TRUE16-NEXT: v_bfi_b32 v2, 0xffff, v2, v132
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v37, 0x40c00000, v37
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v4, 0x40c00000, v4
+; GFX11-TRUE16-NEXT: v_add3_u32 v35, v35, v33, 0x7fff
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v37, 0xffff0000, v3
+; GFX11-TRUE16-NEXT: v_dual_add_f32 v36, 0x40c00000, v36 :: v_dual_lshlrev_b32 v3, 16, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
; GFX11-TRUE16-NEXT: v_bfe_u32 v34, v4, 16, 1
-; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v1, v35, v38 :: v_dual_add_f32 v36, 0x40c00000, v36
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v1, v35, v38, vcc_lo
; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v4
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v4, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_add3_u32 v34, v34, v4, 0x7fff
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v3, 0x40c00000, v3
; GFX11-TRUE16-NEXT: v_bfe_u32 v33, v36, 16, 1
+; GFX11-TRUE16-NEXT: v_add3_u32 v34, v34, v4, 0x7fff
; GFX11-TRUE16-NEXT: v_or_b32_e32 v39, 0x400000, v36
-; GFX11-TRUE16-NEXT: v_bfe_u32 v4, v37, 16, 1
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v148, v34, v38, vcc_lo
-; GFX11-TRUE16-NEXT: v_bfe_u32 v34, v3, 16, 1
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v35.l, v131.h
+; GFX11-TRUE16-NEXT: v_bfi_b32 v2, 0xffff, v2, v129
; GFX11-TRUE16-NEXT: v_add3_u32 v33, v33, v36, 0x7fff
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v133, v34, v38, vcc_lo
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v38, 0xffff0000, v6
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v3, 0x40c00000, v3
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v36, v36
+; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v6, 16, v6
+; GFX11-TRUE16-NEXT: v_bfi_b32 v1, 0xffff, v35, v1
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v105, 24, v2
+; GFX11-TRUE16-NEXT: v_bfe_u32 v34, v3, 16, 1
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v135, v33, v39, vcc_lo
; GFX11-TRUE16-NEXT: v_or_b32_e32 v36, 0x400000, v3
-; GFX11-TRUE16-NEXT: v_add3_u32 v4, v4, v37, 0x7fff
-; GFX11-TRUE16-NEXT: v_add3_u32 v34, v34, v3, 0x7fff
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v38, 0xffff0000, v6
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v144, v33, v39, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v3, v3
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v6, 0x40c00000, v6
+; GFX11-TRUE16-NEXT: v_add3_u32 v34, v34, v3, 0x7fff
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v106, 8, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v111, 8, v1
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 24, v32
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 8, v32
+; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v146, v34, v36 :: v_dual_add_f32 v37, 0x40c00000, v37
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v134, 8, v31
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v36.l, v146.h
+; GFX11-TRUE16-NEXT: v_bfe_u32 v4, v37, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v39, 0x400000, v37
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v33.l, v148.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v35.l, v146.h
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v105, 24, v2
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v164, v34, v36, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v37, v37
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v34, 0x40c00000, v38
-; GFX11-TRUE16-NEXT: v_bfi_b32 v1, 0xffff, v35, v1
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v107, 8, v2
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v36.l, v164.h
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v3, v4, v39, vcc_lo
-; GFX11-TRUE16-NEXT: v_bfe_u32 v37, v34, 16, 1
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add3_u32 v4, v4, v37, 0x7fff
+; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v3, v4, v39 :: v_dual_add_f32 v34, 0x40c00000, v38
; GFX11-TRUE16-NEXT: v_and_b32_e32 v39, 0xffff0000, v7
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v7, 16, v7
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v34
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
; GFX11-TRUE16-NEXT: v_bfi_b32 v3, 0xffff, v36, v3
; GFX11-TRUE16-NEXT: v_and_b32_e32 v36, 0xffff0000, v5
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v33.l, v133.h
+; GFX11-TRUE16-NEXT: v_bfe_u32 v37, v34, 16, 1
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, 16, v5
-; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v6, 16, v6
-; GFX11-TRUE16-NEXT: v_bfi_b32 v4, 0xffff, v33, v144
-; GFX11-TRUE16-NEXT: v_add3_u32 v35, v37, v34, 0x7fff
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_dual_add_f32 v36, 0x40c00000, v36 :: v_dual_add_f32 v5, 0x40c00000, v5
-; GFX11-TRUE16-NEXT: v_dual_add_f32 v6, 0x40c00000, v6 :: v_dual_add_f32 v7, 0x40c00000, v7
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v90, 24, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v94, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v106, 8, v3
+; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v6, v6
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v36, 0x40c00000, v36
+; GFX11-TRUE16-NEXT: v_bfi_b32 v4, 0xffff, v33, v135
; GFX11-TRUE16-NEXT: v_bfe_u32 v33, v6, 16, 1
+; GFX11-TRUE16-NEXT: v_add3_u32 v35, v37, v34, 0x7fff
; GFX11-TRUE16-NEXT: v_or_b32_e32 v37, 0x400000, v6
-; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v6, v6
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v108, 8, v1
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v118, 8, v32
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v5, 0x40c00000, v5
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v34
; GFX11-TRUE16-NEXT: v_add3_u32 v33, v33, v6, 0x7fff
; GFX11-TRUE16-NEXT: v_bfe_u32 v6, v36, 16, 1
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v128, 8, v31
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v165, v33, v37, vcc_lo
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v93, 24, v4
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v95, 8, v4
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v104, 8, v3
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v150, v33, v37, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v34, v34
; GFX11-TRUE16-NEXT: v_bfe_u32 v33, v5, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v37, 0x400000, v5
; GFX11-TRUE16-NEXT: v_add3_u32 v6, v6, v36, 0x7fff
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v34.l, v165.h
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v161, v35, v38, vcc_lo
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v151, v35, v38, vcc_lo
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
; GFX11-TRUE16-NEXT: v_add3_u32 v33, v33, v5, 0x7fff
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v5, v5
; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff0000, v8
-; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v8, 16, v8
; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v36
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v180, v33, v37 :: v_dual_add_f32 v35, 0x40c00000, v35
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v8, 0x40c00000, v8
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v34.l, v150.h
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v166, v33, v37, vcc_lo
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_dual_add_f32 v35, 0x40c00000, v35 :: v_dual_lshlrev_b32 v8, 16, v8
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v36, v36
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v37.l, v166.h
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v37.l, v180.h
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v8, 0x40c00000, v8
; GFX11-TRUE16-NEXT: v_bfe_u32 v5, v35, 16, 1
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_bfe_u32 v33, v8, 16, 1
; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v36, v6, v38, vcc_lo
-; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v8, v8
; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v35
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_bfe_u32 v33, v8, 16, 1
+; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v8, v8
; GFX11-TRUE16-NEXT: v_add3_u32 v5, v5, v35, 0x7fff
; GFX11-TRUE16-NEXT: v_add3_u32 v6, v33, v8, 0x7fff
; GFX11-TRUE16-NEXT: v_or_b32_e32 v33, 0x400000, v8
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_3) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v179, v6, v33, vcc_lo
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v177, v6, v33, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v35, v35
-; GFX11-TRUE16-NEXT: v_bfe_u32 v35, v7, 16, 1
-; GFX11-TRUE16-NEXT: v_bfi_b32 v6, 0xffff, v34, v161
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v8.l, v179.h
-; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v178, v5, v38 :: v_dual_add_f32 v33, 0x40c00000, v39
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v33, 0x40c00000, v39
+; GFX11-TRUE16-NEXT: v_bfi_b32 v6, 0xffff, v34, v151
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v178, v5, v38, vcc_lo
; GFX11-TRUE16-NEXT: v_bfi_b32 v5, 0xffff, v37, v36
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v39, 0xffff0000, v9
-; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v36, 16, v10
-; GFX11-TRUE16-NEXT: v_add3_u32 v35, v35, v7, 0x7fff
+; GFX11-TRUE16-NEXT: v_dual_add_f32 v7, 0x40c00000, v7 :: v_dual_lshlrev_b32 v36, 16, v10
; GFX11-TRUE16-NEXT: v_bfe_u32 v34, v33, 16, 1
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v33
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v8.l, v177.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v36, 0x40c00000, v36
+; GFX11-TRUE16-NEXT: v_bfe_u32 v35, v7, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v37, 0x400000, v7
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v7, v7
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v36, 0x40c00000, v36
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v33
; GFX11-TRUE16-NEXT: v_add3_u32 v34, v34, v33, 0x7fff
; GFX11-TRUE16-NEXT: v_bfi_b32 v8, 0xffff, v8, v178
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v47, v35, v37, vcc_lo
+; GFX11-TRUE16-NEXT: v_add3_u32 v35, v35, v7, 0x7fff
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff0000, v10
; GFX11-TRUE16-NEXT: v_bfe_u32 v7, v36, 16, 1
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v76, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v60, 24, v8
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v43, v35, v37, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v33, v33
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v37, 0x400000, v36
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v59, 24, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v72, 8, v8
; GFX11-TRUE16-NEXT: v_add3_u32 v7, v7, v36, 0x7fff
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v37, 0x400000, v36
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v63, 8, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v78, 8, v6
; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v33, v34, v38, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v36, v36
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.l, v47.h
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v75, 24, v6
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v88, 8, v6
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v91, 8, v5
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v44, v7, v37, vcc_lo
-; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v7, 16, v9
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v9, 0x40c00000, v39
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v7, 0x40c00000, v7
-; GFX11-TRUE16-NEXT: v_bfe_u32 v36, v9, 16, 1
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v39, 0x400000, v9
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v49, 0x400000, v7
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff0000, v10
-; GFX11-TRUE16-NEXT: v_add3_u32 v36, v36, v9, 0x7fff
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_add_f32_e32 v10, 0x40c00000, v10
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.l, v43.h
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v88, 8, v5
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v41, v7, v37, vcc_lo
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_3)
; GFX11-TRUE16-NEXT: v_bfe_u32 v35, v10, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v10
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v10, v10
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_add3_u32 v35, v35, v10, 0x7fff
-; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v41, v35, v38 :: v_dual_lshlrev_b32 v10, 16, v12
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v10, 16, v12
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v44, v35, v38 :: v_dual_and_b32 v39, 0xffff0000, v9
; GFX11-TRUE16-NEXT: v_add_f32_e32 v37, 0x40c00000, v10
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v44.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v41.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_4)
; GFX11-TRUE16-NEXT: v_bfe_u32 v38, v37, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v50, 0x400000, v37
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v37, v37
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_bfi_b32 v10, 0xffff, v35, v41
-; GFX11-TRUE16-NEXT: v_bfe_u32 v35, v7, 16, 1
+; GFX11-TRUE16-NEXT: v_bfi_b32 v10, 0xffff, v35, v44
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
; GFX11-TRUE16-NEXT: v_add3_u32 v38, v38, v37, 0x7fff
; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff0000, v12
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v37, 0x40c00000, v51
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v42, 24, v10
-; GFX11-TRUE16-NEXT: v_add3_u32 v35, v35, v7, 0x7fff
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_3) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v61, v38, v50 :: v_dual_add_f32 v12, 0x40c00000, v12
-; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v7, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v45, 24, v10
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v47, 8, v10
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v59, v38, v50, vcc_lo
+; GFX11-TRUE16-NEXT: v_dual_add_f32 v12, 0x40c00000, v12 :: v_dual_lshlrev_b32 v7, 16, v9
; GFX11-TRUE16-NEXT: v_and_b32_e32 v38, 0xffff0000, v14
-; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v14, 16, v14
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v61.h
+; GFX11-TRUE16-NEXT: v_dual_add_f32 v37, 0x40c00000, v51 :: v_dual_lshlrev_b32 v14, 16, v14
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v7, 0x40c00000, v7
; GFX11-TRUE16-NEXT: v_bfe_u32 v48, v12, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v52, 0x400000, v12
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v14, 0x40c00000, v14
+; GFX11-TRUE16-NEXT: v_bfe_u32 v35, v7, 16, 1
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v49, 0x400000, v7
+; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v7, v7
+; GFX11-TRUE16-NEXT: v_add3_u32 v48, v48, v12, 0x7fff
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_add3_u32 v35, v35, v7, 0x7fff
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v59.h
; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v73, v35, v49, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v12, v12
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v9, 0x40c00000, v39
; GFX11-TRUE16-NEXT: v_bfe_u32 v35, v37, 16, 1
-; GFX11-TRUE16-NEXT: v_add3_u32 v48, v48, v12, 0x7fff
-; GFX11-TRUE16-NEXT: v_dual_add_f32 v14, 0x40c00000, v14 :: v_dual_lshlrev_b32 v11, 16, v11
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v56, 8, v10
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v57, v48, v52, vcc_lo
-; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v9, v9
; GFX11-TRUE16-NEXT: v_bfe_u32 v49, v14, 16, 1
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_bfi_b32 v12, 0xffff, v7, v57
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v9, v36, v39, vcc_lo
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v62, v48, v52, vcc_lo
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_bfe_u32 v36, v9, 16, 1
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v39, 0x400000, v9
+; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v9, v9
+; GFX11-TRUE16-NEXT: v_bfi_b32 v12, 0xffff, v7, v62
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_add3_u32 v36, v36, v9, 0x7fff
; GFX11-TRUE16-NEXT: v_add_f32_e32 v7, 0x40c00000, v11
; GFX11-TRUE16-NEXT: v_add3_u32 v11, v35, v37, 0x7fff
; GFX11-TRUE16-NEXT: v_or_b32_e32 v35, 0x400000, v37
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v180, 24, v12
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v9, v36, v39, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v37, v37
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v36.l, v73.h
; GFX11-TRUE16-NEXT: v_bfe_u32 v39, v7, 16, 1
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v167, 24, v12
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v183, 8, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v182, 8, v12
; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v11, v11, v35, vcc_lo
; GFX11-TRUE16-NEXT: v_add_f32_e32 v35, 0x40c00000, v38
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
; GFX11-TRUE16-NEXT: v_add3_u32 v37, v39, v7, 0x7fff
; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v7
; GFX11-TRUE16-NEXT: v_and_b32_e32 v39, 0xffff0000, v13
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v7, v7
; GFX11-TRUE16-NEXT: v_bfe_u32 v48, v35, 16, 1
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v13, 16, v13
-; GFX11-TRUE16-NEXT: v_bfi_b32 v9, 0xffff, v36, v9
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v7, 0x40c00000, v39
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v89, v37, v38, vcc_lo
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_dual_add_f32 v7, 0x40c00000, v39 :: v_dual_cndmask_b32 v92, v37, v38
; GFX11-TRUE16-NEXT: v_add3_u32 v37, v48, v35, 0x7fff
; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v35
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v35, v35
@@ -160898,18 +160390,18 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_or_b32_e32 v48, 0x400000, v14
; GFX11-TRUE16-NEXT: v_bfe_u32 v49, v7, 16, 1
; GFX11-TRUE16-NEXT: v_add_f32_e32 v13, 0x40c00000, v13
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v77, v37, v38, vcc_lo
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v89, v37, v38, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v14, v14
; GFX11-TRUE16-NEXT: v_or_b32_e32 v35, 0x400000, v7
; GFX11-TRUE16-NEXT: v_add3_u32 v14, v49, v7, 0x7fff
; GFX11-TRUE16-NEXT: v_and_b32_e32 v37, 0xffff0000, v16
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v16, 16, v16
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v78, v39, v48, vcc_lo
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v91, v39, v48, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v7, v7
; GFX11-TRUE16-NEXT: v_bfe_u32 v39, v13, 16, 1
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[64:65], 24, v[9:10]
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
; GFX11-TRUE16-NEXT: v_add_f32_e32 v16, 0x40c00000, v16
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v38.l, v78.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v36.l, v73.h
; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v7, v14, v35 :: v_dual_add_f32 v14, 0x40c00000, v37
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v37, 16, v15
; GFX11-TRUE16-NEXT: v_add3_u32 v35, v39, v13, 0x7fff
@@ -160919,7 +160411,7 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_add_f32_e32 v37, 0x40c00000, v37
; GFX11-TRUE16-NEXT: v_and_b32_e32 v15, 0xffff0000, v15
; GFX11-TRUE16-NEXT: v_or_b32_e32 v49, 0x400000, v16
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v104, v35, v39, vcc_lo
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v108, v35, v39, vcc_lo
; GFX11-TRUE16-NEXT: v_add3_u32 v13, v13, v16, 0x7fff
; GFX11-TRUE16-NEXT: v_bfe_u32 v39, v37, 16, 1
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v16, v16
@@ -160927,405 +160419,366 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_add_f32_e32 v15, 0x40c00000, v15
; GFX11-TRUE16-NEXT: v_or_b32_e32 v51, 0x400000, v37
; GFX11-TRUE16-NEXT: v_add3_u32 v39, v39, v37, 0x7fff
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v93, v13, v49, vcc_lo
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v107, v13, v49, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v37, v37
; GFX11-TRUE16-NEXT: v_add3_u32 v35, v48, v14, 0x7fff
; GFX11-TRUE16-NEXT: v_or_b32_e32 v48, 0x400000, v14
; GFX11-TRUE16-NEXT: v_bfe_u32 v50, v15, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, 0x400000, v15
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v95, v39, v51, vcc_lo
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v110, v39, v51, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v14, v14
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v37.l, v104.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v37.l, v108.h
; GFX11-TRUE16-NEXT: v_add3_u32 v13, v50, v15, 0x7fff
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.l, v89.h
-; GFX11-TRUE16-NEXT: v_bfi_b32 v14, 0xffff, v38, v77
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v92, v35, v48, vcc_lo
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v38.l, v91.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.l, v92.h
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v109, v35, v48, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v15, v15
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v95.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v93.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v110.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v107.h
+; GFX11-TRUE16-NEXT: v_bfi_b32 v14, 0xffff, v38, v89
; GFX11-TRUE16-NEXT: v_bfi_b32 v11, 0xffff, v39, v11
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 24, v14
; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v13, v13, v16, vcc_lo
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v166, 8, v14
-; GFX11-TRUE16-NEXT: v_bfi_b32 v16, 0xffff, v35, v92
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[52:53], 24, v[11:12]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[53:54], 24, v[17:18]
+; GFX11-TRUE16-NEXT: v_bfi_b32 v9, 0xffff, v36, v9
+; GFX11-TRUE16-NEXT: v_bfi_b32 v16, 0xffff, v35, v109
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[48:49], 24, v[23:24]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[53:54], 24, v[11:12]
; GFX11-TRUE16-NEXT: v_bfi_b32 v15, 0xffff, v15, v13
; GFX11-TRUE16-NEXT: v_bfi_b32 v13, 0xffff, v37, v7
; GFX11-TRUE16-NEXT: v_bfi_b32 v7, 0xffff, v34, v33
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[33:34], 24, v[31:32]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[34:35], 24, v[29:30]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[64:65], 24, v[9:10]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[50:51], 24, v[13:14]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[67:68], 24, v[7:8]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[68:69], 24, v[5:6]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[69:70], 24, v[3:4]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[35:36], 24, v[27:28]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[48:49], 24, v[15:16]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[65:66], 24, v[7:8]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[66:67], 24, v[5:6]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[38:39], 24, v[15:16]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[70:71], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[36:37], 24, v[25:26]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[49:50], 24, v[13:14]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[67:68], 24, v[3:4]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[37:38], 24, v[23:24]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[68:69], 24, v[1:2]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[38:39], 24, v[21:22]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[50:51], 24, v[19:20]
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v134, 24, v16
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[51:52], 24, v[21:22]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[54:55], 24, v[19:20]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[65:66], 24, v[17:18]
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 24, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 8, v16
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 8, v15
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 8, v15
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v163, 24, v14
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v165, 8, v14
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v176, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v43, 8, v11
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v58, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v76, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v40, 8, v11
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v57, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v74, 8, v7
; GFX11-TRUE16-NEXT: .LBB90_4: ; %end
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v146.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v108.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v131.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v111.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v133.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v107.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, 0
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v2.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v68.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v132.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v2.h, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v70.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v129.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v1.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v128.h
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v1.h, v2.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v6.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v106.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v164.h
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v1.h, v2.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v105.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v3.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v1.l, v2.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v94.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v91.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v8
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v6, v14
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v4.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v148.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v3.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v67.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v8, v14
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v4.l, v4.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v180.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v3.l, v3.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.l, 8, v90.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v144.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v4.h, v5.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v8, v14
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v66.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v4.l, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v165.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.l, 8, v88.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v8, v14
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v5.l, v6.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v47.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.h, v6.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.l, 8, v76.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 8, v58.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v8, v14
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v75.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v161.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v6.h, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v179.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v72.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v6.l, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v7.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v105.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v69.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v6, v1
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v2.l, v2.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v3.l, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v6.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v146.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v104.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v4.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v95.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v6, v2
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v135.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v3.l, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v6.h
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v133.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.l, 8, v93.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v68.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v6, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v4.l, v4.h
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v5.l, v7.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v6.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v166.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.l, 8, v88.l
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v5.h, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v6.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v6, v4
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v78.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v150.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v151.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v76.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v11.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v66, v6, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v43.h
+; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v7.l, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v6.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v74.l
; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.h, 8, v65.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v6.h, v8.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v10, v14
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v12
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v73.h
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v7.l, v7.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v178.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.l, 8, v59.l
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v8.h, v9.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v12, v14
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v56.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v16
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v8.l, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v44.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v64.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v43.l
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v10.l, v10.h
-; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v89.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v12, v14
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v16
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v9.l, v9.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v41.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v42.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v10.h, v11.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v16, v14
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v18
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v52.l
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v10.l, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v11.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v61.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v183.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v16, v14
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v18
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v11.l, v12.h
-; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v104.h
-; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v11.h, v12.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.l, 8, v176.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.l, 8, v166.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v16, v14
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.h, 8, v167.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v57.h
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v12.h, v13.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v78.h
-; GFX11-TRUE16-NEXT: s_clause 0x1
-; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[1:4], off
-; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[5:8], off offset:16
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v18
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v12.l, v14.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v13.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.h, 8, v49.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v12.h, v15.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v16, v14
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v18
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.l, 8, v160.l
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v13.l, v13.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v15.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v77.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v95.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.l, 8, v150.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v64, v18, v14
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v20
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v13.l, v16.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v93.h
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v13.h, v15.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.h, 8, v149.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v65, v18, v14
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v15.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v15.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.l, 8, v48.l
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v13.l, v13.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v71.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.h, 8, v79.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v16
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v15.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v13.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v92.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.l, 8, v134.l
-; GFX11-TRUE16-NEXT: v_or_b16 v16.l, v13.h, v15.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v66, v18, v14
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v20
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v70.h
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v13.l, v15.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v17.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.l, 8, v74.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.h, 8, v53.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v67, v18, v14
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v46.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v13.h, v15.l
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v13.l, v15.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v84.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.h, 8, v63.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v15.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v16, v14
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v55.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.l, 8, v62.l
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v13.l, v13.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v81.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v60.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v20, 0xffff, v17
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v16.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v13.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v19.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v50.l
-; GFX11-TRUE16-NEXT: v_or_b16 v18.l, v13.h, v16.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v20, v14
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v19, 0xffff, v22
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v86.h
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v13.l, v17.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v18.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v80.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v45.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v19, v14
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v19, 0xffff, v20
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v13.l, v18.h
-; GFX11-TRUE16-NEXT: v_or_b16 v18.l, v13.h, v18.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v83.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.h, 8, v40.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v18.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v19, v14
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v21.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.l, 8, v38.l
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v13.l, v13.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v97.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v182.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff, v20
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v19.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, v13.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v82.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v20.l, 8, v181.l
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v13.h, v19.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v22, v14
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff, v24
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v87.h
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v13.l, v20.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, v21.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v23.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.l, 8, v177.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.h, 8, v37.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v22, v14
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff, v24
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v24.h, 8, v147.l
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v13.h, v21.l
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v13.l, v21.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v101.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.h, 8, v163.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v21.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v22, v14
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v85.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.l, 8, v162.l
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v13.l, v13.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v98.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.h, 8, v151.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v26, 0xffff, v23
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v22.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.l, v13.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v25.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.l, 8, v36.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v13.h, v22.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v26, v14
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v25, 0xffff, v28
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v112.h
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v13.l, v23.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.l, v24.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v96.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v24.l, 8, v145.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v25, v14
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v25, 0xffff, v26
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v13.l, v24.h
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v13.h, v24.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v100.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.h, 8, v135.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.l, v24.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v25, v14
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v27.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.l, 8, v35.l
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v13.l, v13.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v113.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.h, 8, v131.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v28, 0xffff, v26
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v25.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.l, v13.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v99.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v26.l, 8, v130.l
-; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v13.h, v25.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v28, v14
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v28, 0xffff, v30
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v103.h
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v13.l, v26.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.l, v27.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v29.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.l, 8, v129.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.h, 8, v67.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v62.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v67, v6, v8
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v177.h
+; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v7.l, v7.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v63.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v178.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.h, 8, v60.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v180.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v68, v6, v8
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v73.h
+; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v7.l, v7.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v57.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.h, 8, v64.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v53.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v69, v6, v8
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v41.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.l, v7.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v47.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v44.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v45.l
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.l, v9.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v6, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v92.h
+; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.l, v8.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v6.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v40.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v6.h
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v10.l, v10.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v6.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v6, v8
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v59.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v182.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v13.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v50.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v6, v9
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v89.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v108.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v176.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v11.l, v11.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v6.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v6, v10
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v163.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v91.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v165.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v15.h
+; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v12.l, v12.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v6, v11
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v6.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v110.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v161.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.h, 8, v38.l
+; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v109.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v6, v12
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.h, 8, v147.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v13.l, v13.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v6.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v107.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v149.l
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.l, v14.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v6.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v6, v13
+; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v82.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v94.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.h, 8, v65.l
+; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v81.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v6, v14
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v79.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v15.l, v15.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v6.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v80.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v90.l
+; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v16.l, v16.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v6.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v6, v15
+; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v19.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v85.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v77.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v54.l
+; GFX11-TRUE16-NEXT: v_and_b16 v18.l, 0xff, v84.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v6, v16
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v72.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
+; GFX11-TRUE16-NEXT: v_or_b16 v17.h, v17.l, v17.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v6.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v83.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v75.l
+; GFX11-TRUE16-NEXT: v_or_b16 v18.h, v18.l, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v6.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v6, v17
+; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v21.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v96.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v61.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v51.l
+; GFX11-TRUE16-NEXT: v_and_b16 v20.l, 0xff, v87.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v6, v18
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v20.h, 8, v56.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v19.l, v19.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v6.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v86.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v58.l
+; GFX11-TRUE16-NEXT: v_or_b16 v20.h, v20.l, v20.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v6.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v6, v19
+; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v23.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v99.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v46.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.h, 8, v48.l
+; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v98.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v6, v20
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.h, 8, v183.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
+; GFX11-TRUE16-NEXT: v_or_b16 v21.h, v21.l, v21.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v6.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v97.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v42.l
+; GFX11-TRUE16-NEXT: v_or_b16 v22.h, v22.l, v22.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v6.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v6, v21
+; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v25.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v102.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v181.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.h, 8, v36.l
+; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v101.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v6, v22
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v24.h, 8, v167.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
+; GFX11-TRUE16-NEXT: v_or_b16 v23.h, v23.l, v23.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v6.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v100.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v179.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v24.l, v24.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, v6.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v6, v23
+; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v27.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v113.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v164.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.h, 8, v35.l
+; GFX11-TRUE16-NEXT: v_and_b16 v26.l, 0xff, v112.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v6, v24
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v26.h, 8, v160.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
+; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v25.l, v25.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v6.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v103.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v162.l
+; GFX11-TRUE16-NEXT: v_or_b16 v26.h, v26.l, v26.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.l, v6.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v6, v25
+; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v29.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v116.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v148.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.h, 8, v34.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v28, v14
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v28, 0xffff, v30
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v13.h, v27.l
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v13.l, v27.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v116.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.h, 8, v128.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v27.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v28, v14
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v102.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v28.l, 8, v119.l
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v13.l, v13.h
-; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v32, 0xffff, v29
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v115.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v28.h, 8, v118.l
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v28.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.l, v13.l
+; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v115.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v6, v26
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v28.h, 8, v144.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
+; GFX11-TRUE16-NEXT: v_or_b16 v27.h, v27.l, v27.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, v6.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v114.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v145.l
+; GFX11-TRUE16-NEXT: v_or_b16 v28.h, v28.l, v28.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.l, v6.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v6, v27
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v31.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v29.l, 8, v33.l
-; GFX11-TRUE16-NEXT: v_or_b16 v30.l, v13.h, v28.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v32, v14
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v31, 0xffff, v34
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.h, 8, v117.l
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v13.l, v29.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v114.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v30, 0xffff, v30
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v31, v14
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v13.l, v13.h
+; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v31.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v119.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v134.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v29.h, 8, v33.l
+; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v118.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v6, v28
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v30.h, 8, v130.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
+; GFX11-TRUE16-NEXT: v_or_b16 v29.h, v29.l, v29.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v6.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v117.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v132.l
+; GFX11-TRUE16-NEXT: s_clause 0x1
+; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[1:4], off
+; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[66:69], off offset:16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v6, v29
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v30.l, v30.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v6.h
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v30, v14
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v6, v5
; GFX11-TRUE16-NEXT: s_clause 0x5
-; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[9:12], off offset:32
-; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[64:67], off offset:48
+; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[7:10], off offset:32
+; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[11:14], off offset:48
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[15:18], off offset:64
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[19:22], off offset:80
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[23:26], off offset:96
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[27:30], off offset:112
; GFX11-TRUE16-NEXT: s_clause 0x1f
-; GFX11-TRUE16-NEXT: scratch_load_b32 v108, off, s32 offset:12
-; GFX11-TRUE16-NEXT: scratch_load_b32 v107, off, s32 offset:16
-; GFX11-TRUE16-NEXT: scratch_load_b32 v106, off, s32 offset:20
-; GFX11-TRUE16-NEXT: scratch_load_b32 v105, off, s32 offset:24
-; GFX11-TRUE16-NEXT: scratch_load_b32 v104, off, s32 offset:28
-; GFX11-TRUE16-NEXT: scratch_load_b32 v95, off, s32 offset:32
-; GFX11-TRUE16-NEXT: scratch_load_b32 v94, off, s32 offset:36
-; GFX11-TRUE16-NEXT: scratch_load_b32 v93, off, s32 offset:40
-; GFX11-TRUE16-NEXT: scratch_load_b32 v92, off, s32 offset:44
-; GFX11-TRUE16-NEXT: scratch_load_b32 v91, off, s32 offset:48
-; GFX11-TRUE16-NEXT: scratch_load_b32 v90, off, s32 offset:52
-; GFX11-TRUE16-NEXT: scratch_load_b32 v89, off, s32 offset:56
-; GFX11-TRUE16-NEXT: scratch_load_b32 v88, off, s32 offset:60
-; GFX11-TRUE16-NEXT: scratch_load_b32 v79, off, s32 offset:64
-; GFX11-TRUE16-NEXT: scratch_load_b32 v78, off, s32 offset:68
-; GFX11-TRUE16-NEXT: scratch_load_b32 v77, off, s32 offset:72
-; GFX11-TRUE16-NEXT: scratch_load_b32 v76, off, s32 offset:76
-; GFX11-TRUE16-NEXT: scratch_load_b32 v75, off, s32 offset:80
-; GFX11-TRUE16-NEXT: scratch_load_b32 v74, off, s32 offset:84
-; GFX11-TRUE16-NEXT: scratch_load_b32 v73, off, s32 offset:88
-; GFX11-TRUE16-NEXT: scratch_load_b32 v72, off, s32 offset:92
-; GFX11-TRUE16-NEXT: scratch_load_b32 v63, off, s32 offset:96
-; GFX11-TRUE16-NEXT: scratch_load_b32 v62, off, s32 offset:100
-; GFX11-TRUE16-NEXT: scratch_load_b32 v61, off, s32 offset:104
-; GFX11-TRUE16-NEXT: scratch_load_b32 v60, off, s32 offset:108
-; GFX11-TRUE16-NEXT: scratch_load_b32 v59, off, s32 offset:112
-; GFX11-TRUE16-NEXT: scratch_load_b32 v58, off, s32 offset:116
-; GFX11-TRUE16-NEXT: scratch_load_b32 v57, off, s32 offset:120
-; GFX11-TRUE16-NEXT: scratch_load_b32 v56, off, s32 offset:124
-; GFX11-TRUE16-NEXT: scratch_load_b32 v47, off, s32 offset:128
-; GFX11-TRUE16-NEXT: scratch_load_b32 v46, off, s32 offset:132
-; GFX11-TRUE16-NEXT: scratch_load_b32 v45, off, s32 offset:136
-; GFX11-TRUE16-NEXT: s_clause 0x4
-; GFX11-TRUE16-NEXT: scratch_load_b32 v44, off, s32 offset:140
-; GFX11-TRUE16-NEXT: scratch_load_b32 v43, off, s32 offset:144
-; GFX11-TRUE16-NEXT: scratch_load_b32 v42, off, s32 offset:148
-; GFX11-TRUE16-NEXT: scratch_load_b32 v41, off, s32 offset:152
-; GFX11-TRUE16-NEXT: scratch_load_b32 v40, off, s32 offset:156
+; GFX11-TRUE16-NEXT: scratch_load_b32 v111, off, s32 offset:12
+; GFX11-TRUE16-NEXT: scratch_load_b32 v110, off, s32 offset:16
+; GFX11-TRUE16-NEXT: scratch_load_b32 v109, off, s32 offset:20
+; GFX11-TRUE16-NEXT: scratch_load_b32 v108, off, s32 offset:24
+; GFX11-TRUE16-NEXT: scratch_load_b32 v107, off, s32 offset:28
+; GFX11-TRUE16-NEXT: scratch_load_b32 v106, off, s32 offset:32
+; GFX11-TRUE16-NEXT: scratch_load_b32 v105, off, s32 offset:36
+; GFX11-TRUE16-NEXT: scratch_load_b32 v104, off, s32 offset:40
+; GFX11-TRUE16-NEXT: scratch_load_b32 v95, off, s32 offset:44
+; GFX11-TRUE16-NEXT: scratch_load_b32 v94, off, s32 offset:48
+; GFX11-TRUE16-NEXT: scratch_load_b32 v93, off, s32 offset:52
+; GFX11-TRUE16-NEXT: scratch_load_b32 v92, off, s32 offset:56
+; GFX11-TRUE16-NEXT: scratch_load_b32 v91, off, s32 offset:60
+; GFX11-TRUE16-NEXT: scratch_load_b32 v90, off, s32 offset:64
+; GFX11-TRUE16-NEXT: scratch_load_b32 v89, off, s32 offset:68
+; GFX11-TRUE16-NEXT: scratch_load_b32 v88, off, s32 offset:72
+; GFX11-TRUE16-NEXT: scratch_load_b32 v79, off, s32 offset:76
+; GFX11-TRUE16-NEXT: scratch_load_b32 v78, off, s32 offset:80
+; GFX11-TRUE16-NEXT: scratch_load_b32 v77, off, s32 offset:84
+; GFX11-TRUE16-NEXT: scratch_load_b32 v76, off, s32 offset:88
+; GFX11-TRUE16-NEXT: scratch_load_b32 v75, off, s32 offset:92
+; GFX11-TRUE16-NEXT: scratch_load_b32 v74, off, s32 offset:96
+; GFX11-TRUE16-NEXT: scratch_load_b32 v73, off, s32 offset:100
+; GFX11-TRUE16-NEXT: scratch_load_b32 v72, off, s32 offset:104
+; GFX11-TRUE16-NEXT: scratch_load_b32 v63, off, s32 offset:108
+; GFX11-TRUE16-NEXT: scratch_load_b32 v62, off, s32 offset:112
+; GFX11-TRUE16-NEXT: scratch_load_b32 v61, off, s32 offset:116
+; GFX11-TRUE16-NEXT: scratch_load_b32 v60, off, s32 offset:120
+; GFX11-TRUE16-NEXT: scratch_load_b32 v59, off, s32 offset:124
+; GFX11-TRUE16-NEXT: scratch_load_b32 v58, off, s32 offset:128
+; GFX11-TRUE16-NEXT: scratch_load_b32 v57, off, s32 offset:132
+; GFX11-TRUE16-NEXT: scratch_load_b32 v56, off, s32 offset:136
+; GFX11-TRUE16-NEXT: s_clause 0x7
+; GFX11-TRUE16-NEXT: scratch_load_b32 v47, off, s32 offset:140
+; GFX11-TRUE16-NEXT: scratch_load_b32 v46, off, s32 offset:144
+; GFX11-TRUE16-NEXT: scratch_load_b32 v45, off, s32 offset:148
+; GFX11-TRUE16-NEXT: scratch_load_b32 v44, off, s32 offset:152
+; GFX11-TRUE16-NEXT: scratch_load_b32 v43, off, s32 offset:156
+; GFX11-TRUE16-NEXT: scratch_load_b32 v42, off, s32 offset:160
+; GFX11-TRUE16-NEXT: scratch_load_b32 v41, off, s32 offset:164
+; GFX11-TRUE16-NEXT: scratch_load_b32 v40, off, s32 offset:168
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -185302,69 +184755,69 @@ define <128 x i8> @bitcast_v64f16_to_v128i8(<64 x half> %a, i32 %b) {
; GFX11-TRUE16-NEXT: scratch_load_b32 v33, off, s32 offset:8
; GFX11-TRUE16-NEXT: scratch_load_b32 v32, off, s32 offset:4
; GFX11-TRUE16-NEXT: scratch_load_b32 v31, off, s32
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr166_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr165_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr164_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr163_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr68_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr69_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr162_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr160_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr161_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr67_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr151_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr68_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr150_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr66_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr148_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr146_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr147_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr65_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr144_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr132_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr145_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr67_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr135_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr133_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr131_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr118_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr119_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr116_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr117_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr115_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr113_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr112_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr101_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr99_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr97_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr96_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr84_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr145_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr85_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr65_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr160_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr150_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr148_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr146_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr144_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr134_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr135_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr133_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr132_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr128_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr117_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr118_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr116_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr114_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr115_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr113_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr100_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr101_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr112_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr100_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr96_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr86_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr85_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr82_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr83_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr84_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr83_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr82_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr81_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_lo16
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(2)
; GFX11-TRUE16-NEXT: v_cmp_ne_u32_e32 vcc_lo, 0, v33
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
@@ -185375,69 +184828,69 @@ define <128 x i8> @bitcast_v64f16_to_v128i8(<64 x half> %a, i32 %b) {
; GFX11-TRUE16-NEXT: ; %bb.1: ; %cmp.false
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[33:34], 24, v[31:32]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[67:68], 24, v[7:8]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[34:35], 24, v[29:30]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[64:65], 24, v[9:10]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[68:69], 24, v[5:6]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[35:36], 24, v[27:28]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[65:66], 24, v[7:8]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[50:51], 24, v[13:14]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[53:54], 24, v[11:12]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[64:65], 24, v[9:10]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[69:70], 24, v[3:4]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[36:37], 24, v[25:26]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[48:49], 24, v[15:16]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[66:67], 24, v[5:6]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[37:38], 24, v[23:24]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[49:50], 24, v[13:14]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[52:53], 24, v[11:12]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[67:68], 24, v[3:4]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[38:39], 24, v[21:22]
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v84, 24, v16
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 8, v16
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v85, 24, v16
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 8, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v97, 8, v15
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v99, 24, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v102, 8, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v103, 24, v12
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v116, 8, v12
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v119, 8, v11
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v118, 24, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 8, v10
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v101, 8, v14
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v103, 8, v13
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v115, 8, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v117, 8, v11
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v119, 24, v10
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 8, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v131, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 24, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v144, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v146, 24, v6
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v148, 8, v6
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 8, v5
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 24, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v151, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 24, v2
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v162, 8, v2
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v163, 8, v1
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 24, v32
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v70, 8, v32
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v80, 8, v31
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v71, 24, v30
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 8, v30
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v83, 8, v29
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 24, v28
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v85, 8, v28
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v86, 8, v27
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 24, v26
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 8, v26
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v101, 8, v25
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v100, 24, v24
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 8, v24
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v115, 8, v23
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v114, 24, v22
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v117, 8, v22
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v128, 8, v21
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 24, v20
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v133, 8, v20
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v135, 8, v19
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v134, 24, v18
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v145, 8, v18
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[68:69], 24, v[1:2]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[50:51], 24, v[19:20]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[53:54], 24, v[17:18]
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v39, 8, v17
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v133, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v135, 8, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v145, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 8, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v151, 8, v5
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 24, v4
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v162, 8, v4
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v163, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v164, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v165, 8, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v166, 8, v1
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v80, 24, v32
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 8, v32
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 8, v31
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v83, 24, v30
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v84, 8, v30
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v86, 8, v29
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 24, v28
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 8, v28
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v100, 8, v27
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v102, 24, v26
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 8, v26
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v114, 8, v25
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v116, 24, v24
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v118, 8, v24
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v128, 8, v23
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 24, v22
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 8, v22
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v134, 8, v21
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v144, 24, v20
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v146, 8, v20
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v148, 8, v19
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 24, v18
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 8, v18
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[38:39], 24, v[15:16]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[70:71], 24, v[1:2]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[48:49], 24, v[23:24]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[51:52], 24, v[21:22]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[54:55], 24, v[19:20]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[65:66], 24, v[17:18]
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 8, v17
; GFX11-TRUE16-NEXT: .LBB94_2: ; %Flow
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB94_4
@@ -185446,405 +184899,364 @@ define <128 x i8> @bitcast_v64f16_to_v128i8(<64 x half> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_pk_add_f16 v32, 0x200, v32 op_sel_hi:[0,1]
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_pk_add_f16 v31, 0x200, v31 op_sel_hi:[0,1]
-; GFX11-TRUE16-NEXT: v_pk_add_f16 v30, 0x200, v30 op_sel_hi:[0,1]
-; GFX11-TRUE16-NEXT: v_pk_add_f16 v29, 0x200, v29 op_sel_hi:[0,1]
-; GFX11-TRUE16-NEXT: v_pk_add_f16 v10, 0x200, v10 op_sel_hi:[0,1]
-; GFX11-TRUE16-NEXT: v_pk_add_f16 v9, 0x200, v9 op_sel_hi:[0,1]
-; GFX11-TRUE16-NEXT: v_pk_add_f16 v28, 0x200, v28 op_sel_hi:[0,1]
-; GFX11-TRUE16-NEXT: v_pk_add_f16 v27, 0x200, v27 op_sel_hi:[0,1]
; GFX11-TRUE16-NEXT: v_pk_add_f16 v8, 0x200, v8 op_sel_hi:[0,1]
; GFX11-TRUE16-NEXT: v_pk_add_f16 v7, 0x200, v7 op_sel_hi:[0,1]
-; GFX11-TRUE16-NEXT: v_pk_add_f16 v26, 0x200, v26 op_sel_hi:[0,1]
-; GFX11-TRUE16-NEXT: v_pk_add_f16 v25, 0x200, v25 op_sel_hi:[0,1]
+; GFX11-TRUE16-NEXT: v_pk_add_f16 v30, 0x200, v30 op_sel_hi:[0,1]
+; GFX11-TRUE16-NEXT: v_pk_add_f16 v29, 0x200, v29 op_sel_hi:[0,1]
; GFX11-TRUE16-NEXT: v_pk_add_f16 v6, 0x200, v6 op_sel_hi:[0,1]
; GFX11-TRUE16-NEXT: v_pk_add_f16 v5, 0x200, v5 op_sel_hi:[0,1]
-; GFX11-TRUE16-NEXT: v_pk_add_f16 v16, 0x200, v16 op_sel_hi:[0,1]
-; GFX11-TRUE16-NEXT: v_pk_add_f16 v15, 0x200, v15 op_sel_hi:[0,1]
-; GFX11-TRUE16-NEXT: v_pk_add_f16 v24, 0x200, v24 op_sel_hi:[0,1]
-; GFX11-TRUE16-NEXT: v_pk_add_f16 v23, 0x200, v23 op_sel_hi:[0,1]
+; GFX11-TRUE16-NEXT: v_pk_add_f16 v28, 0x200, v28 op_sel_hi:[0,1]
+; GFX11-TRUE16-NEXT: v_pk_add_f16 v27, 0x200, v27 op_sel_hi:[0,1]
; GFX11-TRUE16-NEXT: v_pk_add_f16 v4, 0x200, v4 op_sel_hi:[0,1]
; GFX11-TRUE16-NEXT: v_pk_add_f16 v3, 0x200, v3 op_sel_hi:[0,1]
+; GFX11-TRUE16-NEXT: v_pk_add_f16 v10, 0x200, v10 op_sel_hi:[0,1]
; GFX11-TRUE16-NEXT: v_pk_add_f16 v12, 0x200, v12 op_sel_hi:[0,1]
; GFX11-TRUE16-NEXT: v_pk_add_f16 v14, 0x200, v14 op_sel_hi:[0,1]
; GFX11-TRUE16-NEXT: v_pk_add_f16 v13, 0x200, v13 op_sel_hi:[0,1]
; GFX11-TRUE16-NEXT: v_pk_add_f16 v11, 0x200, v11 op_sel_hi:[0,1]
+; GFX11-TRUE16-NEXT: v_pk_add_f16 v9, 0x200, v9 op_sel_hi:[0,1]
; GFX11-TRUE16-NEXT: v_pk_add_f16 v18, 0x200, v18 op_sel_hi:[0,1]
; GFX11-TRUE16-NEXT: v_pk_add_f16 v17, 0x200, v17 op_sel_hi:[0,1]
; GFX11-TRUE16-NEXT: v_pk_add_f16 v20, 0x200, v20 op_sel_hi:[0,1]
; GFX11-TRUE16-NEXT: v_pk_add_f16 v19, 0x200, v19 op_sel_hi:[0,1]
; GFX11-TRUE16-NEXT: v_pk_add_f16 v22, 0x200, v22 op_sel_hi:[0,1]
; GFX11-TRUE16-NEXT: v_pk_add_f16 v21, 0x200, v21 op_sel_hi:[0,1]
+; GFX11-TRUE16-NEXT: v_pk_add_f16 v24, 0x200, v24 op_sel_hi:[0,1]
+; GFX11-TRUE16-NEXT: v_pk_add_f16 v23, 0x200, v23 op_sel_hi:[0,1]
+; GFX11-TRUE16-NEXT: v_pk_add_f16 v26, 0x200, v26 op_sel_hi:[0,1]
+; GFX11-TRUE16-NEXT: v_pk_add_f16 v25, 0x200, v25 op_sel_hi:[0,1]
; GFX11-TRUE16-NEXT: v_pk_add_f16 v2, 0x200, v2 op_sel_hi:[0,1]
; GFX11-TRUE16-NEXT: v_pk_add_f16 v1, 0x200, v1 op_sel_hi:[0,1]
+; GFX11-TRUE16-NEXT: v_pk_add_f16 v16, 0x200, v16 op_sel_hi:[0,1]
+; GFX11-TRUE16-NEXT: v_pk_add_f16 v15, 0x200, v15 op_sel_hi:[0,1]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[67:68], 24, v[7:8]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[33:34], 24, v[31:32]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[64:65], 24, v[9:10]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[68:69], 24, v[5:6]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[34:35], 24, v[29:30]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[65:66], 24, v[7:8]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[50:51], 24, v[13:14]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[53:54], 24, v[11:12]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[64:65], 24, v[9:10]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[69:70], 24, v[3:4]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[35:36], 24, v[27:28]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[48:49], 24, v[15:16]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[66:67], 24, v[5:6]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[38:39], 24, v[15:16]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[70:71], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[36:37], 24, v[25:26]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[49:50], 24, v[13:14]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[52:53], 24, v[11:12]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[67:68], 24, v[3:4]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[37:38], 24, v[23:24]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[68:69], 24, v[1:2]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[38:39], 24, v[21:22]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[50:51], 24, v[19:20]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[53:54], 24, v[17:18]
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v84, 24, v16
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 8, v16
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[48:49], 24, v[23:24]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[51:52], 24, v[21:22]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[54:55], 24, v[19:20]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[65:66], 24, v[17:18]
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v85, 24, v16
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 8, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v97, 8, v15
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v99, 24, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v102, 8, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v103, 24, v12
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v116, 8, v12
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v119, 8, v11
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v118, 24, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 8, v10
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v101, 8, v14
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v103, 8, v13
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v115, 8, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v117, 8, v11
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v119, 24, v10
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 8, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v131, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 24, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v144, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v146, 24, v6
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v148, 8, v6
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 8, v5
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 24, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v151, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 24, v2
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v162, 8, v2
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v163, 8, v1
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 24, v32
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v70, 8, v32
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v80, 8, v31
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v71, 24, v30
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 8, v30
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v83, 8, v29
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 24, v28
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v85, 8, v28
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v86, 8, v27
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 24, v26
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 8, v26
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v101, 8, v25
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v100, 24, v24
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 8, v24
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v115, 8, v23
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v114, 24, v22
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v117, 8, v22
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v128, 8, v21
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 24, v20
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v133, 8, v20
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v135, 8, v19
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v134, 24, v18
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v145, 8, v18
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v39, 8, v17
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v133, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v135, 8, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v145, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 8, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v151, 8, v5
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 24, v4
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v162, 8, v4
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v163, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v164, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v165, 8, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v166, 8, v1
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v80, 24, v32
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 8, v32
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 8, v31
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v83, 24, v30
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v84, 8, v30
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v86, 8, v29
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 24, v28
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 8, v28
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v100, 8, v27
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v102, 24, v26
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 8, v26
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v114, 8, v25
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v116, 24, v24
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v118, 8, v24
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v128, 8, v23
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 24, v22
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 8, v22
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v134, 8, v21
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v144, 24, v20
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v146, 8, v20
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v148, 8, v19
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 24, v18
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 8, v18
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 8, v17
; GFX11-TRUE16-NEXT: .LBB94_4: ; %end
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v163.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v162.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v166.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v51.l, 0
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v68.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v34.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v70.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v1.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v165.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v1.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v39.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v164.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v39, v1
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v2.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v163.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v69.l
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v54.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v161.l
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v1.h, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v68.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v54
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v160.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v1.l, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v151.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v68, 0xffff, v68
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v54, v51
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v2.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v54.l, v3.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v67.l
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v4.l, v33.h
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v39, v2
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v3.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v162.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v161.l
; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v150.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v68, v51
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v54
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v3.l, v3.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v67.l, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v149.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v5.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v54, v51
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v67
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v66.l
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v4.l, v4.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v67.l, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.l, 8, v148.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v54, v51
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v39, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v4.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v151.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v68.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v39, v4
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v5.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v149.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v147.l
; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v5.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v147.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.h, v6.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v67
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v39, v5
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v6.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v145.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v67.l
; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v39, v6
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v7.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v135.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v133.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v39, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v8.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v131.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v64.l
; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v7.l, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v66.l, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v146.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v144.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v54, v51
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v66
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v66.l, v7.l
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v6.l, v6.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.h, 8, v65.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v8.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v132.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v54, v51
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v66
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v7.l, v7.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v65.l, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 8, v131.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v54, v51
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v65
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v8.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v130.l
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v8.h, v9.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v10.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v39, v8
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v9.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v129.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v119.l
; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v11.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v39, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v10.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v10.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v117.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v53.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v12.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v39, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v11.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v11.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v115.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v113.l
; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v54, v51
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v10.l, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v65.l, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v64.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v119.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v64.l, v10.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v65
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v9.l, v9.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v118.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v11.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v52.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v13.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v39, v11
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v12.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v12.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v103.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v50.l
; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v54, v51
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v64
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v10.l, v10.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v64.l, v11.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v116.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v54, v51
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v52, 0xffff, v64
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v11.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v112.l
-; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v11.h, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v14.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v39, v12
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v13.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v13.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v101.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v99.l
+; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.l
+; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v15.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v39, v13
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v14.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v97.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v38.l
; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.l
+; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v16.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v39, v14
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v15.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v15.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v87.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v85.l
; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v17.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v52, v51
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v13.l, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v54.l, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v103.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v102.l
+; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v17.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v39, v15
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v16.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v16.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v37.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v65.l
+; GFX11-TRUE16-NEXT: v_and_b16 v18.l, 0xff, v18.l
+; GFX11-TRUE16-NEXT: v_and_b16 v18.h, 0xff, v18.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v39, v16
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v17.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v17.h, v17.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v160.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v150.l
; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v19.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v52, 0xffff, v54
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v54.l, v13.l
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v12.l, v12.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.h, 8, v49.l
-; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v14.l, v33.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v49, 0xffff, v54
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v52, v51
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v99.l
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v13.l, v13.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.l, v14.l
-; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.h
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v15.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.l, 8, v97.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v49, v51
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v49, 0xffff, v52
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v14.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v96.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v14.h, v15.l
+; GFX11-TRUE16-NEXT: v_and_b16 v19.h, 0xff, v19.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v39, v17
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v18.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v18.h, v18.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v148.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v54.l
; GFX11-TRUE16-NEXT: v_and_b16 v20.l, 0xff, v20.l
+; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v20.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v39, v18
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v19.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v19.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v146.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v144.l
+; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v21.l
+; GFX11-TRUE16-NEXT: v_and_b16 v21.h, 0xff, v21.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v39, v19
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v20.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v20.h, v20.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v134.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v51.l
; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v22.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v49, v51
-; GFX11-TRUE16-NEXT: v_or_b16 v16.l, v16.l, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.l, v15.l
-; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.h, 8, v48.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v39.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.l, v16.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v48, 0xffff, v52
-; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.h
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v15.l, v15.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v84.l
-; GFX11-TRUE16-NEXT: v_or_b16 v17.l, v17.l, v33.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v39, 0xffff, v39
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v53.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v48, v51
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v16.l, v16.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.l, v17.l
-; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v17.h
-; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v18.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v145.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v39, v51
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v39, 0xffff, v48
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v17.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v135.l
-; GFX11-TRUE16-NEXT: v_or_b16 v18.l, v17.h, v18.l
+; GFX11-TRUE16-NEXT: v_and_b16 v22.h, 0xff, v22.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v39, v20
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v21.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v21.h, v21.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v132.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v130.l
; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v23.l
+; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v23.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v39, v21
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v22.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v22.h, v22.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v128.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v48.l
+; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v24.h, 0xff, v24.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v39, v22
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v23.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v23.h, v23.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v118.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v116.l
; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v25.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v39, v51
-; GFX11-TRUE16-NEXT: v_or_b16 v19.l, v19.l, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.l, v18.l
-; GFX11-TRUE16-NEXT: v_and_b16 v18.l, 0xff, v18.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v134.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v133.l
+; GFX11-TRUE16-NEXT: v_and_b16 v25.h, 0xff, v25.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v39, v23
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v24.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v24.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v114.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v36.l
; GFX11-TRUE16-NEXT: v_and_b16 v26.l, 0xff, v26.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v39, 0xffff, v48
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.l, v19.l
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v18.l, v18.h
-; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v19.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v50.l
-; GFX11-TRUE16-NEXT: v_or_b16 v20.l, v20.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v129.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v39, v51
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v39, 0xffff, v48
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v19.l, v19.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.l, v20.l
-; GFX11-TRUE16-NEXT: v_and_b16 v20.l, 0xff, v20.h
-; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v21.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.l, 8, v128.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v39, v51
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v39, 0xffff, v48
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v20.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v117.l
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v20.h, v21.l
+; GFX11-TRUE16-NEXT: v_and_b16 v26.h, 0xff, v26.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v39, v24
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v25.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v25.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v112.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v102.l
+; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v27.l
+; GFX11-TRUE16-NEXT: v_and_b16 v27.h, 0xff, v27.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v39, v25
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v26.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v26.h, v26.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v100.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v35.l
; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v28.l
+; GFX11-TRUE16-NEXT: v_and_b16 v28.h, 0xff, v28.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v39, v26
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v27.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v27.h, v27.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v98.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v96.l
; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v29.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v39, v51
-; GFX11-TRUE16-NEXT: v_or_b16 v22.l, v22.l, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.l, v21.l
-; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v21.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.h, 8, v38.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v115.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.l, v22.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v38, 0xffff, v48
-; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v22.h
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v21.l, v21.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.h, 8, v114.l
-; GFX11-TRUE16-NEXT: v_or_b16 v23.l, v23.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v37.l
+; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v29.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v39, v27
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v28.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v28.h, v28.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v86.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.l, 8, v34.l
+; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v30.l
+; GFX11-TRUE16-NEXT: v_and_b16 v30.h, 0xff, v30.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v39, v28
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v29.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v29.h, v29.h, v34.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v84.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.l, 8, v83.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_and_b16 v31.l, 0xff, v31.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v38, v51
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v38, 0xffff, v39
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v22.l, v22.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.l, v23.l
-; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v23.h
-; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v24.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v24.l, 8, v113.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v38, v51
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v37, 0xffff, v39
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v23.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v101.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v23.h, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v31.h, 0xff, v31.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v39, v29
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v30.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v30.h, v30.h, v34.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v82.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.l, 8, v33.l
; GFX11-TRUE16-NEXT: v_and_b16 v32.l, 0xff, v32.l
+; GFX11-TRUE16-NEXT: v_and_b16 v32.h, 0xff, v32.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v39, v30
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v31.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v31.h, v33.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v31.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.l, 8, v81.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v80.l
; GFX11-TRUE16-NEXT: s_clause 0x1
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[1:4], off
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[5:8], off offset:16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v37, v51
-; GFX11-TRUE16-NEXT: v_or_b16 v25.l, v25.l, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v38.l, v24.l
-; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v24.h, 8, v100.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v98.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v37, 0xffff, v38
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v38.l, v25.l
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v24.l, v24.h
-; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v25.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.h, 8, v36.l
-; GFX11-TRUE16-NEXT: v_or_b16 v26.l, v26.l, v33.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v36, 0xffff, v38
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v37, v51
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v87.l
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v25.l, v25.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v37.l, v26.l
-; GFX11-TRUE16-NEXT: v_and_b16 v26.l, 0xff, v26.h
-; GFX11-TRUE16-NEXT: v_and_b16 v26.h, 0xff, v27.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.l, 8, v86.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v36, v51
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v36, 0xffff, v37
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v26.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v85.l
-; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v26.h, v27.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v36, v51
-; GFX11-TRUE16-NEXT: v_or_b16 v28.l, v28.l, v33.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v37.l, v27.l
-; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v27.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.h, 8, v35.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v83.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v36.l, v28.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff, v37
-; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v28.h
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v27.l, v27.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v28.h, 8, v82.l
-; GFX11-TRUE16-NEXT: v_or_b16 v29.l, v29.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v34.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v35, v51
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff, v36
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v28.l, v28.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v36.l, v29.l
-; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v29.h
-; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v30.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v30.l, 8, v81.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v35, v51
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v36
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v29.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v80.l
-; GFX11-TRUE16-NEXT: v_or_b16 v30.l, v29.h, v30.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v34, v51
-; GFX11-TRUE16-NEXT: v_or_b16 v31.l, v31.l, v33.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_3) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v30.l
-; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v30.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v30.h, 8, v71.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v70.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v35
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v31.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v30.l, v30.h
-; GFX11-TRUE16-NEXT: v_and_b16 v31.l, 0xff, v31.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v31.h, 8, v33.l
-; GFX11-TRUE16-NEXT: v_or_b16 v32.l, v32.l, v33.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v33, 0xffff, v35
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v34, v51
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v31.l, v31.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.l, v32.l
-; GFX11-TRUE16-NEXT: v_and_b16 v32.l, 0xff, v32.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v32.h, 8, v55.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v33, v51
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v33, 0xffff, v34
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v32.l, v32.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v32, v33, v51
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v39, v31
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v32.l, v33.l
+; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v32.h, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v32.l, v39.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v32, v39, v32
; GFX11-TRUE16-NEXT: s_clause 0x5
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[9:12], off offset:32
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[13:16], off offset:48
@@ -208055,69 +207467,69 @@ define <128 x i8> @bitcast_v64i16_to_v128i8(<64 x i16> %a, i32 %b) {
; GFX11-TRUE16-NEXT: scratch_load_b32 v33, off, s32 offset:8
; GFX11-TRUE16-NEXT: scratch_load_b32 v32, off, s32 offset:4
; GFX11-TRUE16-NEXT: scratch_load_b32 v31, off, s32
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr166_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr165_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr164_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr163_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr68_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr69_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr162_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr160_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr161_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr67_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr151_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr68_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr150_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr66_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr148_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr146_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr147_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr65_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr144_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr132_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr145_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr67_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr135_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr133_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr131_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr118_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr119_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr116_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr117_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr115_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr113_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr112_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr101_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr99_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr97_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr96_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr84_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr145_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr85_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr65_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr160_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr150_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr148_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr146_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr144_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr134_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr135_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr133_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr132_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr128_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr117_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr118_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr116_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr114_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr115_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr113_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr100_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr101_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr112_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr100_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr96_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr86_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr85_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr82_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr83_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr84_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr83_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr82_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr81_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_lo16
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(2)
; GFX11-TRUE16-NEXT: v_cmp_ne_u32_e32 vcc_lo, 0, v33
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
@@ -208128,69 +207540,69 @@ define <128 x i8> @bitcast_v64i16_to_v128i8(<64 x i16> %a, i32 %b) {
; GFX11-TRUE16-NEXT: ; %bb.1: ; %cmp.false
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[33:34], 24, v[31:32]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[67:68], 24, v[7:8]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[34:35], 24, v[29:30]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[64:65], 24, v[9:10]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[68:69], 24, v[5:6]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[35:36], 24, v[27:28]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[65:66], 24, v[7:8]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[50:51], 24, v[13:14]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[53:54], 24, v[11:12]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[64:65], 24, v[9:10]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[69:70], 24, v[3:4]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[36:37], 24, v[25:26]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[48:49], 24, v[15:16]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[66:67], 24, v[5:6]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[37:38], 24, v[23:24]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[49:50], 24, v[13:14]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[52:53], 24, v[11:12]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[67:68], 24, v[3:4]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[38:39], 24, v[21:22]
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v84, 24, v16
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 8, v16
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v85, 24, v16
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 8, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v97, 8, v15
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v99, 24, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v102, 8, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v103, 24, v12
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v116, 8, v12
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v119, 8, v11
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v118, 24, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 8, v10
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v101, 8, v14
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v103, 8, v13
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v115, 8, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v117, 8, v11
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v119, 24, v10
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 8, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v131, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 24, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v144, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v146, 24, v6
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v148, 8, v6
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 8, v5
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 24, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v151, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 24, v2
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v162, 8, v2
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v163, 8, v1
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 24, v32
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v70, 8, v32
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v80, 8, v31
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v71, 24, v30
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 8, v30
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v83, 8, v29
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 24, v28
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v85, 8, v28
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v86, 8, v27
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 24, v26
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 8, v26
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v101, 8, v25
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v100, 24, v24
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 8, v24
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v115, 8, v23
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v114, 24, v22
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v117, 8, v22
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v128, 8, v21
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 24, v20
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v133, 8, v20
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v135, 8, v19
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v134, 24, v18
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v145, 8, v18
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[68:69], 24, v[1:2]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[50:51], 24, v[19:20]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[53:54], 24, v[17:18]
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v39, 8, v17
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v133, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v135, 8, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v145, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 8, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v151, 8, v5
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 24, v4
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v162, 8, v4
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v163, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v164, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v165, 8, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v166, 8, v1
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v80, 24, v32
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 8, v32
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 8, v31
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v83, 24, v30
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v84, 8, v30
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v86, 8, v29
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 24, v28
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 8, v28
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v100, 8, v27
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v102, 24, v26
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 8, v26
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v114, 8, v25
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v116, 24, v24
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v118, 8, v24
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v128, 8, v23
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 24, v22
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 8, v22
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v134, 8, v21
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v144, 24, v20
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v146, 8, v20
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v148, 8, v19
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 24, v18
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 8, v18
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[38:39], 24, v[15:16]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[70:71], 24, v[1:2]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[48:49], 24, v[23:24]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[51:52], 24, v[21:22]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[54:55], 24, v[19:20]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[65:66], 24, v[17:18]
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 8, v17
; GFX11-TRUE16-NEXT: .LBB98_2: ; %Flow
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB98_4
@@ -208199,405 +207611,364 @@ define <128 x i8> @bitcast_v64i16_to_v128i8(<64 x i16> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_pk_add_u16 v32, v32, 3 op_sel_hi:[1,0]
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_pk_add_u16 v31, v31, 3 op_sel_hi:[1,0]
-; GFX11-TRUE16-NEXT: v_pk_add_u16 v30, v30, 3 op_sel_hi:[1,0]
-; GFX11-TRUE16-NEXT: v_pk_add_u16 v29, v29, 3 op_sel_hi:[1,0]
-; GFX11-TRUE16-NEXT: v_pk_add_u16 v10, v10, 3 op_sel_hi:[1,0]
-; GFX11-TRUE16-NEXT: v_pk_add_u16 v9, v9, 3 op_sel_hi:[1,0]
-; GFX11-TRUE16-NEXT: v_pk_add_u16 v28, v28, 3 op_sel_hi:[1,0]
-; GFX11-TRUE16-NEXT: v_pk_add_u16 v27, v27, 3 op_sel_hi:[1,0]
; GFX11-TRUE16-NEXT: v_pk_add_u16 v8, v8, 3 op_sel_hi:[1,0]
; GFX11-TRUE16-NEXT: v_pk_add_u16 v7, v7, 3 op_sel_hi:[1,0]
-; GFX11-TRUE16-NEXT: v_pk_add_u16 v26, v26, 3 op_sel_hi:[1,0]
-; GFX11-TRUE16-NEXT: v_pk_add_u16 v25, v25, 3 op_sel_hi:[1,0]
+; GFX11-TRUE16-NEXT: v_pk_add_u16 v30, v30, 3 op_sel_hi:[1,0]
+; GFX11-TRUE16-NEXT: v_pk_add_u16 v29, v29, 3 op_sel_hi:[1,0]
; GFX11-TRUE16-NEXT: v_pk_add_u16 v6, v6, 3 op_sel_hi:[1,0]
; GFX11-TRUE16-NEXT: v_pk_add_u16 v5, v5, 3 op_sel_hi:[1,0]
-; GFX11-TRUE16-NEXT: v_pk_add_u16 v16, v16, 3 op_sel_hi:[1,0]
-; GFX11-TRUE16-NEXT: v_pk_add_u16 v15, v15, 3 op_sel_hi:[1,0]
-; GFX11-TRUE16-NEXT: v_pk_add_u16 v24, v24, 3 op_sel_hi:[1,0]
-; GFX11-TRUE16-NEXT: v_pk_add_u16 v23, v23, 3 op_sel_hi:[1,0]
+; GFX11-TRUE16-NEXT: v_pk_add_u16 v28, v28, 3 op_sel_hi:[1,0]
+; GFX11-TRUE16-NEXT: v_pk_add_u16 v27, v27, 3 op_sel_hi:[1,0]
; GFX11-TRUE16-NEXT: v_pk_add_u16 v4, v4, 3 op_sel_hi:[1,0]
; GFX11-TRUE16-NEXT: v_pk_add_u16 v3, v3, 3 op_sel_hi:[1,0]
+; GFX11-TRUE16-NEXT: v_pk_add_u16 v10, v10, 3 op_sel_hi:[1,0]
; GFX11-TRUE16-NEXT: v_pk_add_u16 v12, v12, 3 op_sel_hi:[1,0]
; GFX11-TRUE16-NEXT: v_pk_add_u16 v14, v14, 3 op_sel_hi:[1,0]
; GFX11-TRUE16-NEXT: v_pk_add_u16 v13, v13, 3 op_sel_hi:[1,0]
; GFX11-TRUE16-NEXT: v_pk_add_u16 v11, v11, 3 op_sel_hi:[1,0]
+; GFX11-TRUE16-NEXT: v_pk_add_u16 v9, v9, 3 op_sel_hi:[1,0]
; GFX11-TRUE16-NEXT: v_pk_add_u16 v18, v18, 3 op_sel_hi:[1,0]
; GFX11-TRUE16-NEXT: v_pk_add_u16 v17, v17, 3 op_sel_hi:[1,0]
; GFX11-TRUE16-NEXT: v_pk_add_u16 v20, v20, 3 op_sel_hi:[1,0]
; GFX11-TRUE16-NEXT: v_pk_add_u16 v19, v19, 3 op_sel_hi:[1,0]
; GFX11-TRUE16-NEXT: v_pk_add_u16 v22, v22, 3 op_sel_hi:[1,0]
; GFX11-TRUE16-NEXT: v_pk_add_u16 v21, v21, 3 op_sel_hi:[1,0]
+; GFX11-TRUE16-NEXT: v_pk_add_u16 v24, v24, 3 op_sel_hi:[1,0]
+; GFX11-TRUE16-NEXT: v_pk_add_u16 v23, v23, 3 op_sel_hi:[1,0]
+; GFX11-TRUE16-NEXT: v_pk_add_u16 v26, v26, 3 op_sel_hi:[1,0]
+; GFX11-TRUE16-NEXT: v_pk_add_u16 v25, v25, 3 op_sel_hi:[1,0]
; GFX11-TRUE16-NEXT: v_pk_add_u16 v2, v2, 3 op_sel_hi:[1,0]
; GFX11-TRUE16-NEXT: v_pk_add_u16 v1, v1, 3 op_sel_hi:[1,0]
+; GFX11-TRUE16-NEXT: v_pk_add_u16 v16, v16, 3 op_sel_hi:[1,0]
+; GFX11-TRUE16-NEXT: v_pk_add_u16 v15, v15, 3 op_sel_hi:[1,0]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[67:68], 24, v[7:8]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[33:34], 24, v[31:32]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[64:65], 24, v[9:10]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[68:69], 24, v[5:6]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[34:35], 24, v[29:30]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[65:66], 24, v[7:8]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[50:51], 24, v[13:14]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[53:54], 24, v[11:12]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[64:65], 24, v[9:10]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[69:70], 24, v[3:4]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[35:36], 24, v[27:28]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[48:49], 24, v[15:16]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[66:67], 24, v[5:6]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[38:39], 24, v[15:16]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[70:71], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[36:37], 24, v[25:26]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[49:50], 24, v[13:14]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[52:53], 24, v[11:12]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[67:68], 24, v[3:4]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[37:38], 24, v[23:24]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[68:69], 24, v[1:2]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[38:39], 24, v[21:22]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[50:51], 24, v[19:20]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[53:54], 24, v[17:18]
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v84, 24, v16
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 8, v16
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[48:49], 24, v[23:24]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[51:52], 24, v[21:22]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[54:55], 24, v[19:20]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[65:66], 24, v[17:18]
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v85, 24, v16
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 8, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v97, 8, v15
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v99, 24, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v102, 8, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v103, 24, v12
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v116, 8, v12
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v119, 8, v11
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v118, 24, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 8, v10
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v101, 8, v14
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v103, 8, v13
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v115, 8, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v117, 8, v11
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v119, 24, v10
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 8, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v131, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 24, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v144, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v146, 24, v6
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v148, 8, v6
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 8, v5
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 24, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v151, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 24, v2
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v162, 8, v2
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v163, 8, v1
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 24, v32
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v70, 8, v32
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v80, 8, v31
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v71, 24, v30
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 8, v30
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v83, 8, v29
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 24, v28
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v85, 8, v28
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v86, 8, v27
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 24, v26
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 8, v26
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v101, 8, v25
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v100, 24, v24
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 8, v24
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v115, 8, v23
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v114, 24, v22
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v117, 8, v22
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v128, 8, v21
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 24, v20
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v133, 8, v20
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v135, 8, v19
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v134, 24, v18
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v145, 8, v18
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v39, 8, v17
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v133, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v135, 8, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v145, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 8, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v151, 8, v5
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 24, v4
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v162, 8, v4
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v163, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v164, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v165, 8, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v166, 8, v1
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v80, 24, v32
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 8, v32
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 8, v31
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v83, 24, v30
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v84, 8, v30
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v86, 8, v29
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 24, v28
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 8, v28
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v100, 8, v27
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v102, 24, v26
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 8, v26
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v114, 8, v25
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v116, 24, v24
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v118, 8, v24
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v128, 8, v23
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 24, v22
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 8, v22
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v134, 8, v21
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v144, 24, v20
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v146, 8, v20
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v148, 8, v19
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 24, v18
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 8, v18
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 8, v17
; GFX11-TRUE16-NEXT: .LBB98_4: ; %end
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v163.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v162.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v166.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v51.l, 0
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v68.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v34.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v70.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v1.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v165.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v1.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v39.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v164.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v39, v1
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v2.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v163.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v69.l
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v54.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v161.l
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v1.h, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v68.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v54
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v160.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v1.l, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v151.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v68, 0xffff, v68
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v54, v51
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v2.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v54.l, v3.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v67.l
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v4.l, v33.h
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v39, v2
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v3.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v162.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v161.l
; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v150.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v68, v51
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v54
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v3.l, v3.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v67.l, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v149.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v5.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v54, v51
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v67
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v66.l
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v4.l, v4.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v67.l, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.l, 8, v148.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v54, v51
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v39, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v4.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v151.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v68.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v39, v4
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v5.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v149.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v147.l
; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v5.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v147.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.h, v6.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v67
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v39, v5
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v6.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v145.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v67.l
; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v39, v6
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v7.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v135.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v133.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v39, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v8.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v131.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v64.l
; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v7.l, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v66.l, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v146.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v144.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v54, v51
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v66
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v66.l, v7.l
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v6.l, v6.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.h, 8, v65.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v8.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v132.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v54, v51
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v66
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v7.l, v7.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v65.l, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 8, v131.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v54, v51
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v65
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v8.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v130.l
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v8.h, v9.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v10.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v39, v8
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v9.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v129.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v119.l
; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v11.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v39, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v10.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v10.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v117.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v53.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v12.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v39, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v11.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v11.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v115.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v113.l
; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v54, v51
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v10.l, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v65.l, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v64.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v119.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v64.l, v10.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v65
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v9.l, v9.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v118.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v11.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v52.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v13.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v39, v11
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v12.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v12.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v103.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v50.l
; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v54, v51
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v64
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v10.l, v10.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v64.l, v11.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v116.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v54, v51
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v52, 0xffff, v64
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v11.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v112.l
-; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v11.h, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v14.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v39, v12
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v13.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v13.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v101.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v99.l
+; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.l
+; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v15.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v39, v13
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v14.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v97.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v38.l
; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.l
+; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v16.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v39, v14
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v15.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v15.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v87.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v85.l
; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v17.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v52, v51
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v13.l, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v54.l, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v103.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v102.l
+; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v17.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v39, v15
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v16.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v16.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v37.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v65.l
+; GFX11-TRUE16-NEXT: v_and_b16 v18.l, 0xff, v18.l
+; GFX11-TRUE16-NEXT: v_and_b16 v18.h, 0xff, v18.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v39, v16
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v17.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v17.h, v17.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v160.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v150.l
; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v19.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v52, 0xffff, v54
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v54.l, v13.l
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v12.l, v12.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.h, 8, v49.l
-; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v14.l, v33.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v49, 0xffff, v54
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v52, v51
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v99.l
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v13.l, v13.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.l, v14.l
-; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.h
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v15.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.l, 8, v97.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v49, v51
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v49, 0xffff, v52
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v14.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v96.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v14.h, v15.l
+; GFX11-TRUE16-NEXT: v_and_b16 v19.h, 0xff, v19.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v39, v17
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v18.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v18.h, v18.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v148.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v54.l
; GFX11-TRUE16-NEXT: v_and_b16 v20.l, 0xff, v20.l
+; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v20.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v39, v18
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v19.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v19.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v146.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v144.l
+; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v21.l
+; GFX11-TRUE16-NEXT: v_and_b16 v21.h, 0xff, v21.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v39, v19
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v20.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v20.h, v20.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v134.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v51.l
; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v22.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v49, v51
-; GFX11-TRUE16-NEXT: v_or_b16 v16.l, v16.l, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.l, v15.l
-; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.h, 8, v48.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v39.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.l, v16.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v48, 0xffff, v52
-; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.h
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v15.l, v15.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v84.l
-; GFX11-TRUE16-NEXT: v_or_b16 v17.l, v17.l, v33.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v39, 0xffff, v39
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v53.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v48, v51
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v16.l, v16.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.l, v17.l
-; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v17.h
-; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v18.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v145.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v39, v51
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v39, 0xffff, v48
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v17.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v135.l
-; GFX11-TRUE16-NEXT: v_or_b16 v18.l, v17.h, v18.l
+; GFX11-TRUE16-NEXT: v_and_b16 v22.h, 0xff, v22.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v39, v20
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v21.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v21.h, v21.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v132.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v130.l
; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v23.l
+; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v23.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v39, v21
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v22.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v22.h, v22.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v128.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v48.l
+; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v24.h, 0xff, v24.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v39, v22
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v23.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v23.h, v23.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v118.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v116.l
; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v25.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v39, v51
-; GFX11-TRUE16-NEXT: v_or_b16 v19.l, v19.l, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.l, v18.l
-; GFX11-TRUE16-NEXT: v_and_b16 v18.l, 0xff, v18.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v134.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v133.l
+; GFX11-TRUE16-NEXT: v_and_b16 v25.h, 0xff, v25.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v39, v23
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v24.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v24.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v114.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v36.l
; GFX11-TRUE16-NEXT: v_and_b16 v26.l, 0xff, v26.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v39, 0xffff, v48
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.l, v19.l
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v18.l, v18.h
-; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v19.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v50.l
-; GFX11-TRUE16-NEXT: v_or_b16 v20.l, v20.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v129.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v39, v51
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v39, 0xffff, v48
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v19.l, v19.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.l, v20.l
-; GFX11-TRUE16-NEXT: v_and_b16 v20.l, 0xff, v20.h
-; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v21.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.l, 8, v128.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v39, v51
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v39, 0xffff, v48
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v20.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v117.l
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v20.h, v21.l
+; GFX11-TRUE16-NEXT: v_and_b16 v26.h, 0xff, v26.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v39, v24
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v25.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v25.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v112.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v102.l
+; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v27.l
+; GFX11-TRUE16-NEXT: v_and_b16 v27.h, 0xff, v27.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v39, v25
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v26.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v26.h, v26.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v100.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v35.l
; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v28.l
+; GFX11-TRUE16-NEXT: v_and_b16 v28.h, 0xff, v28.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v39, v26
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v27.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v27.h, v27.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v98.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v96.l
; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v29.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v39, v51
-; GFX11-TRUE16-NEXT: v_or_b16 v22.l, v22.l, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.l, v21.l
-; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v21.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.h, 8, v38.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v115.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.l, v22.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v38, 0xffff, v48
-; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v22.h
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v21.l, v21.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.h, 8, v114.l
-; GFX11-TRUE16-NEXT: v_or_b16 v23.l, v23.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v37.l
+; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v29.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v39, v27
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v28.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v28.h, v28.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v86.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.l, 8, v34.l
+; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v30.l
+; GFX11-TRUE16-NEXT: v_and_b16 v30.h, 0xff, v30.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v39, v28
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v29.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v29.h, v29.h, v34.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v84.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.l, 8, v83.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_and_b16 v31.l, 0xff, v31.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v38, v51
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v38, 0xffff, v39
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v22.l, v22.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.l, v23.l
-; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v23.h
-; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v24.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v24.l, 8, v113.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v38, v51
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v37, 0xffff, v39
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v23.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v101.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v23.h, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v31.h, 0xff, v31.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v39, v29
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v30.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v30.h, v30.h, v34.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v82.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.l, 8, v33.l
; GFX11-TRUE16-NEXT: v_and_b16 v32.l, 0xff, v32.l
+; GFX11-TRUE16-NEXT: v_and_b16 v32.h, 0xff, v32.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v39, v30
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v31.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v31.h, v33.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v31.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.l, 8, v81.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v80.l
; GFX11-TRUE16-NEXT: s_clause 0x1
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[1:4], off
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[5:8], off offset:16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v37, v51
-; GFX11-TRUE16-NEXT: v_or_b16 v25.l, v25.l, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v38.l, v24.l
-; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v24.h, 8, v100.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v98.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v37, 0xffff, v38
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v38.l, v25.l
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v24.l, v24.h
-; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v25.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.h, 8, v36.l
-; GFX11-TRUE16-NEXT: v_or_b16 v26.l, v26.l, v33.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v36, 0xffff, v38
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v37, v51
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v87.l
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v25.l, v25.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v37.l, v26.l
-; GFX11-TRUE16-NEXT: v_and_b16 v26.l, 0xff, v26.h
-; GFX11-TRUE16-NEXT: v_and_b16 v26.h, 0xff, v27.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.l, 8, v86.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v36, v51
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v36, 0xffff, v37
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v26.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v85.l
-; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v26.h, v27.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v36, v51
-; GFX11-TRUE16-NEXT: v_or_b16 v28.l, v28.l, v33.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v37.l, v27.l
-; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v27.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.h, 8, v35.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v83.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v36.l, v28.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff, v37
-; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v28.h
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v27.l, v27.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v28.h, 8, v82.l
-; GFX11-TRUE16-NEXT: v_or_b16 v29.l, v29.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v34.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v35, v51
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff, v36
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v28.l, v28.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v36.l, v29.l
-; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v29.h
-; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v30.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v30.l, 8, v81.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v35, v51
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v36
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v29.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v80.l
-; GFX11-TRUE16-NEXT: v_or_b16 v30.l, v29.h, v30.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v34, v51
-; GFX11-TRUE16-NEXT: v_or_b16 v31.l, v31.l, v33.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_3) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v30.l
-; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v30.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v30.h, 8, v71.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v70.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v35
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v31.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v30.l, v30.h
-; GFX11-TRUE16-NEXT: v_and_b16 v31.l, 0xff, v31.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v31.h, 8, v33.l
-; GFX11-TRUE16-NEXT: v_or_b16 v32.l, v32.l, v33.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v33, 0xffff, v35
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v34, v51
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v31.l, v31.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.l, v32.l
-; GFX11-TRUE16-NEXT: v_and_b16 v32.l, 0xff, v32.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v32.h, 8, v55.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v33, v51
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v33, 0xffff, v34
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v32.l, v32.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v32, v33, v51
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v39, v31
+; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v32.l, v33.l
+; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v32.h, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v32.l, v39.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v32, v39, v32
; GFX11-TRUE16-NEXT: s_clause 0x5
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[9:12], off offset:32
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[13:16], off offset:48
diff --git a/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.128bit.ll b/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.128bit.ll
index 3e96ab1d597d6..21ec3ee1996a6 100644
--- a/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.128bit.ll
+++ b/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.128bit.ll
@@ -4118,19 +4118,19 @@ define <4 x i32> @bitcast_v16i8_to_v4i32(<16 x i8> %a, i32 %b) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.h, v9.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.h, v7.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.h, v7.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v4.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.h, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.h, v0.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.l, 8, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.h, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v12.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v15.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.h, 8, v7.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v15.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v16
@@ -4144,103 +4144,95 @@ define <4 x i32> @bitcast_v16i8_to_v4i32(<16 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB26_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, 0
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v8.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.h, v7.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v1.l, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.h, v6.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v0.l, v8.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v9.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v1.l, v7.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v8.l
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.l, v5.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v11, v0
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v1.h, v6.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_3) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v3, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v1.h, v5.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v14.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v6, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v2.h, v4.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v7, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v3.l, v4.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v5, v9
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v5.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v11, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v2.l, v5.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v14.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v11, v2
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v3.l, v4.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v4.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v11.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v11, v3
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB26_2
; GFX11-TRUE16-NEXT: .LBB26_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v10.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v9.l, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v9.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v8.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v7.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v7.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v6.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v12.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v9.h
; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v8.h, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v9.l, v0.h
; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v7.h, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v6.h, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v8.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v10.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v6.l, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v6.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, 0
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, v14.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v12.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v9, v3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v14.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v9.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v5.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v5.h, v1.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v5.l, v1.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, 0x300, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v5.h, v0.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v7, v11
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v4.h, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v11
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v3.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v4.l, v2.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v9, v6
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v9.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v4.l, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v4.h, v2.h
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v5, v11
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v6
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v3.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v4, v11
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v9, v5
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v9.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v9, v3
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -8592,19 +8584,19 @@ define <4 x float> @bitcast_v16i8_to_v4f32(<16 x i8> %a, i32 %b) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.h, v9.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.h, v7.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.h, v7.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v4.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.h, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.h, v0.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.l, 8, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.h, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v12.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v15.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.h, 8, v7.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v15.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v16
@@ -8618,103 +8610,95 @@ define <4 x float> @bitcast_v16i8_to_v4f32(<16 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB50_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, 0
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v8.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.h, v7.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v1.l, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.h, v6.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v0.l, v8.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v9.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v1.l, v7.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v8.l
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.l, v5.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v11, v0
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v1.h, v6.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_3) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v3, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v1.h, v5.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v14.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v6, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v2.h, v4.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v7, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v3.l, v4.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v5, v9
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v5.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v11, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v2.l, v5.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v14.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v11, v2
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v3.l, v4.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v4.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v11.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v11, v3
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB50_2
; GFX11-TRUE16-NEXT: .LBB50_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v10.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v9.l, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v9.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v8.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v7.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v7.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v6.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v12.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v9.h
; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v8.h, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v9.l, v0.h
; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v7.h, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v6.h, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v8.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v10.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v6.l, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v6.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, 0
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, v14.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v12.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v9, v3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v14.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v9.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v5.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v5.h, v1.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v5.l, v1.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, 0x300, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v5.h, v0.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v7, v11
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v4.h, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v11
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v3.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v4.l, v2.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v9, v6
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v9.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v4.l, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v4.h, v2.h
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v5, v11
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v6
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v3.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v4, v11
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v9, v5
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v9.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v9, v3
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -12682,19 +12666,19 @@ define <2 x i64> @bitcast_v16i8_to_v2i64(<16 x i8> %a, i32 %b) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.h, v9.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.h, v7.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.h, v7.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v4.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.h, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.h, v0.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.l, 8, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.h, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v12.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v15.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.h, 8, v7.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v15.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v16
@@ -12708,103 +12692,95 @@ define <2 x i64> @bitcast_v16i8_to_v2i64(<16 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB70_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, 0
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v8.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.h, v7.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v1.l, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.h, v6.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v0.l, v8.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v9.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v1.l, v7.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v8.l
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.l, v5.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v11, v0
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v1.h, v6.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_3) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v3, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v1.h, v5.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v14.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v6, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v2.h, v4.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v7, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v3.l, v4.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v5, v9
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v5.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v11, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v2.l, v5.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v14.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v11, v2
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v3.l, v4.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v4.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v11.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v11, v3
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB70_2
; GFX11-TRUE16-NEXT: .LBB70_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v10.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v9.l, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v9.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v8.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v7.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v7.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v6.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v12.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v9.h
; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v8.h, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v9.l, v0.h
; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v7.h, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v6.h, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v8.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v10.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v6.l, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v6.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, 0
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, v14.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v12.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v9, v3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v14.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v9.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v5.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v5.h, v1.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v5.l, v1.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, 0x300, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v5.h, v0.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v7, v11
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v4.h, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v11
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v3.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v4.l, v2.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v9, v6
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v9.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v4.l, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v4.h, v2.h
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v5, v11
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v6
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v3.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v4, v11
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v9, v5
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v9.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v9, v3
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -16382,19 +16358,19 @@ define <2 x double> @bitcast_v16i8_to_v2f64(<16 x i8> %a, i32 %b) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.h, v9.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.h, v7.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.h, v7.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v4.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.h, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.h, v0.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.l, 8, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.h, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v12.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v15.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.h, 8, v7.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v15.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v16
@@ -16408,103 +16384,95 @@ define <2 x double> @bitcast_v16i8_to_v2f64(<16 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB86_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, 0
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v8.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.h, v7.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v1.l, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.h, v6.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v0.l, v8.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v9.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v1.l, v7.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v8.l
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.l, v5.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v11, v0
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v1.h, v6.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_3) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v3, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v1.h, v5.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v14.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v6, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v2.h, v4.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v7, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v3.l, v4.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v5, v9
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v5.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v11, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v2.l, v5.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v14.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v11, v2
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v3.l, v4.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v4.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v11.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v11, v3
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB86_2
; GFX11-TRUE16-NEXT: .LBB86_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v10.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v9.l, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v9.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v8.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v7.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v7.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v6.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v12.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v9.h
; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v8.h, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v9.l, v0.h
; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v7.h, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v6.h, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v8.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v10.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v6.l, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v6.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, 0
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, v14.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v12.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v9, v3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v14.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v9.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v5.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v5.h, v1.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v5.l, v1.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, 0x300, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v5.h, v0.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v7, v11
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v4.h, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v11
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v3.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v4.l, v2.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v9, v6
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v9.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v4.l, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v4.h, v2.h
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v5, v11
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v6
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v3.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v4, v11
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v9, v5
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v9.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v9, v3
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -19811,19 +19779,19 @@ define <8 x i16> @bitcast_v16i8_to_v8i16(<16 x i8> %a, i32 %b) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.h, v9.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.h, v7.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.h, v7.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v4.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.h, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.h, v0.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.l, 8, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.h, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v12.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v15.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.h, 8, v7.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v15.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v16
@@ -19837,103 +19805,95 @@ define <8 x i16> @bitcast_v16i8_to_v8i16(<16 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB98_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, 0
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v8.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.h, v7.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v1.l, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.h, v6.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v0.l, v8.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v9.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v1.l, v7.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v8.l
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.l, v5.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v11, v0
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v1.h, v6.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_3) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v3, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v1.h, v5.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v14.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v6, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v2.h, v4.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v7, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v3.l, v4.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v5, v9
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v5.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v11, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v2.l, v5.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v14.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v11, v2
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v3.l, v4.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v4.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v11.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v11, v3
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB98_2
; GFX11-TRUE16-NEXT: .LBB98_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v10.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v9.l, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v9.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v8.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v7.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v7.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v6.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v12.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v9.h
; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v8.h, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v9.l, v0.h
; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v7.h, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v6.h, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v8.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v10.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v6.l, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v6.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, 0
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, v14.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v12.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v9, v3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v14.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v9.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v5.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v5.h, v1.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v5.l, v1.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, 0x300, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v5.h, v0.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v7, v11
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v4.h, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v11
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v3.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v4.l, v2.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v9, v6
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v9.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v4.l, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v4.h, v2.h
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v5, v11
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v6
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v3.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v4, v11
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v9, v5
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v9.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v9, v3
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -22725,19 +22685,19 @@ define <8 x half> @bitcast_v16i8_to_v8f16(<16 x i8> %a, i32 %b) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.h, v9.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.h, v7.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.h, v7.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v4.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.h, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.h, v0.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.l, 8, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.h, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v12.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v15.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.h, 8, v7.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v15.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v16
@@ -22751,103 +22711,95 @@ define <8 x half> @bitcast_v16i8_to_v8f16(<16 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB106_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, 0
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v8.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.h, v7.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v1.l, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.h, v6.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v0.l, v8.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v9.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v1.l, v7.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v8.l
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.l, v5.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v11, v0
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v1.h, v6.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_3) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v3, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v1.h, v5.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v14.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v6, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v2.h, v4.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v7, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v3.l, v4.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v5, v9
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v5.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v11, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v2.l, v5.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v14.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v11, v2
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v3.l, v4.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v4.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v11.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v11, v3
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB106_2
; GFX11-TRUE16-NEXT: .LBB106_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v10.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v9.l, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v9.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v8.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v7.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v7.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v6.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v12.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v9.h
; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v8.h, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v9.l, v0.h
; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v7.h, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v6.h, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v8.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v10.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v6.l, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v6.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, 0
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, v14.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v12.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v9, v3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v14.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v9.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v5.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v5.h, v1.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v5.l, v1.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, 0x300, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v5.h, v0.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v7, v11
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v4.h, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v11
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v3.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v4.l, v2.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v9, v6
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v9.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v4.l, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v4.h, v2.h
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v5, v11
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v6
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v3.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v4, v11
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v9, v5
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v9.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v9, v3
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -24944,19 +24896,19 @@ define <8 x bfloat> @bitcast_v16i8_to_v8bf16(<16 x i8> %a, i32 %b) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.h, v9.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.h, v7.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.h, v7.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v4.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.h, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.h, v0.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.l, 8, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.h, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v12.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v15.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.h, 8, v7.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v15.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v16
@@ -24970,103 +24922,95 @@ define <8 x bfloat> @bitcast_v16i8_to_v8bf16(<16 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB110_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, 0
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v8.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.h, v7.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v1.l, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.h, v6.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v0.l, v8.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v9.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v1.l, v7.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v8.l
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.l, v5.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v11, v0
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v1.h, v6.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_3) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v3, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v1.h, v5.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v14.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v6, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v2.h, v4.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v7, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v3.l, v4.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v5, v9
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v5.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v11, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v2.l, v5.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v14.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v11, v2
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v3.l, v4.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v4.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v11.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v11, v3
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB110_2
; GFX11-TRUE16-NEXT: .LBB110_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v10.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v9.l, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v9.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v8.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v7.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v7.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v6.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v12.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v9.h
; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v8.h, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v9.l, v0.h
; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v7.h, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v6.h, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v8.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v10.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v6.l, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v6.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, 0
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, v14.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v12.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v9, v3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v14.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v9.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v5.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v5.h, v1.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v5.l, v1.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, 0x300, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v5.h, v0.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v7, v11
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v4.h, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v11
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v3.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v4.l, v2.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v9, v6
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v9.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v4.l, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v4.h, v2.h
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v5, v11
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v6
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v3.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v4, v11
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v9, v5
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v9.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v9, v3
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
diff --git a/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.256bit.ll b/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.256bit.ll
index f8ffaa456c2b3..38302a75fe26d 100644
--- a/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.256bit.ll
+++ b/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.256bit.ll
@@ -6296,32 +6296,31 @@ define <8 x i32> @bitcast_v32i8_to_v8i32(<32 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.h, v19.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.h, v17.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.h, v15.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v12.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.h, v10.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.h, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.h, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.h, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.h, v4.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.h, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v0.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.l, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.h, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v3.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.h, 8, v7.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v9.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.l, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v23.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v22.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v21.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 8, v23.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.l, 8, v25.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v27.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v29.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v7.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.h, 8, v9.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.l, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.l, 8, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v23.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v22.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.l, 8, v21.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v23.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 8, v25.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v27.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v29.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v31.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v31.l
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v32
; GFX11-TRUE16-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX11-TRUE16-NEXT: s_cbranch_execnz .LBB26_3
@@ -6333,194 +6332,175 @@ define <8 x i32> @bitcast_v32i8_to_v8i32(<32 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB26_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v21.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v19.h
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v20.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v17.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v18.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v19.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v17.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, 0
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v0.h, v15.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v16.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v15.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v16.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.l, v13.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v4
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v13.h
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v1.l, v12.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v19
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v1.h, v14.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v3.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v20.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v7, v19
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v2.l, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v4.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v18.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v3.h, v11.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v6, v19
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v3.l, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v24.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v22.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v12, v19
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v4.l, v10.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v4.h, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v28.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v11
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v17.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v16.h
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v0.l, v19.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v19.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v21.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v1.l, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v21.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v15.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v13.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v21, v0
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v1.h, v17.h
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v16.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v18.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v15.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v20.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v21, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v2.l, v14.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v21.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v14.l
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v22.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v21, v2
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v3.l, v12.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v13.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v21.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v10.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v26.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v28.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v21, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v4.l, v11.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v21.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v9.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v30.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr13_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr13_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_4) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v19
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v5.l, v9.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v26.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v5.h, v9.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v10, v19
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v11
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v6.l, v8.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v30.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr13_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v19
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v7.l, v8.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_hi16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v21, v4
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v5.l, v10.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v21.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v8.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_hi16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v21, v5
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v6.l, v9.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v21.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v21, v6
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v7.l, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v21.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_lo16
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v9, v19
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v21, v7
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB26_2
; GFX11-TRUE16-NEXT: .LBB26_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v21.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v19.h, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v20.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v18.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v17.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v17.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v16.h, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v15.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v19.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v17.h, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v21.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v19.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v19.h, v0.h
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v18.h, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v17.h, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v15.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v13.h, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v15.h, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v16.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v12.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v21, v3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v14.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v21.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v14.h, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v15.l, v1.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v0.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v14.h, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v16.l, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v13.l, v1.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, v14.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v21
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v12.l, v2.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v3.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v21
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v2.h
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v12.h, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v3.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v18.l, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v11.h, v3.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v20.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v6, v21
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v4.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v21, v4
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v21.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v12.h, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v13.l, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v16.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v18.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v21, v5
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v20.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v22.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v21.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v11.l, v3.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v11.h, v3.h
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v4.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v10.h, v3.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v5, v21
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v11.l, v4.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, v24.l, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v4.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, v22.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, 0x300, v4.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v28.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v21
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v10.l, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v26.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v9.l, v5.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v7
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v21, v6
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v21.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v10.l, v4.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v10.h, v4.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v24.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v26.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v21, v7
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v28.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v30.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v21.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v9.l, v5.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v9.h, v5.h
; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v9.h, v6.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, v30.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v5.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v5.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v8.h, v6.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v6.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v10, v21
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v11
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v6.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v8.l, v6.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v9
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v10, v21
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v7.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v8, v21
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v21, v10
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v21.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v8.l, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v8.h, v6.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v21, v9
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v7.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v21.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v21, v7
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -13335,32 +13315,31 @@ define <8 x float> @bitcast_v32i8_to_v8f32(<32 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.h, v19.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.h, v17.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.h, v15.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v12.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.h, v10.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.h, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.h, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.h, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.h, v4.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.h, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v0.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.l, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.h, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v3.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.h, 8, v7.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v9.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.l, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v23.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v22.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v21.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 8, v23.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.l, 8, v25.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v27.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v29.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v7.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.h, 8, v9.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.l, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.l, 8, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v23.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v22.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.l, 8, v21.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v23.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 8, v25.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v27.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v29.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v31.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v31.l
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v32
; GFX11-TRUE16-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX11-TRUE16-NEXT: s_cbranch_execnz .LBB50_3
@@ -13372,194 +13351,175 @@ define <8 x float> @bitcast_v32i8_to_v8f32(<32 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB50_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v21.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v19.h
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v20.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v17.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v18.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v19.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v17.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, 0
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v0.h, v15.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v16.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v15.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v16.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.l, v13.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v4
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v13.h
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v1.l, v12.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v19
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v1.h, v14.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v3.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v20.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v7, v19
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v2.l, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v4.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v18.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v3.h, v11.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v6, v19
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v3.l, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v24.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v22.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v12, v19
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v4.l, v10.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v4.h, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v28.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v11
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v17.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v16.h
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v0.l, v19.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v19.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v21.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v1.l, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v21.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v15.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v13.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v21, v0
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v1.h, v17.h
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v16.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v18.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v15.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v20.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v21, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v2.l, v14.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v21.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v14.l
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v22.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v21, v2
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v3.l, v12.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v13.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v21.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v10.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v26.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v28.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v21, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v4.l, v11.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v21.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v9.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v30.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr13_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr13_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_4) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v19
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v5.l, v9.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v26.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v5.h, v9.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v10, v19
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v11
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v6.l, v8.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v30.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr13_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v19
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v7.l, v8.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_hi16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v21, v4
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v5.l, v10.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v21.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v8.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_hi16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v21, v5
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v6.l, v9.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v21.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v21, v6
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v7.l, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v21.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_lo16
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v9, v19
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v21, v7
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB50_2
; GFX11-TRUE16-NEXT: .LBB50_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v21.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v19.h, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v20.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v18.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v17.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v17.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v16.h, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v15.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v19.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v17.h, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v21.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v19.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v19.h, v0.h
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v18.h, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v17.h, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v15.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v13.h, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v15.h, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v16.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v12.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v21, v3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v14.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v21.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v14.h, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v15.l, v1.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v0.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v14.h, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v16.l, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v13.l, v1.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, v14.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v21
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v12.l, v2.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v3.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v21
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v2.h
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v12.h, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v3.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v18.l, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v11.h, v3.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v20.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v6, v21
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v4.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v21, v4
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v21.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v12.h, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v13.l, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v16.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v18.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v21, v5
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v20.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v22.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v21.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v11.l, v3.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v11.h, v3.h
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v4.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v10.h, v3.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v5, v21
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v11.l, v4.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, v24.l, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v4.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, v22.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, 0x300, v4.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v28.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v21
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v10.l, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v26.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v9.l, v5.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v7
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v21, v6
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v21.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v10.l, v4.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v10.h, v4.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v24.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v26.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v21, v7
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v28.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v30.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v21.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v9.l, v5.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v9.h, v5.h
; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v9.h, v6.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, v30.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v5.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v5.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v8.h, v6.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v6.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v10, v21
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v11
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v6.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v8.l, v6.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v9
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v10, v21
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v7.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v8, v21
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v21, v10
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v21.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v8.l, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v8.h, v6.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v21, v9
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v7.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v21.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v21, v7
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -19892,32 +19852,31 @@ define <4 x i64> @bitcast_v32i8_to_v4i64(<32 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.h, v19.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.h, v17.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.h, v15.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v12.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.h, v10.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.h, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.h, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.h, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.h, v4.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.h, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v0.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.l, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.h, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v3.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.h, 8, v7.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v9.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.l, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v23.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v22.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v21.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 8, v23.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.l, 8, v25.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v27.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v29.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v7.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.h, 8, v9.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.l, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.l, 8, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v23.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v22.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.l, 8, v21.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v23.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 8, v25.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v27.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v29.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v31.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v31.l
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v32
; GFX11-TRUE16-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX11-TRUE16-NEXT: s_cbranch_execnz .LBB70_3
@@ -19929,194 +19888,175 @@ define <4 x i64> @bitcast_v32i8_to_v4i64(<32 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB70_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v21.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v19.h
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v20.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v17.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v18.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v19.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v17.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, 0
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v0.h, v15.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v16.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v15.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v16.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.l, v13.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v4
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v13.h
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v1.l, v12.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v19
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v1.h, v14.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v3.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v20.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v7, v19
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v2.l, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v4.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v18.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v3.h, v11.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v6, v19
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v3.l, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v24.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v22.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v12, v19
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v4.l, v10.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v4.h, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v28.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v11
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v17.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v16.h
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v0.l, v19.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v19.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v21.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v1.l, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v21.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v15.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v13.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v21, v0
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v1.h, v17.h
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v16.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v18.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v15.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v20.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v21, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v2.l, v14.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v21.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v14.l
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v22.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v21, v2
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v3.l, v12.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v13.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v21.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v10.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v26.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v28.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v21, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v4.l, v11.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v21.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v9.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v30.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr13_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr13_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_4) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v19
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v5.l, v9.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v26.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v5.h, v9.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v10, v19
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v11
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v6.l, v8.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v30.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr13_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v19
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v7.l, v8.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_hi16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v21, v4
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v5.l, v10.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v21.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v8.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_hi16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v21, v5
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v6.l, v9.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v21.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v21, v6
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v7.l, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v21.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_lo16
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v9, v19
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v21, v7
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB70_2
; GFX11-TRUE16-NEXT: .LBB70_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v21.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v19.h, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v20.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v18.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v17.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v17.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v16.h, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v15.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v19.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v17.h, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v21.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v19.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v19.h, v0.h
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v18.h, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v17.h, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v15.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v13.h, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v15.h, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v16.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v12.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v21, v3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v14.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v21.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v14.h, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v15.l, v1.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v0.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v14.h, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v16.l, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v13.l, v1.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, v14.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v21
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v12.l, v2.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v3.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v21
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v2.h
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v12.h, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v3.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v18.l, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v11.h, v3.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v20.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v6, v21
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v4.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v21, v4
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v21.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v12.h, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v13.l, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v16.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v18.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v21, v5
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v20.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v22.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v21.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v11.l, v3.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v11.h, v3.h
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v4.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v10.h, v3.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v5, v21
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v11.l, v4.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, v24.l, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v4.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, v22.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, 0x300, v4.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v28.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v21
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v10.l, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v26.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v9.l, v5.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v7
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v21, v6
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v21.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v10.l, v4.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v10.h, v4.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v24.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v26.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v21, v7
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v28.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v30.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v21.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v9.l, v5.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v9.h, v5.h
; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v9.h, v6.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, v30.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v5.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v5.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v8.h, v6.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v6.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v10, v21
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v11
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v6.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v8.l, v6.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v9
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v10, v21
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v7.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v8, v21
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v21, v10
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v21.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v8.l, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v8.h, v6.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v21, v9
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v7.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v21.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v21, v7
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -25939,32 +25879,31 @@ define <4 x double> @bitcast_v32i8_to_v4f64(<32 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.h, v19.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.h, v17.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.h, v15.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v12.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.h, v10.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.h, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.h, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.h, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.h, v4.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.h, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v0.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.l, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.h, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v3.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.h, 8, v7.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v9.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.l, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v23.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v22.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v21.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 8, v23.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.l, 8, v25.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v27.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v29.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v7.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.h, 8, v9.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.l, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.l, 8, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v23.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v22.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.l, 8, v21.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v23.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 8, v25.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v27.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v29.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v31.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v31.l
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v32
; GFX11-TRUE16-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX11-TRUE16-NEXT: s_cbranch_execnz .LBB86_3
@@ -25976,194 +25915,175 @@ define <4 x double> @bitcast_v32i8_to_v4f64(<32 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB86_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v21.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v19.h
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v20.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v17.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v18.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v19.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v17.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, 0
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v0.h, v15.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v16.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v15.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v16.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.l, v13.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v4
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v13.h
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v1.l, v12.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v19
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v1.h, v14.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v3.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v20.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v7, v19
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v2.l, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v4.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v18.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v3.h, v11.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v6, v19
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v3.l, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v24.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v22.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v12, v19
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v4.l, v10.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v4.h, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v28.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v11
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v17.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v16.h
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v0.l, v19.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v19.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v21.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v1.l, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v21.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v15.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v13.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v21, v0
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v1.h, v17.h
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v16.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v18.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v15.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v20.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v21, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v2.l, v14.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v21.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v14.l
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v22.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v21, v2
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v3.l, v12.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v13.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v21.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v10.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v26.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v28.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v21, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v4.l, v11.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v21.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v9.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v30.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr13_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr13_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_4) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v19
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v5.l, v9.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v26.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v5.h, v9.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v10, v19
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v11
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v6.l, v8.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v30.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr13_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v19
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v7.l, v8.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_hi16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v21, v4
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v5.l, v10.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v21.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v8.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_hi16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v21, v5
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v6.l, v9.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v21.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v21, v6
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v7.l, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v21.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_lo16
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v9, v19
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v21, v7
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB86_2
; GFX11-TRUE16-NEXT: .LBB86_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v21.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v19.h, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v20.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v18.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v17.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v17.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v16.h, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v15.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v19.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v17.h, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v21.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v19.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v19.h, v0.h
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v18.h, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v17.h, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v15.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v13.h, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v15.h, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v16.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v12.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v21, v3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v14.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v21.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v14.h, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v15.l, v1.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v0.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v14.h, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v16.l, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v13.l, v1.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, v14.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v21
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v12.l, v2.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v3.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v21
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v2.h
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v12.h, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v3.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v18.l, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v11.h, v3.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v20.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v6, v21
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v4.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v21, v4
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v21.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v12.h, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v13.l, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v16.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v18.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v21, v5
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v20.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v22.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v21.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v11.l, v3.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v11.h, v3.h
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v4.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v10.h, v3.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v5, v21
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v11.l, v4.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, v24.l, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v4.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, v22.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, 0x300, v4.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v28.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v21
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v10.l, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v26.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v9.l, v5.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v7
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v21, v6
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v21.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v10.l, v4.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v10.h, v4.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v24.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v26.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v21, v7
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v28.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v30.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v21.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v9.l, v5.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v9.h, v5.h
; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v9.h, v6.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, v30.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v5.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v5.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v8.h, v6.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v6.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v10, v21
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v11
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v6.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v8.l, v6.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v9
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v10, v21
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v7.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v8, v21
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v21, v10
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v21.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v8.l, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v8.h, v6.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v21, v9
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v7.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v21.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v21, v7
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
diff --git a/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.320bit.ll b/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.320bit.ll
index 0cefbc1c2dee5..436b1a038b274 100644
--- a/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.320bit.ll
+++ b/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.320bit.ll
@@ -2966,20 +2966,20 @@ define <40 x i8> @bitcast_v10i32_to_v40i8(<10 x i32> %a, i32 %b) {
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr13_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
@@ -2995,17 +2995,17 @@ define <40 x i8> @bitcast_v10i32_to_v40i8(<10 x i32> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[15:16], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v17, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v18, 8, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v20, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v19, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v19, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v20, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v21, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v23, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v22, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v22, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v23, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v24, 8, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v25, 8, v5
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v26, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v27, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v30, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v16, 8, v1
; GFX11-TRUE16-NEXT: .LBB12_2: ; %Flow
@@ -3029,17 +3029,17 @@ define <40 x i8> @bitcast_v10i32_to_v40i8(<10 x i32> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[15:16], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v17, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v18, 8, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v20, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v19, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v19, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v20, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v21, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v23, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v22, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v22, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v23, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v24, 8, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v25, 8, v5
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v26, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v27, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v30, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v16, 8, v1
; GFX11-TRUE16-NEXT: .LBB12_4: ; %end
@@ -3047,105 +3047,93 @@ define <40 x i8> @bitcast_v10i32_to_v40i8(<10 x i32> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v16.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v30.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v15.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v1.l, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v30.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v1.h, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v15.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v15.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, 0
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v29.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v15, v1
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v2.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v28.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v14.l
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v29.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v1.h, v11.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v16
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v28.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v1.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v15, v2
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v3.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v15.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v27.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v28, 0xffff, v29
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v16, v15
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v2.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v3.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v14.l
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v4.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v28, v15
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v14, 0xffff, v16
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v26.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v3.l, v3.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v25.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v14, v15
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v26.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v15, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v4.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v25.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v13.l
; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v4.l, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v15, v4
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v5.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v15.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v24.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v4.h, v5.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v14, 0xffff, v16
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v23.l
; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v15, v5
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v6.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v22.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v15, v6
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v7.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v21.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v20.l
; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v6.l, v11.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v23.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v14, v15
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v16
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v5.l, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v22.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v7.l, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v12.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v13, v15
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v14
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v6.l, v6.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v21.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v13, v15
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v14
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v7.l, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v20.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v7.h, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v12, v15
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v9.l, v11.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_3) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v19.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v18.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v13
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v9.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v8.l, v8.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v11.l
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v10.l, v11.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v13
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v12, v15
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v9.l, v9.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v17.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v11, v15
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v12
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v9.l, v9.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v11, v15
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v15, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v8.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v19.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v11.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v15, v8
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v9.l, v11.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v9.h, v11.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v15.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v10.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v18.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v17.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v15, v11
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v9.l, v9.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v15.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v10.l, v10.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v15, v9
; GFX11-TRUE16-NEXT: s_clause 0x2
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[1:4], off
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[5:8], off offset:16
-; GFX11-TRUE16-NEXT: scratch_store_b64 v0, v[10:11], off offset:32
+; GFX11-TRUE16-NEXT: scratch_store_b64 v0, v[11:12], off offset:32
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-FAKE16-LABEL: bitcast_v10i32_to_v40i8:
@@ -5038,48 +5026,49 @@ define <10 x i32> @bitcast_v40i8_to_v10i32(<40 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v34, off, s32 offset:24
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v35, off, s32 offset:32
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v31, off, s32 offset:28
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v32, off, s32 offset:20
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v31, off, s32 offset:12
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v31, off, s32 offset:20
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v32, off, s32 offset:12
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v32, off, s32 offset:4
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.h, v25.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.h, v23.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.h, v21.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.h, v19.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.h, v17.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.h, v14.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.h, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.h, v10.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.h, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.h, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v10.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.h, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.h, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.h, v4.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.h, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.h, v0.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v24.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.l, 8, v3.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.l, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.l, 8, v7.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.h, 8, v9.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.h, 8, v15.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v17.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.h, 8, v30.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.l, 8, v29.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.h, 8, v28.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.l, 8, v27.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v27.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.l, 8, v29.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.h, 8, v7.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v20.h, 8, v9.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.l, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v15.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.l, 8, v35.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.h, 8, v30.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.l, 8, v29.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.h, 8, v28.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.l, 8, v27.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.h, 8, v27.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v29.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(8)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v33.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v33.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(7)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v33.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(6)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v34.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v34.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(5)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v34.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.l, 8, v34.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(4)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.l, 8, v35.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v35.l
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v36
; GFX11-TRUE16-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX11-TRUE16-NEXT: s_cbranch_execnz .LBB14_3
@@ -5092,245 +5081,217 @@ define <10 x i32> @bitcast_v40i8_to_v10i32(<40 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB14_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v26.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v25.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v25.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v22.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, 0
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v24.h
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v23.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v23.h
-; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v0.h, v21.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v21.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v20.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v16.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.l, v19.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v4
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v18.h
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v1.l, v17.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v25
-; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v1.h, v19.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v3.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v16.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v20.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v7, v25
-; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v2.l, v17.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v4.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v18.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v3.h, v15.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v6, v25
-; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v3.l, v15.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v24.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v28.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v22.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v21.h
+; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v0.l, v24.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v25.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v27.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v1.l, v23.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v27.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v19.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v19.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v27, v0
+; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v1.h, v23.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v16.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v18.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v21.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v20.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v27, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v2.l, v20.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v27.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v17.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v16.h
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v15.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v22.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v27, v2
+; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v3.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v27.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v14.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v26.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v28.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v27, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v4.l, v15.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v27.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v13.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v30.l
+; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v32.h
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v32.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v27, v4
+; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v5.l, v14.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v27.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v12.h
+; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v31.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v31.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v27, v5
+; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v6.l, v13.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v27.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_4) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v8, v25
-; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v4.l, v14.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v22.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v4.h, v14.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v25
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v8
-; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v5.l, v13.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v26.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v7, v25
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v6.h, v13.l
-; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v32.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v8
-; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v6.l, v12.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v32.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v30.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v6.h, v12.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v25
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v7.h, v11.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v13
-; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v7.l, v11.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v31.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v12, v25
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v13
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v8.l, v10.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v31.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr13_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr13_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v11, v25
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v12
-; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v9.l, v10.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr13_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_hi16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.h, v10.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v27, v6
+; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v7.l, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v27.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v27, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v8.l, v11.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v27.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v27, v8
+; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v9.l, v10.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v27.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_lo16
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v11, v25
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v27, v9
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB14_2
; GFX11-TRUE16-NEXT: .LBB14_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v26.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v25.l, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v25.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v22.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v23.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v22.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v21.h, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v18.h, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v24.h, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v23.l, v1.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v20.h, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v21.l, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, 0
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v25.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v24.h, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v25.l, v0.h
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v23.h, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v23.l, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v19.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v19.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v17.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v25, v3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v16.h, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v25.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v20.h, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v21.l, v1.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v21.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v19.l, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, v16.h, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v17.h, v2.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v27
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v19.h, v1.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v16.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v27
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, 0x300, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, 0x300, v2.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v15.h, v3.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v2.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v4, v27
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v17.l, v3.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v18.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v20.l, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v25, v4
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v25.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v17.h, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v18.h, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v16.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v18.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v25, v5
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v20.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v22.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v25.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v15.l, v3.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v15.h, v3.h
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v3.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v5, v27
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v14.h, v4.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, v24.l, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v15.l, v4.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, v22.l, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, 0x300, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, 0x300, v4.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v28.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v27
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v14.l, v5.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v26.l, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v13.h, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, 0x300, v5.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v25, v6
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v25.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v14.l, v4.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v14.h, v4.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v24.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v26.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v25, v7
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v28.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v30.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v25.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v13.l, v5.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v13.h, v5.h
; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, 0x300, v5.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v13.l, v6.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v5.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v12.h, v6.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v7, v27
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v25, v8
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v25.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v12.l, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v12.h, v6.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, v32.h, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v8
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, 0x300, v6.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, 0x300, v6.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, v30.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v32.h, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v32.l, 3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v27
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v12.l, v7.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v25, v9
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v31.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v11.l, v7.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, 0x300, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v31.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v25.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v11.l, v7.l
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v11.h, v7.h
; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v11.h, v8.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, v31.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, 0x300, v7.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v7.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v10.h, v8.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.l, 0x300, v8.h
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v12, v27
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v13
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, 0x300, v8.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v10.l, v8.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v11
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v12, v27
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, 0x300, v9.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v10, v27
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v25, v12
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v9.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v9.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v25.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v10.l, v8.l
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v10.h, v8.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v25, v11
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v9.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v9.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v25.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v25, v9
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -9951,20 +9912,20 @@ define <40 x i8> @bitcast_v10f32_to_v40i8(<10 x float> %a, i32 %b) {
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr13_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
@@ -9980,17 +9941,17 @@ define <40 x i8> @bitcast_v10f32_to_v40i8(<10 x float> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[15:16], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v17, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v18, 8, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v20, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v19, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v19, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v20, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v21, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v23, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v22, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v22, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v23, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v24, 8, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v25, 8, v5
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v26, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v27, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v30, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v16, 8, v1
; GFX11-TRUE16-NEXT: .LBB32_2: ; %Flow
@@ -10010,17 +9971,17 @@ define <40 x i8> @bitcast_v10f32_to_v40i8(<10 x float> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[13:14], 24, v[5:6]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[14:15], 24, v[3:4]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[15:16], 24, v[1:2]
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v20, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v19, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v19, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v20, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v21, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v23, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v22, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v22, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v23, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v24, 8, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v25, 8, v5
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v26, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v27, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v30, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v16, 8, v1
; GFX11-TRUE16-NEXT: .LBB32_4: ; %end
@@ -10028,105 +9989,93 @@ define <40 x i8> @bitcast_v10f32_to_v40i8(<10 x float> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v16.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v30.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v15.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v1.l, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v30.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v1.h, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v15.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v15.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, 0
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v29.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v15, v1
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v2.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v28.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v14.l
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v29.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v1.h, v11.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v16
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v28.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v1.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v15, v2
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v3.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v15.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v27.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v28, 0xffff, v29
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v16, v15
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v2.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v3.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v14.l
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v4.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v28, v15
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v14, 0xffff, v16
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v26.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v3.l, v3.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v25.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v14, v15
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v26.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v15, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v4.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v25.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v13.l
; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v4.l, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v15, v4
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v5.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v15.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v24.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v4.h, v5.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v14, 0xffff, v16
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v23.l
; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v15, v5
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v6.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v22.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v15, v6
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v7.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v21.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v20.l
; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v6.l, v11.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v23.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v14, v15
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v16
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v5.l, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v22.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v7.l, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v12.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v13, v15
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v14
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v6.l, v6.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v21.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v13, v15
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v14
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v7.l, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v20.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v7.h, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v12, v15
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v9.l, v11.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_3) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v19.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v18.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v13
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v9.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v8.l, v8.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v11.l
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v10.l, v11.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v13
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v12, v15
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v9.l, v9.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v17.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v11, v15
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v12
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v9.l, v9.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v11, v15
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v15, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v8.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v19.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v11.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v15, v8
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v9.l, v11.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v9.h, v11.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v15.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v10.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v18.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v17.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v15, v11
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v9.l, v9.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v15.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v10.l, v10.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v15, v9
; GFX11-TRUE16-NEXT: s_clause 0x2
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[1:4], off
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[5:8], off offset:16
-; GFX11-TRUE16-NEXT: scratch_store_b64 v0, v[10:11], off offset:32
+; GFX11-TRUE16-NEXT: scratch_store_b64 v0, v[11:12], off offset:32
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-FAKE16-LABEL: bitcast_v10f32_to_v40i8:
@@ -12037,48 +11986,49 @@ define <10 x float> @bitcast_v40i8_to_v10f32(<40 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v34, off, s32 offset:24
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v35, off, s32 offset:32
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v31, off, s32 offset:28
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v32, off, s32 offset:20
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v31, off, s32 offset:12
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v31, off, s32 offset:20
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v32, off, s32 offset:12
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v32, off, s32 offset:4
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.h, v25.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.h, v23.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.h, v21.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.h, v19.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.h, v17.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.h, v14.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.h, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.h, v10.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.h, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.h, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v10.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.h, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.h, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.h, v4.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.h, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.h, v0.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v24.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.l, 8, v3.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.l, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.l, 8, v7.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.h, 8, v9.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.h, 8, v15.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v17.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.h, 8, v30.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.l, 8, v29.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.h, 8, v28.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.l, 8, v27.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v27.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.l, 8, v29.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.h, 8, v7.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v20.h, 8, v9.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.l, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v15.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.l, 8, v35.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.h, 8, v30.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.l, 8, v29.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.h, 8, v28.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.l, 8, v27.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.h, 8, v27.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v29.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(8)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v33.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v33.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(7)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v33.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(6)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v34.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v34.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(5)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v34.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.l, 8, v34.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(4)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.l, 8, v35.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v35.l
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v36
; GFX11-TRUE16-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX11-TRUE16-NEXT: s_cbranch_execnz .LBB34_3
@@ -12091,245 +12041,217 @@ define <10 x float> @bitcast_v40i8_to_v10f32(<40 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB34_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v26.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v25.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v25.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v22.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, 0
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v24.h
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v23.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v23.h
-; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v0.h, v21.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v21.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v20.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v16.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.l, v19.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v4
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v18.h
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v1.l, v17.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v25
-; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v1.h, v19.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v3.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v16.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v20.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v7, v25
-; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v2.l, v17.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v4.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v18.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v3.h, v15.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v6, v25
-; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v3.l, v15.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v24.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v28.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v22.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v21.h
+; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v0.l, v24.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v25.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v27.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v1.l, v23.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v27.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v19.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v19.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v27, v0
+; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v1.h, v23.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v16.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v18.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v21.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v20.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v27, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v2.l, v20.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v27.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v17.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v16.h
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v15.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v22.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v27, v2
+; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v3.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v27.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v14.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v26.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v28.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v27, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v4.l, v15.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v27.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v13.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v30.l
+; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v32.h
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v32.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v27, v4
+; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v5.l, v14.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v27.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v12.h
+; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v31.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v31.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v27, v5
+; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v6.l, v13.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v27.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_4) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v8, v25
-; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v4.l, v14.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v22.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v4.h, v14.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v25
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v8
-; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v5.l, v13.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v26.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v7, v25
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v6.h, v13.l
-; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v32.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v8
-; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v6.l, v12.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v32.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v30.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v6.h, v12.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v25
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v7.h, v11.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v13
-; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v7.l, v11.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v31.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v12, v25
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v13
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v8.l, v10.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v31.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr13_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr13_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v11, v25
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v12
-; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v9.l, v10.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr13_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_hi16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.h, v10.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v27, v6
+; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v7.l, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v27.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v27, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v8.l, v11.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v27.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v27, v8
+; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v9.l, v10.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v27.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_lo16
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v11, v25
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v27, v9
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB34_2
; GFX11-TRUE16-NEXT: .LBB34_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v26.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v25.l, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v25.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v22.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v23.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v22.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v21.h, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v18.h, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v24.h, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v23.l, v1.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v20.h, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v21.l, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, 0
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v25.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v24.h, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v25.l, v0.h
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v23.h, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v23.l, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v19.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v19.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v17.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v25, v3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v16.h, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v25.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v20.h, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v21.l, v1.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v21.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v19.l, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, v16.h, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v17.h, v2.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v27
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v19.h, v1.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v16.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v27
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, 0x300, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, 0x300, v2.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v15.h, v3.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v2.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v4, v27
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v17.l, v3.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v18.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v20.l, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v25, v4
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v25.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v17.h, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v18.h, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v16.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v18.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v25, v5
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v20.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v22.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v25.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v15.l, v3.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v15.h, v3.h
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v3.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v5, v27
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v14.h, v4.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, v24.l, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v15.l, v4.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, v22.l, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, 0x300, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, 0x300, v4.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v28.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v27
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v14.l, v5.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v26.l, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v13.h, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, 0x300, v5.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v25, v6
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v25.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v14.l, v4.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v14.h, v4.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v24.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v26.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v25, v7
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v28.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v30.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v25.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v13.l, v5.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v13.h, v5.h
; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, 0x300, v5.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v13.l, v6.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v5.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v12.h, v6.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v7, v27
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v25, v8
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v25.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v12.l, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v12.h, v6.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, v32.h, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v8
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, 0x300, v6.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, 0x300, v6.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, v30.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v32.h, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v32.l, 3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v27
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v12.l, v7.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v25, v9
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v31.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v11.l, v7.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, 0x300, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v31.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v25.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v11.l, v7.l
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v11.h, v7.h
; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v11.h, v8.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, v31.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, 0x300, v7.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v7.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v10.h, v8.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.l, 0x300, v8.h
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v12, v27
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v13
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, 0x300, v8.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v10.l, v8.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v11
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v12, v27
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, 0x300, v9.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v10, v27
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v25, v12
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v9.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v9.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v25.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v10.l, v8.l
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v10.h, v8.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v25, v11
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v9.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v9.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v25.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v25, v9
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -16358,20 +16280,20 @@ define <40 x i8> @bitcast_v20i16_to_v40i8(<20 x i16> %a, i32 %b) {
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr13_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
@@ -16387,17 +16309,17 @@ define <40 x i8> @bitcast_v20i16_to_v40i8(<20 x i16> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[15:16], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v17, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v18, 8, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v20, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v19, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v19, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v20, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v21, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v23, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v22, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v22, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v23, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v24, 8, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v25, 8, v5
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v26, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v27, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v30, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v16, 8, v1
; GFX11-TRUE16-NEXT: .LBB48_2: ; %Flow
@@ -16421,17 +16343,17 @@ define <40 x i8> @bitcast_v20i16_to_v40i8(<20 x i16> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[15:16], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v17, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v18, 8, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v20, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v19, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v19, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v20, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v21, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v23, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v22, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v22, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v23, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v24, 8, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v25, 8, v5
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v26, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v27, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v30, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v16, 8, v1
; GFX11-TRUE16-NEXT: .LBB48_4: ; %end
@@ -16439,105 +16361,93 @@ define <40 x i8> @bitcast_v20i16_to_v40i8(<20 x i16> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v16.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v30.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v15.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v1.l, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v30.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v1.h, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v15.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v15.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, 0
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v29.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v15, v1
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v2.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v28.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v14.l
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v29.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v1.h, v11.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v16
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v28.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v1.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v15, v2
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v3.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v15.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v27.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v28, 0xffff, v29
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v16, v15
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v2.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v3.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v14.l
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v4.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v28, v15
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v14, 0xffff, v16
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v26.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v3.l, v3.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v25.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v14, v15
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v26.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v15, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v4.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v25.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v13.l
; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v4.l, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v15, v4
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v5.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v15.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v24.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v4.h, v5.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v14, 0xffff, v16
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v23.l
; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v15, v5
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v6.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v22.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v15, v6
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v7.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v21.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v20.l
; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v6.l, v11.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v23.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v14, v15
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v16
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v5.l, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v22.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v7.l, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v12.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v13, v15
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v14
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v6.l, v6.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v21.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v13, v15
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v14
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v7.l, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v20.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v7.h, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v12, v15
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v9.l, v11.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_3) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v19.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v18.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v13
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v9.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v8.l, v8.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v11.l
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v10.l, v11.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v13
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v12, v15
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v9.l, v9.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v17.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v11, v15
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v12
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v9.l, v9.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v11, v15
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v15, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v8.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v19.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v11.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v15, v8
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v9.l, v11.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v9.h, v11.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v15.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v10.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v18.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v17.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v15, v11
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v9.l, v9.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v15.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v10.l, v10.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v15, v9
; GFX11-TRUE16-NEXT: s_clause 0x2
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[1:4], off
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[5:8], off offset:16
-; GFX11-TRUE16-NEXT: scratch_store_b64 v0, v[10:11], off offset:32
+; GFX11-TRUE16-NEXT: scratch_store_b64 v0, v[11:12], off offset:32
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-FAKE16-LABEL: bitcast_v20i16_to_v40i8:
@@ -22479,20 +22389,20 @@ define <40 x i8> @bitcast_v20f16_to_v40i8(<20 x half> %a, i32 %b) {
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr13_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
@@ -22508,17 +22418,17 @@ define <40 x i8> @bitcast_v20f16_to_v40i8(<20 x half> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[15:16], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v17, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v18, 8, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v20, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v19, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v19, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v20, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v21, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v23, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v22, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v22, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v23, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v24, 8, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v25, 8, v5
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v26, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v27, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v30, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v16, 8, v1
; GFX11-TRUE16-NEXT: .LBB60_2: ; %Flow
@@ -22542,17 +22452,17 @@ define <40 x i8> @bitcast_v20f16_to_v40i8(<20 x half> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[15:16], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v17, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v18, 8, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v20, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v19, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v19, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v20, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v21, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v23, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v22, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v22, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v23, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v24, 8, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v25, 8, v5
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v26, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v27, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v30, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v16, 8, v1
; GFX11-TRUE16-NEXT: .LBB60_4: ; %end
@@ -22560,105 +22470,93 @@ define <40 x i8> @bitcast_v20f16_to_v40i8(<20 x half> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v16.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v30.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v15.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v1.l, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v30.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v1.h, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v15.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v15.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, 0
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v29.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v15, v1
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v2.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v28.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v14.l
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v29.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v1.h, v11.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v16
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v28.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v1.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v15, v2
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v3.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v15.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v27.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v28, 0xffff, v29
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v16, v15
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v2.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v3.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v14.l
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v4.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v28, v15
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v14, 0xffff, v16
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v26.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v3.l, v3.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v25.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v14, v15
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v26.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v15, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v4.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v25.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v13.l
; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v4.l, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v15, v4
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v5.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v15.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v24.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v4.h, v5.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v14, 0xffff, v16
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v23.l
; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v15, v5
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v6.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v22.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v15, v6
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v7.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v21.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v20.l
; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v6.l, v11.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v23.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v14, v15
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v16
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v5.l, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v22.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v7.l, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v12.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v13, v15
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v14
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v6.l, v6.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v21.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v13, v15
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v14
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v7.l, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v20.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v7.h, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v12, v15
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v9.l, v11.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_3) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v19.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v18.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v13
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v9.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v8.l, v8.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v11.l
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v10.l, v11.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v13
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v12, v15
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v9.l, v9.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v17.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v11, v15
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v12
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v9.l, v9.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v11, v15
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v15, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v8.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v19.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v11.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v15, v8
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v9.l, v11.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v9.h, v11.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v15.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v10.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v18.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v17.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v15, v11
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v9.l, v9.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v15.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v10.l, v10.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v15, v9
; GFX11-TRUE16-NEXT: s_clause 0x2
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[1:4], off
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[5:8], off offset:16
-; GFX11-TRUE16-NEXT: scratch_store_b64 v0, v[10:11], off offset:32
+; GFX11-TRUE16-NEXT: scratch_store_b64 v0, v[11:12], off offset:32
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-FAKE16-LABEL: bitcast_v20f16_to_v40i8:
@@ -28859,50 +28757,51 @@ define <5 x double> @bitcast_v40i8_to_v5f64(<40 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v37, off, s32 offset:24
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v38, off, s32 offset:32
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v31, off, s32 offset:28
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v32, off, s32 offset:20
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v31, off, s32 offset:12
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v31, off, s32 offset:20
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v32, off, s32 offset:12
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v32, off, s32 offset:4
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v38.h, v29.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.l, v27.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.h, v25.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.l, v23.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.h, v21.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.h, v18.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v16.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v16.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.h, v14.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.h, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.h, v10.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.h, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v33.h, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.h, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, v10.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.h, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.h, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.h, v4.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.h, v0.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.l, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v29.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v3.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.l, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.l, 8, v7.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v29.h, 8, v9.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.h, 8, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.h, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.h, 8, v15.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.l, 8, v17.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.h, 8, v19.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.h, 8, v21.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v48.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.l, 8, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v39.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.l, 8, v38.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v7.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v28.h, 8, v9.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v29.l, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.h, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v26.h, 8, v15.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.l, 8, v17.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.h, 8, v19.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.h, 8, v21.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.h, 8, v48.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.l, 8, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v39.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v38.h
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(8)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v36.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v36.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(7)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v36.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v36.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(6)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v37.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v37.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(5)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v37.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.l, 8, v37.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(4)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.l, 8, v38.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v38.l
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v49
; GFX11-TRUE16-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX11-TRUE16-NEXT: s_cbranch_execnz .LBB72_3
@@ -28915,245 +28814,216 @@ define <5 x double> @bitcast_v40i8_to_v5f64(<40 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB72_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v35.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v34.h
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v35.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v30.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, 0
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v34.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v33.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v0.h, v29.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v29.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v28.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v23.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.l, v27.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v4
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v26.h
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v1.l, v25.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v1.h, v27.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v3.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v24.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v20.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v7, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v2.l, v25.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v4.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v20.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v3.h, v22.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v6, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v3.l, v23.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v24.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v28.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v30.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v29.h
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v0.l, v34.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v10.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v1.l, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v10.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v27.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v27.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v10, v0
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v1.h, v33.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v21.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v20.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v29.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v20.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v10, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v2.l, v28.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v10.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v25.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v24.h
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v23.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v22.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v10, v2
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v3.l, v25.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v26.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v10.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v22.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v26.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v28.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v10, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v4.l, v23.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v10.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v19.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v30.l
+; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v32.h
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v32.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v10, v4
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v5.l, v21.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v10.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v18.h
+; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v17.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v31.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v31.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v10, v5
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v6.l, v19.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v10.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_4) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v8, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v4.l, v21.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v22.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v4.h, v21.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v10
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v8
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v5.l, v19.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v26.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v7, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v6.h, v19.l
-; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v32.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v8
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v6.l, v18.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v32.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v30.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v6.h, v18.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v7.h, v17.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v11
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v7.l, v17.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v31.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v11, v10
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v12
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v8.l, v16.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v31.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.h, v16.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v11, v10
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v12
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v9.l, v16.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v10, v6
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v7.l, v18.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v10.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v10, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v8.l, v17.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v10.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v10, v8
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v9.l, v16.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v10.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_lo16
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v11, v10
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v10, v9
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB72_2
; GFX11-TRUE16-NEXT: .LBB72_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v35.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v34.h, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v35.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v30.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v33.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v30.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v29.h, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v26.h, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v34.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v33.l, v1.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v28.h, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v29.l, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, 0
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v10.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v34.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v34.h, v0.h
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v33.h, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v33.l, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v27.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v27.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v25.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v10, v3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v24.h, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v10.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v28.h, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v29.l, v1.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v29.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v27.l, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, v24.h, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v25.h, v2.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v11
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v27.h, v1.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v23.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v11
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, 0x300, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v2.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v23.h, v3.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v2.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v4, v11
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v25.l, v3.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v20.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v20.l, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v10, v4
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v10.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v25.h, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v26.h, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v21.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v20.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v10, v5
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v20.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v22.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v10.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v23.l, v3.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v23.h, v3.h
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v3.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v5, v11
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v21.h, v4.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, v24.l, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v22.h, v4.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, v22.l, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, 0x300, v4.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v28.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v11
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v21.l, v5.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v26.l, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v19.h, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, 0x300, v5.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v10, v6
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v10.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v21.h, v4.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v22.h, v4.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v24.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v26.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v10, v7
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v28.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v30.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v10.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v19.l, v5.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v19.h, v5.h
; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v5.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v19.l, v6.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v5.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v18.h, v6.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v7, v11
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v10, v8
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v10.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v18.l, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v18.h, v6.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, v32.h, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v8
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, 0x300, v6.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v6.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, v30.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v32.h, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v32.l, 3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v11
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v18.l, v7.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v10, v9
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v31.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v17.l, v7.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, 0x300, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v31.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v10.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v17.l, v7.l
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v17.h, v7.h
; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v17.h, v8.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, v31.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v7.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v7.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v16.h, v8.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v8.h
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v12, v11
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v13
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v8.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v16.l, v8.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v12, v11
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v9.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v10, v11
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v10, v11
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v9.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v9.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v16.l, v8.l
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v16.h, v8.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v10, v11
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v9.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v9.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v10.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v10, v9
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -30908,20 +30778,20 @@ define <40 x i8> @bitcast_v5f64_to_v40i8(<5 x double> %a, i32 %b) {
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr13_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
@@ -30937,17 +30807,17 @@ define <40 x i8> @bitcast_v5f64_to_v40i8(<5 x double> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[15:16], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v17, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v18, 8, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v20, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v19, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v19, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v20, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v21, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v23, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v22, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v22, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v23, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v24, 8, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v25, 8, v5
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v26, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v27, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v30, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v16, 8, v1
; GFX11-TRUE16-NEXT: .LBB74_2: ; %Flow
@@ -30966,17 +30836,17 @@ define <40 x i8> @bitcast_v5f64_to_v40i8(<5 x double> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[15:16], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v17, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v18, 8, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v20, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v19, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v19, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v20, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v21, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v23, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v22, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v22, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v23, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v24, 8, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v25, 8, v5
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v26, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v27, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v30, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v16, 8, v1
; GFX11-TRUE16-NEXT: .LBB74_4: ; %end
@@ -30984,105 +30854,93 @@ define <40 x i8> @bitcast_v5f64_to_v40i8(<5 x double> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v16.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v30.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v15.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v1.l, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v30.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v1.h, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v15.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v15.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, 0
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v29.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v15, v1
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v2.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v28.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v14.l
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v29.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v1.h, v11.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v16
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v28.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v1.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v15, v2
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v3.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v15.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v27.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v28, 0xffff, v29
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v16, v15
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v2.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v3.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v14.l
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v4.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v28, v15
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v14, 0xffff, v16
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v26.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v3.l, v3.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v25.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v14, v15
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v26.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v15, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v4.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v25.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v13.l
; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v4.l, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v15, v4
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v5.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v15.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v24.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v4.h, v5.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v14, 0xffff, v16
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v23.l
; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v15, v5
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v6.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v22.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v15, v6
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v7.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v21.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v20.l
; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v6.l, v11.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v23.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v14, v15
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v16
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v5.l, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v22.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v7.l, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v12.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v13, v15
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v14
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v6.l, v6.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v21.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v13, v15
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v14
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v7.l, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v20.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v7.h, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v12, v15
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v9.l, v11.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_3) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v19.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v18.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v13
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v9.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v8.l, v8.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v11.l
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v10.l, v11.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v13
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v12, v15
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v9.l, v9.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v17.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v11, v15
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v12
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v9.l, v9.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v11, v15
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v15, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v8.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v19.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v11.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v15, v8
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v9.l, v11.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v9.h, v11.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v15.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v10.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v18.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v17.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v15, v11
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v9.l, v9.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v15.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v10.l, v10.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v15, v9
; GFX11-TRUE16-NEXT: s_clause 0x2
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[1:4], off
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[5:8], off offset:16
-; GFX11-TRUE16-NEXT: scratch_store_b64 v0, v[10:11], off offset:32
+; GFX11-TRUE16-NEXT: scratch_store_b64 v0, v[11:12], off offset:32
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-FAKE16-LABEL: bitcast_v5f64_to_v40i8:
@@ -33010,50 +32868,51 @@ define <5 x i64> @bitcast_v40i8_to_v5i64(<40 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v37, off, s32 offset:24
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v38, off, s32 offset:32
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v31, off, s32 offset:28
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v32, off, s32 offset:20
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v31, off, s32 offset:12
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v31, off, s32 offset:20
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v32, off, s32 offset:12
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v32, off, s32 offset:4
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v38.h, v29.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.l, v27.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.h, v25.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.l, v23.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.h, v21.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.h, v18.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v16.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v16.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.h, v14.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.h, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.h, v10.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.h, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v33.h, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.h, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, v10.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.h, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.h, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.h, v4.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.h, v0.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.l, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v29.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v3.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.l, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.l, 8, v7.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v29.h, 8, v9.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.h, 8, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.h, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.h, 8, v15.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.l, 8, v17.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.h, 8, v19.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.h, 8, v21.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v48.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.l, 8, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v39.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.l, 8, v38.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v7.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v28.h, 8, v9.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v29.l, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.h, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v26.h, 8, v15.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.l, 8, v17.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.h, 8, v19.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.h, 8, v21.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.h, 8, v48.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.l, 8, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v39.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v38.h
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(8)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v36.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v36.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(7)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v36.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v36.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(6)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v37.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v37.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(5)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v37.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.l, 8, v37.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(4)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.l, 8, v38.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v38.l
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v49
; GFX11-TRUE16-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX11-TRUE16-NEXT: s_cbranch_execnz .LBB76_3
@@ -33066,245 +32925,216 @@ define <5 x i64> @bitcast_v40i8_to_v5i64(<40 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB76_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v35.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v34.h
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v35.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v30.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, 0
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v34.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v33.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v0.h, v29.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v29.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v28.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v23.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.l, v27.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v4
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v26.h
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v1.l, v25.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v1.h, v27.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v3.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v24.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v20.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v7, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v2.l, v25.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v4.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v20.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v3.h, v22.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v6, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v3.l, v23.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v24.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v28.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v30.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v29.h
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v0.l, v34.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v34.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v10.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v1.l, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v10.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v27.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v27.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v10, v0
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v1.h, v33.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v21.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v20.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v29.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v20.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v10, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v2.l, v28.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v10.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v25.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v24.h
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v23.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v22.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v10, v2
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v3.l, v25.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v26.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v10.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v22.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v26.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v28.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v10, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v4.l, v23.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v10.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v19.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v30.l
+; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v32.h
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v32.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v10, v4
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v5.l, v21.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v10.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v18.h
+; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v17.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v31.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v31.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v10, v5
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v6.l, v19.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v10.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_4) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v8, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v4.l, v21.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v22.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v4.h, v21.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v10
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v8
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v5.l, v19.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v26.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v7, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v6.h, v19.l
-; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v32.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v8
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v6.l, v18.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v32.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v30.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v6.h, v18.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v7.h, v17.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v11
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v7.l, v17.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v31.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v11, v10
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v12
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v8.l, v16.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v31.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.h, v16.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v11, v10
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v12
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v9.l, v16.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v10, v6
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v7.l, v18.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v10.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v10, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v8.l, v17.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v10.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v10, v8
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v9.l, v16.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v10.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_lo16
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v11, v10
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v10, v9
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB76_2
; GFX11-TRUE16-NEXT: .LBB76_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v35.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v34.h, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v35.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v30.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v33.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v30.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v29.h, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v26.h, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v34.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v33.l, v1.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v28.h, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v29.l, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, 0
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v10.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v34.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v34.h, v0.h
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v33.h, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v33.l, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v27.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v27.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v25.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v10, v3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v24.h, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v10.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v28.h, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v29.l, v1.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v29.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v27.l, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, v24.h, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v25.h, v2.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v11
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v27.h, v1.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v23.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v11
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, 0x300, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v2.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v23.h, v3.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v2.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v4, v11
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v25.l, v3.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v20.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v20.l, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v10, v4
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v10.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v25.h, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v26.h, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v21.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v20.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v10, v5
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v20.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v22.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v10.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v23.l, v3.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v23.h, v3.h
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v3.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v5, v11
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v21.h, v4.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, v24.l, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v22.h, v4.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, v22.l, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, 0x300, v4.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v28.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v11
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v21.l, v5.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v26.l, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v19.h, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, 0x300, v5.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v10, v6
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v10.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v21.h, v4.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v22.h, v4.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v24.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v26.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v10, v7
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v28.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v30.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v10.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v19.l, v5.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v19.h, v5.h
; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v5.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v19.l, v6.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v5.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v18.h, v6.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v7, v11
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v10, v8
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v10.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v18.l, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v18.h, v6.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, v32.h, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v8
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, 0x300, v6.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v6.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, v30.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v32.h, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v32.l, 3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v11
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v18.l, v7.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v10, v9
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v31.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v17.l, v7.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, 0x300, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v31.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v10.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v17.l, v7.l
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v17.h, v7.h
; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v17.h, v8.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, v31.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v7.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v7.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v16.h, v8.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v8.h
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v12, v11
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v13
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v8.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v16.l, v8.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v12, v11
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v9.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v10, v11
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v10, v11
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v9.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v9.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v16.l, v8.l
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v16.h, v8.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v10, v11
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v9.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v9.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v10.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v10, v9
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -35074,20 +34904,20 @@ define <40 x i8> @bitcast_v5i64_to_v40i8(<5 x i64> %a, i32 %b) {
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr13_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
@@ -35103,17 +34933,17 @@ define <40 x i8> @bitcast_v5i64_to_v40i8(<5 x i64> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[15:16], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v17, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v18, 8, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v20, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v19, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v19, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v20, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v21, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v23, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v22, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v22, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v23, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v24, 8, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v25, 8, v5
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v26, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v27, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v30, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v16, 8, v1
; GFX11-TRUE16-NEXT: .LBB78_2: ; %Flow
@@ -35140,17 +34970,17 @@ define <40 x i8> @bitcast_v5i64_to_v40i8(<5 x i64> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[15:16], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v17, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v18, 8, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v20, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v19, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v19, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v20, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v21, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v23, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v22, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v22, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v23, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v24, 8, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v25, 8, v5
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v26, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v27, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v30, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v16, 8, v1
; GFX11-TRUE16-NEXT: .LBB78_4: ; %end
@@ -35158,105 +34988,93 @@ define <40 x i8> @bitcast_v5i64_to_v40i8(<5 x i64> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v16.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v30.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v15.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v1.l, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v30.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v1.h, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v15.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v15.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, 0
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v29.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v15, v1
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v2.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v28.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v14.l
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v29.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v1.h, v11.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v16
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v28.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v1.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v15, v2
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v3.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v15.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v27.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v28, 0xffff, v29
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v16, v15
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v2.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v3.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v14.l
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v4.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v28, v15
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v14, 0xffff, v16
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v26.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v3.l, v3.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v25.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v14, v15
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v26.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v15, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v4.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v25.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v13.l
; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v4.l, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v15, v4
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v5.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v15.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v24.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v4.h, v5.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v14, 0xffff, v16
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v23.l
; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v15, v5
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v6.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v22.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v15, v6
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v7.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v21.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v20.l
; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v6.l, v11.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v23.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v14, v15
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v16
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v5.l, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v22.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v7.l, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v12.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v13, v15
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v14
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v6.l, v6.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v21.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v13, v15
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v14
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v7.l, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v20.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v7.h, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v12, v15
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v9.l, v11.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_3) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v19.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v18.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v13
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v9.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v8.l, v8.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v11.l
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v10.l, v11.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v13
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v12, v15
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v9.l, v9.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v17.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v11, v15
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v12
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v9.l, v9.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v11, v15
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v15, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v8.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v19.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v11.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v15, v8
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v9.l, v11.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v9.h, v11.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v15.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v10.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v18.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v17.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v15, v11
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v9.l, v9.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v15.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v10.l, v10.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v15, v9
; GFX11-TRUE16-NEXT: s_clause 0x2
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[1:4], off
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[5:8], off offset:16
-; GFX11-TRUE16-NEXT: scratch_store_b64 v0, v[10:11], off offset:32
+; GFX11-TRUE16-NEXT: scratch_store_b64 v0, v[11:12], off offset:32
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-FAKE16-LABEL: bitcast_v5i64_to_v40i8:
diff --git a/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.32bit.ll b/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.32bit.ll
index 48c9b8775a474..8e30ee659a260 100644
--- a/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.32bit.ll
+++ b/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.32bit.ll
@@ -2257,8 +2257,8 @@ define i32 @bitcast_v4i8_to_i32(<4 x i8> %a, i32 %b) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.h, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.h, 8, v3.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v4
@@ -2273,19 +2273,17 @@ define i32 @bitcast_v4i8_to_i32(<4 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: .LBB22_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v2.h
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v2.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v1.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v1.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v0.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, 0
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v2
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.l, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v1.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v2.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr1_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr1_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v2, v0
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v1, v0
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr1_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr1_lo16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB22_2
; GFX11-TRUE16-NEXT: .LBB22_4: ; %cmp.true
@@ -2295,16 +2293,14 @@ define i32 @bitcast_v4i8_to_i32(<4 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v1.h, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v1.l, v0.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v1.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v1.h, v0.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v0.l
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v0.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, 0
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v1.h
; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v1, v0
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
@@ -4506,8 +4502,8 @@ define float @bitcast_v4i8_to_f32(<4 x i8> %a, i32 %b) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.h, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.h, 8, v3.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v4
@@ -4522,19 +4518,17 @@ define float @bitcast_v4i8_to_f32(<4 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: .LBB42_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v2.h
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v2.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v1.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v1.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v0.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, 0
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v2
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.l, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v1.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v2.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr1_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr1_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v2, v0
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v1, v0
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr1_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr1_lo16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB42_2
; GFX11-TRUE16-NEXT: .LBB42_4: ; %cmp.true
@@ -4544,16 +4538,14 @@ define float @bitcast_v4i8_to_f32(<4 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v1.h, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v1.l, v0.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v1.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v1.h, v0.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v0.l
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v0.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, 0
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v1.h
; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v1, v0
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
@@ -6467,8 +6459,8 @@ define <2 x i16> @bitcast_v4i8_to_v2i16(<4 x i8> %a, i32 %b) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.h, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.h, 8, v3.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v4
@@ -6483,19 +6475,17 @@ define <2 x i16> @bitcast_v4i8_to_v2i16(<4 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: .LBB58_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v2.h
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v2.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v1.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v1.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v0.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, 0
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v2
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.l, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v1.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v2.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr1_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr1_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v2, v0
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v1, v0
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr1_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr1_lo16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB58_2
; GFX11-TRUE16-NEXT: .LBB58_4: ; %cmp.true
@@ -6505,16 +6495,14 @@ define <2 x i16> @bitcast_v4i8_to_v2i16(<4 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v1.h, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v1.l, v0.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v1.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v1.h, v0.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v0.l
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v0.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, 0
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v1.h
; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v1, v0
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
@@ -8116,8 +8104,8 @@ define <2 x half> @bitcast_v4i8_to_v2f16(<4 x i8> %a, i32 %b) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.h, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.h, 8, v3.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v4
@@ -8132,19 +8120,17 @@ define <2 x half> @bitcast_v4i8_to_v2f16(<4 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: .LBB70_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v2.h
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v2.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v1.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v1.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v0.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, 0
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v2
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.l, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v1.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v2.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr1_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr1_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v2, v0
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v1, v0
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr1_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr1_lo16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB70_2
; GFX11-TRUE16-NEXT: .LBB70_4: ; %cmp.true
@@ -8154,16 +8140,14 @@ define <2 x half> @bitcast_v4i8_to_v2f16(<4 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v1.h, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v1.l, v0.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v1.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v1.h, v0.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v0.l
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v0.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, 0
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v1.h
; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v1, v0
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
@@ -9479,8 +9463,8 @@ define <2 x bfloat> @bitcast_v4i8_to_v2bf16(<4 x i8> %a, i32 %b) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.h, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.h, 8, v3.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v4
@@ -9495,19 +9479,17 @@ define <2 x bfloat> @bitcast_v4i8_to_v2bf16(<4 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: .LBB78_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v2.h
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v2.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v1.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v1.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v0.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, 0
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v2
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.l, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v1.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v2.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr1_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr1_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v2, v0
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v1, v0
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr1_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr1_lo16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB78_2
; GFX11-TRUE16-NEXT: .LBB78_4: ; %cmp.true
@@ -9517,16 +9499,14 @@ define <2 x bfloat> @bitcast_v4i8_to_v2bf16(<4 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v1.h, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v1.l, v0.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v1.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v1.h, v0.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v0.l
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v0.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, 0
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v1.h
; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v1, v0
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
@@ -10193,8 +10173,8 @@ define <1 x i32> @bitcast_v4i8_to_v1i32(<4 x i8> %a, i32 %b) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.h, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.h, 8, v3.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v4
@@ -10209,19 +10189,17 @@ define <1 x i32> @bitcast_v4i8_to_v1i32(<4 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: .LBB82_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v2.h
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v2.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v1.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v1.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v0.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, 0
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v2
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.l, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v1.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v2.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr1_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr1_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v2, v0
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v1, v0
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr1_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr1_lo16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB82_2
; GFX11-TRUE16-NEXT: .LBB82_4: ; %cmp.true
@@ -10231,16 +10209,14 @@ define <1 x i32> @bitcast_v4i8_to_v1i32(<4 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v1.h, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v1.l, v0.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v1.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v1.h, v0.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v0.l
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v0.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, 0
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v1.h
; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v1, v0
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
diff --git a/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.512bit.ll b/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.512bit.ll
index 5aac06a7f3a2b..35d135b123969 100644
--- a/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.512bit.ll
+++ b/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.512bit.ll
@@ -8768,32 +8768,32 @@ define <64 x i8> @bitcast_v16i32_to_v64i8(<16 x i32> %a, i32 %b) {
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
@@ -8812,26 +8812,26 @@ define <64 x i8> @bitcast_v16i32_to_v64i8(<16 x i32> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[24:25], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v26, 24, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v27, 8, v16
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 8, v15
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 24, v14
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 8, v15
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 24, v14
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v30, 8, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v32, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v31, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v31, 8, v13
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v32, 24, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v33, 8, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v34, 8, v11
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v35, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v36, 8, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v39, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v50, 8, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v51, 8, v5
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v52, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v53, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v64, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v25, 8, v1
; GFX11-TRUE16-NEXT: .LBB24_2: ; %Flow
@@ -8864,26 +8864,26 @@ define <64 x i8> @bitcast_v16i32_to_v64i8(<16 x i32> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[24:25], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v26, 24, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v27, 8, v16
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 8, v15
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 24, v14
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 8, v15
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 24, v14
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v30, 8, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v32, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v31, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v31, 8, v13
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v32, 24, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v33, 8, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v34, 8, v11
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v35, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v36, 8, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v39, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v50, 8, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v51, 8, v5
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v52, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v53, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v64, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v25, 8, v1
; GFX11-TRUE16-NEXT: .LBB24_4: ; %end
@@ -8891,156 +8891,135 @@ define <64 x i8> @bitcast_v16i32_to_v64i8(<16 x i32> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v25.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v64.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v24.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v1.l, v17.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v64.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v1.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v24.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v17.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v24.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, 0
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v55.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v24, v1
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v2.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v54.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v23.l
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v55.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v1.h, v17.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v25, 0xffff, v25
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v54.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v1.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v24, v2
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v3.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v24.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v53.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v55
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v25, v24
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v2.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v3.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v23.l
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v4.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v54, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v23, 0xffff, v25
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v52.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v3.l, v3.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v51.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v23, v24
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v52.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v24, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v4.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v51.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v22.l
; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v4.l, v17.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v24, v4
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v5.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v24.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v50.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v4.h, v5.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v23, 0xffff, v25
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v49.l
; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v24, v5
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v6.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v48.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v21.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v24, v6
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v7.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v39.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v38.l
; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v6.l, v17.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v22.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v49.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v23, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff, v25
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v5.l, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v48.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v7.l, v17.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v21.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v22, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff, v23
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v6.l, v6.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v39.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v22, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v21, 0xffff, v23
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v7.l, v17.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v38.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v7.h, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v24, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v8.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v37.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v20.l
; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v21, v24
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v9.l, v17.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v37.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v10.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v24, v8
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v9.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v24.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v36.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v21, 0xffff, v22
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v9.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v8.l, v8.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v20.l
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v10.l, v17.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v20, 0xffff, v22
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v21, v24
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v35.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v9.l, v9.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v34.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v20, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v20, 0xffff, v21
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v10.l, v17.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v35.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v11.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v24, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v10.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v10.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v34.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v19.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v12.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v24, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v11.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v11.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v24.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v33.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v10.h, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v32.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v13.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v24, v11
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v12.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v12.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v31.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v18.l
+; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.l
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v14.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v24, v12
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v13.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v13.h, v18.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v30.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v29.l
; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.l
+; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v15.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v24, v13
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v14.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v18.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v28.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v17.l
; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v20, v24
-; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v12.l, v17.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v11.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v19.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v32.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v12.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v19, 0xffff, v21
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.h
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v11.l, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v31.l
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v13.l, v17.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v18.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v19, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v19, 0xffff, v20
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v12.l, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v13.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.l, 8, v30.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v19, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v20
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v13.l, v17.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v29.l
-; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v13.h, v14.l
+; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v16.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v24, v14
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v15.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v15.h, v17.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v27.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v26.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v18, v24
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v15.l, v17.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_3) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v14.l
-; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.h, 8, v28.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v27.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v19
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v15.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v14.l, v14.h
-; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.h, 8, v17.l
-; GFX11-TRUE16-NEXT: v_or_b16 v16.l, v16.l, v17.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v19
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v18, v24
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v15.l, v15.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v16.l
-; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v26.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v17, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v18
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v16.l, v16.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v17, v24
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v24, v15
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v16.l, v17.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v16.h, v17.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v24.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v24, v16
; GFX11-TRUE16-NEXT: s_clause 0x3
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[1:4], off
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[5:8], off offset:16
@@ -12470,15 +12449,15 @@ define <16 x i32> @bitcast_v64i8_to_v16i32(<64 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v64, off, s32 offset:128
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v31, off, s32 offset:124
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v64, off, s32 offset:120
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v32, off, s32 offset:116
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v31, off, s32 offset:116
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v65, off, s32 offset:112
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v31, off, s32 offset:108
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v32, off, s32 offset:108
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v65, off, s32 offset:104
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v33, off, s32 offset:100
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v32, off, s32 offset:100
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v66, off, s32 offset:96
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v32, off, s32 offset:92
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v33, off, s32 offset:92
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v66, off, s32 offset:88
-; GFX11-TRUE16-NEXT: scratch_load_b32 v82, off, s32 offset:132
+; GFX11-TRUE16-NEXT: scratch_load_b32 v81, off, s32 offset:132
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v67, off, s32
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v67, off, s32 offset:8
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v68, off, s32 offset:16
@@ -12492,84 +12471,82 @@ define <16 x i32> @bitcast_v64i8_to_v16i32(<64 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v80, off, s32 offset:80
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v33, off, s32 offset:84
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v34, off, s32 offset:76
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v35, off, s32 offset:68
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v34, off, s32 offset:60
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v34, off, s32 offset:68
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v35, off, s32 offset:60
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v35, off, s32 offset:52
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v36, off, s32 offset:44
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v37, off, s32 offset:36
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v36, off, s32 offset:28
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v38, off, s32 offset:20
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v36, off, s32 offset:36
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v37, off, s32 offset:28
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v37, off, s32 offset:20
; GFX11-TRUE16-NEXT: s_clause 0x1
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v37, off, s32 offset:12
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v38, off, s32 offset:12
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v38, off, s32 offset:4
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v80.h, v29.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v81.l, v27.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.h, v25.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.h, v24.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v22.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, v20.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.h, v18.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.h, v16.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.h, v14.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v50.l, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.h, v10.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v51.h, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v53.l, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v54.l, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.h, v27.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.h, v22.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.h, v20.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.h, v18.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v16.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.l, v14.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.h, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v49.l, v10.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v49.h, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.h, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.l, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v53.h, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.h, v0.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v54.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v52.l, 8, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v53.h, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v50.h, 8, v7.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v52.h, 8, v9.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v49.h, 8, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v51.l, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v48.l, 8, v15.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v49.l, 8, v17.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v29.h, 8, v19.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v39.l, 8, v21.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v28.h, 8, v23.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v29.l, 8, v25.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.h, 8, v81.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v26.h, 8, v80.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v55.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v53.l, 8, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v54.l, 8, v7.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v51.l, 8, v9.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v51.h, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v50.l, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v50.h, 8, v15.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v39.h, 8, v17.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v48.l, 8, v19.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v29.h, 8, v21.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v30.h, 8, v23.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.l, 8, v25.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.h, 8, v27.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.l, 8, v80.h
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(33)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.l, 8, v64.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v64.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(31)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v64.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.l, 8, v64.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(29)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v65.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v65.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(27)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v65.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v65.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(25)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v66.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v66.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(23)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.l, 8, v66.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v66.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(21)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.h, 8, v67.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.h, 8, v67.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(20)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v24.l, 8, v67.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.l, 8, v67.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(19)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.h, 8, v68.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.h, 8, v68.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(18)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.l, 8, v68.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.l, 8, v68.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(17)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.h, 8, v69.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.h, 8, v69.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(16)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.l, 8, v69.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.l, 8, v69.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(15)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v20.h, 8, v70.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.h, 8, v70.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(14)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.l, 8, v70.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v20.l, 8, v70.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(13)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v71.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v20.h, 8, v71.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(12)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v20.l, 8, v71.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.l, 8, v71.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(11)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v80.l
-; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v82
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v80.l
+; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v81
; GFX11-TRUE16-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX11-TRUE16-NEXT: s_cbranch_execnz .LBB26_3
; GFX11-TRUE16-NEXT: ; %bb.1: ; %Flow
@@ -12581,384 +12558,338 @@ define <16 x i32> @bitcast_v64i8_to_v16i32(<64 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB26_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v55.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v54.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v55.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v51.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v54.l, 0
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v54.h
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v53.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v53.l
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v0.h, v52.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v52.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v50.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v30.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.l, v51.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v4
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v48.h
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v1.l, v49.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v54
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v1.h, v50.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v3.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v39.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v27.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v7, v54
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v2.l, v49.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v4.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v27.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v3.h, v39.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v6, v54
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v3.l, v48.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v24.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v28.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v53.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v64.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v52.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v52.l
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v0.l, v54.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v55.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v64.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v1.l, v54.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v64.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v49.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v49.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v64, v0
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v1.h, v53.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v29.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v28.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v51.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v26.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v64, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v2.l, v51.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v64.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v48.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v39.l
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v48.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v24.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v64, v2
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v3.l, v50.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v50.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v64.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v30.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v26.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v28.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v64, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v4.l, v39.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v64.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v27.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v30.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v38.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v8, v54
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v4.l, v29.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v25.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v4.h, v29.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v38.h
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v38.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v64, v4
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v5.l, v29.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v64.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v25.h
+; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v23.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v37.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v37.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v64, v5
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v6.l, v27.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v64.h
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v36.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.h, v22.h
+; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v36.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v35.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v64, v6
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v7.l, v25.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v64.h
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v10.h, v21.h
+; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v35.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v34.h
+; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v34.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v64, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v8.l, v23.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v64.h
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v11.h, v20.h
+; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v12.h, v19.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v33.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v33.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v64, v8
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v9.l, v22.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v64.h
+; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v32.h
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v13.h, v18.h
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v32.l
+; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v31.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v64, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v10.l, v21.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v64.h
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v17.h
+; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v31.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v54
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v8
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v5.l, v28.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v26.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v5.h, v26.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v38.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v8, v54
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v6.l, v25.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v30.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v6.h, v24.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v37.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v54
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v7.l, v23.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v37.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v7.h, v23.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v35.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v10, v54
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v11
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v8.l, v22.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v36.h
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v8.h, v22.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v35.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v11, v54
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v12
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v9.l, v21.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v36.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v9.h, v21.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v12, v54
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v13
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v10.l, v20.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v11.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v34.h
-; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v10.h, v20.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v33.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v13, v54
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v14
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v11.l, v19.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v34.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v13, v54
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v12.h, v19.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v33.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v15, 0xffff, v14
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v12.l, v18.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v32.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v13.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v32.h
-; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v12.h, v18.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v15, v54
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v13.h, v17.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v19
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v13.l, v17.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v14.l
-; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v31.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v18, v54
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v19
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v14.l, v16.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v15.l
-; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v31.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v17, v54
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v18
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v15.l, v16.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_hi16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v64, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v11.l, v20.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v64.h
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v15.h, v16.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_hi16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v64, v11
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v12.l, v19.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v64.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v64, v12
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v13.l, v18.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v64.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v64, v13
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v14.l, v17.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v64.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v64, v14
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v15.l, v16.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v64.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_lo16
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v17, v54
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v64, v15
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB26_2
; GFX11-TRUE16-NEXT: .LBB26_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v55.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v54.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v55.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v51.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v53.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v53.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v52.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v52.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v48.h, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v54.h, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v53.h, v1.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v50.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v52.l, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, 0
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v54.h, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v55.l, v0.h
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v54.l, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v53.l, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v49.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v49.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v48.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v52, v3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v39.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v51.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v51.h, v1.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v52.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v50.h, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, v39.h, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v30.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v55
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v49.h, v2.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v51.l, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v55
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v4
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v2.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v48.l, v3.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v27.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v5, v55
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v4.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v49.l, v3.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v27.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v52, v4
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v52.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v50.l, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v50.h, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v29.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v28.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v52, v5
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v26.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v24.h, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v39.h, v3.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v48.l, v3.h
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, v24.h, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v39.l, v4.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v6, v55
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v5.l
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v29.h, v4.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, v25.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v4.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, 0x300, v4.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v29.l, v5.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v28.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v55
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v28.h, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v26.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v7
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v52, v6
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v29.h, v4.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v30.h, v4.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v24.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v26.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v52, v7
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v28.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v30.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v27.l, v5.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v27.h, v5.h
; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v26.h, v6.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v52, v8
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v25.l, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v25.h, v6.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, v38.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v8, v55
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v7.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v25.h, v6.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, v30.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v8
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v6.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, 0x300, v6.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v24.l, v7.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v37.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v55
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v8.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v23.h, v7.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v38.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v9
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v7.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v38.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v38.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v52, v9
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v37.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v37.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v23.l, v7.l
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v23.h, v7.h
; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v22.h, v8.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, v36.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v10, v55
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v9.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v23.l, v8.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, v37.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v8.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v10
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, 0x300, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v21.h, v9.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v8.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v10, v55
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v22.l, v9.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, v36.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, v35.h, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v11
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v9.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v9.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v52, v10
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v9.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v9.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v22.l, v8.l
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v22.h, v8.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, v36.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, v36.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v52, v11
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v9.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, 0x300, v9.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v10.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v10.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, v35.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, v35.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v21.l, v9.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v21.h, v9.h
; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v10.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v9.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v11, v55
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v20.h, v10.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.l, v35.l, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v21.l, v10.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, v34.h, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v12
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, 0x300, v10.h
-; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v11.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, v33.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v13, v55
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v20.l, v11.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v12.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, v34.l, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v19.h, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v12.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.l, 0x300, v11.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v13
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v52, v12
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v11.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.h, 0x300, v11.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v20.l, v10.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v20.h, v10.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, v34.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, v34.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v52, v13
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v11.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, 0x300, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v12.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, v33.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, v33.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v19.l, v11.l
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v19.h, v11.h
; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v19.l, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v11.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v18.h, v12.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v13, v55
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.l, v33.l, 3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v15, 0xffff, v14
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, 0x300, v12.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v12.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.h, v32.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v12.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v52, v14
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v13.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.h, 0x300, v13.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v18.l, v12.l
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v18.h, v12.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, v32.h, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, v32.l, 3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v15, v55
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v14.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v13.h
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v18.l, v13.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v52, v15
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v13.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.h, 0x300, v13.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v14.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v14.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, v31.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v14.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v15
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v17.l, v13.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.l, 0x300, v13.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, v31.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v17.l, v13.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v17.h, v13.h
; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v17.h, v14.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.l, v31.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v13.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v13.l
-; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v16.h, v14.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v17.l, 0x300, v14.h
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v15.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v18, v55
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v19
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v14.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v16.l, v14.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v17
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v18, v55
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v15.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v16, v55
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v14.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v52, v18
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v15.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v17.h, 0x300, v15.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v16.l, v14.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v16.h, v14.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v52, v17
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v15.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.h, 0x300, v15.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v52, v15
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -23588,32 +23519,32 @@ define <64 x i8> @bitcast_v16f32_to_v64i8(<16 x float> %a, i32 %b) {
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
@@ -23632,26 +23563,26 @@ define <64 x i8> @bitcast_v16f32_to_v64i8(<16 x float> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[24:25], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v26, 24, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v27, 8, v16
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 8, v15
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 24, v14
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 8, v15
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 24, v14
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v30, 8, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v32, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v31, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v31, 8, v13
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v32, 24, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v33, 8, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v34, 8, v11
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v35, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v36, 8, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v39, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v50, 8, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v51, 8, v5
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v52, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v53, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v64, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v25, 8, v1
; GFX11-TRUE16-NEXT: .LBB48_2: ; %Flow
@@ -23676,26 +23607,26 @@ define <64 x i8> @bitcast_v16f32_to_v64i8(<16 x float> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[24:25], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v26, 24, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v27, 8, v16
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 8, v15
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 24, v14
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 8, v15
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 24, v14
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v30, 8, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v32, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v31, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v31, 8, v13
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v32, 24, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v33, 8, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v34, 8, v11
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v35, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v36, 8, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v39, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v50, 8, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v51, 8, v5
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v52, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v53, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v64, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v25, 8, v1
; GFX11-TRUE16-NEXT: .LBB48_4: ; %end
@@ -23703,156 +23634,135 @@ define <64 x i8> @bitcast_v16f32_to_v64i8(<16 x float> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v25.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v64.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v24.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v1.l, v17.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v64.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v1.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v24.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v17.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v24.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, 0
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v55.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v24, v1
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v2.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v54.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v23.l
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v55.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v1.h, v17.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v25, 0xffff, v25
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v54.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v1.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v24, v2
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v3.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v24.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v53.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v55
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v25, v24
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v2.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v3.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v23.l
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v4.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v54, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v23, 0xffff, v25
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v52.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v3.l, v3.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v51.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v23, v24
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v52.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v24, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v4.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v51.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v22.l
; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v4.l, v17.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v24, v4
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v5.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v24.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v50.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v4.h, v5.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v23, 0xffff, v25
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v49.l
; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v24, v5
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v6.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v48.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v21.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v24, v6
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v7.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v39.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v38.l
; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v6.l, v17.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v22.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v49.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v23, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff, v25
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v5.l, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v48.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v7.l, v17.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v21.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v22, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff, v23
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v6.l, v6.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v39.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v22, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v21, 0xffff, v23
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v7.l, v17.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v38.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v7.h, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v24, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v8.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v37.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v20.l
; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v21, v24
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v9.l, v17.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v37.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v10.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v24, v8
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v9.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v24.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v36.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v21, 0xffff, v22
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v9.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v8.l, v8.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v20.l
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v10.l, v17.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v20, 0xffff, v22
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v21, v24
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v35.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v9.l, v9.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v34.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v20, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v20, 0xffff, v21
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v10.l, v17.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v35.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v11.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v24, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v10.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v10.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v34.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v19.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v12.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v24, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v11.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v11.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v24.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v33.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v10.h, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v32.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v13.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v24, v11
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v12.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v12.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v31.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v18.l
+; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.l
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v14.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v24, v12
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v13.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v13.h, v18.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v30.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v29.l
; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.l
+; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v15.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v24, v13
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v14.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v18.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v28.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v17.l
; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v20, v24
-; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v12.l, v17.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v11.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v19.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v32.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v12.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v19, 0xffff, v21
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.h
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v11.l, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v31.l
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v13.l, v17.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v18.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v19, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v19, 0xffff, v20
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v12.l, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v13.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.l, 8, v30.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v19, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v20
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v13.l, v17.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v29.l
-; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v13.h, v14.l
+; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v16.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v24, v14
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v15.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v15.h, v17.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v27.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v26.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v18, v24
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v15.l, v17.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_3) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v14.l
-; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.h, 8, v28.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v27.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v19
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v15.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v14.l, v14.h
-; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.h, 8, v17.l
-; GFX11-TRUE16-NEXT: v_or_b16 v16.l, v16.l, v17.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v19
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v18, v24
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v15.l, v15.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v16.l
-; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v26.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v17, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v18
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v16.l, v16.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v17, v24
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v24, v15
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v16.l, v17.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v16.h, v17.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v24.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v24, v16
; GFX11-TRUE16-NEXT: s_clause 0x3
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[1:4], off
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[5:8], off offset:16
@@ -27413,15 +27323,15 @@ define <16 x float> @bitcast_v64i8_to_v16f32(<64 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v64, off, s32 offset:128
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v31, off, s32 offset:124
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v64, off, s32 offset:120
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v32, off, s32 offset:116
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v31, off, s32 offset:116
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v65, off, s32 offset:112
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v31, off, s32 offset:108
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v32, off, s32 offset:108
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v65, off, s32 offset:104
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v33, off, s32 offset:100
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v32, off, s32 offset:100
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v66, off, s32 offset:96
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v32, off, s32 offset:92
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v33, off, s32 offset:92
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v66, off, s32 offset:88
-; GFX11-TRUE16-NEXT: scratch_load_b32 v82, off, s32 offset:132
+; GFX11-TRUE16-NEXT: scratch_load_b32 v81, off, s32 offset:132
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v67, off, s32
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v67, off, s32 offset:8
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v68, off, s32 offset:16
@@ -27435,84 +27345,82 @@ define <16 x float> @bitcast_v64i8_to_v16f32(<64 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v80, off, s32 offset:80
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v33, off, s32 offset:84
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v34, off, s32 offset:76
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v35, off, s32 offset:68
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v34, off, s32 offset:60
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v34, off, s32 offset:68
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v35, off, s32 offset:60
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v35, off, s32 offset:52
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v36, off, s32 offset:44
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v37, off, s32 offset:36
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v36, off, s32 offset:28
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v38, off, s32 offset:20
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v36, off, s32 offset:36
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v37, off, s32 offset:28
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v37, off, s32 offset:20
; GFX11-TRUE16-NEXT: s_clause 0x1
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v37, off, s32 offset:12
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v38, off, s32 offset:12
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v38, off, s32 offset:4
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v80.h, v29.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v81.l, v27.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.h, v25.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.h, v24.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v22.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, v20.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.h, v18.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.h, v16.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.h, v14.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v50.l, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.h, v10.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v51.h, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v53.l, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v54.l, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.h, v27.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.h, v22.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.h, v20.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.h, v18.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v16.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.l, v14.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.h, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v49.l, v10.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v49.h, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.h, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.l, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v53.h, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.h, v0.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v54.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v52.l, 8, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v53.h, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v50.h, 8, v7.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v52.h, 8, v9.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v49.h, 8, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v51.l, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v48.l, 8, v15.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v49.l, 8, v17.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v29.h, 8, v19.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v39.l, 8, v21.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v28.h, 8, v23.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v29.l, 8, v25.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.h, 8, v81.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v26.h, 8, v80.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v55.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v53.l, 8, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v54.l, 8, v7.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v51.l, 8, v9.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v51.h, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v50.l, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v50.h, 8, v15.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v39.h, 8, v17.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v48.l, 8, v19.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v29.h, 8, v21.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v30.h, 8, v23.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.l, 8, v25.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.h, 8, v27.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.l, 8, v80.h
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(33)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.l, 8, v64.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v64.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(31)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v64.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.l, 8, v64.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(29)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v65.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v65.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(27)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v65.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v65.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(25)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v66.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v66.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(23)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.l, 8, v66.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v66.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(21)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.h, 8, v67.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.h, 8, v67.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(20)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v24.l, 8, v67.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.l, 8, v67.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(19)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.h, 8, v68.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.h, 8, v68.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(18)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.l, 8, v68.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.l, 8, v68.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(17)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.h, 8, v69.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.h, 8, v69.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(16)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.l, 8, v69.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.l, 8, v69.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(15)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v20.h, 8, v70.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.h, 8, v70.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(14)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.l, 8, v70.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v20.l, 8, v70.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(13)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v71.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v20.h, 8, v71.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(12)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v20.l, 8, v71.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.l, 8, v71.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(11)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v80.l
-; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v82
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v80.l
+; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v81
; GFX11-TRUE16-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX11-TRUE16-NEXT: s_cbranch_execnz .LBB50_3
; GFX11-TRUE16-NEXT: ; %bb.1: ; %Flow
@@ -27524,384 +27432,338 @@ define <16 x float> @bitcast_v64i8_to_v16f32(<64 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB50_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v55.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v54.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v55.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v51.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v54.l, 0
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v54.h
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v53.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v53.l
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v0.h, v52.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v52.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v50.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v30.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.l, v51.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v4
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v48.h
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v1.l, v49.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v54
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v1.h, v50.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v3.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v39.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v27.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v7, v54
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v2.l, v49.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v4.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v27.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v3.h, v39.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v6, v54
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v3.l, v48.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v24.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v28.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v53.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v64.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v52.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v52.l
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v0.l, v54.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v55.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v64.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v1.l, v54.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v64.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v49.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v49.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v64, v0
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v1.h, v53.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v29.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v28.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v51.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v26.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v64, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v2.l, v51.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v64.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v48.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v39.l
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v48.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v24.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v64, v2
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v3.l, v50.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v50.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v64.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v30.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v26.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v28.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v64, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v4.l, v39.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v64.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v27.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v30.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v38.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v8, v54
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v4.l, v29.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v25.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v4.h, v29.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v38.h
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v38.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v64, v4
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v5.l, v29.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v64.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v25.h
+; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v23.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v37.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v37.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v64, v5
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v6.l, v27.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v64.h
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v36.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.h, v22.h
+; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v36.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v35.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v64, v6
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v7.l, v25.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v64.h
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v10.h, v21.h
+; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v35.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v34.h
+; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v34.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v64, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v8.l, v23.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v64.h
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v11.h, v20.h
+; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v12.h, v19.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v33.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v33.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v64, v8
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v9.l, v22.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v64.h
+; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v32.h
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v13.h, v18.h
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v32.l
+; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v31.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v64, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v10.l, v21.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v64.h
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v17.h
+; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v31.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v54
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v8
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v5.l, v28.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v26.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v5.h, v26.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v38.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v8, v54
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v6.l, v25.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v30.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v6.h, v24.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v37.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v54
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v7.l, v23.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v37.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v7.h, v23.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v35.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v10, v54
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v11
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v8.l, v22.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v36.h
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v8.h, v22.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v35.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v11, v54
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v12
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v9.l, v21.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v36.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v9.h, v21.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v12, v54
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v13
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v10.l, v20.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v11.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v34.h
-; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v10.h, v20.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v33.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v13, v54
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v14
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v11.l, v19.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v34.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v13, v54
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v12.h, v19.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v33.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v15, 0xffff, v14
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v12.l, v18.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v32.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v13.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v32.h
-; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v12.h, v18.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v15, v54
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v13.h, v17.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v19
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v13.l, v17.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v14.l
-; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v31.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v18, v54
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v19
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v14.l, v16.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v15.l
-; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v31.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v17, v54
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v18
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v15.l, v16.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_hi16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v64, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v11.l, v20.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v64.h
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v15.h, v16.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_hi16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v64, v11
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v12.l, v19.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v64.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v64, v12
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v13.l, v18.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v64.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v64, v13
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v14.l, v17.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v64.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v64, v14
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v15.l, v16.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v64.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_lo16
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v17, v54
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v64, v15
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB50_2
; GFX11-TRUE16-NEXT: .LBB50_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v55.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v54.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v55.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v51.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v53.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v53.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v52.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v52.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v48.h, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v54.h, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v53.h, v1.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v50.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v52.l, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, 0
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v54.h, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v55.l, v0.h
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v54.l, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v53.l, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v49.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v49.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v48.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v52, v3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v39.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v51.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v51.h, v1.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v52.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v50.h, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, v39.h, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v30.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v55
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v49.h, v2.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v51.l, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v55
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v4
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v2.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v48.l, v3.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v27.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v5, v55
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v4.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v49.l, v3.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v27.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v52, v4
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v52.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v50.l, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v50.h, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v29.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v28.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v52, v5
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v26.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v24.h, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v39.h, v3.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v48.l, v3.h
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, v24.h, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v39.l, v4.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v6, v55
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v5.l
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v29.h, v4.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, v25.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v4.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, 0x300, v4.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v29.l, v5.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v28.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v55
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v28.h, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v26.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v7
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v52, v6
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v29.h, v4.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v30.h, v4.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v24.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v26.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v52, v7
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v28.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v30.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v27.l, v5.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v27.h, v5.h
; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v26.h, v6.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v52, v8
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v25.l, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v25.h, v6.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, v38.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v8, v55
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v7.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v25.h, v6.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, v30.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v8
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v6.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, 0x300, v6.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v24.l, v7.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v37.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v55
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v8.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v23.h, v7.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v38.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v9
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v7.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v38.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v38.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v52, v9
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v37.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v37.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v23.l, v7.l
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v23.h, v7.h
; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v22.h, v8.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, v36.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v10, v55
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v9.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v23.l, v8.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, v37.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v8.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v10
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, 0x300, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v21.h, v9.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v8.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v10, v55
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v22.l, v9.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, v36.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, v35.h, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v11
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v9.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v9.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v52, v10
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v9.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v9.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v22.l, v8.l
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v22.h, v8.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, v36.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, v36.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v52, v11
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v9.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, 0x300, v9.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v10.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v10.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, v35.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, v35.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v21.l, v9.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v21.h, v9.h
; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v10.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v9.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v11, v55
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v20.h, v10.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.l, v35.l, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v21.l, v10.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, v34.h, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v12
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, 0x300, v10.h
-; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v11.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, v33.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v13, v55
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v20.l, v11.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v12.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, v34.l, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v19.h, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v12.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.l, 0x300, v11.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v13
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v52, v12
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v11.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.h, 0x300, v11.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v20.l, v10.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v20.h, v10.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, v34.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, v34.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v52, v13
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v11.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, 0x300, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v12.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, v33.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, v33.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v19.l, v11.l
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v19.h, v11.h
; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v19.l, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v11.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v18.h, v12.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v13, v55
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.l, v33.l, 3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v15, 0xffff, v14
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, 0x300, v12.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v12.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.h, v32.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v12.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v52, v14
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v13.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.h, 0x300, v13.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v18.l, v12.l
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v18.h, v12.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, v32.h, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, v32.l, 3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v15, v55
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v14.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v13.h
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v18.l, v13.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v52, v15
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v13.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.h, 0x300, v13.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v14.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v14.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, v31.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v14.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v15
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v17.l, v13.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.l, 0x300, v13.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, v31.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v17.l, v13.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v17.h, v13.h
; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v17.h, v14.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.l, v31.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v13.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v13.l
-; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v16.h, v14.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v17.l, 0x300, v14.h
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v15.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v18, v55
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v19
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v14.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v16.l, v14.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v17
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v18, v55
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v15.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v16, v55
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v14.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v52, v18
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v15.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v17.h, 0x300, v15.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v16.l, v14.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v16.h, v14.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v52, v17
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v15.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.h, 0x300, v15.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v52, v15
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -37916,32 +37778,32 @@ define <64 x i8> @bitcast_v8i64_to_v64i8(<8 x i64> %a, i32 %b) {
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
@@ -37960,26 +37822,26 @@ define <64 x i8> @bitcast_v8i64_to_v64i8(<8 x i64> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[24:25], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v26, 24, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v27, 8, v16
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 8, v15
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 24, v14
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 8, v15
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 24, v14
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v30, 8, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v32, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v31, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v31, 8, v13
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v32, 24, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v33, 8, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v34, 8, v11
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v35, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v36, 8, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v39, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v50, 8, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v51, 8, v5
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v52, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v53, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v64, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v25, 8, v1
; GFX11-TRUE16-NEXT: .LBB68_2: ; %Flow
@@ -38017,26 +37879,26 @@ define <64 x i8> @bitcast_v8i64_to_v64i8(<8 x i64> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[24:25], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v26, 24, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v27, 8, v16
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 8, v15
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 24, v14
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 8, v15
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 24, v14
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v30, 8, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v32, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v31, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v31, 8, v13
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v32, 24, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v33, 8, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v34, 8, v11
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v35, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v36, 8, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v39, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v50, 8, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v51, 8, v5
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v52, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v53, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v64, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v25, 8, v1
; GFX11-TRUE16-NEXT: .LBB68_4: ; %end
@@ -38044,156 +37906,135 @@ define <64 x i8> @bitcast_v8i64_to_v64i8(<8 x i64> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v25.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v64.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v24.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v1.l, v17.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v64.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v1.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v24.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v17.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v24.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, 0
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v55.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v24, v1
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v2.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v54.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v23.l
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v55.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v1.h, v17.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v25, 0xffff, v25
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v54.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v1.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v24, v2
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v3.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v24.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v53.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v55
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v25, v24
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v2.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v3.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v23.l
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v4.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v54, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v23, 0xffff, v25
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v52.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v3.l, v3.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v51.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v23, v24
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v52.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v24, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v4.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v51.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v22.l
; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v4.l, v17.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v24, v4
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v5.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v24.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v50.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v4.h, v5.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v23, 0xffff, v25
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v49.l
; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v24, v5
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v6.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v48.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v21.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v24, v6
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v7.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v39.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v38.l
; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v6.l, v17.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v22.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v49.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v23, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff, v25
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v5.l, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v48.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v7.l, v17.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v21.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v22, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff, v23
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v6.l, v6.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v39.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v22, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v21, 0xffff, v23
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v7.l, v17.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v38.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v7.h, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v24, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v8.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v37.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v20.l
; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v21, v24
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v9.l, v17.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v37.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v10.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v24, v8
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v9.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v24.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v36.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v21, 0xffff, v22
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v9.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v8.l, v8.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v20.l
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v10.l, v17.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v20, 0xffff, v22
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v21, v24
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v35.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v9.l, v9.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v34.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v20, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v20, 0xffff, v21
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v10.l, v17.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v35.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v11.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v24, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v10.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v10.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v34.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v19.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v12.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v24, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v11.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v11.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v24.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v33.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v10.h, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v32.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v13.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v24, v11
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v12.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v12.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v31.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v18.l
+; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.l
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v14.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v24, v12
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v13.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v13.h, v18.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v30.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v29.l
; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.l
+; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v15.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v24, v13
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v14.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v18.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v28.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v17.l
; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v20, v24
-; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v12.l, v17.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v11.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v19.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v32.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v12.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v19, 0xffff, v21
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.h
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v11.l, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v31.l
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v13.l, v17.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v18.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v19, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v19, 0xffff, v20
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v12.l, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v13.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.l, 8, v30.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v19, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v20
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v13.l, v17.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v29.l
-; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v13.h, v14.l
+; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v16.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v24, v14
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v15.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v15.h, v17.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v27.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v26.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v18, v24
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v15.l, v17.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_3) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v14.l
-; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.h, 8, v28.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v27.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v19
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v15.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v14.l, v14.h
-; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.h, 8, v17.l
-; GFX11-TRUE16-NEXT: v_or_b16 v16.l, v16.l, v17.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v19
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v18, v24
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v15.l, v15.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v16.l
-; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v26.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v17, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v18
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v16.l, v16.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v17, v24
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v24, v15
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v16.l, v17.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v16.h, v17.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v24.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v24, v16
; GFX11-TRUE16-NEXT: s_clause 0x3
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[1:4], off
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[5:8], off offset:16
@@ -41628,15 +41469,15 @@ define <8 x i64> @bitcast_v64i8_to_v8i64(<64 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v64, off, s32 offset:128
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v31, off, s32 offset:124
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v64, off, s32 offset:120
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v32, off, s32 offset:116
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v31, off, s32 offset:116
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v65, off, s32 offset:112
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v31, off, s32 offset:108
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v32, off, s32 offset:108
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v65, off, s32 offset:104
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v33, off, s32 offset:100
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v32, off, s32 offset:100
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v66, off, s32 offset:96
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v32, off, s32 offset:92
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v33, off, s32 offset:92
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v66, off, s32 offset:88
-; GFX11-TRUE16-NEXT: scratch_load_b32 v82, off, s32 offset:132
+; GFX11-TRUE16-NEXT: scratch_load_b32 v81, off, s32 offset:132
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v67, off, s32
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v67, off, s32 offset:8
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v68, off, s32 offset:16
@@ -41650,84 +41491,82 @@ define <8 x i64> @bitcast_v64i8_to_v8i64(<64 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v80, off, s32 offset:80
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v33, off, s32 offset:84
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v34, off, s32 offset:76
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v35, off, s32 offset:68
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v34, off, s32 offset:60
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v34, off, s32 offset:68
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v35, off, s32 offset:60
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v35, off, s32 offset:52
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v36, off, s32 offset:44
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v37, off, s32 offset:36
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v36, off, s32 offset:28
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v38, off, s32 offset:20
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v36, off, s32 offset:36
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v37, off, s32 offset:28
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v37, off, s32 offset:20
; GFX11-TRUE16-NEXT: s_clause 0x1
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v37, off, s32 offset:12
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v38, off, s32 offset:12
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v38, off, s32 offset:4
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v80.h, v29.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v81.l, v27.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.h, v25.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.h, v24.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v22.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, v20.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.h, v18.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.h, v16.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.h, v14.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v50.l, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.h, v10.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v51.h, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v53.l, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v54.l, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.h, v27.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.h, v22.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.h, v20.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.h, v18.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v16.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.l, v14.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.h, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v49.l, v10.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v49.h, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.h, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.l, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v53.h, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.h, v0.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v54.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v52.l, 8, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v53.h, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v50.h, 8, v7.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v52.h, 8, v9.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v49.h, 8, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v51.l, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v48.l, 8, v15.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v49.l, 8, v17.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v29.h, 8, v19.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v39.l, 8, v21.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v28.h, 8, v23.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v29.l, 8, v25.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.h, 8, v81.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v26.h, 8, v80.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v55.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v53.l, 8, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v54.l, 8, v7.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v51.l, 8, v9.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v51.h, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v50.l, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v50.h, 8, v15.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v39.h, 8, v17.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v48.l, 8, v19.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v29.h, 8, v21.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v30.h, 8, v23.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.l, 8, v25.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.h, 8, v27.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.l, 8, v80.h
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(33)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.l, 8, v64.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v64.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(31)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v64.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.l, 8, v64.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(29)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v65.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v65.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(27)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v65.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v65.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(25)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v66.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v66.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(23)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.l, 8, v66.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v66.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(21)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.h, 8, v67.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.h, 8, v67.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(20)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v24.l, 8, v67.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.l, 8, v67.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(19)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.h, 8, v68.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.h, 8, v68.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(18)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.l, 8, v68.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.l, 8, v68.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(17)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.h, 8, v69.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.h, 8, v69.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(16)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.l, 8, v69.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.l, 8, v69.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(15)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v20.h, 8, v70.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.h, 8, v70.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(14)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.l, 8, v70.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v20.l, 8, v70.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(13)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v71.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v20.h, 8, v71.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(12)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v20.l, 8, v71.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.l, 8, v71.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(11)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v80.l
-; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v82
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v80.l
+; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v81
; GFX11-TRUE16-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX11-TRUE16-NEXT: s_cbranch_execnz .LBB70_3
; GFX11-TRUE16-NEXT: ; %bb.1: ; %Flow
@@ -41739,384 +41578,338 @@ define <8 x i64> @bitcast_v64i8_to_v8i64(<64 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB70_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v55.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v54.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v55.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v51.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v54.l, 0
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v54.h
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v53.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v53.l
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v0.h, v52.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v52.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v50.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v30.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.l, v51.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v4
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v48.h
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v1.l, v49.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v54
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v1.h, v50.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v3.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v39.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v27.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v7, v54
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v2.l, v49.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v4.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v27.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v3.h, v39.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v6, v54
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v3.l, v48.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v24.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v28.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v53.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v64.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v52.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v52.l
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v0.l, v54.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v55.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v64.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v1.l, v54.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v64.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v49.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v49.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v64, v0
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v1.h, v53.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v29.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v28.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v51.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v26.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v64, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v2.l, v51.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v64.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v48.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v39.l
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v48.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v24.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v64, v2
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v3.l, v50.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v50.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v64.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v30.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v26.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v28.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v64, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v4.l, v39.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v64.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v27.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v30.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v38.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v8, v54
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v4.l, v29.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v25.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v4.h, v29.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v38.h
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v38.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v64, v4
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v5.l, v29.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v64.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v25.h
+; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v23.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v37.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v37.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v64, v5
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v6.l, v27.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v64.h
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v36.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.h, v22.h
+; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v36.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v35.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v64, v6
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v7.l, v25.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v64.h
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v10.h, v21.h
+; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v35.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v34.h
+; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v34.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v64, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v8.l, v23.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v64.h
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v11.h, v20.h
+; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v12.h, v19.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v33.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v33.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v64, v8
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v9.l, v22.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v64.h
+; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v32.h
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v13.h, v18.h
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v32.l
+; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v31.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v64, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v10.l, v21.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v64.h
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v17.h
+; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v31.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v54
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v8
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v5.l, v28.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v26.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v5.h, v26.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v38.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v8, v54
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v6.l, v25.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v30.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v6.h, v24.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v37.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v54
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v7.l, v23.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v37.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v7.h, v23.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v35.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v10, v54
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v11
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v8.l, v22.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v36.h
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v8.h, v22.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v35.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v11, v54
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v12
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v9.l, v21.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v36.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v9.h, v21.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v12, v54
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v13
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v10.l, v20.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v11.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v34.h
-; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v10.h, v20.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v33.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v13, v54
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v14
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v11.l, v19.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v34.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v13, v54
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v12.h, v19.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v33.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v15, 0xffff, v14
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v12.l, v18.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v32.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v13.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v32.h
-; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v12.h, v18.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v15, v54
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v13.h, v17.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v19
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v13.l, v17.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v14.l
-; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v31.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v18, v54
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v19
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v14.l, v16.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v15.l
-; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v31.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v17, v54
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v18
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v15.l, v16.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_hi16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v64, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v11.l, v20.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v64.h
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v15.h, v16.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_hi16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v64, v11
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v12.l, v19.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v64.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v64, v12
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v13.l, v18.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v64.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v64, v13
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v14.l, v17.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v64.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v64, v14
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v15.l, v16.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v64.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_lo16
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v17, v54
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v64, v15
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB70_2
; GFX11-TRUE16-NEXT: .LBB70_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v55.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v54.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v55.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v51.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v53.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v53.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v52.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v52.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v48.h, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v54.h, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v53.h, v1.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v50.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v52.l, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, 0
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v54.h, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v55.l, v0.h
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v54.l, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v53.l, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v49.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v49.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v48.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v52, v3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v39.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v51.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v51.h, v1.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v52.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v50.h, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, v39.h, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v30.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v55
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v49.h, v2.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v51.l, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v55
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v4
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v2.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v48.l, v3.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v27.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v5, v55
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v4.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v49.l, v3.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v27.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v52, v4
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v52.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v50.l, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v50.h, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v29.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v28.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v52, v5
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v26.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v24.h, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v39.h, v3.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v48.l, v3.h
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, v24.h, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v39.l, v4.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v6, v55
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v5.l
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v29.h, v4.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, v25.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v4.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, 0x300, v4.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v29.l, v5.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v28.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v55
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v28.h, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v26.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v7
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v52, v6
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v29.h, v4.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v30.h, v4.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v24.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v26.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v52, v7
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v28.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v30.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v27.l, v5.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v27.h, v5.h
; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v26.h, v6.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v52, v8
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v25.l, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v25.h, v6.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, v38.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v8, v55
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v7.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v25.h, v6.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, v30.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v8
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v6.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, 0x300, v6.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v24.l, v7.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v37.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v55
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v8.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v23.h, v7.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v38.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v9
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v7.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v38.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v38.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v52, v9
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v37.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v37.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v23.l, v7.l
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v23.h, v7.h
; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v22.h, v8.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, v36.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v10, v55
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v9.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v23.l, v8.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, v37.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v8.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v10
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, 0x300, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v21.h, v9.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v8.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v10, v55
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v22.l, v9.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, v36.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, v35.h, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v11
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v9.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v9.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v52, v10
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v9.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v9.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v22.l, v8.l
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v22.h, v8.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, v36.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, v36.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v52, v11
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v9.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, 0x300, v9.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v10.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v10.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, v35.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, v35.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v21.l, v9.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v21.h, v9.h
; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v10.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v9.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v11, v55
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v20.h, v10.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.l, v35.l, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v21.l, v10.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, v34.h, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v12
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, 0x300, v10.h
-; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v11.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, v33.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v13, v55
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v20.l, v11.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v12.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, v34.l, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v19.h, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v12.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.l, 0x300, v11.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v13
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v52, v12
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v11.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.h, 0x300, v11.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v20.l, v10.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v20.h, v10.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, v34.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, v34.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v52, v13
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v11.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, 0x300, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v12.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, v33.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, v33.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v19.l, v11.l
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v19.h, v11.h
; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v19.l, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v11.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v18.h, v12.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v13, v55
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.l, v33.l, 3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v15, 0xffff, v14
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, 0x300, v12.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v12.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.h, v32.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v12.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v52, v14
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v13.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.h, 0x300, v13.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v18.l, v12.l
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v18.h, v12.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, v32.h, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, v32.l, 3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v15, v55
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v14.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v13.h
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v18.l, v13.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v52, v15
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v13.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.h, 0x300, v13.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v14.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v14.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, v31.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v14.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v15
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v17.l, v13.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.l, 0x300, v13.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, v31.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v17.l, v13.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v17.h, v13.h
; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v17.h, v14.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.l, v31.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v13.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v13.l
-; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v16.h, v14.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v17.l, 0x300, v14.h
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v15.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v18, v55
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v19
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v14.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v16.l, v14.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v17
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v18, v55
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v15.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v16, v55
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v14.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v52, v18
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v15.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v17.h, 0x300, v15.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v16.l, v14.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v16.h, v14.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v52, v17
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v15.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.h, 0x300, v15.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v52, v15
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -51295,32 +51088,32 @@ define <64 x i8> @bitcast_v8f64_to_v64i8(<8 x double> %a, i32 %b) {
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
@@ -51339,26 +51132,26 @@ define <64 x i8> @bitcast_v8f64_to_v64i8(<8 x double> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[24:25], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v26, 24, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v27, 8, v16
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 8, v15
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 24, v14
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 8, v15
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 24, v14
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v30, 8, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v32, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v31, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v31, 8, v13
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v32, 24, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v33, 8, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v34, 8, v11
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v35, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v36, 8, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v39, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v50, 8, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v51, 8, v5
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v52, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v53, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v64, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v25, 8, v1
; GFX11-TRUE16-NEXT: .LBB84_2: ; %Flow
@@ -51383,26 +51176,26 @@ define <64 x i8> @bitcast_v8f64_to_v64i8(<8 x double> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[24:25], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v26, 24, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v27, 8, v16
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 8, v15
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 24, v14
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 8, v15
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 24, v14
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v30, 8, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v32, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v31, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v31, 8, v13
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v32, 24, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v33, 8, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v34, 8, v11
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v35, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v36, 8, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v39, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v50, 8, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v51, 8, v5
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v52, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v53, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v64, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v25, 8, v1
; GFX11-TRUE16-NEXT: .LBB84_4: ; %end
@@ -51410,156 +51203,135 @@ define <64 x i8> @bitcast_v8f64_to_v64i8(<8 x double> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v25.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v64.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v24.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v1.l, v17.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v64.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v1.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v24.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v17.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v24.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, 0
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v55.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v24, v1
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v2.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v54.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v23.l
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v55.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v1.h, v17.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v25, 0xffff, v25
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v54.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v1.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v24, v2
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v3.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v24.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v53.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v55
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v25, v24
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v2.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v3.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v23.l
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v4.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v54, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v23, 0xffff, v25
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v52.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v3.l, v3.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v51.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v23, v24
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v52.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v24, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v4.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v51.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v22.l
; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v4.l, v17.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v24, v4
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v5.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v24.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v50.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v4.h, v5.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v23, 0xffff, v25
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v49.l
; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v24, v5
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v6.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v48.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v21.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v24, v6
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v7.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v39.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v38.l
; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v6.l, v17.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v22.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v49.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v23, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff, v25
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v5.l, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v48.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v7.l, v17.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v21.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v22, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff, v23
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v6.l, v6.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v39.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v22, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v21, 0xffff, v23
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v7.l, v17.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v38.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v7.h, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v24, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v8.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v37.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v20.l
; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v21, v24
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v9.l, v17.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v37.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v10.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v24, v8
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v9.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v24.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v36.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v21, 0xffff, v22
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v9.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v8.l, v8.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v20.l
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v10.l, v17.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v20, 0xffff, v22
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v21, v24
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v35.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v9.l, v9.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v34.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v20, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v20, 0xffff, v21
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v10.l, v17.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v35.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v11.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v24, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v10.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v10.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v34.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v19.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v12.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v24, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v11.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v11.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v24.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v33.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v10.h, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v32.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v13.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v24, v11
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v12.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v12.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v31.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v18.l
+; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.l
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v14.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v24, v12
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v13.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v13.h, v18.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v30.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v29.l
; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.l
+; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v15.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v24, v13
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v14.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v18.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v28.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v17.l
; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v20, v24
-; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v12.l, v17.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v11.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v19.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v32.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v12.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v19, 0xffff, v21
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.h
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v11.l, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v31.l
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v13.l, v17.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v18.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v19, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v19, 0xffff, v20
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v12.l, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v13.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.l, 8, v30.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v19, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v20
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v13.l, v17.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v29.l
-; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v13.h, v14.l
+; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v16.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v24, v14
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v15.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v15.h, v17.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v27.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v26.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v18, v24
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v15.l, v17.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_3) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v14.l
-; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.h, 8, v28.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v27.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v19
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v15.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v14.l, v14.h
-; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.h, 8, v17.l
-; GFX11-TRUE16-NEXT: v_or_b16 v16.l, v16.l, v17.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v19
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v18, v24
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v15.l, v15.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v16.l
-; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v26.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v17, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v18
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v16.l, v16.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v17, v24
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v24, v15
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v16.l, v17.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v16.h, v17.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v24.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v24, v16
; GFX11-TRUE16-NEXT: s_clause 0x3
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[1:4], off
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[5:8], off offset:16
@@ -54989,15 +54761,15 @@ define <8 x double> @bitcast_v64i8_to_v8f64(<64 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v64, off, s32 offset:128
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v31, off, s32 offset:124
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v64, off, s32 offset:120
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v32, off, s32 offset:116
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v31, off, s32 offset:116
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v65, off, s32 offset:112
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v31, off, s32 offset:108
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v32, off, s32 offset:108
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v65, off, s32 offset:104
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v33, off, s32 offset:100
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v32, off, s32 offset:100
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v66, off, s32 offset:96
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v32, off, s32 offset:92
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v33, off, s32 offset:92
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v66, off, s32 offset:88
-; GFX11-TRUE16-NEXT: scratch_load_b32 v82, off, s32 offset:132
+; GFX11-TRUE16-NEXT: scratch_load_b32 v81, off, s32 offset:132
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v67, off, s32
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v67, off, s32 offset:8
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v68, off, s32 offset:16
@@ -55011,84 +54783,82 @@ define <8 x double> @bitcast_v64i8_to_v8f64(<64 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v80, off, s32 offset:80
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v33, off, s32 offset:84
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v34, off, s32 offset:76
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v35, off, s32 offset:68
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v34, off, s32 offset:60
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v34, off, s32 offset:68
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v35, off, s32 offset:60
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v35, off, s32 offset:52
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v36, off, s32 offset:44
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v37, off, s32 offset:36
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v36, off, s32 offset:28
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v38, off, s32 offset:20
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v36, off, s32 offset:36
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v37, off, s32 offset:28
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v37, off, s32 offset:20
; GFX11-TRUE16-NEXT: s_clause 0x1
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v37, off, s32 offset:12
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v38, off, s32 offset:12
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v38, off, s32 offset:4
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v80.h, v29.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v81.l, v27.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.h, v25.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.h, v24.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v22.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, v20.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.h, v18.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.h, v16.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.h, v14.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v50.l, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.h, v10.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v51.h, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v53.l, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v54.l, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.h, v27.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.h, v22.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.h, v20.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.h, v18.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v16.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.l, v14.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.h, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v49.l, v10.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v49.h, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.h, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.l, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v53.h, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.h, v0.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v54.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v52.l, 8, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v53.h, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v50.h, 8, v7.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v52.h, 8, v9.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v49.h, 8, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v51.l, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v48.l, 8, v15.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v49.l, 8, v17.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v29.h, 8, v19.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v39.l, 8, v21.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v28.h, 8, v23.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v29.l, 8, v25.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.h, 8, v81.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v26.h, 8, v80.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v55.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v53.l, 8, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v54.l, 8, v7.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v51.l, 8, v9.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v51.h, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v50.l, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v50.h, 8, v15.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v39.h, 8, v17.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v48.l, 8, v19.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v29.h, 8, v21.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v30.h, 8, v23.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.l, 8, v25.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.h, 8, v27.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.l, 8, v80.h
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(33)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.l, 8, v64.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v64.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(31)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v64.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.l, 8, v64.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(29)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v65.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v65.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(27)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v65.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v65.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(25)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v66.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v66.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(23)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.l, 8, v66.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v66.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(21)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.h, 8, v67.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.h, 8, v67.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(20)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v24.l, 8, v67.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.l, 8, v67.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(19)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.h, 8, v68.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.h, 8, v68.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(18)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.l, 8, v68.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.l, 8, v68.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(17)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.h, 8, v69.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.h, 8, v69.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(16)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.l, 8, v69.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.l, 8, v69.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(15)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v20.h, 8, v70.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.h, 8, v70.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(14)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.l, 8, v70.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v20.l, 8, v70.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(13)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v71.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v20.h, 8, v71.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(12)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v20.l, 8, v71.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.l, 8, v71.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(11)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v80.l
-; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v82
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v80.l
+; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v81
; GFX11-TRUE16-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX11-TRUE16-NEXT: s_cbranch_execnz .LBB86_3
; GFX11-TRUE16-NEXT: ; %bb.1: ; %Flow
@@ -55100,384 +54870,338 @@ define <8 x double> @bitcast_v64i8_to_v8f64(<64 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB86_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v55.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v54.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v55.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v51.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v54.l, 0
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v54.h
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v53.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v53.l
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v0.h, v52.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v52.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v50.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v30.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.l, v51.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v4
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v48.h
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v1.l, v49.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v54
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v1.h, v50.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v3.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v39.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v27.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v7, v54
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v2.l, v49.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v4.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v27.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v3.h, v39.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v6, v54
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v3.l, v48.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v24.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v28.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v53.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v64.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v52.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v52.l
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v0.l, v54.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v55.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v64.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v1.l, v54.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v64.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v49.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v49.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v64, v0
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v1.h, v53.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v29.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v28.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v51.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v26.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v64, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v2.l, v51.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v64.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v48.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v39.l
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v48.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v24.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v64, v2
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v3.l, v50.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v50.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v64.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v30.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v26.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v28.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v64, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v4.l, v39.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v64.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v27.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v30.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v38.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v8, v54
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v4.l, v29.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v25.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v4.h, v29.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v38.h
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v38.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v64, v4
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v5.l, v29.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v64.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v25.h
+; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v23.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v37.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v37.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v64, v5
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v6.l, v27.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v64.h
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v36.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.h, v22.h
+; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v36.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v35.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v64, v6
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v7.l, v25.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v64.h
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v10.h, v21.h
+; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v35.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v34.h
+; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v34.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v64, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v8.l, v23.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v64.h
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v11.h, v20.h
+; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v12.h, v19.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v33.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v33.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v64, v8
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v9.l, v22.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v64.h
+; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v32.h
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v13.h, v18.h
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v32.l
+; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v31.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v64, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v10.l, v21.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v64.h
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v17.h
+; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v31.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v54
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v8
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v5.l, v28.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v26.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v5.h, v26.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v38.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v8, v54
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v6.l, v25.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v30.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v6.h, v24.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v37.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v54
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v7.l, v23.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v37.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v7.h, v23.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v35.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v10, v54
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v11
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v8.l, v22.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v36.h
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v8.h, v22.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v35.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v11, v54
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v12
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v9.l, v21.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v36.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v9.h, v21.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v12, v54
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v13
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v10.l, v20.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v11.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v34.h
-; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v10.h, v20.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v33.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v13, v54
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v14
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v11.l, v19.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v34.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v13, v54
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v12.h, v19.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v33.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v15, 0xffff, v14
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v12.l, v18.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v32.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v13.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v32.h
-; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v12.h, v18.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v15, v54
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v13.h, v17.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v19
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v13.l, v17.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v14.l
-; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v31.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v18, v54
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v19
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v14.l, v16.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v15.l
-; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v31.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v17, v54
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v18
-; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v15.l, v16.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_hi16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v64, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v11.l, v20.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v64.h
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v15.h, v16.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_hi16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v64, v11
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v12.l, v19.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v64.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v64, v12
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v13.l, v18.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v64.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v64, v13
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v14.l, v17.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v64.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v64, v14
+; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v15.l, v16.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v64.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_lo16
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v17, v54
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v64, v15
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB86_2
; GFX11-TRUE16-NEXT: .LBB86_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v55.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v54.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v55.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v51.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v53.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v53.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v52.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v52.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v48.h, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v54.h, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v53.h, v1.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v50.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v52.l, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, 0
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v54.h, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v55.l, v0.h
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v54.l, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v53.l, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v49.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v49.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v48.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v52, v3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v39.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v51.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v51.h, v1.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v52.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v50.h, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, v39.h, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v30.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v55
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v49.h, v2.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v51.l, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v55
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v4
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v2.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v48.l, v3.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v27.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v5, v55
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v4.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v49.l, v3.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v27.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v52, v4
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v52.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v50.l, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v50.h, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v29.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v28.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v52, v5
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v26.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v24.h, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v39.h, v3.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v48.l, v3.h
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, v24.h, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v39.l, v4.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v6, v55
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v5.l
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v29.h, v4.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, v25.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v4.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, 0x300, v4.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v29.l, v5.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v28.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v55
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v28.h, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v26.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v7
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v52, v6
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v29.h, v4.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v30.h, v4.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v24.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v26.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v52, v7
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v28.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v30.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v27.l, v5.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v27.h, v5.h
; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v26.h, v6.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v52, v8
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v25.l, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v25.h, v6.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, v38.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v8, v55
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v7.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v25.h, v6.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, v30.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v8
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v6.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, 0x300, v6.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v24.l, v7.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v37.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v55
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v8.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v23.h, v7.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v38.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v9
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v7.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v38.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v38.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v52, v9
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v37.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v37.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v23.l, v7.l
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v23.h, v7.h
; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v22.h, v8.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, v36.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v10, v55
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v9.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v23.l, v8.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, v37.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v8.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v10
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, 0x300, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v21.h, v9.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v8.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v10, v55
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v22.l, v9.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, v36.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, v35.h, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v11
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v9.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v9.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v52, v10
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v9.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v9.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v22.l, v8.l
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v22.h, v8.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, v36.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, v36.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v52, v11
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v9.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, 0x300, v9.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v10.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v10.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, v35.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, v35.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v21.l, v9.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v21.h, v9.h
; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v10.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v9.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v11, v55
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v20.h, v10.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.l, v35.l, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v21.l, v10.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, v34.h, 3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v12
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, 0x300, v10.h
-; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v11.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, v33.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v13, v55
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v20.l, v11.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v12.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, v34.l, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v19.h, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v12.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.l, 0x300, v11.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v13
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v52, v12
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v11.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.h, 0x300, v11.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v20.l, v10.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v20.h, v10.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, v34.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, v34.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v52, v13
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v11.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, 0x300, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v12.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, v33.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, v33.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v19.l, v11.l
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v19.h, v11.h
; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v19.l, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v11.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v18.h, v12.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v13, v55
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.l, v33.l, 3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v15, 0xffff, v14
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, 0x300, v12.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v12.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.h, v32.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v12.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v52, v14
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v13.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.h, 0x300, v13.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v18.l, v12.l
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v18.h, v12.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, v32.h, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, v32.l, 3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v15, v55
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v14.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v13.h
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v18.l, v13.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v52, v15
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v13.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.h, 0x300, v13.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v14.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v14.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, v31.h, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v14.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v15
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v17.l, v13.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.l, 0x300, v13.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, v31.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v17.l, v13.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v17.h, v13.h
; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v17.h, v14.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.l, v31.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v13.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v13.l
-; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v16.h, v14.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v17.l, 0x300, v14.h
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v15.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v18, v55
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v19
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v14.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v16.l, v14.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v17
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v18, v55
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v15.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v16, v55
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v14.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v52, v18
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v15.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v17.h, 0x300, v15.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v16.l, v14.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v16.h, v14.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v52, v17
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v15.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.h, 0x300, v15.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v52.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v52, v15
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -64573,32 +64297,32 @@ define <64 x i8> @bitcast_v32i16_to_v64i8(<32 x i16> %a, i32 %b) {
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
@@ -64617,26 +64341,26 @@ define <64 x i8> @bitcast_v32i16_to_v64i8(<32 x i16> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[24:25], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v26, 24, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v27, 8, v16
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 8, v15
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 24, v14
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 8, v15
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 24, v14
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v30, 8, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v32, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v31, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v31, 8, v13
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v32, 24, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v33, 8, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v34, 8, v11
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v35, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v36, 8, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v39, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v50, 8, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v51, 8, v5
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v52, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v53, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v64, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v25, 8, v1
; GFX11-TRUE16-NEXT: .LBB96_2: ; %Flow
@@ -64669,26 +64393,26 @@ define <64 x i8> @bitcast_v32i16_to_v64i8(<32 x i16> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[24:25], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v26, 24, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v27, 8, v16
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 8, v15
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 24, v14
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 8, v15
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 24, v14
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v30, 8, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v32, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v31, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v31, 8, v13
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v32, 24, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v33, 8, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v34, 8, v11
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v35, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v36, 8, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v39, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v50, 8, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v51, 8, v5
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v52, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v53, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v64, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v25, 8, v1
; GFX11-TRUE16-NEXT: .LBB96_4: ; %end
@@ -64696,156 +64420,135 @@ define <64 x i8> @bitcast_v32i16_to_v64i8(<32 x i16> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v25.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v64.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v24.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v1.l, v17.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v64.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v1.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v24.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v17.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v24.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, 0
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v55.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v24, v1
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v2.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v54.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v23.l
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v55.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v1.h, v17.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v25, 0xffff, v25
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v54.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v1.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v24, v2
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v3.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v24.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v53.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v55
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v25, v24
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v2.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v3.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v23.l
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v4.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v54, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v23, 0xffff, v25
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v52.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v3.l, v3.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v51.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v23, v24
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v52.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v24, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v4.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v51.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v22.l
; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v4.l, v17.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v24, v4
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v5.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v24.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v50.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v4.h, v5.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v23, 0xffff, v25
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v49.l
; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v24, v5
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v6.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v48.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v21.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v24, v6
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v7.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v39.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v38.l
; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v6.l, v17.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v22.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v49.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v23, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff, v25
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v5.l, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v48.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v7.l, v17.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v21.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v22, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff, v23
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v6.l, v6.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v39.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v22, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v21, 0xffff, v23
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v7.l, v17.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v38.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v7.h, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v24, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v8.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v37.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v20.l
; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v21, v24
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v9.l, v17.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v37.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v10.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v24, v8
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v9.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v24.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v36.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v21, 0xffff, v22
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v9.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v8.l, v8.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v20.l
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v10.l, v17.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v20, 0xffff, v22
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v21, v24
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v35.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v9.l, v9.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v34.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v20, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v20, 0xffff, v21
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v10.l, v17.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v35.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v11.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v24, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v10.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v10.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v34.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v19.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v12.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v24, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v11.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v11.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v24.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v33.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v10.h, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v32.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v13.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v24, v11
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v12.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v12.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v31.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v18.l
+; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.l
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v14.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v24, v12
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v13.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v13.h, v18.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v30.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v29.l
; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.l
+; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v15.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v24, v13
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v14.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v18.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v28.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v17.l
; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v20, v24
-; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v12.l, v17.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v11.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v19.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v32.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v12.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v19, 0xffff, v21
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.h
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v11.l, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v31.l
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v13.l, v17.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v18.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v19, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v19, 0xffff, v20
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v12.l, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v13.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.l, 8, v30.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v19, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v20
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v13.l, v17.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v29.l
-; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v13.h, v14.l
+; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v16.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v24, v14
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v15.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v15.h, v17.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v27.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v26.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v18, v24
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v15.l, v17.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_3) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v14.l
-; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.h, 8, v28.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v27.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v19
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v15.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v14.l, v14.h
-; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.h, 8, v17.l
-; GFX11-TRUE16-NEXT: v_or_b16 v16.l, v16.l, v17.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v19
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v18, v24
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v15.l, v15.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v16.l
-; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v26.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v17, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v18
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v16.l, v16.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v17, v24
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v24, v15
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v16.l, v17.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v16.h, v17.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v24.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v24, v16
; GFX11-TRUE16-NEXT: s_clause 0x3
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[1:4], off
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[5:8], off offset:16
@@ -76701,32 +76404,32 @@ define <64 x i8> @bitcast_v32f16_to_v64i8(<32 x half> %a, i32 %b) {
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
@@ -76745,26 +76448,26 @@ define <64 x i8> @bitcast_v32f16_to_v64i8(<32 x half> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[24:25], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v26, 24, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v27, 8, v16
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 8, v15
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 24, v14
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 8, v15
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 24, v14
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v30, 8, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v32, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v31, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v31, 8, v13
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v32, 24, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v33, 8, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v34, 8, v11
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v35, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v36, 8, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v39, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v50, 8, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v51, 8, v5
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v52, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v53, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v64, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v25, 8, v1
; GFX11-TRUE16-NEXT: .LBB104_2: ; %Flow
@@ -76797,26 +76500,26 @@ define <64 x i8> @bitcast_v32f16_to_v64i8(<32 x half> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[24:25], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v26, 24, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v27, 8, v16
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 8, v15
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 24, v14
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 8, v15
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 24, v14
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v30, 8, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v32, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v31, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v31, 8, v13
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v32, 24, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v33, 8, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v34, 8, v11
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v35, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v36, 8, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v39, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v50, 8, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v51, 8, v5
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v52, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v53, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v64, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v25, 8, v1
; GFX11-TRUE16-NEXT: .LBB104_4: ; %end
@@ -76824,156 +76527,135 @@ define <64 x i8> @bitcast_v32f16_to_v64i8(<32 x half> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v25.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v64.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v24.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v1.l, v17.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v64.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v1.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v24.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v17.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v24.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, 0
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v55.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v24, v1
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v2.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v54.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v23.l
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v55.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v1.h, v17.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v25, 0xffff, v25
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v54.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v1.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v24, v2
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v3.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v24.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v53.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v55
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v25, v24
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v2.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v3.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v23.l
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v4.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v54, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v23, 0xffff, v25
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v52.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v3.l, v3.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v51.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v23, v24
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v52.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v24, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v4.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v51.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v22.l
; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v4.l, v17.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v24, v4
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v5.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v24.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v50.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v4.h, v5.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v23, 0xffff, v25
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v49.l
; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v24, v5
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v6.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v48.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v21.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v24, v6
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v7.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v39.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v38.l
; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v6.l, v17.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v22.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v49.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v23, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff, v25
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v5.l, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v48.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v7.l, v17.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v21.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v22, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff, v23
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v6.l, v6.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v39.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v22, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v21, 0xffff, v23
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v7.l, v17.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v38.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v7.h, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v24, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v8.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v37.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v20.l
; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v21, v24
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v9.l, v17.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v37.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v10.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v24, v8
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v9.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v24.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v36.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v21, 0xffff, v22
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v9.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v8.l, v8.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v20.l
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v10.l, v17.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v20, 0xffff, v22
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v21, v24
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v35.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v9.l, v9.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v34.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v20, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v20, 0xffff, v21
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v10.l, v17.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v35.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v11.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v24, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v10.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v10.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v34.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v19.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v12.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v24, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v11.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v11.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v24.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v33.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v10.h, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v32.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v13.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v24, v11
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v12.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v12.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v31.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v18.l
+; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.l
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v14.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v24, v12
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v13.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v13.h, v18.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v30.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v29.l
; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.l
+; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v15.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v24, v13
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v14.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v18.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v28.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v17.l
; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v20, v24
-; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v12.l, v17.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v11.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v19.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v32.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v12.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v19, 0xffff, v21
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.h
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v11.l, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v31.l
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v13.l, v17.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v18.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v19, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v19, 0xffff, v20
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v12.l, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v13.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.l, 8, v30.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v19, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v20
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v13.l, v17.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v29.l
-; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v13.h, v14.l
+; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v16.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v24, v14
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v15.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v15.h, v17.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v27.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v26.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v18, v24
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v15.l, v17.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_3) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v14.l
-; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.h, 8, v28.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v27.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v19
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v15.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v14.l, v14.h
-; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.h, 8, v17.l
-; GFX11-TRUE16-NEXT: v_or_b16 v16.l, v16.l, v17.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v19
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v18, v24
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v15.l, v15.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v16.l
-; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v26.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v17, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v18
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v16.l, v16.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v17, v24
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v24, v15
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v16.l, v17.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v16.h, v17.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v24.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v24, v16
; GFX11-TRUE16-NEXT: s_clause 0x3
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[1:4], off
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[5:8], off offset:16
@@ -85692,59 +85374,59 @@ define <64 x i8> @bitcast_v32bf16_to_v64i8(<32 x bfloat> %a, i32 %b) {
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr113_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr112_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr112_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr101_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr101_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr100_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr99_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr96_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr97_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr96_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr83_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr65_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr84_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr82_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr81_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr69_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr68_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr67_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr86_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr66_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr65_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr85_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr68_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr66_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr85_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr69_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr67_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr82_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr81_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr83_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr86_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
; GFX11-TRUE16-NEXT: s_and_saveexec_b32 s0, vcc_lo
; GFX11-TRUE16-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX11-TRUE16-NEXT: s_xor_b32 s0, exec_lo, s0
@@ -85757,307 +85439,302 @@ define <64 x i8> @bitcast_v32bf16_to_v64i8(<32 x bfloat> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[21:22], 24, v[7:8]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[22:23], 24, v[5:6]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[23:24], 24, v[3:4]
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 24, v16
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 8, v16
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v51, 8, v15
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v50, 24, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v53, 8, v14
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 24, v16
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 8, v16
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v50, 8, v15
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v51, 24, v14
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 8, v14
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v64, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 24, v12
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v67, 8, v12
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v69, 8, v11
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v70, 24, v10
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v65, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v66, 8, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v68, 8, v11
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v71, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v80, 8, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v84, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v83, 24, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v97, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v84, 8, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v97, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 8, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v99, 8, v5
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v100, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v101, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v103, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v102, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v102, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v103, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 8, v1
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[24:25], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.h, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.h, v2.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v32.h, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.h, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.h, v4.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v36.h, v5.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v33.h, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v31.h, v6.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v49.h, v7.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.h, v2.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v31.h, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.h, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.h, v4.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.h, v5.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v32.h, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v33.h, v6.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v38.h, v7.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.h, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.h, v8.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v65.h, v9.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.h, v10.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.h, v10.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v71.h, v11.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v54.h, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.h, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v86.h, v13.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v68.h, v14.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v66.h, v14.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v85.h, v15.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v82.h, v16.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v81.h, v16.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v36.h, v8.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.h, v9.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v37.h, v10.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.h, v10.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v70.h, v11.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.h, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v53.h, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v85.h, v13.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v69.h, v14.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v67.h, v14.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v87.h, v15.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v83.h, v16.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v86.h, v16.h
; GFX11-TRUE16-NEXT: .LBB108_2: ; %Flow
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB108_4
; GFX11-TRUE16-NEXT: ; %bb.3: ; %cmp.true
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff0000, v1
; GFX11-TRUE16-NEXT: v_and_b32_e32 v19, 0xffff0000, v4
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v4, 16, v4
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v17, 16, v2
; GFX11-TRUE16-NEXT: v_and_b32_e32 v2, 0xffff0000, v2
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_dual_add_f32 v4, 0x40c00000, v4 :: v_dual_add_f32 v17, 0x40c00000, v17
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v2, 0x40c00000, v2
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_dual_add_f32 v18, 0x40c00000, v18 :: v_dual_lshlrev_b32 v1, 16, v1
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v49, 0xffff0000, v11
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_dual_add_f32 v2, 0x40c00000, v2 :: v_dual_lshlrev_b32 v11, 16, v11
+; GFX11-TRUE16-NEXT: v_dual_add_f32 v1, 0x40c00000, v1 :: v_dual_add_f32 v4, 0x40c00000, v4
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v17, 0x40c00000, v17
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_bfe_u32 v21, v2, 16, 1
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, 0x400000, v2
; GFX11-TRUE16-NEXT: v_bfe_u32 v20, v17, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, 0x400000, v17
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v17, v17
-; GFX11-TRUE16-NEXT: v_bfe_u32 v21, v2, 16, 1
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, 0x400000, v2
-; GFX11-TRUE16-NEXT: v_add3_u32 v20, v20, v17, 0x7fff
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff0000, v1
-; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v1, 16, v1
; GFX11-TRUE16-NEXT: v_add3_u32 v21, v21, v2, 0x7fff
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v27, v20, v22, vcc_lo
-; GFX11-TRUE16-NEXT: v_dual_add_f32 v18, 0x40c00000, v18 :: v_dual_add_f32 v1, 0x40c00000, v1
-; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v2, v2
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v27.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_add3_u32 v20, v20, v17, 0x7fff
; GFX11-TRUE16-NEXT: v_bfe_u32 v17, v18, 16, 1
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v26, v20, v22, vcc_lo
; GFX11-TRUE16-NEXT: v_bfe_u32 v20, v1, 16, 1
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v26, v21, v23, vcc_lo
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, 0x400000, v1
-; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v1, v1
+; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v2, v2
; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, 0x400000, v18
+; GFX11-TRUE16-NEXT: v_add3_u32 v17, v17, v18, 0x7fff
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v26.h
; GFX11-TRUE16-NEXT: v_add3_u32 v20, v20, v1, 0x7fff
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v27, v21, v23, vcc_lo
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, 0x400000, v1
+; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v1, v1
; GFX11-TRUE16-NEXT: v_add_f32_e32 v19, 0x40c00000, v19
-; GFX11-TRUE16-NEXT: v_add3_u32 v17, v17, v18, 0x7fff
-; GFX11-TRUE16-NEXT: v_bfi_b32 v2, 0xffff, v2, v26
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_bfi_b32 v2, 0xffff, v2, v27
; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v28, v20, v21, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v18, v18
; GFX11-TRUE16-NEXT: v_bfe_u32 v18, v4, 16, 1
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v21, 0xffff0000, v3
; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, 0x400000, v19
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v102, 24, v2
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v28.h
; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v1, v17, v22, vcc_lo
; GFX11-TRUE16-NEXT: v_add3_u32 v18, v18, v4, 0x7fff
; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, 0x400000, v4
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v4, v4
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v21, 0xffff0000, v3
; GFX11-TRUE16-NEXT: v_bfe_u32 v17, v19, 16, 1
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 8, v2
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v30, v18, v22, vcc_lo
+; GFX11-TRUE16-NEXT: v_bfi_b32 v1, 0xffff, v20, v1
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v20, 0xffff0000, v5
+; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, 16, v5
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v29, v18, v22, vcc_lo
; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff0000, v6
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, 16, v3
-; GFX11-TRUE16-NEXT: v_dual_add_f32 v21, 0x40c00000, v21 :: v_dual_lshlrev_b32 v6, 16, v6
; GFX11-TRUE16-NEXT: v_add3_u32 v17, v17, v19, 0x7fff
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v19, v19
+; GFX11-TRUE16-NEXT: v_dual_add_f32 v21, 0x40c00000, v21 :: v_dual_lshlrev_b32 v6, 16, v6
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v3, 0x40c00000, v3
+; GFX11-TRUE16-NEXT: v_dual_add_f32 v3, 0x40c00000, v3 :: v_dual_add_f32 v20, 0x40c00000, v20
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v30, v17, v23, vcc_lo
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_4)
; GFX11-TRUE16-NEXT: v_bfe_u32 v4, v21, 16, 1
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_dual_add_f32 v6, 0x40c00000, v6 :: v_dual_cndmask_b32 v29, v17, v23
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, 0x400000, v21
; GFX11-TRUE16-NEXT: v_bfe_u32 v18, v3, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, 0x400000, v3
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v3, v3
; GFX11-TRUE16-NEXT: v_add3_u32 v4, v4, v21, 0x7fff
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, 0x400000, v21
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v29.h
; GFX11-TRUE16-NEXT: v_add3_u32 v18, v18, v3, 0x7fff
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v30.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v32, v18, v19, vcc_lo
+; GFX11-TRUE16-NEXT: v_dual_add_f32 v6, 0x40c00000, v6 :: v_dual_add_f32 v5, 0x40c00000, v5
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v103, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 8, v2
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_3) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v31, v18, v19, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v21, v21
-; GFX11-TRUE16-NEXT: v_bfi_b32 v1, 0xffff, v20, v1
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v32.h
-; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v3, v4, v23 :: v_dual_add_f32 v18, 0x40c00000, v22
-; GFX11-TRUE16-NEXT: v_bfi_b32 v4, 0xffff, v17, v29
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v18, 0x40c00000, v22
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 8, v1
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v31.h
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v3, v4, v23, vcc_lo
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_bfe_u32 v21, v18, 16, 1
+; GFX11-TRUE16-NEXT: v_bfi_b32 v4, 0xffff, v17, v30
; GFX11-TRUE16-NEXT: v_bfe_u32 v17, v6, 16, 1
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v6, v6
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
; GFX11-TRUE16-NEXT: v_bfi_b32 v3, 0xffff, v19, v3
-; GFX11-TRUE16-NEXT: v_bfe_u32 v21, v18, 16, 1
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, 0x400000, v18
+; GFX11-TRUE16-NEXT: v_add3_u32 v19, v21, v18, 0x7fff
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, 0x400000, v6
; GFX11-TRUE16-NEXT: v_add3_u32 v17, v17, v6, 0x7fff
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, 0x400000, v18
+; GFX11-TRUE16-NEXT: v_bfe_u32 v6, v20, 16, 1
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v100, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v101, 8, v4
-; GFX11-TRUE16-NEXT: v_add3_u32 v19, v21, v18, 0x7fff
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, 0x400000, v6
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v103, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 8, v1
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v33, v17, v21, vcc_lo
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v32, v17, v21, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v18, v18
-; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v31, v19, v22 :: v_dual_and_b32 v20, 0xffff0000, v5
-; GFX11-TRUE16-NEXT: v_dual_add_f32 v20, 0x40c00000, v20 :: v_dual_lshlrev_b32 v5, 16, v5
+; GFX11-TRUE16-NEXT: v_add3_u32 v6, v6, v20, 0x7fff
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v102, 8, v3
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v33, v19, v22, vcc_lo
; GFX11-TRUE16-NEXT: v_and_b32_e32 v19, 0xffff0000, v8
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_dual_add_f32 v5, 0x40c00000, v5 :: v_dual_lshlrev_b32 v8, 16, v8
-; GFX11-TRUE16-NEXT: v_bfe_u32 v6, v20, 16, 1
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, 0x400000, v20
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v8, 0x40c00000, v8
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v8, 16, v8
; GFX11-TRUE16-NEXT: v_bfe_u32 v17, v5, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, 0x400000, v5
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v5, v5
-; GFX11-TRUE16-NEXT: v_add3_u32 v6, v6, v20, 0x7fff
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, 0x400000, v20
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v8, 0x40c00000, v8
; GFX11-TRUE16-NEXT: v_add3_u32 v17, v17, v5, 0x7fff
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v36, v17, v21, vcc_lo
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v32.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v34, v17, v21 :: v_dual_add_f32 v19, 0x40c00000, v19
; GFX11-TRUE16-NEXT: v_bfe_u32 v17, v8, 16, 1
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v20, v20
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v19, 0x40c00000, v19
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v36.h
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v20, v6, v22, vcc_lo
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v34.h
; GFX11-TRUE16-NEXT: v_bfe_u32 v5, v19, 16, 1
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v20, v6, v22, vcc_lo
; GFX11-TRUE16-NEXT: v_add3_u32 v6, v17, v8, 0x7fff
; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, 0x400000, v8
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v8, v8
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, 0x400000, v19
; GFX11-TRUE16-NEXT: v_add3_u32 v5, v5, v19, 0x7fff
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, 0x400000, v19
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v35, v6, v17, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v19, v19
; GFX11-TRUE16-NEXT: v_and_b32_e32 v23, 0xffff0000, v7
-; GFX11-TRUE16-NEXT: v_bfi_b32 v6, 0xffff, v18, v31
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v34, v5, v22, vcc_lo
+; GFX11-TRUE16-NEXT: v_bfi_b32 v6, 0xffff, v18, v33
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v36, v5, v22, vcc_lo
; GFX11-TRUE16-NEXT: v_bfi_b32 v5, 0xffff, v21, v20
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v20, 16, v10
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v7, 16, v7
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v35.h
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v97, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 8, v6
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
; GFX11-TRUE16-NEXT: v_dual_add_f32 v20, 0x40c00000, v20 :: v_dual_add_f32 v7, 0x40c00000, v7
-; GFX11-TRUE16-NEXT: v_bfi_b32 v8, 0xffff, v8, v34
+; GFX11-TRUE16-NEXT: v_bfi_b32 v8, 0xffff, v8, v36
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v99, 8, v5
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
; GFX11-TRUE16-NEXT: v_bfe_u32 v19, v7, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, 0x400000, v7
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v7, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v83, 24, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 8, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v84, 8, v8
; GFX11-TRUE16-NEXT: v_add3_u32 v19, v19, v7, 0x7fff
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff0000, v10
; GFX11-TRUE16-NEXT: v_bfe_u32 v7, v20, 16, 1
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v49, v19, v21, vcc_lo
-; GFX11-TRUE16-NEXT: v_dual_add_f32 v17, 0x40c00000, v23 :: v_dual_add_f32 v10, 0x40c00000, v10
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v38, v19, v21, vcc_lo
; GFX11-TRUE16-NEXT: v_add3_u32 v7, v7, v20, 0x7fff
; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, 0x400000, v20
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff0000, v10
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_dual_add_f32 v17, 0x40c00000, v23 :: v_dual_add_f32 v10, 0x40c00000, v10
; GFX11-TRUE16-NEXT: v_bfe_u32 v18, v17, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, 0x400000, v17
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v17, v17
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
; GFX11-TRUE16-NEXT: v_bfe_u32 v19, v10, 16, 1
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_add3_u32 v18, v18, v17, 0x7fff
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_add3_u32 v19, v19, v10, 0x7fff
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v17, v18, v22, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v20, v20
; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, 0x400000, v10
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v49.h
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v39, v7, v21, vcc_lo
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v38.h
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v37, v7, v21, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v10, v10
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v10, 16, v12
; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff0000, v12
-; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v48, v19, v22 :: v_dual_lshlrev_b32 v7, 16, v9
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v7, 16, v9
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v39, v19, v22, vcc_lo
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_3)
; GFX11-TRUE16-NEXT: v_add_f32_e32 v21, 0x40c00000, v10
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v39.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v37.h
; GFX11-TRUE16-NEXT: v_add_f32_e32 v12, 0x40c00000, v12
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
; GFX11-TRUE16-NEXT: v_bfe_u32 v22, v21, 16, 1
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v37, 0x400000, v21
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v48, 0x400000, v21
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v21, v21
-; GFX11-TRUE16-NEXT: v_bfi_b32 v10, 0xffff, v19, v48
+; GFX11-TRUE16-NEXT: v_bfi_b32 v10, 0xffff, v19, v39
; GFX11-TRUE16-NEXT: v_bfe_u32 v24, v12, 16, 1
; GFX11-TRUE16-NEXT: v_add3_u32 v22, v22, v21, 0x7fff
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v7, 0x40c00000, v7
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v23, 0xffff0000, v9
; GFX11-TRUE16-NEXT: v_or_b32_e32 v50, 0x400000, v12
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v70, 24, v10
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v21, 0x40c00000, v49
; GFX11-TRUE16-NEXT: v_add3_u32 v24, v24, v12, 0x7fff
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v54, v22, v37, vcc_lo
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_3) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v52, v22, v48 :: v_dual_add_f32 v9, 0x40c00000, v23
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff0000, v14
+; GFX11-TRUE16-NEXT: v_dual_add_f32 v7, 0x40c00000, v7 :: v_dual_lshlrev_b32 v14, 16, v14
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v71, 24, v10
+; GFX11-TRUE16-NEXT: v_bfe_u32 v20, v9, 16, 1
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, 0x400000, v9
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v14, 0x40c00000, v14
; GFX11-TRUE16-NEXT: v_bfe_u32 v19, v7, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, 0x400000, v7
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v7, v7
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff0000, v14
-; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v14, 16, v14
+; GFX11-TRUE16-NEXT: v_add3_u32 v20, v20, v9, 0x7fff
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v80, 8, v10
; GFX11-TRUE16-NEXT: v_add3_u32 v19, v19, v7, 0x7fff
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v38, 0xffff0000, v11
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v54.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_dual_add_f32 v14, 0x40c00000, v14 :: v_dual_lshlrev_b32 v11, 16, v11
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v65, v19, v25, vcc_lo
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v52.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v55, v19, v25, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v12, v12
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v23, 0xffff0000, v9
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v21, 0x40c00000, v38
-; GFX11-TRUE16-NEXT: v_bfe_u32 v25, v14, 16, 1
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v80, 8, v10
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v52, v24, v50 :: v_dual_add_f32 v9, 0x40c00000, v23
; GFX11-TRUE16-NEXT: v_bfe_u32 v19, v21, 16, 1
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_bfi_b32 v12, 0xffff, v7, v52
-; GFX11-TRUE16-NEXT: v_bfe_u32 v20, v9, 16, 1
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, 0x400000, v9
+; GFX11-TRUE16-NEXT: v_bfe_u32 v25, v14, 16, 1
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v53, v24, v50, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v9, v9
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_bfi_b32 v12, 0xffff, v7, v53
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v9, v20, v23, vcc_lo
; GFX11-TRUE16-NEXT: v_add_f32_e32 v7, 0x40c00000, v11
; GFX11-TRUE16-NEXT: v_add3_u32 v11, v19, v21, 0x7fff
-; GFX11-TRUE16-NEXT: v_add3_u32 v20, v20, v9, 0x7fff
; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, 0x400000, v21
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 24, v12
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v67, 8, v12
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v9, v20, v23, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v21, v21
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v55.h
; GFX11-TRUE16-NEXT: v_bfe_u32 v23, v7, 16, 1
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v65.h
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v65, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v66, 8, v12
; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v11, v11, v19, vcc_lo
; GFX11-TRUE16-NEXT: v_add_f32_e32 v19, 0x40c00000, v22
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
; GFX11-TRUE16-NEXT: v_add3_u32 v21, v23, v7, 0x7fff
; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, 0x400000, v7
; GFX11-TRUE16-NEXT: v_and_b32_e32 v23, 0xffff0000, v13
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v7, v7
; GFX11-TRUE16-NEXT: v_bfe_u32 v24, v19, 16, 1
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v13, 16, v13
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v7, 0x40c00000, v23
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v71, v21, v22, vcc_lo
+; GFX11-TRUE16-NEXT: v_bfi_b32 v9, 0xffff, v20, v9
+; GFX11-TRUE16-NEXT: v_dual_add_f32 v7, 0x40c00000, v23 :: v_dual_cndmask_b32 v70, v21, v22
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
; GFX11-TRUE16-NEXT: v_add3_u32 v21, v24, v19, 0x7fff
; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, 0x400000, v19
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v19, v19
; GFX11-TRUE16-NEXT: v_add3_u32 v23, v25, v14, 0x7fff
; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, 0x400000, v14
; GFX11-TRUE16-NEXT: v_bfe_u32 v25, v7, 16, 1
-; GFX11-TRUE16-NEXT: v_dual_add_f32 v13, 0x40c00000, v13 :: v_dual_cndmask_b32 v66, v21, v22
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v13, 0x40c00000, v13
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v67, v21, v22, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v14, v14
; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, 0x400000, v7
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
; GFX11-TRUE16-NEXT: v_add3_u32 v14, v25, v7, 0x7fff
; GFX11-TRUE16-NEXT: v_and_b32_e32 v21, 0xffff0000, v16
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v16, 16, v16
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v68, v23, v24, vcc_lo
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v69, v23, v24, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v7, v7
; GFX11-TRUE16-NEXT: v_bfe_u32 v23, v13, 16, 1
-; GFX11-TRUE16-NEXT: v_bfi_b32 v9, 0xffff, v20, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 8, v9
; GFX11-TRUE16-NEXT: v_add_f32_e32 v16, 0x40c00000, v16
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v68.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v69.h
; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v7, v14, v19 :: v_dual_add_f32 v14, 0x40c00000, v21
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v21, 16, v15
; GFX11-TRUE16-NEXT: v_add3_u32 v19, v23, v13, 0x7fff
@@ -86067,42 +85744,42 @@ define <64 x i8> @bitcast_v32bf16_to_v64i8(<32 x bfloat> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_add_f32_e32 v21, 0x40c00000, v21
; GFX11-TRUE16-NEXT: v_and_b32_e32 v15, 0xffff0000, v15
; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, 0x400000, v16
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v86, v19, v23, vcc_lo
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v85, v19, v23, vcc_lo
; GFX11-TRUE16-NEXT: v_add3_u32 v13, v13, v16, 0x7fff
; GFX11-TRUE16-NEXT: v_bfe_u32 v23, v21, 16, 1
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v16, v16
; GFX11-TRUE16-NEXT: v_bfe_u32 v24, v14, 16, 1
; GFX11-TRUE16-NEXT: v_add_f32_e32 v15, 0x40c00000, v15
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v21
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v49, 0x400000, v21
; GFX11-TRUE16-NEXT: v_add3_u32 v23, v23, v21, 0x7fff
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v82, v13, v25, vcc_lo
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v83, v13, v25, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v21, v21
; GFX11-TRUE16-NEXT: v_add3_u32 v19, v24, v14, 0x7fff
; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, 0x400000, v14
-; GFX11-TRUE16-NEXT: v_bfe_u32 v37, v15, 16, 1
+; GFX11-TRUE16-NEXT: v_bfe_u32 v48, v15, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, 0x400000, v15
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v85, v23, v38, vcc_lo
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v87, v23, v49, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v14, v14
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v86.h
-; GFX11-TRUE16-NEXT: v_add3_u32 v13, v37, v15, 0x7fff
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v71.h
-; GFX11-TRUE16-NEXT: v_bfi_b32 v14, 0xffff, v22, v66
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v81, v19, v24, vcc_lo
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v85.h
+; GFX11-TRUE16-NEXT: v_add3_u32 v13, v48, v15, 0x7fff
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v70.h
+; GFX11-TRUE16-NEXT: v_bfi_b32 v14, 0xffff, v22, v67
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v86, v19, v24, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v15, v15
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v82.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v85.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v83.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v87.h
; GFX11-TRUE16-NEXT: v_bfi_b32 v11, 0xffff, v23, v11
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v50, 24, v14
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v51, 24, v14
; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v13, v13, v16, vcc_lo
-; GFX11-TRUE16-NEXT: v_bfi_b32 v16, 0xffff, v19, v81
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v53, 8, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v69, 8, v11
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v84, 8, v9
+; GFX11-TRUE16-NEXT: v_bfi_b32 v16, 0xffff, v19, v86
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 8, v14
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v68, 8, v11
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
; GFX11-TRUE16-NEXT: v_bfi_b32 v15, 0xffff, v15, v13
; GFX11-TRUE16-NEXT: v_bfi_b32 v13, 0xffff, v21, v7
; GFX11-TRUE16-NEXT: v_bfi_b32 v7, 0xffff, v18, v17
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 24, v16
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 8, v16
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 24, v16
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 8, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[17:18], 24, v[15:16]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[18:19], 24, v[13:14]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[19:20], 24, v[11:12]
@@ -86111,159 +85788,142 @@ define <64 x i8> @bitcast_v32bf16_to_v64i8(<32 x bfloat> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[22:23], 24, v[5:6]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[23:24], 24, v[3:4]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[24:25], 24, v[1:2]
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v51, 8, v15
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v50, 8, v15
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v64, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v97, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 8, v7
; GFX11-TRUE16-NEXT: .LBB108_4: ; %end
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v28.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v113.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v27.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v112.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v26.h
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v2.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v24.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v2.h, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v103.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, 0
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v32.h
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v1.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v102.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v1.l, v2.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v30.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v8
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v101.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v6, v24
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v4.l, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v24.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v27.h
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v1.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v26.h
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v1.h, v2.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v112.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v103.l
; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v23.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v3.l, v2.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v8, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v6
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v36.h
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v3.h, v4.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v23.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v24, v1
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v2.l, v2.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v3.l, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v24.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v31.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v102.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v4.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v101.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v24, v2
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v30.h
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v3.l, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v24.h
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v29.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v99.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.l, 8, v100.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v3.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v8, v24
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v22.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v24, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v4.l, v4.h
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v5.l, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v24.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v34.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.l, 8, v99.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v6.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v98.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v4.h, v5.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v4.l, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v33.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v10
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.l, 8, v97.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v22.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v6.l, v6.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v49.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v8, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v5.l, v5.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v31.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v96.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v6.h, v7.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v10, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v12
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v24, v4
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v5.l, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v24.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v32.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v97.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v21.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v6.l, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v35.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v87.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v10, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v12
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v7.l, v8.h
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v65.h
-; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v7.h, v8.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 8, v84.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v69.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v10, v24
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.l, 8, v83.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v34.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v8.h, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v39.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v24, v5
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v6.l, v6.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v7.l, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v24.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v38.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v96.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v8.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v84.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v24, v6
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v36.h
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v7.l, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v24.h
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v35.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.l, 8, v82.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v20.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v24, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v8.l, v8.h
+; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v9.l, v10.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v24.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v55.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.l, 8, v81.l
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.h, v10.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v80.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v12
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v8.l, v10.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v20.l
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v8.h, v10.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v12, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v14, 0xffff, v14
-; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v71.h
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v9.l, v9.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v48.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v70.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v10.h, v11.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v14, v24
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v67.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v14, 0xffff, v16
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v10.l, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v54.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v11.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v19.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.l, 8, v64.l
-; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v12.l, v12.h
-; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v86.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v14, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v16
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v11.l, v11.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v12.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v24, v8
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v39.h
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v9.l, v10.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v24.h
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v37.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v71.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v19.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v24, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v10.l, v10.h
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v11.l, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v24.h
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v70.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v68.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v11.h, v12.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v66.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v24, v10
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v53.h
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v11.l, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v24.h
; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v52.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.l, 8, v55.l
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v12.h, v13.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v16, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v19
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.l, 8, v65.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v13.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.h, 8, v18.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v12.l, v14.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v13.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v68.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.l, 8, v53.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v16, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v19
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v13.l, v14.h
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v85.h
-; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v13.h, v14.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.l, 8, v51.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v16, v24
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v24, v11
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v12.l, v12.h
+; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v13.l, v14.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v24.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v85.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.l, 8, v64.l
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v13.h, v14.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.h, 8, v54.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v24, v12
+; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v67.h
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v13.l, v14.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v24.h
+; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v69.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.l, 8, v51.l
+; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v17.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v24, v13
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v14.l, v14.h
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v15.l, v16.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v24.h
+; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v87.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.l, 8, v50.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v14.l
-; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v66.h
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v14.h, v15.l
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v82.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v38.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v18
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v14.l, v16.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v15.l
-; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.h, 8, v17.l
-; GFX11-TRUE16-NEXT: v_or_b16 v16.l, v14.h, v16.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v18, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v19
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v37.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v15.l, v15.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v16.l
-; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v81.h
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v15.h, v16.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v49.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v24, v14
+; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v86.h
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v15.l, v16.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v24.h
+; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v83.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v48.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v17, v24
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v18
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v16.l, v16.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v17, v24
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v24, v15
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v16.l, v16.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v17.l, v17.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v24.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v24, v16
; GFX11-TRUE16-NEXT: s_clause 0x3
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[1:4], off
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[5:8], off offset:16
diff --git a/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.64bit.ll b/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.64bit.ll
index 6fe66655de3d6..4c485768bcbbf 100644
--- a/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.64bit.ll
+++ b/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.64bit.ll
@@ -3065,13 +3065,12 @@ define i64 @bitcast_v8i8_to_i64(<8 x i8> %a, i32 %b) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, v5.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.h, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.h, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v5.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v1.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v7.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v5.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v7.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v8
@@ -3085,61 +3084,53 @@ define i64 @bitcast_v8i8_to_i64(<8 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB26_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v3.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.h, v4.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v1.l, v2.h
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v0.l, v2.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v5.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v1.l, v3.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_hi16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v0
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v1.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v5.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v2
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v0, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v1.l, v2.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v3, v4
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB26_2
; GFX11-TRUE16-NEXT: .LBB26_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v5.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v4.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v3.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v6.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v4.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v6.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v4.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v3.h, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v4.h, v0.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v2.h, v1.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v2.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v2, 0xffff, v3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v2, v5
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v2.h, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v3.l, v0.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v4.h
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v3.h, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v2.l, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v4.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v1.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v4, v2
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -6214,13 +6205,12 @@ define double @bitcast_v8i8_to_f64(<8 x i8> %a, i32 %b) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, v5.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.h, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.h, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v5.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v1.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v7.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v5.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v7.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v8
@@ -6234,61 +6224,53 @@ define double @bitcast_v8i8_to_f64(<8 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB50_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v3.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.h, v4.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v1.l, v2.h
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v0.l, v2.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v5.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v1.l, v3.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_hi16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v0
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v1.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v5.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v2
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v0, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v1.l, v2.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v3, v4
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB50_2
; GFX11-TRUE16-NEXT: .LBB50_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v5.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v4.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v3.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v6.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v4.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v6.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v4.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v3.h, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v4.h, v0.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v2.h, v1.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v2.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v2, 0xffff, v3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v2, v5
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v2.h, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v3.l, v0.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v4.h
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v3.h, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v2.l, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v4.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v1.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v4, v2
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -9063,13 +9045,12 @@ define <2 x i32> @bitcast_v8i8_to_v2i32(<8 x i8> %a, i32 %b) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, v5.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.h, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.h, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v5.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v1.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v7.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v5.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v7.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v8
@@ -9083,61 +9064,53 @@ define <2 x i32> @bitcast_v8i8_to_v2i32(<8 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB70_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v3.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.h, v4.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v1.l, v2.h
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v0.l, v2.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v5.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v1.l, v3.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_hi16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v0
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v1.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v5.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v2
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v0, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v1.l, v2.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v3, v4
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB70_2
; GFX11-TRUE16-NEXT: .LBB70_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v5.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v4.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v3.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v6.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v4.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v6.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v4.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v3.h, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v4.h, v0.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v2.h, v1.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v2.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v2, 0xffff, v3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v2, v5
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v2.h, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v3.l, v0.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v4.h
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v3.h, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v2.l, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v4.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v1.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v4, v2
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -11603,13 +11576,12 @@ define <2 x float> @bitcast_v8i8_to_v2f32(<8 x i8> %a, i32 %b) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, v5.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.h, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.h, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v5.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v1.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v7.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v5.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v7.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v8
@@ -11623,61 +11595,53 @@ define <2 x float> @bitcast_v8i8_to_v2f32(<8 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB86_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v3.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.h, v4.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v1.l, v2.h
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v0.l, v2.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v5.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v1.l, v3.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_hi16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v0
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v1.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v5.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v2
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v0, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v1.l, v2.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v3, v4
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB86_2
; GFX11-TRUE16-NEXT: .LBB86_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v5.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v4.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v3.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v6.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v4.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v6.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v4.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v3.h, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v4.h, v0.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v2.h, v1.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v2.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v2, 0xffff, v3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v2, v5
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v2.h, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v3.l, v0.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v4.h
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v3.h, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v2.l, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v4.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v1.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v4, v2
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -13829,13 +13793,12 @@ define <4 x i16> @bitcast_v8i8_to_v4i16(<8 x i8> %a, i32 %b) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, v5.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.h, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.h, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v5.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v1.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v7.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v5.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v7.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v8
@@ -13849,61 +13812,53 @@ define <4 x i16> @bitcast_v8i8_to_v4i16(<8 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB98_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v3.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.h, v4.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v1.l, v2.h
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v0.l, v2.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v5.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v1.l, v3.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_hi16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v0
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v1.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v5.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v2
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v0, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v1.l, v2.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v3, v4
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB98_2
; GFX11-TRUE16-NEXT: .LBB98_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v5.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v4.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v3.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v6.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v4.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v6.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v4.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v3.h, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v4.h, v0.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v2.h, v1.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v2.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v2, 0xffff, v3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v2, v5
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v2.h, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v3.l, v0.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v4.h
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v3.h, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v2.l, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v4.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v1.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v4, v2
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -15655,13 +15610,12 @@ define <4 x half> @bitcast_v8i8_to_v4f16(<8 x i8> %a, i32 %b) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, v5.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.h, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.h, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v5.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v1.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v7.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v5.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v7.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v8
@@ -15675,61 +15629,53 @@ define <4 x half> @bitcast_v8i8_to_v4f16(<8 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB106_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v3.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.h, v4.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v1.l, v2.h
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v0.l, v2.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v5.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v1.l, v3.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_hi16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v0
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v1.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v5.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v2
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v0, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v1.l, v2.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v3, v4
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB106_2
; GFX11-TRUE16-NEXT: .LBB106_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v5.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v4.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v3.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v6.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v4.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v6.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v4.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v3.h, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v4.h, v0.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v2.h, v1.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v2.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v2, 0xffff, v3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v2, v5
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v2.h, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v3.l, v0.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v4.h
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v3.h, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v2.l, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v4.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v1.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v4, v2
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -16966,13 +16912,12 @@ define <4 x bfloat> @bitcast_v8i8_to_v4bf16(<8 x i8> %a, i32 %b) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, v5.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.h, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.h, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v5.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v1.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v7.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v5.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v7.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v8
@@ -16986,61 +16931,53 @@ define <4 x bfloat> @bitcast_v8i8_to_v4bf16(<8 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB110_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v3.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.h, v4.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v1.l, v2.h
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v0.l, v2.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v5.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v1.l, v3.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_hi16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v0
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v1.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v5.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v2
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v0, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v1.l, v2.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v3, v4
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB110_2
; GFX11-TRUE16-NEXT: .LBB110_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v5.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v4.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v3.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v6.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v4.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v6.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v4.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v3.h, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v4.h, v0.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v2.h, v1.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v2.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v2, 0xffff, v3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v2, v5
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v2.h, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v3.l, v0.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v4.h
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v3.h, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v2.l, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v4.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v1.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v4, v2
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
diff --git a/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.96bit.ll b/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.96bit.ll
index e5245f7bd71d3..879e8520d8e18 100644
--- a/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.96bit.ll
+++ b/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.96bit.ll
@@ -1102,15 +1102,16 @@ define <3 x i32> @bitcast_v12i8_to_v3i32(<12 x i8> %a, i32 %b) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.h, v7.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.h, v5.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.h, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.h, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v8.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v9.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v9.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v8.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v9.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v11.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v12
@@ -1125,80 +1126,74 @@ define <3 x i32> @bitcast_v12i8_to_v3i32(<12 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: .LBB6_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v7.h
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, 0
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v5.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.h, v5.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v1.l, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_3) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v1.h, v4.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v2
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v10.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v0.l, v5.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v5.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v7.h
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v1.l, v4.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v10.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_lo16
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v7
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v1.l, v3.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v5
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_hi16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v7, v0
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v1.h, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v7.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v3.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v6, v7
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v2.l, v3.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v7, v4
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v2.l, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v7.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v4, v7
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v7, v2
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB6_2
; GFX11-TRUE16-NEXT: .LBB6_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v7.h, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v7.l, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v6.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v8.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v6.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v6.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v8.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v10.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v5.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v5.h, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v10.l, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v5.h, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v5.l, v0.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v7.h
; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v4.l, v1.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v0.h
; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v4.h, v1.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, 0
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v3.h, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v7, v5
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, 0x300, v1.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v7.h
; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v3.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v6
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v7
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v1.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v4, v7
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v2.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v3, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v3.h, v2.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v7, v1
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, 0x300, v2.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, 0x300, v2.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v7.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v7, v2
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -4241,15 +4236,16 @@ define <3 x float> @bitcast_v12i8_to_v3f32(<12 x i8> %a, i32 %b) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.h, v7.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.h, v5.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.h, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.h, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v8.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v9.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v9.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v8.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v9.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v11.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v12
@@ -4264,80 +4260,74 @@ define <3 x float> @bitcast_v12i8_to_v3f32(<12 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: .LBB22_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v7.h
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, 0
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v5.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.h, v5.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v1.l, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_3) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v1.h, v4.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v2
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v10.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v0.l, v5.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v5.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v7.h
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v1.l, v4.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v10.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_lo16
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v7
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v1.l, v3.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v5
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_hi16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v7, v0
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v1.h, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v7.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v3.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v6, v7
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v2.l, v3.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v7, v4
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v2.l, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v7.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v4, v7
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v7, v2
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB22_2
; GFX11-TRUE16-NEXT: .LBB22_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v7.h, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v7.l, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v6.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v8.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v6.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v6.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v8.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v10.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v5.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v5.h, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v10.l, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v5.h, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v5.l, v0.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v7.h
; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v4.l, v1.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v0.h
; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v4.h, v1.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, 0
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v3.h, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v7, v5
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, 0x300, v1.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v7.h
; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v3.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v6
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v7
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v1.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v4, v7
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v2.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v3, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v3.h, v2.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v7, v1
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, 0x300, v2.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, 0x300, v2.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v7.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v7, v2
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -6885,16 +6875,16 @@ define <6 x bfloat> @bitcast_v12i8_to_v6bf16(<12 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.h, v9.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.h, v7.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, v5.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.h, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.h, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.h, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.l, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v5.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v10.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v9.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.l, 8, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v10.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v9.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v11.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v12
@@ -6909,80 +6899,74 @@ define <6 x bfloat> @bitcast_v12i8_to_v6bf16(<12 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: .LBB36_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v9.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v7.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v8.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v7.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.h, v6.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, 0
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v1.l, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v1.h, v5.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v2
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v10.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v7.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.l, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v6.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v3.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v1.l, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v10.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v0.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v3, v7
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v1.l, v4.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v5
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v6, v7
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v2.l, v4.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_hi16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v3, v0
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v1.h, v5.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v3.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v4.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v3, v5
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v2.l, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v3.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v3, v7
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v3, v2
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB36_2
; GFX11-TRUE16-NEXT: .LBB36_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v9.l, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v8.h, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v7.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v8.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v6.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v7.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v8.l, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v10.l, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v7.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v6.l, v0.l
; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v6.h, v0.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v3.h
; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v5.l, v1.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v0.h
; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v5.h, v1.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, 0
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v4.h, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v3, v6
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, 0x300, v1.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v3.h
; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v4.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v6
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v9
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v1.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v3, v9
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v2.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v4, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v4.h, v2.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v3, v1
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v2.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, 0x300, v2.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v3.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v3, v2
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -8651,16 +8635,16 @@ define <6 x half> @bitcast_v12i8_to_v6f16(<12 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.h, v9.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.h, v7.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, v5.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.h, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.h, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.h, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.l, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v5.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v10.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v9.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.l, 8, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v10.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v9.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v11.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v12
@@ -8675,80 +8659,74 @@ define <6 x half> @bitcast_v12i8_to_v6f16(<12 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: .LBB40_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v9.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v7.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v8.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v7.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.h, v6.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, 0
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v1.l, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v1.h, v5.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v2
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v10.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v7.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.l, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v6.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v3.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v1.l, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v10.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v0.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v3, v7
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v1.l, v4.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v5
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v6, v7
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v2.l, v4.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_hi16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v3, v0
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v1.h, v5.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v3.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v4.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v3, v5
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v2.l, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v3.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v3, v7
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v3, v2
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB40_2
; GFX11-TRUE16-NEXT: .LBB40_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v9.l, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v8.h, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v7.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v8.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v6.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v7.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v8.l, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v10.l, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v7.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v6.l, v0.l
; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v6.h, v0.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v3.h
; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v5.l, v1.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v0.h
; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v5.h, v1.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, 0
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v4.h, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v3, v6
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, 0x300, v1.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v3.h
; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v4.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v6
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v9
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v1.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v3, v9
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v2.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v4, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v4.h, v2.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v3, v1
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v2.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, 0x300, v2.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v3.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v3, v2
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -10065,16 +10043,16 @@ define <6 x i16> @bitcast_v12i8_to_v6i16(<12 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.h, v9.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.h, v7.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, v5.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.h, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.h, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.h, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.l, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v5.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v10.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v9.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.l, 8, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v10.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v9.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v11.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v12
@@ -10089,80 +10067,74 @@ define <6 x i16> @bitcast_v12i8_to_v6i16(<12 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: .LBB44_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v9.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v7.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v8.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v7.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.h, v6.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, 0
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v1.l, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v1.h, v5.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v2
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v10.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v7.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.l, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v6.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v3.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v1.l, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v10.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v0.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v3, v7
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v1.l, v4.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v5
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v6, v7
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v2.l, v4.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_hi16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v3, v0
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v1.h, v5.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v3.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v4.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v3, v5
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v2.l, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v3.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v3, v7
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v3, v2
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB44_2
; GFX11-TRUE16-NEXT: .LBB44_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v9.l, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v8.h, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v7.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v8.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v6.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v7.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v8.l, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v10.l, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v7.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v6.l, v0.l
; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v6.h, v0.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v3.h
; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v5.l, v1.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v0.h
; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v5.h, v1.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, 0
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v4.h, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v3, v6
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, 0x300, v1.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v3.h
; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v4.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v6
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v9
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v1.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v3, v9
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v2.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v4, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v4.h, v2.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v3, v1
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v2.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, 0x300, v2.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v3.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v3, v2
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
diff --git a/llvm/test/CodeGen/AMDGPU/amdgpu-llvm-debuginfo-analyzer.ll b/llvm/test/CodeGen/AMDGPU/amdgpu-llvm-debuginfo-analyzer.ll
index 89fc6c062c29d..d6922bc09ff0a 100644
--- a/llvm/test/CodeGen/AMDGPU/amdgpu-llvm-debuginfo-analyzer.ll
+++ b/llvm/test/CodeGen/AMDGPU/amdgpu-llvm-debuginfo-analyzer.ll
@@ -1,3 +1,4 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
; RUN: llc %s -o %t.o -mcpu=gfx1030 -filetype=obj -O0
; RUN: llvm-debuginfo-analyzer %t.o --print=all --attribute=all | FileCheck %s
diff --git a/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll b/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll
index 4cc39d93854a0..1d3368b036d0d 100644
--- a/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll
+++ b/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll
@@ -9022,13 +9022,12 @@ define amdgpu_kernel void @uniform_or_i16(ptr addrspace(1) %result, ptr addrspac
; GFX1164-TRUE16-NEXT: v_lshrrev_b32_e32 v0, s7, v0
; GFX1164-TRUE16-NEXT: .LBB15_2:
; GFX1164-TRUE16-NEXT: s_or_b64 exec, exec, s[4:5]
-; GFX1164-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX1164-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; GFX1164-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX1164-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX1164-TRUE16-NEXT: s_mov_b32 s3, 0x31016000
+; GFX1164-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
; GFX1164-TRUE16-NEXT: v_readfirstlane_b32 s2, v0
; GFX1164-TRUE16-NEXT: v_cndmask_b16 v0.l, s6, 0, vcc
-; GFX1164-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1164-TRUE16-NEXT: v_or_b16 v0.l, s2, v0.l
; GFX1164-TRUE16-NEXT: s_mov_b32 s2, -1
; GFX1164-TRUE16-NEXT: buffer_store_b16 v0, off, s[0:3], 0
@@ -9101,13 +9100,12 @@ define amdgpu_kernel void @uniform_or_i16(ptr addrspace(1) %result, ptr addrspac
; GFX1132-TRUE16-NEXT: v_lshrrev_b32_e32 v0, s6, v0
; GFX1132-TRUE16-NEXT: .LBB15_2:
; GFX1132-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s5
-; GFX1132-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX1132-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; GFX1132-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX1132-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX1132-TRUE16-NEXT: s_mov_b32 s3, 0x31016000
+; GFX1132-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
; GFX1132-TRUE16-NEXT: v_readfirstlane_b32 s2, v0
; GFX1132-TRUE16-NEXT: v_cndmask_b16 v0.l, s4, 0, vcc_lo
-; GFX1132-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1132-TRUE16-NEXT: v_or_b16 v0.l, s2, v0.l
; GFX1132-TRUE16-NEXT: s_mov_b32 s2, -1
; GFX1132-TRUE16-NEXT: buffer_store_b16 v0, off, s[0:3], 0
@@ -9180,13 +9178,12 @@ define amdgpu_kernel void @uniform_or_i16(ptr addrspace(1) %result, ptr addrspac
; GFX1264-TRUE16-NEXT: v_lshrrev_b32_e32 v0, s7, v0
; GFX1264-TRUE16-NEXT: .LBB15_2:
; GFX1264-TRUE16-NEXT: s_or_b64 exec, exec, s[4:5]
-; GFX1264-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX1264-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; GFX1264-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX1264-TRUE16-NEXT: s_wait_kmcnt 0x0
; GFX1264-TRUE16-NEXT: s_mov_b32 s3, 0x31016000
+; GFX1264-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
; GFX1264-TRUE16-NEXT: v_readfirstlane_b32 s2, v0
; GFX1264-TRUE16-NEXT: v_cndmask_b16 v0.l, s6, 0, vcc
-; GFX1264-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1264-TRUE16-NEXT: v_or_b16 v0.l, s2, v0.l
; GFX1264-TRUE16-NEXT: s_mov_b32 s2, -1
; GFX1264-TRUE16-NEXT: buffer_store_b16 v0, off, s[0:3], null
@@ -9259,13 +9256,12 @@ define amdgpu_kernel void @uniform_or_i16(ptr addrspace(1) %result, ptr addrspac
; GFX1232-TRUE16-NEXT: v_lshrrev_b32_e32 v0, s6, v0
; GFX1232-TRUE16-NEXT: .LBB15_2:
; GFX1232-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s5
-; GFX1232-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX1232-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; GFX1232-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX1232-TRUE16-NEXT: s_wait_kmcnt 0x0
; GFX1232-TRUE16-NEXT: s_mov_b32 s3, 0x31016000
+; GFX1232-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
; GFX1232-TRUE16-NEXT: v_readfirstlane_b32 s2, v0
; GFX1232-TRUE16-NEXT: v_cndmask_b16 v0.l, s4, 0, vcc_lo
-; GFX1232-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1232-TRUE16-NEXT: v_or_b16 v0.l, s2, v0.l
; GFX1232-TRUE16-NEXT: s_mov_b32 s2, -1
; GFX1232-TRUE16-NEXT: buffer_store_b16 v0, off, s[0:3], null
@@ -9662,12 +9658,11 @@ define amdgpu_kernel void @uniform_add_i16(ptr addrspace(1) %result, ptr addrspa
; GFX1164-TRUE16-NEXT: v_lshrrev_b32_e32 v0, s11, v2
; GFX1164-TRUE16-NEXT: .LBB16_4: ; %Flow
; GFX1164-TRUE16-NEXT: s_or_b64 exec, exec, s[8:9]
-; GFX1164-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX1164-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; GFX1164-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX1164-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX1164-TRUE16-NEXT: s_mov_b32 s3, 0x31016000
+; GFX1164-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1164-TRUE16-NEXT: v_readfirstlane_b32 s2, v0
-; GFX1164-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1164-TRUE16-NEXT: v_mad_u16 v0.l, s10, v4.l, s2
; GFX1164-TRUE16-NEXT: s_mov_b32 s2, -1
; GFX1164-TRUE16-NEXT: buffer_store_b16 v0, off, s[0:3], 0
@@ -9789,12 +9784,11 @@ define amdgpu_kernel void @uniform_add_i16(ptr addrspace(1) %result, ptr addrspa
; GFX1132-TRUE16-NEXT: v_lshrrev_b32_e32 v0, s2, v2
; GFX1132-TRUE16-NEXT: .LBB16_4: ; %Flow
; GFX1132-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s9
-; GFX1132-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX1132-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; GFX1132-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX1132-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX1132-TRUE16-NEXT: s_mov_b32 s3, 0x31016000
+; GFX1132-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1132-TRUE16-NEXT: v_readfirstlane_b32 s2, v0
-; GFX1132-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1132-TRUE16-NEXT: v_mad_u16 v0.l, s8, v4.l, s2
; GFX1132-TRUE16-NEXT: s_mov_b32 s2, -1
; GFX1132-TRUE16-NEXT: buffer_store_b16 v0, off, s[0:3], 0
@@ -9916,13 +9910,12 @@ define amdgpu_kernel void @uniform_add_i16(ptr addrspace(1) %result, ptr addrspa
; GFX1264-TRUE16-NEXT: v_lshrrev_b32_e32 v0, s11, v2
; GFX1264-TRUE16-NEXT: .LBB16_4: ; %Flow
; GFX1264-TRUE16-NEXT: s_or_b64 exec, exec, s[8:9]
-; GFX1264-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX1264-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; GFX1264-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX1264-TRUE16-NEXT: s_wait_kmcnt 0x0
; GFX1264-TRUE16-NEXT: s_mov_b32 s3, 0x31016000
+; GFX1264-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
; GFX1264-TRUE16-NEXT: v_readfirstlane_b32 s2, v0
; GFX1264-TRUE16-NEXT: s_wait_alu 0xf1ff
-; GFX1264-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1264-TRUE16-NEXT: v_mad_u16 v0.l, s10, v4.l, s2
; GFX1264-TRUE16-NEXT: s_mov_b32 s2, -1
; GFX1264-TRUE16-NEXT: buffer_store_b16 v0, off, s[0:3], null
@@ -10048,13 +10041,12 @@ define amdgpu_kernel void @uniform_add_i16(ptr addrspace(1) %result, ptr addrspa
; GFX1232-TRUE16-NEXT: v_lshrrev_b32_e32 v0, s2, v2
; GFX1232-TRUE16-NEXT: .LBB16_4: ; %Flow
; GFX1232-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s9
-; GFX1232-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX1232-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; GFX1232-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX1232-TRUE16-NEXT: s_wait_kmcnt 0x0
; GFX1232-TRUE16-NEXT: s_mov_b32 s3, 0x31016000
+; GFX1232-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
; GFX1232-TRUE16-NEXT: v_readfirstlane_b32 s2, v0
; GFX1232-TRUE16-NEXT: s_wait_alu 0xf1ff
-; GFX1232-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1232-TRUE16-NEXT: v_mad_u16 v0.l, s8, v4.l, s2
; GFX1232-TRUE16-NEXT: s_mov_b32 s2, -1
; GFX1232-TRUE16-NEXT: buffer_store_b16 v0, off, s[0:3], null
@@ -10734,15 +10726,15 @@ define amdgpu_kernel void @uniform_fadd_f16(ptr addrspace(1) %result, ptr addrsp
; GFX1164-TRUE16-NEXT: s_mov_b64 s[2:3], 0
; GFX1164-TRUE16-NEXT: .LBB18_1: ; %atomicrmw.start
; GFX1164-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
-; GFX1164-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1164-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX1164-TRUE16-NEXT: v_lshrrev_b32_e32 v0, s9, v1
+; GFX1164-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX1164-TRUE16-NEXT: v_add_f16_e32 v0.l, s8, v0.l
; GFX1164-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1164-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX1164-TRUE16-NEXT: v_lshlrev_b32_e32 v0, s9, v0
-; GFX1164-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX1164-TRUE16-NEXT: v_and_or_b32 v0, v1, s10, v0
; GFX1164-TRUE16-NEXT: v_mov_b32_e32 v3, v1
+; GFX1164-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX1164-TRUE16-NEXT: v_mov_b32_e32 v2, v0
; GFX1164-TRUE16-NEXT: buffer_atomic_cmpswap_b32 v[2:3], off, s[4:7], 0 glc
; GFX1164-TRUE16-NEXT: s_waitcnt vmcnt(0)
@@ -10828,14 +10820,14 @@ define amdgpu_kernel void @uniform_fadd_f16(ptr addrspace(1) %result, ptr addrsp
; GFX1132-TRUE16-NEXT: s_mov_b32 s6, -1
; GFX1132-TRUE16-NEXT: .LBB18_1: ; %atomicrmw.start
; GFX1132-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
-; GFX1132-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1132-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX1132-TRUE16-NEXT: v_lshrrev_b32_e32 v0, s2, v1
+; GFX1132-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX1132-TRUE16-NEXT: v_add_f16_e32 v0.l, s8, v0.l
; GFX1132-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1132-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX1132-TRUE16-NEXT: v_lshlrev_b32_e32 v0, s2, v0
-; GFX1132-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1132-TRUE16-NEXT: v_and_or_b32 v0, v1, s3, v0
+; GFX1132-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1132-TRUE16-NEXT: v_dual_mov_b32 v3, v1 :: v_dual_mov_b32 v2, v0
; GFX1132-TRUE16-NEXT: buffer_atomic_cmpswap_b32 v[2:3], off, s[4:7], 0 glc
; GFX1132-TRUE16-NEXT: s_waitcnt vmcnt(0)
@@ -10920,15 +10912,15 @@ define amdgpu_kernel void @uniform_fadd_f16(ptr addrspace(1) %result, ptr addrsp
; GFX1264-TRUE16-NEXT: s_mov_b64 s[2:3], 0
; GFX1264-TRUE16-NEXT: .LBB18_1: ; %atomicrmw.start
; GFX1264-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
-; GFX1264-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1264-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX1264-TRUE16-NEXT: v_lshrrev_b32_e32 v0, s9, v1
+; GFX1264-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX1264-TRUE16-NEXT: v_add_f16_e32 v0.l, s8, v0.l
; GFX1264-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1264-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX1264-TRUE16-NEXT: v_lshlrev_b32_e32 v0, s9, v0
-; GFX1264-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX1264-TRUE16-NEXT: v_and_or_b32 v0, v1, s10, v0
; GFX1264-TRUE16-NEXT: v_mov_b32_e32 v3, v1
+; GFX1264-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX1264-TRUE16-NEXT: v_mov_b32_e32 v2, v0
; GFX1264-TRUE16-NEXT: buffer_atomic_cmpswap_b32 v[2:3], off, s[4:7], null th:TH_ATOMIC_RETURN scope:SCOPE_SYS
; GFX1264-TRUE16-NEXT: s_wait_loadcnt 0x0
@@ -11014,14 +11006,14 @@ define amdgpu_kernel void @uniform_fadd_f16(ptr addrspace(1) %result, ptr addrsp
; GFX1232-TRUE16-NEXT: s_mov_b32 s6, -1
; GFX1232-TRUE16-NEXT: .LBB18_1: ; %atomicrmw.start
; GFX1232-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
-; GFX1232-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1232-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX1232-TRUE16-NEXT: v_lshrrev_b32_e32 v0, s2, v1
+; GFX1232-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX1232-TRUE16-NEXT: v_add_f16_e32 v0.l, s8, v0.l
; GFX1232-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1232-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX1232-TRUE16-NEXT: v_lshlrev_b32_e32 v0, s2, v0
-; GFX1232-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1232-TRUE16-NEXT: v_and_or_b32 v0, v1, s3, v0
+; GFX1232-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1232-TRUE16-NEXT: v_dual_mov_b32 v3, v1 :: v_dual_mov_b32 v2, v0
; GFX1232-TRUE16-NEXT: buffer_atomic_cmpswap_b32 v[2:3], off, s[4:7], null th:TH_ATOMIC_RETURN scope:SCOPE_SYS
; GFX1232-TRUE16-NEXT: s_wait_loadcnt 0x0
diff --git a/llvm/test/CodeGen/AMDGPU/bf16.ll b/llvm/test/CodeGen/AMDGPU/bf16.ll
index 505ddc8c3b575..10e523d1a0cf1 100644
--- a/llvm/test/CodeGen/AMDGPU/bf16.ll
+++ b/llvm/test/CodeGen/AMDGPU/bf16.ll
@@ -37774,9 +37774,10 @@ define bfloat @v_uitofp_i16_to_bf16(i16 %x) {
; GFX11TRUE16-LABEL: v_uitofp_i16_to_bf16:
; GFX11TRUE16: ; %bb.0:
; GFX11TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX11TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; GFX11TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
+; GFX11TRUE16-NEXT: v_mov_b16_e32 v1.l, v0.l
; GFX11TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11TRUE16-NEXT: v_cvt_f32_u32_e32 v0, v0
+; GFX11TRUE16-NEXT: v_cvt_f32_u32_e32 v0, v1
; GFX11TRUE16-NEXT: v_bfe_u32 v1, v0, 16, 1
; GFX11TRUE16-NEXT: v_or_b32_e32 v2, 0x400000, v0
; GFX11TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v0, v0
@@ -40750,12 +40751,11 @@ define amdgpu_ps i32 @s_select_bf16(bfloat inreg %a, bfloat inreg %b, i32 %c) {
;
; GFX11TRUE16-LABEL: s_select_bf16:
; GFX11TRUE16: ; %bb.0:
+; GFX11TRUE16-NEXT: v_mov_b16_e32 v1.l, s0
; GFX11TRUE16-NEXT: v_cmp_eq_u32_e32 vcc_lo, 0, v0
-; GFX11TRUE16-NEXT: v_mov_b16_e32 v0.l, s0
-; GFX11TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11TRUE16-NEXT: v_cndmask_b16 v0.l, s1, v0.l, vcc_lo
-; GFX11TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
-; GFX11TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
+; GFX11TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11TRUE16-NEXT: v_cndmask_b16 v0.l, s1, v1.l, vcc_lo
; GFX11TRUE16-NEXT: v_readfirstlane_b32 s0, v0
; GFX11TRUE16-NEXT: ; return to shader part epilog
;
diff --git a/llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fadd.ll b/llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fadd.ll
index f4b432dce8c8a..0ceb9019eb990 100644
--- a/llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fadd.ll
+++ b/llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fadd.ll
@@ -3443,15 +3443,14 @@ define half @buffer_fat_ptr_agent_atomic_fadd_ret_f16__offset__amdgpu_no_fine_gr
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v1, s4, v2
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_add_f16_e32 v1.l, v1.l, v0.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v1, s4, v1
; GFX12-TRUE16-NEXT: s_wait_alu 0xfffe
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v1, v2, s6, v1
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_dual_mov_b32 v4, v2 :: v_dual_mov_b32 v3, v1
; GFX12-TRUE16-NEXT: buffer_atomic_cmpswap_b32 v[3:4], v5, s[0:3], null offen th:TH_ATOMIC_RETURN
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
@@ -3569,14 +3568,13 @@ define half @buffer_fat_ptr_agent_atomic_fadd_ret_f16__offset__amdgpu_no_fine_gr
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v1, s4, v2
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_add_f16_e32 v1.l, v1.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v1, s4, v1
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v1, v2, s6, v1
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_dual_mov_b32 v4, v2 :: v_dual_mov_b32 v3, v1
; GFX11-TRUE16-NEXT: buffer_atomic_cmpswap_b32 v[3:4], v5, s[0:3], 0 offen glc
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
@@ -3884,15 +3882,14 @@ define void @buffer_fat_ptr_agent_atomic_fadd_noret_f16__offset__amdgpu_no_fine_
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v1, s4, v2
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_add_f16_e32 v1.l, v1.l, v0.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v1, s4, v1
; GFX12-TRUE16-NEXT: s_wait_alu 0xfffe
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v1, v2, s6, v1
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_mov_b32 v4, v1
; GFX12-TRUE16-NEXT: buffer_atomic_cmpswap_b32 v[4:5], v3, s[0:3], null offen th:TH_ATOMIC_RETURN
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
@@ -4007,14 +4004,13 @@ define void @buffer_fat_ptr_agent_atomic_fadd_noret_f16__offset__amdgpu_no_fine_
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v1, s4, v2
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_add_f16_e32 v1.l, v1.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v1, s4, v1
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v1, v2, s6, v1
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_mov_b32 v4, v1
; GFX11-TRUE16-NEXT: buffer_atomic_cmpswap_b32 v[4:5], v3, s[0:3], 0 offen glc
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
@@ -4328,15 +4324,14 @@ define half @buffer_fat_ptr_agent_atomic_fadd_ret_f16__offset__waterfall__amdgpu
; GFX12-TRUE16-NEXT: ; Child Loop BB15_4 Depth 2
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v6, v4, v7
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v6.h, 0
; GFX12-TRUE16-NEXT: s_mov_b32 s2, exec_lo
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_add_f16_e32 v6.l, v6.l, v5.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v6, v4, v6
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v6, v7, v11, v6
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_dual_mov_b32 v9, v7 :: v_dual_mov_b32 v8, v6
; GFX12-TRUE16-NEXT: .LBB15_4: ; Parent Loop BB15_3 Depth=1
; GFX12-TRUE16-NEXT: ; => This Inner Loop Header: Depth=2
@@ -4556,15 +4551,14 @@ define half @buffer_fat_ptr_agent_atomic_fadd_ret_f16__offset__waterfall__amdgpu
; GFX11-TRUE16-NEXT: ; Child Loop BB15_4 Depth 2
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v6, v4, v7
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.h, 0
; GFX11-TRUE16-NEXT: s_mov_b32 s2, exec_lo
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_add_f16_e32 v6.l, v6.l, v5.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v6, v4, v6
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v6, v7, v11, v6
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_dual_mov_b32 v9, v7 :: v_dual_mov_b32 v8, v6
; GFX11-TRUE16-NEXT: .LBB15_4: ; Parent Loop BB15_3 Depth=1
; GFX11-TRUE16-NEXT: ; => This Inner Loop Header: Depth=2
diff --git a/llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmax.ll b/llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmax.ll
index 6f1675edbe58a..cad4c39eaf39f 100644
--- a/llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmax.ll
+++ b/llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmax.ll
@@ -2512,16 +2512,16 @@ define half @buffer_fat_ptr_agent_atomic_fmax_ret_f16__offset__amdgpu_no_fine_gr
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v1, s4, v2
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v1.l, v1.l
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v1.l, v0.h, v0.l
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v1, s4, v1
; GFX12-TRUE16-NEXT: s_wait_alu 0xfffe
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v1, v2, s6, v1
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_dual_mov_b32 v4, v2 :: v_dual_mov_b32 v3, v1
; GFX12-TRUE16-NEXT: buffer_atomic_cmpswap_b32 v[3:4], v5, s[0:3], null offen th:TH_ATOMIC_RETURN
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
@@ -2640,20 +2640,19 @@ define half @buffer_fat_ptr_agent_atomic_fmax_ret_f16__offset__amdgpu_no_fine_gr
; GFX11-TRUE16-NEXT: buffer_load_b32 v2, v5, s[0:3], 0 offen
; GFX11-TRUE16-NEXT: s_not_b32 s6, s5
; GFX11-TRUE16-NEXT: s_mov_b32 s5, 0
-; GFX11-TRUE16-NEXT: .p2align 6
; GFX11-TRUE16-NEXT: .LBB10_1: ; %atomicrmw.start
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v1, s4, v2
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v1.l, v1.l
; GFX11-TRUE16-NEXT: v_max_f16_e32 v1.l, v0.h, v0.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v1, s4, v1
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v1, v2, s6, v1
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_dual_mov_b32 v4, v2 :: v_dual_mov_b32 v3, v1
; GFX11-TRUE16-NEXT: buffer_atomic_cmpswap_b32 v[3:4], v5, s[0:3], 0 offen glc
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
@@ -2973,16 +2972,16 @@ define void @buffer_fat_ptr_agent_atomic_fmax_noret_f16__offset__amdgpu_no_fine_
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v1, s4, v2
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v1.l, v1.l
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v1.l, v0.h, v0.l
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v1, s4, v1
; GFX12-TRUE16-NEXT: s_wait_alu 0xfffe
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v1, v2, s6, v1
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_mov_b32 v4, v1
; GFX12-TRUE16-NEXT: buffer_atomic_cmpswap_b32 v[4:5], v3, s[0:3], null offen th:TH_ATOMIC_RETURN
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
@@ -3098,20 +3097,19 @@ define void @buffer_fat_ptr_agent_atomic_fmax_noret_f16__offset__amdgpu_no_fine_
; GFX11-TRUE16-NEXT: buffer_load_b32 v2, v3, s[0:3], 0 offen
; GFX11-TRUE16-NEXT: s_not_b32 s6, s5
; GFX11-TRUE16-NEXT: s_mov_b32 s5, 0
-; GFX11-TRUE16-NEXT: .p2align 6
; GFX11-TRUE16-NEXT: .LBB11_1: ; %atomicrmw.start
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v1, s4, v2
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v1.l, v1.l
; GFX11-TRUE16-NEXT: v_max_f16_e32 v1.l, v0.h, v0.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v1, s4, v1
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v1, v2, s6, v1
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_mov_b32 v4, v1
; GFX11-TRUE16-NEXT: buffer_atomic_cmpswap_b32 v[4:5], v3, s[0:3], 0 offen glc
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
@@ -3437,16 +3435,16 @@ define half @buffer_fat_ptr_agent_atomic_fmax_ret_f16__offset__waterfall__amdgpu
; GFX12-TRUE16-NEXT: ; Child Loop BB12_4 Depth 2
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v9, v6
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: s_mov_b32 s2, exec_lo
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v4.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v5.l, v4.h, v4.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v9, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v11, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_dual_mov_b32 v8, v6 :: v_dual_mov_b32 v7, v5
; GFX12-TRUE16-NEXT: .LBB12_4: ; Parent Loop BB12_3 Depth=1
; GFX12-TRUE16-NEXT: ; => This Inner Loop Header: Depth=2
@@ -3672,16 +3670,16 @@ define half @buffer_fat_ptr_agent_atomic_fmax_ret_f16__offset__waterfall__amdgpu
; GFX11-TRUE16-NEXT: ; Child Loop BB12_4 Depth 2
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v9, v6
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: s_mov_b32 s2, exec_lo
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v4.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: v_max_f16_e32 v5.l, v4.h, v4.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v9, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v11, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_dual_mov_b32 v8, v6 :: v_dual_mov_b32 v7, v5
; GFX11-TRUE16-NEXT: .LBB12_4: ; Parent Loop BB12_3 Depth=1
; GFX11-TRUE16-NEXT: ; => This Inner Loop Header: Depth=2
diff --git a/llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmin.ll b/llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmin.ll
index acb27be1846b9..6275afd2c6994 100644
--- a/llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmin.ll
+++ b/llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmin.ll
@@ -2512,16 +2512,16 @@ define half @buffer_fat_ptr_agent_atomic_fmin_ret_f16__offset__amdgpu_no_fine_gr
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v1, s4, v2
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v1.l, v1.l
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v1.l, v0.h, v0.l
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v1, s4, v1
; GFX12-TRUE16-NEXT: s_wait_alu 0xfffe
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v1, v2, s6, v1
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_dual_mov_b32 v4, v2 :: v_dual_mov_b32 v3, v1
; GFX12-TRUE16-NEXT: buffer_atomic_cmpswap_b32 v[3:4], v5, s[0:3], null offen th:TH_ATOMIC_RETURN
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
@@ -2640,20 +2640,19 @@ define half @buffer_fat_ptr_agent_atomic_fmin_ret_f16__offset__amdgpu_no_fine_gr
; GFX11-TRUE16-NEXT: buffer_load_b32 v2, v5, s[0:3], 0 offen
; GFX11-TRUE16-NEXT: s_not_b32 s6, s5
; GFX11-TRUE16-NEXT: s_mov_b32 s5, 0
-; GFX11-TRUE16-NEXT: .p2align 6
; GFX11-TRUE16-NEXT: .LBB10_1: ; %atomicrmw.start
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v1, s4, v2
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v1.l, v1.l
; GFX11-TRUE16-NEXT: v_min_f16_e32 v1.l, v0.h, v0.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v1, s4, v1
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v1, v2, s6, v1
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_dual_mov_b32 v4, v2 :: v_dual_mov_b32 v3, v1
; GFX11-TRUE16-NEXT: buffer_atomic_cmpswap_b32 v[3:4], v5, s[0:3], 0 offen glc
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
@@ -2973,16 +2972,16 @@ define void @buffer_fat_ptr_agent_atomic_fmin_noret_f16__offset__amdgpu_no_fine_
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v1, s4, v2
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v1.l, v1.l
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v1.l, v0.h, v0.l
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v1, s4, v1
; GFX12-TRUE16-NEXT: s_wait_alu 0xfffe
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v1, v2, s6, v1
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_mov_b32 v4, v1
; GFX12-TRUE16-NEXT: buffer_atomic_cmpswap_b32 v[4:5], v3, s[0:3], null offen th:TH_ATOMIC_RETURN
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
@@ -3098,20 +3097,19 @@ define void @buffer_fat_ptr_agent_atomic_fmin_noret_f16__offset__amdgpu_no_fine_
; GFX11-TRUE16-NEXT: buffer_load_b32 v2, v3, s[0:3], 0 offen
; GFX11-TRUE16-NEXT: s_not_b32 s6, s5
; GFX11-TRUE16-NEXT: s_mov_b32 s5, 0
-; GFX11-TRUE16-NEXT: .p2align 6
; GFX11-TRUE16-NEXT: .LBB11_1: ; %atomicrmw.start
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v1, s4, v2
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v1.l, v1.l
; GFX11-TRUE16-NEXT: v_min_f16_e32 v1.l, v0.h, v0.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v1, s4, v1
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v1, v2, s6, v1
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_mov_b32 v4, v1
; GFX11-TRUE16-NEXT: buffer_atomic_cmpswap_b32 v[4:5], v3, s[0:3], 0 offen glc
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
@@ -3437,16 +3435,16 @@ define half @buffer_fat_ptr_agent_atomic_fmin_ret_f16__offset__waterfall__amdgpu
; GFX12-TRUE16-NEXT: ; Child Loop BB12_4 Depth 2
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v9, v6
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: s_mov_b32 s2, exec_lo
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v4.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v5.l, v4.h, v4.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v9, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v11, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_dual_mov_b32 v8, v6 :: v_dual_mov_b32 v7, v5
; GFX12-TRUE16-NEXT: .LBB12_4: ; Parent Loop BB12_3 Depth=1
; GFX12-TRUE16-NEXT: ; => This Inner Loop Header: Depth=2
@@ -3672,16 +3670,16 @@ define half @buffer_fat_ptr_agent_atomic_fmin_ret_f16__offset__waterfall__amdgpu
; GFX11-TRUE16-NEXT: ; Child Loop BB12_4 Depth 2
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v9, v6
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: s_mov_b32 s2, exec_lo
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v4.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: v_min_f16_e32 v5.l, v4.h, v4.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v9, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v11, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_dual_mov_b32 v8, v6 :: v_dual_mov_b32 v7, v5
; GFX11-TRUE16-NEXT: .LBB12_4: ; Parent Loop BB12_3 Depth=1
; GFX11-TRUE16-NEXT: ; => This Inner Loop Header: Depth=2
diff --git a/llvm/test/CodeGen/AMDGPU/calling-conventions.ll b/llvm/test/CodeGen/AMDGPU/calling-conventions.ll
index ff80250bfc880..2db7b28c7de97 100644
--- a/llvm/test/CodeGen/AMDGPU/calling-conventions.ll
+++ b/llvm/test/CodeGen/AMDGPU/calling-conventions.ll
@@ -2745,6 +2745,15 @@ define amdgpu_cs void @amdgpu_cs_v32i1(<32 x i1> %arg0) {
;
; GFX11-TRUE16-LABEL: amdgpu_cs_v32i1:
; GFX11-TRUE16: ; %bb.0:
+; GFX11-TRUE16-NEXT: v_and_b16 v26.l, v26.l, 1
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.l, 1, v25.l
+; GFX11-TRUE16-NEXT: v_and_b16 v24.l, v24.l, 1
+; GFX11-TRUE16-NEXT: v_and_b16 v20.h, v22.l, 1
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.l, 1, v21.l
+; GFX11-TRUE16-NEXT: v_and_b16 v20.l, v20.l, 1
+; GFX11-TRUE16-NEXT: v_and_b16 v17.h, v18.l, 1
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 1, v17.l
+; GFX11-TRUE16-NEXT: v_and_b16 v16.l, v16.l, 1
; GFX11-TRUE16-NEXT: v_and_b16 v10.l, v10.l, 1
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 1, v9.l
; GFX11-TRUE16-NEXT: v_and_b16 v8.l, v8.l, 1
@@ -2754,6 +2763,18 @@ define amdgpu_cs void @amdgpu_cs_v32i1(<32 x i1> %arg0) {
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, v2.l, 1
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 1, v1.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, v0.l, 1
+; GFX11-TRUE16-NEXT: v_and_b16 v30.l, v30.l, 1
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v29.l, 1, v29.l
+; GFX11-TRUE16-NEXT: v_and_b16 v28.l, v28.l, 1
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.h, 3, v27.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v26.l, 2, v26.l
+; GFX11-TRUE16-NEXT: v_or_b16 v22.h, v24.l, v25.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.l, 3, v23.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v20.h, 2, v20.h
+; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v20.l, v21.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 3, v19.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 2, v17.h
+; GFX11-TRUE16-NEXT: v_or_b16 v16.l, v16.l, v17.l
; GFX11-TRUE16-NEXT: v_and_b16 v14.l, v14.l, 1
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.l, 1, v13.l
; GFX11-TRUE16-NEXT: v_and_b16 v12.l, v12.l, 1
@@ -2766,15 +2787,15 @@ define amdgpu_cs void @amdgpu_cs_v32i1(<32 x i1> %arg0) {
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 3, v3.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.h, 2, v1.h
; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v25.h, v26.l, 1
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.l, 1, v25.l
-; GFX11-TRUE16-NEXT: v_and_b16 v24.l, v24.l, 1
-; GFX11-TRUE16-NEXT: v_and_b16 v22.l, v22.l, 1
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.l, 1, v21.l
-; GFX11-TRUE16-NEXT: v_and_b16 v20.l, v20.l, 1
-; GFX11-TRUE16-NEXT: v_and_b16 v18.l, v18.l, 1
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 1, v17.l
-; GFX11-TRUE16-NEXT: v_and_b16 v16.l, v16.l, 1
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v28.h, 3, v31.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v29.h, 2, v30.l
+; GFX11-TRUE16-NEXT: v_or_b16 v28.l, v28.l, v29.l
+; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v25.h, v26.l
+; GFX11-TRUE16-NEXT: v_and_b16 v21.h, v22.h, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v18.l, v22.l, v20.h
+; GFX11-TRUE16-NEXT: v_and_b16 v16.h, v16.h, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v17.h, v18.h, v17.h
+; GFX11-TRUE16-NEXT: v_and_b16 v16.l, v16.l, 3
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 3, v15.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 2, v14.l
; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v12.l, v13.l
@@ -2784,65 +2805,42 @@ define amdgpu_cs void @amdgpu_cs_v32i1(<32 x i1> %arg0) {
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, v0.h, 3
; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v2.h, v1.h
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, v0.l, 3
-; GFX11-TRUE16-NEXT: v_and_b16 v30.l, v30.l, 1
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v29.l, 1, v29.l
-; GFX11-TRUE16-NEXT: v_and_b16 v28.l, v28.l, 1
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.h, 3, v27.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.h, 2, v25.h
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v24.l, v25.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.l, 3, v23.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.l, 2, v22.l
-; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v20.l, v21.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 3, v19.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 2, v18.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v16.l, v17.l
+; GFX11-TRUE16-NEXT: v_or_b16 v26.h, v28.h, v29.h
+; GFX11-TRUE16-NEXT: v_and_b16 v24.h, v28.l, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v19.l, v21.h, v25.h
+; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v16.h, v18.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v16.l, v17.h
; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v12.h, v10.h
; GFX11-TRUE16-NEXT: v_and_b16 v3.l, v8.h, 3
; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v6.h
; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v2.l
; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v1.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v30.h, 3, v31.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v28.h, 2, v30.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v28.l, v29.l
-; GFX11-TRUE16-NEXT: v_or_b16 v22.h, v22.h, v25.h
-; GFX11-TRUE16-NEXT: v_and_b16 v24.l, v24.l, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v18.h, v23.l, v22.l
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, v16.h, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v17.h, v18.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, v15.h, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v24.h, v26.h
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, v19.l, 15
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 4, v16.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, v15.h, 15
; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v3.l, v2.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, v1.l, 15
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 4, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, v0.l, 15
-; GFX11-TRUE16-NEXT: v_or_b16 v28.h, v30.h, v28.h
-; GFX11-TRUE16-NEXT: v_and_b16 v24.h, v24.h, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v20.h, v24.l, v22.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v14.h, v18.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 12, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v14.h
; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v1.h, v16.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 12, v2.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v0.h
-; GFX11-TRUE16-NEXT: v_or_b16 v23.h, v24.h, v28.h
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, v20.h, 15
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 4, v2.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, v1.h, 15
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v17.l, v2.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v2.l, v1.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 12, v23.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v0.h
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v1.h, v2.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, 0
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v1.h, v0.h
; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v2.l, v0.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v0.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v1.l, v0.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v2
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v1, v0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, v2.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v0, v2
; GFX11-TRUE16-NEXT: global_store_b32 v[0:1], v0, off
; GFX11-TRUE16-NEXT: s_endpgm
;
diff --git a/llvm/test/CodeGen/AMDGPU/clamp-modifier.ll b/llvm/test/CodeGen/AMDGPU/clamp-modifier.ll
index b9caf8e80bcdf..ccdc0b1bf43c4 100644
--- a/llvm/test/CodeGen/AMDGPU/clamp-modifier.ll
+++ b/llvm/test/CodeGen/AMDGPU/clamp-modifier.ll
@@ -1561,10 +1561,10 @@ define amdgpu_kernel void @v_no_clamp_add_src_v2f16_f16_src(ptr addrspace(1) %ou
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v1, 2, v1
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-TRUE16-NEXT: global_load_d16_b16 v0, v0, s[2:3]
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_add_f16_e32 v0.l, 1.0, v0.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_pk_max_f16 v0, v0, v0 clamp
; GFX11-TRUE16-NEXT: global_store_b32 v1, v0, s[0:1]
; GFX11-TRUE16-NEXT: s_endpgm
diff --git a/llvm/test/CodeGen/AMDGPU/cvt_f32_ubyte.ll b/llvm/test/CodeGen/AMDGPU/cvt_f32_ubyte.ll
index b5bc09a1684ee..26f204f29f5a4 100644
--- a/llvm/test/CodeGen/AMDGPU/cvt_f32_ubyte.ll
+++ b/llvm/test/CodeGen/AMDGPU/cvt_f32_ubyte.ll
@@ -946,9 +946,9 @@ define double @v_uitofp_i8_to_f64(i8 %arg0) nounwind {
; GFX11-TRUE16-LABEL: v_uitofp_i8_to_f64:
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_cvt_f64_u32_e32 v[0:1], v0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -1770,40 +1770,38 @@ define amdgpu_kernel void @load_v4i8_to_v4f32_2_uses(ptr addrspace(1) noalias %o
; GFX11-TRUE16-LABEL: load_v4i8_to_v4f32_2_uses:
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_load_b64 s[0:1], s[4:5], 0x34
-; GFX11-TRUE16-NEXT: v_dual_mov_b32 v5, 0 :: v_dual_and_b32 v0, 0x3ff, v0
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0x3ff, v0
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
+; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v0, 2, v0
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v5.h
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-TRUE16-NEXT: global_load_b32 v4, v0, s[0:1]
; GFX11-TRUE16-NEXT: s_load_b128 s[0:3], s[4:5], 0x24
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v4.l, 9
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff00, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff00, v4.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v4.h, 9
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff00, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff00, v4.h
; GFX11-TRUE16-NEXT: v_cvt_f32_ubyte3_e32 v3, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.h, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v4.h, 9
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x900, v0.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v1.l, v0.h
+; GFX11-TRUE16-NEXT: v_cvt_f32_ubyte2_e32 v2, v4
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v1.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v1.h, v0.h
; GFX11-TRUE16-NEXT: v_cvt_f32_ubyte1_e32 v1, v4
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v2
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x900, v0.l
-; GFX11-TRUE16-NEXT: v_cvt_f32_ubyte2_e32 v2, v4
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, 0x900, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x900, v0.h
; GFX11-TRUE16-NEXT: v_cvt_f32_ubyte0_e32 v0, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v6
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v5, v7
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-TRUE16-NEXT: s_clause 0x1
-; GFX11-TRUE16-NEXT: global_store_b128 v5, v[0:3], s[0:1]
-; GFX11-TRUE16-NEXT: global_store_b32 v5, v4, s[2:3]
+; GFX11-TRUE16-NEXT: global_store_b128 v6, v[0:3], s[0:1]
+; GFX11-TRUE16-NEXT: global_store_b32 v6, v4, s[2:3]
; GFX11-TRUE16-NEXT: s_endpgm
;
; GFX11-FAKE16-LABEL: load_v4i8_to_v4f32_2_uses:
diff --git a/llvm/test/CodeGen/AMDGPU/dynamic_stackalloc.ll b/llvm/test/CodeGen/AMDGPU/dynamic_stackalloc.ll
index b0439b1f7968f..c5db7a33f70e0 100644
--- a/llvm/test/CodeGen/AMDGPU/dynamic_stackalloc.ll
+++ b/llvm/test/CodeGen/AMDGPU/dynamic_stackalloc.ll
@@ -2536,12 +2536,13 @@ define void @test_dynamic_stackalloc_device_divergent_non_standard_size_i16(i16
; GFX11-SDAG-LABEL: test_dynamic_stackalloc_device_divergent_non_standard_size_i16:
; GFX11-SDAG: ; %bb.0:
; GFX11-SDAG-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX11-SDAG-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; GFX11-SDAG-NEXT: v_mov_b16_e32 v1.h, 0
+; GFX11-SDAG-NEXT: v_mov_b16_e32 v1.l, v0.l
; GFX11-SDAG-NEXT: s_mov_b32 s4, s33
; GFX11-SDAG-NEXT: s_mov_b32 s1, exec_lo
; GFX11-SDAG-NEXT: s_mov_b32 s0, 0
; GFX11-SDAG-NEXT: s_mov_b32 s33, s32
-; GFX11-SDAG-NEXT: v_lshl_add_u32 v0, v0, 2, 15
+; GFX11-SDAG-NEXT: v_lshl_add_u32 v0, v1, 2, 15
; GFX11-SDAG-NEXT: s_add_i32 s32, s32, 16
; GFX11-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-SDAG-NEXT: v_and_b32_e32 v0, 0x7fff0, v0
diff --git a/llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fadd.ll b/llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fadd.ll
index 8c7d5cffe39d9..22dd66118837f 100644
--- a/llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fadd.ll
+++ b/llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fadd.ll
@@ -8410,13 +8410,12 @@ define half @flat_agent_atomic_fadd_ret_f16__amdgpu_no_fine_grained_memory(ptr %
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_add_f16_e32 v5.l, v5.l, v2.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[0:1], v[5:6] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -8529,13 +8528,12 @@ define half @flat_agent_atomic_fadd_ret_f16__amdgpu_no_fine_grained_memory(ptr %
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_add_f16_e32 v5.l, v5.l, v2.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[0:1], v[5:6] glc
@@ -8785,13 +8783,12 @@ define half @flat_agent_atomic_fadd_ret_f16__offset12b_pos__amdgpu_no_fine_grain
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_add_f16_e32 v5.l, v5.l, v2.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[0:1], v[5:6] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -8908,13 +8905,12 @@ define half @flat_agent_atomic_fadd_ret_f16__offset12b_pos__amdgpu_no_fine_grain
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_add_f16_e32 v5.l, v5.l, v2.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[0:1], v[5:6] glc
@@ -9171,13 +9167,12 @@ define half @flat_agent_atomic_fadd_ret_f16__offset12b_neg__amdgpu_no_fine_grain
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_add_f16_e32 v5.l, v5.l, v2.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[0:1], v[5:6] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -9295,13 +9290,12 @@ define half @flat_agent_atomic_fadd_ret_f16__offset12b_neg__amdgpu_no_fine_grain
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_add_f16_e32 v5.l, v5.l, v2.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[0:1], v[5:6] glc
@@ -9557,11 +9551,11 @@ define void @flat_agent_atomic_fadd_noret_f16__amdgpu_no_fine_grained_memory(ptr
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_add_f16_e32 v3.l, v3.l, v2.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -9671,11 +9665,11 @@ define void @flat_agent_atomic_fadd_noret_f16__amdgpu_no_fine_grained_memory(ptr
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_add_f16_e32 v3.l, v3.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] glc
@@ -9917,11 +9911,11 @@ define void @flat_agent_atomic_fadd_noret_f16__offset12b_pos__amdgpu_no_fine_gra
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_add_f16_e32 v3.l, v3.l, v2.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -10035,11 +10029,11 @@ define void @flat_agent_atomic_fadd_noret_f16__offset12b_pos__amdgpu_no_fine_gra
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_add_f16_e32 v3.l, v3.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] glc
@@ -10288,11 +10282,11 @@ define void @flat_agent_atomic_fadd_noret_f16__offset12b_neg__amdgpu_no_fine_gra
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_add_f16_e32 v3.l, v3.l, v2.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -10407,11 +10401,11 @@ define void @flat_agent_atomic_fadd_noret_f16__offset12b_neg__amdgpu_no_fine_gra
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_add_f16_e32 v3.l, v3.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] glc
@@ -10651,8 +10645,8 @@ define void @flat_agent_atomic_fadd_noret_f16__offset12b__align4_pos__amdgpu_no_
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_add_f16_e32 v3.l, v4.l, v2.l
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] offset:2046 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -10735,8 +10729,8 @@ define void @flat_agent_atomic_fadd_noret_f16__offset12b__align4_pos__amdgpu_no_
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_add_f16_e32 v3.l, v4.l, v2.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] offset:2046 glc
@@ -10925,10 +10919,9 @@ define half @flat_agent_atomic_fadd_ret_f16__offset12b_pos__align4__amdgpu_no_fi
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_add_f16_e32 v3.l, v4.l, v2.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] offset:2046 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -11014,10 +11007,9 @@ define half @flat_agent_atomic_fadd_ret_f16__offset12b_pos__align4__amdgpu_no_fi
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_add_f16_e32 v3.l, v4.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] offset:2046 glc
@@ -11220,13 +11212,12 @@ define half @flat_system_atomic_fadd_ret_f16__offset12b_pos__amdgpu_no_fine_grai
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_add_f16_e32 v5.l, v5.l, v2.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX12-TRUE16-NEXT: global_wb scope:SCOPE_SYS
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
@@ -11345,13 +11336,12 @@ define half @flat_system_atomic_fadd_ret_f16__offset12b_pos__amdgpu_no_fine_grai
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_add_f16_e32 v5.l, v5.l, v2.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[0:1], v[5:6] glc
@@ -11610,11 +11600,11 @@ define void @flat_system_atomic_fadd_noret_f16__offset12b_pos__amdgpu_no_fine_gr
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_add_f16_e32 v3.l, v3.l, v2.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX12-TRUE16-NEXT: global_wb scope:SCOPE_SYS
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
@@ -11730,11 +11720,11 @@ define void @flat_system_atomic_fadd_noret_f16__offset12b_pos__amdgpu_no_fine_gr
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_add_f16_e32 v3.l, v3.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] glc
diff --git a/llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fmax.ll b/llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fmax.ll
index 56ad91dd59ffb..1dc45179c74ce 100644
--- a/llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fmax.ll
+++ b/llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fmax.ll
@@ -6043,14 +6043,14 @@ define half @flat_agent_atomic_fmax_ret_f16__amdgpu_no_fine_grained_memory(ptr %
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v2.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v5.l, v2.h, v2.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[0:1], v[5:6] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -6168,14 +6168,14 @@ define half @flat_agent_atomic_fmax_ret_f16__amdgpu_no_fine_grained_memory(ptr %
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_max_f16_e32 v2.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v5.l, v2.h, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[0:1], v[5:6] glc
@@ -6438,14 +6438,14 @@ define half @flat_agent_atomic_fmax_ret_f16__offset12b_pos__amdgpu_no_fine_grain
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v5.l, v0.h, v0.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[3:4], v[5:6] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -6570,14 +6570,14 @@ define half @flat_agent_atomic_fmax_ret_f16__offset12b_pos__amdgpu_no_fine_grain
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v5.l, v0.h, v0.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[3:4], v[5:6] glc
@@ -6847,14 +6847,14 @@ define half @flat_agent_atomic_fmax_ret_f16__offset12b_neg__amdgpu_no_fine_grain
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v5.l, v0.h, v0.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[3:4], v[5:6] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -6980,14 +6980,14 @@ define half @flat_agent_atomic_fmax_ret_f16__offset12b_neg__amdgpu_no_fine_grain
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v5.l, v0.h, v0.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[3:4], v[5:6] glc
@@ -7254,13 +7254,12 @@ define void @flat_agent_atomic_fmax_noret_f16__amdgpu_no_fine_grained_memory(ptr
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v2.h, v3.l, v3.l
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v3.l, v2.h, v2.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -7376,13 +7375,12 @@ define void @flat_agent_atomic_fmax_noret_f16__amdgpu_no_fine_grained_memory(ptr
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v2.h, v3.l, v3.l
; GFX11-TRUE16-NEXT: v_max_f16_e32 v3.l, v2.h, v2.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] glc
@@ -7638,13 +7636,12 @@ define void @flat_agent_atomic_fmax_noret_f16__offset12b_pos__amdgpu_no_fine_gra
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v5.l, v0.h, v0.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[3:4], v[5:6] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -7767,13 +7764,12 @@ define void @flat_agent_atomic_fmax_noret_f16__offset12b_pos__amdgpu_no_fine_gra
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: v_max_f16_e32 v5.l, v0.h, v0.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[3:4], v[5:6] glc
@@ -8036,13 +8032,12 @@ define void @flat_agent_atomic_fmax_noret_f16__offset12b_neg__amdgpu_no_fine_gra
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v5.l, v0.h, v0.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[3:4], v[5:6] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -8166,13 +8161,12 @@ define void @flat_agent_atomic_fmax_noret_f16__offset12b_neg__amdgpu_no_fine_gra
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: v_max_f16_e32 v5.l, v0.h, v0.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[3:4], v[5:6] glc
@@ -8424,11 +8418,11 @@ define half @flat_agent_atomic_fmax_ret_f16__offset12b_pos__align4__amdgpu_no_fi
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v2.h, v4.l, v4.l
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v3.l, v2.h, v2.l
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] offset:2046 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -8519,11 +8513,11 @@ define half @flat_agent_atomic_fmax_ret_f16__offset12b_pos__align4__amdgpu_no_fi
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v2.h, v4.l, v4.l
; GFX11-TRUE16-NEXT: v_max_f16_e32 v3.l, v2.h, v2.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] offset:2046 glc
@@ -8728,10 +8722,9 @@ define void @flat_agent_atomic_fmax_noret_f16__offset12b__align4_pos__amdgpu_no_
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v2.h, v4.l, v4.l
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v3.l, v2.h, v2.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] offset:2046 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -8820,10 +8813,9 @@ define void @flat_agent_atomic_fmax_noret_f16__offset12b__align4_pos__amdgpu_no_
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v2.h, v4.l, v4.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v3.l, v2.h, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] offset:2046 glc
@@ -9035,14 +9027,14 @@ define half @flat_system_atomic_fmax_ret_f16__offset12b_pos__amdgpu_no_fine_grai
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v5.l, v0.h, v0.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX12-TRUE16-NEXT: global_wb scope:SCOPE_SYS
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
@@ -9169,14 +9161,14 @@ define half @flat_system_atomic_fmax_ret_f16__offset12b_pos__amdgpu_no_fine_grai
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v5.l, v0.h, v0.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[3:4], v[5:6] glc
@@ -9448,13 +9440,12 @@ define void @flat_system_atomic_fmax_noret_f16__offset12b_pos__amdgpu_no_fine_gr
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v5.l, v0.h, v0.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX12-TRUE16-NEXT: global_wb scope:SCOPE_SYS
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
@@ -9579,13 +9570,12 @@ define void @flat_system_atomic_fmax_noret_f16__offset12b_pos__amdgpu_no_fine_gr
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: v_max_f16_e32 v5.l, v0.h, v0.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[3:4], v[5:6] glc
diff --git a/llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fmin.ll b/llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fmin.ll
index f0083bd23660a..5d26293e7009b 100644
--- a/llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fmin.ll
+++ b/llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fmin.ll
@@ -6043,14 +6043,14 @@ define half @flat_agent_atomic_fmin_ret_f16__amdgpu_no_fine_grained_memory(ptr %
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v2.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v5.l, v2.h, v2.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[0:1], v[5:6] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -6168,14 +6168,14 @@ define half @flat_agent_atomic_fmin_ret_f16__amdgpu_no_fine_grained_memory(ptr %
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_max_f16_e32 v2.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_min_f16_e32 v5.l, v2.h, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[0:1], v[5:6] glc
@@ -6438,14 +6438,14 @@ define half @flat_agent_atomic_fmin_ret_f16__offset12b_pos__amdgpu_no_fine_grain
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v5.l, v0.h, v0.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[3:4], v[5:6] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -6570,14 +6570,14 @@ define half @flat_agent_atomic_fmin_ret_f16__offset12b_pos__amdgpu_no_fine_grain
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_min_f16_e32 v5.l, v0.h, v0.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[3:4], v[5:6] glc
@@ -6847,14 +6847,14 @@ define half @flat_agent_atomic_fmin_ret_f16__offset12b_neg__amdgpu_no_fine_grain
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v5.l, v0.h, v0.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[3:4], v[5:6] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -6980,14 +6980,14 @@ define half @flat_agent_atomic_fmin_ret_f16__offset12b_neg__amdgpu_no_fine_grain
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_min_f16_e32 v5.l, v0.h, v0.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[3:4], v[5:6] glc
@@ -7254,13 +7254,12 @@ define void @flat_agent_atomic_fmin_noret_f16__amdgpu_no_fine_grained_memory(ptr
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v2.h, v3.l, v3.l
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v3.l, v2.h, v2.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -7376,13 +7375,12 @@ define void @flat_agent_atomic_fmin_noret_f16__amdgpu_no_fine_grained_memory(ptr
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v2.h, v3.l, v3.l
; GFX11-TRUE16-NEXT: v_min_f16_e32 v3.l, v2.h, v2.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] glc
@@ -7638,13 +7636,12 @@ define void @flat_agent_atomic_fmin_noret_f16__offset12b_pos__amdgpu_no_fine_gra
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v5.l, v0.h, v0.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[3:4], v[5:6] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -7767,13 +7764,12 @@ define void @flat_agent_atomic_fmin_noret_f16__offset12b_pos__amdgpu_no_fine_gra
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: v_min_f16_e32 v5.l, v0.h, v0.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[3:4], v[5:6] glc
@@ -8036,13 +8032,12 @@ define void @flat_agent_atomic_fmin_noret_f16__offset12b_neg__amdgpu_no_fine_gra
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v5.l, v0.h, v0.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[3:4], v[5:6] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -8166,13 +8161,12 @@ define void @flat_agent_atomic_fmin_noret_f16__offset12b_neg__amdgpu_no_fine_gra
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: v_min_f16_e32 v5.l, v0.h, v0.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[3:4], v[5:6] glc
@@ -8424,11 +8418,11 @@ define half @flat_agent_atomic_fmin_ret_f16__offset12b_pos__align4__amdgpu_no_fi
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v2.h, v4.l, v4.l
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v3.l, v2.h, v2.l
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] offset:2046 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -8519,11 +8513,11 @@ define half @flat_agent_atomic_fmin_ret_f16__offset12b_pos__align4__amdgpu_no_fi
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v2.h, v4.l, v4.l
; GFX11-TRUE16-NEXT: v_min_f16_e32 v3.l, v2.h, v2.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] offset:2046 glc
@@ -8728,10 +8722,9 @@ define void @flat_agent_atomic_fmin_noret_f16__offset12b__align4_pos__amdgpu_no_
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v2.h, v4.l, v4.l
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v3.l, v2.h, v2.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] offset:2046 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -8820,10 +8813,9 @@ define void @flat_agent_atomic_fmin_noret_f16__offset12b__align4_pos__amdgpu_no_
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v2.h, v4.l, v4.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_min_f16_e32 v3.l, v2.h, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] offset:2046 glc
@@ -9035,14 +9027,14 @@ define half @flat_system_atomic_fmin_ret_f16__offset12b_pos__amdgpu_no_fine_grai
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v5.l, v0.h, v0.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX12-TRUE16-NEXT: global_wb scope:SCOPE_SYS
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
@@ -9169,14 +9161,14 @@ define half @flat_system_atomic_fmin_ret_f16__offset12b_pos__amdgpu_no_fine_grai
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_min_f16_e32 v5.l, v0.h, v0.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[3:4], v[5:6] glc
@@ -9448,13 +9440,12 @@ define void @flat_system_atomic_fmin_noret_f16__offset12b_pos__amdgpu_no_fine_gr
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v5.l, v0.h, v0.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX12-TRUE16-NEXT: global_wb scope:SCOPE_SYS
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
@@ -9579,13 +9570,12 @@ define void @flat_system_atomic_fmin_noret_f16__offset12b_pos__amdgpu_no_fine_gr
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: v_min_f16_e32 v5.l, v0.h, v0.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[3:4], v[5:6] glc
diff --git a/llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fsub.ll b/llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fsub.ll
index 3ee0bb2122abe..d12a7f9731586 100644
--- a/llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fsub.ll
+++ b/llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fsub.ll
@@ -5855,13 +5855,12 @@ define half @flat_agent_atomic_fsub_ret_f16(ptr %ptr, half %val) #0 {
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_sub_f16_e32 v5.l, v5.l, v2.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[0:1], v[5:6] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -5974,13 +5973,12 @@ define half @flat_agent_atomic_fsub_ret_f16(ptr %ptr, half %val) #0 {
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_sub_f16_e32 v5.l, v5.l, v2.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[0:1], v[5:6] glc
@@ -6230,13 +6228,12 @@ define half @flat_agent_atomic_fsub_ret_f16__offset12b_pos(ptr %ptr, half %val)
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_sub_f16_e32 v5.l, v5.l, v2.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[0:1], v[5:6] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -6353,13 +6350,12 @@ define half @flat_agent_atomic_fsub_ret_f16__offset12b_pos(ptr %ptr, half %val)
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_sub_f16_e32 v5.l, v5.l, v2.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[0:1], v[5:6] glc
@@ -6616,13 +6612,12 @@ define half @flat_agent_atomic_fsub_ret_f16__offset12b_neg(ptr %ptr, half %val)
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_sub_f16_e32 v5.l, v5.l, v2.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[0:1], v[5:6] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -6740,13 +6735,12 @@ define half @flat_agent_atomic_fsub_ret_f16__offset12b_neg(ptr %ptr, half %val)
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_sub_f16_e32 v5.l, v5.l, v2.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[0:1], v[5:6] glc
@@ -7002,11 +6996,11 @@ define void @flat_agent_atomic_fsub_noret_f16(ptr %ptr, half %val) #0 {
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_sub_f16_e32 v3.l, v3.l, v2.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -7116,11 +7110,11 @@ define void @flat_agent_atomic_fsub_noret_f16(ptr %ptr, half %val) #0 {
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_sub_f16_e32 v3.l, v3.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] glc
@@ -7362,11 +7356,11 @@ define void @flat_agent_atomic_fsub_noret_f16__offset12b_pos(ptr %ptr, half %val
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_sub_f16_e32 v3.l, v3.l, v2.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -7480,11 +7474,11 @@ define void @flat_agent_atomic_fsub_noret_f16__offset12b_pos(ptr %ptr, half %val
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_sub_f16_e32 v3.l, v3.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] glc
@@ -7733,11 +7727,11 @@ define void @flat_agent_atomic_fsub_noret_f16__offset12b_neg(ptr %ptr, half %val
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_sub_f16_e32 v3.l, v3.l, v2.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -7852,11 +7846,11 @@ define void @flat_agent_atomic_fsub_noret_f16__offset12b_neg(ptr %ptr, half %val
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_sub_f16_e32 v3.l, v3.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] glc
@@ -8096,10 +8090,9 @@ define half @flat_agent_atomic_fsub_ret_f16__offset12b_pos__align4(ptr %ptr, hal
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_sub_f16_e32 v3.l, v4.l, v2.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] offset:2046 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -8185,10 +8178,9 @@ define half @flat_agent_atomic_fsub_ret_f16__offset12b_pos__align4(ptr %ptr, hal
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_sub_f16_e32 v3.l, v4.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] offset:2046 glc
@@ -8382,8 +8374,8 @@ define void @flat_agent_atomic_fsub_noret_f16__offset12b__align4_pos(ptr %ptr, h
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_sub_f16_e32 v3.l, v4.l, v2.l
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] offset:2046 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -8466,8 +8458,8 @@ define void @flat_agent_atomic_fsub_noret_f16__offset12b__align4_pos(ptr %ptr, h
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_sub_f16_e32 v3.l, v4.l, v2.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] offset:2046 glc
@@ -8665,13 +8657,12 @@ define half @flat_system_atomic_fsub_ret_f16__offset12b_pos(ptr %ptr, half %val)
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_sub_f16_e32 v5.l, v5.l, v2.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX12-TRUE16-NEXT: global_wb scope:SCOPE_SYS
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
@@ -8790,13 +8781,12 @@ define half @flat_system_atomic_fsub_ret_f16__offset12b_pos(ptr %ptr, half %val)
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_sub_f16_e32 v5.l, v5.l, v2.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[0:1], v[5:6] glc
@@ -9055,11 +9045,11 @@ define void @flat_system_atomic_fsub_noret_f16__offset12b_pos(ptr %ptr, half %va
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_sub_f16_e32 v3.l, v3.l, v2.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX12-TRUE16-NEXT: global_wb scope:SCOPE_SYS
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
@@ -9175,11 +9165,11 @@ define void @flat_system_atomic_fsub_noret_f16__offset12b_pos(ptr %ptr, half %va
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_sub_f16_e32 v3.l, v3.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] glc
diff --git a/llvm/test/CodeGen/AMDGPU/fmul-to-ldexp.ll b/llvm/test/CodeGen/AMDGPU/fmul-to-ldexp.ll
index 9c4901eb19f37..899cc89405440 100644
--- a/llvm/test/CodeGen/AMDGPU/fmul-to-ldexp.ll
+++ b/llvm/test/CodeGen/AMDGPU/fmul-to-ldexp.ll
@@ -4238,7 +4238,7 @@ define amdgpu_ps i32 @s_mul_32_f16(half inreg %x, half inreg %y) {
; GFX11-GISEL-TRUE16-LABEL: s_mul_32_f16:
; GFX11-GISEL-TRUE16: ; %bb.0:
; GFX11-GISEL-TRUE16-NEXT: v_mul_f16_e64 v0.l, 0x5000, s0
-; GFX11-GISEL-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; GFX11-GISEL-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX11-GISEL-TRUE16-NEXT: v_readfirstlane_b32 s0, v0
; GFX11-GISEL-TRUE16-NEXT: ; return to shader part epilog
;
diff --git a/llvm/test/CodeGen/AMDGPU/fold-int-pow2-with-fmul-or-fdiv.ll b/llvm/test/CodeGen/AMDGPU/fold-int-pow2-with-fmul-or-fdiv.ll
index f09c25767648f..a859cc91b7fde 100644
--- a/llvm/test/CodeGen/AMDGPU/fold-int-pow2-with-fmul-or-fdiv.ll
+++ b/llvm/test/CodeGen/AMDGPU/fold-int-pow2-with-fmul-or-fdiv.ll
@@ -644,11 +644,10 @@ define double @fmul_pow_mul_max_pow2(i16 %cnt) nounwind {
; GFX11-TRUE16-LABEL: fmul_pow_mul_max_pow2:
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.l, v0.l, 2
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX11-TRUE16-NEXT: v_cvt_f64_u32_e32 v[0:1], v0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_mul_f64 v[0:1], 0x40080000, v[0:1]
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -1194,13 +1193,12 @@ define double @fmul_pow_shl_cnt_safe(i16 %cnt) nounwind {
; GFX11-TRUE16-LABEL: fmul_pow_shl_cnt_safe:
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.l, v0.l, 1
; GFX11-TRUE16-NEXT: s_mov_b32 s0, 0xff5f3992
; GFX11-TRUE16-NEXT: s_mov_b32 s1, 0x7befffff
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX11-TRUE16-NEXT: v_cvt_f64_u32_e32 v[0:1], v0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_mul_f64 v[0:1], v[0:1], s[0:1]
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
diff --git a/llvm/test/CodeGen/AMDGPU/fptrunc.f16.ll b/llvm/test/CodeGen/AMDGPU/fptrunc.f16.ll
index c52fb6197e3e3..40d2765395543 100644
--- a/llvm/test/CodeGen/AMDGPU/fptrunc.f16.ll
+++ b/llvm/test/CodeGen/AMDGPU/fptrunc.f16.ll
@@ -4372,14 +4372,13 @@ define amdgpu_kernel void @fptrunc_f32_to_f16_zext_i32(
; GFX11-GISEL-TRUE16-LABEL: fptrunc_f32_to_f16_zext_i32:
; GFX11-GISEL-TRUE16: ; %bb.0: ; %entry
; GFX11-GISEL-TRUE16-NEXT: s_load_b128 s[0:3], s[4:5], 0x24
+; GFX11-GISEL-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX11-GISEL-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-GISEL-TRUE16-NEXT: s_load_b32 s2, s[2:3], 0x0
; GFX11-GISEL-TRUE16-NEXT: s_mov_b32 s3, 0x31016000
; GFX11-GISEL-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-GISEL-TRUE16-NEXT: v_cvt_f16_f32_e32 v0.l, s2
; GFX11-GISEL-TRUE16-NEXT: s_mov_b32 s2, -1
-; GFX11-GISEL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-GISEL-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX11-GISEL-TRUE16-NEXT: buffer_store_b32 v0, off, s[0:3], 0
; GFX11-GISEL-TRUE16-NEXT: s_endpgm
;
@@ -4607,14 +4606,13 @@ define amdgpu_kernel void @fptrunc_fabs_f32_to_f16_zext_i32(
; GFX11-GISEL-TRUE16-LABEL: fptrunc_fabs_f32_to_f16_zext_i32:
; GFX11-GISEL-TRUE16: ; %bb.0: ; %entry
; GFX11-GISEL-TRUE16-NEXT: s_load_b128 s[0:3], s[4:5], 0x24
+; GFX11-GISEL-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX11-GISEL-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-GISEL-TRUE16-NEXT: s_load_b32 s2, s[2:3], 0x0
; GFX11-GISEL-TRUE16-NEXT: s_mov_b32 s3, 0x31016000
; GFX11-GISEL-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-GISEL-TRUE16-NEXT: v_cvt_f16_f32_e64 v0.l, |s2|
; GFX11-GISEL-TRUE16-NEXT: s_mov_b32 s2, -1
-; GFX11-GISEL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-GISEL-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX11-GISEL-TRUE16-NEXT: buffer_store_b32 v0, off, s[0:3], 0
; GFX11-GISEL-TRUE16-NEXT: s_endpgm
;
diff --git a/llvm/test/CodeGen/AMDGPU/function-args.ll b/llvm/test/CodeGen/AMDGPU/function-args.ll
index 95e28a37f5ee1..3c41cc43a089e 100644
--- a/llvm/test/CodeGen/AMDGPU/function-args.ll
+++ b/llvm/test/CodeGen/AMDGPU/function-args.ll
@@ -1107,21 +1107,19 @@ define void @void_func_v4i8(<4 x i8> %arg0) #0 {
; GFX11-TRUE16-LABEL: void_func_v4i8:
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, 0
; GFX11-TRUE16-NEXT: s_mov_b64 s[0:1], 0
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v1.h, v0.h
; GFX11-TRUE16-NEXT: s_mov_b32 s3, 0x31016000
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v1.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, v2.l
; GFX11-TRUE16-NEXT: s_mov_b32 s2, -1
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v0.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v3.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v1.l, v0.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v2
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v1, v0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v0, v2
; GFX11-TRUE16-NEXT: buffer_store_b32 v0, off, s[0:3], 0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -1190,22 +1188,20 @@ define void @void_func_v5i8(<5 x i8> %arg0) #0 {
; GFX11-TRUE16-LABEL: void_func_v5i8:
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, 0
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
; GFX11-TRUE16-NEXT: s_mov_b64 s[0:1], 4
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v1.h, v0.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, v2.l
; GFX11-TRUE16-NEXT: s_mov_b32 s3, 0x31016000
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.l, v1.l
; GFX11-TRUE16-NEXT: s_mov_b32 s2, -1
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v0.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v3.l
; GFX11-TRUE16-NEXT: buffer_store_b8 v4, off, s[0:3], 0
; GFX11-TRUE16-NEXT: s_mov_b64 s[0:1], 0
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v0.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, 0
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v1.l, v0.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v2
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v1, v0
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v3, v2
; GFX11-TRUE16-NEXT: buffer_store_b32 v0, off, s[0:3], 0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -1285,29 +1281,27 @@ define void @void_func_v8i8(<8 x i8> %arg0) #0 {
; GFX11-TRUE16-LABEL: void_func_v8i8:
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v5.l
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, 0
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v3.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v5.h, v4.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v4.l, v5.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.h, v6.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v4.l, v4.h
; GFX11-TRUE16-NEXT: s_mov_b64 s[0:1], 0
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v0.h, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v3.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, 0
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
; GFX11-TRUE16-NEXT: s_mov_b32 s3, 0x31016000
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v4, v6
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v1.h, v0.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v1.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, v6.l
; GFX11-TRUE16-NEXT: s_mov_b32 s2, -1
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v5, v4
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v1.l, v0.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v0, v4
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v0, v6
; GFX11-TRUE16-NEXT: buffer_store_b64 v[1:2], off, s[0:3], 0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -1422,47 +1416,44 @@ define void @void_func_v16i8(<16 x i8> %arg0) #0 {
; GFX11-TRUE16-LABEL: void_func_v16i8:
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v15.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v14.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.l, 8, v13.l
; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.l, 8, v15.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 8, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v12.l, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, 0
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v11.l
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v13.h, v12.h
; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v8.h, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v11.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v8.l, v9.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, 0
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v5.l
+; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v12.l, v13.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.h, v14.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 8, v9.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v12, v14
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v10.l, v9.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v10.h, v9.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.h, v14.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v5.l
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v8.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v13
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v9, v14
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v5.h, v4.h
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v4.l, v5.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.h, v14.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v4.l, v4.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v7.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v9, v12
-; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v10.l, v8.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v11
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v8, v12
-; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v1.h, v0.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v0.l
; GFX11-TRUE16-NEXT: s_mov_b64 s[0:1], 0
; GFX11-TRUE16-NEXT: s_mov_b32 s3, 0x31016000
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v4, v12
-; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v1.l, v0.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v2
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v4, v14
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v1.h, v0.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v1.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, v14.l
; GFX11-TRUE16-NEXT: s_mov_b32 s2, -1
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v0, v12
-; GFX11-TRUE16-NEXT: buffer_store_b128 v[6:9], off, s[0:3], 0
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v0, v14
+; GFX11-TRUE16-NEXT: buffer_store_b128 v[5:8], off, s[0:3], 0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-FAKE16-LABEL: void_func_v16i8:
@@ -1658,83 +1649,77 @@ define void @void_func_v32i8(<32 x i8> %arg0) #0 {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: scratch_load_d16_u8 v31, off, s32
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v13.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v32.l, 0
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v15.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v3.h, v2.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v9.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v32.l, 0
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v13.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.h, v32.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v11.l
; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v7.h, v6.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v9.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.l
; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v1.h, v0.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v4.l, v5.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v13
+; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v3.h, v2.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.h, v32.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.l, 8, v7.l
; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v29.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v28.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v3.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v12
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v13, v32
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v7.h, v6.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v12, v32
; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v5.h, v4.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v25.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v24.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v9.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v14
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v12, v32
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.h, v32.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v4.l, v5.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v9, v32
; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v6.l, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v30.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v21.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v20.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v11.h, v11.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v5
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v13, v32
-; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v8.h, v8.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.l, 8, v27.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v26.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v17.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v16.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v0.h, v6.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.h, v32.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v30.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v29.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v28.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v0.l, v1.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v10, v32
+; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v0.h, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.h, v32.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v27.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v26.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.l, 8, v25.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v24.l
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v6.h, v5.h
; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v11, v32
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v4.h, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v22.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.h, v32.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v23.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v21.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v20.l
+; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v7.h, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.h, v32.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v16.l
; GFX11-TRUE16-NEXT: s_mov_b64 s[0:1], 16
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v6.h, v5.h
; GFX11-TRUE16-NEXT: s_mov_b32 s3, 0x31016000
; GFX11-TRUE16-NEXT: s_mov_b32 s2, -1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v31.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_3) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v9.l, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v23.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v8
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v19.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v7, v32
-; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v10.h, v10.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v18.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v32
-; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v4.h, v5.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v10
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v11, v32
-; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v4.l, v8.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.l, 8, v31.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v4.h, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v22.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v13, v32
+; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v5.l, v4.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v19.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v17.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v14, v32
+; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v4.h, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v18.l
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v8.h, v5.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v15, v32
+; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v4.h, v4.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v9, v32
; GFX11-TRUE16-NEXT: buffer_store_b128 v[4:7], off, s[0:3], 0
; GFX11-TRUE16-NEXT: s_mov_b64 s[0:1], 0
diff --git a/llvm/test/CodeGen/AMDGPU/gfx-callable-argument-types.ll b/llvm/test/CodeGen/AMDGPU/gfx-callable-argument-types.ll
index 2fdc1a8854863..919464a936740 100644
--- a/llvm/test/CodeGen/AMDGPU/gfx-callable-argument-types.ll
+++ b/llvm/test/CodeGen/AMDGPU/gfx-callable-argument-types.ll
@@ -4896,23 +4896,22 @@ define amdgpu_gfx void @test_call_external_void_func_v4i8_ret() #0 {
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v2, 16, v0
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, 24, v0
; GFX11-TRUE16-NEXT: s_swappc_b64 s[30:31], s[0:1]
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v0.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_3)
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v1.h, v0.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v1.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, v2.l
; GFX11-TRUE16-NEXT: v_readlane_b32 s31, v42, 1
; GFX11-TRUE16-NEXT: v_readlane_b32 s30, v42, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v1.l, v0.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, 0
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v0, v2
; GFX11-TRUE16-NEXT: s_mov_b32 s32, s33
; GFX11-TRUE16-NEXT: v_readlane_b32 s0, v42, 2
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v2
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v1, v0
; GFX11-TRUE16-NEXT: global_store_b32 v[40:41], v0, off
; GFX11-TRUE16-NEXT: s_clause 0x1
; GFX11-TRUE16-NEXT: scratch_load_b32 v41, off, s33
@@ -5156,30 +5155,29 @@ define amdgpu_gfx void @test_call_external_void_func_v5i8_ret() #0 {
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v0, v5
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v4, v6
; GFX11-TRUE16-NEXT: s_swappc_b64 s[30:31], s[0:1]
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v1.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v0.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v1.l, v0.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v1.h, v0.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.l, v1.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, v2.l
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v0, 4
-; GFX11-TRUE16-NEXT: v_readlane_b32 s31, v42, 1
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_dual_mov_b32 v1, 0 :: v_dual_and_b32 v2, 0xffff, v2
-; GFX11-TRUE16-NEXT: v_readlane_b32 s30, v42, 0
-; GFX11-TRUE16-NEXT: s_mov_b32 s32, s33
-; GFX11-TRUE16-NEXT: v_readlane_b32 s0, v42, 2
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v2, v3
+; GFX11-TRUE16-NEXT: v_mov_b32_e32 v1, 0
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v3, v2
; GFX11-TRUE16-NEXT: s_clause 0x1
; GFX11-TRUE16-NEXT: global_store_b8 v[0:1], v4, off
; GFX11-TRUE16-NEXT: global_store_b32 v[40:41], v2, off
; GFX11-TRUE16-NEXT: s_clause 0x1
; GFX11-TRUE16-NEXT: scratch_load_b32 v41, off, s33
; GFX11-TRUE16-NEXT: scratch_load_b32 v40, off, s33 offset:4
+; GFX11-TRUE16-NEXT: v_readlane_b32 s31, v42, 1
+; GFX11-TRUE16-NEXT: v_readlane_b32 s30, v42, 0
+; GFX11-TRUE16-NEXT: s_mov_b32 s32, s33
+; GFX11-TRUE16-NEXT: v_readlane_b32 s0, v42, 2
; GFX11-TRUE16-NEXT: s_or_saveexec_b32 s1, -1
; GFX11-TRUE16-NEXT: scratch_load_b32 v42, off, s33 offset:8 ; 4-byte Folded Reload
; GFX11-TRUE16-NEXT: s_mov_b32 exec_lo, s1
@@ -5441,36 +5439,34 @@ define amdgpu_gfx void @test_call_external_void_func_v8i8_ret() #0 {
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v7, 24, v1
; GFX11-TRUE16-NEXT: v_dual_mov_b32 v4, v1 :: v_dual_mov_b32 v1, v8
; GFX11-TRUE16-NEXT: s_swappc_b64 s[30:31], s[0:1]
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v5.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, 0
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v1.h, v0.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v7.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v7.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v6.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v0.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, 0
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v1.h, v0.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v3.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v1.h, v1.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v4
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v3.h, v2.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v3, v5
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v1.l, v0.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v4
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v5, v4
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v1.h, v0.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v1.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, v4.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v0, v4
; GFX11-TRUE16-NEXT: v_readlane_b32 s31, v42, 1
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v0, v5
; GFX11-TRUE16-NEXT: v_readlane_b32 s30, v42, 0
-; GFX11-TRUE16-NEXT: s_mov_b32 s32, s33
-; GFX11-TRUE16-NEXT: v_readlane_b32 s0, v42, 2
; GFX11-TRUE16-NEXT: global_store_b64 v[40:41], v[1:2], off
; GFX11-TRUE16-NEXT: s_clause 0x1
; GFX11-TRUE16-NEXT: scratch_load_b32 v41, off, s33
; GFX11-TRUE16-NEXT: scratch_load_b32 v40, off, s33 offset:4
+; GFX11-TRUE16-NEXT: s_mov_b32 s32, s33
+; GFX11-TRUE16-NEXT: v_readlane_b32 s0, v42, 2
; GFX11-TRUE16-NEXT: s_or_saveexec_b32 s1, -1
; GFX11-TRUE16-NEXT: scratch_load_b32 v42, off, s33 offset:8 ; 4-byte Folded Reload
; GFX11-TRUE16-NEXT: s_mov_b32 exec_lo, s1
@@ -5910,85 +5906,77 @@ define amdgpu_gfx void @test_call_external_void_func_v32i8_ret() #0 {
; GFX11-TRUE16-NEXT: v_dual_mov_b32 v17, v32 :: v_dual_mov_b32 v18, v33
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v19, v34
; GFX11-TRUE16-NEXT: s_swappc_b64 s[30:31], s[0:1]
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v13.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v15.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, 0
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v15.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v14.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v13.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, 0
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v1.h, v0.h
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.h, 8, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v12
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v0.h, v2.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v3.h, v1.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v1.h, v0.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v11.l
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v3.h, v2.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.h, v12.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v9, v13
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v8
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v3.h, v2.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v1.h, v0.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v9.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.h, v12.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v13, v12
+; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v1.h, v0.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v7.l
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v3.h, v2.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v8, v13
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v1.h, v0.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.h, v12.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v9, v12
+; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v1.h, v0.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.l, 8, v29.l
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v3.h, v2.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.h, v12.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v4, v12
+; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v1.h, v0.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.l, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.l, 8, v31.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v30.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v29.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v28.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v2, 0xffff, v2
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v6, v13
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v1.l, v0.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v2, v12
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v17.l
+; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v0.h, v0.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.l, 8, v27.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v1.h, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v26.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v25.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v1.h, v0.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v24.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v31.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v30.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v2, v13
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v1.h, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v26.l
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v6.l, v0.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v27.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v7
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.l, 8, v21.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v16.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v2, v12
+; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v0.h, v0.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.l, 8, v23.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v1.h, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v22.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v21.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v20.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v6, v13
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v1.l, v0.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v17.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v1.h, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v16.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v23.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v22.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v1.h, v1.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v7, v13
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v18.l
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v6.l, v0.h
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v10
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v19.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v0.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v6, v13
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v1.l, v0.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v10
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v0, v13
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v2, v12
+; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v0.h, v0.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.l, 8, v19.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.h, v1.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.h, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v18.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v1, v12
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v0.h, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v4.h, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, v12.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v0, v12
; GFX11-TRUE16-NEXT: s_clause 0x1
-; GFX11-TRUE16-NEXT: global_store_b128 v[42:43], v[6:9], off
-; GFX11-TRUE16-NEXT: global_store_b128 v[40:41], v[2:5], off
+; GFX11-TRUE16-NEXT: global_store_b128 v[42:43], v[0:3], off
+; GFX11-TRUE16-NEXT: global_store_b128 v[40:41], v[5:8], off
; GFX11-TRUE16-NEXT: s_clause 0x3
; GFX11-TRUE16-NEXT: scratch_load_b32 v43, off, s33
; GFX11-TRUE16-NEXT: scratch_load_b32 v42, off, s33 offset:4
diff --git a/llvm/test/CodeGen/AMDGPU/global-atomicrmw-fadd.ll b/llvm/test/CodeGen/AMDGPU/global-atomicrmw-fadd.ll
index 1f74fbdc46e98..9c1f9d21b9da3 100644
--- a/llvm/test/CodeGen/AMDGPU/global-atomicrmw-fadd.ll
+++ b/llvm/test/CodeGen/AMDGPU/global-atomicrmw-fadd.ll
@@ -8275,13 +8275,12 @@ define half @global_agent_atomic_fadd_ret_f16__amdgpu_no_fine_grained_memory(ptr
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_add_f16_e32 v5.l, v5.l, v2.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[0:1], v[5:6], off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -8394,13 +8393,12 @@ define half @global_agent_atomic_fadd_ret_f16__amdgpu_no_fine_grained_memory(ptr
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_add_f16_e32 v5.l, v5.l, v2.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[0:1], v[5:6], off glc
@@ -8700,13 +8698,12 @@ define half @global_agent_atomic_fadd_ret_f16__offset12b_pos__amdgpu_no_fine_gra
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_add_f16_e32 v5.l, v5.l, v2.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[0:1], v[5:6], off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -8823,13 +8820,12 @@ define half @global_agent_atomic_fadd_ret_f16__offset12b_pos__amdgpu_no_fine_gra
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_add_f16_e32 v5.l, v5.l, v2.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[0:1], v[5:6], off glc
@@ -9138,13 +9134,12 @@ define half @global_agent_atomic_fadd_ret_f16__offset12b_neg__amdgpu_no_fine_gra
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_add_f16_e32 v5.l, v5.l, v2.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[0:1], v[5:6], off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -9262,13 +9257,12 @@ define half @global_agent_atomic_fadd_ret_f16__offset12b_neg__amdgpu_no_fine_gra
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_add_f16_e32 v5.l, v5.l, v2.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[0:1], v[5:6], off glc
@@ -9576,11 +9570,11 @@ define void @global_agent_atomic_fadd_noret_f16__amdgpu_no_fine_grained_memory(p
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_add_f16_e32 v3.l, v3.l, v2.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -9690,11 +9684,11 @@ define void @global_agent_atomic_fadd_noret_f16__amdgpu_no_fine_grained_memory(p
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_add_f16_e32 v3.l, v3.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off glc
@@ -9985,11 +9979,11 @@ define void @global_agent_atomic_fadd_noret_f16__offset12b_pos__amdgpu_no_fine_g
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_add_f16_e32 v3.l, v3.l, v2.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -10103,11 +10097,11 @@ define void @global_agent_atomic_fadd_noret_f16__offset12b_pos__amdgpu_no_fine_g
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_add_f16_e32 v3.l, v3.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off glc
@@ -10406,11 +10400,11 @@ define void @global_agent_atomic_fadd_noret_f16__offset12b_neg__amdgpu_no_fine_g
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_add_f16_e32 v3.l, v3.l, v2.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -10525,11 +10519,11 @@ define void @global_agent_atomic_fadd_noret_f16__offset12b_neg__amdgpu_no_fine_g
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_add_f16_e32 v3.l, v3.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off glc
@@ -10819,10 +10813,9 @@ define half @global_agent_atomic_fadd_ret_f16__offset12b_pos__align4__amdgpu_no_
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_add_f16_e32 v3.l, v4.l, v2.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off offset:2046 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -10908,10 +10901,9 @@ define half @global_agent_atomic_fadd_ret_f16__offset12b_pos__align4__amdgpu_no_
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_add_f16_e32 v3.l, v4.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off offset:2046 glc
@@ -11144,8 +11136,8 @@ define void @global_agent_atomic_fadd_noret_f16__offset12b__align4_pos__amdgpu_n
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_add_f16_e32 v3.l, v4.l, v2.l
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off offset:2046 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -11228,8 +11220,8 @@ define void @global_agent_atomic_fadd_noret_f16__offset12b__align4_pos__amdgpu_n
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_add_f16_e32 v3.l, v4.l, v2.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off offset:2046 glc
@@ -11464,13 +11456,12 @@ define half @global_system_atomic_fadd_ret_f16__offset12b_pos__amdgpu_no_fine_gr
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_add_f16_e32 v5.l, v5.l, v2.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX12-TRUE16-NEXT: global_wb scope:SCOPE_SYS
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
@@ -11589,13 +11580,12 @@ define half @global_system_atomic_fadd_ret_f16__offset12b_pos__amdgpu_no_fine_gr
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_add_f16_e32 v5.l, v5.l, v2.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[0:1], v[5:6], off glc
@@ -11906,11 +11896,11 @@ define void @global_system_atomic_fadd_noret_f16__offset12b_pos__amdgpu_no_fine_
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_add_f16_e32 v3.l, v3.l, v2.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX12-TRUE16-NEXT: global_wb scope:SCOPE_SYS
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
@@ -12026,11 +12016,11 @@ define void @global_system_atomic_fadd_noret_f16__offset12b_pos__amdgpu_no_fine_
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_add_f16_e32 v3.l, v3.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off glc
diff --git a/llvm/test/CodeGen/AMDGPU/global-atomicrmw-fmax.ll b/llvm/test/CodeGen/AMDGPU/global-atomicrmw-fmax.ll
index faa74fef2be2f..f7cc0709109f9 100644
--- a/llvm/test/CodeGen/AMDGPU/global-atomicrmw-fmax.ll
+++ b/llvm/test/CodeGen/AMDGPU/global-atomicrmw-fmax.ll
@@ -4467,14 +4467,14 @@ define half @global_agent_atomic_fmax_ret_f16__amdgpu_no_fine_grained_memory(ptr
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v2.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v5.l, v2.h, v2.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[0:1], v[5:6], off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -4592,14 +4592,14 @@ define half @global_agent_atomic_fmax_ret_f16__amdgpu_no_fine_grained_memory(ptr
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_max_f16_e32 v2.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v5.l, v2.h, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[0:1], v[5:6], off glc
@@ -4912,14 +4912,14 @@ define half @global_agent_atomic_fmax_ret_f16__offset12b_pos__amdgpu_no_fine_gra
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v5.l, v0.h, v0.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[3:4], v[5:6], off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -5044,14 +5044,14 @@ define half @global_agent_atomic_fmax_ret_f16__offset12b_pos__amdgpu_no_fine_gra
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v5.l, v0.h, v0.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[3:4], v[5:6], off glc
@@ -5373,14 +5373,14 @@ define half @global_agent_atomic_fmax_ret_f16__offset12b_neg__amdgpu_no_fine_gra
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v5.l, v0.h, v0.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[3:4], v[5:6], off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -5506,14 +5506,14 @@ define half @global_agent_atomic_fmax_ret_f16__offset12b_neg__amdgpu_no_fine_gra
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v5.l, v0.h, v0.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[3:4], v[5:6], off glc
@@ -5832,13 +5832,12 @@ define void @global_agent_atomic_fmax_noret_f16__amdgpu_no_fine_grained_memory(p
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v2.h, v3.l, v3.l
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v3.l, v2.h, v2.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -5954,13 +5953,12 @@ define void @global_agent_atomic_fmax_noret_f16__amdgpu_no_fine_grained_memory(p
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v2.h, v3.l, v3.l
; GFX11-TRUE16-NEXT: v_max_f16_e32 v3.l, v2.h, v2.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off glc
@@ -6265,13 +6263,12 @@ define void @global_agent_atomic_fmax_noret_f16__offset12b_pos__amdgpu_no_fine_g
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v5.l, v0.h, v0.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[3:4], v[5:6], off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -6394,13 +6391,12 @@ define void @global_agent_atomic_fmax_noret_f16__offset12b_pos__amdgpu_no_fine_g
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: v_max_f16_e32 v5.l, v0.h, v0.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[3:4], v[5:6], off glc
@@ -6713,13 +6709,12 @@ define void @global_agent_atomic_fmax_noret_f16__offset12b_neg__amdgpu_no_fine_g
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v5.l, v0.h, v0.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[3:4], v[5:6], off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -6843,13 +6838,12 @@ define void @global_agent_atomic_fmax_noret_f16__offset12b_neg__amdgpu_no_fine_g
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: v_max_f16_e32 v5.l, v0.h, v0.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[3:4], v[5:6], off glc
@@ -7151,11 +7145,11 @@ define half @global_agent_atomic_fmax_ret_f16__offset12b_pos__align4__amdgpu_no_
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v2.h, v4.l, v4.l
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v3.l, v2.h, v2.l
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off offset:2046 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -7246,11 +7240,11 @@ define half @global_agent_atomic_fmax_ret_f16__offset12b_pos__align4__amdgpu_no_
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v2.h, v4.l, v4.l
; GFX11-TRUE16-NEXT: v_max_f16_e32 v3.l, v2.h, v2.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off offset:2046 glc
@@ -7494,10 +7488,9 @@ define void @global_agent_atomic_fmax_noret_f16__offset12b__align4_pos__amdgpu_n
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v2.h, v4.l, v4.l
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v3.l, v2.h, v2.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off offset:2046 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -7586,10 +7579,9 @@ define void @global_agent_atomic_fmax_noret_f16__offset12b__align4_pos__amdgpu_n
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v2.h, v4.l, v4.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v3.l, v2.h, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off offset:2046 glc
@@ -7838,14 +7830,14 @@ define half @global_system_atomic_fmax_ret_f16__offset12b_pos__amdgpu_no_fine_gr
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v5.l, v0.h, v0.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX12-TRUE16-NEXT: global_wb scope:SCOPE_SYS
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
@@ -7972,14 +7964,14 @@ define half @global_system_atomic_fmax_ret_f16__offset12b_pos__amdgpu_no_fine_gr
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v5.l, v0.h, v0.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[3:4], v[5:6], off glc
@@ -8303,13 +8295,12 @@ define void @global_system_atomic_fmax_noret_f16__offset12b_pos__amdgpu_no_fine_
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v5.l, v0.h, v0.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX12-TRUE16-NEXT: global_wb scope:SCOPE_SYS
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
@@ -8434,13 +8425,12 @@ define void @global_system_atomic_fmax_noret_f16__offset12b_pos__amdgpu_no_fine_
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: v_max_f16_e32 v5.l, v0.h, v0.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[3:4], v[5:6], off glc
diff --git a/llvm/test/CodeGen/AMDGPU/global-atomicrmw-fmin.ll b/llvm/test/CodeGen/AMDGPU/global-atomicrmw-fmin.ll
index a46b0129b79e6..b81af1fc9233d 100644
--- a/llvm/test/CodeGen/AMDGPU/global-atomicrmw-fmin.ll
+++ b/llvm/test/CodeGen/AMDGPU/global-atomicrmw-fmin.ll
@@ -4467,14 +4467,14 @@ define half @global_agent_atomic_fmin_ret_f16__amdgpu_no_fine_grained_memory(ptr
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v2.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v5.l, v2.h, v2.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[0:1], v[5:6], off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -4592,14 +4592,14 @@ define half @global_agent_atomic_fmin_ret_f16__amdgpu_no_fine_grained_memory(ptr
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_max_f16_e32 v2.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_min_f16_e32 v5.l, v2.h, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[0:1], v[5:6], off glc
@@ -4912,14 +4912,14 @@ define half @global_agent_atomic_fmin_ret_f16__offset12b_pos__amdgpu_no_fine_gra
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v5.l, v0.h, v0.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[3:4], v[5:6], off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -5044,14 +5044,14 @@ define half @global_agent_atomic_fmin_ret_f16__offset12b_pos__amdgpu_no_fine_gra
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_min_f16_e32 v5.l, v0.h, v0.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[3:4], v[5:6], off glc
@@ -5373,14 +5373,14 @@ define half @global_agent_atomic_fmin_ret_f16__offset12b_neg__amdgpu_no_fine_gra
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v5.l, v0.h, v0.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[3:4], v[5:6], off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -5506,14 +5506,14 @@ define half @global_agent_atomic_fmin_ret_f16__offset12b_neg__amdgpu_no_fine_gra
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_min_f16_e32 v5.l, v0.h, v0.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[3:4], v[5:6], off glc
@@ -5832,13 +5832,12 @@ define void @global_agent_atomic_fmin_noret_f16__amdgpu_no_fine_grained_memory(p
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v2.h, v3.l, v3.l
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v3.l, v2.h, v2.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -5954,13 +5953,12 @@ define void @global_agent_atomic_fmin_noret_f16__amdgpu_no_fine_grained_memory(p
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v2.h, v3.l, v3.l
; GFX11-TRUE16-NEXT: v_min_f16_e32 v3.l, v2.h, v2.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off glc
@@ -6265,13 +6263,12 @@ define void @global_agent_atomic_fmin_noret_f16__offset12b_pos__amdgpu_no_fine_g
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v5.l, v0.h, v0.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[3:4], v[5:6], off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -6394,13 +6391,12 @@ define void @global_agent_atomic_fmin_noret_f16__offset12b_pos__amdgpu_no_fine_g
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: v_min_f16_e32 v5.l, v0.h, v0.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[3:4], v[5:6], off glc
@@ -6713,13 +6709,12 @@ define void @global_agent_atomic_fmin_noret_f16__offset12b_neg__amdgpu_no_fine_g
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v5.l, v0.h, v0.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[3:4], v[5:6], off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -6843,13 +6838,12 @@ define void @global_agent_atomic_fmin_noret_f16__offset12b_neg__amdgpu_no_fine_g
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: v_min_f16_e32 v5.l, v0.h, v0.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[3:4], v[5:6], off glc
@@ -7151,11 +7145,11 @@ define half @global_agent_atomic_fmin_ret_f16__offset12b_pos__align4__amdgpu_no_
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v2.h, v4.l, v4.l
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v3.l, v2.h, v2.l
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off offset:2046 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -7246,11 +7240,11 @@ define half @global_agent_atomic_fmin_ret_f16__offset12b_pos__align4__amdgpu_no_
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v2.h, v4.l, v4.l
; GFX11-TRUE16-NEXT: v_min_f16_e32 v3.l, v2.h, v2.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off offset:2046 glc
@@ -7494,10 +7488,9 @@ define void @global_agent_atomic_fmin_noret_f16__offset12b__align4_pos__amdgpu_n
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v2.h, v4.l, v4.l
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v3.l, v2.h, v2.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off offset:2046 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -7586,10 +7579,9 @@ define void @global_agent_atomic_fmin_noret_f16__offset12b__align4_pos__amdgpu_n
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v2.h, v4.l, v4.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_min_f16_e32 v3.l, v2.h, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off offset:2046 glc
@@ -7838,14 +7830,14 @@ define half @global_system_atomic_fmin_ret_f16__offset12b_pos__amdgpu_no_fine_gr
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v5.l, v0.h, v0.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX12-TRUE16-NEXT: global_wb scope:SCOPE_SYS
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
@@ -7972,14 +7964,14 @@ define half @global_system_atomic_fmin_ret_f16__offset12b_pos__amdgpu_no_fine_gr
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_min_f16_e32 v5.l, v0.h, v0.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[3:4], v[5:6], off glc
@@ -8303,13 +8295,12 @@ define void @global_system_atomic_fmin_noret_f16__offset12b_pos__amdgpu_no_fine_
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v5.l, v0.h, v0.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX12-TRUE16-NEXT: global_wb scope:SCOPE_SYS
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
@@ -8434,13 +8425,12 @@ define void @global_system_atomic_fmin_noret_f16__offset12b_pos__amdgpu_no_fine_
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: v_min_f16_e32 v5.l, v0.h, v0.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[3:4], v[5:6], off glc
diff --git a/llvm/test/CodeGen/AMDGPU/global-atomicrmw-fsub.ll b/llvm/test/CodeGen/AMDGPU/global-atomicrmw-fsub.ll
index 053efdcb76261..b8762d13e1327 100644
--- a/llvm/test/CodeGen/AMDGPU/global-atomicrmw-fsub.ll
+++ b/llvm/test/CodeGen/AMDGPU/global-atomicrmw-fsub.ll
@@ -5221,13 +5221,12 @@ define half @global_agent_atomic_fsub_ret_f16(ptr addrspace(1) %ptr, half %val)
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_sub_f16_e32 v5.l, v5.l, v2.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[0:1], v[5:6], off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -5340,13 +5339,12 @@ define half @global_agent_atomic_fsub_ret_f16(ptr addrspace(1) %ptr, half %val)
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_sub_f16_e32 v5.l, v5.l, v2.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[0:1], v[5:6], off glc
@@ -5646,13 +5644,12 @@ define half @global_agent_atomic_fsub_ret_f16__offset12b_pos(ptr addrspace(1) %p
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_sub_f16_e32 v5.l, v5.l, v2.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[0:1], v[5:6], off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -5769,13 +5766,12 @@ define half @global_agent_atomic_fsub_ret_f16__offset12b_pos(ptr addrspace(1) %p
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_sub_f16_e32 v5.l, v5.l, v2.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[0:1], v[5:6], off glc
@@ -6084,13 +6080,12 @@ define half @global_agent_atomic_fsub_ret_f16__offset12b_neg(ptr addrspace(1) %p
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_sub_f16_e32 v5.l, v5.l, v2.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[0:1], v[5:6], off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -6208,13 +6203,12 @@ define half @global_agent_atomic_fsub_ret_f16__offset12b_neg(ptr addrspace(1) %p
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_sub_f16_e32 v5.l, v5.l, v2.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[0:1], v[5:6], off glc
@@ -6522,11 +6516,11 @@ define void @global_agent_atomic_fsub_noret_f16(ptr addrspace(1) %ptr, half %val
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_sub_f16_e32 v3.l, v3.l, v2.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -6636,11 +6630,11 @@ define void @global_agent_atomic_fsub_noret_f16(ptr addrspace(1) %ptr, half %val
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_sub_f16_e32 v3.l, v3.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off glc
@@ -6931,11 +6925,11 @@ define void @global_agent_atomic_fsub_noret_f16__offset12b_pos(ptr addrspace(1)
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_sub_f16_e32 v3.l, v3.l, v2.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -7049,11 +7043,11 @@ define void @global_agent_atomic_fsub_noret_f16__offset12b_pos(ptr addrspace(1)
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_sub_f16_e32 v3.l, v3.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off glc
@@ -7352,11 +7346,11 @@ define void @global_agent_atomic_fsub_noret_f16__offset12b_neg(ptr addrspace(1)
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_sub_f16_e32 v3.l, v3.l, v2.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -7471,11 +7465,11 @@ define void @global_agent_atomic_fsub_noret_f16__offset12b_neg(ptr addrspace(1)
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_sub_f16_e32 v3.l, v3.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off glc
@@ -7765,10 +7759,9 @@ define half @global_agent_atomic_fsub_ret_f16__offset12b_pos__align4(ptr addrspa
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_sub_f16_e32 v3.l, v4.l, v2.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off offset:2046 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -7854,10 +7847,9 @@ define half @global_agent_atomic_fsub_ret_f16__offset12b_pos__align4(ptr addrspa
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_sub_f16_e32 v3.l, v4.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off offset:2046 glc
@@ -8090,8 +8082,8 @@ define void @global_agent_atomic_fsub_noret_f16__offset12b__align4_pos(ptr addrs
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_sub_f16_e32 v3.l, v4.l, v2.l
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off offset:2046 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -8174,8 +8166,8 @@ define void @global_agent_atomic_fsub_noret_f16__offset12b__align4_pos(ptr addrs
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_sub_f16_e32 v3.l, v4.l, v2.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off offset:2046 glc
@@ -8410,13 +8402,12 @@ define half @global_system_atomic_fsub_ret_f16__offset12b_pos(ptr addrspace(1) %
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_sub_f16_e32 v5.l, v5.l, v2.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX12-TRUE16-NEXT: global_wb scope:SCOPE_SYS
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
@@ -8535,13 +8526,12 @@ define half @global_system_atomic_fsub_ret_f16__offset12b_pos(ptr addrspace(1) %
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_sub_f16_e32 v5.l, v5.l, v2.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[0:1], v[5:6], off glc
@@ -8852,11 +8842,11 @@ define void @global_system_atomic_fsub_noret_f16__offset12b_pos(ptr addrspace(1)
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_sub_f16_e32 v3.l, v3.l, v2.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX12-TRUE16-NEXT: global_wb scope:SCOPE_SYS
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
@@ -8972,11 +8962,11 @@ define void @global_system_atomic_fsub_noret_f16__offset12b_pos(ptr addrspace(1)
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_sub_f16_e32 v3.l, v3.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off glc
diff --git a/llvm/test/CodeGen/AMDGPU/idot4u.ll b/llvm/test/CodeGen/AMDGPU/idot4u.ll
index 7ebd69204d87f..305461ed6b208 100644
--- a/llvm/test/CodeGen/AMDGPU/idot4u.ll
+++ b/llvm/test/CodeGen/AMDGPU/idot4u.ll
@@ -1693,12 +1693,11 @@ define amdgpu_kernel void @notdot4_mixedtypes(ptr addrspace(1) %src1,
; GFX11-DL-TRUE16-NEXT: v_mov_b16_e32 v3.l, v7.l
; GFX11-DL-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-DL-TRUE16-NEXT: v_mad_u16 v0.l, v0.h, v1.l, v0.l
+; GFX11-DL-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX11-DL-TRUE16-NEXT: v_perm_b32 v1, v5, v5, 0xc0c0302
-; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
; GFX11-DL-TRUE16-NEXT: v_mad_u16 v0.l, v2.l, v3.l, v0.l
; GFX11-DL-TRUE16-NEXT: v_perm_b32 v2, v4, v4, 0xc0c0302
-; GFX11-DL-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
-; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-DL-TRUE16-NEXT: v_dot4_u32_u8 v0, v2, v1, v0
; GFX11-DL-TRUE16-NEXT: global_store_b16 v6, v0, s[4:5]
; GFX11-DL-TRUE16-NEXT: s_endpgm
@@ -2724,32 +2723,32 @@ define amdgpu_kernel void @udot4_acc8_vecMul(ptr addrspace(1) %src1,
; GFX11-DL-TRUE16-NEXT: global_load_b32 v4, v0, s[2:3]
; GFX11-DL-TRUE16-NEXT: global_load_d16_u8 v0, v5, s[4:5]
; GFX11-DL-TRUE16-NEXT: s_waitcnt vmcnt(2)
-; GFX11-DL-TRUE16-NEXT: v_lshrrev_b16 v0.h, 8, v3.l
-; GFX11-DL-TRUE16-NEXT: s_waitcnt vmcnt(1)
-; GFX11-DL-TRUE16-NEXT: v_lshrrev_b16 v1.l, 8, v4.l
; GFX11-DL-TRUE16-NEXT: v_lshrrev_b32_e32 v2, 24, v3
+; GFX11-DL-TRUE16-NEXT: s_waitcnt vmcnt(1)
; GFX11-DL-TRUE16-NEXT: v_lshrrev_b32_e32 v6, 24, v4
+; GFX11-DL-TRUE16-NEXT: v_lshrrev_b16 v0.h, 8, v3.l
+; GFX11-DL-TRUE16-NEXT: v_mul_lo_u16 v1.l, v3.h, v4.h
+; GFX11-DL-TRUE16-NEXT: v_lshrrev_b16 v1.h, 8, v4.l
; GFX11-DL-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-DL-TRUE16-NEXT: v_mad_u16 v0.l, v3.l, v4.l, v0.l
-; GFX11-DL-TRUE16-NEXT: v_mul_lo_u16 v0.h, v0.h, v1.l
-; GFX11-DL-TRUE16-NEXT: v_mul_lo_u16 v1.l, v3.h, v4.h
-; GFX11-DL-TRUE16-NEXT: v_mul_lo_u16 v1.h, v2.l, v6.l
+; GFX11-DL-TRUE16-NEXT: v_mul_lo_u16 v2.l, v2.l, v6.l
; GFX11-DL-TRUE16-NEXT: v_mov_b16_e32 v6.l, 0
+; GFX11-DL-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
+; GFX11-DL-TRUE16-NEXT: v_mul_lo_u16 v0.h, v0.h, v1.h
; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-DL-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v0.h
-; GFX11-DL-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v1.l
-; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-DL-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.h
-; GFX11-DL-TRUE16-NEXT: v_and_b32_e32 v2, 0xffff, v2
-; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
-; GFX11-DL-TRUE16-NEXT: v_or_b16 v6.h, v0.h, v1.l
-; GFX11-DL-TRUE16-NEXT: v_lshrrev_b32_e32 v1, 8, v1
-; GFX11-DL-TRUE16-NEXT: v_or_b32_e32 v2, v2, v6
-; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-DL-TRUE16-NEXT: v_lshrrev_b32_e32 v2, 8, v2
-; GFX11-DL-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v2.l
+; GFX11-DL-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v2.l
+; GFX11-DL-TRUE16-NEXT: v_mov_b16_e32 v7.h, v6.l
+; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-DL-TRUE16-NEXT: v_lshlrev_b16 v7.l, 8, v0.h
+; GFX11-DL-TRUE16-NEXT: v_or_b16 v6.h, v1.l, v2.l
; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-DL-TRUE16-NEXT: v_or_b32_e32 v1, v7, v6
+; GFX11-DL-TRUE16-NEXT: v_lshrrev_b32_e32 v1, 8, v1
+; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-DL-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v1.l
+; GFX11-DL-TRUE16-NEXT: v_lshrrev_b32_e32 v1, 8, v2
; GFX11-DL-TRUE16-NEXT: v_mad_u16 v0.l, v3.h, v4.h, v0.l
+; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-DL-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v1.l
; GFX11-DL-TRUE16-NEXT: global_store_b8 v5, v0, s[4:5]
; GFX11-DL-TRUE16-NEXT: s_endpgm
diff --git a/llvm/test/CodeGen/AMDGPU/integer-mad-patterns.ll b/llvm/test/CodeGen/AMDGPU/integer-mad-patterns.ll
index 742d87f099ce4..31b6b533866d4 100644
--- a/llvm/test/CodeGen/AMDGPU/integer-mad-patterns.ll
+++ b/llvm/test/CodeGen/AMDGPU/integer-mad-patterns.ll
@@ -1715,9 +1715,9 @@ define zeroext i16 @clpeak_umad_pat_i16(i16 zeroext %x, i16 zeroext %y) {
; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-SDAG-TRUE16-NEXT: v_mad_u16 v0.h, v0.l, v1.l, v0.l
; GFX11-SDAG-TRUE16-NEXT: v_mad_u16 v0.l, v0.h, v0.l, v0.h
-; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-SDAG-TRUE16-NEXT: v_mad_u16 v0.l, v0.l, v0.h, v0.l
-; GFX11-SDAG-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; GFX11-SDAG-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX11-SDAG-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-SDAG-FAKE16-LABEL: clpeak_umad_pat_i16:
@@ -1745,8 +1745,7 @@ define zeroext i16 @clpeak_umad_pat_i16(i16 zeroext %x, i16 zeroext %y) {
; GFX11-GISEL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-GISEL-TRUE16-NEXT: v_mul_lo_u16 v0.l, v0.l, v1.l
; GFX11-GISEL-TRUE16-NEXT: v_mul_lo_u16 v0.l, v0.l, v0.h
-; GFX11-GISEL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-GISEL-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; GFX11-GISEL-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX11-GISEL-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-GISEL-FAKE16-LABEL: clpeak_umad_pat_i16:
@@ -1777,9 +1776,9 @@ define zeroext i16 @clpeak_umad_pat_i16(i16 zeroext %x, i16 zeroext %y) {
; GFX1200-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1200-SDAG-TRUE16-NEXT: v_mad_u16 v0.h, v0.l, v1.l, v0.l
; GFX1200-SDAG-TRUE16-NEXT: v_mad_u16 v0.l, v0.h, v0.l, v0.h
-; GFX1200-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1200-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1200-SDAG-TRUE16-NEXT: v_mad_u16 v0.l, v0.l, v0.h, v0.l
-; GFX1200-SDAG-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; GFX1200-SDAG-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX1200-SDAG-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX1200-SDAG-FAKE16-LABEL: clpeak_umad_pat_i16:
@@ -1815,8 +1814,7 @@ define zeroext i16 @clpeak_umad_pat_i16(i16 zeroext %x, i16 zeroext %y) {
; GFX1200-GISEL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1200-GISEL-TRUE16-NEXT: v_mul_lo_u16 v0.l, v0.l, v1.l
; GFX1200-GISEL-TRUE16-NEXT: v_mul_lo_u16 v0.l, v0.l, v0.h
-; GFX1200-GISEL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX1200-GISEL-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; GFX1200-GISEL-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX1200-GISEL-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX1200-GISEL-FAKE16-LABEL: clpeak_umad_pat_i16:
@@ -9363,9 +9361,9 @@ define zeroext i16 @clpeak_umad_pat_i16_x2(i16 zeroext %x, i16 zeroext %y) {
; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-SDAG-TRUE16-NEXT: v_mad_u16 v0.h, v0.l, v0.h, v0.l
; GFX11-SDAG-TRUE16-NEXT: v_mad_u16 v0.l, v0.h, v0.l, v0.h
-; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-SDAG-TRUE16-NEXT: v_mad_u16 v0.l, v0.l, v0.h, v0.l
-; GFX11-SDAG-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; GFX11-SDAG-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX11-SDAG-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-SDAG-FAKE16-LABEL: clpeak_umad_pat_i16_x2:
@@ -9409,8 +9407,7 @@ define zeroext i16 @clpeak_umad_pat_i16_x2(i16 zeroext %x, i16 zeroext %y) {
; GFX11-GISEL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-GISEL-TRUE16-NEXT: v_mul_lo_u16 v0.l, v0.l, v1.l
; GFX11-GISEL-TRUE16-NEXT: v_mul_lo_u16 v0.l, v0.l, v0.h
-; GFX11-GISEL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-GISEL-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; GFX11-GISEL-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX11-GISEL-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-GISEL-FAKE16-LABEL: clpeak_umad_pat_i16_x2:
@@ -9457,9 +9454,9 @@ define zeroext i16 @clpeak_umad_pat_i16_x2(i16 zeroext %x, i16 zeroext %y) {
; GFX1200-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1200-SDAG-TRUE16-NEXT: v_mad_u16 v0.h, v0.l, v0.h, v0.l
; GFX1200-SDAG-TRUE16-NEXT: v_mad_u16 v0.l, v0.h, v0.l, v0.h
-; GFX1200-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1200-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1200-SDAG-TRUE16-NEXT: v_mad_u16 v0.l, v0.l, v0.h, v0.l
-; GFX1200-SDAG-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; GFX1200-SDAG-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX1200-SDAG-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX1200-SDAG-FAKE16-LABEL: clpeak_umad_pat_i16_x2:
@@ -9511,8 +9508,7 @@ define zeroext i16 @clpeak_umad_pat_i16_x2(i16 zeroext %x, i16 zeroext %y) {
; GFX1200-GISEL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1200-GISEL-TRUE16-NEXT: v_mul_lo_u16 v0.l, v0.l, v1.l
; GFX1200-GISEL-TRUE16-NEXT: v_mul_lo_u16 v0.l, v0.l, v0.h
-; GFX1200-GISEL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX1200-GISEL-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; GFX1200-GISEL-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX1200-GISEL-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX1200-GISEL-FAKE16-LABEL: clpeak_umad_pat_i16_x2:
diff --git a/llvm/test/CodeGen/AMDGPU/local-atomicrmw-fadd.ll b/llvm/test/CodeGen/AMDGPU/local-atomicrmw-fadd.ll
index a42c71c4849bd..c1a32aafbc71e 100644
--- a/llvm/test/CodeGen/AMDGPU/local-atomicrmw-fadd.ll
+++ b/llvm/test/CodeGen/AMDGPU/local-atomicrmw-fadd.ll
@@ -1259,13 +1259,12 @@ define half @local_atomic_fadd_ret_f16(ptr addrspace(3) %ptr) nounwind {
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v0, v4
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
; GFX12-TRUE16-NEXT: v_add_f16_e32 v3.l, 4.0, v3.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v0, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v2, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: ds_cmpstore_rtn_b32 v3, v1, v3, v4
@@ -1371,13 +1370,12 @@ define half @local_atomic_fadd_ret_f16(ptr addrspace(3) %ptr) nounwind {
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v0, v4
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
; GFX11-TRUE16-NEXT: v_add_f16_e32 v3.l, 4.0, v3.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v0, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v2, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: ds_cmpstore_rtn_b32 v3, v1, v3, v4
@@ -1646,13 +1644,12 @@ define half @local_atomic_fadd_ret_f16__offset(ptr addrspace(3) %ptr) nounwind {
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v1, v4
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
; GFX12-TRUE16-NEXT: v_add_f16_e32 v3.l, 4.0, v3.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v1, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v2, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: ds_cmpstore_rtn_b32 v3, v0, v3, v4
@@ -1763,13 +1760,12 @@ define half @local_atomic_fadd_ret_f16__offset(ptr addrspace(3) %ptr) nounwind {
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v1, v4
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
; GFX11-TRUE16-NEXT: v_add_f16_e32 v3.l, 4.0, v3.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v1, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v2, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: ds_cmpstore_rtn_b32 v3, v0, v3, v4
@@ -2044,13 +2040,12 @@ define void @local_atomic_fadd_noret_f16(ptr addrspace(3) %ptr) nounwind {
; GFX12-TRUE16-NEXT: .LBB10_1: ; %atomicrmw.start
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_dscnt 0x0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v4, v0, v2
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v4.h, 0
; GFX12-TRUE16-NEXT: v_add_f16_e32 v4.l, 4.0, v4.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v4, v0, v4
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v4, v2, v3, v4
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: ds_cmpstore_rtn_b32 v4, v1, v4, v2
@@ -2153,13 +2148,12 @@ define void @local_atomic_fadd_noret_f16(ptr addrspace(3) %ptr) nounwind {
; GFX11-TRUE16-NEXT: .LBB10_1: ; %atomicrmw.start
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v4, v0, v2
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.h, 0
; GFX11-TRUE16-NEXT: v_add_f16_e32 v4.l, 4.0, v4.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v4, v0, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v4, v2, v3, v4
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: ds_cmpstore_rtn_b32 v4, v1, v4, v2
@@ -2419,11 +2413,11 @@ define void @local_atomic_fadd_noret_f16__offset(ptr addrspace(3) %ptr) nounwind
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_dscnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v4, v1, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v4.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_add_f16_e32 v4.l, 4.0, v4.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v4, v1, v4
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v4, v3, v2, v4
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: ds_cmpstore_rtn_b32 v4, v0, v4, v3
@@ -2531,11 +2525,11 @@ define void @local_atomic_fadd_noret_f16__offset(ptr addrspace(3) %ptr) nounwind
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v4, v1, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_add_f16_e32 v4.l, 4.0, v4.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v4, v1, v4
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v4, v3, v2, v4
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: ds_cmpstore_rtn_b32 v4, v0, v4, v3
@@ -2795,10 +2789,9 @@ define half @local_atomic_fadd_ret_f16__offset__align4(ptr addrspace(3) %ptr) no
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v2, v1
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_add_f16_e32 v1.l, 4.0, v2.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v1, 0xffff0000, v2, v1
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: ds_cmpstore_rtn_b32 v1, v0, v1, v2 offset:65534
@@ -2882,10 +2875,9 @@ define half @local_atomic_fadd_ret_f16__offset__align4(ptr addrspace(3) %ptr) no
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v2, v1
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_add_f16_e32 v1.l, 4.0, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v1, 0xffff0000, v2, v1
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: ds_cmpstore_rtn_b32 v1, v0, v1, v2 offset:65534
@@ -3095,8 +3087,8 @@ define void @local_atomic_fadd_noret_f16__offset__align4(ptr addrspace(3) %ptr)
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_dscnt 0x0
; GFX12-TRUE16-NEXT: v_add_f16_e32 v2.l, 4.0, v1.l
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v2, 0xffff, v2
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v2.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v2, 0xffff0000, v1, v2
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: ds_cmpstore_rtn_b32 v2, v0, v2, v1 offset:65534
@@ -3177,8 +3169,8 @@ define void @local_atomic_fadd_noret_f16__offset__align4(ptr addrspace(3) %ptr)
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_add_f16_e32 v2.l, 4.0, v1.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v2, 0xffff, v2
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v2, 0xffff0000, v1, v2
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: ds_cmpstore_rtn_b32 v2, v0, v2, v1 offset:65534
diff --git a/llvm/test/CodeGen/AMDGPU/local-atomicrmw-fmax.ll b/llvm/test/CodeGen/AMDGPU/local-atomicrmw-fmax.ll
index 8351d28057564..739e86d1928b1 100644
--- a/llvm/test/CodeGen/AMDGPU/local-atomicrmw-fmax.ll
+++ b/llvm/test/CodeGen/AMDGPU/local-atomicrmw-fmax.ll
@@ -803,14 +803,14 @@ define half @local_atomic_fmax_ret_f16(ptr addrspace(3) %ptr) nounwind {
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v0, v4
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v3.l, v3.l, v3.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v3.l, 4.0, v3.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v0, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v2, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: ds_cmpstore_rtn_b32 v3, v1, v3, v4
@@ -918,14 +918,14 @@ define half @local_atomic_fmax_ret_f16(ptr addrspace(3) %ptr) nounwind {
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v0, v4
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
; GFX11-TRUE16-NEXT: v_max_f16_e32 v3.l, v3.l, v3.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v3.l, 4.0, v3.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v0, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v2, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: ds_cmpstore_rtn_b32 v3, v1, v3, v4
@@ -1199,14 +1199,14 @@ define half @local_atomic_fmax_ret_f16__offset(ptr addrspace(3) %ptr) nounwind {
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v1, v4
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v3.l, v3.l, v3.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v3.l, 4.0, v3.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v1, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v2, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: ds_cmpstore_rtn_b32 v3, v0, v3, v4
@@ -1319,14 +1319,14 @@ define half @local_atomic_fmax_ret_f16__offset(ptr addrspace(3) %ptr) nounwind {
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v1, v4
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
; GFX11-TRUE16-NEXT: v_max_f16_e32 v3.l, v3.l, v3.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v3.l, 4.0, v3.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v1, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v2, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: ds_cmpstore_rtn_b32 v3, v0, v3, v4
@@ -1606,14 +1606,14 @@ define void @local_atomic_fmax_noret_f16(ptr addrspace(3) %ptr) nounwind {
; GFX12-TRUE16-NEXT: .LBB10_1: ; %atomicrmw.start
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_dscnt 0x0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v4, v0, v2
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v4.h, 0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v4.l, v4.l, v4.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v4.l, 4.0, v4.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v4, v0, v4
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v4, v2, v3, v4
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: ds_cmpstore_rtn_b32 v4, v1, v4, v2
@@ -1718,14 +1718,14 @@ define void @local_atomic_fmax_noret_f16(ptr addrspace(3) %ptr) nounwind {
; GFX11-TRUE16-NEXT: .LBB10_1: ; %atomicrmw.start
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v4, v0, v2
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.h, 0
; GFX11-TRUE16-NEXT: v_max_f16_e32 v4.l, v4.l, v4.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v4.l, 4.0, v4.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v4, v0, v4
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v4, v2, v3, v4
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: ds_cmpstore_rtn_b32 v4, v1, v4, v2
@@ -1990,13 +1990,12 @@ define void @local_atomic_fmax_noret_f16__offset(ptr addrspace(3) %ptr) nounwind
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_dscnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v4, v1, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v4.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v4.l, v4.l, v4.l
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v4.l, 4.0, v4.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v4, v1, v4
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v4, v3, v2, v4
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: ds_cmpstore_rtn_b32 v4, v0, v4, v3
@@ -2107,13 +2106,12 @@ define void @local_atomic_fmax_noret_f16__offset(ptr addrspace(3) %ptr) nounwind
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v4, v1, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v4.l, v4.l, v4.l
; GFX11-TRUE16-NEXT: v_max_f16_e32 v4.l, 4.0, v4.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v4, v1, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v4, v3, v2, v4
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: ds_cmpstore_rtn_b32 v4, v0, v4, v3
@@ -2379,11 +2377,11 @@ define half @local_atomic_fmax_ret_f16__offset__align4(ptr addrspace(3) %ptr) no
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v2, v1
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v1.l, v2.l, v2.l
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v1.l, 4.0, v1.l
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v1, 0xffff0000, v2, v1
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: ds_cmpstore_rtn_b32 v1, v0, v1, v2 offset:65534
@@ -2469,11 +2467,11 @@ define half @local_atomic_fmax_ret_f16__offset__align4(ptr addrspace(3) %ptr) no
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v2, v1
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v1.l, v2.l, v2.l
; GFX11-TRUE16-NEXT: v_max_f16_e32 v1.l, 4.0, v1.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v1, 0xffff0000, v2, v1
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: ds_cmpstore_rtn_b32 v1, v0, v1, v2 offset:65534
@@ -2688,10 +2686,9 @@ define void @local_atomic_fmax_noret_f16__offset__align4(ptr addrspace(3) %ptr)
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_dscnt 0x0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v2.l, v1.l, v1.l
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v2.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v2.l, 4.0, v2.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v2, 0xffff, v2
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v2, 0xffff0000, v1, v2
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: ds_cmpstore_rtn_b32 v2, v0, v2, v1 offset:65534
@@ -2775,10 +2772,9 @@ define void @local_atomic_fmax_noret_f16__offset__align4(ptr addrspace(3) %ptr)
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v2.l, v1.l, v1.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v2.l, 4.0, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v2, 0xffff, v2
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v2, 0xffff0000, v1, v2
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: ds_cmpstore_rtn_b32 v2, v0, v2, v1 offset:65534
diff --git a/llvm/test/CodeGen/AMDGPU/local-atomicrmw-fmin.ll b/llvm/test/CodeGen/AMDGPU/local-atomicrmw-fmin.ll
index 0c4aca88b3781..6da80262951e5 100644
--- a/llvm/test/CodeGen/AMDGPU/local-atomicrmw-fmin.ll
+++ b/llvm/test/CodeGen/AMDGPU/local-atomicrmw-fmin.ll
@@ -803,14 +803,14 @@ define half @local_atomic_fmin_ret_f16(ptr addrspace(3) %ptr) nounwind {
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v0, v4
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v3.l, v3.l, v3.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v3.l, 4.0, v3.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v0, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v2, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: ds_cmpstore_rtn_b32 v3, v1, v3, v4
@@ -918,14 +918,14 @@ define half @local_atomic_fmin_ret_f16(ptr addrspace(3) %ptr) nounwind {
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v0, v4
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
; GFX11-TRUE16-NEXT: v_max_f16_e32 v3.l, v3.l, v3.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_min_f16_e32 v3.l, 4.0, v3.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v0, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v2, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: ds_cmpstore_rtn_b32 v3, v1, v3, v4
@@ -1199,14 +1199,14 @@ define half @local_atomic_fmin_ret_f16__offset(ptr addrspace(3) %ptr) nounwind {
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v1, v4
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v3.l, v3.l, v3.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v3.l, 4.0, v3.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v1, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v2, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: ds_cmpstore_rtn_b32 v3, v0, v3, v4
@@ -1319,14 +1319,14 @@ define half @local_atomic_fmin_ret_f16__offset(ptr addrspace(3) %ptr) nounwind {
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v1, v4
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
; GFX11-TRUE16-NEXT: v_max_f16_e32 v3.l, v3.l, v3.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_min_f16_e32 v3.l, 4.0, v3.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v1, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v2, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: ds_cmpstore_rtn_b32 v3, v0, v3, v4
@@ -1606,14 +1606,14 @@ define void @local_atomic_fmin_noret_f16(ptr addrspace(3) %ptr) nounwind {
; GFX12-TRUE16-NEXT: .LBB10_1: ; %atomicrmw.start
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_dscnt 0x0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v4, v0, v2
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v4.h, 0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v4.l, v4.l, v4.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v4.l, 4.0, v4.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v4, v0, v4
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v4, v2, v3, v4
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: ds_cmpstore_rtn_b32 v4, v1, v4, v2
@@ -1718,14 +1718,14 @@ define void @local_atomic_fmin_noret_f16(ptr addrspace(3) %ptr) nounwind {
; GFX11-TRUE16-NEXT: .LBB10_1: ; %atomicrmw.start
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v4, v0, v2
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.h, 0
; GFX11-TRUE16-NEXT: v_max_f16_e32 v4.l, v4.l, v4.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_min_f16_e32 v4.l, 4.0, v4.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v4, v0, v4
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v4, v2, v3, v4
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: ds_cmpstore_rtn_b32 v4, v1, v4, v2
@@ -1990,13 +1990,12 @@ define void @local_atomic_fmin_noret_f16__offset(ptr addrspace(3) %ptr) nounwind
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_dscnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v4, v1, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v4.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v4.l, v4.l, v4.l
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v4.l, 4.0, v4.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v4, v1, v4
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v4, v3, v2, v4
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: ds_cmpstore_rtn_b32 v4, v0, v4, v3
@@ -2107,13 +2106,12 @@ define void @local_atomic_fmin_noret_f16__offset(ptr addrspace(3) %ptr) nounwind
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v4, v1, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v4.l, v4.l, v4.l
; GFX11-TRUE16-NEXT: v_min_f16_e32 v4.l, 4.0, v4.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v4, v1, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v4, v3, v2, v4
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: ds_cmpstore_rtn_b32 v4, v0, v4, v3
@@ -2379,11 +2377,11 @@ define half @local_atomic_fmin_ret_f16__offset__align4(ptr addrspace(3) %ptr) no
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v2, v1
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v1.l, v2.l, v2.l
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v1.l, 4.0, v1.l
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v1, 0xffff0000, v2, v1
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: ds_cmpstore_rtn_b32 v1, v0, v1, v2 offset:65534
@@ -2469,11 +2467,11 @@ define half @local_atomic_fmin_ret_f16__offset__align4(ptr addrspace(3) %ptr) no
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v2, v1
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v1.l, v2.l, v2.l
; GFX11-TRUE16-NEXT: v_min_f16_e32 v1.l, 4.0, v1.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v1, 0xffff0000, v2, v1
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: ds_cmpstore_rtn_b32 v1, v0, v1, v2 offset:65534
@@ -2688,10 +2686,9 @@ define void @local_atomic_fmin_noret_f16__offset__align4(ptr addrspace(3) %ptr)
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_dscnt 0x0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v2.l, v1.l, v1.l
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v2.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v2.l, 4.0, v2.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v2, 0xffff, v2
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v2, 0xffff0000, v1, v2
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: ds_cmpstore_rtn_b32 v2, v0, v2, v1 offset:65534
@@ -2775,10 +2772,9 @@ define void @local_atomic_fmin_noret_f16__offset__align4(ptr addrspace(3) %ptr)
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v2.l, v1.l, v1.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_min_f16_e32 v2.l, 4.0, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v2, 0xffff, v2
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v2, 0xffff0000, v1, v2
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: ds_cmpstore_rtn_b32 v2, v0, v2, v1 offset:65534
diff --git a/llvm/test/CodeGen/AMDGPU/local-atomicrmw-fsub.ll b/llvm/test/CodeGen/AMDGPU/local-atomicrmw-fsub.ll
index 37310b614c0db..786989cc9fb57 100644
--- a/llvm/test/CodeGen/AMDGPU/local-atomicrmw-fsub.ll
+++ b/llvm/test/CodeGen/AMDGPU/local-atomicrmw-fsub.ll
@@ -1721,13 +1721,12 @@ define half @local_atomic_fsub_ret_f16(ptr addrspace(3) %ptr) nounwind {
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v0, v4
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
; GFX12-TRUE16-NEXT: v_add_f16_e32 v3.l, -4.0, v3.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v0, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v2, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: ds_cmpstore_rtn_b32 v3, v1, v3, v4
@@ -1833,13 +1832,12 @@ define half @local_atomic_fsub_ret_f16(ptr addrspace(3) %ptr) nounwind {
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v0, v4
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
; GFX11-TRUE16-NEXT: v_add_f16_e32 v3.l, -4.0, v3.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v0, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v2, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: ds_cmpstore_rtn_b32 v3, v1, v3, v4
@@ -2108,13 +2106,12 @@ define half @local_atomic_fsub_ret_f16__offset(ptr addrspace(3) %ptr) nounwind {
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v1, v4
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
; GFX12-TRUE16-NEXT: v_add_f16_e32 v3.l, -4.0, v3.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v1, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v2, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: ds_cmpstore_rtn_b32 v3, v0, v3, v4
@@ -2225,13 +2222,12 @@ define half @local_atomic_fsub_ret_f16__offset(ptr addrspace(3) %ptr) nounwind {
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v1, v4
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
; GFX11-TRUE16-NEXT: v_add_f16_e32 v3.l, -4.0, v3.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v1, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v2, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: ds_cmpstore_rtn_b32 v3, v0, v3, v4
@@ -2506,13 +2502,12 @@ define void @local_atomic_fsub_noret_f16(ptr addrspace(3) %ptr) nounwind {
; GFX12-TRUE16-NEXT: .LBB10_1: ; %atomicrmw.start
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_dscnt 0x0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v4, v0, v2
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v4.h, 0
; GFX12-TRUE16-NEXT: v_add_f16_e32 v4.l, -4.0, v4.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v4, v0, v4
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v4, v2, v3, v4
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: ds_cmpstore_rtn_b32 v4, v1, v4, v2
@@ -2615,13 +2610,12 @@ define void @local_atomic_fsub_noret_f16(ptr addrspace(3) %ptr) nounwind {
; GFX11-TRUE16-NEXT: .LBB10_1: ; %atomicrmw.start
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v4, v0, v2
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.h, 0
; GFX11-TRUE16-NEXT: v_add_f16_e32 v4.l, -4.0, v4.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v4, v0, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v4, v2, v3, v4
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: ds_cmpstore_rtn_b32 v4, v1, v4, v2
@@ -2881,11 +2875,11 @@ define void @local_atomic_fsub_noret_f16__offset(ptr addrspace(3) %ptr) nounwind
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_dscnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v4, v1, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v4.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_add_f16_e32 v4.l, -4.0, v4.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v4, v1, v4
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v4, v3, v2, v4
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: ds_cmpstore_rtn_b32 v4, v0, v4, v3
@@ -2993,11 +2987,11 @@ define void @local_atomic_fsub_noret_f16__offset(ptr addrspace(3) %ptr) nounwind
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v4, v1, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_add_f16_e32 v4.l, -4.0, v4.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v4, v1, v4
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v4, v3, v2, v4
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: ds_cmpstore_rtn_b32 v4, v0, v4, v3
@@ -3257,10 +3251,9 @@ define half @local_atomic_fsub_ret_f16__offset__align4(ptr addrspace(3) %ptr) no
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v2, v1
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_add_f16_e32 v1.l, -4.0, v2.l
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v1, 0xffff0000, v2, v1
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: ds_cmpstore_rtn_b32 v1, v0, v1, v2 offset:65534
@@ -3344,10 +3337,9 @@ define half @local_atomic_fsub_ret_f16__offset__align4(ptr addrspace(3) %ptr) no
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v2, v1
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_add_f16_e32 v1.l, -4.0, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v1, 0xffff0000, v2, v1
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: ds_cmpstore_rtn_b32 v1, v0, v1, v2 offset:65534
@@ -3557,8 +3549,8 @@ define void @local_atomic_fsub_noret_f16__offset__align4(ptr addrspace(3) %ptr)
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_dscnt 0x0
; GFX12-TRUE16-NEXT: v_add_f16_e32 v2.l, -4.0, v1.l
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT: v_and_b32_e32 v2, 0xffff, v2
+; GFX12-TRUE16-NEXT: v_mov_b16_e32 v2.h, 0
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v2, 0xffff0000, v1, v2
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: ds_cmpstore_rtn_b32 v2, v0, v2, v1 offset:65534
@@ -3639,8 +3631,8 @@ define void @local_atomic_fsub_noret_f16__offset__align4(ptr addrspace(3) %ptr)
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_add_f16_e32 v2.l, -4.0, v1.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v2, 0xffff, v2
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v2, 0xffff0000, v1, v2
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: ds_cmpstore_rtn_b32 v2, v0, v2, v1 offset:65534
diff --git a/llvm/test/CodeGen/AMDGPU/mad-mix-lo.ll b/llvm/test/CodeGen/AMDGPU/mad-mix-lo.ll
index 811e25587d3d5..eab92668c536b 100644
--- a/llvm/test/CodeGen/AMDGPU/mad-mix-lo.ll
+++ b/llvm/test/CodeGen/AMDGPU/mad-mix-lo.ll
@@ -2382,13 +2382,22 @@ define <4 x half> @v_mad_mix_v4f32_clamp_precvt(<4 x half> %src0, <4 x half> %sr
}
define i32 @mixlo_zext(float %src0, float %src1, float %src2) #0 {
-; GFX1100-LABEL: mixlo_zext:
-; GFX1100: ; %bb.0:
-; GFX1100-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX1100-NEXT: v_fma_mixlo_f16 v0, v0, v1, v2
-; GFX1100-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX1100-NEXT: v_and_b32_e32 v0, 0xffff, v0
-; GFX1100-NEXT: s_setpc_b64 s[30:31]
+; SDAG-GFX1100-TRUE16-LABEL: mixlo_zext:
+; SDAG-GFX1100-TRUE16: ; %bb.0:
+; SDAG-GFX1100-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; SDAG-GFX1100-TRUE16-NEXT: v_fma_mixlo_f16 v1, v0, v1, v2
+; SDAG-GFX1100-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
+; SDAG-GFX1100-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
+; SDAG-GFX1100-TRUE16-NEXT: v_mov_b16_e32 v0.l, v1.l
+; SDAG-GFX1100-TRUE16-NEXT: s_setpc_b64 s[30:31]
+;
+; SDAG-GFX1100-FAKE16-LABEL: mixlo_zext:
+; SDAG-GFX1100-FAKE16: ; %bb.0:
+; SDAG-GFX1100-FAKE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; SDAG-GFX1100-FAKE16-NEXT: v_fma_mixlo_f16 v0, v0, v1, v2
+; SDAG-GFX1100-FAKE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; SDAG-GFX1100-FAKE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; SDAG-GFX1100-FAKE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX900-LABEL: mixlo_zext:
; GFX900: ; %bb.0:
@@ -2418,6 +2427,14 @@ define i32 @mixlo_zext(float %src0, float %src1, float %src2) #0 {
; SDAG-CI-NEXT: v_cvt_f16_f32_e32 v0, v2
; SDAG-CI-NEXT: s_setpc_b64 s[30:31]
;
+; GISEL-GFX1100-LABEL: mixlo_zext:
+; GISEL-GFX1100: ; %bb.0:
+; GISEL-GFX1100-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GISEL-GFX1100-NEXT: v_fma_mixlo_f16 v0, v0, v1, v2
+; GISEL-GFX1100-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GISEL-GFX1100-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; GISEL-GFX1100-NEXT: s_setpc_b64 s[30:31]
+;
; GISEL-CI-LABEL: mixlo_zext:
; GISEL-CI: ; %bb.0:
; GISEL-CI-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
diff --git a/llvm/test/CodeGen/AMDGPU/mad.u16.ll b/llvm/test/CodeGen/AMDGPU/mad.u16.ll
index ef80323a98ec0..fbf8011fd40c9 100644
--- a/llvm/test/CodeGen/AMDGPU/mad.u16.ll
+++ b/llvm/test/CodeGen/AMDGPU/mad.u16.ll
@@ -179,8 +179,7 @@ define i32 @v_mad_u16_zext(i16 %arg0, i16 %arg1, i16 %arg2) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mad_u16 v0.l, v0.l, v1.l, v2.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-FAKE16-LABEL: v_mad_u16_zext:
@@ -222,9 +221,9 @@ define i64 @v_mad_u16_zext64(i16 %arg0, i16 %arg1, i16 %arg2) {
; GFX11-TRUE16-LABEL: v_mad_u16_zext64:
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX11-TRUE16-NEXT: v_mad_u16 v0.l, v0.l, v1.l, v2.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_dual_mov_b32 v1, 0 :: v_dual_and_b32 v0, 0xffff, v0
+; GFX11-TRUE16-NEXT: v_mov_b32_e32 v1, 0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-FAKE16-LABEL: v_mad_u16_zext64:
diff --git a/llvm/test/CodeGen/AMDGPU/preserve-hi16.ll b/llvm/test/CodeGen/AMDGPU/preserve-hi16.ll
index 3ce09475c0949..79910af5c0434 100644
--- a/llvm/test/CodeGen/AMDGPU/preserve-hi16.ll
+++ b/llvm/test/CodeGen/AMDGPU/preserve-hi16.ll
@@ -374,7 +374,7 @@ define i32 @shl_i16_zext_i32(i16 %x, i16 %y) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.l, v1.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-FAKE16-LABEL: shl_i16_zext_i32:
@@ -412,7 +412,7 @@ define i32 @lshr_i16_zext_i32(i16 %x, i16 %y) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b16 v0.l, v1.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-FAKE16-LABEL: lshr_i16_zext_i32:
@@ -450,7 +450,7 @@ define i32 @ashr_i16_zext_i32(i16 %x, i16 %y) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_ashrrev_i16 v0.l, v1.l, v0.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-FAKE16-LABEL: ashr_i16_zext_i32:
@@ -488,7 +488,7 @@ define i32 @add_u16_zext_i32(i16 %x, i16 %y) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-FAKE16-LABEL: add_u16_zext_i32:
@@ -526,7 +526,7 @@ define i32 @sub_u16_zext_i32(i16 %x, i16 %y) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_sub_nc_u16 v0.l, v0.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-FAKE16-LABEL: sub_u16_zext_i32:
@@ -564,7 +564,7 @@ define i32 @mul_lo_u16_zext_i32(i16 %x, i16 %y) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mul_lo_u16 v0.l, v0.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-FAKE16-LABEL: mul_lo_u16_zext_i32:
@@ -602,7 +602,7 @@ define i32 @min_u16_zext_i32(i16 %x, i16 %y) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_min_u16 v0.l, v0.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-FAKE16-LABEL: min_u16_zext_i32:
@@ -641,7 +641,7 @@ define i32 @min_i16_zext_i32(i16 %x, i16 %y) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_min_i16 v0.l, v0.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-FAKE16-LABEL: min_i16_zext_i32:
@@ -680,7 +680,7 @@ define i32 @max_u16_zext_i32(i16 %x, i16 %y) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_max_u16 v0.l, v0.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-FAKE16-LABEL: max_u16_zext_i32:
@@ -719,7 +719,7 @@ define i32 @max_i16_zext_i32(i16 %x, i16 %y) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_max_i16 v0.l, v0.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-FAKE16-LABEL: max_i16_zext_i32:
@@ -758,7 +758,7 @@ define i32 @zext_fadd_f16(half %x, half %y) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_add_f16_e32 v0.l, v0.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-FAKE16-LABEL: zext_fadd_f16:
@@ -797,8 +797,10 @@ define i32 @zext_fma_f16(half %x, half %y, half %z) {
; GFX11-TRUE16-LABEL: zext_fma_f16:
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX11-TRUE16-NEXT: v_fmac_f16_e32 v2.l, v0.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v2
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, v0.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v2.l
+; GFX11-TRUE16-NEXT: v_fmac_f16_e32 v0.l, v0.h, v1.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-FAKE16-LABEL: zext_fma_f16:
@@ -838,7 +840,7 @@ define i32 @zext_div_fixup_f16(half %x, half %y, half %z) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_div_fixup_f16 v0.l, v0.l, v1.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-FAKE16-LABEL: zext_div_fixup_f16:
@@ -880,7 +882,7 @@ define i32 @zext_fptrunc_f16(float %x) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_cvt_f16_f32_e32 v0.l, v0
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-FAKE16-LABEL: zext_fptrunc_f16:
@@ -924,12 +926,20 @@ define i32 @zext_fptrunc_fma_f16(float %x, float %y, float %z) {
; GFX10-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX10-NEXT: s_setpc_b64 s[30:31]
;
-; GFX11-LABEL: zext_fptrunc_fma_f16:
-; GFX11: ; %bb.0:
-; GFX11-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX11-NEXT: v_fma_mixlo_f16 v0, v0, v1, v2
-; GFX11-NEXT: v_and_b32_e32 v0, 0xffff, v0
-; GFX11-NEXT: s_setpc_b64 s[30:31]
+; GFX11-TRUE16-LABEL: zext_fptrunc_fma_f16:
+; GFX11-TRUE16: ; %bb.0:
+; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX11-TRUE16-NEXT: v_fma_mixlo_f16 v1, v0, v1, v2
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v1.l
+; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
+;
+; GFX11-FAKE16-LABEL: zext_fptrunc_fma_f16:
+; GFX11-FAKE16: ; %bb.0:
+; GFX11-FAKE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX11-FAKE16-NEXT: v_fma_mixlo_f16 v0, v0, v1, v2
+; GFX11-FAKE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; GFX11-FAKE16-NEXT: s_setpc_b64 s[30:31]
%fma = call float @llvm.fma.f32(float %x, float %y, float %z)
%fptrunc = fptrunc float %fma to half
%cast = bitcast half %fptrunc to i16
@@ -940,3 +950,5 @@ define i32 @zext_fptrunc_fma_f16(float %x, float %y, float %z) {
declare half @llvm.amdgcn.div.fixup.f16(half, half, half)
declare half @llvm.fma.f16(half, half, half)
declare float @llvm.fma.f32(float, float, float)
+;; NOTE: These prefixes are unused and the list is autogenerated. Do not add tests below this line:
+; GFX11: {{.*}}
diff --git a/llvm/test/CodeGen/AMDGPU/shrink-add-sub-constant.ll b/llvm/test/CodeGen/AMDGPU/shrink-add-sub-constant.ll
index 21aa40d69998e..91c88ec5e718c 100644
--- a/llvm/test/CodeGen/AMDGPU/shrink-add-sub-constant.ll
+++ b/llvm/test/CodeGen/AMDGPU/shrink-add-sub-constant.ll
@@ -1528,10 +1528,9 @@ define amdgpu_kernel void @v_test_i16_x_sub_64_zext_to_i32(ptr addrspace(1) %out
; GFX11-SDAG-TRUE16-NEXT: v_lshlrev_b32_e32 v1, 2, v1
; GFX11-SDAG-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-SDAG-TRUE16-NEXT: global_load_d16_b16 v0, v0, s[2:3]
+; GFX11-SDAG-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX11-SDAG-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-SDAG-TRUE16-NEXT: v_sub_nc_u16 v0.l, v0.l, 64
-; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-SDAG-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX11-SDAG-TRUE16-NEXT: global_store_b32 v1, v0, s[0:1]
; GFX11-SDAG-TRUE16-NEXT: s_endpgm
;
@@ -1560,10 +1559,9 @@ define amdgpu_kernel void @v_test_i16_x_sub_64_zext_to_i32(ptr addrspace(1) %out
; GFX11-GISEL-TRUE16-NEXT: v_lshlrev_b32_e32 v1, 2, v1
; GFX11-GISEL-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-GISEL-TRUE16-NEXT: global_load_d16_b16 v0, v0, s[2:3]
+; GFX11-GISEL-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX11-GISEL-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-GISEL-TRUE16-NEXT: v_add_nc_u16 v0.l, 0xffc0, v0.l
-; GFX11-GISEL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-GISEL-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX11-GISEL-TRUE16-NEXT: global_store_b32 v1, v0, s[0:1]
; GFX11-GISEL-TRUE16-NEXT: s_endpgm
;
diff --git a/llvm/test/CodeGen/AMDGPU/vector-reduce-add.ll b/llvm/test/CodeGen/AMDGPU/vector-reduce-add.ll
index 30ed6ae5484c6..334215125f58a 100644
--- a/llvm/test/CodeGen/AMDGPU/vector-reduce-add.ll
+++ b/llvm/test/CodeGen/AMDGPU/vector-reduce-add.ll
@@ -300,17 +300,15 @@ define i8 @test_vector_reduce_add_v4i8(<4 x i8> %v) {
; GFX11-SDAG-TRUE16-LABEL: test_vector_reduce_add_v4i8:
; GFX11-SDAG-TRUE16: ; %bb.0: ; %entry
; GFX11-SDAG-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v1.l, v1.l, v3.l
+; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.h, v1.l, v3.l
; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v2.l
-; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-SDAG-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
+; GFX11-SDAG-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
+; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-SDAG-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v0.h
; GFX11-SDAG-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-SDAG-TRUE16-NEXT: v_mov_b16_e32 v2.l, v1.l
+; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_4)
; GFX11-SDAG-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v1.l
-; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-SDAG-TRUE16-NEXT: v_and_b32_e32 v2, 0xffff, v2
-; GFX11-SDAG-TRUE16-NEXT: v_lshrrev_b32_e32 v1, 8, v2
+; GFX11-SDAG-TRUE16-NEXT: v_lshrrev_b32_e32 v1, 8, v1
; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v1.l
; GFX11-SDAG-TRUE16-NEXT: s_setpc_b64 s[30:31]
@@ -348,17 +346,15 @@ define i8 @test_vector_reduce_add_v4i8(<4 x i8> %v) {
; GFX12-SDAG-TRUE16-NEXT: s_wait_samplecnt 0x0
; GFX12-SDAG-TRUE16-NEXT: s_wait_bvhcnt 0x0
; GFX12-SDAG-TRUE16-NEXT: s_wait_kmcnt 0x0
-; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v1.l, v1.l, v3.l
+; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.h, v1.l, v3.l
; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v2.l
-; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX12-SDAG-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
+; GFX12-SDAG-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
+; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX12-SDAG-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v0.h
; GFX12-SDAG-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX12-SDAG-TRUE16-NEXT: v_mov_b16_e32 v2.l, v1.l
+; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_4)
; GFX12-SDAG-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v1.l
-; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-SDAG-TRUE16-NEXT: v_and_b32_e32 v2, 0xffff, v2
-; GFX12-SDAG-TRUE16-NEXT: v_lshrrev_b32_e32 v1, 8, v2
+; GFX12-SDAG-TRUE16-NEXT: v_lshrrev_b32_e32 v1, 8, v1
; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v1.l
; GFX12-SDAG-TRUE16-NEXT: s_setpc_b64 s[30:31]
@@ -518,21 +514,19 @@ define i8 @test_vector_reduce_add_v8i8(<8 x i8> %v) {
; GFX11-SDAG-TRUE16-LABEL: test_vector_reduce_add_v8i8:
; GFX11-SDAG-TRUE16: ; %bb.0: ; %entry
; GFX11-SDAG-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.h, v2.l, v6.l
; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v1.h, v3.l, v7.l
; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v1.l, v1.l, v5.l
; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v4.l
-; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
-; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.h, v1.l, v1.h
-; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v1.l, v2.l, v6.l
-; GFX11-SDAG-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v0.h
-; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v1.l
-; GFX11-SDAG-TRUE16-NEXT: v_mov_b16_e32 v1.l, v2.l
; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v1.l, v1.l, v1.h
+; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v0.h
+; GFX11-SDAG-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
+; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-SDAG-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
; GFX11-SDAG-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-SDAG-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
-; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-SDAG-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v2.l
+; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-SDAG-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v1.l
; GFX11-SDAG-TRUE16-NEXT: v_lshrrev_b32_e32 v1, 8, v1
; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v1.l
@@ -581,21 +575,19 @@ define i8 @test_vector_reduce_add_v8i8(<8 x i8> %v) {
; GFX12-SDAG-TRUE16-NEXT: s_wait_samplecnt 0x0
; GFX12-SDAG-TRUE16-NEXT: s_wait_bvhcnt 0x0
; GFX12-SDAG-TRUE16-NEXT: s_wait_kmcnt 0x0
+; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.h, v2.l, v6.l
; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v1.h, v3.l, v7.l
; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v1.l, v1.l, v5.l
; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v4.l
-; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
-; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.h, v1.l, v1.h
-; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v1.l, v2.l, v6.l
-; GFX12-SDAG-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v0.h
-; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v1.l
-; GFX12-SDAG-TRUE16-NEXT: v_mov_b16_e32 v1.l, v2.l
; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v1.l, v1.l, v1.h
+; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v0.h
+; GFX12-SDAG-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
+; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX12-SDAG-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
; GFX12-SDAG-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX12-SDAG-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
-; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX12-SDAG-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v2.l
+; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX12-SDAG-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v1.l
; GFX12-SDAG-TRUE16-NEXT: v_lshrrev_b32_e32 v1, 8, v1
; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v1.l
@@ -832,28 +824,25 @@ define i8 @test_vector_reduce_add_v16i8(<16 x i8> %v) {
; GFX11-SDAG-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v1.h, v5.l, v13.l
; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v1.l, v1.l, v9.l
-; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v3.h, v7.l, v15.l
-; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v3.l, v3.l, v11.l
-; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v2.h, v6.l, v14.l
+; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v5.l, v7.l, v15.l
+; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.h, v6.l, v14.l
; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v2.l, v2.l, v10.l
-; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v1.l, v1.l, v1.h
-; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v1.h, v4.l, v12.l
-; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.h, v3.l, v3.h
+; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v2.h, v3.l, v11.l
+; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v3.l, v4.l, v12.l
; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v8.l
-; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_3)
-; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.h, v1.l, v0.h
-; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v1.l, v2.l, v2.h
-; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v1.h
+; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v1.l, v1.l, v1.h
+; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.h, v2.l, v0.h
+; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v2.l, v2.h, v5.l
+; GFX11-SDAG-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
+; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v3.l
; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-SDAG-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v0.h
-; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v1.l
+; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v1.l, v1.l, v2.l
+; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v0.h
; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-SDAG-TRUE16-NEXT: v_mov_b16_e32 v1.l, v2.l
+; GFX11-SDAG-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
; GFX11-SDAG-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-SDAG-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
-; GFX11-SDAG-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v2.l
-; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-SDAG-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v1.l
; GFX11-SDAG-TRUE16-NEXT: v_lshrrev_b32_e32 v1, 8, v1
; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v1.l
; GFX11-SDAG-TRUE16-NEXT: s_setpc_b64 s[30:31]
@@ -922,28 +911,25 @@ define i8 @test_vector_reduce_add_v16i8(<16 x i8> %v) {
; GFX12-SDAG-TRUE16-NEXT: s_wait_kmcnt 0x0
; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v1.h, v5.l, v13.l
; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v1.l, v1.l, v9.l
-; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v3.h, v7.l, v15.l
-; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v3.l, v3.l, v11.l
-; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v2.h, v6.l, v14.l
+; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v5.l, v7.l, v15.l
+; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.h, v6.l, v14.l
; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v2.l, v2.l, v10.l
-; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v1.l, v1.l, v1.h
-; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v1.h, v4.l, v12.l
-; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.h, v3.l, v3.h
+; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v2.h, v3.l, v11.l
+; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v3.l, v4.l, v12.l
; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v8.l
-; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_3)
-; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.h, v1.l, v0.h
-; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v1.l, v2.l, v2.h
-; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v1.h
+; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v1.l, v1.l, v1.h
+; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.h, v2.l, v0.h
+; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v2.l, v2.h, v5.l
+; GFX12-SDAG-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
+; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v3.l
; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX12-SDAG-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v0.h
-; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v1.l
+; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v1.l, v1.l, v2.l
+; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v0.h
; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX12-SDAG-TRUE16-NEXT: v_mov_b16_e32 v1.l, v2.l
+; GFX12-SDAG-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
; GFX12-SDAG-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX12-SDAG-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
-; GFX12-SDAG-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v2.l
-; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX12-SDAG-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v1.l
; GFX12-SDAG-TRUE16-NEXT: v_lshrrev_b32_e32 v1, 8, v1
; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v1.l
; GFX12-SDAG-TRUE16-NEXT: s_setpc_b64 s[30:31]
diff --git a/llvm/test/CodeGen/AMDGPU/vector-reduce-umin.ll b/llvm/test/CodeGen/AMDGPU/vector-reduce-umin.ll
index aab0e76410ccb..1d3b42ee43b0f 100644
--- a/llvm/test/CodeGen/AMDGPU/vector-reduce-umin.ll
+++ b/llvm/test/CodeGen/AMDGPU/vector-reduce-umin.ll
@@ -374,13 +374,12 @@ define i8 @test_vector_reduce_umin_v4i8(<4 x i8> %v) {
; GFX11-SDAG-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-SDAG-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v2.l
; GFX11-SDAG-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-SDAG-TRUE16-NEXT: v_min_u16 v1.l, v1.l, v1.h
+; GFX11-SDAG-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
; GFX11-SDAG-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-SDAG-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
; GFX11-SDAG-TRUE16-NEXT: v_lshrrev_b32_e32 v1, 8, v1
-; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-SDAG-TRUE16-NEXT: v_min3_u16 v0.l, v0.l, v0.h, v1.l
; GFX11-SDAG-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -427,13 +426,12 @@ define i8 @test_vector_reduce_umin_v4i8(<4 x i8> %v) {
; GFX12-SDAG-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX12-SDAG-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v2.l
; GFX12-SDAG-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX12-SDAG-TRUE16-NEXT: v_min_u16 v1.l, v1.l, v1.h
+; GFX12-SDAG-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
; GFX12-SDAG-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-SDAG-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
; GFX12-SDAG-TRUE16-NEXT: v_lshrrev_b32_e32 v1, 8, v1
-; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-SDAG-TRUE16-NEXT: v_min3_u16 v0.l, v0.l, v0.h, v1.l
; GFX12-SDAG-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -624,22 +622,20 @@ define i8 @test_vector_reduce_umin_v8i8(<8 x i8> %v) {
; GFX11-SDAG-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-SDAG-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v7.l
; GFX11-SDAG-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
-; GFX11-SDAG-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v4.l
; GFX11-SDAG-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-SDAG-TRUE16-NEXT: v_min_u16 v1.l, v1.l, v1.h
-; GFX11-SDAG-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.l
+; GFX11-SDAG-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX11-SDAG-TRUE16-NEXT: v_min_u16 v0.h, v1.l, v1.h
+; GFX11-SDAG-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v4.l
+; GFX11-SDAG-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v6.l
; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-SDAG-TRUE16-NEXT: v_min_u16 v0.l, v0.l, v0.h
-; GFX11-SDAG-TRUE16-NEXT: v_min3_u16 v1.l, v1.l, v3.l, v3.h
-; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-SDAG-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
-; GFX11-SDAG-TRUE16-NEXT: v_mov_b16_e32 v3.l, v1.l
-; GFX11-SDAG-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v6.l
-; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-SDAG-TRUE16-NEXT: v_and_b32_e32 v2, 0xffff, v3
-; GFX11-SDAG-TRUE16-NEXT: v_min3_u16 v0.l, v0.l, v1.h, v1.l
+; GFX11-SDAG-TRUE16-NEXT: v_min3_u16 v0.h, v0.h, v3.l, v3.h
+; GFX11-SDAG-TRUE16-NEXT: v_min_u16 v0.l, v0.l, v1.l
+; GFX11-SDAG-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-SDAG-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v0.h
+; GFX11-SDAG-TRUE16-NEXT: v_min3_u16 v0.l, v0.l, v2.l, v1.h
; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-SDAG-TRUE16-NEXT: v_lshrrev_b32_e32 v1, 8, v2
+; GFX11-SDAG-TRUE16-NEXT: v_lshrrev_b32_e32 v1, 8, v3
; GFX11-SDAG-TRUE16-NEXT: v_min_u16 v0.l, v0.l, v1.l
; GFX11-SDAG-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -703,22 +699,20 @@ define i8 @test_vector_reduce_umin_v8i8(<8 x i8> %v) {
; GFX12-SDAG-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX12-SDAG-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v7.l
; GFX12-SDAG-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
-; GFX12-SDAG-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v4.l
; GFX12-SDAG-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX12-SDAG-TRUE16-NEXT: v_min_u16 v1.l, v1.l, v1.h
-; GFX12-SDAG-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.l
+; GFX12-SDAG-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
+; GFX12-SDAG-TRUE16-NEXT: v_min_u16 v0.h, v1.l, v1.h
+; GFX12-SDAG-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v4.l
+; GFX12-SDAG-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v6.l
; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX12-SDAG-TRUE16-NEXT: v_min_u16 v0.l, v0.l, v0.h
-; GFX12-SDAG-TRUE16-NEXT: v_min3_u16 v1.l, v1.l, v3.l, v3.h
-; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-SDAG-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
-; GFX12-SDAG-TRUE16-NEXT: v_mov_b16_e32 v3.l, v1.l
-; GFX12-SDAG-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v6.l
-; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX12-SDAG-TRUE16-NEXT: v_and_b32_e32 v2, 0xffff, v3
-; GFX12-SDAG-TRUE16-NEXT: v_min3_u16 v0.l, v0.l, v1.h, v1.l
+; GFX12-SDAG-TRUE16-NEXT: v_min3_u16 v0.h, v0.h, v3.l, v3.h
+; GFX12-SDAG-TRUE16-NEXT: v_min_u16 v0.l, v0.l, v1.l
+; GFX12-SDAG-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX12-SDAG-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v0.h
+; GFX12-SDAG-TRUE16-NEXT: v_min3_u16 v0.l, v0.l, v2.l, v1.h
; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-SDAG-TRUE16-NEXT: v_lshrrev_b32_e32 v1, 8, v2
+; GFX12-SDAG-TRUE16-NEXT: v_lshrrev_b32_e32 v1, 8, v3
; GFX12-SDAG-TRUE16-NEXT: v_min_u16 v0.l, v0.l, v1.l
; GFX12-SDAG-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -1047,14 +1041,12 @@ define i8 @test_vector_reduce_umin_v16i8(<16 x i8> %v) {
; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
; GFX11-SDAG-TRUE16-NEXT: v_min3_u16 v0.l, v0.l, v1.h, v1.l
; GFX11-SDAG-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v0.h
-; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-SDAG-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v0.h
-; GFX11-SDAG-TRUE16-NEXT: v_mov_b16_e32 v1.l, v0.l
-; GFX11-SDAG-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
+; GFX11-SDAG-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-SDAG-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
+; GFX11-SDAG-TRUE16-NEXT: v_or_b16 v1.l, v0.l, v0.h
+; GFX11-SDAG-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v1.l
+; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-SDAG-TRUE16-NEXT: v_lshrrev_b32_e32 v1, 8, v1
-; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-SDAG-TRUE16-NEXT: v_min_u16 v0.l, v0.l, v1.l
; GFX11-SDAG-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -1176,14 +1168,12 @@ define i8 @test_vector_reduce_umin_v16i8(<16 x i8> %v) {
; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
; GFX12-SDAG-TRUE16-NEXT: v_min3_u16 v0.l, v0.l, v1.h, v1.l
; GFX12-SDAG-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v0.h
-; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-SDAG-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v0.h
-; GFX12-SDAG-TRUE16-NEXT: v_mov_b16_e32 v1.l, v0.l
-; GFX12-SDAG-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
+; GFX12-SDAG-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-SDAG-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
+; GFX12-SDAG-TRUE16-NEXT: v_or_b16 v1.l, v0.l, v0.h
+; GFX12-SDAG-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v1.l
+; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-SDAG-TRUE16-NEXT: v_lshrrev_b32_e32 v1, 8, v1
-; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-SDAG-TRUE16-NEXT: v_min_u16 v0.l, v0.l, v1.l
; GFX12-SDAG-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
>From 4ab14685a0b96b48f8fd77ead55c1816668cac18 Mon Sep 17 00:00:00 2001
From: Tiger Ding <38360323+zerogtiger at users.noreply.github.com>
Date: Mon, 18 Aug 2025 11:04:27 -0400
Subject: [PATCH 035/112] [AMDGPU] Narrow only on store to pow of 2 mem
location (#150093)
Lowering in GlobalISel for AMDGPU previously always narrows to i32 on
truncating store regardless of mem size or scalar size, causing issues
with types like i65 which is first extended to i128 then stored as i64 +
i8 to i128 locations. Narrowing only on store to pow of 2 mem location
ensures only narrowing to mem size near end of legalization.
This LLVM defect was identified via the AMD Fuzzing project.
---
.../lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp | 30 ++-
.../GlobalISel/legalize-store-global.mir | 84 ++++---
.../AMDGPU/GlobalISel/legalize-store.mir | 8 +-
.../AMDGPU/GlobalISel/store-weird-size.ll | 224 ++++++++++++++++++
4 files changed, 299 insertions(+), 47 deletions(-)
create mode 100644 llvm/test/CodeGen/AMDGPU/GlobalISel/store-weird-size.ll
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
index b88891ac4894b..600a13096f55d 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
@@ -26,6 +26,7 @@
#include "llvm/ADT/ScopeExit.h"
#include "llvm/CodeGen/GlobalISel/GenericMachineInstrs.h"
#include "llvm/CodeGen/GlobalISel/LegalizerHelper.h"
+#include "llvm/CodeGen/GlobalISel/LegalizerInfo.h"
#include "llvm/CodeGen/GlobalISel/MIPatternMatch.h"
#include "llvm/CodeGen/GlobalISel/MachineIRBuilder.h"
#include "llvm/CodeGen/GlobalISel/Utils.h"
@@ -137,6 +138,14 @@ static LegalizeMutation moreEltsToNext32Bit(unsigned TypeIdx) {
};
}
+// Retrieves the scalar type that's the same size as the mem desc
+static LegalizeMutation getScalarTypeFromMemDesc(unsigned TypeIdx) {
+ return [=](const LegalityQuery &Query) {
+ unsigned MemSize = Query.MMODescrs[0].MemoryTy.getSizeInBits();
+ return std::make_pair(TypeIdx, LLT::scalar(MemSize));
+ };
+}
+
// Increase the number of vector elements to reach the next legal RegClass.
static LegalizeMutation moreElementsToNextExistingRegClass(unsigned TypeIdx) {
return [=](const LegalityQuery &Query) {
@@ -384,6 +393,16 @@ static LegalityPredicate isWideScalarExtLoadTruncStore(unsigned TypeIdx) {
};
}
+// If we have a truncating store or an extending load with a data size larger
+// than 32-bits and mem location is a power of 2
+static LegalityPredicate isTruncStoreToSizePowerOf2(unsigned TypeIdx) {
+ return [=](const LegalityQuery &Query) {
+ unsigned MemSize = Query.MMODescrs[0].MemoryTy.getSizeInBits();
+ return isWideScalarExtLoadTruncStore(TypeIdx)(Query) &&
+ isPowerOf2_64(MemSize);
+ };
+}
+
// TODO: Should load to s16 be legal? Most loads extend to 32-bits, but we
// handle some operations by just promoting the register during
// selection. There are also d16 loads on GFX9+ which preserve the high bits.
@@ -1635,11 +1654,12 @@ AMDGPULegalizerInfo::AMDGPULegalizerInfo(const GCNSubtarget &ST_,
// May need relegalization for the scalars.
return std::pair(0, EltTy);
})
- .minScalar(0, S32)
- .narrowScalarIf(isWideScalarExtLoadTruncStore(0), changeTo(0, S32))
- .widenScalarToNextPow2(0)
- .moreElementsIf(vectorSmallerThan(0, 32), moreEltsToNext32Bit(0))
- .lower();
+ .minScalar(0, S32)
+ .narrowScalarIf(isTruncStoreToSizePowerOf2(0),
+ getScalarTypeFromMemDesc(0))
+ .widenScalarToNextPow2(0)
+ .moreElementsIf(vectorSmallerThan(0, 32), moreEltsToNext32Bit(0))
+ .lower();
}
// FIXME: Unaligned accesses not lowered.
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-store-global.mir b/llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-store-global.mir
index 2b84c6bcba7b5..acbcb098e8367 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-store-global.mir
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-store-global.mir
@@ -886,33 +886,34 @@ body: |
; SI-NEXT: {{ $}}
; SI-NEXT: [[COPY:%[0-9]+]]:_(p1) = COPY $vgpr0_vgpr1
; SI-NEXT: [[COPY1:%[0-9]+]]:_(s64) = COPY $vgpr2_vgpr3
- ; SI-NEXT: [[TRUNC:%[0-9]+]]:_(s32) = G_TRUNC [[COPY1]](s64)
+ ; SI-NEXT: [[COPY2:%[0-9]+]]:_(s64) = COPY [[COPY1]](s64)
; SI-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 32
- ; SI-NEXT: [[LSHR:%[0-9]+]]:_(s64) = G_LSHR [[COPY1]], [[C]](s32)
+ ; SI-NEXT: [[LSHR:%[0-9]+]]:_(s64) = G_LSHR [[COPY2]], [[C]](s32)
; SI-NEXT: [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 4
; SI-NEXT: [[PTR_ADD:%[0-9]+]]:_(p1) = nuw inbounds G_PTR_ADD [[COPY]], [[C1]](s64)
- ; SI-NEXT: [[COPY2:%[0-9]+]]:_(s32) = COPY [[TRUNC]](s32)
+ ; SI-NEXT: [[TRUNC:%[0-9]+]]:_(s32) = G_TRUNC [[COPY2]](s64)
+ ; SI-NEXT: [[COPY3:%[0-9]+]]:_(s32) = COPY [[TRUNC]](s32)
; SI-NEXT: [[C2:%[0-9]+]]:_(s32) = G_CONSTANT i32 16
- ; SI-NEXT: [[LSHR1:%[0-9]+]]:_(s32) = G_LSHR [[COPY2]], [[C2]](s32)
+ ; SI-NEXT: [[LSHR1:%[0-9]+]]:_(s32) = G_LSHR [[COPY3]], [[C2]](s32)
; SI-NEXT: [[C3:%[0-9]+]]:_(s64) = G_CONSTANT i64 2
; SI-NEXT: [[PTR_ADD1:%[0-9]+]]:_(p1) = nuw inbounds G_PTR_ADD [[COPY]], [[C3]](s64)
; SI-NEXT: [[C4:%[0-9]+]]:_(s32) = G_CONSTANT i32 8
; SI-NEXT: [[C5:%[0-9]+]]:_(s32) = G_CONSTANT i32 65535
- ; SI-NEXT: [[AND:%[0-9]+]]:_(s32) = G_AND [[COPY2]], [[C5]]
+ ; SI-NEXT: [[AND:%[0-9]+]]:_(s32) = G_AND [[COPY3]], [[C5]]
; SI-NEXT: [[LSHR2:%[0-9]+]]:_(s32) = G_LSHR [[AND]], [[C4]](s32)
; SI-NEXT: [[C6:%[0-9]+]]:_(s64) = G_CONSTANT i64 1
; SI-NEXT: [[PTR_ADD2:%[0-9]+]]:_(p1) = nuw inbounds G_PTR_ADD [[COPY]], [[C6]](s64)
- ; SI-NEXT: G_STORE [[COPY2]](s32), [[COPY]](p1) :: (store (s8), addrspace 1)
+ ; SI-NEXT: G_STORE [[COPY3]](s32), [[COPY]](p1) :: (store (s8), addrspace 1)
; SI-NEXT: G_STORE [[LSHR2]](s32), [[PTR_ADD2]](p1) :: (store (s8) into unknown-address + 1, addrspace 1)
- ; SI-NEXT: [[COPY3:%[0-9]+]]:_(s32) = COPY [[C4]](s32)
- ; SI-NEXT: [[LSHR3:%[0-9]+]]:_(s32) = G_LSHR [[LSHR1]], [[COPY3]](s32)
+ ; SI-NEXT: [[COPY4:%[0-9]+]]:_(s32) = COPY [[C4]](s32)
+ ; SI-NEXT: [[LSHR3:%[0-9]+]]:_(s32) = G_LSHR [[LSHR1]], [[COPY4]](s32)
; SI-NEXT: [[PTR_ADD3:%[0-9]+]]:_(p1) = nuw inbounds G_PTR_ADD [[PTR_ADD1]], [[C6]](s64)
; SI-NEXT: G_STORE [[LSHR1]](s32), [[PTR_ADD1]](p1) :: (store (s8) into unknown-address + 2, addrspace 1)
; SI-NEXT: G_STORE [[LSHR3]](s32), [[PTR_ADD3]](p1) :: (store (s8) into unknown-address + 3, addrspace 1)
; SI-NEXT: [[TRUNC1:%[0-9]+]]:_(s32) = G_TRUNC [[LSHR]](s64)
- ; SI-NEXT: [[COPY4:%[0-9]+]]:_(s32) = COPY [[C4]](s32)
+ ; SI-NEXT: [[COPY5:%[0-9]+]]:_(s32) = COPY [[C4]](s32)
; SI-NEXT: [[AND1:%[0-9]+]]:_(s32) = G_AND [[TRUNC1]], [[C5]]
- ; SI-NEXT: [[LSHR4:%[0-9]+]]:_(s32) = G_LSHR [[AND1]], [[COPY4]](s32)
+ ; SI-NEXT: [[LSHR4:%[0-9]+]]:_(s32) = G_LSHR [[AND1]], [[COPY5]](s32)
; SI-NEXT: [[PTR_ADD4:%[0-9]+]]:_(p1) = nuw inbounds G_PTR_ADD [[PTR_ADD]], [[C6]](s64)
; SI-NEXT: G_STORE [[TRUNC1]](s32), [[PTR_ADD]](p1) :: (store (s8) into unknown-address + 4, addrspace 1)
; SI-NEXT: G_STORE [[LSHR4]](s32), [[PTR_ADD4]](p1) :: (store (s8) into unknown-address + 5, addrspace 1)
@@ -922,11 +923,12 @@ body: |
; CI-NEXT: {{ $}}
; CI-NEXT: [[COPY:%[0-9]+]]:_(p1) = COPY $vgpr0_vgpr1
; CI-NEXT: [[COPY1:%[0-9]+]]:_(s64) = COPY $vgpr2_vgpr3
- ; CI-NEXT: [[TRUNC:%[0-9]+]]:_(s32) = G_TRUNC [[COPY1]](s64)
+ ; CI-NEXT: [[COPY2:%[0-9]+]]:_(s64) = COPY [[COPY1]](s64)
; CI-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 32
- ; CI-NEXT: [[LSHR:%[0-9]+]]:_(s64) = G_LSHR [[COPY1]], [[C]](s32)
+ ; CI-NEXT: [[LSHR:%[0-9]+]]:_(s64) = G_LSHR [[COPY2]], [[C]](s32)
; CI-NEXT: [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 4
; CI-NEXT: [[PTR_ADD:%[0-9]+]]:_(p1) = nuw inbounds G_PTR_ADD [[COPY]], [[C1]](s64)
+ ; CI-NEXT: [[TRUNC:%[0-9]+]]:_(s32) = G_TRUNC [[COPY2]](s64)
; CI-NEXT: G_STORE [[TRUNC]](s32), [[COPY]](p1) :: (store (s32), align 1, addrspace 1)
; CI-NEXT: [[TRUNC1:%[0-9]+]]:_(s32) = G_TRUNC [[LSHR]](s64)
; CI-NEXT: G_STORE [[TRUNC1]](s32), [[PTR_ADD]](p1) :: (store (s16) into unknown-address + 4, align 1, addrspace 1)
@@ -936,22 +938,23 @@ body: |
; VI-NEXT: {{ $}}
; VI-NEXT: [[COPY:%[0-9]+]]:_(p1) = COPY $vgpr0_vgpr1
; VI-NEXT: [[COPY1:%[0-9]+]]:_(s64) = COPY $vgpr2_vgpr3
- ; VI-NEXT: [[TRUNC:%[0-9]+]]:_(s32) = G_TRUNC [[COPY1]](s64)
+ ; VI-NEXT: [[COPY2:%[0-9]+]]:_(s64) = COPY [[COPY1]](s64)
; VI-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 32
- ; VI-NEXT: [[LSHR:%[0-9]+]]:_(s64) = G_LSHR [[COPY1]], [[C]](s32)
+ ; VI-NEXT: [[LSHR:%[0-9]+]]:_(s64) = G_LSHR [[COPY2]], [[C]](s32)
; VI-NEXT: [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 4
; VI-NEXT: [[PTR_ADD:%[0-9]+]]:_(p1) = nuw inbounds G_PTR_ADD [[COPY]], [[C1]](s64)
- ; VI-NEXT: [[COPY2:%[0-9]+]]:_(s32) = COPY [[TRUNC]](s32)
+ ; VI-NEXT: [[TRUNC:%[0-9]+]]:_(s32) = G_TRUNC [[COPY2]](s64)
+ ; VI-NEXT: [[COPY3:%[0-9]+]]:_(s32) = COPY [[TRUNC]](s32)
; VI-NEXT: [[C2:%[0-9]+]]:_(s32) = G_CONSTANT i32 16
- ; VI-NEXT: [[LSHR1:%[0-9]+]]:_(s32) = G_LSHR [[COPY2]], [[C2]](s32)
+ ; VI-NEXT: [[LSHR1:%[0-9]+]]:_(s32) = G_LSHR [[COPY3]], [[C2]](s32)
; VI-NEXT: [[C3:%[0-9]+]]:_(s64) = G_CONSTANT i64 2
; VI-NEXT: [[PTR_ADD1:%[0-9]+]]:_(p1) = nuw inbounds G_PTR_ADD [[COPY]], [[C3]](s64)
- ; VI-NEXT: [[TRUNC1:%[0-9]+]]:_(s16) = G_TRUNC [[COPY1]](s64)
+ ; VI-NEXT: [[TRUNC1:%[0-9]+]]:_(s16) = G_TRUNC [[COPY2]](s64)
; VI-NEXT: [[C4:%[0-9]+]]:_(s16) = G_CONSTANT i16 8
; VI-NEXT: [[LSHR2:%[0-9]+]]:_(s16) = G_LSHR [[TRUNC1]], [[C4]](s16)
; VI-NEXT: [[C5:%[0-9]+]]:_(s64) = G_CONSTANT i64 1
; VI-NEXT: [[PTR_ADD2:%[0-9]+]]:_(p1) = nuw inbounds G_PTR_ADD [[COPY]], [[C5]](s64)
- ; VI-NEXT: G_STORE [[COPY2]](s32), [[COPY]](p1) :: (store (s8), addrspace 1)
+ ; VI-NEXT: G_STORE [[COPY3]](s32), [[COPY]](p1) :: (store (s8), addrspace 1)
; VI-NEXT: [[ANYEXT:%[0-9]+]]:_(s32) = G_ANYEXT [[LSHR2]](s16)
; VI-NEXT: G_STORE [[ANYEXT]](s32), [[PTR_ADD2]](p1) :: (store (s8) into unknown-address + 1, addrspace 1)
; VI-NEXT: [[TRUNC2:%[0-9]+]]:_(s16) = G_TRUNC [[LSHR1]](s32)
@@ -960,11 +963,11 @@ body: |
; VI-NEXT: G_STORE [[LSHR1]](s32), [[PTR_ADD1]](p1) :: (store (s8) into unknown-address + 2, addrspace 1)
; VI-NEXT: [[ANYEXT1:%[0-9]+]]:_(s32) = G_ANYEXT [[LSHR3]](s16)
; VI-NEXT: G_STORE [[ANYEXT1]](s32), [[PTR_ADD3]](p1) :: (store (s8) into unknown-address + 3, addrspace 1)
- ; VI-NEXT: [[TRUNC3:%[0-9]+]]:_(s32) = G_TRUNC [[LSHR]](s64)
- ; VI-NEXT: [[TRUNC4:%[0-9]+]]:_(s16) = G_TRUNC [[LSHR]](s64)
- ; VI-NEXT: [[LSHR4:%[0-9]+]]:_(s16) = G_LSHR [[TRUNC4]], [[C4]](s16)
+ ; VI-NEXT: [[TRUNC3:%[0-9]+]]:_(s16) = G_TRUNC [[LSHR]](s64)
+ ; VI-NEXT: [[TRUNC4:%[0-9]+]]:_(s32) = G_TRUNC [[LSHR]](s64)
+ ; VI-NEXT: [[LSHR4:%[0-9]+]]:_(s16) = G_LSHR [[TRUNC3]], [[C4]](s16)
; VI-NEXT: [[PTR_ADD4:%[0-9]+]]:_(p1) = nuw inbounds G_PTR_ADD [[PTR_ADD]], [[C5]](s64)
- ; VI-NEXT: G_STORE [[TRUNC3]](s32), [[PTR_ADD]](p1) :: (store (s8) into unknown-address + 4, addrspace 1)
+ ; VI-NEXT: G_STORE [[TRUNC4]](s32), [[PTR_ADD]](p1) :: (store (s8) into unknown-address + 4, addrspace 1)
; VI-NEXT: [[ANYEXT2:%[0-9]+]]:_(s32) = G_ANYEXT [[LSHR4]](s16)
; VI-NEXT: G_STORE [[ANYEXT2]](s32), [[PTR_ADD4]](p1) :: (store (s8) into unknown-address + 5, addrspace 1)
;
@@ -973,11 +976,12 @@ body: |
; GFX9-NEXT: {{ $}}
; GFX9-NEXT: [[COPY:%[0-9]+]]:_(p1) = COPY $vgpr0_vgpr1
; GFX9-NEXT: [[COPY1:%[0-9]+]]:_(s64) = COPY $vgpr2_vgpr3
- ; GFX9-NEXT: [[TRUNC:%[0-9]+]]:_(s32) = G_TRUNC [[COPY1]](s64)
+ ; GFX9-NEXT: [[COPY2:%[0-9]+]]:_(s64) = COPY [[COPY1]](s64)
; GFX9-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 32
- ; GFX9-NEXT: [[LSHR:%[0-9]+]]:_(s64) = G_LSHR [[COPY1]], [[C]](s32)
+ ; GFX9-NEXT: [[LSHR:%[0-9]+]]:_(s64) = G_LSHR [[COPY2]], [[C]](s32)
; GFX9-NEXT: [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 4
; GFX9-NEXT: [[PTR_ADD:%[0-9]+]]:_(p1) = nuw inbounds G_PTR_ADD [[COPY]], [[C1]](s64)
+ ; GFX9-NEXT: [[TRUNC:%[0-9]+]]:_(s32) = G_TRUNC [[COPY2]](s64)
; GFX9-NEXT: G_STORE [[TRUNC]](s32), [[COPY]](p1) :: (store (s32), align 1, addrspace 1)
; GFX9-NEXT: [[TRUNC1:%[0-9]+]]:_(s32) = G_TRUNC [[LSHR]](s64)
; GFX9-NEXT: G_STORE [[TRUNC1]](s32), [[PTR_ADD]](p1) :: (store (s16) into unknown-address + 4, align 1, addrspace 1)
@@ -998,17 +1002,18 @@ body: |
; SI-NEXT: {{ $}}
; SI-NEXT: [[COPY:%[0-9]+]]:_(p1) = COPY $vgpr0_vgpr1
; SI-NEXT: [[COPY1:%[0-9]+]]:_(s64) = COPY $vgpr2_vgpr3
- ; SI-NEXT: [[TRUNC:%[0-9]+]]:_(s32) = G_TRUNC [[COPY1]](s64)
+ ; SI-NEXT: [[COPY2:%[0-9]+]]:_(s64) = COPY [[COPY1]](s64)
; SI-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 32
- ; SI-NEXT: [[LSHR:%[0-9]+]]:_(s64) = G_LSHR [[COPY1]], [[C]](s32)
+ ; SI-NEXT: [[LSHR:%[0-9]+]]:_(s64) = G_LSHR [[COPY2]], [[C]](s32)
; SI-NEXT: [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 4
; SI-NEXT: [[PTR_ADD:%[0-9]+]]:_(p1) = nuw inbounds G_PTR_ADD [[COPY]], [[C1]](s64)
- ; SI-NEXT: [[COPY2:%[0-9]+]]:_(s32) = COPY [[TRUNC]](s32)
+ ; SI-NEXT: [[TRUNC:%[0-9]+]]:_(s32) = G_TRUNC [[COPY2]](s64)
+ ; SI-NEXT: [[COPY3:%[0-9]+]]:_(s32) = COPY [[TRUNC]](s32)
; SI-NEXT: [[C2:%[0-9]+]]:_(s32) = G_CONSTANT i32 16
- ; SI-NEXT: [[LSHR1:%[0-9]+]]:_(s32) = G_LSHR [[COPY2]], [[C2]](s32)
+ ; SI-NEXT: [[LSHR1:%[0-9]+]]:_(s32) = G_LSHR [[COPY3]], [[C2]](s32)
; SI-NEXT: [[C3:%[0-9]+]]:_(s64) = G_CONSTANT i64 2
; SI-NEXT: [[PTR_ADD1:%[0-9]+]]:_(p1) = nuw inbounds G_PTR_ADD [[COPY]], [[C3]](s64)
- ; SI-NEXT: G_STORE [[COPY2]](s32), [[COPY]](p1) :: (store (s16), addrspace 1)
+ ; SI-NEXT: G_STORE [[COPY3]](s32), [[COPY]](p1) :: (store (s16), addrspace 1)
; SI-NEXT: G_STORE [[LSHR1]](s32), [[PTR_ADD1]](p1) :: (store (s16) into unknown-address + 2, addrspace 1)
; SI-NEXT: [[TRUNC1:%[0-9]+]]:_(s32) = G_TRUNC [[LSHR]](s64)
; SI-NEXT: G_STORE [[TRUNC1]](s32), [[PTR_ADD]](p1) :: (store (s16) into unknown-address + 4, addrspace 1)
@@ -1018,11 +1023,12 @@ body: |
; CI-NEXT: {{ $}}
; CI-NEXT: [[COPY:%[0-9]+]]:_(p1) = COPY $vgpr0_vgpr1
; CI-NEXT: [[COPY1:%[0-9]+]]:_(s64) = COPY $vgpr2_vgpr3
- ; CI-NEXT: [[TRUNC:%[0-9]+]]:_(s32) = G_TRUNC [[COPY1]](s64)
+ ; CI-NEXT: [[COPY2:%[0-9]+]]:_(s64) = COPY [[COPY1]](s64)
; CI-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 32
- ; CI-NEXT: [[LSHR:%[0-9]+]]:_(s64) = G_LSHR [[COPY1]], [[C]](s32)
+ ; CI-NEXT: [[LSHR:%[0-9]+]]:_(s64) = G_LSHR [[COPY2]], [[C]](s32)
; CI-NEXT: [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 4
; CI-NEXT: [[PTR_ADD:%[0-9]+]]:_(p1) = nuw inbounds G_PTR_ADD [[COPY]], [[C1]](s64)
+ ; CI-NEXT: [[TRUNC:%[0-9]+]]:_(s32) = G_TRUNC [[COPY2]](s64)
; CI-NEXT: G_STORE [[TRUNC]](s32), [[COPY]](p1) :: (store (s32), align 2, addrspace 1)
; CI-NEXT: [[TRUNC1:%[0-9]+]]:_(s32) = G_TRUNC [[LSHR]](s64)
; CI-NEXT: G_STORE [[TRUNC1]](s32), [[PTR_ADD]](p1) :: (store (s16) into unknown-address + 4, addrspace 1)
@@ -1032,17 +1038,18 @@ body: |
; VI-NEXT: {{ $}}
; VI-NEXT: [[COPY:%[0-9]+]]:_(p1) = COPY $vgpr0_vgpr1
; VI-NEXT: [[COPY1:%[0-9]+]]:_(s64) = COPY $vgpr2_vgpr3
- ; VI-NEXT: [[TRUNC:%[0-9]+]]:_(s32) = G_TRUNC [[COPY1]](s64)
+ ; VI-NEXT: [[COPY2:%[0-9]+]]:_(s64) = COPY [[COPY1]](s64)
; VI-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 32
- ; VI-NEXT: [[LSHR:%[0-9]+]]:_(s64) = G_LSHR [[COPY1]], [[C]](s32)
+ ; VI-NEXT: [[LSHR:%[0-9]+]]:_(s64) = G_LSHR [[COPY2]], [[C]](s32)
; VI-NEXT: [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 4
; VI-NEXT: [[PTR_ADD:%[0-9]+]]:_(p1) = nuw inbounds G_PTR_ADD [[COPY]], [[C1]](s64)
- ; VI-NEXT: [[COPY2:%[0-9]+]]:_(s32) = COPY [[TRUNC]](s32)
+ ; VI-NEXT: [[TRUNC:%[0-9]+]]:_(s32) = G_TRUNC [[COPY2]](s64)
+ ; VI-NEXT: [[COPY3:%[0-9]+]]:_(s32) = COPY [[TRUNC]](s32)
; VI-NEXT: [[C2:%[0-9]+]]:_(s32) = G_CONSTANT i32 16
- ; VI-NEXT: [[LSHR1:%[0-9]+]]:_(s32) = G_LSHR [[COPY2]], [[C2]](s32)
+ ; VI-NEXT: [[LSHR1:%[0-9]+]]:_(s32) = G_LSHR [[COPY3]], [[C2]](s32)
; VI-NEXT: [[C3:%[0-9]+]]:_(s64) = G_CONSTANT i64 2
; VI-NEXT: [[PTR_ADD1:%[0-9]+]]:_(p1) = nuw inbounds G_PTR_ADD [[COPY]], [[C3]](s64)
- ; VI-NEXT: G_STORE [[COPY2]](s32), [[COPY]](p1) :: (store (s16), addrspace 1)
+ ; VI-NEXT: G_STORE [[COPY3]](s32), [[COPY]](p1) :: (store (s16), addrspace 1)
; VI-NEXT: G_STORE [[LSHR1]](s32), [[PTR_ADD1]](p1) :: (store (s16) into unknown-address + 2, addrspace 1)
; VI-NEXT: [[TRUNC1:%[0-9]+]]:_(s32) = G_TRUNC [[LSHR]](s64)
; VI-NEXT: G_STORE [[TRUNC1]](s32), [[PTR_ADD]](p1) :: (store (s16) into unknown-address + 4, addrspace 1)
@@ -1052,11 +1059,12 @@ body: |
; GFX9-NEXT: {{ $}}
; GFX9-NEXT: [[COPY:%[0-9]+]]:_(p1) = COPY $vgpr0_vgpr1
; GFX9-NEXT: [[COPY1:%[0-9]+]]:_(s64) = COPY $vgpr2_vgpr3
- ; GFX9-NEXT: [[TRUNC:%[0-9]+]]:_(s32) = G_TRUNC [[COPY1]](s64)
+ ; GFX9-NEXT: [[COPY2:%[0-9]+]]:_(s64) = COPY [[COPY1]](s64)
; GFX9-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 32
- ; GFX9-NEXT: [[LSHR:%[0-9]+]]:_(s64) = G_LSHR [[COPY1]], [[C]](s32)
+ ; GFX9-NEXT: [[LSHR:%[0-9]+]]:_(s64) = G_LSHR [[COPY2]], [[C]](s32)
; GFX9-NEXT: [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 4
; GFX9-NEXT: [[PTR_ADD:%[0-9]+]]:_(p1) = nuw inbounds G_PTR_ADD [[COPY]], [[C1]](s64)
+ ; GFX9-NEXT: [[TRUNC:%[0-9]+]]:_(s32) = G_TRUNC [[COPY2]](s64)
; GFX9-NEXT: G_STORE [[TRUNC]](s32), [[COPY]](p1) :: (store (s32), align 2, addrspace 1)
; GFX9-NEXT: [[TRUNC1:%[0-9]+]]:_(s32) = G_TRUNC [[LSHR]](s64)
; GFX9-NEXT: G_STORE [[TRUNC1]](s32), [[PTR_ADD]](p1) :: (store (s16) into unknown-address + 4, addrspace 1)
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-store.mir b/llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-store.mir
index a931c6366c403..7fd23197a5dd6 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-store.mir
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-store.mir
@@ -285,13 +285,13 @@ body: |
; VI-NEXT: {{ $}}
; VI-NEXT: [[COPY:%[0-9]+]]:_(p1) = COPY $vgpr0_vgpr1
; VI-NEXT: [[COPY1:%[0-9]+]]:_(s64) = COPY $vgpr2_vgpr3
- ; VI-NEXT: [[TRUNC:%[0-9]+]]:_(s32) = G_TRUNC [[COPY1]](s64)
- ; VI-NEXT: [[TRUNC1:%[0-9]+]]:_(s16) = G_TRUNC [[COPY1]](s64)
+ ; VI-NEXT: [[TRUNC:%[0-9]+]]:_(s16) = G_TRUNC [[COPY1]](s64)
+ ; VI-NEXT: [[TRUNC1:%[0-9]+]]:_(s32) = G_TRUNC [[COPY1]](s64)
; VI-NEXT: [[C:%[0-9]+]]:_(s16) = G_CONSTANT i16 8
- ; VI-NEXT: [[LSHR:%[0-9]+]]:_(s16) = G_LSHR [[TRUNC1]], [[C]](s16)
+ ; VI-NEXT: [[LSHR:%[0-9]+]]:_(s16) = G_LSHR [[TRUNC]], [[C]](s16)
; VI-NEXT: [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 1
; VI-NEXT: [[PTR_ADD:%[0-9]+]]:_(p1) = nuw inbounds G_PTR_ADD [[COPY]], [[C1]](s64)
- ; VI-NEXT: G_STORE [[TRUNC]](s32), [[COPY]](p1) :: (store (s8), addrspace 1)
+ ; VI-NEXT: G_STORE [[TRUNC1]](s32), [[COPY]](p1) :: (store (s8), addrspace 1)
; VI-NEXT: [[ANYEXT:%[0-9]+]]:_(s32) = G_ANYEXT [[LSHR]](s16)
; VI-NEXT: G_STORE [[ANYEXT]](s32), [[PTR_ADD]](p1) :: (store (s8) into unknown-address + 1, addrspace 1)
%0:_(p1) = COPY $vgpr0_vgpr1
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/store-weird-size.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/store-weird-size.ll
new file mode 100644
index 0000000000000..0aa08cc2b1d6f
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/store-weird-size.ll
@@ -0,0 +1,224 @@
+; NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=amdgcn -mcpu=gfx90a -O0 -global-isel=true -stop-after=legalizer -o - %s | FileCheck -check-prefix=UNPACKED %s
+
+define void @store_i48(ptr addrspace(1) %ptr, i48 %arg) #0 {
+ ; UNPACKED-LABEL: name: store_i48
+ ; UNPACKED: bb.1 (%ir-block.0):
+ ; UNPACKED-NEXT: liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3
+ ; UNPACKED-NEXT: {{ $}}
+ ; UNPACKED-NEXT: [[COPY:%[0-9]+]]:_(s32) = COPY $vgpr0
+ ; UNPACKED-NEXT: [[COPY1:%[0-9]+]]:_(s32) = COPY $vgpr1
+ ; UNPACKED-NEXT: [[MV:%[0-9]+]]:_(p1) = G_MERGE_VALUES [[COPY]](s32), [[COPY1]](s32)
+ ; UNPACKED-NEXT: [[COPY2:%[0-9]+]]:_(s32) = COPY $vgpr2
+ ; UNPACKED-NEXT: [[COPY3:%[0-9]+]]:_(s32) = COPY $vgpr3
+ ; UNPACKED-NEXT: [[MV1:%[0-9]+]]:_(s64) = G_MERGE_VALUES [[COPY2]](s32), [[COPY3]](s32)
+ ; UNPACKED-NEXT: [[COPY4:%[0-9]+]]:_(s64) = COPY [[MV1]](s64)
+ ; UNPACKED-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 32
+ ; UNPACKED-NEXT: [[LSHR:%[0-9]+]]:_(s64) = G_LSHR [[COPY4]], [[C]](s32)
+ ; UNPACKED-NEXT: [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 4
+ ; UNPACKED-NEXT: [[PTR_ADD:%[0-9]+]]:_(p1) = nuw inbounds G_PTR_ADD [[MV]], [[C1]](s64)
+ ; UNPACKED-NEXT: G_STORE [[COPY2]](s32), [[MV]](p1) :: (store (s32) into %ir.ptr, addrspace 1)
+ ; UNPACKED-NEXT: [[TRUNC:%[0-9]+]]:_(s32) = G_TRUNC [[LSHR]](s64)
+ ; UNPACKED-NEXT: G_STORE [[TRUNC]](s32), [[PTR_ADD]](p1) :: (store (s16) into %ir.ptr + 4, align 4, addrspace 1)
+ ; UNPACKED-NEXT: SI_RETURN
+ store i48 %arg, ptr addrspace(1) %ptr, align 4
+ ret void
+}
+
+define void @store_i55(ptr addrspace(1) %ptr, i55 %arg) #0 {
+ ; UNPACKED-LABEL: name: store_i55
+ ; UNPACKED: bb.1 (%ir-block.0):
+ ; UNPACKED-NEXT: liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3
+ ; UNPACKED-NEXT: {{ $}}
+ ; UNPACKED-NEXT: [[COPY:%[0-9]+]]:_(s32) = COPY $vgpr0
+ ; UNPACKED-NEXT: [[COPY1:%[0-9]+]]:_(s32) = COPY $vgpr1
+ ; UNPACKED-NEXT: [[MV:%[0-9]+]]:_(p1) = G_MERGE_VALUES [[COPY]](s32), [[COPY1]](s32)
+ ; UNPACKED-NEXT: [[COPY2:%[0-9]+]]:_(s32) = COPY $vgpr2
+ ; UNPACKED-NEXT: [[COPY3:%[0-9]+]]:_(s32) = COPY $vgpr3
+ ; UNPACKED-NEXT: [[MV1:%[0-9]+]]:_(s64) = G_MERGE_VALUES [[COPY2]](s32), [[COPY3]](s32)
+ ; UNPACKED-NEXT: [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36028797018963967
+ ; UNPACKED-NEXT: [[AND:%[0-9]+]]:_(s64) = G_AND [[MV1]], [[C]]
+ ; UNPACKED-NEXT: [[COPY4:%[0-9]+]]:_(s64) = COPY [[AND]](s64)
+ ; UNPACKED-NEXT: [[C1:%[0-9]+]]:_(s32) = G_CONSTANT i32 32
+ ; UNPACKED-NEXT: [[LSHR:%[0-9]+]]:_(s64) = G_LSHR [[COPY4]], [[C1]](s32)
+ ; UNPACKED-NEXT: [[C2:%[0-9]+]]:_(s64) = G_CONSTANT i64 4
+ ; UNPACKED-NEXT: [[PTR_ADD:%[0-9]+]]:_(p1) = nuw inbounds G_PTR_ADD [[MV]], [[C2]](s64)
+ ; UNPACKED-NEXT: [[TRUNC:%[0-9]+]]:_(s32) = G_TRUNC [[COPY4]](s64)
+ ; UNPACKED-NEXT: G_STORE [[TRUNC]](s32), [[MV]](p1) :: (store (s32) into %ir.ptr, addrspace 1)
+ ; UNPACKED-NEXT: [[TRUNC1:%[0-9]+]]:_(s32) = G_TRUNC [[LSHR]](s64)
+ ; UNPACKED-NEXT: [[C3:%[0-9]+]]:_(s32) = G_CONSTANT i32 16
+ ; UNPACKED-NEXT: [[LSHR1:%[0-9]+]]:_(s32) = G_LSHR [[TRUNC1]], [[C3]](s32)
+ ; UNPACKED-NEXT: [[C4:%[0-9]+]]:_(s64) = G_CONSTANT i64 2
+ ; UNPACKED-NEXT: [[PTR_ADD1:%[0-9]+]]:_(p1) = nuw inbounds G_PTR_ADD [[PTR_ADD]], [[C4]](s64)
+ ; UNPACKED-NEXT: G_STORE [[TRUNC1]](s32), [[PTR_ADD]](p1) :: (store (s16) into %ir.ptr + 4, align 4, addrspace 1)
+ ; UNPACKED-NEXT: G_STORE [[LSHR1]](s32), [[PTR_ADD1]](p1) :: (store (s8) into %ir.ptr + 6, align 2, basealign 4, addrspace 1)
+ ; UNPACKED-NEXT: SI_RETURN
+ store i55 %arg, ptr addrspace(1) %ptr, align 4
+ ret void
+}
+
+define void @store_i56(ptr addrspace(1) %ptr, i56 %arg) #0 {
+ ; UNPACKED-LABEL: name: store_i56
+ ; UNPACKED: bb.1 (%ir-block.0):
+ ; UNPACKED-NEXT: liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3
+ ; UNPACKED-NEXT: {{ $}}
+ ; UNPACKED-NEXT: [[COPY:%[0-9]+]]:_(s32) = COPY $vgpr0
+ ; UNPACKED-NEXT: [[COPY1:%[0-9]+]]:_(s32) = COPY $vgpr1
+ ; UNPACKED-NEXT: [[MV:%[0-9]+]]:_(p1) = G_MERGE_VALUES [[COPY]](s32), [[COPY1]](s32)
+ ; UNPACKED-NEXT: [[COPY2:%[0-9]+]]:_(s32) = COPY $vgpr2
+ ; UNPACKED-NEXT: [[COPY3:%[0-9]+]]:_(s32) = COPY $vgpr3
+ ; UNPACKED-NEXT: [[MV1:%[0-9]+]]:_(s64) = G_MERGE_VALUES [[COPY2]](s32), [[COPY3]](s32)
+ ; UNPACKED-NEXT: [[COPY4:%[0-9]+]]:_(s64) = COPY [[MV1]](s64)
+ ; UNPACKED-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 32
+ ; UNPACKED-NEXT: [[LSHR:%[0-9]+]]:_(s64) = G_LSHR [[COPY4]], [[C]](s32)
+ ; UNPACKED-NEXT: [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 4
+ ; UNPACKED-NEXT: [[PTR_ADD:%[0-9]+]]:_(p1) = nuw inbounds G_PTR_ADD [[MV]], [[C1]](s64)
+ ; UNPACKED-NEXT: G_STORE [[COPY2]](s32), [[MV]](p1) :: (store (s32) into %ir.ptr, addrspace 1)
+ ; UNPACKED-NEXT: [[TRUNC:%[0-9]+]]:_(s32) = G_TRUNC [[LSHR]](s64)
+ ; UNPACKED-NEXT: [[C2:%[0-9]+]]:_(s32) = G_CONSTANT i32 16
+ ; UNPACKED-NEXT: [[LSHR1:%[0-9]+]]:_(s32) = G_LSHR [[TRUNC]], [[C2]](s32)
+ ; UNPACKED-NEXT: [[C3:%[0-9]+]]:_(s64) = G_CONSTANT i64 2
+ ; UNPACKED-NEXT: [[PTR_ADD1:%[0-9]+]]:_(p1) = nuw inbounds G_PTR_ADD [[PTR_ADD]], [[C3]](s64)
+ ; UNPACKED-NEXT: G_STORE [[TRUNC]](s32), [[PTR_ADD]](p1) :: (store (s16) into %ir.ptr + 4, align 4, addrspace 1)
+ ; UNPACKED-NEXT: G_STORE [[LSHR1]](s32), [[PTR_ADD1]](p1) :: (store (s8) into %ir.ptr + 6, align 2, basealign 4, addrspace 1)
+ ; UNPACKED-NEXT: SI_RETURN
+ store i56 %arg, ptr addrspace(1) %ptr, align 4
+ ret void
+}
+
+define void @store_i65(ptr addrspace(1) %ptr, i65 %arg) #0 {
+ ; UNPACKED-LABEL: name: store_i65
+ ; UNPACKED: bb.1 (%ir-block.0):
+ ; UNPACKED-NEXT: liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3, $vgpr4
+ ; UNPACKED-NEXT: {{ $}}
+ ; UNPACKED-NEXT: [[COPY:%[0-9]+]]:_(s32) = COPY $vgpr0
+ ; UNPACKED-NEXT: [[COPY1:%[0-9]+]]:_(s32) = COPY $vgpr1
+ ; UNPACKED-NEXT: [[MV:%[0-9]+]]:_(p1) = G_MERGE_VALUES [[COPY]](s32), [[COPY1]](s32)
+ ; UNPACKED-NEXT: [[COPY2:%[0-9]+]]:_(s32) = COPY $vgpr2
+ ; UNPACKED-NEXT: [[COPY3:%[0-9]+]]:_(s32) = COPY $vgpr3
+ ; UNPACKED-NEXT: [[COPY4:%[0-9]+]]:_(s32) = COPY $vgpr4
+ ; UNPACKED-NEXT: [[DEF:%[0-9]+]]:_(s32) = G_IMPLICIT_DEF
+ ; UNPACKED-NEXT: [[MV1:%[0-9]+]]:_(s64) = G_MERGE_VALUES [[COPY2]](s32), [[COPY3]](s32)
+ ; UNPACKED-NEXT: [[MV2:%[0-9]+]]:_(s64) = G_MERGE_VALUES [[COPY4]](s32), [[DEF]](s32)
+ ; UNPACKED-NEXT: [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 -1
+ ; UNPACKED-NEXT: [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 1
+ ; UNPACKED-NEXT: [[AND:%[0-9]+]]:_(s64) = G_AND [[MV1]], [[C]]
+ ; UNPACKED-NEXT: [[AND1:%[0-9]+]]:_(s64) = G_AND [[MV2]], [[C1]]
+ ; UNPACKED-NEXT: [[C2:%[0-9]+]]:_(s64) = G_CONSTANT i64 8
+ ; UNPACKED-NEXT: [[PTR_ADD:%[0-9]+]]:_(p1) = nuw inbounds G_PTR_ADD [[MV]], [[C2]](s64)
+ ; UNPACKED-NEXT: G_STORE [[AND]](s64), [[MV]](p1) :: (store (s64) into %ir.ptr, align 4, addrspace 1)
+ ; UNPACKED-NEXT: [[TRUNC:%[0-9]+]]:_(s32) = G_TRUNC [[AND1]](s64)
+ ; UNPACKED-NEXT: G_STORE [[TRUNC]](s32), [[PTR_ADD]](p1) :: (store (s8) into %ir.ptr + 8, align 4, addrspace 1)
+ ; UNPACKED-NEXT: SI_RETURN
+ store i65 %arg, ptr addrspace(1) %ptr, align 4
+ ret void
+}
+
+define void @store_i95(ptr addrspace(1) %ptr, i95 %arg) #0 {
+ ; UNPACKED-LABEL: name: store_i95
+ ; UNPACKED: bb.1 (%ir-block.0):
+ ; UNPACKED-NEXT: liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3, $vgpr4
+ ; UNPACKED-NEXT: {{ $}}
+ ; UNPACKED-NEXT: [[COPY:%[0-9]+]]:_(s32) = COPY $vgpr0
+ ; UNPACKED-NEXT: [[COPY1:%[0-9]+]]:_(s32) = COPY $vgpr1
+ ; UNPACKED-NEXT: [[MV:%[0-9]+]]:_(p1) = G_MERGE_VALUES [[COPY]](s32), [[COPY1]](s32)
+ ; UNPACKED-NEXT: [[COPY2:%[0-9]+]]:_(s32) = COPY $vgpr2
+ ; UNPACKED-NEXT: [[COPY3:%[0-9]+]]:_(s32) = COPY $vgpr3
+ ; UNPACKED-NEXT: [[COPY4:%[0-9]+]]:_(s32) = COPY $vgpr4
+ ; UNPACKED-NEXT: [[DEF:%[0-9]+]]:_(s32) = G_IMPLICIT_DEF
+ ; UNPACKED-NEXT: [[MV1:%[0-9]+]]:_(s64) = G_MERGE_VALUES [[COPY2]](s32), [[COPY3]](s32)
+ ; UNPACKED-NEXT: [[MV2:%[0-9]+]]:_(s64) = G_MERGE_VALUES [[COPY4]](s32), [[DEF]](s32)
+ ; UNPACKED-NEXT: [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 -1
+ ; UNPACKED-NEXT: [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 2147483647
+ ; UNPACKED-NEXT: [[AND:%[0-9]+]]:_(s64) = G_AND [[MV1]], [[C]]
+ ; UNPACKED-NEXT: [[AND1:%[0-9]+]]:_(s64) = G_AND [[MV2]], [[C1]]
+ ; UNPACKED-NEXT: [[C2:%[0-9]+]]:_(s64) = G_CONSTANT i64 8
+ ; UNPACKED-NEXT: [[PTR_ADD:%[0-9]+]]:_(p1) = nuw inbounds G_PTR_ADD [[MV]], [[C2]](s64)
+ ; UNPACKED-NEXT: G_STORE [[AND]](s64), [[MV]](p1) :: (store (s64) into %ir.ptr, align 4, addrspace 1)
+ ; UNPACKED-NEXT: [[TRUNC:%[0-9]+]]:_(s32) = G_TRUNC [[AND1]](s64)
+ ; UNPACKED-NEXT: G_STORE [[TRUNC]](s32), [[PTR_ADD]](p1) :: (store (s32) into %ir.ptr + 8, addrspace 1)
+ ; UNPACKED-NEXT: SI_RETURN
+ store i95 %arg, ptr addrspace(1) %ptr, align 4
+ ret void
+}
+
+define void @store_i96(ptr addrspace(1) %ptr, i96 %arg) #0 {
+ ; UNPACKED-LABEL: name: store_i96
+ ; UNPACKED: bb.1 (%ir-block.0):
+ ; UNPACKED-NEXT: liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3, $vgpr4
+ ; UNPACKED-NEXT: {{ $}}
+ ; UNPACKED-NEXT: [[COPY:%[0-9]+]]:_(s32) = COPY $vgpr0
+ ; UNPACKED-NEXT: [[COPY1:%[0-9]+]]:_(s32) = COPY $vgpr1
+ ; UNPACKED-NEXT: [[MV:%[0-9]+]]:_(p1) = G_MERGE_VALUES [[COPY]](s32), [[COPY1]](s32)
+ ; UNPACKED-NEXT: [[COPY2:%[0-9]+]]:_(s32) = COPY $vgpr2
+ ; UNPACKED-NEXT: [[COPY3:%[0-9]+]]:_(s32) = COPY $vgpr3
+ ; UNPACKED-NEXT: [[COPY4:%[0-9]+]]:_(s32) = COPY $vgpr4
+ ; UNPACKED-NEXT: [[MV1:%[0-9]+]]:_(s96) = G_MERGE_VALUES [[COPY2]](s32), [[COPY3]](s32), [[COPY4]](s32)
+ ; UNPACKED-NEXT: [[BITCAST:%[0-9]+]]:_(<3 x s32>) = G_BITCAST [[MV1]](s96)
+ ; UNPACKED-NEXT: G_STORE [[BITCAST]](<3 x s32>), [[MV]](p1) :: (store (<3 x s32>) into %ir.ptr, align 4, addrspace 1)
+ ; UNPACKED-NEXT: SI_RETURN
+ store i96 %arg, ptr addrspace(1) %ptr, align 4
+ ret void
+}
+
+define void @store_i97(ptr addrspace(1) %ptr, i97 %arg) #0 {
+ ; UNPACKED-LABEL: name: store_i97
+ ; UNPACKED: bb.1 (%ir-block.0):
+ ; UNPACKED-NEXT: liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3, $vgpr4, $vgpr5
+ ; UNPACKED-NEXT: {{ $}}
+ ; UNPACKED-NEXT: [[COPY:%[0-9]+]]:_(s32) = COPY $vgpr0
+ ; UNPACKED-NEXT: [[COPY1:%[0-9]+]]:_(s32) = COPY $vgpr1
+ ; UNPACKED-NEXT: [[MV:%[0-9]+]]:_(p1) = G_MERGE_VALUES [[COPY]](s32), [[COPY1]](s32)
+ ; UNPACKED-NEXT: [[COPY2:%[0-9]+]]:_(s32) = COPY $vgpr2
+ ; UNPACKED-NEXT: [[COPY3:%[0-9]+]]:_(s32) = COPY $vgpr3
+ ; UNPACKED-NEXT: [[COPY4:%[0-9]+]]:_(s32) = COPY $vgpr4
+ ; UNPACKED-NEXT: [[COPY5:%[0-9]+]]:_(s32) = COPY $vgpr5
+ ; UNPACKED-NEXT: [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 -1
+ ; UNPACKED-NEXT: [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 8589934591
+ ; UNPACKED-NEXT: [[MV1:%[0-9]+]]:_(s64) = G_MERGE_VALUES [[COPY2]](s32), [[COPY3]](s32)
+ ; UNPACKED-NEXT: [[MV2:%[0-9]+]]:_(s64) = G_MERGE_VALUES [[COPY4]](s32), [[COPY5]](s32)
+ ; UNPACKED-NEXT: [[AND:%[0-9]+]]:_(s64) = G_AND [[MV1]], [[C]]
+ ; UNPACKED-NEXT: [[AND1:%[0-9]+]]:_(s64) = G_AND [[MV2]], [[C1]]
+ ; UNPACKED-NEXT: [[C2:%[0-9]+]]:_(s64) = G_CONSTANT i64 8
+ ; UNPACKED-NEXT: [[PTR_ADD:%[0-9]+]]:_(p1) = nuw inbounds G_PTR_ADD [[MV]], [[C2]](s64)
+ ; UNPACKED-NEXT: G_STORE [[AND]](s64), [[MV]](p1) :: (store (s64) into %ir.ptr, align 4, addrspace 1)
+ ; UNPACKED-NEXT: [[C3:%[0-9]+]]:_(s32) = G_CONSTANT i32 32
+ ; UNPACKED-NEXT: [[LSHR:%[0-9]+]]:_(s64) = G_LSHR [[AND1]], [[C3]](s32)
+ ; UNPACKED-NEXT: [[C4:%[0-9]+]]:_(s64) = G_CONSTANT i64 4
+ ; UNPACKED-NEXT: [[PTR_ADD1:%[0-9]+]]:_(p1) = nuw inbounds G_PTR_ADD [[PTR_ADD]], [[C4]](s64)
+ ; UNPACKED-NEXT: [[TRUNC:%[0-9]+]]:_(s32) = G_TRUNC [[AND1]](s64)
+ ; UNPACKED-NEXT: G_STORE [[TRUNC]](s32), [[PTR_ADD]](p1) :: (store (s32) into %ir.ptr + 8, addrspace 1)
+ ; UNPACKED-NEXT: [[TRUNC1:%[0-9]+]]:_(s32) = G_TRUNC [[LSHR]](s64)
+ ; UNPACKED-NEXT: G_STORE [[TRUNC1]](s32), [[PTR_ADD1]](p1) :: (store (s8) into %ir.ptr + 12, align 4, addrspace 1)
+ ; UNPACKED-NEXT: SI_RETURN
+ store i97 %arg, ptr addrspace(1) %ptr, align 4
+ ret void
+}
+
+define void @store_i127(ptr addrspace(1) %ptr, i127 %arg) #0 {
+ ; UNPACKED-LABEL: name: store_i127
+ ; UNPACKED: bb.1 (%ir-block.0):
+ ; UNPACKED-NEXT: liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3, $vgpr4, $vgpr5
+ ; UNPACKED-NEXT: {{ $}}
+ ; UNPACKED-NEXT: [[COPY:%[0-9]+]]:_(s32) = COPY $vgpr0
+ ; UNPACKED-NEXT: [[COPY1:%[0-9]+]]:_(s32) = COPY $vgpr1
+ ; UNPACKED-NEXT: [[MV:%[0-9]+]]:_(p1) = G_MERGE_VALUES [[COPY]](s32), [[COPY1]](s32)
+ ; UNPACKED-NEXT: [[COPY2:%[0-9]+]]:_(s32) = COPY $vgpr2
+ ; UNPACKED-NEXT: [[COPY3:%[0-9]+]]:_(s32) = COPY $vgpr3
+ ; UNPACKED-NEXT: [[COPY4:%[0-9]+]]:_(s32) = COPY $vgpr4
+ ; UNPACKED-NEXT: [[COPY5:%[0-9]+]]:_(s32) = COPY $vgpr5
+ ; UNPACKED-NEXT: [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 -1
+ ; UNPACKED-NEXT: [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 9223372036854775807
+ ; UNPACKED-NEXT: [[MV1:%[0-9]+]]:_(s64) = G_MERGE_VALUES [[COPY2]](s32), [[COPY3]](s32)
+ ; UNPACKED-NEXT: [[MV2:%[0-9]+]]:_(s64) = G_MERGE_VALUES [[COPY4]](s32), [[COPY5]](s32)
+ ; UNPACKED-NEXT: [[AND:%[0-9]+]]:_(s64) = G_AND [[MV1]], [[C]]
+ ; UNPACKED-NEXT: [[AND1:%[0-9]+]]:_(s64) = G_AND [[MV2]], [[C1]]
+ ; UNPACKED-NEXT: [[MV3:%[0-9]+]]:_(s128) = G_MERGE_VALUES [[AND]](s64), [[AND1]](s64)
+ ; UNPACKED-NEXT: [[BITCAST:%[0-9]+]]:_(<4 x s32>) = G_BITCAST [[MV3]](s128)
+ ; UNPACKED-NEXT: G_STORE [[BITCAST]](<4 x s32>), [[MV]](p1) :: (store (<4 x s32>) into %ir.ptr, align 4, addrspace 1)
+ ; UNPACKED-NEXT: SI_RETURN
+ store i127 %arg, ptr addrspace(1) %ptr, align 4
+ ret void
+}
+
+attributes #0 = { nounwind }
>From 1c5188692036c51123ae78e9208d5a375d28f74a Mon Sep 17 00:00:00 2001
From: William Tran-Viet <wtranviet at proton.me>
Date: Mon, 18 Aug 2025 11:04:45 -0400
Subject: [PATCH 036/112] [libc++] Implement P3168R2: Give optional range
support (#149441)
Resolves #105430
- Implement all required pieces of P3168R2
- Leverage existing `wrap_iter` and `bounded_iter` classes to implement
the `optional` regular and hardened iterator type, respectively
- Update documentation to match
---
...-hardening-mode-fast-with-abi-breaks.cmake | 1 +
libcxx/docs/FeatureTestMacroTable.rst | 2 +-
libcxx/docs/ReleaseNotes/22.rst | 1 +
libcxx/docs/Status/Cxx2cPapers.csv | 2 +-
libcxx/include/__iterator/wrap_iter.h | 2 +
libcxx/include/optional | 68 +++++++++++++
libcxx/include/version | 2 +-
libcxx/modules/std/optional.inc | 11 ++-
.../iterator.compile.pass.cpp | 30 ++++++
.../optional.version.compile.pass.cpp | 16 +--
.../version.version.compile.pass.cpp | 16 +--
.../optional/optional.iterator/begin.pass.cpp | 64 ++++++++++++
.../optional/optional.iterator/end.pass.cpp | 74 ++++++++++++++
.../optional.iterator/iterator.pass.cpp | 98 +++++++++++++++++++
.../generate_feature_test_macro_components.py | 1 -
15 files changed, 361 insertions(+), 27 deletions(-)
create mode 100644 libcxx/test/libcxx/utilities/optional/optional.iterator/iterator.compile.pass.cpp
create mode 100644 libcxx/test/std/utilities/optional/optional.iterator/begin.pass.cpp
create mode 100644 libcxx/test/std/utilities/optional/optional.iterator/end.pass.cpp
create mode 100644 libcxx/test/std/utilities/optional/optional.iterator/iterator.pass.cpp
diff --git a/libcxx/cmake/caches/Generic-hardening-mode-fast-with-abi-breaks.cmake b/libcxx/cmake/caches/Generic-hardening-mode-fast-with-abi-breaks.cmake
index 699d3f8866861..d4ce32ce5b17f 100644
--- a/libcxx/cmake/caches/Generic-hardening-mode-fast-with-abi-breaks.cmake
+++ b/libcxx/cmake/caches/Generic-hardening-mode-fast-with-abi-breaks.cmake
@@ -5,5 +5,6 @@ set(_defines
_LIBCPP_ABI_BOUNDED_ITERATORS_IN_VECTOR
_LIBCPP_ABI_BOUNDED_UNIQUE_PTR
_LIBCPP_ABI_BOUNDED_ITERATORS_IN_STD_ARRAY
+ _LIBCPP_ABI_BOUNDED_ITERATORS_IN_OPTIONAL
)
set(LIBCXX_ABI_DEFINES "${_defines}" CACHE STRING "")
diff --git a/libcxx/docs/FeatureTestMacroTable.rst b/libcxx/docs/FeatureTestMacroTable.rst
index a36848ebd24b4..358889d8dbc37 100644
--- a/libcxx/docs/FeatureTestMacroTable.rst
+++ b/libcxx/docs/FeatureTestMacroTable.rst
@@ -480,7 +480,7 @@ Status
---------------------------------------------------------- -----------------
``__cpp_lib_not_fn`` ``202306L``
---------------------------------------------------------- -----------------
- ``__cpp_lib_optional_range_support`` *unimplemented*
+ ``__cpp_lib_optional_range_support`` ``202406L``
---------------------------------------------------------- -----------------
``__cpp_lib_out_ptr`` ``202311L``
---------------------------------------------------------- -----------------
diff --git a/libcxx/docs/ReleaseNotes/22.rst b/libcxx/docs/ReleaseNotes/22.rst
index 191dab6b77564..f28babf548fe4 100644
--- a/libcxx/docs/ReleaseNotes/22.rst
+++ b/libcxx/docs/ReleaseNotes/22.rst
@@ -39,6 +39,7 @@ Implemented Papers
------------------
- P2321R2: ``zip`` (`Github <https://github.com/llvm/llvm-project/issues/105169>`__) (The paper is partially implemented. ``zip_transform_view`` is implemented in this release)
+- P3168R2: Give ``std::optional`` Range Support (`Github <https://github.com/llvm/llvm-project/issues/105430>`__)
Improvements and New Features
-----------------------------
diff --git a/libcxx/docs/Status/Cxx2cPapers.csv b/libcxx/docs/Status/Cxx2cPapers.csv
index e8b0c9559f40b..3b8b2b7ad0b3f 100644
--- a/libcxx/docs/Status/Cxx2cPapers.csv
+++ b/libcxx/docs/Status/Cxx2cPapers.csv
@@ -66,7 +66,7 @@
"`P2747R2 <https://wg21.link/P2747R2>`__","``constexpr`` placement new","2024-06 (St. Louis)","|Complete|","20",""
"`P2997R1 <https://wg21.link/P2997R1>`__","Removing the common reference requirement from the indirectly invocable concepts","2024-06 (St. Louis)","|Complete|","19","Implemented as a DR against C++20. (MSVC STL and libstdc++ will do the same.)"
"`P2389R2 <https://wg21.link/P2389R2>`__","``dextents`` Index Type Parameter","2024-06 (St. Louis)","|Complete|","19",""
-"`P3168R2 <https://wg21.link/P3168R2>`__","Give ``std::optional`` Range Support","2024-06 (St. Louis)","","",""
+"`P3168R2 <https://wg21.link/P3168R2>`__","Give ``std::optional`` Range Support","2024-06 (St. Louis)","|Complete|","22",""
"`P3217R0 <https://wg21.link/P3217R0>`__","Adjoints to 'Enabling list-initialization for algorithms': find_last","2024-06 (St. Louis)","","",""
"`P2985R0 <https://wg21.link/P2985R0>`__","A type trait for detecting virtual base classes","2024-06 (St. Louis)","|Complete|","20",""
"`P0843R14 <https://wg21.link/P0843R14>`__","``inplace_vector``","2024-06 (St. Louis)","","",""
diff --git a/libcxx/include/__iterator/wrap_iter.h b/libcxx/include/__iterator/wrap_iter.h
index 2b5bc489dd44c..7610586ddecbb 100644
--- a/libcxx/include/__iterator/wrap_iter.h
+++ b/libcxx/include/__iterator/wrap_iter.h
@@ -117,6 +117,8 @@ class __wrap_iter {
friend class span;
template <class _Tp, size_t _Size>
friend struct array;
+ template <class _Tp>
+ friend class optional;
};
template <class _Iter1>
diff --git a/libcxx/include/optional b/libcxx/include/optional
index e81bff50daad6..39fcaa2c2ec18 100644
--- a/libcxx/include/optional
+++ b/libcxx/include/optional
@@ -20,6 +20,11 @@ namespace std {
template <class T>
class optional;
+ template<class T>
+ constexpr bool ranges::enable_view<optional<T>> = true;
+ template<class T>
+ constexpr auto format_kind<optional<T>> = range_format::disabled;
+
template<class T>
concept is-derived-from-optional = requires(const T& t) { // exposition only
[]<class U>(const optional<U>&){ }(t);
@@ -102,6 +107,8 @@ namespace std {
class optional {
public:
using value_type = T;
+ using iterator = implementation-defined; // see [optional.iterators]
+ using const_iterator = implementation-defined; // see [optional.iterators]
// [optional.ctor], constructors
constexpr optional() noexcept;
@@ -135,6 +142,12 @@ namespace std {
// [optional.swap], swap
void swap(optional &) noexcept(see below ); // constexpr in C++20
+ // [optional.iterators], iterator support
+ constexpr iterator begin() noexcept;
+ constexpr const_iterator begin() const noexcept;
+ constexpr iterator end() noexcept;
+ constexpr const_iterator end() const noexcept;
+
// [optional.observe], observers
constexpr T const *operator->() const noexcept;
constexpr T *operator->() noexcept;
@@ -186,13 +199,18 @@ namespace std {
# include <__compare/three_way_comparable.h>
# include <__concepts/invocable.h>
# include <__config>
+# include <__cstddef/ptrdiff_t.h>
# include <__exception/exception.h>
+# include <__format/range_format.h>
# include <__functional/hash.h>
# include <__functional/invoke.h>
# include <__functional/unary_function.h>
# include <__fwd/functional.h>
+# include <__iterator/bounded_iter.h>
+# include <__iterator/wrap_iter.h>
# include <__memory/addressof.h>
# include <__memory/construct_at.h>
+# include <__ranges/enable_view.h>
# include <__tuple/sfinae_helpers.h>
# include <__type_traits/add_pointer.h>
# include <__type_traits/conditional.h>
@@ -207,6 +225,7 @@ namespace std {
# include <__type_traits/is_convertible.h>
# include <__type_traits/is_core_convertible.h>
# include <__type_traits/is_destructible.h>
+# include <__type_traits/is_function.h>
# include <__type_traits/is_nothrow_assignable.h>
# include <__type_traits/is_nothrow_constructible.h>
# include <__type_traits/is_object.h>
@@ -219,6 +238,7 @@ namespace std {
# include <__type_traits/is_trivially_constructible.h>
# include <__type_traits/is_trivially_destructible.h>
# include <__type_traits/is_trivially_relocatable.h>
+# include <__type_traits/is_unbounded_array.h>
# include <__type_traits/negation.h>
# include <__type_traits/remove_const.h>
# include <__type_traits/remove_cv.h>
@@ -567,6 +587,14 @@ using __optional_sfinae_assign_base_t _LIBCPP_NODEBUG =
template <class _Tp>
class optional;
+# if _LIBCPP_STD_VER >= 26
+template <class _Tp>
+constexpr bool ranges::enable_view<optional<_Tp>> = true;
+
+template <class _Tp>
+constexpr range_format format_kind<optional<_Tp>> = range_format::disabled;
+# endif
+
# if _LIBCPP_STD_VER >= 20
template <class _Tp>
@@ -586,9 +614,21 @@ class _LIBCPP_DECLSPEC_EMPTY_BASES optional
private __optional_sfinae_assign_base_t<_Tp> {
using __base _LIBCPP_NODEBUG = __optional_move_assign_base<_Tp>;
+ using __pointer _LIBCPP_NODEBUG = std::add_pointer_t<_Tp>;
+ using __const_pointer _LIBCPP_NODEBUG = std::add_pointer_t<const _Tp>;
+
public:
using value_type = _Tp;
+# if _LIBCPP_STD_VER >= 26
+# ifdef _LIBCPP_ABI_BOUNDED_ITERATORS_IN_OPTIONAL
+ using iterator = __bounded_iter<__wrap_iter<__pointer>>;
+ using const_iterator = __bounded_iter<__wrap_iter<__const_pointer>>;
+# else
+ using iterator = __wrap_iter<__pointer>;
+ using const_iterator = __wrap_iter<__const_pointer>;
+# endif
+# endif
using __trivially_relocatable _LIBCPP_NODEBUG =
conditional_t<__libcpp_is_trivially_relocatable<_Tp>::value, optional, void>;
using __replaceable _LIBCPP_NODEBUG = conditional_t<__is_replaceable_v<_Tp>, optional, void>;
@@ -792,6 +832,34 @@ public:
}
}
+# if _LIBCPP_STD_VER >= 26
+ // [optional.iterators], iterator support
+ _LIBCPP_HIDE_FROM_ABI constexpr iterator begin() noexcept {
+# ifdef _LIBCPP_ABI_BOUNDED_ITERATORS_IN_OPTIONAL
+ return std::__make_bounded_iter(
+ std::__wrap_iter<__pointer>(std::addressof(this->__get())),
+ std::__wrap_iter<__pointer>(std::addressof(this->__get())),
+ std::__wrap_iter<__pointer>(std::addressof(this->__get()) + (this->has_value() ? 1 : 0)));
+# else
+ return iterator(std::addressof(this->__get()));
+# endif
+ }
+
+ _LIBCPP_HIDE_FROM_ABI constexpr const_iterator begin() const noexcept {
+# ifdef _LIBCPP_ABI_BOUNDED_ITERATORS_IN_OPTIONAL
+ return std::__make_bounded_iter(
+ std::__wrap_iter<__const_pointer>(std::addressof(this->__get())),
+ std::__wrap_iter<__const_pointer>(std::addressof(this->__get())),
+ std::__wrap_iter<__const_pointer>(std::addressof(this->__get()) + (this->has_value() ? 1 : 0)));
+# else
+ return const_iterator(std::addressof(this->__get()));
+# endif
+ }
+
+ _LIBCPP_HIDE_FROM_ABI constexpr iterator end() noexcept { return begin() + (this->has_value() ? 1 : 0); }
+ _LIBCPP_HIDE_FROM_ABI constexpr const_iterator end() const noexcept { return begin() + (this->has_value() ? 1 : 0); }
+# endif
+
_LIBCPP_HIDE_FROM_ABI constexpr add_pointer_t<value_type const> operator->() const noexcept {
_LIBCPP_ASSERT_VALID_ELEMENT_ACCESS(this->has_value(), "optional operator-> called on a disengaged value");
return std::addressof(this->__get());
diff --git a/libcxx/include/version b/libcxx/include/version
index aae9277a7dfc6..16917a3bd9ddd 100644
--- a/libcxx/include/version
+++ b/libcxx/include/version
@@ -585,7 +585,7 @@ __cpp_lib_void_t 201411L <type_traits>
# define __cpp_lib_mdspan 202406L
# undef __cpp_lib_not_fn
# define __cpp_lib_not_fn 202306L
-// # define __cpp_lib_optional_range_support 202406L
+# define __cpp_lib_optional_range_support 202406L
# undef __cpp_lib_out_ptr
# define __cpp_lib_out_ptr 202311L
// # define __cpp_lib_philox_engine 202406L
diff --git a/libcxx/modules/std/optional.inc b/libcxx/modules/std/optional.inc
index 0f812bc0e24a4..9ee51117277ce 100644
--- a/libcxx/modules/std/optional.inc
+++ b/libcxx/modules/std/optional.inc
@@ -10,7 +10,12 @@
export namespace std {
// [optional.optional], class template optional
using std::optional;
-
+#if _LIBCPP_STD_VER >= 26
+ // [optional.iterators], iterator support
+ namespace ranges {
+ using std::ranges::enable_view;
+ }
+#endif
// [optional.nullopt], no-value state indicator
using std::nullopt;
using std::nullopt_t;
@@ -18,6 +23,10 @@ export namespace std {
// [optional.bad.access], class bad_optional_access
using std::bad_optional_access;
+#if _LIBCPP_STD_VER >= 26
+ using std::format_kind;
+#endif
+
// [optional.relops], relational operators
using std::operator==;
using std::operator!=;
diff --git a/libcxx/test/libcxx/utilities/optional/optional.iterator/iterator.compile.pass.cpp b/libcxx/test/libcxx/utilities/optional/optional.iterator/iterator.compile.pass.cpp
new file mode 100644
index 0000000000000..3cdd7553e2e5d
--- /dev/null
+++ b/libcxx/test/libcxx/utilities/optional/optional.iterator/iterator.compile.pass.cpp
@@ -0,0 +1,30 @@
+//===----------------------------------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+// REQUIRES: std-at-least-c++26
+
+// <optional>
+
+// template <class T> class optional::iterator;
+// template <class T> class optional::const_iterator;
+
+#include <optional>
+
+template <typename T>
+concept has_iterator_aliases = requires {
+ typename T::iterator;
+ typename T::const_iterator;
+};
+
+static_assert(has_iterator_aliases<std::optional<int>>);
+static_assert(has_iterator_aliases<std::optional<const int>>);
+
+// TODO: Uncomment these once P2988R12 is implemented, as they would be testing optional<T&>
+
+// static_assert(!has_iterator_aliases<std::optional<int (&)[]>>);
+// static_assert(!has_iterator_aliases<std::optional<void (&)(int, char)>>);
diff --git a/libcxx/test/std/language.support/support.limits/support.limits.general/optional.version.compile.pass.cpp b/libcxx/test/std/language.support/support.limits/support.limits.general/optional.version.compile.pass.cpp
index ccdb1a8c11a0b..aca6290f5a4bf 100644
--- a/libcxx/test/std/language.support/support.limits/support.limits.general/optional.version.compile.pass.cpp
+++ b/libcxx/test/std/language.support/support.limits/support.limits.general/optional.version.compile.pass.cpp
@@ -146,17 +146,11 @@
# error "__cpp_lib_optional should have the value 202110L in c++26"
# endif
-# if !defined(_LIBCPP_VERSION)
-# ifndef __cpp_lib_optional_range_support
-# error "__cpp_lib_optional_range_support should be defined in c++26"
-# endif
-# if __cpp_lib_optional_range_support != 202406L
-# error "__cpp_lib_optional_range_support should have the value 202406L in c++26"
-# endif
-# else
-# ifdef __cpp_lib_optional_range_support
-# error "__cpp_lib_optional_range_support should not be defined because it is unimplemented in libc++!"
-# endif
+# ifndef __cpp_lib_optional_range_support
+# error "__cpp_lib_optional_range_support should be defined in c++26"
+# endif
+# if __cpp_lib_optional_range_support != 202406L
+# error "__cpp_lib_optional_range_support should have the value 202406L in c++26"
# endif
#endif // TEST_STD_VER > 23
diff --git a/libcxx/test/std/language.support/support.limits/support.limits.general/version.version.compile.pass.cpp b/libcxx/test/std/language.support/support.limits/support.limits.general/version.version.compile.pass.cpp
index 7bd8e8979e6f3..cde2f258b7732 100644
--- a/libcxx/test/std/language.support/support.limits/support.limits.general/version.version.compile.pass.cpp
+++ b/libcxx/test/std/language.support/support.limits/support.limits.general/version.version.compile.pass.cpp
@@ -7437,17 +7437,11 @@
# error "__cpp_lib_optional should have the value 202110L in c++26"
# endif
-# if !defined(_LIBCPP_VERSION)
-# ifndef __cpp_lib_optional_range_support
-# error "__cpp_lib_optional_range_support should be defined in c++26"
-# endif
-# if __cpp_lib_optional_range_support != 202406L
-# error "__cpp_lib_optional_range_support should have the value 202406L in c++26"
-# endif
-# else
-# ifdef __cpp_lib_optional_range_support
-# error "__cpp_lib_optional_range_support should not be defined because it is unimplemented in libc++!"
-# endif
+# ifndef __cpp_lib_optional_range_support
+# error "__cpp_lib_optional_range_support should be defined in c++26"
+# endif
+# if __cpp_lib_optional_range_support != 202406L
+# error "__cpp_lib_optional_range_support should have the value 202406L in c++26"
# endif
# ifndef __cpp_lib_out_ptr
diff --git a/libcxx/test/std/utilities/optional/optional.iterator/begin.pass.cpp b/libcxx/test/std/utilities/optional/optional.iterator/begin.pass.cpp
new file mode 100644
index 0000000000000..df95a8df3793f
--- /dev/null
+++ b/libcxx/test/std/utilities/optional/optional.iterator/begin.pass.cpp
@@ -0,0 +1,64 @@
+//===----------------------------------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+// REQUIRES: std-at-least-c++26
+
+// <optional>
+
+// constexpr iterator optional::begin() noexcept;
+// constexpr const_iterator optional::begin() const noexcept;
+
+#include <cassert>
+#include <iterator>
+#include <optional>
+#include <type_traits>
+#include <utility>
+
+template <typename T>
+constexpr bool test() {
+ std::optional<T> opt{T{}};
+
+ { // begin() is marked noexcept
+ static_assert(noexcept(opt.begin()));
+ static_assert(noexcept(std::as_const(opt).begin()));
+ }
+
+ { // Dereferencing an iterator at the beginning == indexing the 0th element, and that calling begin() again return the same iterator.
+ auto iter1 = opt.begin();
+ auto iter2 = std::as_const(opt).begin();
+ assert(*iter1 == iter1[0]);
+ assert(*iter2 == iter2[0]);
+ assert(iter1 == opt.begin());
+ assert(iter2 == std::as_const(opt).begin());
+ }
+
+ { // Calling begin() multiple times on a disengaged optional returns the same iterator.
+ std::optional<T> disengaged{std::nullopt};
+ auto iter1 = disengaged.begin();
+ auto iter2 = std::as_const(disengaged).begin();
+ assert(iter1 == disengaged.begin());
+ assert(iter2 == std::as_const(disengaged).begin());
+ }
+
+ return true;
+}
+
+constexpr bool tests() {
+ assert(test<int>());
+ assert(test<char>());
+ assert(test<const int>());
+ assert(test<const char>());
+ return true;
+}
+
+int main(int, char**) {
+ assert(tests());
+ static_assert(tests());
+
+ return 0;
+}
diff --git a/libcxx/test/std/utilities/optional/optional.iterator/end.pass.cpp b/libcxx/test/std/utilities/optional/optional.iterator/end.pass.cpp
new file mode 100644
index 0000000000000..966c3e7441880
--- /dev/null
+++ b/libcxx/test/std/utilities/optional/optional.iterator/end.pass.cpp
@@ -0,0 +1,74 @@
+//===----------------------------------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+// REQUIRES: std-at-least-c++26
+
+// <optional>
+
+// constexpr iterator optional::end() noexcept;
+// constexpr const_iterator optional::end() const noexcept;
+
+#include <cassert>
+#include <iterator>
+#include <optional>
+#include <ranges>
+#include <utility>
+
+template <typename T>
+constexpr bool test() {
+ std::optional<T> disengaged{std::nullopt};
+
+ { // end() is marked noexcept
+ static_assert(noexcept(disengaged.end()));
+ static_assert(noexcept(std::as_const(disengaged).end()));
+ }
+
+ { // end() == begin() and end() == end() if the optional is disengaged
+ auto it = disengaged.end();
+ auto it2 = std::as_const(disengaged).end();
+
+ assert(it == disengaged.begin());
+ assert(disengaged.begin() == it);
+ assert(it == disengaged.end());
+
+ assert(it2 == std::as_const(disengaged).begin());
+ assert(std::as_const(disengaged).begin() == it2);
+ assert(it2 == std::as_const(disengaged).end());
+ }
+
+ std::optional<T> engaged{T{}};
+
+ { // end() != begin() if the optional is engaged
+ auto it = engaged.end();
+ auto it2 = std::as_const(engaged).end();
+
+ assert(it != engaged.begin());
+ assert(engaged.begin() != it);
+
+ assert(it2 != std::as_const(engaged).begin());
+ assert(std::as_const(engaged).begin() != it2);
+ }
+
+ return true;
+}
+
+constexpr bool tests() {
+ assert(test<int>());
+ assert(test<char>());
+ assert(test<const int>());
+ assert(test<const char>());
+
+ return true;
+}
+
+int main(int, char**) {
+ assert(tests());
+ static_assert(tests());
+
+ return 0;
+}
diff --git a/libcxx/test/std/utilities/optional/optional.iterator/iterator.pass.cpp b/libcxx/test/std/utilities/optional/optional.iterator/iterator.pass.cpp
new file mode 100644
index 0000000000000..1203290a0290a
--- /dev/null
+++ b/libcxx/test/std/utilities/optional/optional.iterator/iterator.pass.cpp
@@ -0,0 +1,98 @@
+//===----------------------------------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+// REQUIRES: std-at-least-c++26
+
+// <optional>
+
+// template <class T> class optional::iterator;
+// template <class T> class optional::const_iterator;
+
+#include <cassert>
+#include <iterator>
+#include <optional>
+#include <ranges>
+#include <type_traits>
+#include <utility>
+
+template <typename T, T __val>
+constexpr bool test() {
+ std::optional<T> opt{__val};
+
+ { // Dereferencing an iterator of an engaged optional will return the same value that the optional holds.
+ auto it = opt.begin();
+ auto it2 = std::as_const(opt).begin();
+ assert(*it == *opt);
+ assert(*it2 == *std::as_const(opt));
+ }
+
+ { // optional::iterator and optional::const_iterator satisfy the Cpp17RandomAccessIterator and contiguous iterator.
+ auto it = opt.begin();
+ auto it2 = std::as_const(opt).begin();
+ assert(std::contiguous_iterator<decltype(it)>);
+ assert(std::contiguous_iterator<decltype(it2)>);
+
+ assert(std::random_access_iterator<decltype(it)>);
+ assert(std::random_access_iterator<decltype(it2)>);
+ }
+
+ { // const_iterator::value_type == std::remove_cv_t<T>, const_iterator::reference == const T&, iterator::value_type = std::remove_cv_t<T>, iterator::reference == T&
+ auto it = opt.begin();
+ auto it2 = std::as_const(opt).begin();
+ assert((std::is_same_v<typename decltype(it)::value_type, std::remove_cv_t<T>>));
+ assert((std::is_same_v<typename decltype(it)::reference, T&>));
+ assert((std::is_same_v<typename decltype(it2)::value_type, std::remove_cv_t<T>>));
+ assert((std::is_same_v<typename decltype(it2)::reference, const T&>));
+ }
+
+ { // std::ranges::size for an engaged optional<T> == 1, disengaged optional<T> == 0
+ const std::optional<T> disengaged{std::nullopt};
+ std::optional<T> disengaged2{std::nullopt};
+ assert(std::ranges::size(opt) == 1);
+ assert(std::ranges::size(std::as_const(opt)) == 1);
+
+ assert(std::ranges::size(disengaged) == 0);
+ assert(std::ranges::size(disengaged2) == 0);
+ }
+
+ { // std::ranges::enable_view<optional<T>> == true, and std::format_kind<optional<T>> == true
+ static_assert(std::ranges::enable_view<std::optional<T>> == true);
+ static_assert(std::format_kind<std::optional<T>> == std::range_format::disabled);
+ }
+
+ // An optional with value that is reset will have a begin() == end(), then when it is reassigned a value,
+ // begin() != end(), and *begin() will contain the new value.
+ {
+ std::optional<T> val{__val};
+ assert(val.begin() != val.end());
+ val.reset();
+ assert(val.begin() == val.end());
+ val.emplace(__val);
+ assert(val.begin() != val.end());
+ assert(*(val.begin()) == __val);
+ }
+
+ return true;
+}
+
+constexpr bool tests() {
+ assert((test<int, 1>()));
+ assert((test<char, 'a'>()));
+ assert((test<bool, true>()));
+ assert((test<const int, 2>()));
+ assert((test<const char, 'b'>()));
+
+ return true;
+}
+
+int main(int, char**) {
+ assert(tests());
+ static_assert(tests());
+
+ return 0;
+}
diff --git a/libcxx/utils/generate_feature_test_macro_components.py b/libcxx/utils/generate_feature_test_macro_components.py
index d9317e00e3f4a..8d57a07b8836b 100644
--- a/libcxx/utils/generate_feature_test_macro_components.py
+++ b/libcxx/utils/generate_feature_test_macro_components.py
@@ -1012,7 +1012,6 @@ def add_version_header(tc):
"name": "__cpp_lib_optional_range_support",
"values": {"c++26": 202406}, # P3168R2 Give std::optional Range Support
"headers": ["optional"],
- "unimplemented": True,
},
{
"name": "__cpp_lib_out_ptr",
>From 08a140add86081932515188bd9120fd5e69f3ac3 Mon Sep 17 00:00:00 2001
From: AZero13 <gfunni234 at gmail.com>
Date: Mon, 18 Aug 2025 11:12:07 -0400
Subject: [PATCH 037/112] [AArch64] Fix build-bot assertion error in AArch64
(#154124)
Fixes build bot assertion.
I forgot to include logic that will be added in a future PR that handles
-1 correctly. For now, let's just return nullptr like we used to.
---
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | 13 +++++++------
.../AArch64/GISel/AArch64PostLegalizerLowering.cpp | 6 ++++--
2 files changed, 11 insertions(+), 8 deletions(-)
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index c27bf82157393..63a85faf344c4 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -4035,12 +4035,13 @@ static SDValue getAArch64Cmp(SDValue LHS, SDValue RHS, ISD::CondCode CC,
break;
case ISD::SETULE:
case ISD::SETUGT: {
- assert(!C.isAllOnes() && "C should not be -1 here");
- APInt CPlusOne = C + 1;
- if (isLegalCmpImmed(CPlusOne) ||
- (NumImmForC > numberOfInstrToLoadImm(CPlusOne))) {
- CC = (CC == ISD::SETULE) ? ISD::SETULT : ISD::SETUGE;
- RHS = DAG.getConstant(CPlusOne, DL, VT);
+ if (!C.isAllOnes()) {
+ APInt CPlusOne = C + 1;
+ if (isLegalCmpImmed(CPlusOne) ||
+ (NumImmForC > numberOfInstrToLoadImm(CPlusOne))) {
+ CC = (CC == ISD::SETULE) ? ISD::SETULT : ISD::SETUGE;
+ RHS = DAG.getConstant(CPlusOne, DL, VT);
+ }
}
break;
}
diff --git a/llvm/lib/Target/AArch64/GISel/AArch64PostLegalizerLowering.cpp b/llvm/lib/Target/AArch64/GISel/AArch64PostLegalizerLowering.cpp
index 2abe0dd0bbdc2..6025f1c9f5f4e 100644
--- a/llvm/lib/Target/AArch64/GISel/AArch64PostLegalizerLowering.cpp
+++ b/llvm/lib/Target/AArch64/GISel/AArch64PostLegalizerLowering.cpp
@@ -639,8 +639,10 @@ tryAdjustICmpImmAndPred(Register RHS, CmpInst::Predicate P,
// x ule c => x ult c + 1
// x ugt c => s uge c + 1
//
- assert(C != (Size == 32 ? UINT32_MAX : UINT64_MAX) &&
- "C should not be -1 here!");
+ // When c is not the largest possible unsigned integer.
+ if ((Size == 32 && static_cast<uint32_t>(C) == UINT32_MAX) ||
+ (Size == 64 && C == UINT64_MAX))
+ return std::nullopt;
P = (P == CmpInst::ICMP_ULE) ? CmpInst::ICMP_ULT : CmpInst::ICMP_UGE;
C += 1;
break;
>From 6ce13ae1c20515e7c4554cde028e3a0990786075 Mon Sep 17 00:00:00 2001
From: Timm Baeder <tbaeder at redhat.com>
Date: Mon, 18 Aug 2025 17:15:31 +0200
Subject: [PATCH 038/112] [clang][bytecode] Always track item types in
InterpStack (#151088)
This has been a long-standing problem, but we didn't use to call the
destructors of items on the stack unless we explicitly `pop()` or
`discard()` them.
When interpretation was interrupted midway-through (because something
failed), we left `Pointer`s on the stack. Since all `Block`s track what
`Pointer`s point to them (via a doubly-linked list in the `Pointer`),
that meant we potentially leave deallocated pointers in that list. We
used to work around this by removing the `Pointer` from the list before
deallocating the block.
However, we now want to track pointers to global blocks as well, which
poses a problem since the blocks are never deallocated and thus those
pointers are always left dangling.
I've tried a few different approaches to fixing this but in the end I
just gave up on the idea of never knowing what items are in the stack.
We already have an `ItemTypes` vector that we use for debugging
assertions. This patch simply enables this vector unconditionally and
uses it in the abort case to properly `discard()` all elements from the
stack. That's a little sad IMO but I don't know of another way of
solving this problem.
As expected, this is a slight hit to compile times:
https://llvm-compile-time-tracker.com/compare.php?from=574d0a92060bf4808776b7a0239ffe91a092b15d&to=0317105f559093cfb909bfb01857a6b837991940&stat=instructions:u
---
clang/lib/AST/ByteCode/Compiler.cpp | 25 ++++++-----
clang/lib/AST/ByteCode/Context.cpp | 14 ++-----
clang/lib/AST/ByteCode/Context.h | 2 +-
clang/lib/AST/ByteCode/Descriptor.h | 2 +-
clang/lib/AST/ByteCode/Function.h | 2 +-
clang/lib/AST/ByteCode/InterpBlock.cpp | 12 ------
clang/lib/AST/ByteCode/InterpBlock.h | 2 +-
clang/lib/AST/ByteCode/InterpStack.cpp | 57 +++++++++-----------------
clang/lib/AST/ByteCode/InterpStack.h | 24 ++++-------
clang/lib/AST/ByteCode/InterpState.cpp | 12 +++---
clang/lib/AST/ByteCode/Pointer.h | 2 +-
clang/lib/AST/ByteCode/PrimType.h | 7 ++--
clang/lib/AST/ByteCode/Program.h | 4 ++
clang/test/AST/ByteCode/c.c | 9 ++++
14 files changed, 76 insertions(+), 98 deletions(-)
diff --git a/clang/lib/AST/ByteCode/Compiler.cpp b/clang/lib/AST/ByteCode/Compiler.cpp
index 5c416474d3bcf..f2ce69a62838e 100644
--- a/clang/lib/AST/ByteCode/Compiler.cpp
+++ b/clang/lib/AST/ByteCode/Compiler.cpp
@@ -2980,20 +2980,25 @@ bool Compiler<Emitter>::VisitCompoundLiteralExpr(const CompoundLiteralExpr *E) {
if (T && !E->isLValue())
return this->delegate(Init);
- if (std::optional<unsigned> GlobalIndex = P.createGlobal(E)) {
- if (!this->emitGetPtrGlobal(*GlobalIndex, E))
- return false;
+ std::optional<unsigned> GlobalIndex = P.createGlobal(E);
+ if (!GlobalIndex)
+ return false;
- if (T) {
- if (!this->visit(Init))
- return false;
- return this->emitInitGlobal(*T, *GlobalIndex, E);
- }
+ if (!this->emitGetPtrGlobal(*GlobalIndex, E))
+ return false;
+
+ // Since this is a global variable, we might've already seen,
+ // don't do it again.
+ if (P.isGlobalInitialized(*GlobalIndex))
+ return true;
- return this->visitInitializer(Init) && this->emitFinishInit(E);
+ if (T) {
+ if (!this->visit(Init))
+ return false;
+ return this->emitInitGlobal(*T, *GlobalIndex, E);
}
- return false;
+ return this->visitInitializer(Init) && this->emitFinishInit(E);
}
// Otherwise, use a local variable.
diff --git a/clang/lib/AST/ByteCode/Context.cpp b/clang/lib/AST/ByteCode/Context.cpp
index 6343b2af313f1..36eb7607e70bf 100644
--- a/clang/lib/AST/ByteCode/Context.cpp
+++ b/clang/lib/AST/ByteCode/Context.cpp
@@ -398,17 +398,11 @@ const llvm::fltSemantics &Context::getFloatSemantics(QualType T) const {
}
bool Context::Run(State &Parent, const Function *Func) {
-
- {
- InterpState State(Parent, *P, Stk, *this, Func);
- if (Interpret(State)) {
- assert(Stk.empty());
- return true;
- }
- // State gets destroyed here, so the Stk.clear() below doesn't accidentally
- // remove values the State's destructor might access.
+ InterpState State(Parent, *P, Stk, *this, Func);
+ if (Interpret(State)) {
+ assert(Stk.empty());
+ return true;
}
-
Stk.clear();
return false;
}
diff --git a/clang/lib/AST/ByteCode/Context.h b/clang/lib/AST/ByteCode/Context.h
index a6d90bb385067..fa98498dbe8fa 100644
--- a/clang/lib/AST/ByteCode/Context.h
+++ b/clang/lib/AST/ByteCode/Context.h
@@ -30,7 +30,7 @@ namespace interp {
class Function;
class Program;
class State;
-enum PrimType : unsigned;
+enum PrimType : uint8_t;
struct ParamOffset {
unsigned Offset;
diff --git a/clang/lib/AST/ByteCode/Descriptor.h b/clang/lib/AST/ByteCode/Descriptor.h
index 4a808c0a2d216..90dc2b4aa3111 100644
--- a/clang/lib/AST/ByteCode/Descriptor.h
+++ b/clang/lib/AST/ByteCode/Descriptor.h
@@ -24,7 +24,7 @@ class Record;
class SourceInfo;
struct InitMap;
struct Descriptor;
-enum PrimType : unsigned;
+enum PrimType : uint8_t;
using DeclTy = llvm::PointerUnion<const Decl *, const Expr *>;
using InitMapPtr = std::optional<std::pair<bool, std::shared_ptr<InitMap>>>;
diff --git a/clang/lib/AST/ByteCode/Function.h b/clang/lib/AST/ByteCode/Function.h
index 92363b62c85d4..af429b7849e88 100644
--- a/clang/lib/AST/ByteCode/Function.h
+++ b/clang/lib/AST/ByteCode/Function.h
@@ -28,7 +28,7 @@ namespace interp {
class Program;
class ByteCodeEmitter;
class Pointer;
-enum PrimType : uint32_t;
+enum PrimType : uint8_t;
/// Describes a scope block.
///
diff --git a/clang/lib/AST/ByteCode/InterpBlock.cpp b/clang/lib/AST/ByteCode/InterpBlock.cpp
index 69221d85d6715..b7fd324594c82 100644
--- a/clang/lib/AST/ByteCode/InterpBlock.cpp
+++ b/clang/lib/AST/ByteCode/InterpBlock.cpp
@@ -18,10 +18,6 @@ using namespace clang::interp;
void Block::addPointer(Pointer *P) {
assert(P);
- if (IsStatic) {
- assert(!Pointers);
- return;
- }
#ifndef NDEBUG
assert(!hasPointer(P));
@@ -39,10 +35,6 @@ void Block::addPointer(Pointer *P) {
void Block::removePointer(Pointer *P) {
assert(P->isBlockPointer());
assert(P);
- if (IsStatic) {
- assert(!Pointers);
- return;
- }
#ifndef NDEBUG
assert(hasPointer(P));
@@ -74,10 +66,6 @@ void Block::replacePointer(Pointer *Old, Pointer *New) {
assert(New);
assert(New->isBlockPointer());
assert(Old != New);
- if (IsStatic) {
- assert(!Pointers);
- return;
- }
#ifndef NDEBUG
assert(hasPointer(Old));
#endif
diff --git a/clang/lib/AST/ByteCode/InterpBlock.h b/clang/lib/AST/ByteCode/InterpBlock.h
index 8f30a6ece74ee..778ac8fdb085c 100644
--- a/clang/lib/AST/ByteCode/InterpBlock.h
+++ b/clang/lib/AST/ByteCode/InterpBlock.h
@@ -22,7 +22,7 @@ class Block;
class DeadBlock;
class InterpState;
class Pointer;
-enum PrimType : unsigned;
+enum PrimType : uint8_t;
/// A memory block, either on the stack or in the heap.
///
diff --git a/clang/lib/AST/ByteCode/InterpStack.cpp b/clang/lib/AST/ByteCode/InterpStack.cpp
index 6b748d62b83bd..7920378f365f9 100644
--- a/clang/lib/AST/ByteCode/InterpStack.cpp
+++ b/clang/lib/AST/ByteCode/InterpStack.cpp
@@ -26,33 +26,33 @@ InterpStack::~InterpStack() {
std::free(Chunk);
Chunk = nullptr;
StackSize = 0;
-#ifndef NDEBUG
ItemTypes.clear();
-#endif
}
// We keep the last chunk around to reuse.
void InterpStack::clear() {
- if (!Chunk)
- return;
-
- if (Chunk->Next)
- std::free(Chunk->Next);
-
- assert(Chunk);
- StackSize = 0;
-#ifndef NDEBUG
- ItemTypes.clear();
-#endif
+ for (PrimType Item : llvm::reverse(ItemTypes)) {
+ TYPE_SWITCH(Item, { this->discard<T>(); });
+ }
+ assert(ItemTypes.empty());
+ assert(empty());
}
void InterpStack::clearTo(size_t NewSize) {
- assert(NewSize <= size());
- size_t ToShrink = size() - NewSize;
- if (ToShrink == 0)
+ if (NewSize == 0)
+ return clear();
+ if (NewSize == size())
return;
- shrink(ToShrink);
+ assert(NewSize <= size());
+ for (PrimType Item : llvm::reverse(ItemTypes)) {
+ TYPE_SWITCH(Item, { this->discard<T>(); });
+
+ if (size() == NewSize)
+ break;
+ }
+
+ // Note: discard() above already removed the types from ItemTypes.
assert(size() == NewSize);
}
@@ -105,25 +105,9 @@ void InterpStack::shrink(size_t Size) {
Chunk->End -= Size;
StackSize -= Size;
-
-#ifndef NDEBUG
- size_t TypesSize = 0;
- for (PrimType T : ItemTypes)
- TYPE_SWITCH(T, { TypesSize += aligned_size<T>(); });
-
- size_t StackSize = size();
- while (TypesSize > StackSize) {
- TYPE_SWITCH(ItemTypes.back(), {
- TypesSize -= aligned_size<T>();
- ItemTypes.pop_back();
- });
- }
- assert(TypesSize == StackSize);
-#endif
}
void InterpStack::dump() const {
-#ifndef NDEBUG
llvm::errs() << "Items: " << ItemTypes.size() << ". Size: " << size() << '\n';
if (ItemTypes.empty())
return;
@@ -133,11 +117,11 @@ void InterpStack::dump() const {
// The type of the item on the top of the stack is inserted to the back
// of the vector, so the iteration has to happen backwards.
- for (auto TyIt = ItemTypes.rbegin(); TyIt != ItemTypes.rend(); ++TyIt) {
- Offset += align(primSize(*TyIt));
+ for (PrimType Item : llvm::reverse(ItemTypes)) {
+ Offset += align(primSize(Item));
llvm::errs() << Index << '/' << Offset << ": ";
- TYPE_SWITCH(*TyIt, {
+ TYPE_SWITCH(Item, {
const T &V = peek<T>(Offset);
llvm::errs() << V;
});
@@ -145,5 +129,4 @@ void InterpStack::dump() const {
++Index;
}
-#endif
}
diff --git a/clang/lib/AST/ByteCode/InterpStack.h b/clang/lib/AST/ByteCode/InterpStack.h
index 580494eb2347c..b0f9f6e225682 100644
--- a/clang/lib/AST/ByteCode/InterpStack.h
+++ b/clang/lib/AST/ByteCode/InterpStack.h
@@ -17,7 +17,6 @@
#include "IntegralAP.h"
#include "MemberPointer.h"
#include "PrimType.h"
-#include <vector>
namespace clang {
namespace interp {
@@ -33,18 +32,14 @@ class InterpStack final {
/// Constructs a value in place on the top of the stack.
template <typename T, typename... Tys> void push(Tys &&...Args) {
new (grow(aligned_size<T>())) T(std::forward<Tys>(Args)...);
-#ifndef NDEBUG
ItemTypes.push_back(toPrimType<T>());
-#endif
}
/// Returns the value from the top of the stack and removes it.
template <typename T> T pop() {
-#ifndef NDEBUG
assert(!ItemTypes.empty());
assert(ItemTypes.back() == toPrimType<T>());
ItemTypes.pop_back();
-#endif
T *Ptr = &peekInternal<T>();
T Value = std::move(*Ptr);
shrink(aligned_size<T>());
@@ -53,22 +48,20 @@ class InterpStack final {
/// Discards the top value from the stack.
template <typename T> void discard() {
-#ifndef NDEBUG
assert(!ItemTypes.empty());
assert(ItemTypes.back() == toPrimType<T>());
ItemTypes.pop_back();
-#endif
T *Ptr = &peekInternal<T>();
- Ptr->~T();
+ if constexpr (!std::is_trivially_destructible_v<T>) {
+ Ptr->~T();
+ }
shrink(aligned_size<T>());
}
/// Returns a reference to the value on the top of the stack.
template <typename T> T &peek() const {
-#ifndef NDEBUG
assert(!ItemTypes.empty());
assert(ItemTypes.back() == toPrimType<T>());
-#endif
return peekInternal<T>();
}
@@ -83,7 +76,7 @@ class InterpStack final {
/// Returns the size of the stack in bytes.
size_t size() const { return StackSize; }
- /// Clears the stack without calling any destructors.
+ /// Clears the stack.
void clear();
void clearTo(size_t NewSize);
@@ -146,9 +139,11 @@ class InterpStack final {
/// Total size of the stack.
size_t StackSize = 0;
-#ifndef NDEBUG
- /// vector recording the type of data we pushed into the stack.
- std::vector<PrimType> ItemTypes;
+ /// SmallVector recording the type of data we pushed into the stack.
+ /// We don't usually need this during normal code interpretation but
+ /// when aborting, we need type information to call the destructors
+ /// for what's left on the stack.
+ llvm::SmallVector<PrimType> ItemTypes;
template <typename T> static constexpr PrimType toPrimType() {
if constexpr (std::is_same_v<T, Pointer>)
@@ -192,7 +187,6 @@ class InterpStack final {
llvm_unreachable("unknown type push()'ed into InterpStack");
}
-#endif
};
} // namespace interp
diff --git a/clang/lib/AST/ByteCode/InterpState.cpp b/clang/lib/AST/ByteCode/InterpState.cpp
index b5f0f9a44f344..f89967759ff9b 100644
--- a/clang/lib/AST/ByteCode/InterpState.cpp
+++ b/clang/lib/AST/ByteCode/InterpState.cpp
@@ -45,6 +45,12 @@ InterpState::~InterpState() {
while (DeadBlocks) {
DeadBlock *Next = DeadBlocks->Next;
+
+ // There might be a pointer in a global structure pointing to the dead
+ // block.
+ for (Pointer *P = DeadBlocks->B.Pointers; P; P = P->asBlockPointer().Next)
+ DeadBlocks->B.removePointer(P);
+
std::free(DeadBlocks);
DeadBlocks = Next;
}
@@ -53,12 +59,6 @@ InterpState::~InterpState() {
void InterpState::cleanup() {
// As a last resort, make sure all pointers still pointing to a dead block
// don't point to it anymore.
- for (DeadBlock *DB = DeadBlocks; DB; DB = DB->Next) {
- for (Pointer *P = DB->B.Pointers; P; P = P->asBlockPointer().Next) {
- P->PointeeStorage.BS.Pointee = nullptr;
- }
- }
-
Alloc.cleanup();
}
diff --git a/clang/lib/AST/ByteCode/Pointer.h b/clang/lib/AST/ByteCode/Pointer.h
index 1f6f1cbce5391..1dcdc0424801d 100644
--- a/clang/lib/AST/ByteCode/Pointer.h
+++ b/clang/lib/AST/ByteCode/Pointer.h
@@ -29,7 +29,7 @@ class DeadBlock;
class Pointer;
class Context;
template <unsigned A, bool B> class Integral;
-enum PrimType : unsigned;
+enum PrimType : uint8_t;
class Pointer;
inline llvm::raw_ostream &operator<<(llvm::raw_ostream &OS, const Pointer &P);
diff --git a/clang/lib/AST/ByteCode/PrimType.h b/clang/lib/AST/ByteCode/PrimType.h
index 724da93ca1ef6..093084a8aad7b 100644
--- a/clang/lib/AST/ByteCode/PrimType.h
+++ b/clang/lib/AST/ByteCode/PrimType.h
@@ -31,7 +31,7 @@ template <bool Signed> class IntegralAP;
template <unsigned Bits, bool Signed> class Integral;
/// Enumeration of the primitive types of the VM.
-enum PrimType : unsigned {
+enum PrimType : uint8_t {
PT_Sint8 = 0,
PT_Uint8 = 1,
PT_Sint16 = 2,
@@ -51,14 +51,15 @@ enum PrimType : unsigned {
// Like std::optional<PrimType>, but only sizeof(PrimType).
class OptPrimType final {
- unsigned V = ~0u;
+ static constexpr uint8_t None = 0xFF;
+ uint8_t V = None;
public:
OptPrimType() = default;
OptPrimType(std::nullopt_t) {}
OptPrimType(PrimType T) : V(static_cast<unsigned>(T)) {}
- explicit constexpr operator bool() const { return V != ~0u; }
+ explicit constexpr operator bool() const { return V != None; }
PrimType operator*() const {
assert(operator bool());
return static_cast<PrimType>(V);
diff --git a/clang/lib/AST/ByteCode/Program.h b/clang/lib/AST/ByteCode/Program.h
index b63a70ed8113a..9c4e63a14d448 100644
--- a/clang/lib/AST/ByteCode/Program.h
+++ b/clang/lib/AST/ByteCode/Program.h
@@ -73,6 +73,10 @@ class Program final {
return Globals[Idx]->block();
}
+ bool isGlobalInitialized(unsigned Index) const {
+ return getPtrGlobal(Index).isInitialized();
+ }
+
/// Finds a global's index.
std::optional<unsigned> getGlobal(const ValueDecl *VD);
std::optional<unsigned> getGlobal(const Expr *E);
diff --git a/clang/test/AST/ByteCode/c.c b/clang/test/AST/ByteCode/c.c
index a7b1fe07f6d84..654b3da2b7d66 100644
--- a/clang/test/AST/ByteCode/c.c
+++ b/clang/test/AST/ByteCode/c.c
@@ -329,3 +329,12 @@ void foo3 (void)
void* x = 0;
void* y = &*x;
}
+
+static void *FooTable[1] = {
+ [0] = (void *[1]) { // 1
+ [0] = (void *[1]) { // 2
+ [0] = (void *[1]) {} // pedantic-warning {{use of an empty initializer}}
+ },
+ }
+};
+
>From f15c6ff6cb15acf67ee5bd73ca6442c6abd0f063 Mon Sep 17 00:00:00 2001
From: Jay Foad <jay.foad at amd.com>
Date: Mon, 18 Aug 2025 16:18:46 +0100
Subject: [PATCH 039/112] [AMDGPU] Make use of SIInstrInfo::isWaitcnt. NFC.
(#154087)
---
.../lib/Target/AMDGPU/GCNHazardRecognizer.cpp | 36 +++----------------
llvm/lib/Target/AMDGPU/SIInstrInfo.h | 2 +-
2 files changed, 5 insertions(+), 33 deletions(-)
diff --git a/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp b/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp
index 49a681efc79c7..a3b64aee297b2 100644
--- a/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp
@@ -1357,17 +1357,10 @@ bool GCNHazardRecognizer::fixSMEMtoVectorWriteHazards(MachineInstr *MI) {
// DsCnt corresponds to LGKMCnt here.
return (Decoded.DsCnt == 0);
}
- case AMDGPU::S_WAIT_STORECNT:
- case AMDGPU::S_WAIT_STORECNT_DSCNT:
- case AMDGPU::S_WAIT_LOADCNT:
- case AMDGPU::S_WAIT_LOADCNT_DSCNT:
- case AMDGPU::S_WAIT_SAMPLECNT:
- case AMDGPU::S_WAIT_BVHCNT:
- case AMDGPU::S_WAIT_DSCNT:
- case AMDGPU::S_WAIT_EXPCNT:
- case AMDGPU::S_WAIT_KMCNT:
- llvm_unreachable("unexpected wait count instruction");
default:
+ assert((!SIInstrInfo::isWaitcnt(MI.getOpcode()) ||
+ MI.getOpcode() == AMDGPU::S_WAIT_IDLE) &&
+ "unexpected wait count instruction");
// SOPP instructions cannot mitigate the hazard.
if (TII->isSOPP(MI))
return false;
@@ -2257,28 +2250,7 @@ int GCNHazardRecognizer::checkFPAtomicToDenormModeHazard(MachineInstr *MI) {
if (WaitStates >= 3 || SIInstrInfo::isVALU(MI))
return true;
- switch (MI.getOpcode()) {
- case AMDGPU::S_WAITCNT:
- case AMDGPU::S_WAITCNT_VSCNT:
- case AMDGPU::S_WAITCNT_VMCNT:
- case AMDGPU::S_WAITCNT_EXPCNT:
- case AMDGPU::S_WAITCNT_LGKMCNT:
- case AMDGPU::S_WAIT_IDLE:
- case AMDGPU::S_WAIT_LOADCNT:
- case AMDGPU::S_WAIT_LOADCNT_DSCNT:
- case AMDGPU::S_WAIT_SAMPLECNT:
- case AMDGPU::S_WAIT_BVHCNT:
- case AMDGPU::S_WAIT_STORECNT:
- case AMDGPU::S_WAIT_STORECNT_DSCNT:
- case AMDGPU::S_WAIT_EXPCNT:
- case AMDGPU::S_WAIT_DSCNT:
- case AMDGPU::S_WAIT_KMCNT:
- return true;
- default:
- break;
- }
-
- return false;
+ return SIInstrInfo::isWaitcnt(MI.getOpcode());
};
return FPAtomicToDenormModeWaitStates -
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.h b/llvm/lib/Target/AMDGPU/SIInstrInfo.h
index 18f0e5b9b56bc..5cbf6f5ab0459 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.h
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.h
@@ -1056,7 +1056,7 @@ class SIInstrInfo final : public AMDGPUGenInstrInfo {
}
}
- bool isWaitcnt(unsigned Opcode) const {
+ static bool isWaitcnt(unsigned Opcode) {
switch (getNonSoftWaitcntOpcode(Opcode)) {
case AMDGPU::S_WAITCNT:
case AMDGPU::S_WAITCNT_VSCNT:
>From 4bf33958dac30facec505e7410e4be8cea567a2e Mon Sep 17 00:00:00 2001
From: Jacques Pienaar <jpienaar at google.com>
Date: Mon, 18 Aug 2025 08:19:34 -0700
Subject: [PATCH 040/112] [mlir] Update builders to use new form. (#154132)
Mechanically applied using clang-tidy.
---
.../MemRefToEmitC/MemRefToEmitC.cpp | 26 +++++++++-------
.../MemRefToEmitC/MemRefToEmitCPass.cpp | 4 +--
.../EmitC/Transforms/WrapFuncInClass.cpp | 4 +--
.../Linalg/Transforms/DropUnitDims.cpp | 6 ++--
.../Linalg/Transforms/TransposeMatmul.cpp | 14 ++++-----
mlir/lib/Dialect/XeGPU/IR/XeGPUDialect.cpp | 4 +--
.../Transforms/XeGPUWgToSgDistribute.cpp | 4 +--
mlir/lib/Target/Wasm/TranslateFromWasm.cpp | 31 ++++++++++---------
.../lib/Dialect/XeGPU/TestXeGPUTransforms.cpp | 2 +-
9 files changed, 49 insertions(+), 46 deletions(-)
diff --git a/mlir/lib/Conversion/MemRefToEmitC/MemRefToEmitC.cpp b/mlir/lib/Conversion/MemRefToEmitC/MemRefToEmitC.cpp
index a1f38c95935ad..2b7bdc9a7b7f8 100644
--- a/mlir/lib/Conversion/MemRefToEmitC/MemRefToEmitC.cpp
+++ b/mlir/lib/Conversion/MemRefToEmitC/MemRefToEmitC.cpp
@@ -156,19 +156,21 @@ struct ConvertAlloc final : public OpConversionPattern<memref::AllocOp> {
Type sizeTType = emitc::SizeTType::get(rewriter.getContext());
Type elementType = memrefType.getElementType();
IndexType indexType = rewriter.getIndexType();
- emitc::CallOpaqueOp sizeofElementOp = rewriter.create<emitc::CallOpaqueOp>(
- loc, sizeTType, rewriter.getStringAttr("sizeof"), ValueRange{},
+ emitc::CallOpaqueOp sizeofElementOp = emitc::CallOpaqueOp::create(
+ rewriter, loc, sizeTType, rewriter.getStringAttr("sizeof"),
+ ValueRange{},
ArrayAttr::get(rewriter.getContext(), {TypeAttr::get(elementType)}));
int64_t numElements = 1;
for (int64_t dimSize : memrefType.getShape()) {
numElements *= dimSize;
}
- Value numElementsValue = rewriter.create<emitc::ConstantOp>(
- loc, indexType, rewriter.getIndexAttr(numElements));
+ Value numElementsValue = emitc::ConstantOp::create(
+ rewriter, loc, indexType, rewriter.getIndexAttr(numElements));
- Value totalSizeBytes = rewriter.create<emitc::MulOp>(
- loc, sizeTType, sizeofElementOp.getResult(0), numElementsValue);
+ Value totalSizeBytes =
+ emitc::MulOp::create(rewriter, loc, sizeTType,
+ sizeofElementOp.getResult(0), numElementsValue);
emitc::CallOpaqueOp allocCall;
StringAttr allocFunctionName;
@@ -176,8 +178,8 @@ struct ConvertAlloc final : public OpConversionPattern<memref::AllocOp> {
SmallVector<Value, 2> argsVec;
if (allocOp.getAlignment()) {
allocFunctionName = rewriter.getStringAttr(alignedAllocFunctionName);
- alignmentValue = rewriter.create<emitc::ConstantOp>(
- loc, sizeTType,
+ alignmentValue = emitc::ConstantOp::create(
+ rewriter, loc, sizeTType,
rewriter.getIntegerAttr(indexType,
allocOp.getAlignment().value_or(0)));
argsVec.push_back(alignmentValue);
@@ -188,15 +190,15 @@ struct ConvertAlloc final : public OpConversionPattern<memref::AllocOp> {
argsVec.push_back(totalSizeBytes);
ValueRange args(argsVec);
- allocCall = rewriter.create<emitc::CallOpaqueOp>(
- loc,
+ allocCall = emitc::CallOpaqueOp::create(
+ rewriter, loc,
emitc::PointerType::get(
emitc::OpaqueType::get(rewriter.getContext(), "void")),
allocFunctionName, args);
emitc::PointerType targetPointerType = emitc::PointerType::get(elementType);
- emitc::CastOp castOp = rewriter.create<emitc::CastOp>(
- loc, targetPointerType, allocCall.getResult(0));
+ emitc::CastOp castOp = emitc::CastOp::create(
+ rewriter, loc, targetPointerType, allocCall.getResult(0));
rewriter.replaceOp(allocOp, castOp);
return success();
diff --git a/mlir/lib/Conversion/MemRefToEmitC/MemRefToEmitCPass.cpp b/mlir/lib/Conversion/MemRefToEmitC/MemRefToEmitCPass.cpp
index a51890248271f..a073a9acf752f 100644
--- a/mlir/lib/Conversion/MemRefToEmitC/MemRefToEmitCPass.cpp
+++ b/mlir/lib/Conversion/MemRefToEmitC/MemRefToEmitCPass.cpp
@@ -33,8 +33,8 @@ namespace {
emitc::IncludeOp addStandardHeader(OpBuilder &builder, ModuleOp module,
StringRef headerName) {
StringAttr includeAttr = builder.getStringAttr(headerName);
- return builder.create<emitc::IncludeOp>(
- module.getLoc(), includeAttr,
+ return emitc::IncludeOp::create(
+ builder, module.getLoc(), includeAttr,
/*is_standard_include=*/builder.getUnitAttr());
}
diff --git a/mlir/lib/Dialect/EmitC/Transforms/WrapFuncInClass.cpp b/mlir/lib/Dialect/EmitC/Transforms/WrapFuncInClass.cpp
index c55e26e722f33..06d7e07005f8a 100644
--- a/mlir/lib/Dialect/EmitC/Transforms/WrapFuncInClass.cpp
+++ b/mlir/lib/Dialect/EmitC/Transforms/WrapFuncInClass.cpp
@@ -64,8 +64,8 @@ class WrapFuncInClass : public OpRewritePattern<emitc::FuncOp> {
TypeAttr typeAttr = TypeAttr::get(val.getType());
fields.push_back({fieldName, typeAttr});
- FieldOp fieldop = rewriter.create<emitc::FieldOp>(
- funcOp->getLoc(), fieldName, typeAttr, nullptr);
+ FieldOp fieldop = emitc::FieldOp::create(rewriter, funcOp->getLoc(),
+ fieldName, typeAttr, nullptr);
if (argAttrs && idx < argAttrs->size()) {
fieldop->setDiscardableAttrs(funcOp.getArgAttrDict(idx));
diff --git a/mlir/lib/Dialect/Linalg/Transforms/DropUnitDims.cpp b/mlir/lib/Dialect/Linalg/Transforms/DropUnitDims.cpp
index d56506969662b..22690daa4f9e1 100644
--- a/mlir/lib/Dialect/Linalg/Transforms/DropUnitDims.cpp
+++ b/mlir/lib/Dialect/Linalg/Transforms/DropUnitDims.cpp
@@ -691,9 +691,9 @@ struct DropPadUnitDims : public OpRewritePattern<tensor::PadOp> {
auto newResultType = RankedTensorType::get(
newResultShape, padOp.getResultType().getElementType());
- auto newPadOp = rewriter.create<tensor::PadOp>(
- padOp.getLoc(), /*result=*/newResultType, collapsedSource, newLowPad,
- newHighPad, paddingVal, padOp.getNofold());
+ auto newPadOp = tensor::PadOp::create(
+ rewriter, padOp.getLoc(), /*result=*/newResultType, collapsedSource,
+ newLowPad, newHighPad, paddingVal, padOp.getNofold());
Value dest = padOp.getResult();
if (options.rankReductionStrategy ==
diff --git a/mlir/lib/Dialect/Linalg/Transforms/TransposeMatmul.cpp b/mlir/lib/Dialect/Linalg/Transforms/TransposeMatmul.cpp
index 9ec4af6d4581c..2650488c17993 100644
--- a/mlir/lib/Dialect/Linalg/Transforms/TransposeMatmul.cpp
+++ b/mlir/lib/Dialect/Linalg/Transforms/TransposeMatmul.cpp
@@ -52,11 +52,11 @@ FailureOr<Operation *> mlir::linalg::transposeMatmul(RewriterBase &rewriter,
dynamicDims.push_back(tensor::DimOp::create(rewriter, loc, input, 0));
ArrayRef<int64_t> shape = type.getShape();
- Value empty = rewriter.create<tensor::EmptyOp>(
- loc, ArrayRef<int64_t>{shape[1], shape[0]}, type.getElementType(),
- dynamicDims);
- auto transposeOp = rewriter.create<linalg::TransposeOp>(
- loc, input, empty, ArrayRef<int64_t>{1, 0});
+ Value empty = tensor::EmptyOp::create(rewriter, loc,
+ ArrayRef<int64_t>{shape[1], shape[0]},
+ type.getElementType(), dynamicDims);
+ auto transposeOp = linalg::TransposeOp::create(rewriter, loc, input, empty,
+ ArrayRef<int64_t>{1, 0});
Operation *newMatmulOp;
if (transposeLHS) {
newMatmulOp = MatmulTransposeAOp::create(
@@ -112,8 +112,8 @@ mlir::linalg::transposeBatchMatmul(RewriterBase &rewriter,
Value empty = tensor::EmptyOp::create(
rewriter, loc, ArrayRef<int64_t>{shape[0], shape[2], shape[1]},
type.getElementType(), dynamicDims);
- auto transposeOp = rewriter.create<linalg::TransposeOp>(
- loc, input, empty, ArrayRef<int64_t>{0, 2, 1});
+ auto transposeOp = linalg::TransposeOp::create(rewriter, loc, input, empty,
+ ArrayRef<int64_t>{0, 2, 1});
Operation *newMatmulOp;
if (transposeLHS) {
newMatmulOp = BatchMatmulTransposeAOp::create(
diff --git a/mlir/lib/Dialect/XeGPU/IR/XeGPUDialect.cpp b/mlir/lib/Dialect/XeGPU/IR/XeGPUDialect.cpp
index 1b26542ff65a3..8ea8cb1f45972 100644
--- a/mlir/lib/Dialect/XeGPU/IR/XeGPUDialect.cpp
+++ b/mlir/lib/Dialect/XeGPU/IR/XeGPUDialect.cpp
@@ -67,7 +67,7 @@ genOffsetsComputingInsts(OpBuilder &builder, Location loc,
StaticTileOffsetRange(sizePerWg, distUnit)) {
SmallVector<Value> base =
llvm::map_to_vector(unitOffs, [&](int64_t d) -> Value {
- return builder.create<arith::ConstantIndexOp>(loc, d);
+ return arith::ConstantIndexOp::create(builder, loc, d);
});
SmallVector<Value> adds = llvm::map_to_vector(
@@ -80,7 +80,7 @@ genOffsetsComputingInsts(OpBuilder &builder, Location loc,
llvm::zip_equal(adds, sizePerWg), [&](const auto &t) -> Value {
return builder.createOrFold<index::RemUOp>(
loc, std::get<0>(t),
- builder.create<arith::ConstantIndexOp>(loc, std::get<1>(t)));
+ arith::ConstantIndexOp::create(builder, loc, std::get<1>(t)));
});
offsets.push_back(mods);
diff --git a/mlir/lib/Dialect/XeGPU/Transforms/XeGPUWgToSgDistribute.cpp b/mlir/lib/Dialect/XeGPU/Transforms/XeGPUWgToSgDistribute.cpp
index 46ff03745a220..ecec186fe3fc9 100644
--- a/mlir/lib/Dialect/XeGPU/Transforms/XeGPUWgToSgDistribute.cpp
+++ b/mlir/lib/Dialect/XeGPU/Transforms/XeGPUWgToSgDistribute.cpp
@@ -166,7 +166,7 @@ struct WgToSgCreateNdOp : public OpConversionPattern<xegpu::CreateNdDescOp> {
// Subtract startOfRange from the original subgroup id to get
// the adjusted sg id
Value startOfRangeVal =
- rewriter.create<arith::ConstantIndexOp>(loc, startOfRange);
+ arith::ConstantIndexOp::create(rewriter, loc, startOfRange);
linearSgId =
rewriter.createOrFold<index::SubOp>(loc, linearSgId, startOfRangeVal);
}
@@ -675,7 +675,7 @@ struct WgToSgArithConstantOp : public OpConversionPattern<arith::ConstantOp> {
auto newType = VectorType::get(sgShape, vecType.getElementType());
auto sgAttr = DenseElementsAttr::get(newType, singleVal);
auto cstOp =
- rewriter.create<arith::ConstantOp>(op.getLoc(), newType, sgAttr);
+ arith::ConstantOp::create(rewriter, op.getLoc(), newType, sgAttr);
if (auto newLayout = layout.dropSgLayoutAndData())
xegpu::setLayoutAttr(cstOp->getResult(0), newLayout);
SmallVector<Value> newConsts(count, cstOp);
diff --git a/mlir/lib/Target/Wasm/TranslateFromWasm.cpp b/mlir/lib/Target/Wasm/TranslateFromWasm.cpp
index da811ba0954c2..8d450520629eb 100644
--- a/mlir/lib/Target/Wasm/TranslateFromWasm.cpp
+++ b/mlir/lib/Target/Wasm/TranslateFromWasm.cpp
@@ -780,8 +780,9 @@ parsed_inst_t ExpressionParser::parseConstInst(
auto parsedConstant = parser.parseLiteral<valueT>();
if (failed(parsedConstant))
return failure();
- auto constOp = builder.create<ConstOp>(
- *currentOpLoc, buildLiteralAttr<valueT>(builder, *parsedConstant));
+ auto constOp =
+ ConstOp::create(builder, *currentOpLoc,
+ buildLiteralAttr<valueT>(builder, *parsedConstant));
return {{constOp.getResult()}};
}
@@ -929,8 +930,8 @@ class WasmBinaryParser {
<< " type registration.";
FunctionType type = symbols.moduleFuncTypes[tid.id];
std::string symbol = symbols.getNewFuncSymbolName();
- auto funcOp =
- builder.create<FuncImportOp>(loc, symbol, moduleName, importName, type);
+ auto funcOp = FuncImportOp::create(builder, loc, symbol, moduleName,
+ importName, type);
symbols.funcSymbols.push_back({{FlatSymbolRefAttr::get(funcOp)}, type});
return funcOp.verify();
}
@@ -939,8 +940,8 @@ class WasmBinaryParser {
LogicalResult visitImport(Location loc, StringRef moduleName,
StringRef importName, LimitType limitType) {
std::string symbol = symbols.getNewMemorySymbolName();
- auto memOp = builder.create<MemImportOp>(loc, symbol, moduleName,
- importName, limitType);
+ auto memOp = MemImportOp::create(builder, loc, symbol, moduleName,
+ importName, limitType);
symbols.memSymbols.push_back({FlatSymbolRefAttr::get(memOp)});
return memOp.verify();
}
@@ -949,8 +950,8 @@ class WasmBinaryParser {
LogicalResult visitImport(Location loc, StringRef moduleName,
StringRef importName, TableType tableType) {
std::string symbol = symbols.getNewTableSymbolName();
- auto tableOp = builder.create<TableImportOp>(loc, symbol, moduleName,
- importName, tableType);
+ auto tableOp = TableImportOp::create(builder, loc, symbol, moduleName,
+ importName, tableType);
symbols.tableSymbols.push_back({FlatSymbolRefAttr::get(tableOp)});
return tableOp.verify();
}
@@ -960,8 +961,8 @@ class WasmBinaryParser {
StringRef importName, GlobalTypeRecord globalType) {
std::string symbol = symbols.getNewGlobalSymbolName();
auto giOp =
- builder.create<GlobalImportOp>(loc, symbol, moduleName, importName,
- globalType.type, globalType.isMutable);
+ GlobalImportOp::create(builder, loc, symbol, moduleName, importName,
+ globalType.type, globalType.isMutable);
symbols.globalSymbols.push_back(
{{FlatSymbolRefAttr::get(giOp)}, giOp.getType()});
return giOp.verify();
@@ -1012,7 +1013,7 @@ class WasmBinaryParser {
if (failed(fillRegistry))
return;
- mOp = builder.create<ModuleOp>(getLocation());
+ mOp = ModuleOp::create(builder, getLocation());
builder.setInsertionPointToStart(&mOp.getBodyRegion().front());
LogicalResult parsingTypes = parseSection<WasmSectionType::TYPE>();
if (failed(parsingTypes))
@@ -1172,7 +1173,7 @@ WasmBinaryParser::parseSectionItem<WasmSectionType::TABLE>(ParserHead &ph,
LDBG() << " Parsed table description: " << *tableType;
StringAttr symbol = builder.getStringAttr(symbols.getNewTableSymbolName());
auto tableOp =
- builder.create<TableOp>(opLocation, symbol.strref(), *tableType);
+ TableOp::create(builder, opLocation, symbol.strref(), *tableType);
symbols.tableSymbols.push_back({SymbolRefAttr::get(tableOp)});
return success();
}
@@ -1190,11 +1191,11 @@ WasmBinaryParser::parseSectionItem<WasmSectionType::FUNCTION>(ParserHead &ph,
return emitError(getLocation(), "invalid type index: ") << typeIdx;
std::string symbol = symbols.getNewFuncSymbolName();
auto funcOp =
- builder.create<FuncOp>(opLoc, symbol, symbols.moduleFuncTypes[typeIdx]);
+ FuncOp::create(builder, opLoc, symbol, symbols.moduleFuncTypes[typeIdx]);
Block *block = funcOp.addEntryBlock();
auto ip = builder.saveInsertionPoint();
builder.setInsertionPointToEnd(block);
- builder.create<ReturnOp>(opLoc);
+ ReturnOp::create(builder, opLoc);
builder.restoreInsertionPoint(ip);
symbols.funcSymbols.push_back(
{{FlatSymbolRefAttr::get(funcOp.getSymNameAttr())},
@@ -1225,7 +1226,7 @@ WasmBinaryParser::parseSectionItem<WasmSectionType::MEMORY>(ParserHead &ph,
LDBG() << " Registering memory " << *memory;
std::string symbol = symbols.getNewMemorySymbolName();
- auto memOp = builder.create<MemOp>(opLocation, symbol, *memory);
+ auto memOp = MemOp::create(builder, opLocation, symbol, *memory);
symbols.memSymbols.push_back({SymbolRefAttr::get(memOp)});
return success();
}
diff --git a/mlir/test/lib/Dialect/XeGPU/TestXeGPUTransforms.cpp b/mlir/test/lib/Dialect/XeGPU/TestXeGPUTransforms.cpp
index 3bea8efcdb0ae..58962714b7864 100644
--- a/mlir/test/lib/Dialect/XeGPU/TestXeGPUTransforms.cpp
+++ b/mlir/test/lib/Dialect/XeGPU/TestXeGPUTransforms.cpp
@@ -228,7 +228,7 @@ struct TestXeGPULayoutInterface
auto materializeCast = [&](mlir::OpBuilder &builder, mlir::Type type,
mlir::ValueRange inputs,
mlir::Location loc) -> mlir::Value {
- return builder.create<UnrealizedConversionCastOp>(loc, type, inputs)
+ return UnrealizedConversionCastOp::create(builder, loc, type, inputs)
.getResult(0);
};
typeConverter.addSourceMaterialization(materializeCast);
>From 60aa0d4bfc13c3d8c9967e083bb7134ecb4f254b Mon Sep 17 00:00:00 2001
From: Craig Topper <craig.topper at sifive.com>
Date: Mon, 18 Aug 2025 08:23:14 -0700
Subject: [PATCH 041/112] [RISCV] Add P-ext MC support for pli.dh, pli.db, and
plui.dh. (#153972)
Refactor the pli.b/h/w and plui.h/w tablegen classes.
---
llvm/lib/Target/RISCV/RISCVInstrInfoP.td | 111 ++++++++++++++++-------
llvm/test/MC/RISCV/rv32p-invalid.s | 8 ++
llvm/test/MC/RISCV/rv32p-valid.s | 16 ++++
llvm/test/MC/RISCV/rv64p-invalid.s | 5 +
4 files changed, 107 insertions(+), 33 deletions(-)
diff --git a/llvm/lib/Target/RISCV/RISCVInstrInfoP.td b/llvm/lib/Target/RISCV/RISCVInstrInfoP.td
index 157bad8034072..1e22c2d355108 100644
--- a/llvm/lib/Target/RISCV/RISCVInstrInfoP.td
+++ b/llvm/lib/Target/RISCV/RISCVInstrInfoP.td
@@ -25,7 +25,7 @@ def SImm8UnsignedAsmOperand : SImmAsmOperand<8, "Unsigned"> {
}
// A 8-bit signed immediate allowing range [-128, 255]
-// but represented as [-128, 127].
+// but represented as [-128, 255].
def simm8_unsigned : RISCVOp {
let ParserMatchClass = SImm8UnsignedAsmOperand;
let EncoderMethod = "getImmOpValue";
@@ -62,49 +62,40 @@ def simm10_unsigned : RISCVOp {
// Instruction class templates
//===----------------------------------------------------------------------===//
-let hasSideEffects = 0, mayLoad = 0, mayStore = 0 in
-class PLI_i<bits<7> funct7, string opcodestr>
- : RVInst<(outs GPR:$rd), (ins simm10:$imm10), opcodestr, "$rd, $imm10", [],
+// Common base for pli.b/h/w and plui.h/w
+class RVPLoadImm_i<bits<7> funct7, dag ins, string opcodestr,
+ string argstr>
+ : RVInst<(outs GPR:$rd), ins, opcodestr, argstr, [],
InstFormatOther> {
- bits<10> imm10;
bits<5> rd;
let Inst{31-25} = funct7;
- let Inst{24-16} = imm10{8-0};
- let Inst{15} = imm10{9};
let Inst{14-12} = 0b010;
let Inst{11-7} = rd;
let Inst{6-0} = OPC_OP_IMM_32.Value;
+
+ let hasSideEffects = 0;
+ let mayLoad = 0;
+ let mayStore = 0;
}
-let hasSideEffects = 0, mayLoad = 0, mayStore = 0 in
-class PLUI_i<bits<7> funct7, string opcodestr>
- : RVInst<(outs GPR:$rd), (ins simm10_unsigned:$imm10), opcodestr,
- "$rd, $imm10", [], InstFormatOther> {
+// Base for pli.h/w.
+class PLI_i<bits<7> funct7, string opcodestr>
+ : RVPLoadImm_i<funct7, (ins simm10:$imm10), opcodestr, "$rd, $imm10"> {
bits<10> imm10;
- bits<5> rd;
- let Inst{31-25} = funct7;
- let Inst{24} = imm10{0};
- let Inst{23-15} = imm10{9-1};
- let Inst{14-12} = 0b010;
- let Inst{11-7} = rd;
- let Inst{6-0} = OPC_OP_IMM_32.Value;
+ let Inst{24-16} = imm10{8-0};
+ let Inst{15} = imm10{9};
}
-let hasSideEffects = 0, mayLoad = 0, mayStore = 0 in
-class PLI_B_i<bits<8> funct8, string opcodestr>
- : RVInst<(outs GPR:$rd), (ins simm8_unsigned:$imm8), opcodestr,
- "$rd, $imm8", [], InstFormatOther> {
- bits<8> imm8;
- bits<5> rd;
+// Base for plui.h/w.
+class PLUI_i<bits<7> funct7, string opcodestr>
+ : RVPLoadImm_i<funct7, (ins simm10_unsigned:$imm10), opcodestr,
+ "$rd, $imm10"> {
+ bits<10> imm10;
- let Inst{31-24} = funct8;
- let Inst{23-16} = imm8;
- let Inst{15} = 0b0;
- let Inst{14-12} = 0b010;
- let Inst{11-7} = rd;
- let Inst{6-0} = OPC_OP_IMM_32.Value;
+ let Inst{24} = imm10{0};
+ let Inst{23-15} = imm10{9-1};
}
let hasSideEffects = 0, mayLoad = 0, mayStore = 0 in
@@ -180,7 +171,8 @@ class RVPBinary_rr<bits<4> f, bits<2> w, bits<3> funct3, string opcodestr>
let hasSideEffects = 0, mayLoad = 0, mayStore = 0 in
class RVPTernary_rrr<bits<4> f, bits<2> w, bits<3> funct3, string opcodestr>
: RVInstRBase<funct3, OPC_OP_32, (outs GPR:$rd_wb),
- (ins GPR:$rd, GPR:$rs1, GPR:$rs2), opcodestr, "$rd, $rs1, $rs2"> {
+ (ins GPR:$rd, GPR:$rs1, GPR:$rs2), opcodestr,
+ "$rd, $rs1, $rs2"> {
let Inst{31} = 0b1;
let Inst{30-27} = f;
let Inst{26-25} = w;
@@ -188,6 +180,24 @@ class RVPTernary_rrr<bits<4> f, bits<2> w, bits<3> funct3, string opcodestr>
let Constraints = "$rd = $rd_wb";
}
+// Common base for pli.db/h/w and plui.dh/w
+class RVPPairLoadImm_i<bits<7> funct7, dag ins, string opcodestr,
+ string argstr>
+ : RVInst<(outs GPRPairRV32:$rd), ins, opcodestr, argstr, [],
+ InstFormatOther> {
+ bits<5> rd;
+
+ let Inst{31-25} = funct7;
+ let Inst{14-12} = 0b010;
+ let Inst{11-8} = rd{4-1};
+ let Inst{7} = 0b0;
+ let Inst{6-0} = OPC_OP_IMM_32.Value;
+
+ let hasSideEffects = 0;
+ let mayLoad = 0;
+ let mayStore = 0;
+}
+
//===----------------------------------------------------------------------===//
// Instructions
//===----------------------------------------------------------------------===//
@@ -229,8 +239,16 @@ let Predicates = [HasStdExtP] in
def PLI_H : PLI_i<0b1011000, "pli.h">;
let Predicates = [HasStdExtP, IsRV64] in
def PLI_W : PLI_i<0b1011001, "pli.w">;
-let Predicates = [HasStdExtP] in
-def PLI_B : PLI_B_i<0b10110100, "pli.b">;
+let Predicates = [HasStdExtP] in {
+ def PLI_B : RVPLoadImm_i<0b1011010, (ins simm8_unsigned:$imm8), "pli.b",
+ "$rd, $imm8"> {
+ bits<8> imm8;
+
+ let Inst{24} = 0b0;
+ let Inst{23-16} = imm8;
+ let Inst{15} = 0b0;
+ }
+}
let Predicates = [HasStdExtP] in {
def PSEXT_H_B : RVPUnary_ri<0b00, 0b00100, "psext.h.b">;
@@ -578,3 +596,30 @@ let Predicates = [HasStdExtP, IsRV64] in {
def PPACKT_W : RVPBinary_rr<0b0110, 0b01, 0b100, "ppackt.w">;
def PACKT_RV64 : RVPBinary_rr<0b0110, 0b11, 0b100, "packt">;
} // Predicates = [HasStdExtP, IsRV64]
+
+let Predicates = [HasStdExtP, IsRV32] in {
+ def PLI_DH : RVPPairLoadImm_i<0b0011000, (ins simm10:$imm10), "pli.dh",
+ "$rd, $imm10"> {
+ bits<10> imm10;
+
+ let Inst{24-16} = imm10{8-0};
+ let Inst{15} = imm10{9};
+ }
+
+ def PLI_DB : RVPPairLoadImm_i<0b0011010, (ins simm8_unsigned:$imm8), "pli.db",
+ "$rd, $imm8"> {
+ bits<8> imm8;
+
+ let Inst{24} = 0b0;
+ let Inst{23-16} = imm8;
+ let Inst{15} = 0b0;
+ }
+
+ def PLUI_DH : RVPPairLoadImm_i<0b0111000, (ins simm10_unsigned:$imm10),
+ "plui.dh", "$rd, $imm10"> {
+ bits<10> imm10;
+
+ let Inst{24} = imm10{0};
+ let Inst{23-15} = imm10{9-1};
+ }
+}
diff --git a/llvm/test/MC/RISCV/rv32p-invalid.s b/llvm/test/MC/RISCV/rv32p-invalid.s
index 7184241477d69..b00c39b8811dc 100644
--- a/llvm/test/MC/RISCV/rv32p-invalid.s
+++ b/llvm/test/MC/RISCV/rv32p-invalid.s
@@ -106,3 +106,11 @@ ppack.w t5, a2, a4 # CHECK: :[[@LINE]]:1: error: instruction requires the follow
ppackbt.w t5, s0, t5 # CHECK: :[[@LINE]]:1: error: instruction requires the following: RV64I Base Instruction Set
ppacktb.w t5, t1, t1 # CHECK: :[[@LINE]]:1: error: instruction requires the following: RV64I Base Instruction Set
ppackt.w t3, a0, s2 # CHECK: :[[@LINE]]:1: error: instruction requires the following: RV64I Base Instruction Set
+
+pli.dh a1, 1 # CHECK: :[[@LINE]]:8: error: register must be even
+pli.db s1, 1 # CHECK: :[[@LINE]]:8: error: register must be even
+plui.dh t2, 1 # CHECK: :[[@LINE]]:9: error: register must be even
+
+pli.dh a0, 0x400 # CHECK: :[[@LINE]]:12: error: immediate must be an integer in the range [-512, 511]
+pli.db a0, 0x200 # CHECK: :[[@LINE]]:12: error: immediate must be an integer in the range [-128, 255]
+plui.dh a0, 0x400 # CHECK: :[[@LINE]]:13: error: immediate must be an integer in the range [-512, 1023]
diff --git a/llvm/test/MC/RISCV/rv32p-valid.s b/llvm/test/MC/RISCV/rv32p-valid.s
index d5e8299131f10..bc7ec6587c5fc 100644
--- a/llvm/test/MC/RISCV/rv32p-valid.s
+++ b/llvm/test/MC/RISCV/rv32p-valid.s
@@ -376,3 +376,19 @@ ppackt.h t3, s0, s0
# CHECK-ASM-AND-OBJ: packt a2, t3, t1
# CHECK-ASM: encoding: [0x3b,0x46,0x6e,0xb2]
packt a2, t3, t1
+
+# CHECK-ASM-AND-OBJ: pli.dh a4, 16
+# CHECK-ASM: encoding: [0x1b,0x27,0x10,0x30]
+pli.dh a4, 16
+# CHECK-ASM-AND-OBJ: pli.db a6, 16
+# CHECK-ASM: encoding: [0x1b,0x28,0x10,0x34]
+pli.db a6, 16
+# CHECK-ASM-AND-OBJ: pli.db a6, -128
+# CHECK-ASM: encoding: [0x1b,0x28,0x80,0x34]
+pli.db a6, -128
+# CHECK-ASM-AND-OBJ: plui.dh tp, 32
+# CHECK-ASM: encoding: [0x1b,0x22,0x08,0x70]
+plui.dh tp, 32
+# CHECK-ASM-AND-OBJ: plui.dh tp, -412
+# CHECK-ASM: encoding: [0x1b,0x22,0x99,0x70]
+plui.dh tp, 612
diff --git a/llvm/test/MC/RISCV/rv64p-invalid.s b/llvm/test/MC/RISCV/rv64p-invalid.s
index 58f5dfb822dea..e18c9ec0e29ea 100644
--- a/llvm/test/MC/RISCV/rv64p-invalid.s
+++ b/llvm/test/MC/RISCV/rv64p-invalid.s
@@ -65,3 +65,8 @@ mulsu.h00 a4, s4, s6 # CHECK: :[[@LINE]]:1: error: instruction requires the foll
maccsu.h00 s4, s4, s0 # CHECK: :[[@LINE]]:1: error: instruction requires the following: RV32I Base Instruction Set
mulsu.h11 s8, s4, s0 # CHECK: :[[@LINE]]:1: error: instruction requires the following: RV32I Base Instruction Set
maccsu.h11 s0, a2, s6 # CHECK: :[[@LINE]]:1: error: instruction requires the following: RV32I Base Instruction Set
+
+# FIXME: This error doesn't make sense. Should say that we need RV32I.
+pli.dh a0, 1 # CHECK: :[[@LINE]]:8: error: invalid operand for instruction
+pli.db s0, 1 # CHECK: :[[@LINE]]:8: error: invalid operand for instruction
+plui.dh t1, 1 # CHECK: :[[@LINE]]:9: error: invalid operand for instruction
>From 916218ccbd72164071e74a0b145c17fd7db03667 Mon Sep 17 00:00:00 2001
From: Andres-Salamanca <andrealebarbaritos at gmail.com>
Date: Mon, 18 Aug 2025 10:25:40 -0500
Subject: [PATCH 042/112] [CIR] Upstream GotoOp (#153701)
This PR upstreams `GotoOp`. It moves some tests from the `goto` test
file to the `label` test file, and adds verify logic to `FuncOp`. The
gotosSolver, required for lowering, will be implemented in a future PR.
---
.../CIR/Dialect/Builder/CIRBaseBuilder.h | 3 +-
clang/include/clang/CIR/Dialect/IR/CIROps.td | 56 +++++
clang/lib/CIR/CodeGen/CIRGenFunction.h | 2 +
clang/lib/CIR/CodeGen/CIRGenStmt.cpp | 20 ++
clang/lib/CIR/Dialect/IR/CIRDialect.cpp | 27 ++-
clang/test/CIR/CodeGen/goto.cpp | 210 ++++++++++++++++++
clang/test/CIR/CodeGen/label.c | 36 +++
clang/test/CIR/IR/invalid-goto.cir | 9 +
8 files changed, 358 insertions(+), 5 deletions(-)
create mode 100644 clang/test/CIR/CodeGen/goto.cpp
create mode 100644 clang/test/CIR/IR/invalid-goto.cir
diff --git a/clang/include/clang/CIR/Dialect/Builder/CIRBaseBuilder.h b/clang/include/clang/CIR/Dialect/Builder/CIRBaseBuilder.h
index 0bf3cb26be850..6244d34300263 100644
--- a/clang/include/clang/CIR/Dialect/Builder/CIRBaseBuilder.h
+++ b/clang/include/clang/CIR/Dialect/Builder/CIRBaseBuilder.h
@@ -504,8 +504,7 @@ class CIRBaseBuilderTy : public mlir::OpBuilder {
static OpBuilder::InsertPoint getBestAllocaInsertPoint(mlir::Block *block) {
auto last =
std::find_if(block->rbegin(), block->rend(), [](mlir::Operation &op) {
- // TODO: Add LabelOp missing feature here
- return mlir::isa<cir::AllocaOp>(&op);
+ return mlir::isa<cir::AllocaOp, cir::LabelOp>(&op);
});
if (last != block->rend())
diff --git a/clang/include/clang/CIR/Dialect/IR/CIROps.td b/clang/include/clang/CIR/Dialect/IR/CIROps.td
index 3bfa29b9c3472..129a6760c935a 100644
--- a/clang/include/clang/CIR/Dialect/IR/CIROps.td
+++ b/clang/include/clang/CIR/Dialect/IR/CIROps.td
@@ -1060,6 +1060,62 @@ def CIR_BrOp : CIR_Op<"br",[
}];
}
+//===----------------------------------------------------------------------===//
+// GotoOp
+//===----------------------------------------------------------------------===//
+
+def CIR_GotoOp : CIR_Op<"goto", [Terminator]> {
+ let description = [{
+
+ Transfers control to the specified `label`. This requires a corresponding
+ `cir.label` to exist and is used by to represent source level `goto`s
+ that jump across region boundaries. Alternatively, `cir.br` is used to
+ construct goto's that don't violate such boundaries.
+
+ `cir.goto` is completely symbolic (i.e. it "jumps" on a label that isn't
+ yet materialized) and should be taken into account by passes and analysis
+ when deciding if it's safe to make some assumptions about a given region
+ or basic block.
+
+ Example:
+ ```C++
+ int test(int x) {
+ if (x)
+ goto label;
+ {
+ x = 10;
+ label:
+ return x;
+ }
+ }
+ ```
+
+ ```mlir
+ cir.scope { // REGION #1
+ %2 = cir.load %0 : !cir.ptr<!s32i>, !s32i
+ %3 = cir.cast(int_to_bool, %2 : !s32i), !cir.bool
+ cir.if %3 {
+ cir.goto "label"
+ }
+ }
+ cir.scope { // REGION #2
+ %2 = cir.const #cir.int<10> : !s32i
+ cir.store %2, %0 : !s32i, !cir.ptr<!s32i>
+ cir.br ^bb1
+ ^bb1: // pred: ^bb0
+ cir.label "label"
+ %3 = cir.load %0 : !cir.ptr<!s32i>, !s32i
+ cir.store %3, %1 : !s32i, !cir.ptr<!s32i>
+ %4 = cir.load %1 : !cir.ptr<!s32i>, !s32i
+ cir.return %4 : !s32i
+ }
+ cir.unreachable
+ ```
+ }];
+ let arguments = (ins StrAttr:$label);
+ let assemblyFormat = [{ $label attr-dict }];
+}
+
//===----------------------------------------------------------------------===//
// LabelOp
//===----------------------------------------------------------------------===//
diff --git a/clang/lib/CIR/CodeGen/CIRGenFunction.h b/clang/lib/CIR/CodeGen/CIRGenFunction.h
index 9a887ec047f86..554e46414c9a7 100644
--- a/clang/lib/CIR/CodeGen/CIRGenFunction.h
+++ b/clang/lib/CIR/CodeGen/CIRGenFunction.h
@@ -1120,6 +1120,8 @@ class CIRGenFunction : public CIRGenTypeCache {
mlir::LogicalResult emitFunctionBody(const clang::Stmt *body);
+ mlir::LogicalResult emitGotoStmt(const clang::GotoStmt &s);
+
void emitImplicitAssignmentOperatorBody(FunctionArgList &args);
void emitInitializerForField(clang::FieldDecl *field, LValue lhs,
diff --git a/clang/lib/CIR/CodeGen/CIRGenStmt.cpp b/clang/lib/CIR/CodeGen/CIRGenStmt.cpp
index d1e4a14824011..d83018e513707 100644
--- a/clang/lib/CIR/CodeGen/CIRGenStmt.cpp
+++ b/clang/lib/CIR/CodeGen/CIRGenStmt.cpp
@@ -252,6 +252,8 @@ mlir::LogicalResult CIRGenFunction::emitSimpleStmt(const Stmt *s,
else
emitCompoundStmt(cast<CompoundStmt>(*s));
break;
+ case Stmt::GotoStmtClass:
+ return emitGotoStmt(cast<GotoStmt>(*s));
case Stmt::ContinueStmtClass:
return emitContinueStmt(cast<ContinueStmt>(*s));
@@ -435,6 +437,24 @@ mlir::LogicalResult CIRGenFunction::emitReturnStmt(const ReturnStmt &s) {
return mlir::success();
}
+mlir::LogicalResult CIRGenFunction::emitGotoStmt(const clang::GotoStmt &s) {
+ // FIXME: LLVM codegen inserts emit a stop point here for debug info
+ // sake when the insertion point is available, but doesn't do
+ // anything special when there isn't. We haven't implemented debug
+ // info support just yet, look at this again once we have it.
+ assert(!cir::MissingFeatures::generateDebugInfo());
+
+ cir::GotoOp::create(builder, getLoc(s.getSourceRange()),
+ s.getLabel()->getName());
+
+ // A goto marks the end of a block, create a new one for codegen after
+ // emitGotoStmt can resume building in that block.
+ // Insert the new block to continue codegen after goto.
+ builder.createBlock(builder.getBlock()->getParent());
+
+ return mlir::success();
+}
+
mlir::LogicalResult
CIRGenFunction::emitContinueStmt(const clang::ContinueStmt &s) {
builder.createContinue(getLoc(s.getContinueLoc()));
diff --git a/clang/lib/CIR/Dialect/IR/CIRDialect.cpp b/clang/lib/CIR/Dialect/IR/CIRDialect.cpp
index 50246007b1072..220927601f74e 100644
--- a/clang/lib/CIR/Dialect/IR/CIRDialect.cpp
+++ b/clang/lib/CIR/Dialect/IR/CIRDialect.cpp
@@ -22,6 +22,8 @@
#include "clang/CIR/Dialect/IR/CIROpsDialect.cpp.inc"
#include "clang/CIR/Dialect/IR/CIROpsEnums.cpp.inc"
#include "clang/CIR/MissingFeatures.h"
+#include "llvm/ADT/SetOperations.h"
+#include "llvm/ADT/SmallSet.h"
#include "llvm/Support/LogicalResult.h"
#include <numeric>
@@ -1647,9 +1649,28 @@ void cir::FuncOp::print(OpAsmPrinter &p) {
}
}
-// TODO(CIR): The properties of functions that require verification haven't
-// been implemented yet.
-mlir::LogicalResult cir::FuncOp::verify() { return success(); }
+mlir::LogicalResult cir::FuncOp::verify() {
+
+ llvm::SmallSet<llvm::StringRef, 16> labels;
+ llvm::SmallSet<llvm::StringRef, 16> gotos;
+
+ getOperation()->walk([&](mlir::Operation *op) {
+ if (auto lab = dyn_cast<cir::LabelOp>(op)) {
+ labels.insert(lab.getLabel());
+ } else if (auto goTo = dyn_cast<cir::GotoOp>(op)) {
+ gotos.insert(goTo.getLabel());
+ }
+ });
+
+ if (!labels.empty() || !gotos.empty()) {
+ llvm::SmallSet<llvm::StringRef, 16> mismatched =
+ llvm::set_difference(gotos, labels);
+
+ if (!mismatched.empty())
+ return emitOpError() << "goto/label mismatch";
+ }
+ return success();
+}
//===----------------------------------------------------------------------===//
// BinOp
diff --git a/clang/test/CIR/CodeGen/goto.cpp b/clang/test/CIR/CodeGen/goto.cpp
new file mode 100644
index 0000000000000..13ca65344a150
--- /dev/null
+++ b/clang/test/CIR/CodeGen/goto.cpp
@@ -0,0 +1,210 @@
+// RUN: %clang_cc1 -triple x86_64-unknown-linux-gnu -fclangir -emit-cir %s -o %t.cir
+// RUN: FileCheck --input-file=%t.cir %s -check-prefix=CIR
+// RUN: %clang_cc1 -triple x86_64-unknown-linux-gnu -emit-llvm %s -o %t.ll
+// RUN: FileCheck --input-file=%t.ll %s --check-prefix=OGCG
+
+int shouldNotGenBranchRet(int x) {
+ if (x > 5)
+ goto err;
+ return 0;
+err:
+ return -1;
+}
+// CIR: cir.func dso_local @_Z21shouldNotGenBranchReti
+// CIR: cir.if {{.*}} {
+// CIR: cir.goto "err"
+// CIR: }
+// CIR: [[ZERO:%.*]] = cir.const #cir.int<0> : !s32i
+// CIR: cir.store [[ZERO]], [[RETVAL:%.*]] : !s32i, !cir.ptr<!s32i>
+// CIR: cir.br ^bb1
+// CIR: ^bb1:
+// CIR: [[RET:%.*]] = cir.load [[RETVAL]] : !cir.ptr<!s32i>, !s32i
+// CIR: cir.return [[RET]] : !s32i
+// CIR: ^bb2:
+// CIR: cir.label "err"
+// CIR: [[ONE:%.*]] = cir.const #cir.int<1> : !s32i
+// CIR: [[MINUS:%.*]] = cir.unary(minus, [[ONE]]) nsw : !s32i, !s32i
+// CIR: cir.store [[MINUS]], [[RETVAL]] : !s32i, !cir.ptr<!s32i>
+// CIR: cir.br ^bb1
+
+// OGCG: define dso_local noundef i32 @_Z21shouldNotGenBranchReti
+// OGCG: if.then:
+// OGCG: br label %err
+// OGCG: if.end:
+// OGCG: br label %return
+// OGCG: err:
+// OGCG: br label %return
+// OGCG: return:
+
+int shouldGenBranch(int x) {
+ if (x > 5)
+ goto err;
+ x++;
+err:
+ return -1;
+}
+// CIR: cir.func dso_local @_Z15shouldGenBranchi
+// CIR: cir.if {{.*}} {
+// CIR: cir.goto "err"
+// CIR: }
+// CIR: cir.br ^bb1
+// CIR: ^bb1:
+// CIR: cir.label "err"
+
+// OGCG: define dso_local noundef i32 @_Z15shouldGenBranchi
+// OGCG: if.then:
+// OGCG: br label %err
+// OGCG: if.end:
+// OGCG: br label %err
+// OGCG: err:
+// OGCG: ret
+
+void severalLabelsInARow(int a) {
+ int b = a;
+ goto end1;
+ b = b + 1;
+ goto end2;
+end1:
+end2:
+ b = b + 2;
+}
+// CIR: cir.func dso_local @_Z19severalLabelsInARowi
+// CIR: cir.goto "end1"
+// CIR: ^bb[[#BLK1:]]
+// CIR: cir.goto "end2"
+// CIR: ^bb[[#BLK2:]]:
+// CIR: cir.label "end1"
+// CIR: cir.br ^bb[[#BLK3:]]
+// CIR: ^bb[[#BLK3]]:
+// CIR: cir.label "end2"
+
+// OGCG: define dso_local void @_Z19severalLabelsInARowi
+// OGCG: br label %end1
+// OGCG: end1:
+// OGCG: br label %end2
+// OGCG: end2:
+// OGCG: ret
+
+void severalGotosInARow(int a) {
+ int b = a;
+ goto end;
+ goto end;
+end:
+ b = b + 2;
+}
+// CIR: cir.func dso_local @_Z18severalGotosInARowi
+// CIR: cir.goto "end"
+// CIR: ^bb[[#BLK1:]]:
+// CIR: cir.goto "end"
+// CIR: ^bb[[#BLK2:]]:
+// CIR: cir.label "end"
+
+// OGCG: define dso_local void @_Z18severalGotosInARowi(i32 noundef %a) #0 {
+// OGCG: br label %end
+// OGCG: end:
+// OGCG: ret void
+
+extern "C" void action1();
+extern "C" void action2();
+extern "C" void multiple_non_case(int v) {
+ switch (v) {
+ default:
+ action1();
+ l2:
+ action2();
+ break;
+ }
+}
+
+// CIR: cir.func dso_local @multiple_non_case
+// CIR: cir.switch
+// CIR: cir.case(default, []) {
+// CIR: cir.call @action1()
+// CIR: cir.br ^[[BB1:[a-zA-Z0-9]+]]
+// CIR: ^[[BB1]]:
+// CIR: cir.label
+// CIR: cir.call @action2()
+// CIR: cir.break
+
+// OGCG: define dso_local void @multiple_non_case
+// OGCG: sw.default:
+// OGCG: call void @action1()
+// OGCG: br label %l2
+// OGCG: l2:
+// OGCG: call void @action2()
+// OGCG: br label [[BREAK:%.*]]
+
+extern "C" void case_follow_label(int v) {
+ switch (v) {
+ case 1:
+ label:
+ case 2:
+ action1();
+ break;
+ default:
+ action2();
+ goto label;
+ }
+}
+
+// CIR: cir.func dso_local @case_follow_label
+// CIR: cir.switch
+// CIR: cir.case(equal, [#cir.int<1> : !s32i]) {
+// CIR: cir.label "label"
+// CIR: cir.case(equal, [#cir.int<2> : !s32i]) {
+// CIR: cir.call @action1()
+// CIR: cir.break
+// CIR: cir.case(default, []) {
+// CIR: cir.call @action2()
+// CIR: cir.goto "label"
+
+// OGCG: define dso_local void @case_follow_label
+// OGCG: sw.bb:
+// OGCG: br label %label
+// OGCG: label:
+// OGCG: br label %sw.bb1
+// OGCG: sw.bb1:
+// OGCG: call void @action1()
+// OGCG: br label %sw.epilog
+// OGCG: sw.default:
+// OGCG: call void @action2()
+// OGCG: br label %label
+// OGCG: sw.epilog:
+// OGCG: ret void
+
+extern "C" void default_follow_label(int v) {
+ switch (v) {
+ case 1:
+ case 2:
+ action1();
+ break;
+ label:
+ default:
+ action2();
+ goto label;
+ }
+}
+
+// CIR: cir.func dso_local @default_follow_label
+// CIR: cir.switch
+// CIR: cir.case(equal, [#cir.int<1> : !s32i]) {
+// CIR: cir.yield
+// CIR: cir.case(equal, [#cir.int<2> : !s32i]) {
+// CIR: cir.call @action1()
+// CIR: cir.break
+// CIR: cir.label "label"
+// CIR: cir.case(default, []) {
+// CIR: cir.call @action2()
+// CIR: cir.goto "label"
+
+// OGCG: define dso_local void @default_follow_label
+// OGCG: sw.bb:
+// OGCG: call void @action1()
+// OGCG: br label %sw.epilog
+// OGCG: label:
+// OGCG: br label %sw.default
+// OGCG: sw.default:
+// OGCG: call void @action2()
+// OGCG: br label %label
+// OGCG: sw.epilog:
+// OGCG: ret void
diff --git a/clang/test/CIR/CodeGen/label.c b/clang/test/CIR/CodeGen/label.c
index 2a515fc4046e8..797c44475a621 100644
--- a/clang/test/CIR/CodeGen/label.c
+++ b/clang/test/CIR/CodeGen/label.c
@@ -101,3 +101,39 @@ void after_unreachable() {
// OGCG: unreachable
// OGCG: label:
// OGCG: ret void
+
+void labelWithoutMatch() {
+end:
+ return;
+}
+// CIR: cir.func no_proto dso_local @labelWithoutMatch
+// CIR: cir.label "end"
+// CIR: cir.return
+// CIR: }
+
+// OGCG: define dso_local void @labelWithoutMatch
+// OGCG: br label %end
+// OGCG: end:
+// OGCG: ret void
+
+struct S {};
+struct S get();
+void bar(struct S);
+
+void foo() {
+ {
+ label:
+ bar(get());
+ }
+}
+
+// CIR: cir.func no_proto dso_local @foo
+// CIR: cir.scope {
+// CIR: cir.label "label"
+// CIR: %0 = cir.alloca !rec_S, !cir.ptr<!rec_S>, ["agg.tmp0"]
+
+// OGCG: define dso_local void @foo()
+// OGCG: %agg.tmp = alloca %struct.S, align 1
+// OGCG: %undef.agg.tmp = alloca %struct.S, align 1
+// OGCG: br label %label
+// OGCG: label:
diff --git a/clang/test/CIR/IR/invalid-goto.cir b/clang/test/CIR/IR/invalid-goto.cir
new file mode 100644
index 0000000000000..9f58bac92fa3f
--- /dev/null
+++ b/clang/test/CIR/IR/invalid-goto.cir
@@ -0,0 +1,9 @@
+// RUN: cir-opt %s -verify-diagnostics -split-input-file
+
+// expected-error at +1 {{goto/label mismatch}}
+cir.func @bad_goto() -> () {
+ cir.goto "somewhere"
+^bb1:
+ cir.label "label"
+ cir.return
+}
>From d12f58ff11baaff8cc5599f1016aa63ca4de9428 Mon Sep 17 00:00:00 2001
From: Alex MacLean <amaclean at nvidia.com>
Date: Mon, 18 Aug 2025 08:33:23 -0700
Subject: [PATCH 043/112] [NVVM] Add various intrinsic attrs, cleanup and
consolidate td (#153436)
- llvm.nvvm.reflect - Use a PureIntrinsic for (adding speculatable),
this will be replaced by a constant prior to lowering so speculation is
fine.
- llvm.nvvm.tex.* - Add [IntrNoCallback, IntrNoFree, IntrWillReturn]
- llvm.nvvm.suld.* - Add [IntrNoCallback, IntrNoFree] and
[IntrWillReturn] when not using "clamp" mode
- llvm.nvvm.sust.* - Add [IntrNoCallback, IntrNoFree, IntrWriteMem] and
[IntrWillReturn] when not using "clamp" mode
- llvm.nvvm.[suq|txq|istypep].* - Use DefaultAttrsIntrinsic
- llvm.nvvm.read.ptx.sreg.* - Add [IntrNoFree, IntrWillReturn] to
non-constant reads as well.
---
llvm/include/llvm/IR/IntrinsicsNVVM.td | 940 ++++++++++++-------------
1 file changed, 453 insertions(+), 487 deletions(-)
diff --git a/llvm/include/llvm/IR/IntrinsicsNVVM.td b/llvm/include/llvm/IR/IntrinsicsNVVM.td
index 1bcc442a3f77f..77ef79debac1a 100644
--- a/llvm/include/llvm/IR/IntrinsicsNVVM.td
+++ b/llvm/include/llvm/IR/IntrinsicsNVVM.td
@@ -128,12 +128,12 @@
// * llvm.nvvm.swap.lo.hi.b64 --> llvm.fshl(x, x, 32)
// * llvm.nvvm.atomic.load.inc.32 --> atomicrmw uinc_wrap
// * llvm.nvvm.atomic.load.dec.32 --> atomicrmw udec_wrap
-// * llvm.nvvm.barrier0 --> llvm.nvvm.barrier.cta.sync.aligned.all(0)
-// * llvm.nvvm.barrier.n --> llvm.nvvm.barrier.cta.sync.aligned.all(x)
-// * llvm.nvvm.bar.sync --> llvm.nvvm.barrier.cta.sync.aligned.all(x)
-// * llvm.nvvm.barrier --> llvm.nvvm.barrier.cta.sync.aligned(x, y)
-// * llvm.nvvm.barrier.sync --> llvm.nvvm.barrier.cta.sync.all(x)
-// * llvm.nvvm.barrier.sync.cnt --> llvm.nvvm.barrier.cta.sync(x, y)
+// * llvm.nvvm.barrier0 --> llvm.nvvm.barrier.cta.sync.aligned.all(0)
+// * llvm.nvvm.barrier.n --> llvm.nvvm.barrier.cta.sync.aligned.all(x)
+// * llvm.nvvm.bar.sync --> llvm.nvvm.barrier.cta.sync.aligned.all(x)
+// * llvm.nvvm.barrier --> llvm.nvvm.barrier.cta.sync.aligned(x, y)
+// * llvm.nvvm.barrier.sync --> llvm.nvvm.barrier.cta.sync.all(x)
+// * llvm.nvvm.barrier.sync.cnt --> llvm.nvvm.barrier.cta.sync(x, y)
def llvm_global_ptr_ty : LLVMQualPointerType<1>; // (global)ptr
def llvm_shared_ptr_ty : LLVMQualPointerType<3>; // (shared)ptr
@@ -793,38 +793,49 @@ class NVVMBuiltin :
"NVVMBuiltin must be a NVVM intrinsic starting with 'int_nvvm_'";
}
+class PureIntrinsic<list<LLVMType> ret_types,
+ list<LLVMType> param_types = [],
+ list<IntrinsicProperty> intr_properties = [],
+ string name = ""> :
+ DefaultAttrsIntrinsic<ret_types, param_types,
+ intr_properties # [IntrNoMem, IntrSpeculatable], name> {}
+
let TargetPrefix = "nvvm" in {
+ //
// PRMT - permute
-
- let IntrProperties = [IntrNoMem, IntrSpeculatable] in {
- def int_nvvm_prmt : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty, llvm_i32_ty]>;
-
- foreach mode = ["f4e", "b4e"] in
- def int_nvvm_prmt_ # mode :
- DefaultAttrsIntrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty, llvm_i32_ty]>;
-
- // Note: these variants also have 2 source operands but only one will ever
- // be used so we eliminate the other operand in the IR (0 is used as the
- // placeholder in the backend).
- foreach mode = ["rc8", "ecl", "ecr", "rc16"] in
- def int_nvvm_prmt_ # mode :
- DefaultAttrsIntrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty]>;
- }
-
+ //
+ def int_nvvm_prmt : NVVMBuiltin,
+ PureIntrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty, llvm_i32_ty]>;
+
+ foreach mode = ["f4e", "b4e"] in
+ def int_nvvm_prmt_ # mode :
+ PureIntrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty, llvm_i32_ty]>;
+
+ // Note: these variants also have 2 source operands but only one will ever
+ // be used so we eliminate the other operand in the IR (0 is used as the
+ // placeholder in the backend).
+ foreach mode = ["rc8", "ecl", "ecr", "rc16"] in
+ def int_nvvm_prmt_ # mode :
+ PureIntrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty]>;
+
+ //
+ // Nanosleep
+ //
def int_nvvm_nanosleep : NVVMBuiltin,
DefaultAttrsIntrinsic<[], [llvm_i32_ty],
[IntrConvergent, IntrNoMem, IntrHasSideEffects]>;
+ //
// Performance Monitor Events (pm events) intrinsics
+ //
def int_nvvm_pm_event_mask : NVVMBuiltin,
DefaultAttrsIntrinsic<[], [llvm_i16_ty],
[IntrConvergent, IntrNoMem, IntrHasSideEffects,
ImmArg<ArgIndex<0>>]>;
-//
-// Min Max
-//
+ //
+ // Min Max
+ //
let IntrProperties = [IntrNoMem, IntrSpeculatable, Commutative] in {
foreach operation = ["min", "max"] in {
def int_nvvm_f # operation # _d : NVVMBuiltin,
@@ -853,9 +864,9 @@ let TargetPrefix = "nvvm" in {
} // operation
}
-//
-// Multiplication
-//
+ //
+ // Multiplication
+ //
let IntrProperties = [IntrNoMem, IntrSpeculatable, Commutative] in {
foreach sign = ["", "u"] in {
def int_nvvm_mulhi_ # sign # s : NVVMBuiltin,
@@ -881,9 +892,9 @@ let TargetPrefix = "nvvm" in {
}
}
-//
-// Div
-//
+ //
+ // Div
+ //
let IntrProperties = [IntrNoMem] in {
foreach ftz = ["", "_ftz"] in {
def int_nvvm_div_approx # ftz # _f : NVVMBuiltin,
@@ -903,90 +914,79 @@ let TargetPrefix = "nvvm" in {
}
}
-//
-// Sad
-//
- let IntrProperties = [IntrNoMem, IntrSpeculatable] in {
- foreach sign = ["", "u"] in {
- def int_nvvm_sad_ # sign # s : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_i16_ty], [llvm_i16_ty, llvm_i16_ty, llvm_i16_ty]>;
+ //
+ // Sad - Sum of Absolute Differences
+ //
+ foreach sign = ["", "u"] in {
+ def int_nvvm_sad_ # sign # s : NVVMBuiltin,
+ PureIntrinsic<[llvm_i16_ty], [llvm_i16_ty, llvm_i16_ty, llvm_i16_ty]>;
- def int_nvvm_sad_ # sign # i : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty, llvm_i32_ty]>;
+ def int_nvvm_sad_ # sign # i : NVVMBuiltin,
+ PureIntrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty, llvm_i32_ty]>;
- def int_nvvm_sad_ # sign # ll : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_i64_ty], [llvm_i64_ty, llvm_i64_ty, llvm_i64_ty]>;
- }
+ def int_nvvm_sad_ # sign # ll : NVVMBuiltin,
+ PureIntrinsic<[llvm_i64_ty], [llvm_i64_ty, llvm_i64_ty, llvm_i64_ty]>;
}
-//
-// Floor Ceil
-//
- let IntrProperties = [IntrNoMem, IntrSpeculatable] in {
- foreach op = ["floor", "ceil"] in {
- foreach ftz = ["", "_ftz"] in
- def int_nvvm_ # op # ftz # _f : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_float_ty], [llvm_float_ty]>;
- def int_nvvm_ # op # _d : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_double_ty], [llvm_double_ty]>;
- }
+ //
+ // Floor Ceil
+ //
+ foreach op = ["floor", "ceil"] in {
+ foreach ftz = ["", "_ftz"] in
+ def int_nvvm_ # op # ftz # _f : NVVMBuiltin,
+ PureIntrinsic<[llvm_float_ty], [llvm_float_ty]>;
+ def int_nvvm_ # op # _d : NVVMBuiltin,
+ PureIntrinsic<[llvm_double_ty], [llvm_double_ty]>;
}
-//
-// Abs
-//
+ //
+ // Abs
+ //
foreach ftz = ["", "_ftz"] in
def int_nvvm_fabs # ftz :
- DefaultAttrsIntrinsic<[llvm_anyfloat_ty], [LLVMMatchType<0>],
- [IntrNoMem, IntrSpeculatable]>;
+ PureIntrinsic<[llvm_anyfloat_ty], [LLVMMatchType<0>]>;
-//
-// Abs, Neg bf16, bf16x2
-//
+ //
+ // Neg bf16, bf16x2
+ //
def int_nvvm_neg_bf16 : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_bfloat_ty], [llvm_bfloat_ty], [IntrNoMem]>;
+ PureIntrinsic<[llvm_bfloat_ty], [llvm_bfloat_ty]>;
def int_nvvm_neg_bf16x2 : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_v2bf16_ty], [llvm_v2bf16_ty], [IntrNoMem]>;
+ PureIntrinsic<[llvm_v2bf16_ty], [llvm_v2bf16_ty]>;
-//
-// Round
-//
- let IntrProperties = [IntrNoMem, IntrSpeculatable] in {
- foreach ftz = ["", "_ftz"] in
- def int_nvvm_round # ftz # _f : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_float_ty], [llvm_float_ty]>;
+ //
+ // Round
+ //
+ foreach ftz = ["", "_ftz"] in
+ def int_nvvm_round # ftz # _f : NVVMBuiltin,
+ PureIntrinsic<[llvm_float_ty], [llvm_float_ty]>;
- def int_nvvm_round_d : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_double_ty], [llvm_double_ty]>;
- }
+ def int_nvvm_round_d : NVVMBuiltin,
+ PureIntrinsic<[llvm_double_ty], [llvm_double_ty]>;
-//
-// Trunc
-//
- let IntrProperties = [IntrNoMem, IntrSpeculatable] in {
- foreach ftz = ["", "_ftz"] in
- def int_nvvm_trunc # ftz # _f : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_float_ty], [llvm_float_ty]>;
+ //
+ // Trunc
+ //
+ foreach ftz = ["", "_ftz"] in
+ def int_nvvm_trunc # ftz # _f : NVVMBuiltin,
+ PureIntrinsic<[llvm_float_ty], [llvm_float_ty]>;
- def int_nvvm_trunc_d : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_double_ty], [llvm_double_ty]>;
- }
+ def int_nvvm_trunc_d : NVVMBuiltin,
+ PureIntrinsic<[llvm_double_ty], [llvm_double_ty]>;
-//
-// Saturate
-//
- let IntrProperties = [IntrNoMem, IntrSpeculatable] in {
- foreach ftz = ["", "_ftz"] in
- def int_nvvm_saturate # ftz # _f : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_float_ty], [llvm_float_ty]>;
+ //
+ // Saturate
+ //
+ foreach ftz = ["", "_ftz"] in
+ def int_nvvm_saturate # ftz # _f : NVVMBuiltin,
+ PureIntrinsic<[llvm_float_ty], [llvm_float_ty]>;
- def int_nvvm_saturate_d : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_double_ty], [llvm_double_ty]>;
- }
+ def int_nvvm_saturate_d : NVVMBuiltin,
+ PureIntrinsic<[llvm_double_ty], [llvm_double_ty]>;
-//
-// Exp2 Log2
-//
+ //
+ // Exp2 Log2
+ //
let IntrProperties = [IntrNoMem] in {
foreach ftz = ["", "_ftz"] in
def int_nvvm_ex2_approx # ftz # _f : NVVMBuiltin,
@@ -1007,53 +1007,51 @@ let TargetPrefix = "nvvm" in {
DefaultAttrsIntrinsic<[llvm_double_ty], [llvm_double_ty]>;
}
-//
-// Sin Cos
-//
+ //
+ // Sin Cos
+ //
foreach op = ["sin", "cos"] in
foreach ftz = ["", "_ftz"] in
def int_nvvm_ # op # _approx # ftz # _f : NVVMBuiltin,
DefaultAttrsIntrinsic<[llvm_float_ty], [llvm_float_ty], [IntrNoMem]>;
-//
-// Fma
-//
- let IntrProperties = [IntrNoMem, IntrSpeculatable] in {
- foreach variant = ["", "_sat", "_relu"] in {
- foreach ftz = ["", "_ftz"] in {
- def int_nvvm_fma_rn # ftz # variant # _f16 :
- DefaultAttrsIntrinsic<[llvm_half_ty],
- [llvm_half_ty, llvm_half_ty, llvm_half_ty]>;
-
- def int_nvvm_fma_rn # ftz # variant # _f16x2 :
- DefaultAttrsIntrinsic<[llvm_v2f16_ty],
- [llvm_v2f16_ty, llvm_v2f16_ty, llvm_v2f16_ty]>;
-
- def int_nvvm_fma_rn # ftz # variant # _bf16 : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_bfloat_ty],
- [llvm_bfloat_ty, llvm_bfloat_ty, llvm_bfloat_ty]>;
-
- def int_nvvm_fma_rn # ftz # variant # _bf16x2 : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_v2bf16_ty],
- [llvm_v2bf16_ty, llvm_v2bf16_ty, llvm_v2bf16_ty]>;
- } // ftz
- } // variant
+ //
+ // Fma
+ //
+ foreach variant = ["", "_sat", "_relu"] in {
+ foreach ftz = ["", "_ftz"] in {
+ def int_nvvm_fma_rn # ftz # variant # _f16 :
+ PureIntrinsic<[llvm_half_ty],
+ [llvm_half_ty, llvm_half_ty, llvm_half_ty]>;
- foreach rnd = ["rn", "rz", "rm", "rp"] in {
- foreach ftz = ["", "_ftz"] in
- def int_nvvm_fma_ # rnd # ftz # _f : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_float_ty],
- [llvm_float_ty, llvm_float_ty, llvm_float_ty]>;
+ def int_nvvm_fma_rn # ftz # variant # _f16x2 :
+ PureIntrinsic<[llvm_v2f16_ty],
+ [llvm_v2f16_ty, llvm_v2f16_ty, llvm_v2f16_ty]>;
- def int_nvvm_fma_ # rnd # _d : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_double_ty],
- [llvm_double_ty, llvm_double_ty, llvm_double_ty]>;
- }
+ def int_nvvm_fma_rn # ftz # variant # _bf16 : NVVMBuiltin,
+ PureIntrinsic<[llvm_bfloat_ty],
+ [llvm_bfloat_ty, llvm_bfloat_ty, llvm_bfloat_ty]>;
+
+ def int_nvvm_fma_rn # ftz # variant # _bf16x2 : NVVMBuiltin,
+ PureIntrinsic<[llvm_v2bf16_ty],
+ [llvm_v2bf16_ty, llvm_v2bf16_ty, llvm_v2bf16_ty]>;
+ } // ftz
+ } // variant
+
+ foreach rnd = ["rn", "rz", "rm", "rp"] in {
+ foreach ftz = ["", "_ftz"] in
+ def int_nvvm_fma_ # rnd # ftz # _f : NVVMBuiltin,
+ PureIntrinsic<[llvm_float_ty],
+ [llvm_float_ty, llvm_float_ty, llvm_float_ty]>;
+
+ def int_nvvm_fma_ # rnd # _d : NVVMBuiltin,
+ PureIntrinsic<[llvm_double_ty],
+ [llvm_double_ty, llvm_double_ty, llvm_double_ty]>;
}
-//
-// Rcp
-//
+ //
+ // Rcp
+ //
let IntrProperties = [IntrNoMem] in {
foreach rnd = ["rn", "rz", "rm", "rp"] in {
foreach ftz = ["", "_ftz"] in
@@ -1070,9 +1068,9 @@ let TargetPrefix = "nvvm" in {
DefaultAttrsIntrinsic<[llvm_double_ty], [llvm_double_ty]>;
}
-//
-// Sqrt
-//
+ //
+ // Sqrt
+ //
let IntrProperties = [IntrNoMem] in {
foreach rnd = ["rn", "rz", "rm", "rp"] in {
foreach ftz = ["", "_ftz"] in
@@ -1091,9 +1089,9 @@ let TargetPrefix = "nvvm" in {
DefaultAttrsIntrinsic<[llvm_float_ty], [llvm_float_ty]>;
}
-//
-// Rsqrt
-//
+ //
+ // Rsqrt
+ //
let IntrProperties = [IntrNoMem] in {
foreach ftz = ["", "_ftz"] in {
def int_nvvm_rsqrt_approx # ftz # _f : NVVMBuiltin,
@@ -1103,208 +1101,206 @@ let TargetPrefix = "nvvm" in {
}
}
-//
-// Add
-//
+ //
+ // Add
+ //
let IntrProperties = [IntrNoMem, IntrSpeculatable, Commutative] in {
foreach rnd = ["rn", "rz", "rm", "rp"] in {
foreach ftz = ["", "_ftz"] in
def int_nvvm_add_ # rnd # ftz # _f : NVVMBuiltin,
DefaultAttrsIntrinsic<[llvm_float_ty], [llvm_float_ty, llvm_float_ty]>;
- def int_nvvm_add_ # rnd # _d : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_double_ty], [llvm_double_ty, llvm_double_ty]>;
+ def int_nvvm_add_ # rnd # _d : NVVMBuiltin,
+ DefaultAttrsIntrinsic<[llvm_double_ty], [llvm_double_ty, llvm_double_ty]>;
}
}
-//
-// Dot Product
-//
+ //
+ // Dot Product
+ //
foreach a_type = ["s", "u"] in {
foreach b_type = ["s", "u"] in {
def int_nvvm_idp4a_ # a_type # _ # b_type :
- DefaultAttrsIntrinsic<[llvm_i32_ty],
- [llvm_i32_ty, llvm_i32_ty, llvm_i32_ty],
- [IntrNoMem, IntrSpeculatable]>;
+ PureIntrinsic<[llvm_i32_ty],
+ [llvm_i32_ty, llvm_i32_ty, llvm_i32_ty]>;
def int_nvvm_idp2a_ # a_type # _ # b_type :
- DefaultAttrsIntrinsic<[llvm_i32_ty],
+ PureIntrinsic<[llvm_i32_ty],
[llvm_i32_ty, llvm_i32_ty, llvm_i1_ty, llvm_i32_ty],
- [IntrNoMem, IntrSpeculatable, ImmArg<ArgIndex<2>>]>;
+ [ImmArg<ArgIndex<2>>]>;
}
}
-//
-// Funnel-shift
-//
+ //
+ // Funnel-shift
+ //
foreach direction = ["l", "r"] in
def int_nvvm_fsh # direction # _clamp :
- DefaultAttrsIntrinsic<[llvm_anyint_ty],
- [LLVMMatchType<0>, LLVMMatchType<0>, LLVMMatchType<0>],
- [IntrNoMem, IntrSpeculatable]>;
+ PureIntrinsic<[llvm_anyint_ty],
+ [LLVMMatchType<0>, LLVMMatchType<0>, LLVMMatchType<0>]>;
-//
-// FLO - Find Leading One
-//
+ //
+ // FLO - Find Leading One
+ //
foreach sign = ["s", "u"] in
def int_nvvm_flo_ # sign :
- DefaultAttrsIntrinsic<[llvm_i32_ty],
- [llvm_anyint_ty, llvm_i1_ty],
- [IntrNoMem, IntrSpeculatable, ImmArg<ArgIndex<1>>]>;
+ PureIntrinsic<[llvm_i32_ty], [llvm_anyint_ty, llvm_i1_ty],
+ [ImmArg<ArgIndex<1>>]>;
-//
-// szext
-//
+ //
+ // szext
+ //
foreach ext = ["sext", "zext"] in
foreach mode = ["wrap", "clamp"] in
def int_nvvm_ # ext # _ # mode :
- DefaultAttrsIntrinsic<[llvm_i32_ty],
- [llvm_i32_ty, llvm_i32_ty],
- [IntrNoMem, IntrSpeculatable]>;
+ PureIntrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty]>;
-//
-// BMSK - bit mask
-//
+ //
+ // BMSK - bit mask
+ //
foreach mode = ["wrap", "clamp"] in
def int_nvvm_bmsk_ # mode :
- DefaultAttrsIntrinsic<[llvm_i32_ty],
- [llvm_i32_ty, llvm_i32_ty],
- [IntrNoMem, IntrSpeculatable]>;
-
-//
-// Convert
-//
- let IntrProperties = [IntrNoMem, IntrSpeculatable] in {
- def int_nvvm_lohi_i2d : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_double_ty], [llvm_i32_ty, llvm_i32_ty]>;
-
- def int_nvvm_d2i_lo : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_i32_ty], [llvm_double_ty]>;
- def int_nvvm_d2i_hi : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_i32_ty], [llvm_double_ty]>;
+ PureIntrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty]>;
- foreach rnd = ["rn", "rz", "rm", "rp"] in {
- foreach ftz = ["", "_ftz"] in
- def int_nvvm_d2f_ # rnd # ftz : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_float_ty], [llvm_double_ty]>;
+ //
+ // FNS - Find the n-th set bit
+ //
+ def int_nvvm_fns : NVVMBuiltin,
+ PureIntrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty, llvm_i32_ty]>;
+
+ //
+ // Convert
+ //
+ // TODO: All these intrinsics are defined as PureIntrinsic, this attaches the
+ // IntrSpeculatable property to them. Consider if some of these should
+ // have this attribute removed as they may be too expensive.
+ //
+ def int_nvvm_lohi_i2d : NVVMBuiltin,
+ PureIntrinsic<[llvm_double_ty], [llvm_i32_ty, llvm_i32_ty]>;
+
+ def int_nvvm_d2i_lo : NVVMBuiltin,
+ PureIntrinsic<[llvm_i32_ty], [llvm_double_ty]>;
+ def int_nvvm_d2i_hi : NVVMBuiltin,
+ PureIntrinsic<[llvm_i32_ty], [llvm_double_ty]>;
+
+ foreach rnd = ["rn", "rz", "rm", "rp"] in {
+ foreach ftz = ["", "_ftz"] in
+ def int_nvvm_d2f_ # rnd # ftz : NVVMBuiltin,
+ PureIntrinsic<[llvm_float_ty], [llvm_double_ty]>;
- foreach sign = ["", "u"] in {
+ foreach sign = ["", "u"] in {
- def int_nvvm_d2 # sign # i_ # rnd : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_i32_ty], [llvm_double_ty]>;
+ def int_nvvm_d2 # sign # i_ # rnd : NVVMBuiltin,
+ PureIntrinsic<[llvm_i32_ty], [llvm_double_ty]>;
- def int_nvvm_ # sign # i2d_ # rnd : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_double_ty], [llvm_i32_ty]>;
+ def int_nvvm_ # sign # i2d_ # rnd : NVVMBuiltin,
+ PureIntrinsic<[llvm_double_ty], [llvm_i32_ty]>;
- foreach ftz = ["", "_ftz"] in
- def int_nvvm_f2 # sign # i_ # rnd # ftz : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_i32_ty], [llvm_float_ty]>;
+ foreach ftz = ["", "_ftz"] in
+ def int_nvvm_f2 # sign # i_ # rnd # ftz : NVVMBuiltin,
+ PureIntrinsic<[llvm_i32_ty], [llvm_float_ty]>;
- def int_nvvm_ # sign # i2f_ # rnd : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_float_ty], [llvm_i32_ty]>;
+ def int_nvvm_ # sign # i2f_ # rnd : NVVMBuiltin,
+ PureIntrinsic<[llvm_float_ty], [llvm_i32_ty]>;
- foreach ftz = ["", "_ftz"] in
- def int_nvvm_f2 # sign # ll_ # rnd # ftz : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_i64_ty], [llvm_float_ty]>;
+ foreach ftz = ["", "_ftz"] in
+ def int_nvvm_f2 # sign # ll_ # rnd # ftz : NVVMBuiltin,
+ PureIntrinsic<[llvm_i64_ty], [llvm_float_ty]>;
- def int_nvvm_d2 # sign # ll_ # rnd : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_i64_ty], [llvm_double_ty]>;
+ def int_nvvm_d2 # sign # ll_ # rnd : NVVMBuiltin,
+ PureIntrinsic<[llvm_i64_ty], [llvm_double_ty]>;
- def int_nvvm_ # sign # ll2f_ # rnd : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_float_ty], [llvm_i64_ty]>;
+ def int_nvvm_ # sign # ll2f_ # rnd : NVVMBuiltin,
+ PureIntrinsic<[llvm_float_ty], [llvm_i64_ty]>;
- def int_nvvm_ # sign # ll2d_ # rnd : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_double_ty], [llvm_i64_ty]>;
+ def int_nvvm_ # sign # ll2d_ # rnd : NVVMBuiltin,
+ PureIntrinsic<[llvm_double_ty], [llvm_i64_ty]>;
- } // sign
- } // rnd
+ } // sign
+ } // rnd
- foreach ftz = ["", "_ftz"] in {
- def int_nvvm_f2h_rn # ftz : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_i16_ty], [llvm_float_ty]>;
+ foreach ftz = ["", "_ftz"] in {
+ def int_nvvm_f2h_rn # ftz : NVVMBuiltin,
+ PureIntrinsic<[llvm_i16_ty], [llvm_float_ty]>;
- def int_nvvm_bf2h_rn # ftz : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_i16_ty], [llvm_bfloat_ty]>;
- }
+ def int_nvvm_bf2h_rn # ftz : NVVMBuiltin,
+ PureIntrinsic<[llvm_i16_ty], [llvm_bfloat_ty]>;
+ }
- foreach rnd = ["rn", "rz"] in {
- foreach relu = ["", "_relu"] in {
- def int_nvvm_ff2bf16x2_ # rnd # relu : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_v2bf16_ty], [llvm_float_ty, llvm_float_ty]>;
+ foreach rnd = ["rn", "rz"] in {
+ foreach relu = ["", "_relu"] in {
+ def int_nvvm_ff2bf16x2_ # rnd # relu : NVVMBuiltin,
+ PureIntrinsic<[llvm_v2bf16_ty], [llvm_float_ty, llvm_float_ty]>;
- def int_nvvm_ff2f16x2_ # rnd # relu : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_v2f16_ty], [llvm_float_ty, llvm_float_ty]>;
+ def int_nvvm_ff2f16x2_ # rnd # relu : NVVMBuiltin,
+ PureIntrinsic<[llvm_v2f16_ty], [llvm_float_ty, llvm_float_ty]>;
- def int_nvvm_f2bf16_ # rnd # relu : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_bfloat_ty], [llvm_float_ty]>;
- }
+ def int_nvvm_f2bf16_ # rnd # relu : NVVMBuiltin,
+ PureIntrinsic<[llvm_bfloat_ty], [llvm_float_ty]>;
}
+ }
- foreach satfinite = ["", "_satfinite"] in {
- def int_nvvm_f2tf32_rna # satfinite : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_i32_ty], [llvm_float_ty]>;
+ foreach satfinite = ["", "_satfinite"] in {
+ def int_nvvm_f2tf32_rna # satfinite : NVVMBuiltin,
+ PureIntrinsic<[llvm_i32_ty], [llvm_float_ty]>;
- foreach rnd = ["rn", "rz"] in
- foreach relu = ["", "_relu"] in
- def int_nvvm_f2tf32_ # rnd # relu # satfinite : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_i32_ty], [llvm_float_ty]>;
- }
+ foreach rnd = ["rn", "rz"] in
+ foreach relu = ["", "_relu"] in
+ def int_nvvm_f2tf32_ # rnd # relu # satfinite : NVVMBuiltin,
+ PureIntrinsic<[llvm_i32_ty], [llvm_float_ty]>;
+ }
- foreach type = ["e4m3x2", "e5m2x2"] in {
- foreach relu = ["", "_relu"] in {
- def int_nvvm_ff_to_ # type # _rn # relu : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_i16_ty], [llvm_float_ty, llvm_float_ty]>;
+ foreach type = ["e4m3x2", "e5m2x2"] in {
+ foreach relu = ["", "_relu"] in {
+ def int_nvvm_ff_to_ # type # _rn # relu : NVVMBuiltin,
+ PureIntrinsic<[llvm_i16_ty], [llvm_float_ty, llvm_float_ty]>;
- def int_nvvm_f16x2_to_ # type # _rn # relu : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_i16_ty], [llvm_v2f16_ty]>;
+ def int_nvvm_f16x2_to_ # type # _rn # relu : NVVMBuiltin,
+ PureIntrinsic<[llvm_i16_ty], [llvm_v2f16_ty]>;
- def int_nvvm_ # type # _to_f16x2_rn # relu : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_v2f16_ty], [llvm_i16_ty]>;
- }
+ def int_nvvm_ # type # _to_f16x2_rn # relu : NVVMBuiltin,
+ PureIntrinsic<[llvm_v2f16_ty], [llvm_i16_ty]>;
}
+ }
- // FP4 conversions.
- foreach relu = ["", "_relu"] in {
- def int_nvvm_ff_to_e2m1x2_rn # relu # _satfinite : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_i16_ty], [llvm_float_ty, llvm_float_ty]>;
+ // FP4 conversions.
+ foreach relu = ["", "_relu"] in {
+ def int_nvvm_ff_to_e2m1x2_rn # relu # _satfinite : NVVMBuiltin,
+ PureIntrinsic<[llvm_i16_ty], [llvm_float_ty, llvm_float_ty]>;
- def int_nvvm_e2m1x2_to_f16x2_rn # relu : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_v2f16_ty], [llvm_i16_ty]>;
- }
+ def int_nvvm_e2m1x2_to_f16x2_rn # relu : NVVMBuiltin,
+ PureIntrinsic<[llvm_v2f16_ty], [llvm_i16_ty]>;
+ }
- // FP6 conversions.
- foreach type = ["e2m3x2", "e3m2x2"] in {
- foreach relu = ["", "_relu"] in {
- def int_nvvm_ff_to_ # type # _rn # relu # _satfinite : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_i16_ty], [llvm_float_ty, llvm_float_ty]>;
+ // FP6 conversions.
+ foreach type = ["e2m3x2", "e3m2x2"] in {
+ foreach relu = ["", "_relu"] in {
+ def int_nvvm_ff_to_ # type # _rn # relu # _satfinite : NVVMBuiltin,
+ PureIntrinsic<[llvm_i16_ty], [llvm_float_ty, llvm_float_ty]>;
- def int_nvvm_ # type # _to_f16x2_rn # relu : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_v2f16_ty], [llvm_i16_ty]>;
- }
+ def int_nvvm_ # type # _to_f16x2_rn # relu : NVVMBuiltin,
+ PureIntrinsic<[llvm_v2f16_ty], [llvm_i16_ty]>;
}
+ }
- // UE8M0x2 conversions.
- foreach rmode = ["_rz", "_rp"] in {
- foreach satmode = ["", "_satfinite"] in {
- defvar suffix = rmode # satmode;
- def int_nvvm_ff_to_ue8m0x2 # suffix : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_i16_ty], [llvm_float_ty, llvm_float_ty]>;
+ // UE8M0x2 conversions.
+ foreach rmode = ["_rz", "_rp"] in {
+ foreach satmode = ["", "_satfinite"] in {
+ defvar suffix = rmode # satmode;
+ def int_nvvm_ff_to_ue8m0x2 # suffix : NVVMBuiltin,
+ PureIntrinsic<[llvm_i16_ty], [llvm_float_ty, llvm_float_ty]>;
- def int_nvvm_bf16x2_to_ue8m0x2 # suffix : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_i16_ty], [llvm_v2bf16_ty]>;
+ def int_nvvm_bf16x2_to_ue8m0x2 # suffix : NVVMBuiltin,
+ PureIntrinsic<[llvm_i16_ty], [llvm_v2bf16_ty]>;
- }
}
+ }
- def int_nvvm_ue8m0x2_to_bf16x2 : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_v2bf16_ty], [llvm_i16_ty]>;
-
- } // IntrProperties = [IntrNoMem, IntrSpeculatable]
-
-// FNS
- def int_nvvm_fns : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty, llvm_i32_ty],
- [IntrNoMem]>;
+ def int_nvvm_ue8m0x2_to_bf16x2 : NVVMBuiltin,
+ PureIntrinsic<[llvm_v2bf16_ty], [llvm_i16_ty]>;
+ //
+ // Atomic operations
+ //
class SCOPED_ATOMIC2_impl<LLVMType elty>
: Intrinsic<[elty],
[llvm_anyptr_ty, LLVMMatchType<0>],
@@ -1337,7 +1333,9 @@ let TargetPrefix = "nvvm" in {
defm int_nvvm_atomic_and_gen_i : PTXAtomicWithScope2<llvm_anyint_ty>;
defm int_nvvm_atomic_cas_gen_i : PTXAtomicWithScope3<llvm_anyint_ty>;
-// Bar.Sync
+ //
+ // Bar.Sync
+ //
def int_nvvm_barrier0_popc : ClangBuiltin<"__nvvm_bar0_popc">,
Intrinsic<[llvm_i32_ty], [llvm_i32_ty], [IntrConvergent, IntrNoCallback]>;
def int_nvvm_barrier0_and : ClangBuiltin<"__nvvm_bar0_and">,
@@ -1361,62 +1359,65 @@ let TargetPrefix = "nvvm" in {
}
}
- // barrier.cluster.[wait, arrive, arrive.relaxed]
- def int_nvvm_barrier_cluster_arrive :
- Intrinsic<[], [], [IntrConvergent, IntrNoCallback]>;
- def int_nvvm_barrier_cluster_arrive_relaxed :
- Intrinsic<[], [], [IntrConvergent, IntrNoCallback]>;
- def int_nvvm_barrier_cluster_wait :
- Intrinsic<[], [], [IntrConvergent, IntrNoCallback]>;
-
- // 'aligned' versions of the above barrier.cluster.* intrinsics
- def int_nvvm_barrier_cluster_arrive_aligned :
- Intrinsic<[], [], [IntrConvergent, IntrNoCallback]>;
- def int_nvvm_barrier_cluster_arrive_relaxed_aligned :
- Intrinsic<[], [], [IntrConvergent, IntrNoCallback]>;
- def int_nvvm_barrier_cluster_wait_aligned :
- Intrinsic<[], [], [IntrConvergent, IntrNoCallback]>;
+ let IntrProperties = [IntrConvergent, IntrNoCallback] in {
+ // barrier.cluster.[wait, arrive, arrive.relaxed]
+ def int_nvvm_barrier_cluster_arrive : Intrinsic<[]>;
+ def int_nvvm_barrier_cluster_arrive_relaxed : Intrinsic<[]>;
+ def int_nvvm_barrier_cluster_wait : Intrinsic<[]>;
+
+ // 'aligned' versions of the above barrier.cluster.* intrinsics
+ def int_nvvm_barrier_cluster_arrive_aligned : Intrinsic<[]>;
+ def int_nvvm_barrier_cluster_arrive_relaxed_aligned : Intrinsic<[]>;
+ def int_nvvm_barrier_cluster_wait_aligned : Intrinsic<[]>;
+ }
+ //
// Membar
- def int_nvvm_membar_cta : NVVMBuiltin, Intrinsic<[], [], [IntrNoCallback]>;
- def int_nvvm_membar_gl : NVVMBuiltin, Intrinsic<[], [], [IntrNoCallback]>;
- def int_nvvm_membar_sys : NVVMBuiltin, Intrinsic<[], [], [IntrNoCallback]>;
- def int_nvvm_fence_sc_cluster : Intrinsic<[], [], [IntrNoCallback]>;
-
-// Proxy fence (uni-directional)
-foreach scope = ["cta", "cluster", "gpu", "sys"] in {
-
- def int_nvvm_fence_proxy_tensormap_generic_release_ # scope :
- Intrinsic<[], [], [IntrNoCallback],
- "llvm.nvvm.fence.proxy.tensormap_generic.release." # scope>;
-
- // The imm-arg 'size' can only be 128.
- def int_nvvm_fence_proxy_tensormap_generic_acquire_ # scope :
- Intrinsic<[], [llvm_ptr_ty, llvm_i32_ty],
- [IntrNoCallback, IntrArgMemOnly, ImmArg<ArgIndex<1>>,
- Range<ArgIndex<1>, 128, 129>],
- "llvm.nvvm.fence.proxy.tensormap_generic.acquire." # scope>;
-}
+ //
+ let IntrProperties = [IntrNoCallback] in {
+ def int_nvvm_membar_cta : NVVMBuiltin, Intrinsic<[]>;
+ def int_nvvm_membar_gl : NVVMBuiltin, Intrinsic<[]>;
+ def int_nvvm_membar_sys : NVVMBuiltin, Intrinsic<[]>;
+ def int_nvvm_fence_sc_cluster : Intrinsic<[]>;
+ }
+ //
+ // Proxy fence (uni-directional)
+ //
+ foreach scope = ["cta", "cluster", "gpu", "sys"] in {
+
+ def int_nvvm_fence_proxy_tensormap_generic_release_ # scope :
+ Intrinsic<[], [], [IntrNoCallback],
+ "llvm.nvvm.fence.proxy.tensormap_generic.release." # scope>;
+
+ // The imm-arg 'size' can only be 128.
+ def int_nvvm_fence_proxy_tensormap_generic_acquire_ # scope :
+ Intrinsic<[], [llvm_ptr_ty, llvm_i32_ty],
+ [IntrNoCallback, IntrArgMemOnly, ImmArg<ArgIndex<1>>,
+ Range<ArgIndex<1>, 128, 129>],
+ "llvm.nvvm.fence.proxy.tensormap_generic.acquire." # scope>;
+ }
+
+//
// Async Copy
+//
let IntrProperties = [IntrConvergent, IntrNoCallback] in {
def int_nvvm_cp_async_mbarrier_arrive : NVVMBuiltin,
- Intrinsic<[],[llvm_ptr_ty]>;
+ Intrinsic<[], [llvm_ptr_ty]>;
def int_nvvm_cp_async_mbarrier_arrive_shared : NVVMBuiltin,
- Intrinsic<[],[llvm_shared_ptr_ty]>;
+ Intrinsic<[], [llvm_shared_ptr_ty]>;
def int_nvvm_cp_async_mbarrier_arrive_noinc : NVVMBuiltin,
- Intrinsic<[],[llvm_ptr_ty]>;
+ Intrinsic<[], [llvm_ptr_ty]>;
def int_nvvm_cp_async_mbarrier_arrive_noinc_shared : NVVMBuiltin,
- Intrinsic<[],[llvm_shared_ptr_ty]>;
+ Intrinsic<[], [llvm_shared_ptr_ty]>;
}
multiclass CP_ASYNC_SHARED_GLOBAL {
- def NAME : Intrinsic<[], [llvm_shared_ptr_ty, llvm_global_ptr_ty],
- [IntrArgMemOnly, IntrNoCallback, NoAlias<ArgIndex<0>>, NoAlias<ArgIndex<1>>,
- WriteOnly<ArgIndex<0>>, ReadOnly<ArgIndex<1>>]>;
- def _s : Intrinsic<[], [llvm_shared_ptr_ty, llvm_global_ptr_ty, llvm_i32_ty],
- [IntrArgMemOnly, IntrNoCallback, NoAlias<ArgIndex<0>>, NoAlias<ArgIndex<1>>,
- WriteOnly<ArgIndex<0>>, ReadOnly<ArgIndex<1>>]>;
+ let IntrProperties = [IntrArgMemOnly, IntrNoCallback, NoAlias<ArgIndex<0>>,
+ NoAlias<ArgIndex<1>>, WriteOnly<ArgIndex<0>>, ReadOnly<ArgIndex<1>>] in {
+ def NAME : Intrinsic<[], [llvm_shared_ptr_ty, llvm_global_ptr_ty]>;
+ def _s : Intrinsic<[], [llvm_shared_ptr_ty, llvm_global_ptr_ty, llvm_i32_ty]>;
+ }
}
defm int_nvvm_cp_async_ca_shared_global_4 : CP_ASYNC_SHARED_GLOBAL;
@@ -1424,17 +1425,15 @@ defm int_nvvm_cp_async_ca_shared_global_8 : CP_ASYNC_SHARED_GLOBAL;
defm int_nvvm_cp_async_ca_shared_global_16 : CP_ASYNC_SHARED_GLOBAL;
defm int_nvvm_cp_async_cg_shared_global_16 : CP_ASYNC_SHARED_GLOBAL;
-def int_nvvm_cp_async_commit_group : NVVMBuiltin, Intrinsic<[], [], []>;
+def int_nvvm_cp_async_commit_group : NVVMBuiltin, Intrinsic<[]>;
def int_nvvm_cp_async_wait_group : NVVMBuiltin,
Intrinsic<[], [llvm_i32_ty], [ImmArg<ArgIndex<0>>]>;
-def int_nvvm_cp_async_wait_all : NVVMBuiltin,
- Intrinsic<[], [], []>;
+def int_nvvm_cp_async_wait_all : NVVMBuiltin, Intrinsic<[]>;
// cp.async.bulk variants of the commit/wait group
-def int_nvvm_cp_async_bulk_commit_group :
- Intrinsic<[], [], []>;
+def int_nvvm_cp_async_bulk_commit_group : Intrinsic<[]>;
def int_nvvm_cp_async_bulk_wait_group :
Intrinsic<[], [llvm_i32_ty], [ImmArg<ArgIndex<0>>]>;
@@ -1457,29 +1456,30 @@ def int_nvvm_mbarrier_inval_shared : NVVMBuiltin,
[IntrConvergent, IntrWriteMem, IntrArgMemOnly, IntrNoCallback,
WriteOnly<ArgIndex<0>>, NoCapture<ArgIndex<0>>]>;
-def int_nvvm_mbarrier_arrive : NVVMBuiltin,
- Intrinsic<[llvm_i64_ty], [llvm_ptr_ty], [IntrConvergent, IntrNoCallback]>;
-def int_nvvm_mbarrier_arrive_shared : NVVMBuiltin,
- Intrinsic<[llvm_i64_ty], [llvm_shared_ptr_ty], [IntrConvergent, IntrNoCallback]>;
-def int_nvvm_mbarrier_arrive_noComplete : NVVMBuiltin,
- Intrinsic<[llvm_i64_ty], [llvm_ptr_ty, llvm_i32_ty], [IntrConvergent, IntrNoCallback]>;
-def int_nvvm_mbarrier_arrive_noComplete_shared : NVVMBuiltin,
- Intrinsic<[llvm_i64_ty], [llvm_shared_ptr_ty,
- llvm_i32_ty], [IntrConvergent, IntrNoCallback]>;
-
-def int_nvvm_mbarrier_arrive_drop : NVVMBuiltin,
- Intrinsic<[llvm_i64_ty], [llvm_ptr_ty], [IntrConvergent, IntrNoCallback]>;
-def int_nvvm_mbarrier_arrive_drop_shared : NVVMBuiltin,
- Intrinsic<[llvm_i64_ty], [llvm_shared_ptr_ty], [IntrConvergent, IntrNoCallback]>;
-def int_nvvm_mbarrier_arrive_drop_noComplete : NVVMBuiltin,
- Intrinsic<[llvm_i64_ty], [llvm_ptr_ty, llvm_i32_ty], [IntrConvergent, IntrNoCallback]>;
-def int_nvvm_mbarrier_arrive_drop_noComplete_shared : NVVMBuiltin,
- Intrinsic<[llvm_i64_ty], [llvm_shared_ptr_ty, llvm_i32_ty], [IntrConvergent, IntrNoCallback]>;
-
-def int_nvvm_mbarrier_test_wait : NVVMBuiltin,
- Intrinsic<[llvm_i1_ty], [llvm_ptr_ty, llvm_i64_ty], [IntrConvergent, IntrNoCallback]>;
-def int_nvvm_mbarrier_test_wait_shared : NVVMBuiltin,
- Intrinsic<[llvm_i1_ty], [llvm_shared_ptr_ty, llvm_i64_ty], [IntrConvergent, IntrNoCallback]>;
+let IntrProperties = [IntrConvergent, IntrNoCallback] in {
+ def int_nvvm_mbarrier_arrive : NVVMBuiltin,
+ Intrinsic<[llvm_i64_ty], [llvm_ptr_ty]>;
+ def int_nvvm_mbarrier_arrive_shared : NVVMBuiltin,
+ Intrinsic<[llvm_i64_ty], [llvm_shared_ptr_ty]>;
+ def int_nvvm_mbarrier_arrive_noComplete : NVVMBuiltin,
+ Intrinsic<[llvm_i64_ty], [llvm_ptr_ty, llvm_i32_ty]>;
+ def int_nvvm_mbarrier_arrive_noComplete_shared : NVVMBuiltin,
+ Intrinsic<[llvm_i64_ty], [llvm_shared_ptr_ty, llvm_i32_ty]>;
+
+ def int_nvvm_mbarrier_arrive_drop : NVVMBuiltin,
+ Intrinsic<[llvm_i64_ty], [llvm_ptr_ty]>;
+ def int_nvvm_mbarrier_arrive_drop_shared : NVVMBuiltin,
+ Intrinsic<[llvm_i64_ty], [llvm_shared_ptr_ty]>;
+ def int_nvvm_mbarrier_arrive_drop_noComplete : NVVMBuiltin,
+ Intrinsic<[llvm_i64_ty], [llvm_ptr_ty, llvm_i32_ty]>;
+ def int_nvvm_mbarrier_arrive_drop_noComplete_shared : NVVMBuiltin,
+ Intrinsic<[llvm_i64_ty], [llvm_shared_ptr_ty, llvm_i32_ty]>;
+
+ def int_nvvm_mbarrier_test_wait : NVVMBuiltin,
+ Intrinsic<[llvm_i1_ty], [llvm_ptr_ty, llvm_i64_ty]>;
+ def int_nvvm_mbarrier_test_wait_shared : NVVMBuiltin,
+ Intrinsic<[llvm_i1_ty], [llvm_shared_ptr_ty, llvm_i64_ty]>;
+}
def int_nvvm_mbarrier_pending_count : NVVMBuiltin,
Intrinsic<[llvm_i32_ty], [llvm_i64_ty], [IntrNoMem, IntrConvergent, IntrNoCallback]>;
@@ -1504,9 +1504,8 @@ let IntrProperties = [IntrReadMem, IntrArgMemOnly, IntrNoCallback, IntrWillRetur
// space when lowered during ISel.
//
def int_nvvm_internal_addrspace_wrap :
- DefaultAttrsIntrinsic<[llvm_anyptr_ty], [llvm_anyptr_ty],
- [IntrNoMem, IntrSpeculatable, NoUndef<ArgIndex<0>>,
- NoUndef<RetIndex>]>;
+ PureIntrinsic<[llvm_anyptr_ty], [llvm_anyptr_ty],
+ [NoUndef<ArgIndex<0>>, NoUndef<RetIndex>]>;
// Move intrinsics, used in nvvm internally
@@ -1520,36 +1519,26 @@ let IntrProperties = [IntrNoMem] in {
}
// For getting the handle from a texture or surface variable
-let IntrProperties = [IntrNoMem, IntrSpeculatable] in {
- def int_nvvm_texsurf_handle
- : DefaultAttrsIntrinsic<[llvm_i64_ty], [llvm_metadata_ty, llvm_anyptr_ty]>;
- def int_nvvm_texsurf_handle_internal
- : DefaultAttrsIntrinsic<[llvm_i64_ty], [llvm_anyptr_ty]>;
-}
+def int_nvvm_texsurf_handle
+ : PureIntrinsic<[llvm_i64_ty], [llvm_metadata_ty, llvm_anyptr_ty]>;
+def int_nvvm_texsurf_handle_internal
+ : PureIntrinsic<[llvm_i64_ty], [llvm_anyptr_ty]>;
/// Error / Warn
def int_nvvm_compiler_error : Intrinsic<[], [llvm_anyptr_ty]>;
def int_nvvm_compiler_warn : Intrinsic<[], [llvm_anyptr_ty]>;
-def int_nvvm_reflect : NVVMBuiltin,
- Intrinsic<[llvm_i32_ty], [llvm_ptr_ty], [IntrNoMem]>;
+def int_nvvm_reflect : NVVMBuiltin, PureIntrinsic<[llvm_i32_ty], [llvm_ptr_ty]>;
// isspacep.{const, global, local, shared}
foreach space = ["const", "global", "local", "shared", "shared_cluster"] in
def int_nvvm_isspacep_ # space : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_i1_ty], [llvm_ptr_ty],
- [IntrNoMem, IntrSpeculatable, NoCapture<ArgIndex<0>>]>;
-
-// Environment register read
-foreach i = 0...31 in
- def int_nvvm_read_ptx_sreg_envreg # i : NVVMBuiltin,
- DefaultAttrsIntrinsic<[llvm_i32_ty], [],
- [IntrNoMem, IntrSpeculatable, NoUndef<RetIndex>]>;
+ PureIntrinsic<[llvm_i1_ty], [llvm_ptr_ty], [NoCapture<ArgIndex<0>>]>;
//
// Texture Fetch
//
-let IntrProperties = [IntrReadMem] in {
+let IntrProperties = [IntrReadMem, IntrNoCallback, IntrNoFree, IntrWillReturn] in {
foreach is_unified = [true, false] in {
defvar mode = !if(is_unified, "_unified", "");
defvar addr_args = !if(is_unified, [llvm_i64_ty], [llvm_i64_ty, llvm_i64_ty]);
@@ -1558,76 +1547,63 @@ let IntrProperties = [IntrReadMem] in {
foreach is_array = [true, false] in {
defvar array = !if(is_array, "_array", "");
defvar array_args = !if(is_array, [llvm_i32_ty], []<LLVMType>);
+ defvar base_args = !listconcat(addr_args, array_args);
def int_nvvm_tex # mode # _1d # array # _ # vec.Name # _s32
- : Intrinsic<vec.Types,
- !listconcat(addr_args, array_args, !listsplat(llvm_i32_ty, 1))>;
+ : Intrinsic<vec.Types, base_args # !listsplat(llvm_i32_ty, 1)>;
def int_nvvm_tex # mode # _1d # array # _ # vec.Name # _f32
- : Intrinsic<vec.Types,
- !listconcat(addr_args, array_args, !listsplat(llvm_float_ty, 1))>;
+ : Intrinsic<vec.Types, base_args # !listsplat(llvm_float_ty, 1)>;
def int_nvvm_tex # mode # _1d # array # _level_ # vec.Name # _f32
- : Intrinsic<vec.Types,
- !listconcat(addr_args, array_args, !listsplat(llvm_float_ty, 2))>;
+ : Intrinsic<vec.Types, base_args # !listsplat(llvm_float_ty, 2)>;
def int_nvvm_tex # mode # _1d # array # _grad_ # vec.Name # _f32
- : Intrinsic<vec.Types,
- !listconcat(addr_args, array_args, !listsplat(llvm_float_ty, 3))>;
+ : Intrinsic<vec.Types, base_args # !listsplat(llvm_float_ty, 3)>;
def int_nvvm_tex # mode # _2d # array # _ # vec.Name # _s32
- : Intrinsic<vec.Types,
- !listconcat(addr_args, array_args, !listsplat(llvm_i32_ty, 2))>;
+ : Intrinsic<vec.Types, base_args # !listsplat(llvm_i32_ty, 2)>;
def int_nvvm_tex # mode # _2d # array # _ # vec.Name # _f32
- : Intrinsic<vec.Types,
- !listconcat(addr_args, array_args, !listsplat(llvm_float_ty, 2))>;
+ : Intrinsic<vec.Types, base_args # !listsplat(llvm_float_ty, 2)>;
def int_nvvm_tex # mode # _2d # array # _level_ # vec.Name # _f32
- : Intrinsic<vec.Types,
- !listconcat(addr_args, array_args, !listsplat(llvm_float_ty, 3))>;
+ : Intrinsic<vec.Types, base_args # !listsplat(llvm_float_ty, 3)>;
def int_nvvm_tex # mode # _2d # array # _grad_ # vec.Name # _f32
- : Intrinsic<vec.Types,
- !listconcat(addr_args, array_args, !listsplat(llvm_float_ty, 6))>;
+ : Intrinsic<vec.Types, base_args # !listsplat(llvm_float_ty, 6)>;
if !not(is_array) then {
def int_nvvm_tex # mode # _3d_ # vec.Name # _s32
- : Intrinsic<vec.Types,
- !listconcat(addr_args, !listsplat(llvm_i32_ty, 3))>;
+ : Intrinsic<vec.Types, base_args # !listsplat(llvm_i32_ty, 3)>;
def int_nvvm_tex # mode # _3d_ # vec.Name # _f32
- : Intrinsic<vec.Types,
- !listconcat(addr_args, !listsplat(llvm_float_ty, 3))>;
+ : Intrinsic<vec.Types, base_args # !listsplat(llvm_float_ty, 3)>;
def int_nvvm_tex # mode # _3d_level_ # vec.Name # _f32
- : Intrinsic<vec.Types,
- !listconcat(addr_args, !listsplat(llvm_float_ty, 4))>;
+ : Intrinsic<vec.Types, base_args # !listsplat(llvm_float_ty, 4)>;
def int_nvvm_tex # mode # _3d_grad_ # vec.Name # _f32
- : Intrinsic<vec.Types,
- !listconcat(addr_args, !listsplat(llvm_float_ty, 9))>;
+ : Intrinsic<vec.Types, base_args # !listsplat(llvm_float_ty, 9)>;
}
def int_nvvm_tex # mode # _cube # array # _ # vec.Name # _f32
- : Intrinsic<vec.Types,
- !listconcat(addr_args, array_args, !listsplat(llvm_float_ty, 3))>;
+ : Intrinsic<vec.Types, base_args # !listsplat(llvm_float_ty, 3)>;
def int_nvvm_tex # mode # _cube # array # _level_ # vec.Name # _f32
- : Intrinsic<vec.Types,
- !listconcat(addr_args, array_args, !listsplat(llvm_float_ty, 4))>;
+ : Intrinsic<vec.Types, base_args # !listsplat(llvm_float_ty, 4)>;
if is_unified then
def int_nvvm_tex # mode # _cube # array # _grad_ # vec.Name # _f32
- : Intrinsic<vec.Types,
- !listconcat(addr_args, array_args, !listsplat(llvm_float_ty, 9))>;
+ : Intrinsic<vec.Types, base_args # !listsplat(llvm_float_ty, 9)>;
} // is_array
foreach comp = ["r", "g", "b", "a"] in {
def int_nvvm_tld4 # mode # _ # comp # _2d_ # vec.Name # _f32
- : Intrinsic<vec.Types,
- !listconcat(addr_args, !listsplat(llvm_float_ty, 2))>;
+ : Intrinsic<vec.Types, addr_args # !listsplat(llvm_float_ty, 2)>;
} // comp
} // vec
} // is_unified
} // IntrProperties = [IntrReadMem]
//=== Surface Load
-let IntrProperties = [IntrReadMem] in {
- foreach clamp = ["clamp", "trap", "zero"] in {
- foreach vec = [TV_I8, TV_I16, TV_I32, TV_I64,
- TV_V2I8, TV_V2I16, TV_V2I32, TV_V2I64,
- TV_V4I8, TV_V4I16, TV_V4I32] in {
+foreach clamp = ["clamp", "trap", "zero"] in {
+ foreach vec = [TV_I8, TV_I16, TV_I32, TV_I64,
+ TV_V2I8, TV_V2I16, TV_V2I32, TV_V2I64,
+ TV_V4I8, TV_V4I16, TV_V4I32] in {
+
+ let IntrProperties = [IntrNoCallback, IntrNoFree, IntrReadMem]
+ # !if(!ne(clamp, "trap"), [IntrWillReturn], []<IntrinsicProperty>) in {
def int_nvvm_suld_1d_ # vec.Name # _ # clamp
: Intrinsic<vec.Types,
@@ -1648,47 +1624,50 @@ let IntrProperties = [IntrReadMem] in {
def int_nvvm_suld_3d_ # vec.Name # _ # clamp
: Intrinsic<vec.Types,
[llvm_i64_ty, llvm_i32_ty, llvm_i32_ty, llvm_i32_ty]>;
- } // vec
- } // clamp
-} // IntrProperties = [IntrReadMem]
+ }
+ } // vec
+} // clamp
//===- Texture Query ------------------------------------------------------===//
foreach query = ["channel_order", "channel_data_type", "width", "height",
"depth", "array_size", "num_samples", "num_mipmap_levels"] in
def int_nvvm_txq_ # query : NVVMBuiltin,
- Intrinsic<[llvm_i32_ty], [llvm_i64_ty], [IntrNoMem]>;
+ DefaultAttrsIntrinsic<[llvm_i32_ty], [llvm_i64_ty], [IntrNoMem]>;
//===- Surface Query ------------------------------------------------------===//
foreach query = ["channel_order", "channel_data_type", "width", "height",
"depth", "array_size"] in
def int_nvvm_suq_ # query : NVVMBuiltin,
- Intrinsic<[llvm_i32_ty], [llvm_i64_ty], [IntrNoMem]>;
+ DefaultAttrsIntrinsic<[llvm_i32_ty], [llvm_i64_ty], [IntrNoMem]>;
//===- Handle Query -------------------------------------------------------===//
foreach type = ["sampler", "surface", "texture"] in
def int_nvvm_istypep_ # type : NVVMBuiltin,
- Intrinsic<[llvm_i1_ty], [llvm_i64_ty], [IntrNoMem]>;
+ DefaultAttrsIntrinsic<[llvm_i1_ty], [llvm_i64_ty], [IntrNoMem]>;
//===- Surface Stores -----------------------------------------------------===//
multiclass SurfaceStoreIntrinsics<string clamp, TexVector vec> {
- def _1d_ # vec.Name # _ # clamp : NVVMBuiltin,
- Intrinsic<[], !listconcat([llvm_i64_ty, llvm_i32_ty], vec.Types)>;
+ let IntrProperties = [IntrNoCallback, IntrNoFree, IntrWriteMem] #
+ !if(!ne(clamp, "trap"), [IntrWillReturn], []<IntrinsicProperty>) in {
+ def _1d_ # vec.Name # _ # clamp : NVVMBuiltin,
+ Intrinsic<[], [llvm_i64_ty, llvm_i32_ty] # vec.Types>;
- def _1d_array_ # vec.Name # _ # clamp : NVVMBuiltin,
- Intrinsic<[], !listconcat([llvm_i64_ty, llvm_i32_ty, llvm_i32_ty], vec.Types)>;
+ def _1d_array_ # vec.Name # _ # clamp : NVVMBuiltin,
+ Intrinsic<[], [llvm_i64_ty, llvm_i32_ty, llvm_i32_ty] # vec.Types>;
- def _2d_ # vec.Name # _ # clamp : NVVMBuiltin,
- Intrinsic<[], !listconcat([llvm_i64_ty, llvm_i32_ty, llvm_i32_ty], vec.Types)>;
+ def _2d_ # vec.Name # _ # clamp : NVVMBuiltin,
+ Intrinsic<[], [llvm_i64_ty, llvm_i32_ty, llvm_i32_ty] # vec.Types>;
- def _2d_array_ # vec.Name # _ # clamp : NVVMBuiltin,
- Intrinsic<[], !listconcat([llvm_i64_ty, llvm_i32_ty, llvm_i32_ty, llvm_i32_ty], vec.Types)>;
+ def _2d_array_ # vec.Name # _ # clamp : NVVMBuiltin,
+ Intrinsic<[], [llvm_i64_ty, llvm_i32_ty, llvm_i32_ty, llvm_i32_ty] # vec.Types>;
- def _3d_ # vec.Name # _ # clamp : NVVMBuiltin,
- Intrinsic<[], !listconcat([llvm_i64_ty, llvm_i32_ty, llvm_i32_ty, llvm_i32_ty], vec.Types)>;
+ def _3d_ # vec.Name # _ # clamp : NVVMBuiltin,
+ Intrinsic<[], [llvm_i64_ty, llvm_i32_ty, llvm_i32_ty, llvm_i32_ty] # vec.Types>;
+ }
}
// Unformatted
@@ -1704,23 +1683,17 @@ foreach vec = [TV_I8, TV_I16, TV_I32,
TV_V4I8, TV_V4I16, TV_V4I32] in
defm int_nvvm_sust_p : SurfaceStoreIntrinsics<"trap", vec>;
+//
// Accessing special registers.
-
+//
class PTXReadSRegIntrinsicNB_r32<list<IntrinsicProperty> properties = []>
- : DefaultAttrsIntrinsic<[llvm_i32_ty], [],
- !listconcat([IntrNoMem, IntrSpeculatable, NoUndef<RetIndex>], properties)>;
+ : PureIntrinsic<[llvm_i32_ty], [], [NoUndef<RetIndex>] # properties>;
class PTXReadSRegIntrinsic_r32<list<IntrinsicProperty> properties = []>
- : PTXReadSRegIntrinsicNB_r32<properties>,
- NVVMBuiltin;
+ : PTXReadSRegIntrinsicNB_r32<properties>, NVVMBuiltin;
multiclass PTXReadSRegIntrinsic_v4i32<list<list<IntrinsicProperty>> properties = [[], [], [], []]> {
assert !eq(!size(properties), 4), "properties must be a list of 4 lists";
-// FIXME: Do we need the 128-bit integer type version?
-// def _r64 : Intrinsic<[llvm_i128_ty], [], [IntrNoMem, IntrSpeculatable]>;
-
-// FIXME: Enable this once v4i32 support is enabled in back-end.
-// def _v4i16 : Intrinsic<[llvm_v4i32_ty], [], [IntrNoMem, IntrSpeculatable]>;
defvar suffixes = ["_x", "_y", "_z", "_w"];
foreach i = !range(suffixes) in
def suffixes[i] : PTXReadSRegIntrinsic_r32<properties[i]>;
@@ -1737,30 +1710,20 @@ multiclass PTXReadSRegIntrinsicNB_v4i32<list<list<IntrinsicProperty>> properties
// Intrinsics to read registers with non-constant values. E.g. the values that
// do change over the kernel lifetime. Such reads should not be CSE'd.
-class PTXReadNCSRegIntrinsic_r32
- : Intrinsic<[llvm_i32_ty], [], [IntrInaccessibleMemOnly, IntrNoCallback, NoUndef<RetIndex>]>,
- NVVMBuiltin;
-class PTXReadNCSRegIntrinsic_r64
- : Intrinsic<[llvm_i64_ty], [], [IntrInaccessibleMemOnly, IntrNoCallback, NoUndef<RetIndex>]>,
+class PTXReadNCSRegIntrinsic<LLVMType ty>
+ : Intrinsic<[ty], [], [IntrInaccessibleMemOnly, IntrNoCallback,
+ IntrNoFree, IntrWillReturn, NoUndef<RetIndex>]>,
NVVMBuiltin;
-defm int_nvvm_read_ptx_sreg_tid
- : PTXReadSRegIntrinsic_v4i32<[[Range<RetIndex, 0, MAX_BLOCK_SIZE_X>],
- [Range<RetIndex, 0, MAX_BLOCK_SIZE_Y>],
- [Range<RetIndex, 0, MAX_BLOCK_SIZE_Z>],
- [Range<RetIndex, 0, 1>]]>;
-
-defm int_nvvm_read_ptx_sreg_ntid
- : PTXReadSRegIntrinsic_v4i32<[[Range<RetIndex, 1, !add(MAX_BLOCK_SIZE_X, 1)>],
- [Range<RetIndex, 1, !add(MAX_BLOCK_SIZE_Y, 1)>],
- [Range<RetIndex, 1, !add(MAX_BLOCK_SIZE_Z, 1)>],
- [Range<RetIndex, 0, 1>]]>;
-
-def int_nvvm_read_ptx_sreg_laneid
- : PTXReadSRegIntrinsic_r32<[Range<RetIndex, 0, WARP_SIZE>]>;
+defvar MAX_BLOCK_ID_RANGE = [[Range<RetIndex, 0, MAX_BLOCK_SIZE_X>],
+ [Range<RetIndex, 0, MAX_BLOCK_SIZE_Y>],
+ [Range<RetIndex, 0, MAX_BLOCK_SIZE_Z>],
+ [Range<RetIndex, 0, 1>]];
-def int_nvvm_read_ptx_sreg_warpid : PTXReadSRegIntrinsic_r32;
-def int_nvvm_read_ptx_sreg_nwarpid : PTXReadSRegIntrinsic_r32;
+defvar MAX_BLOCK_NID_RANGE = [[Range<RetIndex, 1, !add(MAX_BLOCK_SIZE_X, 1)>],
+ [Range<RetIndex, 1, !add(MAX_BLOCK_SIZE_Y, 1)>],
+ [Range<RetIndex, 1, !add(MAX_BLOCK_SIZE_Z, 1)>],
+ [Range<RetIndex, 0, 1>]];
defvar MAX_GRID_ID_RANGE = [[Range<RetIndex, 0, MAX_GRID_SIZE_X>],
[Range<RetIndex, 0, MAX_GRID_SIZE_Y>],
@@ -1772,11 +1735,17 @@ defvar MAX_GRID_NID_RANGE = [[Range<RetIndex, 1, !add(MAX_GRID_SIZE_X, 1)>],
[Range<RetIndex, 1, !add(MAX_GRID_SIZE_Z, 1)>],
[Range<RetIndex, 0, 1>]];
-defm int_nvvm_read_ptx_sreg_ctaid
- : PTXReadSRegIntrinsic_v4i32<MAX_GRID_ID_RANGE>;
+defm int_nvvm_read_ptx_sreg_tid : PTXReadSRegIntrinsic_v4i32<MAX_BLOCK_ID_RANGE>;
+defm int_nvvm_read_ptx_sreg_ntid : PTXReadSRegIntrinsic_v4i32<MAX_BLOCK_NID_RANGE>;
+
+def int_nvvm_read_ptx_sreg_laneid
+ : PTXReadSRegIntrinsic_r32<[Range<RetIndex, 0, WARP_SIZE>]>;
+
+def int_nvvm_read_ptx_sreg_warpid : PTXReadSRegIntrinsic_r32;
+def int_nvvm_read_ptx_sreg_nwarpid : PTXReadSRegIntrinsic_r32;
-defm int_nvvm_read_ptx_sreg_nctaid
- : PTXReadSRegIntrinsic_v4i32<MAX_GRID_NID_RANGE>;
+defm int_nvvm_read_ptx_sreg_ctaid : PTXReadSRegIntrinsic_v4i32<MAX_GRID_ID_RANGE>;
+defm int_nvvm_read_ptx_sreg_nctaid : PTXReadSRegIntrinsic_v4i32<MAX_GRID_NID_RANGE>;
def int_nvvm_read_ptx_sreg_smid : PTXReadSRegIntrinsic_r32;
def int_nvvm_read_ptx_sreg_nsmid : PTXReadSRegIntrinsic_r32;
@@ -1788,19 +1757,22 @@ def int_nvvm_read_ptx_sreg_lanemask_lt : PTXReadSRegIntrinsic_r32;
def int_nvvm_read_ptx_sreg_lanemask_ge : PTXReadSRegIntrinsic_r32;
def int_nvvm_read_ptx_sreg_lanemask_gt : PTXReadSRegIntrinsic_r32;
-def int_nvvm_read_ptx_sreg_clock : PTXReadNCSRegIntrinsic_r32;
-def int_nvvm_read_ptx_sreg_clock64 : PTXReadNCSRegIntrinsic_r64;
+def int_nvvm_read_ptx_sreg_clock : PTXReadNCSRegIntrinsic<llvm_i32_ty>;
+def int_nvvm_read_ptx_sreg_clock64 : PTXReadNCSRegIntrinsic<llvm_i64_ty>;
-def int_nvvm_read_ptx_sreg_globaltimer : PTXReadNCSRegIntrinsic_r64;
+def int_nvvm_read_ptx_sreg_globaltimer : PTXReadNCSRegIntrinsic<llvm_i64_ty>;
-def int_nvvm_read_ptx_sreg_pm0 : PTXReadNCSRegIntrinsic_r32;
-def int_nvvm_read_ptx_sreg_pm1 : PTXReadNCSRegIntrinsic_r32;
-def int_nvvm_read_ptx_sreg_pm2 : PTXReadNCSRegIntrinsic_r32;
-def int_nvvm_read_ptx_sreg_pm3 : PTXReadNCSRegIntrinsic_r32;
+def int_nvvm_read_ptx_sreg_pm0 : PTXReadNCSRegIntrinsic<llvm_i32_ty>;
+def int_nvvm_read_ptx_sreg_pm1 : PTXReadNCSRegIntrinsic<llvm_i32_ty>;
+def int_nvvm_read_ptx_sreg_pm2 : PTXReadNCSRegIntrinsic<llvm_i32_ty>;
+def int_nvvm_read_ptx_sreg_pm3 : PTXReadNCSRegIntrinsic<llvm_i32_ty>;
def int_nvvm_read_ptx_sreg_warpsize
: PTXReadSRegIntrinsic_r32<[Range<RetIndex, WARP_SIZE, !add(WARP_SIZE, 1)>]>;
+foreach i = 0...31 in
+ def int_nvvm_read_ptx_sreg_envreg # i : PTXReadSRegIntrinsic_r32;
+
// sm90+, PTX7.8+
// Note: Since clusters are subdivisions of the grid, we conservatively use the
@@ -1808,14 +1780,10 @@ def int_nvvm_read_ptx_sreg_warpsize
// practice, the clusterid will likely be much smaller. The CUDA programming
// guide recommends 8 as a maximum portable value and H100s support 16.
-defm int_nvvm_read_ptx_sreg_clusterid
- : PTXReadSRegIntrinsicNB_v4i32<MAX_GRID_ID_RANGE>;
-defm int_nvvm_read_ptx_sreg_nclusterid
- : PTXReadSRegIntrinsicNB_v4i32<MAX_GRID_NID_RANGE>;
-defm int_nvvm_read_ptx_sreg_cluster_ctaid
- : PTXReadSRegIntrinsicNB_v4i32<MAX_GRID_ID_RANGE>;
-defm int_nvvm_read_ptx_sreg_cluster_nctaid
- : PTXReadSRegIntrinsicNB_v4i32<MAX_GRID_NID_RANGE>;
+defm int_nvvm_read_ptx_sreg_clusterid : PTXReadSRegIntrinsicNB_v4i32<MAX_GRID_ID_RANGE>;
+defm int_nvvm_read_ptx_sreg_nclusterid : PTXReadSRegIntrinsicNB_v4i32<MAX_GRID_NID_RANGE>;
+defm int_nvvm_read_ptx_sreg_cluster_ctaid : PTXReadSRegIntrinsicNB_v4i32<MAX_GRID_ID_RANGE>;
+defm int_nvvm_read_ptx_sreg_cluster_nctaid : PTXReadSRegIntrinsicNB_v4i32<MAX_GRID_NID_RANGE>;
def int_nvvm_read_ptx_sreg_cluster_ctarank : PTXReadSRegIntrinsicNB_r32;
def int_nvvm_read_ptx_sreg_cluster_nctarank : PTXReadSRegIntrinsicNB_r32;
@@ -1843,13 +1811,13 @@ let IntrProperties = [IntrInaccessibleMemOnly, IntrConvergent, IntrNoCallback] i
//
// VOTE
//
-
let IntrProperties = [IntrInaccessibleMemOnly, IntrConvergent, IntrNoCallback] in {
def int_nvvm_vote_all : NVVMBuiltin, Intrinsic<[llvm_i1_ty], [llvm_i1_ty]>;
def int_nvvm_vote_any : NVVMBuiltin, Intrinsic<[llvm_i1_ty], [llvm_i1_ty]>;
def int_nvvm_vote_uni : NVVMBuiltin, Intrinsic<[llvm_i1_ty], [llvm_i1_ty]>;
def int_nvvm_vote_ballot : NVVMBuiltin, Intrinsic<[llvm_i32_ty], [llvm_i1_ty]>;
}
+
//
// VOTE.SYNC
//
@@ -2052,8 +2020,7 @@ let IntrProperties = [IntrNoMem, IntrSpeculatable, NoCapture<ArgIndex<0>>] in {
}
def int_nvvm_is_explicit_cluster
- : DefaultAttrsIntrinsic<[llvm_i1_ty], [],
- [IntrNoMem, IntrSpeculatable, NoUndef<RetIndex>],
+ : PureIntrinsic<[llvm_i1_ty], [], [NoUndef<RetIndex>],
"llvm.nvvm.is_explicit_cluster">;
// Setmaxnreg inc/dec intrinsics
@@ -2458,13 +2425,12 @@ def int_nvvm_clusterlaunchcontrol_try_cancel_async_multicast_shared
// clusterlaunchcontrol.query_cancel.is_canceled
def int_nvvm_clusterlaunchcontrol_query_cancel_is_canceled
- : DefaultAttrsIntrinsic<[llvm_i1_ty], [llvm_i128_ty], [IntrNoMem, IntrSpeculatable],
- "llvm.nvvm.clusterlaunchcontrol.query_cancel.is_canceled">;
+ : PureIntrinsic<[llvm_i1_ty], [llvm_i128_ty], [],
+ "llvm.nvvm.clusterlaunchcontrol.query_cancel.is_canceled">;
-foreach dim = ["x", "y", "z"] in {
-def int_nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid_ # dim
- : DefaultAttrsIntrinsic<[llvm_i32_ty], [llvm_i128_ty], [IntrNoMem, IntrSpeculatable],
- "llvm.nvvm.clusterlaunchcontrol.query_cancel.get_first_ctaid." # dim>;
-}
+foreach dim = ["x", "y", "z"] in
+ def int_nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid_ # dim
+ : PureIntrinsic<[llvm_i32_ty], [llvm_i128_ty], [],
+ "llvm.nvvm.clusterlaunchcontrol.query_cancel.get_first_ctaid." # dim>;
} // let TargetPrefix = "nvvm"
>From 07738545758be942cb674254ed4bc6d12db48563 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=E9=BB=83=E5=9C=8B=E5=BA=AD?= <we3223 at gmail.com>
Date: Mon, 18 Aug 2025 23:36:26 +0800
Subject: [PATCH 044/112] [DAG] Fold trunc(avg(x,y)) for avgceil/floor u/s
nodes if they have sufficient leading zero/sign bits (#152273)
avgceil version : https://alive2.llvm.org/ce/z/2CKrRh
Fixes #147773
---------
Co-authored-by: Simon Pilgrim <llvm-dev at redking.me.uk>
---
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | 34 ++++++++++++
llvm/test/CodeGen/AArch64/trunc-avg-fold.ll | 53 +++++++++++++++++++
2 files changed, 87 insertions(+)
create mode 100644 llvm/test/CodeGen/AArch64/trunc-avg-fold.ll
diff --git a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
index 43d4138df8b49..c16ccaf926bc7 100644
--- a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
@@ -16279,6 +16279,40 @@ SDValue DAGCombiner::visitTRUNCATE(SDNode *N) {
// because targets may prefer a wider type during later combines and invert
// this transform.
switch (N0.getOpcode()) {
+ case ISD::AVGCEILU:
+ case ISD::AVGFLOORU:
+ if (!LegalOperations && N0.hasOneUse() &&
+ TLI.isOperationLegal(N0.getOpcode(), VT)) {
+ SDValue X = N0.getOperand(0);
+ SDValue Y = N0.getOperand(1);
+ unsigned SrcBits = X.getScalarValueSizeInBits();
+ unsigned DstBits = VT.getScalarSizeInBits();
+ APInt UpperBits = APInt::getBitsSetFrom(SrcBits, DstBits);
+ if (DAG.MaskedValueIsZero(X, UpperBits) &&
+ DAG.MaskedValueIsZero(Y, UpperBits)) {
+ SDValue Tx = DAG.getNode(ISD::TRUNCATE, DL, VT, X);
+ SDValue Ty = DAG.getNode(ISD::TRUNCATE, DL, VT, Y);
+ return DAG.getNode(N0.getOpcode(), DL, VT, Tx, Ty);
+ }
+ }
+ break;
+ case ISD::AVGCEILS:
+ case ISD::AVGFLOORS:
+ if (!LegalOperations && N0.hasOneUse() &&
+ TLI.isOperationLegal(N0.getOpcode(), VT)) {
+ SDValue X = N0.getOperand(0);
+ SDValue Y = N0.getOperand(1);
+ unsigned SrcBits = X.getScalarValueSizeInBits();
+ unsigned DstBits = VT.getScalarSizeInBits();
+ unsigned NeededSignBits = SrcBits - DstBits + 1;
+ if (DAG.ComputeNumSignBits(X) >= NeededSignBits &&
+ DAG.ComputeNumSignBits(Y) >= NeededSignBits) {
+ SDValue Tx = DAG.getNode(ISD::TRUNCATE, DL, VT, X);
+ SDValue Ty = DAG.getNode(ISD::TRUNCATE, DL, VT, Y);
+ return DAG.getNode(N0.getOpcode(), DL, VT, Tx, Ty);
+ }
+ }
+ break;
case ISD::ADD:
case ISD::SUB:
case ISD::MUL:
diff --git a/llvm/test/CodeGen/AArch64/trunc-avg-fold.ll b/llvm/test/CodeGen/AArch64/trunc-avg-fold.ll
new file mode 100644
index 0000000000000..54fcae4ba28b7
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/trunc-avg-fold.ll
@@ -0,0 +1,53 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=aarch64-- -O2 -mattr=+neon < %s | FileCheck %s
+
+define <8 x i8> @avgceil_u_i8_to_i16(<8 x i8> %a, <8 x i8> %b) {
+; CHECK-LABEL: avgceil_u_i8_to_i16:
+; CHECK: // %bb.0:
+; CHECK-NEXT: urhadd v0.8b, v0.8b, v1.8b
+; CHECK-NEXT: ret
+ %a16 = zext <8 x i8> %a to <8 x i16>
+ %b16 = zext <8 x i8> %b to <8 x i16>
+ %avg16 = call <8 x i16> @llvm.aarch64.neon.urhadd.v8i16(<8 x i16> %a16, <8 x i16> %b16)
+ %r = trunc <8 x i16> %avg16 to <8 x i8>
+ ret <8 x i8> %r
+}
+
+
+define <8 x i8> @test_avgceil_s(<8 x i8> %a, <8 x i8> %b) {
+; CHECK-LABEL: test_avgceil_s:
+; CHECK: // %bb.0:
+; CHECK-NEXT: srhadd v0.8b, v0.8b, v1.8b
+; CHECK-NEXT: ret
+ %a16 = sext <8 x i8> %a to <8 x i16>
+ %b16 = sext <8 x i8> %b to <8 x i16>
+ %avg16 = call <8 x i16> @llvm.aarch64.neon.srhadd.v8i16(<8 x i16> %a16, <8 x i16> %b16)
+ %res = trunc <8 x i16> %avg16 to <8 x i8>
+ ret <8 x i8> %res
+}
+
+define <8 x i8> @avgfloor_u_i8_to_i16(<8 x i8> %a, <8 x i8> %b) {
+; CHECK-LABEL: avgfloor_u_i8_to_i16:
+; CHECK: // %bb.0:
+; CHECK-NEXT: uhadd v0.8b, v0.8b, v1.8b
+; CHECK-NEXT: ret
+ %a16 = zext <8 x i8> %a to <8 x i16>
+ %b16 = zext <8 x i8> %b to <8 x i16>
+ %avg16 = call <8 x i16> @llvm.aarch64.neon.uhadd.v8i16(<8 x i16> %a16, <8 x i16> %b16)
+ %res = trunc <8 x i16> %avg16 to <8 x i8>
+ ret <8 x i8> %res
+}
+
+define <8 x i8> @test_avgfloor_s(<8 x i8> %a, <8 x i8> %b) {
+; CHECK-LABEL: test_avgfloor_s:
+; CHECK: // %bb.0:
+; CHECK-NEXT: shadd v0.8b, v0.8b, v1.8b
+; CHECK-NEXT: ret
+ %a16 = sext <8 x i8> %a to <8 x i16>
+ %b16 = sext <8 x i8> %b to <8 x i16>
+ %avg16 = call <8 x i16> @llvm.aarch64.neon.shadd.v8i16(<8 x i16> %a16, <8 x i16> %b16)
+ %res = trunc <8 x i16> %avg16 to <8 x i8>
+ ret <8 x i8> %res
+}
+
+
>From 17f5f5ba55972d1078ca24861d12ea8ffbeef9e2 Mon Sep 17 00:00:00 2001
From: Aiden Grossman <aidengrossman at google.com>
Date: Mon, 18 Aug 2025 15:35:53 +0000
Subject: [PATCH 045/112] [X86] Avoid Register implicit int conversion
PushedRegisters in this patch needs to be of type int64_t because iot is
grabbing registers from immediate operands of pseudo instructions.
However, we then compare to an actual register type later, which relies
on the implicit conversion within Register to int, which can result in
build failures in some configurations.
---
llvm/lib/Target/X86/X86WinEHUnwindV2.cpp | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/llvm/lib/Target/X86/X86WinEHUnwindV2.cpp b/llvm/lib/Target/X86/X86WinEHUnwindV2.cpp
index 7640d7090949c..7fa77ee8204a9 100644
--- a/llvm/lib/Target/X86/X86WinEHUnwindV2.cpp
+++ b/llvm/lib/Target/X86/X86WinEHUnwindV2.cpp
@@ -235,7 +235,7 @@ bool X86WinEHUnwindV2::runOnMachineFunction(MachineFunction &MF) {
MF, Mode,
"The epilog is popping more registers than the prolog "
"pushed");
- if (PushedRegs[PushedRegs.size() - PoppedRegCount] != Reg)
+ if (PushedRegs[PushedRegs.size() - PoppedRegCount] != Reg.id())
return rejectCurrentFunctionInternalError(
MF, Mode,
"The epilog is popping a registers in "
>From 33761df961627f9d057fa049509fc8ba8baaaf78 Mon Sep 17 00:00:00 2001
From: Antonio Frighetto <me at antoniofrighetto.com>
Date: Mon, 18 Aug 2025 17:40:08 +0200
Subject: [PATCH 046/112] =?UTF-8?q?Revert=20"[SimpleLoopUnswitch]=C2=A0Rec?=
=?UTF-8?q?ord=20loops=20from=20unswitching=20non-trivial=20conditions"?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
This reverts commit e9de32fd159d30cfd6fcc861b57b7e99ec2742ab due to
multiple performance regressions observed across downstream Numba
benchmarks (https://github.com/llvm/llvm-project/issues/138509#issuecomment-3193855772).
While avoiding non-trivial unswitches on newly-cloned loops helps
mitigate the pathological case reported in https://github.com/llvm/llvm-project/issues/138509,
it may as well make the IR less friendly to vectorization / loop-
canonicalization (in the test reported, previously no select with
loop-carried dependence existed in the new specialized loops),
leading the abovementioned approach to be reconsidered.
---
.../Transforms/Scalar/SimpleLoopUnswitch.cpp | 49 +++--
.../LICM/PR116813-memoryssa-outdated.ll | 2 +-
.../AArch64/block_scaling_decompr_8bit.ll | 6 +-
.../exponential-nontrivial-unswitch-nested.ll | 24 +--
...exponential-nontrivial-unswitch-nested2.ll | 6 +-
.../exponential-nontrivial-unswitch.ll | 33 +--
.../exponential-switch-unswitch.ll | 10 +-
.../Transforms/SimpleLoopUnswitch/guards.ll | 34 +--
.../inject-invariant-conditions.ll | 91 ++++-----
.../invalidate-block-and-loop-dispositions.ll | 26 ++-
.../nontrivial-unswitch-freeze.ll | 52 ++---
.../nontrivial-unswitch-select.ll | 115 +++++++----
.../SimpleLoopUnswitch/nontrivial-unswitch.ll | 193 +++++++++++-------
...al-unswitch-loop-and-block-dispositions.ll | 100 ++++++---
.../SimpleLoopUnswitch/partial-unswitch.ll | 74 ++++---
.../Transforms/SimpleLoopUnswitch/pr138509.ll | 49 -----
.../SimpleLoopUnswitch/update-scev-3.ll | 76 +++++--
17 files changed, 528 insertions(+), 412 deletions(-)
delete mode 100644 llvm/test/Transforms/SimpleLoopUnswitch/pr138509.ll
diff --git a/llvm/lib/Transforms/Scalar/SimpleLoopUnswitch.cpp b/llvm/lib/Transforms/Scalar/SimpleLoopUnswitch.cpp
index f6959ca209fd7..9b40fc03da6bb 100644
--- a/llvm/lib/Transforms/Scalar/SimpleLoopUnswitch.cpp
+++ b/llvm/lib/Transforms/Scalar/SimpleLoopUnswitch.cpp
@@ -2144,23 +2144,9 @@ void visitDomSubTree(DominatorTree &DT, BasicBlock *BB, CallableT Callable) {
void postUnswitch(Loop &L, LPMUpdater &U, StringRef LoopName,
bool CurrentLoopValid, bool PartiallyInvariant,
bool InjectedCondition, ArrayRef<Loop *> NewLoops) {
- auto RecordLoopAsUnswitched = [&](Loop *TargetLoop, StringRef Tag,
- StringRef DisableTag) {
- auto &Ctx = TargetLoop->getHeader()->getContext();
- MDNode *DisableMD = MDNode::get(Ctx, MDString::get(Ctx, DisableTag));
- MDNode *NewLoopID = makePostTransformationMetadata(
- Ctx, TargetLoop->getLoopID(), {Tag}, {DisableMD});
- TargetLoop->setLoopID(NewLoopID);
- };
-
- // If we performed a non-trivial unswitch, we have added new cloned loops.
- // Mark such newly-created loops as visited.
- if (!NewLoops.empty()) {
- for (Loop *NL : NewLoops)
- RecordLoopAsUnswitched(NL, "llvm.loop.unswitch.nontrivial",
- "llvm.loop.unswitch.nontrivial.disable");
+ // If we did a non-trivial unswitch, we have added new (cloned) loops.
+ if (!NewLoops.empty())
U.addSiblingLoops(NewLoops);
- }
// If the current loop remains valid, we should revisit it to catch any
// other unswitch opportunities. Otherwise, we need to mark it as deleted.
@@ -2168,12 +2154,24 @@ void postUnswitch(Loop &L, LPMUpdater &U, StringRef LoopName,
if (PartiallyInvariant) {
// Mark the new loop as partially unswitched, to avoid unswitching on
// the same condition again.
- RecordLoopAsUnswitched(&L, "llvm.loop.unswitch.partial",
- "llvm.loop.unswitch.partial.disable");
+ auto &Context = L.getHeader()->getContext();
+ MDNode *DisableUnswitchMD = MDNode::get(
+ Context,
+ MDString::get(Context, "llvm.loop.unswitch.partial.disable"));
+ MDNode *NewLoopID = makePostTransformationMetadata(
+ Context, L.getLoopID(), {"llvm.loop.unswitch.partial"},
+ {DisableUnswitchMD});
+ L.setLoopID(NewLoopID);
} else if (InjectedCondition) {
// Do the same for injection of invariant conditions.
- RecordLoopAsUnswitched(&L, "llvm.loop.unswitch.injection",
- "llvm.loop.unswitch.injection.disable");
+ auto &Context = L.getHeader()->getContext();
+ MDNode *DisableUnswitchMD = MDNode::get(
+ Context,
+ MDString::get(Context, "llvm.loop.unswitch.injection.disable"));
+ MDNode *NewLoopID = makePostTransformationMetadata(
+ Context, L.getLoopID(), {"llvm.loop.unswitch.injection"},
+ {DisableUnswitchMD});
+ L.setLoopID(NewLoopID);
} else
U.revisitCurrentLoop();
} else
@@ -2811,9 +2809,9 @@ static BranchInst *turnGuardIntoBranch(IntrinsicInst *GI, Loop &L,
}
/// Cost multiplier is a way to limit potentially exponential behavior
-/// of loop-unswitch. Cost is multiplied in proportion of 2^number of unswitch
-/// candidates available. Also consider the number of "sibling" loops with
-/// the idea of accounting for previous unswitches that already happened on this
+/// of loop-unswitch. Cost is multipied in proportion of 2^number of unswitch
+/// candidates available. Also accounting for the number of "sibling" loops with
+/// the idea to account for previous unswitches that already happened on this
/// cluster of loops. There was an attempt to keep this formula simple,
/// just enough to limit the worst case behavior. Even if it is not that simple
/// now it is still not an attempt to provide a detailed heuristic size
@@ -3509,9 +3507,8 @@ static bool unswitchBestCondition(Loop &L, DominatorTree &DT, LoopInfo &LI,
SmallVector<NonTrivialUnswitchCandidate, 4> UnswitchCandidates;
IVConditionInfo PartialIVInfo;
Instruction *PartialIVCondBranch = nullptr;
- if (!findOptionMDForLoop(&L, "llvm.loop.unswitch.nontrivial.disable"))
- collectUnswitchCandidates(UnswitchCandidates, PartialIVInfo,
- PartialIVCondBranch, L, LI, AA, MSSAU);
+ collectUnswitchCandidates(UnswitchCandidates, PartialIVInfo,
+ PartialIVCondBranch, L, LI, AA, MSSAU);
if (!findOptionMDForLoop(&L, "llvm.loop.unswitch.injection.disable"))
collectUnswitchCandidatesWithInjections(UnswitchCandidates, PartialIVInfo,
PartialIVCondBranch, L, DT, LI, AA,
diff --git a/llvm/test/Transforms/LICM/PR116813-memoryssa-outdated.ll b/llvm/test/Transforms/LICM/PR116813-memoryssa-outdated.ll
index 562701420f806..a040c3cc6947c 100644
--- a/llvm/test/Transforms/LICM/PR116813-memoryssa-outdated.ll
+++ b/llvm/test/Transforms/LICM/PR116813-memoryssa-outdated.ll
@@ -18,7 +18,7 @@ define i32 @foo(i1 %arg, ptr %arg1) {
; CHECK: [[BB1]]:
; CHECK-NEXT: [[UNSWITCHED_SELECT_US:%.*]] = phi ptr [ [[ARG1]], %[[BB0]] ]
; CHECK-NEXT: [[I3_US:%.*]] = call i32 [[UNSWITCHED_SELECT_US]]()
-; CHECK-NEXT: br i1 true, label %[[LOOP_US]], label %[[RET_SPLIT_US:.*]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK-NEXT: br i1 true, label %[[LOOP_US]], label %[[RET_SPLIT_US:.*]]
; CHECK: [[RET_SPLIT_US]]:
; CHECK-NEXT: [[I3_LCSSA_US:%.*]] = phi i32 [ [[I3_US]], %[[BB1]] ]
; CHECK-NEXT: br label %[[RET:.*]]
diff --git a/llvm/test/Transforms/PhaseOrdering/AArch64/block_scaling_decompr_8bit.ll b/llvm/test/Transforms/PhaseOrdering/AArch64/block_scaling_decompr_8bit.ll
index 05674b9efc39d..7175816963ed1 100644
--- a/llvm/test/Transforms/PhaseOrdering/AArch64/block_scaling_decompr_8bit.ll
+++ b/llvm/test/Transforms/PhaseOrdering/AArch64/block_scaling_decompr_8bit.ll
@@ -94,7 +94,7 @@ define dso_local noundef i32 @_Z33block_scaling_decompr_8bitjPK27compressed_data
; CHECK-NEXT: [[DST_ADDR_1]] = getelementptr inbounds nuw i8, ptr [[DST_ADDR_052]], i64 48
; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
; CHECK-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[WIDE_TRIP_COUNT58]]
-; CHECK-NEXT: br i1 [[EXITCOND_NOT]], label %[[FOR_END]], label %[[FOR_BODY]], !llvm.loop [[LOOP7:![0-9]+]]
+; CHECK-NEXT: br i1 [[EXITCOND_NOT]], label %[[FOR_END]], label %[[FOR_BODY]], !llvm.loop [[LOOP4]]
; CHECK: [[FOR_END]]:
; CHECK-NEXT: ret i32 0
;
@@ -801,8 +801,6 @@ attributes #2 = { nocallback nofree nosync nounwind willreturn memory(none) }
!4 = distinct !{!4, !5}
!5 = !{!"llvm.loop.mustprogress"}
;.
-; CHECK: [[LOOP4]] = distinct !{[[LOOP4]], [[META5:![0-9]+]], [[META6:![0-9]+]]}
+; CHECK: [[LOOP4]] = distinct !{[[LOOP4]], [[META5:![0-9]+]]}
; CHECK: [[META5]] = !{!"llvm.loop.mustprogress"}
-; CHECK: [[META6]] = !{!"llvm.loop.unswitch.nontrivial.disable"}
-; CHECK: [[LOOP7]] = distinct !{[[LOOP7]], [[META5]]}
;.
diff --git a/llvm/test/Transforms/SimpleLoopUnswitch/exponential-nontrivial-unswitch-nested.ll b/llvm/test/Transforms/SimpleLoopUnswitch/exponential-nontrivial-unswitch-nested.ll
index 6f2833b4f4e76..f82d7309f6d07 100644
--- a/llvm/test/Transforms/SimpleLoopUnswitch/exponential-nontrivial-unswitch-nested.ll
+++ b/llvm/test/Transforms/SimpleLoopUnswitch/exponential-nontrivial-unswitch-nested.ll
@@ -45,7 +45,7 @@
;
; RUN: opt < %s -enable-unswitch-cost-multiplier=false \
; RUN: -passes='loop-mssa(licm,simple-loop-unswitch<nontrivial>),print<loops>' -disable-output 2>&1 | \
-; RUN: sort -b -k 1 | FileCheck %s --check-prefixes=LOOP6
+; RUN: sort -b -k 1 | FileCheck %s --check-prefixes=LOOP32
;
; Single loop nest, not unswitched
; LOOP1: Loop at depth 1 containing:
@@ -55,23 +55,23 @@
;
; Half unswitched loop nests, with unscaled4 and div1 it gets less depth1 loops unswitched
; since they have more cost.
-; LOOP-UNSCALE4-DIV1-COUNT-4: Loop at depth 1 containing:
-; LOOP-UNSCALE4-DIV1-COUNT-4: Loop at depth 2 containing:
-; LOOP-UNSCALE4-DIV1-COUNT-4: Loop at depth 3 containing:
+; LOOP-UNSCALE4-DIV1-COUNT-6: Loop at depth 1 containing:
+; LOOP-UNSCALE4-DIV1-COUNT-19: Loop at depth 2 containing:
+; LOOP-UNSCALE4-DIV1-COUNT-29: Loop at depth 3 containing:
; LOOP-UNSCALE4-DIV1-NOT: Loop at depth {{[0-9]+}} containing:
;
; Half unswitched loop nests, with unscaled4 and div2 it gets more depth1 loops unswitched
; as div2 kicks in.
-; LOOP-UNSCALE4-DIV2-COUNT-4: Loop at depth 1 containing:
-; LOOP-UNSCALE4-DIV2-COUNT-4: Loop at depth 2 containing:
-; LOOP-UNSCALE4-DIV2-COUNT-4: Loop at depth 3 containing:
+; LOOP-UNSCALE4-DIV2-COUNT-11: Loop at depth 1 containing:
+; LOOP-UNSCALE4-DIV2-COUNT-22: Loop at depth 2 containing:
+; LOOP-UNSCALE4-DIV2-COUNT-29: Loop at depth 3 containing:
; LOOP-UNSCALE4-DIV2-NOT: Loop at depth {{[0-9]+}} containing:
;
-; 6 loop nests, fully unswitched
-; LOOP6-COUNT-6: Loop at depth 1 containing:
-; LOOP6-COUNT-6: Loop at depth 2 containing:
-; LOOP6-COUNT-6: Loop at depth 3 containing:
-; LOOP6-NOT: Loop at depth {{[0-9]+}} containing:
+; 32 loop nests, fully unswitched
+; LOOP32-COUNT-32: Loop at depth 1 containing:
+; LOOP32-COUNT-32: Loop at depth 2 containing:
+; LOOP32-COUNT-32: Loop at depth 3 containing:
+; LOOP32-NOT: Loop at depth {{[0-9]+}} containing:
declare void @bar()
diff --git a/llvm/test/Transforms/SimpleLoopUnswitch/exponential-nontrivial-unswitch-nested2.ll b/llvm/test/Transforms/SimpleLoopUnswitch/exponential-nontrivial-unswitch-nested2.ll
index ab3b3d26d9975..63d2789da5a82 100644
--- a/llvm/test/Transforms/SimpleLoopUnswitch/exponential-nontrivial-unswitch-nested2.ll
+++ b/llvm/test/Transforms/SimpleLoopUnswitch/exponential-nontrivial-unswitch-nested2.ll
@@ -60,7 +60,7 @@
;
; Half unswitched loop nests, with unscaled3 and div1 it gets less depth1 loops unswitched
; since they have more cost.
-; LOOP-UNSCALE3-DIV1-COUNT-2: Loop at depth 1 containing:
+; LOOP-UNSCALE3-DIV1-COUNT-4: Loop at depth 1 containing:
; LOOP-UNSCALE3-DIV1-NOT: Loop at depth 1 containing:
; LOOP-UNSCALE3-DIV1-COUNT-1: Loop at depth 2 containing:
; LOOP-UNSCALE3-DIV1-NOT: Loop at depth 2 containing:
@@ -69,7 +69,7 @@
;
; Half unswitched loop nests, with unscaled3 and div2 it gets more depth1 loops unswitched
; as div2 kicks in.
-; LOOP-UNSCALE3-DIV2-COUNT-2: Loop at depth 1 containing:
+; LOOP-UNSCALE3-DIV2-COUNT-6: Loop at depth 1 containing:
; LOOP-UNSCALE3-DIV2-NOT: Loop at depth 1 containing:
; LOOP-UNSCALE3-DIV2-COUNT-1: Loop at depth 2 containing:
; LOOP-UNSCALE3-DIV2-NOT: Loop at depth 2 containing:
@@ -77,7 +77,7 @@
; LOOP-UNSCALE3-DIV2-NOT: Loop at depth 3 containing:
;
; Maximally unswitched (copy of the outer loop per each condition)
-; LOOP-MAX-COUNT-2: Loop at depth 1 containing:
+; LOOP-MAX-COUNT-6: Loop at depth 1 containing:
; LOOP-MAX-NOT: Loop at depth 1 containing:
; LOOP-MAX-COUNT-1: Loop at depth 2 containing:
; LOOP-MAX-NOT: Loop at depth 2 containing:
diff --git a/llvm/test/Transforms/SimpleLoopUnswitch/exponential-nontrivial-unswitch.ll b/llvm/test/Transforms/SimpleLoopUnswitch/exponential-nontrivial-unswitch.ll
index 7515cbbcbf1df..a2a745f46bca7 100644
--- a/llvm/test/Transforms/SimpleLoopUnswitch/exponential-nontrivial-unswitch.ll
+++ b/llvm/test/Transforms/SimpleLoopUnswitch/exponential-nontrivial-unswitch.ll
@@ -25,37 +25,46 @@
;
; RUN: opt < %s -enable-unswitch-cost-multiplier=true \
; RUN: -unswitch-num-initial-unscaled-candidates=8 -unswitch-siblings-toplevel-div=1 \
-; RUN: -passes='loop(simple-loop-unswitch<nontrivial>),print<loops>' -disable-output 2>&1 | FileCheck %s --check-prefixes=LOOP4
+; RUN: -passes='loop(simple-loop-unswitch<nontrivial>),print<loops>' -disable-output 2>&1 | FileCheck %s --check-prefixes=LOOP5
;
; RUN: opt < %s -enable-unswitch-cost-multiplier=true \
; RUN: -unswitch-num-initial-unscaled-candidates=8 -unswitch-siblings-toplevel-div=1 \
-; RUN: -passes='loop-mssa(simple-loop-unswitch<nontrivial>),print<loops>' -disable-output 2>&1 | FileCheck %s --check-prefixes=LOOP4
+; RUN: -passes='loop-mssa(simple-loop-unswitch<nontrivial>),print<loops>' -disable-output 2>&1 | FileCheck %s --check-prefixes=LOOP5
+;
+; With relaxed candidates multiplier (unscaled candidates == 8) and with relaxed
+; siblings multiplier for top-level loops (toplevel-div == 8) we should get
+; 2^(num conds) == 2^5 == 32
+; copies of the loop:
;
; RUN: opt < %s -enable-unswitch-cost-multiplier=true \
; RUN: -unswitch-num-initial-unscaled-candidates=8 -unswitch-siblings-toplevel-div=8 \
-; RUN: -passes='loop(simple-loop-unswitch<nontrivial>),print<loops>' -disable-output 2>&1 | FileCheck %s --check-prefixes=LOOP6
+; RUN: -passes='loop(simple-loop-unswitch<nontrivial>),print<loops>' -disable-output 2>&1 | FileCheck %s --check-prefixes=LOOP32
;
; RUN: opt < %s -enable-unswitch-cost-multiplier=true \
; RUN: -unswitch-num-initial-unscaled-candidates=8 -unswitch-siblings-toplevel-div=8 \
-; RUN: -passes='loop-mssa(simple-loop-unswitch<nontrivial>),print<loops>' -disable-output 2>&1 | FileCheck %s --check-prefixes=LOOP6
+; RUN: -passes='loop-mssa(simple-loop-unswitch<nontrivial>),print<loops>' -disable-output 2>&1 | FileCheck %s --check-prefixes=LOOP32
+;
+; Similarly get
+; 2^(num conds) == 2^5 == 32
+; copies of the loop when cost multiplier is disabled:
;
; RUN: opt < %s -enable-unswitch-cost-multiplier=false \
-; RUN: -passes='loop(simple-loop-unswitch<nontrivial>),print<loops>' -disable-output 2>&1 | FileCheck %s --check-prefixes=LOOP6
+; RUN: -passes='loop(simple-loop-unswitch<nontrivial>),print<loops>' -disable-output 2>&1 | FileCheck %s --check-prefixes=LOOP32
;
; RUN: opt < %s -enable-unswitch-cost-multiplier=false \
-; RUN: -passes='loop-mssa(simple-loop-unswitch<nontrivial>),print<loops>' -disable-output 2>&1 | FileCheck %s --check-prefixes=LOOP6
+; RUN: -passes='loop-mssa(simple-loop-unswitch<nontrivial>),print<loops>' -disable-output 2>&1 | FileCheck %s --check-prefixes=LOOP32
;
; Single loop, not unswitched
; LOOP1: Loop at depth 1 containing:
; LOOP1-NOT: Loop at depth 1 containing:
-; 4 loops, unswitched 4 times
-; LOOP4-COUNT-4: Loop at depth 1 containing:
-; LOOP4-NOT: Loop at depth 1 containing:
+; 5 loops, unswitched 4 times
+; LOOP5-COUNT-5: Loop at depth 1 containing:
+; LOOP5-NOT: Loop at depth 1 containing:
-; 6 loops, fully unswitched
-; LOOP6-COUNT-6: Loop at depth 1 containing:
-; LOOP6-NOT: Loop at depth 1 containing:
+; 32 loops, fully unswitched
+; LOOP32-COUNT-32: Loop at depth 1 containing:
+; LOOP32-NOT: Loop at depth 1 containing:
define void @loop_simple5(ptr %addr, i1 %c1, i1 %c2, i1 %c3, i1 %c4, i1 %c5) {
entry:
diff --git a/llvm/test/Transforms/SimpleLoopUnswitch/exponential-switch-unswitch.ll b/llvm/test/Transforms/SimpleLoopUnswitch/exponential-switch-unswitch.ll
index 846a7793b6c37..96fe899d69c3b 100644
--- a/llvm/test/Transforms/SimpleLoopUnswitch/exponential-switch-unswitch.ll
+++ b/llvm/test/Transforms/SimpleLoopUnswitch/exponential-switch-unswitch.ll
@@ -61,19 +61,19 @@
; Somewhat relaxed restrictions on candidates:
; LOOP-RELAX-COUNT-5: Loop at depth 1 containing:
; LOOP-RELAX-NOT: Loop at depth 1 containing:
-; LOOP-RELAX-COUNT-5: Loop at depth 2 containing:
+; LOOP-RELAX-COUNT-32: Loop at depth 2 containing:
; LOOP-RELAX-NOT: Loop at depth 2 containing:
;
; Even more relaxed restrictions on candidates and siblings.
-; LOOP-RELAX2-COUNT-5: Loop at depth 1 containing:
+; LOOP-RELAX2-COUNT-11: Loop at depth 1 containing:
; LOOP-RELAX2-NOT: Loop at depth 1 containing:
-; LOOP-RELAX2-COUNT-5: Loop at depth 2 containing:
+; LOOP-RELAX2-COUNT-40: Loop at depth 2 containing:
; LOOP-RELAX-NOT: Loop at depth 2 containing:
;
; Unswitched as much as it could (with multiplier disabled).
-; LOOP-MAX-COUNT-6: Loop at depth 1 containing:
+; LOOP-MAX-COUNT-56: Loop at depth 1 containing:
; LOOP-MAX-NOT: Loop at depth 1 containing:
-; LOOP-MAX-COUNT-11: Loop at depth 2 containing:
+; LOOP-MAX-COUNT-111: Loop at depth 2 containing:
; LOOP-MAX-NOT: Loop at depth 2 containing:
define i32 @loop_switch(ptr %addr, i32 %c1, i32 %c2) {
diff --git a/llvm/test/Transforms/SimpleLoopUnswitch/guards.ll b/llvm/test/Transforms/SimpleLoopUnswitch/guards.ll
index c77e7cce77a9c..533b1f691f5ad 100644
--- a/llvm/test/Transforms/SimpleLoopUnswitch/guards.ll
+++ b/llvm/test/Transforms/SimpleLoopUnswitch/guards.ll
@@ -38,25 +38,25 @@ exit:
}
define void @test_two_guards(i1 %cond1, i1 %cond2, i32 %N) {
-; CHECK-LABEL: define void @test_two_guards(i1 %cond1, i1 %cond2, i32 %N) {
+; CHECK-LABEL: @test_two_guards(
; CHECK-NEXT: entry:
-; CHECK-NEXT: br i1 %cond1, label %entry.split.us, label %entry.split
+; CHECK-NEXT: br i1 [[COND1:%.*]], label [[ENTRY_SPLIT_US:%.*]], label [[ENTRY_SPLIT:%.*]]
; CHECK: entry.split.us:
-; CHECK-NEXT: br label %loop.us
-; CHECK: loop.us:
-; CHECK-NEXT: %iv.us = phi i32 [ 0, %entry.split.us ], [ %iv.next.us, %guarded.us ]
-; CHECK-NEXT: br label %guarded.us
-; CHECK: guarded.us:
-; CHECK-NEXT: call void (i1, ...) @llvm.experimental.guard(i1 %cond2) [ "deopt"() ]
-; CHECK-NEXT: %iv.next.us = add i32 %iv.us, 1
-; CHECK-NEXT: %loop.cond.us = icmp slt i32 %iv.next.us, %N
-; CHECK-NEXT: br i1 %loop.cond.us, label %loop.us, label %exit.split.us, !llvm.loop !2
-; CHECK: exit.split.us:
-; CHECK-NEXT: br label %exit
-; CHECK: entry.split:
-; CHECK-NEXT: br label %loop
-; CHECK: loop:
-; CHECK-NEXT: br label %deopt
+; CHECK-NEXT: br i1 [[COND2:%.*]], label [[ENTRY_SPLIT_US_SPLIT_US:%.*]], label [[ENTRY_SPLIT_US_SPLIT:%.*]]
+; CHECK: entry.split.us.split.us:
+; CHECK-NEXT: br label [[LOOP_US_US:%.*]]
+; CHECK: loop.us.us:
+; CHECK-NEXT: [[IV_US_US:%.*]] = phi i32 [ 0, [[ENTRY_SPLIT_US_SPLIT_US]] ], [ [[IV_NEXT_US_US:%.*]], [[GUARDED_US2:%.*]] ]
+; CHECK-NEXT: br label [[GUARDED_US_US:%.*]]
+; CHECK: guarded.us.us:
+; CHECK-NEXT: br label [[GUARDED_US2]]
+; CHECK: guarded.us2:
+; CHECK-NEXT: [[IV_NEXT_US_US]] = add i32 [[IV_US_US]], 1
+; CHECK-NEXT: [[LOOP_COND_US_US:%.*]] = icmp slt i32 [[IV_NEXT_US_US]], [[N:%.*]]
+; CHECK-NEXT: br i1 [[LOOP_COND_US_US]], label [[LOOP_US_US]], label [[EXIT_SPLIT_US_SPLIT_US:%.*]]
+; CHECK: deopt1:
+; CHECK-NEXT: call void (i1, ...) @llvm.experimental.guard(i1 false) [ "deopt"() ]
+; CHECK-NEXT: unreachable
; CHECK: deopt:
; CHECK-NEXT: call void (i1, ...) @llvm.experimental.guard(i1 false) [ "deopt"() ]
; CHECK-NEXT: unreachable
diff --git a/llvm/test/Transforms/SimpleLoopUnswitch/inject-invariant-conditions.ll b/llvm/test/Transforms/SimpleLoopUnswitch/inject-invariant-conditions.ll
index 3dc83203f1490..536e0c6a0e74a 100644
--- a/llvm/test/Transforms/SimpleLoopUnswitch/inject-invariant-conditions.ll
+++ b/llvm/test/Transforms/SimpleLoopUnswitch/inject-invariant-conditions.ll
@@ -5,7 +5,7 @@
define i32 @test_01(ptr noundef %p, i32 noundef %n, i32 noundef %limit, ptr noundef %arr, ptr noundef %x_p) {
; CHECK-LABEL: @test_01(
; CHECK-NEXT: entry:
-; CHECK-NEXT: [[X:%.*]] = load i32, ptr [[X_P:%.*]], align 4, !noundef [[META0:![0-9]+]]
+; CHECK-NEXT: [[X:%.*]] = load i32, ptr [[X_P:%.*]], align 4, !noundef !0
; CHECK-NEXT: [[INJECTED_COND:%.*]] = icmp ule i32 [[LIMIT:%.*]], [[X]]
; CHECK-NEXT: br i1 [[INJECTED_COND]], label [[LOOP_US:%.*]], label [[LOOP:%.*]]
; CHECK: loop.us:
@@ -20,7 +20,7 @@ define i32 @test_01(ptr noundef %p, i32 noundef %n, i32 noundef %limit, ptr noun
; CHECK-NEXT: store i32 [[IV_US]], ptr [[ARR_PTR_US]], align 4
; CHECK-NEXT: [[IV_NEXT_US]] = add i32 [[IV_US]], 1
; CHECK-NEXT: [[LOOP_COND_US:%.*]] = icmp slt i32 [[IV_NEXT_US]], [[N:%.*]]
-; CHECK-NEXT: br i1 [[LOOP_COND_US]], label [[LOOP_US]], label [[COMMON_RET]], !llvm.loop [[LOOP2:![0-9]+]]
+; CHECK-NEXT: br i1 [[LOOP_COND_US]], label [[LOOP_US]], label [[COMMON_RET]]
; CHECK: loop:
; CHECK-NEXT: [[IV:%.*]] = phi i32 [ [[IV_NEXT:%.*]], [[BACKEDGE:%.*]] ], [ 0, [[ENTRY]] ]
; CHECK-NEXT: [[EL_PTR:%.*]] = getelementptr i32, ptr [[P]], i32 [[IV]]
@@ -35,7 +35,7 @@ define i32 @test_01(ptr noundef %p, i32 noundef %n, i32 noundef %limit, ptr noun
; CHECK-NEXT: store i32 [[IV]], ptr [[ARR_PTR]], align 4
; CHECK-NEXT: [[IV_NEXT]] = add i32 [[IV]], 1
; CHECK-NEXT: [[LOOP_COND:%.*]] = icmp slt i32 [[IV_NEXT]], [[N]]
-; CHECK-NEXT: br i1 [[LOOP_COND]], label [[LOOP]], label [[COMMON_RET]], !llvm.loop [[LOOP4:![0-9]+]]
+; CHECK-NEXT: br i1 [[LOOP_COND]], label [[LOOP]], label [[COMMON_RET]], !llvm.loop [[LOOP2:![0-9]+]]
; CHECK: common.ret:
; CHECK-NEXT: [[COMMON_RET_OP:%.*]] = phi i32 [ 0, [[BACKEDGE]] ], [ 0, [[GUARDED_US]] ], [ -1, [[LOOP]] ], [ -1, [[LOOP_US]] ], [ -2, [[GUARDED]] ]
; CHECK-NEXT: ret i32 [[COMMON_RET_OP]]
@@ -76,7 +76,7 @@ range_check_failed: ; preds = %guarded
define i32 @test_01_neg_void_profile(ptr noundef %p, i32 noundef %n, i32 noundef %limit, ptr noundef %arr, ptr noundef %x_p) {
; CHECK-LABEL: @test_01_neg_void_profile(
; CHECK-NEXT: entry:
-; CHECK-NEXT: [[X:%.*]] = load i32, ptr [[X_P:%.*]], align 4, !noundef [[META0]]
+; CHECK-NEXT: [[X:%.*]] = load i32, ptr [[X_P:%.*]], align 4, !noundef !0
; CHECK-NEXT: br label [[LOOP:%.*]]
; CHECK: loop:
; CHECK-NEXT: [[IV:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[IV_NEXT:%.*]], [[BACKEDGE:%.*]] ]
@@ -133,7 +133,7 @@ range_check_failed: ; preds = %guarded
define i32 @test_01_constants(ptr noundef %p, ptr noundef %arr, ptr noundef %x_p) {
; CHECK-LABEL: @test_01_constants(
; CHECK-NEXT: entry:
-; CHECK-NEXT: [[X:%.*]] = load i32, ptr [[X_P:%.*]], align 4, !noundef [[META0]]
+; CHECK-NEXT: [[X:%.*]] = load i32, ptr [[X_P:%.*]], align 4, !noundef !0
; CHECK-NEXT: [[INJECTED_COND:%.*]] = icmp ule i32 200, 300
; CHECK-NEXT: br i1 [[INJECTED_COND]], label [[LOOP_US:%.*]], label [[LOOP:%.*]]
; CHECK: loop.us:
@@ -148,7 +148,7 @@ define i32 @test_01_constants(ptr noundef %p, ptr noundef %arr, ptr noundef %x_p
; CHECK-NEXT: store i32 [[IV_US]], ptr [[ARR_PTR_US]], align 4
; CHECK-NEXT: [[IV_NEXT_US]] = add i32 [[IV_US]], 1
; CHECK-NEXT: [[LOOP_COND_US:%.*]] = icmp slt i32 [[IV_NEXT_US]], 1000
-; CHECK-NEXT: br i1 [[LOOP_COND_US]], label [[LOOP_US]], label [[COMMON_RET]], !llvm.loop [[LOOP6:![0-9]+]]
+; CHECK-NEXT: br i1 [[LOOP_COND_US]], label [[LOOP_US]], label [[COMMON_RET]]
; CHECK: loop:
; CHECK-NEXT: [[IV:%.*]] = phi i32 [ [[IV_NEXT:%.*]], [[BACKEDGE:%.*]] ], [ 0, [[ENTRY]] ]
; CHECK-NEXT: [[EL_PTR:%.*]] = getelementptr i32, ptr [[P]], i32 [[IV]]
@@ -160,7 +160,7 @@ define i32 @test_01_constants(ptr noundef %p, ptr noundef %arr, ptr noundef %x_p
; CHECK-NEXT: store i32 [[IV]], ptr [[ARR_PTR]], align 4
; CHECK-NEXT: [[IV_NEXT]] = add i32 [[IV]], 1
; CHECK-NEXT: [[LOOP_COND:%.*]] = icmp slt i32 [[IV_NEXT]], 1000
-; CHECK-NEXT: br i1 [[LOOP_COND]], label [[LOOP]], label [[COMMON_RET]], !llvm.loop [[LOOP7:![0-9]+]]
+; CHECK-NEXT: br i1 [[LOOP_COND]], label [[LOOP]], label [[COMMON_RET]], !llvm.loop [[LOOP4:![0-9]+]]
; CHECK: common.ret:
; CHECK-NEXT: [[COMMON_RET_OP:%.*]] = phi i32 [ 0, [[BACKEDGE]] ], [ 0, [[GUARDED_US]] ], [ -1, [[LOOP]] ], [ -1, [[LOOP_US]] ]
; CHECK-NEXT: ret i32 [[COMMON_RET_OP]]
@@ -200,7 +200,7 @@ range_check_failed: ; preds = %guarded
define i32 @test_01_neg_degenerate_profile(ptr noundef %p, i32 noundef %n, i32 noundef %limit, ptr noundef %arr, ptr noundef %x_p) {
; CHECK-LABEL: @test_01_neg_degenerate_profile(
; CHECK-NEXT: entry:
-; CHECK-NEXT: [[X:%.*]] = load i32, ptr [[X_P:%.*]], align 4, !noundef [[META0]]
+; CHECK-NEXT: [[X:%.*]] = load i32, ptr [[X_P:%.*]], align 4, !noundef !0
; CHECK-NEXT: br label [[LOOP:%.*]]
; CHECK: loop:
; CHECK-NEXT: [[IV:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[IV_NEXT:%.*]], [[BACKEDGE:%.*]] ]
@@ -210,7 +210,7 @@ define i32 @test_01_neg_degenerate_profile(ptr noundef %p, i32 noundef %n, i32 n
; CHECK-NEXT: br i1 [[BOUND_CHECK]], label [[GUARDED:%.*]], label [[COMMON_RET:%.*]], !prof [[PROF1]]
; CHECK: guarded:
; CHECK-NEXT: [[RANGE_CHECK:%.*]] = icmp ult i32 [[EL]], [[X]]
-; CHECK-NEXT: br i1 [[RANGE_CHECK]], label [[BACKEDGE]], label [[COMMON_RET]], !prof [[PROF8:![0-9]+]]
+; CHECK-NEXT: br i1 [[RANGE_CHECK]], label [[BACKEDGE]], label [[COMMON_RET]], !prof [[PROF5:![0-9]+]]
; CHECK: backedge:
; CHECK-NEXT: [[ARR_PTR:%.*]] = getelementptr i32, ptr [[ARR:%.*]], i32 [[EL]]
; CHECK-NEXT: store i32 [[IV]], ptr [[ARR_PTR]], align 4
@@ -257,7 +257,7 @@ range_check_failed: ; preds = %guarded
define i32 @test_01_neg_cold(ptr noundef %p, i32 noundef %n, i32 noundef %limit, ptr noundef %arr, ptr noundef %x_p) {
; CHECK-LABEL: @test_01_neg_cold(
; CHECK-NEXT: entry:
-; CHECK-NEXT: [[X:%.*]] = load i32, ptr [[X_P:%.*]], align 4, !noundef [[META0]]
+; CHECK-NEXT: [[X:%.*]] = load i32, ptr [[X_P:%.*]], align 4, !noundef !0
; CHECK-NEXT: br label [[LOOP:%.*]]
; CHECK: loop:
; CHECK-NEXT: [[IV:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[IV_NEXT:%.*]], [[BACKEDGE:%.*]] ]
@@ -267,7 +267,7 @@ define i32 @test_01_neg_cold(ptr noundef %p, i32 noundef %n, i32 noundef %limit,
; CHECK-NEXT: br i1 [[BOUND_CHECK]], label [[GUARDED:%.*]], label [[COMMON_RET:%.*]], !prof [[PROF1]]
; CHECK: guarded:
; CHECK-NEXT: [[RANGE_CHECK:%.*]] = icmp ult i32 [[EL]], [[X]]
-; CHECK-NEXT: br i1 [[RANGE_CHECK]], label [[BACKEDGE]], label [[COMMON_RET]], !prof [[PROF9:![0-9]+]]
+; CHECK-NEXT: br i1 [[RANGE_CHECK]], label [[BACKEDGE]], label [[COMMON_RET]], !prof [[PROF6:![0-9]+]]
; CHECK: backedge:
; CHECK-NEXT: [[ARR_PTR:%.*]] = getelementptr i32, ptr [[ARR:%.*]], i32 [[EL]]
; CHECK-NEXT: store i32 [[IV]], ptr [[ARR_PTR]], align 4
@@ -314,17 +314,17 @@ range_check_failed: ; preds = %guarded
define i32 @test_01_neg_overflowing_metadata(ptr noundef %p, i32 noundef %n, i32 noundef %limit, ptr noundef %arr, ptr noundef %x_p) {
; CHECK-LABEL: @test_01_neg_overflowing_metadata(
; CHECK-NEXT: entry:
-; CHECK-NEXT: [[X:%.*]] = load i32, ptr [[X_P:%.*]], align 4, !noundef [[META0]]
+; CHECK-NEXT: [[X:%.*]] = load i32, ptr [[X_P:%.*]], align 4, !noundef !0
; CHECK-NEXT: br label [[LOOP:%.*]]
; CHECK: loop:
; CHECK-NEXT: [[IV:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[IV_NEXT:%.*]], [[BACKEDGE:%.*]] ]
; CHECK-NEXT: [[EL_PTR:%.*]] = getelementptr i32, ptr [[P:%.*]], i32 [[IV]]
; CHECK-NEXT: [[EL:%.*]] = load i32, ptr [[EL_PTR]], align 4
; CHECK-NEXT: [[BOUND_CHECK:%.*]] = icmp ult i32 [[EL]], [[LIMIT:%.*]]
-; CHECK-NEXT: br i1 [[BOUND_CHECK]], label [[GUARDED:%.*]], label [[COMMON_RET:%.*]], !prof [[PROF10:![0-9]+]]
+; CHECK-NEXT: br i1 [[BOUND_CHECK]], label [[GUARDED:%.*]], label [[COMMON_RET:%.*]], !prof [[PROF7:![0-9]+]]
; CHECK: guarded:
; CHECK-NEXT: [[RANGE_CHECK:%.*]] = icmp ult i32 [[EL]], [[X]]
-; CHECK-NEXT: br i1 [[RANGE_CHECK]], label [[BACKEDGE]], label [[COMMON_RET]], !prof [[PROF10]]
+; CHECK-NEXT: br i1 [[RANGE_CHECK]], label [[BACKEDGE]], label [[COMMON_RET]], !prof [[PROF7]]
; CHECK: backedge:
; CHECK-NEXT: [[ARR_PTR:%.*]] = getelementptr i32, ptr [[ARR:%.*]], i32 [[EL]]
; CHECK-NEXT: store i32 [[IV]], ptr [[ARR_PTR]], align 4
@@ -371,7 +371,7 @@ range_check_failed: ; preds = %guarded
define i32 @test_02(ptr noundef %p, i32 noundef %n, i32 noundef %limit, ptr noundef %arr, ptr noundef %x_p) {
; CHECK-LABEL: @test_02(
; CHECK-NEXT: entry:
-; CHECK-NEXT: [[X:%.*]] = load i32, ptr [[X_P:%.*]], align 4, !noundef [[META0]]
+; CHECK-NEXT: [[X:%.*]] = load i32, ptr [[X_P:%.*]], align 4, !noundef !0
; CHECK-NEXT: [[INJECTED_COND:%.*]] = icmp ule i32 -2147483648, [[X]]
; CHECK-NEXT: br i1 [[INJECTED_COND]], label [[LOOP_US:%.*]], label [[LOOP:%.*]]
; CHECK: loop.us:
@@ -386,7 +386,7 @@ define i32 @test_02(ptr noundef %p, i32 noundef %n, i32 noundef %limit, ptr noun
; CHECK-NEXT: store i32 [[IV_US]], ptr [[ARR_PTR_US]], align 4
; CHECK-NEXT: [[IV_NEXT_US]] = add i32 [[IV_US]], 1
; CHECK-NEXT: [[LOOP_COND_US:%.*]] = icmp slt i32 [[IV_NEXT_US]], [[N:%.*]]
-; CHECK-NEXT: br i1 [[LOOP_COND_US]], label [[LOOP_US]], label [[COMMON_RET]], !llvm.loop [[LOOP11:![0-9]+]]
+; CHECK-NEXT: br i1 [[LOOP_COND_US]], label [[LOOP_US]], label [[COMMON_RET]]
; CHECK: loop:
; CHECK-NEXT: [[IV:%.*]] = phi i32 [ [[IV_NEXT:%.*]], [[BACKEDGE:%.*]] ], [ 0, [[ENTRY]] ]
; CHECK-NEXT: [[EL_PTR:%.*]] = getelementptr i32, ptr [[P]], i32 [[IV]]
@@ -401,7 +401,7 @@ define i32 @test_02(ptr noundef %p, i32 noundef %n, i32 noundef %limit, ptr noun
; CHECK-NEXT: store i32 [[IV]], ptr [[ARR_PTR]], align 4
; CHECK-NEXT: [[IV_NEXT]] = add i32 [[IV]], 1
; CHECK-NEXT: [[LOOP_COND:%.*]] = icmp slt i32 [[IV_NEXT]], [[N]]
-; CHECK-NEXT: br i1 [[LOOP_COND]], label [[LOOP]], label [[COMMON_RET]], !llvm.loop [[LOOP12:![0-9]+]]
+; CHECK-NEXT: br i1 [[LOOP_COND]], label [[LOOP]], label [[COMMON_RET]], !llvm.loop [[LOOP8:![0-9]+]]
; CHECK: common.ret:
; CHECK-NEXT: [[COMMON_RET_OP:%.*]] = phi i32 [ 0, [[BACKEDGE]] ], [ 0, [[GUARDED_US]] ], [ -1, [[LOOP]] ], [ -1, [[LOOP_US]] ], [ -2, [[GUARDED]] ]
; CHECK-NEXT: ret i32 [[COMMON_RET_OP]]
@@ -441,7 +441,7 @@ range_check_failed: ; preds = %guarded
define i32 @test_02_inverse(ptr noundef %p, i32 noundef %n, i32 noundef %limit, ptr noundef %arr, ptr noundef %x_p) {
; CHECK-LABEL: @test_02_inverse(
; CHECK-NEXT: entry:
-; CHECK-NEXT: [[X:%.*]] = load i32, ptr [[X_P:%.*]], align 4, !noundef [[META0]]
+; CHECK-NEXT: [[X:%.*]] = load i32, ptr [[X_P:%.*]], align 4, !noundef !0
; CHECK-NEXT: [[INJECTED_COND:%.*]] = icmp ule i32 -2147483648, [[X]]
; CHECK-NEXT: br i1 [[INJECTED_COND]], label [[LOOP_US:%.*]], label [[LOOP:%.*]]
; CHECK: loop.us:
@@ -456,7 +456,7 @@ define i32 @test_02_inverse(ptr noundef %p, i32 noundef %n, i32 noundef %limit,
; CHECK-NEXT: store i32 [[IV_US]], ptr [[ARR_PTR_US]], align 4
; CHECK-NEXT: [[IV_NEXT_US]] = add i32 [[IV_US]], 1
; CHECK-NEXT: [[LOOP_COND_US:%.*]] = icmp slt i32 [[IV_NEXT_US]], [[N:%.*]]
-; CHECK-NEXT: br i1 [[LOOP_COND_US]], label [[LOOP_US]], label [[COMMON_RET]], !llvm.loop [[LOOP13:![0-9]+]]
+; CHECK-NEXT: br i1 [[LOOP_COND_US]], label [[LOOP_US]], label [[COMMON_RET]]
; CHECK: loop:
; CHECK-NEXT: [[IV:%.*]] = phi i32 [ [[IV_NEXT:%.*]], [[BACKEDGE:%.*]] ], [ 0, [[ENTRY]] ]
; CHECK-NEXT: [[EL_PTR:%.*]] = getelementptr i32, ptr [[P]], i32 [[IV]]
@@ -471,7 +471,7 @@ define i32 @test_02_inverse(ptr noundef %p, i32 noundef %n, i32 noundef %limit,
; CHECK-NEXT: store i32 [[IV]], ptr [[ARR_PTR]], align 4
; CHECK-NEXT: [[IV_NEXT]] = add i32 [[IV]], 1
; CHECK-NEXT: [[LOOP_COND:%.*]] = icmp slt i32 [[IV_NEXT]], [[N]]
-; CHECK-NEXT: br i1 [[LOOP_COND]], label [[LOOP]], label [[COMMON_RET]], !llvm.loop [[LOOP14:![0-9]+]]
+; CHECK-NEXT: br i1 [[LOOP_COND]], label [[LOOP]], label [[COMMON_RET]], !llvm.loop [[LOOP9:![0-9]+]]
; CHECK: common.ret:
; CHECK-NEXT: [[COMMON_RET_OP:%.*]] = phi i32 [ 0, [[BACKEDGE]] ], [ 0, [[GUARDED_US]] ], [ -1, [[LOOP]] ], [ -1, [[LOOP_US]] ], [ -2, [[GUARDED]] ]
; CHECK-NEXT: ret i32 [[COMMON_RET_OP]]
@@ -511,7 +511,7 @@ range_check_failed: ; preds = %guarded
define i32 @test_03(ptr noundef %p, i32 noundef %n, i32 noundef %limit, ptr noundef %arr, ptr noundef %x_p) {
; CHECK-LABEL: @test_03(
; CHECK-NEXT: entry:
-; CHECK-NEXT: [[X:%.*]] = load i32, ptr [[X_P:%.*]], align 4, !noundef [[META0]]
+; CHECK-NEXT: [[X:%.*]] = load i32, ptr [[X_P:%.*]], align 4, !noundef !0
; CHECK-NEXT: [[INJECTED_COND:%.*]] = icmp ule i32 -2147483648, [[X]]
; CHECK-NEXT: br i1 [[INJECTED_COND]], label [[LOOP_US:%.*]], label [[LOOP:%.*]]
; CHECK: loop.us:
@@ -519,20 +519,20 @@ define i32 @test_03(ptr noundef %p, i32 noundef %n, i32 noundef %limit, ptr noun
; CHECK-NEXT: [[EL_PTR_US:%.*]] = getelementptr i32, ptr [[P:%.*]], i32 [[IV_US]]
; CHECK-NEXT: [[EL_US:%.*]] = load i32, ptr [[EL_PTR_US]], align 4
; CHECK-NEXT: [[BOUND_CHECK_US:%.*]] = icmp slt i32 [[EL_US]], 0
-; CHECK-NEXT: br i1 [[BOUND_CHECK_US]], label [[COMMON_RET:%.*]], label [[GUARDED_US]], !prof [[PROF15:![0-9]+]]
+; CHECK-NEXT: br i1 [[BOUND_CHECK_US]], label [[COMMON_RET:%.*]], label [[GUARDED_US]], !prof [[PROF10:![0-9]+]]
; CHECK: guarded.us:
; CHECK-NEXT: [[RANGE_CHECK_US:%.*]] = icmp ult i32 [[EL_US]], [[X]]
; CHECK-NEXT: [[ARR_PTR_US:%.*]] = getelementptr i32, ptr [[ARR:%.*]], i32 [[EL_US]]
; CHECK-NEXT: store i32 [[IV_US]], ptr [[ARR_PTR_US]], align 4
; CHECK-NEXT: [[IV_NEXT_US]] = add i32 [[IV_US]], 1
; CHECK-NEXT: [[LOOP_COND_US:%.*]] = icmp slt i32 [[IV_NEXT_US]], [[N:%.*]]
-; CHECK-NEXT: br i1 [[LOOP_COND_US]], label [[LOOP_US]], label [[COMMON_RET]], !llvm.loop [[LOOP16:![0-9]+]]
+; CHECK-NEXT: br i1 [[LOOP_COND_US]], label [[LOOP_US]], label [[COMMON_RET]]
; CHECK: loop:
; CHECK-NEXT: [[IV:%.*]] = phi i32 [ [[IV_NEXT:%.*]], [[BACKEDGE:%.*]] ], [ 0, [[ENTRY]] ]
; CHECK-NEXT: [[EL_PTR:%.*]] = getelementptr i32, ptr [[P]], i32 [[IV]]
; CHECK-NEXT: [[EL:%.*]] = load i32, ptr [[EL_PTR]], align 4
; CHECK-NEXT: [[BOUND_CHECK:%.*]] = icmp slt i32 [[EL]], 0
-; CHECK-NEXT: br i1 [[BOUND_CHECK]], label [[COMMON_RET]], label [[GUARDED:%.*]], !prof [[PROF15]]
+; CHECK-NEXT: br i1 [[BOUND_CHECK]], label [[COMMON_RET]], label [[GUARDED:%.*]], !prof [[PROF10]]
; CHECK: guarded:
; CHECK-NEXT: [[RANGE_CHECK:%.*]] = icmp ult i32 [[EL]], [[X]]
; CHECK-NEXT: br i1 [[RANGE_CHECK]], label [[BACKEDGE]], label [[COMMON_RET]]
@@ -541,7 +541,7 @@ define i32 @test_03(ptr noundef %p, i32 noundef %n, i32 noundef %limit, ptr noun
; CHECK-NEXT: store i32 [[IV]], ptr [[ARR_PTR]], align 4
; CHECK-NEXT: [[IV_NEXT]] = add i32 [[IV]], 1
; CHECK-NEXT: [[LOOP_COND:%.*]] = icmp slt i32 [[IV_NEXT]], [[N]]
-; CHECK-NEXT: br i1 [[LOOP_COND]], label [[LOOP]], label [[COMMON_RET]], !llvm.loop [[LOOP17:![0-9]+]]
+; CHECK-NEXT: br i1 [[LOOP_COND]], label [[LOOP]], label [[COMMON_RET]], !llvm.loop [[LOOP11:![0-9]+]]
; CHECK: common.ret:
; CHECK-NEXT: [[COMMON_RET_OP:%.*]] = phi i32 [ 0, [[BACKEDGE]] ], [ 0, [[GUARDED_US]] ], [ -1, [[LOOP]] ], [ -1, [[LOOP_US]] ], [ -2, [[GUARDED]] ]
; CHECK-NEXT: ret i32 [[COMMON_RET_OP]]
@@ -581,7 +581,7 @@ range_check_failed: ; preds = %guarded
define i32 @test_04(ptr noundef %p, i32 noundef %n, i32 noundef %limit, ptr noundef %arr, ptr noundef %x_p) {
; CHECK-LABEL: @test_04(
; CHECK-NEXT: entry:
-; CHECK-NEXT: [[X:%.*]] = load i32, ptr [[X_P:%.*]], align 4, !noundef [[META0]]
+; CHECK-NEXT: [[X:%.*]] = load i32, ptr [[X_P:%.*]], align 4, !noundef !0
; CHECK-NEXT: [[INJECTED_COND:%.*]] = icmp ule i32 128, [[X]]
; CHECK-NEXT: br i1 [[INJECTED_COND]], label [[LOOP_US:%.*]], label [[LOOP:%.*]]
; CHECK: loop.us:
@@ -589,7 +589,7 @@ define i32 @test_04(ptr noundef %p, i32 noundef %n, i32 noundef %limit, ptr noun
; CHECK-NEXT: [[EL_PTR_US:%.*]] = getelementptr i8, ptr [[P:%.*]], i32 [[IV_US]]
; CHECK-NEXT: [[EL_US:%.*]] = load i8, ptr [[EL_PTR_US]], align 4
; CHECK-NEXT: [[BOUND_CHECK_US:%.*]] = icmp slt i8 [[EL_US]], 0
-; CHECK-NEXT: br i1 [[BOUND_CHECK_US]], label [[COMMON_RET:%.*]], label [[GUARDED_US]], !prof [[PROF15]]
+; CHECK-NEXT: br i1 [[BOUND_CHECK_US]], label [[COMMON_RET:%.*]], label [[GUARDED_US]], !prof [[PROF10]]
; CHECK: guarded.us:
; CHECK-NEXT: [[EL_WIDE_US:%.*]] = zext i8 [[EL_US]] to i32
; CHECK-NEXT: [[RANGE_CHECK_US:%.*]] = icmp ult i32 [[EL_WIDE_US]], [[X]]
@@ -597,13 +597,13 @@ define i32 @test_04(ptr noundef %p, i32 noundef %n, i32 noundef %limit, ptr noun
; CHECK-NEXT: store i32 [[IV_US]], ptr [[ARR_PTR_US]], align 4
; CHECK-NEXT: [[IV_NEXT_US]] = add i32 [[IV_US]], 1
; CHECK-NEXT: [[LOOP_COND_US:%.*]] = icmp slt i32 [[IV_NEXT_US]], [[N:%.*]]
-; CHECK-NEXT: br i1 [[LOOP_COND_US]], label [[LOOP_US]], label [[COMMON_RET]], !llvm.loop [[LOOP18:![0-9]+]]
+; CHECK-NEXT: br i1 [[LOOP_COND_US]], label [[LOOP_US]], label [[COMMON_RET]]
; CHECK: loop:
; CHECK-NEXT: [[IV:%.*]] = phi i32 [ [[IV_NEXT:%.*]], [[BACKEDGE:%.*]] ], [ 0, [[ENTRY]] ]
; CHECK-NEXT: [[EL_PTR:%.*]] = getelementptr i8, ptr [[P]], i32 [[IV]]
; CHECK-NEXT: [[EL:%.*]] = load i8, ptr [[EL_PTR]], align 4
; CHECK-NEXT: [[BOUND_CHECK:%.*]] = icmp slt i8 [[EL]], 0
-; CHECK-NEXT: br i1 [[BOUND_CHECK]], label [[COMMON_RET]], label [[GUARDED:%.*]], !prof [[PROF15]]
+; CHECK-NEXT: br i1 [[BOUND_CHECK]], label [[COMMON_RET]], label [[GUARDED:%.*]], !prof [[PROF10]]
; CHECK: guarded:
; CHECK-NEXT: [[EL_WIDE:%.*]] = zext i8 [[EL]] to i32
; CHECK-NEXT: [[RANGE_CHECK:%.*]] = icmp ult i32 [[EL_WIDE]], [[X]]
@@ -613,7 +613,7 @@ define i32 @test_04(ptr noundef %p, i32 noundef %n, i32 noundef %limit, ptr noun
; CHECK-NEXT: store i32 [[IV]], ptr [[ARR_PTR]], align 4
; CHECK-NEXT: [[IV_NEXT]] = add i32 [[IV]], 1
; CHECK-NEXT: [[LOOP_COND:%.*]] = icmp slt i32 [[IV_NEXT]], [[N]]
-; CHECK-NEXT: br i1 [[LOOP_COND]], label [[LOOP]], label [[COMMON_RET]], !llvm.loop [[LOOP19:![0-9]+]]
+; CHECK-NEXT: br i1 [[LOOP_COND]], label [[LOOP]], label [[COMMON_RET]], !llvm.loop [[LOOP12:![0-9]+]]
; CHECK: common.ret:
; CHECK-NEXT: [[COMMON_RET_OP:%.*]] = phi i32 [ 0, [[BACKEDGE]] ], [ 0, [[GUARDED_US]] ], [ -1, [[LOOP]] ], [ -1, [[LOOP_US]] ], [ -2, [[GUARDED]] ]
; CHECK-NEXT: ret i32 [[COMMON_RET_OP]]
@@ -651,24 +651,17 @@ range_check_failed: ; preds = %guarded
ret i32 -2
}
;.
-; CHECK: [[META0]] = !{}
+; CHECK: [[META0:![0-9]+]] = !{}
; CHECK: [[PROF1]] = !{!"branch_weights", i32 100, i32 1}
-; CHECK: [[LOOP2]] = distinct !{[[LOOP2]], [[META3:![0-9]+]]}
-; CHECK: [[META3]] = !{!"llvm.loop.unswitch.nontrivial.disable"}
-; CHECK: [[LOOP4]] = distinct !{[[LOOP4]], [[META5:![0-9]+]]}
-; CHECK: [[META5]] = !{!"llvm.loop.unswitch.injection.disable"}
-; CHECK: [[LOOP6]] = distinct !{[[LOOP6]], [[META3]]}
-; CHECK: [[LOOP7]] = distinct !{[[LOOP7]], [[META5]]}
-; CHECK: [[PROF8]] = !{!"branch_weights", i32 0, i32 0}
-; CHECK: [[PROF9]] = !{!"branch_weights", i32 2, i32 3}
-; CHECK: [[PROF10]] = !{!"branch_weights", i32 -1, i32 -1000}
-; CHECK: [[LOOP11]] = distinct !{[[LOOP11]], [[META3]]}
-; CHECK: [[LOOP12]] = distinct !{[[LOOP12]], [[META5]]}
-; CHECK: [[LOOP13]] = distinct !{[[LOOP13]], [[META3]]}
-; CHECK: [[LOOP14]] = distinct !{[[LOOP14]], [[META5]]}
-; CHECK: [[PROF15]] = !{!"branch_weights", i32 1, i32 100}
-; CHECK: [[LOOP16]] = distinct !{[[LOOP16]], [[META3]]}
-; CHECK: [[LOOP17]] = distinct !{[[LOOP17]], [[META5]]}
-; CHECK: [[LOOP18]] = distinct !{[[LOOP18]], [[META3]]}
-; CHECK: [[LOOP19]] = distinct !{[[LOOP19]], [[META5]]}
+; CHECK: [[LOOP2]] = distinct !{!2, !3}
+; CHECK: [[META3:![0-9]+]] = !{!"llvm.loop.unswitch.injection.disable"}
+; CHECK: [[LOOP4]] = distinct !{!4, !3}
+; CHECK: [[PROF5]] = !{!"branch_weights", i32 0, i32 0}
+; CHECK: [[PROF6]] = !{!"branch_weights", i32 2, i32 3}
+; CHECK: [[PROF7]] = !{!"branch_weights", i32 -1, i32 -1000}
+; CHECK: [[LOOP8]] = distinct !{!8, !3}
+; CHECK: [[LOOP9]] = distinct !{!9, !3}
+; CHECK: [[PROF10]] = !{!"branch_weights", i32 1, i32 100}
+; CHECK: [[LOOP11]] = distinct !{!11, !3}
+; CHECK: [[LOOP12]] = distinct !{!12, !3}
;.
diff --git a/llvm/test/Transforms/SimpleLoopUnswitch/invalidate-block-and-loop-dispositions.ll b/llvm/test/Transforms/SimpleLoopUnswitch/invalidate-block-and-loop-dispositions.ll
index 5f713fae9e964..fcef88667449f 100644
--- a/llvm/test/Transforms/SimpleLoopUnswitch/invalidate-block-and-loop-dispositions.ll
+++ b/llvm/test/Transforms/SimpleLoopUnswitch/invalidate-block-and-loop-dispositions.ll
@@ -14,17 +14,27 @@ define void @test_pr58136(i1 %c.1, i1 %c.2) {
; CHECK-NEXT: [[C_1_FR:%.*]] = freeze i1 [[C_1:%.*]]
; CHECK-NEXT: br i1 [[C_1_FR]], label [[ENTRY_SPLIT_US:%.*]], label [[ENTRY_SPLIT:%.*]]
; CHECK: entry.split.us:
+; CHECK-NEXT: [[C_2_FR:%.*]] = freeze i1 [[C_2:%.*]]
+; CHECK-NEXT: br i1 [[C_2_FR]], label [[ENTRY_SPLIT_US_SPLIT_US:%.*]], label [[ENTRY_SPLIT_US_SPLIT:%.*]]
+; CHECK: entry.split.us.split.us:
; CHECK-NEXT: br label [[LOOP_HEADER_US_US:%.*]]
-; CHECK: loop.header.us:
-; CHECK-NEXT: [[MUL1_US_US:%.*]] = phi i16 [ [[MUL_US_US:%.*]], [[LOOP_LATCH_US:%.*]] ], [ [[GLOB_PROMOTED]], [[ENTRY_SPLIT_US]] ]
+; CHECK: loop.header.us.us:
+; CHECK-NEXT: [[MUL1_US_US:%.*]] = phi i16 [ [[MUL_US_US:%.*]], [[LOOP_LATCH_US_US:%.*]] ], [ [[GLOB_PROMOTED]], [[ENTRY_SPLIT_US_SPLIT_US]] ]
; CHECK-NEXT: [[CALL2_US_US:%.*]] = call i16 @foo()
-; CHECK-NEXT: br label [[LOOP_LATCH_US_US:%.*]]
-; CHECK: then.bb.us:
-; CHECK-NEXT: br i1 [[C_2:%.*]], label [[LOOP_LATCH_US]], label [[EXIT_SPLIT_US:%.*]]
-; CHECK: loop.latch.us:
+; CHECK-NEXT: br label [[THEN_BB_US_US:%.*]]
+; CHECK: then.bb.us.us:
+; CHECK-NEXT: br label [[LOOP_LATCH_US_US]]
+; CHECK: loop.latch.us.us:
; CHECK-NEXT: [[MUL_US_US]] = mul nsw i16 [[MUL1_US_US]], [[L_3]]
; CHECK-NEXT: store i16 [[MUL_US_US]], ptr @glob, align 2
-; CHECK-NEXT: br label [[LOOP_HEADER_US_US]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK-NEXT: br label [[LOOP_HEADER_US_US]]
+; CHECK: entry.split.us.split:
+; CHECK-NEXT: br label [[LOOP_HEADER_US:%.*]]
+; CHECK: loop.header.us:
+; CHECK-NEXT: [[CALL2_US:%.*]] = call i16 @foo()
+; CHECK-NEXT: br label [[THEN_BB_US:%.*]]
+; CHECK: then.bb.us:
+; CHECK-NEXT: br label [[EXIT_SPLIT_US:%.*]]
; CHECK: exit.split.us:
; CHECK-NEXT: br label [[EXIT:%.*]]
; CHECK: entry.split:
@@ -79,7 +89,7 @@ define void @test_pr58158(i1 %c.1) {
; CHECK: outer.loopexit.us:
; CHECK-NEXT: br label [[OUTER_BACKEDGE_US:%.*]]
; CHECK: outer.backedge.us:
-; CHECK-NEXT: br label [[OUTER_US]], !llvm.loop [[LOOP2:![0-9]+]]
+; CHECK-NEXT: br label [[OUTER_US]]
; CHECK: entry.split:
; CHECK-NEXT: br label [[OUTER:%.*]]
; CHECK: outer:
diff --git a/llvm/test/Transforms/SimpleLoopUnswitch/nontrivial-unswitch-freeze.ll b/llvm/test/Transforms/SimpleLoopUnswitch/nontrivial-unswitch-freeze.ll
index d07c2fa4afd5d..8e97cb5cb42f8 100644
--- a/llvm/test/Transforms/SimpleLoopUnswitch/nontrivial-unswitch-freeze.ll
+++ b/llvm/test/Transforms/SimpleLoopUnswitch/nontrivial-unswitch-freeze.ll
@@ -32,7 +32,7 @@ define i32 @test1_freeze(ptr %ptr0, ptr %ptr1, ptr %ptr2) {
; CHECK-NEXT: br label [[LATCH_US:%.*]]
; CHECK: latch.us:
; CHECK-NEXT: [[V_US:%.*]] = load i1, ptr [[PTR0:%.*]], align 1
-; CHECK-NEXT: br i1 [[V_US]], label [[LOOP_BEGIN_US]], label [[LOOP_EXIT_SPLIT_US:%.*]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK-NEXT: br i1 [[V_US]], label [[LOOP_BEGIN_US]], label [[LOOP_EXIT_SPLIT_US:%.*]]
; CHECK: loop_exit.split.us:
; CHECK-NEXT: br label [[LOOP_EXIT:%.*]]
; CHECK: entry.split:
@@ -50,7 +50,7 @@ define i32 @test1_freeze(ptr %ptr0, ptr %ptr1, ptr %ptr2) {
; CHECK-NEXT: br label [[LATCH_US2:%.*]]
; CHECK: latch.us2:
; CHECK-NEXT: [[V_US3:%.*]] = load i1, ptr [[PTR0]], align 1
-; CHECK-NEXT: br i1 [[V_US3]], label [[LOOP_BEGIN_US1]], label [[LOOP_EXIT_SPLIT_SPLIT_US:%.*]], !llvm.loop [[LOOP2:![0-9]+]]
+; CHECK-NEXT: br i1 [[V_US3]], label [[LOOP_BEGIN_US1]], label [[LOOP_EXIT_SPLIT_SPLIT_US:%.*]]
; CHECK: loop_exit.split.split.us:
; CHECK-NEXT: br label [[LOOP_EXIT_SPLIT:%.*]]
; CHECK: entry.split.split:
@@ -276,7 +276,7 @@ define i32 @test7b(ptr %ptr, ptr %cond.ptr, ptr %a.ptr, ptr %b.ptr) {
; CHECK-NEXT: [[V4_US:%.*]] = load i1, ptr [[PTR]], align 1
; CHECK-NEXT: br i1 [[V4_US]], label [[INNER_LOOP_EXIT_LOOPEXIT_SPLIT_US:%.*]], label [[INNER_INNER_LOOP_D_US:%.*]]
; CHECK: inner_inner_loop_d.us:
-; CHECK-NEXT: br label [[INNER_INNER_LOOP_BEGIN_US]], !llvm.loop [[LOOP3:![0-9]+]]
+; CHECK-NEXT: br label [[INNER_INNER_LOOP_BEGIN_US]]
; CHECK: inner_inner_loop_exit.split.us:
; CHECK-NEXT: br label [[INNER_INNER_LOOP_EXIT]]
; CHECK: loop_exit.split.us:
@@ -512,7 +512,7 @@ define i32 @test8b(ptr %ptr, ptr %cond.ptr, ptr %a.ptr, ptr %b.ptr) {
; CHECK-NEXT: [[V2_US:%.*]] = load i1, ptr [[PTR]], align 1
; CHECK-NEXT: br i1 [[V2_US]], label [[INNER_INNER_LOOP_LATCH_US:%.*]], label [[INNER_LOOP_EXIT_LOOPEXIT_SPLIT_US:%.*]]
; CHECK: inner_inner_loop_latch.us:
-; CHECK-NEXT: br label [[INNER_INNER_LOOP_BEGIN_US]], !llvm.loop [[LOOP4:![0-9]+]]
+; CHECK-NEXT: br label [[INNER_INNER_LOOP_BEGIN_US]]
; CHECK: inner_inner_loop_exit.split.us:
; CHECK-NEXT: br label [[INNER_INNER_LOOP_EXIT]]
; CHECK: inner_loop_exit.loopexit.split.us:
@@ -614,7 +614,7 @@ define i32 @test10a(ptr %ptr, i1 %cond, ptr %a.ptr) {
; CHECK-NEXT: [[V2_US:%.*]] = load i1, ptr [[PTR]], align 1
; CHECK-NEXT: br i1 [[V2_US]], label [[LOOP_EXIT_SPLIT_US_LOOPEXIT:%.*]], label [[LOOP_BEGIN_BACKEDGE_US:%.*]]
; CHECK: loop_begin.backedge.us:
-; CHECK-NEXT: br label [[LOOP_BEGIN_US]], !llvm.loop [[LOOP5:![0-9]+]]
+; CHECK-NEXT: br label [[LOOP_BEGIN_US]]
; CHECK: loop_exit.split.us.loopexit:
; CHECK-NEXT: [[A_LCSSA_US_PH:%.*]] = phi i32 [ [[A_US]], [[LOOP_A_US]] ]
; CHECK-NEXT: br label [[LOOP_EXIT_SPLIT_US]]
@@ -682,7 +682,7 @@ define i32 @test10b(ptr %ptr, i1 %cond, ptr %a.ptr) {
; CHECK-NEXT: [[V2_US:%.*]] = load i1, ptr [[PTR]], align 1
; CHECK-NEXT: br i1 [[V2_US]], label [[LOOP_BEGIN_BACKEDGE_US]], label [[LOOP_EXIT_SPLIT_US:%.*]]
; CHECK: loop_begin.backedge.us:
-; CHECK-NEXT: br label [[LOOP_BEGIN_US]], !llvm.loop [[LOOP6:![0-9]+]]
+; CHECK-NEXT: br label [[LOOP_BEGIN_US]]
; CHECK: loop_exit.split.us:
; CHECK-NEXT: [[A_LCSSA_US:%.*]] = phi i32 [ [[A_US]], [[LOOP_A_US]] ]
; CHECK-NEXT: br label [[LOOP_EXIT:%.*]]
@@ -844,7 +844,7 @@ define i32 @test11b(ptr %ptr, ptr %cond.ptr, ptr %a.ptr, ptr %b.ptr) {
; CHECK-NEXT: br label [[INNER_LOOP_A_US:%.*]]
; CHECK: inner_loop_a.us:
; CHECK-NEXT: [[V2_US:%.*]] = load i1, ptr [[PTR]], align 1
-; CHECK-NEXT: br i1 [[V2_US]], label [[INNER_LOOP_EXIT_SPLIT_US:%.*]], label [[INNER_LOOP_BEGIN_US]], !llvm.loop [[LOOP7:![0-9]+]]
+; CHECK-NEXT: br i1 [[V2_US]], label [[INNER_LOOP_EXIT_SPLIT_US:%.*]], label [[INNER_LOOP_BEGIN_US]]
; CHECK: inner_loop_exit.split.us:
; CHECK-NEXT: [[A_INNER_LCSSA_US:%.*]] = phi i32 [ [[A_US]], [[INNER_LOOP_A_US]] ]
; CHECK-NEXT: br label [[INNER_LOOP_EXIT:%.*]]
@@ -1033,7 +1033,7 @@ define i32 @test12b(ptr %ptr, ptr %cond.ptr, ptr %a.ptr, ptr %b.ptr) {
; CHECK-NEXT: br label [[INNER_INNER_LOOP_A_US:%.*]]
; CHECK: inner_inner_loop_a.us:
; CHECK-NEXT: [[V2_US:%.*]] = load i1, ptr [[PTR]], align 1
-; CHECK-NEXT: br i1 [[V2_US]], label [[INNER_INNER_LOOP_EXIT_SPLIT_US:%.*]], label [[INNER_INNER_LOOP_BEGIN_US]], !llvm.loop [[LOOP8:![0-9]+]]
+; CHECK-NEXT: br i1 [[V2_US]], label [[INNER_INNER_LOOP_EXIT_SPLIT_US:%.*]], label [[INNER_INNER_LOOP_BEGIN_US]]
; CHECK: inner_inner_loop_exit.split.us:
; CHECK-NEXT: [[A_INNER_INNER_LCSSA_US:%.*]] = phi i32 [ [[A_US]], [[INNER_INNER_LOOP_A_US]] ]
; CHECK-NEXT: br label [[INNER_INNER_LOOP_EXIT:%.*]]
@@ -1142,7 +1142,7 @@ define i32 @test13a(ptr %ptr, i1 %cond, ptr %a.ptr, ptr %b.ptr) {
; CHECK-NEXT: [[V2_US:%.*]] = load i1, ptr [[PTR]], align 1
; CHECK-NEXT: br i1 [[V2_US]], label [[LOOP_EXIT_SPLIT_US:%.*]], label [[LOOP_LATCH_US]]
; CHECK: loop_latch.us:
-; CHECK-NEXT: br label [[LOOP_BEGIN_US]], !llvm.loop [[LOOP9:![0-9]+]]
+; CHECK-NEXT: br label [[LOOP_BEGIN_US]]
; CHECK: loop_exit.split.us:
; CHECK-NEXT: [[LCSSA_US:%.*]] = phi i32 [ [[A_US]], [[LOOP_A_US]] ]
; CHECK-NEXT: br label [[LOOP_EXIT:%.*]]
@@ -1237,7 +1237,7 @@ define i32 @test13b(ptr %ptr, i1 %cond, ptr %a.ptr, ptr %b.ptr) {
; CHECK-NEXT: [[V2_US:%.*]] = load i1, ptr [[PTR]], align 1
; CHECK-NEXT: br i1 [[V2_US]], label [[LOOP_EXIT_SPLIT_US_LOOPEXIT:%.*]], label [[LOOP_LATCH_US:%.*]]
; CHECK: loop_latch.us:
-; CHECK-NEXT: br label [[LOOP_BEGIN_US]], !llvm.loop [[LOOP10:![0-9]+]]
+; CHECK-NEXT: br label [[LOOP_BEGIN_US]]
; CHECK: loop_exit.split.us.loopexit:
; CHECK-NEXT: [[LCSSA_US_PH:%.*]] = phi i32 [ [[A_US]], [[LOOP_A_US]] ]
; CHECK-NEXT: br label [[LOOP_EXIT_SPLIT_US]]
@@ -1356,7 +1356,7 @@ define void @test23(i1 %arg, ptr %ptr) {
; CHECK-NEXT: br label [[OUTER_LATCH_US:%.*]]
; CHECK: outer.latch.us:
; CHECK-NEXT: [[OUTER_COND_US:%.*]] = load i1, ptr [[PTR]], align 1
-; CHECK-NEXT: br i1 [[OUTER_COND_US]], label [[OUTER_HEADER_US]], label [[EXIT_SPLIT_US:%.*]], !llvm.loop [[LOOP11:![0-9]+]]
+; CHECK-NEXT: br i1 [[OUTER_COND_US]], label [[OUTER_HEADER_US]], label [[EXIT_SPLIT_US:%.*]]
; CHECK: exit.split.us:
; CHECK-NEXT: br label [[EXIT:%.*]]
; CHECK: entry.split:
@@ -1426,10 +1426,10 @@ define i32 @test29(i32 %arg) {
; CHECK-NEXT: entry:
; CHECK-NEXT: [[ARG_FR:%.*]] = freeze i32 [[ARG:%.*]]
; CHECK-NEXT: switch i32 [[ARG_FR]], label [[ENTRY_SPLIT:%.*]] [
-; CHECK-NEXT: i32 0, label [[ENTRY_SPLIT_US:%.*]]
-; CHECK-NEXT: i32 1, label [[ENTRY_SPLIT_US]]
-; CHECK-NEXT: i32 2, label [[ENTRY_SPLIT_US1:%.*]]
-; CHECK-NEXT: i32 3, label [[ENTRY_SPLIT]]
+; CHECK-NEXT: i32 0, label [[ENTRY_SPLIT_US:%.*]]
+; CHECK-NEXT: i32 1, label [[ENTRY_SPLIT_US]]
+; CHECK-NEXT: i32 2, label [[ENTRY_SPLIT_US1:%.*]]
+; CHECK-NEXT: i32 3, label [[ENTRY_SPLIT]]
; CHECK-NEXT: ]
; CHECK: entry.split.us:
; CHECK-NEXT: br label [[HEADER_US:%.*]]
@@ -1456,7 +1456,7 @@ define i32 @test29(i32 %arg) {
; CHECK-NEXT: br label [[LATCH_US:%.*]]
; CHECK: latch.us:
; CHECK-NEXT: [[CMP2_US:%.*]] = icmp slt i32 [[TMP_C_SUM_US]], 42
-; CHECK-NEXT: br i1 [[CMP2_US]], label [[HEADER_US]], label [[EXIT_SPLIT_US:%.*]], !llvm.loop [[LOOP12:![0-9]+]]
+; CHECK-NEXT: br i1 [[CMP2_US]], label [[HEADER_US]], label [[EXIT_SPLIT_US:%.*]]
; CHECK: exit.split.us:
; CHECK-NEXT: [[LCSSA_PHI_US:%.*]] = phi i32 [ [[TMP_C_SUM_US]], [[LATCH_US]] ]
; CHECK-NEXT: br label [[EXIT:%.*]]
@@ -1485,7 +1485,7 @@ define i32 @test29(i32 %arg) {
; CHECK-NEXT: br label [[LATCH_US18:%.*]]
; CHECK: latch.us18:
; CHECK-NEXT: [[CMP2_US19:%.*]] = icmp slt i32 [[TMP_C_SUM_US17]], 42
-; CHECK-NEXT: br i1 [[CMP2_US19]], label [[HEADER_US2]], label [[EXIT_SPLIT_SPLIT_US:%.*]], !llvm.loop [[LOOP13:![0-9]+]]
+; CHECK-NEXT: br i1 [[CMP2_US19]], label [[HEADER_US2]], label [[EXIT_SPLIT_SPLIT_US:%.*]]
; CHECK: exit.split.split.us:
; CHECK-NEXT: [[LCSSA_PHI_US20:%.*]] = phi i32 [ [[TMP_C_SUM_US17]], [[LATCH_US18]] ]
; CHECK-NEXT: br label [[EXIT_SPLIT:%.*]]
@@ -1587,10 +1587,10 @@ define i32 @test30(i32 %arg) {
; CHECK-NEXT: entry:
; CHECK-NEXT: [[ARG_FR:%.*]] = freeze i32 [[ARG:%.*]]
; CHECK-NEXT: switch i32 [[ARG_FR]], label [[ENTRY_SPLIT:%.*]] [
-; CHECK-NEXT: i32 -1, label [[ENTRY_SPLIT]]
-; CHECK-NEXT: i32 0, label [[ENTRY_SPLIT_US:%.*]]
-; CHECK-NEXT: i32 1, label [[ENTRY_SPLIT_US1:%.*]]
-; CHECK-NEXT: i32 2, label [[ENTRY_SPLIT_US1]]
+; CHECK-NEXT: i32 -1, label [[ENTRY_SPLIT]]
+; CHECK-NEXT: i32 0, label [[ENTRY_SPLIT_US:%.*]]
+; CHECK-NEXT: i32 1, label [[ENTRY_SPLIT_US1:%.*]]
+; CHECK-NEXT: i32 2, label [[ENTRY_SPLIT_US1]]
; CHECK-NEXT: ]
; CHECK: entry.split.us:
; CHECK-NEXT: br label [[HEADER_US:%.*]]
@@ -1612,7 +1612,7 @@ define i32 @test30(i32 %arg) {
; CHECK-NEXT: br label [[LATCH_US:%.*]]
; CHECK: latch.us:
; CHECK-NEXT: [[CMP2_US:%.*]] = icmp slt i32 [[TMP_B_SUM_US]], 42
-; CHECK-NEXT: br i1 [[CMP2_US]], label [[HEADER_US]], label [[LOOP_EXIT2_SPLIT_US:%.*]], !llvm.loop [[LOOP14:![0-9]+]]
+; CHECK-NEXT: br i1 [[CMP2_US]], label [[HEADER_US]], label [[LOOP_EXIT2_SPLIT_US:%.*]]
; CHECK: loop.exit2.split.us:
; CHECK-NEXT: [[L2_PHI_US:%.*]] = phi i32 [ [[TMP_B_SUM_US]], [[LATCH_US]] ]
; CHECK-NEXT: br label [[LOOP_EXIT2:%.*]]
@@ -1636,7 +1636,7 @@ define i32 @test30(i32 %arg) {
; CHECK-NEXT: br label [[LATCH_US14:%.*]]
; CHECK: latch.us14:
; CHECK-NEXT: [[CMP2_US15:%.*]] = icmp slt i32 [[TMP_B_SUM_US13]], 42
-; CHECK-NEXT: br i1 [[CMP2_US15]], label [[HEADER_US2]], label [[LOOP_EXIT2_SPLIT_SPLIT_US:%.*]], !llvm.loop [[LOOP15:![0-9]+]]
+; CHECK-NEXT: br i1 [[CMP2_US15]], label [[HEADER_US2]], label [[LOOP_EXIT2_SPLIT_SPLIT_US:%.*]]
; CHECK: loop.exit2.split.split.us:
; CHECK-NEXT: [[L2_PHI_US16:%.*]] = phi i32 [ [[TMP_B_SUM_US13]], [[LATCH_US14]] ]
; CHECK-NEXT: br label [[LOOP_EXIT2_SPLIT:%.*]]
@@ -2259,9 +2259,9 @@ define void @hoist_inner_loop_switch(ptr %ptr) {
; CHECK-NEXT: [[V1:%.*]] = call i32 @cond.i32()
; CHECK-NEXT: [[V1_FR:%.*]] = freeze i32 [[V1]]
; CHECK-NEXT: switch i32 [[V1_FR]], label [[B_HEADER_SPLIT:%.*]] [
-; CHECK-NEXT: i32 1, label [[B_HEADER_SPLIT_US:%.*]]
-; CHECK-NEXT: i32 2, label [[B_HEADER_SPLIT_US]]
-; CHECK-NEXT: i32 3, label [[B_HEADER_SPLIT_US]]
+; CHECK-NEXT: i32 1, label [[B_HEADER_SPLIT_US:%.*]]
+; CHECK-NEXT: i32 2, label [[B_HEADER_SPLIT_US]]
+; CHECK-NEXT: i32 3, label [[B_HEADER_SPLIT_US]]
; CHECK-NEXT: ]
; CHECK: b.header.split.us:
; CHECK-NEXT: br label [[C_HEADER_US:%.*]]
diff --git a/llvm/test/Transforms/SimpleLoopUnswitch/nontrivial-unswitch-select.ll b/llvm/test/Transforms/SimpleLoopUnswitch/nontrivial-unswitch-select.ll
index 64b18291b22d1..c86fa349200c5 100644
--- a/llvm/test/Transforms/SimpleLoopUnswitch/nontrivial-unswitch-select.ll
+++ b/llvm/test/Transforms/SimpleLoopUnswitch/nontrivial-unswitch-select.ll
@@ -28,7 +28,7 @@ define i32 @basic(i32 %N, i1 %cond, i32 %select_input) {
; CHECK-NEXT: [[UNSWITCHED_SELECT_US:%.*]] = phi i32 [ [[SELECT_INPUT]], [[TMP0]] ]
; CHECK-NEXT: [[ADD_US]] = add nuw nsw i32 [[UNSWITCHED_SELECT_US]], [[RES_US]]
; CHECK-NEXT: [[INC_US]] = add nuw nsw i32 [[I_US]], 1
-; CHECK-NEXT: br label [[FOR_COND_US]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK-NEXT: br label [[FOR_COND_US]]
; CHECK: for.cond.cleanup.split.us:
; CHECK-NEXT: [[RES_LCSSA_US:%.*]] = phi i32 [ [[RES_US]], [[FOR_COND_US]] ]
; CHECK-NEXT: br label [[FOR_COND_CLEANUP:%.*]]
@@ -132,7 +132,7 @@ define i32 @select_phi_input(i32 %N, i1 %cond) {
; CHECK-NEXT: [[UNSWITCHED_SELECT_US:%.*]] = phi i32 [ [[I_US]], [[TMP0]] ]
; CHECK-NEXT: [[ADD_US]] = add nuw nsw i32 [[UNSWITCHED_SELECT_US]], [[RES_US]]
; CHECK-NEXT: [[INC_US]] = add nuw nsw i32 [[I_US]], 1
-; CHECK-NEXT: br label [[FOR_COND_US]], !llvm.loop [[LOOP2:![0-9]+]]
+; CHECK-NEXT: br label [[FOR_COND_US]]
; CHECK: for.cond.cleanup.split.us:
; CHECK-NEXT: [[RES_LCSSA_US:%.*]] = phi i32 [ [[RES_US]], [[FOR_COND_US]] ]
; CHECK-NEXT: br label [[FOR_COND_CLEANUP:%.*]]
@@ -195,7 +195,7 @@ define i32 @basic_cond_noundef(i32 %N, i1 noundef %cond) {
; CHECK-NEXT: [[UNSWITCHED_SELECT_US:%.*]] = phi i32 [ [[I_US]], [[TMP0]] ]
; CHECK-NEXT: [[ADD_US]] = add nuw nsw i32 [[UNSWITCHED_SELECT_US]], [[RES_US]]
; CHECK-NEXT: [[INC_US]] = add nuw nsw i32 [[I_US]], 1
-; CHECK-NEXT: br label [[FOR_COND_US]], !llvm.loop [[LOOP3:![0-9]+]]
+; CHECK-NEXT: br label [[FOR_COND_US]]
; CHECK: for.cond.cleanup.split.us:
; CHECK-NEXT: [[RES_LCSSA_US:%.*]] = phi i32 [ [[RES_US]], [[FOR_COND_US]] ]
; CHECK-NEXT: br label [[FOR_COND_CLEANUP:%.*]]
@@ -285,24 +285,55 @@ define i32 @chained_select(i32 %N, i1 %cond, i1 %cond2) {
; CHECK-NEXT: [[COND_FR:%.*]] = freeze i1 [[COND]]
; CHECK-NEXT: br i1 [[COND_FR]], label [[ENTRY_SPLIT_US:%.*]], label [[ENTRY_SPLIT:%.*]]
; CHECK: entry.split.us:
+; CHECK-NEXT: [[COND2_FR13:%.*]] = freeze i1 [[COND2]]
+; CHECK-NEXT: br i1 [[COND2_FR13]], label [[ENTRY_SPLIT_US_SPLIT_US:%.*]], label [[ENTRY_SPLIT_US_SPLIT:%.*]]
+; CHECK: entry.split.us.split.us:
+; CHECK-NEXT: br label [[FOR_COND_US_US:%.*]]
+; CHECK: for.cond.us.us:
+; CHECK-NEXT: [[RES_US_US:%.*]] = phi i32 [ 0, [[ENTRY_SPLIT_US_SPLIT_US]] ], [ [[ADD_US_US:%.*]], [[TMP3:%.*]] ]
+; CHECK-NEXT: [[I_US_US:%.*]] = phi i32 [ 0, [[ENTRY_SPLIT_US_SPLIT_US]] ], [ [[INC_US_US:%.*]], [[TMP3]] ]
+; CHECK-NEXT: [[CMP_US_US:%.*]] = icmp slt i32 [[I_US_US]], [[N]]
+; CHECK-NEXT: br i1 [[CMP_US_US]], label [[FOR_BODY_US_US:%.*]], label [[FOR_COND_CLEANUP_SPLIT_US_SPLIT_US:%.*]]
+; CHECK: for.body.us.us:
+; CHECK-NEXT: br label [[TMP0:%.*]]
+; CHECK: 0:
+; CHECK-NEXT: br label [[TMP1:%.*]]
+; CHECK: 1:
+; CHECK-NEXT: [[UNSWITCHED_SELECT_US_US:%.*]] = phi i32 [ [[I_US_US]], [[TMP0]] ]
+; CHECK-NEXT: br label [[TMP2:%.*]]
+; CHECK: 2:
+; CHECK-NEXT: br label [[TMP3]]
+; CHECK: 3:
+; CHECK-NEXT: [[UNSWITCHED_SELECT_US11:%.*]] = phi i32 [ [[UNSWITCHED_SELECT_US_US]], [[TMP2]] ]
+; CHECK-NEXT: [[ADD_US_US]] = add nuw nsw i32 [[UNSWITCHED_SELECT_US11]], [[RES_US_US]]
+; CHECK-NEXT: [[INC_US_US]] = add nuw nsw i32 [[I_US_US]], 1
+; CHECK-NEXT: br label [[FOR_COND_US_US]]
+; CHECK: for.cond.cleanup.split.us.split.us:
+; CHECK-NEXT: [[RES_LCSSA_US_US:%.*]] = phi i32 [ [[RES_US_US]], [[FOR_COND_US_US]] ]
+; CHECK-NEXT: br label [[FOR_COND_CLEANUP_SPLIT_US:%.*]]
+; CHECK: entry.split.us.split:
; CHECK-NEXT: br label [[FOR_COND_US:%.*]]
; CHECK: for.cond.us:
-; CHECK-NEXT: [[RES_US:%.*]] = phi i32 [ 0, [[ENTRY_SPLIT_US]] ], [ [[ADD_US:%.*]], [[TMP1:%.*]] ]
-; CHECK-NEXT: [[I_US:%.*]] = phi i32 [ 0, [[ENTRY_SPLIT_US]] ], [ [[INC_US:%.*]], [[TMP1]] ]
+; CHECK-NEXT: [[RES_US:%.*]] = phi i32 [ 0, [[ENTRY_SPLIT_US_SPLIT]] ], [ [[ADD_US:%.*]], [[TMP6:%.*]] ]
+; CHECK-NEXT: [[I_US:%.*]] = phi i32 [ 0, [[ENTRY_SPLIT_US_SPLIT]] ], [ [[INC_US:%.*]], [[TMP6]] ]
; CHECK-NEXT: [[CMP_US:%.*]] = icmp slt i32 [[I_US]], [[N]]
-; CHECK-NEXT: br i1 [[CMP_US]], label [[FOR_BODY_US:%.*]], label [[FOR_COND_CLEANUP_SPLIT_US:%.*]]
+; CHECK-NEXT: br i1 [[CMP_US]], label [[FOR_BODY_US:%.*]], label [[FOR_COND_CLEANUP_SPLIT_US_SPLIT:%.*]]
; CHECK: for.body.us:
-; CHECK-NEXT: br label [[TMP0:%.*]]
-; CHECK: 0:
-; CHECK-NEXT: br label [[TMP1]]
-; CHECK: 1:
-; CHECK-NEXT: [[UNSWITCHED_SELECT_US:%.*]] = phi i32 [ [[I_US]], [[TMP0]] ]
-; CHECK-NEXT: [[SELECT2_US:%.*]] = select i1 [[COND2]], i32 [[UNSWITCHED_SELECT_US]], i32 24
-; CHECK-NEXT: [[ADD_US]] = add nuw nsw i32 [[SELECT2_US]], [[RES_US]]
+; CHECK-NEXT: br label [[TMP4:%.*]]
+; CHECK: 4:
+; CHECK-NEXT: br label [[TMP5:%.*]]
+; CHECK: 5:
+; CHECK-NEXT: [[UNSWITCHED_SELECT_US:%.*]] = phi i32 [ [[I_US]], [[TMP4]] ]
+; CHECK-NEXT: br label [[TMP6]]
+; CHECK: 6:
+; CHECK-NEXT: [[ADD_US]] = add nuw nsw i32 24, [[RES_US]]
; CHECK-NEXT: [[INC_US]] = add nuw nsw i32 [[I_US]], 1
-; CHECK-NEXT: br label [[FOR_COND_US]], !llvm.loop [[LOOP4:![0-9]+]]
-; CHECK: for.cond.cleanup.split.us:
+; CHECK-NEXT: br label [[FOR_COND_US]]
+; CHECK: for.cond.cleanup.split.us.split:
; CHECK-NEXT: [[RES_LCSSA_US:%.*]] = phi i32 [ [[RES_US]], [[FOR_COND_US]] ]
+; CHECK-NEXT: br label [[FOR_COND_CLEANUP_SPLIT_US]]
+; CHECK: for.cond.cleanup.split.us:
+; CHECK-NEXT: [[DOTUS_PHI12:%.*]] = phi i32 [ [[RES_LCSSA_US]], [[FOR_COND_CLEANUP_SPLIT_US_SPLIT]] ], [ [[RES_LCSSA_US_US]], [[FOR_COND_CLEANUP_SPLIT_US_SPLIT_US]] ]
; CHECK-NEXT: br label [[FOR_COND_CLEANUP:%.*]]
; CHECK: entry.split:
; CHECK-NEXT: [[COND2_FR:%.*]] = freeze i1 [[COND2]]
@@ -310,36 +341,36 @@ define i32 @chained_select(i32 %N, i1 %cond, i1 %cond2) {
; CHECK: entry.split.split.us:
; CHECK-NEXT: br label [[FOR_COND_US1:%.*]]
; CHECK: for.cond.us1:
-; CHECK-NEXT: [[RES_US2:%.*]] = phi i32 [ 0, [[ENTRY_SPLIT_SPLIT_US]] ], [ [[ADD_US7:%.*]], [[TMP4:%.*]] ]
-; CHECK-NEXT: [[I_US3:%.*]] = phi i32 [ 0, [[ENTRY_SPLIT_SPLIT_US]] ], [ [[INC_US8:%.*]], [[TMP4]] ]
+; CHECK-NEXT: [[RES_US2:%.*]] = phi i32 [ 0, [[ENTRY_SPLIT_SPLIT_US]] ], [ [[ADD_US7:%.*]], [[TMP9:%.*]] ]
+; CHECK-NEXT: [[I_US3:%.*]] = phi i32 [ 0, [[ENTRY_SPLIT_SPLIT_US]] ], [ [[INC_US8:%.*]], [[TMP9]] ]
; CHECK-NEXT: [[CMP_US4:%.*]] = icmp slt i32 [[I_US3]], [[N]]
; CHECK-NEXT: br i1 [[CMP_US4]], label [[FOR_BODY_US5:%.*]], label [[FOR_COND_CLEANUP_SPLIT_SPLIT_US:%.*]]
; CHECK: for.body.us5:
-; CHECK-NEXT: br label [[TMP2:%.*]]
-; CHECK: 2:
-; CHECK-NEXT: br label [[TMP3:%.*]]
-; CHECK: 3:
-; CHECK-NEXT: br label [[TMP4]]
-; CHECK: 4:
-; CHECK-NEXT: [[UNSWITCHED_SELECT_US6:%.*]] = phi i32 [ 42, [[TMP3]] ]
+; CHECK-NEXT: br label [[TMP7:%.*]]
+; CHECK: 7:
+; CHECK-NEXT: br label [[TMP8:%.*]]
+; CHECK: 8:
+; CHECK-NEXT: br label [[TMP9]]
+; CHECK: 9:
+; CHECK-NEXT: [[UNSWITCHED_SELECT_US6:%.*]] = phi i32 [ 42, [[TMP8]] ]
; CHECK-NEXT: [[ADD_US7]] = add nuw nsw i32 [[UNSWITCHED_SELECT_US6]], [[RES_US2]]
; CHECK-NEXT: [[INC_US8]] = add nuw nsw i32 [[I_US3]], 1
-; CHECK-NEXT: br label [[FOR_COND_US1]], !llvm.loop [[LOOP5:![0-9]+]]
+; CHECK-NEXT: br label [[FOR_COND_US1]]
; CHECK: for.cond.cleanup.split.split.us:
; CHECK-NEXT: [[RES_LCSSA_US9:%.*]] = phi i32 [ [[RES_US2]], [[FOR_COND_US1]] ]
; CHECK-NEXT: br label [[FOR_COND_CLEANUP_SPLIT:%.*]]
; CHECK: entry.split.split:
; CHECK-NEXT: br label [[FOR_COND:%.*]]
; CHECK: for.cond:
-; CHECK-NEXT: [[RES:%.*]] = phi i32 [ 0, [[ENTRY_SPLIT_SPLIT]] ], [ [[ADD:%.*]], [[TMP6:%.*]] ]
-; CHECK-NEXT: [[I:%.*]] = phi i32 [ 0, [[ENTRY_SPLIT_SPLIT]] ], [ [[INC:%.*]], [[TMP6]] ]
+; CHECK-NEXT: [[RES:%.*]] = phi i32 [ 0, [[ENTRY_SPLIT_SPLIT]] ], [ [[ADD:%.*]], [[TMP11:%.*]] ]
+; CHECK-NEXT: [[I:%.*]] = phi i32 [ 0, [[ENTRY_SPLIT_SPLIT]] ], [ [[INC:%.*]], [[TMP11]] ]
; CHECK-NEXT: [[CMP:%.*]] = icmp slt i32 [[I]], [[N]]
; CHECK-NEXT: br i1 [[CMP]], label [[FOR_BODY:%.*]], label [[FOR_COND_CLEANUP_SPLIT_SPLIT:%.*]]
; CHECK: for.body:
-; CHECK-NEXT: br label [[TMP5:%.*]]
-; CHECK: 5:
-; CHECK-NEXT: br label [[TMP6]]
-; CHECK: 6:
+; CHECK-NEXT: br label [[TMP10:%.*]]
+; CHECK: 10:
+; CHECK-NEXT: br label [[TMP11]]
+; CHECK: 11:
; CHECK-NEXT: [[ADD]] = add nuw nsw i32 24, [[RES]]
; CHECK-NEXT: [[INC]] = add nuw nsw i32 [[I]], 1
; CHECK-NEXT: br label [[FOR_COND]]
@@ -350,7 +381,7 @@ define i32 @chained_select(i32 %N, i1 %cond, i1 %cond2) {
; CHECK-NEXT: [[DOTUS_PHI10:%.*]] = phi i32 [ [[RES_LCSSA]], [[FOR_COND_CLEANUP_SPLIT_SPLIT]] ], [ [[RES_LCSSA_US9]], [[FOR_COND_CLEANUP_SPLIT_SPLIT_US]] ]
; CHECK-NEXT: br label [[FOR_COND_CLEANUP]]
; CHECK: for.cond.cleanup:
-; CHECK-NEXT: [[DOTUS_PHI:%.*]] = phi i32 [ [[DOTUS_PHI10]], [[FOR_COND_CLEANUP_SPLIT]] ], [ [[RES_LCSSA_US]], [[FOR_COND_CLEANUP_SPLIT_US]] ]
+; CHECK-NEXT: [[DOTUS_PHI:%.*]] = phi i32 [ [[DOTUS_PHI10]], [[FOR_COND_CLEANUP_SPLIT]] ], [ [[DOTUS_PHI12]], [[FOR_COND_CLEANUP_SPLIT_US]] ]
; CHECK-NEXT: ret i32 [[DOTUS_PHI]]
;
entry:
@@ -396,7 +427,7 @@ define i32 @select_in_if(i32 %N, i1 %cond) {
; CHECK-NEXT: [[P_US:%.*]] = phi i32 [ [[UNSWITCHED_SELECT_US:%.*]], [[TMP1:%.*]] ], [ 24, [[FOR_BODY_US]] ]
; CHECK-NEXT: [[ADD_US]] = add nuw nsw i32 [[P_US]], [[RES_US]]
; CHECK-NEXT: [[INC_US]] = add nuw nsw i32 [[I_US]], 1
-; CHECK-NEXT: br label [[FOR_COND_US]], !llvm.loop [[LOOP6:![0-9]+]]
+; CHECK-NEXT: br label [[FOR_COND_US]]
; CHECK: 0:
; CHECK-NEXT: br label [[TMP1]]
; CHECK: 1:
@@ -486,7 +517,7 @@ define i32 @select_in_if_else(i32 %N, i1 %cond) {
; CHECK-NEXT: [[P_US:%.*]] = phi i32 [ [[COND1A_US]], [[FOR_BODY_IF_US]] ], [ [[UNSWITCHED_SELECT_US:%.*]], [[TMP1:%.*]] ]
; CHECK-NEXT: [[ADD_US]] = add nuw nsw i32 [[P_US]], [[RES_US]]
; CHECK-NEXT: [[INC_US]] = add nuw nsw i32 [[I_US]], 1
-; CHECK-NEXT: br label [[FOR_COND_US]], !llvm.loop [[LOOP7:![0-9]+]]
+; CHECK-NEXT: br label [[FOR_COND_US]]
; CHECK: 0:
; CHECK-NEXT: br label [[TMP1]]
; CHECK: 1:
@@ -575,7 +606,7 @@ define dso_local void @select_nested_loop(i1 noundef zeroext %cond, i32 noundef
; CHECK: for.cond1.for.cond.cleanup3_crit_edge.us.us:
; CHECK-NEXT: [[INC7_US_US]] = add nuw i32 [[I_018_US_US]], 1
; CHECK-NEXT: [[EXITCOND21_NOT_US:%.*]] = icmp eq i32 [[INC7_US_US]], [[N]]
-; CHECK-NEXT: br i1 [[EXITCOND21_NOT_US]], label [[FOR_COND_CLEANUP_LOOPEXIT_SPLIT_US:%.*]], label [[FOR_COND1_PREHEADER_US_US]], !llvm.loop [[LOOP8:![0-9]+]]
+; CHECK-NEXT: br i1 [[EXITCOND21_NOT_US]], label [[FOR_COND_CLEANUP_LOOPEXIT_SPLIT_US:%.*]], label [[FOR_COND1_PREHEADER_US_US]]
; CHECK: for.cond1.preheader.us.split.us.us:
; CHECK-NEXT: br label [[FOR_BODY4_US_US_US:%.*]]
; CHECK: for.body4.us.us.us:
@@ -588,7 +619,7 @@ define dso_local void @select_nested_loop(i1 noundef zeroext %cond, i32 noundef
; CHECK-NEXT: tail call void @bar(i32 noundef [[UNSWITCHED_SELECT_US_US]])
; CHECK-NEXT: [[INC_US_US_US]] = add nuw i32 [[J_016_US_US_US]], 1
; CHECK-NEXT: [[EXITCOND_NOT_US_US:%.*]] = icmp eq i32 [[INC_US_US_US]], [[M]]
-; CHECK-NEXT: br i1 [[EXITCOND_NOT_US_US]], label [[FOR_COND1_FOR_COND_CLEANUP3_CRIT_EDGE_US_SPLIT_US_US:%.*]], label [[FOR_BODY4_US_US_US]], !llvm.loop [[LOOP9:![0-9]+]]
+; CHECK-NEXT: br i1 [[EXITCOND_NOT_US_US]], label [[FOR_COND1_FOR_COND_CLEANUP3_CRIT_EDGE_US_SPLIT_US_US:%.*]], label [[FOR_BODY4_US_US_US]]
; CHECK: for.cond1.for.cond.cleanup3_crit_edge.us.split.us.us:
; CHECK-NEXT: br label [[FOR_COND1_FOR_COND_CLEANUP3_CRIT_EDGE_US_US]]
; CHECK: for.cond.cleanup.loopexit.split.us:
@@ -676,7 +707,7 @@ define dso_local void @select_invariant_outer_loop(i1 noundef zeroext %cond, i32
; CHECK-NEXT: tail call void @bar(i32 noundef [[UNSWITCHED_SELECT_US]])
; CHECK-NEXT: [[INC_US_US]] = add nuw i32 [[J_019_US_US]], 1
; CHECK-NEXT: [[EXITCOND_NOT_US:%.*]] = icmp eq i32 [[INC_US_US]], [[M]]
-; CHECK-NEXT: br i1 [[EXITCOND_NOT_US]], label [[FOR_COND1_FOR_COND_CLEANUP3_CRIT_EDGE_US_SPLIT_US:%.*]], label [[FOR_BODY4_US_US]], !llvm.loop [[LOOP10:![0-9]+]]
+; CHECK-NEXT: br i1 [[EXITCOND_NOT_US]], label [[FOR_COND1_FOR_COND_CLEANUP3_CRIT_EDGE_US_SPLIT_US:%.*]], label [[FOR_BODY4_US_US]]
; CHECK: for.cond1.for.cond.cleanup3_crit_edge.us.split.us:
; CHECK-NEXT: br label [[FOR_COND1_FOR_COND_CLEANUP3_CRIT_EDGE_US]]
; CHECK: for.cond1.preheader.us.split:
@@ -751,7 +782,7 @@ define dso_local i32 @trivial_select_cond(i32 noundef %n, i32 noundef %a, i32 no
; CHECK-NEXT: tail call void @bar(i32 noundef [[UNSWITCHED_SELECT_US]])
; CHECK-NEXT: [[INC_US]] = add nuw nsw i32 [[I_03_US]], 1
; CHECK-NEXT: [[EXITCOND_NOT_US:%.*]] = icmp eq i32 [[INC_US]], [[N]]
-; CHECK-NEXT: br i1 [[EXITCOND_NOT_US]], label [[FOR_COND_CLEANUP_LOOPEXIT_SPLIT_US:%.*]], label [[FOR_BODY_US]], !llvm.loop [[LOOP11:![0-9]+]]
+; CHECK-NEXT: br i1 [[EXITCOND_NOT_US]], label [[FOR_COND_CLEANUP_LOOPEXIT_SPLIT_US:%.*]], label [[FOR_BODY_US]]
; CHECK: for.cond.cleanup.loopexit.split.us:
; CHECK-NEXT: br label [[FOR_COND_CLEANUP_LOOPEXIT:%.*]]
; CHECK: for.body.preheader.split:
@@ -808,7 +839,7 @@ define i32 @and_lhs_invariant(i32 %num, i1 %cond) {
; CHECK-NEXT: tail call void @bar(i32 noundef [[UNSWITCHED_SELECT_US]])
; CHECK-NEXT: [[INC_US]] = add nuw nsw i32 [[I_07_US]], 1
; CHECK-NEXT: [[EXITCOND_NOT_US:%.*]] = icmp eq i32 [[INC_US]], [[NUM]]
-; CHECK-NEXT: br i1 [[EXITCOND_NOT_US]], label [[FOR_COND_CLEANUP_LOOPEXIT_SPLIT_US:%.*]], label [[FOR_BODY_US]], !llvm.loop [[LOOP12:![0-9]+]]
+; CHECK-NEXT: br i1 [[EXITCOND_NOT_US]], label [[FOR_COND_CLEANUP_LOOPEXIT_SPLIT_US:%.*]], label [[FOR_BODY_US]]
; CHECK: for.cond.cleanup.loopexit.split.us:
; CHECK-NEXT: br label [[FOR_COND_CLEANUP_LOOPEXIT:%.*]]
; CHECK: for.body.preheader.split:
@@ -873,7 +904,7 @@ define i32 @and_rhs_invariant(i32 %num, i1 %cond) {
; CHECK-NEXT: tail call void @bar(i32 noundef [[UNSWITCHED_SELECT_US]])
; CHECK-NEXT: [[INC_US]] = add nuw nsw i32 [[I_07_US]], 1
; CHECK-NEXT: [[EXITCOND_NOT_US:%.*]] = icmp eq i32 [[INC_US]], [[NUM]]
-; CHECK-NEXT: br i1 [[EXITCOND_NOT_US]], label [[FOR_COND_CLEANUP_LOOPEXIT_SPLIT_US:%.*]], label [[FOR_BODY_US]], !llvm.loop [[LOOP13:![0-9]+]]
+; CHECK-NEXT: br i1 [[EXITCOND_NOT_US]], label [[FOR_COND_CLEANUP_LOOPEXIT_SPLIT_US:%.*]], label [[FOR_BODY_US]]
; CHECK: for.cond.cleanup.loopexit.split.us:
; CHECK-NEXT: br label [[FOR_COND_CLEANUP_LOOPEXIT:%.*]]
; CHECK: for.body.preheader.split:
@@ -940,7 +971,7 @@ define i32 @or_lhs_invariant(i32 %num, i1 %cond) {
; CHECK-NEXT: tail call void @bar(i32 noundef [[UNSWITCHED_SELECT_US]])
; CHECK-NEXT: [[INC_US]] = add nuw nsw i32 [[I_07_US]], 1
; CHECK-NEXT: [[EXITCOND_NOT_US:%.*]] = icmp eq i32 [[INC_US]], [[NUM]]
-; CHECK-NEXT: br i1 [[EXITCOND_NOT_US]], label [[FOR_COND_CLEANUP_LOOPEXIT_SPLIT_US:%.*]], label [[FOR_BODY_US]], !llvm.loop [[LOOP14:![0-9]+]]
+; CHECK-NEXT: br i1 [[EXITCOND_NOT_US]], label [[FOR_COND_CLEANUP_LOOPEXIT_SPLIT_US:%.*]], label [[FOR_BODY_US]]
; CHECK: for.cond.cleanup.loopexit.split.us:
; CHECK-NEXT: br label [[FOR_COND_CLEANUP_LOOPEXIT:%.*]]
; CHECK: for.body.preheader.split:
@@ -1007,7 +1038,7 @@ define i32 @or_rhs_invariant(i32 %num, i1 %cond) {
; CHECK-NEXT: tail call void @bar(i32 noundef [[UNSWITCHED_SELECT_US]])
; CHECK-NEXT: [[INC_US]] = add nuw nsw i32 [[I_07_US]], 1
; CHECK-NEXT: [[EXITCOND_NOT_US:%.*]] = icmp eq i32 [[INC_US]], [[NUM]]
-; CHECK-NEXT: br i1 [[EXITCOND_NOT_US]], label [[FOR_COND_CLEANUP_LOOPEXIT_SPLIT_US:%.*]], label [[FOR_BODY_US]], !llvm.loop [[LOOP15:![0-9]+]]
+; CHECK-NEXT: br i1 [[EXITCOND_NOT_US]], label [[FOR_COND_CLEANUP_LOOPEXIT_SPLIT_US:%.*]], label [[FOR_BODY_US]]
; CHECK: for.cond.cleanup.loopexit.split.us:
; CHECK-NEXT: br label [[FOR_COND_CLEANUP_LOOPEXIT:%.*]]
; CHECK: for.body.preheader.split:
diff --git a/llvm/test/Transforms/SimpleLoopUnswitch/nontrivial-unswitch.ll b/llvm/test/Transforms/SimpleLoopUnswitch/nontrivial-unswitch.ll
index 36f7a9e8cd654..9567b6b793239 100644
--- a/llvm/test/Transforms/SimpleLoopUnswitch/nontrivial-unswitch.ll
+++ b/llvm/test/Transforms/SimpleLoopUnswitch/nontrivial-unswitch.ll
@@ -2626,45 +2626,66 @@ loop_a:
; The second unswitched condition.
;
; CHECK: entry.split.us:
-; CHECK-NEXT: br label %loop_begin.us
+; CHECK-NEXT: br i1 %cond2, label %entry.split.us.split.us, label %entry.split.us.split
loop_a_a:
call i32 @a()
br label %latch
; The 'loop_a_a' unswitched loop.
;
-; CHECK: loop_begin.us:
-; CHECK-NEXT: br label %loop_a.us
+; CHECK: entry.split.us.split.us:
+; CHECK-NEXT: br label %loop_begin.us.us
;
-; CHECK: loop_a.us:
-; CHECK-NEXT: br i1 %cond2, label %loop_a_a.us, label %loop_a_c.us
+; CHECK: loop_begin.us.us:
+; CHECK-NEXT: br label %loop_a.us.us
;
-; The 'loop_a_c' unswitched loop.
+; CHECK: loop_a.us.us:
+; CHECK-NEXT: br label %loop_a_a.us.us
;
-; CHECK: loop_a_c.us:
-; CHECK-NEXT: call i32 @c()
-; CHECK-NEXT: br label %latch.us
-;
-; CHECK: loop_a_a.us:
+; CHECK: loop_a_a.us.us:
; CHECK-NEXT: call i32 @a()
-; CHECK-NEXT: br label %latch.us
+; CHECK-NEXT: br label %latch.us.us
;
-; CHECK: latch.us:
+; CHECK: latch.us.us:
; CHECK-NEXT: %[[V:.*]] = load i1, ptr %ptr
-; CHECK-NEXT: br i1 %[[V]], label %loop_begin.us, label %loop_exit.split.us, !llvm.loop !22
+; CHECK-NEXT: br i1 %[[V]], label %loop_begin.us.us, label %loop_exit.split.us.split.us
;
-; CHECK: loop_exit.split.us
-; CHECK-NEXT: br label %loop_exit
+; CHECK: loop_exit.split.us.split.us:
+; CHECK-NEXT: br label %loop_exit.split
loop_a_c:
call i32 @c()
br label %latch
+; The 'loop_a_c' unswitched loop.
+;
+; CHECK: entry.split.us.split:
+; CHECK-NEXT: br label %loop_begin.us
+;
+; CHECK: loop_begin.us:
+; CHECK-NEXT: br label %loop_a.us
+;
+; CHECK: loop_a.us:
+; CHECK-NEXT: br label %loop_a_c.us
+;
+; CHECK: loop_a_c.us:
+; CHECK-NEXT: call i32 @c()
+; CHECK-NEXT: br label %latch
+;
+; CHECK: latch.us:
+; CHECK-NEXT: %[[V:.*]] = load i1, ptr %ptr
+; CHECK-NEXT: br i1 %[[V]], label %loop_begin.us, label %loop_exit.split.us.split
+;
+; CHECK: loop_exit.split.us.split:
+; CHECK-NEXT: br label %loop_exit.split
loop_b:
call i32 @b()
br label %latch
; The 'loop_b' unswitched loop.
;
+; CHECK: entry.split:
+; CHECK-NEXT: br label %loop_begin
+;
; CHECK: loop_begin:
; CHECK-NEXT: br label %loop_b
;
@@ -2964,9 +2985,9 @@ loop_a:
;
; CHECK: [[LOOP_LATCH_A]]:
; CHECK-NEXT: %[[V_A:.*]] = load i1, ptr %ptr
-; CHECK: br i1 %[[V_A]], label %loop_begin.us, label %loop_exit.split.us, !llvm.loop !26
+; CHECK: br i1 %[[V_A]], label %[[LOOP_BEGIN_A]], label %[[LOOP_EXIT_A:.*]]
;
-; CHECK: loop_exit.split.us:
+; CHECK: [[LOOP_EXIT_A]]:
; CHECK-NEXT: br label %loop_exit
loop_b:
@@ -2986,10 +3007,10 @@ loop_b:
;
; CHECK: [[LOOP_LATCH_B]]:
; CHECK-NEXT: %[[V_B:.*]] = load i1, ptr %ptr
-; CHECK: br i1 %[[V_B]], label %loop_begin.us2, label %loop_exit.split.split.us, !llvm.loop !27
+; CHECK: br i1 %[[V_B]], label %[[LOOP_BEGIN_B]], label %[[LOOP_EXIT_B:.*]]
;
-; CHECK: loop_exit.split.split.us:
-; CHECK-NEXT: br label %loop_exit.split
+; CHECK: [[LOOP_EXIT_B]]:
+; CHECK-NEXT: br label %loop_exit
loop_c:
call i32 @c()
@@ -3008,10 +3029,10 @@ loop_c:
;
; CHECK: [[LOOP_LATCH_C]]:
; CHECK-NEXT: %[[V_C:.*]] = load i1, ptr %ptr
-; CHECK: br i1 %[[V_C]], label %loop_begin.us6, label %loop_exit.split.split.split.us, !llvm.loop !28
+; CHECK: br i1 %[[V_C]], label %[[LOOP_BEGIN_C]], label %[[LOOP_EXIT_C:.*]]
;
-; CHECK: loop_exit.split.split.split.us:
-; CHECK-NEXT: br label %loop_exit.split.split
+; CHECK: [[LOOP_EXIT_C]]:
+; CHECK-NEXT: br label %loop_exit
latch:
%v = load i1, ptr %ptr
@@ -3111,9 +3132,9 @@ body.a:
;
; CHECK: [[LATCH_A]]:
; CHECK-NEXT: %[[CMP2_A:.*]] = icmp slt i32 %[[TMP_C_SUM_A]], 42
-; CHECK: br i1 %[[CMP2_A]], label %header.us, label %exit.split.us, !llvm.loop !29
+; CHECK: br i1 %[[CMP2_A]], label %[[HEADER_A]], label %[[LOOP_EXIT_A:.*]]
;
-; CHECK: exit.split.us:
+; CHECK: [[LOOP_EXIT_A]]:
; CHECK-NEXT: %[[LCSSA_A:.*]] = phi i32 [ %[[TMP_C_SUM_A]], %[[LATCH_A]] ]
; CHECK-NEXT: br label %exit
@@ -3155,9 +3176,9 @@ body.b:
;
; CHECK: [[LATCH_B]]:
; CHECK-NEXT: %[[CMP2_B:.*]] = icmp slt i32 %[[TMP_C_SUM_B]], 42
-; CHECK: br i1 %[[CMP2_B]], label %header.us2, label %exit.split.split.us, !llvm.loop !30
+; CHECK: br i1 %[[CMP2_B]], label %[[HEADER_B]], label %[[LOOP_EXIT_B:.*]]
;
-; CHECK: exit.split.split.us:
+; CHECK: [[LOOP_EXIT_B]]:
; CHECK-NEXT: %[[LCSSA_B:.*]] = phi i32 [ %[[TMP_C_SUM_B]], %[[LATCH_B]] ]
; CHECK-NEXT: br label %[[EXIT_SPLIT:.*]]
@@ -3213,11 +3234,11 @@ exit:
%lcssa.phi = phi i32 [ %tmp.c.sum, %latch ]
ret i32 %lcssa.phi
; CHECK: [[EXIT_SPLIT]]:
-; CHECK-NEXT: %[[EXIT_PHI1:.*]] = phi i32 [ %[[LCSSA_C]], %[[LOOP_EXIT_C]] ], [ %[[LCSSA_B]], %exit.split.split.us ]
+; CHECK-NEXT: %[[EXIT_PHI1:.*]] = phi i32 [ %[[LCSSA_C]], %[[LOOP_EXIT_C]] ], [ %[[LCSSA_B]], %[[LOOP_EXIT_B]] ]
; CHECK-NEXT: br label %exit
; CHECK: exit:
-; CHECK-NEXT: %[[EXIT_PHI2:.*]] = phi i32 [ %[[EXIT_PHI1]], %[[EXIT_SPLIT]] ], [ %[[LCSSA_A]], %exit.split.us ]
+; CHECK-NEXT: %[[EXIT_PHI2:.*]] = phi i32 [ %[[EXIT_PHI1]], %[[EXIT_SPLIT]] ], [ %[[LCSSA_A]], %[[LOOP_EXIT_A]] ]
; CHECK-NEXT: ret i32 %[[EXIT_PHI2]]
}
@@ -3283,9 +3304,9 @@ body.a:
;
; CHECK: [[LATCH_A]]:
; CHECK-NEXT: %[[CMP2_A:.*]] = icmp slt i32 %[[TMP_B_SUM_A]], 42
-; CHECK: br i1 %[[CMP2_A]], label %header.us, label %loop.exit2.split.us, !llvm.loop !31
+; CHECK: br i1 %[[CMP2_A]], label %[[HEADER_A]], label %[[LOOP_EXIT_A:.*]]
;
-; CHECK: loop.exit2.split.us:
+; CHECK: [[LOOP_EXIT_A]]:
; CHECK-NEXT: %[[LCSSA_A:.*]] = phi i32 [ %[[TMP_B_SUM_A]], %[[LATCH_A]] ]
; CHECK-NEXT: br label %loop.exit2
@@ -3321,9 +3342,9 @@ body.b:
;
; CHECK: [[LATCH_B]]:
; CHECK-NEXT: %[[CMP2_B:.*]] = icmp slt i32 %[[TMP_B_SUM_B]], 42
-; CHECK: br i1 %[[CMP2_B]], label %header.us2, label %loop.exit2.split.split.us, !llvm.loop !32
+; CHECK: br i1 %[[CMP2_B]], label %[[HEADER_B]], label %[[LOOP_EXIT_B:.*]]
;
-; CHECK: loop.exit2.split.split.us:
+; CHECK: [[LOOP_EXIT_B]]:
; CHECK-NEXT: %[[LCSSA_B:.*]] = phi i32 [ %[[TMP_B_SUM_B]], %[[LATCH_B]] ]
; CHECK-NEXT: br label %[[LOOP_EXIT2_SPLIT:.*]]
@@ -3376,11 +3397,11 @@ loop.exit2:
%l2.phi = phi i32 [ %tmp.b.sum, %latch ]
br label %exit
; CHECK: [[LOOP_EXIT2_SPLIT]]:
-; CHECK-NEXT: %[[LOOP_EXIT_PHI1:.*]] = phi i32 [ %[[L2_PHI]], %[[LOOP_EXIT_EXIT]] ], [ %[[LCSSA_B]], %loop.exit2.split.split.us ]
+; CHECK-NEXT: %[[LOOP_EXIT_PHI1:.*]] = phi i32 [ %[[L2_PHI]], %[[LOOP_EXIT_EXIT]] ], [ %[[LCSSA_B]], %[[LOOP_EXIT_B]] ]
; CHECK-NEXT: br label %loop.exit2
;
; CHECK: loop.exit2:
-; CHECK-NEXT: %[[LOOP_EXIT_PHI2:.*]] = phi i32 [ %[[LOOP_EXIT_PHI1]], %[[LOOP_EXIT2_SPLIT]] ], [ %[[LCSSA_A]], %loop.exit2.split.us ]
+; CHECK-NEXT: %[[LOOP_EXIT_PHI2:.*]] = phi i32 [ %[[LOOP_EXIT_PHI1]], %[[LOOP_EXIT2_SPLIT]] ], [ %[[LCSSA_A]], %[[LOOP_EXIT_A]] ]
; CHECK-NEXT: br label %exit
exit:
@@ -4037,7 +4058,9 @@ entry:
; CHECK-NEXT: ]
;
; CHECK: [[ENTRY_SPLIT_US]]:
-; CHECK-NEXT: br label %outer.header.us
+; CHECK-NEXT: switch i32 %arg, label %[[ENTRY_SPLIT_US_SPLIT:.*]] [
+; CHECK-NEXT: i32 1, label %[[ENTRY_SPLIT_US_SPLIT_US:.*]]
+; CHECK-NEXT: ]
outer.header:
br label %inner.header
@@ -4051,13 +4074,66 @@ inner.header:
inner.body1:
%a = call i32 @a()
br label %inner.latch
+; The (super convoluted) fully unswitched loop around `@a`.
+;
+; CHECK: [[ENTRY_SPLIT_US_SPLIT_US]]:
+; CHECK-NEXT: br label %[[OUTER_HEADER_US_US:.*]]
+;
+; CHECK: [[OUTER_HEADER_US_US]]:
+; CHECK-NEXT: br label %[[OUTER_HEADER_SPLIT_US_US:.*]]
+;
+; CHECK: [[OUTER_LATCH_US_US:.*]]:
+; CHECK-NEXT: %[[OUTER_COND_US_US:.*]] = call i1 @cond()
+; CHECK-NEXT: br i1 %[[OUTER_COND_US_US]], label %[[OUTER_HEADER_US_US]], label %[[EXIT_SPLIT_US_SPLIT_US:.*]]
+;
+; CHECK: [[OUTER_HEADER_SPLIT_US_US]]:
+; CHECK-NEXT: br label %[[OUTER_HEADER_SPLIT_SPLIT_US_US_US:.*]]
+;
+; CHECK: [[INNER_LOOPEXIT2_US_US:.*]]:
+; CHECK-NEXT: br label %[[OUTER_LATCH_US_US]]
+;
+; CHECK: [[OUTER_HEADER_SPLIT_SPLIT_US_US_US]]:
+; CHECK-NEXT: br label %[[INNER_HEADER_US_US_US:.*]]
+;
+; CHECK: [[INNER_HEADER_US_US_US]]:
+; CHECK-NEXT: br label %[[INNER_BODY1_US_US_US:.*]]
+;
+; CHECK: [[INNER_BODY1_US_US_US]]:
+; CHECK-NEXT: %[[A:.*]] = call i32 @a()
+; CHECK-NEXT: br label %[[INNER_LATCH_US_US_US:.*]]
+;
+; CHECK: [[INNER_LATCH_US_US_US]]:
+; CHECK-NEXT: %[[PHI_A:.*]] = phi i32 [ %[[A]], %[[INNER_BODY1_US_US_US]] ]
+; CHECK-NEXT: call void @sink1(i32 0)
+; CHECK-NEXT: call void @sink1(i32 0)
+; CHECK-NEXT: call void @sink1(i32 0)
+; CHECK-NEXT: call void @sink1(i32 0)
+; CHECK-NEXT: call void @sink1(i32 0)
+; CHECK-NEXT: call void @sink1(i32 0)
+; CHECK-NEXT: call void @sink1(i32 0)
+; CHECK-NEXT: call void @sink1(i32 0)
+; CHECK-NEXT: call void @sink1(i32 0)
+; CHECK-NEXT: call void @sink1(i32 0)
+; CHECK-NEXT: call void @sink1(i32 %[[PHI_A]])
+; CHECK-NEXT: %[[INNER_COND_US_US_US:.*]] = call i1 @cond()
+; CHECK-NEXT: br i1 %[[INNER_COND_US_US_US]], label %[[INNER_HEADER_US_US_US]], label %[[INNER_LOOPEXIT2_SPLIT_US_US_US:.*]]
+;
+; CHECK: [[INNER_LOOPEXIT2_SPLIT_US_US_US]]:
+; CHECK-NEXT: br label %[[INNER_LOOPEXIT2_US_US]]
+;
+; CHECK: [[EXIT_SPLIT_US_SPLIT_US]]:
+; CHECK-NEXT: br label %[[EXIT_SPLIT_US:.*]]
+
inner.body2:
%b = call i32 @b()
br label %inner.latch
; The fully unswitched loop around `@b`.
;
-; CHECK: outer.header.us:
+; CHECK: [[ENTRY_SPLIT_US_SPLIT]]:
+; CHECK-NEXT: br label %[[OUTER_HEADER_US:.*]]
+;
+; CHECK: [[OUTER_HEADER_US]]:
; CHECK-NEXT: br label %[[OUTER_HEADER_SPLIT_US:.*]]
;
; CHECK: [[INNER_HEADER_US:.*]]:
@@ -4087,51 +4163,18 @@ inner.body2:
;
; CHECK: [[OUTER_LATCH_US:.*]]:
; CHECK-NEXT: %[[OUTER_COND_US:.*]] = call i1 @cond()
-; CHECK-NEXT: br i1 %[[OUTER_COND_US]], label %outer.header.us, label %exit.split.us, !llvm.loop !33
+; CHECK-NEXT: br i1 %[[OUTER_COND_US]], label %[[OUTER_HEADER_US]], label %[[EXIT_SPLIT_US_SPLIT:.*]]
;
; CHECK: [[OUTER_HEADER_SPLIT_US]]:
-; CHECK-NEXT: switch i32 %arg, label %outer.header.split.split.us5 [
-; CHECK-NEXT: i32 1, label %outer.header.split.split.us.us
-; CHECK-NEXT: ]
+; CHECK-NEXT: br label %[[OUTER_HEADER_SPLIT_SPLIT_US:.*]]
;
-; CHECK: outer.header.split.split.us5:
+; CHECK: [[OUTER_HEADER_SPLIT_SPLIT_US]]:
; CHECK-NEXT: br label %[[INNER_HEADER_US]]
;
; CHECK: [[INNER_LOOPEXIT2_US]]:
; CHECK-NEXT: br label %[[OUTER_LATCH_US]]
-
-; The (super convoluted) fully unswitched loop around `@a`.
-;
-; CHECK: outer.header.split.split.us.us:
-; CHECK-NEXT: br label %[[INNER_HEADER_US_US:.*]]
-;
-; CHECK: [[INNER_HEADER_US_US]]:
-; CHECK-NEXT: br label %[[INNER_BODY1_US_US:.*]]
-;
-; CHECK: [[INNER_BODY1_US_US]]:
-; CHECK-NEXT: %[[A:.*]] = call i32 @a()
-; CHECK-NEXT: br label %[[INNER_LATCH_US_US:.*]]
-;
-; CHECK: [[INNER_LATCH_US_US]]:
-; CHECK-NEXT: %[[PHI_A:.*]] = phi i32 [ %[[A]], %[[INNER_BODY1_US_US]] ]
-; CHECK-NEXT: call void @sink1(i32 0)
-; CHECK-NEXT: call void @sink1(i32 0)
-; CHECK-NEXT: call void @sink1(i32 0)
-; CHECK-NEXT: call void @sink1(i32 0)
-; CHECK-NEXT: call void @sink1(i32 0)
-; CHECK-NEXT: call void @sink1(i32 0)
-; CHECK-NEXT: call void @sink1(i32 0)
-; CHECK-NEXT: call void @sink1(i32 0)
-; CHECK-NEXT: call void @sink1(i32 0)
-; CHECK-NEXT: call void @sink1(i32 0)
-; CHECK-NEXT: call void @sink1(i32 %[[PHI_A]])
-; CHECK-NEXT: %[[INNER_COND_US_US:.*]] = call i1 @cond()
-; CHECK-NEXT: br i1 %[[INNER_COND_US_US]], label %[[INNER_HEADER_US_US]], label %[[INNER_LOOPEXIT2_SPLIT_US_US:.*]], !llvm.loop !34
-;
-; CHECK: [[INNER_LOOPEXIT2_SPLIT_US_US]]:
-; CHECK-NEXT: br label %[[INNER_LOOPEXIT2_US]]
;
-; CHECK: exit.split.us:
+; CHECK: [[EXIT_SPLIT_US]]:
; CHECK-NEXT: br label %exit
inner.latch:
diff --git a/llvm/test/Transforms/SimpleLoopUnswitch/partial-unswitch-loop-and-block-dispositions.ll b/llvm/test/Transforms/SimpleLoopUnswitch/partial-unswitch-loop-and-block-dispositions.ll
index e821dfcd0124c..a169aa47ea7d5 100644
--- a/llvm/test/Transforms/SimpleLoopUnswitch/partial-unswitch-loop-and-block-dispositions.ll
+++ b/llvm/test/Transforms/SimpleLoopUnswitch/partial-unswitch-loop-and-block-dispositions.ll
@@ -11,43 +11,59 @@ define void @test_pr58564(i16 %a, i1 %c.1, ptr %dst) {
; CHECK-NEXT: [[TMP0:%.*]] = icmp ult i16 [[A:%.*]], -6
; CHECK-NEXT: br i1 [[TMP0]], label [[ENTRY_SPLIT_US:%.*]], label [[ENTRY_SPLIT:%.*]]
; CHECK: entry.split.us:
+; CHECK-NEXT: br i1 [[C_1:%.*]], label [[ENTRY_SPLIT_US_SPLIT_US:%.*]], label [[ENTRY_SPLIT_US_SPLIT:%.*]]
+; CHECK: entry.split.us.split.us:
+; CHECK-NEXT: br label [[LOOP_1_HEADER_US_US:%.*]]
+; CHECK: loop.1.header.us.us:
+; CHECK-NEXT: br label [[LOOP_1_HEADER_SPLIT_US_US_US:%.*]]
+; CHECK: loop.1.header.split.us.us.us:
+; CHECK-NEXT: br label [[LOOP_1_HEADER_SPLIT_US_SPLIT_US_SPLIT_US_SPLIT_US:%.*]]
+; CHECK: loop.1.header.split.us.split.us.split.us.split.us:
+; CHECK-NEXT: br label [[LOOP_1_HEADER_SPLIT_US_SPLIT_US_SPLIT_US:%.*]]
+; CHECK: entry.split.us.split:
; CHECK-NEXT: br label [[LOOP_1_HEADER_US:%.*]]
; CHECK: loop.1.header.us:
; CHECK-NEXT: br label [[LOOP_1_HEADER_SPLIT_US_US:%.*]]
-; CHECK: loop.4.header.us2:
+; CHECK: loop.4.header.us5:
; CHECK-NEXT: br label [[LOOP_5_US6:%.*]]
-; CHECK: loop.5.us3:
+; CHECK: loop.5.us6:
; CHECK-NEXT: [[IV_US7:%.*]] = phi i16 [ 0, [[LOOP_4_HEADER_US5:%.*]] ], [ [[IV_NEXT_US9:%.*]], [[LOOP_5_US6]] ]
; CHECK-NEXT: [[GEP_US8:%.*]] = getelementptr inbounds ptr, ptr [[DST:%.*]], i16 [[IV_US7]]
; CHECK-NEXT: store ptr null, ptr [[GEP_US8]], align 8
; CHECK-NEXT: [[IV_NEXT_US9]] = add nuw nsw i16 [[IV_US7]], 1
; CHECK-NEXT: [[EC_US10:%.*]] = icmp ne i16 [[IV_US7]], 10000
-; CHECK-NEXT: br i1 [[EC_US10]], label [[LOOP_5_US6]], label [[LOOP_4_LATCH_US8:%.*]], !llvm.loop [[LOOP0:![0-9]+]]
-; CHECK: loop.4.latch.us8:
+; CHECK-NEXT: br i1 [[EC_US10]], label [[LOOP_5_US6]], label [[LOOP_4_LATCH_US11:%.*]]
+; CHECK: loop.4.latch.us11:
; CHECK-NEXT: br label [[LOOP_1_LATCH_US:%.*]]
; CHECK: loop.1.latch.us:
-; CHECK-NEXT: br label [[LOOP_1_HEADER_US]], !llvm.loop [[LOOP2:![0-9]+]]
+; CHECK-NEXT: br label [[LOOP_1_HEADER_US]]
; CHECK: loop.4.header.preheader.us:
-; CHECK-NEXT: br i1 [[C_1:%.*]], label [[LOOP_4_HEADER_PREHEADER_SPLIT1_US_SPLIT_US:%.*]], label [[LOOP_4_HEADER_PREHEADER_SPLIT1_US9:%.*]]
+; CHECK-NEXT: br i1 false, label [[LOOP_4_HEADER_PREHEADER_SPLIT4_US_SPLIT_US:%.*]], label [[LOOP_4_HEADER_PREHEADER_SPLIT4_US15:%.*]]
; CHECK: loop.1.header.split.us.us:
; CHECK-NEXT: br label [[LOOP_1_HEADER_SPLIT_US_SPLIT_US14:%.*]]
-; CHECK: loop.2.header.us.us:
+; CHECK: loop.2.header.us.us12:
; CHECK-NEXT: br label [[LOOP_2_HEADER_SPLIT_US_US_US13:%.*]]
; CHECK: loop.2.latch.us.us:
-; CHECK-NEXT: br i1 [[C_1]], label [[LOOP_1_HEADER_SPLIT_US_SPLIT_US14]], label [[LOOP_4_HEADER_PREHEADER_SPLIT_US_US:%.*]], !llvm.loop [[LOOP3:![0-9]+]]
-; CHECK: loop.2.header.split.us.us.us:
+; CHECK-NEXT: br i1 false, label [[LOOP_2_HEADER_US_US12:%.*]], label [[LOOP_4_HEADER_PREHEADER_SPLIT_US_US:%.*]]
+; CHECK: loop.2.header.split.us.us.us13:
+; CHECK-NEXT: br label [[LOOP_2_HEADER_SPLIT_US_SPLIT_US3_US:%.*]]
+; CHECK: loop.3.header.us.us1.us:
; CHECK-NEXT: br label [[LOOP_3_LATCH_US_US2_US:%.*]]
-; CHECK: loop.3.header.us.us.us:
+; CHECK: loop.3.latch.us.us2.us:
; CHECK-NEXT: br label [[LOOP_2_LATCH_SPLIT_US_US_US:%.*]]
-; CHECK: loop.3.latch.us.us.us:
-; CHECK-NEXT: br i1 [[C_1]], label [[LOOP_3_LATCH_US_US2_US]], label [[LOOP_2_LATCH_SPLIT_US_US_US1:%.*]], !llvm.loop [[LOOP4:![0-9]+]]
; CHECK: loop.2.latch.split.us.us.us:
+; CHECK-NEXT: br label [[LOOP_2_LATCH_US_US:%.*]]
+; CHECK: loop.2.header.split.us.split.us3.us:
; CHECK-NEXT: br label [[LOOP_3_HEADER_US_US1_US:%.*]]
; CHECK: loop.4.header.preheader.split.us.us:
-; CHECK-NEXT: br label [[LOOP_2_HEADER_US_US12:%.*]]
-; CHECK: loop.4.header.preheader.split1.us9:
+; CHECK-NEXT: br label [[LOOP_4_HEADER_PREHEADER_US:%.*]]
+; CHECK: loop.1.header.split.us.split.us14:
+; CHECK-NEXT: br label [[LOOP_2_HEADER_US_US12]]
+; CHECK: loop.4.header.preheader.split4.us15:
; CHECK-NEXT: br label [[LOOP_4_HEADER_US5]]
-; CHECK: loop.4.header.preheader.split1.us.split.us:
+; CHECK: loop.4.header.preheader.split4.us.split.us:
+; CHECK-NEXT: br label [[LOOP_4_HEADER_PREHEADER_SPLIT4_US:%.*]]
+; CHECK: loop.1.header.split.us.split.us.split.us:
; CHECK-NEXT: br label [[LOOP_1_HEADER_SPLIT_US_SPLIT_US:%.*]]
; CHECK: entry.split:
; CHECK-NEXT: br label [[LOOP_1_HEADER:%.*]]
@@ -55,20 +71,36 @@ define void @test_pr58564(i16 %a, i1 %c.1, ptr %dst) {
; CHECK-NEXT: [[TMP1:%.*]] = icmp ult i16 [[A]], -6
; CHECK-NEXT: br i1 [[TMP1]], label [[LOOP_1_HEADER_SPLIT_US:%.*]], label [[LOOP_1_HEADER_SPLIT:%.*]]
; CHECK: loop.1.header.split.us:
+; CHECK-NEXT: br i1 [[C_1]], label [[LOOP_1_HEADER_SPLIT_US_SPLIT_US_SPLIT:%.*]], label [[LOOP_1_HEADER_SPLIT_US_SPLIT:%.*]]
+; CHECK: loop.1.header.split.us.split.us.split:
+; CHECK-NEXT: br label [[LOOP_1_HEADER_SPLIT_US_SPLIT_US]]
+; CHECK: loop.1.header.split.us.split.us:
+; CHECK-NEXT: br label [[LOOP_2_HEADER_US_US:%.*]]
+; CHECK: loop.2.header.us.us:
+; CHECK-NEXT: br label [[LOOP_2_HEADER_SPLIT_US_US_US:%.*]]
+; CHECK: loop.2.header.split.us.us.us:
+; CHECK-NEXT: br label [[LOOP_2_HEADER_SPLIT_US_SPLIT_US_SPLIT_US_SPLIT_US:%.*]]
+; CHECK: loop.2.header.split.us.split.us.split.us.split.us:
+; CHECK-NEXT: br label [[LOOP_2_HEADER_SPLIT_US_SPLIT_US_SPLIT_US:%.*]]
+; CHECK: loop.1.header.split.us.split:
; CHECK-NEXT: br label [[LOOP_2_HEADER_US:%.*]]
; CHECK: loop.2.header.us:
; CHECK-NEXT: br label [[LOOP_2_HEADER_SPLIT_US_US:%.*]]
; CHECK: loop.2.latch.us:
-; CHECK-NEXT: br i1 [[C_1]], label [[LOOP_2_HEADER_US]], label [[LOOP_4_HEADER_PREHEADER_SPLIT_US:%.*]], !llvm.loop [[LOOP3]]
+; CHECK-NEXT: br i1 false, label [[LOOP_2_HEADER_US]], label [[LOOP_4_HEADER_PREHEADER_SPLIT_US:%.*]]
; CHECK: loop.2.header.split.us.us:
+; CHECK-NEXT: br label [[LOOP_2_HEADER_SPLIT_US_SPLIT_US3:%.*]]
+; CHECK: loop.3.header.us.us1:
; CHECK-NEXT: br label [[LOOP_3_LATCH_US_US2:%.*]]
-; CHECK: loop.3.header.us.us:
+; CHECK: loop.3.latch.us.us2:
; CHECK-NEXT: br label [[LOOP_2_LATCH_SPLIT_US_US:%.*]]
-; CHECK: loop.3.latch.us.us:
-; CHECK-NEXT: br i1 [[C_1]], label [[LOOP_3_LATCH_US_US2]], label [[LOOP_2_LATCH_SPLIT_US_US1:%.*]], !llvm.loop [[LOOP4]]
; CHECK: loop.2.latch.split.us.us:
+; CHECK-NEXT: br label [[LOOP_2_LATCH_US:%.*]]
+; CHECK: loop.2.header.split.us.split.us3:
; CHECK-NEXT: br label [[LOOP_3_HEADER_US_US1:%.*]]
; CHECK: loop.4.header.preheader.split.us:
+; CHECK-NEXT: br label [[LOOP_4_HEADER_PREHEADER:%.*]]
+; CHECK: loop.2.header.split.us.split.us.split.us:
; CHECK-NEXT: br label [[LOOP_2_HEADER_SPLIT_US_SPLIT_US:%.*]]
; CHECK: loop.1.header.split:
; CHECK-NEXT: br label [[LOOP_2_HEADER:%.*]]
@@ -76,11 +108,21 @@ define void @test_pr58564(i16 %a, i1 %c.1, ptr %dst) {
; CHECK-NEXT: [[TMP2:%.*]] = icmp ult i16 [[A]], -6
; CHECK-NEXT: br i1 [[TMP2]], label [[LOOP_2_HEADER_SPLIT_US:%.*]], label [[LOOP_2_HEADER_SPLIT:%.*]]
; CHECK: loop.2.header.split.us:
+; CHECK-NEXT: br i1 [[C_1]], label [[LOOP_2_HEADER_SPLIT_US_SPLIT_US_SPLIT:%.*]], label [[LOOP_2_HEADER_SPLIT_US_SPLIT:%.*]]
+; CHECK: loop.2.header.split.us.split.us.split:
+; CHECK-NEXT: br label [[LOOP_2_HEADER_SPLIT_US_SPLIT_US]]
+; CHECK: loop.2.header.split.us.split.us:
+; CHECK-NEXT: br label [[LOOP_3_HEADER_US_US:%.*]]
+; CHECK: loop.3.header.us.us:
+; CHECK-NEXT: br label [[LOOP_3_LATCH_US_US:%.*]]
+; CHECK: loop.3.latch.us.us:
+; CHECK-NEXT: br label [[LOOP_3_HEADER_US_US]]
+; CHECK: loop.2.header.split.us.split:
; CHECK-NEXT: br label [[LOOP_3_HEADER_US:%.*]]
; CHECK: loop.3.header.us:
; CHECK-NEXT: br label [[LOOP_3_LATCH_US:%.*]]
; CHECK: loop.3.latch.us:
-; CHECK-NEXT: br i1 [[C_1]], label [[LOOP_3_HEADER_US]], label [[LOOP_2_LATCH_SPLIT_US:%.*]], !llvm.loop [[LOOP4]]
+; CHECK-NEXT: br label [[LOOP_2_LATCH_SPLIT_US:%.*]]
; CHECK: loop.2.latch.split.us:
; CHECK-NEXT: br label [[LOOP_2_LATCH:%.*]]
; CHECK: loop.2.header.split:
@@ -92,18 +134,18 @@ define void @test_pr58564(i16 %a, i1 %c.1, ptr %dst) {
; CHECK-NEXT: call void @clobber()
; CHECK-NEXT: br label [[LOOP_3_LATCH]]
; CHECK: loop.3.latch:
-; CHECK-NEXT: br i1 [[C_1]], label [[LOOP_3_HEADER]], label [[LOOP_2_LATCH_SPLIT:%.*]], !llvm.loop [[LOOP5:![0-9]+]]
+; CHECK-NEXT: br i1 [[C_1]], label [[LOOP_3_HEADER]], label [[LOOP_2_LATCH_SPLIT:%.*]], !llvm.loop [[LOOP0:![0-9]+]]
; CHECK: loop.2.latch.split:
; CHECK-NEXT: br label [[LOOP_2_LATCH]]
; CHECK: loop.2.latch:
-; CHECK-NEXT: br i1 [[C_1]], label [[LOOP_2_HEADER]], label [[LOOP_4_HEADER_PREHEADER_SPLIT:%.*]], !llvm.loop [[LOOP7:![0-9]+]]
+; CHECK-NEXT: br i1 [[C_1]], label [[LOOP_2_HEADER]], label [[LOOP_4_HEADER_PREHEADER_SPLIT:%.*]], !llvm.loop [[LOOP2:![0-9]+]]
; CHECK: loop.4.header.preheader.split:
-; CHECK-NEXT: br label [[LOOP_2_HEADER_SPLIT_US_SPLIT_US]]
+; CHECK-NEXT: br label [[LOOP_4_HEADER_PREHEADER]]
; CHECK: loop.4.header.preheader:
; CHECK-NEXT: br i1 [[C_1]], label [[LOOP_4_HEADER_PREHEADER_SPLIT4_US_SPLIT:%.*]], label [[LOOP_4_HEADER_PREHEADER_SPLIT4:%.*]]
-; CHECK: loop.4.header.preheader.split1.us.split:
-; CHECK-NEXT: br label [[LOOP_1_HEADER_SPLIT_US_SPLIT_US]]
-; CHECK: loop.4.header.preheader.split1.us:
+; CHECK: loop.4.header.preheader.split4.us.split:
+; CHECK-NEXT: br label [[LOOP_4_HEADER_PREHEADER_SPLIT4_US]]
+; CHECK: loop.4.header.preheader.split4.us:
; CHECK-NEXT: br label [[LOOP_4_HEADER_US:%.*]]
; CHECK: loop.4.header.us:
; CHECK-NEXT: br label [[LOOP_5_US:%.*]]
@@ -116,7 +158,7 @@ define void @test_pr58564(i16 %a, i1 %c.1, ptr %dst) {
; CHECK-NEXT: br i1 [[EC_US]], label [[LOOP_5_US]], label [[LOOP_4_LATCH_US:%.*]]
; CHECK: loop.4.latch.us:
; CHECK-NEXT: br label [[LOOP_4_HEADER_US]]
-; CHECK: loop.4.header.preheader.split1:
+; CHECK: loop.4.header.preheader.split4:
; CHECK-NEXT: br label [[LOOP_4_HEADER:%.*]]
; CHECK: loop.4.header:
; CHECK-NEXT: br label [[LOOP_5:%.*]]
@@ -126,11 +168,11 @@ define void @test_pr58564(i16 %a, i1 %c.1, ptr %dst) {
; CHECK-NEXT: store ptr null, ptr [[GEP]], align 8
; CHECK-NEXT: [[IV_NEXT]] = add nuw nsw i16 [[IV]], 1
; CHECK-NEXT: [[EC:%.*]] = icmp ne i16 [[IV]], 10000
-; CHECK-NEXT: br i1 [[EC]], label [[LOOP_5]], label [[LOOP_4_LATCH:%.*]], !llvm.loop [[LOOP0]]
+; CHECK-NEXT: br i1 [[EC]], label [[LOOP_5]], label [[LOOP_4_LATCH:%.*]]
; CHECK: loop.4.latch:
; CHECK-NEXT: br label [[LOOP_1_LATCH:%.*]]
; CHECK: loop.1.latch:
-; CHECK-NEXT: br label [[LOOP_1_HEADER]], !llvm.loop [[LOOP8:![0-9]+]]
+; CHECK-NEXT: br label [[LOOP_1_HEADER]], !llvm.loop [[LOOP3:![0-9]+]]
;
entry:
br label %loop.1.header
diff --git a/llvm/test/Transforms/SimpleLoopUnswitch/partial-unswitch.ll b/llvm/test/Transforms/SimpleLoopUnswitch/partial-unswitch.ll
index 108b2406920f2..1d8942079ffd8 100644
--- a/llvm/test/Transforms/SimpleLoopUnswitch/partial-unswitch.ll
+++ b/llvm/test/Transforms/SimpleLoopUnswitch/partial-unswitch.ll
@@ -19,7 +19,7 @@ define i32 @partial_unswitch_true_successor(ptr %ptr, i32 %N) {
; CHECK: loop.latch.us:
; CHECK-NEXT: [[C_US:%.*]] = icmp ult i32 [[IV_US]], [[N:%.*]]
; CHECK-NEXT: [[IV_NEXT_US]] = add i32 [[IV_US]], 1
-; CHECK-NEXT: br i1 [[C_US]], label [[LOOP_HEADER_US]], label [[EXIT_SPLIT_US:%.*]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK-NEXT: br i1 [[C_US]], label [[LOOP_HEADER_US]], label [[EXIT_SPLIT_US:%.*]]
; CHECK: exit.split.us:
; CHECK-NEXT: br label [[EXIT:%.*]]
; CHECK: entry.split:
@@ -37,7 +37,7 @@ define i32 @partial_unswitch_true_successor(ptr %ptr, i32 %N) {
; CHECK: loop.latch:
; CHECK-NEXT: [[C:%.*]] = icmp ult i32 [[IV]], [[N]]
; CHECK-NEXT: [[IV_NEXT]] = add i32 [[IV]], 1
-; CHECK-NEXT: br i1 [[C]], label [[LOOP_HEADER]], label [[EXIT_SPLIT:%.*]], !llvm.loop [[LOOP2:![0-9]+]]
+; CHECK-NEXT: br i1 [[C]], label [[LOOP_HEADER]], label [[EXIT_SPLIT:%.*]], !llvm.loop [[LOOP0:![0-9]+]]
; CHECK: exit.split:
; CHECK-NEXT: br label [[EXIT]]
; CHECK: exit:
@@ -84,7 +84,7 @@ define i32 @partial_unswitch_false_successor(ptr %ptr, i32 %N) {
; CHECK: loop.latch.us:
; CHECK-NEXT: [[C_US:%.*]] = icmp ult i32 [[IV_US]], [[N:%.*]]
; CHECK-NEXT: [[IV_NEXT_US]] = add i32 [[IV_US]], 1
-; CHECK-NEXT: br i1 [[C_US]], label [[LOOP_HEADER_US]], label [[EXIT_SPLIT_US:%.*]], !llvm.loop [[LOOP4:![0-9]+]]
+; CHECK-NEXT: br i1 [[C_US]], label [[LOOP_HEADER_US]], label [[EXIT_SPLIT_US:%.*]]
; CHECK: exit.split.us:
; CHECK-NEXT: br label [[EXIT:%.*]]
; CHECK: entry.split:
@@ -102,7 +102,7 @@ define i32 @partial_unswitch_false_successor(ptr %ptr, i32 %N) {
; CHECK: loop.latch:
; CHECK-NEXT: [[C:%.*]] = icmp ult i32 [[IV]], [[N]]
; CHECK-NEXT: [[IV_NEXT]] = add i32 [[IV]], 1
-; CHECK-NEXT: br i1 [[C]], label [[LOOP_HEADER]], label [[EXIT_SPLIT:%.*]], !llvm.loop [[LOOP5:![0-9]+]]
+; CHECK-NEXT: br i1 [[C]], label [[LOOP_HEADER]], label [[EXIT_SPLIT:%.*]], !llvm.loop [[LOOP2:![0-9]+]]
; CHECK: exit.split:
; CHECK-NEXT: br label [[EXIT]]
; CHECK: exit:
@@ -151,7 +151,7 @@ define i32 @partial_unswtich_gep_load_icmp(ptr %ptr, i32 %N) {
; CHECK: loop.latch.us:
; CHECK-NEXT: [[C_US:%.*]] = icmp ult i32 [[IV_US]], [[N:%.*]]
; CHECK-NEXT: [[IV_NEXT_US]] = add i32 [[IV_US]], 1
-; CHECK-NEXT: br i1 [[C_US]], label [[LOOP_HEADER_US]], label [[EXIT_SPLIT_US:%.*]], !llvm.loop [[LOOP6:![0-9]+]]
+; CHECK-NEXT: br i1 [[C_US]], label [[LOOP_HEADER_US]], label [[EXIT_SPLIT_US:%.*]]
; CHECK: exit.split.us:
; CHECK-NEXT: br label [[EXIT:%.*]]
; CHECK: entry.split:
@@ -171,7 +171,7 @@ define i32 @partial_unswtich_gep_load_icmp(ptr %ptr, i32 %N) {
; CHECK: loop.latch:
; CHECK-NEXT: [[C:%.*]] = icmp ult i32 [[IV]], [[N]]
; CHECK-NEXT: [[IV_NEXT]] = add i32 [[IV]], 1
-; CHECK-NEXT: br i1 [[C]], label [[LOOP_HEADER]], label [[EXIT_SPLIT:%.*]], !llvm.loop [[LOOP7:![0-9]+]]
+; CHECK-NEXT: br i1 [[C]], label [[LOOP_HEADER]], label [[EXIT_SPLIT:%.*]], !llvm.loop [[LOOP3:![0-9]+]]
; CHECK: exit.split:
; CHECK-NEXT: br label [[EXIT]]
; CHECK: exit:
@@ -223,7 +223,7 @@ define i32 @partial_unswitch_reduction_phi(ptr %ptr, i32 %N) {
; CHECK-NEXT: [[RED_NEXT_US]] = phi i32 [ [[ADD_10_US]], [[NOCLOBBER_US]] ]
; CHECK-NEXT: [[C_US:%.*]] = icmp ult i32 [[IV_US]], [[N:%.*]]
; CHECK-NEXT: [[IV_NEXT_US]] = add i32 [[IV_US]], 1
-; CHECK-NEXT: br i1 [[C_US]], label [[LOOP_HEADER_US]], label [[EXIT_SPLIT_US:%.*]], !llvm.loop [[LOOP8:![0-9]+]]
+; CHECK-NEXT: br i1 [[C_US]], label [[LOOP_HEADER_US]], label [[EXIT_SPLIT_US:%.*]]
; CHECK: exit.split.us:
; CHECK-NEXT: [[RED_NEXT_LCSSA_US:%.*]] = phi i32 [ [[RED_NEXT_US]], [[LOOP_LATCH_US]] ]
; CHECK-NEXT: br label [[EXIT:%.*]]
@@ -246,7 +246,7 @@ define i32 @partial_unswitch_reduction_phi(ptr %ptr, i32 %N) {
; CHECK-NEXT: [[RED_NEXT]] = phi i32 [ [[ADD_5]], [[CLOBBER]] ], [ [[ADD_10]], [[NOCLOBBER]] ]
; CHECK-NEXT: [[C:%.*]] = icmp ult i32 [[IV]], [[N]]
; CHECK-NEXT: [[IV_NEXT]] = add i32 [[IV]], 1
-; CHECK-NEXT: br i1 [[C]], label [[LOOP_HEADER]], label [[EXIT_SPLIT:%.*]], !llvm.loop [[LOOP9:![0-9]+]]
+; CHECK-NEXT: br i1 [[C]], label [[LOOP_HEADER]], label [[EXIT_SPLIT:%.*]], !llvm.loop [[LOOP4:![0-9]+]]
; CHECK: exit.split:
; CHECK-NEXT: [[RED_NEXT_LCSSA:%.*]] = phi i32 [ [[RED_NEXT]], [[LOOP_LATCH]] ]
; CHECK-NEXT: br label [[EXIT]]
@@ -305,7 +305,7 @@ define i32 @partial_unswitch_true_successor_noclobber(ptr noalias %ptr.1, ptr no
; CHECK: loop.latch.us:
; CHECK-NEXT: [[C_US:%.*]] = icmp ult i32 [[IV_US]], [[N:%.*]]
; CHECK-NEXT: [[IV_NEXT_US]] = add i32 [[IV_US]], 1
-; CHECK-NEXT: br i1 [[C_US]], label [[LOOP_HEADER_US]], label [[EXIT_SPLIT_US:%.*]], !llvm.loop [[LOOP10:![0-9]+]]
+; CHECK-NEXT: br i1 [[C_US]], label [[LOOP_HEADER_US]], label [[EXIT_SPLIT_US:%.*]]
; CHECK: exit.split.us:
; CHECK-NEXT: br label [[EXIT:%.*]]
; CHECK: entry.split:
@@ -325,7 +325,7 @@ define i32 @partial_unswitch_true_successor_noclobber(ptr noalias %ptr.1, ptr no
; CHECK: loop.latch:
; CHECK-NEXT: [[C:%.*]] = icmp ult i32 [[IV]], [[N]]
; CHECK-NEXT: [[IV_NEXT]] = add i32 [[IV]], 1
-; CHECK-NEXT: br i1 [[C]], label [[LOOP_HEADER]], label [[EXIT_SPLIT:%.*]], !llvm.loop [[LOOP11:![0-9]+]]
+; CHECK-NEXT: br i1 [[C]], label [[LOOP_HEADER]], label [[EXIT_SPLIT:%.*]], !llvm.loop [[LOOP5:![0-9]+]]
; CHECK: exit.split:
; CHECK-NEXT: br label [[EXIT]]
; CHECK: exit:
@@ -619,7 +619,7 @@ define i32 @partial_unswitch_true_successor_preheader_insertion(ptr %ptr, i32 %N
; CHECK: loop.latch.us:
; CHECK-NEXT: [[C_US:%.*]] = icmp ult i32 [[IV_US]], [[N:%.*]]
; CHECK-NEXT: [[IV_NEXT_US]] = add i32 [[IV_US]], 1
-; CHECK-NEXT: br i1 [[C_US]], label [[LOOP_HEADER_US]], label [[EXIT_LOOPEXIT_SPLIT_US:%.*]], !llvm.loop [[LOOP12:![0-9]+]]
+; CHECK-NEXT: br i1 [[C_US]], label [[LOOP_HEADER_US]], label [[EXIT_LOOPEXIT_SPLIT_US:%.*]]
; CHECK: exit.loopexit.split.us:
; CHECK-NEXT: br label [[EXIT_LOOPEXIT:%.*]]
; CHECK: loop.ph.split:
@@ -637,7 +637,7 @@ define i32 @partial_unswitch_true_successor_preheader_insertion(ptr %ptr, i32 %N
; CHECK: loop.latch:
; CHECK-NEXT: [[C:%.*]] = icmp ult i32 [[IV]], [[N]]
; CHECK-NEXT: [[IV_NEXT]] = add i32 [[IV]], 1
-; CHECK-NEXT: br i1 [[C]], label [[LOOP_HEADER]], label [[EXIT_LOOPEXIT_SPLIT:%.*]], !llvm.loop [[LOOP13:![0-9]+]]
+; CHECK-NEXT: br i1 [[C]], label [[LOOP_HEADER]], label [[EXIT_LOOPEXIT_SPLIT:%.*]], !llvm.loop [[LOOP6:![0-9]+]]
; CHECK: exit.loopexit.split:
; CHECK-NEXT: br label [[EXIT_LOOPEXIT]]
; CHECK: exit.loopexit:
@@ -695,7 +695,7 @@ define i32 @partial_unswitch_true_successor_insert_point(ptr %ptr, i32 %N) {
; CHECK: loop.latch.us:
; CHECK-NEXT: [[C_US:%.*]] = icmp ult i32 [[IV_US]], [[N:%.*]]
; CHECK-NEXT: [[IV_NEXT_US]] = add i32 [[IV_US]], 1
-; CHECK-NEXT: br i1 [[C_US]], label [[LOOP_HEADER_US]], label [[EXIT_SPLIT_US:%.*]], !llvm.loop [[LOOP14:![0-9]+]]
+; CHECK-NEXT: br i1 [[C_US]], label [[LOOP_HEADER_US]], label [[EXIT_SPLIT_US:%.*]]
; CHECK: exit.split.us:
; CHECK-NEXT: br label [[EXIT:%.*]]
; CHECK: entry.split:
@@ -713,7 +713,7 @@ define i32 @partial_unswitch_true_successor_insert_point(ptr %ptr, i32 %N) {
; CHECK: loop.latch:
; CHECK-NEXT: [[C:%.*]] = icmp ult i32 [[IV]], [[N]]
; CHECK-NEXT: [[IV_NEXT]] = add i32 [[IV]], 1
-; CHECK-NEXT: br i1 [[C]], label [[LOOP_HEADER]], label [[EXIT_SPLIT:%.*]], !llvm.loop [[LOOP15:![0-9]+]]
+; CHECK-NEXT: br i1 [[C]], label [[LOOP_HEADER]], label [[EXIT_SPLIT:%.*]], !llvm.loop [[LOOP7:![0-9]+]]
; CHECK: exit.split:
; CHECK-NEXT: br label [[EXIT]]
; CHECK: exit:
@@ -765,7 +765,7 @@ define i32 @partial_unswitch_true_successor_hoist_invariant(ptr %ptr, i32 %N) {
; CHECK: loop.latch.us:
; CHECK-NEXT: [[C_US:%.*]] = icmp ult i32 [[IV_US]], [[N:%.*]]
; CHECK-NEXT: [[IV_NEXT_US]] = add i32 [[IV_US]], 1
-; CHECK-NEXT: br i1 [[C_US]], label [[LOOP_HEADER_US]], label [[EXIT_SPLIT_US:%.*]], !llvm.loop [[LOOP16:![0-9]+]]
+; CHECK-NEXT: br i1 [[C_US]], label [[LOOP_HEADER_US]], label [[EXIT_SPLIT_US:%.*]]
; CHECK: exit.split.us:
; CHECK-NEXT: br label [[EXIT:%.*]]
; CHECK: entry.split:
@@ -784,7 +784,7 @@ define i32 @partial_unswitch_true_successor_hoist_invariant(ptr %ptr, i32 %N) {
; CHECK: loop.latch:
; CHECK-NEXT: [[C:%.*]] = icmp ult i32 [[IV]], [[N]]
; CHECK-NEXT: [[IV_NEXT]] = add i32 [[IV]], 1
-; CHECK-NEXT: br i1 [[C]], label [[LOOP_HEADER]], label [[EXIT_SPLIT:%.*]], !llvm.loop [[LOOP17:![0-9]+]]
+; CHECK-NEXT: br i1 [[C]], label [[LOOP_HEADER]], label [[EXIT_SPLIT:%.*]], !llvm.loop [[LOOP8:![0-9]+]]
; CHECK: exit.split:
; CHECK-NEXT: br label [[EXIT]]
; CHECK: exit:
@@ -1057,7 +1057,7 @@ define i32 @partial_unswitch_true_to_latch(ptr %ptr, i32 %N) {
; CHECK: loop.latch.us:
; CHECK-NEXT: [[C_US:%.*]] = icmp ult i32 [[IV_US]], [[N:%.*]]
; CHECK-NEXT: [[IV_NEXT_US]] = add i32 [[IV_US]], 1
-; CHECK-NEXT: br i1 [[C_US]], label [[LOOP_HEADER_US]], label [[EXIT_SPLIT_US:%.*]], !llvm.loop [[LOOP18:![0-9]+]]
+; CHECK-NEXT: br i1 [[C_US]], label [[LOOP_HEADER_US]], label [[EXIT_SPLIT_US:%.*]]
; CHECK: exit.split.us:
; CHECK-NEXT: br label [[EXIT:%.*]]
; CHECK: entry.split:
@@ -1073,7 +1073,7 @@ define i32 @partial_unswitch_true_to_latch(ptr %ptr, i32 %N) {
; CHECK: loop.latch:
; CHECK-NEXT: [[C:%.*]] = icmp ult i32 [[IV]], [[N]]
; CHECK-NEXT: [[IV_NEXT]] = add i32 [[IV]], 1
-; CHECK-NEXT: br i1 [[C]], label [[LOOP_HEADER]], label [[EXIT_SPLIT:%.*]], !llvm.loop [[LOOP19:![0-9]+]]
+; CHECK-NEXT: br i1 [[C]], label [[LOOP_HEADER]], label [[EXIT_SPLIT:%.*]], !llvm.loop [[LOOP9:![0-9]+]]
; CHECK: exit.split:
; CHECK-NEXT: br label [[EXIT]]
; CHECK: exit:
@@ -1112,11 +1112,19 @@ define i32 @partial_unswitch_exiting_block_with_multiple_unswitch_candidates(i32
; CHECK-NEXT: [[TMP3:%.*]] = icmp ult i32 [[TMP2]], 41
; CHECK-NEXT: br i1 [[TMP3]], label [[ENTRY_SPLIT:%.*]], label [[ENTRY_SPLIT_US:%.*]]
; CHECK: entry.split.us:
+; CHECK-NEXT: br i1 [[EXIT_COND]], label [[ENTRY_SPLIT_US_SPLIT_US:%.*]], label [[ENTRY_SPLIT_US_SPLIT:%.*]]
+; CHECK: entry.split.us.split.us:
+; CHECK-NEXT: br label [[LOOP_US_US:%.*]]
+; CHECK: loop.us.us:
+; CHECK-NEXT: br label [[EXITING_US_US:%.*]]
+; CHECK: exiting.us.us:
+; CHECK-NEXT: br label [[LOOP_US_US]]
+; CHECK: entry.split.us.split:
; CHECK-NEXT: br label [[LOOP_US:%.*]]
; CHECK: loop.us:
; CHECK-NEXT: br label [[EXITING_US:%.*]]
; CHECK: exiting.us:
-; CHECK-NEXT: br i1 [[EXIT_COND]], label [[LOOP_US]], label [[EXIT_SPLIT_US:%.*]], !llvm.loop [[LOOP20:![0-9]+]]
+; CHECK-NEXT: br label [[EXIT_SPLIT_US:%.*]]
; CHECK: exit.split.us:
; CHECK-NEXT: [[RET_VAL_US:%.*]] = phi i32 [ 1, [[EXITING_US]] ]
; CHECK-NEXT: br label [[EXIT:%.*]]
@@ -1130,7 +1138,7 @@ define i32 @partial_unswitch_exiting_block_with_multiple_unswitch_candidates(i32
; CHECK-NEXT: store i32 [[TMP1:%.*]], ptr [[PTR]], align 16
; CHECK-NEXT: br label [[EXITING]]
; CHECK: exiting:
-; CHECK-NEXT: br i1 [[EXIT_COND]], label [[LOOP]], label [[EXIT_SPLIT:%.*]], !llvm.loop [[LOOP21:![0-9]+]]
+; CHECK-NEXT: br i1 [[EXIT_COND]], label [[LOOP]], label [[EXIT_SPLIT:%.*]], !llvm.loop [[LOOP10:![0-9]+]]
; CHECK: exit.split:
; CHECK-NEXT: [[RET_VAL:%.*]] = phi i32 [ 1, [[EXITING]] ]
; CHECK-NEXT: br label [[EXIT]]
@@ -1177,7 +1185,7 @@ define i32 @partial_unswitch_true_successor_for_cost_calculation(ptr %ptr, i32 %
; CHECK: loop.latch.us:
; CHECK-NEXT: [[C_US:%.*]] = icmp ult i32 [[IV_US]], [[N:%.*]]
; CHECK-NEXT: [[IV_NEXT_US]] = add i32 [[IV_US]], 1
-; CHECK-NEXT: br i1 [[C_US]], label [[LOOP_HEADER_US]], label [[EXIT_SPLIT_US:%.*]], !llvm.loop [[LOOP22:![0-9]+]]
+; CHECK-NEXT: br i1 [[C_US]], label [[LOOP_HEADER_US]], label [[EXIT_SPLIT_US:%.*]]
; CHECK: exit.split.us:
; CHECK-NEXT: br label [[EXIT:%.*]]
; CHECK: entry.split:
@@ -1241,7 +1249,7 @@ define i32 @partial_unswitch_true_successor_for_cost_calculation(ptr %ptr, i32 %
; CHECK: loop.latch:
; CHECK-NEXT: [[C:%.*]] = icmp ult i32 [[IV]], [[N]]
; CHECK-NEXT: [[IV_NEXT]] = add i32 [[IV]], 1
-; CHECK-NEXT: br i1 [[C]], label [[LOOP_HEADER]], label [[EXIT_SPLIT:%.*]], !llvm.loop [[LOOP23:![0-9]+]]
+; CHECK-NEXT: br i1 [[C]], label [[LOOP_HEADER]], label [[EXIT_SPLIT:%.*]], !llvm.loop [[LOOP11:![0-9]+]]
; CHECK: exit.split:
; CHECK-NEXT: br label [[EXIT]]
; CHECK: exit:
@@ -1334,7 +1342,7 @@ define i32 @partial_unswitch_true_successor_trunc(ptr %ptr, i32 %N) {
; CHECK: loop.latch.us:
; CHECK-NEXT: [[C_US:%.*]] = icmp ult i32 [[IV_US]], [[N:%.*]]
; CHECK-NEXT: [[IV_NEXT_US]] = add i32 [[IV_US]], 1
-; CHECK-NEXT: br i1 [[C_US]], label [[LOOP_HEADER_US]], label [[EXIT_SPLIT_US:%.*]], !llvm.loop [[LOOP24:![0-9]+]]
+; CHECK-NEXT: br i1 [[C_US]], label [[LOOP_HEADER_US]], label [[EXIT_SPLIT_US:%.*]]
; CHECK: exit.split.us:
; CHECK-NEXT: br label [[EXIT:%.*]]
; CHECK: entry.split:
@@ -1352,7 +1360,7 @@ define i32 @partial_unswitch_true_successor_trunc(ptr %ptr, i32 %N) {
; CHECK: loop.latch:
; CHECK-NEXT: [[C:%.*]] = icmp ult i32 [[IV]], [[N]]
; CHECK-NEXT: [[IV_NEXT]] = add i32 [[IV]], 1
-; CHECK-NEXT: br i1 [[C]], label [[LOOP_HEADER]], label [[EXIT_SPLIT:%.*]], !llvm.loop [[LOOP25:![0-9]+]]
+; CHECK-NEXT: br i1 [[C]], label [[LOOP_HEADER]], label [[EXIT_SPLIT:%.*]], !llvm.loop [[LOOP12:![0-9]+]]
; CHECK: exit.split:
; CHECK-NEXT: br label [[EXIT]]
; CHECK: exit:
@@ -1399,7 +1407,7 @@ define i32 @partial_unswitch_false_successor_trunc(ptr %ptr, i32 %N) {
; CHECK: loop.latch.us:
; CHECK-NEXT: [[C_US:%.*]] = icmp ult i32 [[IV_US]], [[N:%.*]]
; CHECK-NEXT: [[IV_NEXT_US]] = add i32 [[IV_US]], 1
-; CHECK-NEXT: br i1 [[C_US]], label [[LOOP_HEADER_US]], label [[EXIT_SPLIT_US:%.*]], !llvm.loop [[LOOP26:![0-9]+]]
+; CHECK-NEXT: br i1 [[C_US]], label [[LOOP_HEADER_US]], label [[EXIT_SPLIT_US:%.*]]
; CHECK: exit.split.us:
; CHECK-NEXT: br label [[EXIT:%.*]]
; CHECK: entry.split:
@@ -1417,7 +1425,7 @@ define i32 @partial_unswitch_false_successor_trunc(ptr %ptr, i32 %N) {
; CHECK: loop.latch:
; CHECK-NEXT: [[C:%.*]] = icmp ult i32 [[IV]], [[N]]
; CHECK-NEXT: [[IV_NEXT]] = add i32 [[IV]], 1
-; CHECK-NEXT: br i1 [[C]], label [[LOOP_HEADER]], label [[EXIT_SPLIT:%.*]], !llvm.loop [[LOOP27:![0-9]+]]
+; CHECK-NEXT: br i1 [[C]], label [[LOOP_HEADER]], label [[EXIT_SPLIT:%.*]], !llvm.loop [[LOOP13:![0-9]+]]
; CHECK: exit.split:
; CHECK-NEXT: br label [[EXIT]]
; CHECK: exit:
@@ -1448,15 +1456,15 @@ exit:
ret i32 10
}
-; CHECK: [[LOOP2]] = distinct !{[[LOOP2]], [[UNSWITCH_PARTIAL_DISABLE:![0-9]+]]}
+; CHECK: [[LOOP0]] = distinct !{[[LOOP0]], [[UNSWITCH_PARTIAL_DISABLE:![0-9]+]]}
; CHECK: [[UNSWITCH_PARTIAL_DISABLE]] = !{!"llvm.loop.unswitch.partial.disable"}
+; CHECK: [[LOOP2]] = distinct !{[[LOOP2]], [[UNSWITCH_PARTIAL_DISABLE]]}
+; CHECK: [[LOOP3]] = distinct !{[[LOOP3]], [[UNSWITCH_PARTIAL_DISABLE]]}
+; CHECK: [[LOOP4]] = distinct !{[[LOOP4]], [[UNSWITCH_PARTIAL_DISABLE]]}
; CHECK: [[LOOP5]] = distinct !{[[LOOP5]], [[UNSWITCH_PARTIAL_DISABLE]]}
+; CHECK: [[LOOP6]] = distinct !{[[LOOP6]], [[UNSWITCH_PARTIAL_DISABLE]]}
; CHECK: [[LOOP7]] = distinct !{[[LOOP7]], [[UNSWITCH_PARTIAL_DISABLE]]}
+; CHECK: [[LOOP8]] = distinct !{[[LOOP8]], [[UNSWITCH_PARTIAL_DISABLE]]}
; CHECK: [[LOOP9]] = distinct !{[[LOOP9]], [[UNSWITCH_PARTIAL_DISABLE]]}
+; CHECK: [[LOOP10]] = distinct !{[[LOOP10]], [[UNSWITCH_PARTIAL_DISABLE]]}
; CHECK: [[LOOP11]] = distinct !{[[LOOP11]], [[UNSWITCH_PARTIAL_DISABLE]]}
-; CHECK: [[LOOP13]] = distinct !{[[LOOP13]], [[UNSWITCH_PARTIAL_DISABLE]]}
-; CHECK: [[LOOP15]] = distinct !{[[LOOP15]], [[UNSWITCH_PARTIAL_DISABLE]]}
-; CHECK: [[LOOP17]] = distinct !{[[LOOP17]], [[UNSWITCH_PARTIAL_DISABLE]]}
-; CHECK: [[LOOP19]] = distinct !{[[LOOP19]], [[UNSWITCH_PARTIAL_DISABLE]]}
-; CHECK: [[LOOP21]] = distinct !{[[LOOP21]], [[UNSWITCH_PARTIAL_DISABLE]]}
-; CHECK: [[LOOP23]] = distinct !{[[LOOP23]], [[UNSWITCH_PARTIAL_DISABLE]]}
diff --git a/llvm/test/Transforms/SimpleLoopUnswitch/pr138509.ll b/llvm/test/Transforms/SimpleLoopUnswitch/pr138509.ll
deleted file mode 100644
index e24d17f088427..0000000000000
--- a/llvm/test/Transforms/SimpleLoopUnswitch/pr138509.ll
+++ /dev/null
@@ -1,49 +0,0 @@
-; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
-; RUN: opt -S -passes="loop-mssa(loop-simplifycfg,licm,loop-rotate,simple-loop-unswitch<nontrivial>)" < %s | FileCheck %s
-
- at a = global i32 0, align 4
- at b = global i32 0, align 4
- at c = global i32 0, align 4
- at d = global i32 0, align 4
-
-define i32 @main() {
-entry:
- br label %outer.loop.header
-
-outer.loop.header: ; preds = %outer.loop.latch, %entry
- br i1 false, label %exit, label %outer.loop.body
-
-outer.loop.body: ; preds = %inner.loop.header, %outer.loop.header
- store i32 1, ptr @c, align 4
- %cmp = icmp sgt i32 0, -1
- br i1 %cmp, label %outer.loop.latch, label %exit
-
-inner.loop.header: ; preds = %outer.loop.latch, %inner.loop.body
- %a_val = load i32, ptr @a, align 4
- %c_val = load i32, ptr @c, align 4
- %mul = mul nsw i32 %c_val, %a_val
- store i32 %mul, ptr @b, align 4
- %cmp2 = icmp sgt i32 %mul, -1
- br i1 %cmp2, label %inner.loop.body, label %outer.loop.body
-
-inner.loop.body: ; preds = %inner.loop.header
- %mul2 = mul nsw i32 %c_val, 3
- store i32 %mul2, ptr @c, align 4
- store i32 %c_val, ptr @d, align 4
- %mul3 = mul nsw i32 %c_val, %a_val
- %cmp3 = icmp sgt i32 %mul3, -1
- br i1 %cmp3, label %inner.loop.header, label %exit
-
-outer.loop.latch: ; preds = %outer.loop.body
- %d_val = load i32, ptr @d, align 4
- store i32 %d_val, ptr @b, align 4
- %cmp4 = icmp eq i32 %d_val, 0
- br i1 %cmp4, label %inner.loop.header, label %outer.loop.header
-
-exit: ; preds = %inner.loop.body, %outer.loop.body, %outer.loop.header
- ret i32 0
-}
-
-; CHECK: [[LOOP0:.*]] = distinct !{[[LOOP0]], [[META1:![0-9]+]]}
-; CHECK: [[META1]] = !{!"llvm.loop.unswitch.nontrivial.disable"}
-; CHECK: [[LOOP2:.*]] = distinct !{[[LOOP2]], [[META1]]}
diff --git a/llvm/test/Transforms/SimpleLoopUnswitch/update-scev-3.ll b/llvm/test/Transforms/SimpleLoopUnswitch/update-scev-3.ll
index 4e428cbc30bb6..ef00d7ea8f2bb 100644
--- a/llvm/test/Transforms/SimpleLoopUnswitch/update-scev-3.ll
+++ b/llvm/test/Transforms/SimpleLoopUnswitch/update-scev-3.ll
@@ -19,42 +19,56 @@ define i32 @foo(i1 %not) {
; CHECK-NEXT: [[FALSE:%.*]] = and i1 true, false
; CHECK-NEXT: br i1 [[NOT]], label [[ENTRY_SPLIT_US:%.*]], label [[ENTRY_SPLIT:%.*]]
; CHECK: entry.split.us:
+; CHECK-NEXT: br i1 [[FALSE]], label [[ENTRY_SPLIT_US_SPLIT_US:%.*]], label [[ENTRY_SPLIT_US_SPLIT:%.*]]
+; CHECK: entry.split.us.split.us:
+; CHECK-NEXT: br label [[FOR_COND_US_US:%.*]]
+; CHECK: for.cond.us.us:
+; CHECK-NEXT: br label [[FOR_COND_SPLIT_US_US_US:%.*]]
+; CHECK: for.cond.split.us.us.us:
+; CHECK-NEXT: br label [[FOR_COND_SPLIT_US_SPLIT_US_SPLIT_US_SPLIT_US:%.*]]
+; CHECK: for.cond.split.us.split.us.split.us.split.us:
+; CHECK-NEXT: br label [[FOR_COND_SPLIT_US_SPLIT_US_SPLIT_US:%.*]]
+; CHECK: entry.split.us.split:
; CHECK-NEXT: br label [[FOR_COND_US:%.*]]
; CHECK: for.cond.us:
; CHECK-NEXT: br label [[FOR_COND_SPLIT_US_US:%.*]]
; CHECK: for.inc11.us:
-; CHECK-NEXT: br label [[FOR_COND_US]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK-NEXT: br label [[FOR_COND_US]]
; CHECK: for.cond.split.us.us:
-; CHECK-NEXT: br label [[FOR_COND5_PREHEADER_US_US:%.*]]
-; CHECK: for.cond5.preheader.us.us:
-; CHECK-NEXT: br label [[FOR_COND5_PREHEADER_SPLIT_US_US_US:%.*]]
+; CHECK-NEXT: br label [[FOR_COND_SPLIT_US_SPLIT_US11:%.*]]
+; CHECK: for.cond5.preheader.us.us9:
+; CHECK-NEXT: br label [[FOR_COND5_PREHEADER_SPLIT_US_US_US10:%.*]]
; CHECK: for.inc8.us.us:
-; CHECK-NEXT: br i1 [[FALSE]], label [[FOR_INC8_FOR_COND5_PREHEADER_CRIT_EDGE_US_US:%.*]], label [[FOR_INC11_SPLIT_US_US:%.*]]
+; CHECK-NEXT: br i1 false, label [[FOR_INC8_FOR_COND5_PREHEADER_CRIT_EDGE_US_US:%.*]], label [[FOR_INC11_SPLIT_US_US:%.*]]
; CHECK: for.inc8.for.cond5.preheader_crit_edge.us.us:
-; CHECK-NEXT: br label [[FOR_COND5_PREHEADER_US_US]], !llvm.loop [[LOOP2:![0-9]+]]
+; CHECK-NEXT: br label [[FOR_COND5_PREHEADER_US_US9:%.*]]
; CHECK: for.end.us.us:
-; CHECK-NEXT: br i1 [[FALSE]], label [[FOR_INC8_US_US:%.*]], label [[CLEANUP15_SPLIT_US_SPLIT_US:%.*]]
-; CHECK: for.cond5.preheader.split.us.us.us:
-; CHECK-NEXT: br label [[FOR_BODY7_US_US_US:%.*]]
-; CHECK: for.body7.us.us.us:
-; CHECK-NEXT: br label [[HANDLER_POINTER_OVERFLOW_US_US_US:%.*]]
-; CHECK: handler.pointer_overflow.us.us.us:
-; CHECK-NEXT: br label [[CONT_US_US_US:%.*]]
-; CHECK: cont.us.us.us:
-; CHECK-NEXT: br i1 [[FALSE]], label [[CONT_FOR_BODY7_CRIT_EDGE_US_US_US:%.*]], label [[FOR_END_SPLIT_US_US_US:%.*]]
-; CHECK: cont.for.body7_crit_edge.us.us.us:
-; CHECK-NEXT: br label [[FOR_BODY7_US_US_US]], !llvm.loop [[LOOP3:![0-9]+]]
+; CHECK-NEXT: br i1 false, label [[FOR_INC8_US_US:%.*]], label [[CLEANUP15_SPLIT_US_SPLIT_US:%.*]]
+; CHECK: for.cond5.preheader.split.us.us.us10:
+; CHECK-NEXT: br label [[FOR_COND5_PREHEADER_SPLIT_US_SPLIT_US7_US:%.*]]
+; CHECK: for.body7.us.us4.us:
+; CHECK-NEXT: br label [[HANDLER_POINTER_OVERFLOW_US_US5_US:%.*]]
+; CHECK: handler.pointer_overflow.us.us5.us:
+; CHECK-NEXT: br label [[CONT_US_US6_US:%.*]]
+; CHECK: cont.us.us6.us:
+; CHECK-NEXT: br label [[FOR_END_SPLIT_US_US_US:%.*]]
; CHECK: for.end.split.us.us.us:
; CHECK-NEXT: br label [[FOR_END_US_US:%.*]]
+; CHECK: for.cond5.preheader.split.us.split.us7.us:
+; CHECK-NEXT: br label [[FOR_BODY7_US_US4_US:%.*]]
; CHECK: for.inc11.split.us.us:
; CHECK-NEXT: br label [[FOR_INC11_US:%.*]]
+; CHECK: for.cond.split.us.split.us11:
+; CHECK-NEXT: br label [[FOR_COND5_PREHEADER_US_US9]]
+; CHECK: for.cond.split.us.split.us.split.us:
+; CHECK-NEXT: br label [[FOR_COND_SPLIT_US_SPLIT_US:%.*]]
; CHECK: cleanup15.split.us.split.us:
; CHECK-NEXT: br label [[CLEANUP15_SPLIT_US:%.*]]
; CHECK: entry.split:
; CHECK-NEXT: br i1 [[FALSE]], label [[ENTRY_SPLIT_SPLIT_US:%.*]], label [[ENTRY_SPLIT_SPLIT:%.*]]
; CHECK: entry.split.split.us:
-; CHECK-NEXT: br label [[FOR_COND_US5:%.*]]
-; CHECK: for.cond.us5:
+; CHECK-NEXT: br label [[FOR_COND_US12:%.*]]
+; CHECK: for.cond.us12:
; CHECK-NEXT: br label [[FOR_COND_SPLIT_US:%.*]]
; CHECK: for.cond.split.us:
; CHECK-NEXT: br label [[FOR_COND_SPLIT_SPLIT_US_SPLIT_US:%.*]]
@@ -64,13 +78,23 @@ define i32 @foo(i1 %not) {
; CHECK-NEXT: br label [[FOR_COND:%.*]]
; CHECK: for.cond:
; CHECK-NEXT: br label [[FOR_COND_SPLIT:%.*]]
+; CHECK: for.cond.split.us.split.us:
+; CHECK-NEXT: br label [[FOR_COND5_PREHEADER_US_US:%.*]]
+; CHECK: for.cond5.preheader.us.us:
+; CHECK-NEXT: br label [[FOR_COND5_PREHEADER_SPLIT_US_US_US:%.*]]
+; CHECK: for.cond5.preheader.split.us.us.us:
+; CHECK-NEXT: br label [[FOR_COND5_PREHEADER_SPLIT_US_SPLIT_US_SPLIT_US_SPLIT_US:%.*]]
+; CHECK: for.cond5.preheader.split.us.split.us.split.us.split.us:
+; CHECK-NEXT: br label [[FOR_COND5_PREHEADER_SPLIT_US_SPLIT_US_SPLIT_US:%.*]]
; CHECK: cleanup15.split.us:
; CHECK-NEXT: br label [[CLEANUP15:%.*]]
+; CHECK: for.cond5.preheader.split.us.split.us.split.us:
+; CHECK-NEXT: br label [[FOR_COND5_PREHEADER_SPLIT_US_SPLIT_US:%.*]]
; CHECK: for.cond.split:
; CHECK-NEXT: br label [[FOR_COND_SPLIT_SPLIT:%.*]]
; CHECK: for.cond.split.split.us:
-; CHECK-NEXT: br label [[FOR_COND5_PREHEADER_US4:%.*]]
-; CHECK: for.cond5.preheader.us4:
+; CHECK-NEXT: br label [[FOR_COND5_PREHEADER_US8:%.*]]
+; CHECK: for.cond5.preheader.us8:
; CHECK-NEXT: br label [[FOR_COND5_PREHEADER_SPLIT_US:%.*]]
; CHECK: for.cond5.preheader.split.us:
; CHECK-NEXT: br label [[FOR_COND5_PREHEADER_SPLIT_SPLIT_US_SPLIT_US:%.*]]
@@ -80,6 +104,16 @@ define i32 @foo(i1 %not) {
; CHECK-NEXT: br label [[FOR_COND5_PREHEADER:%.*]]
; CHECK: for.cond5.preheader:
; CHECK-NEXT: br label [[FOR_COND5_PREHEADER_SPLIT:%.*]]
+; CHECK: for.cond5.preheader.split.us.split.us:
+; CHECK-NEXT: br label [[FOR_BODY7_US_US:%.*]]
+; CHECK: for.body7.us.us:
+; CHECK-NEXT: br label [[HANDLER_POINTER_OVERFLOW_US_US:%.*]]
+; CHECK: handler.pointer_overflow.us.us:
+; CHECK-NEXT: br label [[CONT_US_US:%.*]]
+; CHECK: cont.us.us:
+; CHECK-NEXT: br label [[CONT_FOR_BODY7_CRIT_EDGE_US_US:%.*]]
+; CHECK: cont.for.body7_crit_edge.us.us:
+; CHECK-NEXT: br label [[FOR_BODY7_US_US]]
; CHECK: for.cond5.preheader.split:
; CHECK-NEXT: br label [[FOR_COND5_PREHEADER_SPLIT_SPLIT:%.*]]
; CHECK: for.cond5.preheader.split.split.us:
>From c67d27dad02ab7debfff6c7f7fc3ea8abf064b6a Mon Sep 17 00:00:00 2001
From: Jeremy Kun <jkun at google.com>
Date: Mon, 18 Aug 2025 08:47:47 -0700
Subject: [PATCH 047/112] [mlir][Presburger] NFC: return var index from
IntegerRelation::addLocalFloorDiv (#153463)
addLocalFloorDiv currently returns void and requires the caller to know
that the newly added local variable is in a particular index. This
commit returns the index of the newly added variable so that callers
need not tie themselves to this implementation detail.
I found one relevant callsite demonstrating this and updated it. I am
using this API out of tree and wanted to make our out-of-tree code a bit
more resilient to upstream changes.
---
.../include/mlir/Analysis/Presburger/IntegerRelation.h | 10 ++++++----
mlir/lib/Analysis/FlatLinearValueConstraints.cpp | 2 +-
mlir/lib/Analysis/Presburger/IntegerRelation.cpp | 8 +++++---
mlir/lib/Analysis/Presburger/Simplex.cpp | 2 +-
mlir/lib/Dialect/Affine/Analysis/AffineStructures.cpp | 4 ++--
5 files changed, 15 insertions(+), 11 deletions(-)
diff --git a/mlir/include/mlir/Analysis/Presburger/IntegerRelation.h b/mlir/include/mlir/Analysis/Presburger/IntegerRelation.h
index 335a2dddc7561..e6d2f8dcca7d5 100644
--- a/mlir/include/mlir/Analysis/Presburger/IntegerRelation.h
+++ b/mlir/include/mlir/Analysis/Presburger/IntegerRelation.h
@@ -479,10 +479,12 @@ class IntegerRelation {
/// respect to a positive constant `divisor`. Two constraints are added to the
/// system to capture equivalence with the floordiv:
/// q = dividend floordiv c <=> c*q <= dividend <= c*q + c - 1.
- void addLocalFloorDiv(ArrayRef<DynamicAPInt> dividend,
- const DynamicAPInt &divisor);
- void addLocalFloorDiv(ArrayRef<int64_t> dividend, int64_t divisor) {
- addLocalFloorDiv(getDynamicAPIntVec(dividend), DynamicAPInt(divisor));
+ /// Returns the column position of the new local variable.
+ unsigned addLocalFloorDiv(ArrayRef<DynamicAPInt> dividend,
+ const DynamicAPInt &divisor);
+ unsigned addLocalFloorDiv(ArrayRef<int64_t> dividend, int64_t divisor) {
+ return addLocalFloorDiv(getDynamicAPIntVec(dividend),
+ DynamicAPInt(divisor));
}
/// Adds a new local variable as the modulus of an affine function of other
diff --git a/mlir/lib/Analysis/FlatLinearValueConstraints.cpp b/mlir/lib/Analysis/FlatLinearValueConstraints.cpp
index f4b02b496a5c5..30ce1fb320017 100644
--- a/mlir/lib/Analysis/FlatLinearValueConstraints.cpp
+++ b/mlir/lib/Analysis/FlatLinearValueConstraints.cpp
@@ -60,7 +60,7 @@ struct AffineExprFlattener : public SimpleAffineExprFlattener {
AffineExpr localExpr) override {
SimpleAffineExprFlattener::addLocalFloorDivId(dividend, divisor, localExpr);
// Update localVarCst.
- localVarCst.addLocalFloorDiv(dividend, divisor);
+ (void)localVarCst.addLocalFloorDiv(dividend, divisor);
}
LogicalResult addLocalIdSemiAffine(ArrayRef<int64_t> lhs,
diff --git a/mlir/lib/Analysis/Presburger/IntegerRelation.cpp b/mlir/lib/Analysis/Presburger/IntegerRelation.cpp
index 1d1e4ded19db1..0dcdd5bb97bc8 100644
--- a/mlir/lib/Analysis/Presburger/IntegerRelation.cpp
+++ b/mlir/lib/Analysis/Presburger/IntegerRelation.cpp
@@ -1500,12 +1500,13 @@ void IntegerRelation::addBound(BoundType type, ArrayRef<DynamicAPInt> expr,
/// respect to a positive constant 'divisor'. Two constraints are added to the
/// system to capture equivalence with the floordiv.
/// q = expr floordiv c <=> c*q <= expr <= c*q + c - 1.
-void IntegerRelation::addLocalFloorDiv(ArrayRef<DynamicAPInt> dividend,
- const DynamicAPInt &divisor) {
+/// Returns the column position of the new local variable.
+unsigned IntegerRelation::addLocalFloorDiv(ArrayRef<DynamicAPInt> dividend,
+ const DynamicAPInt &divisor) {
assert(dividend.size() == getNumCols() && "incorrect dividend size");
assert(divisor > 0 && "positive divisor expected");
- appendVar(VarKind::Local);
+ unsigned newVar = appendVar(VarKind::Local);
SmallVector<DynamicAPInt, 8> dividendCopy(dividend);
dividendCopy.insert(dividendCopy.end() - 1, DynamicAPInt(0));
@@ -1513,6 +1514,7 @@ void IntegerRelation::addLocalFloorDiv(ArrayRef<DynamicAPInt> dividend,
getDivLowerBound(dividendCopy, divisor, dividendCopy.size() - 2));
addInequality(
getDivUpperBound(dividendCopy, divisor, dividendCopy.size() - 2));
+ return newVar;
}
unsigned IntegerRelation::addLocalModulo(ArrayRef<DynamicAPInt> exprs,
diff --git a/mlir/lib/Analysis/Presburger/Simplex.cpp b/mlir/lib/Analysis/Presburger/Simplex.cpp
index 08290db55f2c7..51e2007db45e6 100644
--- a/mlir/lib/Analysis/Presburger/Simplex.cpp
+++ b/mlir/lib/Analysis/Presburger/Simplex.cpp
@@ -433,7 +433,7 @@ LogicalResult SymbolicLexSimplex::addSymbolicCut(unsigned row) {
normalizeDiv(divCoeffs, divDenom);
domainSimplex.addDivisionVariable(divCoeffs, divDenom);
- domainPoly.addLocalFloorDiv(divCoeffs, divDenom);
+ (void)domainPoly.addLocalFloorDiv(divCoeffs, divDenom);
// Update `this` to account for the additional symbol we just added.
appendSymbol();
diff --git a/mlir/lib/Dialect/Affine/Analysis/AffineStructures.cpp b/mlir/lib/Dialect/Affine/Analysis/AffineStructures.cpp
index 86edc2bcc2761..b405ec2201bf8 100644
--- a/mlir/lib/Dialect/Affine/Analysis/AffineStructures.cpp
+++ b/mlir/lib/Dialect/Affine/Analysis/AffineStructures.cpp
@@ -93,13 +93,13 @@ FlatAffineValueConstraints::addAffineForOpDomain(AffineForOp forOp) {
int64_t lb = forOp.getConstantLowerBound();
dividend[pos] = 1;
dividend.back() -= lb;
- addLocalFloorDiv(dividend, step);
+ unsigned qPos = addLocalFloorDiv(dividend, step);
// Second constraint: (iv - lb) - step * q = 0.
SmallVector<int64_t, 8> eq(getNumCols(), 0);
eq[pos] = 1;
eq.back() -= lb;
// For the local var just added above.
- eq[getNumCols() - 2] = -step;
+ eq[qPos] = -step;
addEquality(eq);
}
}
>From 3ecfc0330d93a6c3a3f3d3e427390b01cb52a88d Mon Sep 17 00:00:00 2001
From: Yitzhak Mandelbaum <ymand at users.noreply.github.com>
Date: Mon, 18 Aug 2025 11:55:12 -0400
Subject: [PATCH 048/112] [clang][dataflow] Add support for serialization and
deserialization. (#152487)
Adds support for compact serialization of Formulas, and a corresponding
parse function. Extends Environment and AnalysisContext with necessary
functions for serializing and deserializing all formula-related parts of
the environment.
---
.../FlowSensitive/DataflowAnalysisContext.h | 34 +++
.../FlowSensitive/DataflowEnvironment.h | 14 +-
.../clang/Analysis/FlowSensitive/Formula.h | 19 +-
.../FlowSensitive/FormulaSerialization.h | 40 ++++
.../lib/Analysis/FlowSensitive/CMakeLists.txt | 1 +
.../FlowSensitive/DataflowAnalysisContext.cpp | 79 +++++++
.../FlowSensitive/FormulaSerialization.cpp | 153 +++++++++++++
.../Analysis/FlowSensitive/CMakeLists.txt | 1 +
.../DataflowAnalysisContextTest.cpp | 96 +++++++++
.../Analysis/FlowSensitive/FormulaTest.cpp | 201 ++++++++++++++++++
10 files changed, 626 insertions(+), 12 deletions(-)
create mode 100644 clang/include/clang/Analysis/FlowSensitive/FormulaSerialization.h
create mode 100644 clang/lib/Analysis/FlowSensitive/FormulaSerialization.cpp
create mode 100644 clang/unittests/Analysis/FlowSensitive/FormulaTest.cpp
diff --git a/clang/include/clang/Analysis/FlowSensitive/DataflowAnalysisContext.h b/clang/include/clang/Analysis/FlowSensitive/DataflowAnalysisContext.h
index 5be4a1145f40d..11042e865c4e6 100644
--- a/clang/include/clang/Analysis/FlowSensitive/DataflowAnalysisContext.h
+++ b/clang/include/clang/Analysis/FlowSensitive/DataflowAnalysisContext.h
@@ -42,6 +42,18 @@ struct ContextSensitiveOptions {
unsigned Depth = 2;
};
+/// A simple representation of essential elements of the logical context used in
+/// environments. Designed for import/export for applications requiring
+/// serialization support.
+struct SimpleLogicalContext {
+ // Global invariant that applies for all definitions in the context.
+ const Formula *Invariant;
+ // Flow-condition tokens in the context.
+ llvm::DenseMap<Atom, const Formula *> TokenDefs;
+ // Dependencies between flow-condition definitions.
+ llvm::DenseMap<Atom, llvm::DenseSet<Atom>> TokenDeps;
+};
+
/// Owns objects that encompass the state of a program and stores context that
/// is used during dataflow analysis.
class DataflowAnalysisContext {
@@ -140,6 +152,15 @@ class DataflowAnalysisContext {
/// Adds `Constraint` to the flow condition identified by `Token`.
void addFlowConditionConstraint(Atom Token, const Formula &Constraint);
+ /// Adds `Deps` to the dependencies of the flow condition identified by
+ /// `Token`. Intended for use in deserializing contexts. The formula alone
+ /// doesn't have enough information to indicate its deps.
+ void addFlowConditionDeps(Atom Token, const llvm::DenseSet<Atom> &Deps) {
+ // Avoid creating an entry for `Token` with an empty set.
+ if (!Deps.empty())
+ FlowConditionDeps[Token].insert(Deps.begin(), Deps.end());
+ }
+
/// Creates a new flow condition with the same constraints as the flow
/// condition identified by `Token` and returns its token.
Atom forkFlowCondition(Atom Token);
@@ -207,6 +228,14 @@ class DataflowAnalysisContext {
return {};
}
+ /// Export the logical-context portions of `AC`, limited to the given target
+ /// flow-condition tokens.
+ SimpleLogicalContext
+ exportLogicalContext(llvm::DenseSet<dataflow::Atom> TargetTokens) const;
+
+ /// Initializes this context's "logical" components with `LC`.
+ void initLogicalContext(SimpleLogicalContext LC);
+
private:
friend class Environment;
@@ -228,6 +257,11 @@ class DataflowAnalysisContext {
DataflowAnalysisContext(Solver &S, std::unique_ptr<Solver> &&OwnedSolver,
Options Opts);
+ /// Computes the transitive closure of dependencies of (flow-condition)
+ /// `Tokens`. That is, the set of flow-condition tokens reachable from
+ /// `Tokens` in the dependency graph.
+ llvm::DenseSet<Atom> collectDependencies(llvm::DenseSet<Atom> Tokens) const;
+
// Extends the set of modeled field declarations.
void addModeledFields(const FieldSet &Fields);
diff --git a/clang/include/clang/Analysis/FlowSensitive/DataflowEnvironment.h b/clang/include/clang/Analysis/FlowSensitive/DataflowEnvironment.h
index 097ff2bdfe7ad..076714462bb2a 100644
--- a/clang/include/clang/Analysis/FlowSensitive/DataflowEnvironment.h
+++ b/clang/include/clang/Analysis/FlowSensitive/DataflowEnvironment.h
@@ -157,10 +157,18 @@ class Environment {
};
/// Creates an environment that uses `DACtx` to store objects that encompass
- /// the state of a program.
+ /// the state of a program. `FlowConditionToken` sets the flow condition
+ /// associated with the environment. Generally, new environments should be
+ /// initialized with a fresh token, by using one of the other
+ /// constructors. This constructor is for specialized use, including
+ /// deserialization and delegation from other constructors.
+ Environment(DataflowAnalysisContext &DACtx, Atom FlowConditionToken)
+ : DACtx(&DACtx), FlowConditionToken(FlowConditionToken) {}
+
+ /// Creates an environment that uses `DACtx` to store objects that encompass
+ /// the state of a program. Populates a fresh atom as flow condition token.
explicit Environment(DataflowAnalysisContext &DACtx)
- : DACtx(&DACtx),
- FlowConditionToken(DACtx.arena().makeFlowConditionToken()) {}
+ : Environment(DACtx, DACtx.arena().makeFlowConditionToken()) {}
/// Creates an environment that uses `DACtx` to store objects that encompass
/// the state of a program, with `S` as the statement to analyze.
diff --git a/clang/include/clang/Analysis/FlowSensitive/Formula.h b/clang/include/clang/Analysis/FlowSensitive/Formula.h
index 0e6352403a832..3959bc98619b9 100644
--- a/clang/include/clang/Analysis/FlowSensitive/Formula.h
+++ b/clang/include/clang/Analysis/FlowSensitive/Formula.h
@@ -85,21 +85,17 @@ class alignas(const Formula *) Formula {
}
using AtomNames = llvm::DenseMap<Atom, std::string>;
- // Produce a stable human-readable representation of this formula.
- // For example: (V3 | !(V1 & V2))
- // If AtomNames is provided, these override the default V0, V1... names.
+ /// Produces a stable human-readable representation of this formula.
+ /// For example: (V3 | !(V1 & V2))
+ /// If AtomNames is provided, these override the default V0, V1... names.
void print(llvm::raw_ostream &OS, const AtomNames * = nullptr) const;
- // Allocate Formulas using Arena rather than calling this function directly.
+ /// Allocates Formulas using Arena rather than calling this function directly.
static const Formula &create(llvm::BumpPtrAllocator &Alloc, Kind K,
ArrayRef<const Formula *> Operands,
unsigned Value = 0);
-private:
- Formula() = default;
- Formula(const Formula &) = delete;
- Formula &operator=(const Formula &) = delete;
-
+ /// Count of operands (sub-formulas) associated with Formulas of kind `K`.
static unsigned numOperands(Kind K) {
switch (K) {
case AtomRef:
@@ -116,6 +112,11 @@ class alignas(const Formula *) Formula {
llvm_unreachable("Unhandled Formula::Kind enum");
}
+private:
+ Formula() = default;
+ Formula(const Formula &) = delete;
+ Formula &operator=(const Formula &) = delete;
+
Kind FormulaKind;
// Some kinds of formula have scalar values, e.g. AtomRef's atom number.
unsigned Value;
diff --git a/clang/include/clang/Analysis/FlowSensitive/FormulaSerialization.h b/clang/include/clang/Analysis/FlowSensitive/FormulaSerialization.h
new file mode 100644
index 0000000000000..119f93e5d73f6
--- /dev/null
+++ b/clang/include/clang/Analysis/FlowSensitive/FormulaSerialization.h
@@ -0,0 +1,40 @@
+//=== FormulaSerialization.h - Formula De/Serialization support -*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_CLANG_ANALYSIS_FLOWSENSITIVE_FORMULA_SERIALIZATION_H
+#define LLVM_CLANG_ANALYSIS_FLOWSENSITIVE_FORMULA_SERIALIZATION_H
+
+#include "clang/Analysis/FlowSensitive/Arena.h"
+#include "clang/Analysis/FlowSensitive/Formula.h"
+#include "clang/Basic/LLVM.h"
+#include "llvm/ADT/ArrayRef.h"
+#include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/DenseMapInfo.h"
+#include "llvm/Support/Allocator.h"
+#include "llvm/Support/raw_ostream.h"
+#include <cassert>
+#include <string>
+
+namespace clang::dataflow {
+
+/// Prints `F` to `OS` in a compact format, optimized for easy parsing
+/// (deserialization) rather than human use.
+void serializeFormula(const Formula &F, llvm::raw_ostream &OS);
+
+/// Parses `Str` to build a serialized Formula.
+/// @returns error on parse failure or if parsing does not fully consume `Str`.
+/// @param A used to construct the formula components.
+/// @param AtomMap maps serialized Atom identifiers (unsigned ints) to Atoms.
+/// This map is provided by the caller to enable consistency across
+/// multiple formulas in a single file.
+llvm::Expected<const Formula *>
+parseFormula(llvm::StringRef Str, Arena &A,
+ llvm::DenseMap<unsigned, Atom> &AtomMap);
+
+} // namespace clang::dataflow
+#endif
diff --git a/clang/lib/Analysis/FlowSensitive/CMakeLists.txt b/clang/lib/Analysis/FlowSensitive/CMakeLists.txt
index 0c30df8b4b194..97e09c9bce95f 100644
--- a/clang/lib/Analysis/FlowSensitive/CMakeLists.txt
+++ b/clang/lib/Analysis/FlowSensitive/CMakeLists.txt
@@ -6,6 +6,7 @@ add_clang_library(clangAnalysisFlowSensitive
DataflowAnalysisContext.cpp
DataflowEnvironment.cpp
Formula.cpp
+ FormulaSerialization.cpp
HTMLLogger.cpp
Logger.cpp
RecordOps.cpp
diff --git a/clang/lib/Analysis/FlowSensitive/DataflowAnalysisContext.cpp b/clang/lib/Analysis/FlowSensitive/DataflowAnalysisContext.cpp
index 6421ad3883d10..06a88784a6f94 100644
--- a/clang/lib/Analysis/FlowSensitive/DataflowAnalysisContext.cpp
+++ b/clang/lib/Analysis/FlowSensitive/DataflowAnalysisContext.cpp
@@ -208,6 +208,24 @@ bool DataflowAnalysisContext::equivalentFormulas(const Formula &Val1,
return isUnsatisfiable(std::move(Constraints));
}
+llvm::DenseSet<Atom> DataflowAnalysisContext::collectDependencies(
+ llvm::DenseSet<Atom> Tokens) const {
+ // Use a worklist algorithm, with `Remaining` holding the worklist and
+ // `Tokens` tracking which atoms have already been added to the worklist.
+ std::vector<Atom> Remaining(Tokens.begin(), Tokens.end());
+ while (!Remaining.empty()) {
+ Atom CurrentToken = Remaining.back();
+ Remaining.pop_back();
+ if (auto DepsIt = FlowConditionDeps.find(CurrentToken);
+ DepsIt != FlowConditionDeps.end())
+ for (Atom A : DepsIt->second)
+ if (Tokens.insert(A).second)
+ Remaining.push_back(A);
+ }
+
+ return Tokens;
+}
+
void DataflowAnalysisContext::addTransitiveFlowConditionConstraints(
Atom Token, llvm::SetVector<const Formula *> &Constraints) {
llvm::DenseSet<Atom> AddedTokens;
@@ -224,6 +242,8 @@ void DataflowAnalysisContext::addTransitiveFlowConditionConstraints(
auto ConstraintsIt = FlowConditionConstraints.find(Token);
if (ConstraintsIt == FlowConditionConstraints.end()) {
+ // The flow condition is unconstrained. Just add the atom directly, which
+ // is equivalent to asserting it is true.
Constraints.insert(&arena().makeAtomRef(Token));
} else {
// Bind flow condition token via `iff` to its set of constraints:
@@ -239,6 +259,65 @@ void DataflowAnalysisContext::addTransitiveFlowConditionConstraints(
}
}
+static void getReferencedAtoms(const Formula &F,
+ llvm::DenseSet<dataflow::Atom> &Refs) {
+ switch (F.kind()) {
+ case Formula::AtomRef:
+ Refs.insert(F.getAtom());
+ break;
+ case Formula::Literal:
+ break;
+ case Formula::Not:
+ getReferencedAtoms(*F.operands()[0], Refs);
+ break;
+ case Formula::And:
+ case Formula::Or:
+ case Formula::Implies:
+ case Formula::Equal:
+ ArrayRef<const Formula *> Operands = F.operands();
+ getReferencedAtoms(*Operands[0], Refs);
+ getReferencedAtoms(*Operands[1], Refs);
+ break;
+ }
+}
+
+SimpleLogicalContext DataflowAnalysisContext::exportLogicalContext(
+ llvm::DenseSet<dataflow::Atom> TargetTokens) const {
+ SimpleLogicalContext LC;
+
+ if (Invariant != nullptr) {
+ LC.Invariant = Invariant;
+ getReferencedAtoms(*Invariant, TargetTokens);
+ }
+
+ llvm::DenseSet<dataflow::Atom> Dependencies =
+ collectDependencies(std::move(TargetTokens));
+
+ for (dataflow::Atom Token : Dependencies) {
+ // Only process the token if it is constrained. Unconstrained tokens don't
+ // have dependencies.
+ const Formula *Constraints = FlowConditionConstraints.lookup(Token);
+ if (Constraints == nullptr)
+ continue;
+ LC.TokenDefs[Token] = Constraints;
+
+ if (auto DepsIt = FlowConditionDeps.find(Token);
+ DepsIt != FlowConditionDeps.end())
+ LC.TokenDeps[Token] = DepsIt->second;
+ }
+
+ return LC;
+}
+
+void DataflowAnalysisContext::initLogicalContext(SimpleLogicalContext LC) {
+ Invariant = LC.Invariant;
+ FlowConditionConstraints = std::move(LC.TokenDefs);
+ // TODO: The dependencies in `LC.TokenDeps` can be reconstructed from
+ // `LC.TokenDefs`. Give the caller the option to reconstruct, rather than
+ // providing them directly, to save caller space (memory/disk).
+ FlowConditionDeps = std::move(LC.TokenDeps);
+}
+
static void printAtomList(const llvm::SmallVector<Atom> &Atoms,
llvm::raw_ostream &OS) {
OS << "(";
diff --git a/clang/lib/Analysis/FlowSensitive/FormulaSerialization.cpp b/clang/lib/Analysis/FlowSensitive/FormulaSerialization.cpp
new file mode 100644
index 0000000000000..df15a1d6eaadb
--- /dev/null
+++ b/clang/lib/Analysis/FlowSensitive/FormulaSerialization.cpp
@@ -0,0 +1,153 @@
+//===- FormulaSerialization.cpp ---------------------------------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "clang/Analysis/FlowSensitive/FormulaSerialization.h"
+#include "clang/Analysis/FlowSensitive/Arena.h"
+#include "clang/Analysis/FlowSensitive/Formula.h"
+#include "clang/Basic/LLVM.h"
+#include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/STLExtras.h"
+#include "llvm/ADT/StringRef.h"
+#include "llvm/Support/Allocator.h"
+#include "llvm/Support/Error.h"
+#include "llvm/Support/ErrorHandling.h"
+#include <cassert>
+
+namespace clang::dataflow {
+
+// Returns the leading indicator of operation formulas. `AtomRef` and `Literal`
+// are handled differently.
+static char compactSigil(Formula::Kind K) {
+ switch (K) {
+ case Formula::AtomRef:
+ case Formula::Literal:
+ // No sigil.
+ return '\0';
+ case Formula::Not:
+ return '!';
+ case Formula::And:
+ return '&';
+ case Formula::Or:
+ return '|';
+ case Formula::Implies:
+ return '>';
+ case Formula::Equal:
+ return '=';
+ }
+ llvm_unreachable("unhandled formula kind");
+}
+
+void serializeFormula(const Formula &F, llvm::raw_ostream &OS) {
+ switch (Formula::numOperands(F.kind())) {
+ case 0:
+ switch (F.kind()) {
+ case Formula::AtomRef:
+ OS << F.getAtom();
+ break;
+ case Formula::Literal:
+ OS << (F.literal() ? 'T' : 'F');
+ break;
+ default:
+ llvm_unreachable("unhandled formula kind");
+ }
+ break;
+ case 1:
+ OS << compactSigil(F.kind());
+ serializeFormula(*F.operands()[0], OS);
+ break;
+ case 2:
+ OS << compactSigil(F.kind());
+ serializeFormula(*F.operands()[0], OS);
+ serializeFormula(*F.operands()[1], OS);
+ break;
+ default:
+ llvm_unreachable("unhandled formula arity");
+ }
+}
+
+static llvm::Expected<const Formula *>
+parsePrefix(llvm::StringRef &Str, Arena &A,
+ llvm::DenseMap<unsigned, Atom> &AtomMap) {
+ if (Str.empty())
+ return llvm::createStringError(llvm::inconvertibleErrorCode(),
+ "unexpected end of input");
+
+ char Prefix = Str[0];
+ Str = Str.drop_front();
+
+ switch (Prefix) {
+ case 'T':
+ return &A.makeLiteral(true);
+ case 'F':
+ return &A.makeLiteral(false);
+ case 'V': {
+ unsigned AtomID;
+ if (Str.consumeInteger(10, AtomID))
+ return llvm::createStringError(llvm::inconvertibleErrorCode(),
+ "expected atom id");
+ auto [It, Inserted] = AtomMap.try_emplace(AtomID, Atom());
+ if (Inserted)
+ It->second = A.makeAtom();
+ return &A.makeAtomRef(It->second);
+ }
+ case '!': {
+ auto OperandOrErr = parsePrefix(Str, A, AtomMap);
+ if (!OperandOrErr)
+ return OperandOrErr.takeError();
+ return &A.makeNot(**OperandOrErr);
+ }
+ case '&':
+ case '|':
+ case '>':
+ case '=': {
+ auto LeftOrErr = parsePrefix(Str, A, AtomMap);
+ if (!LeftOrErr)
+ return LeftOrErr.takeError();
+
+ auto RightOrErr = parsePrefix(Str, A, AtomMap);
+ if (!RightOrErr)
+ return RightOrErr.takeError();
+
+ const Formula &LHS = **LeftOrErr;
+ const Formula &RHS = **RightOrErr;
+
+ switch (Prefix) {
+ case '&':
+ return &A.makeAnd(LHS, RHS);
+ case '|':
+ return &A.makeOr(LHS, RHS);
+ case '>':
+ return &A.makeImplies(LHS, RHS);
+ case '=':
+ return &A.makeEquals(LHS, RHS);
+ default:
+ llvm_unreachable("unexpected binary op");
+ }
+ }
+ default:
+ return llvm::createStringError(llvm::inconvertibleErrorCode(),
+ "unexpected prefix character: %c", Prefix);
+ }
+}
+
+llvm::Expected<const Formula *>
+parseFormula(llvm::StringRef Str, Arena &A,
+ llvm::DenseMap<unsigned, Atom> &AtomMap) {
+ size_t OriginalSize = Str.size();
+ llvm::Expected<const Formula *> F = parsePrefix(Str, A, AtomMap);
+ if (!F)
+ return F.takeError();
+ if (!Str.empty())
+ return llvm::createStringError(llvm::inconvertibleErrorCode(),
+ ("unexpected suffix of length: " +
+ llvm::Twine(Str.size() - OriginalSize))
+ .str());
+ return F;
+}
+
+} // namespace clang::dataflow
diff --git a/clang/unittests/Analysis/FlowSensitive/CMakeLists.txt b/clang/unittests/Analysis/FlowSensitive/CMakeLists.txt
index 4ac563143cd68..3bd4a6e21bee7 100644
--- a/clang/unittests/Analysis/FlowSensitive/CMakeLists.txt
+++ b/clang/unittests/Analysis/FlowSensitive/CMakeLists.txt
@@ -8,6 +8,7 @@ add_clang_unittest(ClangAnalysisFlowSensitiveTests
DataflowEnvironmentTest.cpp
DebugSupportTest.cpp
DeterminismTest.cpp
+ FormulaTest.cpp
LoggerTest.cpp
MapLatticeTest.cpp
MatchSwitchTest.cpp
diff --git a/clang/unittests/Analysis/FlowSensitive/DataflowAnalysisContextTest.cpp b/clang/unittests/Analysis/FlowSensitive/DataflowAnalysisContextTest.cpp
index 4f7a72c502ccf..92b687a5a18a4 100644
--- a/clang/unittests/Analysis/FlowSensitive/DataflowAnalysisContextTest.cpp
+++ b/clang/unittests/Analysis/FlowSensitive/DataflowAnalysisContextTest.cpp
@@ -17,6 +17,9 @@ namespace {
using namespace clang;
using namespace dataflow;
+using ::testing::IsEmpty;
+using ::testing::UnorderedElementsAre;
+
class DataflowAnalysisContextTest : public ::testing::Test {
protected:
DataflowAnalysisContextTest()
@@ -171,4 +174,97 @@ TEST_F(DataflowAnalysisContextTest, EquivBoolVals) {
A.makeAnd(X, A.makeAnd(Y, Z))));
}
+using ExportLogicalContextTest = DataflowAnalysisContextTest;
+
+TEST_F(ExportLogicalContextTest, EmptySet) {
+ EXPECT_THAT(Context.exportLogicalContext({}).TokenDefs, IsEmpty());
+}
+
+// Only constrainted tokens are included in the output.
+TEST_F(ExportLogicalContextTest, UnconstrainedIgnored) {
+ Atom FC1 = A.makeFlowConditionToken();
+ EXPECT_THAT(Context.exportLogicalContext({FC1}).TokenDefs, IsEmpty());
+}
+
+TEST_F(ExportLogicalContextTest, SingletonSet) {
+ Atom FC1 = A.makeFlowConditionToken();
+ auto &C1 = A.makeAtomRef(A.makeAtom());
+ Context.addFlowConditionConstraint(FC1, C1);
+ EXPECT_THAT(Context.exportLogicalContext({FC1}).TokenDefs.keys(),
+ UnorderedElementsAre(FC1));
+}
+
+TEST_F(ExportLogicalContextTest, NoDependency) {
+ Atom FC1 = A.makeFlowConditionToken();
+ Atom FC2 = A.makeFlowConditionToken();
+ Atom FC3 = A.makeFlowConditionToken();
+ auto &C1 = A.makeAtomRef(A.makeAtom());
+ auto &C2 = A.makeAtomRef(A.makeAtom());
+ auto &C3 = A.makeAtomRef(A.makeAtom());
+
+ Context.addFlowConditionConstraint(FC1, C1);
+ Context.addFlowConditionConstraint(FC2, C2);
+ Context.addFlowConditionConstraint(FC3, C3);
+
+ // FCs are independent.
+ EXPECT_THAT(Context.exportLogicalContext({FC1}).TokenDefs.keys(),
+ UnorderedElementsAre(FC1));
+ EXPECT_THAT(Context.exportLogicalContext({FC2}).TokenDefs.keys(),
+ UnorderedElementsAre(FC2));
+ EXPECT_THAT(Context.exportLogicalContext({FC3}).TokenDefs.keys(),
+ UnorderedElementsAre(FC3));
+}
+
+TEST_F(ExportLogicalContextTest, SimpleDependencyChain) {
+ Atom FC1 = A.makeFlowConditionToken();
+ const Formula &C = A.makeAtomRef(A.makeAtom());
+ Context.addFlowConditionConstraint(FC1, C);
+ Atom FC2 = Context.forkFlowCondition(FC1);
+ Atom FC3 = Context.forkFlowCondition(FC2);
+
+ EXPECT_THAT(Context.exportLogicalContext({FC3}).TokenDefs.keys(),
+ UnorderedElementsAre(FC1, FC2, FC3));
+}
+
+TEST_F(ExportLogicalContextTest, DependencyTree) {
+ Atom FC1 = A.makeFlowConditionToken();
+ const Formula &C = A.makeAtomRef(A.makeAtom());
+ Context.addFlowConditionConstraint(FC1, C);
+ Atom FC2 = Context.forkFlowCondition(FC1);
+ Atom FC3 = A.makeFlowConditionToken();
+ Context.addFlowConditionConstraint(FC3, C);
+ Atom FC4 = Context.joinFlowConditions(FC2, FC3);
+
+ EXPECT_THAT(Context.exportLogicalContext({FC4}).TokenDefs.keys(),
+ UnorderedElementsAre(FC1, FC2, FC3, FC4));
+}
+
+TEST_F(ExportLogicalContextTest, DependencyDAG) {
+ Atom FC1 = A.makeFlowConditionToken();
+ const Formula &C = A.makeAtomRef(A.makeAtom());
+ Context.addFlowConditionConstraint(FC1, C);
+
+ Atom FC2 = Context.forkFlowCondition(FC1);
+ Atom FC3 = Context.forkFlowCondition(FC1);
+ Atom FC4 = Context.joinFlowConditions(FC2, FC3);
+
+ EXPECT_THAT(Context.exportLogicalContext({FC4}).TokenDefs.keys(),
+ UnorderedElementsAre(FC1, FC2, FC3, FC4));
+}
+
+TEST_F(ExportLogicalContextTest, MixedDependencies) {
+ Atom FC1 = A.makeFlowConditionToken();
+ const Formula &C = A.makeAtomRef(A.makeAtom());
+ Context.addFlowConditionConstraint(FC1, C);
+
+ Atom FC2 = Context.forkFlowCondition(FC1);
+ Atom FC3 = Context.forkFlowCondition(FC2);
+ (void)FC3; // unused, and we test below that it is not included.
+
+ Atom FC4 = A.makeFlowConditionToken();
+ Context.addFlowConditionConstraint(FC4, C);
+
+ EXPECT_THAT(Context.exportLogicalContext({FC2, FC4}).TokenDefs.keys(),
+ UnorderedElementsAre(FC1, FC2, FC4));
+}
} // namespace
diff --git a/clang/unittests/Analysis/FlowSensitive/FormulaTest.cpp b/clang/unittests/Analysis/FlowSensitive/FormulaTest.cpp
new file mode 100644
index 0000000000000..cabcd59fffedc
--- /dev/null
+++ b/clang/unittests/Analysis/FlowSensitive/FormulaTest.cpp
@@ -0,0 +1,201 @@
+//===- unittests/Analysis/FlowSensitive/FormulaTest.cpp -------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "clang/Analysis/FlowSensitive/Formula.h"
+#include "clang/Analysis/FlowSensitive/Arena.h"
+#include "clang/Analysis/FlowSensitive/FormulaSerialization.h"
+#include "llvm/Support/raw_ostream.h"
+#include "llvm/Testing/Support/Error.h"
+#include "gmock/gmock.h"
+#include "gtest/gtest.h"
+
+namespace {
+
+using namespace clang;
+using namespace dataflow;
+
+using ::llvm::Failed;
+using ::llvm::HasValue;
+using ::llvm::Succeeded;
+using ::testing::ElementsAre;
+using ::testing::IsEmpty;
+
+class SerializeFormulaTest : public ::testing::Test {
+protected:
+ Arena A;
+ std::string Out;
+ llvm::raw_string_ostream OS{Out};
+
+ const Formula &A1 = A.makeAtomRef(A.makeAtom());
+ const Formula &A2 = A.makeAtomRef(A.makeAtom());
+};
+
+TEST_F(SerializeFormulaTest, Atom) {
+ serializeFormula(A1, OS);
+ EXPECT_EQ(Out, "V0");
+ Out = "";
+
+ serializeFormula(A2, OS);
+ EXPECT_EQ(Out, "V1");
+}
+
+TEST_F(SerializeFormulaTest, LiteralTrue) {
+ serializeFormula(A.makeLiteral(true), OS);
+ EXPECT_EQ(Out, "T");
+}
+
+TEST_F(SerializeFormulaTest, LiteralFalse) {
+ serializeFormula(A.makeLiteral(false), OS);
+ EXPECT_EQ(Out, "F");
+}
+
+TEST_F(SerializeFormulaTest, Not) {
+ serializeFormula(A.makeNot(A1), OS);
+ EXPECT_EQ(Out, "!V0");
+}
+
+TEST_F(SerializeFormulaTest, Or) {
+ serializeFormula(A.makeOr(A1, A2), OS);
+ EXPECT_EQ(Out, "|V0V1");
+}
+
+TEST_F(SerializeFormulaTest, And) {
+ serializeFormula(A.makeAnd(A1, A2), OS);
+ EXPECT_EQ(Out, "&V0V1");
+}
+
+TEST_F(SerializeFormulaTest, Implies) {
+ serializeFormula(A.makeImplies(A1, A2), OS);
+ EXPECT_EQ(Out, ">V0V1");
+}
+
+TEST_F(SerializeFormulaTest, Equal) {
+ serializeFormula(A.makeEquals(A1, A2), OS);
+ EXPECT_EQ(Out, "=V0V1");
+}
+
+TEST_F(SerializeFormulaTest, NestedBinaryUnary) {
+ serializeFormula(A.makeEquals(A.makeOr(A1, A2), A2), OS);
+ EXPECT_EQ(Out, "=|V0V1V1");
+}
+
+TEST_F(SerializeFormulaTest, NestedBinaryBinary) {
+ serializeFormula(A.makeEquals(A.makeOr(A1, A2), A.makeAnd(A1, A2)), OS);
+ EXPECT_EQ(Out, "=|V0V1&V0V1");
+}
+
+class ParseFormulaTest : public ::testing::Test {
+protected:
+ void SetUp() override {
+ AtomMap[0] = Atom1;
+ AtomMap[1] = Atom2;
+ }
+
+ // Convenience wrapper for `testParseFormula`.
+ llvm::Expected<const Formula *> testParseFormula(llvm::StringRef Str) {
+ return parseFormula(Str, A, AtomMap);
+ }
+
+ Arena A;
+ std::string Out;
+ llvm::raw_string_ostream OS{Out};
+
+ Atom Atom1 = A.makeAtom();
+ Atom Atom2 = A.makeAtom();
+ const Formula &A1 = A.makeAtomRef(Atom1);
+ const Formula &A2 = A.makeAtomRef(Atom2);
+ llvm::DenseMap<unsigned, Atom> AtomMap;
+};
+
+TEST_F(ParseFormulaTest, Atom) {
+ EXPECT_THAT_EXPECTED(testParseFormula("V0"), HasValue(&A1));
+ EXPECT_THAT_EXPECTED(testParseFormula("V1"), HasValue(&A2));
+}
+
+TEST_F(ParseFormulaTest, LiteralTrue) {
+ EXPECT_THAT_EXPECTED(testParseFormula("T"), HasValue(&A.makeLiteral(true)));
+}
+
+TEST_F(ParseFormulaTest, LiteralFalse) {
+ EXPECT_THAT_EXPECTED(testParseFormula("F"), HasValue(&A.makeLiteral(false)));
+}
+
+TEST_F(ParseFormulaTest, Not) {
+ EXPECT_THAT_EXPECTED(testParseFormula("!V0"), HasValue(&A.makeNot(A1)));
+}
+
+TEST_F(ParseFormulaTest, Or) {
+ EXPECT_THAT_EXPECTED(testParseFormula("|V0V1"), HasValue(&A.makeOr(A1, A2)));
+}
+
+TEST_F(ParseFormulaTest, And) {
+ EXPECT_THAT_EXPECTED(testParseFormula("&V0V1"), HasValue(&A.makeAnd(A1, A2)));
+}
+
+TEST_F(ParseFormulaTest, OutOfNumericOrder) {
+ EXPECT_THAT_EXPECTED(testParseFormula("&V1V0"), HasValue(&A.makeAnd(A2, A1)));
+}
+
+TEST_F(ParseFormulaTest, Implies) {
+ EXPECT_THAT_EXPECTED(testParseFormula(">V0V1"),
+ HasValue(&A.makeImplies(A1, A2)));
+}
+
+TEST_F(ParseFormulaTest, Equal) {
+ EXPECT_THAT_EXPECTED(testParseFormula("=V0V1"),
+ HasValue(&A.makeEquals(A1, A2)));
+}
+
+TEST_F(ParseFormulaTest, NestedBinaryUnary) {
+ EXPECT_THAT_EXPECTED(testParseFormula("=|V0V1V1"),
+ HasValue(&A.makeEquals(A.makeOr(A1, A2), A2)));
+}
+
+TEST_F(ParseFormulaTest, NestedBinaryBinary) {
+ EXPECT_THAT_EXPECTED(
+ testParseFormula("=|V0V1&V0V1"),
+ HasValue(&A.makeEquals(A.makeOr(A1, A2), A.makeAnd(A1, A2))));
+}
+
+// Verifies that parsing generates fresh atoms, if they are not already in the
+// map.
+TEST_F(ParseFormulaTest, GeneratesAtoms) {
+ llvm::DenseMap<unsigned, Atom> FreshAtomMap;
+ ASSERT_THAT_EXPECTED(parseFormula("=V0V1", A, FreshAtomMap), Succeeded());
+ // The map contains two, unique elements.
+ ASSERT_EQ(FreshAtomMap.size(), 2U);
+ EXPECT_NE(FreshAtomMap[0], FreshAtomMap[1]);
+}
+
+TEST_F(ParseFormulaTest, MalformedFormulaFails) {
+ // Arbitrary string.
+ EXPECT_THAT_EXPECTED(testParseFormula("Hello"), Failed());
+ // Empty string.
+ EXPECT_THAT_EXPECTED(testParseFormula(""), Failed());
+ // Malformed atom.
+ EXPECT_THAT_EXPECTED(testParseFormula("Vabc"), Failed());
+ // Irrelevant suffix.
+ EXPECT_THAT_EXPECTED(testParseFormula("V0Hello"), Failed());
+ EXPECT_THAT_EXPECTED(testParseFormula("=V0V1Hello"), Failed());
+ // Sequence without operator.
+ EXPECT_THAT_EXPECTED(testParseFormula("TF"), Failed());
+ // Bad subformula.
+ EXPECT_THAT_EXPECTED(testParseFormula("!G"), Failed());
+ // Incomplete formulas.
+ EXPECT_THAT_EXPECTED(testParseFormula("V"), Failed());
+ EXPECT_THAT_EXPECTED(testParseFormula("&"), Failed());
+ EXPECT_THAT_EXPECTED(testParseFormula("|"), Failed());
+ EXPECT_THAT_EXPECTED(testParseFormula(">"), Failed());
+ EXPECT_THAT_EXPECTED(testParseFormula("="), Failed());
+ EXPECT_THAT_EXPECTED(testParseFormula("&V0"), Failed());
+ EXPECT_THAT_EXPECTED(testParseFormula("|V0"), Failed());
+ EXPECT_THAT_EXPECTED(testParseFormula(">V0"), Failed());
+ EXPECT_THAT_EXPECTED(testParseFormula("=V0"), Failed());
+}
+
+} // namespace
>From cc49f3b3e1ddb77ebd888ef87c3b0e87c69e201b Mon Sep 17 00:00:00 2001
From: Damyan Pepper <damyanp at microsoft.com>
Date: Mon, 18 Aug 2025 08:58:33 -0700
Subject: [PATCH 049/112] [NFC][HLSL] Remove confusing enum aliases /
duplicates (#153909)
Remove:
* DescriptorType enum - this almost exactly shadowed the ResourceClass
enum
* ClauseType aliased ResourceClass
Although these were introduced to make the HLSL root signature handling
code a bit cleaner, they were ultimately causing confusion as they
appeared to be unique enums that needed to be converted between each
other.
Closes #153890
---
clang/lib/Parse/ParseHLSLRootSignature.cpp | 14 +++---
.../Parse/ParseHLSLRootSignatureTest.cpp | 46 +++++++++----------
.../llvm/Frontend/HLSL/HLSLRootSignature.h | 25 +++++-----
llvm/lib/Frontend/HLSL/HLSLRootSignature.cpp | 6 +--
.../Frontend/HLSL/RootSignatureMetadata.cpp | 3 +-
.../Frontend/HLSLRootSignatureDumpTest.cpp | 23 +++++-----
6 files changed, 59 insertions(+), 58 deletions(-)
diff --git a/clang/lib/Parse/ParseHLSLRootSignature.cpp b/clang/lib/Parse/ParseHLSLRootSignature.cpp
index 98dc458f7adc5..5490c61f52356 100644
--- a/clang/lib/Parse/ParseHLSLRootSignature.cpp
+++ b/clang/lib/Parse/ParseHLSLRootSignature.cpp
@@ -234,15 +234,15 @@ std::optional<RootDescriptor> RootSignatureParser::parseRootDescriptor() {
default:
llvm_unreachable("Switch for consumed token was not provided");
case TokenKind::kw_CBV:
- Descriptor.Type = DescriptorType::CBuffer;
+ Descriptor.Type = ResourceClass::CBuffer;
ExpectedReg = TokenKind::bReg;
break;
case TokenKind::kw_SRV:
- Descriptor.Type = DescriptorType::SRV;
+ Descriptor.Type = ResourceClass::SRV;
ExpectedReg = TokenKind::tReg;
break;
case TokenKind::kw_UAV:
- Descriptor.Type = DescriptorType::UAV;
+ Descriptor.Type = ResourceClass::UAV;
ExpectedReg = TokenKind::uReg;
break;
}
@@ -360,19 +360,19 @@ RootSignatureParser::parseDescriptorTableClause() {
default:
llvm_unreachable("Switch for consumed token was not provided");
case TokenKind::kw_CBV:
- Clause.Type = ClauseType::CBuffer;
+ Clause.Type = ResourceClass::CBuffer;
ExpectedReg = TokenKind::bReg;
break;
case TokenKind::kw_SRV:
- Clause.Type = ClauseType::SRV;
+ Clause.Type = ResourceClass::SRV;
ExpectedReg = TokenKind::tReg;
break;
case TokenKind::kw_UAV:
- Clause.Type = ClauseType::UAV;
+ Clause.Type = ResourceClass::UAV;
ExpectedReg = TokenKind::uReg;
break;
case TokenKind::kw_Sampler:
- Clause.Type = ClauseType::Sampler;
+ Clause.Type = ResourceClass::Sampler;
ExpectedReg = TokenKind::sReg;
break;
}
diff --git a/clang/unittests/Parse/ParseHLSLRootSignatureTest.cpp b/clang/unittests/Parse/ParseHLSLRootSignatureTest.cpp
index 44f6b0469f38e..44c0978a243bc 100644
--- a/clang/unittests/Parse/ParseHLSLRootSignatureTest.cpp
+++ b/clang/unittests/Parse/ParseHLSLRootSignatureTest.cpp
@@ -180,7 +180,7 @@ TEST_F(ParseHLSLRootSignatureTest, ValidParseDTClausesTest) {
// First Descriptor Table with 4 elements
RootElement Elem = Elements[0].getElement();
ASSERT_TRUE(std::holds_alternative<DescriptorTableClause>(Elem));
- ASSERT_EQ(std::get<DescriptorTableClause>(Elem).Type, ClauseType::CBuffer);
+ ASSERT_EQ(std::get<DescriptorTableClause>(Elem).Type, ResourceClass::CBuffer);
ASSERT_EQ(std::get<DescriptorTableClause>(Elem).Reg.ViewType,
RegisterType::BReg);
ASSERT_EQ(std::get<DescriptorTableClause>(Elem).Reg.Number, 0u);
@@ -193,7 +193,7 @@ TEST_F(ParseHLSLRootSignatureTest, ValidParseDTClausesTest) {
Elem = Elements[1].getElement();
ASSERT_TRUE(std::holds_alternative<DescriptorTableClause>(Elem));
- ASSERT_EQ(std::get<DescriptorTableClause>(Elem).Type, ClauseType::SRV);
+ ASSERT_EQ(std::get<DescriptorTableClause>(Elem).Type, ResourceClass::SRV);
ASSERT_EQ(std::get<DescriptorTableClause>(Elem).Reg.ViewType,
RegisterType::TReg);
ASSERT_EQ(std::get<DescriptorTableClause>(Elem).Reg.Number, 42u);
@@ -205,7 +205,7 @@ TEST_F(ParseHLSLRootSignatureTest, ValidParseDTClausesTest) {
Elem = Elements[2].getElement();
ASSERT_TRUE(std::holds_alternative<DescriptorTableClause>(Elem));
- ASSERT_EQ(std::get<DescriptorTableClause>(Elem).Type, ClauseType::Sampler);
+ ASSERT_EQ(std::get<DescriptorTableClause>(Elem).Type, ResourceClass::Sampler);
ASSERT_EQ(std::get<DescriptorTableClause>(Elem).Reg.ViewType,
RegisterType::SReg);
ASSERT_EQ(std::get<DescriptorTableClause>(Elem).Reg.Number, 987u);
@@ -218,7 +218,7 @@ TEST_F(ParseHLSLRootSignatureTest, ValidParseDTClausesTest) {
Elem = Elements[3].getElement();
ASSERT_TRUE(std::holds_alternative<DescriptorTableClause>(Elem));
- ASSERT_EQ(std::get<DescriptorTableClause>(Elem).Type, ClauseType::UAV);
+ ASSERT_EQ(std::get<DescriptorTableClause>(Elem).Type, ResourceClass::UAV);
ASSERT_EQ(std::get<DescriptorTableClause>(Elem).Reg.ViewType,
RegisterType::UReg);
ASSERT_EQ(std::get<DescriptorTableClause>(Elem).Reg.Number, 4294967294u);
@@ -445,7 +445,7 @@ TEST_F(ParseHLSLRootSignatureTest, ValidSamplerFlagsTest) {
auto Elements = Parser.getElements();
RootElement Elem = Elements[0].getElement();
ASSERT_TRUE(std::holds_alternative<DescriptorTableClause>(Elem));
- ASSERT_EQ(std::get<DescriptorTableClause>(Elem).Type, ClauseType::Sampler);
+ ASSERT_EQ(std::get<DescriptorTableClause>(Elem).Type, ResourceClass::Sampler);
auto ValidSamplerFlags =
llvm::dxbc::DescriptorRangeFlags::DescriptorsVolatile;
ASSERT_EQ(std::get<DescriptorTableClause>(Elem).Flags, ValidSamplerFlags);
@@ -591,7 +591,7 @@ TEST_F(ParseHLSLRootSignatureTest, ValidParseRootDescriptorsTest) {
RootElement Elem = Elements[0].getElement();
ASSERT_TRUE(std::holds_alternative<RootDescriptor>(Elem));
- ASSERT_EQ(std::get<RootDescriptor>(Elem).Type, DescriptorType::CBuffer);
+ ASSERT_EQ(std::get<RootDescriptor>(Elem).Type, ResourceClass::CBuffer);
ASSERT_EQ(std::get<RootDescriptor>(Elem).Reg.ViewType, RegisterType::BReg);
ASSERT_EQ(std::get<RootDescriptor>(Elem).Reg.Number, 0u);
ASSERT_EQ(std::get<RootDescriptor>(Elem).Space, 0u);
@@ -602,7 +602,7 @@ TEST_F(ParseHLSLRootSignatureTest, ValidParseRootDescriptorsTest) {
Elem = Elements[1].getElement();
ASSERT_TRUE(std::holds_alternative<RootDescriptor>(Elem));
- ASSERT_EQ(std::get<RootDescriptor>(Elem).Type, DescriptorType::SRV);
+ ASSERT_EQ(std::get<RootDescriptor>(Elem).Type, ResourceClass::SRV);
ASSERT_EQ(std::get<RootDescriptor>(Elem).Reg.ViewType, RegisterType::TReg);
ASSERT_EQ(std::get<RootDescriptor>(Elem).Reg.Number, 42u);
ASSERT_EQ(std::get<RootDescriptor>(Elem).Space, 4u);
@@ -616,7 +616,7 @@ TEST_F(ParseHLSLRootSignatureTest, ValidParseRootDescriptorsTest) {
Elem = Elements[2].getElement();
ASSERT_TRUE(std::holds_alternative<RootDescriptor>(Elem));
- ASSERT_EQ(std::get<RootDescriptor>(Elem).Type, DescriptorType::UAV);
+ ASSERT_EQ(std::get<RootDescriptor>(Elem).Type, ResourceClass::UAV);
ASSERT_EQ(std::get<RootDescriptor>(Elem).Reg.ViewType, RegisterType::UReg);
ASSERT_EQ(std::get<RootDescriptor>(Elem).Reg.Number, 34893247u);
ASSERT_EQ(std::get<RootDescriptor>(Elem).Space, 0u);
@@ -628,7 +628,7 @@ TEST_F(ParseHLSLRootSignatureTest, ValidParseRootDescriptorsTest) {
RootDescriptorFlags::DataVolatile);
Elem = Elements[3].getElement();
- ASSERT_EQ(std::get<RootDescriptor>(Elem).Type, DescriptorType::CBuffer);
+ ASSERT_EQ(std::get<RootDescriptor>(Elem).Type, ResourceClass::CBuffer);
ASSERT_EQ(std::get<RootDescriptor>(Elem).Reg.ViewType, RegisterType::BReg);
ASSERT_EQ(std::get<RootDescriptor>(Elem).Reg.Number, 0u);
ASSERT_EQ(std::get<RootDescriptor>(Elem).Space, 0u);
@@ -696,17 +696,17 @@ TEST_F(ParseHLSLRootSignatureTest, ValidVersion10Test) {
auto DefRootDescriptorFlag = llvm::dxbc::RootDescriptorFlags::DataVolatile;
RootElement Elem = Elements[0].getElement();
ASSERT_TRUE(std::holds_alternative<RootDescriptor>(Elem));
- ASSERT_EQ(std::get<RootDescriptor>(Elem).Type, DescriptorType::CBuffer);
+ ASSERT_EQ(std::get<RootDescriptor>(Elem).Type, ResourceClass::CBuffer);
ASSERT_EQ(std::get<RootDescriptor>(Elem).Flags, DefRootDescriptorFlag);
Elem = Elements[1].getElement();
ASSERT_TRUE(std::holds_alternative<RootDescriptor>(Elem));
- ASSERT_EQ(std::get<RootDescriptor>(Elem).Type, DescriptorType::SRV);
+ ASSERT_EQ(std::get<RootDescriptor>(Elem).Type, ResourceClass::SRV);
ASSERT_EQ(std::get<RootDescriptor>(Elem).Flags, DefRootDescriptorFlag);
Elem = Elements[2].getElement();
ASSERT_TRUE(std::holds_alternative<RootDescriptor>(Elem));
- ASSERT_EQ(std::get<RootDescriptor>(Elem).Type, DescriptorType::UAV);
+ ASSERT_EQ(std::get<RootDescriptor>(Elem).Type, ResourceClass::UAV);
ASSERT_EQ(std::get<RootDescriptor>(Elem).Flags, DefRootDescriptorFlag);
auto ValidNonSamplerFlags =
@@ -714,22 +714,22 @@ TEST_F(ParseHLSLRootSignatureTest, ValidVersion10Test) {
llvm::dxbc::DescriptorRangeFlags::DataVolatile;
Elem = Elements[3].getElement();
ASSERT_TRUE(std::holds_alternative<DescriptorTableClause>(Elem));
- ASSERT_EQ(std::get<DescriptorTableClause>(Elem).Type, ClauseType::CBuffer);
+ ASSERT_EQ(std::get<DescriptorTableClause>(Elem).Type, ResourceClass::CBuffer);
ASSERT_EQ(std::get<DescriptorTableClause>(Elem).Flags, ValidNonSamplerFlags);
Elem = Elements[4].getElement();
ASSERT_TRUE(std::holds_alternative<DescriptorTableClause>(Elem));
- ASSERT_EQ(std::get<DescriptorTableClause>(Elem).Type, ClauseType::SRV);
+ ASSERT_EQ(std::get<DescriptorTableClause>(Elem).Type, ResourceClass::SRV);
ASSERT_EQ(std::get<DescriptorTableClause>(Elem).Flags, ValidNonSamplerFlags);
Elem = Elements[5].getElement();
ASSERT_TRUE(std::holds_alternative<DescriptorTableClause>(Elem));
- ASSERT_EQ(std::get<DescriptorTableClause>(Elem).Type, ClauseType::UAV);
+ ASSERT_EQ(std::get<DescriptorTableClause>(Elem).Type, ResourceClass::UAV);
ASSERT_EQ(std::get<DescriptorTableClause>(Elem).Flags, ValidNonSamplerFlags);
Elem = Elements[6].getElement();
ASSERT_TRUE(std::holds_alternative<DescriptorTableClause>(Elem));
- ASSERT_EQ(std::get<DescriptorTableClause>(Elem).Type, ClauseType::Sampler);
+ ASSERT_EQ(std::get<DescriptorTableClause>(Elem).Type, ResourceClass::Sampler);
ASSERT_EQ(std::get<DescriptorTableClause>(Elem).Flags,
llvm::dxbc::DescriptorRangeFlags::DescriptorsVolatile);
@@ -767,43 +767,43 @@ TEST_F(ParseHLSLRootSignatureTest, ValidVersion11Test) {
auto Elements = Parser.getElements();
RootElement Elem = Elements[0].getElement();
ASSERT_TRUE(std::holds_alternative<RootDescriptor>(Elem));
- ASSERT_EQ(std::get<RootDescriptor>(Elem).Type, DescriptorType::CBuffer);
+ ASSERT_EQ(std::get<RootDescriptor>(Elem).Type, ResourceClass::CBuffer);
ASSERT_EQ(std::get<RootDescriptor>(Elem).Flags,
llvm::dxbc::RootDescriptorFlags::DataStaticWhileSetAtExecute);
Elem = Elements[1].getElement();
ASSERT_TRUE(std::holds_alternative<RootDescriptor>(Elem));
- ASSERT_EQ(std::get<RootDescriptor>(Elem).Type, DescriptorType::SRV);
+ ASSERT_EQ(std::get<RootDescriptor>(Elem).Type, ResourceClass::SRV);
ASSERT_EQ(std::get<RootDescriptor>(Elem).Flags,
llvm::dxbc::RootDescriptorFlags::DataStaticWhileSetAtExecute);
Elem = Elements[2].getElement();
ASSERT_TRUE(std::holds_alternative<RootDescriptor>(Elem));
- ASSERT_EQ(std::get<RootDescriptor>(Elem).Type, DescriptorType::UAV);
+ ASSERT_EQ(std::get<RootDescriptor>(Elem).Type, ResourceClass::UAV);
ASSERT_EQ(std::get<RootDescriptor>(Elem).Flags,
llvm::dxbc::RootDescriptorFlags::DataVolatile);
Elem = Elements[3].getElement();
ASSERT_TRUE(std::holds_alternative<DescriptorTableClause>(Elem));
- ASSERT_EQ(std::get<DescriptorTableClause>(Elem).Type, ClauseType::CBuffer);
+ ASSERT_EQ(std::get<DescriptorTableClause>(Elem).Type, ResourceClass::CBuffer);
ASSERT_EQ(std::get<DescriptorTableClause>(Elem).Flags,
llvm::dxbc::DescriptorRangeFlags::DataStaticWhileSetAtExecute);
Elem = Elements[4].getElement();
ASSERT_TRUE(std::holds_alternative<DescriptorTableClause>(Elem));
- ASSERT_EQ(std::get<DescriptorTableClause>(Elem).Type, ClauseType::SRV);
+ ASSERT_EQ(std::get<DescriptorTableClause>(Elem).Type, ResourceClass::SRV);
ASSERT_EQ(std::get<DescriptorTableClause>(Elem).Flags,
llvm::dxbc::DescriptorRangeFlags::DataStaticWhileSetAtExecute);
Elem = Elements[5].getElement();
ASSERT_TRUE(std::holds_alternative<DescriptorTableClause>(Elem));
- ASSERT_EQ(std::get<DescriptorTableClause>(Elem).Type, ClauseType::UAV);
+ ASSERT_EQ(std::get<DescriptorTableClause>(Elem).Type, ResourceClass::UAV);
ASSERT_EQ(std::get<DescriptorTableClause>(Elem).Flags,
llvm::dxbc::DescriptorRangeFlags::DataVolatile);
Elem = Elements[6].getElement();
ASSERT_TRUE(std::holds_alternative<DescriptorTableClause>(Elem));
- ASSERT_EQ(std::get<DescriptorTableClause>(Elem).Type, ClauseType::Sampler);
+ ASSERT_EQ(std::get<DescriptorTableClause>(Elem).Type, ResourceClass::Sampler);
ASSERT_EQ(std::get<DescriptorTableClause>(Elem).Flags,
llvm::dxbc::DescriptorRangeFlags::None);
diff --git a/llvm/include/llvm/Frontend/HLSL/HLSLRootSignature.h b/llvm/include/llvm/Frontend/HLSL/HLSLRootSignature.h
index e44612af071bc..87777fddc9157 100644
--- a/llvm/include/llvm/Frontend/HLSL/HLSLRootSignature.h
+++ b/llvm/include/llvm/Frontend/HLSL/HLSLRootSignature.h
@@ -42,10 +42,9 @@ struct RootConstants {
dxbc::ShaderVisibility Visibility = dxbc::ShaderVisibility::All;
};
-enum class DescriptorType : uint8_t { SRV = 0, UAV, CBuffer };
// Models RootDescriptor : CBV | SRV | UAV, by collecting like parameters
struct RootDescriptor {
- DescriptorType Type;
+ dxil::ResourceClass Type;
Register Reg;
uint32_t Space = 0;
dxbc::ShaderVisibility Visibility = dxbc::ShaderVisibility::All;
@@ -60,13 +59,16 @@ struct RootDescriptor {
assert(Version == llvm::dxbc::RootSignatureVersion::V1_1 &&
"Specified an invalid root signature version");
switch (Type) {
- case DescriptorType::CBuffer:
- case DescriptorType::SRV:
+ case dxil::ResourceClass::CBuffer:
+ case dxil::ResourceClass::SRV:
Flags = dxbc::RootDescriptorFlags::DataStaticWhileSetAtExecute;
break;
- case DescriptorType::UAV:
+ case dxil::ResourceClass::UAV:
Flags = dxbc::RootDescriptorFlags::DataVolatile;
break;
+ case dxil::ResourceClass::Sampler:
+ llvm_unreachable(
+ "ResourceClass::Sampler is not valid for RootDescriptors");
}
}
};
@@ -82,9 +84,8 @@ struct DescriptorTable {
static const uint32_t NumDescriptorsUnbounded = 0xffffffff;
static const uint32_t DescriptorTableOffsetAppend = 0xffffffff;
// Models DTClause : CBV | SRV | UAV | Sampler, by collecting like parameters
-using ClauseType = llvm::dxil::ResourceClass;
struct DescriptorTableClause {
- ClauseType Type;
+ dxil::ResourceClass Type;
Register Reg;
uint32_t NumDescriptors = 1;
uint32_t Space = 0;
@@ -94,7 +95,7 @@ struct DescriptorTableClause {
void setDefaultFlags(dxbc::RootSignatureVersion Version) {
if (Version == dxbc::RootSignatureVersion::V1_0) {
Flags = dxbc::DescriptorRangeFlags::DescriptorsVolatile;
- if (Type != ClauseType::Sampler)
+ if (Type != dxil::ResourceClass::Sampler)
Flags |= dxbc::DescriptorRangeFlags::DataVolatile;
return;
}
@@ -102,14 +103,14 @@ struct DescriptorTableClause {
assert(Version == dxbc::RootSignatureVersion::V1_1 &&
"Specified an invalid root signature version");
switch (Type) {
- case ClauseType::CBuffer:
- case ClauseType::SRV:
+ case dxil::ResourceClass::CBuffer:
+ case dxil::ResourceClass::SRV:
Flags = dxbc::DescriptorRangeFlags::DataStaticWhileSetAtExecute;
break;
- case ClauseType::UAV:
+ case dxil::ResourceClass::UAV:
Flags = dxbc::DescriptorRangeFlags::DataVolatile;
break;
- case ClauseType::Sampler:
+ case dxil::ResourceClass::Sampler:
Flags = dxbc::DescriptorRangeFlags::None;
break;
}
diff --git a/llvm/lib/Frontend/HLSL/HLSLRootSignature.cpp b/llvm/lib/Frontend/HLSL/HLSLRootSignature.cpp
index ac2c974fb11a1..92c62b83fadb0 100644
--- a/llvm/lib/Frontend/HLSL/HLSLRootSignature.cpp
+++ b/llvm/lib/Frontend/HLSL/HLSLRootSignature.cpp
@@ -93,7 +93,8 @@ static raw_ostream &operator<<(raw_ostream &OS,
return OS;
}
-static raw_ostream &operator<<(raw_ostream &OS, const ClauseType &Type) {
+static raw_ostream &operator<<(raw_ostream &OS,
+ const dxil::ResourceClass &Type) {
OS << dxil::getResourceClassName(Type);
return OS;
}
@@ -152,8 +153,7 @@ raw_ostream &operator<<(raw_ostream &OS, const DescriptorTableClause &Clause) {
}
raw_ostream &operator<<(raw_ostream &OS, const RootDescriptor &Descriptor) {
- ClauseType Type = ClauseType(llvm::to_underlying(Descriptor.Type));
- OS << "Root" << Type << "(" << Descriptor.Reg
+ OS << "Root" << Descriptor.Type << "(" << Descriptor.Reg
<< ", space = " << Descriptor.Space
<< ", visibility = " << Descriptor.Visibility
<< ", flags = " << Descriptor.Flags << ")";
diff --git a/llvm/lib/Frontend/HLSL/RootSignatureMetadata.cpp b/llvm/lib/Frontend/HLSL/RootSignatureMetadata.cpp
index f822d918fae41..dece8f197aaf7 100644
--- a/llvm/lib/Frontend/HLSL/RootSignatureMetadata.cpp
+++ b/llvm/lib/Frontend/HLSL/RootSignatureMetadata.cpp
@@ -120,8 +120,7 @@ MDNode *MetadataBuilder::BuildRootConstants(const RootConstants &Constants) {
MDNode *MetadataBuilder::BuildRootDescriptor(const RootDescriptor &Descriptor) {
IRBuilder<> Builder(Ctx);
- StringRef ResName =
- dxil::getResourceClassName(dxil::ResourceClass(Descriptor.Type));
+ StringRef ResName = dxil::getResourceClassName(Descriptor.Type);
assert(!ResName.empty() && "Provided an invalid Resource Class");
SmallString<7> Name({"Root", ResName});
Metadata *Operands[] = {
diff --git a/llvm/unittests/Frontend/HLSLRootSignatureDumpTest.cpp b/llvm/unittests/Frontend/HLSLRootSignatureDumpTest.cpp
index 98b33fdfb8c12..1eb03f16527ec 100644
--- a/llvm/unittests/Frontend/HLSLRootSignatureDumpTest.cpp
+++ b/llvm/unittests/Frontend/HLSLRootSignatureDumpTest.cpp
@@ -10,12 +10,13 @@
#include "gtest/gtest.h"
using namespace llvm::hlsl::rootsig;
+using llvm::dxil::ResourceClass;
namespace {
TEST(HLSLRootSignatureTest, DescriptorCBVClauseDump) {
DescriptorTableClause Clause;
- Clause.Type = ClauseType::CBuffer;
+ Clause.Type = ResourceClass::CBuffer;
Clause.Reg = {RegisterType::BReg, 0};
Clause.setDefaultFlags(llvm::dxbc::RootSignatureVersion::V1_1);
@@ -32,7 +33,7 @@ TEST(HLSLRootSignatureTest, DescriptorCBVClauseDump) {
TEST(HLSLRootSignatureTest, DescriptorSRVClauseDump) {
DescriptorTableClause Clause;
- Clause.Type = ClauseType::SRV;
+ Clause.Type = ResourceClass::SRV;
Clause.Reg = {RegisterType::TReg, 0};
Clause.NumDescriptors = NumDescriptorsUnbounded;
Clause.Space = 42;
@@ -52,7 +53,7 @@ TEST(HLSLRootSignatureTest, DescriptorSRVClauseDump) {
TEST(HLSLRootSignatureTest, DescriptorUAVClauseDump) {
using llvm::dxbc::DescriptorRangeFlags;
DescriptorTableClause Clause;
- Clause.Type = ClauseType::UAV;
+ Clause.Type = ResourceClass::UAV;
Clause.Reg = {RegisterType::UReg, 92374};
Clause.NumDescriptors = 3298;
Clause.Space = 932847;
@@ -82,7 +83,7 @@ TEST(HLSLRootSignatureTest, DescriptorUAVClauseDump) {
TEST(HLSLRootSignatureTest, DescriptorSamplerClauseDump) {
DescriptorTableClause Clause;
- Clause.Type = ClauseType::Sampler;
+ Clause.Type = ResourceClass::Sampler;
Clause.Reg = {RegisterType::SReg, 0};
Clause.NumDescriptors = 2;
Clause.Space = 42;
@@ -102,7 +103,7 @@ TEST(HLSLRootSignatureTest, DescriptorSamplerClauseDump) {
TEST(HLSLRootSignatureTest, DescriptorCBVV10ClauseDump) {
DescriptorTableClause Clause;
- Clause.Type = ClauseType::CBuffer;
+ Clause.Type = ResourceClass::CBuffer;
Clause.Reg = {RegisterType::BReg, 0};
Clause.setDefaultFlags(llvm::dxbc::RootSignatureVersion::V1_0);
@@ -119,7 +120,7 @@ TEST(HLSLRootSignatureTest, DescriptorCBVV10ClauseDump) {
TEST(HLSLRootSignatureTest, DescriptorSamplerV10ClauseDump) {
DescriptorTableClause Clause;
- Clause.Type = ClauseType::Sampler;
+ Clause.Type = ResourceClass::Sampler;
Clause.Reg = {RegisterType::SReg, 0};
Clause.setDefaultFlags(llvm::dxbc::RootSignatureVersion::V1_0);
@@ -151,7 +152,7 @@ TEST(HLSLRootSignatureTest, DescriptorTableDump) {
TEST(HLSLRootSignatureTest, RootCBVDump) {
RootDescriptor Descriptor;
- Descriptor.Type = DescriptorType::CBuffer;
+ Descriptor.Type = ResourceClass::CBuffer;
Descriptor.Reg = {RegisterType::BReg, 0};
Descriptor.setDefaultFlags(llvm::dxbc::RootSignatureVersion::V1_1);
@@ -168,7 +169,7 @@ TEST(HLSLRootSignatureTest, RootCBVDump) {
TEST(HLSLRootSignatureTest, RootSRV10Dump) {
RootDescriptor Descriptor;
- Descriptor.Type = DescriptorType::SRV;
+ Descriptor.Type = ResourceClass::SRV;
Descriptor.Reg = {RegisterType::TReg, 0};
Descriptor.setDefaultFlags(llvm::dxbc::RootSignatureVersion::V1_0);
@@ -185,7 +186,7 @@ TEST(HLSLRootSignatureTest, RootSRV10Dump) {
TEST(HLSLRootSignatureTest, RootUAVV10Dump) {
RootDescriptor Descriptor;
- Descriptor.Type = DescriptorType::UAV;
+ Descriptor.Type = ResourceClass::UAV;
Descriptor.Reg = {RegisterType::UReg, 0};
Descriptor.setDefaultFlags(llvm::dxbc::RootSignatureVersion::V1_0);
@@ -202,7 +203,7 @@ TEST(HLSLRootSignatureTest, RootUAVV10Dump) {
TEST(HLSLRootSignatureTest, RootSRVDump) {
RootDescriptor Descriptor;
- Descriptor.Type = DescriptorType::SRV;
+ Descriptor.Type = ResourceClass::SRV;
Descriptor.Reg = {RegisterType::TReg, 0};
Descriptor.Space = 42;
Descriptor.Visibility = llvm::dxbc::ShaderVisibility::Geometry;
@@ -221,7 +222,7 @@ TEST(HLSLRootSignatureTest, RootSRVDump) {
TEST(HLSLRootSignatureTest, RootUAVDump) {
using llvm::dxbc::RootDescriptorFlags;
RootDescriptor Descriptor;
- Descriptor.Type = DescriptorType::UAV;
+ Descriptor.Type = ResourceClass::UAV;
Descriptor.Reg = {RegisterType::UReg, 92374};
Descriptor.Space = 932847;
Descriptor.Visibility = llvm::dxbc::ShaderVisibility::Hull;
>From d6e0922a5e2eb85fb44076b19791c0d39f189a97 Mon Sep 17 00:00:00 2001
From: LLVM GN Syncbot <llvmgnsyncbot at gmail.com>
Date: Mon, 18 Aug 2025 16:02:02 +0000
Subject: [PATCH 050/112] [gn build] Port 3ecfc0330d93
---
.../utils/gn/secondary/clang/lib/Analysis/FlowSensitive/BUILD.gn | 1 +
.../gn/secondary/clang/unittests/Analysis/FlowSensitive/BUILD.gn | 1 +
2 files changed, 2 insertions(+)
diff --git a/llvm/utils/gn/secondary/clang/lib/Analysis/FlowSensitive/BUILD.gn b/llvm/utils/gn/secondary/clang/lib/Analysis/FlowSensitive/BUILD.gn
index 0b6fa7cc5f5ce..74b2fe204537a 100644
--- a/llvm/utils/gn/secondary/clang/lib/Analysis/FlowSensitive/BUILD.gn
+++ b/llvm/utils/gn/secondary/clang/lib/Analysis/FlowSensitive/BUILD.gn
@@ -31,6 +31,7 @@ static_library("FlowSensitive") {
"DataflowEnvironment.cpp",
"DebugSupport.cpp",
"Formula.cpp",
+ "FormulaSerialization.cpp",
"HTMLLogger.cpp",
"Logger.cpp",
"RecordOps.cpp",
diff --git a/llvm/utils/gn/secondary/clang/unittests/Analysis/FlowSensitive/BUILD.gn b/llvm/utils/gn/secondary/clang/unittests/Analysis/FlowSensitive/BUILD.gn
index e4727d5a3298c..1afd342f67ce4 100644
--- a/llvm/utils/gn/secondary/clang/unittests/Analysis/FlowSensitive/BUILD.gn
+++ b/llvm/utils/gn/secondary/clang/unittests/Analysis/FlowSensitive/BUILD.gn
@@ -27,6 +27,7 @@ unittest("ClangAnalysisFlowSensitiveTests") {
"DataflowEnvironmentTest.cpp",
"DebugSupportTest.cpp",
"DeterminismTest.cpp",
+ "FormulaTest.cpp",
"LoggerTest.cpp",
"MapLatticeTest.cpp",
"MatchSwitchTest.cpp",
>From 4a9d038acd637c5742e6d1622d4ad803059825bd Mon Sep 17 00:00:00 2001
From: Nishant Patel <nishant.b.patel at intel.com>
Date: Mon, 18 Aug 2025 09:45:29 -0700
Subject: [PATCH 051/112] [MLIR][XeGPU] Distribute load_nd/store_nd/prefetch_nd
with offsets from Wg to Sg (#153432)
This PR adds pattern to distribute the load/store/prefetch nd ops with
offsets from workgroup to subgroup IR. This PR is part of the transition
to move offsets from create_nd to load/store/prefetch nd ops.
Create_nd PR : #152351
---
.../include/mlir/Dialect/XeGPU/IR/XeGPUOps.td | 18 +-
mlir/lib/Dialect/XeGPU/IR/XeGPUOps.cpp | 46 ++++
.../Transforms/XeGPUWgToSgDistribute.cpp | 218 +++++++++++++++-
.../XeGPU/xegpu-wg-to-sg-unify-ops-rr.mlir | 73 +++++-
.../XeGPU/xegpu-wg-to-sg-unify-ops.mlir | 242 ++++++++++++++++++
5 files changed, 586 insertions(+), 11 deletions(-)
diff --git a/mlir/include/mlir/Dialect/XeGPU/IR/XeGPUOps.td b/mlir/include/mlir/Dialect/XeGPU/IR/XeGPUOps.td
index abc291c81a76c..eb54d6887681d 100644
--- a/mlir/include/mlir/Dialect/XeGPU/IR/XeGPUOps.td
+++ b/mlir/include/mlir/Dialect/XeGPU/IR/XeGPUOps.td
@@ -272,6 +272,11 @@ def XeGPU_PrefetchNdOp : XeGPU_Op<"prefetch_nd", []> {
let builders = [
OpBuilder<(ins "Value": $TensorDesc,
+ "xegpu::CachePolicyAttr": $l1_hint,
+ "xegpu::CachePolicyAttr": $l2_hint,
+ "xegpu::CachePolicyAttr": $l3_hint)>,
+ OpBuilder<(ins "Value": $TensorDesc,
+ "ArrayRef<OpFoldResult>": $offsets,
"xegpu::CachePolicyAttr": $l1_hint,
"xegpu::CachePolicyAttr": $l2_hint,
"xegpu::CachePolicyAttr": $l3_hint)>
@@ -348,6 +353,12 @@ def XeGPU_LoadNdOp : XeGPU_Op<"load_nd", [
let builders = [
OpBuilder<(ins "Type": $value, "Value": $TensorDesc,
+ "UnitAttr": $packed, "DenseI64ArrayAttr": $transpose,
+ "xegpu::CachePolicyAttr": $l1_hint,
+ "xegpu::CachePolicyAttr": $l2_hint,
+ "xegpu::CachePolicyAttr": $l3_hint)>,
+ OpBuilder<(ins "Type": $value, "Value": $TensorDesc,
+ "ArrayRef<OpFoldResult>": $offsets,
"UnitAttr": $packed, "DenseI64ArrayAttr": $transpose,
"xegpu::CachePolicyAttr": $l1_hint,
"xegpu::CachePolicyAttr": $l2_hint,
@@ -419,7 +430,12 @@ def XeGPU_StoreNdOp : XeGPU_Op<"store_nd", [
OpBuilder<(ins "Value": $value, "Value": $TensorDesc,
"xegpu::CachePolicyAttr": $l1_hint,
"xegpu::CachePolicyAttr": $l2_hint,
- "xegpu::CachePolicyAttr": $l3_hint)>
+ "xegpu::CachePolicyAttr": $l3_hint)>,
+ OpBuilder<(ins "Value": $value, "Value": $TensorDesc,
+ "ArrayRef<OpFoldResult>": $offsets,
+ "xegpu::CachePolicyAttr": $l1_hint,
+ "xegpu::CachePolicyAttr": $l2_hint,
+ "xegpu::CachePolicyAttr": $l3_hint)>
];
diff --git a/mlir/lib/Dialect/XeGPU/IR/XeGPUOps.cpp b/mlir/lib/Dialect/XeGPU/IR/XeGPUOps.cpp
index eee0fdc7160de..906c71d8b8dad 100644
--- a/mlir/lib/Dialect/XeGPU/IR/XeGPUOps.cpp
+++ b/mlir/lib/Dialect/XeGPU/IR/XeGPUOps.cpp
@@ -385,6 +385,21 @@ void PrefetchNdOp::build(OpBuilder &builder, OperationState &state,
l1_hint, l2_hint, l3_hint);
}
+void PrefetchNdOp::build(OpBuilder &builder, OperationState &state,
+ Value tensorDesc, ArrayRef<OpFoldResult> offsets,
+ xegpu::CachePolicyAttr l1_hint,
+ xegpu::CachePolicyAttr l2_hint,
+ xegpu::CachePolicyAttr l3_hint) {
+ SmallVector<Value> dynamicOffsets;
+ SmallVector<int64_t> staticOffsets;
+ dispatchIndexOpFoldResults(offsets, dynamicOffsets, staticOffsets);
+
+ auto staticOffsetsAttr = builder.getDenseI64ArrayAttr(staticOffsets);
+
+ build(builder, state, tensorDesc, dynamicOffsets, staticOffsetsAttr, l1_hint,
+ l2_hint, l3_hint);
+}
+
LogicalResult PrefetchNdOp::verify() {
auto tdescTy = getTensorDescType();
if (tdescTy.isScattered())
@@ -427,6 +442,22 @@ void LoadNdOp::build(OpBuilder &builder, OperationState &state, Type retType,
l3_hint);
}
+void LoadNdOp::build(OpBuilder &builder, OperationState &state, Type retType,
+ Value tensorDesc, ArrayRef<OpFoldResult> offsets,
+ UnitAttr packed, DenseI64ArrayAttr transpose,
+ xegpu::CachePolicyAttr l1_hint,
+ xegpu::CachePolicyAttr l2_hint,
+ xegpu::CachePolicyAttr l3_hint) {
+ SmallVector<Value> dynamicOffsets;
+ SmallVector<int64_t> staticOffsets;
+ dispatchIndexOpFoldResults(offsets, dynamicOffsets, staticOffsets);
+
+ auto staticOffsetsAttr = builder.getDenseI64ArrayAttr(staticOffsets);
+
+ build(builder, state, retType, tensorDesc, dynamicOffsets, staticOffsetsAttr,
+ packed, transpose, l1_hint, l2_hint, l3_hint);
+}
+
LogicalResult LoadNdOp::verify() {
auto tdescTy = getTensorDescType();
auto valueTy = getType();
@@ -533,6 +564,21 @@ void StoreNdOp::build(OpBuilder &builder, OperationState &state, Value value,
DenseI64ArrayAttr(), l1_hint, l2_hint, l3_hint);
}
+void StoreNdOp::build(OpBuilder &builder, OperationState &state, Value value,
+ Value tensorDesc, ArrayRef<OpFoldResult> offsets,
+ xegpu::CachePolicyAttr l1_hint,
+ xegpu::CachePolicyAttr l2_hint,
+ xegpu::CachePolicyAttr l3_hint) {
+ SmallVector<Value> dynamicOffsets;
+ SmallVector<int64_t> staticOffsets;
+ dispatchIndexOpFoldResults(offsets, dynamicOffsets, staticOffsets);
+
+ auto staticOffsetsAttr = builder.getDenseI64ArrayAttr(staticOffsets);
+
+ build(builder, state, value, tensorDesc, dynamicOffsets, staticOffsetsAttr,
+ l1_hint, l2_hint, l3_hint);
+}
+
LogicalResult StoreNdOp::verify() {
auto dstTy = getTensorDescType(); // Tile
auto valTy = getValueType(); // Vector
diff --git a/mlir/lib/Dialect/XeGPU/Transforms/XeGPUWgToSgDistribute.cpp b/mlir/lib/Dialect/XeGPU/Transforms/XeGPUWgToSgDistribute.cpp
index ecec186fe3fc9..8f1208e77ca5d 100644
--- a/mlir/lib/Dialect/XeGPU/Transforms/XeGPUWgToSgDistribute.cpp
+++ b/mlir/lib/Dialect/XeGPU/Transforms/XeGPUWgToSgDistribute.cpp
@@ -182,16 +182,16 @@ struct WgToSgCreateNdOp : public OpConversionPattern<xegpu::CreateNdDescOp> {
layout.dropSgLayoutAndData());
SmallVector<Value> newCreateNdOps;
- SmallVector<OpFoldResult> wgOffsets = op.getMixedOffsets();
+ SmallVector<OpFoldResult> origOffsets = op.getMixedOffsets();
for (auto tdescOffsets : *maybeTdescOffsets) {
SmallVector<OpFoldResult> sgOffsets;
size_t rank = tdescOffsets.size();
for (size_t i = 0; i < rank; i++) {
- size_t idx = wgOffsets.size() - rank + i;
+ size_t idx = origOffsets.size() - rank + i;
Value add = rewriter.createOrFold<index::AddOp>(
loc, tdescOffsets[i],
- getValueOrCreateConstantIndexOp(rewriter, loc, wgOffsets[idx]));
+ getValueOrCreateConstantIndexOp(rewriter, loc, origOffsets[idx]));
sgOffsets.push_back(add);
}
@@ -296,6 +296,205 @@ struct WgToSgStoreNdOp : public OpConversionPattern<xegpu::StoreNdOp> {
}
};
+// Utility function to compute global offsets for subgroup operations.
+// Returns a vector of new offsets for each subgroup, given the original op's
+// offsets and subgroup relative offsets.
+static SmallVector<SmallVector<OpFoldResult>>
+computeOffsets(Operation *op, ArrayRef<SmallVector<Value>> sgOffsetsList,
+ ArrayRef<OpFoldResult> origOffsets,
+ ConversionPatternRewriter &rewriter) {
+ SmallVector<SmallVector<OpFoldResult>> finalOffsets;
+ Location loc = op->getLoc();
+ for (const auto &sgOffsets : sgOffsetsList) {
+ SmallVector<OpFoldResult> newOffsets;
+ size_t rank = sgOffsets.size();
+ for (size_t i = 0; i < rank; i++) {
+ size_t idx = origOffsets.size() - rank + i;
+ Value add = rewriter.createOrFold<index::AddOp>(
+ loc, sgOffsets[i],
+ getValueOrCreateConstantIndexOp(rewriter, loc, origOffsets[idx]));
+ newOffsets.push_back(add);
+ }
+ finalOffsets.push_back(std::move(newOffsets));
+ }
+ return finalOffsets;
+}
+
+// Utility function to get sgShape, sgOffsetList for a given
+// op.
+template <typename OpTy, typename AdaptorTy>
+LogicalResult getSgOffsets(OpTy op, AdaptorTy adaptor,
+ ConversionPatternRewriter &rewriter,
+ SmallVector<int64_t> &sgShape,
+ SmallVector<SmallVector<Value>> &sgOffsetList) {
+ int64_t offsetSize = static_cast<int64_t>(op.getOffsets().size());
+ if (offsetSize == 0 && (!op.getConstOffsetsAttr()))
+ return failure();
+
+ Location loc = op.getLoc();
+ Value tdesc = op.getTensorDesc();
+ auto tdescTy = dyn_cast<xegpu::TensorDescType>(tdesc.getType());
+ if (!tdescTy)
+ return failure();
+ auto layout = dyn_cast<xegpu::LayoutAttr>(tdescTy.getLayout());
+ if (!layout)
+ return failure();
+
+ SmallVector<int64_t> sgLayout;
+ auto sgLayoutAttr = layout.getSgLayout();
+ if (!sgLayoutAttr)
+ return rewriter.notifyMatchFailure(
+ op, "sgLayout attribute is required in layout");
+ sgLayout = llvm::to_vector_of<int64_t>(sgLayoutAttr.asArrayRef());
+
+ ArrayRef<int64_t> wgShape = tdescTy.getShape();
+ int count;
+ std::tie(sgShape, count) = getSgShapeAndCount(wgShape, layout);
+
+ // Get the subgroup ID
+ Value linearSgId =
+ gpu::SubgroupIdOp::create(rewriter, loc, /*upper_bound=*/nullptr);
+
+ int64_t startOfRange = -1, endOfRange = -1;
+ bool sgIdRangeSpecified = isSgIdRangeSpecified(op, startOfRange, endOfRange);
+
+ if (sgIdRangeSpecified) {
+ int64_t sgCount = endOfRange - startOfRange;
+ if (computeProduct(sgLayout) != sgCount)
+ return rewriter.notifyMatchFailure(
+ op, "sg_layout size must match the sg_id_range");
+ Value startOfRangeVal =
+ rewriter.create<arith::ConstantIndexOp>(loc, startOfRange);
+ linearSgId =
+ rewriter.createOrFold<index::SubOp>(loc, linearSgId, startOfRangeVal);
+ }
+
+ auto sgOffsets = layout.getOffsets(rewriter, loc, linearSgId, wgShape);
+ if (failed(sgOffsets))
+ return failure();
+
+ sgOffsetList = *sgOffsets;
+ return success();
+}
+
+template <typename OpTy>
+SmallVector<OpFoldResult> getOffsets(OpTy op,
+ ConversionPatternRewriter &rewriter) {
+ SmallVector<OpFoldResult> origOffsets;
+ if (auto constOffsets = op.getConstOffsetsAttr()) {
+ for (auto attr : constOffsets.asArrayRef())
+ origOffsets.push_back(rewriter.getIndexAttr(attr));
+ }
+ for (auto v : op.getOffsets())
+ origOffsets.push_back(v);
+ return origOffsets;
+}
+
+// This pattern transforms the LoadNdOp with explicit offsets to load
+// subgroup data.
+struct WgToSgLoadNdOpWithOffset : public OpConversionPattern<xegpu::LoadNdOp> {
+ using OpConversionPattern<xegpu::LoadNdOp>::OpConversionPattern;
+ LogicalResult
+ matchAndRewrite(xegpu::LoadNdOp op, OneToNOpAdaptor adaptor,
+ ConversionPatternRewriter &rewriter) const override {
+
+ SmallVector<int64_t> sgShape;
+ SmallVector<SmallVector<Value>> sgOffsetList;
+
+ // Do the distribution from workgroup to subgroup and get subgroup offsets
+ if (failed(getSgOffsets(op, adaptor, rewriter, sgShape, sgOffsetList)))
+ return failure();
+
+ // Get the original workgroup offsets
+ SmallVector<OpFoldResult> origOffsets = getOffsets(op, rewriter);
+
+ // Calculate the final offsets for each subgroup
+ auto finalOffsets = computeOffsets(op, sgOffsetList, origOffsets, rewriter);
+
+ SmallVector<Value> newLoadOps;
+ for (auto [offsets, tdesc] :
+ llvm::zip(finalOffsets, adaptor.getTensorDesc())) {
+ VectorType newResTy = VectorType::get(
+ sgShape,
+ dyn_cast<xegpu::TensorDescType>(tdesc.getType()).getElementType());
+ auto newLoadOp = rewriter.create<xegpu::LoadNdOp>(
+ op.getLoc(), newResTy, tdesc, offsets,
+ /*packed=*/nullptr,
+ /*transpose=*/nullptr, op.getL1HintAttr(), op.getL2HintAttr(),
+ op.getL3HintAttr());
+ newLoadOps.push_back(newLoadOp);
+ }
+ rewriter.replaceOpWithMultiple(op, {newLoadOps});
+ return success();
+ }
+};
+
+// This pattern transforms the StoreNdOp with explicit offsets to store
+// subgroup data.
+struct WgToSgStoreNdOpWithOffset
+ : public OpConversionPattern<xegpu::StoreNdOp> {
+ using OpConversionPattern<xegpu::StoreNdOp>::OpConversionPattern;
+ LogicalResult
+ matchAndRewrite(xegpu::StoreNdOp op, OneToNOpAdaptor adaptor,
+ ConversionPatternRewriter &rewriter) const override {
+
+ SmallVector<int64_t> sgShape;
+ SmallVector<SmallVector<Value>> sgOffsetList;
+
+ // Do the distribution from workgroup to subgroup and get subgroup offsets
+ if (failed(getSgOffsets(op, adaptor, rewriter, sgShape, sgOffsetList)))
+ return failure();
+
+ // Get the original workgroup offsets
+ SmallVector<OpFoldResult> origOffsets = getOffsets(op, rewriter);
+
+ // Calculate the final offsets for each subgroup
+ auto finalOffsets = computeOffsets(op, sgOffsetList, origOffsets, rewriter);
+
+ for (auto [offsets, tdesc, value] :
+ llvm::zip(finalOffsets, adaptor.getTensorDesc(), adaptor.getValue())) {
+ rewriter.create<xegpu::StoreNdOp>(op.getLoc(), value, tdesc, offsets,
+ op.getL1HintAttr(), op.getL2HintAttr(),
+ op.getL3HintAttr());
+ }
+ rewriter.eraseOp(op);
+ return success();
+ }
+};
+
+// This pattern transforms the PrefetchNdOp with explicit offsets to prefetch
+// subgroup data.
+struct WgToSgPrefetchNdOpWithOffset
+ : public OpConversionPattern<xegpu::PrefetchNdOp> {
+ using OpConversionPattern<xegpu::PrefetchNdOp>::OpConversionPattern;
+ LogicalResult
+ matchAndRewrite(xegpu::PrefetchNdOp op, OneToNOpAdaptor adaptor,
+ ConversionPatternRewriter &rewriter) const override {
+
+ SmallVector<int64_t> sgShape;
+ SmallVector<SmallVector<Value>> sgOffsetList;
+
+ // Do the distribution from workgroup to subgroup and get subgroup offsets
+ if (failed(getSgOffsets(op, adaptor, rewriter, sgShape, sgOffsetList)))
+ return failure();
+
+ // Get the original workgroup offsets
+ SmallVector<OpFoldResult> origOffsets = getOffsets(op, rewriter);
+
+ // Calculate the final offsets for each subgroup
+ auto finalOffsets = computeOffsets(op, sgOffsetList, origOffsets, rewriter);
+
+ for (auto [offsets, tdesc] :
+ llvm::zip(finalOffsets, adaptor.getTensorDesc())) {
+ rewriter.create<xegpu::PrefetchNdOp>(
+ op.getLoc(), tdesc, offsets, op.getL1HintAttr(), op.getL2HintAttr(),
+ op.getL3HintAttr());
+ }
+ rewriter.eraseOp(op);
+ return success();
+ }
+};
+
/// This pattern transforms the UpdateNdOffsetOp to update the offsets of a
/// subgroup descriptor. It creates an UpdateNdOffsetOp op to update the
/// offsets of the new subgroup src tensor descriptors.
@@ -690,12 +889,13 @@ struct WgToSgArithConstantOp : public OpConversionPattern<arith::ConstantOp> {
namespace mlir {
namespace xegpu {
void populateXeGPUWgToSgDistributePatterns(RewritePatternSet &patterns) {
- patterns.add<WgToSgCreateNdOp, WgToSgCreateNdOpNoOffset, WgToSgLoadNdOp,
- WgToSgStoreNdOp, WgToSgUpdateNdOffsetOp, WgToSgDpasOp,
- WgToSgPrefetchNdOp, UnrealizedConversionCastOpPattern,
- WgToSgElementwiseOp, WgToSgVectorBroadcastOp,
- WgToSgConvertLayoutOp, WgToSgArithConstantOp>(
- patterns.getContext());
+ patterns
+ .add<WgToSgCreateNdOp, WgToSgCreateNdOpNoOffset, WgToSgLoadNdOp,
+ WgToSgLoadNdOpWithOffset, WgToSgStoreNdOp, WgToSgStoreNdOpWithOffset,
+ WgToSgUpdateNdOffsetOp, WgToSgDpasOp, WgToSgPrefetchNdOp,
+ WgToSgPrefetchNdOpWithOffset, UnrealizedConversionCastOpPattern,
+ WgToSgElementwiseOp, WgToSgVectorBroadcastOp, WgToSgConvertLayoutOp,
+ WgToSgArithConstantOp>(patterns.getContext());
}
} // namespace xegpu
} // namespace mlir
diff --git a/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg-unify-ops-rr.mlir b/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg-unify-ops-rr.mlir
index b6f44b5bc0b68..6ff7a94d678a3 100644
--- a/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg-unify-ops-rr.mlir
+++ b/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg-unify-ops-rr.mlir
@@ -10,5 +10,76 @@ gpu.module @test_distribution {
%tdesc = xegpu.create_nd_tdesc %src: memref<256x128xf32>
-> !xegpu.tensor_desc<256x128xf32, #xegpu.layout<sg_layout = [8, 4], sg_data = [16, 16], lane_layout = [1, 16], lane_data = [1, 1]>>
gpu.return
- }
+ }
+
+ // CHECK-LABEL: load_nd_tdesc_with_offset
+ gpu.func @load_nd_tdesc_with_offset(%src: memref<256x128xf32>) {
+ // CHECK-COUNT-4: xegpu.load_nd {{%.*}}[{{%.*}}, {{%.*}}]
+ // CHECK-SAME-COUNT-4: : !xegpu.tensor_desc<16x16xf32, #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>>
+ // CHECK-SAME-COUNT-4: -> vector<16x16xf32>
+ // CHECK-NOT: xegpu.load_nd
+ %tdesc = xegpu.create_nd_tdesc %src: memref<256x128xf32>
+ -> !xegpu.tensor_desc<256x128xf32, #xegpu.layout<sg_layout = [8, 4], sg_data = [16, 16], lane_layout = [1, 16], lane_data = [1, 1]>>
+ %load = xegpu.load_nd %tdesc[0, 0]
+ : !xegpu.tensor_desc<256x128xf32, #xegpu.layout<sg_layout = [8, 4], sg_data = [16, 16], lane_layout = [1, 16], lane_data = [1, 1]>>
+ -> vector<256x128xf32>
+ gpu.return
+ }
+
+ // CHECK-LABEL: store_nd_with_offset
+ gpu.func @store_nd_with_offset(%src: memref<256x128xf32>) {
+ // CHECK-COUNT-4: xegpu.store_nd %{{.*}}, {{%.*}}[{{%.*}}, {{%.*}}]
+ // CHECK-SAME-COUNT-4: : !xegpu.tensor_desc<16x16xf32, #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>>
+ // CHECK-NOT: xegpu.store_nd
+ %tdesc = xegpu.create_nd_tdesc %src: memref<256x128xf32>
+ -> !xegpu.tensor_desc<256x128xf32, #xegpu.layout<sg_layout = [8, 4], sg_data = [16, 16], lane_layout = [1, 16], lane_data = [1, 1]>>
+ %load = xegpu.load_nd %tdesc[0, 0]
+ : !xegpu.tensor_desc<256x128xf32, #xegpu.layout<sg_layout = [8, 4], sg_data = [16, 16], lane_layout = [1, 16], lane_data = [1, 1]>>
+ -> vector<256x128xf32>
+ xegpu.store_nd %load, %tdesc[0, 0]
+ : vector<256x128xf32>, !xegpu.tensor_desc<256x128xf32, #xegpu.layout<sg_layout = [8, 4], sg_data = [16, 16], lane_layout = [1, 16], lane_data = [1, 1]>>
+ gpu.return
+ }
+
+ // CHECK-LABEL: prefetch_nd_tdesc_with_offset
+ // CHECK-SAME: %[[ARG_0:.*]]: memref<256x128xf32>
+ gpu.func @prefetch_nd_tdesc_with_offset(%src: memref<256x128xf32>) {
+ // CHECK-COUNT-4: xegpu.prefetch_nd {{%.*}}[{{%.*}}, {{%.*}}]
+ // CHECK-SAME-COUNT-4: !xegpu.tensor_desc<256x128xf32, #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>>
+ // CHECK-NOT: xegpu.prefetch_nd
+ %tdesc = xegpu.create_nd_tdesc %src : memref<256x128xf32>
+ -> !xegpu.tensor_desc<256x128xf32, #xegpu.layout<sg_layout = [8, 4], sg_data = [16, 16], lane_layout = [1, 16], lane_data = [1, 1]>>
+ xegpu.prefetch_nd %tdesc[0, 0]
+ : !xegpu.tensor_desc<256x128xf32, #xegpu.layout<sg_layout = [8, 4], sg_data = [16, 16], lane_layout = [1, 16], lane_data = [1, 1]>>
+ gpu.return
+ }
+
+ // CHECK-LABEL: dpas
+ // CHECK-SAME: (%[[ARG_0:.*]]: memref<256x128xf16>, %[[ARG_1:.*]]: memref<128x256xf16>)
+ gpu.func @dpas(%a: memref<256x128xf16>, %b: memref<128x256xf16>) {
+ // CHECK-COUNT-4: xegpu.create_nd_tdesc %[[ARG_0]] : memref<256x128xf16>
+ // CHECK-SAME-COUNT-4: -> !xegpu.tensor_desc<16x16xf16, #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>>
+ // CHECK-NOT: xegpu.create_nd_tdesc
+ // CHECK-COUNT-4: xegpu.create_nd_tdesc %[[ARG_1]] : memref<128x256xf16>
+ // CHECK-SAME-COUNT-4: -> !xegpu.tensor_desc<16x16xf16, #xegpu.layout<lane_layout = [4, 8], lane_data = [1, 1]>>
+ // CHECK-NOT: xegpu.create_nd_tdesc
+ // CHECK-COUNT-16: xegpu.dpas %{{.*}}, %{{.*}}
+ // CHECK-SAME-COUNT-16: {layout = #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>}
+ // CHECK-SAME-COUNT-16: : vector<16x16xf16>, vector<16x16xf16> -> vector<16x16xf32>
+ // CHECK-NOT: xegpu.dpas
+ %tdesc_a = xegpu.create_nd_tdesc %a : memref<256x128xf16>
+ -> !xegpu.tensor_desc<256x128xf16, #xegpu.layout<sg_layout = [8, 4], sg_data = [16, 16], lane_layout = [1, 16], lane_data = [1, 1]>>
+ %load_a = xegpu.load_nd %tdesc_a[0, 0]
+ : !xegpu.tensor_desc<256x128xf16, #xegpu.layout<sg_layout = [8, 4], sg_data = [16, 16], lane_layout = [1, 16], lane_data = [1, 1]>>
+ -> vector<256x128xf16>
+ %tdesc_b = xegpu.create_nd_tdesc %b : memref<128x256xf16>
+ -> !xegpu.tensor_desc<128x256xf16, #xegpu.layout<sg_layout = [4, 8], sg_data = [16, 16], lane_layout = [1, 16], lane_data = [2, 1]>>
+ %load_b = xegpu.load_nd %tdesc_b[0, 0]
+ : !xegpu.tensor_desc<128x256xf16, #xegpu.layout<sg_layout = [4, 8], sg_data = [16, 16], lane_layout = [1, 16], lane_data = [2, 1]>>
+ -> vector<128x256xf16>
+ %dpas = xegpu.dpas %load_a, %load_b
+ {layout_result_0 = #xegpu.layout<sg_layout = [8, 8], sg_data = [16, 16], lane_layout = [1, 16], lane_data = [1, 1]>}
+ : vector<256x128xf16>, vector<128x256xf16> -> vector<256x256xf32>
+ gpu.return
+ }
}
diff --git a/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg-unify-ops.mlir b/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg-unify-ops.mlir
index 025d48e22307e..07a0b86223c33 100644
--- a/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg-unify-ops.mlir
+++ b/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg-unify-ops.mlir
@@ -1,5 +1,7 @@
// RUN: mlir-opt --xegpu-wg-to-sg-distribute -split-input-file %s | FileCheck %s
+//CHECK: #map = affine_map<()[s0] -> (s0 floordiv 4)>
+//CHECK: #map1 = affine_map<()[s0] -> (s0 mod 4)>
gpu.module @test_distribution {
// CHECK-LABEL: create_nd_tdesc_no_offset
// CHECK-SAME: %[[ARG_0:.*]]: memref<256x128xf32>
@@ -21,4 +23,244 @@ gpu.module @test_distribution {
-> !xegpu.tensor_desc<256x128xf32, #xegpu.layout<sg_layout = [8, 4], sg_data = [32, 32], lane_layout = [1, 16], lane_data = [1, 1]>>
gpu.return
}
+
+ // CHECK-LABEL: load_nd_tdesc_with_offset
+ // CHECK-SAME: %[[ARG_0:.*]]: memref<256x128xf32>
+ gpu.func @load_nd_tdesc_with_offset(%src: memref<256x128xf32>) {
+ //CHECK: [[SGID:%.+]] = gpu.subgroup_id : index
+ //CHECK: [[SGIDY:%.+]] = affine.apply #map()[[[SGID]]]
+ //CHECK: [[SGIDX:%.+]] = affine.apply #map1()[[[SGID]]]
+ //CHECK: %[[LOAD:.*]] = xegpu.load_nd {{%.*}}[{{%.*}}, {{%.*}}] : !xegpu.tensor_desc<32x32xf32, #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>> -> vector<32x32xf32>
+ %tdesc = xegpu.create_nd_tdesc %src: memref<256x128xf32>
+ -> !xegpu.tensor_desc<256x128xf32, #xegpu.layout<sg_layout = [8, 4], sg_data = [32, 32], lane_layout = [1, 16], lane_data = [1, 1]>>
+ %load = xegpu.load_nd %tdesc[0, 0]
+ : !xegpu.tensor_desc<256x128xf32, #xegpu.layout<sg_layout = [8, 4], sg_data = [32, 32], lane_layout = [1, 16], lane_data = [1, 1]>>
+ -> vector<256x128xf32>
+ gpu.return
+ }
+
+ // CHECK-LABEL: store_nd_with_offsets
+ // CHECK-SAME: %[[ARG_0:.*]]: memref<256x128xf32>
+ gpu.func @store_nd_with_offsets(%src: memref<256x128xf32>) {
+ //CHECK: [[SGID:%.+]] = gpu.subgroup_id : index
+ //CHECK: [[SGIDY:%.+]] = affine.apply #map()[[[SGID]]]
+ //CHECK: [[SGIDX:%.+]] = affine.apply #map1()[[[SGID]]]
+ //CHECK: xegpu.store_nd %{{.*}}, {{%.*}}[{{%.*}}, {{%.*}}] : vector<32x32xf32>, !xegpu.tensor_desc<32x32xf32, #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>>
+ %tdesc = xegpu.create_nd_tdesc %src: memref<256x128xf32>
+ -> !xegpu.tensor_desc<256x128xf32, #xegpu.layout<sg_layout = [8, 4], sg_data = [32, 32], lane_layout = [1, 16], lane_data = [1, 1]>>
+ %load = xegpu.load_nd %tdesc[0, 0]
+ : !xegpu.tensor_desc<256x128xf32, #xegpu.layout<sg_layout = [8, 4], sg_data = [32, 32], lane_layout = [1, 16], lane_data = [1, 1]>>
+ -> vector<256x128xf32>
+ xegpu.store_nd %load, %tdesc[0, 0]
+ : vector<256x128xf32>, !xegpu.tensor_desc<256x128xf32, #xegpu.layout<sg_layout = [8, 4], sg_data = [32, 32], lane_layout = [1, 16], lane_data = [1, 1]>>
+ gpu.return
+}
+
+ // CHECK-LABEL: prefetch_nd_tdesc_with_offset
+ // CHECK-SAME: %[[ARG_0:.*]]: memref<256x128xf32>
+ gpu.func @prefetch_nd_tdesc_with_offset(%src: memref<256x128xf32>) {
+ //CHECK: [[SGID:%.+]] = gpu.subgroup_id : index
+ //CHECK: [[SGIDY:%.+]] = affine.apply #map()[[[SGID]]]
+ //CHECK: [[SGIDX:%.+]] = affine.apply #map1()[[[SGID]]]
+ //CHECK: xegpu.prefetch_nd %{{.*}}[{{%.*}}, {{%.*}}] : !xegpu.tensor_desc<32x32xf32, #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>>
+ %cst0 = arith.constant 0 : index
+ %tdesc = xegpu.create_nd_tdesc %src : memref<256x128xf32>
+ -> !xegpu.tensor_desc<256x128xf32, #xegpu.layout<sg_layout = [8, 4], sg_data = [32, 32], lane_layout = [1, 16], lane_data = [1, 1]>>
+ xegpu.prefetch_nd %tdesc[%cst0, %cst0]
+ : !xegpu.tensor_desc<256x128xf32, #xegpu.layout<sg_layout = [8, 4], sg_data = [32, 32], lane_layout = [1, 16], lane_data = [1, 1]>>
+ gpu.return
+ }
+
+ // CHECK-LABEL: dpas
+ gpu.func @dpas(%a: memref<128x128xf16>, %b: memref<128x128xf16>) {
+ // CHECK: %[[DPAS:.*]] = xegpu.dpas %{{.*}}, %{{.*}} {layout_result_0 = #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>} : vector<16x128xf16>, vector<128x16xf16> -> vector<16x16xf32>
+ %tdesc_a = xegpu.create_nd_tdesc %a : memref<128x128xf16>
+ -> !xegpu.tensor_desc<128x128xf16, #xegpu.layout<sg_layout = [8, 8], sg_data = [16, 128], lane_layout = [1, 16], lane_data = [1, 1]>>
+ %load_a = xegpu.load_nd %tdesc_a[0, 0]
+ : !xegpu.tensor_desc<128x128xf16, #xegpu.layout<sg_layout = [8, 8], sg_data = [16, 128], lane_layout = [1, 16], lane_data = [1, 1]>>
+ -> vector<128x128xf16>
+ %tdesc_b = xegpu.create_nd_tdesc %b : memref<128x128xf16>
+ -> !xegpu.tensor_desc<128x128xf16, #xegpu.layout<sg_layout = [8, 8], sg_data = [128, 16], lane_layout = [1, 16], lane_data = [2, 1]>>
+ %load_b = xegpu.load_nd %tdesc_b[0, 0]
+ : !xegpu.tensor_desc<128x128xf16, #xegpu.layout<sg_layout = [8, 8], sg_data = [128, 16], lane_layout = [1, 16], lane_data = [2, 1]>>
+ -> vector<128x128xf16>
+ %dpas = xegpu.dpas %load_a, %load_b
+ {layout_result_0 = #xegpu.layout<sg_layout = [8, 8], sg_data = [16, 16], lane_layout = [1, 16], lane_data = [1, 1]>}
+ : vector<128x128xf16>, vector<128x128xf16> -> vector<128x128xf32>
+ gpu.return
+ }
+
+ // CHECK-LABEL: dpas_no_sg_data
+ gpu.func @dpas_no_sg_data(%a: memref<128x128xf16>, %b: memref<128x128xf16>) {
+ // CHECK: %[[DPAS:.*]] = xegpu.dpas %{{.*}}, %{{.*}} {layout_result_0 = #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1], order = [1, 0]>} : vector<16x16xf16>, vector<16x16xf16> -> vector<16x16xf32>
+ %tdesc_a = xegpu.create_nd_tdesc %a : memref<128x128xf16>
+ -> !xegpu.tensor_desc<128x128xf16, #xegpu.layout<sg_layout = [8, 8], lane_layout = [1, 16], lane_data = [1, 1],
+ order = [1, 0]>>
+ %load_a = xegpu.load_nd %tdesc_a[0, 0]
+ : !xegpu.tensor_desc<128x128xf16, #xegpu.layout<sg_layout = [8, 8], lane_layout = [1, 16], lane_data = [1, 1],
+ order = [1, 0]>>
+ -> vector<128x128xf16>
+ %tdesc_b = xegpu.create_nd_tdesc %b : memref<128x128xf16>
+ -> !xegpu.tensor_desc<128x128xf16, #xegpu.layout<sg_layout = [8, 8], lane_layout = [1, 16], lane_data = [2, 1],
+ order = [1, 0]>>
+ %load_b = xegpu.load_nd %tdesc_b[0, 0]
+ : !xegpu.tensor_desc<128x128xf16, #xegpu.layout<sg_layout = [8, 8], lane_layout = [1, 16], lane_data = [2, 1],
+ order = [1, 0]>>
+ -> vector<128x128xf16>
+ %dpas = xegpu.dpas %load_a, %load_b
+ {layout_result_0 = #xegpu.layout<sg_layout = [8, 8], lane_layout = [1, 16], lane_data = [1, 1], order = [1, 0]>}
+ : vector<128x128xf16>, vector<128x128xf16> -> vector<128x128xf32>
+ gpu.return
+ }
+
+ // CHECK-LABEL: dpas_with_no_create_nd_desc
+ gpu.func @dpas_with_no_create_nd_desc(%a: vector<256x128xf32>, %b: vector<128x256xf32>) {
+ // CHECK-NOT: vector<32x32xf32>
+ %dpas = xegpu.dpas %a, %b
+ {layout = #xegpu.layout<sg_layout = [8, 8], sg_data = [16, 16], lane_layout = [1, 16], lane_data = [1, 1]>}
+ : vector<256x128xf32>, vector<128x256xf32> -> vector<256x256xf32>
+ gpu.return
+ }
+
+ // CHECK-LABEL: broadcast_dim1
+ // CHECK-SAME: %[[ARG_0:.*]]: memref<256x1xf32>
+ gpu.func @broadcast_dim1(%src: memref<256x1xf32>) {
+ %tdesc = xegpu.create_nd_tdesc %src : memref<256x1xf32>
+ -> !xegpu.tensor_desc<256x1xf32, #xegpu.layout<sg_layout = [8, 1], sg_data = [32, 1], lane_layout = [8, 1], lane_data = [1, 1]>>
+ %load = xegpu.load_nd %tdesc[0, 0]
+ : !xegpu.tensor_desc<256x1xf32, #xegpu.layout<sg_layout = [8, 1], sg_data = [32, 1], lane_layout = [8, 1], lane_data = [1, 1]>>
+ -> vector<256x1xf32>
+ // CHECK: vector.broadcast {{.*}} {layout_result_0 = #xegpu.layout<lane_layout = [8, 1], lane_data = [1, 1]>}
+ // CHECK-SAME: : vector<32x1xf32> to vector<32x32xf32>
+ %broadcast = vector.broadcast %load
+ {layout_result_0 = #xegpu.layout<sg_layout = [8, 1], sg_data = [32, 32], lane_layout = [8, 1], lane_data = [1, 1]>}
+ : vector<256x1xf32> to vector<256x32xf32>
+ gpu.return
+ }
+
+ // CHECK-LABEL: broadcast_dim0
+ // CHECK-SAME: %[[ARG_0:.*]]: memref<1x128xf32>
+ gpu.func @broadcast_dim0(%src: memref<1x128xf32>) {
+ %tdesc = xegpu.create_nd_tdesc %src : memref<1x128xf32>
+ -> !xegpu.tensor_desc<1x128xf32, #xegpu.layout<sg_layout = [1, 4], sg_data = [1, 32], lane_layout = [1, 16], lane_data = [1, 1]>>
+ %load = xegpu.load_nd %tdesc[0, 0]
+ : !xegpu.tensor_desc<1x128xf32, #xegpu.layout<sg_layout = [1, 4], sg_data = [1, 32], lane_layout = [1, 16], lane_data = [1, 1]>>
+ -> vector<1x128xf32>
+ // CHECK: vector.broadcast {{.*}} {layout_result_0 = #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>}
+ // CHECK-SAME: : vector<1x32xf32> to vector<32x32xf32>
+ %broadcast = vector.broadcast %load
+ {layout_result_0 = #xegpu.layout<sg_layout = [1, 4], sg_data = [32, 32], lane_layout = [1, 16], lane_data = [1, 1]>}
+ : vector<1x128xf32> to vector<32x128xf32>
+ gpu.return
+ }
+
+ // CHECK-LABEL: gemm_with_load_store_offset
+ // CHECK-SAME: %[[ARG_0:.*]]: memref<1024x1024xf16>, %[[ARG_1:.*]]: memref<1024x1024xf16>, %[[ARG_2:.*]]: memref<1024x1024xf32>
+ gpu.func @gemm_with_load_store_offset(%arg0: memref<1024x1024xf16>, %arg1: memref<1024x1024xf16>, %arg2: memref<1024x1024xf32>) {
+ //CHECK: [[c0:%.+]] = arith.constant 0 : index
+ //CHECK: [[c128:%.+]] = arith.constant 128 : index
+ //CHECK: [[c1024:%.+]] = arith.constant 1024 : index
+ %c0 = arith.constant 0 : index
+ %c128 = arith.constant 128 : index
+ %c1024 = arith.constant 1024 : index
+ %block_id_x = gpu.block_id x
+ %block_id_y = gpu.block_id y
+ %0 = arith.muli %block_id_x, %c128 : index
+ %1 = arith.muli %block_id_y, %c128 : index
+ %2 = xegpu.create_nd_tdesc %arg2 : memref<1024x1024xf32> -> !xegpu.tensor_desc<128x128xf32, #xegpu.layout<sg_layout = [8, 8], sg_data = [16, 16]>>
+ // CHECK: [[DESC_A:%.+]] = xegpu.create_nd_tdesc %[[ARG_0]] : memref<1024x1024xf16> -> !xegpu.tensor_desc<16x128xf16>
+ // CHECK: [[DESC_B:%.+]] = xegpu.create_nd_tdesc %[[ARG_1]] : memref<1024x1024xf16> -> !xegpu.tensor_desc<128x16xf16>
+ %3 = xegpu.create_nd_tdesc %arg0 : memref<1024x1024xf16> -> !xegpu.tensor_desc<128x128xf16, #xegpu.layout<sg_layout = [8, 8], sg_data = [16, 128]>>
+ %4 = xegpu.create_nd_tdesc %arg1 : memref<1024x1024xf16> -> !xegpu.tensor_desc<128x128xf16, #xegpu.layout<sg_layout = [8, 8], sg_data = [128, 16]>>
+ // load_nd with offset
+ %5 = xegpu.load_nd %2[%0, %1] : !xegpu.tensor_desc<128x128xf32, #xegpu.layout<sg_layout = [8, 8], sg_data = [16, 16]>> -> vector<128x128xf32>
+ %6 = xegpu.load_nd %3[%0, %c0] : !xegpu.tensor_desc<128x128xf16, #xegpu.layout<sg_layout = [8, 8], sg_data = [16, 128]>> -> vector<128x128xf16>
+ %7 = xegpu.load_nd %4[%c0, %1] : !xegpu.tensor_desc<128x128xf16, #xegpu.layout<sg_layout = [8, 8], sg_data = [128, 16]>> -> vector<128x128xf16>
+ // scf.for loop
+ // CHECK: [[scf:%.+]]:3 = scf.for [[arg3:%.+]] = [[c0]] to [[c1024]] step [[c128]]
+ // CHECK-SAME: iter_args([[arg4:%.+]] = {{.*}}, [[arg5:%.+]] = {{.*}}, [[arg6:%.+]] = {{.*}}) ->
+ // CHECK-SAME: (vector<16x128xf16>, vector<128x16xf16>, vector<16x16xf32>)
+ // CHECK: [[c:%.+]] = xegpu.dpas [[arg4]], [[arg5]], [[arg6]] : vector<16x128xf16>, vector<128x16xf16>, vector<16x16xf32> -> vector<16x16xf32>
+ // CHECK: [[a:%.+]] = xegpu.load_nd [[DESC_A]][{{%.*}}, {{%.*}}] : !xegpu.tensor_desc<16x128xf16> -> vector<16x128xf16>
+ // CHECK: [[b:%.+]] = xegpu.load_nd [[DESC_B]][{{%.*}}, {{%.*}}] : !xegpu.tensor_desc<128x16xf16> -> vector<128x16xf16>
+ // CHECK: scf.yield [[a]], [[b]], [[c]] : vector<16x128xf16>, vector<128x16xf16>, vector<16x16xf32>
+ %8:3 = scf.for %arg3 = %c0 to %c1024 step %c128 iter_args(%arg4 = %6, %arg5 = %7, %arg6 = %5)
+ -> (vector<128x128xf16>, vector<128x128xf16>, vector<128x128xf32>) {
+ // load_nd with offset inside loop
+ %9 = xegpu.dpas %arg4, %arg5, %arg6 {layout_result_0 = #xegpu.layout<sg_layout = [8, 8], sg_data = [16, 16]>}
+ : vector<128x128xf16>, vector<128x128xf16>, vector<128x128xf32> -> vector<128x128xf32>
+ %10 = xegpu.load_nd %3[%arg3, %c0] : !xegpu.tensor_desc<128x128xf16, #xegpu.layout<sg_layout = [8, 8], sg_data = [16, 128]>> -> vector<128x128xf16>
+ %11 = xegpu.load_nd %4[%c0, %arg3] : !xegpu.tensor_desc<128x128xf16, #xegpu.layout<sg_layout = [8, 8], sg_data = [128, 16]>> -> vector<128x128xf16>
+ scf.yield %10, %11, %9 : vector<128x128xf16>, vector<128x128xf16>, vector<128x128xf32>
+ }
+ // store_nd with offset
+ xegpu.store_nd %8#2, %2[%0, %1] : vector<128x128xf32>, !xegpu.tensor_desc<128x128xf32, #xegpu.layout<sg_layout = [8, 8], sg_data = [16, 16]>>
+ gpu.return
+ }
+
+ // CHECK-LABEL: @subgroup_id_range
+ gpu.func @subgroup_id_range(%src: memref<256x128xf32>, %src1: memref<128x256xf32>, %src2: memref<128x64xf32>) {
+ %sg_id = gpu.subgroup_id : index
+ %c0 = arith.constant 0 : index
+ %c1 = arith.constant 1 : index
+ %c2 = arith.constant 2 : index
+ %c31 = arith.constant 31 : index
+ %c3 = arith.constant 3 : index
+ %cond1 = arith.cmpi sge, %sg_id, %c0 : index
+ %cond2 = arith.cmpi slt, %sg_id, %c1 : index
+ %cond = arith.andi %cond1, %cond2 : i1
+ scf.if %cond {
+ // CHECK-NOT: index.sub
+ %tdesc = xegpu.create_nd_tdesc %src : memref<256x128xf32>
+ -> !xegpu.tensor_desc<256x128xf32, #xegpu.layout<sg_layout = [8, 4], sg_data = [32, 32], lane_layout = [8, 4], lane_data = [1, 1]>>
+ %load = xegpu.load_nd %tdesc[0, 0]
+ : !xegpu.tensor_desc<256x128xf32, #xegpu.layout<sg_layout = [8, 4], sg_data = [32, 32], lane_layout = [8, 4], lane_data = [1, 1]>>
+ -> vector<256x128xf32>
+ } {sg_id_range = #xegpu.range<[0, 32]>}
+ %cond3 = arith.cmpi sge, %sg_id, %c2 : index
+ %cond4 = arith.cmpi slt, %sg_id, %c31 : index
+ %cond5 = arith.andi %cond3, %cond4 : i1
+ scf.if %cond5 {
+ // CHECK: %[[SGID:.*]] = gpu.subgroup_id : index
+ // CHECK: %[[C2:.*]] = arith.constant 2 : index
+ // CHECK: %[[SUB:.*]] = index.sub %{{.*}}, %[[C2]]
+ %tdesc = xegpu.create_nd_tdesc %src2 : memref<128x64xf32>
+ -> !xegpu.tensor_desc<128x64xf32, #xegpu.layout<sg_layout = [4, 4], sg_data = [32, 16], lane_layout = [8, 4], lane_data = [1, 1]>>
+ %load = xegpu.load_nd %tdesc[0, 0]
+ : !xegpu.tensor_desc<128x64xf32, #xegpu.layout<sg_layout = [4, 4], sg_data = [32, 16], lane_layout = [8, 4], lane_data = [1, 1]>>
+ -> vector<128x64xf32>
+ %exp = math.exp %load {layout_result_0 = #xegpu.layout<sg_layout = [4, 4], sg_data = [32, 16], lane_layout = [8, 4], lane_data = [1, 1]>} : vector<128x64xf32>
+ }{sg_id_range = #xegpu.range<[2, 18]>}
+ gpu.return
+ }
+
+ // CHECK-LABEL: @subgroup_id_range_nested_if
+ gpu.func @subgroup_id_range_nested_if(%src: memref<256x128xf32>, %src1: memref<128x64xf32>) {
+ %sg_id = gpu.subgroup_id : index
+ %c1 = arith.constant 1 : i1
+ %c3 = arith.constant 3 : index
+ %c32 = arith.constant 32 : index
+ %tdesc = xegpu.create_nd_tdesc %src : memref<256x128xf32>
+ -> !xegpu.tensor_desc<256x128xf32, #xegpu.layout<sg_layout = [8, 4], sg_data = [32, 32], lane_layout = [8, 4], lane_data = [1, 1]>>
+ %load = xegpu.load_nd %tdesc[0, 0]
+ : !xegpu.tensor_desc<256x128xf32, #xegpu.layout<sg_layout = [8, 4], sg_data = [32, 32], lane_layout = [8, 4], lane_data = [1, 1]>>
+ -> vector<256x128xf32>
+ %cond1 = arith.cmpi sge, %sg_id, %c3 : index
+ %cond2 = arith.cmpi slt, %sg_id, %c32 : index
+ %cond = arith.andi %cond1, %cond2 : i1
+ scf.if %c1 {
+ scf.if %cond {
+ // CHECK: %[[SGID:.*]] = gpu.subgroup_id : index
+ // CHECK: %[[C3:.*]] = arith.constant 3 : index
+ // CHECK: %[[SUB:.*]] = index.sub %{{.*}}, %[[C3]]
+ %td = xegpu.create_nd_tdesc %src1 : memref<128x64xf32>
+ -> !xegpu.tensor_desc<128x64xf32, #xegpu.layout<sg_layout = [4, 4], sg_data = [32, 16], lane_layout = [8, 4], lane_data = [1, 1]>>
+ %ld = xegpu.load_nd %td[0, 0]
+ : !xegpu.tensor_desc<128x64xf32, #xegpu.layout<sg_layout = [4, 4], sg_data = [32, 16], lane_layout = [8, 4], lane_data = [1, 1]>>
+ -> vector<128x64xf32>
+ %exp = math.exp %ld {layout_result_0 = #xegpu.layout<sg_layout = [4, 4], sg_data = [32, 16], lane_layout = [8, 4], lane_data = [1, 1]>} : vector<128x64xf32>
+ }
+ } {sg_id_range = #xegpu.range<[3, 19]>}
+ gpu.return
+ }
}
>From 1b60236200735abc39e5bd3a2280123e9789dec5 Mon Sep 17 00:00:00 2001
From: Andreas Jonson <andjo403 at hotmail.com>
Date: Mon, 18 Aug 2025 18:45:52 +0200
Subject: [PATCH 052/112] [SimplifyCFG] Avoid redundant calls in gather. (NFC)
(#154133)
Split out from https://github.com/llvm/llvm-project/pull/154007 as it
showed compile time improvements
NFC as there needs to be at least two icmps that is part of the chain.
---
llvm/lib/Transforms/Utils/SimplifyCFG.cpp | 28 ++++++++++++-----------
1 file changed, 15 insertions(+), 13 deletions(-)
diff --git a/llvm/lib/Transforms/Utils/SimplifyCFG.cpp b/llvm/lib/Transforms/Utils/SimplifyCFG.cpp
index 0ca7188470d8e..055e8cadaab76 100644
--- a/llvm/lib/Transforms/Utils/SimplifyCFG.cpp
+++ b/llvm/lib/Transforms/Utils/SimplifyCFG.cpp
@@ -565,6 +565,9 @@ struct ConstantComparesGatherer {
/// Number of comparisons matched in the and/or chain
unsigned UsedICmps = 0;
+ /// If the elements in Vals matches the comparisons
+ bool IsEq = false;
+
/// Construct and compute the result for the comparison instruction Cond
ConstantComparesGatherer(Instruction *Cond, const DataLayout &DL) : DL(DL) {
gather(Cond);
@@ -736,23 +739,23 @@ struct ConstantComparesGatherer {
/// vector.
/// One "Extra" case is allowed to differ from the other.
void gather(Value *V) {
- bool isEQ = match(V, m_LogicalOr(m_Value(), m_Value()));
-
+ Value *Op0, *Op1;
+ if (match(V, m_LogicalOr(m_Value(Op0), m_Value(Op1))))
+ IsEq = true;
+ else if (match(V, m_LogicalAnd(m_Value(Op0), m_Value(Op1))))
+ IsEq = false;
+ else
+ return;
// Keep a stack (SmallVector for efficiency) for depth-first traversal
- SmallVector<Value *, 8> DFT;
- SmallPtrSet<Value *, 8> Visited;
-
- // Initialize
- Visited.insert(V);
- DFT.push_back(V);
+ SmallVector<Value *, 8> DFT{Op0, Op1};
+ SmallPtrSet<Value *, 8> Visited{V, Op0, Op1};
while (!DFT.empty()) {
V = DFT.pop_back_val();
if (Instruction *I = dyn_cast<Instruction>(V)) {
// If it is a || (or && depending on isEQ), process the operands.
- Value *Op0, *Op1;
- if (isEQ ? match(I, m_LogicalOr(m_Value(Op0), m_Value(Op1)))
+ if (IsEq ? match(I, m_LogicalOr(m_Value(Op0), m_Value(Op1)))
: match(I, m_LogicalAnd(m_Value(Op0), m_Value(Op1)))) {
if (Visited.insert(Op1).second)
DFT.push_back(Op1);
@@ -763,7 +766,7 @@ struct ConstantComparesGatherer {
}
// Try to match the current instruction
- if (matchInstruction(I, isEQ))
+ if (matchInstruction(I, IsEq))
// Match succeed, continue the loop
continue;
}
@@ -5103,6 +5106,7 @@ bool SimplifyCFGOpt::simplifyBranchOnICmpChain(BranchInst *BI,
Value *CompVal = ConstantCompare.CompValue;
unsigned UsedICmps = ConstantCompare.UsedICmps;
Value *ExtraCase = ConstantCompare.Extra;
+ bool TrueWhenEqual = ConstantCompare.IsEq;
// If we didn't have a multiply compared value, fail.
if (!CompVal)
@@ -5112,8 +5116,6 @@ bool SimplifyCFGOpt::simplifyBranchOnICmpChain(BranchInst *BI,
if (UsedICmps <= 1)
return false;
- bool TrueWhenEqual = match(Cond, m_LogicalOr(m_Value(), m_Value()));
-
// There might be duplicate constants in the list, which the switch
// instruction can't handle, remove them now.
array_pod_sort(Values.begin(), Values.end(), constantIntSortPredicate);
>From 97f554249c564e769956abfcb3266925745482c5 Mon Sep 17 00:00:00 2001
From: Ramkumar Ramachandra <ramkumar.ramachandra at codasip.com>
Date: Mon, 18 Aug 2025 17:48:42 +0100
Subject: [PATCH 053/112] [VPlan] Preserve nusw in createInBoundsPtrAdd
(#151549)
Rename createInBoundsPtrAdd to createNoWrapPtrAdd, and preserve nusw as
well as inbounds at the callsite.
---
.../Vectorize/LoopVectorizationPlanner.h | 14 +-
.../Transforms/Vectorize/VPlanTransforms.cpp | 4 +-
...aved-accesses-different-insert-position.ll | 2 +-
.../interleaved-accesses-gep-nowrap-flags.ll | 148 ++++++++++++++++++
4 files changed, 158 insertions(+), 10 deletions(-)
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
index 4856ebebb596f..838476dcae661 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
@@ -256,13 +256,15 @@ class VPBuilder {
new VPInstruction(VPInstruction::PtrAdd, {Ptr, Offset},
GEPNoWrapFlags::none(), DL, Name));
}
- VPInstruction *createInBoundsPtrAdd(VPValue *Ptr, VPValue *Offset,
- DebugLoc DL = DebugLoc::getUnknown(),
- const Twine &Name = "") {
- return tryInsertInstruction(
- new VPInstruction(VPInstruction::PtrAdd, {Ptr, Offset},
- GEPNoWrapFlags::inBounds(), DL, Name));
+
+ VPInstruction *createNoWrapPtrAdd(VPValue *Ptr, VPValue *Offset,
+ GEPNoWrapFlags GEPFlags,
+ DebugLoc DL = DebugLoc::getUnknown(),
+ const Twine &Name = "") {
+ return tryInsertInstruction(new VPInstruction(
+ VPInstruction::PtrAdd, {Ptr, Offset}, GEPFlags, DL, Name));
}
+
VPInstruction *createWidePtrAdd(VPValue *Ptr, VPValue *Offset,
DebugLoc DL = DebugLoc::getUnknown(),
const Twine &Name = "") {
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
index 05c12b7a1adcc..14532244d5748 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
@@ -2615,9 +2615,7 @@ void VPlanTransforms::createInterleaveGroups(
VPValue *OffsetVPV =
Plan.getOrAddLiveIn(ConstantInt::get(Plan.getContext(), -Offset));
VPBuilder B(InsertPos);
- Addr = NW.isInBounds()
- ? B.createInBoundsPtrAdd(InsertPos->getAddr(), OffsetVPV)
- : B.createPtrAdd(InsertPos->getAddr(), OffsetVPV);
+ Addr = B.createNoWrapPtrAdd(InsertPos->getAddr(), OffsetVPV, NW);
}
// If the group is reverse, adjust the index to refer to the last vector
// lane instead of the first. We adjust the index from the first vector
diff --git a/llvm/test/Transforms/LoopVectorize/interleaved-accesses-different-insert-position.ll b/llvm/test/Transforms/LoopVectorize/interleaved-accesses-different-insert-position.ll
index fa339f45fcdd9..dd6b829fcb5c9 100644
--- a/llvm/test/Transforms/LoopVectorize/interleaved-accesses-different-insert-position.ll
+++ b/llvm/test/Transforms/LoopVectorize/interleaved-accesses-different-insert-position.ll
@@ -86,7 +86,7 @@ define void @test_ig_insert_pos_at_end_of_vpbb(ptr noalias %dst, ptr noalias %sr
; CHECK: [[VECTOR_BODY]]:
; CHECK-NEXT: [[TMP3:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
; CHECK-NEXT: [[TMP4:%.*]] = getelementptr nusw { i16, i16, i16, i16 }, ptr [[SRC]], i64 [[TMP3]], i32 2
-; CHECK-NEXT: [[TMP5:%.*]] = getelementptr i8, ptr [[TMP4]], i32 -4
+; CHECK-NEXT: [[TMP5:%.*]] = getelementptr nusw i8, ptr [[TMP4]], i32 -4
; CHECK-NEXT: [[WIDE_VEC:%.*]] = load <16 x i16>, ptr [[TMP5]], align 2
; CHECK-NEXT: [[STRIDED_VEC:%.*]] = shufflevector <16 x i16> [[WIDE_VEC]], <16 x i16> poison, <4 x i32> <i32 0, i32 4, i32 8, i32 12>
; CHECK-NEXT: [[STRIDED_VEC1:%.*]] = shufflevector <16 x i16> [[WIDE_VEC]], <16 x i16> poison, <4 x i32> <i32 2, i32 6, i32 10, i32 14>
diff --git a/llvm/test/Transforms/LoopVectorize/interleaved-accesses-gep-nowrap-flags.ll b/llvm/test/Transforms/LoopVectorize/interleaved-accesses-gep-nowrap-flags.ll
index 552f6a4ec62d9..a6ba29ed1ca0e 100644
--- a/llvm/test/Transforms/LoopVectorize/interleaved-accesses-gep-nowrap-flags.ll
+++ b/llvm/test/Transforms/LoopVectorize/interleaved-accesses-gep-nowrap-flags.ll
@@ -185,3 +185,151 @@ loop:
exit:
ret void
}
+
+define void @nusw_preservation_2(ptr %src, ptr noalias %dst) {
+; CHECK-LABEL: define void @nusw_preservation_2(
+; CHECK-SAME: ptr [[SRC:%.*]], ptr noalias [[DST:%.*]]) {
+; CHECK-NEXT: [[ENTRY:.*:]]
+; CHECK-NEXT: br i1 false, label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK: [[VECTOR_PH]]:
+; CHECK-NEXT: br label %[[VECTOR_BODY:.*]]
+; CHECK: [[VECTOR_BODY]]:
+; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT: [[OFFSET_IDX:%.*]] = mul i64 [[INDEX]], 2
+; CHECK-NEXT: [[TMP0:%.*]] = or disjoint i64 [[OFFSET_IDX]], 1
+; CHECK-NEXT: [[TMP1:%.*]] = getelementptr nusw i8, ptr [[SRC]], i64 [[TMP0]]
+; CHECK-NEXT: [[TMP2:%.*]] = getelementptr nusw i8, ptr [[TMP1]], i32 -1
+; CHECK-NEXT: [[WIDE_VEC:%.*]] = load <8 x i8>, ptr [[TMP2]], align 1
+; CHECK-NEXT: [[STRIDED_VEC:%.*]] = shufflevector <8 x i8> [[WIDE_VEC]], <8 x i8> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
+; CHECK-NEXT: [[STRIDED_VEC1:%.*]] = shufflevector <8 x i8> [[WIDE_VEC]], <8 x i8> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
+; CHECK-NEXT: [[TMP3:%.*]] = add <4 x i8> [[STRIDED_VEC1]], [[STRIDED_VEC]]
+; CHECK-NEXT: [[TMP4:%.*]] = getelementptr nusw i8, ptr [[DST]], i64 [[INDEX]]
+; CHECK-NEXT: store <4 x i8> [[TMP3]], ptr [[TMP4]], align 1
+; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 4
+; CHECK-NEXT: [[TMP5:%.*]] = icmp eq i64 [[INDEX_NEXT]], 100
+; CHECK-NEXT: br i1 [[TMP5]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP8:![0-9]+]]
+; CHECK: [[MIDDLE_BLOCK]]:
+; CHECK-NEXT: br [[EXIT:label %.*]]
+; CHECK: [[SCALAR_PH]]:
+;
+entry:
+ br label %loop
+
+loop: ; preds = %loop, %entry
+ %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+ %iv2 = phi i64 [ 0, %entry ], [ %iv2.next, %loop ]
+ %or.1 = or disjoint i64 %iv2, 1
+ %gep.src.or.1 = getelementptr nusw i8, ptr %src, i64 %or.1
+ %load.src.1 = load i8, ptr %gep.src.or.1, align 1
+ %gep.src.iv2 = getelementptr nusw i8, ptr %src, i64 %iv2
+ %load.src.2 = load i8, ptr %gep.src.iv2, align 1
+ %add = add i8 %load.src.1, %load.src.2
+ %gep.dst.iv = getelementptr nusw i8, ptr %dst, i64 %iv
+ store i8 %add, ptr %gep.dst.iv, align 1
+ %iv2.next = add i64 %iv2, 2
+ %iv.next = add i64 %iv, 1
+ %exit.cond = icmp eq i64 %iv.next, 100
+ br i1 %exit.cond, label %exit, label %loop
+
+exit:
+ ret void
+}
+
+define void @inbounds_preservation_2(ptr %src, ptr noalias %dst) {
+; CHECK-LABEL: define void @inbounds_preservation_2(
+; CHECK-SAME: ptr [[SRC:%.*]], ptr noalias [[DST:%.*]]) {
+; CHECK-NEXT: [[ENTRY:.*:]]
+; CHECK-NEXT: br i1 false, label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK: [[VECTOR_PH]]:
+; CHECK-NEXT: br label %[[VECTOR_BODY:.*]]
+; CHECK: [[VECTOR_BODY]]:
+; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT: [[OFFSET_IDX:%.*]] = mul i64 [[INDEX]], 2
+; CHECK-NEXT: [[TMP0:%.*]] = or disjoint i64 [[OFFSET_IDX]], 1
+; CHECK-NEXT: [[TMP1:%.*]] = getelementptr inbounds i8, ptr [[SRC]], i64 [[TMP0]]
+; CHECK-NEXT: [[TMP2:%.*]] = getelementptr inbounds i8, ptr [[TMP1]], i32 -1
+; CHECK-NEXT: [[WIDE_VEC:%.*]] = load <8 x i8>, ptr [[TMP2]], align 1
+; CHECK-NEXT: [[STRIDED_VEC:%.*]] = shufflevector <8 x i8> [[WIDE_VEC]], <8 x i8> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
+; CHECK-NEXT: [[STRIDED_VEC1:%.*]] = shufflevector <8 x i8> [[WIDE_VEC]], <8 x i8> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
+; CHECK-NEXT: [[TMP3:%.*]] = add <4 x i8> [[STRIDED_VEC1]], [[STRIDED_VEC]]
+; CHECK-NEXT: [[TMP4:%.*]] = getelementptr inbounds i8, ptr [[DST]], i64 [[INDEX]]
+; CHECK-NEXT: store <4 x i8> [[TMP3]], ptr [[TMP4]], align 1
+; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 4
+; CHECK-NEXT: [[TMP5:%.*]] = icmp eq i64 [[INDEX_NEXT]], 100
+; CHECK-NEXT: br i1 [[TMP5]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP10:![0-9]+]]
+; CHECK: [[MIDDLE_BLOCK]]:
+; CHECK-NEXT: br [[EXIT:label %.*]]
+; CHECK: [[SCALAR_PH]]:
+;
+entry:
+ br label %loop
+
+loop: ; preds = %loop, %entry
+ %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+ %iv2 = phi i64 [ 0, %entry ], [ %iv2.next, %loop ]
+ %or.1 = or disjoint i64 %iv2, 1
+ %gep.src.or.1 = getelementptr inbounds i8, ptr %src, i64 %or.1
+ %load.src.1 = load i8, ptr %gep.src.or.1, align 1
+ %gep.src.iv2 = getelementptr inbounds i8, ptr %src, i64 %iv2
+ %load.src.2 = load i8, ptr %gep.src.iv2, align 1
+ %add = add i8 %load.src.1, %load.src.2
+ %gep.dst.iv = getelementptr inbounds i8, ptr %dst, i64 %iv
+ store i8 %add, ptr %gep.dst.iv, align 1
+ %iv2.next = add i64 %iv2, 2
+ %iv.next = add i64 %iv, 1
+ %exit.cond = icmp eq i64 %iv.next, 100
+ br i1 %exit.cond, label %exit, label %loop
+
+exit:
+ ret void
+}
+
+define void @nuw_drop_2(ptr %src, ptr noalias %dst) {
+; CHECK-LABEL: define void @nuw_drop_2(
+; CHECK-SAME: ptr [[SRC:%.*]], ptr noalias [[DST:%.*]]) {
+; CHECK-NEXT: [[ENTRY:.*:]]
+; CHECK-NEXT: br i1 false, label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK: [[VECTOR_PH]]:
+; CHECK-NEXT: br label %[[VECTOR_BODY:.*]]
+; CHECK: [[VECTOR_BODY]]:
+; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT: [[OFFSET_IDX:%.*]] = mul i64 [[INDEX]], 2
+; CHECK-NEXT: [[TMP0:%.*]] = or disjoint i64 [[OFFSET_IDX]], 1
+; CHECK-NEXT: [[TMP1:%.*]] = getelementptr nuw i8, ptr [[SRC]], i64 [[TMP0]]
+; CHECK-NEXT: [[TMP2:%.*]] = getelementptr i8, ptr [[TMP1]], i32 -1
+; CHECK-NEXT: [[WIDE_VEC:%.*]] = load <8 x i8>, ptr [[TMP2]], align 1
+; CHECK-NEXT: [[STRIDED_VEC:%.*]] = shufflevector <8 x i8> [[WIDE_VEC]], <8 x i8> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
+; CHECK-NEXT: [[STRIDED_VEC1:%.*]] = shufflevector <8 x i8> [[WIDE_VEC]], <8 x i8> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
+; CHECK-NEXT: [[TMP3:%.*]] = add <4 x i8> [[STRIDED_VEC1]], [[STRIDED_VEC]]
+; CHECK-NEXT: [[TMP4:%.*]] = getelementptr nuw i8, ptr [[DST]], i64 [[INDEX]]
+; CHECK-NEXT: store <4 x i8> [[TMP3]], ptr [[TMP4]], align 1
+; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 4
+; CHECK-NEXT: [[TMP5:%.*]] = icmp eq i64 [[INDEX_NEXT]], 100
+; CHECK-NEXT: br i1 [[TMP5]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP12:![0-9]+]]
+; CHECK: [[MIDDLE_BLOCK]]:
+; CHECK-NEXT: br [[EXIT:label %.*]]
+; CHECK: [[SCALAR_PH]]:
+;
+entry:
+ br label %loop
+
+loop: ; preds = %loop, %entry
+ %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+ %iv2 = phi i64 [ 0, %entry ], [ %iv2.next, %loop ]
+ %or.1 = or disjoint i64 %iv2, 1
+ %gep.src.or.1 = getelementptr nuw i8, ptr %src, i64 %or.1
+ %load.src.1 = load i8, ptr %gep.src.or.1, align 1
+ %gep.src.iv2 = getelementptr nuw i8, ptr %src, i64 %iv2
+ %load.src.2 = load i8, ptr %gep.src.iv2, align 1
+ %add = add i8 %load.src.1, %load.src.2
+ %gep.dst.iv = getelementptr nuw i8, ptr %dst, i64 %iv
+ store i8 %add, ptr %gep.dst.iv, align 1
+ %iv2.next = add i64 %iv2, 2
+ %iv.next = add i64 %iv, 1
+ %exit.cond = icmp eq i64 %iv.next, 100
+ br i1 %exit.cond, label %exit, label %loop
+
+exit:
+ ret void
+}
+
>From 8135b7c1abd7d22f98cf3dbd7d7a93c9fc7755c6 Mon Sep 17 00:00:00 2001
From: Tobias Stadler <mail at stadler-tobias.de>
Date: Mon, 18 Aug 2025 18:04:53 +0100
Subject: [PATCH 054/112] [LV] Emit all remarks for unvectorizable instructions
(#153833)
If ExtraAnalysis is requested, emit all remarks caused by unvectorizable instructions - instead of only the first.
This is in line with how other places handle DoExtraAnalysis and it can be quite helpful to get info about all instructions in a loop that prevent vectorization.
---
.../Vectorize/LoopVectorizationLegality.h | 3 +
.../Vectorize/LoopVectorizationLegality.cpp | 504 +++++++++---------
.../X86/vectorization-remarks-missed.ll | 36 ++
3 files changed, 299 insertions(+), 244 deletions(-)
diff --git a/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h b/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
index 43ff084816d18..48ee93acbe008 100644
--- a/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
+++ b/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
@@ -493,6 +493,9 @@ class LoopVectorizationLegality {
/// and we only need to check individual instructions.
bool canVectorizeInstrs();
+ /// Check if an individual instruction is vectorizable.
+ bool canVectorizeInstr(Instruction &I);
+
/// When we vectorize loops we may change the order in which
/// we read and write from memory. This method checks if it is
/// legal to vectorize the code, considering only memory constrains.
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
index c47fd9421fddd..789047a2a28e7 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
@@ -793,280 +793,296 @@ static bool canWidenCallReturnType(Type *Ty) {
}
bool LoopVectorizationLegality::canVectorizeInstrs() {
- BasicBlock *Header = TheLoop->getHeader();
+ bool DoExtraAnalysis = ORE->allowExtraAnalysis(DEBUG_TYPE);
+ bool Result = true;
// For each block in the loop.
for (BasicBlock *BB : TheLoop->blocks()) {
// Scan the instructions in the block and look for hazards.
for (Instruction &I : *BB) {
- if (auto *Phi = dyn_cast<PHINode>(&I)) {
- Type *PhiTy = Phi->getType();
- // Check that this PHI type is allowed.
- if (!PhiTy->isIntegerTy() && !PhiTy->isFloatingPointTy() &&
- !PhiTy->isPointerTy()) {
- reportVectorizationFailure("Found a non-int non-pointer PHI",
- "loop control flow is not understood by vectorizer",
- "CFGNotUnderstood", ORE, TheLoop);
- return false;
- }
+ Result &= canVectorizeInstr(I);
+ if (!DoExtraAnalysis && !Result)
+ return false;
+ }
+ }
- // If this PHINode is not in the header block, then we know that we
- // can convert it to select during if-conversion. No need to check if
- // the PHIs in this block are induction or reduction variables.
- if (BB != Header) {
- // Non-header phi nodes that have outside uses can be vectorized. Add
- // them to the list of allowed exits.
- // Unsafe cyclic dependencies with header phis are identified during
- // legalization for reduction, induction and fixed order
- // recurrences.
- AllowedExit.insert(&I);
- continue;
- }
+ if (!PrimaryInduction) {
+ if (Inductions.empty()) {
+ reportVectorizationFailure(
+ "Did not find one integer induction var",
+ "loop induction variable could not be identified",
+ "NoInductionVariable", ORE, TheLoop);
+ return false;
+ }
+ if (!WidestIndTy) {
+ reportVectorizationFailure(
+ "Did not find one integer induction var",
+ "integer loop induction variable could not be identified",
+ "NoIntegerInductionVariable", ORE, TheLoop);
+ return false;
+ }
+ LLVM_DEBUG(dbgs() << "LV: Did not find one integer induction var.\n");
+ }
- // We only allow if-converted PHIs with exactly two incoming values.
- if (Phi->getNumIncomingValues() != 2) {
- reportVectorizationFailure("Found an invalid PHI",
- "loop control flow is not understood by vectorizer",
- "CFGNotUnderstood", ORE, TheLoop, Phi);
- return false;
- }
+ // Now we know the widest induction type, check if our found induction
+ // is the same size. If it's not, unset it here and InnerLoopVectorizer
+ // will create another.
+ if (PrimaryInduction && WidestIndTy != PrimaryInduction->getType())
+ PrimaryInduction = nullptr;
- RecurrenceDescriptor RedDes;
- if (RecurrenceDescriptor::isReductionPHI(Phi, TheLoop, RedDes, DB, AC,
- DT, PSE.getSE())) {
- Requirements->addExactFPMathInst(RedDes.getExactFPMathInst());
- AllowedExit.insert(RedDes.getLoopExitInstr());
- Reductions[Phi] = RedDes;
- continue;
- }
+ return Result;
+}
- // We prevent matching non-constant strided pointer IVS to preserve
- // historical vectorizer behavior after a generalization of the
- // IVDescriptor code. The intent is to remove this check, but we
- // have to fix issues around code quality for such loops first.
- auto IsDisallowedStridedPointerInduction =
- [](const InductionDescriptor &ID) {
- if (AllowStridedPointerIVs)
- return false;
- return ID.getKind() == InductionDescriptor::IK_PtrInduction &&
- ID.getConstIntStepValue() == nullptr;
- };
-
- // TODO: Instead of recording the AllowedExit, it would be good to
- // record the complementary set: NotAllowedExit. These include (but may
- // not be limited to):
- // 1. Reduction phis as they represent the one-before-last value, which
- // is not available when vectorized
- // 2. Induction phis and increment when SCEV predicates cannot be used
- // outside the loop - see addInductionPhi
- // 3. Non-Phis with outside uses when SCEV predicates cannot be used
- // outside the loop - see call to hasOutsideLoopUser in the non-phi
- // handling below
- // 4. FixedOrderRecurrence phis that can possibly be handled by
- // extraction.
- // By recording these, we can then reason about ways to vectorize each
- // of these NotAllowedExit.
- InductionDescriptor ID;
- if (InductionDescriptor::isInductionPHI(Phi, TheLoop, PSE, ID) &&
- !IsDisallowedStridedPointerInduction(ID)) {
- addInductionPhi(Phi, ID, AllowedExit);
- Requirements->addExactFPMathInst(ID.getExactFPMathInst());
- continue;
- }
+bool LoopVectorizationLegality::canVectorizeInstr(Instruction &I) {
+ BasicBlock *BB = I.getParent();
+ BasicBlock *Header = TheLoop->getHeader();
- if (RecurrenceDescriptor::isFixedOrderRecurrence(Phi, TheLoop, DT)) {
- AllowedExit.insert(Phi);
- FixedOrderRecurrences.insert(Phi);
- continue;
- }
+ if (auto *Phi = dyn_cast<PHINode>(&I)) {
+ Type *PhiTy = Phi->getType();
+ // Check that this PHI type is allowed.
+ if (!PhiTy->isIntegerTy() && !PhiTy->isFloatingPointTy() &&
+ !PhiTy->isPointerTy()) {
+ reportVectorizationFailure(
+ "Found a non-int non-pointer PHI",
+ "loop control flow is not understood by vectorizer",
+ "CFGNotUnderstood", ORE, TheLoop);
+ return false;
+ }
- // As a last resort, coerce the PHI to a AddRec expression
- // and re-try classifying it a an induction PHI.
- if (InductionDescriptor::isInductionPHI(Phi, TheLoop, PSE, ID, true) &&
- !IsDisallowedStridedPointerInduction(ID)) {
- addInductionPhi(Phi, ID, AllowedExit);
- continue;
- }
+ // If this PHINode is not in the header block, then we know that we
+ // can convert it to select during if-conversion. No need to check if
+ // the PHIs in this block are induction or reduction variables.
+ if (BB != Header) {
+ // Non-header phi nodes that have outside uses can be vectorized. Add
+ // them to the list of allowed exits.
+ // Unsafe cyclic dependencies with header phis are identified during
+ // legalization for reduction, induction and fixed order
+ // recurrences.
+ AllowedExit.insert(&I);
+ return true;
+ }
- reportVectorizationFailure("Found an unidentified PHI",
- "value that could not be identified as "
- "reduction is used outside the loop",
- "NonReductionValueUsedOutsideLoop", ORE, TheLoop, Phi);
- return false;
- } // end of PHI handling
-
- // We handle calls that:
- // * Have a mapping to an IR intrinsic.
- // * Have a vector version available.
- auto *CI = dyn_cast<CallInst>(&I);
-
- if (CI && !getVectorIntrinsicIDForCall(CI, TLI) &&
- !(CI->getCalledFunction() && TLI &&
- (!VFDatabase::getMappings(*CI).empty() ||
- isTLIScalarize(*TLI, *CI)))) {
- // If the call is a recognized math libary call, it is likely that
- // we can vectorize it given loosened floating-point constraints.
- LibFunc Func;
- bool IsMathLibCall =
- TLI && CI->getCalledFunction() &&
- CI->getType()->isFloatingPointTy() &&
- TLI->getLibFunc(CI->getCalledFunction()->getName(), Func) &&
- TLI->hasOptimizedCodeGen(Func);
-
- if (IsMathLibCall) {
- // TODO: Ideally, we should not use clang-specific language here,
- // but it's hard to provide meaningful yet generic advice.
- // Also, should this be guarded by allowExtraAnalysis() and/or be part
- // of the returned info from isFunctionVectorizable()?
- reportVectorizationFailure(
- "Found a non-intrinsic callsite",
- "library call cannot be vectorized. "
- "Try compiling with -fno-math-errno, -ffast-math, "
- "or similar flags",
- "CantVectorizeLibcall", ORE, TheLoop, CI);
- } else {
- reportVectorizationFailure("Found a non-intrinsic callsite",
- "call instruction cannot be vectorized",
- "CantVectorizeLibcall", ORE, TheLoop, CI);
- }
- return false;
- }
+ // We only allow if-converted PHIs with exactly two incoming values.
+ if (Phi->getNumIncomingValues() != 2) {
+ reportVectorizationFailure(
+ "Found an invalid PHI",
+ "loop control flow is not understood by vectorizer",
+ "CFGNotUnderstood", ORE, TheLoop, Phi);
+ return false;
+ }
- // Some intrinsics have scalar arguments and should be same in order for
- // them to be vectorized (i.e. loop invariant).
- if (CI) {
- auto *SE = PSE.getSE();
- Intrinsic::ID IntrinID = getVectorIntrinsicIDForCall(CI, TLI);
- for (unsigned Idx = 0; Idx < CI->arg_size(); ++Idx)
- if (isVectorIntrinsicWithScalarOpAtArg(IntrinID, Idx, TTI)) {
- if (!SE->isLoopInvariant(PSE.getSCEV(CI->getOperand(Idx)),
- TheLoop)) {
- reportVectorizationFailure("Found unvectorizable intrinsic",
- "intrinsic instruction cannot be vectorized",
- "CantVectorizeIntrinsic", ORE, TheLoop, CI);
- return false;
- }
- }
- }
+ RecurrenceDescriptor RedDes;
+ if (RecurrenceDescriptor::isReductionPHI(Phi, TheLoop, RedDes, DB, AC, DT,
+ PSE.getSE())) {
+ Requirements->addExactFPMathInst(RedDes.getExactFPMathInst());
+ AllowedExit.insert(RedDes.getLoopExitInstr());
+ Reductions[Phi] = RedDes;
+ return true;
+ }
- // If we found a vectorized variant of a function, note that so LV can
- // make better decisions about maximum VF.
- if (CI && !VFDatabase::getMappings(*CI).empty())
- VecCallVariantsFound = true;
-
- auto CanWidenInstructionTy = [](Instruction const &Inst) {
- Type *InstTy = Inst.getType();
- if (!isa<StructType>(InstTy))
- return canVectorizeTy(InstTy);
-
- // For now, we only recognize struct values returned from calls where
- // all users are extractvalue as vectorizable. All element types of the
- // struct must be types that can be widened.
- return isa<CallInst>(Inst) && canWidenCallReturnType(InstTy) &&
- all_of(Inst.users(), IsaPred<ExtractValueInst>);
- };
+ // We prevent matching non-constant strided pointer IVS to preserve
+ // historical vectorizer behavior after a generalization of the
+ // IVDescriptor code. The intent is to remove this check, but we
+ // have to fix issues around code quality for such loops first.
+ auto IsDisallowedStridedPointerInduction =
+ [](const InductionDescriptor &ID) {
+ if (AllowStridedPointerIVs)
+ return false;
+ return ID.getKind() == InductionDescriptor::IK_PtrInduction &&
+ ID.getConstIntStepValue() == nullptr;
+ };
+
+ // TODO: Instead of recording the AllowedExit, it would be good to
+ // record the complementary set: NotAllowedExit. These include (but may
+ // not be limited to):
+ // 1. Reduction phis as they represent the one-before-last value, which
+ // is not available when vectorized
+ // 2. Induction phis and increment when SCEV predicates cannot be used
+ // outside the loop - see addInductionPhi
+ // 3. Non-Phis with outside uses when SCEV predicates cannot be used
+ // outside the loop - see call to hasOutsideLoopUser in the non-phi
+ // handling below
+ // 4. FixedOrderRecurrence phis that can possibly be handled by
+ // extraction.
+ // By recording these, we can then reason about ways to vectorize each
+ // of these NotAllowedExit.
+ InductionDescriptor ID;
+ if (InductionDescriptor::isInductionPHI(Phi, TheLoop, PSE, ID) &&
+ !IsDisallowedStridedPointerInduction(ID)) {
+ addInductionPhi(Phi, ID, AllowedExit);
+ Requirements->addExactFPMathInst(ID.getExactFPMathInst());
+ return true;
+ }
- // Check that the instruction return type is vectorizable.
- // We can't vectorize casts from vector type to scalar type.
- // Also, we can't vectorize extractelement instructions.
- if (!CanWidenInstructionTy(I) ||
- (isa<CastInst>(I) &&
- !VectorType::isValidElementType(I.getOperand(0)->getType())) ||
- isa<ExtractElementInst>(I)) {
- reportVectorizationFailure("Found unvectorizable type",
- "instruction return type cannot be vectorized",
- "CantVectorizeInstructionReturnType", ORE, TheLoop, &I);
- return false;
- }
+ if (RecurrenceDescriptor::isFixedOrderRecurrence(Phi, TheLoop, DT)) {
+ AllowedExit.insert(Phi);
+ FixedOrderRecurrences.insert(Phi);
+ return true;
+ }
+
+ // As a last resort, coerce the PHI to a AddRec expression
+ // and re-try classifying it a an induction PHI.
+ if (InductionDescriptor::isInductionPHI(Phi, TheLoop, PSE, ID, true) &&
+ !IsDisallowedStridedPointerInduction(ID)) {
+ addInductionPhi(Phi, ID, AllowedExit);
+ return true;
+ }
- // Check that the stored type is vectorizable.
- if (auto *ST = dyn_cast<StoreInst>(&I)) {
- Type *T = ST->getValueOperand()->getType();
- if (!VectorType::isValidElementType(T)) {
- reportVectorizationFailure("Store instruction cannot be vectorized",
- "CantVectorizeStore", ORE, TheLoop, ST);
+ reportVectorizationFailure("Found an unidentified PHI",
+ "value that could not be identified as "
+ "reduction is used outside the loop",
+ "NonReductionValueUsedOutsideLoop", ORE, TheLoop,
+ Phi);
+ return false;
+ } // end of PHI handling
+
+ // We handle calls that:
+ // * Have a mapping to an IR intrinsic.
+ // * Have a vector version available.
+ auto *CI = dyn_cast<CallInst>(&I);
+
+ if (CI && !getVectorIntrinsicIDForCall(CI, TLI) &&
+ !(CI->getCalledFunction() && TLI &&
+ (!VFDatabase::getMappings(*CI).empty() || isTLIScalarize(*TLI, *CI)))) {
+ // If the call is a recognized math libary call, it is likely that
+ // we can vectorize it given loosened floating-point constraints.
+ LibFunc Func;
+ bool IsMathLibCall =
+ TLI && CI->getCalledFunction() && CI->getType()->isFloatingPointTy() &&
+ TLI->getLibFunc(CI->getCalledFunction()->getName(), Func) &&
+ TLI->hasOptimizedCodeGen(Func);
+
+ if (IsMathLibCall) {
+ // TODO: Ideally, we should not use clang-specific language here,
+ // but it's hard to provide meaningful yet generic advice.
+ // Also, should this be guarded by allowExtraAnalysis() and/or be part
+ // of the returned info from isFunctionVectorizable()?
+ reportVectorizationFailure(
+ "Found a non-intrinsic callsite",
+ "library call cannot be vectorized. "
+ "Try compiling with -fno-math-errno, -ffast-math, "
+ "or similar flags",
+ "CantVectorizeLibcall", ORE, TheLoop, CI);
+ } else {
+ reportVectorizationFailure("Found a non-intrinsic callsite",
+ "call instruction cannot be vectorized",
+ "CantVectorizeLibcall", ORE, TheLoop, CI);
+ }
+ return false;
+ }
+
+ // Some intrinsics have scalar arguments and should be same in order for
+ // them to be vectorized (i.e. loop invariant).
+ if (CI) {
+ auto *SE = PSE.getSE();
+ Intrinsic::ID IntrinID = getVectorIntrinsicIDForCall(CI, TLI);
+ for (unsigned Idx = 0; Idx < CI->arg_size(); ++Idx)
+ if (isVectorIntrinsicWithScalarOpAtArg(IntrinID, Idx, TTI)) {
+ if (!SE->isLoopInvariant(PSE.getSCEV(CI->getOperand(Idx)), TheLoop)) {
+ reportVectorizationFailure(
+ "Found unvectorizable intrinsic",
+ "intrinsic instruction cannot be vectorized",
+ "CantVectorizeIntrinsic", ORE, TheLoop, CI);
return false;
}
+ }
+ }
- // For nontemporal stores, check that a nontemporal vector version is
- // supported on the target.
- if (ST->getMetadata(LLVMContext::MD_nontemporal)) {
- // Arbitrarily try a vector of 2 elements.
- auto *VecTy = FixedVectorType::get(T, /*NumElts=*/2);
- assert(VecTy && "did not find vectorized version of stored type");
- if (!TTI->isLegalNTStore(VecTy, ST->getAlign())) {
- reportVectorizationFailure(
- "nontemporal store instruction cannot be vectorized",
- "CantVectorizeNontemporalStore", ORE, TheLoop, ST);
- return false;
- }
- }
+ // If we found a vectorized variant of a function, note that so LV can
+ // make better decisions about maximum VF.
+ if (CI && !VFDatabase::getMappings(*CI).empty())
+ VecCallVariantsFound = true;
+
+ auto CanWidenInstructionTy = [](Instruction const &Inst) {
+ Type *InstTy = Inst.getType();
+ if (!isa<StructType>(InstTy))
+ return canVectorizeTy(InstTy);
+
+ // For now, we only recognize struct values returned from calls where
+ // all users are extractvalue as vectorizable. All element types of the
+ // struct must be types that can be widened.
+ return isa<CallInst>(Inst) && canWidenCallReturnType(InstTy) &&
+ all_of(Inst.users(), IsaPred<ExtractValueInst>);
+ };
- } else if (auto *LD = dyn_cast<LoadInst>(&I)) {
- if (LD->getMetadata(LLVMContext::MD_nontemporal)) {
- // For nontemporal loads, check that a nontemporal vector version is
- // supported on the target (arbitrarily try a vector of 2 elements).
- auto *VecTy = FixedVectorType::get(I.getType(), /*NumElts=*/2);
- assert(VecTy && "did not find vectorized version of load type");
- if (!TTI->isLegalNTLoad(VecTy, LD->getAlign())) {
- reportVectorizationFailure(
- "nontemporal load instruction cannot be vectorized",
- "CantVectorizeNontemporalLoad", ORE, TheLoop, LD);
- return false;
- }
- }
+ // Check that the instruction return type is vectorizable.
+ // We can't vectorize casts from vector type to scalar type.
+ // Also, we can't vectorize extractelement instructions.
+ if (!CanWidenInstructionTy(I) ||
+ (isa<CastInst>(I) &&
+ !VectorType::isValidElementType(I.getOperand(0)->getType())) ||
+ isa<ExtractElementInst>(I)) {
+ reportVectorizationFailure("Found unvectorizable type",
+ "instruction return type cannot be vectorized",
+ "CantVectorizeInstructionReturnType", ORE,
+ TheLoop, &I);
+ return false;
+ }
+
+ // Check that the stored type is vectorizable.
+ if (auto *ST = dyn_cast<StoreInst>(&I)) {
+ Type *T = ST->getValueOperand()->getType();
+ if (!VectorType::isValidElementType(T)) {
+ reportVectorizationFailure("Store instruction cannot be vectorized",
+ "CantVectorizeStore", ORE, TheLoop, ST);
+ return false;
+ }
- // FP instructions can allow unsafe algebra, thus vectorizable by
- // non-IEEE-754 compliant SIMD units.
- // This applies to floating-point math operations and calls, not memory
- // operations, shuffles, or casts, as they don't change precision or
- // semantics.
- } else if (I.getType()->isFloatingPointTy() && (CI || I.isBinaryOp()) &&
- !I.isFast()) {
- LLVM_DEBUG(dbgs() << "LV: Found FP op with unsafe algebra.\n");
- Hints->setPotentiallyUnsafe();
+ // For nontemporal stores, check that a nontemporal vector version is
+ // supported on the target.
+ if (ST->getMetadata(LLVMContext::MD_nontemporal)) {
+ // Arbitrarily try a vector of 2 elements.
+ auto *VecTy = FixedVectorType::get(T, /*NumElts=*/2);
+ assert(VecTy && "did not find vectorized version of stored type");
+ if (!TTI->isLegalNTStore(VecTy, ST->getAlign())) {
+ reportVectorizationFailure(
+ "nontemporal store instruction cannot be vectorized",
+ "CantVectorizeNontemporalStore", ORE, TheLoop, ST);
+ return false;
}
+ }
- // Reduction instructions are allowed to have exit users.
- // All other instructions must not have external users.
- if (hasOutsideLoopUser(TheLoop, &I, AllowedExit)) {
- // We can safely vectorize loops where instructions within the loop are
- // used outside the loop only if the SCEV predicates within the loop is
- // same as outside the loop. Allowing the exit means reusing the SCEV
- // outside the loop.
- if (PSE.getPredicate().isAlwaysTrue()) {
- AllowedExit.insert(&I);
- continue;
- }
- reportVectorizationFailure("Value cannot be used outside the loop",
- "ValueUsedOutsideLoop", ORE, TheLoop, &I);
+ } else if (auto *LD = dyn_cast<LoadInst>(&I)) {
+ if (LD->getMetadata(LLVMContext::MD_nontemporal)) {
+ // For nontemporal loads, check that a nontemporal vector version is
+ // supported on the target (arbitrarily try a vector of 2 elements).
+ auto *VecTy = FixedVectorType::get(I.getType(), /*NumElts=*/2);
+ assert(VecTy && "did not find vectorized version of load type");
+ if (!TTI->isLegalNTLoad(VecTy, LD->getAlign())) {
+ reportVectorizationFailure(
+ "nontemporal load instruction cannot be vectorized",
+ "CantVectorizeNontemporalLoad", ORE, TheLoop, LD);
return false;
}
- } // next instr.
+ }
+
+ // FP instructions can allow unsafe algebra, thus vectorizable by
+ // non-IEEE-754 compliant SIMD units.
+ // This applies to floating-point math operations and calls, not memory
+ // operations, shuffles, or casts, as they don't change precision or
+ // semantics.
+ } else if (I.getType()->isFloatingPointTy() && (CI || I.isBinaryOp()) &&
+ !I.isFast()) {
+ LLVM_DEBUG(dbgs() << "LV: Found FP op with unsafe algebra.\n");
+ Hints->setPotentiallyUnsafe();
}
- if (!PrimaryInduction) {
- if (Inductions.empty()) {
- reportVectorizationFailure("Did not find one integer induction var",
- "loop induction variable could not be identified",
- "NoInductionVariable", ORE, TheLoop);
- return false;
- }
- if (!WidestIndTy) {
- reportVectorizationFailure("Did not find one integer induction var",
- "integer loop induction variable could not be identified",
- "NoIntegerInductionVariable", ORE, TheLoop);
- return false;
+ // Reduction instructions are allowed to have exit users.
+ // All other instructions must not have external users.
+ if (hasOutsideLoopUser(TheLoop, &I, AllowedExit)) {
+ // We can safely vectorize loops where instructions within the loop are
+ // used outside the loop only if the SCEV predicates within the loop is
+ // same as outside the loop. Allowing the exit means reusing the SCEV
+ // outside the loop.
+ if (PSE.getPredicate().isAlwaysTrue()) {
+ AllowedExit.insert(&I);
+ return true;
}
- LLVM_DEBUG(dbgs() << "LV: Did not find one integer induction var.\n");
+ reportVectorizationFailure("Value cannot be used outside the loop",
+ "ValueUsedOutsideLoop", ORE, TheLoop, &I);
+ return false;
}
- // Now we know the widest induction type, check if our found induction
- // is the same size. If it's not, unset it here and InnerLoopVectorizer
- // will create another.
- if (PrimaryInduction && WidestIndTy != PrimaryInduction->getType())
- PrimaryInduction = nullptr;
-
return true;
}
diff --git a/llvm/test/Transforms/LoopVectorize/X86/vectorization-remarks-missed.ll b/llvm/test/Transforms/LoopVectorize/X86/vectorization-remarks-missed.ll
index 70134fa6bc78d..5ec093c5af6ba 100644
--- a/llvm/test/Transforms/LoopVectorize/X86/vectorization-remarks-missed.ll
+++ b/llvm/test/Transforms/LoopVectorize/X86/vectorization-remarks-missed.ll
@@ -117,6 +117,33 @@
; YAML-NEXT: ...
; YAML-NEXT: --- !Analysis
; YAML-NEXT: Pass: loop-vectorize
+; YAML-NEXT: Name: NonReductionValueUsedOutsideLoop
+; YAML-NEXT: DebugLoc: { File: source.cpp, Line: 27, Column: 3 }
+; YAML-NEXT: Function: test_multiple_failures
+; YAML-NEXT: Args:
+; YAML-NEXT: - String: 'loop not vectorized: '
+; YAML-NEXT: - String: value that could not be identified as reduction is used outside the loop
+; YAML-NEXT: ...
+; YAML-NEXT: --- !Analysis
+; YAML-NEXT: Pass: loop-vectorize
+; YAML-NEXT: Name: CantVectorizeLibcall
+; YAML-NEXT: DebugLoc: { File: source.cpp, Line: 29, Column: 11 }
+; YAML-NEXT: Function: test_multiple_failures
+; YAML-NEXT: Args:
+; YAML-NEXT: - String: 'loop not vectorized: '
+; YAML-NEXT: - String: call instruction cannot be vectorized
+; YAML-NEXT: ...
+; YAML-NEXT: --- !Analysis
+; YAML-NEXT: Pass: loop-vectorize
+; YAML-NEXT: Name: NoInductionVariable
+; YAML-NEXT: DebugLoc: { File: source.cpp, Line: 27, Column: 3 }
+; YAML-NEXT: Function: test_multiple_failures
+; YAML-NEXT: Args:
+; YAML-NEXT: - String: 'loop not vectorized: '
+; YAML-NEXT: - String: loop induction variable could not be identified
+; YAML-NEXT: ...
+; YAML-NEXT: --- !Analysis
+; YAML-NEXT: Pass: loop-vectorize
; YAML-NEXT: Name: UnsupportedUncountableLoop
; YAML-NEXT: DebugLoc: { File: source.cpp, Line: 27, Column: 3 }
; YAML-NEXT: Function: test_multiple_failures
@@ -124,6 +151,15 @@
; YAML-NEXT: - String: 'loop not vectorized: '
; YAML-NEXT: - String: Cannot vectorize uncountable loop
; YAML-NEXT: ...
+; YAML-NEXT: --- !Analysis
+; YAML-NEXT: Pass: loop-vectorize
+; YAML-NEXT: Name: CantComputeNumberOfIterations
+; YAML-NEXT: DebugLoc: { File: source.cpp, Line: 27, Column: 3 }
+; YAML-NEXT: Function: test_multiple_failures
+; YAML-NEXT: Args:
+; YAML-NEXT: - String: 'loop not vectorized: '
+; YAML-NEXT: - String: could not determine number of loop iterations
+; YAML-NEXT: ...
; YAML: --- !Missed
; YAML-NEXT: Pass: loop-vectorize
; YAML-NEXT: Name: MissedDetails
>From 4eb1a07d7d1a9722e84490b0ff79d3ae5e260f76 Mon Sep 17 00:00:00 2001
From: Yang Bai <baiyang0132 at gmail.com>
Date: Tue, 19 Aug 2025 01:09:12 +0800
Subject: [PATCH 055/112] [mlir][vector] Support multi-dimensional vectors in
VectorFromElementsLowering (#151175)
This patch introduces a new unrolling-based approach for lowering
multi-dimensional `vector.from_elements` operations.
**Implementation Details:**
1. **New Transform Pattern**: Added `UnrollFromElements` that unrolls a
N-D(N>=2) from_elements op to a (N-1)-D from_elements op align the
outermost dimension.
2. **Utility Functions**: Added `unrollVectorOp` to reuse the unroll
algo of vector.gather for vector.from_elements.
3. **Integration**: Added the unrolling pattern to the
convert-vector-to-llvm pass as a temporal transformation.
4. Use direct LLVM dialect operations instead of intermediate
vector.insert operations for efficiency in `VectorFromElementsLowering`.
**Example:**
```mlir
// unroll
%v = vector.from_elements %e0, %e1, %e2, %e3 : vector<2x2xf32>
=>
%poison_2d = ub.poison : vector<2x2xf32>
%vec_1d_0 = vector.from_elements %e0, %e1 : vector<2xf32>
%vec_2d_0 = vector.insert %vec_1d_0, %poison_2d [0] : vector<2xf32> into vector<2x2xf32>
%vec_1d_1 = vector.from_elements %e2, %e3 : vector<2xf32>
%result = vector.insert %vec_1d_1, %vec_2d_0 [1] : vector<2xf32> into vector<2x2xf32>
// convert-vector-to-llvm
%v = vector.from_elements %e0, %e1, %e2, %e3 : vector<2x2xf32>
=>
%poison_2d = ub.poison : vector<2x2xf32>
%poison_2d_cast = builtin.unrealized_conversion_cast %poison_2d : vector<2x2xf32> to !llvm.array<2 x vector<2xf32>>
%poison_1d_0 = llvm.mlir.poison : vector<2xf32>
%c0_0 = llvm.mlir.constant(0 : i64) : i64
%vec_1d_0_0 = llvm.insertelement %e0, %poison_1d_0[%c0_0 : i64] : vector<2xf32>
%c1_0 = llvm.mlir.constant(1 : i64) : i64
%vec_1d_0_1 = llvm.insertelement %e1, %vec_1d_0_0[%c1_0 : i64] : vector<2xf32>
%vec_2d_0 = llvm.insertvalue %vec_1d_0_1, %poison_2d_cast[0] : !llvm.array<2 x vector<2xf32>>
%poison_1d_1 = llvm.mlir.poison : vector<2xf32>
%c0_1 = llvm.mlir.constant(0 : i64) : i64
%vec_1d_1_0 = llvm.insertelement %e2, %poison_1d_1[%c0_1 : i64] : vector<2xf32>
%c1_1 = llvm.mlir.constant(1 : i64) : i64
%vec_1d_1_1 = llvm.insertelement %e3, %vec_1d_1_0[%c1_1 : i64] : vector<2xf32>
%vec_2d_1 = llvm.insertvalue %vec_1d_1_1, %vec_2d_0[1] : !llvm.array<2 x vector<2xf32>>
%result = builtin.unrealized_conversion_cast %vec_2d_1 : !llvm.array<2 x vector<2xf32>> to vector<2x2xf32>
```
---------
Co-authored-by: Nicolas Vasilache <Nico.Vasilache at amd.com>
Co-authored-by: Yang Bai <yangb at nvidia.com>
Co-authored-by: James Newling <james.newling at gmail.com>
Co-authored-by: Diego Caballero <dieg0ca6aller0 at gmail.com>
---
.../Vector/TransformOps/VectorTransformOps.td | 11 ++++
.../Vector/Transforms/LoweringPatterns.h | 8 +++
.../mlir/Dialect/Vector/Utils/VectorUtils.h | 17 +++++
.../VectorToLLVM/ConvertVectorToLLVM.cpp | 14 ++--
.../VectorToLLVM/ConvertVectorToLLVMPass.cpp | 1 +
.../TransformOps/VectorTransformOps.cpp | 5 ++
.../Dialect/Vector/Transforms/CMakeLists.txt | 1 +
.../Transforms/LowerVectorFromElements.cpp | 65 +++++++++++++++++++
.../Vector/Transforms/LowerVectorGather.cpp | 33 +++-------
mlir/lib/Dialect/Vector/Utils/VectorUtils.cpp | 26 ++++++++
.../VectorToLLVM/vector-to-llvm.mlir | 37 +++++++++++
.../Vector/vector-from-elements-lowering.mlir | 45 +++++++++++++
.../Vector/vector-gather-lowering.mlir | 2 +-
.../Dialect/Vector/TestVectorTransforms.cpp | 24 +++++++
.../python/dialects/transform_vector_ext.py | 2 +
15 files changed, 261 insertions(+), 30 deletions(-)
create mode 100644 mlir/lib/Dialect/Vector/Transforms/LowerVectorFromElements.cpp
create mode 100644 mlir/test/Dialect/Vector/vector-from-elements-lowering.mlir
diff --git a/mlir/include/mlir/Dialect/Vector/TransformOps/VectorTransformOps.td b/mlir/include/mlir/Dialect/Vector/TransformOps/VectorTransformOps.td
index 299f198e4ab9c..07a4117a37b2c 100644
--- a/mlir/include/mlir/Dialect/Vector/TransformOps/VectorTransformOps.td
+++ b/mlir/include/mlir/Dialect/Vector/TransformOps/VectorTransformOps.td
@@ -254,6 +254,17 @@ def ApplyLowerGatherPatternsOp : Op<Transform_Dialect,
let assemblyFormat = "attr-dict";
}
+def ApplyUnrollFromElementsPatternsOp : Op<Transform_Dialect,
+ "apply_patterns.vector.unroll_from_elements",
+ [DeclareOpInterfaceMethods<PatternDescriptorOpInterface>]> {
+ let description = [{
+ Indicates that vector from_elements operations should be unrolled
+ along the outermost dimension.
+ }];
+
+ let assemblyFormat = "attr-dict";
+}
+
def ApplyLowerScanPatternsOp : Op<Transform_Dialect,
"apply_patterns.vector.lower_scan",
[DeclareOpInterfaceMethods<PatternDescriptorOpInterface>]> {
diff --git a/mlir/include/mlir/Dialect/Vector/Transforms/LoweringPatterns.h b/mlir/include/mlir/Dialect/Vector/Transforms/LoweringPatterns.h
index e03f0dabece52..47f96112a9433 100644
--- a/mlir/include/mlir/Dialect/Vector/Transforms/LoweringPatterns.h
+++ b/mlir/include/mlir/Dialect/Vector/Transforms/LoweringPatterns.h
@@ -303,6 +303,14 @@ void populateVectorRankReducingFMAPattern(RewritePatternSet &patterns);
void populateVectorToFromElementsToShuffleTreePatterns(
RewritePatternSet &patterns, PatternBenefit benefit = 1);
+/// Populate the pattern set with the following patterns:
+///
+/// [UnrollFromElements]
+/// Unrolls 2 or more dimensional `vector.from_elements` ops by unrolling the
+/// outermost dimension.
+void populateVectorFromElementsLoweringPatterns(RewritePatternSet &patterns,
+ PatternBenefit benefit = 1);
+
/// Populate the pattern set with the following patterns:
///
/// [ContractionOpToMatmulOpLowering]
diff --git a/mlir/include/mlir/Dialect/Vector/Utils/VectorUtils.h b/mlir/include/mlir/Dialect/Vector/Utils/VectorUtils.h
index 8bd54cf31b893..ace26990601c8 100644
--- a/mlir/include/mlir/Dialect/Vector/Utils/VectorUtils.h
+++ b/mlir/include/mlir/Dialect/Vector/Utils/VectorUtils.h
@@ -12,6 +12,7 @@
#include "mlir/Dialect/Arith/IR/Arith.h"
#include "mlir/Dialect/MemRef/IR/MemRef.h"
#include "mlir/Dialect/Tensor/IR/Tensor.h"
+#include "mlir/Dialect/UB/IR/UBOps.h"
#include "mlir/Dialect/Utils/IndexingUtils.h"
#include "mlir/Dialect/Vector/IR/VectorOps.h"
#include "mlir/IR/BuiltinAttributes.h"
@@ -238,6 +239,22 @@ Value createReadOrMaskedRead(OpBuilder &builder, Location loc, Value source,
/// static sizes in `shape`.
LogicalResult isValidMaskedInputVector(ArrayRef<int64_t> shape,
ArrayRef<int64_t> inputVectorSizes);
+
+/// Generic utility for unrolling n-D vector operations to (n-1)-D operations.
+/// This handles the common pattern of:
+/// 1. Check if already 1-D. If so, return failure.
+/// 2. Check for scalable dimensions. If so, return failure.
+/// 3. Create poison initialized result.
+/// 4. Loop through the outermost dimension, execute the UnrollVectorOpFn to
+/// create sub vectors.
+/// 5. Insert the sub vectors back into the final vector.
+/// 6. Replace the original op with the new result.
+using UnrollVectorOpFn =
+ function_ref<Value(PatternRewriter &, Location, VectorType, int64_t)>;
+
+LogicalResult unrollVectorOp(Operation *op, PatternRewriter &rewriter,
+ UnrollVectorOpFn unrollFn);
+
} // namespace vector
/// Constructs a permutation map of invariant memref indices to vector
diff --git a/mlir/lib/Conversion/VectorToLLVM/ConvertVectorToLLVM.cpp b/mlir/lib/Conversion/VectorToLLVM/ConvertVectorToLLVM.cpp
index f9e2a01dbf969..afc3d1b12ac0d 100644
--- a/mlir/lib/Conversion/VectorToLLVM/ConvertVectorToLLVM.cpp
+++ b/mlir/lib/Conversion/VectorToLLVM/ConvertVectorToLLVM.cpp
@@ -1891,15 +1891,21 @@ struct VectorFromElementsLowering
ConversionPatternRewriter &rewriter) const override {
Location loc = fromElementsOp.getLoc();
VectorType vectorType = fromElementsOp.getType();
- // TODO: Multi-dimensional vectors lower to !llvm.array<... x vector<>>.
- // Such ops should be handled in the same way as vector.insert.
+ // Only support 1-D vectors. Multi-dimensional vectors should have been
+ // transformed to 1-D vectors by the vector-to-vector transformations before
+ // this.
if (vectorType.getRank() > 1)
return rewriter.notifyMatchFailure(fromElementsOp,
"rank > 1 vectors are not supported");
Type llvmType = typeConverter->convertType(vectorType);
+ Type llvmIndexType = typeConverter->convertType(rewriter.getIndexType());
Value result = LLVM::PoisonOp::create(rewriter, loc, llvmType);
- for (auto [idx, val] : llvm::enumerate(adaptor.getElements()))
- result = vector::InsertOp::create(rewriter, loc, val, result, idx);
+ for (auto [idx, val] : llvm::enumerate(adaptor.getElements())) {
+ auto constIdx =
+ LLVM::ConstantOp::create(rewriter, loc, llvmIndexType, idx);
+ result = LLVM::InsertElementOp::create(rewriter, loc, llvmType, result,
+ val, constIdx);
+ }
rewriter.replaceOp(fromElementsOp, result);
return success();
}
diff --git a/mlir/lib/Conversion/VectorToLLVM/ConvertVectorToLLVMPass.cpp b/mlir/lib/Conversion/VectorToLLVM/ConvertVectorToLLVMPass.cpp
index cf108690c3741..9852df6970fdc 100644
--- a/mlir/lib/Conversion/VectorToLLVM/ConvertVectorToLLVMPass.cpp
+++ b/mlir/lib/Conversion/VectorToLLVM/ConvertVectorToLLVMPass.cpp
@@ -94,6 +94,7 @@ void ConvertVectorToLLVMPass::runOnOperation() {
populateVectorStepLoweringPatterns(patterns);
populateVectorRankReducingFMAPattern(patterns);
populateVectorGatherLoweringPatterns(patterns);
+ populateVectorFromElementsLoweringPatterns(patterns);
if (armI8MM) {
if (armNeon)
arm_neon::populateLowerContractionToNeonI8MMPatterns(patterns);
diff --git a/mlir/lib/Dialect/Vector/TransformOps/VectorTransformOps.cpp b/mlir/lib/Dialect/Vector/TransformOps/VectorTransformOps.cpp
index 2d5cc070558c3..fe066dc04ad55 100644
--- a/mlir/lib/Dialect/Vector/TransformOps/VectorTransformOps.cpp
+++ b/mlir/lib/Dialect/Vector/TransformOps/VectorTransformOps.cpp
@@ -139,6 +139,11 @@ void transform::ApplyLowerGatherPatternsOp::populatePatterns(
vector::populateVectorGatherLoweringPatterns(patterns);
}
+void transform::ApplyUnrollFromElementsPatternsOp::populatePatterns(
+ RewritePatternSet &patterns) {
+ vector::populateVectorFromElementsLoweringPatterns(patterns);
+}
+
void transform::ApplyLowerScanPatternsOp::populatePatterns(
RewritePatternSet &patterns) {
vector::populateVectorScanLoweringPatterns(patterns);
diff --git a/mlir/lib/Dialect/Vector/Transforms/CMakeLists.txt b/mlir/lib/Dialect/Vector/Transforms/CMakeLists.txt
index 9e287fc109990..acbf2b746037b 100644
--- a/mlir/lib/Dialect/Vector/Transforms/CMakeLists.txt
+++ b/mlir/lib/Dialect/Vector/Transforms/CMakeLists.txt
@@ -3,6 +3,7 @@ add_mlir_dialect_library(MLIRVectorTransforms
LowerVectorBitCast.cpp
LowerVectorBroadcast.cpp
LowerVectorContract.cpp
+ LowerVectorFromElements.cpp
LowerVectorGather.cpp
LowerVectorInterleave.cpp
LowerVectorMask.cpp
diff --git a/mlir/lib/Dialect/Vector/Transforms/LowerVectorFromElements.cpp b/mlir/lib/Dialect/Vector/Transforms/LowerVectorFromElements.cpp
new file mode 100644
index 0000000000000..c22fd54cef46b
--- /dev/null
+++ b/mlir/lib/Dialect/Vector/Transforms/LowerVectorFromElements.cpp
@@ -0,0 +1,65 @@
+//===- LowerVectorFromElements.cpp - Lower 'vector.from_elements' op -----===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// This file implements target-independent rewrites and utilities to lower the
+// 'vector.from_elements' operation.
+//
+//===----------------------------------------------------------------------===//
+
+#include "mlir/Dialect/Vector/IR/VectorOps.h"
+#include "mlir/Dialect/Vector/Transforms/LoweringPatterns.h"
+
+#define DEBUG_TYPE "lower-vector-from-elements"
+
+using namespace mlir;
+
+namespace {
+
+/// Unrolls 2 or more dimensional `vector.from_elements` ops by unrolling the
+/// outermost dimension. For example:
+/// ```
+/// %v = vector.from_elements %e0, %e1, %e2, %e3, %e4, %e5 : vector<2x3xf32>
+///
+/// ==>
+///
+/// %0 = ub.poison : vector<2x3xf32>
+/// %v0 = vector.from_elements %e0, %e1, %e2 : vector<3xf32>
+/// %1 = vector.insert %v0, %0 [0] : vector<3xf32> into vector<2x3xf32>
+/// %v1 = vector.from_elements %e3, %e4, %e5 : vector<3xf32>
+/// %v = vector.insert %v1, %1 [1] : vector<3xf32> into vector<2x3xf32>
+/// ```
+///
+/// When applied exhaustively, this will produce a sequence of 1-d from_elements
+/// ops.
+struct UnrollFromElements : OpRewritePattern<vector::FromElementsOp> {
+ using OpRewritePattern::OpRewritePattern;
+
+ LogicalResult matchAndRewrite(vector::FromElementsOp op,
+ PatternRewriter &rewriter) const override {
+ ValueRange allElements = op.getElements();
+
+ auto unrollFromElementsFn = [&](PatternRewriter &rewriter, Location loc,
+ VectorType subTy, int64_t index) {
+ size_t subTyNumElements = subTy.getNumElements();
+ assert((index + 1) * subTyNumElements <= allElements.size() &&
+ "out of bounds");
+ ValueRange subElements =
+ allElements.slice(index * subTyNumElements, subTyNumElements);
+ return vector::FromElementsOp::create(rewriter, loc, subTy, subElements);
+ };
+
+ return unrollVectorOp(op, rewriter, unrollFromElementsFn);
+ }
+};
+
+} // namespace
+
+void mlir::vector::populateVectorFromElementsLoweringPatterns(
+ RewritePatternSet &patterns, PatternBenefit benefit) {
+ patterns.add<UnrollFromElements>(patterns.getContext(), benefit);
+}
diff --git a/mlir/lib/Dialect/Vector/Transforms/LowerVectorGather.cpp b/mlir/lib/Dialect/Vector/Transforms/LowerVectorGather.cpp
index e062f55f87679..90f21c53246b0 100644
--- a/mlir/lib/Dialect/Vector/Transforms/LowerVectorGather.cpp
+++ b/mlir/lib/Dialect/Vector/Transforms/LowerVectorGather.cpp
@@ -54,27 +54,13 @@ struct UnrollGather : OpRewritePattern<vector::GatherOp> {
LogicalResult matchAndRewrite(vector::GatherOp op,
PatternRewriter &rewriter) const override {
- VectorType resultTy = op.getType();
- if (resultTy.getRank() < 2)
- return rewriter.notifyMatchFailure(op, "already 1-D");
-
- // Unrolling doesn't take vscale into account. Pattern is disabled for
- // vectors with leading scalable dim(s).
- if (resultTy.getScalableDims().front())
- return rewriter.notifyMatchFailure(op, "cannot unroll scalable dim");
-
- Location loc = op.getLoc();
Value indexVec = op.getIndexVec();
Value maskVec = op.getMask();
Value passThruVec = op.getPassThru();
- Value result = arith::ConstantOp::create(rewriter, loc, resultTy,
- rewriter.getZeroAttr(resultTy));
-
- VectorType subTy = VectorType::Builder(resultTy).dropDim(0);
-
- for (int64_t i = 0, e = resultTy.getShape().front(); i < e; ++i) {
- int64_t thisIdx[1] = {i};
+ auto unrollGatherFn = [&](PatternRewriter &rewriter, Location loc,
+ VectorType subTy, int64_t index) {
+ int64_t thisIdx[1] = {index};
Value indexSubVec =
vector::ExtractOp::create(rewriter, loc, indexVec, thisIdx);
@@ -82,15 +68,12 @@ struct UnrollGather : OpRewritePattern<vector::GatherOp> {
vector::ExtractOp::create(rewriter, loc, maskVec, thisIdx);
Value passThruSubVec =
vector::ExtractOp::create(rewriter, loc, passThruVec, thisIdx);
- Value subGather = vector::GatherOp::create(
- rewriter, loc, subTy, op.getBase(), op.getIndices(), indexSubVec,
- maskSubVec, passThruSubVec);
- result =
- vector::InsertOp::create(rewriter, loc, subGather, result, thisIdx);
- }
+ return vector::GatherOp::create(rewriter, loc, subTy, op.getBase(),
+ op.getIndices(), indexSubVec, maskSubVec,
+ passThruSubVec);
+ };
- rewriter.replaceOp(op, result);
- return success();
+ return unrollVectorOp(op, rewriter, unrollGatherFn);
}
};
diff --git a/mlir/lib/Dialect/Vector/Utils/VectorUtils.cpp b/mlir/lib/Dialect/Vector/Utils/VectorUtils.cpp
index 6e2fa35e1279a..841e1384e03b3 100644
--- a/mlir/lib/Dialect/Vector/Utils/VectorUtils.cpp
+++ b/mlir/lib/Dialect/Vector/Utils/VectorUtils.cpp
@@ -392,3 +392,29 @@ vector::isValidMaskedInputVector(ArrayRef<int64_t> shape,
}
return success();
}
+
+LogicalResult vector::unrollVectorOp(Operation *op, PatternRewriter &rewriter,
+ vector::UnrollVectorOpFn unrollFn) {
+ assert(op->getNumResults() == 1 && "expected single result");
+ assert(isa<VectorType>(op->getResult(0).getType()) && "expected vector type");
+ VectorType resultTy = cast<VectorType>(op->getResult(0).getType());
+ if (resultTy.getRank() < 2)
+ return rewriter.notifyMatchFailure(op, "already 1-D");
+
+ // Unrolling doesn't take vscale into account. Pattern is disabled for
+ // vectors with leading scalable dim(s).
+ if (resultTy.getScalableDims().front())
+ return rewriter.notifyMatchFailure(op, "cannot unroll scalable dim");
+
+ Location loc = op->getLoc();
+ Value result = ub::PoisonOp::create(rewriter, loc, resultTy);
+ VectorType subTy = VectorType::Builder(resultTy).dropDim(0);
+
+ for (int64_t i = 0, e = resultTy.getShape().front(); i < e; ++i) {
+ Value subVector = unrollFn(rewriter, loc, subTy, i);
+ result = vector::InsertOp::create(rewriter, loc, subVector, result, i);
+ }
+
+ rewriter.replaceOp(op, result);
+ return success();
+}
diff --git a/mlir/test/Conversion/VectorToLLVM/vector-to-llvm.mlir b/mlir/test/Conversion/VectorToLLVM/vector-to-llvm.mlir
index 72810b5dddaa3..07d335117de01 100644
--- a/mlir/test/Conversion/VectorToLLVM/vector-to-llvm.mlir
+++ b/mlir/test/Conversion/VectorToLLVM/vector-to-llvm.mlir
@@ -1737,3 +1737,40 @@ func.func @step() -> vector<4xindex> {
%0 = vector.step : vector<4xindex>
return %0 : vector<4xindex>
}
+
+
+// -----
+
+//===----------------------------------------------------------------------===//
+// vector.from_elements
+//===----------------------------------------------------------------------===//
+
+// NOTE: We unroll multi-dimensional from_elements ops with pattern `UnrollFromElements`
+// and then convert the 1-D from_elements ops to llvm.
+
+// CHECK-LABEL: func @from_elements_3d
+// CHECK-SAME: %[[ARG_0:.*]]: f32, %[[ARG_1:.*]]: f32, %[[ARG_2:.*]]: f32, %[[ARG_3:.*]]: f32)
+// CHECK: %[[UNDEF_RES:.*]] = ub.poison : vector<2x1x2xf32>
+// CHECK: %[[UNDEF_RES_LLVM:.*]] = builtin.unrealized_conversion_cast %[[UNDEF_RES]] : vector<2x1x2xf32> to !llvm.array<2 x array<1 x vector<2xf32>>>
+// CHECK: %[[UNDEF_VEC_RANK_2:.*]] = ub.poison : vector<1x2xf32>
+// CHECK: %[[UNDEF_VEC_RANK_2_LLVM:.*]] = builtin.unrealized_conversion_cast %[[UNDEF_VEC_RANK_2]] : vector<1x2xf32> to !llvm.array<1 x vector<2xf32>>
+// CHECK: %[[UNDEF_VEC0:.*]] = llvm.mlir.poison : vector<2xf32>
+// CHECK: %[[C0_0:.*]] = llvm.mlir.constant(0 : i64) : i64
+// CHECK: %[[VEC0_0:.*]] = llvm.insertelement %[[ARG_0]], %[[UNDEF_VEC0]][%[[C0_0]] : i64] : vector<2xf32>
+// CHECK: %[[C1_0:.*]] = llvm.mlir.constant(1 : i64) : i64
+// CHECK: %[[VEC0_1:.*]] = llvm.insertelement %[[ARG_1]], %[[VEC0_0]][%[[C1_0]] : i64] : vector<2xf32>
+// CHECK: %[[RES_RANK_2_0:.*]] = llvm.insertvalue %[[VEC0_1]], %[[UNDEF_VEC_RANK_2_LLVM]][0] : !llvm.array<1 x vector<2xf32>>
+// CHECK: %[[RES_0:.*]] = llvm.insertvalue %[[RES_RANK_2_0]], %[[UNDEF_RES_LLVM]][0] : !llvm.array<2 x array<1 x vector<2xf32>>>
+// CHECK: %[[UNDEF_VEC1:.*]] = llvm.mlir.poison : vector<2xf32>
+// CHECK: %[[C0_1:.*]] = llvm.mlir.constant(0 : i64) : i64
+// CHECK: %[[VEC1_0:.*]] = llvm.insertelement %[[ARG_2]], %[[UNDEF_VEC1]][%[[C0_1]] : i64] : vector<2xf32>
+// CHECK: %[[C1_1:.*]] = llvm.mlir.constant(1 : i64) : i64
+// CHECK: %[[VEC1_1:.*]] = llvm.insertelement %[[ARG_3]], %[[VEC1_0]][%[[C1_1]] : i64] : vector<2xf32>
+// CHECK: %[[RES_RANK_2_1:.*]] = llvm.insertvalue %[[VEC1_1]], %[[UNDEF_VEC_RANK_2_LLVM]][0] : !llvm.array<1 x vector<2xf32>>
+// CHECK: %[[RES_1:.*]] = llvm.insertvalue %[[RES_RANK_2_1]], %[[RES_0]][1] : !llvm.array<2 x array<1 x vector<2xf32>>>
+// CHECK: %[[CAST:.*]] = builtin.unrealized_conversion_cast %[[RES_1]] : !llvm.array<2 x array<1 x vector<2xf32>>> to vector<2x1x2xf32>
+// CHECK: return %[[CAST]]
+func.func @from_elements_3d(%arg0: f32, %arg1: f32, %arg2: f32, %arg3: f32) -> vector<2x1x2xf32> {
+ %0 = vector.from_elements %arg0, %arg1, %arg2, %arg3 : vector<2x1x2xf32>
+ return %0 : vector<2x1x2xf32>
+}
diff --git a/mlir/test/Dialect/Vector/vector-from-elements-lowering.mlir b/mlir/test/Dialect/Vector/vector-from-elements-lowering.mlir
new file mode 100644
index 0000000000000..8fac608ed5692
--- /dev/null
+++ b/mlir/test/Dialect/Vector/vector-from-elements-lowering.mlir
@@ -0,0 +1,45 @@
+// RUN: mlir-opt %s -test-unroll-vector-from-elements | FileCheck %s --check-prefix=CHECK-UNROLL
+
+//===----------------------------------------------------------------------===//
+// Test UnrollFromElements.
+//===----------------------------------------------------------------------===//
+
+// CHECK-UNROLL-LABEL: @unroll_from_elements_2d
+// CHECK-UNROLL-SAME: (%[[ARG0:.*]]: f32, %[[ARG1:.*]]: f32, %[[ARG2:.*]]: f32, %[[ARG3:.*]]: f32)
+// CHECK-UNROLL-NEXT: %[[UNDEF_RES:.*]] = ub.poison : vector<2x2xf32>
+// CHECK-UNROLL-NEXT: %[[VEC_0:.*]] = vector.from_elements %[[ARG0]], %[[ARG1]] : vector<2xf32>
+// CHECK-UNROLL-NEXT: %[[RES_0:.*]] = vector.insert %[[VEC_0]], %[[UNDEF_RES]] [0] : vector<2xf32> into vector<2x2xf32>
+// CHECK-UNROLL-NEXT: %[[VEC_1:.*]] = vector.from_elements %[[ARG2]], %[[ARG3]] : vector<2xf32>
+// CHECK-UNROLL-NEXT: %[[RES_1:.*]] = vector.insert %[[VEC_1]], %[[RES_0]] [1] : vector<2xf32> into vector<2x2xf32>
+// CHECK-UNROLL-NEXT: return %[[RES_1]] : vector<2x2xf32>
+func.func @unroll_from_elements_2d(%arg0: f32, %arg1: f32, %arg2: f32, %arg3: f32) -> vector<2x2xf32> {
+ %0 = vector.from_elements %arg0, %arg1, %arg2, %arg3 : vector<2x2xf32>
+ return %0 : vector<2x2xf32>
+}
+
+// CHECK-UNROLL-LABEL: @unroll_from_elements_3d
+// CHECK-UNROLL-SAME: (%[[ARG0:.*]]: f32, %[[ARG1:.*]]: f32, %[[ARG2:.*]]: f32, %[[ARG3:.*]]: f32)
+// CHECK-UNROLL-NEXT: %[[UNDEF_RES:.*]] = ub.poison : vector<2x1x2xf32>
+// CHECK-UNROLL-NEXT: %[[UNDEF_RANK_2:.*]] = ub.poison : vector<1x2xf32>
+// CHECK-UNROLL-NEXT: %[[VEC_0:.*]] = vector.from_elements %[[ARG0]], %[[ARG1]] : vector<2xf32>
+// CHECK-UNROLL-NEXT: %[[RANK_2_0:.*]] = vector.insert %[[VEC_0]], %[[UNDEF_RANK_2]] [0] : vector<2xf32> into vector<1x2xf32>
+// CHECK-UNROLL-NEXT: %[[RES_0:.*]] = vector.insert %[[RANK_2_0]], %[[UNDEF_RES]] [0] : vector<1x2xf32> into vector<2x1x2xf32>
+// CHECK-UNROLL-NEXT: %[[VEC_1:.*]] = vector.from_elements %[[ARG2]], %[[ARG3]] : vector<2xf32>
+// CHECK-UNROLL-NEXT: %[[RANK_2_1:.*]] = vector.insert %[[VEC_1]], %[[UNDEF_RANK_2]] [0] : vector<2xf32> into vector<1x2xf32>
+// CHECK-UNROLL-NEXT: %[[RES_1:.*]] = vector.insert %[[RANK_2_1]], %[[RES_0]] [1] : vector<1x2xf32> into vector<2x1x2xf32>
+// CHECK-UNROLL-NEXT: return %[[RES_1]] : vector<2x1x2xf32>
+func.func @unroll_from_elements_3d(%arg0: f32, %arg1: f32, %arg2: f32, %arg3: f32) -> vector<2x1x2xf32> {
+ %0 = vector.from_elements %arg0, %arg1, %arg2, %arg3 : vector<2x1x2xf32>
+ return %0 : vector<2x1x2xf32>
+}
+
+// 1-D vector.from_elements should not be unrolled.
+
+// CHECK-UNROLL-LABEL: @negative_unroll_from_elements_1d
+// CHECK-UNROLL-SAME: (%[[ARG0:.*]]: f32, %[[ARG1:.*]]: f32)
+// CHECK-UNROLL-NEXT: %[[RES:.*]] = vector.from_elements %[[ARG0]], %[[ARG1]] : vector<2xf32>
+// CHECK-UNROLL-NEXT: return %[[RES]] : vector<2xf32>
+func.func @negative_unroll_from_elements_1d(%arg0: f32, %arg1: f32) -> vector<2xf32> {
+ %0 = vector.from_elements %arg0, %arg1 : vector<2xf32>
+ return %0 : vector<2xf32>
+}
diff --git a/mlir/test/Dialect/Vector/vector-gather-lowering.mlir b/mlir/test/Dialect/Vector/vector-gather-lowering.mlir
index 5be267c1be984..9c2a508671e06 100644
--- a/mlir/test/Dialect/Vector/vector-gather-lowering.mlir
+++ b/mlir/test/Dialect/Vector/vector-gather-lowering.mlir
@@ -81,7 +81,7 @@ func.func @gather_memref_1d_i32_index(%base: memref<?xf32>, %v: vector<2xi32>, %
// CHECK-SAME: %[[PASS:.*]]: vector<2x[3]xf32>
// CHECK: %[[C0:.*]] = arith.constant 0 : index
// CHECK: %[[C1:.*]] = arith.constant 1 : index
-// CHECK: %[[INIT:.*]] = arith.constant dense<0.000000e+00> : vector<2x[3]xf32>
+// CHECK: %[[INIT:.*]] = ub.poison : vector<2x[3]xf32>
// CHECK: %[[IDXVEC0:.*]] = vector.extract %[[IDXVEC]][0] : vector<[3]xindex> from vector<2x[3]xindex>
// CHECK: %[[MASK0:.*]] = vector.extract %[[MASK]][0] : vector<[3]xi1> from vector<2x[3]xi1>
// CHECK: %[[PASS0:.*]] = vector.extract %[[PASS]][0] : vector<[3]xf32> from vector<2x[3]xf32>
diff --git a/mlir/test/lib/Dialect/Vector/TestVectorTransforms.cpp b/mlir/test/lib/Dialect/Vector/TestVectorTransforms.cpp
index f89c944b5c564..bb1598ee3efe5 100644
--- a/mlir/test/lib/Dialect/Vector/TestVectorTransforms.cpp
+++ b/mlir/test/lib/Dialect/Vector/TestVectorTransforms.cpp
@@ -786,6 +786,28 @@ struct TestVectorGatherLowering
}
};
+struct TestUnrollVectorFromElements
+ : public PassWrapper<TestUnrollVectorFromElements,
+ OperationPass<func::FuncOp>> {
+ MLIR_DEFINE_EXPLICIT_INTERNAL_INLINE_TYPE_ID(TestUnrollVectorFromElements)
+
+ StringRef getArgument() const final {
+ return "test-unroll-vector-from-elements";
+ }
+ StringRef getDescription() const final {
+ return "Test unrolling patterns for from_elements ops";
+ }
+ void getDependentDialects(DialectRegistry ®istry) const override {
+ registry.insert<func::FuncDialect, vector::VectorDialect, ub::UBDialect>();
+ }
+
+ void runOnOperation() override {
+ RewritePatternSet patterns(&getContext());
+ populateVectorFromElementsLoweringPatterns(patterns);
+ (void)applyPatternsGreedily(getOperation(), std::move(patterns));
+ }
+};
+
struct TestFoldArithExtensionIntoVectorContractPatterns
: public PassWrapper<TestFoldArithExtensionIntoVectorContractPatterns,
OperationPass<func::FuncOp>> {
@@ -1059,6 +1081,8 @@ void registerTestVectorLowerings() {
PassRegistration<TestVectorGatherLowering>();
+ PassRegistration<TestUnrollVectorFromElements>();
+
PassRegistration<TestFoldArithExtensionIntoVectorContractPatterns>();
PassRegistration<TestVectorEmulateMaskedLoadStore>();
diff --git a/mlir/test/python/dialects/transform_vector_ext.py b/mlir/test/python/dialects/transform_vector_ext.py
index a51f2154d1f7d..5a648fe073315 100644
--- a/mlir/test/python/dialects/transform_vector_ext.py
+++ b/mlir/test/python/dialects/transform_vector_ext.py
@@ -46,6 +46,8 @@ def non_configurable_patterns():
vector.ApplyLowerOuterProductPatternsOp()
# CHECK: transform.apply_patterns.vector.lower_gather
vector.ApplyLowerGatherPatternsOp()
+ # CHECK: transform.apply_patterns.vector.unroll_from_elements
+ vector.ApplyUnrollFromElementsPatternsOp()
# CHECK: transform.apply_patterns.vector.lower_scan
vector.ApplyLowerScanPatternsOp()
# CHECK: transform.apply_patterns.vector.lower_shape_cast
>From c2e7fad44691ed44281bde9e8322e70be0e6aeec Mon Sep 17 00:00:00 2001
From: Panagiotis Karouzakis <45971450+karouzakisp at users.noreply.github.com>
Date: Mon, 18 Aug 2025 20:11:16 +0300
Subject: [PATCH 056/112] [DemandedBits] Support non-constant shift amounts
(#148880)
This patch adds support for the shift operators to handle non-constant
shift operands.
ashr proof -->https://alive2.llvm.org/ce/z/EN-siK
lshr proof --> https://alive2.llvm.org/ce/z/eeGzyB
shl proof --> https://alive2.llvm.org/ce/z/dpvbkq
---
llvm/lib/Analysis/DemandedBits.cpp | 69 +++++++++
llvm/test/Analysis/DemandedBits/ashr.ll | 198 ++++++++++++++++++++++++
llvm/test/Analysis/DemandedBits/lshr.ll | 198 ++++++++++++++++++++++++
llvm/test/Analysis/DemandedBits/shl.ll | 134 +++++++++++++++-
4 files changed, 598 insertions(+), 1 deletion(-)
create mode 100644 llvm/test/Analysis/DemandedBits/ashr.ll
create mode 100644 llvm/test/Analysis/DemandedBits/lshr.ll
diff --git a/llvm/lib/Analysis/DemandedBits.cpp b/llvm/lib/Analysis/DemandedBits.cpp
index 6694d5cc06c8c..e0881751aef7e 100644
--- a/llvm/lib/Analysis/DemandedBits.cpp
+++ b/llvm/lib/Analysis/DemandedBits.cpp
@@ -76,6 +76,26 @@ void DemandedBits::determineLiveOperandBits(
computeKnownBits(V2, Known2, DL, &AC, UserI, &DT);
}
};
+ auto GetShiftedRange = [&](uint64_t Min, uint64_t Max, bool ShiftLeft) {
+ auto ShiftF = [ShiftLeft](const APInt &Mask, unsigned ShiftAmnt) {
+ return ShiftLeft ? Mask.shl(ShiftAmnt) : Mask.lshr(ShiftAmnt);
+ };
+ AB = APInt::getZero(BitWidth);
+ uint64_t LoopRange = Max - Min;
+ APInt Mask = AOut;
+ APInt Shifted = AOut; // AOut | (AOut << 1) | ... | (AOut << (ShiftAmnt - 1)
+ for (unsigned ShiftAmnt = 1; ShiftAmnt <= LoopRange; ShiftAmnt <<= 1) {
+ if (LoopRange & ShiftAmnt) {
+ // Account for (LoopRange - ShiftAmnt, LoopRange]
+ Mask |= ShiftF(Shifted, LoopRange - ShiftAmnt + 1);
+ // Clears the low bit.
+ LoopRange -= ShiftAmnt;
+ }
+ // [0, ShiftAmnt) -> [0, ShiftAmnt * 2)
+ Shifted |= ShiftF(Shifted, ShiftAmnt);
+ }
+ AB = ShiftF(Mask, Min);
+ };
switch (UserI->getOpcode()) {
default: break;
@@ -183,6 +203,17 @@ void DemandedBits::determineLiveOperandBits(
AB |= APInt::getHighBitsSet(BitWidth, ShiftAmt+1);
else if (S->hasNoUnsignedWrap())
AB |= APInt::getHighBitsSet(BitWidth, ShiftAmt);
+ } else {
+ ComputeKnownBits(BitWidth, UserI->getOperand(1), nullptr);
+ uint64_t Min = Known.getMinValue().getLimitedValue(BitWidth - 1);
+ uint64_t Max = Known.getMaxValue().getLimitedValue(BitWidth - 1);
+ // similar to Lshr case
+ GetShiftedRange(Min, Max, /*ShiftLeft=*/false);
+ const auto *S = cast<ShlOperator>(UserI);
+ if (S->hasNoSignedWrap())
+ AB |= APInt::getHighBitsSet(BitWidth, Max + 1);
+ else if (S->hasNoUnsignedWrap())
+ AB |= APInt::getHighBitsSet(BitWidth, Max);
}
}
break;
@@ -197,6 +228,24 @@ void DemandedBits::determineLiveOperandBits(
// (they must be zero).
if (cast<LShrOperator>(UserI)->isExact())
AB |= APInt::getLowBitsSet(BitWidth, ShiftAmt);
+ } else {
+ ComputeKnownBits(BitWidth, UserI->getOperand(1), nullptr);
+ uint64_t Min = Known.getMinValue().getLimitedValue(BitWidth - 1);
+ uint64_t Max = Known.getMaxValue().getLimitedValue(BitWidth - 1);
+ // Suppose AOut == 0b0000 0001
+ // [min, max] = [1, 3]
+ // iteration 1 shift by 1 mask is 0b0000 0011
+ // iteration 2 shift by 2 mask is 0b0000 1111
+ // iteration 3, shiftAmnt = 4 > max - min, we stop.
+ //
+ // After the iterations we need one more shift by min,
+ // to move from 0b0000 1111 to --> 0b0001 1110.
+ // The loop populates the mask relative to (0,...,max-min),
+ // but we need coverage from (min, max).
+ // This is why the shift by min is needed.
+ GetShiftedRange(Min, Max, /*ShiftLeft=*/true);
+ if (cast<LShrOperator>(UserI)->isExact())
+ AB |= APInt::getLowBitsSet(BitWidth, Max);
}
}
break;
@@ -217,6 +266,26 @@ void DemandedBits::determineLiveOperandBits(
// (they must be zero).
if (cast<AShrOperator>(UserI)->isExact())
AB |= APInt::getLowBitsSet(BitWidth, ShiftAmt);
+ } else {
+ ComputeKnownBits(BitWidth, UserI->getOperand(1), nullptr);
+ uint64_t Min = Known.getMinValue().getLimitedValue(BitWidth - 1);
+ uint64_t Max = Known.getMaxValue().getLimitedValue(BitWidth - 1);
+ GetShiftedRange(Min, Max, /*ShiftLeft=*/true);
+ if (Max &&
+ (AOut & APInt::getHighBitsSet(BitWidth, Max)).getBoolValue()) {
+ // Suppose AOut = 0011 1100
+ // [min, max] = [1, 3]
+ // ShiftAmount = 1 : Mask is 1000 0000
+ // ShiftAmount = 2 : Mask is 1100 0000
+ // ShiftAmount = 3 : Mask is 1110 0000
+ // The Mask with Max covers every case in [min, max],
+ // so we are done
+ AB.setSignBit();
+ }
+ // If the shift is exact, then the low bits are not dead
+ // (they must be zero).
+ if (cast<AShrOperator>(UserI)->isExact())
+ AB |= APInt::getLowBitsSet(BitWidth, Max);
}
}
break;
diff --git a/llvm/test/Analysis/DemandedBits/ashr.ll b/llvm/test/Analysis/DemandedBits/ashr.ll
new file mode 100644
index 0000000000000..6185d4c492d86
--- /dev/null
+++ b/llvm/test/Analysis/DemandedBits/ashr.ll
@@ -0,0 +1,198 @@
+; RUN: opt -S -disable-output -passes="print<demanded-bits>" < %s 2>&1 | FileCheck %s
+
+define i8 @test_ashr_const_amount_4(i32 %a) {
+; CHECK-LABEL: 'test_ashr_const_amount_4'
+; CHECK-DAG: DemandedBits: 0xff for %ashr = ashr i32 %a, 4
+; CHECK-DAG: DemandedBits: 0xff0 for %a in %ashr = ashr i32 %a, 4
+; CHECK-DAG: DemandedBits: 0xffffffff for 4 in %ashr = ashr i32 %a, 4
+; CHECK-DAG: DemandedBits: 0xff for %ashr.t = trunc i32 %ashr to i8
+; CHECK-DAG: DemandedBits: 0xff for %ashr in %ashr.t = trunc i32 %ashr to i8
+;
+ %ashr = ashr i32 %a, 4
+ %ashr.t = trunc i32 %ashr to i8
+ ret i8 %ashr.t
+}
+
+define i8 @test_ashr_const_amount_5(i32 %a) {
+; CHECK-LABEL: 'test_ashr_const_amount_5'
+; CHECK-DAG: DemandedBits: 0xff for %ashr = ashr i32 %a, 5
+; CHECK-DAG: DemandedBits: 0x1fe0 for %a in %ashr = ashr i32 %a, 5
+; CHECK-DAG: DemandedBits: 0xffffffff for 5 in %ashr = ashr i32 %a, 5
+; CHECK-DAG: DemandedBits: 0xff for %ashr.t = trunc i32 %ashr to i8
+; CHECK-DAG: DemandedBits: 0xff for %ashr in %ashr.t = trunc i32 %ashr to i8
+;
+ %ashr = ashr i32 %a, 5
+ %ashr.t = trunc i32 %ashr to i8
+ ret i8 %ashr.t
+}
+
+define i8 @test_ashr_const_amount_8(i32 %a) {
+; CHECK-LABEL: 'test_ashr_const_amount_8'
+; CHECK-DAG: DemandedBits: 0xff for %ashr = ashr i32 %a, 8
+; CHECK-DAG: DemandedBits: 0xff00 for %a in %ashr = ashr i32 %a, 8
+; CHECK-DAG: DemandedBits: 0xffffffff for 8 in %ashr = ashr i32 %a, 8
+; CHECK-DAG: DemandedBits: 0xff for %ashr.t = trunc i32 %ashr to i8
+; CHECK-DAG: DemandedBits: 0xff for %ashr in %ashr.t = trunc i32 %ashr to i8
+;
+ %ashr = ashr i32 %a, 8
+ %ashr.t = trunc i32 %ashr to i8
+ ret i8 %ashr.t
+}
+
+define i8 @test_ashr_const_amount_9(i32 %a) {
+
+; CHECK-LABEL: 'test_ashr_const_amount_9'
+; CHECK-DAG: DemandedBits: 0xff for %ashr.t = trunc i32 %ashr to i8
+; CHECK-DAG: DemandedBits: 0xff for %ashr in %ashr.t = trunc i32 %ashr to i8
+; CHECK-DAG: DemandedBits: 0xff for %ashr = ashr i32 %a, 8
+; CHECK-DAG: DemandedBits: 0xff00 for %a in %ashr = ashr i32 %a, 8
+; CHECK-DAG: DemandedBits: 0xffffffff for 8 in %ashr = ashr i32 %a, 8
+;
+ %ashr = ashr i32 %a, 8
+ %ashr.t = trunc i32 %ashr to i8
+ ret i8 %ashr.t
+}
+
+define i8 @test_ashr(i32 %a, i32 %b) {
+; CHECK-LABEL: 'test_ashr'
+; CHECK-DAG: DemandedBits: 0xff for %ashr = ashr i32 %a, %b
+; CHECK-DAG: DemandedBits: 0xffffffff for %a in %ashr = ashr i32 %a, %b
+; CHECK-DAG: DemandedBits: 0xffffffff for %b in %ashr = ashr i32 %a, %b
+; CHECK-DAG: DemandedBits: 0xff for %ashr.t = trunc i32 %ashr to i8
+; CHECK-DAG: DemandedBits: 0xff for %ashr in %ashr.t = trunc i32 %ashr to i8
+;
+ %ashr = ashr i32 %a, %b
+ %ashr.t = trunc i32 %ashr to i8
+ ret i8 %ashr.t
+}
+
+define i8 @test_ashr_range_1(i32 %a, i32 %b) {
+; CHECK-LABEL: 'test_ashr_range_1'
+; CHECK-DAG: DemandedBits: 0xff for %shl.t = trunc i32 %ashr to i8
+; CHECK-DAG: DemandedBits: 0xff for %ashr in %shl.t = trunc i32 %ashr to i8
+; CHECK-DAG: DemandedBits: 0xffffffff for %b2 = and i32 %b, 3
+; CHECK-DAG: DemandedBits: 0x3 for %b in %b2 = and i32 %b, 3
+; CHECK-DAG: DemandedBits: 0xffffffff for 3 in %b2 = and i32 %b, 3
+; CHECK-DAG: DemandedBits: 0xff for %ashr = ashr i32 %a, %b2
+; CHECK-DAG: DemandedBits: 0x7ff for %a in %ashr = ashr i32 %a, %b2
+; CHECK-DAG: DemandedBits: 0xffffffff for %b2 in %ashr = ashr i32 %a, %b2
+;
+ %b2 = and i32 %b, 3
+ %ashr = ashr i32 %a, %b2
+ %shl.t = trunc i32 %ashr to i8
+ ret i8 %shl.t
+}
+
+define i32 @test_ashr_range_2(i32 %a, i32 %b) {
+; CHECK-LABEL: 'test_ashr_range_2'
+; CHECK-DAG: DemandedBits: 0xffffffff for %b2 = and i32 %b, 3
+; CHECK-DAG: DemandedBits: 0x3 for %b in %b2 = and i32 %b, 3
+; CHECK-DAG: DemandedBits: 0xffffffff for 3 in %b2 = and i32 %b, 3
+; CHECK-DAG: DemandedBits: 0xffffffff for %ashr = ashr i32 %a, %b2
+; CHECK-DAG: DemandedBits: 0xffffffff for %a in %ashr = ashr i32 %a, %b2
+; CHECK-DAG: DemandedBits: 0xffffffff for %b2 in %ashr = ashr i32 %a, %b2
+;
+ %b2 = and i32 %b, 3
+ %ashr = ashr i32 %a, %b2
+ ret i32 %ashr
+}
+
+define i32 @test_ashr_range_3(i32 %a, i32 %b) {
+; CHECK-LABEL: 'test_ashr_range_3'
+; CHECK-DAG: DemandedBits: 0xffff for %ashr = ashr i32 %a, %b
+; CHECK-DAG: DemandedBits: 0xffffffff for %a in %ashr = ashr i32 %a, %b
+; CHECK-DAG: DemandedBits: 0xffffffff for %b in %ashr = ashr i32 %a, %b
+; CHECK-DAG: DemandedBits: 0xffffffff for %shl = shl i32 %ashr, 16
+; CHECK-DAG: DemandedBits: 0xffff for %ashr in %shl = shl i32 %ashr, 16
+; CHECK-DAG: DemandedBits: 0xffffffff for 16 in %shl = shl i32 %ashr, 16
+;
+ %ashr = ashr i32 %a, %b
+ %shl = shl i32 %ashr, 16
+ ret i32 %shl
+}
+define i32 @test_ashr_range_4(i32 %a, i32 %b) {
+; CHECK-LABEL: 'test_ashr_range_4'
+; CHECK-DAG: DemandedBits: 0xffffffff for %shr = lshr i32 %ashr, 8
+; CHECK-DAG: DemandedBits: 0xffffff00 for %ashr in %shr = lshr i32 %ashr, 8
+; CHECK-DAG: DemandedBits: 0xffffffff for 8 in %shr = lshr i32 %ashr, 8
+; CHECK-DAG: DemandedBits: 0xffffff00 for %ashr = ashr i32 %a, %b
+; CHECK-DAG: DemandedBits: 0xffffff00 for %a in %ashr = ashr i32 %a, %b
+; CHECK-DAG: DemandedBits: 0xffffffff for %b in %ashr = ashr i32 %a, %b
+ %ashr = ashr i32 %a, %b
+ %shr = lshr i32 %ashr, 8
+ ret i32 %shr
+}
+
+define i32 @test_ashr_range_5(i32 %a, i32 %b) {
+; CHECK-LABEL: 'test_ashr_range_5'
+; CHECK-DAG: DemandedBits: 0xffffffff for %2 = and i32 %1, 255
+; CHECK-DAG: DemandedBits: 0xff for %1 in %2 = and i32 %1, 255
+; CHECK-DAG: DemandedBits: 0xffffffff for 255 in %2 = and i32 %1, 255
+; CHECK-DAG: DemandedBits: 0xff for %1 = ashr i32 %a, %b
+; CHECK-DAG: DemandedBits: 0xffffffff for %a in %1 = ashr i32 %a, %b
+; CHECK-DAG: DemandedBits: 0xffffffff for %b in %1 = ashr i32 %a, %b
+;
+ %1 = ashr i32 %a, %b
+ %2 = and i32 %1, 255
+ ret i32 %2
+}
+
+define i32 @test_ashr_range_6(i32 %a, i32 %b) {
+; CHECK-LABEL: 'test_ashr_range_6'
+; CHECK-DAG: DemandedBits: 0xffff0000 for %lshr.1 = ashr i32 %a, %b
+; CHECK-DAG: DemandedBits: 0xffff0000 for %a in %lshr.1 = ashr i32 %a, %b
+; CHECK-DAG: DemandedBits: 0xffffffff for %b in %lshr.1 = ashr i32 %a, %b
+; CHECK-DAG: DemandedBits: 0xffffffff for %lshr.2 = ashr i32 %lshr.1, 16
+; CHECK-DAG: DemandedBits: 0xffff0000 for %lshr.1 in %lshr.2 = ashr i32 %lshr.1, 16
+; CHECK-DAG: DemandedBits: 0xffffffff for 16 in %lshr.2 = ashr i32 %lshr.1, 16
+;
+ %lshr.1 = ashr i32 %a, %b
+ %lshr.2 = ashr i32 %lshr.1, 16
+ ret i32 %lshr.2
+}
+
+define i8 @test_ashr_var_amount(i32 %a, i32 %b){
+; CHECK-LABEL: 'test_ashr_var_amount'
+; CHECK-DAG: DemandedBits: 0xff for %4 = ashr i32 %1, %3
+; CHECK-DAG: DemandedBits: 0xffffffff for %1 in %4 = ashr i32 %1, %3
+; CHECK-DAG: DemandedBits: 0xffffffff for %3 in %4 = ashr i32 %1, %3
+; CHECK-DAG: DemandedBits: 0xff for %2 = trunc i32 %1 to i8
+; CHECK-DAG: DemandedBits: 0xff for %1 in %2 = trunc i32 %1 to i8
+; CHECK-DAG: DemandedBits: 0xffffffff for %1 = add nsw i32 %a, %b
+; CHECK-DAG: DemandedBits: 0xffffffff for %a in %1 = add nsw i32 %a, %b
+; CHECK-DAG: DemandedBits: 0xffffffff for %b in %1 = add nsw i32 %a, %b
+; CHECK-DAG: DemandedBits: 0xffffffff for %3 = zext i8 %2 to i32
+; CHECK-DAG: DemandedBits: 0xff for %2 in %3 = zext i8 %2 to i32
+; CHECK-DAG: DemandedBits: 0xff for %5 = trunc i32 %4 to i8
+; CHECK-DAG: DemandedBits: 0xff for %4 in %5 = trunc i32 %4 to i8
+;
+ %1 = add nsw i32 %a, %b
+ %2 = trunc i32 %1 to i8
+ %3 = zext i8 %2 to i32
+ %4 = ashr i32 %1, %3
+ %5 = trunc i32 %4 to i8
+ ret i8 %5
+}
+
+define i8 @test_ashr_var_amount_nsw(i32 %a, i32 %b){
+ ; CHECK-LABEL 'test_ashr_var_amount_nsw'
+ ; CHECK-DAG: DemandedBits: 0xff for %5 = trunc i32 %4 to i8
+ ; CHECK-DAG: DemandedBits: 0xff for %4 in %5 = trunc i32 %4 to i8
+ ; CHECK-DAG: DemandedBits: 0xffffffff for %1 = add nsw i32 %a, %b
+ ; CHECK-DAG: DemandedBits: 0xffffffff for %a in %1 = add nsw i32 %a, %b
+ ; CHECK-DAG: DemandedBits: 0xffffffff for %b in %1 = add nsw i32 %a, %b
+ ; CHECK-DAG: DemandedBits: 0xff for %2 = trunc i32 %1 to i8
+ ; CHECK-DAG: DemandedBits: 0xff for %1 in %2 = trunc i32 %1 to i8
+ ; CHECK-DAG: DemandedBits: 0xffffffff for %3 = zext i8 %2 to i32
+ ; CHECK-DAG: DemandedBits: 0xff for %2 in %3 = zext i8 %2 to i32
+ ; CHECK-DAG: DemandedBits: 0xff for %4 = ashr exact i32 %1, %3
+ ; CHECK-DAG: DemandedBits: 0xffffffff for %1 in %4 = ashr exact i32 %1, %3
+ ; CHECK-DAG: DemandedBits: 0xffffffff for %3 in %4 = ashr exact i32 %1, %3
+ ;
+ %1 = add nsw i32 %a, %b
+ %2 = trunc i32 %1 to i8
+ %3 = zext i8 %2 to i32
+ %4 = ashr exact i32 %1, %3
+ %5 = trunc i32 %4 to i8
+ ret i8 %5
+}
diff --git a/llvm/test/Analysis/DemandedBits/lshr.ll b/llvm/test/Analysis/DemandedBits/lshr.ll
new file mode 100644
index 0000000000000..e07f994a1b304
--- /dev/null
+++ b/llvm/test/Analysis/DemandedBits/lshr.ll
@@ -0,0 +1,198 @@
+; RUN: opt -S -disable-output -passes="print<demanded-bits>" < %s 2>&1 | FileCheck %s
+
+define i8 @test_lshr_const_amount_4(i32 %a) {
+; CHECK-LABEL: 'test_lshr_const_amount_4'
+; CHECK-DAG: DemandedBits: 0xff for %lshr.t = trunc i32 %lshr to i8
+; CHECK-DAG: DemandedBits: 0xff for %lshr in %lshr.t = trunc i32 %lshr to i8
+; CHECK-DAG: DemandedBits: 0xff for %lshr = lshr i32 %a, 4
+; CHECK-DAG: DemandedBits: 0xff0 for %a in %lshr = lshr i32 %a, 4
+; CHECK-DAG: DemandedBits: 0xffffffff for 4 in %lshr = lshr i32 %a, 4
+;
+ %lshr = lshr i32 %a, 4
+ %lshr.t = trunc i32 %lshr to i8
+ ret i8 %lshr.t
+}
+
+define i8 @test_lshr_const_amount_5(i32 %a) {
+; CHECK-LABEL: 'test_lshr_const_amount_5'
+; CHECK-DAG: DemandedBits: 0xff for %lshr = lshr i32 %a, 5
+; CHECK-DAG: DemandedBits: 0x1fe0 for %a in %lshr = lshr i32 %a, 5
+; CHECK-DAG: DemandedBits: 0xffffffff for 5 in %lshr = lshr i32 %a, 5
+; CHECK-DAG: DemandedBits: 0xff for %lshr.t = trunc i32 %lshr to i8
+; CHECK-DAG: DemandedBits: 0xff for %lshr in %lshr.t = trunc i32 %lshr to i8
+;
+ %lshr = lshr i32 %a, 5
+ %lshr.t = trunc i32 %lshr to i8
+ ret i8 %lshr.t
+}
+define i8 @test_lshr_const_amount_8(i32 %a) {
+; CHECK-LABEL: 'test_lshr_const_amount_8'
+; CHECK-DAG: DemandedBits: 0xff for %lshr.t = trunc i32 %lshr to i8
+; CHECK-DAG: DemandedBits: 0xff for %lshr in %lshr.t = trunc i32 %lshr to i8
+; CHECK-DAG: DemandedBits: 0xff for %lshr = lshr i32 %a, 8
+; CHECK-DAG: DemandedBits: 0xff00 for %a in %lshr = lshr i32 %a, 8
+; CHECK-DAG: DemandedBits: 0xffffffff for 8 in %lshr = lshr i32 %a, 8
+;
+ %lshr = lshr i32 %a, 8
+ %lshr.t = trunc i32 %lshr to i8
+ ret i8 %lshr.t
+}
+
+define i8 @test_lshr_const_amount_9(i32 %a) {
+; CHECK-LABEL: 'test_lshr_const_amount_9'
+; CHECK-DAG: DemandedBits: 0xff for %lshr.t = trunc i32 %lshr to i8
+; CHECK-DAG: DemandedBits: 0xff for %lshr in %lshr.t = trunc i32 %lshr to i8
+; CHECK-DAG: DemandedBits: 0xff for %lshr = lshr i32 %a, 9
+; CHECK-DAG: DemandedBits: 0x1fe00 for %a in %lshr = lshr i32 %a, 9
+; CHECK-DAG: DemandedBits: 0xffffffff for 9 in %lshr = lshr i32 %a, 9
+;
+ %lshr = lshr i32 %a, 9
+ %lshr.t = trunc i32 %lshr to i8
+ ret i8 %lshr.t
+}
+
+define i8 @test_lshr(i32 %a, i32 %b) {
+; CHECK-LABEL: 'test_lshr'
+; CHECK-DAG: DemandedBits: 0xff for %lshr = lshr i32 %a, %b
+; CHECK-DAG: DemandedBits: 0xffffffff for %a in %lshr = lshr i32 %a, %b
+; CHECK-DAG: DemandedBits: 0xffffffff for %b in %lshr = lshr i32 %a, %b
+; CHECK-DAG: DemandedBits: 0xff for %lshr.t = trunc i32 %lshr to i8
+; CHECK-DAG: DemandedBits: 0xff for %lshr in %lshr.t = trunc i32 %lshr to i8
+;
+ %lshr = lshr i32 %a, %b
+ %lshr.t = trunc i32 %lshr to i8
+ ret i8 %lshr.t
+}
+
+define i8 @test_lshr_range_1(i32 %a, i32 %b) {
+; CHECK-LABEL: 'test_lshr_range_1'
+; CHECK-DAG: DemandedBits: 0xff for %shl.t = trunc i32 %lshr to i8
+; CHECK-DAG: DemandedBits: 0xff for %lshr in %shl.t = trunc i32 %lshr to i8
+; CHECK-DAG: DemandedBits: 0xff for %lshr = lshr i32 %a, %b2
+; CHECK-DAG: DemandedBits: 0x7ff for %a in %lshr = lshr i32 %a, %b2
+; CHECK-DAG: DemandedBits: 0xffffffff for %b2 in %lshr = lshr i32 %a, %b2
+; CHECK-DAG: DemandedBits: 0xffffffff for %b2 = and i32 %b, 3
+; CHECK-DAG: DemandedBits: 0x3 for %b in %b2 = and i32 %b, 3
+; CHECK-DAG: DemandedBits: 0xffffffff for 3 in %b2 = and i32 %b, 3
+;
+ %b2 = and i32 %b, 3
+ %lshr = lshr i32 %a, %b2
+ %shl.t = trunc i32 %lshr to i8
+ ret i8 %shl.t
+}
+
+define i32 @test_lshr_range_2(i32 %a, i32 %b) {
+; CHECK-LABEL: 'test_lshr_range_2'
+; CHECK-DAG: DemandedBits: 0xffffffff for %lshr = lshr i32 %a, %b2
+; CHECK-DAG: DemandedBits: 0xffffffff for %a in %lshr = lshr i32 %a, %b2
+; CHECK-DAG: DemandedBits: 0xffffffff for %b2 in %lshr = lshr i32 %a, %b2
+; CHECK-DAG: DemandedBits: 0xffffffff for %b2 = and i32 %b, 3
+; CHECK-DAG: DemandedBits: 0x3 for %b in %b2 = and i32 %b, 3
+; CHECK-DAG: DemandedBits: 0xffffffff for 3 in %b2 = and i32 %b, 3
+;
+ %b2 = and i32 %b, 3
+ %lshr = lshr i32 %a, %b2
+ ret i32 %lshr
+}
+
+define i32 @test_lshr_range_3(i32 %a, i32 %b) {
+; CHECK-LABEL: 'test_lshr_range_3'
+; CHECK-DAG: DemandedBits: 0xffff for %lshr = lshr i32 %a, %b
+; CHECK-DAG: DemandedBits: 0xffffffff for %a in %lshr = lshr i32 %a, %b
+; CHECK-DAG: DemandedBits: 0xffffffff for %b in %lshr = lshr i32 %a, %b
+; CHECK-DAG: DemandedBits: 0xffffffff for %shl = shl i32 %lshr, 16
+; CHECK-DAG: DemandedBits: 0xffff for %lshr in %shl = shl i32 %lshr, 16
+; CHECK-DAG: DemandedBits: 0xffffffff for 16 in %shl = shl i32 %lshr, 16
+;
+ %lshr = lshr i32 %a, %b
+ %shl = shl i32 %lshr, 16
+ ret i32 %shl
+}
+
+define i32 @test_lshr_range_4(i32 %a, i32 %b) {
+; CHECK-LABEL: 'test_lshr_range_4'
+; CHECK-DAG: DemandedBits: 0xffffff00 for %lshr = lshr i32 %a, %b
+; CHECK-DAG: DemandedBits: 0xffffff00 for %a in %lshr = lshr i32 %a, %b
+; CHECK-DAG: DemandedBits: 0xffffffff for %b in %lshr = lshr i32 %a, %b
+; CHECK-DAG: DemandedBits: 0xffffffff for %shr = ashr i32 %lshr, 8
+; CHECK-DAG: DemandedBits: 0xffffff00 for %lshr in %shr = ashr i32 %lshr, 8
+; CHECK-DAG: DemandedBits: 0xffffffff for 8 in %shr = ashr i32 %lshr, 8
+ %lshr = lshr i32 %a, %b
+ %shr = ashr i32 %lshr, 8
+ ret i32 %shr
+}
+
+define i32 @test_lshr_range_5(i32 %a, i32 %b) {
+; CHECK-LABEL: 'test_lshr_range_5'
+; CHECK-DAG: DemandedBits: 0xff for %1 = lshr i32 %a, %b
+; CHECK-DAG: DemandedBits: 0xffffffff for %a in %1 = lshr i32 %a, %b
+; CHECK-DAG: DemandedBits: 0xffffffff for %b in %1 = lshr i32 %a, %b
+; CHECK-DAG: DemandedBits: 0xffffffff for %2 = and i32 %1, 255
+; CHECK-DAG: DemandedBits: 0xff for %1 in %2 = and i32 %1, 255
+; CHECK-DAG: DemandedBits: 0xffffffff for 255 in %2 = and i32 %1, 255
+;
+ %1 = lshr i32 %a, %b
+ %2 = and i32 %1, 255
+ ret i32 %2
+}
+
+define i32 @test_lshr_range_6(i32 %a, i32 %b) {
+; CHECK-LABEL: 'test_lshr_range_6'
+; CHECK-DAG: DemandedBits: 0xffff0000 for %lshr.1 = lshr i32 %a, %b
+; CHECK-DAG: DemandedBits: 0xffff0000 for %a in %lshr.1 = lshr i32 %a, %b
+; CHECK-DAG: DemandedBits: 0xffffffff for %b in %lshr.1 = lshr i32 %a, %b
+; CHECK-DAG: DemandedBits: 0xffffffff for %lshr.2 = lshr i32 %lshr.1, 16
+; CHECK-DAG: DemandedBits: 0xffff0000 for %lshr.1 in %lshr.2 = lshr i32 %lshr.1, 16
+; CHECK-DAG: DemandedBits: 0xffffffff for 16 in %lshr.2 = lshr i32 %lshr.1, 16
+;
+ %lshr.1 = lshr i32 %a, %b
+ %lshr.2 = lshr i32 %lshr.1, 16
+ ret i32 %lshr.2
+}
+
+
+define i8 @test_lshr_var_amount(i32 %a, i32 %b){
+; CHECK-LABEL: 'test_lshr_var_amount'
+; CHECK-DAG: DemandedBits: 0xff for %4 = lshr i32 %1, %3
+; CHECK-DAG: DemandedBits: 0xffffffff for %1 in %4 = lshr i32 %1, %3
+; CHECK-DAG: DemandedBits: 0xffffffff for %3 in %4 = lshr i32 %1, %3
+; CHECK-DAG: DemandedBits: 0xff for %5 = trunc i32 %4 to i8
+; CHECK-DAG: DemandedBits: 0xff for %4 in %5 = trunc i32 %4 to i8
+; CHECK-DAG: DemandedBits: 0xffffffff for %3 = zext i8 %2 to i32
+; CHECK-DAG: DemandedBits: 0xff for %2 in %3 = zext i8 %2 to i32
+; CHECK-DAG: DemandedBits: 0xffffffff for %1 = add nsw i32 %a, %b
+; CHECK-DAG: DemandedBits: 0xffffffff for %a in %1 = add nsw i32 %a, %b
+; CHECK-DAG: DemandedBits: 0xffffffff for %b in %1 = add nsw i32 %a, %b
+; CHECK-DAG: DemandedBits: 0xff for %2 = trunc i32 %1 to i8
+; CHECK-DAG: DemandedBits: 0xff for %1 in %2 = trunc i32 %1 to i8
+;
+ %1 = add nsw i32 %a, %b
+ %2 = trunc i32 %1 to i8
+ %3 = zext i8 %2 to i32
+ %4 = lshr i32 %1, %3
+ %5 = trunc i32 %4 to i8
+ ret i8 %5
+}
+
+define i8 @test_lshr_var_amount_exact(i32 %a, i32 %b){
+ ; CHECK-LABEL 'test_lshr_var_amount_nsw'
+ ; CHECK-DAG: DemandedBits: 0xffffffff for %1 = add nsw i32 %a, %b
+ ; CHECK-DAG: DemandedBits: 0xffffffff for %a in %1 = add nsw i32 %a, %b
+ ; CHECK-DAG: DemandedBits: 0xffffffff for %b in %1 = add nsw i32 %a, %b
+ ; CHECK-DAG: DemandedBits: 0xff for %2 = trunc i32 %1 to i8
+ ; CHECK-DAG: DemandedBits: 0xff for %1 in %2 = trunc i32 %1 to i8
+ ; CHECK-DAG: DemandedBits: 0xffffffff for %3 = zext i8 %2 to i32
+ ; CHECK-DAG: DemandedBits: 0xff for %2 in %3 = zext i8 %2 to i32
+ ; CHECK-DAG: DemandedBits: 0xff for %4 = lshr exact i32 %1, %3
+ ; CHECK-DAG: DemandedBits: 0xffffffff for %1 in %4 = lshr exact i32 %1, %3
+ ; CHECK-DAG: DemandedBits: 0xffffffff for %3 in %4 = lshr exact i32 %1, %3
+ ; CHECK-DAG: DemandedBits: 0xff for %5 = trunc i32 %4 to i8
+ ; CHECK-DAG: DemandedBits: 0xff for %4 in %5 = trunc i32 %4 to i8
+ ;
+ %1 = add nsw i32 %a, %b
+ %2 = trunc i32 %1 to i8
+ %3 = zext i8 %2 to i32
+ %4 = lshr exact i32 %1, %3
+ %5 = trunc i32 %4 to i8
+ ret i8 %5
+}
diff --git a/llvm/test/Analysis/DemandedBits/shl.ll b/llvm/test/Analysis/DemandedBits/shl.ll
index e41f5f4107735..c872d2d854e83 100644
--- a/llvm/test/Analysis/DemandedBits/shl.ll
+++ b/llvm/test/Analysis/DemandedBits/shl.ll
@@ -57,10 +57,142 @@ define i8 @test_shl(i32 %a, i32 %b) {
; CHECK-DAG: DemandedBits: 0xff for %shl.t = trunc i32 %shl to i8
; CHECK-DAG: DemandedBits: 0xff for %shl in %shl.t = trunc i32 %shl to i8
; CHECK-DAG: DemandedBits: 0xff for %shl = shl i32 %a, %b
-; CHECK-DAG: DemandedBits: 0xffffffff for %a in %shl = shl i32 %a, %b
+; CHECK-DAG: DemandedBits: 0xff for %a in %shl = shl i32 %a, %b
; CHECK-DAG: DemandedBits: 0xffffffff for %b in %shl = shl i32 %a, %b
;
%shl = shl i32 %a, %b
%shl.t = trunc i32 %shl to i8
ret i8 %shl.t
}
+
+define i8 @test_shl_range_1(i32 %a, i32 %b) {
+; CHECK-LABEL: 'test_shl_range_1'
+; CHECK-DAG: DemandedBits: 0xff for %shl = shl i32 %a, %b2
+; CHECK-DAG: DemandedBits: 0xff for %a in %shl = shl i32 %a, %b2
+; CHECK-DAG: DemandedBits: 0xffffffff for %b2 in %shl = shl i32 %a, %b2
+; CHECK-DAG: DemandedBits: 0xff for %shl.t = trunc i32 %shl to i8
+; CHECK-DAG: DemandedBits: 0xff for %shl in %shl.t = trunc i32 %shl to i8
+; CHECK-DAG: DemandedBits: 0xffffffff for %b2 = and i32 %b, 3
+; CHECK-DAG: DemandedBits: 0x3 for %b in %b2 = and i32 %b, 3
+; CHECK-DAG: DemandedBits: 0xffffffff for 3 in %b2 = and i32 %b, 3
+;
+ %b2 = and i32 %b, 3
+ %shl = shl i32 %a, %b2
+ %shl.t = trunc i32 %shl to i8
+ ret i8 %shl.t
+}
+
+define i32 @test_shl_range_2(i32 %a, i32 %b) {
+; CHECK-LABEL: 'test_shl_range_2'
+; CHECK-DAG: DemandedBits: 0xffffffff for %b2 = and i32 %b, 3
+; CHECK-DAG: DemandedBits: 0x3 for %b in %b2 = and i32 %b, 3
+; CHECK-DAG: DemandedBits: 0xffffffff for 3 in %b2 = and i32 %b, 3
+; CHECK-DAG: DemandedBits: 0xffffffff for %shl = shl i32 %a, %b2
+; CHECK-DAG: DemandedBits: 0xffffffff for %a in %shl = shl i32 %a, %b2
+; CHECK-DAG: DemandedBits: 0xffffffff for %b2 in %shl = shl i32 %a, %b2
+;
+ %b2 = and i32 %b, 3
+ %shl = shl i32 %a, %b2
+ ret i32 %shl
+}
+
+define i32 @test_shl_range_3(i32 %a, i32 %b) {
+; CHECK-LABEL: 'test_shl_range_3'
+; CHECK-DAG: DemandedBits: 0xffffffff for %shr = lshr i32 %shl, 16
+; CHECK-DAG: DemandedBits: 0xffff0000 for %shl in %shr = lshr i32 %shl, 16
+; CHECK-DAG: DemandedBits: 0xffffffff for 16 in %shr = lshr i32 %shl, 16
+; CHECK-DAG: DemandedBits: 0xffff0000 for %shl = shl i32 %a, %b
+; CHECK-DAG: DemandedBits: 0xffffffff for %a in %shl = shl i32 %a, %b
+; CHECK-DAG: DemandedBits: 0xffffffff for %b in %shl = shl i32 %a, %b
+;
+ %shl = shl i32 %a, %b
+ %shr = lshr i32 %shl, 16
+ ret i32 %shr
+}
+
+define i32 @test_shl_range_4(i32 %a, i32 %b) {
+; CHECK-LABEL: 'test_shl_range_4'
+; CHECK-DAG: DemandedBits: 0xffffffff for %shr = ashr i32 %shl, 8
+; CHECK-DAG: DemandedBits: 0xffffff00 for %shl in %shr = ashr i32 %shl, 8
+; CHECK-DAG: DemandedBits: 0xffffffff for 8 in %shr = ashr i32 %shl, 8
+; CHECK-DAG: DemandedBits: 0xffffff00 for %shl = shl i32 %a, %b
+; CHECK-DAG: DemandedBits: 0xffffffff for %a in %shl = shl i32 %a, %b
+; CHECK-DAG: DemandedBits: 0xffffffff for %b in %shl = shl i32 %a, %b
+ %shl = shl i32 %a, %b
+ %shr = ashr i32 %shl, 8
+ ret i32 %shr
+}
+
+define i32 @test_shl_range_5(i32 %a, i32 %b) {
+; CHECK-LABEL: 'test_shl_range_5'
+; CHECK-DAG: DemandedBits: 0xff for %1 = shl i32 %a, %b
+; CHECK-DAG: DemandedBits: 0xff for %a in %1 = shl i32 %a, %b
+; CHECK-DAG: DemandedBits: 0xffffffff for %b in %1 = shl i32 %a, %b
+; CHECK-DAG: DemandedBits: 0xffffffff for %2 = and i32 %1, 255
+; CHECK-DAG: DemandedBits: 0xff for %1 in %2 = and i32 %1, 255
+; CHECK-DAG: DemandedBits: 0xffffffff for 255 in %2 = and i32 %1, 255
+;
+ %1 = shl i32 %a, %b
+ %2 = and i32 %1, 255
+ ret i32 %2
+}
+
+define i32 @test_shl_range_6(i32 %a, i32 %b) {
+; CHECK-LABEL: 'test_shl_range_6'
+; CHECK-DAG: DemandedBits: 0xffffffff for %shl.2 = shl i32 %shl.1, 16
+; CHECK-DAG: DemandedBits: 0xffff for %shl.1 in %shl.2 = shl i32 %shl.1, 16
+; CHECK-DAG: DemandedBits: 0xffffffff for 16 in %shl.2 = shl i32 %shl.1, 16
+; CHECK-DAG: DemandedBits: 0xffff for %shl.1 = shl i32 %a, %b
+; CHECK-DAG: DemandedBits: 0xffff for %a in %shl.1 = shl i32 %a, %b
+; CHECK-DAG: DemandedBits: 0xffffffff for %b in %shl.1 = shl i32 %a, %b
+;
+ %shl.1 = shl i32 %a, %b
+ %shl.2 = shl i32 %shl.1, 16
+ ret i32 %shl.2
+}
+
+define i8 @test_shl_var_amount(i32 %a, i32 %b){
+; CHECK-LABEL: 'test_shl_var_amount'
+; CHECK-DAG: DemandedBits: 0xff for %5 = trunc i32 %4 to i8
+; CHECK-DAG: DemandedBits: 0xff for %4 in %5 = trunc i32 %4 to i8
+; CHECK-DAG: DemandedBits: 0xff for %4 = shl i32 %1, %3
+; CHECK-DAG: DemandedBits: 0xff for %1 in %4 = shl i32 %1, %3
+; CHECK-DAG: DemandedBits: 0xffffffff for %3 in %4 = shl i32 %1, %3
+; CHECK-DAG: DemandedBits: 0xff for %2 = trunc i32 %1 to i8
+; CHECK-DAG: DemandedBits: 0xff for %1 in %2 = trunc i32 %1 to i8
+; CHECK-DAG: DemandedBits: 0xffffffff for %3 = zext i8 %2 to i32
+; CHECK-DAG: DemandedBits: 0xff for %2 in %3 = zext i8 %2 to i32
+; CHECK-DAG: DemandedBits: 0xff for %1 = add nsw i32 %a, %b
+; CHECK-DAG: DemandedBits: 0xff for %a in %1 = add nsw i32 %a, %b
+; CHECK-DAG: DemandedBits: 0xff for %b in %1 = add nsw i32 %a, %b
+;
+ %1 = add nsw i32 %a, %b
+ %2 = trunc i32 %1 to i8
+ %3 = zext i8 %2 to i32
+ %4 = shl i32 %1, %3
+ %5 = trunc i32 %4 to i8
+ ret i8 %5
+}
+
+define i8 @test_shl_var_amount_nsw(i32 %a, i32 %b){
+ ; CHECK-LABEL 'test_shl_var_amount_nsw'
+ ; CHECK-DAG: DemandedBits: 0xff for %5 = trunc i32 %4 to i8
+ ; CHECK-DAG: DemandedBits: 0xff for %4 in %5 = trunc i32 %4 to i8
+ ; CHECK-DAG: DemandedBits: 0xff for %4 = shl nsw i32 %1, %3
+ ; CHECK-DAG: DemandedBits: 0xffffffff for %1 in %4 = shl nsw i32 %1, %3
+ ; CHECK-DAG: DemandedBits: 0xffffffff for %3 in %4 = shl nsw i32 %1, %3
+ ; CHECK-DAG: DemandedBits: 0xffffffff for %3 = zext i8 %2 to i32
+ ; CHECK-DAG: DemandedBits: 0xff for %2 in %3 = zext i8 %2 to i32
+ ; CHECK-DAG: DemandedBits: 0xff for %2 = trunc i32 %1 to i8
+ ; CHECK-DAG: DemandedBits: 0xff for %1 in %2 = trunc i32 %1 to i8
+ ; CHECK-DAG: DemandedBits: 0xffffffff for %1 = add nsw i32 %a, %b
+ ; CHECK-DAG: DemandedBits: 0xffffffff for %a in %1 = add nsw i32 %a, %b
+ ; CHECK-DAG: DemandedBits: 0xffffffff for %b in %1 = add nsw i32 %a, %b
+ ;
+ %1 = add nsw i32 %a, %b
+ %2 = trunc i32 %1 to i8
+ %3 = zext i8 %2 to i32
+ %4 = shl nsw i32 %1, %3
+ %5 = trunc i32 %4 to i8
+ ret i8 %5
+}
>From 6960bf556c3eb7e3fcd5da3de28f55310bea341e Mon Sep 17 00:00:00 2001
From: Aiden Grossman <aidengrossman at google.com>
Date: Mon, 18 Aug 2025 10:20:31 -0700
Subject: [PATCH 057/112] [Github] Drop llvm-project-tests
All users of this have been claned up so we can now drop it fully.
Reviewers: cmtice, tstellar
Reviewed By: cmtice
Pull Request: https://github.com/llvm/llvm-project/pull/153877
---
.github/workflows/llvm-project-tests.yml | 149 ------------------
.../workflows/llvm-project-workflow-tests.yml | 32 ----
2 files changed, 181 deletions(-)
delete mode 100644 .github/workflows/llvm-project-tests.yml
delete mode 100644 .github/workflows/llvm-project-workflow-tests.yml
diff --git a/.github/workflows/llvm-project-tests.yml b/.github/workflows/llvm-project-tests.yml
deleted file mode 100644
index 8621a3b59218e..0000000000000
--- a/.github/workflows/llvm-project-tests.yml
+++ /dev/null
@@ -1,149 +0,0 @@
-name: LLVM Project Tests
-
-permissions:
- contents: read
-
-on:
- workflow_dispatch:
- inputs:
- build_target:
- required: false
- projects:
- required: false
- extra_cmake_args:
- required: false
- os_list:
- required: false
- default: '["ubuntu-24.04", "windows-2019", "macOS-13"]'
- python_version:
- required: false
- type: string
- default: '3.11'
- workflow_call:
- inputs:
- build_target:
- required: false
- type: string
- default: "all"
-
- projects:
- required: true
- type: string
-
- extra_cmake_args:
- required: false
- type: string
-
- os_list:
- required: false
- type: string
- # Use windows-2019 due to:
- # https://developercommunity.visualstudio.com/t/Prev-Issue---with-__assume-isnan-/1597317
- default: '["ubuntu-24.04", "windows-2019", "macOS-13"]'
-
- python_version:
- required: false
- type: string
- default: '3.11'
-
-concurrency:
- # Skip intermediate builds: always.
- # Cancel intermediate builds: only if it is a pull request build.
- # If the group name here is the same as the group name in the workflow that includes
- # this one, then the action will try to wait on itself and get stuck.
- group: llvm-project-${{ github.workflow }}-${{ inputs.projects }}-${{ inputs.python_version }}${{ github.ref }}
- cancel-in-progress: ${{ startsWith(github.ref, 'refs/pull/') }}
-
-jobs:
- lit-tests:
- name: Lit Tests
- runs-on: ${{ matrix.os }}
- container:
- image: ${{(startsWith(matrix.os, 'ubuntu') && 'ghcr.io/llvm/ci-ubuntu-24.04:latest') || null}}
- volumes:
- - /mnt/:/mnt/
- strategy:
- fail-fast: false
- matrix:
- os: ${{ fromJSON(inputs.os_list) }}
- steps:
- - name: Setup Windows
- if: startsWith(matrix.os, 'windows')
- uses: llvm/actions/setup-windows at main
- with:
- arch: amd64
- # On Windows, starting with win19/20220814.1, cmake choose the 32-bit
- # python3.10.6 libraries instead of the 64-bit libraries when building
- # lldb. Using this setup-python action to make 3.10 the default
- # python fixes this.
- - name: Setup Python
- uses: actions/setup-python at 42375524e23c412d93fb67b49958b491fce71c38 # v5.4.0
- with:
- python-version: ${{ inputs.python_version }}
- - name: Install Ninja
- if: runner.os != 'Linux'
- uses: llvm/actions/install-ninja at main
- # actions/checkout deletes any existing files in the new git directory,
- # so this needs to either run before ccache-action or it has to use
- # clean: false.
- - uses: actions/checkout at 08c6903cd8c0fde910a37f88322edcfb5dd907a8 # v5.0.0
- with:
- fetch-depth: 250
- - name: Setup ccache
- uses: hendrikmuhs/ccache-action at a1209f81afb8c005c13b4296c32e363431bffea5 # v1.2.17
- with:
- # A full build of llvm, clang, lld, and lldb takes about 250MB
- # of ccache space. There's not much reason to have more than this,
- # because we usually won't need to save cache entries from older
- # builds. Also, there is an overall 10GB cache limit, and each
- # run creates a new cache entry so we want to ensure that we have
- # enough cache space for all the tests to run at once and still
- # fit under the 10 GB limit.
- # Default to 2G to workaround: https://github.com/hendrikmuhs/ccache-action/issues/174
- max-size: 2G
- key: ${{ matrix.os }}
- variant: sccache
- - name: Build and Test
- env:
- # Workaround for https://github.com/actions/virtual-environments/issues/5900.
- # This should be a no-op for non-mac OSes
- PKG_CONFIG_PATH: /usr/local/Homebrew/Library/Homebrew/os/mac/pkgconfig//12
- shell: bash
- id: build-llvm
- run: |
- if [ "${{ runner.os }}" == "Linux" ]; then
- builddir="/mnt/build/"
- sudo mkdir -p $builddir
- sudo chown gha $builddir
- extra_cmake_args="-DCMAKE_CXX_COMPILER=clang++ -DCMAKE_C_COMPILER=clang"
- else
- builddir="$(pwd)"/build
- fi
- if [ "${{ runner.os }}" == "macOS" ]; then
- # Workaround test failure on some lld tests on MacOS
- # https://github.com/llvm/llvm-project/issues/81967
- extra_cmake_args="-DLLVM_DISABLE_ASSEMBLY_FILES=ON"
- fi
- echo "llvm-builddir=$builddir" >> "$GITHUB_OUTPUT"
- cmake -G Ninja \
- -B "$builddir" \
- -S llvm \
- -DLLVM_ENABLE_PROJECTS="${{ inputs.projects }}" \
- -DCMAKE_BUILD_TYPE=Release \
- -DLLVM_ENABLE_ASSERTIONS=ON \
- -DLLDB_INCLUDE_TESTS=OFF \
- -DLIBCLC_TARGETS_TO_BUILD="amdgcn--;amdgcn--amdhsa;r600--;nvptx--;nvptx64--;nvptx--nvidiacl;nvptx64--nvidiacl" \
- -DCMAKE_C_COMPILER_LAUNCHER=sccache \
- -DCMAKE_CXX_COMPILER_LAUNCHER=sccache \
- $extra_cmake_args \
- ${{ inputs.extra_cmake_args }}
- ninja -C "$builddir" '${{ inputs.build_target }}'
-
- - name: Build and Test libclc
- if: "!startsWith(matrix.os, 'windows') && contains(inputs.projects, 'libclc')"
- env:
- LLVM_BUILDDIR: ${{ steps.build-llvm.outputs.llvm-builddir }}
- run: |
- # The libclc tests don't have a generated check target so all we can
- # do is build it.
- ninja -C "$LLVM_BUILDDIR"
diff --git a/.github/workflows/llvm-project-workflow-tests.yml b/.github/workflows/llvm-project-workflow-tests.yml
deleted file mode 100644
index a2539b279be0a..0000000000000
--- a/.github/workflows/llvm-project-workflow-tests.yml
+++ /dev/null
@@ -1,32 +0,0 @@
-# This workflow will test the llvm-project-tests workflow in PRs
-# targetting the main branch. Since this workflow doesn't normally
-# run on main PRs, we need some way to test it to ensure new updates
-# don't break it.
-
-name: LLVM Workflow Test
-
-permissions:
- contents: read
-
-on:
- pull_request:
- branches:
- - 'main'
- paths:
- - '.github/workflows/llvm-project-tests.yml'
- - '.github/workflows/llvm-project-workflow-tests.yml'
-
-concurrency:
- # Skip intermediate builds: always.
- # Cancel intermediate builds: only if it is a pull request build.
- group: ${{ github.workflow }}-${{ github.ref }}
- cancel-in-progress: ${{ startsWith(github.ref, 'refs/pull/') }}
-
-jobs:
- llvm-test:
- if: github.repository_owner == 'llvm'
- name: Build and Test
- uses: ./.github/workflows/llvm-project-tests.yml
- with:
- build_target: check-all
- projects: clang;lld;libclc;lldb
>From 99829573cc8460782e4f10713ef24d5af9f82036 Mon Sep 17 00:00:00 2001
From: Shafik Yaghmour <shafik.yaghmour at intel.com>
Date: Mon, 18 Aug 2025 10:27:37 -0700
Subject: [PATCH 058/112] [Clang][Webassembly] Remove unrachable code in
ParseTypeQualifierListOpt (#153729)
Static analysis flagged this goto as unreachable and indeed it is, so
removing it.
---
clang/lib/Parse/ParseDecl.cpp | 1 -
1 file changed, 1 deletion(-)
diff --git a/clang/lib/Parse/ParseDecl.cpp b/clang/lib/Parse/ParseDecl.cpp
index fd53cca5a13ff..96f1a53922d1f 100644
--- a/clang/lib/Parse/ParseDecl.cpp
+++ b/clang/lib/Parse/ParseDecl.cpp
@@ -6224,7 +6224,6 @@ void Parser::ParseTypeQualifierListOpt(
case tok::kw___funcref:
ParseWebAssemblyFuncrefTypeAttribute(DS.getAttributes());
continue;
- goto DoneWithTypeQuals;
case tok::kw___pascal:
if (AttrReqs & AR_VendorAttributesParsed) {
>From 7f27482a32180def47c71f490501ea0e560bfa9f Mon Sep 17 00:00:00 2001
From: Krzysztof Drewniak <Krzysztof.Drewniak at amd.com>
Date: Mon, 18 Aug 2025 13:32:54 -0400
Subject: [PATCH 059/112] [AMDGPU][LowerBufferFatPointers] Fix lack of rewrite
when loading/storing null (#154128)
Fixes #154056.
The fat buffer lowering pass was erroniously detecting that it did not
need to run on functions that only load/store to the null constant (or
other such constants). We thought this would be covered by specializing
constants out to instructions, but that doesn't account foc trivial
constants like null. Therefore, we check the operands of instructions
for buffer fat pointers in order to find such constants and ensure the
pass runs.
---------
Co-authored-by: Nikita Popov <github at npopov.com>
---
.../AMDGPU/AMDGPULowerBufferFatPointers.cpp | 6 ++-
.../lower-buffer-fat-pointers-constants.ll | 40 +++++++++++++++++++
2 files changed, 45 insertions(+), 1 deletion(-)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp b/llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp
index ed73dc8903908..139cad60ebcb2 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp
@@ -2366,8 +2366,12 @@ static bool containsBufferFatPointers(const Function &F,
BufferFatPtrToStructTypeMap *TypeMap) {
bool HasFatPointers = false;
for (const BasicBlock &BB : F)
- for (const Instruction &I : BB)
+ for (const Instruction &I : BB) {
HasFatPointers |= (I.getType() != TypeMap->remapType(I.getType()));
+ // Catch null pointer constants in loads, stores, etc.
+ for (const Value *V : I.operand_values())
+ HasFatPointers |= (V->getType() != TypeMap->remapType(V->getType()));
+ }
return HasFatPointers;
}
diff --git a/llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-constants.ll b/llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-constants.ll
index a0c1e573f8fbb..a09e392b89e63 100644
--- a/llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-constants.ll
+++ b/llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-constants.ll
@@ -223,3 +223,43 @@ define i32 @fancy_zero() {
ptr addrspace(7) addrspacecast (ptr addrspace(8) @buf to ptr addrspace(7))
to i32)
}
+
+define i32 @load_null() {
+; CHECK-LABEL: define i32 @load_null
+; CHECK-SAME: () #[[ATTR0]] {
+; CHECK-NEXT: [[X:%.*]] = call i32 @llvm.amdgcn.raw.ptr.buffer.load.i32(ptr addrspace(8) align 4 null, i32 0, i32 0, i32 0)
+; CHECK-NEXT: ret i32 [[X]]
+;
+ %x = load i32, ptr addrspace(7) null, align 4
+ ret i32 %x
+}
+
+define void @store_null() {
+; CHECK-LABEL: define void @store_null
+; CHECK-SAME: () #[[ATTR0]] {
+; CHECK-NEXT: call void @llvm.amdgcn.raw.ptr.buffer.store.i32(i32 0, ptr addrspace(8) align 4 null, i32 0, i32 0, i32 0)
+; CHECK-NEXT: ret void
+;
+ store i32 0, ptr addrspace(7) null, align 4
+ ret void
+}
+
+define i32 @load_poison() {
+; CHECK-LABEL: define i32 @load_poison
+; CHECK-SAME: () #[[ATTR0]] {
+; CHECK-NEXT: [[X:%.*]] = call i32 @llvm.amdgcn.raw.ptr.buffer.load.i32(ptr addrspace(8) align 4 poison, i32 poison, i32 0, i32 0)
+; CHECK-NEXT: ret i32 [[X]]
+;
+ %x = load i32, ptr addrspace(7) poison, align 4
+ ret i32 %x
+}
+
+define void @store_poison() {
+; CHECK-LABEL: define void @store_poison
+; CHECK-SAME: () #[[ATTR0]] {
+; CHECK-NEXT: call void @llvm.amdgcn.raw.ptr.buffer.store.i32(i32 0, ptr addrspace(8) align 4 poison, i32 poison, i32 0, i32 0)
+; CHECK-NEXT: ret void
+;
+ store i32 0, ptr addrspace(7) poison, align 4
+ ret void
+}
>From 350f4a3e3b0ebd9695f9c2194db5fd86ff551489 Mon Sep 17 00:00:00 2001
From: LauraElanorJones <laura.elanor.jones at gmail.com>
Date: Mon, 18 Aug 2025 10:47:14 -0700
Subject: [PATCH 060/112] Decent to Descent (#154040)
[lldb] Rename RecursiveDecentFormatter to RecursiveDescentFormatter (NFC)
---
lldb/packages/Python/lldbsuite/test/lldbutil.py | 7 +++----
lldb/test/API/python_api/value/TestValueAPI.py | 2 +-
lldb/utils/lui/lldbutil.py | 7 +++----
3 files changed, 7 insertions(+), 9 deletions(-)
diff --git a/lldb/packages/Python/lldbsuite/test/lldbutil.py b/lldb/packages/Python/lldbsuite/test/lldbutil.py
index 8112705438c1f..b8a78b71f5ec1 100644
--- a/lldb/packages/Python/lldbsuite/test/lldbutil.py
+++ b/lldb/packages/Python/lldbsuite/test/lldbutil.py
@@ -1464,8 +1464,8 @@ def format(self, value, buffer=None):
return output.getvalue()
-class RecursiveDecentFormatter(BasicFormatter):
- """The recursive decent formatter prints the value and the decendents.
+class RecursiveDescentFormatter(BasicFormatter):
+ """The recursive descent formatter prints the value and the descendents.
The constructor takes two keyword args: indent_level, which defaults to 0,
and indent_child, which defaults to 2. The current indentation level is
@@ -1482,7 +1482,6 @@ def format(self, value, buffer=None):
output = io.StringIO()
else:
output = buffer
-
BasicFormatter.format(self, value, buffer=output, indent=self.lindent)
new_indent = self.lindent + self.cindent
for child in value:
@@ -1490,7 +1489,7 @@ def format(self, value, buffer=None):
BasicFormatter.format(self, child, buffer=output, indent=new_indent)
else:
if child.GetNumChildren() > 0:
- rdf = RecursiveDecentFormatter(indent_level=new_indent)
+ rdf = RecursiveDescentFormatter(indent_level=new_indent)
rdf.format(child, buffer=output)
else:
BasicFormatter.format(self, child, buffer=output, indent=new_indent)
diff --git a/lldb/test/API/python_api/value/TestValueAPI.py b/lldb/test/API/python_api/value/TestValueAPI.py
index 0da57346212d0..907992bf05c04 100644
--- a/lldb/test/API/python_api/value/TestValueAPI.py
+++ b/lldb/test/API/python_api/value/TestValueAPI.py
@@ -83,7 +83,7 @@ def test(self):
fmt = lldbutil.BasicFormatter()
cvf = lldbutil.ChildVisitingFormatter(indent_child=2)
- rdf = lldbutil.RecursiveDecentFormatter(indent_child=2)
+ rdf = lldbutil.RecursiveDescentFormatter(indent_child=2)
if self.TraceOn():
print(fmt.format(days_of_week))
print(cvf.format(days_of_week))
diff --git a/lldb/utils/lui/lldbutil.py b/lldb/utils/lui/lldbutil.py
index 6cbf4a302f65f..140317af3670b 100644
--- a/lldb/utils/lui/lldbutil.py
+++ b/lldb/utils/lui/lldbutil.py
@@ -1040,8 +1040,8 @@ def format(self, value, buffer=None):
return output.getvalue()
-class RecursiveDecentFormatter(BasicFormatter):
- """The recursive decent formatter prints the value and the decendents.
+class RecursiveDescentFormatter(BasicFormatter):
+ """The recursive descent formatter prints the value and the descendents.
The constructor takes two keyword args: indent_level, which defaults to 0,
and indent_child, which defaults to 2. The current indentation level is
@@ -1058,7 +1058,6 @@ def format(self, value, buffer=None):
output = io.StringIO()
else:
output = buffer
-
BasicFormatter.format(self, value, buffer=output, indent=self.lindent)
new_indent = self.lindent + self.cindent
for child in value:
@@ -1066,7 +1065,7 @@ def format(self, value, buffer=None):
BasicFormatter.format(self, child, buffer=output, indent=new_indent)
else:
if child.GetNumChildren() > 0:
- rdf = RecursiveDecentFormatter(indent_level=new_indent)
+ rdf = RecursiveDescentFormatter(indent_level=new_indent)
rdf.format(child, buffer=output)
else:
BasicFormatter.format(self, child, buffer=output, indent=new_indent)
>From 58de8f2c25291549dc1cabe364d399e564bca042 Mon Sep 17 00:00:00 2001
From: Justin Fargnoli <jfargnoli at nvidia.com>
Date: Mon, 18 Aug 2025 10:48:49 -0700
Subject: [PATCH 061/112] [Inliner] Add option (default off) to inline all
calls regardless of the cost (#152365)
Add a default off option to the inline cost calculation to always inline
all viable calls regardless of the cost/benefit and cost/threshold
calculations.
For performance reasons, some users require that all calls be inlined.
Rather than forcing them to adjust the inlining threshold to an
arbitrarily high value, offer an option to inline all calls.
---
llvm/lib/Analysis/InlineCost.cpp | 8 ++
.../Inline/inline-all-viable-calls.ll | 114 ++++++++++++++++++
2 files changed, 122 insertions(+)
create mode 100644 llvm/test/Transforms/Inline/inline-all-viable-calls.ll
diff --git a/llvm/lib/Analysis/InlineCost.cpp b/llvm/lib/Analysis/InlineCost.cpp
index 22f4d08448a22..757f68999691e 100644
--- a/llvm/lib/Analysis/InlineCost.cpp
+++ b/llvm/lib/Analysis/InlineCost.cpp
@@ -180,6 +180,10 @@ static cl::opt<bool> DisableGEPConstOperand(
"disable-gep-const-evaluation", cl::Hidden, cl::init(false),
cl::desc("Disables evaluation of GetElementPtr with constant operands"));
+static cl::opt<bool> InlineAllViableCalls(
+ "inline-all-viable-calls", cl::Hidden, cl::init(false),
+ cl::desc("Inline all viable calls, even if they exceed the inlining "
+ "threshold"));
namespace llvm {
std::optional<int> getStringFnAttrAsInt(const Attribute &Attr) {
if (Attr.isValid()) {
@@ -3272,6 +3276,10 @@ InlineCost llvm::getInlineCost(
return llvm::InlineCost::getNever(UserDecision->getFailureReason());
}
+ if (InlineAllViableCalls && isInlineViable(*Callee).isSuccess())
+ return llvm::InlineCost::getAlways(
+ "Inlining forced by -inline-all-viable-calls");
+
LLVM_DEBUG(llvm::dbgs() << " Analyzing call of " << Callee->getName()
<< "... (caller:" << Call.getCaller()->getName()
<< ")\n");
diff --git a/llvm/test/Transforms/Inline/inline-all-viable-calls.ll b/llvm/test/Transforms/Inline/inline-all-viable-calls.ll
new file mode 100644
index 0000000000000..a06ec1acd4ef3
--- /dev/null
+++ b/llvm/test/Transforms/Inline/inline-all-viable-calls.ll
@@ -0,0 +1,114 @@
+; RUN: opt -passes=inline -inline-threshold=0 -inline-all-viable-calls -S < %s | FileCheck %s
+
+; Check that viable calls that are beyond the cost threshold are still inlined.
+define i32 @callee_simple(i32 %x) {
+ %1 = add i32 %x, 1
+ %2 = mul i32 %1, 2
+ %3 = sub i32 %2, 1
+ %4 = add i32 %3, 3
+ %5 = mul i32 %4, 2
+ %6 = sub i32 %5, 2
+ %7 = add i32 %6, 1
+ ret i32 %7
+}
+
+; Check that user decisions are respected.
+define i32 @callee_alwaysinline(i32 %x) alwaysinline {
+ %sub = sub i32 %x, 3
+ ret i32 %sub
+}
+
+define i32 @callee_noinline(i32 %x) noinline {
+ %div = sdiv i32 %x, 2
+ ret i32 %div
+}
+
+define i32 @callee_optnone(i32 %x) optnone noinline {
+ %rem = srem i32 %x, 2
+ ret i32 %rem
+}
+
+define i32 @caller(i32 %a) {
+; CHECK-LABEL: define i32 @caller(
+; CHECK-SAME: i32 [[A:%.*]]) {
+; CHECK-NEXT: [[TMP7:%.*]] = add i32 [[A]], 1
+; CHECK-NEXT: [[TMP8:%.*]] = mul i32 [[TMP7]], 2
+; CHECK-NEXT: [[TMP3:%.*]] = sub i32 [[TMP8]], 1
+; CHECK-NEXT: [[TMP4:%.*]] = add i32 [[TMP3]], 3
+; CHECK-NEXT: [[TMP5:%.*]] = mul i32 [[TMP4]], 2
+; CHECK-NEXT: [[TMP6:%.*]] = sub i32 [[TMP5]], 2
+; CHECK-NEXT: [[ADD_I:%.*]] = add i32 [[TMP6]], 1
+; CHECK-NEXT: [[SUB_I:%.*]] = sub i32 [[ADD_I]], 3
+; CHECK-NEXT: [[TMP1:%.*]] = call i32 @callee_noinline(i32 [[SUB_I]])
+; CHECK-NEXT: [[TMP2:%.*]] = call i32 @callee_optnone(i32 [[TMP1]])
+; CHECK-NEXT: [[SUM:%.*]] = add i32 [[TMP2]], [[TMP1]]
+; CHECK-NEXT: ret i32 [[SUM]]
+;
+ %1 = call i32 @callee_simple(i32 %a)
+ %2 = call i32 @callee_alwaysinline(i32 %1)
+ %3 = call i32 @callee_noinline(i32 %2)
+ %4 = call i32 @callee_optnone(i32 %3)
+ %sum = add i32 %4, %3
+ ret i32 %sum
+}
+
+; Check that non-viable calls are not inlined
+
+; Test recursive function is not inlined
+define i32 @recursive(i32 %n) {
+entry:
+ %cmp = icmp eq i32 %n, 0
+ br i1 %cmp, label %base, label %recurse
+
+base:
+ ret i32 0
+
+recurse:
+ %dec = sub i32 %n, 1
+ %rec = call i32 @recursive(i32 %dec)
+ %add = add i32 %rec, 1
+ ret i32 %add
+}
+
+define i32 @call_recursive(i32 %x) {
+; CHECK-LABEL: define i32 @call_recursive(
+; CHECK-SAME: i32 [[X:%.*]]) {
+; CHECK-NEXT: [[R:%.*]] = call i32 @recursive(i32 [[X]])
+; CHECK-NEXT: ret i32 [[R]]
+;
+ %r = call i32 @recursive(i32 %x)
+ ret i32 %r
+}
+
+; Test indirectbr prevents inlining
+define void @has_indirectbr(ptr %ptr, i32 %cond) {
+entry:
+ switch i32 %cond, label %default [
+ i32 0, label %target0
+ i32 1, label %target1
+ ]
+
+target0:
+ br label %end
+
+target1:
+ br label %end
+
+default:
+ br label %end
+
+end:
+ indirectbr ptr %ptr, [label %target0, label %target1]
+ ret void
+}
+
+define void @call_indirectbr(ptr %p, i32 %c) {
+; CHECK-LABEL: define void @call_indirectbr(
+; CHECK-SAME: ptr [[P:%.*]], i32 [[C:%.*]]) {
+; CHECK-NEXT: call void @has_indirectbr(ptr [[P]], i32 [[C]])
+; CHECK-NEXT: ret void
+;
+ call void @has_indirectbr(ptr %p, i32 %c)
+ ret void
+}
+
>From 7e8ff2afa9ddfe1d7c42bb58cc9523006c34396b Mon Sep 17 00:00:00 2001
From: Shaoce SUN <sunshaoce at outlook.com>
Date: Tue, 19 Aug 2025 01:52:24 +0800
Subject: [PATCH 062/112] [RISCV][GISel] Optimize +0.0 to use fcvt.d.w for s64
on rv32 (#153978)
Resolve the TODO: on RV32, when constructing the double-precision
constant `+0.0` for `s64`, `BuildPairF64Pseudo` can be optimized to use
the `fcvt.d.w` instruction to generate the result directly.
---
.../RISCV/GISel/RISCVInstructionSelector.cpp | 15 +++++-
.../CodeGen/RISCV/GlobalISel/double-arith.ll | 48 ++++---------------
.../instruction-select/fp-constant.mir | 6 +--
3 files changed, 24 insertions(+), 45 deletions(-)
diff --git a/llvm/lib/Target/RISCV/GISel/RISCVInstructionSelector.cpp b/llvm/lib/Target/RISCV/GISel/RISCVInstructionSelector.cpp
index f83c2b6da8923..51ea3fc5f6774 100644
--- a/llvm/lib/Target/RISCV/GISel/RISCVInstructionSelector.cpp
+++ b/llvm/lib/Target/RISCV/GISel/RISCVInstructionSelector.cpp
@@ -736,7 +736,6 @@ bool RISCVInstructionSelector::select(MachineInstr &MI) {
}
case TargetOpcode::G_FCONSTANT: {
// TODO: Use constant pool for complex constants.
- // TODO: Optimize +0.0 to use fcvt.d.w for s64 on rv32.
Register DstReg = MI.getOperand(0).getReg();
const APFloat &FPimm = MI.getOperand(1).getFPImm()->getValueAPF();
APInt Imm = FPimm.bitcastToAPInt();
@@ -753,8 +752,22 @@ bool RISCVInstructionSelector::select(MachineInstr &MI) {
if (!FMV.constrainAllUses(TII, TRI, RBI))
return false;
} else {
+ // s64 on rv32
assert(Size == 64 && !Subtarget->is64Bit() &&
"Unexpected size or subtarget");
+
+ if (Imm.isNonNegative() && Imm.isZero()) {
+ // Optimize +0.0 to use fcvt.d.w
+ MachineInstrBuilder FCVT =
+ MIB.buildInstr(RISCV::FCVT_D_W, {DstReg}, {Register(RISCV::X0)})
+ .addImm(RISCVFPRndMode::RNE);
+ if (!FCVT.constrainAllUses(TII, TRI, RBI))
+ return false;
+
+ MI.eraseFromParent();
+ return true;
+ }
+
// Split into two pieces and build through the stack.
Register GPRRegHigh = MRI->createVirtualRegister(&RISCV::GPRRegClass);
Register GPRRegLow = MRI->createVirtualRegister(&RISCV::GPRRegClass);
diff --git a/llvm/test/CodeGen/RISCV/GlobalISel/double-arith.ll b/llvm/test/CodeGen/RISCV/GlobalISel/double-arith.ll
index cb2037f5fb027..4eb7646d13a39 100644
--- a/llvm/test/CodeGen/RISCV/GlobalISel/double-arith.ll
+++ b/llvm/test/CodeGen/RISCV/GlobalISel/double-arith.ll
@@ -395,13 +395,9 @@ define double @fmadd_d(double %a, double %b, double %c) nounwind {
define double @fmsub_d(double %a, double %b, double %c) nounwind {
; RV32IFD-LABEL: fmsub_d:
; RV32IFD: # %bb.0:
-; RV32IFD-NEXT: addi sp, sp, -16
-; RV32IFD-NEXT: sw zero, 8(sp)
-; RV32IFD-NEXT: sw zero, 12(sp)
-; RV32IFD-NEXT: fld fa5, 8(sp)
+; RV32IFD-NEXT: fcvt.d.w fa5, zero
; RV32IFD-NEXT: fadd.d fa5, fa2, fa5
; RV32IFD-NEXT: fmsub.d fa0, fa0, fa1, fa5
-; RV32IFD-NEXT: addi sp, sp, 16
; RV32IFD-NEXT: ret
;
; RV64IFD-LABEL: fmsub_d:
@@ -478,14 +474,10 @@ define double @fmsub_d(double %a, double %b, double %c) nounwind {
define double @fnmadd_d(double %a, double %b, double %c) nounwind {
; RV32IFD-LABEL: fnmadd_d:
; RV32IFD: # %bb.0:
-; RV32IFD-NEXT: addi sp, sp, -16
-; RV32IFD-NEXT: sw zero, 8(sp)
-; RV32IFD-NEXT: sw zero, 12(sp)
-; RV32IFD-NEXT: fld fa5, 8(sp)
+; RV32IFD-NEXT: fcvt.d.w fa5, zero
; RV32IFD-NEXT: fadd.d fa4, fa0, fa5
; RV32IFD-NEXT: fadd.d fa5, fa2, fa5
; RV32IFD-NEXT: fnmadd.d fa0, fa4, fa1, fa5
-; RV32IFD-NEXT: addi sp, sp, 16
; RV32IFD-NEXT: ret
;
; RV64IFD-LABEL: fnmadd_d:
@@ -590,14 +582,10 @@ define double @fnmadd_d(double %a, double %b, double %c) nounwind {
define double @fnmadd_d_2(double %a, double %b, double %c) nounwind {
; RV32IFD-LABEL: fnmadd_d_2:
; RV32IFD: # %bb.0:
-; RV32IFD-NEXT: addi sp, sp, -16
-; RV32IFD-NEXT: sw zero, 8(sp)
-; RV32IFD-NEXT: sw zero, 12(sp)
-; RV32IFD-NEXT: fld fa5, 8(sp)
+; RV32IFD-NEXT: fcvt.d.w fa5, zero
; RV32IFD-NEXT: fadd.d fa4, fa1, fa5
; RV32IFD-NEXT: fadd.d fa5, fa2, fa5
; RV32IFD-NEXT: fnmadd.d fa0, fa4, fa0, fa5
-; RV32IFD-NEXT: addi sp, sp, 16
; RV32IFD-NEXT: ret
;
; RV64IFD-LABEL: fnmadd_d_2:
@@ -772,13 +760,9 @@ define double @fnmadd_nsz(double %a, double %b, double %c) nounwind {
define double @fnmsub_d(double %a, double %b, double %c) nounwind {
; RV32IFD-LABEL: fnmsub_d:
; RV32IFD: # %bb.0:
-; RV32IFD-NEXT: addi sp, sp, -16
-; RV32IFD-NEXT: sw zero, 8(sp)
-; RV32IFD-NEXT: sw zero, 12(sp)
-; RV32IFD-NEXT: fld fa5, 8(sp)
+; RV32IFD-NEXT: fcvt.d.w fa5, zero
; RV32IFD-NEXT: fadd.d fa5, fa0, fa5
; RV32IFD-NEXT: fnmsub.d fa0, fa5, fa1, fa2
-; RV32IFD-NEXT: addi sp, sp, 16
; RV32IFD-NEXT: ret
;
; RV64IFD-LABEL: fnmsub_d:
@@ -851,13 +835,9 @@ define double @fnmsub_d(double %a, double %b, double %c) nounwind {
define double @fnmsub_d_2(double %a, double %b, double %c) nounwind {
; RV32IFD-LABEL: fnmsub_d_2:
; RV32IFD: # %bb.0:
-; RV32IFD-NEXT: addi sp, sp, -16
-; RV32IFD-NEXT: sw zero, 8(sp)
-; RV32IFD-NEXT: sw zero, 12(sp)
-; RV32IFD-NEXT: fld fa5, 8(sp)
+; RV32IFD-NEXT: fcvt.d.w fa5, zero
; RV32IFD-NEXT: fadd.d fa5, fa1, fa5
; RV32IFD-NEXT: fnmsub.d fa0, fa5, fa0, fa2
-; RV32IFD-NEXT: addi sp, sp, 16
; RV32IFD-NEXT: ret
;
; RV64IFD-LABEL: fnmsub_d_2:
@@ -976,14 +956,10 @@ define double @fmadd_d_contract(double %a, double %b, double %c) nounwind {
define double @fmsub_d_contract(double %a, double %b, double %c) nounwind {
; RV32IFD-LABEL: fmsub_d_contract:
; RV32IFD: # %bb.0:
-; RV32IFD-NEXT: addi sp, sp, -16
-; RV32IFD-NEXT: sw zero, 8(sp)
-; RV32IFD-NEXT: sw zero, 12(sp)
-; RV32IFD-NEXT: fld fa5, 8(sp)
+; RV32IFD-NEXT: fcvt.d.w fa5, zero
; RV32IFD-NEXT: fadd.d fa5, fa2, fa5
; RV32IFD-NEXT: fmul.d fa4, fa0, fa1
; RV32IFD-NEXT: fsub.d fa0, fa4, fa5
-; RV32IFD-NEXT: addi sp, sp, 16
; RV32IFD-NEXT: ret
;
; RV64IFD-LABEL: fmsub_d_contract:
@@ -1069,17 +1045,13 @@ define double @fmsub_d_contract(double %a, double %b, double %c) nounwind {
define double @fnmadd_d_contract(double %a, double %b, double %c) nounwind {
; RV32IFD-LABEL: fnmadd_d_contract:
; RV32IFD: # %bb.0:
-; RV32IFD-NEXT: addi sp, sp, -16
-; RV32IFD-NEXT: sw zero, 8(sp)
-; RV32IFD-NEXT: sw zero, 12(sp)
-; RV32IFD-NEXT: fld fa5, 8(sp)
+; RV32IFD-NEXT: fcvt.d.w fa5, zero
; RV32IFD-NEXT: fadd.d fa4, fa0, fa5
; RV32IFD-NEXT: fadd.d fa3, fa1, fa5
; RV32IFD-NEXT: fadd.d fa5, fa2, fa5
; RV32IFD-NEXT: fmul.d fa4, fa4, fa3
; RV32IFD-NEXT: fneg.d fa4, fa4
; RV32IFD-NEXT: fsub.d fa0, fa4, fa5
-; RV32IFD-NEXT: addi sp, sp, 16
; RV32IFD-NEXT: ret
;
; RV64IFD-LABEL: fnmadd_d_contract:
@@ -1204,14 +1176,10 @@ define double @fnmadd_d_contract(double %a, double %b, double %c) nounwind {
define double @fnmsub_d_contract(double %a, double %b, double %c) nounwind {
; RV32IFD-LABEL: fnmsub_d_contract:
; RV32IFD: # %bb.0:
-; RV32IFD-NEXT: addi sp, sp, -16
-; RV32IFD-NEXT: sw zero, 8(sp)
-; RV32IFD-NEXT: sw zero, 12(sp)
-; RV32IFD-NEXT: fld fa5, 8(sp)
+; RV32IFD-NEXT: fcvt.d.w fa5, zero
; RV32IFD-NEXT: fadd.d fa4, fa0, fa5
; RV32IFD-NEXT: fadd.d fa5, fa1, fa5
; RV32IFD-NEXT: fnmsub.d fa0, fa4, fa5, fa2
-; RV32IFD-NEXT: addi sp, sp, 16
; RV32IFD-NEXT: ret
;
; RV64IFD-LABEL: fnmsub_d_contract:
diff --git a/llvm/test/CodeGen/RISCV/GlobalISel/instruction-select/fp-constant.mir b/llvm/test/CodeGen/RISCV/GlobalISel/instruction-select/fp-constant.mir
index e82d4bcec48b1..4db80c6c1141f 100644
--- a/llvm/test/CodeGen/RISCV/GlobalISel/instruction-select/fp-constant.mir
+++ b/llvm/test/CodeGen/RISCV/GlobalISel/instruction-select/fp-constant.mir
@@ -164,10 +164,8 @@ body: |
; RV32-LABEL: name: double_positive_zero
; RV32: liveins: $x10
; RV32-NEXT: {{ $}}
- ; RV32-NEXT: [[COPY:%[0-9]+]]:gpr = COPY $x0
- ; RV32-NEXT: [[COPY1:%[0-9]+]]:gpr = COPY $x0
- ; RV32-NEXT: [[BuildPairF64Pseudo:%[0-9]+]]:fpr64 = BuildPairF64Pseudo [[COPY1]], [[COPY]]
- ; RV32-NEXT: $f10_d = COPY [[BuildPairF64Pseudo]]
+ ; RV32-NEXT: [[FCVT_D_W:%[0-9]+]]:fpr64 = FCVT_D_W $x0, 0
+ ; RV32-NEXT: $f10_d = COPY [[FCVT_D_W]]
; RV32-NEXT: PseudoRET implicit $f10_d
;
; RV64-LABEL: name: double_positive_zero
>From d49aab10bd424f67a0df0d70f653f8deeb498a16 Mon Sep 17 00:00:00 2001
From: Brox Chen <guochen2 at amd.com>
Date: Mon, 18 Aug 2025 14:01:19 -0400
Subject: [PATCH 063/112] =?UTF-8?q?Revert=20"[AMDGPU][True16][CodeGen]=20u?=
=?UTF-8?q?se=20vgpr16=20for=20zext=20patterns=20(#1538=E2=80=A6=20(#15416?=
=?UTF-8?q?3)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
This reverts commit 7c53c6162bd43d952546a3ef7d019babd5244c29.
This patch hit an issue in hip test. revert and will reopen later
---
llvm/lib/Target/AMDGPU/SIInstructions.td | 22 -
llvm/test/CodeGen/AMDGPU/GlobalISel/mul.ll | 2 +-
.../CodeGen/AMDGPU/amdgcn.bitcast.1024bit.ll | 11901 ++++++++--------
.../CodeGen/AMDGPU/amdgcn.bitcast.128bit.ll | 1148 +-
.../CodeGen/AMDGPU/amdgcn.bitcast.256bit.ll | 1320 +-
.../CodeGen/AMDGPU/amdgcn.bitcast.320bit.ll | 2886 ++--
.../CodeGen/AMDGPU/amdgcn.bitcast.32bit.ll | 240 +-
.../CodeGen/AMDGPU/amdgcn.bitcast.512bit.ll | 5414 +++----
.../CodeGen/AMDGPU/amdgcn.bitcast.64bit.ll | 637 +-
.../CodeGen/AMDGPU/amdgcn.bitcast.96bit.ll | 594 +-
.../AMDGPU/amdgpu-llvm-debuginfo-analyzer.ll | 1 -
.../atomic_optimizations_global_pointer.ll | 64 +-
llvm/test/CodeGen/AMDGPU/bf16.ll | 14 +-
.../buffer-fat-pointer-atomicrmw-fadd.ll | 42 +-
.../buffer-fat-pointer-atomicrmw-fmax.ll | 42 +-
.../buffer-fat-pointer-atomicrmw-fmin.ll | 42 +-
.../CodeGen/AMDGPU/calling-conventions.ll | 100 +-
llvm/test/CodeGen/AMDGPU/clamp-modifier.ll | 4 +-
llvm/test/CodeGen/AMDGPU/cvt_f32_ubyte.ll | 42 +-
.../test/CodeGen/AMDGPU/dynamic_stackalloc.ll | 5 +-
.../CodeGen/AMDGPU/flat-atomicrmw-fadd.ll | 106 +-
.../CodeGen/AMDGPU/flat-atomicrmw-fmax.ll | 110 +-
.../CodeGen/AMDGPU/flat-atomicrmw-fmin.ll | 110 +-
.../CodeGen/AMDGPU/flat-atomicrmw-fsub.ll | 106 +-
llvm/test/CodeGen/AMDGPU/fmul-to-ldexp.ll | 2 +-
.../AMDGPU/fold-int-pow2-with-fmul-or-fdiv.ll | 6 +-
llvm/test/CodeGen/AMDGPU/fptrunc.f16.ll | 6 +-
llvm/test/CodeGen/AMDGPU/function-args.ll | 251 +-
.../AMDGPU/gfx-callable-argument-types.ll | 222 +-
.../CodeGen/AMDGPU/global-atomicrmw-fadd.ll | 106 +-
.../CodeGen/AMDGPU/global-atomicrmw-fmax.ll | 110 +-
.../CodeGen/AMDGPU/global-atomicrmw-fmin.ll | 110 +-
.../CodeGen/AMDGPU/global-atomicrmw-fsub.ll | 106 +-
llvm/test/CodeGen/AMDGPU/idot4u.ll | 41 +-
.../CodeGen/AMDGPU/integer-mad-patterns.ll | 28 +-
.../CodeGen/AMDGPU/local-atomicrmw-fadd.ll | 60 +-
.../CodeGen/AMDGPU/local-atomicrmw-fmax.ll | 68 +-
.../CodeGen/AMDGPU/local-atomicrmw-fmin.ll | 68 +-
.../CodeGen/AMDGPU/local-atomicrmw-fsub.ll | 60 +-
llvm/test/CodeGen/AMDGPU/mad-mix-lo.ll | 31 +-
llvm/test/CodeGen/AMDGPU/mad.u16.ll | 7 +-
llvm/test/CodeGen/AMDGPU/preserve-hi16.ll | 54 +-
.../CodeGen/AMDGPU/shrink-add-sub-constant.ll | 6 +-
llvm/test/CodeGen/AMDGPU/vector-reduce-add.ll | 126 +-
.../test/CodeGen/AMDGPU/vector-reduce-umin.ll | 78 +-
45 files changed, 14018 insertions(+), 12480 deletions(-)
diff --git a/llvm/lib/Target/AMDGPU/SIInstructions.td b/llvm/lib/Target/AMDGPU/SIInstructions.td
index 6488fa3dacfb3..bd5dfa92a8e43 100644
--- a/llvm/lib/Target/AMDGPU/SIInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SIInstructions.td
@@ -3056,8 +3056,6 @@ def : GCNPat<
}
} // AddedComplexity = 1
-foreach p = [NotHasTrue16BitInsts, UseFakeTrue16Insts] in
-let True16Predicate = p in {
def : GCNPat<
(i32 (DivergentUnaryFrag<zext> i16:$src)),
(V_AND_B32_e64 (S_MOV_B32 (i32 0xffff)), $src)
@@ -3073,26 +3071,6 @@ def : GCNPat<
def : GCNPat<
(i32 (zext (i16 (bitconvert fp16_zeros_high_16bits:$src)))),
(COPY VSrc_b16:$src)>;
-}
-
-let True16Predicate = UseRealTrue16Insts in {
-def : GCNPat<
- (i32 (DivergentUnaryFrag<zext> i16:$src)),
- (REG_SEQUENCE VGPR_32, $src, lo16, (V_MOV_B16_t16_e64 0, (i16 0), 0), hi16)
->;
-
-def : GCNPat<
- (i64 (DivergentUnaryFrag<zext> i16:$src)),
- (REG_SEQUENCE VReg_64,
- (REG_SEQUENCE VGPR_32, $src, lo16, (V_MOV_B16_t16_e64 0, (i16 0), 0), hi16), sub0,
- (S_MOV_B32 (i32 0)), sub1)
->;
-
-def : GCNPat<
- (i32 (zext (i16 (bitconvert fp16_zeros_high_16bits:$src)))),
- (REG_SEQUENCE VGPR_32, $src, lo16, (V_MOV_B16_t16_e64 0, (i16 0), 0), hi16)
->;
-}
def : GCNPat <
(i32 (trunc i64:$a)),
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/mul.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/mul.ll
index 637aaf7529364..01854c8560ce2 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/mul.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/mul.ll
@@ -164,7 +164,7 @@ define zeroext i16 @v_mul_i16_zeroext(i16 zeroext %num, i16 zeroext %den) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mul_lo_u16 v0.l, v0.l, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-FAKE16-LABEL: v_mul_i16_zeroext:
diff --git a/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.1024bit.ll b/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.1024bit.ll
index d03d6a8940b2f..0d5f538215f18 100644
--- a/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.1024bit.ll
+++ b/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.1024bit.ll
@@ -6309,64 +6309,64 @@ define <128 x i8> @bitcast_v32i32_to_v128i8(<32 x i32> %a, i32 %b) {
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr66_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr162_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr161_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr160_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr161_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr65_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr151_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr150_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr150_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr148_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr147_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr146_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr147_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr145_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr144_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr135_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr134_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr133_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr132_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr133_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr131_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr128_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr119_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr118_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr117_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr116_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr115_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr116_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr114_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr113_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr112_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr113_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr101_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr100_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr99_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr99_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr97_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr96_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr96_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr86_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr85_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr84_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr83_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr82_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr81_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr82_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr69_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr68_lo16
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(2)
@@ -6394,50 +6394,50 @@ define <128 x i8> @bitcast_v32i32_to_v128i8(<32 x i32> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[65:66], 24, v[3:4]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v68, 24, v32
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v69, 8, v32
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v70, 8, v31
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v71, 24, v30
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v71, 8, v31
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v70, 24, v30
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v80, 8, v30
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 8, v29
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 24, v28
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 8, v29
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 24, v28
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v83, 8, v28
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v84, 8, v27
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v85, 24, v26
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v86, 8, v26
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 8, v25
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 24, v24
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 8, v25
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 24, v24
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v97, 8, v24
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 8, v23
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v99, 24, v22
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v99, 8, v23
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 24, v22
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v100, 8, v22
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v101, 8, v21
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v102, 24, v20
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v103, 8, v20
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 8, v19
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 24, v18
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 8, v19
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 24, v18
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v114, 8, v18
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v115, 8, v17
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v116, 24, v16
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v116, 8, v17
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v115, 24, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v117, 8, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v118, 8, v15
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v119, 24, v14
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v128, 8, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 8, v13
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 24, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v131, 8, v12
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 8, v11
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v133, 24, v10
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v133, 8, v11
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v134, 8, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v135, 8, v9
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v144, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v145, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v146, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v146, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v148, 8, v6
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 8, v5
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 24, v4
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 8, v5
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v151, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v162, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[54:55], 24, v[7:8]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[66:67], 24, v[1:2]
@@ -6498,50 +6498,50 @@ define <128 x i8> @bitcast_v32i32_to_v128i8(<32 x i32> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[66:67], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v68, 24, v32
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v69, 8, v32
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v70, 8, v31
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v71, 24, v30
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v71, 8, v31
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v70, 24, v30
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v80, 8, v30
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 8, v29
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 24, v28
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 8, v29
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 24, v28
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v83, 8, v28
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v84, 8, v27
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v85, 24, v26
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v86, 8, v26
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 8, v25
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 24, v24
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 8, v25
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 24, v24
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v97, 8, v24
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 8, v23
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v99, 24, v22
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v99, 8, v23
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 24, v22
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v100, 8, v22
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v101, 8, v21
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v102, 24, v20
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v103, 8, v20
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 8, v19
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 24, v18
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 8, v19
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 24, v18
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v114, 8, v18
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v115, 8, v17
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v116, 24, v16
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v116, 8, v17
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v115, 24, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v117, 8, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v118, 8, v15
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v119, 24, v14
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v128, 8, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 8, v13
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 24, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v131, 8, v12
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 8, v11
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v133, 24, v10
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v133, 8, v11
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v134, 8, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v135, 8, v9
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v144, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v145, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v146, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v146, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v148, 8, v6
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 8, v5
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 24, v4
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 8, v5
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v151, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v162, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v39, 8, v1
; GFX11-TRUE16-NEXT: .LBB12_4: ; %end
@@ -6549,266 +6549,307 @@ define <128 x i8> @bitcast_v32i32_to_v128i8(<32 x i32> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v39.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v66.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v1.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v162.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v1.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v162.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.l, 0
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v66.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v34.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v161.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v39, v1
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v2.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v160.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v65.l
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v39, v2
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v3.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v39.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v161.l
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v1.h, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v66.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v55, 0xffff, v55
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v160.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v1.l, v3.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v151.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v150.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v66, 0xffff, v66
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v55, v39
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v2.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v65.l
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v4.l, v33.h
; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v39, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v4.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v149.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v64.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v39, v4
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v5.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v148.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v147.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v150.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v66, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v55, 0xffff, v55
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v3.l, v3.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v65.l, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v149.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v5.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v55, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v55, 0xffff, v65
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v64.l
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v4.l, v4.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v65.l, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.l, 8, v148.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v55, v39
; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v39, v5
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v6.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v146.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v54.l
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v5.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v147.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.h, v6.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v55, 0xffff, v65
; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v39, v6
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v7.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v145.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v144.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v39, v7
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v8.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v135.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v53.l
; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v39, v8
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v9.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v39.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v7.l, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v64.l, v6.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v146.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v145.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v55, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v55, 0xffff, v64
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v64.l, v7.l
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v6.l, v6.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.h, 8, v54.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v8.l, v33.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v64
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v55, v39
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v144.l
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v7.l, v7.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v9.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 8, v135.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v54, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v55
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v8.l, v33.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v134.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v133.l
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v8.h, v9.l
; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v11.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v39, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v10.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v10.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v132.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v52.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v12.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v39, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v11.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v11.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v131.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v130.l
; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v13.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v39, v11
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v12.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v12.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v129.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v51.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v54, v39
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v10.l, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v9.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v53.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v133.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v54.l, v10.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v53, 0xffff, v55
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.h
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v9.l, v9.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v132.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v11.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v52.l
; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v14.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v39, v12
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v13.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v13.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v128.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v119.l
-; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.l
-; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v15.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v39, v13
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v14.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v118.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v50.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v53, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v53, 0xffff, v54
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v10.l, v10.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v54.l, v11.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v131.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v53, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v52, 0xffff, v54
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v11.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v130.l
+; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v11.h, v12.l
; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.l
-; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v16.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v39, v14
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v15.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v15.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v117.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v116.l
; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v17.l
-; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v17.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v39, v15
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v16.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v16.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v115.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v49.l
-; GFX11-TRUE16-NEXT: v_and_b16 v18.l, 0xff, v18.l
-; GFX11-TRUE16-NEXT: v_and_b16 v18.h, 0xff, v18.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v39, v16
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v17.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v17.h, v17.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v114.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v113.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v52, v39
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v13.l, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v53.l, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v129.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v128.l
; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v19.l
-; GFX11-TRUE16-NEXT: v_and_b16 v19.h, 0xff, v19.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v39, v17
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v18.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v18.h, v18.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v112.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v48.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v52, 0xffff, v53
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v53.l, v13.l
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v12.l, v12.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.h, 8, v51.l
+; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v14.l, v33.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v51, 0xffff, v53
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v52, v39
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v119.l
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v13.l, v13.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.l, v14.l
+; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.h
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v15.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.l, 8, v118.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v51, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v51, 0xffff, v52
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v14.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v117.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v14.h, v15.l
; GFX11-TRUE16-NEXT: v_and_b16 v20.l, 0xff, v20.l
-; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v20.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v39, v18
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v19.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v19.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v103.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v102.l
-; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v21.l
-; GFX11-TRUE16-NEXT: v_and_b16 v21.h, 0xff, v21.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v39, v19
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v20.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v20.h, v20.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v101.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v38.l
; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v22.l
-; GFX11-TRUE16-NEXT: v_and_b16 v22.h, 0xff, v22.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v39, v20
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v21.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v21.h, v21.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v100.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v99.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v51, v39
+; GFX11-TRUE16-NEXT: v_or_b16 v16.l, v16.l, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.l, v15.l
+; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.h, 8, v50.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v116.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v51.l, v16.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v50, 0xffff, v52
+; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.h
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v15.l, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v115.l
+; GFX11-TRUE16-NEXT: v_or_b16 v17.l, v17.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v49.l
; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v23.l
-; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v23.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v39, v21
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v22.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v22.h, v22.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v98.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v37.l
-; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v24.l
-; GFX11-TRUE16-NEXT: v_and_b16 v24.h, 0xff, v24.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v39, v22
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v23.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v23.h, v23.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v97.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v96.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v50, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v50, 0xffff, v51
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v16.l, v16.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v51.l, v17.l
+; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v17.h
+; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v18.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v114.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v50, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v49, 0xffff, v51
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v17.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v113.l
+; GFX11-TRUE16-NEXT: v_or_b16 v18.l, v17.h, v18.l
; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v25.l
-; GFX11-TRUE16-NEXT: v_and_b16 v25.h, 0xff, v25.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v39, v23
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v24.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v24.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v87.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v36.l
; GFX11-TRUE16-NEXT: v_and_b16 v26.l, 0xff, v26.l
-; GFX11-TRUE16-NEXT: v_and_b16 v26.h, 0xff, v26.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v39, v24
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v25.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v25.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v86.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v85.l
-; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v27.l
-; GFX11-TRUE16-NEXT: v_and_b16 v27.h, 0xff, v27.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v39, v25
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v26.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v26.h, v26.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v84.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v35.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v49, v39
+; GFX11-TRUE16-NEXT: v_or_b16 v19.l, v19.l, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v50.l, v18.l
+; GFX11-TRUE16-NEXT: v_and_b16 v18.l, 0xff, v18.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v112.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v103.l
; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v28.l
-; GFX11-TRUE16-NEXT: v_and_b16 v28.h, 0xff, v28.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v39, v26
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v27.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v27.h, v27.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v83.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v82.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v49, 0xffff, v50
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v50.l, v19.l
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v18.l, v18.h
+; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v19.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v48.l
+; GFX11-TRUE16-NEXT: v_or_b16 v20.l, v20.l, v33.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v48, 0xffff, v50
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v49, v39
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v102.l
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v19.l, v19.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v49.l, v20.l
+; GFX11-TRUE16-NEXT: v_and_b16 v20.l, 0xff, v20.h
+; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v21.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.l, 8, v101.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v48, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v48, 0xffff, v49
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v20.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v100.l
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v20.h, v21.l
; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v29.l
-; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v29.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v39, v27
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v28.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v28.h, v28.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v81.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.l, 8, v34.l
-; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v30.l
-; GFX11-TRUE16-NEXT: v_and_b16 v30.h, 0xff, v30.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v39, v28
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v29.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v29.h, v29.h, v34.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v80.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.l, 8, v71.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_and_b16 v31.l, 0xff, v31.l
-; GFX11-TRUE16-NEXT: v_and_b16 v31.h, 0xff, v31.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v39, v29
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v30.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v30.h, v30.h, v34.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v70.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.l, 8, v33.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v48, v39
+; GFX11-TRUE16-NEXT: v_or_b16 v22.l, v22.l, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v49.l, v21.l
+; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v21.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.h, 8, v38.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v99.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.l, v22.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v38, 0xffff, v49
+; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v22.h
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v21.l, v21.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.h, 8, v98.l
+; GFX11-TRUE16-NEXT: v_or_b16 v23.l, v23.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v37.l
; GFX11-TRUE16-NEXT: v_and_b16 v32.l, 0xff, v32.l
-; GFX11-TRUE16-NEXT: v_and_b16 v32.h, 0xff, v32.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v39, v30
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v31.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v31.h, v33.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v31.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.l, 8, v69.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v68.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v38, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v38, 0xffff, v48
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v22.l, v22.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.l, v23.l
+; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v23.h
+; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v24.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v24.l, 8, v97.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v38, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v37, 0xffff, v48
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v23.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v96.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v23.h, v24.l
; GFX11-TRUE16-NEXT: s_clause 0x1
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[1:4], off
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[5:8], off offset:16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v39, v31
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v32.l, v33.l
-; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v32.h, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v32.l, v39.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v32, v39, v32
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v37, v39
+; GFX11-TRUE16-NEXT: v_or_b16 v25.l, v25.l, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v38.l, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v24.h, 8, v87.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v86.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v37, 0xffff, v38
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v38.l, v25.l
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v24.l, v24.h
+; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v25.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.h, 8, v36.l
+; GFX11-TRUE16-NEXT: v_or_b16 v26.l, v26.l, v33.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v36, 0xffff, v38
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v37, v39
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v85.l
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v25.l, v25.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v37.l, v26.l
+; GFX11-TRUE16-NEXT: v_and_b16 v26.l, 0xff, v26.h
+; GFX11-TRUE16-NEXT: v_and_b16 v26.h, 0xff, v27.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.l, 8, v84.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v36, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v36, 0xffff, v37
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v26.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v83.l
+; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v26.h, v27.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v36, v39
+; GFX11-TRUE16-NEXT: v_or_b16 v28.l, v28.l, v33.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v37.l, v27.l
+; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v27.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.h, 8, v35.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v82.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v36.l, v28.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff, v37
+; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v28.h
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v27.l, v27.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v28.h, 8, v81.l
+; GFX11-TRUE16-NEXT: v_or_b16 v29.l, v29.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v34.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v35, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff, v36
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v28.l, v28.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v36.l, v29.l
+; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v29.h
+; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v30.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v30.l, 8, v80.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v35, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v36
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v29.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v71.l
+; GFX11-TRUE16-NEXT: v_or_b16 v30.l, v29.h, v30.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v34, v39
+; GFX11-TRUE16-NEXT: v_or_b16 v31.l, v31.l, v33.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_3) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v30.l
+; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v30.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v30.h, 8, v70.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v69.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v35
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v31.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v30.l, v30.h
+; GFX11-TRUE16-NEXT: v_and_b16 v31.l, 0xff, v31.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v31.h, 8, v33.l
+; GFX11-TRUE16-NEXT: v_or_b16 v32.l, v32.l, v33.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v33, 0xffff, v35
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v34, v39
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v31.l, v31.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.l, v32.l
+; GFX11-TRUE16-NEXT: v_and_b16 v32.l, 0xff, v32.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v32.h, 8, v68.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v33, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v33, 0xffff, v34
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v32.l, v32.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v32, v33, v39
; GFX11-TRUE16-NEXT: s_clause 0x5
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[9:12], off offset:32
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[13:16], off offset:48
@@ -15372,63 +15413,63 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v31, off, s32 offset:384
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v32, off, s32 offset:380
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v31, off, s32 offset:376
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v32, off, s32 offset:372
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v50, off, s32 offset:368
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v33, off, s32 offset:364
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v51, off, s32 offset:360
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v33, off, s32 offset:356
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v51, off, s32 offset:352
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v34, off, s32 offset:348
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v52, off, s32 offset:344
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v33, off, s32 offset:372
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v49, off, s32 offset:368
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v32, off, s32 offset:364
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v50, off, s32 offset:360
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v34, off, s32 offset:356
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v50, off, s32 offset:352
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v33, off, s32 offset:348
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v51, off, s32 offset:344
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v34, off, s32 offset:340
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v52, off, s32 offset:336
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v35, off, s32 offset:332
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v53, off, s32 offset:328
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v35, off, s32 offset:324
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v36, off, s32 offset:324
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v53, off, s32 offset:320
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v36, off, s32 offset:316
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v35, off, s32 offset:316
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v54, off, s32 offset:312
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v36, off, s32 offset:308
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v54, off, s32 offset:304
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v37, off, s32 offset:300
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v55, off, s32 offset:296
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v37, off, s32 offset:292
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v64, off, s32 offset:288
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v38, off, s32 offset:284
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v64, off, s32 offset:280
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v38, off, s32 offset:276
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v66, off, s32 offset:272
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v39, off, s32 offset:268
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v66, off, s32 offset:264
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v38, off, s32 offset:292
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v55, off, s32 offset:288
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v37, off, s32 offset:284
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v64, off, s32 offset:280
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v39, off, s32 offset:276
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v65, off, s32 offset:272
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v38, off, s32 offset:268
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v66, off, s32 offset:264
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v39, off, s32 offset:260
; GFX11-TRUE16-NEXT: s_clause 0x1f
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v67, off, s32 offset:256
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v66, off, s32 offset:256
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v48, off, s32 offset:252
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v68, off, s32 offset:248
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v48, off, s32 offset:244
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v70, off, s32 offset:240
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v68, off, s32 offset:240
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v49, off, s32 offset:236
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v70, off, s32 offset:232
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v49, off, s32 offset:228
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v71, off, s32 offset:224
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v50, off, s32 offset:220
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v71, off, s32 offset:216
-; GFX11-TRUE16-NEXT: scratch_load_b32 v114, off, s32 offset:388
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v81, off, s32
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v70, off, s32 offset:232
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v51, off, s32 offset:228
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v70, off, s32 offset:224
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v52, off, s32 offset:220
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v80, off, s32 offset:216
+; GFX11-TRUE16-NEXT: scratch_load_b32 v103, off, s32 offset:388
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v80, off, s32
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v82, off, s32 offset:8
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v83, off, s32 offset:16
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v82, off, s32 offset:16
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v83, off, s32 offset:24
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v85, off, s32 offset:32
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v85, off, s32 offset:40
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v87, off, s32 offset:48
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v87, off, s32 offset:56
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v97, off, s32 offset:64
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v97, off, s32 offset:72
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v98, off, s32 offset:80
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v84, off, s32 offset:32
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v85, off, s32 offset:40
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v86, off, s32 offset:48
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v87, off, s32 offset:56
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v96, off, s32 offset:64
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v97, off, s32 offset:72
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v98, off, s32 offset:80
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v99, off, s32 offset:88
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v100, off, s32 offset:96
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v99, off, s32 offset:96
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v101, off, s32 offset:104
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v102, off, s32 offset:112
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v101, off, s32 offset:112
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v160, off, s32 offset:120
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v160, off, s32 offset:128
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v161, off, s32 offset:136
@@ -15442,143 +15483,144 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v164, off, s32 offset:192
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v165, off, s32 offset:200
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v165, off, s32 offset:208
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v55, off, s32 offset:212
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v65, off, s32 offset:204
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v65, off, s32 offset:196
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v67, off, s32 offset:188
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v68, off, s32 offset:180
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v69, off, s32 offset:172
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v69, off, s32 offset:164
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v80, off, s32 offset:156
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v80, off, s32 offset:148
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v81, off, s32 offset:140
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v82, off, s32 offset:132
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v64, off, s32 offset:212
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v65, off, s32 offset:204
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v67, off, s32 offset:196
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v67, off, s32 offset:188
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v69, off, s32 offset:180
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v69, off, s32 offset:172
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v71, off, s32 offset:164
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v71, off, s32 offset:156
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v81, off, s32 offset:148
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v81, off, s32 offset:140
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v83, off, s32 offset:132
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v84, off, s32 offset:124
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v84, off, s32 offset:116
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v85, off, s32 offset:116
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v86, off, s32 offset:108
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v86, off, s32 offset:100
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v87, off, s32 offset:100
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v96, off, s32 offset:92
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v96, off, s32 offset:84
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v98, off, s32 offset:76
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v99, off, s32 offset:68
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v100, off, s32 offset:60
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v101, off, s32 offset:52
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v103, off, s32 offset:44
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v103, off, s32 offset:36
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v112, off, s32 offset:28
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v113, off, s32 offset:20
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v97, off, s32 offset:84
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v98, off, s32 offset:76
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v100, off, s32 offset:68
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v100, off, s32 offset:60
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v102, off, s32 offset:52
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v102, off, s32 offset:44
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v112, off, s32 offset:36
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v112, off, s32 offset:28
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v114, off, s32 offset:20
; GFX11-TRUE16-NEXT: s_clause 0x1
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v115, off, s32 offset:12
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v115, off, s32 offset:4
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v116, off, s32 offset:4
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v117.l, v30.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v118.h, v28.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v119.l, v26.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v119.h, v24.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v130.l, v22.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v130.h, v20.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v131.l, v18.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v132.h, v16.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v134.l, v14.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v134.h, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v128.h, v24.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v129.h, v22.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v131.l, v20.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v131.h, v18.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v133.h, v16.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v134.h, v14.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v144.h, v12.l
; GFX11-TRUE16-NEXT: v_mov_b16_e64 v144.l, v10.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v145.h, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v146.h, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v146.l, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v149.l, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v149.h, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v151.l, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v151.h, 8, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v150.l, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v150.h, 8, v7.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v148.l, 8, v9.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v148.h, 8, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v147.l, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v147.h, 8, v15.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v144.h, 8, v17.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v145.l, 8, v19.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v135.l, 8, v21.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v135.h, 8, v23.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v133.l, 8, v25.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v133.h, 8, v27.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v131.h, 8, v29.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v146.l, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v147.h, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v148.h, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v150.l, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v151.l, v0.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v151.h, 8, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v149.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v150.h, 8, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v147.l, 8, v7.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v149.h, 8, v9.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v146.h, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v148.l, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v145.l, 8, v15.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v145.h, 8, v17.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v135.l, 8, v19.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v135.h, 8, v21.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v133.l, 8, v23.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v134.l, 8, v25.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v132.l, 8, v27.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v132.h, 8, v29.l
+; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(62)
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v50.h, 8, v50.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(54)
-; GFX11-TRUE16-NEXT: v_cmp_ne_u32_e32 vcc_lo, 0, v114
+; GFX11-TRUE16-NEXT: v_cmp_ne_u32_e32 vcc_lo, 0, v103
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(53)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v132.l, 8, v81.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v130.l, 8, v80.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(52)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v129.l, 8, v82.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v130.h, 8, v82.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(51)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v129.h, 8, v83.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v128.l, 8, v82.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(50)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v128.l, 8, v83.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v129.l, 8, v83.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(49)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v128.h, 8, v85.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v118.l, 8, v84.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(48)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v117.h, 8, v85.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v119.h, 8, v85.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(47)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v118.l, 8, v87.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v116.l, 8, v86.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(46)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v116.l, 8, v87.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v117.h, 8, v87.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(45)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v116.h, 8, v97.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v114.h, 8, v96.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(44)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v114.l, 8, v97.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v115.h, 8, v97.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(43)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v114.h, 8, v98.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v113.l, 8, v98.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(42)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v112.h, 8, v99.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v113.h, 8, v99.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(41)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v113.l, 8, v100.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v103.l, 8, v99.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(40)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v102.l, 8, v101.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v103.h, 8, v101.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(39)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v102.h, 8, v102.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v101.l, 8, v101.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(38)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v100.h, 8, v160.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v101.h, 8, v160.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(37)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v101.l, 8, v160.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v99.l, 8, v160.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(36)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v98.h, 8, v161.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v99.h, 8, v161.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(35)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v99.l, 8, v161.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v97.l, 8, v161.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(34)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v97.l, 8, v162.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v98.l, 8, v162.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(33)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v97.h, 8, v162.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v87.l, 8, v162.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(32)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v87.l, 8, v163.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v96.h, 8, v163.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(31)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v87.h, 8, v163.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v85.l, 8, v163.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(30)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v85.l, 8, v164.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v86.h, 8, v164.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(29)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v85.h, 8, v164.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v83.h, 8, v164.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(28)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v83.l, 8, v165.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v84.h, 8, v165.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(27)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v83.h, 8, v165.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v81.h, 8, v71.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v82.l, 8, v71.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v71.l, 8, v70.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v71.h, 8, v70.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v70.l, 8, v68.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v70.h, 8, v67.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v67.h, 8, v66.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v68.l, 8, v66.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v66.l, 8, v64.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v82.l, 8, v165.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v82.h, 8, v80.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v80.l, 8, v70.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v80.h, 8, v70.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v70.l, 8, v68.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v70.h, 8, v68.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v68.l, 8, v66.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v68.h, 8, v66.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v66.l, 8, v65.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v66.h, 8, v64.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v64.l, 8, v55.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v64.h, 8, v54.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v54.h, 8, v54.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v55.l, 8, v53.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v53.h, 8, v53.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v54.l, 8, v52.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v52.h, 8, v52.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v53.l, 8, v51.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v51.h, 8, v51.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v52.l, 8, v50.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v50.h, 8, v31.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v51.l, 8, v31.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v64.l, 8, v55.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v65.l, 8, v55.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v55.l, 8, v54.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v55.h, 8, v54.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v54.l, 8, v53.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v54.h, 8, v53.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v53.l, 8, v52.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v53.h, 8, v51.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v52.h, 8, v50.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v50.l, 8, v49.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v51.l, 8, v31.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v49.h, 8, v31.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15_vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31
; GFX11-TRUE16-NEXT: s_and_saveexec_b32 s0, vcc_lo
; GFX11-TRUE16-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
@@ -15592,660 +15634,746 @@ define <32 x i32> @bitcast_v128i8_to_v32i32(<128 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB14_3: ; %cmp.false
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v149.h
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v149.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v149.h, 0
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v146.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v146.l
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v0.l, v151.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v151.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v0.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v1.l, v150.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v3.l, v149.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v145.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v144.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v149, v0
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v1.h, v150.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v132.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v131.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v148.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v130.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v149, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v2.l, v148.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v2.l, v149.h
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v151.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v148.h
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v150.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v146.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v31.l, 0
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v151.h
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v150.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v147.h
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v0.h, v149.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v149.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v144.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v133.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.l, v148.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v4
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v144.l
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v1.l, v145.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v31
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v1.h, v147.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v134.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v134.l
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v145.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v130.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v149, v2
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v3.l, v147.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v147.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v3.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v135.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v119.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v119.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v118.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v149, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v4.l, v144.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v4.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v133.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v117.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v131.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v7, v31
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v2.l, v146.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v4.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v131.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v3.h, v135.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v6, v31
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v3.l, v145.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v128.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v118.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v115.h
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v115.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v149, v4
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v5.l, v135.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v5.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v132.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v129.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v113.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v112.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v149, v5
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v6.l, v133.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v6.l, v149.h
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v103.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.h, v128.h
-; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v103.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v101.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v149, v6
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v7.l, v131.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v7.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v10.h, v118.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v100.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v99.h
-; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v98.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v149, v7
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v8.l, v129.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v8.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v11.h, v116.h
-; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v12.h, v114.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v96.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v96.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v149, v8
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v9.l, v128.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v9.l, v149.h
-; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v86.h
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v13.h, v113.l
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v86.l
-; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v84.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v149, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v10.l, v117.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v10.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v102.h
-; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v84.l
-; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v82.h
-; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v81.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v149, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v11.l, v116.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v11.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v15.h, v101.l
-; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v16.h, v99.l
-; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v80.h
-; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v80.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v149, v11
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v12.l, v114.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v12.l, v149.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v116.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v8, v31
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v4.l, v135.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v129.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v4.h, v134.l
+; GFX11-TRUE16-NEXT: v_and_b16 v32.l, 0xff, v32.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v8
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v5.l, v133.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v6.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v119.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v5.h, v132.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v114.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v8, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v6.l, v132.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v117.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v6.h, v130.h
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v112.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v7.l, v130.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v115.l
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v7.h, v129.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v102.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v10, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v11
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v8.l, v128.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v9.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v112.h
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v8.h, v119.h
+; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v100.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v11, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v12
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v9.l, v118.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v10.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v102.h
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v9.h, v117.h
+; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v97.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v12, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v13
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v10.l, v116.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v11.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v100.h
+; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v10.h, v115.h
+; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v87.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v13, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v14, 0xffff, v14
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v11.l, v114.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v98.h
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v11.h, v113.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v85.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v14, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v15, 0xffff, v15
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v12.l, v113.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v13.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v96.l
+; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v12.h, v103.h
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v83.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v15, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v16
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v13.l, v103.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v14.l
+; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v86.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v13.h, v101.h
+; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v81.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v16, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v17
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v14.l, v101.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v15.l
+; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v84.l
+; GFX11-TRUE16-NEXT: v_or_b16 v16.l, v14.h, v99.h
+; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v71.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v17, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v18
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v15.l, v99.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v16.l
+; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v81.h
+; GFX11-TRUE16-NEXT: v_or_b16 v17.l, v15.h, v98.l
+; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v69.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v18, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v19, 0xffff, v19
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v16.l, v97.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v17.l
+; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v71.h
+; GFX11-TRUE16-NEXT: v_or_b16 v18.l, v16.h, v96.h
+; GFX11-TRUE16-NEXT: v_and_b16 v18.h, 0xff, v67.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v19, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v20, 0xffff, v20
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v17.l, v87.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v18.l
; GFX11-TRUE16-NEXT: v_and_b16 v18.l, 0xff, v69.h
-; GFX11-TRUE16-NEXT: v_or_b16 v17.h, v17.h, v97.h
-; GFX11-TRUE16-NEXT: v_and_b16 v18.h, 0xff, v69.l
-; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v68.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v149, v12
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v13.l, v112.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v13.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v18.h, v18.h, v87.h
-; GFX11-TRUE16-NEXT: v_and_b16 v19.h, 0xff, v67.l
+; GFX11-TRUE16-NEXT: v_or_b16 v19.l, v17.h, v86.h
+; GFX11-TRUE16-NEXT: v_and_b16 v19.h, 0xff, v64.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v20, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v21, 0xffff, v21
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v18.l, v85.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v19.l
+; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v67.h
+; GFX11-TRUE16-NEXT: v_or_b16 v20.l, v18.h, v84.h
+; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v51.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v21, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff, v22
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v19.l, v83.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v20.l
; GFX11-TRUE16-NEXT: v_and_b16 v20.l, 0xff, v65.h
-; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v65.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v149, v13
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v14.l, v102.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v14.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v19.h, v85.h
-; GFX11-TRUE16-NEXT: v_or_b16 v20.h, v20.h, v83.h
-; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v55.h
-; GFX11-TRUE16-NEXT: v_and_b16 v21.h, 0xff, v50.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v149, v14
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v15.l, v100.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v15.l, v149.h
-; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v49.h
-; GFX11-TRUE16-NEXT: v_or_b16 v21.h, v21.h, v82.l
-; GFX11-TRUE16-NEXT: v_and_b16 v22.h, 0xff, v49.l
-; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v48.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v149, v15
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v16.l, v98.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v16.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v22.h, v22.h, v71.h
-; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v48.l
-; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v39.h
-; GFX11-TRUE16-NEXT: v_and_b16 v24.h, 0xff, v39.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v149, v16
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v17.l, v97.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v17.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v23.h, v23.h, v70.h
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v24.h, v68.l
-; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v38.h
-; GFX11-TRUE16-NEXT: v_and_b16 v25.h, 0xff, v38.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v149, v17
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v18.l, v87.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v18.l, v149.h
-; GFX11-TRUE16-NEXT: v_and_b16 v26.l, 0xff, v37.h
-; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v25.h, v66.h
-; GFX11-TRUE16-NEXT: v_and_b16 v26.h, 0xff, v37.l
-; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v36.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v149, v18
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v19.l, v85.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v19.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v26.h, v26.h, v64.h
-; GFX11-TRUE16-NEXT: v_and_b16 v27.h, 0xff, v36.l
-; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v35.h
-; GFX11-TRUE16-NEXT: v_and_b16 v28.h, 0xff, v35.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v149, v19
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v20.l, v83.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v20.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v27.h, v27.h, v55.l
-; GFX11-TRUE16-NEXT: v_or_b16 v28.h, v28.h, v54.l
-; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v34.h
-; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v34.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v149, v20
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v21.l, v81.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v21.l, v149.h
-; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v29.h, v29.h, v53.l
-; GFX11-TRUE16-NEXT: v_and_b16 v30.h, 0xff, v33.l
-; GFX11-TRUE16-NEXT: v_and_b16 v31.l, 0xff, v32.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v149, v21
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v22.l, v71.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v22.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v30.h, v30.h, v52.l
-; GFX11-TRUE16-NEXT: v_and_b16 v31.h, 0xff, v32.l
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v19.h, v82.h
+; GFX11-TRUE16-NEXT: v_and_b16 v21.h, 0xff, v48.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v22, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v23, 0xffff, v23
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v20.l, v82.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, v21.l
+; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v52.l
+; GFX11-TRUE16-NEXT: v_or_b16 v22.l, v20.h, v80.h
+; GFX11-TRUE16-NEXT: v_and_b16 v22.h, 0xff, v39.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v23, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v24, 0xffff, v24
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v21.l, v80.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v22.l
+; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v49.l
+; GFX11-TRUE16-NEXT: v_or_b16 v23.l, v21.h, v70.h
+; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v39.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v24, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v25, 0xffff, v25
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v22.l, v70.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.l, v23.l
+; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v48.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v22.h, v68.h
+; GFX11-TRUE16-NEXT: v_and_b16 v24.h, 0xff, v38.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v25, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v26, 0xffff, v26
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v23.l, v68.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v38.h
+; GFX11-TRUE16-NEXT: v_or_b16 v25.l, v23.h, v66.h
+; GFX11-TRUE16-NEXT: v_and_b16 v25.h, 0xff, v36.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v26, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v27, 0xffff, v27
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v24.l, v66.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.l, v25.l
+; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v37.h
+; GFX11-TRUE16-NEXT: v_or_b16 v26.l, v24.h, v65.l
+; GFX11-TRUE16-NEXT: v_and_b16 v26.h, 0xff, v36.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v27, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v28, 0xffff, v28
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v25.l, v64.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v26.l
+; GFX11-TRUE16-NEXT: v_and_b16 v26.l, 0xff, v37.l
+; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v25.h, v55.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v28, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v29, 0xffff, v29
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v26.l, v55.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.l, v27.l
+; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v35.h
+; GFX11-TRUE16-NEXT: v_or_b16 v28.l, v26.h, v54.h
+; GFX11-TRUE16-NEXT: v_and_b16 v28.h, 0xff, v34.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v29, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v29, 0xffff, v30
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v27.l, v54.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.l, v28.l
+; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v35.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr151_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr150_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr148_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr147_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr146_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr146_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr145_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr144_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr144_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr134_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr134_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr132_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr133_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr131_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr131_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr119_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr128_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr119_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr118_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr117_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr115_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr116_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr115_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr113_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr114_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr112_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr112_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr101_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr100_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr100_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr99_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr96_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr97_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr96_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr86_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr86_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr84_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr85_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr84_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr82_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr83_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr81_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr81_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr69_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr69_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr68_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr67_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr67_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr65_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr65_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr151_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr151_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr150_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr150_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr148_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr148_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr147_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr147_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr144_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr146_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr148_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr145_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr145_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr135_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr135_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr133_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr133_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr131_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr134_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr132_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr132_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr128_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr128_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr117_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr118_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr119_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr116_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr116_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr114_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr117_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr114_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr112_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr115_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr113_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr100_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr113_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr101_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr101_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr99_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr99_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr97_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr97_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr96_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr85_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr85_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr83_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr86_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr83_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr81_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr84_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr82_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr82_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr68_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr68_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr66_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr66_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr65_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v149, v22
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v23.l, v70.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v23.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v31.h, v51.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v149, v23
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v24.l, v67.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v24.l, v149.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr67_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v149, v24
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v25.l, v66.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v25.l, v149.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr66_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v149, v25
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v26.l, v64.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v26.l, v149.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v149, v26
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v27.l, v54.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v27.l, v149.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v149, v27
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v28.l, v53.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v28.l, v149.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v29, v31
+; GFX11-TRUE16-NEXT: v_or_b16 v29.l, v28.h, v53.h
+; GFX11-TRUE16-NEXT: v_and_b16 v28.h, 0xff, v34.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff, v30
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v28.l, v53.l
+; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v33.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.l, v29.l
+; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v30.l, v28.h, v52.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v35, v31
+; GFX11-TRUE16-NEXT: v_or_b16 v33.l, v29.h, v51.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v34
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v29.l, v50.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v30.l
+; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v32.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v33, 0xffff, v33
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v149, v28
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v29.l, v52.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v29.l, v149.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v149, v29
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v30.l, v51.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v30.l, v149.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v149, v30
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v31.l, v50.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v31.l, v149.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v34, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v35
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v30.l, v50.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v34, v31
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v32.l, v49.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_hi16
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v149, v31
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v33, v31
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB14_2
; GFX11-TRUE16-NEXT: .LBB14_4: ; %cmp.true
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v149.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v149.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v146.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v146.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v31.h, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v151.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v148.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v150.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v146.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v147.h, 3
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v151.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v151.h, v0.h
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v150.h, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v150.l, v1.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v145.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v144.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v134.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v31, v3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v134.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v148.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v148.h, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v144.l, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v151.h, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v150.h, v1.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v144.h, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v149.l, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v31.l, 0
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v31, v4
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v31.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v147.l, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v147.h, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v132.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v149.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v147.l, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, v134.h, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v133.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v31
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v146.h, v2.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v148.l, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v4
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v2.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v2.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v145.l, v3.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v131.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v31, v5
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v3.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v130.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v130.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v144.h, v3.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v145.l, v3.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v5, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v4.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v145.h, v3.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v131.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v3.h
; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v31, v6
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v5.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v135.l, v4.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v135.h, v4.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v119.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v119.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v31, v7
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, 0x300, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v118.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v117.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v133.l, v5.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v133.h, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, v128.h, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v135.h, v4.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v6, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v5.l
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v135.l, v4.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, v129.h, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v32.l, v32.l, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v4.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, 0x300, v4.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v134.l, v5.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v118.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v133.l, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v119.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v31, v8
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v7.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v7.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v131.h, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v132.l, v6.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v132.h, v6.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v115.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v115.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v31, v9
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v7.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, 0x300, v7.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v113.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v112.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v129.l, v7.l
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v129.h, v7.h
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, v116.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v8, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v7.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v132.l, v6.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, v117.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v8
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v6.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, 0x300, v6.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v130.h, v7.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v114.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v8.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v130.l, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v115.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v31, v10
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v9.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v9.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v128.l, v8.l
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v128.h, v8.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, v103.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, v103.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v31, v11
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v9.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, 0x300, v9.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, v101.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, v100.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v117.h, v9.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v118.l, v9.h
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v9
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v129.l, v8.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, v112.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v10, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v9.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v128.l, v8.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, v112.h, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v10
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v8.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v8.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v119.h, v9.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, v102.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v11, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v10.l
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v118.l, v9.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, v102.h, 3
; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v31, v12
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v11.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.h, 0x300, v11.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v116.l, v10.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v116.h, v10.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, v99.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, v98.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v31, v13
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v11.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, 0x300, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v12.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, v96.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, v96.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v114.l, v11.l
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v114.h, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v11
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v9.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.l, 0x300, v9.h
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v117.h, v10.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, v100.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v12, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v11.l
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v116.l, v10.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.l, v100.h, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v11.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v12
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v10.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, 0x300, v10.h
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v115.h, v11.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, v97.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v13, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v12.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v114.h, v11.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, v98.h, 3
; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v12.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v31, v14
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v13.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.h, 0x300, v13.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v112.h, v12.l
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v113.l, v12.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, v86.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, v86.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v31, v15
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v13.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.h, 0x300, v13.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v14.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, v84.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, v84.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v102.l, v13.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v102.h, v13.h
-; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v14, 0xffff, v13
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v11.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.l, 0x300, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v113.h, v12.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.h, v87.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v14, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v13.l
+; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v113.l, v12.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.l, v96.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v13.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v15, 0xffff, v14
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v12.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, 0x300, v12.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.l
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v103.h, v13.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, v85.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v15, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v14.l
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v103.l, v13.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, v86.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v14.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v31, v16
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v15.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v17.h, 0x300, v15.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v100.h, v14.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v101.l, v14.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.l, v82.h, 3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v15
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v13.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.l, 0x300, v13.h
+; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.l
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v101.h, v14.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.h, v83.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v16, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v15.l
+; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v101.l, v14.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.l, v84.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v15.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v16
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v14.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.l, 0x300, v14.h
+; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v99.h, v15.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.h, v81.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v31, v17
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v15.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.h, 0x300, v15.h
-; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v16.l
-; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v16.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.l, v80.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.h, v80.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v17.l, v98.h, v15.l
-; GFX11-TRUE16-NEXT: v_or_b16 v17.h, v99.l, v15.h
-; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v17, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v16.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v99.l, v15.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.l, v81.h, 3
; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v16.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v31, v18
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v17.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v19.h, 0x300, v17.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v17.l, v97.l, v16.l
-; GFX11-TRUE16-NEXT: v_or_b16 v17.h, v97.h, v16.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.l, v69.h, 3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v17
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v15.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v17.l, 0x300, v15.h
+; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.l
+; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v98.l, v16.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v17.h, v71.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v18, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v17.l
+; GFX11-TRUE16-NEXT: v_or_b16 v16.l, v97.l, v16.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v17.l, v71.h, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v17.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v19, 0xffff, v18
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v16.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.l, 0x300, v16.h
+; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v17.l
+; GFX11-TRUE16-NEXT: v_or_b16 v17.h, v96.h, v17.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.h, v69.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v31, v19
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v17.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.h, 0x300, v17.h
-; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v18.l
-; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v18.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.l, v68.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.h, v67.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v19.l, v87.l, v17.l
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v87.h, v17.h
-; GFX11-TRUE16-NEXT: v_and_b16 v18.l, 0xff, v18.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v19, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v18.l
+; GFX11-TRUE16-NEXT: v_or_b16 v17.l, v87.l, v17.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.l, v69.h, 3
; GFX11-TRUE16-NEXT: v_and_b16 v18.h, 0xff, v18.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v31, v20
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v19.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v19.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v19.l, v85.l, v18.l
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v85.h, v18.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v20, 0xffff, v19
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v17.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v19.l, 0x300, v17.h
+; GFX11-TRUE16-NEXT: v_and_b16 v18.l, 0xff, v18.l
+; GFX11-TRUE16-NEXT: v_or_b16 v18.h, v86.h, v18.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v19.h, v67.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v20, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v19.l
+; GFX11-TRUE16-NEXT: v_or_b16 v18.l, v85.l, v18.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v19.l, v67.h, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v19.h, 0xff, v19.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v21, 0xffff, v20
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v18.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.l, 0x300, v18.h
+; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v19.l
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v84.h, v19.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.h, v64.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v21, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v20.l
+; GFX11-TRUE16-NEXT: v_or_b16 v19.l, v83.h, v19.l
; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.l, v65.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.h, v65.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v31, v21
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v19.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.h, 0x300, v19.h
-; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v20.l
-; GFX11-TRUE16-NEXT: v_and_b16 v19.h, 0xff, v20.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.l, v55.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.h, v50.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v83.l, v19.l
-; GFX11-TRUE16-NEXT: v_or_b16 v21.h, v83.h, v19.h
-; GFX11-TRUE16-NEXT: v_and_b16 v20.l, 0xff, v20.l
; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v20.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v31, v22
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v21.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v23.h, 0x300, v21.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v81.h, v20.l
-; GFX11-TRUE16-NEXT: v_or_b16 v21.h, v82.l, v20.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.l, v49.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.h, v49.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v31, v23
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v21.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.h, 0x300, v21.h
-; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v22.l
-; GFX11-TRUE16-NEXT: v_and_b16 v21.h, 0xff, v22.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.l, v48.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.h, v48.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v23.l, v71.l, v21.l
-; GFX11-TRUE16-NEXT: v_or_b16 v23.h, v71.h, v21.h
-; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v22.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff, v21
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v19.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v19.h
+; GFX11-TRUE16-NEXT: v_and_b16 v20.l, 0xff, v20.l
+; GFX11-TRUE16-NEXT: v_or_b16 v20.h, v82.h, v20.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, v51.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v22, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v21.l
+; GFX11-TRUE16-NEXT: v_or_b16 v20.l, v82.l, v20.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, v52.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v21.h, 0xff, v21.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v23, 0xffff, v22
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v20.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.l, 0x300, v20.h
+; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v21.l
+; GFX11-TRUE16-NEXT: v_or_b16 v21.h, v80.h, v21.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.h, v48.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v23, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v22.l
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v80.l, v21.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.l, v49.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v22.h, 0xff, v22.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v31, v24
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v23.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.h, 0x300, v23.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v23.l, v70.l, v22.l
-; GFX11-TRUE16-NEXT: v_or_b16 v23.h, v70.h, v22.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.l, v39.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.h, v39.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v31, v25
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v23.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.h, 0x300, v23.h
-; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v24.l
-; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v24.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.l, v38.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.h, v38.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v25.l, v67.h, v23.l
-; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v68.l, v23.h
-; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v24.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v24, 0xffff, v23
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v21.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v23.l, 0x300, v21.h
+; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v22.l
+; GFX11-TRUE16-NEXT: v_or_b16 v22.h, v70.h, v22.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v23.h, v39.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v24, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, v23.l
+; GFX11-TRUE16-NEXT: v_or_b16 v22.l, v70.l, v22.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v23.l, v48.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v23.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v25, 0xffff, v24
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v22.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.l, 0x300, v22.h
+; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v23.l
+; GFX11-TRUE16-NEXT: v_or_b16 v23.h, v68.h, v23.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.h, v38.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v25, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v24.l
+; GFX11-TRUE16-NEXT: v_or_b16 v23.l, v68.l, v23.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.l, v39.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v24.h, 0xff, v24.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v31, v26
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v25.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, 0x300, v25.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v25.l, v66.l, v24.l
-; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v66.h, v24.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.l, v37.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.h, v37.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v31, v27
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v25.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.h, 0x300, v25.h
-; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v26.l
-; GFX11-TRUE16-NEXT: v_and_b16 v25.h, 0xff, v26.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.l, v36.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.h, v36.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v64.l, v25.l
-; GFX11-TRUE16-NEXT: v_or_b16 v27.h, v64.h, v25.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v26, 0xffff, v25
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v23.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v23.h
+; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v24.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v66.l, v24.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.h, v37.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v26, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.l, v25.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v66.h, v24.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, v38.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v24.h
+; GFX11-TRUE16-NEXT: v_and_b16 v25.h, 0xff, v25.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v26, 0xffff, v26
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.l, 0x300, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v25.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v64.l, v25.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, v24.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v26, v31
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v25.l, v65.l, v25.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.l, v37.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.h, v36.h, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v27, 0xffff, v27
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v25.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v25.l
; GFX11-TRUE16-NEXT: v_and_b16 v26.l, 0xff, v26.l
; GFX11-TRUE16-NEXT: v_and_b16 v26.h, 0xff, v26.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v31, v28
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v27.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v29.h, 0x300, v27.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v54.h, v26.l
-; GFX11-TRUE16-NEXT: v_or_b16 v27.h, v55.l, v26.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.l, v35.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.h, v35.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v31, v29
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v27.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v30.h, 0x300, v27.h
-; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v28.l
-; GFX11-TRUE16-NEXT: v_and_b16 v27.h, 0xff, v28.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.l, v34.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.h, v34.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v29.l, v53.h, v27.l
-; GFX11-TRUE16-NEXT: v_or_b16 v29.h, v54.l, v27.h
-; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v28.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.l, v25.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v27, v31
+; GFX11-TRUE16-NEXT: v_or_b16 v26.l, v55.l, v26.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.l, v36.l, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v26.h, v55.h, v26.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, v35.h, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v29, 0xffff, v28
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v26.l
+; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v27.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.l, 0x300, v26.h
+; GFX11-TRUE16-NEXT: v_and_b16 v27.h, 0xff, v27.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.h, v34.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v29, v31
+; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v54.h, v27.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v28.l
+; GFX11-TRUE16-NEXT: v_or_b16 v27.h, v54.l, v27.h
; GFX11-TRUE16-NEXT: v_and_b16 v28.h, 0xff, v28.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v31, v30
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v29.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v34.h, 0x300, v29.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v29.l, v52.h, v28.l
-; GFX11-TRUE16-NEXT: v_or_b16 v29.h, v53.l, v28.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v30.l, v33.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.l, v35.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.l, 0x300, v27.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v29, 0xffff, v29
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v27.h
+; GFX11-TRUE16-NEXT: v_or_b16 v28.h, v53.h, v28.h
+; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v28.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.l, v27.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v29, v31
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v29.l, v34.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v29.h, v33.h, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff, v30
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v30.l, 0x300, v28.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v30.h, v33.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v31, v34
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v29.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v33.h, 0x300, v29.h
-; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v30.l
-; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v30.h
+; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v29.l
+; GFX11-TRUE16-NEXT: v_or_b16 v28.l, v53.l, v28.l
+; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v29.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v33.l, v30.l
; GFX11-TRUE16-NEXT: v_add_nc_u16 v30.l, v32.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v30.h, v32.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v33.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v32.l, v51.h, v29.l
-; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v52.l, v29.h
-; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v30.l
+; GFX11-TRUE16-NEXT: v_or_b16 v29.l, v52.h, v29.l
; GFX11-TRUE16-NEXT: v_and_b16 v30.h, 0xff, v30.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v31, v33
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v32.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v33.h, 0x300, v32.h
-; GFX11-TRUE16-NEXT: v_or_b16 v32.l, v50.h, v30.l
-; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v51.l, v30.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v31, v33
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v32.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v32.h, 0x300, v32.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v32.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v31, v32
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v28.l
+; GFX11-TRUE16-NEXT: v_or_b16 v29.h, v50.h, v29.h
+; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v30.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v29.l, 0x300, v29.l
+; GFX11-TRUE16-NEXT: v_or_b16 v30.h, v51.l, v30.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v35, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v33
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v29.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v29.l
+; GFX11-TRUE16-NEXT: v_or_b16 v30.l, v50.l, v30.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v33.l, 0x300, v30.h
+; GFX11-TRUE16-NEXT: v_and_b16 v30.h, 0xff, v32.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v34, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v35
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v30.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v33, 0xffff, v33
+; GFX11-TRUE16-NEXT: v_or_b16 v32.l, v49.h, v30.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v34, v31
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v32.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v33, v31
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -42028,64 +42156,64 @@ define <128 x i8> @bitcast_v32f32_to_v128i8(<32 x float> %a, i32 %b) {
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr66_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr162_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr161_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr160_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr161_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr65_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr151_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr150_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr150_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr148_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr147_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr146_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr147_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr145_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr144_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr135_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr134_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr133_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr132_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr133_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr131_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr128_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr119_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr118_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr117_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr116_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr115_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr116_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr114_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr113_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr112_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr113_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr101_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr100_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr99_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr99_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr97_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr96_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr96_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr86_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr85_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr84_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr83_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr82_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr81_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr82_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr69_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr68_lo16
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(2)
@@ -42113,50 +42241,50 @@ define <128 x i8> @bitcast_v32f32_to_v128i8(<32 x float> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[65:66], 24, v[3:4]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v68, 24, v32
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v69, 8, v32
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v70, 8, v31
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v71, 24, v30
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v71, 8, v31
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v70, 24, v30
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v80, 8, v30
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 8, v29
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 24, v28
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 8, v29
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 24, v28
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v83, 8, v28
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v84, 8, v27
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v85, 24, v26
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v86, 8, v26
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 8, v25
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 24, v24
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 8, v25
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 24, v24
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v97, 8, v24
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 8, v23
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v99, 24, v22
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v99, 8, v23
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 24, v22
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v100, 8, v22
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v101, 8, v21
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v102, 24, v20
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v103, 8, v20
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 8, v19
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 24, v18
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 8, v19
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 24, v18
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v114, 8, v18
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v115, 8, v17
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v116, 24, v16
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v116, 8, v17
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v115, 24, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v117, 8, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v118, 8, v15
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v119, 24, v14
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v128, 8, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 8, v13
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 24, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v131, 8, v12
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 8, v11
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v133, 24, v10
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v133, 8, v11
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v134, 8, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v135, 8, v9
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v144, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v145, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v146, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v146, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v148, 8, v6
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 8, v5
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 24, v4
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 8, v5
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v151, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v162, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[54:55], 24, v[7:8]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[66:67], 24, v[1:2]
@@ -42200,50 +42328,50 @@ define <128 x i8> @bitcast_v32f32_to_v128i8(<32 x float> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[66:67], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v68, 24, v32
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v69, 8, v32
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v70, 8, v31
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v71, 24, v30
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v71, 8, v31
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v70, 24, v30
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v80, 8, v30
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 8, v29
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 24, v28
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 8, v29
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 24, v28
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v83, 8, v28
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v84, 8, v27
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v85, 24, v26
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v86, 8, v26
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 8, v25
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 24, v24
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 8, v25
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 24, v24
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v97, 8, v24
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 8, v23
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v99, 24, v22
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v99, 8, v23
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 24, v22
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v100, 8, v22
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v101, 8, v21
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v102, 24, v20
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v103, 8, v20
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 8, v19
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 24, v18
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 8, v19
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 24, v18
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v114, 8, v18
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v115, 8, v17
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v116, 24, v16
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v116, 8, v17
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v115, 24, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v117, 8, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v118, 8, v15
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v119, 24, v14
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v128, 8, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 8, v13
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 24, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v131, 8, v12
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 8, v11
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v133, 24, v10
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v133, 8, v11
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v134, 8, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v135, 8, v9
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v144, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v145, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v146, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v146, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v148, 8, v6
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 8, v5
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 24, v4
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 8, v5
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v151, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v162, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v39, 8, v1
; GFX11-TRUE16-NEXT: .LBB36_4: ; %end
@@ -42251,266 +42379,307 @@ define <128 x i8> @bitcast_v32f32_to_v128i8(<32 x float> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v39.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v66.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v1.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v162.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v1.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v162.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.l, 0
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v66.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v34.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v161.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v39, v1
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v2.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v160.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v65.l
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v39, v2
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v3.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v39.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v161.l
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v1.h, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v66.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v55, 0xffff, v55
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v160.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v1.l, v3.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v151.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v150.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v66, 0xffff, v66
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v55, v39
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v2.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v65.l
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v4.l, v33.h
; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v39, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v4.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v149.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v64.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v39, v4
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v5.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v148.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v147.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v150.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v66, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v55, 0xffff, v55
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v3.l, v3.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v65.l, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v149.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v5.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v55, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v55, 0xffff, v65
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v64.l
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v4.l, v4.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v65.l, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.l, 8, v148.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v55, v39
; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v39, v5
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v6.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v146.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v54.l
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v5.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v147.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.h, v6.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v55, 0xffff, v65
; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v39, v6
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v7.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v145.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v144.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v39, v7
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v8.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v135.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v53.l
; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v39, v8
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v9.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v39.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v7.l, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v64.l, v6.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v146.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v145.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v55, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v55, 0xffff, v64
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v64.l, v7.l
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v6.l, v6.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.h, 8, v54.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v8.l, v33.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v64
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v55, v39
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v144.l
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v7.l, v7.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v9.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 8, v135.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v54, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v55
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v8.l, v33.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v134.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v133.l
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v8.h, v9.l
; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v11.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v39, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v10.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v10.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v132.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v52.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v12.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v39, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v11.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v11.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v131.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v130.l
; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v13.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v39, v11
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v12.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v12.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v129.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v51.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v54, v39
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v10.l, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v9.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v53.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v133.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v54.l, v10.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v53, 0xffff, v55
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.h
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v9.l, v9.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v132.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v11.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v52.l
; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v14.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v39, v12
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v13.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v13.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v128.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v119.l
-; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.l
-; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v15.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v39, v13
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v14.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v118.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v50.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v53, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v53, 0xffff, v54
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v10.l, v10.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v54.l, v11.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v131.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v53, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v52, 0xffff, v54
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v11.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v130.l
+; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v11.h, v12.l
; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.l
-; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v16.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v39, v14
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v15.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v15.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v117.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v116.l
; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v17.l
-; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v17.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v39, v15
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v16.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v16.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v115.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v49.l
-; GFX11-TRUE16-NEXT: v_and_b16 v18.l, 0xff, v18.l
-; GFX11-TRUE16-NEXT: v_and_b16 v18.h, 0xff, v18.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v39, v16
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v17.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v17.h, v17.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v114.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v113.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v52, v39
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v13.l, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v53.l, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v129.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v128.l
; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v19.l
-; GFX11-TRUE16-NEXT: v_and_b16 v19.h, 0xff, v19.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v39, v17
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v18.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v18.h, v18.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v112.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v48.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v52, 0xffff, v53
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v53.l, v13.l
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v12.l, v12.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.h, 8, v51.l
+; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v14.l, v33.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v51, 0xffff, v53
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v52, v39
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v119.l
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v13.l, v13.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.l, v14.l
+; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.h
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v15.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.l, 8, v118.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v51, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v51, 0xffff, v52
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v14.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v117.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v14.h, v15.l
; GFX11-TRUE16-NEXT: v_and_b16 v20.l, 0xff, v20.l
-; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v20.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v39, v18
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v19.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v19.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v103.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v102.l
-; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v21.l
-; GFX11-TRUE16-NEXT: v_and_b16 v21.h, 0xff, v21.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v39, v19
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v20.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v20.h, v20.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v101.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v38.l
; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v22.l
-; GFX11-TRUE16-NEXT: v_and_b16 v22.h, 0xff, v22.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v39, v20
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v21.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v21.h, v21.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v100.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v99.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v51, v39
+; GFX11-TRUE16-NEXT: v_or_b16 v16.l, v16.l, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.l, v15.l
+; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.h, 8, v50.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v116.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v51.l, v16.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v50, 0xffff, v52
+; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.h
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v15.l, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v115.l
+; GFX11-TRUE16-NEXT: v_or_b16 v17.l, v17.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v49.l
; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v23.l
-; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v23.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v39, v21
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v22.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v22.h, v22.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v98.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v37.l
-; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v24.l
-; GFX11-TRUE16-NEXT: v_and_b16 v24.h, 0xff, v24.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v39, v22
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v23.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v23.h, v23.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v97.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v96.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v50, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v50, 0xffff, v51
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v16.l, v16.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v51.l, v17.l
+; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v17.h
+; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v18.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v114.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v50, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v49, 0xffff, v51
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v17.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v113.l
+; GFX11-TRUE16-NEXT: v_or_b16 v18.l, v17.h, v18.l
; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v25.l
-; GFX11-TRUE16-NEXT: v_and_b16 v25.h, 0xff, v25.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v39, v23
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v24.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v24.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v87.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v36.l
; GFX11-TRUE16-NEXT: v_and_b16 v26.l, 0xff, v26.l
-; GFX11-TRUE16-NEXT: v_and_b16 v26.h, 0xff, v26.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v39, v24
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v25.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v25.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v86.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v85.l
-; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v27.l
-; GFX11-TRUE16-NEXT: v_and_b16 v27.h, 0xff, v27.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v39, v25
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v26.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v26.h, v26.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v84.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v35.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v49, v39
+; GFX11-TRUE16-NEXT: v_or_b16 v19.l, v19.l, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v50.l, v18.l
+; GFX11-TRUE16-NEXT: v_and_b16 v18.l, 0xff, v18.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v112.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v103.l
; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v28.l
-; GFX11-TRUE16-NEXT: v_and_b16 v28.h, 0xff, v28.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v39, v26
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v27.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v27.h, v27.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v83.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v82.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v49, 0xffff, v50
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v50.l, v19.l
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v18.l, v18.h
+; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v19.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v48.l
+; GFX11-TRUE16-NEXT: v_or_b16 v20.l, v20.l, v33.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v48, 0xffff, v50
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v49, v39
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v102.l
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v19.l, v19.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v49.l, v20.l
+; GFX11-TRUE16-NEXT: v_and_b16 v20.l, 0xff, v20.h
+; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v21.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.l, 8, v101.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v48, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v48, 0xffff, v49
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v20.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v100.l
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v20.h, v21.l
; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v29.l
-; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v29.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v39, v27
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v28.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v28.h, v28.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v81.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.l, 8, v34.l
-; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v30.l
-; GFX11-TRUE16-NEXT: v_and_b16 v30.h, 0xff, v30.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v39, v28
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v29.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v29.h, v29.h, v34.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v80.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.l, 8, v71.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_and_b16 v31.l, 0xff, v31.l
-; GFX11-TRUE16-NEXT: v_and_b16 v31.h, 0xff, v31.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v39, v29
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v30.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v30.h, v30.h, v34.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v70.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.l, 8, v33.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v48, v39
+; GFX11-TRUE16-NEXT: v_or_b16 v22.l, v22.l, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v49.l, v21.l
+; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v21.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.h, 8, v38.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v99.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.l, v22.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v38, 0xffff, v49
+; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v22.h
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v21.l, v21.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.h, 8, v98.l
+; GFX11-TRUE16-NEXT: v_or_b16 v23.l, v23.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v37.l
; GFX11-TRUE16-NEXT: v_and_b16 v32.l, 0xff, v32.l
-; GFX11-TRUE16-NEXT: v_and_b16 v32.h, 0xff, v32.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v39, v30
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v31.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v31.h, v33.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v31.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.l, 8, v69.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v68.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v38, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v38, 0xffff, v48
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v22.l, v22.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.l, v23.l
+; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v23.h
+; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v24.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v24.l, 8, v97.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v38, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v37, 0xffff, v48
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v23.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v96.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v23.h, v24.l
; GFX11-TRUE16-NEXT: s_clause 0x1
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[1:4], off
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[5:8], off offset:16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v39, v31
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v32.l, v33.l
-; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v32.h, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v32.l, v39.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v32, v39, v32
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v37, v39
+; GFX11-TRUE16-NEXT: v_or_b16 v25.l, v25.l, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v38.l, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v24.h, 8, v87.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v86.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v37, 0xffff, v38
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v38.l, v25.l
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v24.l, v24.h
+; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v25.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.h, 8, v36.l
+; GFX11-TRUE16-NEXT: v_or_b16 v26.l, v26.l, v33.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v36, 0xffff, v38
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v37, v39
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v85.l
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v25.l, v25.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v37.l, v26.l
+; GFX11-TRUE16-NEXT: v_and_b16 v26.l, 0xff, v26.h
+; GFX11-TRUE16-NEXT: v_and_b16 v26.h, 0xff, v27.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.l, 8, v84.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v36, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v36, 0xffff, v37
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v26.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v83.l
+; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v26.h, v27.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v36, v39
+; GFX11-TRUE16-NEXT: v_or_b16 v28.l, v28.l, v33.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v37.l, v27.l
+; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v27.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.h, 8, v35.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v82.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v36.l, v28.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff, v37
+; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v28.h
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v27.l, v27.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v28.h, 8, v81.l
+; GFX11-TRUE16-NEXT: v_or_b16 v29.l, v29.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v34.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v35, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff, v36
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v28.l, v28.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v36.l, v29.l
+; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v29.h
+; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v30.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v30.l, 8, v80.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v35, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v36
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v29.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v71.l
+; GFX11-TRUE16-NEXT: v_or_b16 v30.l, v29.h, v30.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v34, v39
+; GFX11-TRUE16-NEXT: v_or_b16 v31.l, v31.l, v33.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_3) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v30.l
+; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v30.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v30.h, 8, v70.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v69.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v35
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v31.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v30.l, v30.h
+; GFX11-TRUE16-NEXT: v_and_b16 v31.l, 0xff, v31.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v31.h, 8, v33.l
+; GFX11-TRUE16-NEXT: v_or_b16 v32.l, v32.l, v33.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v33, 0xffff, v35
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v34, v39
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v31.l, v31.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.l, v32.l
+; GFX11-TRUE16-NEXT: v_and_b16 v32.l, 0xff, v32.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v32.h, 8, v68.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v33, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v33, 0xffff, v34
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v32.l, v32.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v32, v33, v39
; GFX11-TRUE16-NEXT: s_clause 0x5
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[9:12], off offset:32
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[13:16], off offset:48
@@ -52041,63 +52210,63 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v31, off, s32 offset:384
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v32, off, s32 offset:380
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v31, off, s32 offset:376
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v32, off, s32 offset:372
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v50, off, s32 offset:368
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v33, off, s32 offset:364
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v51, off, s32 offset:360
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v33, off, s32 offset:356
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v51, off, s32 offset:352
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v34, off, s32 offset:348
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v52, off, s32 offset:344
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v33, off, s32 offset:372
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v49, off, s32 offset:368
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v32, off, s32 offset:364
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v50, off, s32 offset:360
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v34, off, s32 offset:356
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v50, off, s32 offset:352
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v33, off, s32 offset:348
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v51, off, s32 offset:344
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v34, off, s32 offset:340
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v52, off, s32 offset:336
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v35, off, s32 offset:332
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v53, off, s32 offset:328
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v35, off, s32 offset:324
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v36, off, s32 offset:324
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v53, off, s32 offset:320
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v36, off, s32 offset:316
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v35, off, s32 offset:316
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v54, off, s32 offset:312
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v36, off, s32 offset:308
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v54, off, s32 offset:304
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v37, off, s32 offset:300
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v55, off, s32 offset:296
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v37, off, s32 offset:292
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v64, off, s32 offset:288
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v38, off, s32 offset:284
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v64, off, s32 offset:280
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v38, off, s32 offset:276
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v66, off, s32 offset:272
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v39, off, s32 offset:268
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v66, off, s32 offset:264
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v38, off, s32 offset:292
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v55, off, s32 offset:288
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v37, off, s32 offset:284
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v64, off, s32 offset:280
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v39, off, s32 offset:276
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v65, off, s32 offset:272
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v38, off, s32 offset:268
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v66, off, s32 offset:264
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v39, off, s32 offset:260
; GFX11-TRUE16-NEXT: s_clause 0x1f
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v67, off, s32 offset:256
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v66, off, s32 offset:256
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v48, off, s32 offset:252
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v68, off, s32 offset:248
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v48, off, s32 offset:244
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v70, off, s32 offset:240
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v68, off, s32 offset:240
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v49, off, s32 offset:236
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v70, off, s32 offset:232
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v49, off, s32 offset:228
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v71, off, s32 offset:224
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v50, off, s32 offset:220
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v71, off, s32 offset:216
-; GFX11-TRUE16-NEXT: scratch_load_b32 v114, off, s32 offset:388
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v81, off, s32
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v70, off, s32 offset:232
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v51, off, s32 offset:228
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v70, off, s32 offset:224
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v52, off, s32 offset:220
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v80, off, s32 offset:216
+; GFX11-TRUE16-NEXT: scratch_load_b32 v103, off, s32 offset:388
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v80, off, s32
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v82, off, s32 offset:8
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v83, off, s32 offset:16
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v82, off, s32 offset:16
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v83, off, s32 offset:24
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v85, off, s32 offset:32
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v85, off, s32 offset:40
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v87, off, s32 offset:48
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v87, off, s32 offset:56
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v97, off, s32 offset:64
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v97, off, s32 offset:72
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v98, off, s32 offset:80
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v84, off, s32 offset:32
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v85, off, s32 offset:40
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v86, off, s32 offset:48
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v87, off, s32 offset:56
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v96, off, s32 offset:64
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v97, off, s32 offset:72
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v98, off, s32 offset:80
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v99, off, s32 offset:88
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v100, off, s32 offset:96
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v99, off, s32 offset:96
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v101, off, s32 offset:104
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v102, off, s32 offset:112
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v101, off, s32 offset:112
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v160, off, s32 offset:120
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v160, off, s32 offset:128
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v161, off, s32 offset:136
@@ -52111,143 +52280,144 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v164, off, s32 offset:192
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v165, off, s32 offset:200
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v165, off, s32 offset:208
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v55, off, s32 offset:212
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v65, off, s32 offset:204
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v65, off, s32 offset:196
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v67, off, s32 offset:188
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v68, off, s32 offset:180
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v69, off, s32 offset:172
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v69, off, s32 offset:164
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v80, off, s32 offset:156
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v80, off, s32 offset:148
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v81, off, s32 offset:140
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v82, off, s32 offset:132
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v64, off, s32 offset:212
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v65, off, s32 offset:204
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v67, off, s32 offset:196
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v67, off, s32 offset:188
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v69, off, s32 offset:180
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v69, off, s32 offset:172
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v71, off, s32 offset:164
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v71, off, s32 offset:156
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v81, off, s32 offset:148
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v81, off, s32 offset:140
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v83, off, s32 offset:132
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v84, off, s32 offset:124
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v84, off, s32 offset:116
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v85, off, s32 offset:116
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v86, off, s32 offset:108
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v86, off, s32 offset:100
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v87, off, s32 offset:100
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v96, off, s32 offset:92
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v96, off, s32 offset:84
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v98, off, s32 offset:76
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v99, off, s32 offset:68
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v100, off, s32 offset:60
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v101, off, s32 offset:52
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v103, off, s32 offset:44
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v103, off, s32 offset:36
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v112, off, s32 offset:28
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v113, off, s32 offset:20
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v97, off, s32 offset:84
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v98, off, s32 offset:76
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v100, off, s32 offset:68
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v100, off, s32 offset:60
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v102, off, s32 offset:52
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v102, off, s32 offset:44
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v112, off, s32 offset:36
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v112, off, s32 offset:28
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v114, off, s32 offset:20
; GFX11-TRUE16-NEXT: s_clause 0x1
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v115, off, s32 offset:12
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v115, off, s32 offset:4
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v116, off, s32 offset:4
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v117.l, v30.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v118.h, v28.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v119.l, v26.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v119.h, v24.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v130.l, v22.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v130.h, v20.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v131.l, v18.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v132.h, v16.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v134.l, v14.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v134.h, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v128.h, v24.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v129.h, v22.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v131.l, v20.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v131.h, v18.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v133.h, v16.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v134.h, v14.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v144.h, v12.l
; GFX11-TRUE16-NEXT: v_mov_b16_e64 v144.l, v10.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v145.h, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v146.h, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v146.l, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v149.l, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v149.h, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v151.l, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v151.h, 8, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v150.l, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v150.h, 8, v7.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v148.l, 8, v9.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v148.h, 8, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v147.l, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v147.h, 8, v15.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v144.h, 8, v17.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v145.l, 8, v19.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v135.l, 8, v21.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v135.h, 8, v23.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v133.l, 8, v25.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v133.h, 8, v27.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v131.h, 8, v29.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v146.l, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v147.h, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v148.h, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v150.l, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v151.l, v0.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v151.h, 8, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v149.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v150.h, 8, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v147.l, 8, v7.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v149.h, 8, v9.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v146.h, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v148.l, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v145.l, 8, v15.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v145.h, 8, v17.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v135.l, 8, v19.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v135.h, 8, v21.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v133.l, 8, v23.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v134.l, 8, v25.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v132.l, 8, v27.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v132.h, 8, v29.l
+; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(62)
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v50.h, 8, v50.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(54)
-; GFX11-TRUE16-NEXT: v_cmp_ne_u32_e32 vcc_lo, 0, v114
+; GFX11-TRUE16-NEXT: v_cmp_ne_u32_e32 vcc_lo, 0, v103
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(53)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v132.l, 8, v81.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v130.l, 8, v80.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(52)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v129.l, 8, v82.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v130.h, 8, v82.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(51)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v129.h, 8, v83.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v128.l, 8, v82.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(50)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v128.l, 8, v83.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v129.l, 8, v83.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(49)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v128.h, 8, v85.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v118.l, 8, v84.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(48)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v117.h, 8, v85.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v119.h, 8, v85.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(47)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v118.l, 8, v87.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v116.l, 8, v86.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(46)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v116.l, 8, v87.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v117.h, 8, v87.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(45)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v116.h, 8, v97.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v114.h, 8, v96.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(44)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v114.l, 8, v97.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v115.h, 8, v97.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(43)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v114.h, 8, v98.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v113.l, 8, v98.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(42)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v112.h, 8, v99.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v113.h, 8, v99.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(41)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v113.l, 8, v100.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v103.l, 8, v99.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(40)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v102.l, 8, v101.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v103.h, 8, v101.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(39)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v102.h, 8, v102.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v101.l, 8, v101.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(38)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v100.h, 8, v160.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v101.h, 8, v160.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(37)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v101.l, 8, v160.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v99.l, 8, v160.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(36)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v98.h, 8, v161.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v99.h, 8, v161.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(35)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v99.l, 8, v161.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v97.l, 8, v161.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(34)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v97.l, 8, v162.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v98.l, 8, v162.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(33)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v97.h, 8, v162.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v87.l, 8, v162.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(32)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v87.l, 8, v163.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v96.h, 8, v163.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(31)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v87.h, 8, v163.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v85.l, 8, v163.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(30)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v85.l, 8, v164.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v86.h, 8, v164.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(29)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v85.h, 8, v164.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v83.h, 8, v164.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(28)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v83.l, 8, v165.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v84.h, 8, v165.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(27)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v83.h, 8, v165.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v81.h, 8, v71.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v82.l, 8, v71.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v71.l, 8, v70.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v71.h, 8, v70.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v70.l, 8, v68.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v70.h, 8, v67.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v67.h, 8, v66.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v68.l, 8, v66.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v66.l, 8, v64.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v82.l, 8, v165.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v82.h, 8, v80.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v80.l, 8, v70.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v80.h, 8, v70.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v70.l, 8, v68.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v70.h, 8, v68.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v68.l, 8, v66.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v68.h, 8, v66.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v66.l, 8, v65.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v66.h, 8, v64.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v64.l, 8, v55.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v64.h, 8, v54.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v54.h, 8, v54.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v55.l, 8, v53.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v53.h, 8, v53.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v54.l, 8, v52.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v52.h, 8, v52.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v53.l, 8, v51.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v51.h, 8, v51.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v52.l, 8, v50.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v50.h, 8, v31.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v51.l, 8, v31.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v64.l, 8, v55.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v65.l, 8, v55.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v55.l, 8, v54.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v55.h, 8, v54.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v54.l, 8, v53.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v54.h, 8, v53.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v53.l, 8, v52.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v53.h, 8, v51.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v52.h, 8, v50.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v50.l, 8, v49.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v51.l, 8, v31.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v49.h, 8, v31.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15_vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31
; GFX11-TRUE16-NEXT: s_and_saveexec_b32 s0, vcc_lo
; GFX11-TRUE16-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
@@ -52261,660 +52431,746 @@ define <32 x float> @bitcast_v128i8_to_v32f32(<128 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB38_3: ; %cmp.false
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v149.h
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v149.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v149.h, 0
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v146.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v146.l
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v0.l, v151.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v151.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v0.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v1.l, v150.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v3.l, v149.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v145.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v144.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v149, v0
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v1.h, v150.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v132.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v131.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v148.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v130.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v149, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v2.l, v148.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v2.l, v149.h
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v151.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v148.h
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v150.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v146.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v31.l, 0
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v151.h
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v150.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v147.h
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v0.h, v149.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v149.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v144.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v133.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.l, v148.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v4
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v144.l
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v1.l, v145.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v31
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v1.h, v147.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v134.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v134.l
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v145.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v130.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v149, v2
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v3.l, v147.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v147.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v3.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v135.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v119.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v119.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v118.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v149, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v4.l, v144.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v4.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v133.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v117.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v131.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v7, v31
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v2.l, v146.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v4.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v131.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v3.h, v135.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v6, v31
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v3.l, v145.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v128.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v118.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v115.h
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v115.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v149, v4
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v5.l, v135.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v5.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v132.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v129.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v113.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v112.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v149, v5
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v6.l, v133.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v6.l, v149.h
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v103.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.h, v128.h
-; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v103.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v101.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v149, v6
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v7.l, v131.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v7.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v10.h, v118.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v100.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v99.h
-; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v98.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v149, v7
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v8.l, v129.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v8.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v11.h, v116.h
-; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v12.h, v114.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v96.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v96.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v149, v8
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v9.l, v128.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v9.l, v149.h
-; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v86.h
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v13.h, v113.l
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v86.l
-; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v84.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v149, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v10.l, v117.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v10.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v102.h
-; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v84.l
-; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v82.h
-; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v81.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v149, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v11.l, v116.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v11.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v15.h, v101.l
-; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v16.h, v99.l
-; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v80.h
-; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v80.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v149, v11
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v12.l, v114.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v12.l, v149.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v116.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v8, v31
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v4.l, v135.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v129.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v4.h, v134.l
+; GFX11-TRUE16-NEXT: v_and_b16 v32.l, 0xff, v32.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v8
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v5.l, v133.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v6.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v119.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v5.h, v132.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v114.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v8, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v6.l, v132.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v117.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v6.h, v130.h
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v112.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v7.l, v130.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v115.l
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v7.h, v129.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v102.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v10, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v11
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v8.l, v128.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v9.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v112.h
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v8.h, v119.h
+; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v100.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v11, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v12
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v9.l, v118.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v10.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v102.h
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v9.h, v117.h
+; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v97.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v12, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v13
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v10.l, v116.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v11.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v100.h
+; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v10.h, v115.h
+; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v87.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v13, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v14, 0xffff, v14
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v11.l, v114.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v98.h
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v11.h, v113.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v85.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v14, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v15, 0xffff, v15
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v12.l, v113.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v13.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v96.l
+; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v12.h, v103.h
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v83.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v15, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v16
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v13.l, v103.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v14.l
+; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v86.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v13.h, v101.h
+; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v81.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v16, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v17
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v14.l, v101.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v15.l
+; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v84.l
+; GFX11-TRUE16-NEXT: v_or_b16 v16.l, v14.h, v99.h
+; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v71.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v17, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v18
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v15.l, v99.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v16.l
+; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v81.h
+; GFX11-TRUE16-NEXT: v_or_b16 v17.l, v15.h, v98.l
+; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v69.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v18, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v19, 0xffff, v19
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v16.l, v97.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v17.l
+; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v71.h
+; GFX11-TRUE16-NEXT: v_or_b16 v18.l, v16.h, v96.h
+; GFX11-TRUE16-NEXT: v_and_b16 v18.h, 0xff, v67.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v19, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v20, 0xffff, v20
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v17.l, v87.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v18.l
; GFX11-TRUE16-NEXT: v_and_b16 v18.l, 0xff, v69.h
-; GFX11-TRUE16-NEXT: v_or_b16 v17.h, v17.h, v97.h
-; GFX11-TRUE16-NEXT: v_and_b16 v18.h, 0xff, v69.l
-; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v68.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v149, v12
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v13.l, v112.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v13.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v18.h, v18.h, v87.h
-; GFX11-TRUE16-NEXT: v_and_b16 v19.h, 0xff, v67.l
+; GFX11-TRUE16-NEXT: v_or_b16 v19.l, v17.h, v86.h
+; GFX11-TRUE16-NEXT: v_and_b16 v19.h, 0xff, v64.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v20, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v21, 0xffff, v21
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v18.l, v85.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v19.l
+; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v67.h
+; GFX11-TRUE16-NEXT: v_or_b16 v20.l, v18.h, v84.h
+; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v51.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v21, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff, v22
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v19.l, v83.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v20.l
; GFX11-TRUE16-NEXT: v_and_b16 v20.l, 0xff, v65.h
-; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v65.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v149, v13
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v14.l, v102.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v14.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v19.h, v85.h
-; GFX11-TRUE16-NEXT: v_or_b16 v20.h, v20.h, v83.h
-; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v55.h
-; GFX11-TRUE16-NEXT: v_and_b16 v21.h, 0xff, v50.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v149, v14
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v15.l, v100.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v15.l, v149.h
-; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v49.h
-; GFX11-TRUE16-NEXT: v_or_b16 v21.h, v21.h, v82.l
-; GFX11-TRUE16-NEXT: v_and_b16 v22.h, 0xff, v49.l
-; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v48.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v149, v15
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v16.l, v98.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v16.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v22.h, v22.h, v71.h
-; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v48.l
-; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v39.h
-; GFX11-TRUE16-NEXT: v_and_b16 v24.h, 0xff, v39.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v149, v16
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v17.l, v97.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v17.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v23.h, v23.h, v70.h
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v24.h, v68.l
-; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v38.h
-; GFX11-TRUE16-NEXT: v_and_b16 v25.h, 0xff, v38.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v149, v17
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v18.l, v87.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v18.l, v149.h
-; GFX11-TRUE16-NEXT: v_and_b16 v26.l, 0xff, v37.h
-; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v25.h, v66.h
-; GFX11-TRUE16-NEXT: v_and_b16 v26.h, 0xff, v37.l
-; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v36.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v149, v18
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v19.l, v85.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v19.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v26.h, v26.h, v64.h
-; GFX11-TRUE16-NEXT: v_and_b16 v27.h, 0xff, v36.l
-; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v35.h
-; GFX11-TRUE16-NEXT: v_and_b16 v28.h, 0xff, v35.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v149, v19
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v20.l, v83.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v20.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v27.h, v27.h, v55.l
-; GFX11-TRUE16-NEXT: v_or_b16 v28.h, v28.h, v54.l
-; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v34.h
-; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v34.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v149, v20
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v21.l, v81.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v21.l, v149.h
-; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v29.h, v29.h, v53.l
-; GFX11-TRUE16-NEXT: v_and_b16 v30.h, 0xff, v33.l
-; GFX11-TRUE16-NEXT: v_and_b16 v31.l, 0xff, v32.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v149, v21
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v22.l, v71.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v22.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v30.h, v30.h, v52.l
-; GFX11-TRUE16-NEXT: v_and_b16 v31.h, 0xff, v32.l
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v19.h, v82.h
+; GFX11-TRUE16-NEXT: v_and_b16 v21.h, 0xff, v48.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v22, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v23, 0xffff, v23
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v20.l, v82.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, v21.l
+; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v52.l
+; GFX11-TRUE16-NEXT: v_or_b16 v22.l, v20.h, v80.h
+; GFX11-TRUE16-NEXT: v_and_b16 v22.h, 0xff, v39.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v23, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v24, 0xffff, v24
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v21.l, v80.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v22.l
+; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v49.l
+; GFX11-TRUE16-NEXT: v_or_b16 v23.l, v21.h, v70.h
+; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v39.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v24, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v25, 0xffff, v25
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v22.l, v70.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.l, v23.l
+; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v48.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v22.h, v68.h
+; GFX11-TRUE16-NEXT: v_and_b16 v24.h, 0xff, v38.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v25, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v26, 0xffff, v26
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v23.l, v68.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v38.h
+; GFX11-TRUE16-NEXT: v_or_b16 v25.l, v23.h, v66.h
+; GFX11-TRUE16-NEXT: v_and_b16 v25.h, 0xff, v36.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v26, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v27, 0xffff, v27
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v24.l, v66.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.l, v25.l
+; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v37.h
+; GFX11-TRUE16-NEXT: v_or_b16 v26.l, v24.h, v65.l
+; GFX11-TRUE16-NEXT: v_and_b16 v26.h, 0xff, v36.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v27, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v28, 0xffff, v28
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v25.l, v64.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v26.l
+; GFX11-TRUE16-NEXT: v_and_b16 v26.l, 0xff, v37.l
+; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v25.h, v55.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v28, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v29, 0xffff, v29
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v26.l, v55.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.l, v27.l
+; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v35.h
+; GFX11-TRUE16-NEXT: v_or_b16 v28.l, v26.h, v54.h
+; GFX11-TRUE16-NEXT: v_and_b16 v28.h, 0xff, v34.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v29, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v29, 0xffff, v30
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v27.l, v54.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.l, v28.l
+; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v35.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr151_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr150_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr148_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr147_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr146_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr146_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr145_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr144_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr144_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr134_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr134_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr132_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr133_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr131_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr131_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr119_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr128_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr119_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr118_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr117_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr115_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr116_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr115_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr113_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr114_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr112_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr112_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr101_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr100_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr100_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr99_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr96_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr97_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr96_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr86_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr86_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr84_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr85_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr84_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr82_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr83_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr81_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr81_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr69_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr69_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr68_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr67_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr67_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr65_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr65_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr151_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr151_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr150_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr150_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr148_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr148_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr147_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr147_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr144_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr146_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr148_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr145_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr145_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr135_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr135_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr133_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr133_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr131_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr134_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr132_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr132_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr128_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr128_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr117_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr118_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr119_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr116_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr116_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr114_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr117_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr114_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr112_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr115_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr113_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr100_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr113_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr101_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr101_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr99_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr99_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr97_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr97_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr96_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr85_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr85_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr83_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr86_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr83_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr81_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr84_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr82_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr82_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr68_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr68_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr66_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr66_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr65_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v149, v22
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v23.l, v70.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v23.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v31.h, v51.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v149, v23
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v24.l, v67.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v24.l, v149.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr67_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v149, v24
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v25.l, v66.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v25.l, v149.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr66_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v149, v25
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v26.l, v64.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v26.l, v149.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v149, v26
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v27.l, v54.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v27.l, v149.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v149, v27
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v28.l, v53.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v28.l, v149.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v29, v31
+; GFX11-TRUE16-NEXT: v_or_b16 v29.l, v28.h, v53.h
+; GFX11-TRUE16-NEXT: v_and_b16 v28.h, 0xff, v34.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff, v30
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v28.l, v53.l
+; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v33.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.l, v29.l
+; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v30.l, v28.h, v52.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v35, v31
+; GFX11-TRUE16-NEXT: v_or_b16 v33.l, v29.h, v51.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v34
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v29.l, v50.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v30.l
+; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v32.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v33, 0xffff, v33
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v149, v28
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v29.l, v52.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v29.l, v149.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v149, v29
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v30.l, v51.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v30.l, v149.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v149, v30
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v31.l, v50.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v31.l, v149.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v34, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v35
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v30.l, v50.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v34, v31
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v32.l, v49.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_hi16
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v149, v31
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v33, v31
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB38_2
; GFX11-TRUE16-NEXT: .LBB38_4: ; %cmp.true
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v149.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v149.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v146.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v146.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v31.h, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v151.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v148.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v150.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v146.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v147.h, 3
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v151.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v151.h, v0.h
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v150.h, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v150.l, v1.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v145.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v144.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v134.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v31, v3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v134.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v148.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v148.h, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v144.l, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v151.h, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v150.h, v1.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v144.h, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v149.l, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v31.l, 0
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v31, v4
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v31.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v147.l, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v147.h, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v132.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v149.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v147.l, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, v134.h, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v133.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v31
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v146.h, v2.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v148.l, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v4
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v2.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v2.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v145.l, v3.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v131.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v31, v5
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v3.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v130.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v130.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v144.h, v3.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v145.l, v3.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v5, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v4.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v145.h, v3.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v131.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v3.h
; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v31, v6
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v5.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v135.l, v4.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v135.h, v4.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v119.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v119.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v31, v7
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, 0x300, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v118.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v117.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v133.l, v5.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v133.h, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, v128.h, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v135.h, v4.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v6, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v5.l
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v135.l, v4.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, v129.h, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v32.l, v32.l, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v4.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, 0x300, v4.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v134.l, v5.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v118.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v133.l, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v119.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v31, v8
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v7.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v7.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v131.h, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v132.l, v6.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v132.h, v6.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v115.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v115.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v31, v9
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v7.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, 0x300, v7.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v113.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v112.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v129.l, v7.l
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v129.h, v7.h
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, v116.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v8, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v7.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v132.l, v6.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, v117.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v8
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v6.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, 0x300, v6.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v130.h, v7.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v114.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v8.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v130.l, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v115.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v31, v10
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v9.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v9.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v128.l, v8.l
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v128.h, v8.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, v103.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, v103.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v31, v11
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v9.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, 0x300, v9.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, v101.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, v100.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v117.h, v9.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v118.l, v9.h
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v9
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v129.l, v8.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, v112.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v10, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v9.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v128.l, v8.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, v112.h, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v10
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v8.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v8.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v119.h, v9.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, v102.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v11, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v10.l
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v118.l, v9.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, v102.h, 3
; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v31, v12
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v11.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.h, 0x300, v11.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v116.l, v10.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v116.h, v10.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, v99.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, v98.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v31, v13
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v11.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, 0x300, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v12.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, v96.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, v96.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v114.l, v11.l
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v114.h, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v11
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v9.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.l, 0x300, v9.h
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v117.h, v10.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, v100.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v12, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v11.l
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v116.l, v10.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.l, v100.h, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v11.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v12
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v10.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, 0x300, v10.h
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v115.h, v11.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, v97.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v13, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v12.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v114.h, v11.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, v98.h, 3
; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v12.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v31, v14
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v13.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.h, 0x300, v13.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v112.h, v12.l
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v113.l, v12.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, v86.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, v86.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v31, v15
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v13.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.h, 0x300, v13.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v14.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, v84.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, v84.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v102.l, v13.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v102.h, v13.h
-; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v14, 0xffff, v13
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v11.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.l, 0x300, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v113.h, v12.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.h, v87.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v14, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v13.l
+; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v113.l, v12.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.l, v96.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v13.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v15, 0xffff, v14
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v12.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, 0x300, v12.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.l
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v103.h, v13.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, v85.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v15, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v14.l
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v103.l, v13.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, v86.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v14.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v31, v16
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v15.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v17.h, 0x300, v15.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v100.h, v14.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v101.l, v14.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.l, v82.h, 3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v15
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v13.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.l, 0x300, v13.h
+; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.l
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v101.h, v14.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.h, v83.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v16, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v15.l
+; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v101.l, v14.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.l, v84.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v15.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v16
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v14.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.l, 0x300, v14.h
+; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v99.h, v15.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.h, v81.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v31, v17
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v15.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.h, 0x300, v15.h
-; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v16.l
-; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v16.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.l, v80.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.h, v80.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v17.l, v98.h, v15.l
-; GFX11-TRUE16-NEXT: v_or_b16 v17.h, v99.l, v15.h
-; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v17, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v16.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v99.l, v15.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.l, v81.h, 3
; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v16.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v31, v18
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v17.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v19.h, 0x300, v17.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v17.l, v97.l, v16.l
-; GFX11-TRUE16-NEXT: v_or_b16 v17.h, v97.h, v16.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.l, v69.h, 3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v17
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v15.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v17.l, 0x300, v15.h
+; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.l
+; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v98.l, v16.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v17.h, v71.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v18, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v17.l
+; GFX11-TRUE16-NEXT: v_or_b16 v16.l, v97.l, v16.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v17.l, v71.h, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v17.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v19, 0xffff, v18
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v16.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.l, 0x300, v16.h
+; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v17.l
+; GFX11-TRUE16-NEXT: v_or_b16 v17.h, v96.h, v17.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.h, v69.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v31, v19
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v17.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.h, 0x300, v17.h
-; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v18.l
-; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v18.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.l, v68.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.h, v67.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v19.l, v87.l, v17.l
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v87.h, v17.h
-; GFX11-TRUE16-NEXT: v_and_b16 v18.l, 0xff, v18.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v19, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v18.l
+; GFX11-TRUE16-NEXT: v_or_b16 v17.l, v87.l, v17.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.l, v69.h, 3
; GFX11-TRUE16-NEXT: v_and_b16 v18.h, 0xff, v18.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v31, v20
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v19.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v19.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v19.l, v85.l, v18.l
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v85.h, v18.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v20, 0xffff, v19
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v17.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v19.l, 0x300, v17.h
+; GFX11-TRUE16-NEXT: v_and_b16 v18.l, 0xff, v18.l
+; GFX11-TRUE16-NEXT: v_or_b16 v18.h, v86.h, v18.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v19.h, v67.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v20, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v19.l
+; GFX11-TRUE16-NEXT: v_or_b16 v18.l, v85.l, v18.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v19.l, v67.h, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v19.h, 0xff, v19.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v21, 0xffff, v20
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v18.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.l, 0x300, v18.h
+; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v19.l
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v84.h, v19.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.h, v64.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v21, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v20.l
+; GFX11-TRUE16-NEXT: v_or_b16 v19.l, v83.h, v19.l
; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.l, v65.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.h, v65.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v31, v21
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v19.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.h, 0x300, v19.h
-; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v20.l
-; GFX11-TRUE16-NEXT: v_and_b16 v19.h, 0xff, v20.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.l, v55.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.h, v50.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v83.l, v19.l
-; GFX11-TRUE16-NEXT: v_or_b16 v21.h, v83.h, v19.h
-; GFX11-TRUE16-NEXT: v_and_b16 v20.l, 0xff, v20.l
; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v20.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v31, v22
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v21.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v23.h, 0x300, v21.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v81.h, v20.l
-; GFX11-TRUE16-NEXT: v_or_b16 v21.h, v82.l, v20.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.l, v49.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.h, v49.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v31, v23
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v21.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.h, 0x300, v21.h
-; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v22.l
-; GFX11-TRUE16-NEXT: v_and_b16 v21.h, 0xff, v22.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.l, v48.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.h, v48.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v23.l, v71.l, v21.l
-; GFX11-TRUE16-NEXT: v_or_b16 v23.h, v71.h, v21.h
-; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v22.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff, v21
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v19.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v19.h
+; GFX11-TRUE16-NEXT: v_and_b16 v20.l, 0xff, v20.l
+; GFX11-TRUE16-NEXT: v_or_b16 v20.h, v82.h, v20.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, v51.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v22, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v21.l
+; GFX11-TRUE16-NEXT: v_or_b16 v20.l, v82.l, v20.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, v52.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v21.h, 0xff, v21.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v23, 0xffff, v22
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v20.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.l, 0x300, v20.h
+; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v21.l
+; GFX11-TRUE16-NEXT: v_or_b16 v21.h, v80.h, v21.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.h, v48.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v23, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v22.l
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v80.l, v21.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.l, v49.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v22.h, 0xff, v22.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v31, v24
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v23.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.h, 0x300, v23.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v23.l, v70.l, v22.l
-; GFX11-TRUE16-NEXT: v_or_b16 v23.h, v70.h, v22.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.l, v39.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.h, v39.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v31, v25
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v23.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.h, 0x300, v23.h
-; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v24.l
-; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v24.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.l, v38.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.h, v38.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v25.l, v67.h, v23.l
-; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v68.l, v23.h
-; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v24.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v24, 0xffff, v23
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v21.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v23.l, 0x300, v21.h
+; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v22.l
+; GFX11-TRUE16-NEXT: v_or_b16 v22.h, v70.h, v22.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v23.h, v39.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v24, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, v23.l
+; GFX11-TRUE16-NEXT: v_or_b16 v22.l, v70.l, v22.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v23.l, v48.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v23.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v25, 0xffff, v24
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v22.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.l, 0x300, v22.h
+; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v23.l
+; GFX11-TRUE16-NEXT: v_or_b16 v23.h, v68.h, v23.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.h, v38.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v25, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v24.l
+; GFX11-TRUE16-NEXT: v_or_b16 v23.l, v68.l, v23.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.l, v39.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v24.h, 0xff, v24.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v31, v26
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v25.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, 0x300, v25.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v25.l, v66.l, v24.l
-; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v66.h, v24.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.l, v37.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.h, v37.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v31, v27
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v25.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.h, 0x300, v25.h
-; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v26.l
-; GFX11-TRUE16-NEXT: v_and_b16 v25.h, 0xff, v26.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.l, v36.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.h, v36.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v64.l, v25.l
-; GFX11-TRUE16-NEXT: v_or_b16 v27.h, v64.h, v25.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v26, 0xffff, v25
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v23.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v23.h
+; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v24.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v66.l, v24.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.h, v37.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v26, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.l, v25.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v66.h, v24.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, v38.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v24.h
+; GFX11-TRUE16-NEXT: v_and_b16 v25.h, 0xff, v25.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v26, 0xffff, v26
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.l, 0x300, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v25.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v64.l, v25.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, v24.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v26, v31
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v25.l, v65.l, v25.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.l, v37.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.h, v36.h, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v27, 0xffff, v27
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v25.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v25.l
; GFX11-TRUE16-NEXT: v_and_b16 v26.l, 0xff, v26.l
; GFX11-TRUE16-NEXT: v_and_b16 v26.h, 0xff, v26.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v31, v28
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v27.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v29.h, 0x300, v27.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v54.h, v26.l
-; GFX11-TRUE16-NEXT: v_or_b16 v27.h, v55.l, v26.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.l, v35.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.h, v35.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v31, v29
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v27.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v30.h, 0x300, v27.h
-; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v28.l
-; GFX11-TRUE16-NEXT: v_and_b16 v27.h, 0xff, v28.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.l, v34.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.h, v34.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v29.l, v53.h, v27.l
-; GFX11-TRUE16-NEXT: v_or_b16 v29.h, v54.l, v27.h
-; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v28.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.l, v25.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v27, v31
+; GFX11-TRUE16-NEXT: v_or_b16 v26.l, v55.l, v26.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.l, v36.l, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v26.h, v55.h, v26.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, v35.h, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v29, 0xffff, v28
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v26.l
+; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v27.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.l, 0x300, v26.h
+; GFX11-TRUE16-NEXT: v_and_b16 v27.h, 0xff, v27.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.h, v34.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v29, v31
+; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v54.h, v27.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v28.l
+; GFX11-TRUE16-NEXT: v_or_b16 v27.h, v54.l, v27.h
; GFX11-TRUE16-NEXT: v_and_b16 v28.h, 0xff, v28.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v31, v30
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v29.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v34.h, 0x300, v29.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v29.l, v52.h, v28.l
-; GFX11-TRUE16-NEXT: v_or_b16 v29.h, v53.l, v28.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v30.l, v33.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.l, v35.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.l, 0x300, v27.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v29, 0xffff, v29
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v27.h
+; GFX11-TRUE16-NEXT: v_or_b16 v28.h, v53.h, v28.h
+; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v28.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.l, v27.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v29, v31
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v29.l, v34.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v29.h, v33.h, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff, v30
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v30.l, 0x300, v28.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v30.h, v33.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v31, v34
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v29.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v33.h, 0x300, v29.h
-; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v30.l
-; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v30.h
+; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v29.l
+; GFX11-TRUE16-NEXT: v_or_b16 v28.l, v53.l, v28.l
+; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v29.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v33.l, v30.l
; GFX11-TRUE16-NEXT: v_add_nc_u16 v30.l, v32.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v30.h, v32.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v33.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v32.l, v51.h, v29.l
-; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v52.l, v29.h
-; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v30.l
+; GFX11-TRUE16-NEXT: v_or_b16 v29.l, v52.h, v29.l
; GFX11-TRUE16-NEXT: v_and_b16 v30.h, 0xff, v30.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v31, v33
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v32.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v33.h, 0x300, v32.h
-; GFX11-TRUE16-NEXT: v_or_b16 v32.l, v50.h, v30.l
-; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v51.l, v30.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v31, v33
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v32.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v32.h, 0x300, v32.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v32.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v31, v32
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v28.l
+; GFX11-TRUE16-NEXT: v_or_b16 v29.h, v50.h, v29.h
+; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v30.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v29.l, 0x300, v29.l
+; GFX11-TRUE16-NEXT: v_or_b16 v30.h, v51.l, v30.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v35, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v33
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v29.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v29.l
+; GFX11-TRUE16-NEXT: v_or_b16 v30.l, v50.l, v30.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v33.l, 0x300, v30.h
+; GFX11-TRUE16-NEXT: v_and_b16 v30.h, 0xff, v32.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v34, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v35
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v30.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v33, 0xffff, v33
+; GFX11-TRUE16-NEXT: v_or_b16 v32.l, v49.h, v30.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v34, v31
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v32.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v33, v31
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -77682,64 +77938,64 @@ define <128 x i8> @bitcast_v16i64_to_v128i8(<16 x i64> %a, i32 %b) {
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr66_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr162_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr161_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr160_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr161_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr65_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr151_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr150_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr150_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr148_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr147_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr146_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr147_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr145_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr144_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr135_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr134_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr133_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr132_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr133_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr131_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr128_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr119_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr118_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr117_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr116_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr115_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr116_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr114_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr113_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr112_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr113_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr101_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr100_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr99_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr99_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr97_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr96_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr96_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr86_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr85_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr84_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr83_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr82_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr81_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr82_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr69_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr68_lo16
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(2)
@@ -77767,50 +78023,50 @@ define <128 x i8> @bitcast_v16i64_to_v128i8(<16 x i64> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[65:66], 24, v[3:4]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v68, 24, v32
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v69, 8, v32
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v70, 8, v31
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v71, 24, v30
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v71, 8, v31
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v70, 24, v30
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v80, 8, v30
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 8, v29
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 24, v28
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 8, v29
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 24, v28
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v83, 8, v28
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v84, 8, v27
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v85, 24, v26
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v86, 8, v26
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 8, v25
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 24, v24
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 8, v25
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 24, v24
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v97, 8, v24
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 8, v23
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v99, 24, v22
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v99, 8, v23
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 24, v22
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v100, 8, v22
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v101, 8, v21
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v102, 24, v20
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v103, 8, v20
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 8, v19
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 24, v18
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 8, v19
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 24, v18
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v114, 8, v18
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v115, 8, v17
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v116, 24, v16
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v116, 8, v17
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v115, 24, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v117, 8, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v118, 8, v15
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v119, 24, v14
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v128, 8, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 8, v13
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 24, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v131, 8, v12
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 8, v11
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v133, 24, v10
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v133, 8, v11
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v134, 8, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v135, 8, v9
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v144, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v145, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v146, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v146, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v148, 8, v6
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 8, v5
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 24, v4
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 8, v5
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v151, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v162, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[54:55], 24, v[7:8]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[66:67], 24, v[1:2]
@@ -77879,50 +78135,50 @@ define <128 x i8> @bitcast_v16i64_to_v128i8(<16 x i64> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[66:67], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v68, 24, v32
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v69, 8, v32
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v70, 8, v31
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v71, 24, v30
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v71, 8, v31
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v70, 24, v30
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v80, 8, v30
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 8, v29
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 24, v28
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 8, v29
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 24, v28
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v83, 8, v28
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v84, 8, v27
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v85, 24, v26
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v86, 8, v26
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 8, v25
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 24, v24
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 8, v25
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 24, v24
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v97, 8, v24
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 8, v23
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v99, 24, v22
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v99, 8, v23
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 24, v22
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v100, 8, v22
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v101, 8, v21
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v102, 24, v20
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v103, 8, v20
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 8, v19
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 24, v18
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 8, v19
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 24, v18
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v114, 8, v18
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v115, 8, v17
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v116, 24, v16
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v116, 8, v17
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v115, 24, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v117, 8, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v118, 8, v15
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v119, 24, v14
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v128, 8, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 8, v13
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 24, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v131, 8, v12
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 8, v11
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v133, 24, v10
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v133, 8, v11
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v134, 8, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v135, 8, v9
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v144, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v145, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v146, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v146, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v148, 8, v6
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 8, v5
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 24, v4
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 8, v5
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v151, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v162, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v39, 8, v1
; GFX11-TRUE16-NEXT: .LBB56_4: ; %end
@@ -77930,266 +78186,307 @@ define <128 x i8> @bitcast_v16i64_to_v128i8(<16 x i64> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v39.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v66.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v1.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v162.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v1.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v162.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.l, 0
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v66.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v34.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v161.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v39, v1
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v2.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v160.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v65.l
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v39, v2
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v3.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v39.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v161.l
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v1.h, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v66.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v55, 0xffff, v55
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v160.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v1.l, v3.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v151.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v150.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v66, 0xffff, v66
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v55, v39
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v2.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v65.l
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v4.l, v33.h
; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v39, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v4.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v149.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v64.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v39, v4
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v5.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v148.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v147.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v150.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v66, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v55, 0xffff, v55
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v3.l, v3.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v65.l, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v149.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v5.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v55, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v55, 0xffff, v65
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v64.l
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v4.l, v4.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v65.l, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.l, 8, v148.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v55, v39
; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v39, v5
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v6.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v146.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v54.l
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v5.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v147.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.h, v6.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v55, 0xffff, v65
; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v39, v6
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v7.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v145.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v144.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v39, v7
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v8.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v135.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v53.l
; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v39, v8
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v9.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v39.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v7.l, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v64.l, v6.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v146.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v145.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v55, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v55, 0xffff, v64
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v64.l, v7.l
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v6.l, v6.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.h, 8, v54.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v8.l, v33.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v64
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v55, v39
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v144.l
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v7.l, v7.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v9.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 8, v135.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v54, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v55
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v8.l, v33.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v134.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v133.l
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v8.h, v9.l
; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v11.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v39, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v10.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v10.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v132.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v52.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v12.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v39, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v11.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v11.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v131.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v130.l
; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v13.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v39, v11
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v12.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v12.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v129.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v51.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v54, v39
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v10.l, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v9.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v53.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v133.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v54.l, v10.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v53, 0xffff, v55
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.h
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v9.l, v9.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v132.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v11.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v52.l
; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v14.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v39, v12
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v13.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v13.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v128.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v119.l
-; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.l
-; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v15.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v39, v13
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v14.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v118.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v50.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v53, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v53, 0xffff, v54
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v10.l, v10.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v54.l, v11.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v131.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v53, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v52, 0xffff, v54
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v11.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v130.l
+; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v11.h, v12.l
; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.l
-; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v16.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v39, v14
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v15.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v15.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v117.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v116.l
; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v17.l
-; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v17.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v39, v15
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v16.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v16.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v115.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v49.l
-; GFX11-TRUE16-NEXT: v_and_b16 v18.l, 0xff, v18.l
-; GFX11-TRUE16-NEXT: v_and_b16 v18.h, 0xff, v18.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v39, v16
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v17.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v17.h, v17.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v114.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v113.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v52, v39
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v13.l, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v53.l, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v129.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v128.l
; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v19.l
-; GFX11-TRUE16-NEXT: v_and_b16 v19.h, 0xff, v19.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v39, v17
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v18.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v18.h, v18.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v112.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v48.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v52, 0xffff, v53
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v53.l, v13.l
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v12.l, v12.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.h, 8, v51.l
+; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v14.l, v33.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v51, 0xffff, v53
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v52, v39
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v119.l
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v13.l, v13.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.l, v14.l
+; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.h
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v15.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.l, 8, v118.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v51, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v51, 0xffff, v52
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v14.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v117.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v14.h, v15.l
; GFX11-TRUE16-NEXT: v_and_b16 v20.l, 0xff, v20.l
-; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v20.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v39, v18
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v19.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v19.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v103.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v102.l
-; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v21.l
-; GFX11-TRUE16-NEXT: v_and_b16 v21.h, 0xff, v21.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v39, v19
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v20.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v20.h, v20.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v101.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v38.l
; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v22.l
-; GFX11-TRUE16-NEXT: v_and_b16 v22.h, 0xff, v22.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v39, v20
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v21.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v21.h, v21.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v100.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v99.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v51, v39
+; GFX11-TRUE16-NEXT: v_or_b16 v16.l, v16.l, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.l, v15.l
+; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.h, 8, v50.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v116.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v51.l, v16.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v50, 0xffff, v52
+; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.h
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v15.l, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v115.l
+; GFX11-TRUE16-NEXT: v_or_b16 v17.l, v17.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v49.l
; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v23.l
-; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v23.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v39, v21
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v22.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v22.h, v22.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v98.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v37.l
-; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v24.l
-; GFX11-TRUE16-NEXT: v_and_b16 v24.h, 0xff, v24.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v39, v22
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v23.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v23.h, v23.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v97.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v96.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v50, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v50, 0xffff, v51
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v16.l, v16.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v51.l, v17.l
+; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v17.h
+; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v18.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v114.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v50, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v49, 0xffff, v51
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v17.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v113.l
+; GFX11-TRUE16-NEXT: v_or_b16 v18.l, v17.h, v18.l
; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v25.l
-; GFX11-TRUE16-NEXT: v_and_b16 v25.h, 0xff, v25.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v39, v23
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v24.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v24.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v87.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v36.l
; GFX11-TRUE16-NEXT: v_and_b16 v26.l, 0xff, v26.l
-; GFX11-TRUE16-NEXT: v_and_b16 v26.h, 0xff, v26.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v39, v24
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v25.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v25.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v86.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v85.l
-; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v27.l
-; GFX11-TRUE16-NEXT: v_and_b16 v27.h, 0xff, v27.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v39, v25
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v26.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v26.h, v26.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v84.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v35.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v49, v39
+; GFX11-TRUE16-NEXT: v_or_b16 v19.l, v19.l, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v50.l, v18.l
+; GFX11-TRUE16-NEXT: v_and_b16 v18.l, 0xff, v18.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v112.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v103.l
; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v28.l
-; GFX11-TRUE16-NEXT: v_and_b16 v28.h, 0xff, v28.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v39, v26
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v27.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v27.h, v27.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v83.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v82.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v49, 0xffff, v50
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v50.l, v19.l
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v18.l, v18.h
+; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v19.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v48.l
+; GFX11-TRUE16-NEXT: v_or_b16 v20.l, v20.l, v33.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v48, 0xffff, v50
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v49, v39
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v102.l
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v19.l, v19.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v49.l, v20.l
+; GFX11-TRUE16-NEXT: v_and_b16 v20.l, 0xff, v20.h
+; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v21.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.l, 8, v101.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v48, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v48, 0xffff, v49
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v20.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v100.l
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v20.h, v21.l
; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v29.l
-; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v29.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v39, v27
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v28.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v28.h, v28.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v81.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.l, 8, v34.l
-; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v30.l
-; GFX11-TRUE16-NEXT: v_and_b16 v30.h, 0xff, v30.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v39, v28
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v29.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v29.h, v29.h, v34.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v80.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.l, 8, v71.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_and_b16 v31.l, 0xff, v31.l
-; GFX11-TRUE16-NEXT: v_and_b16 v31.h, 0xff, v31.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v39, v29
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v30.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v30.h, v30.h, v34.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v70.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.l, 8, v33.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v48, v39
+; GFX11-TRUE16-NEXT: v_or_b16 v22.l, v22.l, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v49.l, v21.l
+; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v21.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.h, 8, v38.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v99.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.l, v22.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v38, 0xffff, v49
+; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v22.h
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v21.l, v21.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.h, 8, v98.l
+; GFX11-TRUE16-NEXT: v_or_b16 v23.l, v23.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v37.l
; GFX11-TRUE16-NEXT: v_and_b16 v32.l, 0xff, v32.l
-; GFX11-TRUE16-NEXT: v_and_b16 v32.h, 0xff, v32.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v39, v30
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v31.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v31.h, v33.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v31.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.l, 8, v69.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v68.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v38, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v38, 0xffff, v48
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v22.l, v22.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.l, v23.l
+; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v23.h
+; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v24.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v24.l, 8, v97.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v38, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v37, 0xffff, v48
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v23.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v96.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v23.h, v24.l
; GFX11-TRUE16-NEXT: s_clause 0x1
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[1:4], off
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[5:8], off offset:16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v39, v31
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v32.l, v33.l
-; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v32.h, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v32.l, v39.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v32, v39, v32
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v37, v39
+; GFX11-TRUE16-NEXT: v_or_b16 v25.l, v25.l, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v38.l, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v24.h, 8, v87.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v86.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v37, 0xffff, v38
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v38.l, v25.l
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v24.l, v24.h
+; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v25.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.h, 8, v36.l
+; GFX11-TRUE16-NEXT: v_or_b16 v26.l, v26.l, v33.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v36, 0xffff, v38
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v37, v39
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v85.l
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v25.l, v25.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v37.l, v26.l
+; GFX11-TRUE16-NEXT: v_and_b16 v26.l, 0xff, v26.h
+; GFX11-TRUE16-NEXT: v_and_b16 v26.h, 0xff, v27.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.l, 8, v84.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v36, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v36, 0xffff, v37
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v26.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v83.l
+; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v26.h, v27.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v36, v39
+; GFX11-TRUE16-NEXT: v_or_b16 v28.l, v28.l, v33.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v37.l, v27.l
+; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v27.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.h, 8, v35.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v82.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v36.l, v28.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff, v37
+; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v28.h
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v27.l, v27.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v28.h, 8, v81.l
+; GFX11-TRUE16-NEXT: v_or_b16 v29.l, v29.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v34.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v35, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff, v36
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v28.l, v28.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v36.l, v29.l
+; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v29.h
+; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v30.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v30.l, 8, v80.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v35, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v36
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v29.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v71.l
+; GFX11-TRUE16-NEXT: v_or_b16 v30.l, v29.h, v30.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v34, v39
+; GFX11-TRUE16-NEXT: v_or_b16 v31.l, v31.l, v33.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_3) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v30.l
+; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v30.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v30.h, 8, v70.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v69.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v35
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v31.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v30.l, v30.h
+; GFX11-TRUE16-NEXT: v_and_b16 v31.l, 0xff, v31.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v31.h, 8, v33.l
+; GFX11-TRUE16-NEXT: v_or_b16 v32.l, v32.l, v33.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v33, 0xffff, v35
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v34, v39
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v31.l, v31.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.l, v32.l
+; GFX11-TRUE16-NEXT: v_and_b16 v32.l, 0xff, v32.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v32.h, 8, v68.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v33, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v33, 0xffff, v34
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v32.l, v32.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v32, v33, v39
; GFX11-TRUE16-NEXT: s_clause 0x5
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[9:12], off offset:32
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[13:16], off offset:48
@@ -86763,63 +87060,63 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v31, off, s32 offset:384
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v32, off, s32 offset:380
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v31, off, s32 offset:376
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v32, off, s32 offset:372
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v50, off, s32 offset:368
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v33, off, s32 offset:364
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v51, off, s32 offset:360
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v33, off, s32 offset:356
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v51, off, s32 offset:352
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v34, off, s32 offset:348
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v52, off, s32 offset:344
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v33, off, s32 offset:372
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v49, off, s32 offset:368
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v32, off, s32 offset:364
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v50, off, s32 offset:360
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v34, off, s32 offset:356
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v50, off, s32 offset:352
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v33, off, s32 offset:348
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v51, off, s32 offset:344
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v34, off, s32 offset:340
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v52, off, s32 offset:336
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v35, off, s32 offset:332
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v53, off, s32 offset:328
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v35, off, s32 offset:324
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v36, off, s32 offset:324
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v53, off, s32 offset:320
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v36, off, s32 offset:316
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v35, off, s32 offset:316
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v54, off, s32 offset:312
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v36, off, s32 offset:308
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v54, off, s32 offset:304
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v37, off, s32 offset:300
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v55, off, s32 offset:296
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v37, off, s32 offset:292
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v64, off, s32 offset:288
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v38, off, s32 offset:284
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v64, off, s32 offset:280
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v38, off, s32 offset:276
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v66, off, s32 offset:272
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v39, off, s32 offset:268
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v66, off, s32 offset:264
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v38, off, s32 offset:292
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v55, off, s32 offset:288
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v37, off, s32 offset:284
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v64, off, s32 offset:280
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v39, off, s32 offset:276
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v65, off, s32 offset:272
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v38, off, s32 offset:268
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v66, off, s32 offset:264
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v39, off, s32 offset:260
; GFX11-TRUE16-NEXT: s_clause 0x1f
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v67, off, s32 offset:256
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v66, off, s32 offset:256
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v48, off, s32 offset:252
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v68, off, s32 offset:248
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v48, off, s32 offset:244
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v70, off, s32 offset:240
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v68, off, s32 offset:240
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v49, off, s32 offset:236
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v70, off, s32 offset:232
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v49, off, s32 offset:228
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v71, off, s32 offset:224
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v50, off, s32 offset:220
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v71, off, s32 offset:216
-; GFX11-TRUE16-NEXT: scratch_load_b32 v114, off, s32 offset:388
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v81, off, s32
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v70, off, s32 offset:232
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v51, off, s32 offset:228
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v70, off, s32 offset:224
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v52, off, s32 offset:220
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v80, off, s32 offset:216
+; GFX11-TRUE16-NEXT: scratch_load_b32 v103, off, s32 offset:388
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v80, off, s32
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v82, off, s32 offset:8
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v83, off, s32 offset:16
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v82, off, s32 offset:16
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v83, off, s32 offset:24
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v85, off, s32 offset:32
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v85, off, s32 offset:40
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v87, off, s32 offset:48
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v87, off, s32 offset:56
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v97, off, s32 offset:64
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v97, off, s32 offset:72
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v98, off, s32 offset:80
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v84, off, s32 offset:32
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v85, off, s32 offset:40
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v86, off, s32 offset:48
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v87, off, s32 offset:56
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v96, off, s32 offset:64
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v97, off, s32 offset:72
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v98, off, s32 offset:80
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v99, off, s32 offset:88
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v100, off, s32 offset:96
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v99, off, s32 offset:96
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v101, off, s32 offset:104
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v102, off, s32 offset:112
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v101, off, s32 offset:112
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v160, off, s32 offset:120
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v160, off, s32 offset:128
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v161, off, s32 offset:136
@@ -86833,143 +87130,144 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v164, off, s32 offset:192
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v165, off, s32 offset:200
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v165, off, s32 offset:208
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v55, off, s32 offset:212
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v65, off, s32 offset:204
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v65, off, s32 offset:196
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v67, off, s32 offset:188
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v68, off, s32 offset:180
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v69, off, s32 offset:172
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v69, off, s32 offset:164
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v80, off, s32 offset:156
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v80, off, s32 offset:148
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v81, off, s32 offset:140
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v82, off, s32 offset:132
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v64, off, s32 offset:212
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v65, off, s32 offset:204
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v67, off, s32 offset:196
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v67, off, s32 offset:188
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v69, off, s32 offset:180
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v69, off, s32 offset:172
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v71, off, s32 offset:164
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v71, off, s32 offset:156
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v81, off, s32 offset:148
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v81, off, s32 offset:140
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v83, off, s32 offset:132
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v84, off, s32 offset:124
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v84, off, s32 offset:116
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v85, off, s32 offset:116
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v86, off, s32 offset:108
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v86, off, s32 offset:100
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v87, off, s32 offset:100
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v96, off, s32 offset:92
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v96, off, s32 offset:84
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v98, off, s32 offset:76
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v99, off, s32 offset:68
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v100, off, s32 offset:60
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v101, off, s32 offset:52
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v103, off, s32 offset:44
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v103, off, s32 offset:36
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v112, off, s32 offset:28
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v113, off, s32 offset:20
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v97, off, s32 offset:84
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v98, off, s32 offset:76
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v100, off, s32 offset:68
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v100, off, s32 offset:60
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v102, off, s32 offset:52
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v102, off, s32 offset:44
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v112, off, s32 offset:36
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v112, off, s32 offset:28
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v114, off, s32 offset:20
; GFX11-TRUE16-NEXT: s_clause 0x1
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v115, off, s32 offset:12
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v115, off, s32 offset:4
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v116, off, s32 offset:4
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v117.l, v30.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v118.h, v28.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v119.l, v26.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v119.h, v24.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v130.l, v22.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v130.h, v20.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v131.l, v18.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v132.h, v16.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v134.l, v14.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v134.h, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v128.h, v24.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v129.h, v22.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v131.l, v20.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v131.h, v18.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v133.h, v16.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v134.h, v14.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v144.h, v12.l
; GFX11-TRUE16-NEXT: v_mov_b16_e64 v144.l, v10.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v145.h, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v146.h, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v146.l, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v149.l, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v149.h, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v151.l, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v151.h, 8, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v150.l, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v150.h, 8, v7.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v148.l, 8, v9.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v148.h, 8, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v147.l, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v147.h, 8, v15.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v144.h, 8, v17.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v145.l, 8, v19.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v135.l, 8, v21.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v135.h, 8, v23.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v133.l, 8, v25.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v133.h, 8, v27.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v131.h, 8, v29.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v146.l, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v147.h, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v148.h, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v150.l, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v151.l, v0.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v151.h, 8, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v149.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v150.h, 8, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v147.l, 8, v7.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v149.h, 8, v9.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v146.h, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v148.l, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v145.l, 8, v15.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v145.h, 8, v17.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v135.l, 8, v19.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v135.h, 8, v21.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v133.l, 8, v23.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v134.l, 8, v25.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v132.l, 8, v27.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v132.h, 8, v29.l
+; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(62)
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v50.h, 8, v50.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(54)
-; GFX11-TRUE16-NEXT: v_cmp_ne_u32_e32 vcc_lo, 0, v114
+; GFX11-TRUE16-NEXT: v_cmp_ne_u32_e32 vcc_lo, 0, v103
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(53)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v132.l, 8, v81.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v130.l, 8, v80.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(52)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v129.l, 8, v82.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v130.h, 8, v82.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(51)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v129.h, 8, v83.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v128.l, 8, v82.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(50)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v128.l, 8, v83.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v129.l, 8, v83.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(49)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v128.h, 8, v85.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v118.l, 8, v84.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(48)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v117.h, 8, v85.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v119.h, 8, v85.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(47)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v118.l, 8, v87.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v116.l, 8, v86.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(46)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v116.l, 8, v87.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v117.h, 8, v87.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(45)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v116.h, 8, v97.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v114.h, 8, v96.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(44)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v114.l, 8, v97.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v115.h, 8, v97.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(43)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v114.h, 8, v98.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v113.l, 8, v98.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(42)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v112.h, 8, v99.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v113.h, 8, v99.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(41)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v113.l, 8, v100.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v103.l, 8, v99.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(40)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v102.l, 8, v101.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v103.h, 8, v101.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(39)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v102.h, 8, v102.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v101.l, 8, v101.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(38)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v100.h, 8, v160.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v101.h, 8, v160.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(37)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v101.l, 8, v160.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v99.l, 8, v160.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(36)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v98.h, 8, v161.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v99.h, 8, v161.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(35)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v99.l, 8, v161.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v97.l, 8, v161.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(34)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v97.l, 8, v162.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v98.l, 8, v162.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(33)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v97.h, 8, v162.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v87.l, 8, v162.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(32)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v87.l, 8, v163.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v96.h, 8, v163.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(31)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v87.h, 8, v163.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v85.l, 8, v163.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(30)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v85.l, 8, v164.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v86.h, 8, v164.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(29)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v85.h, 8, v164.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v83.h, 8, v164.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(28)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v83.l, 8, v165.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v84.h, 8, v165.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(27)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v83.h, 8, v165.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v81.h, 8, v71.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v82.l, 8, v71.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v71.l, 8, v70.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v71.h, 8, v70.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v70.l, 8, v68.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v70.h, 8, v67.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v67.h, 8, v66.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v68.l, 8, v66.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v66.l, 8, v64.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v82.l, 8, v165.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v82.h, 8, v80.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v80.l, 8, v70.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v80.h, 8, v70.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v70.l, 8, v68.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v70.h, 8, v68.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v68.l, 8, v66.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v68.h, 8, v66.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v66.l, 8, v65.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v66.h, 8, v64.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v64.l, 8, v55.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v64.h, 8, v54.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v54.h, 8, v54.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v55.l, 8, v53.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v53.h, 8, v53.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v54.l, 8, v52.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v52.h, 8, v52.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v53.l, 8, v51.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v51.h, 8, v51.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v52.l, 8, v50.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v50.h, 8, v31.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v51.l, 8, v31.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v64.l, 8, v55.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v65.l, 8, v55.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v55.l, 8, v54.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v55.h, 8, v54.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v54.l, 8, v53.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v54.h, 8, v53.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v53.l, 8, v52.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v53.h, 8, v51.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v52.h, 8, v50.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v50.l, 8, v49.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v51.l, 8, v31.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v49.h, 8, v31.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15_vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31
; GFX11-TRUE16-NEXT: s_and_saveexec_b32 s0, vcc_lo
; GFX11-TRUE16-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
@@ -86983,660 +87281,746 @@ define <16 x i64> @bitcast_v128i8_to_v16i64(<128 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB58_3: ; %cmp.false
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v149.h
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v149.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v149.h, 0
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v146.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v146.l
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v0.l, v151.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v151.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v0.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v1.l, v150.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v3.l, v149.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v145.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v144.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v149, v0
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v1.h, v150.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v132.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v131.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v148.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v130.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v149, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v2.l, v148.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v2.l, v149.h
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v151.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v148.h
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v150.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v146.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v31.l, 0
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v151.h
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v150.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v147.h
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v0.h, v149.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v149.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v144.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v133.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.l, v148.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v4
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v144.l
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v1.l, v145.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v31
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v1.h, v147.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v134.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v134.l
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v145.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v130.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v149, v2
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v3.l, v147.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v147.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v3.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v135.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v119.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v119.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v118.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v149, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v4.l, v144.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v4.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v133.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v117.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v131.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v7, v31
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v2.l, v146.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v4.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v131.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v3.h, v135.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v6, v31
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v3.l, v145.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v128.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v118.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v115.h
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v115.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v149, v4
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v5.l, v135.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v5.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v132.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v129.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v113.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v112.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v149, v5
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v6.l, v133.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v6.l, v149.h
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v103.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.h, v128.h
-; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v103.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v101.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v149, v6
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v7.l, v131.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v7.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v10.h, v118.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v100.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v99.h
-; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v98.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v149, v7
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v8.l, v129.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v8.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v11.h, v116.h
-; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v12.h, v114.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v96.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v96.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v149, v8
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v9.l, v128.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v9.l, v149.h
-; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v86.h
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v13.h, v113.l
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v86.l
-; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v84.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v149, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v10.l, v117.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v10.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v102.h
-; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v84.l
-; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v82.h
-; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v81.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v149, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v11.l, v116.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v11.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v15.h, v101.l
-; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v16.h, v99.l
-; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v80.h
-; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v80.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v149, v11
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v12.l, v114.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v12.l, v149.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v116.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v8, v31
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v4.l, v135.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v129.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v4.h, v134.l
+; GFX11-TRUE16-NEXT: v_and_b16 v32.l, 0xff, v32.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v8
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v5.l, v133.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v6.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v119.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v5.h, v132.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v114.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v8, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v6.l, v132.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v117.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v6.h, v130.h
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v112.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v7.l, v130.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v115.l
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v7.h, v129.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v102.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v10, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v11
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v8.l, v128.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v9.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v112.h
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v8.h, v119.h
+; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v100.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v11, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v12
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v9.l, v118.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v10.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v102.h
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v9.h, v117.h
+; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v97.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v12, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v13
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v10.l, v116.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v11.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v100.h
+; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v10.h, v115.h
+; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v87.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v13, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v14, 0xffff, v14
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v11.l, v114.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v98.h
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v11.h, v113.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v85.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v14, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v15, 0xffff, v15
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v12.l, v113.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v13.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v96.l
+; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v12.h, v103.h
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v83.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v15, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v16
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v13.l, v103.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v14.l
+; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v86.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v13.h, v101.h
+; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v81.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v16, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v17
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v14.l, v101.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v15.l
+; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v84.l
+; GFX11-TRUE16-NEXT: v_or_b16 v16.l, v14.h, v99.h
+; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v71.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v17, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v18
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v15.l, v99.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v16.l
+; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v81.h
+; GFX11-TRUE16-NEXT: v_or_b16 v17.l, v15.h, v98.l
+; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v69.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v18, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v19, 0xffff, v19
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v16.l, v97.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v17.l
+; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v71.h
+; GFX11-TRUE16-NEXT: v_or_b16 v18.l, v16.h, v96.h
+; GFX11-TRUE16-NEXT: v_and_b16 v18.h, 0xff, v67.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v19, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v20, 0xffff, v20
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v17.l, v87.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v18.l
; GFX11-TRUE16-NEXT: v_and_b16 v18.l, 0xff, v69.h
-; GFX11-TRUE16-NEXT: v_or_b16 v17.h, v17.h, v97.h
-; GFX11-TRUE16-NEXT: v_and_b16 v18.h, 0xff, v69.l
-; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v68.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v149, v12
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v13.l, v112.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v13.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v18.h, v18.h, v87.h
-; GFX11-TRUE16-NEXT: v_and_b16 v19.h, 0xff, v67.l
+; GFX11-TRUE16-NEXT: v_or_b16 v19.l, v17.h, v86.h
+; GFX11-TRUE16-NEXT: v_and_b16 v19.h, 0xff, v64.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v20, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v21, 0xffff, v21
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v18.l, v85.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v19.l
+; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v67.h
+; GFX11-TRUE16-NEXT: v_or_b16 v20.l, v18.h, v84.h
+; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v51.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v21, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff, v22
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v19.l, v83.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v20.l
; GFX11-TRUE16-NEXT: v_and_b16 v20.l, 0xff, v65.h
-; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v65.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v149, v13
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v14.l, v102.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v14.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v19.h, v85.h
-; GFX11-TRUE16-NEXT: v_or_b16 v20.h, v20.h, v83.h
-; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v55.h
-; GFX11-TRUE16-NEXT: v_and_b16 v21.h, 0xff, v50.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v149, v14
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v15.l, v100.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v15.l, v149.h
-; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v49.h
-; GFX11-TRUE16-NEXT: v_or_b16 v21.h, v21.h, v82.l
-; GFX11-TRUE16-NEXT: v_and_b16 v22.h, 0xff, v49.l
-; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v48.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v149, v15
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v16.l, v98.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v16.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v22.h, v22.h, v71.h
-; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v48.l
-; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v39.h
-; GFX11-TRUE16-NEXT: v_and_b16 v24.h, 0xff, v39.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v149, v16
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v17.l, v97.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v17.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v23.h, v23.h, v70.h
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v24.h, v68.l
-; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v38.h
-; GFX11-TRUE16-NEXT: v_and_b16 v25.h, 0xff, v38.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v149, v17
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v18.l, v87.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v18.l, v149.h
-; GFX11-TRUE16-NEXT: v_and_b16 v26.l, 0xff, v37.h
-; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v25.h, v66.h
-; GFX11-TRUE16-NEXT: v_and_b16 v26.h, 0xff, v37.l
-; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v36.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v149, v18
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v19.l, v85.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v19.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v26.h, v26.h, v64.h
-; GFX11-TRUE16-NEXT: v_and_b16 v27.h, 0xff, v36.l
-; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v35.h
-; GFX11-TRUE16-NEXT: v_and_b16 v28.h, 0xff, v35.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v149, v19
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v20.l, v83.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v20.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v27.h, v27.h, v55.l
-; GFX11-TRUE16-NEXT: v_or_b16 v28.h, v28.h, v54.l
-; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v34.h
-; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v34.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v149, v20
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v21.l, v81.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v21.l, v149.h
-; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v29.h, v29.h, v53.l
-; GFX11-TRUE16-NEXT: v_and_b16 v30.h, 0xff, v33.l
-; GFX11-TRUE16-NEXT: v_and_b16 v31.l, 0xff, v32.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v149, v21
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v22.l, v71.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v22.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v30.h, v30.h, v52.l
-; GFX11-TRUE16-NEXT: v_and_b16 v31.h, 0xff, v32.l
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v19.h, v82.h
+; GFX11-TRUE16-NEXT: v_and_b16 v21.h, 0xff, v48.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v22, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v23, 0xffff, v23
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v20.l, v82.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, v21.l
+; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v52.l
+; GFX11-TRUE16-NEXT: v_or_b16 v22.l, v20.h, v80.h
+; GFX11-TRUE16-NEXT: v_and_b16 v22.h, 0xff, v39.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v23, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v24, 0xffff, v24
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v21.l, v80.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v22.l
+; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v49.l
+; GFX11-TRUE16-NEXT: v_or_b16 v23.l, v21.h, v70.h
+; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v39.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v24, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v25, 0xffff, v25
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v22.l, v70.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.l, v23.l
+; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v48.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v22.h, v68.h
+; GFX11-TRUE16-NEXT: v_and_b16 v24.h, 0xff, v38.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v25, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v26, 0xffff, v26
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v23.l, v68.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v38.h
+; GFX11-TRUE16-NEXT: v_or_b16 v25.l, v23.h, v66.h
+; GFX11-TRUE16-NEXT: v_and_b16 v25.h, 0xff, v36.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v26, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v27, 0xffff, v27
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v24.l, v66.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.l, v25.l
+; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v37.h
+; GFX11-TRUE16-NEXT: v_or_b16 v26.l, v24.h, v65.l
+; GFX11-TRUE16-NEXT: v_and_b16 v26.h, 0xff, v36.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v27, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v28, 0xffff, v28
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v25.l, v64.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v26.l
+; GFX11-TRUE16-NEXT: v_and_b16 v26.l, 0xff, v37.l
+; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v25.h, v55.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v28, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v29, 0xffff, v29
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v26.l, v55.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.l, v27.l
+; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v35.h
+; GFX11-TRUE16-NEXT: v_or_b16 v28.l, v26.h, v54.h
+; GFX11-TRUE16-NEXT: v_and_b16 v28.h, 0xff, v34.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v29, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v29, 0xffff, v30
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v27.l, v54.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.l, v28.l
+; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v35.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr151_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr150_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr148_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr147_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr146_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr146_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr145_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr144_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr144_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr134_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr134_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr132_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr133_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr131_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr131_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr119_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr128_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr119_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr118_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr117_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr115_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr116_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr115_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr113_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr114_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr112_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr112_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr101_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr100_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr100_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr99_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr96_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr97_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr96_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr86_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr86_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr84_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr85_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr84_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr82_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr83_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr81_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr81_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr69_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr69_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr68_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr67_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr67_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr65_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr65_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr151_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr151_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr150_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr150_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr148_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr148_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr147_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr147_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr144_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr146_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr148_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr145_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr145_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr135_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr135_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr133_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr133_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr131_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr134_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr132_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr132_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr128_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr128_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr117_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr118_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr119_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr116_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr116_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr114_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr117_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr114_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr112_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr115_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr113_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr100_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr113_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr101_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr101_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr99_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr99_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr97_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr97_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr96_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr85_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr85_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr83_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr86_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr83_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr81_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr84_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr82_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr82_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr68_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr68_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr66_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr66_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr65_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v149, v22
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v23.l, v70.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v23.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v31.h, v51.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v149, v23
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v24.l, v67.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v24.l, v149.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr67_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v149, v24
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v25.l, v66.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v25.l, v149.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr66_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v149, v25
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v26.l, v64.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v26.l, v149.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v149, v26
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v27.l, v54.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v27.l, v149.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v149, v27
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v28.l, v53.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v28.l, v149.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v29, v31
+; GFX11-TRUE16-NEXT: v_or_b16 v29.l, v28.h, v53.h
+; GFX11-TRUE16-NEXT: v_and_b16 v28.h, 0xff, v34.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff, v30
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v28.l, v53.l
+; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v33.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.l, v29.l
+; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v30.l, v28.h, v52.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v35, v31
+; GFX11-TRUE16-NEXT: v_or_b16 v33.l, v29.h, v51.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v34
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v29.l, v50.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v30.l
+; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v32.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v33, 0xffff, v33
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v149, v28
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v29.l, v52.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v29.l, v149.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v149, v29
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v30.l, v51.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v30.l, v149.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v149, v30
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v31.l, v50.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v31.l, v149.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v34, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v35
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v30.l, v50.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v34, v31
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v32.l, v49.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_hi16
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v149, v31
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v33, v31
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB58_2
; GFX11-TRUE16-NEXT: .LBB58_4: ; %cmp.true
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v149.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v149.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v146.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v146.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v31.h, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v151.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v148.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v150.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v146.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v147.h, 3
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v151.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v151.h, v0.h
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v150.h, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v150.l, v1.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v145.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v144.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v134.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v31, v3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v134.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v148.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v148.h, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v144.l, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v151.h, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v150.h, v1.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v144.h, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v149.l, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v31.l, 0
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v31, v4
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v31.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v147.l, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v147.h, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v132.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v149.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v147.l, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, v134.h, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v133.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v31
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v146.h, v2.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v148.l, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v4
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v2.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v2.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v145.l, v3.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v131.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v31, v5
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v3.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v130.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v130.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v144.h, v3.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v145.l, v3.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v5, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v4.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v145.h, v3.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v131.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v3.h
; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v31, v6
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v5.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v135.l, v4.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v135.h, v4.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v119.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v119.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v31, v7
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, 0x300, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v118.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v117.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v133.l, v5.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v133.h, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, v128.h, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v135.h, v4.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v6, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v5.l
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v135.l, v4.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, v129.h, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v32.l, v32.l, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v4.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, 0x300, v4.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v134.l, v5.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v118.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v133.l, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v119.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v31, v8
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v7.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v7.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v131.h, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v132.l, v6.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v132.h, v6.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v115.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v115.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v31, v9
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v7.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, 0x300, v7.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v113.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v112.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v129.l, v7.l
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v129.h, v7.h
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, v116.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v8, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v7.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v132.l, v6.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, v117.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v8
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v6.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, 0x300, v6.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v130.h, v7.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v114.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v8.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v130.l, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v115.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v31, v10
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v9.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v9.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v128.l, v8.l
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v128.h, v8.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, v103.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, v103.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v31, v11
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v9.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, 0x300, v9.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, v101.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, v100.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v117.h, v9.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v118.l, v9.h
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v9
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v129.l, v8.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, v112.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v10, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v9.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v128.l, v8.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, v112.h, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v10
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v8.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v8.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v119.h, v9.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, v102.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v11, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v10.l
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v118.l, v9.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, v102.h, 3
; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v31, v12
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v11.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.h, 0x300, v11.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v116.l, v10.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v116.h, v10.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, v99.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, v98.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v31, v13
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v11.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, 0x300, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v12.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, v96.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, v96.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v114.l, v11.l
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v114.h, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v11
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v9.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.l, 0x300, v9.h
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v117.h, v10.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, v100.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v12, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v11.l
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v116.l, v10.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.l, v100.h, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v11.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v12
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v10.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, 0x300, v10.h
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v115.h, v11.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, v97.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v13, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v12.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v114.h, v11.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, v98.h, 3
; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v12.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v31, v14
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v13.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.h, 0x300, v13.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v112.h, v12.l
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v113.l, v12.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, v86.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, v86.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v31, v15
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v13.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.h, 0x300, v13.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v14.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, v84.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, v84.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v102.l, v13.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v102.h, v13.h
-; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v14, 0xffff, v13
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v11.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.l, 0x300, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v113.h, v12.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.h, v87.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v14, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v13.l
+; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v113.l, v12.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.l, v96.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v13.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v15, 0xffff, v14
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v12.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, 0x300, v12.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.l
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v103.h, v13.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, v85.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v15, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v14.l
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v103.l, v13.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, v86.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v14.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v31, v16
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v15.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v17.h, 0x300, v15.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v100.h, v14.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v101.l, v14.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.l, v82.h, 3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v15
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v13.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.l, 0x300, v13.h
+; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.l
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v101.h, v14.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.h, v83.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v16, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v15.l
+; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v101.l, v14.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.l, v84.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v15.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v16
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v14.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.l, 0x300, v14.h
+; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v99.h, v15.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.h, v81.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v31, v17
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v15.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.h, 0x300, v15.h
-; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v16.l
-; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v16.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.l, v80.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.h, v80.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v17.l, v98.h, v15.l
-; GFX11-TRUE16-NEXT: v_or_b16 v17.h, v99.l, v15.h
-; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v17, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v16.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v99.l, v15.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.l, v81.h, 3
; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v16.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v31, v18
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v17.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v19.h, 0x300, v17.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v17.l, v97.l, v16.l
-; GFX11-TRUE16-NEXT: v_or_b16 v17.h, v97.h, v16.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.l, v69.h, 3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v17
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v15.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v17.l, 0x300, v15.h
+; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.l
+; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v98.l, v16.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v17.h, v71.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v18, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v17.l
+; GFX11-TRUE16-NEXT: v_or_b16 v16.l, v97.l, v16.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v17.l, v71.h, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v17.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v19, 0xffff, v18
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v16.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.l, 0x300, v16.h
+; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v17.l
+; GFX11-TRUE16-NEXT: v_or_b16 v17.h, v96.h, v17.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.h, v69.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v31, v19
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v17.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.h, 0x300, v17.h
-; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v18.l
-; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v18.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.l, v68.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.h, v67.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v19.l, v87.l, v17.l
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v87.h, v17.h
-; GFX11-TRUE16-NEXT: v_and_b16 v18.l, 0xff, v18.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v19, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v18.l
+; GFX11-TRUE16-NEXT: v_or_b16 v17.l, v87.l, v17.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.l, v69.h, 3
; GFX11-TRUE16-NEXT: v_and_b16 v18.h, 0xff, v18.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v31, v20
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v19.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v19.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v19.l, v85.l, v18.l
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v85.h, v18.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v20, 0xffff, v19
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v17.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v19.l, 0x300, v17.h
+; GFX11-TRUE16-NEXT: v_and_b16 v18.l, 0xff, v18.l
+; GFX11-TRUE16-NEXT: v_or_b16 v18.h, v86.h, v18.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v19.h, v67.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v20, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v19.l
+; GFX11-TRUE16-NEXT: v_or_b16 v18.l, v85.l, v18.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v19.l, v67.h, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v19.h, 0xff, v19.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v21, 0xffff, v20
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v18.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.l, 0x300, v18.h
+; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v19.l
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v84.h, v19.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.h, v64.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v21, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v20.l
+; GFX11-TRUE16-NEXT: v_or_b16 v19.l, v83.h, v19.l
; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.l, v65.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.h, v65.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v31, v21
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v19.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.h, 0x300, v19.h
-; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v20.l
-; GFX11-TRUE16-NEXT: v_and_b16 v19.h, 0xff, v20.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.l, v55.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.h, v50.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v83.l, v19.l
-; GFX11-TRUE16-NEXT: v_or_b16 v21.h, v83.h, v19.h
-; GFX11-TRUE16-NEXT: v_and_b16 v20.l, 0xff, v20.l
; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v20.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v31, v22
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v21.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v23.h, 0x300, v21.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v81.h, v20.l
-; GFX11-TRUE16-NEXT: v_or_b16 v21.h, v82.l, v20.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.l, v49.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.h, v49.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v31, v23
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v21.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.h, 0x300, v21.h
-; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v22.l
-; GFX11-TRUE16-NEXT: v_and_b16 v21.h, 0xff, v22.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.l, v48.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.h, v48.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v23.l, v71.l, v21.l
-; GFX11-TRUE16-NEXT: v_or_b16 v23.h, v71.h, v21.h
-; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v22.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff, v21
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v19.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v19.h
+; GFX11-TRUE16-NEXT: v_and_b16 v20.l, 0xff, v20.l
+; GFX11-TRUE16-NEXT: v_or_b16 v20.h, v82.h, v20.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, v51.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v22, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v21.l
+; GFX11-TRUE16-NEXT: v_or_b16 v20.l, v82.l, v20.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, v52.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v21.h, 0xff, v21.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v23, 0xffff, v22
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v20.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.l, 0x300, v20.h
+; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v21.l
+; GFX11-TRUE16-NEXT: v_or_b16 v21.h, v80.h, v21.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.h, v48.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v23, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v22.l
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v80.l, v21.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.l, v49.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v22.h, 0xff, v22.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v31, v24
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v23.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.h, 0x300, v23.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v23.l, v70.l, v22.l
-; GFX11-TRUE16-NEXT: v_or_b16 v23.h, v70.h, v22.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.l, v39.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.h, v39.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v31, v25
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v23.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.h, 0x300, v23.h
-; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v24.l
-; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v24.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.l, v38.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.h, v38.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v25.l, v67.h, v23.l
-; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v68.l, v23.h
-; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v24.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v24, 0xffff, v23
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v21.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v23.l, 0x300, v21.h
+; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v22.l
+; GFX11-TRUE16-NEXT: v_or_b16 v22.h, v70.h, v22.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v23.h, v39.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v24, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, v23.l
+; GFX11-TRUE16-NEXT: v_or_b16 v22.l, v70.l, v22.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v23.l, v48.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v23.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v25, 0xffff, v24
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v22.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.l, 0x300, v22.h
+; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v23.l
+; GFX11-TRUE16-NEXT: v_or_b16 v23.h, v68.h, v23.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.h, v38.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v25, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v24.l
+; GFX11-TRUE16-NEXT: v_or_b16 v23.l, v68.l, v23.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.l, v39.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v24.h, 0xff, v24.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v31, v26
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v25.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, 0x300, v25.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v25.l, v66.l, v24.l
-; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v66.h, v24.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.l, v37.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.h, v37.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v31, v27
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v25.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.h, 0x300, v25.h
-; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v26.l
-; GFX11-TRUE16-NEXT: v_and_b16 v25.h, 0xff, v26.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.l, v36.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.h, v36.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v64.l, v25.l
-; GFX11-TRUE16-NEXT: v_or_b16 v27.h, v64.h, v25.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v26, 0xffff, v25
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v23.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v23.h
+; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v24.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v66.l, v24.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.h, v37.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v26, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.l, v25.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v66.h, v24.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, v38.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v24.h
+; GFX11-TRUE16-NEXT: v_and_b16 v25.h, 0xff, v25.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v26, 0xffff, v26
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.l, 0x300, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v25.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v64.l, v25.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, v24.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v26, v31
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v25.l, v65.l, v25.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.l, v37.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.h, v36.h, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v27, 0xffff, v27
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v25.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v25.l
; GFX11-TRUE16-NEXT: v_and_b16 v26.l, 0xff, v26.l
; GFX11-TRUE16-NEXT: v_and_b16 v26.h, 0xff, v26.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v31, v28
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v27.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v29.h, 0x300, v27.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v54.h, v26.l
-; GFX11-TRUE16-NEXT: v_or_b16 v27.h, v55.l, v26.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.l, v35.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.h, v35.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v31, v29
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v27.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v30.h, 0x300, v27.h
-; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v28.l
-; GFX11-TRUE16-NEXT: v_and_b16 v27.h, 0xff, v28.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.l, v34.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.h, v34.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v29.l, v53.h, v27.l
-; GFX11-TRUE16-NEXT: v_or_b16 v29.h, v54.l, v27.h
-; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v28.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.l, v25.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v27, v31
+; GFX11-TRUE16-NEXT: v_or_b16 v26.l, v55.l, v26.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.l, v36.l, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v26.h, v55.h, v26.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, v35.h, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v29, 0xffff, v28
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v26.l
+; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v27.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.l, 0x300, v26.h
+; GFX11-TRUE16-NEXT: v_and_b16 v27.h, 0xff, v27.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.h, v34.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v29, v31
+; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v54.h, v27.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v28.l
+; GFX11-TRUE16-NEXT: v_or_b16 v27.h, v54.l, v27.h
; GFX11-TRUE16-NEXT: v_and_b16 v28.h, 0xff, v28.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v31, v30
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v29.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v34.h, 0x300, v29.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v29.l, v52.h, v28.l
-; GFX11-TRUE16-NEXT: v_or_b16 v29.h, v53.l, v28.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v30.l, v33.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.l, v35.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.l, 0x300, v27.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v29, 0xffff, v29
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v27.h
+; GFX11-TRUE16-NEXT: v_or_b16 v28.h, v53.h, v28.h
+; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v28.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.l, v27.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v29, v31
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v29.l, v34.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v29.h, v33.h, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff, v30
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v30.l, 0x300, v28.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v30.h, v33.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v31, v34
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v29.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v33.h, 0x300, v29.h
-; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v30.l
-; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v30.h
+; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v29.l
+; GFX11-TRUE16-NEXT: v_or_b16 v28.l, v53.l, v28.l
+; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v29.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v33.l, v30.l
; GFX11-TRUE16-NEXT: v_add_nc_u16 v30.l, v32.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v30.h, v32.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v33.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v32.l, v51.h, v29.l
-; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v52.l, v29.h
-; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v30.l
+; GFX11-TRUE16-NEXT: v_or_b16 v29.l, v52.h, v29.l
; GFX11-TRUE16-NEXT: v_and_b16 v30.h, 0xff, v30.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v31, v33
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v32.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v33.h, 0x300, v32.h
-; GFX11-TRUE16-NEXT: v_or_b16 v32.l, v50.h, v30.l
-; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v51.l, v30.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v31, v33
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v32.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v32.h, 0x300, v32.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v32.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v31, v32
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v28.l
+; GFX11-TRUE16-NEXT: v_or_b16 v29.h, v50.h, v29.h
+; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v30.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v29.l, 0x300, v29.l
+; GFX11-TRUE16-NEXT: v_or_b16 v30.h, v51.l, v30.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v35, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v33
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v29.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v29.l
+; GFX11-TRUE16-NEXT: v_or_b16 v30.l, v50.l, v30.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v33.l, 0x300, v30.h
+; GFX11-TRUE16-NEXT: v_and_b16 v30.h, 0xff, v32.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v34, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v35
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v30.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v33, 0xffff, v33
+; GFX11-TRUE16-NEXT: v_or_b16 v32.l, v49.h, v30.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v34, v31
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v32.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v33, v31
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -111416,64 +111800,64 @@ define <128 x i8> @bitcast_v16f64_to_v128i8(<16 x double> %a, i32 %b) {
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr66_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr162_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr161_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr160_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr161_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr65_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr151_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr150_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr150_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr148_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr147_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr146_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr147_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr145_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr144_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr135_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr134_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr133_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr132_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr133_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr131_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr128_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr119_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr118_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr117_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr116_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr115_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr116_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr114_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr113_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr112_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr113_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr101_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr100_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr99_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr99_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr97_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr96_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr96_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr86_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr85_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr84_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr83_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr82_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr81_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr82_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr69_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr68_lo16
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(2)
@@ -111501,50 +111885,50 @@ define <128 x i8> @bitcast_v16f64_to_v128i8(<16 x double> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[65:66], 24, v[3:4]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v68, 24, v32
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v69, 8, v32
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v70, 8, v31
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v71, 24, v30
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v71, 8, v31
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v70, 24, v30
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v80, 8, v30
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 8, v29
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 24, v28
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 8, v29
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 24, v28
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v83, 8, v28
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v84, 8, v27
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v85, 24, v26
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v86, 8, v26
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 8, v25
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 24, v24
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 8, v25
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 24, v24
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v97, 8, v24
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 8, v23
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v99, 24, v22
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v99, 8, v23
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 24, v22
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v100, 8, v22
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v101, 8, v21
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v102, 24, v20
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v103, 8, v20
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 8, v19
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 24, v18
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 8, v19
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 24, v18
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v114, 8, v18
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v115, 8, v17
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v116, 24, v16
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v116, 8, v17
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v115, 24, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v117, 8, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v118, 8, v15
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v119, 24, v14
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v128, 8, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 8, v13
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 24, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v131, 8, v12
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 8, v11
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v133, 24, v10
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v133, 8, v11
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v134, 8, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v135, 8, v9
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v144, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v145, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v146, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v146, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v148, 8, v6
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 8, v5
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 24, v4
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 8, v5
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v151, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v162, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[54:55], 24, v[7:8]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[66:67], 24, v[1:2]
@@ -111588,50 +111972,50 @@ define <128 x i8> @bitcast_v16f64_to_v128i8(<16 x double> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[66:67], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v68, 24, v32
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v69, 8, v32
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v70, 8, v31
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v71, 24, v30
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v71, 8, v31
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v70, 24, v30
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v80, 8, v30
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 8, v29
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 24, v28
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 8, v29
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 24, v28
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v83, 8, v28
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v84, 8, v27
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v85, 24, v26
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v86, 8, v26
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 8, v25
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 24, v24
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 8, v25
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 24, v24
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v97, 8, v24
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 8, v23
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v99, 24, v22
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v99, 8, v23
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 24, v22
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v100, 8, v22
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v101, 8, v21
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v102, 24, v20
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v103, 8, v20
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 8, v19
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 24, v18
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 8, v19
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 24, v18
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v114, 8, v18
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v115, 8, v17
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v116, 24, v16
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v116, 8, v17
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v115, 24, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v117, 8, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v118, 8, v15
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v119, 24, v14
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v128, 8, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 8, v13
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 24, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v131, 8, v12
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 8, v11
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v133, 24, v10
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v133, 8, v11
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v134, 8, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v135, 8, v9
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v144, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v145, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v146, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v146, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v148, 8, v6
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 8, v5
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 24, v4
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 8, v5
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v151, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v162, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v39, 8, v1
; GFX11-TRUE16-NEXT: .LBB72_4: ; %end
@@ -111639,266 +112023,307 @@ define <128 x i8> @bitcast_v16f64_to_v128i8(<16 x double> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v39.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v66.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v1.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v162.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v1.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v162.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.l, 0
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v66.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v34.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v161.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v39, v1
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v2.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v160.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v65.l
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v39, v2
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v3.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v39.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v161.l
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v1.h, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v66.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v55, 0xffff, v55
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v160.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v1.l, v3.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v151.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v150.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v66, 0xffff, v66
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v55, v39
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v2.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v65.l
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v4.l, v33.h
; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v39, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v4.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v149.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v64.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v39, v4
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v5.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v148.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v147.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v150.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v66, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v55, 0xffff, v55
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v3.l, v3.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v65.l, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v149.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v5.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v55, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v55, 0xffff, v65
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v64.l
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v4.l, v4.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v65.l, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.l, 8, v148.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v55, v39
; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v39, v5
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v6.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v146.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v54.l
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v5.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v147.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.h, v6.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v55, 0xffff, v65
; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v39, v6
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v7.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v145.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v144.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v39, v7
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v8.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v135.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v53.l
; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v39, v8
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v9.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v39.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v7.l, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v64.l, v6.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v146.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v145.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v55, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v55, 0xffff, v64
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v64.l, v7.l
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v6.l, v6.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.h, 8, v54.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v8.l, v33.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v64
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v55, v39
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v144.l
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v7.l, v7.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v9.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 8, v135.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v54, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v55
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v8.l, v33.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v134.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v133.l
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v8.h, v9.l
; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v11.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v39, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v10.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v10.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v132.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v52.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v12.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v39, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v11.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v11.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v131.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v130.l
; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v13.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v39, v11
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v12.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v12.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v129.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v51.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v54, v39
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v10.l, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v9.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v53.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v133.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v54.l, v10.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v53, 0xffff, v55
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.h
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v9.l, v9.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v132.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v11.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v52.l
; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v14.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v39, v12
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v13.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v13.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v128.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v119.l
-; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.l
-; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v15.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v39, v13
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v14.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v118.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v50.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v53, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v53, 0xffff, v54
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v10.l, v10.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v54.l, v11.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v131.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v53, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v52, 0xffff, v54
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v11.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v130.l
+; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v11.h, v12.l
; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.l
-; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v16.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v39, v14
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v15.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v15.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v117.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v116.l
; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v17.l
-; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v17.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v39, v15
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v16.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v16.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v115.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v49.l
-; GFX11-TRUE16-NEXT: v_and_b16 v18.l, 0xff, v18.l
-; GFX11-TRUE16-NEXT: v_and_b16 v18.h, 0xff, v18.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v39, v16
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v17.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v17.h, v17.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v114.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v113.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v52, v39
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v13.l, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v53.l, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v129.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v128.l
; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v19.l
-; GFX11-TRUE16-NEXT: v_and_b16 v19.h, 0xff, v19.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v39, v17
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v18.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v18.h, v18.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v112.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v48.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v52, 0xffff, v53
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v53.l, v13.l
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v12.l, v12.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.h, 8, v51.l
+; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v14.l, v33.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v51, 0xffff, v53
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v52, v39
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v119.l
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v13.l, v13.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.l, v14.l
+; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.h
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v15.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.l, 8, v118.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v51, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v51, 0xffff, v52
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v14.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v117.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v14.h, v15.l
; GFX11-TRUE16-NEXT: v_and_b16 v20.l, 0xff, v20.l
-; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v20.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v39, v18
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v19.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v19.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v103.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v102.l
-; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v21.l
-; GFX11-TRUE16-NEXT: v_and_b16 v21.h, 0xff, v21.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v39, v19
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v20.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v20.h, v20.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v101.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v38.l
; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v22.l
-; GFX11-TRUE16-NEXT: v_and_b16 v22.h, 0xff, v22.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v39, v20
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v21.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v21.h, v21.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v100.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v99.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v51, v39
+; GFX11-TRUE16-NEXT: v_or_b16 v16.l, v16.l, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.l, v15.l
+; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.h, 8, v50.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v116.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v51.l, v16.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v50, 0xffff, v52
+; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.h
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v15.l, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v115.l
+; GFX11-TRUE16-NEXT: v_or_b16 v17.l, v17.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v49.l
; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v23.l
-; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v23.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v39, v21
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v22.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v22.h, v22.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v98.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v37.l
-; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v24.l
-; GFX11-TRUE16-NEXT: v_and_b16 v24.h, 0xff, v24.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v39, v22
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v23.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v23.h, v23.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v97.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v96.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v50, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v50, 0xffff, v51
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v16.l, v16.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v51.l, v17.l
+; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v17.h
+; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v18.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v114.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v50, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v49, 0xffff, v51
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v17.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v113.l
+; GFX11-TRUE16-NEXT: v_or_b16 v18.l, v17.h, v18.l
; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v25.l
-; GFX11-TRUE16-NEXT: v_and_b16 v25.h, 0xff, v25.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v39, v23
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v24.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v24.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v87.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v36.l
; GFX11-TRUE16-NEXT: v_and_b16 v26.l, 0xff, v26.l
-; GFX11-TRUE16-NEXT: v_and_b16 v26.h, 0xff, v26.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v39, v24
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v25.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v25.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v86.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v85.l
-; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v27.l
-; GFX11-TRUE16-NEXT: v_and_b16 v27.h, 0xff, v27.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v39, v25
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v26.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v26.h, v26.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v84.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v35.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v49, v39
+; GFX11-TRUE16-NEXT: v_or_b16 v19.l, v19.l, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v50.l, v18.l
+; GFX11-TRUE16-NEXT: v_and_b16 v18.l, 0xff, v18.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v112.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v103.l
; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v28.l
-; GFX11-TRUE16-NEXT: v_and_b16 v28.h, 0xff, v28.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v39, v26
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v27.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v27.h, v27.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v83.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v82.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v49, 0xffff, v50
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v50.l, v19.l
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v18.l, v18.h
+; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v19.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v48.l
+; GFX11-TRUE16-NEXT: v_or_b16 v20.l, v20.l, v33.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v48, 0xffff, v50
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v49, v39
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v102.l
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v19.l, v19.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v49.l, v20.l
+; GFX11-TRUE16-NEXT: v_and_b16 v20.l, 0xff, v20.h
+; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v21.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.l, 8, v101.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v48, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v48, 0xffff, v49
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v20.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v100.l
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v20.h, v21.l
; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v29.l
-; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v29.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v39, v27
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v28.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v28.h, v28.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v81.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.l, 8, v34.l
-; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v30.l
-; GFX11-TRUE16-NEXT: v_and_b16 v30.h, 0xff, v30.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v39, v28
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v29.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v29.h, v29.h, v34.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v80.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.l, 8, v71.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_and_b16 v31.l, 0xff, v31.l
-; GFX11-TRUE16-NEXT: v_and_b16 v31.h, 0xff, v31.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v39, v29
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v30.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v30.h, v30.h, v34.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v70.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.l, 8, v33.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v48, v39
+; GFX11-TRUE16-NEXT: v_or_b16 v22.l, v22.l, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v49.l, v21.l
+; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v21.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.h, 8, v38.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v99.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.l, v22.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v38, 0xffff, v49
+; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v22.h
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v21.l, v21.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.h, 8, v98.l
+; GFX11-TRUE16-NEXT: v_or_b16 v23.l, v23.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v37.l
; GFX11-TRUE16-NEXT: v_and_b16 v32.l, 0xff, v32.l
-; GFX11-TRUE16-NEXT: v_and_b16 v32.h, 0xff, v32.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v39, v30
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v31.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v31.h, v33.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v31.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.l, 8, v69.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v68.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v38, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v38, 0xffff, v48
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v22.l, v22.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.l, v23.l
+; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v23.h
+; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v24.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v24.l, 8, v97.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v38, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v37, 0xffff, v48
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v23.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v96.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v23.h, v24.l
; GFX11-TRUE16-NEXT: s_clause 0x1
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[1:4], off
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[5:8], off offset:16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v39, v31
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v32.l, v33.l
-; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v32.h, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v32.l, v39.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v32, v39, v32
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v37, v39
+; GFX11-TRUE16-NEXT: v_or_b16 v25.l, v25.l, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v38.l, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v24.h, 8, v87.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v86.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v37, 0xffff, v38
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v38.l, v25.l
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v24.l, v24.h
+; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v25.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.h, 8, v36.l
+; GFX11-TRUE16-NEXT: v_or_b16 v26.l, v26.l, v33.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v36, 0xffff, v38
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v37, v39
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v85.l
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v25.l, v25.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v37.l, v26.l
+; GFX11-TRUE16-NEXT: v_and_b16 v26.l, 0xff, v26.h
+; GFX11-TRUE16-NEXT: v_and_b16 v26.h, 0xff, v27.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.l, 8, v84.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v36, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v36, 0xffff, v37
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v26.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v83.l
+; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v26.h, v27.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v36, v39
+; GFX11-TRUE16-NEXT: v_or_b16 v28.l, v28.l, v33.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v37.l, v27.l
+; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v27.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.h, 8, v35.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v82.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v36.l, v28.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff, v37
+; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v28.h
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v27.l, v27.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v28.h, 8, v81.l
+; GFX11-TRUE16-NEXT: v_or_b16 v29.l, v29.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v34.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v35, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff, v36
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v28.l, v28.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v36.l, v29.l
+; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v29.h
+; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v30.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v30.l, 8, v80.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v35, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v36
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v29.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v71.l
+; GFX11-TRUE16-NEXT: v_or_b16 v30.l, v29.h, v30.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v34, v39
+; GFX11-TRUE16-NEXT: v_or_b16 v31.l, v31.l, v33.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_3) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v30.l
+; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v30.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v30.h, 8, v70.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v69.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v35
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v31.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v30.l, v30.h
+; GFX11-TRUE16-NEXT: v_and_b16 v31.l, 0xff, v31.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v31.h, 8, v33.l
+; GFX11-TRUE16-NEXT: v_or_b16 v32.l, v32.l, v33.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v33, 0xffff, v35
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v34, v39
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v31.l, v31.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.l, v32.l
+; GFX11-TRUE16-NEXT: v_and_b16 v32.l, 0xff, v32.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v32.h, 8, v68.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v33, v39
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v33, 0xffff, v34
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b16 v39.h, v32.l, v32.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v32, v33, v39
; GFX11-TRUE16-NEXT: s_clause 0x5
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[9:12], off offset:32
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[13:16], off offset:48
@@ -121414,63 +121839,63 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v31, off, s32 offset:384
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v32, off, s32 offset:380
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v31, off, s32 offset:376
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v32, off, s32 offset:372
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v50, off, s32 offset:368
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v33, off, s32 offset:364
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v51, off, s32 offset:360
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v33, off, s32 offset:356
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v51, off, s32 offset:352
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v34, off, s32 offset:348
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v52, off, s32 offset:344
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v33, off, s32 offset:372
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v49, off, s32 offset:368
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v32, off, s32 offset:364
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v50, off, s32 offset:360
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v34, off, s32 offset:356
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v50, off, s32 offset:352
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v33, off, s32 offset:348
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v51, off, s32 offset:344
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v34, off, s32 offset:340
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v52, off, s32 offset:336
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v35, off, s32 offset:332
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v53, off, s32 offset:328
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v35, off, s32 offset:324
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v36, off, s32 offset:324
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v53, off, s32 offset:320
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v36, off, s32 offset:316
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v35, off, s32 offset:316
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v54, off, s32 offset:312
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v36, off, s32 offset:308
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v54, off, s32 offset:304
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v37, off, s32 offset:300
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v55, off, s32 offset:296
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v37, off, s32 offset:292
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v64, off, s32 offset:288
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v38, off, s32 offset:284
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v64, off, s32 offset:280
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v38, off, s32 offset:276
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v66, off, s32 offset:272
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v39, off, s32 offset:268
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v66, off, s32 offset:264
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v38, off, s32 offset:292
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v55, off, s32 offset:288
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v37, off, s32 offset:284
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v64, off, s32 offset:280
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v39, off, s32 offset:276
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v65, off, s32 offset:272
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v38, off, s32 offset:268
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v66, off, s32 offset:264
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v39, off, s32 offset:260
; GFX11-TRUE16-NEXT: s_clause 0x1f
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v67, off, s32 offset:256
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v66, off, s32 offset:256
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v48, off, s32 offset:252
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v68, off, s32 offset:248
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v48, off, s32 offset:244
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v70, off, s32 offset:240
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v68, off, s32 offset:240
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v49, off, s32 offset:236
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v70, off, s32 offset:232
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v49, off, s32 offset:228
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v71, off, s32 offset:224
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v50, off, s32 offset:220
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v71, off, s32 offset:216
-; GFX11-TRUE16-NEXT: scratch_load_b32 v114, off, s32 offset:388
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v81, off, s32
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v70, off, s32 offset:232
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v51, off, s32 offset:228
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v70, off, s32 offset:224
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v52, off, s32 offset:220
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v80, off, s32 offset:216
+; GFX11-TRUE16-NEXT: scratch_load_b32 v103, off, s32 offset:388
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v80, off, s32
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v82, off, s32 offset:8
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v83, off, s32 offset:16
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v82, off, s32 offset:16
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v83, off, s32 offset:24
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v85, off, s32 offset:32
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v85, off, s32 offset:40
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v87, off, s32 offset:48
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v87, off, s32 offset:56
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v97, off, s32 offset:64
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v97, off, s32 offset:72
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v98, off, s32 offset:80
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v84, off, s32 offset:32
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v85, off, s32 offset:40
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v86, off, s32 offset:48
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v87, off, s32 offset:56
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v96, off, s32 offset:64
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v97, off, s32 offset:72
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v98, off, s32 offset:80
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v99, off, s32 offset:88
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v100, off, s32 offset:96
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v99, off, s32 offset:96
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v101, off, s32 offset:104
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v102, off, s32 offset:112
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v101, off, s32 offset:112
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v160, off, s32 offset:120
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v160, off, s32 offset:128
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v161, off, s32 offset:136
@@ -121484,143 +121909,144 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v164, off, s32 offset:192
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v165, off, s32 offset:200
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v165, off, s32 offset:208
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v55, off, s32 offset:212
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v65, off, s32 offset:204
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v65, off, s32 offset:196
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v67, off, s32 offset:188
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v68, off, s32 offset:180
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v69, off, s32 offset:172
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v69, off, s32 offset:164
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v80, off, s32 offset:156
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v80, off, s32 offset:148
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v81, off, s32 offset:140
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v82, off, s32 offset:132
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v64, off, s32 offset:212
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v65, off, s32 offset:204
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v67, off, s32 offset:196
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v67, off, s32 offset:188
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v69, off, s32 offset:180
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v69, off, s32 offset:172
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v71, off, s32 offset:164
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v71, off, s32 offset:156
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v81, off, s32 offset:148
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v81, off, s32 offset:140
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v83, off, s32 offset:132
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v84, off, s32 offset:124
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v84, off, s32 offset:116
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v85, off, s32 offset:116
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v86, off, s32 offset:108
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v86, off, s32 offset:100
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v87, off, s32 offset:100
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v96, off, s32 offset:92
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v96, off, s32 offset:84
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v98, off, s32 offset:76
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v99, off, s32 offset:68
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v100, off, s32 offset:60
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v101, off, s32 offset:52
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v103, off, s32 offset:44
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v103, off, s32 offset:36
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v112, off, s32 offset:28
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v113, off, s32 offset:20
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v97, off, s32 offset:84
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v98, off, s32 offset:76
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v100, off, s32 offset:68
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v100, off, s32 offset:60
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v102, off, s32 offset:52
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v102, off, s32 offset:44
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v112, off, s32 offset:36
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v112, off, s32 offset:28
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v114, off, s32 offset:20
; GFX11-TRUE16-NEXT: s_clause 0x1
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v115, off, s32 offset:12
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v115, off, s32 offset:4
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v116, off, s32 offset:4
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v117.l, v30.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v118.h, v28.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v119.l, v26.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v119.h, v24.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v130.l, v22.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v130.h, v20.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v131.l, v18.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v132.h, v16.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v134.l, v14.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v134.h, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v128.h, v24.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v129.h, v22.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v131.l, v20.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v131.h, v18.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v133.h, v16.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v134.h, v14.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v144.h, v12.l
; GFX11-TRUE16-NEXT: v_mov_b16_e64 v144.l, v10.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v145.h, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v146.h, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v146.l, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v149.l, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v149.h, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v151.l, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v151.h, 8, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v150.l, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v150.h, 8, v7.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v148.l, 8, v9.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v148.h, 8, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v147.l, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v147.h, 8, v15.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v144.h, 8, v17.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v145.l, 8, v19.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v135.l, 8, v21.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v135.h, 8, v23.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v133.l, 8, v25.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v133.h, 8, v27.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v131.h, 8, v29.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v146.l, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v147.h, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v148.h, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v150.l, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v151.l, v0.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v151.h, 8, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v149.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v150.h, 8, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v147.l, 8, v7.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v149.h, 8, v9.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v146.h, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v148.l, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v145.l, 8, v15.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v145.h, 8, v17.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v135.l, 8, v19.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v135.h, 8, v21.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v133.l, 8, v23.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v134.l, 8, v25.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v132.l, 8, v27.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v132.h, 8, v29.l
+; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(62)
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v50.h, 8, v50.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(54)
-; GFX11-TRUE16-NEXT: v_cmp_ne_u32_e32 vcc_lo, 0, v114
+; GFX11-TRUE16-NEXT: v_cmp_ne_u32_e32 vcc_lo, 0, v103
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(53)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v132.l, 8, v81.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v130.l, 8, v80.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(52)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v129.l, 8, v82.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v130.h, 8, v82.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(51)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v129.h, 8, v83.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v128.l, 8, v82.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(50)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v128.l, 8, v83.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v129.l, 8, v83.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(49)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v128.h, 8, v85.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v118.l, 8, v84.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(48)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v117.h, 8, v85.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v119.h, 8, v85.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(47)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v118.l, 8, v87.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v116.l, 8, v86.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(46)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v116.l, 8, v87.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v117.h, 8, v87.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(45)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v116.h, 8, v97.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v114.h, 8, v96.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(44)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v114.l, 8, v97.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v115.h, 8, v97.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(43)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v114.h, 8, v98.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v113.l, 8, v98.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(42)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v112.h, 8, v99.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v113.h, 8, v99.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(41)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v113.l, 8, v100.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v103.l, 8, v99.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(40)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v102.l, 8, v101.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v103.h, 8, v101.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(39)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v102.h, 8, v102.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v101.l, 8, v101.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(38)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v100.h, 8, v160.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v101.h, 8, v160.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(37)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v101.l, 8, v160.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v99.l, 8, v160.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(36)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v98.h, 8, v161.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v99.h, 8, v161.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(35)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v99.l, 8, v161.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v97.l, 8, v161.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(34)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v97.l, 8, v162.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v98.l, 8, v162.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(33)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v97.h, 8, v162.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v87.l, 8, v162.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(32)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v87.l, 8, v163.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v96.h, 8, v163.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(31)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v87.h, 8, v163.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v85.l, 8, v163.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(30)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v85.l, 8, v164.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v86.h, 8, v164.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(29)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v85.h, 8, v164.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v83.h, 8, v164.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(28)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v83.l, 8, v165.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v84.h, 8, v165.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(27)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v83.h, 8, v165.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v81.h, 8, v71.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v82.l, 8, v71.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v71.l, 8, v70.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v71.h, 8, v70.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v70.l, 8, v68.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v70.h, 8, v67.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v67.h, 8, v66.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v68.l, 8, v66.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v66.l, 8, v64.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v82.l, 8, v165.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v82.h, 8, v80.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v80.l, 8, v70.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v80.h, 8, v70.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v70.l, 8, v68.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v70.h, 8, v68.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v68.l, 8, v66.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v68.h, 8, v66.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v66.l, 8, v65.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v66.h, 8, v64.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v64.l, 8, v55.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v64.h, 8, v54.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v54.h, 8, v54.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v55.l, 8, v53.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v53.h, 8, v53.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v54.l, 8, v52.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v52.h, 8, v52.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v53.l, 8, v51.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v51.h, 8, v51.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v52.l, 8, v50.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v50.h, 8, v31.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v51.l, 8, v31.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v64.l, 8, v55.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v65.l, 8, v55.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v55.l, 8, v54.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v55.h, 8, v54.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v54.l, 8, v53.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v54.h, 8, v53.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v53.l, 8, v52.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v53.h, 8, v51.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v52.h, 8, v50.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v50.l, 8, v49.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v51.l, 8, v31.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v49.h, 8, v31.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15_vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31
; GFX11-TRUE16-NEXT: s_and_saveexec_b32 s0, vcc_lo
; GFX11-TRUE16-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
@@ -121634,660 +122060,746 @@ define <16 x double> @bitcast_v128i8_to_v16f64(<128 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB74_3: ; %cmp.false
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v149.h
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v149.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v149.h, 0
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v146.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v146.l
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v0.l, v151.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v151.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v0.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v1.l, v150.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v3.l, v149.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v145.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v144.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v149, v0
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v1.h, v150.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v132.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v131.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v148.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v130.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v149, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v2.l, v148.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v2.l, v149.h
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v151.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v148.h
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v150.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v146.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v31.l, 0
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v151.h
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v150.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v147.h
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v0.h, v149.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v149.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v144.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v133.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.l, v148.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v4
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v144.l
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v1.l, v145.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v31
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v1.h, v147.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v134.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v134.l
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v145.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v130.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v149, v2
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v3.l, v147.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v147.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v3.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v135.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v119.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v119.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v118.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v149, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v4.l, v144.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v4.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v133.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v117.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v131.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v7, v31
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v2.l, v146.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v4.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v131.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v3.h, v135.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v6, v31
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v3.l, v145.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v128.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v118.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v115.h
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v115.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v149, v4
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v5.l, v135.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v5.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v132.l
-; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v129.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v113.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v112.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v149, v5
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v6.l, v133.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v6.l, v149.h
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v103.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.h, v128.h
-; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v103.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v101.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v149, v6
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v7.l, v131.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v7.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v10.h, v118.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v100.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v99.h
-; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v98.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v149, v7
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v8.l, v129.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v8.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v11.h, v116.h
-; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v12.h, v114.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v96.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v96.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v149, v8
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v9.l, v128.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v9.l, v149.h
-; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v86.h
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v13.h, v113.l
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v86.l
-; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v84.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v149, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v10.l, v117.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v10.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v102.h
-; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v84.l
-; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v82.h
-; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v81.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v149, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v11.l, v116.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v11.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v15.h, v101.l
-; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v16.h, v99.l
-; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v80.h
-; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v80.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v149, v11
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v12.l, v114.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v12.l, v149.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v116.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v8, v31
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v4.l, v135.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v129.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v4.h, v134.l
+; GFX11-TRUE16-NEXT: v_and_b16 v32.l, 0xff, v32.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v8
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v5.l, v133.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v6.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v119.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v5.h, v132.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v114.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v8, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v6.l, v132.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v117.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v6.h, v130.h
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v112.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v7.l, v130.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v115.l
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v7.h, v129.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v102.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v10, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v11
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v8.l, v128.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v9.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v112.h
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v8.h, v119.h
+; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v100.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v11, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v12
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v9.l, v118.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v10.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v102.h
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v9.h, v117.h
+; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v97.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v12, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v13
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v10.l, v116.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v11.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v100.h
+; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v10.h, v115.h
+; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v87.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v13, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v14, 0xffff, v14
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v11.l, v114.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v98.h
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v11.h, v113.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v85.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v14, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v15, 0xffff, v15
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v12.l, v113.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v13.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v96.l
+; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v12.h, v103.h
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v83.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v15, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v16
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v13.l, v103.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v14.l
+; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v86.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v13.h, v101.h
+; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v81.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v16, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v17
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v14.l, v101.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v15.l
+; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v84.l
+; GFX11-TRUE16-NEXT: v_or_b16 v16.l, v14.h, v99.h
+; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v71.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v17, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v18
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v15.l, v99.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v16.l
+; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v81.h
+; GFX11-TRUE16-NEXT: v_or_b16 v17.l, v15.h, v98.l
+; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v69.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v18, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v19, 0xffff, v19
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v16.l, v97.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v17.l
+; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v71.h
+; GFX11-TRUE16-NEXT: v_or_b16 v18.l, v16.h, v96.h
+; GFX11-TRUE16-NEXT: v_and_b16 v18.h, 0xff, v67.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v19, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v20, 0xffff, v20
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v17.l, v87.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v18.l
; GFX11-TRUE16-NEXT: v_and_b16 v18.l, 0xff, v69.h
-; GFX11-TRUE16-NEXT: v_or_b16 v17.h, v17.h, v97.h
-; GFX11-TRUE16-NEXT: v_and_b16 v18.h, 0xff, v69.l
-; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v68.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v149, v12
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v13.l, v112.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v13.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v18.h, v18.h, v87.h
-; GFX11-TRUE16-NEXT: v_and_b16 v19.h, 0xff, v67.l
+; GFX11-TRUE16-NEXT: v_or_b16 v19.l, v17.h, v86.h
+; GFX11-TRUE16-NEXT: v_and_b16 v19.h, 0xff, v64.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v20, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v21, 0xffff, v21
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v18.l, v85.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v19.l
+; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v67.h
+; GFX11-TRUE16-NEXT: v_or_b16 v20.l, v18.h, v84.h
+; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v51.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v21, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff, v22
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v19.l, v83.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v20.l
; GFX11-TRUE16-NEXT: v_and_b16 v20.l, 0xff, v65.h
-; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v65.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v149, v13
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v14.l, v102.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v14.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v19.h, v85.h
-; GFX11-TRUE16-NEXT: v_or_b16 v20.h, v20.h, v83.h
-; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v55.h
-; GFX11-TRUE16-NEXT: v_and_b16 v21.h, 0xff, v50.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v149, v14
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v15.l, v100.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v15.l, v149.h
-; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v49.h
-; GFX11-TRUE16-NEXT: v_or_b16 v21.h, v21.h, v82.l
-; GFX11-TRUE16-NEXT: v_and_b16 v22.h, 0xff, v49.l
-; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v48.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v149, v15
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v16.l, v98.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v16.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v22.h, v22.h, v71.h
-; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v48.l
-; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v39.h
-; GFX11-TRUE16-NEXT: v_and_b16 v24.h, 0xff, v39.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v149, v16
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v17.l, v97.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v17.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v23.h, v23.h, v70.h
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v24.h, v68.l
-; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v38.h
-; GFX11-TRUE16-NEXT: v_and_b16 v25.h, 0xff, v38.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v149, v17
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v18.l, v87.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v18.l, v149.h
-; GFX11-TRUE16-NEXT: v_and_b16 v26.l, 0xff, v37.h
-; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v25.h, v66.h
-; GFX11-TRUE16-NEXT: v_and_b16 v26.h, 0xff, v37.l
-; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v36.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v149, v18
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v19.l, v85.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v19.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v26.h, v26.h, v64.h
-; GFX11-TRUE16-NEXT: v_and_b16 v27.h, 0xff, v36.l
-; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v35.h
-; GFX11-TRUE16-NEXT: v_and_b16 v28.h, 0xff, v35.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v149, v19
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v20.l, v83.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v20.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v27.h, v27.h, v55.l
-; GFX11-TRUE16-NEXT: v_or_b16 v28.h, v28.h, v54.l
-; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v34.h
-; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v34.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v149, v20
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v21.l, v81.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v21.l, v149.h
-; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v29.h, v29.h, v53.l
-; GFX11-TRUE16-NEXT: v_and_b16 v30.h, 0xff, v33.l
-; GFX11-TRUE16-NEXT: v_and_b16 v31.l, 0xff, v32.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v149, v21
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v22.l, v71.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v22.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v30.h, v30.h, v52.l
-; GFX11-TRUE16-NEXT: v_and_b16 v31.h, 0xff, v32.l
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v19.h, v82.h
+; GFX11-TRUE16-NEXT: v_and_b16 v21.h, 0xff, v48.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v22, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v23, 0xffff, v23
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v20.l, v82.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, v21.l
+; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v52.l
+; GFX11-TRUE16-NEXT: v_or_b16 v22.l, v20.h, v80.h
+; GFX11-TRUE16-NEXT: v_and_b16 v22.h, 0xff, v39.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v23, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v24, 0xffff, v24
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v21.l, v80.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v22.l
+; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v49.l
+; GFX11-TRUE16-NEXT: v_or_b16 v23.l, v21.h, v70.h
+; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v39.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v24, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v25, 0xffff, v25
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v22.l, v70.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.l, v23.l
+; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v48.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v22.h, v68.h
+; GFX11-TRUE16-NEXT: v_and_b16 v24.h, 0xff, v38.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v25, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v26, 0xffff, v26
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v23.l, v68.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v38.h
+; GFX11-TRUE16-NEXT: v_or_b16 v25.l, v23.h, v66.h
+; GFX11-TRUE16-NEXT: v_and_b16 v25.h, 0xff, v36.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v26, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v27, 0xffff, v27
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v24.l, v66.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.l, v25.l
+; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v37.h
+; GFX11-TRUE16-NEXT: v_or_b16 v26.l, v24.h, v65.l
+; GFX11-TRUE16-NEXT: v_and_b16 v26.h, 0xff, v36.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v27, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v28, 0xffff, v28
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v25.l, v64.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v26.l
+; GFX11-TRUE16-NEXT: v_and_b16 v26.l, 0xff, v37.l
+; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v25.h, v55.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v28, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v29, 0xffff, v29
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v26.l, v55.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.l, v27.l
+; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v35.h
+; GFX11-TRUE16-NEXT: v_or_b16 v28.l, v26.h, v54.h
+; GFX11-TRUE16-NEXT: v_and_b16 v28.h, 0xff, v34.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v29, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v29, 0xffff, v30
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v27.l, v54.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.l, v28.l
+; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v35.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr151_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr150_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr148_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr147_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr146_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr146_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr145_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr144_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr144_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr134_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr134_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr132_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr133_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr131_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr131_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr119_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr128_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr119_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr118_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr117_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr115_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr116_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr115_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr113_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr114_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr112_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr112_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr101_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr100_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr100_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr99_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr96_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr97_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr96_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr86_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr86_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr84_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr85_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr84_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr82_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr83_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr81_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr81_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr69_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr69_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr68_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr67_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr67_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr65_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr65_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr151_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr151_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr150_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr150_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr148_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr148_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr147_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr147_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr144_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr146_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr148_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr145_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr145_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr135_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr135_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr133_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr133_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr131_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr134_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr132_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr132_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr128_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr128_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr117_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr118_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr119_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr116_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr116_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr114_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr117_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr114_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr112_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr115_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr113_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr100_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr113_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr101_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr101_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr99_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr99_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr97_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr97_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr96_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr85_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr85_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr83_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr86_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr83_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr81_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr84_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr82_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr82_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr68_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr68_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr66_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr66_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr65_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v149, v22
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v23.l, v70.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v23.l, v149.h
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v31.h, v51.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v149, v23
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v24.l, v67.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v24.l, v149.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr67_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v149, v24
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v25.l, v66.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v25.l, v149.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr66_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v149, v25
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v26.l, v64.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v26.l, v149.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v149, v26
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v27.l, v54.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v27.l, v149.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v149, v27
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v28.l, v53.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v28.l, v149.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v29, v31
+; GFX11-TRUE16-NEXT: v_or_b16 v29.l, v28.h, v53.h
+; GFX11-TRUE16-NEXT: v_and_b16 v28.h, 0xff, v34.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff, v30
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v28.l, v53.l
+; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v33.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.l, v29.l
+; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v30.l, v28.h, v52.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v35, v31
+; GFX11-TRUE16-NEXT: v_or_b16 v33.l, v29.h, v51.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v34
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v29.l, v50.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v30.l
+; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v32.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v33, 0xffff, v33
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v149, v28
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v29.l, v52.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v29.l, v149.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v149, v29
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v30.l, v51.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v30.l, v149.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v149, v30
-; GFX11-TRUE16-NEXT: v_or_b16 v149.l, v31.l, v50.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v31.l, v149.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v34, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v35
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v30.l, v50.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v34, v31
+; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v32.l, v49.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_hi16
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v149, v31
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v33, v31
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB74_2
; GFX11-TRUE16-NEXT: .LBB74_4: ; %cmp.true
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v149.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v149.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v146.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v146.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v31.h, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v151.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v148.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v150.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v146.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v147.h, 3
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v151.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v151.h, v0.h
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v150.h, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v150.l, v1.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v145.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v144.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v134.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v31, v3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v134.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v148.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v148.h, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v144.l, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v151.h, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v150.h, v1.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v144.h, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v149.l, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v31.l, 0
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v31, v4
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v31.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v147.l, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v147.h, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v132.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v149.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v147.l, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, v134.h, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v133.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v31
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v146.h, v2.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v148.l, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v4
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v2.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v2.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v145.l, v3.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v131.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v31, v5
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v3.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v130.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v130.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v144.h, v3.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v145.l, v3.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v5, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v4.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v145.h, v3.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v131.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v3.h
; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v31, v6
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v5.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v135.l, v4.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v135.h, v4.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v119.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v119.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v31, v7
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, 0x300, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v118.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v117.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v133.l, v5.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v133.h, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, v128.h, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v135.h, v4.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v6, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v5.l
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v135.l, v4.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, v129.h, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v32.l, v32.l, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v4.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, 0x300, v4.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v134.l, v5.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v118.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v133.l, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v119.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v31, v8
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v7.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v7.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v131.h, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v132.l, v6.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v132.h, v6.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v115.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v115.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v31, v9
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v7.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, 0x300, v7.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v113.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v112.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v129.l, v7.l
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v129.h, v7.h
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, v116.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v8, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v7.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v132.l, v6.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, v117.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v8
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v6.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, 0x300, v6.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v130.h, v7.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v114.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v8.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v130.l, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v115.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v31, v10
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v9.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v9.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v128.l, v8.l
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v128.h, v8.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, v103.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, v103.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v31, v11
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v9.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, 0x300, v9.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, v101.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, v100.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v117.h, v9.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v118.l, v9.h
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v9
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v129.l, v8.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, v112.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v10, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v9.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v128.l, v8.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, v112.h, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v10
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v8.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v8.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v119.h, v9.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, v102.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v11, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v10.l
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v118.l, v9.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, v102.h, 3
; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v31, v12
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v11.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.h, 0x300, v11.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v116.l, v10.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v116.h, v10.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, v99.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, v98.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v31, v13
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v11.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, 0x300, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v12.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, v96.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, v96.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v114.l, v11.l
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v114.h, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v11
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v9.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.l, 0x300, v9.h
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v117.h, v10.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, v100.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v12, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v11.l
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v116.l, v10.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.l, v100.h, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v11.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v12
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v10.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, 0x300, v10.h
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v115.h, v11.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, v97.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v13, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v12.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v114.h, v11.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, v98.h, 3
; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v12.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v31, v14
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v13.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.h, 0x300, v13.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v112.h, v12.l
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v113.l, v12.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, v86.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, v86.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v31, v15
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v13.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.h, 0x300, v13.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v14.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, v84.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, v84.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v102.l, v13.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v102.h, v13.h
-; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v14, 0xffff, v13
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v11.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.l, 0x300, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v113.h, v12.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.h, v87.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v14, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v13.l
+; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v113.l, v12.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.l, v96.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v13.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v15, 0xffff, v14
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v12.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, 0x300, v12.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.l
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v103.h, v13.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, v85.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v15, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v14.l
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v103.l, v13.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, v86.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v14.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v31, v16
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v15.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v17.h, 0x300, v15.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v100.h, v14.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v101.l, v14.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.l, v82.h, 3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v15
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v13.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.l, 0x300, v13.h
+; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.l
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v101.h, v14.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.h, v83.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v16, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v15.l
+; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v101.l, v14.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.l, v84.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v15.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v16
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v14.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.l, 0x300, v14.h
+; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v99.h, v15.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.h, v81.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v31, v17
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v15.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.h, 0x300, v15.h
-; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v16.l
-; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v16.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.l, v80.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.h, v80.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v17.l, v98.h, v15.l
-; GFX11-TRUE16-NEXT: v_or_b16 v17.h, v99.l, v15.h
-; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v17, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v16.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v99.l, v15.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v16.l, v81.h, 3
; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v16.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v31, v18
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v17.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v19.h, 0x300, v17.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v17.l, v97.l, v16.l
-; GFX11-TRUE16-NEXT: v_or_b16 v17.h, v97.h, v16.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.l, v69.h, 3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v17
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v15.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v17.l, 0x300, v15.h
+; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.l
+; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v98.l, v16.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v17.h, v71.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v18, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v17.l
+; GFX11-TRUE16-NEXT: v_or_b16 v16.l, v97.l, v16.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v17.l, v71.h, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v17.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v19, 0xffff, v18
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v16.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.l, 0x300, v16.h
+; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v17.l
+; GFX11-TRUE16-NEXT: v_or_b16 v17.h, v96.h, v17.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.h, v69.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v31, v19
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v17.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.h, 0x300, v17.h
-; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v18.l
-; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v18.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.l, v68.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.h, v67.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v19.l, v87.l, v17.l
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v87.h, v17.h
-; GFX11-TRUE16-NEXT: v_and_b16 v18.l, 0xff, v18.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v19, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v18.l
+; GFX11-TRUE16-NEXT: v_or_b16 v17.l, v87.l, v17.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.l, v69.h, 3
; GFX11-TRUE16-NEXT: v_and_b16 v18.h, 0xff, v18.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v31, v20
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v19.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v19.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v19.l, v85.l, v18.l
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v85.h, v18.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v20, 0xffff, v19
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v17.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v19.l, 0x300, v17.h
+; GFX11-TRUE16-NEXT: v_and_b16 v18.l, 0xff, v18.l
+; GFX11-TRUE16-NEXT: v_or_b16 v18.h, v86.h, v18.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v19.h, v67.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v20, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v19.l
+; GFX11-TRUE16-NEXT: v_or_b16 v18.l, v85.l, v18.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v19.l, v67.h, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v19.h, 0xff, v19.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v21, 0xffff, v20
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v18.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.l, 0x300, v18.h
+; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v19.l
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v84.h, v19.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.h, v64.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v21, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v20.l
+; GFX11-TRUE16-NEXT: v_or_b16 v19.l, v83.h, v19.l
; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.l, v65.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.h, v65.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v31, v21
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v19.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.h, 0x300, v19.h
-; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v20.l
-; GFX11-TRUE16-NEXT: v_and_b16 v19.h, 0xff, v20.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.l, v55.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v20.h, v50.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v83.l, v19.l
-; GFX11-TRUE16-NEXT: v_or_b16 v21.h, v83.h, v19.h
-; GFX11-TRUE16-NEXT: v_and_b16 v20.l, 0xff, v20.l
; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v20.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v31, v22
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v21.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v23.h, 0x300, v21.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v81.h, v20.l
-; GFX11-TRUE16-NEXT: v_or_b16 v21.h, v82.l, v20.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.l, v49.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.h, v49.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v31, v23
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v21.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.h, 0x300, v21.h
-; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v22.l
-; GFX11-TRUE16-NEXT: v_and_b16 v21.h, 0xff, v22.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.l, v48.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.h, v48.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v23.l, v71.l, v21.l
-; GFX11-TRUE16-NEXT: v_or_b16 v23.h, v71.h, v21.h
-; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v22.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff, v21
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v19.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v19.h
+; GFX11-TRUE16-NEXT: v_and_b16 v20.l, 0xff, v20.l
+; GFX11-TRUE16-NEXT: v_or_b16 v20.h, v82.h, v20.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, v51.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v22, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v21.l
+; GFX11-TRUE16-NEXT: v_or_b16 v20.l, v82.l, v20.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, v52.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v21.h, 0xff, v21.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v23, 0xffff, v22
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v20.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.l, 0x300, v20.h
+; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v21.l
+; GFX11-TRUE16-NEXT: v_or_b16 v21.h, v80.h, v21.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.h, v48.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v23, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v22.l
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v80.l, v21.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v22.l, v49.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v22.h, 0xff, v22.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v31, v24
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v23.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.h, 0x300, v23.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v23.l, v70.l, v22.l
-; GFX11-TRUE16-NEXT: v_or_b16 v23.h, v70.h, v22.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.l, v39.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.h, v39.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v31, v25
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v23.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.h, 0x300, v23.h
-; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v24.l
-; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v24.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.l, v38.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.h, v38.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v25.l, v67.h, v23.l
-; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v68.l, v23.h
-; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v24.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v24, 0xffff, v23
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v21.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v23.l, 0x300, v21.h
+; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v22.l
+; GFX11-TRUE16-NEXT: v_or_b16 v22.h, v70.h, v22.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v23.h, v39.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v24, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, v23.l
+; GFX11-TRUE16-NEXT: v_or_b16 v22.l, v70.l, v22.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v23.l, v48.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v23.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v25, 0xffff, v24
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v22.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.l, 0x300, v22.h
+; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v23.l
+; GFX11-TRUE16-NEXT: v_or_b16 v23.h, v68.h, v23.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.h, v38.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v25, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v24.l
+; GFX11-TRUE16-NEXT: v_or_b16 v23.l, v68.l, v23.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.l, v39.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v24.h, 0xff, v24.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v31, v26
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v25.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, 0x300, v25.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v25.l, v66.l, v24.l
-; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v66.h, v24.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.l, v37.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.h, v37.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v31, v27
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v25.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.h, 0x300, v25.h
-; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v26.l
-; GFX11-TRUE16-NEXT: v_and_b16 v25.h, 0xff, v26.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.l, v36.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.h, v36.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v64.l, v25.l
-; GFX11-TRUE16-NEXT: v_or_b16 v27.h, v64.h, v25.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v26, 0xffff, v25
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v23.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v23.h
+; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v24.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v66.l, v24.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.h, v37.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v26, v31
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.l, v25.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v66.h, v24.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, v38.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v24.h
+; GFX11-TRUE16-NEXT: v_and_b16 v25.h, 0xff, v25.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v26, 0xffff, v26
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v24.l, 0x300, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v25.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v64.l, v25.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, v24.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v26, v31
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v25.l, v65.l, v25.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.l, v37.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v26.h, v36.h, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v27, 0xffff, v27
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v25.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v25.l
; GFX11-TRUE16-NEXT: v_and_b16 v26.l, 0xff, v26.l
; GFX11-TRUE16-NEXT: v_and_b16 v26.h, 0xff, v26.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v31, v28
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v27.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v29.h, 0x300, v27.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v54.h, v26.l
-; GFX11-TRUE16-NEXT: v_or_b16 v27.h, v55.l, v26.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.l, v35.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.h, v35.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v31, v29
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v27.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v30.h, 0x300, v27.h
-; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v28.l
-; GFX11-TRUE16-NEXT: v_and_b16 v27.h, 0xff, v28.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.l, v34.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.h, v34.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v29.l, v53.h, v27.l
-; GFX11-TRUE16-NEXT: v_or_b16 v29.h, v54.l, v27.h
-; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v28.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.l, v25.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v27, v31
+; GFX11-TRUE16-NEXT: v_or_b16 v26.l, v55.l, v26.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.l, v36.l, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v26.h, v55.h, v26.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, v35.h, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v29, 0xffff, v28
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v26.l
+; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v27.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.l, 0x300, v26.h
+; GFX11-TRUE16-NEXT: v_and_b16 v27.h, 0xff, v27.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.h, v34.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v29, v31
+; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v54.h, v27.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v28.l
+; GFX11-TRUE16-NEXT: v_or_b16 v27.h, v54.l, v27.h
; GFX11-TRUE16-NEXT: v_and_b16 v28.h, 0xff, v28.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v31, v30
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v29.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v34.h, 0x300, v29.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v29.l, v52.h, v28.l
-; GFX11-TRUE16-NEXT: v_or_b16 v29.h, v53.l, v28.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v30.l, v33.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v28.l, v35.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.l, 0x300, v27.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v29, 0xffff, v29
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v27.h
+; GFX11-TRUE16-NEXT: v_or_b16 v28.h, v53.h, v28.h
+; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v28.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.l, v27.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v29, v31
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v29.l, v34.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v29.h, v33.h, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff, v30
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v30.l, 0x300, v28.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v30.h, v33.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v31, v34
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v29.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v33.h, 0x300, v29.h
-; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v30.l
-; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v30.h
+; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v29.l
+; GFX11-TRUE16-NEXT: v_or_b16 v28.l, v53.l, v28.l
+; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v29.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v33.l, v30.l
; GFX11-TRUE16-NEXT: v_add_nc_u16 v30.l, v32.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v30.h, v32.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v33.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v32.l, v51.h, v29.l
-; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v52.l, v29.h
-; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v30.l
+; GFX11-TRUE16-NEXT: v_or_b16 v29.l, v52.h, v29.l
; GFX11-TRUE16-NEXT: v_and_b16 v30.h, 0xff, v30.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v31, v33
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v32.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v33.h, 0x300, v32.h
-; GFX11-TRUE16-NEXT: v_or_b16 v32.l, v50.h, v30.l
-; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v51.l, v30.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v31, v33
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.l, 0x300, v32.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v32.h, 0x300, v32.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v32.l, v31.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v31, v32
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v28.l
+; GFX11-TRUE16-NEXT: v_or_b16 v29.h, v50.h, v29.h
+; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v30.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v29.l, 0x300, v29.l
+; GFX11-TRUE16-NEXT: v_or_b16 v30.h, v51.l, v30.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v35, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v33
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v29.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v29.l
+; GFX11-TRUE16-NEXT: v_or_b16 v30.l, v50.l, v30.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v33.l, 0x300, v30.h
+; GFX11-TRUE16-NEXT: v_and_b16 v30.h, 0xff, v32.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v34, v31
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v35
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v30.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v33, 0xffff, v33
+; GFX11-TRUE16-NEXT: v_or_b16 v32.l, v49.h, v30.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v34, v31
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v31.h, 0x300, v32.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v33, v31
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -159577,162 +160089,159 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: s_clause 0x1f
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v40, s32 offset:168
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v41, s32 offset:164
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v42, s32 offset:160
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v43, s32 offset:156
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v44, s32 offset:152
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v45, s32 offset:148
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v46, s32 offset:144
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v47, s32 offset:140
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v56, s32 offset:136
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v57, s32 offset:132
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v58, s32 offset:128
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v59, s32 offset:124
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v60, s32 offset:120
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v61, s32 offset:116
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v62, s32 offset:112
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v63, s32 offset:108
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v72, s32 offset:104
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v73, s32 offset:100
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v74, s32 offset:96
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v75, s32 offset:92
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v76, s32 offset:88
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v77, s32 offset:84
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v78, s32 offset:80
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v79, s32 offset:76
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v88, s32 offset:72
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v89, s32 offset:68
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v90, s32 offset:64
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v91, s32 offset:60
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v92, s32 offset:56
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v93, s32 offset:52
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v94, s32 offset:48
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v95, s32 offset:44
-; GFX11-TRUE16-NEXT: s_clause 0x7
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v104, s32 offset:40
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v105, s32 offset:36
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v106, s32 offset:32
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v107, s32 offset:28
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v108, s32 offset:24
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v109, s32 offset:20
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v110, s32 offset:16
-; GFX11-TRUE16-NEXT: scratch_store_b32 off, v111, s32 offset:12
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v40, s32 offset:156
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v41, s32 offset:152
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v42, s32 offset:148
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v43, s32 offset:144
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v44, s32 offset:140
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v45, s32 offset:136
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v46, s32 offset:132
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v47, s32 offset:128
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v56, s32 offset:124
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v57, s32 offset:120
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v58, s32 offset:116
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v59, s32 offset:112
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v60, s32 offset:108
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v61, s32 offset:104
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v62, s32 offset:100
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v63, s32 offset:96
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v72, s32 offset:92
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v73, s32 offset:88
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v74, s32 offset:84
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v75, s32 offset:80
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v76, s32 offset:76
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v77, s32 offset:72
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v78, s32 offset:68
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v79, s32 offset:64
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v88, s32 offset:60
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v89, s32 offset:56
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v90, s32 offset:52
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v91, s32 offset:48
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v92, s32 offset:44
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v93, s32 offset:40
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v94, s32 offset:36
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v95, s32 offset:32
+; GFX11-TRUE16-NEXT: s_clause 0x4
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v104, s32 offset:28
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v105, s32 offset:24
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v106, s32 offset:20
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v107, s32 offset:16
+; GFX11-TRUE16-NEXT: scratch_store_b32 off, v108, s32 offset:12
; GFX11-TRUE16-NEXT: s_clause 0x2
; GFX11-TRUE16-NEXT: scratch_load_b32 v33, off, s32 offset:8
; GFX11-TRUE16-NEXT: scratch_load_b32 v32, off, s32 offset:4
; GFX11-TRUE16-NEXT: scratch_load_b32 v31, off, s32
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr131_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr111_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr128_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr106_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr105_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr146_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr104_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr69_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr108_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr68_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr133_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr95_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr135_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr93_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr166_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr107_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr132_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr105_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr164_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr106_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr67_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr148_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr94_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr144_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr90_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr180_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr91_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr66_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr165_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr88_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr68_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr150_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr78_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr151_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr161_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr75_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr47_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr76_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr43_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr74_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr67_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr177_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr63_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr65_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr179_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr72_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr178_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr60_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr59_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr73_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr57_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr58_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr41_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr47_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr44_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr45_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr92_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr40_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr59_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr182_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr62_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr180_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr108_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr176_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr91_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr165_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr56_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr41_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr42_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr89_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr163_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr110_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr161_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr107_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr43_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr61_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr183_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr57_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr167_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr104_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr176_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr78_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr166_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr77_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr160_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr95_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr150_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr93_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr109_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr147_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr82_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr94_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr65_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr90_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr81_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr92_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr134_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr79_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr85_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr77_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr83_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr75_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr74_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr62_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr84_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr72_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr96_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr61_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr86_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr58_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr56_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr99_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr63_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr81_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr60_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr46_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr97_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr42_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr183_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr86_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr45_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr83_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr40_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr82_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr181_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr97_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr182_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr177_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr85_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr162_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr101_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr163_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr151_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr96_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr147_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr112_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr145_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr100_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr179_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr101_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr167_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr135_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr99_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr113_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr164_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr131_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr162_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr112_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr160_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr119_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr116_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr148_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr114_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr145_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr128_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr115_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr144_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr119_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr134_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr117_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr132_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr118_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr118_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr114_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr117_lo16
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(2)
; GFX11-TRUE16-NEXT: v_cmp_ne_u32_e32 vcc_lo, 0, v33
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
@@ -159741,142 +160250,143 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB90_2
; GFX11-TRUE16-NEXT: ; %bb.1: ; %cmp.false
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[67:68], 24, v[7:8]
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[33:34], 24, v[31:32]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[68:69], 24, v[5:6]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[34:35], 24, v[29:30]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[50:51], 24, v[13:14]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[53:54], 24, v[11:12]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[64:65], 24, v[9:10]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[69:70], 24, v[3:4]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[34:35], 24, v[29:30]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[65:66], 24, v[7:8]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[35:36], 24, v[27:28]
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 24, v16
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[48:49], 24, v[15:16]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[66:67], 24, v[5:6]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[36:37], 24, v[25:26]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[49:50], 24, v[13:14]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[52:53], 24, v[11:12]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[67:68], 24, v[3:4]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[37:38], 24, v[23:24]
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v134, 24, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 8, v16
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 8, v15
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v163, 24, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v165, 8, v14
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 8, v15
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 24, v14
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v166, 8, v14
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v176, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v180, 24, v12
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v182, 8, v12
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v40, 8, v11
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v45, 24, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v47, 8, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v57, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v60, 24, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v63, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v74, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v76, 24, v6
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v78, 8, v6
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v88, 8, v5
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v93, 24, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v95, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v104, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v167, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v183, 8, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v43, 8, v11
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v42, 24, v10
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v56, 8, v10
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v58, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v59, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v72, 8, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v76, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v75, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v88, 8, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v91, 8, v5
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v90, 24, v4
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v94, 8, v4
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v106, 8, v3
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v105, 24, v2
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v106, 8, v2
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v111, 8, v1
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 24, v32
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 8, v32
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v134, 8, v31
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v144, 24, v30
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v145, 8, v30
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v148, 8, v29
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 24, v28
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v162, 8, v28
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v164, 8, v27
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v167, 24, v26
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v179, 8, v26
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v181, 8, v25
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v183, 24, v24
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v42, 8, v24
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v46, 8, v23
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v56, 24, v22
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v58, 8, v22
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v61, 8, v21
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v72, 24, v20
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v75, 8, v20
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v77, 8, v19
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v79, 24, v18
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v90, 8, v18
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v94, 8, v17
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[38:39], 24, v[15:16]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[70:71], 24, v[1:2]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[36:37], 24, v[25:26]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[48:49], 24, v[23:24]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[51:52], 24, v[21:22]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[54:55], 24, v[19:20]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[65:66], 24, v[17:18]
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v131.h, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v128.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v129.h, v2.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v146.h, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v133.h, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v135.h, v4.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v166.h, v5.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v150.h, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v151.h, v6.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v43.h, v7.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v177.h, v8.l
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v107, 8, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v108, 8, v1
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v117, 24, v32
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v118, 8, v32
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v128, 8, v31
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v119, 24, v30
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 8, v30
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v131, 8, v29
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 24, v28
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v135, 8, v28
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v145, 8, v27
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 24, v26
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v151, 8, v26
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v163, 8, v25
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v162, 24, v24
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v177, 8, v24
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v182, 8, v23
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v181, 24, v22
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v40, 8, v22
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v45, 8, v21
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v46, 24, v20
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v60, 8, v20
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v63, 8, v19
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v62, 24, v18
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v74, 8, v18
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v79, 8, v17
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[68:69], 24, v[1:2]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[38:39], 24, v[21:22]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[50:51], 24, v[19:20]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[53:54], 24, v[17:18]
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v146.h, v1.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v133.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v132.h, v2.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v164.h, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v148.h, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v144.h, v4.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v180.h, v5.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v165.h, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v161.h, v6.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v47.h, v7.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v179.h, v8.l
; GFX11-TRUE16-NEXT: v_mov_b16_e64 v178.h, v8.h
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v73.h, v9.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v41.h, v10.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v44.h, v10.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v92.h, v11.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v59.h, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v62.h, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v108.h, v13.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v91.h, v14.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v89.h, v14.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v110.h, v15.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v107.h, v16.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v109.h, v16.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v82.h, v17.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v80.h, v18.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v81.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v85.h, v19.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v83.h, v20.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v84.h, v20.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v96.h, v21.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v86.h, v22.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v87.h, v22.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v99.h, v23.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v97.h, v24.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v98.h, v24.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v102.h, v25.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v100.h, v26.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v101.h, v26.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v113.h, v27.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v103.h, v28.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v112.h, v28.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v116.h, v29.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v114.h, v30.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v115.h, v30.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v119.h, v31.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v117.h, v32.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v118.h, v32.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v44.h, v10.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v41.h, v10.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v89.h, v11.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v61.h, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v57.h, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v104.h, v13.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v78.h, v14.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v77.h, v14.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v95.h, v15.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v93.h, v16.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v92.h, v16.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v71.h, v17.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v70.h, v18.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.h, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v84.h, v19.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v81.h, v20.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v80.h, v20.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v86.h, v21.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v83.h, v22.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v82.h, v22.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v97.h, v23.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v87.h, v24.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v85.h, v24.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v101.h, v25.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v98.h, v26.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v96.h, v26.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v112.h, v27.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v100.h, v28.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v99.h, v28.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v113.h, v29.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v103.h, v30.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v102.h, v30.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v116.h, v31.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v115.h, v32.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v114.h, v32.h
; GFX11-TRUE16-NEXT: .LBB90_2: ; %Flow
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB90_4
; GFX11-TRUE16-NEXT: ; %bb.3: ; %cmp.true
; GFX11-TRUE16-NEXT: v_and_b32_e32 v33, 0xffff0000, v18
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v18, 16, v18
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff0000, v20
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_add_f32_e32 v18, 0x40c00000, v18
; GFX11-TRUE16-NEXT: v_bfe_u32 v37, v18, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v39, 0x400000, v18
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v18, v18
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_add3_u32 v37, v37, v18, 0x7fff
-; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v80, v37, v39 :: v_dual_add_f32 v33, 0x40c00000, v33
+; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v70, v37, v39 :: v_dual_add_f32 v33, 0x40c00000, v33
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_3)
; GFX11-TRUE16-NEXT: v_bfe_u32 v36, v33, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v33
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v33, v33
; GFX11-TRUE16-NEXT: v_add3_u32 v36, v36, v33, 0x7fff
; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff0000, v17
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v33.l, v80.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v33.l, v70.h
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v81, v36, v38, vcc_lo
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v55, v36, v38, vcc_lo
; GFX11-TRUE16-NEXT: v_dual_add_f32 v34, 0x40c00000, v34 :: v_dual_lshlrev_b32 v17, 16, v17
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_add_f32_e32 v17, 0x40c00000, v17
@@ -159889,500 +160399,498 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_add3_u32 v18, v48, v34, 0x7fff
; GFX11-TRUE16-NEXT: v_add3_u32 v37, v50, v17, 0x7fff
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v82, v37, v51 :: v_dual_and_b32 v35, 0xffff0000, v20
-; GFX11-TRUE16-NEXT: v_dual_add_f32 v35, 0x40c00000, v35 :: v_dual_lshlrev_b32 v20, 16, v20
+; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v71, v37, v51 :: v_dual_lshlrev_b32 v20, 16, v20
+; GFX11-TRUE16-NEXT: v_dual_add_f32 v35, 0x40c00000, v35 :: v_dual_add_f32 v20, 0x40c00000, v20
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v34, v34
; GFX11-TRUE16-NEXT: v_and_b32_e32 v51, 0xffff0000, v11
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.l, v82.h
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v20, 0x40c00000, v20
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.l, v71.h
; GFX11-TRUE16-NEXT: v_bfe_u32 v36, v35, 16, 1
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v37, 0x400000, v20
; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v17, v18, v49, vcc_lo
-; GFX11-TRUE16-NEXT: v_bfi_b32 v18, 0xffff, v33, v81
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v35
+; GFX11-TRUE16-NEXT: v_bfi_b32 v18, 0xffff, v33, v55
; GFX11-TRUE16-NEXT: v_bfe_u32 v33, v20, 16, 1
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v37, 0x400000, v20
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v20, v20
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v35
; GFX11-TRUE16-NEXT: v_bfi_b32 v17, 0xffff, v34, v17
; GFX11-TRUE16-NEXT: v_add3_u32 v34, v36, v35, 0x7fff
-; GFX11-TRUE16-NEXT: v_add3_u32 v33, v33, v20, 0x7fff
; GFX11-TRUE16-NEXT: v_and_b32_e32 v36, 0xffff0000, v19
-; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v11, 16, v11
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v79, 24, v18
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v90, 8, v18
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v83, v33, v37, vcc_lo
-; GFX11-TRUE16-NEXT: v_dual_add_f32 v36, 0x40c00000, v36 :: v_dual_lshlrev_b32 v19, 16, v19
+; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v19, 16, v19
+; GFX11-TRUE16-NEXT: v_add3_u32 v33, v33, v20, 0x7fff
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v62, 24, v18
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v74, 8, v18
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_dual_add_f32 v36, 0x40c00000, v36 :: v_dual_add_f32 v19, 0x40c00000, v19
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v81, v33, v37, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v35, v35
; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff0000, v22
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_dual_add_f32 v19, 0x40c00000, v19 :: v_dual_lshlrev_b32 v22, 16, v22
-; GFX11-TRUE16-NEXT: v_bfe_u32 v20, v36, 16, 1
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v84, v34, v38, vcc_lo
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v36
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.l, v83.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v22, 16, v22
; GFX11-TRUE16-NEXT: v_bfe_u32 v33, v19, 16, 1
+; GFX11-TRUE16-NEXT: v_bfe_u32 v20, v36, 16, 1
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v80, v34, v38, vcc_lo
; GFX11-TRUE16-NEXT: v_or_b32_e32 v37, 0x400000, v19
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v22, 0x40c00000, v22
+; GFX11-TRUE16-NEXT: v_add3_u32 v33, v33, v19, 0x7fff
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v19, v19
; GFX11-TRUE16-NEXT: v_add3_u32 v20, v20, v36, 0x7fff
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v94, 8, v17
-; GFX11-TRUE16-NEXT: v_add3_u32 v33, v33, v19, 0x7fff
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_dual_add_f32 v22, 0x40c00000, v22 :: v_dual_cndmask_b32 v85, v33, v37
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v36
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.l, v81.h
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v79, 8, v17
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v84, v33, v37, vcc_lo
; GFX11-TRUE16-NEXT: v_bfe_u32 v33, v22, 16, 1
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v36, v36
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v37.l, v85.h
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v35, 0x40c00000, v35
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v37.l, v84.h
; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v36, v20, v38, vcc_lo
+; GFX11-TRUE16-NEXT: v_bfe_u32 v19, v35, 16, 1
; GFX11-TRUE16-NEXT: v_add3_u32 v20, v33, v22, 0x7fff
; GFX11-TRUE16-NEXT: v_or_b32_e32 v33, 0x400000, v22
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v22, v22
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v86, v20, v33 :: v_dual_add_f32 v35, 0x40c00000, v35
-; GFX11-TRUE16-NEXT: v_bfi_b32 v20, 0xffff, v34, v84
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v86.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_bfe_u32 v19, v35, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v39, 0x400000, v35
-; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v35, v35
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v72, 24, v20
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v75, 8, v20
; GFX11-TRUE16-NEXT: v_add3_u32 v19, v19, v35, 0x7fff
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v38, 0xffff0000, v21
-; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v21, 16, v21
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v87, v19, v39, vcc_lo
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v83, v20, v33, vcc_lo
+; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v35, v35
+; GFX11-TRUE16-NEXT: v_bfi_b32 v20, 0xffff, v34, v80
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v82, v19, v39, vcc_lo
; GFX11-TRUE16-NEXT: v_bfi_b32 v19, 0xffff, v37, v36
; GFX11-TRUE16-NEXT: v_and_b32_e32 v36, 0xffff0000, v24
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v21, 0x40c00000, v21
-; GFX11-TRUE16-NEXT: v_dual_add_f32 v33, 0x40c00000, v38 :: v_dual_lshlrev_b32 v24, 16, v24
-; GFX11-TRUE16-NEXT: v_bfi_b32 v22, 0xffff, v22, v87
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v24, 16, v24
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v83.h
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v46, 24, v20
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v60, 8, v20
; GFX11-TRUE16-NEXT: v_add_f32_e32 v36, 0x40c00000, v36
-; GFX11-TRUE16-NEXT: v_bfe_u32 v34, v21, 16, 1
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v24, 0x40c00000, v24
+; GFX11-TRUE16-NEXT: v_bfi_b32 v22, 0xffff, v22, v82
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v63, 8, v19
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_4) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v39, 0x400000, v36
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v38, 0xffff0000, v21
+; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v21, 16, v21
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v181, 24, v22
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v40, 8, v22
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v33, 0x40c00000, v38
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v21, 0x40c00000, v21
; GFX11-TRUE16-NEXT: v_bfe_u32 v35, v33, 16, 1
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_bfe_u32 v34, v21, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v37, 0x400000, v21
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v24, 0x40c00000, v24
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v21, v21
-; GFX11-TRUE16-NEXT: v_add3_u32 v34, v34, v21, 0x7fff
-; GFX11-TRUE16-NEXT: v_add3_u32 v35, v35, v33, 0x7fff
; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v33
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v39, 0x400000, v36
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v56, 24, v22
-; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v96, v34, v37 :: v_dual_and_b32 v37, 0xffff0000, v23
+; GFX11-TRUE16-NEXT: v_add3_u32 v35, v35, v33, 0x7fff
+; GFX11-TRUE16-NEXT: v_add3_u32 v34, v34, v21, 0x7fff
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v86, v34, v37 :: v_dual_and_b32 v37, 0xffff0000, v23
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v33, v33
-; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v23, 16, v23
; GFX11-TRUE16-NEXT: v_bfe_u32 v34, v24, 16, 1
+; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v23, 16, v23
; GFX11-TRUE16-NEXT: v_bfe_u32 v33, v36, 16, 1
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v58, 8, v22
; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v21, v35, v38, vcc_lo
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v24
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_4) | instid1(VALU_DEP_4)
; GFX11-TRUE16-NEXT: v_add3_u32 v34, v34, v24, 0x7fff
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v23, 0x40c00000, v23
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v24
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v24, v24
; GFX11-TRUE16-NEXT: v_add3_u32 v33, v33, v36, 0x7fff
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v96.h
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v77, 8, v19
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v97, v34, v38, vcc_lo
-; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v36, v36
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v37, 0x40c00000, v37
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v86.h
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v87, v34, v38, vcc_lo
; GFX11-TRUE16-NEXT: v_and_b32_e32 v38, 0xffff0000, v26
-; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v26, 16, v26
+; GFX11-TRUE16-NEXT: v_dual_add_f32 v23, 0x40c00000, v23 :: v_dual_lshlrev_b32 v26, 16, v26
+; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v36, v36
+; GFX11-TRUE16-NEXT: v_bfi_b32 v21, 0xffff, v35, v21
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v26, 0x40c00000, v26
; GFX11-TRUE16-NEXT: v_bfe_u32 v34, v23, 16, 1
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v98, v33, v39, vcc_lo
-; GFX11-TRUE16-NEXT: v_bfe_u32 v24, v37, 16, 1
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v85, v33, v39, vcc_lo
; GFX11-TRUE16-NEXT: v_or_b32_e32 v36, 0x400000, v23
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v23, v23
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v45, 8, v21
; GFX11-TRUE16-NEXT: v_add3_u32 v34, v34, v23, 0x7fff
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v97, v34, v36, vcc_lo
+; GFX11-TRUE16-NEXT: v_dual_add_f32 v37, 0x40c00000, v37 :: v_dual_add_f32 v34, 0x40c00000, v38
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v36.l, v97.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_3) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_bfe_u32 v24, v37, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v39, 0x400000, v37
-; GFX11-TRUE16-NEXT: v_add3_u32 v24, v24, v37, 0x7fff
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v26, 0x40c00000, v26
-; GFX11-TRUE16-NEXT: v_bfi_b32 v21, 0xffff, v35, v21
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v99, v34, v36, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v37, v37
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v34, 0x40c00000, v38
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v61, 8, v21
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v36.l, v99.h
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v23, v24, v39, vcc_lo
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_bfe_u32 v37, v34, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v34
-; GFX11-TRUE16-NEXT: v_bfi_b32 v23, 0xffff, v36, v23
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v36, 0xffff0000, v25
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v33.l, v97.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v25, 16, v25
+; GFX11-TRUE16-NEXT: v_add3_u32 v24, v24, v37, 0x7fff
+; GFX11-TRUE16-NEXT: v_bfe_u32 v37, v34, 16, 1
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v23, v24, v39, vcc_lo
; GFX11-TRUE16-NEXT: v_add3_u32 v35, v37, v34, 0x7fff
; GFX11-TRUE16-NEXT: v_or_b32_e32 v37, 0x400000, v26
+; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v26, v26
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_bfi_b32 v23, 0xffff, v36, v23
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v36, 0xffff0000, v25
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v33.l, v87.h
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v182, 8, v23
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
; GFX11-TRUE16-NEXT: v_add_f32_e32 v36, 0x40c00000, v36
-; GFX11-TRUE16-NEXT: v_bfi_b32 v24, 0xffff, v33, v98
+; GFX11-TRUE16-NEXT: v_bfi_b32 v24, 0xffff, v33, v85
; GFX11-TRUE16-NEXT: v_bfe_u32 v33, v26, 16, 1
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v25, 0x40c00000, v25
-; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v26, v26
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v46, 8, v23
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v183, 24, v24
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v162, 24, v24
; GFX11-TRUE16-NEXT: v_add3_u32 v33, v33, v26, 0x7fff
; GFX11-TRUE16-NEXT: v_bfe_u32 v26, v36, 16, 1
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v42, 8, v24
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v100, v33, v37, vcc_lo
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v177, 8, v24
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_3) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v98, v33, v37, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v34, v34
-; GFX11-TRUE16-NEXT: v_bfe_u32 v33, v25, 16, 1
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v37, 0x400000, v25
+; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v25, 16, v25
; GFX11-TRUE16-NEXT: v_add3_u32 v26, v26, v36, 0x7fff
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v101, v35, v38, vcc_lo
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_add3_u32 v33, v33, v25, 0x7fff
-; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v25, v25
+; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v96, v35, v38 :: v_dual_add_f32 v25, 0x40c00000, v25
; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff0000, v28
+; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v28, 16, v28
; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v36
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.l, v100.h
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v102, v33, v37, vcc_lo
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_dual_add_f32 v35, 0x40c00000, v35 :: v_dual_lshlrev_b32 v28, 16, v28
-; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v36, v36
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v37.l, v102.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v28, 0x40c00000, v28
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.l, v98.h
+; GFX11-TRUE16-NEXT: v_bfe_u32 v33, v25, 16, 1
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v37, 0x400000, v25
+; GFX11-TRUE16-NEXT: v_dual_add_f32 v35, 0x40c00000, v35 :: v_dual_add_f32 v28, 0x40c00000, v28
+; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v25, v25
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_add3_u32 v33, v33, v25, 0x7fff
; GFX11-TRUE16-NEXT: v_bfe_u32 v25, v35, 16, 1
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v36, v26, v38, vcc_lo
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v38, 0xffff0000, v27
-; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v27, 16, v27
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v39, 0x400000, v35
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_4) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v101, v33, v37, vcc_lo
; GFX11-TRUE16-NEXT: v_bfe_u32 v33, v28, 16, 1
-; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v28, v28
+; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v36, v36
; GFX11-TRUE16-NEXT: v_add3_u32 v25, v25, v35, 0x7fff
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v39, 0x400000, v35
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v27, 0x40c00000, v27
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v36, v26, v38, vcc_lo
; GFX11-TRUE16-NEXT: v_add3_u32 v26, v33, v28, 0x7fff
; GFX11-TRUE16-NEXT: v_or_b32_e32 v33, 0x400000, v28
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v103, v26, v33, vcc_lo
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v38, 0xffff0000, v27
+; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v28, v28
+; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v27, 16, v27
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v37.l, v101.h
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v100, v26, v33, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v35, v35
-; GFX11-TRUE16-NEXT: v_bfi_b32 v26, 0xffff, v34, v101
-; GFX11-TRUE16-NEXT: v_bfe_u32 v34, v27, 16, 1
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v33, 0x40c00000, v38
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.l, v103.h
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v112, v25, v39, vcc_lo
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v27, 0x40c00000, v27
+; GFX11-TRUE16-NEXT: v_bfi_b32 v26, 0xffff, v34, v96
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.l, v100.h
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v99, v25, v39, vcc_lo
; GFX11-TRUE16-NEXT: v_bfi_b32 v25, 0xffff, v37, v36
-; GFX11-TRUE16-NEXT: v_add3_u32 v34, v34, v27, 0x7fff
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v37, 0x400000, v27
-; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v27, v27
; GFX11-TRUE16-NEXT: v_and_b32_e32 v36, 0xffff0000, v30
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v33, 0x40c00000, v38
+; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v30, 16, v30
+; GFX11-TRUE16-NEXT: v_bfe_u32 v34, v27, 16, 1
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v37, 0x400000, v27
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v36, 0x40c00000, v36
; GFX11-TRUE16-NEXT: v_bfe_u32 v35, v33, 16, 1
+; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v27, v27
+; GFX11-TRUE16-NEXT: v_add3_u32 v34, v34, v27, 0x7fff
; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v33
-; GFX11-TRUE16-NEXT: v_bfi_b32 v28, 0xffff, v28, v112
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v113, v34, v37, vcc_lo
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v37, 0xffff0000, v29
-; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v30, 16, v30
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v30, 0x40c00000, v30
; GFX11-TRUE16-NEXT: v_add3_u32 v35, v35, v33, 0x7fff
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v39, 0x400000, v36
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v112, v34, v37, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v33, v33
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_dual_add_f32 v36, 0x40c00000, v36 :: v_dual_add_f32 v37, 0x40c00000, v37
-; GFX11-TRUE16-NEXT: v_dual_add_f32 v30, 0x40c00000, v30 :: v_dual_lshlrev_b32 v29, 16, v29
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v27, v35, v38, vcc_lo
; GFX11-TRUE16-NEXT: v_bfe_u32 v33, v36, 16, 1
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v39, 0x400000, v36
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_bfi_b32 v28, 0xffff, v28, v99
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 24, v26
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v151, 8, v26
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v27, v35, v38, vcc_lo
+; GFX11-TRUE16-NEXT: v_add3_u32 v33, v33, v36, 0x7fff
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v37, 0xffff0000, v29
+; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v29, 16, v29
; GFX11-TRUE16-NEXT: v_bfe_u32 v34, v30, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v30
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v30, v30
-; GFX11-TRUE16-NEXT: v_add3_u32 v33, v33, v36, 0x7fff
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v29, 0x40c00000, v29
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v112.h
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 24, v28
; GFX11-TRUE16-NEXT: v_add3_u32 v34, v34, v30, 0x7fff
-; GFX11-TRUE16-NEXT: v_bfe_u32 v30, v37, 16, 1
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v113.h
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 24, v28
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v162, 8, v28
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v114, v34, v38, vcc_lo
-; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v36, v36
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v135, 8, v28
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v163, 8, v25
+; GFX11-TRUE16-NEXT: v_bfi_b32 v27, 0xffff, v35, v27
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(1)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v38, 0xffff0000, v32
+; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v103, v34, v38 :: v_dual_and_b32 v38, 0xffff0000, v32
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v29, 0x40c00000, v29
+; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v36, v36
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v32, 16, v32
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v145, 8, v27
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_3) | instid1(VALU_DEP_4)
; GFX11-TRUE16-NEXT: v_bfe_u32 v34, v29, 16, 1
+; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v102, v33, v39 :: v_dual_add_f32 v37, 0x40c00000, v37
; GFX11-TRUE16-NEXT: v_or_b32_e32 v36, 0x400000, v29
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v115, v33, v39, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v29, v29
-; GFX11-TRUE16-NEXT: v_add3_u32 v30, v30, v37, 0x7fff
; GFX11-TRUE16-NEXT: v_add3_u32 v34, v34, v29, 0x7fff
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v33.l, v103.h
+; GFX11-TRUE16-NEXT: v_bfe_u32 v30, v37, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v39, 0x400000, v37
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v33.l, v114.h
-; GFX11-TRUE16-NEXT: v_bfi_b32 v27, 0xffff, v35, v27
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v167, 24, v26
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v116, v34, v36, vcc_lo
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v32, 0x40c00000, v32
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v113, v34, v36, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v37, v37
+; GFX11-TRUE16-NEXT: v_add3_u32 v30, v30, v37, 0x7fff
; GFX11-TRUE16-NEXT: v_add_f32_e32 v34, 0x40c00000, v38
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v164, 8, v27
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v179, 8, v26
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v36.l, v116.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v36.l, v113.h
; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v29, v30, v39, vcc_lo
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
; GFX11-TRUE16-NEXT: v_bfe_u32 v37, v34, 16, 1
-; GFX11-TRUE16-NEXT: v_bfi_b32 v30, 0xffff, v33, v115
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v34
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v181, 8, v25
+; GFX11-TRUE16-NEXT: v_bfi_b32 v30, 0xffff, v33, v102
+; GFX11-TRUE16-NEXT: v_bfe_u32 v33, v32, 16, 1
+; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v32, v32
; GFX11-TRUE16-NEXT: v_bfi_b32 v29, 0xffff, v36, v29
-; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v36, 0xffff0000, v31
-; GFX11-TRUE16-NEXT: v_dual_add_f32 v32, 0x40c00000, v32 :: v_dual_lshlrev_b32 v31, 16, v31
; GFX11-TRUE16-NEXT: v_add3_u32 v35, v37, v34, 0x7fff
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v144, 24, v30
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v145, 8, v30
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_4) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v31, 0x40c00000, v31
-; GFX11-TRUE16-NEXT: v_bfe_u32 v33, v32, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v37, 0x400000, v32
-; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v32, v32
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v148, 8, v29
; GFX11-TRUE16-NEXT: v_add3_u32 v33, v33, v32, 0x7fff
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_3) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v117, v33, v37, vcc_lo
+; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v36, 0xffff0000, v31
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v34
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v119, 24, v30
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 8, v30
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v115, v33, v37, vcc_lo
+; GFX11-TRUE16-NEXT: v_dual_add_f32 v36, 0x40c00000, v36 :: v_dual_lshlrev_b32 v31, 16, v31
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v34, v34
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v131, 8, v29
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.l, v115.h
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v31, 0x40c00000, v31
+; GFX11-TRUE16-NEXT: v_bfe_u32 v32, v36, 16, 1
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v114, v35, v38, vcc_lo
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v36
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_3) | instid1(VALU_DEP_4)
; GFX11-TRUE16-NEXT: v_bfe_u32 v33, v31, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v37, 0x400000, v31
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.l, v117.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_4) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_add3_u32 v33, v33, v31, 0x7fff
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v118, v35, v38, vcc_lo
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff0000, v2
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v36, 0x40c00000, v36
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v31, v31
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v35, 0x40c00000, v35
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_bfe_u32 v32, v36, 16, 1
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v36
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v119, v33, v37, vcc_lo
+; GFX11-TRUE16-NEXT: v_add3_u32 v32, v32, v36, 0x7fff
+; GFX11-TRUE16-NEXT: v_add3_u32 v33, v33, v31, 0x7fff
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v116, v33, v37 :: v_dual_and_b32 v35, 0xffff0000, v2
+; GFX11-TRUE16-NEXT: v_dual_add_f32 v35, 0x40c00000, v35 :: v_dual_lshlrev_b32 v2, 16, v2
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v36, v36
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v37.l, v116.h
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v2, 0x40c00000, v2
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
; GFX11-TRUE16-NEXT: v_bfe_u32 v31, v35, 16, 1
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v39, 0x400000, v35
-; GFX11-TRUE16-NEXT: v_add3_u32 v32, v32, v36, 0x7fff
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v37.l, v119.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_add3_u32 v31, v31, v35, 0x7fff
-; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v2, 16, v2
; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v36, v32, v38, vcc_lo
; GFX11-TRUE16-NEXT: v_and_b32_e32 v38, 0xffff0000, v1
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_dual_add_f32 v2, 0x40c00000, v2 :: v_dual_lshlrev_b32 v1, 16, v1
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v1, 0x40c00000, v1
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v1, 16, v1
; GFX11-TRUE16-NEXT: v_bfe_u32 v33, v2, 16, 1
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v2, v2
+; GFX11-TRUE16-NEXT: v_add3_u32 v31, v31, v35, 0x7fff
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v39, 0x400000, v35
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v1, 0x40c00000, v1
; GFX11-TRUE16-NEXT: v_add3_u32 v32, v33, v2, 0x7fff
; GFX11-TRUE16-NEXT: v_or_b32_e32 v33, 0x400000, v2
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v128, v32, v33, vcc_lo
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v133, v32, v33, vcc_lo
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v33, 0x40c00000, v38
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v35, v35
-; GFX11-TRUE16-NEXT: v_bfi_b32 v32, 0xffff, v34, v118
+; GFX11-TRUE16-NEXT: v_bfi_b32 v32, 0xffff, v34, v114
; GFX11-TRUE16-NEXT: v_bfe_u32 v34, v1, 16, 1
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v33, 0x40c00000, v38
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v2.l, v128.h
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v129, v31, v39, vcc_lo
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v132, v31, v39, vcc_lo
; GFX11-TRUE16-NEXT: v_bfi_b32 v31, 0xffff, v37, v36
-; GFX11-TRUE16-NEXT: v_add3_u32 v34, v34, v1, 0x7fff
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v37, 0x400000, v1
-; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v1, v1
; GFX11-TRUE16-NEXT: v_and_b32_e32 v36, 0xffff0000, v4
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v4, 16, v4
; GFX11-TRUE16-NEXT: v_bfe_u32 v35, v33, 16, 1
+; GFX11-TRUE16-NEXT: v_add3_u32 v34, v34, v1, 0x7fff
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v37, 0x400000, v1
+; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v1, v1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v33
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v131, v34, v37, vcc_lo
-; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v33, v33
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v4, 0x40c00000, v4
; GFX11-TRUE16-NEXT: v_add3_u32 v35, v35, v33, 0x7fff
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v37, 0xffff0000, v3
-; GFX11-TRUE16-NEXT: v_dual_add_f32 v36, 0x40c00000, v36 :: v_dual_lshlrev_b32 v3, 16, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v2.l, v133.h
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v117, 24, v32
+; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v146, v34, v37 :: v_dual_and_b32 v37, 0xffff0000, v3
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v4, 0x40c00000, v4
+; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v33, v33
+; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, 16, v3
+; GFX11-TRUE16-NEXT: v_bfi_b32 v2, 0xffff, v2, v132
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v37, 0x40c00000, v37
; GFX11-TRUE16-NEXT: v_bfe_u32 v34, v4, 16, 1
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v1, v35, v38, vcc_lo
+; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v1, v35, v38 :: v_dual_add_f32 v36, 0x40c00000, v36
; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v4
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v4, v4
-; GFX11-TRUE16-NEXT: v_bfe_u32 v33, v36, 16, 1
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
; GFX11-TRUE16-NEXT: v_add3_u32 v34, v34, v4, 0x7fff
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v3, 0x40c00000, v3
+; GFX11-TRUE16-NEXT: v_bfe_u32 v33, v36, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v39, 0x400000, v36
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v35.l, v131.h
-; GFX11-TRUE16-NEXT: v_bfi_b32 v2, 0xffff, v2, v129
+; GFX11-TRUE16-NEXT: v_bfe_u32 v4, v37, 16, 1
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v148, v34, v38, vcc_lo
+; GFX11-TRUE16-NEXT: v_bfe_u32 v34, v3, 16, 1
; GFX11-TRUE16-NEXT: v_add3_u32 v33, v33, v36, 0x7fff
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v133, v34, v38, vcc_lo
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v38, 0xffff0000, v6
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v3, 0x40c00000, v3
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v36, v36
-; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v6, 16, v6
-; GFX11-TRUE16-NEXT: v_bfi_b32 v1, 0xffff, v35, v1
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v105, 24, v2
-; GFX11-TRUE16-NEXT: v_bfe_u32 v34, v3, 16, 1
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v135, v33, v39, vcc_lo
; GFX11-TRUE16-NEXT: v_or_b32_e32 v36, 0x400000, v3
-; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v3, v3
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v6, 0x40c00000, v6
+; GFX11-TRUE16-NEXT: v_add3_u32 v4, v4, v37, 0x7fff
; GFX11-TRUE16-NEXT: v_add3_u32 v34, v34, v3, 0x7fff
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v106, 8, v2
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v111, 8, v1
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 24, v32
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 8, v32
-; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v146, v34, v36 :: v_dual_add_f32 v37, 0x40c00000, v37
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v134, 8, v31
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v36.l, v146.h
-; GFX11-TRUE16-NEXT: v_bfe_u32 v4, v37, 16, 1
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v38, 0xffff0000, v6
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v144, v33, v39, vcc_lo
+; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v3, v3
; GFX11-TRUE16-NEXT: v_or_b32_e32 v39, 0x400000, v37
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v33.l, v148.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v35.l, v146.h
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v105, 24, v2
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v164, v34, v36, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v37, v37
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add3_u32 v4, v4, v37, 0x7fff
-; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v3, v4, v39 :: v_dual_add_f32 v34, 0x40c00000, v38
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v34, 0x40c00000, v38
+; GFX11-TRUE16-NEXT: v_bfi_b32 v1, 0xffff, v35, v1
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v107, 8, v2
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v36.l, v164.h
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v3, v4, v39, vcc_lo
+; GFX11-TRUE16-NEXT: v_bfe_u32 v37, v34, 16, 1
; GFX11-TRUE16-NEXT: v_and_b32_e32 v39, 0xffff0000, v7
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v7, 16, v7
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v34
; GFX11-TRUE16-NEXT: v_bfi_b32 v3, 0xffff, v36, v3
; GFX11-TRUE16-NEXT: v_and_b32_e32 v36, 0xffff0000, v5
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v33.l, v133.h
-; GFX11-TRUE16-NEXT: v_bfe_u32 v37, v34, 16, 1
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, 16, v5
-; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v6, v6
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v36, 0x40c00000, v36
-; GFX11-TRUE16-NEXT: v_bfi_b32 v4, 0xffff, v33, v135
-; GFX11-TRUE16-NEXT: v_bfe_u32 v33, v6, 16, 1
+; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v6, 16, v6
+; GFX11-TRUE16-NEXT: v_bfi_b32 v4, 0xffff, v33, v144
; GFX11-TRUE16-NEXT: v_add3_u32 v35, v37, v34, 0x7fff
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_dual_add_f32 v36, 0x40c00000, v36 :: v_dual_add_f32 v5, 0x40c00000, v5
+; GFX11-TRUE16-NEXT: v_dual_add_f32 v6, 0x40c00000, v6 :: v_dual_add_f32 v7, 0x40c00000, v7
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v90, 24, v4
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v94, 8, v4
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v106, 8, v3
+; GFX11-TRUE16-NEXT: v_bfe_u32 v33, v6, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v37, 0x400000, v6
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v5, 0x40c00000, v5
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v34
+; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v6, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v108, 8, v1
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v118, 8, v32
; GFX11-TRUE16-NEXT: v_add3_u32 v33, v33, v6, 0x7fff
; GFX11-TRUE16-NEXT: v_bfe_u32 v6, v36, 16, 1
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v93, 24, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v95, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v104, 8, v3
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v150, v33, v37, vcc_lo
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v128, 8, v31
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v165, v33, v37, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v34, v34
; GFX11-TRUE16-NEXT: v_bfe_u32 v33, v5, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v37, 0x400000, v5
; GFX11-TRUE16-NEXT: v_add3_u32 v6, v6, v36, 0x7fff
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v151, v35, v38, vcc_lo
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v34.l, v165.h
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v161, v35, v38, vcc_lo
; GFX11-TRUE16-NEXT: v_add3_u32 v33, v33, v5, 0x7fff
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v5, v5
; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff0000, v8
+; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v8, 16, v8
; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v36
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v34.l, v150.h
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v166, v33, v37, vcc_lo
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_dual_add_f32 v35, 0x40c00000, v35 :: v_dual_lshlrev_b32 v8, 16, v8
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v180, v33, v37 :: v_dual_add_f32 v35, 0x40c00000, v35
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v8, 0x40c00000, v8
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v36, v36
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v37.l, v166.h
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v8, 0x40c00000, v8
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v37.l, v180.h
; GFX11-TRUE16-NEXT: v_bfe_u32 v5, v35, 16, 1
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v36, v6, v38, vcc_lo
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v35
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
; GFX11-TRUE16-NEXT: v_bfe_u32 v33, v8, 16, 1
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v36, v6, v38, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v8, v8
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v35
; GFX11-TRUE16-NEXT: v_add3_u32 v5, v5, v35, 0x7fff
; GFX11-TRUE16-NEXT: v_add3_u32 v6, v33, v8, 0x7fff
; GFX11-TRUE16-NEXT: v_or_b32_e32 v33, 0x400000, v8
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v177, v6, v33, vcc_lo
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_3) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v179, v6, v33, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v35, v35
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v33, 0x40c00000, v39
-; GFX11-TRUE16-NEXT: v_bfi_b32 v6, 0xffff, v34, v151
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v178, v5, v38, vcc_lo
+; GFX11-TRUE16-NEXT: v_bfe_u32 v35, v7, 16, 1
+; GFX11-TRUE16-NEXT: v_bfi_b32 v6, 0xffff, v34, v161
+; GFX11-TRUE16-NEXT: v_mov_b16_e64 v8.l, v179.h
+; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v178, v5, v38 :: v_dual_add_f32 v33, 0x40c00000, v39
; GFX11-TRUE16-NEXT: v_bfi_b32 v5, 0xffff, v37, v36
-; GFX11-TRUE16-NEXT: v_dual_add_f32 v7, 0x40c00000, v7 :: v_dual_lshlrev_b32 v36, 16, v10
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v39, 0xffff0000, v9
+; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v36, 16, v10
+; GFX11-TRUE16-NEXT: v_add3_u32 v35, v35, v7, 0x7fff
; GFX11-TRUE16-NEXT: v_bfe_u32 v34, v33, 16, 1
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v33
-; GFX11-TRUE16-NEXT: v_mov_b16_e64 v8.l, v177.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v36, 0x40c00000, v36
-; GFX11-TRUE16-NEXT: v_bfe_u32 v35, v7, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v37, 0x400000, v7
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v7, v7
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v36, 0x40c00000, v36
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v33
; GFX11-TRUE16-NEXT: v_add3_u32 v34, v34, v33, 0x7fff
; GFX11-TRUE16-NEXT: v_bfi_b32 v8, 0xffff, v8, v178
-; GFX11-TRUE16-NEXT: v_add3_u32 v35, v35, v7, 0x7fff
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff0000, v10
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v47, v35, v37, vcc_lo
; GFX11-TRUE16-NEXT: v_bfe_u32 v7, v36, 16, 1
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v76, 24, v6
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v60, 24, v8
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v43, v35, v37, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v33, v33
-; GFX11-TRUE16-NEXT: v_add3_u32 v7, v7, v36, 0x7fff
; GFX11-TRUE16-NEXT: v_or_b32_e32 v37, 0x400000, v36
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v63, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v78, 8, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v59, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v72, 8, v8
+; GFX11-TRUE16-NEXT: v_add3_u32 v7, v7, v36, 0x7fff
; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v33, v34, v38, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v36, v36
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.l, v47.h
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v75, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v88, 8, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v91, 8, v5
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v44, v7, v37, vcc_lo
+; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v7, 16, v9
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v9, 0x40c00000, v39
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v7, 0x40c00000, v7
+; GFX11-TRUE16-NEXT: v_bfe_u32 v36, v9, 16, 1
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v39, 0x400000, v9
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v49, 0x400000, v7
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff0000, v10
+; GFX11-TRUE16-NEXT: v_add3_u32 v36, v36, v9, 0x7fff
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_add_f32_e32 v10, 0x40c00000, v10
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.l, v43.h
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v88, 8, v5
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v41, v7, v37, vcc_lo
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_3)
; GFX11-TRUE16-NEXT: v_bfe_u32 v35, v10, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v10
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v10, v10
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_add3_u32 v35, v35, v10, 0x7fff
-; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v10, 16, v12
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v44, v35, v38 :: v_dual_and_b32 v39, 0xffff0000, v9
+; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v41, v35, v38 :: v_dual_lshlrev_b32 v10, 16, v12
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_add_f32_e32 v37, 0x40c00000, v10
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v41.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v44.h
; GFX11-TRUE16-NEXT: v_bfe_u32 v38, v37, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v50, 0x400000, v37
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v37, v37
-; GFX11-TRUE16-NEXT: v_bfi_b32 v10, 0xffff, v35, v44
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_bfi_b32 v10, 0xffff, v35, v41
+; GFX11-TRUE16-NEXT: v_bfe_u32 v35, v7, 16, 1
; GFX11-TRUE16-NEXT: v_add3_u32 v38, v38, v37, 0x7fff
; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff0000, v12
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v45, 24, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v47, 8, v10
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v59, v38, v50, vcc_lo
-; GFX11-TRUE16-NEXT: v_dual_add_f32 v12, 0x40c00000, v12 :: v_dual_lshlrev_b32 v7, 16, v9
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v37, 0x40c00000, v51
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v42, 24, v10
+; GFX11-TRUE16-NEXT: v_add3_u32 v35, v35, v7, 0x7fff
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_3) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v61, v38, v50 :: v_dual_add_f32 v12, 0x40c00000, v12
+; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v7, v7
; GFX11-TRUE16-NEXT: v_and_b32_e32 v38, 0xffff0000, v14
-; GFX11-TRUE16-NEXT: v_dual_add_f32 v37, 0x40c00000, v51 :: v_dual_lshlrev_b32 v14, 16, v14
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v7, 0x40c00000, v7
+; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v14, 16, v14
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v61.h
; GFX11-TRUE16-NEXT: v_bfe_u32 v48, v12, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v52, 0x400000, v12
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v14, 0x40c00000, v14
-; GFX11-TRUE16-NEXT: v_bfe_u32 v35, v7, 16, 1
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v49, 0x400000, v7
-; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v7, v7
-; GFX11-TRUE16-NEXT: v_add3_u32 v48, v48, v12, 0x7fff
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_add3_u32 v35, v35, v7, 0x7fff
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v59.h
; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v73, v35, v49, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v12, v12
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v9, 0x40c00000, v39
; GFX11-TRUE16-NEXT: v_bfe_u32 v35, v37, 16, 1
-; GFX11-TRUE16-NEXT: v_bfe_u32 v49, v14, 16, 1
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v62, v48, v52, vcc_lo
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_bfe_u32 v36, v9, 16, 1
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v39, 0x400000, v9
+; GFX11-TRUE16-NEXT: v_add3_u32 v48, v48, v12, 0x7fff
+; GFX11-TRUE16-NEXT: v_dual_add_f32 v14, 0x40c00000, v14 :: v_dual_lshlrev_b32 v11, 16, v11
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v56, 8, v10
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v57, v48, v52, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v9, v9
-; GFX11-TRUE16-NEXT: v_bfi_b32 v12, 0xffff, v7, v62
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_add3_u32 v36, v36, v9, 0x7fff
+; GFX11-TRUE16-NEXT: v_bfe_u32 v49, v14, 16, 1
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_bfi_b32 v12, 0xffff, v7, v57
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v9, v36, v39, vcc_lo
; GFX11-TRUE16-NEXT: v_add_f32_e32 v7, 0x40c00000, v11
; GFX11-TRUE16-NEXT: v_add3_u32 v11, v35, v37, 0x7fff
; GFX11-TRUE16-NEXT: v_or_b32_e32 v35, 0x400000, v37
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v180, 24, v12
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v9, v36, v39, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v37, v37
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v36.l, v73.h
; GFX11-TRUE16-NEXT: v_bfe_u32 v39, v7, 16, 1
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v182, 8, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v167, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v183, 8, v12
; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v11, v11, v35, vcc_lo
; GFX11-TRUE16-NEXT: v_add_f32_e32 v35, 0x40c00000, v38
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
; GFX11-TRUE16-NEXT: v_add3_u32 v37, v39, v7, 0x7fff
; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v7
; GFX11-TRUE16-NEXT: v_and_b32_e32 v39, 0xffff0000, v13
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v7, v7
; GFX11-TRUE16-NEXT: v_bfe_u32 v48, v35, 16, 1
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v13, 16, v13
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_dual_add_f32 v7, 0x40c00000, v39 :: v_dual_cndmask_b32 v92, v37, v38
+; GFX11-TRUE16-NEXT: v_bfi_b32 v9, 0xffff, v36, v9
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v7, 0x40c00000, v39
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v89, v37, v38, vcc_lo
; GFX11-TRUE16-NEXT: v_add3_u32 v37, v48, v35, 0x7fff
; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v35
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v35, v35
@@ -160390,18 +160898,18 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_or_b32_e32 v48, 0x400000, v14
; GFX11-TRUE16-NEXT: v_bfe_u32 v49, v7, 16, 1
; GFX11-TRUE16-NEXT: v_add_f32_e32 v13, 0x40c00000, v13
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v89, v37, v38, vcc_lo
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v77, v37, v38, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v14, v14
; GFX11-TRUE16-NEXT: v_or_b32_e32 v35, 0x400000, v7
; GFX11-TRUE16-NEXT: v_add3_u32 v14, v49, v7, 0x7fff
; GFX11-TRUE16-NEXT: v_and_b32_e32 v37, 0xffff0000, v16
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v16, 16, v16
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v91, v39, v48, vcc_lo
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v78, v39, v48, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v7, v7
; GFX11-TRUE16-NEXT: v_bfe_u32 v39, v13, 16, 1
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[64:65], 24, v[9:10]
; GFX11-TRUE16-NEXT: v_add_f32_e32 v16, 0x40c00000, v16
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v36.l, v73.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v38.l, v78.h
; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v7, v14, v35 :: v_dual_add_f32 v14, 0x40c00000, v37
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v37, 16, v15
; GFX11-TRUE16-NEXT: v_add3_u32 v35, v39, v13, 0x7fff
@@ -160411,7 +160919,7 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_add_f32_e32 v37, 0x40c00000, v37
; GFX11-TRUE16-NEXT: v_and_b32_e32 v15, 0xffff0000, v15
; GFX11-TRUE16-NEXT: v_or_b32_e32 v49, 0x400000, v16
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v108, v35, v39, vcc_lo
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v104, v35, v39, vcc_lo
; GFX11-TRUE16-NEXT: v_add3_u32 v13, v13, v16, 0x7fff
; GFX11-TRUE16-NEXT: v_bfe_u32 v39, v37, 16, 1
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v16, v16
@@ -160419,366 +160927,405 @@ define <128 x i8> @bitcast_v64bf16_to_v128i8(<64 x bfloat> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_add_f32_e32 v15, 0x40c00000, v15
; GFX11-TRUE16-NEXT: v_or_b32_e32 v51, 0x400000, v37
; GFX11-TRUE16-NEXT: v_add3_u32 v39, v39, v37, 0x7fff
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v107, v13, v49, vcc_lo
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v93, v13, v49, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v37, v37
; GFX11-TRUE16-NEXT: v_add3_u32 v35, v48, v14, 0x7fff
; GFX11-TRUE16-NEXT: v_or_b32_e32 v48, 0x400000, v14
; GFX11-TRUE16-NEXT: v_bfe_u32 v50, v15, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, 0x400000, v15
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v110, v39, v51, vcc_lo
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v95, v39, v51, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v14, v14
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v37.l, v108.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v37.l, v104.h
; GFX11-TRUE16-NEXT: v_add3_u32 v13, v50, v15, 0x7fff
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v38.l, v91.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.l, v92.h
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v109, v35, v48, vcc_lo
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.l, v89.h
+; GFX11-TRUE16-NEXT: v_bfi_b32 v14, 0xffff, v38, v77
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v92, v35, v48, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v15, v15
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v110.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v107.h
-; GFX11-TRUE16-NEXT: v_bfi_b32 v14, 0xffff, v38, v89
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v95.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v93.h
; GFX11-TRUE16-NEXT: v_bfi_b32 v11, 0xffff, v39, v11
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 24, v14
; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v13, v13, v16, vcc_lo
-; GFX11-TRUE16-NEXT: v_bfi_b32 v9, 0xffff, v36, v9
-; GFX11-TRUE16-NEXT: v_bfi_b32 v16, 0xffff, v35, v109
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[48:49], 24, v[23:24]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[53:54], 24, v[11:12]
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v166, 8, v14
+; GFX11-TRUE16-NEXT: v_bfi_b32 v16, 0xffff, v35, v92
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[52:53], 24, v[11:12]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[53:54], 24, v[17:18]
; GFX11-TRUE16-NEXT: v_bfi_b32 v15, 0xffff, v15, v13
; GFX11-TRUE16-NEXT: v_bfi_b32 v13, 0xffff, v37, v7
; GFX11-TRUE16-NEXT: v_bfi_b32 v7, 0xffff, v34, v33
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[33:34], 24, v[31:32]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[34:35], 24, v[29:30]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[64:65], 24, v[9:10]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[50:51], 24, v[13:14]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[67:68], 24, v[7:8]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[68:69], 24, v[5:6]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[69:70], 24, v[3:4]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[35:36], 24, v[27:28]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[38:39], 24, v[15:16]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[70:71], 24, v[1:2]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[48:49], 24, v[15:16]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[65:66], 24, v[7:8]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[66:67], 24, v[5:6]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[36:37], 24, v[25:26]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[51:52], 24, v[21:22]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[54:55], 24, v[19:20]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[65:66], 24, v[17:18]
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 24, v16
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[49:50], 24, v[13:14]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[67:68], 24, v[3:4]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[37:38], 24, v[23:24]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[68:69], 24, v[1:2]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[38:39], 24, v[21:22]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[50:51], 24, v[19:20]
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v134, 24, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 8, v16
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 8, v15
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v163, 24, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v165, 8, v14
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 8, v15
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v176, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v40, 8, v11
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v57, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v74, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v43, 8, v11
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v58, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v76, 8, v7
; GFX11-TRUE16-NEXT: .LBB90_4: ; %end
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v131.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v111.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v146.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v108.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v70.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.h, 0
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v129.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v1.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v128.h
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v1.h, v2.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v6.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v133.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v107.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, 0
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v2.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v68.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v132.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v2.h, v3.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v106.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v105.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v69.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v6, v1
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v2.l, v2.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v3.l, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v6.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v146.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v104.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v4.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v95.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v6, v2
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v135.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v3.l, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v6.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v133.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.l, 8, v93.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v68.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v6, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v4.l, v4.h
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v5.l, v7.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v6.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v166.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.l, 8, v88.l
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v5.h, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v6.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v6, v4
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v78.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v150.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v151.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v76.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v11.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v66, v6, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v43.h
-; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v7.l, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v6.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v74.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v164.h
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v1.h, v2.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v105.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v1.l, v2.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v94.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v91.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v8
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v6, v14
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v4.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v148.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v67.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v8, v14
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v4.l, v4.h
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v180.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v3.l, v3.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.l, 8, v90.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v144.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v4.h, v5.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v8, v14
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v66.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v4.l, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v165.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.l, 8, v88.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v8, v14
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v5.l, v6.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v47.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.h, v6.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.l, 8, v76.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 8, v58.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v8, v14
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v75.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v6.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v161.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v6.h, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v179.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v72.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v6.l, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v7.l
; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.h, 8, v67.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v62.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v67, v6, v8
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v177.h
-; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v7.l, v7.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v63.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v178.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.h, 8, v60.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v180.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v68, v6, v8
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v73.h
-; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v7.l, v7.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v57.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.h, 8, v64.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v53.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v69, v6, v8
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v41.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.l, v7.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v47.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v44.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v45.l
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.l, v9.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v6, v7
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v92.h
-; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.l, v8.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v6.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v40.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v6.h
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v10.l, v10.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v6.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v6, v8
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v59.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v182.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v13.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v50.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v6, v9
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v89.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v108.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v176.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v11.l, v11.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v6.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v6, v10
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v163.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v91.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v165.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v15.h
-; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v12.l, v12.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v6, v11
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v6.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v110.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v161.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.h, 8, v38.l
-; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v109.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v6, v12
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.h, 8, v147.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v13.l, v13.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v6.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v107.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v149.l
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.l, v14.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v6.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v6, v13
-; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v82.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v94.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.h, 8, v65.l
-; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v81.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v6, v14
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v79.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v15.l, v15.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v6.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v80.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v90.l
-; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v16.l, v16.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v6.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v6, v15
-; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v19.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v85.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v77.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v54.l
-; GFX11-TRUE16-NEXT: v_and_b16 v18.l, 0xff, v84.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v6, v16
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v72.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
-; GFX11-TRUE16-NEXT: v_or_b16 v17.h, v17.l, v17.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v6.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v83.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v75.l
-; GFX11-TRUE16-NEXT: v_or_b16 v18.h, v18.l, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v6.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v6, v17
-; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v21.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v96.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v61.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v51.l
-; GFX11-TRUE16-NEXT: v_and_b16 v20.l, 0xff, v87.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v6, v18
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v20.h, 8, v56.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v19.l, v19.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v6.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v86.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v58.l
-; GFX11-TRUE16-NEXT: v_or_b16 v20.h, v20.l, v20.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v6.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v6, v19
-; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v23.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v99.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v46.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.h, 8, v48.l
-; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v98.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v6, v20
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.h, 8, v183.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
-; GFX11-TRUE16-NEXT: v_or_b16 v21.h, v21.l, v21.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v6.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v97.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v42.l
-; GFX11-TRUE16-NEXT: v_or_b16 v22.h, v22.l, v22.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v6.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v6, v21
-; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v25.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v102.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v181.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.h, 8, v36.l
-; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v101.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v6, v22
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v24.h, 8, v167.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
-; GFX11-TRUE16-NEXT: v_or_b16 v23.h, v23.l, v23.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v6.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v100.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v179.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v24.l, v24.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, v6.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v6, v23
-; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v27.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v113.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v164.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.h, 8, v35.l
-; GFX11-TRUE16-NEXT: v_and_b16 v26.l, 0xff, v112.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v6, v24
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v26.h, 8, v160.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
-; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v25.l, v25.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v6.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v103.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v162.l
-; GFX11-TRUE16-NEXT: v_or_b16 v26.h, v26.l, v26.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.l, v6.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v6, v25
-; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v29.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v116.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v148.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.h, 8, v34.l
-; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v115.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v6, v26
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v28.h, 8, v144.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
-; GFX11-TRUE16-NEXT: v_or_b16 v27.h, v27.l, v27.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, v6.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v114.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v145.l
-; GFX11-TRUE16-NEXT: v_or_b16 v28.h, v28.l, v28.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.l, v6.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v6, v27
-; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v31.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v119.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v134.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v29.h, 8, v33.l
-; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v118.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v6, v28
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v30.h, 8, v130.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
-; GFX11-TRUE16-NEXT: v_or_b16 v29.h, v29.l, v29.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v6.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v117.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v132.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.h, 8, v65.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v6.h, v8.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v10, v14
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v12
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v73.h
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v7.l, v7.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v178.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.l, 8, v59.l
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v8.h, v9.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v12, v14
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v56.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v16
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v8.l, v10.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v44.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v9.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v64.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v43.l
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v10.l, v10.h
+; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v89.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v12, v14
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v16
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v9.l, v9.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v10.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v41.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v42.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v10.h, v11.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v16, v14
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v18
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v52.l
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v10.l, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v11.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v61.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v183.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v16, v14
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v18
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v11.l, v12.h
+; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v104.h
+; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v11.h, v12.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.l, 8, v176.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.l, 8, v166.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v16, v14
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.h, 8, v167.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v57.h
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v12.h, v13.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v78.h
; GFX11-TRUE16-NEXT: s_clause 0x1
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[1:4], off
-; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[66:69], off offset:16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v6, v29
-; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.l, v5.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v30.l, v30.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v6.h
+; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[5:8], off offset:16
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v18
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v12.l, v14.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v13.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.h, 8, v49.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v12.h, v15.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v16, v14
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v18
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.l, 8, v160.l
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v13.l, v13.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v15.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v77.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v95.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.l, 8, v150.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v64, v18, v14
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v20
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v13.l, v16.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v93.h
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v13.h, v15.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.h, 8, v149.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v65, v18, v14
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v15.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v15.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.l, 8, v48.l
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v13.l, v13.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v71.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.h, 8, v79.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v16
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v15.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v13.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v92.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.l, 8, v134.l
+; GFX11-TRUE16-NEXT: v_or_b16 v16.l, v13.h, v15.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v66, v18, v14
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v20
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v70.h
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v13.l, v15.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v17.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.l, 8, v74.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.h, 8, v53.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v67, v18, v14
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v46.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v13.h, v15.l
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v13.l, v15.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v84.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.h, 8, v63.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v15.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v16, v14
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v55.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.l, 8, v62.l
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v13.l, v13.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v81.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v60.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v20, 0xffff, v17
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v16.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v13.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v19.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v50.l
+; GFX11-TRUE16-NEXT: v_or_b16 v18.l, v13.h, v16.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v20, v14
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v19, 0xffff, v22
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v86.h
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v13.l, v17.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v18.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v80.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v45.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v19, v14
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v19, 0xffff, v20
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v13.l, v18.h
+; GFX11-TRUE16-NEXT: v_or_b16 v18.l, v13.h, v18.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v83.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.h, 8, v40.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v18.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v19, v14
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v21.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.l, 8, v38.l
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v13.l, v13.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v97.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v182.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff, v20
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v19.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, v13.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v82.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v20.l, 8, v181.l
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v13.h, v19.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v22, v14
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff, v24
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v87.h
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v13.l, v20.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, v21.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v23.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.l, 8, v177.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.h, 8, v37.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v22, v14
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff, v24
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v24.h, 8, v147.l
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v13.h, v21.l
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v13.l, v21.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v101.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.h, 8, v163.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v21.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v22, v14
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v85.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.l, 8, v162.l
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v13.l, v13.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v98.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.h, 8, v151.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v26, 0xffff, v23
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v22.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.l, v13.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v25.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.l, 8, v36.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v13.h, v22.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v26, v14
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v25, 0xffff, v28
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v112.h
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v13.l, v23.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.l, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v96.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v24.l, 8, v145.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v25, v14
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v25, 0xffff, v26
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v13.l, v24.h
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v13.h, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v100.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.h, 8, v135.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.l, v24.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v25, v14
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v27.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.l, 8, v35.l
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v13.l, v13.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v113.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.h, 8, v131.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v28, 0xffff, v26
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v25.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.l, v13.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v99.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v26.l, 8, v130.l
+; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v13.h, v25.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v28, v14
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v28, 0xffff, v30
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v103.h
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v13.l, v26.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.l, v27.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v29.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.l, 8, v129.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.h, 8, v34.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v28, v14
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v28, 0xffff, v30
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v13.h, v27.l
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v13.l, v27.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v116.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.h, 8, v128.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v27.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v28, v14
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v102.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v28.l, 8, v119.l
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v13.l, v13.h
+; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v32, 0xffff, v29
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v115.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v28.h, 8, v118.l
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v28.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.l, v13.l
+; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v31.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v29.l, 8, v33.l
+; GFX11-TRUE16-NEXT: v_or_b16 v30.l, v13.h, v28.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v32, v14
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v31, 0xffff, v34
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.h, 8, v117.l
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v13.l, v29.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v114.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v30, 0xffff, v30
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v31, v14
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v13.l, v13.h
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v6, v5
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v30, v14
; GFX11-TRUE16-NEXT: s_clause 0x5
-; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[7:10], off offset:32
-; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[11:14], off offset:48
+; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[9:12], off offset:32
+; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[64:67], off offset:48
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[15:18], off offset:64
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[19:22], off offset:80
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[23:26], off offset:96
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[27:30], off offset:112
; GFX11-TRUE16-NEXT: s_clause 0x1f
-; GFX11-TRUE16-NEXT: scratch_load_b32 v111, off, s32 offset:12
-; GFX11-TRUE16-NEXT: scratch_load_b32 v110, off, s32 offset:16
-; GFX11-TRUE16-NEXT: scratch_load_b32 v109, off, s32 offset:20
-; GFX11-TRUE16-NEXT: scratch_load_b32 v108, off, s32 offset:24
-; GFX11-TRUE16-NEXT: scratch_load_b32 v107, off, s32 offset:28
-; GFX11-TRUE16-NEXT: scratch_load_b32 v106, off, s32 offset:32
-; GFX11-TRUE16-NEXT: scratch_load_b32 v105, off, s32 offset:36
-; GFX11-TRUE16-NEXT: scratch_load_b32 v104, off, s32 offset:40
-; GFX11-TRUE16-NEXT: scratch_load_b32 v95, off, s32 offset:44
-; GFX11-TRUE16-NEXT: scratch_load_b32 v94, off, s32 offset:48
-; GFX11-TRUE16-NEXT: scratch_load_b32 v93, off, s32 offset:52
-; GFX11-TRUE16-NEXT: scratch_load_b32 v92, off, s32 offset:56
-; GFX11-TRUE16-NEXT: scratch_load_b32 v91, off, s32 offset:60
-; GFX11-TRUE16-NEXT: scratch_load_b32 v90, off, s32 offset:64
-; GFX11-TRUE16-NEXT: scratch_load_b32 v89, off, s32 offset:68
-; GFX11-TRUE16-NEXT: scratch_load_b32 v88, off, s32 offset:72
-; GFX11-TRUE16-NEXT: scratch_load_b32 v79, off, s32 offset:76
-; GFX11-TRUE16-NEXT: scratch_load_b32 v78, off, s32 offset:80
-; GFX11-TRUE16-NEXT: scratch_load_b32 v77, off, s32 offset:84
-; GFX11-TRUE16-NEXT: scratch_load_b32 v76, off, s32 offset:88
-; GFX11-TRUE16-NEXT: scratch_load_b32 v75, off, s32 offset:92
-; GFX11-TRUE16-NEXT: scratch_load_b32 v74, off, s32 offset:96
-; GFX11-TRUE16-NEXT: scratch_load_b32 v73, off, s32 offset:100
-; GFX11-TRUE16-NEXT: scratch_load_b32 v72, off, s32 offset:104
-; GFX11-TRUE16-NEXT: scratch_load_b32 v63, off, s32 offset:108
-; GFX11-TRUE16-NEXT: scratch_load_b32 v62, off, s32 offset:112
-; GFX11-TRUE16-NEXT: scratch_load_b32 v61, off, s32 offset:116
-; GFX11-TRUE16-NEXT: scratch_load_b32 v60, off, s32 offset:120
-; GFX11-TRUE16-NEXT: scratch_load_b32 v59, off, s32 offset:124
-; GFX11-TRUE16-NEXT: scratch_load_b32 v58, off, s32 offset:128
-; GFX11-TRUE16-NEXT: scratch_load_b32 v57, off, s32 offset:132
-; GFX11-TRUE16-NEXT: scratch_load_b32 v56, off, s32 offset:136
-; GFX11-TRUE16-NEXT: s_clause 0x7
-; GFX11-TRUE16-NEXT: scratch_load_b32 v47, off, s32 offset:140
-; GFX11-TRUE16-NEXT: scratch_load_b32 v46, off, s32 offset:144
-; GFX11-TRUE16-NEXT: scratch_load_b32 v45, off, s32 offset:148
-; GFX11-TRUE16-NEXT: scratch_load_b32 v44, off, s32 offset:152
-; GFX11-TRUE16-NEXT: scratch_load_b32 v43, off, s32 offset:156
-; GFX11-TRUE16-NEXT: scratch_load_b32 v42, off, s32 offset:160
-; GFX11-TRUE16-NEXT: scratch_load_b32 v41, off, s32 offset:164
-; GFX11-TRUE16-NEXT: scratch_load_b32 v40, off, s32 offset:168
+; GFX11-TRUE16-NEXT: scratch_load_b32 v108, off, s32 offset:12
+; GFX11-TRUE16-NEXT: scratch_load_b32 v107, off, s32 offset:16
+; GFX11-TRUE16-NEXT: scratch_load_b32 v106, off, s32 offset:20
+; GFX11-TRUE16-NEXT: scratch_load_b32 v105, off, s32 offset:24
+; GFX11-TRUE16-NEXT: scratch_load_b32 v104, off, s32 offset:28
+; GFX11-TRUE16-NEXT: scratch_load_b32 v95, off, s32 offset:32
+; GFX11-TRUE16-NEXT: scratch_load_b32 v94, off, s32 offset:36
+; GFX11-TRUE16-NEXT: scratch_load_b32 v93, off, s32 offset:40
+; GFX11-TRUE16-NEXT: scratch_load_b32 v92, off, s32 offset:44
+; GFX11-TRUE16-NEXT: scratch_load_b32 v91, off, s32 offset:48
+; GFX11-TRUE16-NEXT: scratch_load_b32 v90, off, s32 offset:52
+; GFX11-TRUE16-NEXT: scratch_load_b32 v89, off, s32 offset:56
+; GFX11-TRUE16-NEXT: scratch_load_b32 v88, off, s32 offset:60
+; GFX11-TRUE16-NEXT: scratch_load_b32 v79, off, s32 offset:64
+; GFX11-TRUE16-NEXT: scratch_load_b32 v78, off, s32 offset:68
+; GFX11-TRUE16-NEXT: scratch_load_b32 v77, off, s32 offset:72
+; GFX11-TRUE16-NEXT: scratch_load_b32 v76, off, s32 offset:76
+; GFX11-TRUE16-NEXT: scratch_load_b32 v75, off, s32 offset:80
+; GFX11-TRUE16-NEXT: scratch_load_b32 v74, off, s32 offset:84
+; GFX11-TRUE16-NEXT: scratch_load_b32 v73, off, s32 offset:88
+; GFX11-TRUE16-NEXT: scratch_load_b32 v72, off, s32 offset:92
+; GFX11-TRUE16-NEXT: scratch_load_b32 v63, off, s32 offset:96
+; GFX11-TRUE16-NEXT: scratch_load_b32 v62, off, s32 offset:100
+; GFX11-TRUE16-NEXT: scratch_load_b32 v61, off, s32 offset:104
+; GFX11-TRUE16-NEXT: scratch_load_b32 v60, off, s32 offset:108
+; GFX11-TRUE16-NEXT: scratch_load_b32 v59, off, s32 offset:112
+; GFX11-TRUE16-NEXT: scratch_load_b32 v58, off, s32 offset:116
+; GFX11-TRUE16-NEXT: scratch_load_b32 v57, off, s32 offset:120
+; GFX11-TRUE16-NEXT: scratch_load_b32 v56, off, s32 offset:124
+; GFX11-TRUE16-NEXT: scratch_load_b32 v47, off, s32 offset:128
+; GFX11-TRUE16-NEXT: scratch_load_b32 v46, off, s32 offset:132
+; GFX11-TRUE16-NEXT: scratch_load_b32 v45, off, s32 offset:136
+; GFX11-TRUE16-NEXT: s_clause 0x4
+; GFX11-TRUE16-NEXT: scratch_load_b32 v44, off, s32 offset:140
+; GFX11-TRUE16-NEXT: scratch_load_b32 v43, off, s32 offset:144
+; GFX11-TRUE16-NEXT: scratch_load_b32 v42, off, s32 offset:148
+; GFX11-TRUE16-NEXT: scratch_load_b32 v41, off, s32 offset:152
+; GFX11-TRUE16-NEXT: scratch_load_b32 v40, off, s32 offset:156
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -184755,69 +185302,69 @@ define <128 x i8> @bitcast_v64f16_to_v128i8(<64 x half> %a, i32 %b) {
; GFX11-TRUE16-NEXT: scratch_load_b32 v33, off, s32 offset:8
; GFX11-TRUE16-NEXT: scratch_load_b32 v32, off, s32 offset:4
; GFX11-TRUE16-NEXT: scratch_load_b32 v31, off, s32
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr166_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr165_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr164_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr163_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr69_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr68_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr162_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr160_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr161_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr67_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr151_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr68_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr147_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr145_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr67_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr135_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr133_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr131_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr119_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr117_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr115_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr113_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr101_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr99_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr97_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr85_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr65_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr160_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr150_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr66_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr148_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr146_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr147_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr65_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr144_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr134_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr132_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr131_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr128_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr118_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr119_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr116_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr114_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr112_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr99_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr97_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr96_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr84_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr145_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr134_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr135_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr133_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr128_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr117_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr114_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr115_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr113_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr100_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr101_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr96_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr86_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr84_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr83_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr85_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr82_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr83_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr81_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_lo16
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(2)
; GFX11-TRUE16-NEXT: v_cmp_ne_u32_e32 vcc_lo, 0, v33
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
@@ -184828,69 +185375,69 @@ define <128 x i8> @bitcast_v64f16_to_v128i8(<64 x half> %a, i32 %b) {
; GFX11-TRUE16-NEXT: ; %bb.1: ; %cmp.false
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[33:34], 24, v[31:32]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[67:68], 24, v[7:8]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[34:35], 24, v[29:30]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[68:69], 24, v[5:6]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[35:36], 24, v[27:28]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[50:51], 24, v[13:14]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[53:54], 24, v[11:12]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[64:65], 24, v[9:10]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[69:70], 24, v[3:4]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[35:36], 24, v[27:28]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[65:66], 24, v[7:8]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[36:37], 24, v[25:26]
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v85, 24, v16
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 8, v16
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[48:49], 24, v[15:16]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[66:67], 24, v[5:6]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[37:38], 24, v[23:24]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[49:50], 24, v[13:14]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[52:53], 24, v[11:12]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[67:68], 24, v[3:4]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[38:39], 24, v[21:22]
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v84, 24, v16
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 8, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v97, 8, v15
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v99, 24, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v101, 8, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v103, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 24, v12
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v115, 8, v12
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v117, 8, v11
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v119, 24, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 8, v10
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v102, 8, v14
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 8, v13
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v103, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v116, 8, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v119, 8, v11
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v118, 24, v10
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 8, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v131, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v133, 24, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v135, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v145, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 24, v6
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 8, v6
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v151, 8, v5
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 24, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v162, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v163, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v164, 24, v2
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v165, 8, v2
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v166, 8, v1
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v80, 24, v32
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 8, v32
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 8, v31
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v83, 24, v30
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v84, 8, v30
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v86, 8, v29
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 24, v28
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 8, v28
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v100, 8, v27
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v102, 24, v26
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 8, v26
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v114, 8, v25
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v116, 24, v24
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v118, 8, v24
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v128, 8, v23
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 24, v22
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 8, v22
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v134, 8, v21
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v144, 24, v20
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v146, 8, v20
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v148, 8, v19
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 24, v18
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 8, v18
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[38:39], 24, v[15:16]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[70:71], 24, v[1:2]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[48:49], 24, v[23:24]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[51:52], 24, v[21:22]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[54:55], 24, v[19:20]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[65:66], 24, v[17:18]
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 8, v17
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v144, 8, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v146, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v148, 8, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 8, v5
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 24, v4
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v151, 8, v4
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v162, 8, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v163, 8, v1
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 24, v32
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v70, 8, v32
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v80, 8, v31
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v71, 24, v30
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 8, v30
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v83, 8, v29
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 24, v28
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v85, 8, v28
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v86, 8, v27
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 24, v26
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 8, v26
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v101, 8, v25
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v100, 24, v24
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 8, v24
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v115, 8, v23
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v114, 24, v22
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v117, 8, v22
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v128, 8, v21
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 24, v20
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v133, 8, v20
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v135, 8, v19
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v134, 24, v18
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v145, 8, v18
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[68:69], 24, v[1:2]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[50:51], 24, v[19:20]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[53:54], 24, v[17:18]
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v39, 8, v17
; GFX11-TRUE16-NEXT: .LBB94_2: ; %Flow
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB94_4
@@ -184899,364 +185446,405 @@ define <128 x i8> @bitcast_v64f16_to_v128i8(<64 x half> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_pk_add_f16 v32, 0x200, v32 op_sel_hi:[0,1]
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_pk_add_f16 v31, 0x200, v31 op_sel_hi:[0,1]
-; GFX11-TRUE16-NEXT: v_pk_add_f16 v8, 0x200, v8 op_sel_hi:[0,1]
-; GFX11-TRUE16-NEXT: v_pk_add_f16 v7, 0x200, v7 op_sel_hi:[0,1]
; GFX11-TRUE16-NEXT: v_pk_add_f16 v30, 0x200, v30 op_sel_hi:[0,1]
; GFX11-TRUE16-NEXT: v_pk_add_f16 v29, 0x200, v29 op_sel_hi:[0,1]
-; GFX11-TRUE16-NEXT: v_pk_add_f16 v6, 0x200, v6 op_sel_hi:[0,1]
-; GFX11-TRUE16-NEXT: v_pk_add_f16 v5, 0x200, v5 op_sel_hi:[0,1]
+; GFX11-TRUE16-NEXT: v_pk_add_f16 v10, 0x200, v10 op_sel_hi:[0,1]
+; GFX11-TRUE16-NEXT: v_pk_add_f16 v9, 0x200, v9 op_sel_hi:[0,1]
; GFX11-TRUE16-NEXT: v_pk_add_f16 v28, 0x200, v28 op_sel_hi:[0,1]
; GFX11-TRUE16-NEXT: v_pk_add_f16 v27, 0x200, v27 op_sel_hi:[0,1]
+; GFX11-TRUE16-NEXT: v_pk_add_f16 v8, 0x200, v8 op_sel_hi:[0,1]
+; GFX11-TRUE16-NEXT: v_pk_add_f16 v7, 0x200, v7 op_sel_hi:[0,1]
+; GFX11-TRUE16-NEXT: v_pk_add_f16 v26, 0x200, v26 op_sel_hi:[0,1]
+; GFX11-TRUE16-NEXT: v_pk_add_f16 v25, 0x200, v25 op_sel_hi:[0,1]
+; GFX11-TRUE16-NEXT: v_pk_add_f16 v6, 0x200, v6 op_sel_hi:[0,1]
+; GFX11-TRUE16-NEXT: v_pk_add_f16 v5, 0x200, v5 op_sel_hi:[0,1]
+; GFX11-TRUE16-NEXT: v_pk_add_f16 v16, 0x200, v16 op_sel_hi:[0,1]
+; GFX11-TRUE16-NEXT: v_pk_add_f16 v15, 0x200, v15 op_sel_hi:[0,1]
+; GFX11-TRUE16-NEXT: v_pk_add_f16 v24, 0x200, v24 op_sel_hi:[0,1]
+; GFX11-TRUE16-NEXT: v_pk_add_f16 v23, 0x200, v23 op_sel_hi:[0,1]
; GFX11-TRUE16-NEXT: v_pk_add_f16 v4, 0x200, v4 op_sel_hi:[0,1]
; GFX11-TRUE16-NEXT: v_pk_add_f16 v3, 0x200, v3 op_sel_hi:[0,1]
-; GFX11-TRUE16-NEXT: v_pk_add_f16 v10, 0x200, v10 op_sel_hi:[0,1]
; GFX11-TRUE16-NEXT: v_pk_add_f16 v12, 0x200, v12 op_sel_hi:[0,1]
; GFX11-TRUE16-NEXT: v_pk_add_f16 v14, 0x200, v14 op_sel_hi:[0,1]
; GFX11-TRUE16-NEXT: v_pk_add_f16 v13, 0x200, v13 op_sel_hi:[0,1]
; GFX11-TRUE16-NEXT: v_pk_add_f16 v11, 0x200, v11 op_sel_hi:[0,1]
-; GFX11-TRUE16-NEXT: v_pk_add_f16 v9, 0x200, v9 op_sel_hi:[0,1]
; GFX11-TRUE16-NEXT: v_pk_add_f16 v18, 0x200, v18 op_sel_hi:[0,1]
; GFX11-TRUE16-NEXT: v_pk_add_f16 v17, 0x200, v17 op_sel_hi:[0,1]
; GFX11-TRUE16-NEXT: v_pk_add_f16 v20, 0x200, v20 op_sel_hi:[0,1]
; GFX11-TRUE16-NEXT: v_pk_add_f16 v19, 0x200, v19 op_sel_hi:[0,1]
; GFX11-TRUE16-NEXT: v_pk_add_f16 v22, 0x200, v22 op_sel_hi:[0,1]
; GFX11-TRUE16-NEXT: v_pk_add_f16 v21, 0x200, v21 op_sel_hi:[0,1]
-; GFX11-TRUE16-NEXT: v_pk_add_f16 v24, 0x200, v24 op_sel_hi:[0,1]
-; GFX11-TRUE16-NEXT: v_pk_add_f16 v23, 0x200, v23 op_sel_hi:[0,1]
-; GFX11-TRUE16-NEXT: v_pk_add_f16 v26, 0x200, v26 op_sel_hi:[0,1]
-; GFX11-TRUE16-NEXT: v_pk_add_f16 v25, 0x200, v25 op_sel_hi:[0,1]
; GFX11-TRUE16-NEXT: v_pk_add_f16 v2, 0x200, v2 op_sel_hi:[0,1]
; GFX11-TRUE16-NEXT: v_pk_add_f16 v1, 0x200, v1 op_sel_hi:[0,1]
-; GFX11-TRUE16-NEXT: v_pk_add_f16 v16, 0x200, v16 op_sel_hi:[0,1]
-; GFX11-TRUE16-NEXT: v_pk_add_f16 v15, 0x200, v15 op_sel_hi:[0,1]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[67:68], 24, v[7:8]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[33:34], 24, v[31:32]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[68:69], 24, v[5:6]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[34:35], 24, v[29:30]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[50:51], 24, v[13:14]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[53:54], 24, v[11:12]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[64:65], 24, v[9:10]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[69:70], 24, v[3:4]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[34:35], 24, v[29:30]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[65:66], 24, v[7:8]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[35:36], 24, v[27:28]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[38:39], 24, v[15:16]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[70:71], 24, v[1:2]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[48:49], 24, v[15:16]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[66:67], 24, v[5:6]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[36:37], 24, v[25:26]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[48:49], 24, v[23:24]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[51:52], 24, v[21:22]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[54:55], 24, v[19:20]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[65:66], 24, v[17:18]
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v85, 24, v16
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 8, v16
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[49:50], 24, v[13:14]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[52:53], 24, v[11:12]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[67:68], 24, v[3:4]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[37:38], 24, v[23:24]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[68:69], 24, v[1:2]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[38:39], 24, v[21:22]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[50:51], 24, v[19:20]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[53:54], 24, v[17:18]
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v84, 24, v16
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 8, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v97, 8, v15
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v99, 24, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v101, 8, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v103, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 24, v12
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v115, 8, v12
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v117, 8, v11
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v119, 24, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 8, v10
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v102, 8, v14
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 8, v13
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v103, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v116, 8, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v119, 8, v11
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v118, 24, v10
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 8, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v131, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v133, 24, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v135, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v145, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 24, v6
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 8, v6
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v151, 8, v5
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 24, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v162, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v163, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v164, 24, v2
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v165, 8, v2
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v166, 8, v1
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v80, 24, v32
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 8, v32
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 8, v31
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v83, 24, v30
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v84, 8, v30
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v86, 8, v29
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 24, v28
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 8, v28
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v100, 8, v27
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v102, 24, v26
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 8, v26
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v114, 8, v25
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v116, 24, v24
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v118, 8, v24
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v128, 8, v23
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 24, v22
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 8, v22
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v134, 8, v21
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v144, 24, v20
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v146, 8, v20
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v148, 8, v19
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 24, v18
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 8, v18
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 8, v17
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v144, 8, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v146, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v148, 8, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 8, v5
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 24, v4
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v151, 8, v4
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v162, 8, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v163, 8, v1
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 24, v32
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v70, 8, v32
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v80, 8, v31
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v71, 24, v30
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 8, v30
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v83, 8, v29
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 24, v28
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v85, 8, v28
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v86, 8, v27
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 24, v26
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 8, v26
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v101, 8, v25
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v100, 24, v24
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 8, v24
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v115, 8, v23
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v114, 24, v22
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v117, 8, v22
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v128, 8, v21
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 24, v20
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v133, 8, v20
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v135, 8, v19
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v134, 24, v18
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v145, 8, v18
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v39, 8, v17
; GFX11-TRUE16-NEXT: .LBB94_4: ; %end
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v166.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v70.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.h, 0
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v163.l
; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v1.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v165.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v1.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v162.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v51.l, 0
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v68.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v34.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v164.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v39, v1
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v2.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v163.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v69.l
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v39, v2
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v3.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v162.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v161.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v39, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v4.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v39.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v54.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v161.l
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v1.h, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v68.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v54
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v160.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v1.l, v3.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v151.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v68.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v39, v4
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v5.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v149.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v147.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v68, 0xffff, v68
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v54, v51
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v2.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v54.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v67.l
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v4.l, v33.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v150.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v68, v51
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v54
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v3.l, v3.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v67.l, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v149.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v5.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v54, v51
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v67
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v66.l
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v4.l, v4.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v67.l, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.l, 8, v148.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v54, v51
; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v39, v5
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v6.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v145.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v67.l
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v5.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v147.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.h, v6.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v67
; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v39, v6
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v7.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v135.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v133.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v39, v7
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v8.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v131.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v64.l
; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v39, v8
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v9.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v129.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v119.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v7.l, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v66.l, v6.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v146.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v144.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v54, v51
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v66
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v66.l, v7.l
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v6.l, v6.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.h, 8, v65.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v8.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v132.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v54, v51
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v66
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v7.l, v7.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v65.l, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v9.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 8, v131.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v54, v51
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v65
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v8.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v130.l
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v8.h, v9.l
; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v11.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v39, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v10.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v10.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v117.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v53.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v12.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v39, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v11.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v11.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v115.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v113.l
; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v13.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v39, v11
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v12.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v12.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v103.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v50.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v54, v51
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v10.l, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v65.l, v9.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v64.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v119.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v64.l, v10.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v65
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.h
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v9.l, v9.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v118.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v11.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v52.l
; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v14.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v39, v12
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v13.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v13.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v101.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v99.l
-; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.l
-; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v15.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v39, v13
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v14.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v97.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v38.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v54, v51
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v64
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v10.l, v10.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v64.l, v11.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v116.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v54, v51
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v52, 0xffff, v64
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v11.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v112.l
+; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v11.h, v12.l
; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.l
-; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v16.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v39, v14
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v15.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v15.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v87.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v85.l
; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v17.l
-; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v17.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v39, v15
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v16.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v16.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v37.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v65.l
-; GFX11-TRUE16-NEXT: v_and_b16 v18.l, 0xff, v18.l
-; GFX11-TRUE16-NEXT: v_and_b16 v18.h, 0xff, v18.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v39, v16
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v17.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v17.h, v17.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v160.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v150.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v52, v51
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v13.l, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v54.l, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v103.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v102.l
; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v19.l
-; GFX11-TRUE16-NEXT: v_and_b16 v19.h, 0xff, v19.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v39, v17
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v18.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v18.h, v18.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v148.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v54.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v52, 0xffff, v54
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v54.l, v13.l
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v12.l, v12.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.h, 8, v49.l
+; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v14.l, v33.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v49, 0xffff, v54
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v52, v51
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v99.l
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v13.l, v13.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.l, v14.l
+; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.h
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v15.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.l, 8, v97.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v49, v51
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v49, 0xffff, v52
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v14.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v96.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v14.h, v15.l
; GFX11-TRUE16-NEXT: v_and_b16 v20.l, 0xff, v20.l
-; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v20.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v39, v18
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v19.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v19.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v146.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v144.l
-; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v21.l
-; GFX11-TRUE16-NEXT: v_and_b16 v21.h, 0xff, v21.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v39, v19
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v20.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v20.h, v20.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v134.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v51.l
; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v22.l
-; GFX11-TRUE16-NEXT: v_and_b16 v22.h, 0xff, v22.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v39, v20
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v21.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v21.h, v21.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v132.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v130.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v49, v51
+; GFX11-TRUE16-NEXT: v_or_b16 v16.l, v16.l, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.l, v15.l
+; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.h, 8, v48.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v39.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.l, v16.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v48, 0xffff, v52
+; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.h
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v15.l, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v84.l
+; GFX11-TRUE16-NEXT: v_or_b16 v17.l, v17.l, v33.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v39, 0xffff, v39
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v53.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v48, v51
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v16.l, v16.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.l, v17.l
+; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v17.h
+; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v18.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v145.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v39, v51
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v39, 0xffff, v48
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v17.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v135.l
+; GFX11-TRUE16-NEXT: v_or_b16 v18.l, v17.h, v18.l
; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v23.l
-; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v23.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v39, v21
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v22.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v22.h, v22.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v128.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v48.l
-; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v24.l
-; GFX11-TRUE16-NEXT: v_and_b16 v24.h, 0xff, v24.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v39, v22
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v23.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v23.h, v23.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v118.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v116.l
; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v25.l
-; GFX11-TRUE16-NEXT: v_and_b16 v25.h, 0xff, v25.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v39, v23
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v24.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v24.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v114.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v36.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v39, v51
+; GFX11-TRUE16-NEXT: v_or_b16 v19.l, v19.l, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.l, v18.l
+; GFX11-TRUE16-NEXT: v_and_b16 v18.l, 0xff, v18.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v134.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v133.l
; GFX11-TRUE16-NEXT: v_and_b16 v26.l, 0xff, v26.l
-; GFX11-TRUE16-NEXT: v_and_b16 v26.h, 0xff, v26.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v39, v24
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v25.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v25.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v112.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v102.l
-; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v27.l
-; GFX11-TRUE16-NEXT: v_and_b16 v27.h, 0xff, v27.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v39, v25
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v26.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v26.h, v26.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v100.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v35.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v39, 0xffff, v48
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.l, v19.l
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v18.l, v18.h
+; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v19.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v50.l
+; GFX11-TRUE16-NEXT: v_or_b16 v20.l, v20.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v129.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v39, v51
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v39, 0xffff, v48
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v19.l, v19.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.l, v20.l
+; GFX11-TRUE16-NEXT: v_and_b16 v20.l, 0xff, v20.h
+; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v21.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.l, 8, v128.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v39, v51
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v39, 0xffff, v48
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v20.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v117.l
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v20.h, v21.l
; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v28.l
-; GFX11-TRUE16-NEXT: v_and_b16 v28.h, 0xff, v28.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v39, v26
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v27.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v27.h, v27.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v98.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v96.l
; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v29.l
-; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v29.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v39, v27
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v28.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v28.h, v28.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v86.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.l, 8, v34.l
-; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v30.l
-; GFX11-TRUE16-NEXT: v_and_b16 v30.h, 0xff, v30.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v39, v28
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v29.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v29.h, v29.h, v34.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v84.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.l, 8, v83.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v39, v51
+; GFX11-TRUE16-NEXT: v_or_b16 v22.l, v22.l, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.l, v21.l
+; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v21.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.h, 8, v38.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v115.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.l, v22.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v38, 0xffff, v48
+; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v22.h
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v21.l, v21.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.h, 8, v114.l
+; GFX11-TRUE16-NEXT: v_or_b16 v23.l, v23.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v37.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_and_b16 v31.l, 0xff, v31.l
-; GFX11-TRUE16-NEXT: v_and_b16 v31.h, 0xff, v31.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v39, v29
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v30.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v30.h, v30.h, v34.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v82.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.l, 8, v33.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v38, v51
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v38, 0xffff, v39
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v22.l, v22.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.l, v23.l
+; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v23.h
+; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v24.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v24.l, 8, v113.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v38, v51
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v37, 0xffff, v39
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v23.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v101.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v23.h, v24.l
; GFX11-TRUE16-NEXT: v_and_b16 v32.l, 0xff, v32.l
-; GFX11-TRUE16-NEXT: v_and_b16 v32.h, 0xff, v32.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v39, v30
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v31.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v31.h, v33.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v31.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.l, 8, v81.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v80.l
; GFX11-TRUE16-NEXT: s_clause 0x1
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[1:4], off
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[5:8], off offset:16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v39, v31
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v32.l, v33.l
-; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v32.h, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v32.l, v39.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v32, v39, v32
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v37, v51
+; GFX11-TRUE16-NEXT: v_or_b16 v25.l, v25.l, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v38.l, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v24.h, 8, v100.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v98.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v37, 0xffff, v38
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v38.l, v25.l
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v24.l, v24.h
+; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v25.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.h, 8, v36.l
+; GFX11-TRUE16-NEXT: v_or_b16 v26.l, v26.l, v33.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v36, 0xffff, v38
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v37, v51
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v87.l
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v25.l, v25.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v37.l, v26.l
+; GFX11-TRUE16-NEXT: v_and_b16 v26.l, 0xff, v26.h
+; GFX11-TRUE16-NEXT: v_and_b16 v26.h, 0xff, v27.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.l, 8, v86.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v36, v51
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v36, 0xffff, v37
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v26.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v85.l
+; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v26.h, v27.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v36, v51
+; GFX11-TRUE16-NEXT: v_or_b16 v28.l, v28.l, v33.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v37.l, v27.l
+; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v27.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.h, 8, v35.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v83.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v36.l, v28.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff, v37
+; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v28.h
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v27.l, v27.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v28.h, 8, v82.l
+; GFX11-TRUE16-NEXT: v_or_b16 v29.l, v29.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v34.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v35, v51
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff, v36
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v28.l, v28.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v36.l, v29.l
+; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v29.h
+; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v30.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v30.l, 8, v81.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v35, v51
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v36
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v29.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v80.l
+; GFX11-TRUE16-NEXT: v_or_b16 v30.l, v29.h, v30.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v34, v51
+; GFX11-TRUE16-NEXT: v_or_b16 v31.l, v31.l, v33.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_3) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v30.l
+; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v30.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v30.h, 8, v71.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v70.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v35
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v31.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v30.l, v30.h
+; GFX11-TRUE16-NEXT: v_and_b16 v31.l, 0xff, v31.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v31.h, 8, v33.l
+; GFX11-TRUE16-NEXT: v_or_b16 v32.l, v32.l, v33.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v33, 0xffff, v35
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v34, v51
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v31.l, v31.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.l, v32.l
+; GFX11-TRUE16-NEXT: v_and_b16 v32.l, 0xff, v32.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v32.h, 8, v55.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v33, v51
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v33, 0xffff, v34
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v32.l, v32.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v32, v33, v51
; GFX11-TRUE16-NEXT: s_clause 0x5
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[9:12], off offset:32
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[13:16], off offset:48
@@ -207467,69 +208055,69 @@ define <128 x i8> @bitcast_v64i16_to_v128i8(<64 x i16> %a, i32 %b) {
; GFX11-TRUE16-NEXT: scratch_load_b32 v33, off, s32 offset:8
; GFX11-TRUE16-NEXT: scratch_load_b32 v32, off, s32 offset:4
; GFX11-TRUE16-NEXT: scratch_load_b32 v31, off, s32
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr166_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr165_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr164_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr163_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr69_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr68_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr162_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr160_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr161_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr67_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr151_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr68_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr149_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr147_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr145_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr67_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr135_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr133_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr131_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr119_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr117_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr115_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr113_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr101_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr99_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr97_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr85_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr65_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr160_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr150_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr66_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr148_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr146_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr147_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr65_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr144_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr134_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr132_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr131_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr130_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr128_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr118_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr119_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr116_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr114_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr112_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr99_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr97_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr96_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr84_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr145_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr134_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr135_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr133_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr129_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr128_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr117_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr114_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr115_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr113_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr100_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr101_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr96_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr86_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr84_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr83_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr85_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr82_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr83_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr81_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_lo16
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(2)
; GFX11-TRUE16-NEXT: v_cmp_ne_u32_e32 vcc_lo, 0, v33
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
@@ -207540,69 +208128,69 @@ define <128 x i8> @bitcast_v64i16_to_v128i8(<64 x i16> %a, i32 %b) {
; GFX11-TRUE16-NEXT: ; %bb.1: ; %cmp.false
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[33:34], 24, v[31:32]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[67:68], 24, v[7:8]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[34:35], 24, v[29:30]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[68:69], 24, v[5:6]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[35:36], 24, v[27:28]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[50:51], 24, v[13:14]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[53:54], 24, v[11:12]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[64:65], 24, v[9:10]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[69:70], 24, v[3:4]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[35:36], 24, v[27:28]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[65:66], 24, v[7:8]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[36:37], 24, v[25:26]
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v85, 24, v16
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 8, v16
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[48:49], 24, v[15:16]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[66:67], 24, v[5:6]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[37:38], 24, v[23:24]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[49:50], 24, v[13:14]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[52:53], 24, v[11:12]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[67:68], 24, v[3:4]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[38:39], 24, v[21:22]
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v84, 24, v16
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 8, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v97, 8, v15
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v99, 24, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v101, 8, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v103, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 24, v12
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v115, 8, v12
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v117, 8, v11
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v119, 24, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 8, v10
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v102, 8, v14
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 8, v13
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v103, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v116, 8, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v119, 8, v11
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v118, 24, v10
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 8, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v131, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v133, 24, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v135, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v145, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 24, v6
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 8, v6
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v151, 8, v5
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 24, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v162, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v163, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v164, 24, v2
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v165, 8, v2
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v166, 8, v1
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v80, 24, v32
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 8, v32
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 8, v31
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v83, 24, v30
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v84, 8, v30
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v86, 8, v29
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 24, v28
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 8, v28
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v100, 8, v27
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v102, 24, v26
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 8, v26
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v114, 8, v25
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v116, 24, v24
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v118, 8, v24
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v128, 8, v23
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 24, v22
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 8, v22
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v134, 8, v21
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v144, 24, v20
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v146, 8, v20
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v148, 8, v19
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 24, v18
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 8, v18
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[38:39], 24, v[15:16]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[70:71], 24, v[1:2]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[48:49], 24, v[23:24]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[51:52], 24, v[21:22]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[54:55], 24, v[19:20]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[65:66], 24, v[17:18]
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 8, v17
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v144, 8, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v146, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v148, 8, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 8, v5
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 24, v4
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v151, 8, v4
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v162, 8, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v163, 8, v1
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 24, v32
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v70, 8, v32
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v80, 8, v31
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v71, 24, v30
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 8, v30
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v83, 8, v29
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 24, v28
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v85, 8, v28
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v86, 8, v27
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 24, v26
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 8, v26
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v101, 8, v25
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v100, 24, v24
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 8, v24
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v115, 8, v23
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v114, 24, v22
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v117, 8, v22
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v128, 8, v21
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 24, v20
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v133, 8, v20
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v135, 8, v19
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v134, 24, v18
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v145, 8, v18
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[68:69], 24, v[1:2]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[50:51], 24, v[19:20]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[53:54], 24, v[17:18]
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v39, 8, v17
; GFX11-TRUE16-NEXT: .LBB98_2: ; %Flow
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB98_4
@@ -207611,364 +208199,405 @@ define <128 x i8> @bitcast_v64i16_to_v128i8(<64 x i16> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_pk_add_u16 v32, v32, 3 op_sel_hi:[1,0]
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_pk_add_u16 v31, v31, 3 op_sel_hi:[1,0]
-; GFX11-TRUE16-NEXT: v_pk_add_u16 v8, v8, 3 op_sel_hi:[1,0]
-; GFX11-TRUE16-NEXT: v_pk_add_u16 v7, v7, 3 op_sel_hi:[1,0]
; GFX11-TRUE16-NEXT: v_pk_add_u16 v30, v30, 3 op_sel_hi:[1,0]
; GFX11-TRUE16-NEXT: v_pk_add_u16 v29, v29, 3 op_sel_hi:[1,0]
-; GFX11-TRUE16-NEXT: v_pk_add_u16 v6, v6, 3 op_sel_hi:[1,0]
-; GFX11-TRUE16-NEXT: v_pk_add_u16 v5, v5, 3 op_sel_hi:[1,0]
+; GFX11-TRUE16-NEXT: v_pk_add_u16 v10, v10, 3 op_sel_hi:[1,0]
+; GFX11-TRUE16-NEXT: v_pk_add_u16 v9, v9, 3 op_sel_hi:[1,0]
; GFX11-TRUE16-NEXT: v_pk_add_u16 v28, v28, 3 op_sel_hi:[1,0]
; GFX11-TRUE16-NEXT: v_pk_add_u16 v27, v27, 3 op_sel_hi:[1,0]
+; GFX11-TRUE16-NEXT: v_pk_add_u16 v8, v8, 3 op_sel_hi:[1,0]
+; GFX11-TRUE16-NEXT: v_pk_add_u16 v7, v7, 3 op_sel_hi:[1,0]
+; GFX11-TRUE16-NEXT: v_pk_add_u16 v26, v26, 3 op_sel_hi:[1,0]
+; GFX11-TRUE16-NEXT: v_pk_add_u16 v25, v25, 3 op_sel_hi:[1,0]
+; GFX11-TRUE16-NEXT: v_pk_add_u16 v6, v6, 3 op_sel_hi:[1,0]
+; GFX11-TRUE16-NEXT: v_pk_add_u16 v5, v5, 3 op_sel_hi:[1,0]
+; GFX11-TRUE16-NEXT: v_pk_add_u16 v16, v16, 3 op_sel_hi:[1,0]
+; GFX11-TRUE16-NEXT: v_pk_add_u16 v15, v15, 3 op_sel_hi:[1,0]
+; GFX11-TRUE16-NEXT: v_pk_add_u16 v24, v24, 3 op_sel_hi:[1,0]
+; GFX11-TRUE16-NEXT: v_pk_add_u16 v23, v23, 3 op_sel_hi:[1,0]
; GFX11-TRUE16-NEXT: v_pk_add_u16 v4, v4, 3 op_sel_hi:[1,0]
; GFX11-TRUE16-NEXT: v_pk_add_u16 v3, v3, 3 op_sel_hi:[1,0]
-; GFX11-TRUE16-NEXT: v_pk_add_u16 v10, v10, 3 op_sel_hi:[1,0]
; GFX11-TRUE16-NEXT: v_pk_add_u16 v12, v12, 3 op_sel_hi:[1,0]
; GFX11-TRUE16-NEXT: v_pk_add_u16 v14, v14, 3 op_sel_hi:[1,0]
; GFX11-TRUE16-NEXT: v_pk_add_u16 v13, v13, 3 op_sel_hi:[1,0]
; GFX11-TRUE16-NEXT: v_pk_add_u16 v11, v11, 3 op_sel_hi:[1,0]
-; GFX11-TRUE16-NEXT: v_pk_add_u16 v9, v9, 3 op_sel_hi:[1,0]
; GFX11-TRUE16-NEXT: v_pk_add_u16 v18, v18, 3 op_sel_hi:[1,0]
; GFX11-TRUE16-NEXT: v_pk_add_u16 v17, v17, 3 op_sel_hi:[1,0]
; GFX11-TRUE16-NEXT: v_pk_add_u16 v20, v20, 3 op_sel_hi:[1,0]
; GFX11-TRUE16-NEXT: v_pk_add_u16 v19, v19, 3 op_sel_hi:[1,0]
; GFX11-TRUE16-NEXT: v_pk_add_u16 v22, v22, 3 op_sel_hi:[1,0]
; GFX11-TRUE16-NEXT: v_pk_add_u16 v21, v21, 3 op_sel_hi:[1,0]
-; GFX11-TRUE16-NEXT: v_pk_add_u16 v24, v24, 3 op_sel_hi:[1,0]
-; GFX11-TRUE16-NEXT: v_pk_add_u16 v23, v23, 3 op_sel_hi:[1,0]
-; GFX11-TRUE16-NEXT: v_pk_add_u16 v26, v26, 3 op_sel_hi:[1,0]
-; GFX11-TRUE16-NEXT: v_pk_add_u16 v25, v25, 3 op_sel_hi:[1,0]
; GFX11-TRUE16-NEXT: v_pk_add_u16 v2, v2, 3 op_sel_hi:[1,0]
; GFX11-TRUE16-NEXT: v_pk_add_u16 v1, v1, 3 op_sel_hi:[1,0]
-; GFX11-TRUE16-NEXT: v_pk_add_u16 v16, v16, 3 op_sel_hi:[1,0]
-; GFX11-TRUE16-NEXT: v_pk_add_u16 v15, v15, 3 op_sel_hi:[1,0]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[67:68], 24, v[7:8]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[33:34], 24, v[31:32]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[68:69], 24, v[5:6]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[34:35], 24, v[29:30]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[50:51], 24, v[13:14]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[53:54], 24, v[11:12]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[64:65], 24, v[9:10]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[69:70], 24, v[3:4]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[34:35], 24, v[29:30]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[65:66], 24, v[7:8]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[35:36], 24, v[27:28]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[38:39], 24, v[15:16]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[70:71], 24, v[1:2]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[48:49], 24, v[15:16]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[66:67], 24, v[5:6]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[36:37], 24, v[25:26]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[48:49], 24, v[23:24]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[51:52], 24, v[21:22]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[54:55], 24, v[19:20]
-; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[65:66], 24, v[17:18]
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v85, 24, v16
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 8, v16
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[49:50], 24, v[13:14]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[52:53], 24, v[11:12]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[67:68], 24, v[3:4]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[37:38], 24, v[23:24]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[68:69], 24, v[1:2]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[38:39], 24, v[21:22]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[50:51], 24, v[19:20]
+; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[53:54], 24, v[17:18]
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v84, 24, v16
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 8, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v97, 8, v15
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v99, 24, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v101, 8, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v103, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 24, v12
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v115, 8, v12
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v117, 8, v11
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v119, 24, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 8, v10
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v102, 8, v14
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 8, v13
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v103, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v116, 8, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v119, 8, v11
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v118, 24, v10
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 8, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v131, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v133, 24, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v135, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v145, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 24, v6
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 8, v6
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v151, 8, v5
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 24, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v162, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v163, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v164, 24, v2
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v165, 8, v2
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v166, 8, v1
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v80, 24, v32
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 8, v32
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 8, v31
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v83, 24, v30
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v84, 8, v30
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v86, 8, v29
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 24, v28
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 8, v28
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v100, 8, v27
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v102, 24, v26
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 8, v26
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v114, 8, v25
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v116, 24, v24
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v118, 8, v24
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v128, 8, v23
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v130, 24, v22
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 8, v22
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v134, 8, v21
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v144, 24, v20
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v146, 8, v20
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v148, 8, v19
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 24, v18
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 8, v18
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 8, v17
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v132, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v144, 8, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v147, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v146, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v148, 8, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v150, 8, v5
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v149, 24, v4
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v151, 8, v4
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v161, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v160, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v162, 8, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v163, 8, v1
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 24, v32
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v70, 8, v32
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v80, 8, v31
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v71, 24, v30
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 8, v30
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v83, 8, v29
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 24, v28
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v85, 8, v28
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v86, 8, v27
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 24, v26
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 8, v26
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v101, 8, v25
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v100, 24, v24
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 8, v24
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v115, 8, v23
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v114, 24, v22
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v117, 8, v22
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v128, 8, v21
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v129, 24, v20
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v133, 8, v20
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v135, 8, v19
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v134, 24, v18
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v145, 8, v18
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v39, 8, v17
; GFX11-TRUE16-NEXT: .LBB98_4: ; %end
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v166.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v70.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.h, 0
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v163.l
; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v1.l, v33.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v165.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v1.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v162.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v51.l, 0
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v68.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v34.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v164.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v39, v1
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v2.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v163.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v69.l
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v39, v2
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v3.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v162.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v161.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v39, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v4.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v39.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v54.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v161.l
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v1.h, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v68.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v54
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v160.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v1.l, v3.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v151.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v68.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v39, v4
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v5.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v149.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v147.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v68, 0xffff, v68
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v54, v51
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v2.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v54.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v67.l
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v4.l, v33.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v150.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v68, v51
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v54
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v3.l, v3.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v67.l, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v149.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v5.l, v33.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v54, v51
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v67
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v66.l
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v4.l, v4.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v67.l, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.l, 8, v148.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v54, v51
; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v39, v5
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v6.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v145.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v67.l
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v5.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v147.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v5.h, v6.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v67
; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v39, v6
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v7.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v135.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v133.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v39, v7
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v8.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v131.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v64.l
; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v39, v8
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v9.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v129.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v119.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v7.l, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v66.l, v6.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v146.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v144.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v54, v51
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v66
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v66.l, v7.l
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v6.l, v6.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.h, 8, v65.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v8.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v132.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v54, v51
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v66
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v7.l, v7.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v65.l, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v9.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 8, v131.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v54, v51
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v65
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v8.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v130.l
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v8.h, v9.l
; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v11.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v39, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v10.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v10.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v117.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v53.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v12.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v39, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v11.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v11.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v115.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v113.l
; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v13.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v39, v11
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v12.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v12.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v103.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v50.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v54, v51
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v10.l, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v65.l, v9.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v64.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v119.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v64.l, v10.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v65
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.h
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v9.l, v9.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v118.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v11.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v52.l
; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v14.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v39, v12
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v13.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v13.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v101.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v99.l
-; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.l
-; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v15.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v39, v13
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v14.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v97.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v38.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v54, v51
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v64
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v10.l, v10.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v64.l, v11.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v116.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v54, v51
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v52, 0xffff, v64
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v11.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v112.l
+; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v11.h, v12.l
; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.l
-; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v16.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v39, v14
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v15.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v15.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v87.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v85.l
; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v17.l
-; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v17.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v39, v15
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v16.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v16.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v37.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v65.l
-; GFX11-TRUE16-NEXT: v_and_b16 v18.l, 0xff, v18.l
-; GFX11-TRUE16-NEXT: v_and_b16 v18.h, 0xff, v18.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v39, v16
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v17.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v17.h, v17.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v160.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v150.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v52, v51
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v13.l, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v54.l, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v103.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v102.l
; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v19.l
-; GFX11-TRUE16-NEXT: v_and_b16 v19.h, 0xff, v19.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v39, v17
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v18.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v18.h, v18.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v148.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v54.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v52, 0xffff, v54
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v54.l, v13.l
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v12.l, v12.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.h, 8, v49.l
+; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v14.l, v33.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v49, 0xffff, v54
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v52, v51
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v99.l
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v13.l, v13.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.l, v14.l
+; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.h
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v15.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.l, 8, v97.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v49, v51
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v49, 0xffff, v52
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v14.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v96.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v14.h, v15.l
; GFX11-TRUE16-NEXT: v_and_b16 v20.l, 0xff, v20.l
-; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v20.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v39, v18
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v19.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v19.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v146.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v144.l
-; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v21.l
-; GFX11-TRUE16-NEXT: v_and_b16 v21.h, 0xff, v21.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v39, v19
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v20.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v20.h, v20.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v134.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v51.l
; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v22.l
-; GFX11-TRUE16-NEXT: v_and_b16 v22.h, 0xff, v22.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v39, v20
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v21.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v21.h, v21.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v132.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v130.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v49, v51
+; GFX11-TRUE16-NEXT: v_or_b16 v16.l, v16.l, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.l, v15.l
+; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.h, 8, v48.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v39.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.l, v16.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v48, 0xffff, v52
+; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.h
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v15.l, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v84.l
+; GFX11-TRUE16-NEXT: v_or_b16 v17.l, v17.l, v33.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v39, 0xffff, v39
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v53.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v48, v51
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v16.l, v16.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.l, v17.l
+; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v17.h
+; GFX11-TRUE16-NEXT: v_and_b16 v17.h, 0xff, v18.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v145.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v39, v51
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v39, 0xffff, v48
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v17.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v135.l
+; GFX11-TRUE16-NEXT: v_or_b16 v18.l, v17.h, v18.l
; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v23.l
-; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v23.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v39, v21
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v22.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v22.h, v22.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v128.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v48.l
-; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v24.l
-; GFX11-TRUE16-NEXT: v_and_b16 v24.h, 0xff, v24.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v39, v22
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v23.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v23.h, v23.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v118.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v116.l
; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v25.l
-; GFX11-TRUE16-NEXT: v_and_b16 v25.h, 0xff, v25.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v39, v23
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v24.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v24.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v114.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v36.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, v39, v51
+; GFX11-TRUE16-NEXT: v_or_b16 v19.l, v19.l, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.l, v18.l
+; GFX11-TRUE16-NEXT: v_and_b16 v18.l, 0xff, v18.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v134.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v133.l
; GFX11-TRUE16-NEXT: v_and_b16 v26.l, 0xff, v26.l
-; GFX11-TRUE16-NEXT: v_and_b16 v26.h, 0xff, v26.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v39, v24
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v25.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v25.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v112.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v102.l
-; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v27.l
-; GFX11-TRUE16-NEXT: v_and_b16 v27.h, 0xff, v27.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v39, v25
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v26.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v26.h, v26.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v100.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v35.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v39, 0xffff, v48
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.l, v19.l
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v18.l, v18.h
+; GFX11-TRUE16-NEXT: v_and_b16 v19.l, 0xff, v19.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v50.l
+; GFX11-TRUE16-NEXT: v_or_b16 v20.l, v20.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v129.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v18, v39, v51
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v39, 0xffff, v48
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v19.l, v19.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.l, v20.l
+; GFX11-TRUE16-NEXT: v_and_b16 v20.l, 0xff, v20.h
+; GFX11-TRUE16-NEXT: v_and_b16 v20.h, 0xff, v21.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.l, 8, v128.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, v39, v51
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v39, 0xffff, v48
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v20.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v117.l
+; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v20.h, v21.l
; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v28.l
-; GFX11-TRUE16-NEXT: v_and_b16 v28.h, 0xff, v28.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v39, v26
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v27.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v27.h, v27.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v98.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v96.l
; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v29.l
-; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v29.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v39, v27
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v28.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v28.h, v28.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v86.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.l, 8, v34.l
-; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v30.l
-; GFX11-TRUE16-NEXT: v_and_b16 v30.h, 0xff, v30.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v39, v28
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v29.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v29.h, v29.h, v34.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v84.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.l, 8, v83.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v20, v39, v51
+; GFX11-TRUE16-NEXT: v_or_b16 v22.l, v22.l, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.l, v21.l
+; GFX11-TRUE16-NEXT: v_and_b16 v21.l, 0xff, v21.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.h, 8, v38.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v115.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.l, v22.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v38, 0xffff, v48
+; GFX11-TRUE16-NEXT: v_and_b16 v22.l, 0xff, v22.h
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v21.l, v21.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.h, 8, v114.l
+; GFX11-TRUE16-NEXT: v_or_b16 v23.l, v23.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v37.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_and_b16 v31.l, 0xff, v31.l
-; GFX11-TRUE16-NEXT: v_and_b16 v31.h, 0xff, v31.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v39, v29
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v30.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v30.h, v30.h, v34.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v82.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.l, 8, v33.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, v38, v51
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v38, 0xffff, v39
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v22.l, v22.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.l, v23.l
+; GFX11-TRUE16-NEXT: v_and_b16 v23.l, 0xff, v23.h
+; GFX11-TRUE16-NEXT: v_and_b16 v23.h, 0xff, v24.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v24.l, 8, v113.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, v38, v51
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v37, 0xffff, v39
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v23.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v101.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v23.h, v24.l
; GFX11-TRUE16-NEXT: v_and_b16 v32.l, 0xff, v32.l
-; GFX11-TRUE16-NEXT: v_and_b16 v32.h, 0xff, v32.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v39, v30
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v31.l, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v31.h, v31.h, v33.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v31.l, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.l, 8, v81.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v80.l
; GFX11-TRUE16-NEXT: s_clause 0x1
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[1:4], off
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[5:8], off offset:16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v39, v31
-; GFX11-TRUE16-NEXT: v_or_b16 v39.l, v32.l, v33.l
-; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v32.h, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v32.l, v39.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v32, v39, v32
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, v37, v51
+; GFX11-TRUE16-NEXT: v_or_b16 v25.l, v25.l, v33.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v38.l, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v24.l, 0xff, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v24.h, 8, v100.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v98.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v37, 0xffff, v38
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v38.l, v25.l
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v24.l, v24.h
+; GFX11-TRUE16-NEXT: v_and_b16 v25.l, 0xff, v25.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.h, 8, v36.l
+; GFX11-TRUE16-NEXT: v_or_b16 v26.l, v26.l, v33.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v36, 0xffff, v38
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, v37, v51
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v87.l
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v25.l, v25.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v37.l, v26.l
+; GFX11-TRUE16-NEXT: v_and_b16 v26.l, 0xff, v26.h
+; GFX11-TRUE16-NEXT: v_and_b16 v26.h, 0xff, v27.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.l, 8, v86.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, v36, v51
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v36, 0xffff, v37
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v26.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v85.l
+; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v26.h, v27.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v26, v36, v51
+; GFX11-TRUE16-NEXT: v_or_b16 v28.l, v28.l, v33.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v37.l, v27.l
+; GFX11-TRUE16-NEXT: v_and_b16 v27.l, 0xff, v27.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.h, 8, v35.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v83.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v36.l, v28.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff, v37
+; GFX11-TRUE16-NEXT: v_and_b16 v28.l, 0xff, v28.h
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v27.l, v27.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v28.h, 8, v82.l
+; GFX11-TRUE16-NEXT: v_or_b16 v29.l, v29.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v34.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v27, v35, v51
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v35, 0xffff, v36
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v28.l, v28.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v36.l, v29.l
+; GFX11-TRUE16-NEXT: v_and_b16 v29.l, 0xff, v29.h
+; GFX11-TRUE16-NEXT: v_and_b16 v29.h, 0xff, v30.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v30.l, 8, v81.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v28, v35, v51
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v36
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v29.l, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v80.l
+; GFX11-TRUE16-NEXT: v_or_b16 v30.l, v29.h, v30.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v29, v34, v51
+; GFX11-TRUE16-NEXT: v_or_b16 v31.l, v31.l, v33.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_3) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v30.l
+; GFX11-TRUE16-NEXT: v_and_b16 v30.l, 0xff, v30.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v30.h, 8, v71.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v70.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v34, 0xffff, v35
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v31.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v30.l, v30.h
+; GFX11-TRUE16-NEXT: v_and_b16 v31.l, 0xff, v31.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v31.h, 8, v33.l
+; GFX11-TRUE16-NEXT: v_or_b16 v32.l, v32.l, v33.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v33, 0xffff, v35
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v30, v34, v51
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v31.l, v31.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.l, v32.l
+; GFX11-TRUE16-NEXT: v_and_b16 v32.l, 0xff, v32.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v32.h, 8, v55.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v31, v33, v51
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v33, 0xffff, v34
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b16 v51.h, v32.l, v32.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v32, v33, v51
; GFX11-TRUE16-NEXT: s_clause 0x5
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[9:12], off offset:32
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[13:16], off offset:48
diff --git a/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.128bit.ll b/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.128bit.ll
index 21ec3ee1996a6..3e96ab1d597d6 100644
--- a/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.128bit.ll
+++ b/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.128bit.ll
@@ -4118,19 +4118,19 @@ define <4 x i32> @bitcast_v16i8_to_v4i32(<16 x i8> %a, i32 %b) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.h, v9.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.h, v7.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.h, v7.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v4.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.h, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.h, v0.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 8, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.h, 8, v7.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v15.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.h, 8, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v12.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v15.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v16
@@ -4144,95 +4144,103 @@ define <4 x i32> @bitcast_v16i8_to_v4i32(<16 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB26_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.h, 0
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v0.l, v8.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v9.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v1.l, v7.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v9.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, 0
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v8.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.h, v7.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v1.l, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.h, v6.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v7.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v1.l
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v11, v0
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v1.h, v6.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.l, v5.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_3) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v3, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v1.h, v5.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v14.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v6, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v2.h, v4.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v7, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v3.l, v4.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v5.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v11, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v2.l, v5.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v14.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v11, v2
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v3.l, v4.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v4.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v11.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v11, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v5, v9
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_hi16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB26_2
; GFX11-TRUE16-NEXT: .LBB26_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v10.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v9.l, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v9.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v7.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v6.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.h, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v8.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v7.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v9.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v12.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v8.h, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v9.l, v0.h
; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v7.h, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v6.h, v1.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v8.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v10.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v12.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v9, v3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v14.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v9.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v5.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v5.h, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v6.l, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v6.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, v14.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v3.l
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v9, v6
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v9.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v4.l, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v4.h, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v5.l, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, 0x300, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v5.h, v0.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v7, v11
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v4.h, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v11
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v3.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v4.l, v2.h
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v9, v5
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v3.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v9.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v9, v3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v5, v11
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v6
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v4, v11
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -8584,19 +8592,19 @@ define <4 x float> @bitcast_v16i8_to_v4f32(<16 x i8> %a, i32 %b) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.h, v9.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.h, v7.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.h, v7.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v4.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.h, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.h, v0.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 8, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.h, 8, v7.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v15.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.h, 8, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v12.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v15.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v16
@@ -8610,95 +8618,103 @@ define <4 x float> @bitcast_v16i8_to_v4f32(<16 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB50_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.h, 0
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v0.l, v8.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v9.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v1.l, v7.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v9.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, 0
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v8.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.h, v7.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v1.l, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.h, v6.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v7.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v1.l
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v11, v0
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v1.h, v6.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.l, v5.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_3) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v3, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v1.h, v5.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v14.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v6, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v2.h, v4.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v7, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v3.l, v4.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v5.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v11, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v2.l, v5.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v14.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v11, v2
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v3.l, v4.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v4.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v11.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v11, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v5, v9
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_hi16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB50_2
; GFX11-TRUE16-NEXT: .LBB50_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v10.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v9.l, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v9.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v7.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v6.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.h, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v8.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v7.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v9.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v12.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v8.h, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v9.l, v0.h
; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v7.h, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v6.h, v1.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v8.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v10.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v12.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v9, v3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v14.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v9.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v5.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v5.h, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v6.l, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v6.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, v14.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v3.l
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v9, v6
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v9.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v4.l, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v4.h, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v5.l, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, 0x300, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v5.h, v0.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v7, v11
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v4.h, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v11
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v3.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v4.l, v2.h
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v9, v5
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v3.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v9.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v9, v3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v5, v11
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v6
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v4, v11
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -12666,19 +12682,19 @@ define <2 x i64> @bitcast_v16i8_to_v2i64(<16 x i8> %a, i32 %b) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.h, v9.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.h, v7.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.h, v7.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v4.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.h, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.h, v0.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 8, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.h, 8, v7.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v15.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.h, 8, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v12.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v15.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v16
@@ -12692,95 +12708,103 @@ define <2 x i64> @bitcast_v16i8_to_v2i64(<16 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB70_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.h, 0
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v0.l, v8.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v9.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v1.l, v7.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v9.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, 0
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v8.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.h, v7.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v1.l, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.h, v6.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v7.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v1.l
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v11, v0
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v1.h, v6.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.l, v5.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_3) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v3, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v1.h, v5.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v14.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v6, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v2.h, v4.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v7, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v3.l, v4.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v5.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v11, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v2.l, v5.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v14.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v11, v2
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v3.l, v4.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v4.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v11.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v11, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v5, v9
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_hi16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB70_2
; GFX11-TRUE16-NEXT: .LBB70_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v10.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v9.l, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v9.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v7.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v6.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.h, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v8.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v7.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v9.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v12.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v8.h, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v9.l, v0.h
; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v7.h, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v6.h, v1.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v8.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v10.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v12.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v9, v3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v14.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v9.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v5.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v5.h, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v6.l, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v6.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, v14.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v3.l
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v9, v6
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v9.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v4.l, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v4.h, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v5.l, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, 0x300, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v5.h, v0.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v7, v11
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v4.h, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v11
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v3.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v4.l, v2.h
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v9, v5
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v3.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v9.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v9, v3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v5, v11
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v6
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v4, v11
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -16358,19 +16382,19 @@ define <2 x double> @bitcast_v16i8_to_v2f64(<16 x i8> %a, i32 %b) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.h, v9.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.h, v7.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.h, v7.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v4.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.h, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.h, v0.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 8, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.h, 8, v7.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v15.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.h, 8, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v12.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v15.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v16
@@ -16384,95 +16408,103 @@ define <2 x double> @bitcast_v16i8_to_v2f64(<16 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB86_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.h, 0
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v0.l, v8.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v9.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v1.l, v7.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v9.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, 0
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v8.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.h, v7.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v1.l, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.h, v6.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v7.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v1.l
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v11, v0
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v1.h, v6.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.l, v5.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_3) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v3, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v1.h, v5.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v14.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v6, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v2.h, v4.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v7, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v3.l, v4.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v5.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v11, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v2.l, v5.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v14.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v11, v2
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v3.l, v4.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v4.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v11.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v11, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v5, v9
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_hi16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB86_2
; GFX11-TRUE16-NEXT: .LBB86_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v10.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v9.l, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v9.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v7.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v6.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.h, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v8.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v7.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v9.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v12.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v8.h, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v9.l, v0.h
; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v7.h, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v6.h, v1.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v8.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v10.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v12.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v9, v3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v14.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v9.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v5.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v5.h, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v6.l, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v6.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, v14.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v3.l
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v9, v6
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v9.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v4.l, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v4.h, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v5.l, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, 0x300, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v5.h, v0.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v7, v11
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v4.h, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v11
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v3.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v4.l, v2.h
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v9, v5
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v3.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v9.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v9, v3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v5, v11
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v6
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v4, v11
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -19779,19 +19811,19 @@ define <8 x i16> @bitcast_v16i8_to_v8i16(<16 x i8> %a, i32 %b) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.h, v9.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.h, v7.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.h, v7.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v4.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.h, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.h, v0.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 8, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.h, 8, v7.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v15.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.h, 8, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v12.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v15.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v16
@@ -19805,95 +19837,103 @@ define <8 x i16> @bitcast_v16i8_to_v8i16(<16 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB98_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.h, 0
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v0.l, v8.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v9.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v1.l, v7.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v9.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, 0
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v8.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.h, v7.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v1.l, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.h, v6.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v7.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v1.l
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v11, v0
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v1.h, v6.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.l, v5.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_3) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v3, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v1.h, v5.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v14.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v6, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v2.h, v4.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v7, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v3.l, v4.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v5.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v11, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v2.l, v5.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v14.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v11, v2
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v3.l, v4.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v4.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v11.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v11, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v5, v9
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_hi16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB98_2
; GFX11-TRUE16-NEXT: .LBB98_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v10.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v9.l, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v9.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v7.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v6.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.h, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v8.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v7.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v9.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v12.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v8.h, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v9.l, v0.h
; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v7.h, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v6.h, v1.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v8.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v10.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v12.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v9, v3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v14.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v9.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v5.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v5.h, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v6.l, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v6.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, v14.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v3.l
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v9, v6
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v9.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v4.l, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v4.h, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v5.l, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, 0x300, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v5.h, v0.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v7, v11
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v4.h, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v11
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v3.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v4.l, v2.h
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v9, v5
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v3.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v9.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v9, v3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v5, v11
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v6
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v4, v11
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -22685,19 +22725,19 @@ define <8 x half> @bitcast_v16i8_to_v8f16(<16 x i8> %a, i32 %b) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.h, v9.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.h, v7.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.h, v7.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v4.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.h, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.h, v0.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 8, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.h, 8, v7.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v15.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.h, 8, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v12.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v15.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v16
@@ -22711,95 +22751,103 @@ define <8 x half> @bitcast_v16i8_to_v8f16(<16 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB106_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.h, 0
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v0.l, v8.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v9.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v1.l, v7.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v9.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, 0
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v8.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.h, v7.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v1.l, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.h, v6.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v7.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v1.l
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v11, v0
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v1.h, v6.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.l, v5.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_3) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v3, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v1.h, v5.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v14.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v6, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v2.h, v4.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v7, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v3.l, v4.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v5.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v11, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v2.l, v5.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v14.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v11, v2
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v3.l, v4.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v4.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v11.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v11, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v5, v9
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_hi16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB106_2
; GFX11-TRUE16-NEXT: .LBB106_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v10.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v9.l, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v9.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v7.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v6.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.h, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v8.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v7.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v9.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v12.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v8.h, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v9.l, v0.h
; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v7.h, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v6.h, v1.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v8.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v10.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v12.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v9, v3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v14.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v9.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v5.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v5.h, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v6.l, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v6.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, v14.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v3.l
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v9, v6
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v9.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v4.l, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v4.h, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v5.l, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, 0x300, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v5.h, v0.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v7, v11
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v4.h, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v11
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v3.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v4.l, v2.h
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v9, v5
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v3.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v9.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v9, v3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v5, v11
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v6
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v4, v11
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -24896,19 +24944,19 @@ define <8 x bfloat> @bitcast_v16i8_to_v8bf16(<16 x i8> %a, i32 %b) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.h, v9.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.h, v7.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.h, v7.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v4.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.h, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.h, v0.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 8, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.h, 8, v7.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v15.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.h, 8, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v12.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v15.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v16
@@ -24922,95 +24970,103 @@ define <8 x bfloat> @bitcast_v16i8_to_v8bf16(<16 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB110_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.h, 0
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v0.l, v8.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v9.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v1.l, v7.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v9.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, 0
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v8.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.h, v7.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v1.l, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.h, v6.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v7.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v1.l
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v11, v0
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v1.h, v6.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.l, v5.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_3) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v3, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v1.h, v5.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v14.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v6, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v2.h, v4.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v7, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v3.l, v4.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v5.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v11, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v2.l, v5.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v14.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v11, v2
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v3.l, v4.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v4.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v11.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v11, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v5, v9
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_hi16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB110_2
; GFX11-TRUE16-NEXT: .LBB110_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v10.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v9.l, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v9.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v7.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v6.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.h, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v8.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v7.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v9.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v12.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v8.h, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v9.l, v0.h
; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v7.h, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v6.h, v1.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v8.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v10.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v12.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v9, v3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v14.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v9.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v5.l, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v5.h, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v6.l, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v6.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, v14.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v3.l
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v9, v6
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v9.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v4.l, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v4.h, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v5.l, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, 0x300, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v5.h, v0.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v7, v11
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v4.h, v2.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v11
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v3.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v4.l, v2.h
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v9, v5
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v3.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v9.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v9, v3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v5, v11
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v6
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v4, v11
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
diff --git a/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.256bit.ll b/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.256bit.ll
index 38302a75fe26d..f8ffaa456c2b3 100644
--- a/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.256bit.ll
+++ b/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.256bit.ll
@@ -6296,31 +6296,32 @@ define <8 x i32> @bitcast_v32i8_to_v8i32(<32 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.h, v19.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.h, v17.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.h, v15.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v12.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.h, v10.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.h, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.h, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.h, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.h, v4.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.h, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v0.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.l, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.h, 8, v3.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v7.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.h, 8, v9.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.l, 8, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.l, 8, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v23.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v22.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.l, 8, v21.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v23.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 8, v25.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v27.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v29.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.h, 8, v7.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v9.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.l, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v23.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v22.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v21.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 8, v23.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.l, 8, v25.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v27.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v29.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v31.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v31.l
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v32
; GFX11-TRUE16-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX11-TRUE16-NEXT: s_cbranch_execnz .LBB26_3
@@ -6332,175 +6333,194 @@ define <8 x i32> @bitcast_v32i8_to_v8i32(<32 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB26_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v21.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v19.h
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v20.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.h, 0
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v17.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v16.h
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v0.l, v19.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v19.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v21.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v1.l, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v21.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v15.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v13.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v21, v0
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v1.h, v17.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v16.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v18.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v15.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v20.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v21, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v2.l, v14.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v21.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v22.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v21, v2
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v3.l, v12.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v13.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v21.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v10.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v24.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v26.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v28.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v21, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v4.l, v11.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v21.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v9.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v30.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v17.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v18.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v19.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v17.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, 0
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v0.h, v15.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v16.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v15.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v16.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.l, v13.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v4
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v13.h
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v1.l, v12.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v19
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v1.h, v14.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v14.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v20.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v7, v19
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v2.l, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v4.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v18.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v3.h, v11.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v6, v19
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v3.l, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v24.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v22.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v12, v19
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v4.l, v10.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v4.h, v10.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v28.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v11
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr13_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr13_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_hi16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_4) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v19
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v5.l, v9.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v6.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v26.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v5.h, v9.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v10, v19
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v11
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v6.l, v8.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v30.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v19
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v7.l, v8.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v21, v4
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v5.l, v10.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v21.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v8.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v21, v5
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v6.l, v9.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v21.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v21, v6
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v7.l, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v21.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_lo16
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v21, v7
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v9, v19
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_hi16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB26_2
; GFX11-TRUE16-NEXT: .LBB26_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v21.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v19.h, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v20.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v17.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v16.h, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.h, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v18.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v17.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v21.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v15.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v19.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v19.h, v0.h
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v18.h, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v17.h, v1.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v15.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v17.h, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v13.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v12.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v21, v3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v14.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v21.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v14.h, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v15.l, v1.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v15.h, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v16.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, 0
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v21, v4
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v21.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v12.h, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v13.l, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v16.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v18.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v21, v5
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v3.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v20.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v22.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v21.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v11.l, v3.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v11.h, v3.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v0.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v14.h, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v16.l, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v13.l, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, v14.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v21
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v12.l, v2.h
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v21
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v2.h
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v12.h, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v3.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v18.l, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v11.h, v3.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v20.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v6, v21
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v4.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v3.h
; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v21, v6
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v5.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v21.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v10.l, v4.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v10.h, v4.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v24.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v26.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v21, v7
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, 0x300, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v28.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v30.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v21.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v9.l, v5.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v9.h, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v4.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v10.h, v3.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v5, v21
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v11.l, v4.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, v24.l, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v4.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, v22.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, 0x300, v4.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v28.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v21
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v10.l, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v26.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v21, v10
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v7.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v7.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v21.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v8.l, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v8.h, v6.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v21, v9
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v7.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v7.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v21.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v21, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v9.l, v5.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v9.h, v6.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, v30.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v5.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v8.h, v6.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v6.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v7.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v10, v21
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v11
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v6.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v8.l, v6.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v9
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v10, v21
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v7.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v8, v21
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -13315,31 +13335,32 @@ define <8 x float> @bitcast_v32i8_to_v8f32(<32 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.h, v19.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.h, v17.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.h, v15.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v12.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.h, v10.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.h, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.h, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.h, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.h, v4.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.h, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v0.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.l, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.h, 8, v3.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v7.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.h, 8, v9.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.l, 8, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.l, 8, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v23.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v22.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.l, 8, v21.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v23.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 8, v25.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v27.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v29.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.h, 8, v7.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v9.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.l, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v23.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v22.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v21.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 8, v23.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.l, 8, v25.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v27.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v29.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v31.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v31.l
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v32
; GFX11-TRUE16-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX11-TRUE16-NEXT: s_cbranch_execnz .LBB50_3
@@ -13351,175 +13372,194 @@ define <8 x float> @bitcast_v32i8_to_v8f32(<32 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB50_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v21.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v19.h
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v20.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.h, 0
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v17.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v16.h
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v0.l, v19.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v19.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v21.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v1.l, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v21.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v15.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v13.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v21, v0
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v1.h, v17.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v16.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v18.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v15.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v20.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v21, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v2.l, v14.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v21.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v22.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v21, v2
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v3.l, v12.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v13.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v21.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v10.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v24.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v26.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v28.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v21, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v4.l, v11.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v21.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v9.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v30.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v17.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v18.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v19.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v17.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, 0
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v0.h, v15.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v16.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v15.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v16.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.l, v13.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v4
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v13.h
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v1.l, v12.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v19
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v1.h, v14.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v14.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v20.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v7, v19
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v2.l, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v4.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v18.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v3.h, v11.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v6, v19
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v3.l, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v24.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v22.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v12, v19
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v4.l, v10.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v4.h, v10.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v28.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v11
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr13_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr13_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_hi16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_4) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v19
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v5.l, v9.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v6.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v26.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v5.h, v9.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v10, v19
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v11
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v6.l, v8.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v30.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v19
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v7.l, v8.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v21, v4
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v5.l, v10.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v21.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v8.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v21, v5
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v6.l, v9.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v21.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v21, v6
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v7.l, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v21.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_lo16
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v21, v7
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v9, v19
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_hi16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB50_2
; GFX11-TRUE16-NEXT: .LBB50_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v21.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v19.h, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v20.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v17.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v16.h, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.h, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v18.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v17.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v21.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v15.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v19.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v19.h, v0.h
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v18.h, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v17.h, v1.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v15.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v17.h, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v13.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v12.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v21, v3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v14.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v21.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v14.h, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v15.l, v1.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v15.h, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v16.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, 0
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v21, v4
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v21.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v12.h, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v13.l, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v16.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v18.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v21, v5
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v3.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v20.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v22.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v21.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v11.l, v3.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v11.h, v3.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v0.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v14.h, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v16.l, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v13.l, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, v14.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v21
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v12.l, v2.h
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v21
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v2.h
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v12.h, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v3.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v18.l, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v11.h, v3.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v20.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v6, v21
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v4.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v3.h
; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v21, v6
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v5.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v21.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v10.l, v4.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v10.h, v4.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v24.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v26.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v21, v7
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, 0x300, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v28.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v30.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v21.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v9.l, v5.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v9.h, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v4.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v10.h, v3.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v5, v21
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v11.l, v4.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, v24.l, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v4.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, v22.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, 0x300, v4.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v28.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v21
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v10.l, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v26.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v21, v10
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v7.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v7.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v21.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v8.l, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v8.h, v6.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v21, v9
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v7.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v7.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v21.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v21, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v9.l, v5.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v9.h, v6.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, v30.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v5.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v8.h, v6.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v6.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v7.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v10, v21
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v11
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v6.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v8.l, v6.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v9
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v10, v21
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v7.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v8, v21
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -19852,31 +19892,32 @@ define <4 x i64> @bitcast_v32i8_to_v4i64(<32 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.h, v19.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.h, v17.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.h, v15.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v12.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.h, v10.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.h, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.h, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.h, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.h, v4.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.h, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v0.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.l, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.h, 8, v3.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v7.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.h, 8, v9.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.l, 8, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.l, 8, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v23.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v22.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.l, 8, v21.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v23.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 8, v25.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v27.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v29.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.h, 8, v7.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v9.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.l, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v23.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v22.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v21.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 8, v23.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.l, 8, v25.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v27.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v29.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v31.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v31.l
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v32
; GFX11-TRUE16-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX11-TRUE16-NEXT: s_cbranch_execnz .LBB70_3
@@ -19888,175 +19929,194 @@ define <4 x i64> @bitcast_v32i8_to_v4i64(<32 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB70_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v21.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v19.h
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v20.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.h, 0
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v17.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v16.h
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v0.l, v19.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v19.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v21.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v1.l, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v21.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v15.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v13.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v21, v0
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v1.h, v17.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v16.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v18.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v15.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v20.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v21, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v2.l, v14.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v21.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v22.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v21, v2
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v3.l, v12.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v13.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v21.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v10.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v24.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v26.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v28.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v21, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v4.l, v11.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v21.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v9.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v30.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v17.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v18.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v19.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v17.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, 0
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v0.h, v15.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v16.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v15.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v16.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.l, v13.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v4
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v13.h
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v1.l, v12.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v19
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v1.h, v14.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v14.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v20.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v7, v19
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v2.l, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v4.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v18.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v3.h, v11.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v6, v19
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v3.l, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v24.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v22.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v12, v19
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v4.l, v10.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v4.h, v10.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v28.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v11
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr13_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr13_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_hi16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_4) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v19
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v5.l, v9.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v6.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v26.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v5.h, v9.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v10, v19
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v11
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v6.l, v8.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v30.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v19
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v7.l, v8.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v21, v4
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v5.l, v10.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v21.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v8.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v21, v5
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v6.l, v9.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v21.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v21, v6
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v7.l, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v21.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_lo16
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v21, v7
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v9, v19
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_hi16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB70_2
; GFX11-TRUE16-NEXT: .LBB70_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v21.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v19.h, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v20.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v17.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v16.h, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.h, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v18.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v17.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v21.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v15.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v19.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v19.h, v0.h
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v18.h, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v17.h, v1.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v15.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v17.h, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v13.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v12.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v21, v3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v14.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v21.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v14.h, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v15.l, v1.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v15.h, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v16.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, 0
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v21, v4
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v21.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v12.h, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v13.l, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v16.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v18.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v21, v5
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v3.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v20.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v22.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v21.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v11.l, v3.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v11.h, v3.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v0.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v14.h, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v16.l, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v13.l, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, v14.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v21
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v12.l, v2.h
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v21
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v2.h
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v12.h, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v3.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v18.l, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v11.h, v3.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v20.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v6, v21
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v4.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v3.h
; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v21, v6
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v5.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v21.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v10.l, v4.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v10.h, v4.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v24.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v26.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v21, v7
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, 0x300, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v28.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v30.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v21.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v9.l, v5.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v9.h, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v4.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v10.h, v3.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v5, v21
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v11.l, v4.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, v24.l, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v4.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, v22.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, 0x300, v4.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v28.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v21
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v10.l, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v26.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v21, v10
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v7.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v7.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v21.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v8.l, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v8.h, v6.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v21, v9
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v7.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v7.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v21.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v21, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v9.l, v5.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v9.h, v6.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, v30.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v5.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v8.h, v6.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v6.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v7.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v10, v21
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v11
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v6.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v8.l, v6.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v9
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v10, v21
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v7.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v8, v21
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -25879,31 +25939,32 @@ define <4 x double> @bitcast_v32i8_to_v4f64(<32 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.h, v19.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.h, v17.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.h, v15.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v12.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.h, v10.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.h, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.h, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.h, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.h, v4.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.h, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v0.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.l, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.h, 8, v3.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v7.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.h, 8, v9.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.l, 8, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.l, 8, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v23.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v22.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.l, 8, v21.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v23.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 8, v25.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v27.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v29.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.h, 8, v7.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v9.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.l, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v23.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v22.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v21.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 8, v23.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.l, 8, v25.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v27.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v29.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v31.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v31.l
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v32
; GFX11-TRUE16-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX11-TRUE16-NEXT: s_cbranch_execnz .LBB86_3
@@ -25915,175 +25976,194 @@ define <4 x double> @bitcast_v32i8_to_v4f64(<32 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB86_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v21.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v19.h
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v20.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.h, 0
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v17.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v16.h
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v0.l, v19.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v19.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v21.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v1.l, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v21.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v15.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v13.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v21, v0
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v1.h, v17.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v16.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v18.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v15.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v20.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v21, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v2.l, v14.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v21.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v22.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v21, v2
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v3.l, v12.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v13.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v21.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v10.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v24.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v26.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v28.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v21, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v4.l, v11.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v21.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v9.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v30.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v17.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v18.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v19.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v17.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, 0
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v0.h, v15.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v16.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v15.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v16.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.l, v13.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v4
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v13.h
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v1.l, v12.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v19
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v1.h, v14.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v14.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v20.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v7, v19
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v2.l, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v4.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v18.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v3.h, v11.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v6, v19
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v3.l, v11.h
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v24.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v22.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v12, v19
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v4.l, v10.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v4.h, v10.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v28.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v11
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr13_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr13_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_hi16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_4) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v19
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v5.l, v9.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v6.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v26.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v5.h, v9.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v10, v19
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v11
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v6.l, v8.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v30.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v19
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v19.h, v7.l, v8.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v21, v4
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v5.l, v10.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v21.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v8.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v21, v5
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v6.l, v9.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v21.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v21, v6
-; GFX11-TRUE16-NEXT: v_or_b16 v21.l, v7.l, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v21.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_lo16
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v21, v7
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v9, v19
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_hi16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB86_2
; GFX11-TRUE16-NEXT: .LBB86_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v21.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v19.h, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v20.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v17.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v16.h, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.h, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v18.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v17.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v21.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v15.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v19.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v19.h, v0.h
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v18.h, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v17.h, v1.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v15.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v17.h, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v13.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v12.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v21, v3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v14.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v21.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v14.h, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v15.l, v1.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v15.h, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v16.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, 0
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v21, v4
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v21.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v12.h, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v13.l, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v16.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v18.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v21, v5
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v3.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v20.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v22.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v21.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v11.l, v3.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v11.h, v3.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v0.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v14.h, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v16.l, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v13.l, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, v14.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v21
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v12.l, v2.h
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v21
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v2.h
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v12.h, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v3.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v18.l, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v11.h, v3.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v20.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v6, v21
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v4.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v3.h
; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v21, v6
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v5.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v21.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v10.l, v4.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v10.h, v4.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v24.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v26.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v21, v7
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, 0x300, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v28.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v30.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v21.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v9.l, v5.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v9.h, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v4.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v10.h, v3.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v5, v21
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v11.l, v4.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, v24.l, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v4.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, v22.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, 0x300, v4.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v28.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v21
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v10.l, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v26.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v21, v10
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v7.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v7.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v21.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v8.l, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v8.h, v6.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v21, v9
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.l, 0x300, v7.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v7.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v21.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v21, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v9.l, v5.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v9.h, v6.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, v30.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v5.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v8.h, v6.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v6.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v7.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v10, v21
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v11
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v6.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v8.l, v6.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v9
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v10, v21
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v21.h, 0x300, v7.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v8, v21
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
diff --git a/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.320bit.ll b/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.320bit.ll
index 436b1a038b274..0cefbc1c2dee5 100644
--- a/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.320bit.ll
+++ b/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.320bit.ll
@@ -2966,20 +2966,20 @@ define <40 x i8> @bitcast_v10i32_to_v40i8(<10 x i32> %a, i32 %b) {
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr13_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
@@ -2995,17 +2995,17 @@ define <40 x i8> @bitcast_v10i32_to_v40i8(<10 x i32> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[15:16], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v17, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v18, 8, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v19, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v20, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v20, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v19, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v21, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v22, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v23, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v23, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v22, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v24, 8, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v25, 8, v5
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v26, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v27, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v30, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v16, 8, v1
; GFX11-TRUE16-NEXT: .LBB12_2: ; %Flow
@@ -3029,17 +3029,17 @@ define <40 x i8> @bitcast_v10i32_to_v40i8(<10 x i32> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[15:16], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v17, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v18, 8, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v19, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v20, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v20, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v19, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v21, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v22, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v23, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v23, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v22, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v24, 8, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v25, 8, v5
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v26, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v27, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v30, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v16, 8, v1
; GFX11-TRUE16-NEXT: .LBB12_4: ; %end
@@ -3047,93 +3047,105 @@ define <40 x i8> @bitcast_v10i32_to_v40i8(<10 x i32> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v16.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v15.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v1.l, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v30.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v1.h, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v30.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v29.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v15, v1
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v2.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v15.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v28.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v14.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v15.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, 0
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v15, v2
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v3.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v15.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v29.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v1.h, v11.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v16
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v28.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v1.l, v3.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v27.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v26.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v15, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v4.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v15.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v25.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v13.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v28, 0xffff, v29
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v16, v15
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v2.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v14.l
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v4.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v28, v15
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v14, 0xffff, v16
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v26.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v3.l, v3.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v25.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v14, v15
; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v15, v4
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v5.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v15.h
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v4.l, v11.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v24.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v23.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v4.h, v5.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v14, 0xffff, v16
; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v15, v5
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v6.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v15.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v22.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v15, v6
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v7.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v15.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v21.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v20.l
; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v15, v7
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v8.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v15.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v19.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v11.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v15, v8
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v9.l, v11.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v9.h, v11.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v15.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v18.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v17.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v15, v11
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v9.l, v9.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v15.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v10.l, v10.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v15, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v6.l, v11.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v23.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v14, v15
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v16
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v5.l, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v22.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v7.l, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v12.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v13, v15
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v14
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v6.l, v6.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v21.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v13, v15
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v14
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v7.l, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v20.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v7.h, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v12, v15
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v9.l, v11.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_3) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v19.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v18.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v13
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v9.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v8.l, v8.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v11.l
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v10.l, v11.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v13
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v12, v15
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v9.l, v9.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v10.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v10.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v17.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v11, v15
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v12
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v9.l, v9.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v11, v15
; GFX11-TRUE16-NEXT: s_clause 0x2
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[1:4], off
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[5:8], off offset:16
-; GFX11-TRUE16-NEXT: scratch_store_b64 v0, v[11:12], off offset:32
+; GFX11-TRUE16-NEXT: scratch_store_b64 v0, v[10:11], off offset:32
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-FAKE16-LABEL: bitcast_v10i32_to_v40i8:
@@ -5026,49 +5038,48 @@ define <10 x i32> @bitcast_v40i8_to_v10i32(<40 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v34, off, s32 offset:24
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v35, off, s32 offset:32
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v31, off, s32 offset:28
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v31, off, s32 offset:20
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v32, off, s32 offset:12
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v32, off, s32 offset:20
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v31, off, s32 offset:12
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v32, off, s32 offset:4
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.h, v25.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.h, v23.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.h, v21.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.h, v19.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.h, v17.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.h, v14.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v10.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.h, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.h, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.h, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.h, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.h, v10.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.h, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.h, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v4.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.h, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.h, v0.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v24.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.l, 8, v3.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.l, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.h, 8, v7.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v20.h, 8, v9.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.l, 8, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v15.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.l, 8, v35.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.h, 8, v30.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.l, 8, v29.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.h, 8, v28.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.l, 8, v27.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.h, 8, v27.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v29.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.l, 8, v7.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.h, 8, v9.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.h, 8, v15.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v17.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.h, 8, v30.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.l, 8, v29.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.h, 8, v28.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.l, 8, v27.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v27.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.l, 8, v29.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(8)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v33.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v33.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(7)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v33.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(6)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v34.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v34.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(5)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.l, 8, v34.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v34.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(4)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v35.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.l, 8, v35.l
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v36
; GFX11-TRUE16-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX11-TRUE16-NEXT: s_cbranch_execnz .LBB14_3
@@ -5081,217 +5092,245 @@ define <10 x i32> @bitcast_v40i8_to_v10i32(<40 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB14_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v26.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v25.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v25.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.h, 0
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v22.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v21.h
-; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v0.l, v24.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v25.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v27.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v1.l, v23.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v27.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v19.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v19.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v27, v0
-; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v1.h, v23.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v16.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v18.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v21.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v20.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v27, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v2.l, v20.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v27.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v17.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v16.h
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v15.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v22.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v27, v2
-; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v3.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v27.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v14.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v24.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v26.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v28.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v27, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v4.l, v15.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v27.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v13.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v30.l
-; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v32.h
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v32.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v27, v4
-; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v5.l, v14.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v27.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v12.h
-; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v31.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v31.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v27, v5
-; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v6.l, v13.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v27.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v22.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, 0
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v24.h
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v23.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v23.h
+; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v0.h, v21.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v21.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v20.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v16.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.l, v19.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v4
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v18.h
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v1.l, v17.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v25
+; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v1.h, v19.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v16.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v20.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v7, v25
+; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v2.l, v17.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v4.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v18.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v3.h, v15.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v6, v25
+; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v3.l, v15.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v28.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_4) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v8, v25
+; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v4.l, v14.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v22.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v4.h, v14.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr13_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v25
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v8
+; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v5.l, v13.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v6.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v26.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v7, v25
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v6.h, v13.l
+; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v32.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v8
+; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v6.l, v12.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v32.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v30.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v6.h, v12.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v25
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v7.h, v11.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v13
+; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v7.l, v11.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v31.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v12, v25
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v13
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v8.l, v10.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v9.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v31.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr13_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.h, v10.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr13_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v27, v6
-; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v7.l, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v27.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v11, v25
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v12
+; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v9.l, v10.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v27, v7
-; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v8.l, v11.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v27.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v27, v8
-; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v9.l, v10.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v27.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_lo16
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v27, v9
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v11, v25
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_hi16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB14_2
; GFX11-TRUE16-NEXT: .LBB14_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v26.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v25.l, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v25.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v22.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v21.h, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.h, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v22.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v23.h, 3
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v25.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v24.h, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v25.l, v0.h
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v23.h, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v23.l, v1.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v19.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v19.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v17.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v25, v3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v16.h, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v25.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v20.h, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v21.l, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v18.h, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v24.h, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v23.l, v1.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v20.h, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v21.l, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, 0
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v25, v4
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v25.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v17.h, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v18.h, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v16.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v18.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v25, v5
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v3.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v20.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v22.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v25.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v15.l, v3.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v15.h, v3.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v21.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v19.l, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, v16.h, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v17.h, v2.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v27
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v19.h, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v16.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v27
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, 0x300, v2.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, 0x300, v2.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v15.h, v3.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v2.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v4, v27
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v17.l, v3.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v18.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v20.l, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v3.l
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v25, v6
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v5.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v25.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v14.l, v4.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v14.h, v4.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v24.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v26.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v25, v7
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, 0x300, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v28.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v30.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v25.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v13.l, v5.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v13.h, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v3.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v5, v27
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v14.h, v4.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, v24.l, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v15.l, v4.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, v22.l, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, 0x300, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, 0x300, v4.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v28.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v27
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v14.l, v5.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v26.l, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v13.h, v5.h
; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v25, v8
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v7.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v7.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v25.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v12.l, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v12.h, v6.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v13.l, v6.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v5.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v12.h, v6.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v7, v27
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v32.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, v32.h, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v8
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, 0x300, v6.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, 0x300, v6.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, v30.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v32.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v25, v9
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v7.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, 0x300, v7.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v27
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v12.l, v7.l
; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v31.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v31.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v25.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v11.l, v7.l
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v11.h, v7.h
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v25, v12
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v9.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v9.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v25.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v10.l, v8.l
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v10.h, v8.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v25, v11
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v9.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v9.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v25.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v25, v9
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v11.l, v7.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, 0x300, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v11.h, v8.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, v31.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v7.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v10.h, v8.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.l, 0x300, v8.h
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v9.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v12, v27
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v13
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, 0x300, v8.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v10.l, v8.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v11
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v12, v27
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, 0x300, v9.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v10, v27
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -9912,20 +9951,20 @@ define <40 x i8> @bitcast_v10f32_to_v40i8(<10 x float> %a, i32 %b) {
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr13_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
@@ -9941,17 +9980,17 @@ define <40 x i8> @bitcast_v10f32_to_v40i8(<10 x float> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[15:16], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v17, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v18, 8, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v19, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v20, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v20, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v19, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v21, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v22, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v23, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v23, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v22, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v24, 8, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v25, 8, v5
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v26, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v27, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v30, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v16, 8, v1
; GFX11-TRUE16-NEXT: .LBB32_2: ; %Flow
@@ -9971,17 +10010,17 @@ define <40 x i8> @bitcast_v10f32_to_v40i8(<10 x float> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[13:14], 24, v[5:6]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[14:15], 24, v[3:4]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[15:16], 24, v[1:2]
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v19, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v20, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v20, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v19, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v21, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v22, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v23, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v23, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v22, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v24, 8, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v25, 8, v5
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v26, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v27, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v30, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v16, 8, v1
; GFX11-TRUE16-NEXT: .LBB32_4: ; %end
@@ -9989,93 +10028,105 @@ define <40 x i8> @bitcast_v10f32_to_v40i8(<10 x float> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v16.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v15.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v1.l, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v30.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v1.h, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v30.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v29.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v15, v1
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v2.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v15.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v28.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v14.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v15.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, 0
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v15, v2
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v3.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v15.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v29.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v1.h, v11.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v16
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v28.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v1.l, v3.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v27.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v26.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v15, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v4.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v15.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v25.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v13.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v28, 0xffff, v29
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v16, v15
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v2.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v14.l
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v4.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v28, v15
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v14, 0xffff, v16
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v26.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v3.l, v3.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v25.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v14, v15
; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v15, v4
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v5.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v15.h
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v4.l, v11.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v24.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v23.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v4.h, v5.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v14, 0xffff, v16
; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v15, v5
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v6.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v15.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v22.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v15, v6
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v7.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v15.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v21.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v20.l
; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v15, v7
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v8.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v15.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v19.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v11.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v15, v8
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v9.l, v11.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v9.h, v11.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v15.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v18.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v17.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v15, v11
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v9.l, v9.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v15.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v10.l, v10.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v15, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v6.l, v11.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v23.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v14, v15
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v16
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v5.l, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v22.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v7.l, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v12.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v13, v15
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v14
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v6.l, v6.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v21.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v13, v15
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v14
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v7.l, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v20.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v7.h, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v12, v15
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v9.l, v11.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_3) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v19.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v18.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v13
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v9.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v8.l, v8.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v11.l
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v10.l, v11.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v13
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v12, v15
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v9.l, v9.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v10.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v10.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v17.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v11, v15
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v12
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v9.l, v9.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v11, v15
; GFX11-TRUE16-NEXT: s_clause 0x2
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[1:4], off
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[5:8], off offset:16
-; GFX11-TRUE16-NEXT: scratch_store_b64 v0, v[11:12], off offset:32
+; GFX11-TRUE16-NEXT: scratch_store_b64 v0, v[10:11], off offset:32
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-FAKE16-LABEL: bitcast_v10f32_to_v40i8:
@@ -11986,49 +12037,48 @@ define <10 x float> @bitcast_v40i8_to_v10f32(<40 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v34, off, s32 offset:24
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v35, off, s32 offset:32
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v31, off, s32 offset:28
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v31, off, s32 offset:20
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v32, off, s32 offset:12
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v32, off, s32 offset:20
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v31, off, s32 offset:12
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v32, off, s32 offset:4
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.h, v25.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.h, v23.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.h, v21.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.h, v19.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.h, v17.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.h, v14.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v10.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.h, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.h, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.h, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.h, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.h, v10.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.h, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.h, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v4.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.h, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.h, v0.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v24.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.l, 8, v3.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.l, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.h, 8, v7.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v20.h, 8, v9.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.l, 8, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v15.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.l, 8, v35.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.h, 8, v30.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.l, 8, v29.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.h, 8, v28.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.l, 8, v27.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.h, 8, v27.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v29.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.l, 8, v7.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.h, 8, v9.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.h, 8, v15.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v17.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.h, 8, v30.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.l, 8, v29.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.h, 8, v28.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.l, 8, v27.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v27.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.l, 8, v29.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(8)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v33.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v33.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(7)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v33.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v33.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(6)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v34.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v34.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(5)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.l, 8, v34.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v34.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(4)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v35.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.l, 8, v35.l
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v36
; GFX11-TRUE16-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX11-TRUE16-NEXT: s_cbranch_execnz .LBB34_3
@@ -12041,217 +12091,245 @@ define <10 x float> @bitcast_v40i8_to_v10f32(<40 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB34_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v26.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v25.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v25.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.h, 0
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v22.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v21.h
-; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v0.l, v24.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v25.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v27.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v1.l, v23.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v27.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v19.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v19.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v27, v0
-; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v1.h, v23.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v16.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v18.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v21.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v20.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v27, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v2.l, v20.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v27.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v17.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v16.h
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v15.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v22.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v27, v2
-; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v3.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v27.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v14.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v24.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v26.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v28.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v27, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v4.l, v15.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v27.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v13.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v30.l
-; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v32.h
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v32.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v27, v4
-; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v5.l, v14.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v27.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v12.h
-; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v31.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v31.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v27, v5
-; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v6.l, v13.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v27.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v22.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, 0
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v24.h
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v23.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v23.h
+; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v0.h, v21.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v21.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v20.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v16.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.l, v19.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v4
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v18.h
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v1.l, v17.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v25
+; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v1.h, v19.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v16.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v20.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v7, v25
+; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v2.l, v17.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v4.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v18.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v3.h, v15.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v6, v25
+; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v3.l, v15.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v28.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_4) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v8, v25
+; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v4.l, v14.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v22.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v4.h, v14.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr13_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v25
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v8
+; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v5.l, v13.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v6.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v26.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v7, v25
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v6.h, v13.l
+; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v32.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v8
+; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v6.l, v12.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v32.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v30.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v6.h, v12.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v25
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v7.h, v11.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v13
+; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v7.l, v11.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v31.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v12, v25
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v13
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v8.l, v10.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v9.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v31.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr13_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.h, v10.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr13_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v27, v6
-; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v7.l, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v27.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v11, v25
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v12
+; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v9.l, v10.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v27, v7
-; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v8.l, v11.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v27.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v27, v8
-; GFX11-TRUE16-NEXT: v_or_b16 v27.l, v9.l, v10.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v27.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_lo16
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v27, v9
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v11, v25
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_hi16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB34_2
; GFX11-TRUE16-NEXT: .LBB34_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v26.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v25.l, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v25.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v22.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v21.h, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.h, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v22.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v23.h, 3
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v25.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v24.h, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v25.l, v0.h
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v23.h, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v23.l, v1.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v19.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v19.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v17.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v25, v3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v16.h, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v25.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v20.h, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v21.l, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v18.h, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v24.h, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v23.l, v1.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v20.h, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v21.l, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, 0
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v25, v4
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v25.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v17.h, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v18.h, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v16.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v18.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v25, v5
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v3.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v20.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v22.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v25.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v15.l, v3.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v15.h, v3.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v21.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v19.l, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, v16.h, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v17.h, v2.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v27
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v19.h, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v16.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v27
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, 0x300, v2.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, 0x300, v2.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v15.h, v3.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v2.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v4, v27
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v17.l, v3.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v18.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v20.l, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v3.l
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v25, v6
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v5.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v25.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v14.l, v4.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v14.h, v4.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v24.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v26.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v25, v7
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, 0x300, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v28.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v30.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v25.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v13.l, v5.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v13.h, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v3.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v5, v27
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v14.h, v4.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, v24.l, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v15.l, v4.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, v22.l, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, 0x300, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, 0x300, v4.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v28.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v27
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v14.l, v5.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v26.l, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v13.h, v5.h
; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v25, v8
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v7.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v7.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v25.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v12.l, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v12.h, v6.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v13.l, v6.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v5.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v12.h, v6.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v7, v27
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v32.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, v32.h, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v8
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, 0x300, v6.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, 0x300, v6.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, v30.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v32.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v25, v9
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v7.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, 0x300, v7.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v27
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v12.l, v7.l
; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v31.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v31.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v25.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v11.l, v7.l
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v11.h, v7.h
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v25, v12
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v9.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v9.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v25.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v10.l, v8.l
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v10.h, v8.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v25, v11
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v25.l, 0x300, v9.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v9.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v25.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v25, v9
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v11.l, v7.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, 0x300, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v11.h, v8.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, v31.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v7.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v10.h, v8.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.l, 0x300, v8.h
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v9.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v12, v27
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v13
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, 0x300, v8.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v10.l, v8.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v11
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v12, v27
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v27.h, 0x300, v9.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v10, v27
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -16280,20 +16358,20 @@ define <40 x i8> @bitcast_v20i16_to_v40i8(<20 x i16> %a, i32 %b) {
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr13_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
@@ -16309,17 +16387,17 @@ define <40 x i8> @bitcast_v20i16_to_v40i8(<20 x i16> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[15:16], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v17, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v18, 8, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v19, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v20, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v20, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v19, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v21, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v22, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v23, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v23, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v22, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v24, 8, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v25, 8, v5
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v26, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v27, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v30, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v16, 8, v1
; GFX11-TRUE16-NEXT: .LBB48_2: ; %Flow
@@ -16343,17 +16421,17 @@ define <40 x i8> @bitcast_v20i16_to_v40i8(<20 x i16> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[15:16], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v17, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v18, 8, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v19, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v20, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v20, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v19, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v21, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v22, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v23, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v23, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v22, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v24, 8, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v25, 8, v5
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v26, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v27, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v30, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v16, 8, v1
; GFX11-TRUE16-NEXT: .LBB48_4: ; %end
@@ -16361,93 +16439,105 @@ define <40 x i8> @bitcast_v20i16_to_v40i8(<20 x i16> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v16.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v15.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v1.l, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v30.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v1.h, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v30.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v29.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v15, v1
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v2.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v15.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v28.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v14.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v15.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, 0
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v15, v2
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v3.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v15.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v29.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v1.h, v11.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v16
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v28.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v1.l, v3.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v27.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v26.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v15, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v4.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v15.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v25.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v13.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v28, 0xffff, v29
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v16, v15
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v2.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v14.l
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v4.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v28, v15
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v14, 0xffff, v16
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v26.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v3.l, v3.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v25.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v14, v15
; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v15, v4
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v5.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v15.h
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v4.l, v11.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v24.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v23.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v4.h, v5.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v14, 0xffff, v16
; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v15, v5
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v6.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v15.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v22.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v15, v6
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v7.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v15.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v21.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v20.l
; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v15, v7
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v8.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v15.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v19.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v11.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v15, v8
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v9.l, v11.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v9.h, v11.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v15.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v18.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v17.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v15, v11
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v9.l, v9.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v15.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v10.l, v10.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v15, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v6.l, v11.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v23.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v14, v15
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v16
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v5.l, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v22.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v7.l, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v12.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v13, v15
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v14
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v6.l, v6.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v21.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v13, v15
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v14
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v7.l, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v20.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v7.h, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v12, v15
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v9.l, v11.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_3) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v19.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v18.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v13
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v9.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v8.l, v8.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v11.l
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v10.l, v11.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v13
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v12, v15
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v9.l, v9.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v10.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v10.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v17.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v11, v15
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v12
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v9.l, v9.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v11, v15
; GFX11-TRUE16-NEXT: s_clause 0x2
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[1:4], off
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[5:8], off offset:16
-; GFX11-TRUE16-NEXT: scratch_store_b64 v0, v[11:12], off offset:32
+; GFX11-TRUE16-NEXT: scratch_store_b64 v0, v[10:11], off offset:32
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-FAKE16-LABEL: bitcast_v20i16_to_v40i8:
@@ -22389,20 +22479,20 @@ define <40 x i8> @bitcast_v20f16_to_v40i8(<20 x half> %a, i32 %b) {
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr13_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
@@ -22418,17 +22508,17 @@ define <40 x i8> @bitcast_v20f16_to_v40i8(<20 x half> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[15:16], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v17, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v18, 8, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v19, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v20, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v20, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v19, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v21, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v22, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v23, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v23, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v22, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v24, 8, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v25, 8, v5
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v26, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v27, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v30, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v16, 8, v1
; GFX11-TRUE16-NEXT: .LBB60_2: ; %Flow
@@ -22452,17 +22542,17 @@ define <40 x i8> @bitcast_v20f16_to_v40i8(<20 x half> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[15:16], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v17, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v18, 8, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v19, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v20, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v20, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v19, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v21, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v22, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v23, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v23, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v22, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v24, 8, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v25, 8, v5
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v26, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v27, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v30, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v16, 8, v1
; GFX11-TRUE16-NEXT: .LBB60_4: ; %end
@@ -22470,93 +22560,105 @@ define <40 x i8> @bitcast_v20f16_to_v40i8(<20 x half> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v16.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v15.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v1.l, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v30.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v1.h, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v30.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v29.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v15, v1
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v2.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v15.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v28.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v14.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v15.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, 0
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v15, v2
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v3.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v15.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v29.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v1.h, v11.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v16
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v28.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v1.l, v3.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v27.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v26.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v15, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v4.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v15.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v25.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v13.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v28, 0xffff, v29
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v16, v15
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v2.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v14.l
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v4.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v28, v15
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v14, 0xffff, v16
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v26.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v3.l, v3.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v25.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v14, v15
; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v15, v4
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v5.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v15.h
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v4.l, v11.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v24.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v23.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v4.h, v5.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v14, 0xffff, v16
; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v15, v5
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v6.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v15.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v22.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v15, v6
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v7.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v15.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v21.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v20.l
; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v15, v7
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v8.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v15.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v19.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v11.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v15, v8
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v9.l, v11.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v9.h, v11.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v15.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v18.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v17.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v15, v11
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v9.l, v9.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v15.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v10.l, v10.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v15, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v6.l, v11.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v23.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v14, v15
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v16
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v5.l, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v22.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v7.l, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v12.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v13, v15
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v14
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v6.l, v6.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v21.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v13, v15
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v14
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v7.l, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v20.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v7.h, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v12, v15
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v9.l, v11.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_3) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v19.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v18.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v13
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v9.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v8.l, v8.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v11.l
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v10.l, v11.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v13
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v12, v15
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v9.l, v9.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v10.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v10.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v17.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v11, v15
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v12
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v9.l, v9.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v11, v15
; GFX11-TRUE16-NEXT: s_clause 0x2
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[1:4], off
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[5:8], off offset:16
-; GFX11-TRUE16-NEXT: scratch_store_b64 v0, v[11:12], off offset:32
+; GFX11-TRUE16-NEXT: scratch_store_b64 v0, v[10:11], off offset:32
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-FAKE16-LABEL: bitcast_v20f16_to_v40i8:
@@ -28757,51 +28859,50 @@ define <5 x double> @bitcast_v40i8_to_v5f64(<40 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v37, off, s32 offset:24
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v38, off, s32 offset:32
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v31, off, s32 offset:28
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v31, off, s32 offset:20
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v32, off, s32 offset:12
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v32, off, s32 offset:20
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v31, off, s32 offset:12
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v32, off, s32 offset:4
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v38.h, v29.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.l, v27.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.h, v25.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.l, v23.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.h, v21.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.h, v18.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v16.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v16.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.h, v14.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, v10.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.h, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.h, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.h, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.h, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.h, v10.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.h, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v33.h, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.h, v4.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.h, v0.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.l, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v29.l, 8, v3.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.l, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v7.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v28.h, 8, v9.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v29.l, 8, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.h, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v26.h, 8, v15.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.l, 8, v17.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.h, 8, v19.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.h, 8, v21.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.h, 8, v48.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.l, 8, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v39.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v38.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.l, 8, v7.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v29.h, 8, v9.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.h, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.h, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.h, 8, v15.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.l, 8, v17.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.h, 8, v19.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.h, 8, v21.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v48.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.l, 8, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v39.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.l, 8, v38.h
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(8)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v36.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v36.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(7)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v36.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v36.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(6)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v37.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v37.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(5)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.l, 8, v37.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v37.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(4)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v38.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.l, 8, v38.l
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v49
; GFX11-TRUE16-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX11-TRUE16-NEXT: s_cbranch_execnz .LBB72_3
@@ -28814,216 +28915,245 @@ define <5 x double> @bitcast_v40i8_to_v5f64(<40 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB72_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v35.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v34.h
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v35.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.h, 0
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v30.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v29.h
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v0.l, v34.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v10.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v1.l, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v10.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v27.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v27.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v10, v0
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v1.h, v33.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v21.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v20.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v29.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v20.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v10, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v2.l, v28.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v10.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v25.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v24.h
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v23.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v22.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v10, v2
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v3.l, v25.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v26.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v10.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v22.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v24.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v26.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v28.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v10, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v4.l, v23.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v10.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v19.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v30.l
-; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v32.h
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v32.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v10, v4
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v5.l, v21.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v10.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v18.h
-; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v17.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v31.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v31.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v10, v5
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v6.l, v19.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v10.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v30.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, 0
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v34.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v33.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v0.h, v29.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v29.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v28.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v23.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.l, v27.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v4
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v26.h
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v1.l, v25.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v1.h, v27.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v24.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v20.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v7, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v2.l, v25.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v4.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v20.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v3.h, v22.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v6, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v3.l, v23.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v28.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_4) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v8, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v4.l, v21.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v22.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v4.h, v21.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v10
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v8
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v5.l, v19.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v6.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v26.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v7, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v6.h, v19.l
+; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v32.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v8
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v6.l, v18.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v32.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v30.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v6.h, v18.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v7.h, v17.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v11
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v7.l, v17.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v31.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.h, v16.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v11, v10
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v12
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v8.l, v16.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v9.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v31.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v10, v6
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v7.l, v18.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v10.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v10, v7
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v8.l, v17.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v10.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v10, v8
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v9.l, v16.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v10.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v11, v10
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v12
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v9.l, v16.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_lo16
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v10, v9
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v11, v10
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB72_2
; GFX11-TRUE16-NEXT: .LBB72_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v35.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v34.h, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v35.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v30.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v29.h, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.h, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v30.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v33.h, 3
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v10.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v34.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v34.h, v0.h
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v33.h, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v33.l, v1.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v27.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v27.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v25.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v10, v3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v24.h, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v10.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v28.h, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v29.l, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v26.h, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v34.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v33.l, v1.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v28.h, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v29.l, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, 0
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v10, v4
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v10.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v25.h, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v26.h, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v21.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v20.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v10, v5
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v3.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v20.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v22.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v10.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v23.l, v3.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v23.h, v3.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v29.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v27.l, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, v24.h, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v25.h, v2.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v11
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v27.h, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v23.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v11
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, 0x300, v2.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v2.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v23.h, v3.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v2.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v4, v11
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v25.l, v3.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v20.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v20.l, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v3.l
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v10, v6
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v5.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v10.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v21.h, v4.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v22.h, v4.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v24.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v26.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v10, v7
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, 0x300, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v28.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v30.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v10.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v19.l, v5.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v19.h, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v3.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v5, v11
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v21.h, v4.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, v24.l, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v22.h, v4.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, v22.l, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, 0x300, v4.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v28.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v11
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v21.l, v5.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v26.l, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v19.h, v5.h
; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v10, v8
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v7.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v7.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v10.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v18.l, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v18.h, v6.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v19.l, v6.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v5.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v18.h, v6.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v7, v11
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v32.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, v32.h, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v8
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, 0x300, v6.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v6.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, v30.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v32.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v10, v9
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v7.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v7.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v11
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v18.l, v7.l
; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v31.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v31.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v10.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v17.l, v7.l
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v17.h, v7.h
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v10, v11
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v9.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v9.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v16.l, v8.l
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v16.h, v8.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v10, v11
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v9.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v9.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v10.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v10, v9
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v17.l, v7.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, 0x300, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v17.h, v8.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, v31.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v7.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v16.h, v8.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v8.h
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v9.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v12, v11
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v13
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v8.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v16.l, v8.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v12, v11
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v9.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v10, v11
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -30778,20 +30908,20 @@ define <40 x i8> @bitcast_v5f64_to_v40i8(<5 x double> %a, i32 %b) {
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr13_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
@@ -30807,17 +30937,17 @@ define <40 x i8> @bitcast_v5f64_to_v40i8(<5 x double> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[15:16], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v17, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v18, 8, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v19, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v20, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v20, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v19, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v21, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v22, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v23, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v23, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v22, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v24, 8, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v25, 8, v5
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v26, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v27, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v30, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v16, 8, v1
; GFX11-TRUE16-NEXT: .LBB74_2: ; %Flow
@@ -30836,17 +30966,17 @@ define <40 x i8> @bitcast_v5f64_to_v40i8(<5 x double> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[15:16], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v17, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v18, 8, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v19, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v20, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v20, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v19, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v21, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v22, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v23, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v23, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v22, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v24, 8, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v25, 8, v5
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v26, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v27, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v30, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v16, 8, v1
; GFX11-TRUE16-NEXT: .LBB74_4: ; %end
@@ -30854,93 +30984,105 @@ define <40 x i8> @bitcast_v5f64_to_v40i8(<5 x double> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v16.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v15.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v1.l, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v30.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v1.h, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v30.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v29.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v15, v1
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v2.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v15.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v28.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v14.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v15.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, 0
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v15, v2
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v3.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v15.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v29.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v1.h, v11.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v16
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v28.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v1.l, v3.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v27.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v26.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v15, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v4.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v15.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v25.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v13.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v28, 0xffff, v29
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v16, v15
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v2.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v14.l
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v4.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v28, v15
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v14, 0xffff, v16
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v26.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v3.l, v3.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v25.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v14, v15
; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v15, v4
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v5.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v15.h
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v4.l, v11.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v24.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v23.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v4.h, v5.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v14, 0xffff, v16
; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v15, v5
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v6.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v15.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v22.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v15, v6
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v7.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v15.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v21.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v20.l
; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v15, v7
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v8.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v15.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v19.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v11.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v15, v8
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v9.l, v11.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v9.h, v11.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v15.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v18.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v17.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v15, v11
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v9.l, v9.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v15.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v10.l, v10.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v15, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v6.l, v11.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v23.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v14, v15
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v16
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v5.l, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v22.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v7.l, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v12.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v13, v15
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v14
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v6.l, v6.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v21.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v13, v15
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v14
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v7.l, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v20.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v7.h, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v12, v15
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v9.l, v11.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_3) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v19.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v18.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v13
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v9.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v8.l, v8.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v11.l
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v10.l, v11.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v13
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v12, v15
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v9.l, v9.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v10.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v10.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v17.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v11, v15
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v12
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v9.l, v9.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v11, v15
; GFX11-TRUE16-NEXT: s_clause 0x2
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[1:4], off
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[5:8], off offset:16
-; GFX11-TRUE16-NEXT: scratch_store_b64 v0, v[11:12], off offset:32
+; GFX11-TRUE16-NEXT: scratch_store_b64 v0, v[10:11], off offset:32
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-FAKE16-LABEL: bitcast_v5f64_to_v40i8:
@@ -32868,51 +33010,50 @@ define <5 x i64> @bitcast_v40i8_to_v5i64(<40 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v37, off, s32 offset:24
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v38, off, s32 offset:32
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v31, off, s32 offset:28
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v31, off, s32 offset:20
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v32, off, s32 offset:12
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v32, off, s32 offset:20
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v31, off, s32 offset:12
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v32, off, s32 offset:4
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v38.h, v29.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.l, v27.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.h, v25.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.l, v23.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.h, v21.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.h, v18.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v16.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v16.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.h, v14.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, v10.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.h, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.h, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.h, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.h, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.h, v10.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.h, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v33.h, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.h, v4.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.l, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.h, v0.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.l, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v34.h, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v29.l, 8, v3.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.l, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v33.h, 8, v7.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v28.h, 8, v9.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v29.l, 8, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.h, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v26.h, 8, v15.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.l, 8, v17.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.h, 8, v19.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.h, 8, v21.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.h, 8, v48.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.l, 8, v39.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v39.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v38.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.l, 8, v7.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v29.h, 8, v9.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.h, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.h, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.h, 8, v15.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.l, 8, v17.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.h, 8, v19.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.h, 8, v21.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v48.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.l, 8, v39.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v39.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.l, 8, v38.h
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(8)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v36.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v36.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(7)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v36.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v36.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(6)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v37.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v37.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(5)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.l, 8, v37.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v37.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(4)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v38.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.l, 8, v38.l
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v49
; GFX11-TRUE16-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX11-TRUE16-NEXT: s_cbranch_execnz .LBB76_3
@@ -32925,216 +33066,245 @@ define <5 x i64> @bitcast_v40i8_to_v5i64(<40 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB76_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v35.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v34.h
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v35.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.h, 0
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v30.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v29.h
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v0.l, v34.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v34.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v10.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v1.l, v33.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v10.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v27.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v27.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v10, v0
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v1.h, v33.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v21.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v20.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v29.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v20.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v10, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v2.l, v28.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v10.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v25.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v24.h
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v23.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v22.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v10, v2
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v3.l, v25.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v26.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v10.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v22.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v24.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v26.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v28.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v10, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v4.l, v23.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v10.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v19.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v30.l
-; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v32.h
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v32.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v10, v4
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v5.l, v21.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v10.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v18.h
-; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v17.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v31.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v31.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v10, v5
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v6.l, v19.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v10.h
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v30.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, 0
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v34.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v33.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v33.h
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v0.h, v29.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v29.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v28.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v23.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.l, v27.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v4
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v26.h
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v1.l, v25.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v1.h, v27.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v24.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v20.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v7, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v2.l, v25.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v4.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v20.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v3.h, v22.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v6, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v3.l, v23.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v28.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_4) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v8, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v4.l, v21.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v22.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v4.h, v21.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v10
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v8
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v5.l, v19.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v6.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v26.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v7, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v6.h, v19.l
+; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v32.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v8
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v6.l, v18.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v32.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v30.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v6.h, v18.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v7.h, v17.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v11
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v7.l, v17.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v31.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.h, v16.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v11, v10
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v12
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v8.l, v16.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v9.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v31.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v10, v6
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v7.l, v18.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v10.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v10, v7
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v8.l, v17.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v10.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v10, v8
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v9.l, v16.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v10.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v11, v10
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v12
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v9.l, v16.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_lo16
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v10, v9
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v11, v10
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB76_2
; GFX11-TRUE16-NEXT: .LBB76_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v35.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v34.h, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v35.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v30.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v29.h, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.h, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v30.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v33.h, 3
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v10.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v34.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v34.h, v0.h
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v33.h, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v33.l, v1.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v27.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v27.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v25.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v10, v3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v24.h, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v10.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v28.h, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v29.l, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v26.h, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v34.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v33.l, v1.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v28.h, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v29.l, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, 0
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v10, v4
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v10.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v25.h, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v26.h, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v21.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v20.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v10, v5
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v3.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v20.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v22.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v10.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v23.l, v3.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v23.h, v3.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v29.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v27.l, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, v24.h, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v25.h, v2.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v11
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v27.h, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v23.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v11
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, 0x300, v2.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v2.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v23.h, v3.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v2.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v4, v11
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v25.l, v3.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v20.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v20.l, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v3.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v3.l
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v10, v6
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v5.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v10.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v21.h, v4.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v22.h, v4.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v24.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v26.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v10, v7
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, 0x300, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v28.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v30.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v10.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v19.l, v5.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v19.h, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v3.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v5, v11
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v21.h, v4.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, v24.l, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v22.h, v4.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, v22.l, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, 0x300, v4.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v28.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v11
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v21.l, v5.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v26.l, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v19.h, v5.h
; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v10, v8
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v7.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v7.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v10.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v18.l, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v18.h, v6.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v19.l, v6.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v5.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v18.h, v6.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v7, v11
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v32.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, v32.h, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v8
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, 0x300, v6.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v6.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, v30.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v32.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v10, v9
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v7.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v7.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v11
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v18.l, v7.l
; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v31.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v31.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v10.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v17.l, v7.l
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v17.h, v7.h
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v10, v11
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v9.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v9.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v16.l, v8.l
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v16.h, v8.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v10, v11
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v9.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v9.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v10.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v10, v9
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v17.l, v7.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, 0x300, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v17.h, v8.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, v31.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v7.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v16.h, v8.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, 0x300, v8.h
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v9.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v12, v11
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v13
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v8.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v16.l, v8.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v12, v11
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v9.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v10, v11
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -34904,20 +35074,20 @@ define <40 x i8> @bitcast_v5i64_to_v40i8(<5 x i64> %a, i32 %b) {
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr15_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr14_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr13_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr12_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr11_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
@@ -34933,17 +35103,17 @@ define <40 x i8> @bitcast_v5i64_to_v40i8(<5 x i64> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[15:16], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v17, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v18, 8, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v19, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v20, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v20, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v19, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v21, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v22, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v23, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v23, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v22, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v24, 8, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v25, 8, v5
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v26, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v27, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v30, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v16, 8, v1
; GFX11-TRUE16-NEXT: .LBB78_2: ; %Flow
@@ -34970,17 +35140,17 @@ define <40 x i8> @bitcast_v5i64_to_v40i8(<5 x i64> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[15:16], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v17, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v18, 8, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v19, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v20, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v20, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v19, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v21, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v22, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v23, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v23, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v22, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v24, 8, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v25, 8, v5
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v26, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v27, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v30, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v16, 8, v1
; GFX11-TRUE16-NEXT: .LBB78_4: ; %end
@@ -34988,93 +35158,105 @@ define <40 x i8> @bitcast_v5i64_to_v40i8(<5 x i64> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v16.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v15.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v1.l, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v30.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v1.h, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v30.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v29.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v15, v1
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v2.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v15.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v28.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v14.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v15.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, 0
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v15, v2
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v3.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v15.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v29.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v1.h, v11.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v16
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v28.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v1.l, v3.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v27.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v26.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v15, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v4.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v15.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v25.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v13.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v28, 0xffff, v29
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v16, v15
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v2.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v14.l
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v4.l, v11.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v28, v15
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v14, 0xffff, v16
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v26.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v3.l, v3.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v25.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v14, v15
; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v15, v4
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v5.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v15.h
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v4.l, v11.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v24.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v23.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v4.h, v5.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v14, 0xffff, v16
; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v15, v5
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v6.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v15.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v22.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v15, v6
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v7.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v15.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v21.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v20.l
; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v15, v7
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v8.l, v11.h
-; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v15.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v19.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v11.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v15, v8
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v9.l, v11.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v9.h, v11.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v15.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v18.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v17.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v15, v11
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v9.l, v9.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v15.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v10.l, v10.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v15, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v6.l, v11.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v23.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v14, v15
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v16
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v5.l, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v22.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v7.l, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v12.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v13, v15
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v14
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v6.l, v6.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v21.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v13, v15
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v14
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v7.l, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v20.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v7.h, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v12, v15
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v9.l, v11.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_3) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v19.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v18.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v13
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v9.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v8.l, v8.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v11.l
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v10.l, v11.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v13
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v12, v15
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v9.l, v9.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v10.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v10.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v17.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v11, v15
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v12
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v9.l, v9.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v11, v15
; GFX11-TRUE16-NEXT: s_clause 0x2
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[1:4], off
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[5:8], off offset:16
-; GFX11-TRUE16-NEXT: scratch_store_b64 v0, v[11:12], off offset:32
+; GFX11-TRUE16-NEXT: scratch_store_b64 v0, v[10:11], off offset:32
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-FAKE16-LABEL: bitcast_v5i64_to_v40i8:
diff --git a/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.32bit.ll b/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.32bit.ll
index 8e30ee659a260..48c9b8775a474 100644
--- a/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.32bit.ll
+++ b/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.32bit.ll
@@ -2257,8 +2257,8 @@ define i32 @bitcast_v4i8_to_i32(<4 x i8> %a, i32 %b) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.h, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.h, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.h, 8, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v3.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v4
@@ -2273,17 +2273,19 @@ define i32 @bitcast_v4i8_to_i32(<4 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: .LBB22_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v2.h
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.l, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v1.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v2.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr1_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr1_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v2, v0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v1.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v1.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v0.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, 0
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v2
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v1, v0
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr1_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr1_lo16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB22_2
; GFX11-TRUE16-NEXT: .LBB22_4: ; %cmp.true
@@ -2293,14 +2295,16 @@ define i32 @bitcast_v4i8_to_i32(<4 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v1.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v1.h, v0.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v1.h, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v1.l, v0.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v1.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v0.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, 0
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v1, v0
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
@@ -4502,8 +4506,8 @@ define float @bitcast_v4i8_to_f32(<4 x i8> %a, i32 %b) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.h, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.h, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.h, 8, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v3.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v4
@@ -4518,17 +4522,19 @@ define float @bitcast_v4i8_to_f32(<4 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: .LBB42_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v2.h
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.l, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v1.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v2.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr1_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr1_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v2, v0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v1.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v1.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v0.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, 0
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v2
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v1, v0
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr1_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr1_lo16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB42_2
; GFX11-TRUE16-NEXT: .LBB42_4: ; %cmp.true
@@ -4538,14 +4544,16 @@ define float @bitcast_v4i8_to_f32(<4 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v1.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v1.h, v0.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v1.h, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v1.l, v0.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v1.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v0.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, 0
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v1, v0
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
@@ -6459,8 +6467,8 @@ define <2 x i16> @bitcast_v4i8_to_v2i16(<4 x i8> %a, i32 %b) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.h, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.h, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.h, 8, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v3.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v4
@@ -6475,17 +6483,19 @@ define <2 x i16> @bitcast_v4i8_to_v2i16(<4 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: .LBB58_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v2.h
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.l, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v1.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v2.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr1_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr1_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v2, v0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v1.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v1.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v0.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, 0
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v2
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v1, v0
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr1_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr1_lo16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB58_2
; GFX11-TRUE16-NEXT: .LBB58_4: ; %cmp.true
@@ -6495,14 +6505,16 @@ define <2 x i16> @bitcast_v4i8_to_v2i16(<4 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v1.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v1.h, v0.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v1.h, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v1.l, v0.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v1.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v0.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, 0
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v1, v0
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
@@ -8104,8 +8116,8 @@ define <2 x half> @bitcast_v4i8_to_v2f16(<4 x i8> %a, i32 %b) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.h, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.h, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.h, 8, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v3.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v4
@@ -8120,17 +8132,19 @@ define <2 x half> @bitcast_v4i8_to_v2f16(<4 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: .LBB70_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v2.h
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.l, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v1.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v2.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr1_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr1_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v2, v0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v1.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v1.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v0.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, 0
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v2
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v1, v0
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr1_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr1_lo16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB70_2
; GFX11-TRUE16-NEXT: .LBB70_4: ; %cmp.true
@@ -8140,14 +8154,16 @@ define <2 x half> @bitcast_v4i8_to_v2f16(<4 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v1.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v1.h, v0.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v1.h, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v1.l, v0.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v1.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v0.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, 0
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v1, v0
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
@@ -9463,8 +9479,8 @@ define <2 x bfloat> @bitcast_v4i8_to_v2bf16(<4 x i8> %a, i32 %b) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.h, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.h, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.h, 8, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v3.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v4
@@ -9479,17 +9495,19 @@ define <2 x bfloat> @bitcast_v4i8_to_v2bf16(<4 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: .LBB78_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v2.h
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.l, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v1.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v2.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr1_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr1_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v2, v0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v1.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v1.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v0.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, 0
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v2
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v1, v0
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr1_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr1_lo16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB78_2
; GFX11-TRUE16-NEXT: .LBB78_4: ; %cmp.true
@@ -9499,14 +9517,16 @@ define <2 x bfloat> @bitcast_v4i8_to_v2bf16(<4 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v1.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v1.h, v0.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v1.h, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v1.l, v0.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v1.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v0.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, 0
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v1, v0
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
@@ -10173,8 +10193,8 @@ define <1 x i32> @bitcast_v4i8_to_v1i32(<4 x i8> %a, i32 %b) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.h, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.h, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.h, 8, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v3.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v4
@@ -10189,17 +10209,19 @@ define <1 x i32> @bitcast_v4i8_to_v1i32(<4 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: .LBB82_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v2.h
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.l, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v1.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v2.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr1_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr1_hi16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v2, v0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v1.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v1.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v0.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, 0
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v2
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v1, v0
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr1_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr1_lo16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB82_2
; GFX11-TRUE16-NEXT: .LBB82_4: ; %cmp.true
@@ -10209,14 +10231,16 @@ define <1 x i32> @bitcast_v4i8_to_v1i32(<4 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v1.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v1.h, v0.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v1.h, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v1.l, v0.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v1.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v0.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, 0
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v1, v0
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
diff --git a/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.512bit.ll b/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.512bit.ll
index 35d135b123969..5aac06a7f3a2b 100644
--- a/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.512bit.ll
+++ b/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.512bit.ll
@@ -8768,32 +8768,32 @@ define <64 x i8> @bitcast_v16i32_to_v64i8(<16 x i32> %a, i32 %b) {
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
@@ -8812,26 +8812,26 @@ define <64 x i8> @bitcast_v16i32_to_v64i8(<16 x i32> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[24:25], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v26, 24, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v27, 8, v16
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 8, v15
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 24, v14
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 8, v15
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 24, v14
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v30, 8, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v31, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v32, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v32, 8, v13
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v31, 24, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v33, 8, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v34, 8, v11
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v35, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v36, 8, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v39, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v50, 8, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v51, 8, v5
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v52, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v53, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v64, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v25, 8, v1
; GFX11-TRUE16-NEXT: .LBB24_2: ; %Flow
@@ -8864,26 +8864,26 @@ define <64 x i8> @bitcast_v16i32_to_v64i8(<16 x i32> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[24:25], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v26, 24, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v27, 8, v16
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 8, v15
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 24, v14
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 8, v15
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 24, v14
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v30, 8, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v31, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v32, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v32, 8, v13
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v31, 24, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v33, 8, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v34, 8, v11
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v35, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v36, 8, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v39, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v50, 8, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v51, 8, v5
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v52, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v53, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v64, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v25, 8, v1
; GFX11-TRUE16-NEXT: .LBB24_4: ; %end
@@ -8891,135 +8891,156 @@ define <64 x i8> @bitcast_v16i32_to_v64i8(<16 x i32> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v25.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v24.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v1.l, v17.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v64.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v1.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v64.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v55.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v24, v1
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v2.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v54.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v23.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v17.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v24.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, 0
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v24, v2
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v3.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v24.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v55.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v1.h, v17.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v25, 0xffff, v25
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v54.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v1.l, v3.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v53.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v52.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v24, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v4.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v51.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v22.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v55
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v25, v24
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v2.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v23.l
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v4.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v54, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v23, 0xffff, v25
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v52.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v3.l, v3.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v51.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v23, v24
; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v24, v4
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v5.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v24.h
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v4.l, v17.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v50.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v49.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v4.h, v5.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v23, 0xffff, v25
; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v24, v5
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v6.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v48.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v21.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v24, v6
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v7.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v39.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v38.l
; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v24, v7
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v8.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v37.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v20.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v6.l, v17.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v22.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v49.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v23, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff, v25
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v5.l, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v48.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v7.l, v17.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v21.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v22, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff, v23
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v6.l, v6.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v39.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v22, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v21, 0xffff, v23
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v7.l, v17.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v38.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v7.h, v8.l
; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v24, v8
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v9.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v36.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v35.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v11.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v24, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v10.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v10.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v34.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v19.l
; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v12.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v24, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v11.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v11.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v33.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v32.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v21, v24
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v9.l, v17.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v37.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v36.l
; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v13.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v24, v11
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v12.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v12.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v31.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v18.l
-; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v14.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v24, v12
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v13.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v13.h, v18.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v30.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v29.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v21, 0xffff, v22
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v9.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v8.l, v8.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v20.l
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v10.l, v17.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v20, 0xffff, v22
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v21, v24
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v35.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v9.l, v9.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v10.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.h
+; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v34.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v20, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v20, 0xffff, v21
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v10.l, v17.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v33.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v10.h, v11.l
; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.l
-; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v15.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v24, v13
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v14.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v18.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v28.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v17.l
; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.l
-; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v16.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v24, v14
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v15.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v15.h, v17.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v27.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v26.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v20, v24
+; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v12.l, v17.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v11.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v19.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v32.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v12.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v19, 0xffff, v21
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.h
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v11.l, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v31.l
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v13.l, v17.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v18.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v19, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v19, 0xffff, v20
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v12.l, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v13.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v14.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.l, 8, v30.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v19, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v20
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v13.l, v17.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v29.l
+; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v13.h, v14.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v24, v15
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v16.l, v17.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v16.h, v17.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v24.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v24, v16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v18, v24
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v15.l, v17.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_3) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v14.l
+; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.h, 8, v28.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v27.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v19
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v15.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v14.l, v14.h
+; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.h, 8, v17.l
+; GFX11-TRUE16-NEXT: v_or_b16 v16.l, v16.l, v17.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v19
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v18, v24
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v15.l, v15.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v16.l
+; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v26.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v17, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v18
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v16.l, v16.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v17, v24
; GFX11-TRUE16-NEXT: s_clause 0x3
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[1:4], off
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[5:8], off offset:16
@@ -12449,15 +12470,15 @@ define <16 x i32> @bitcast_v64i8_to_v16i32(<64 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v64, off, s32 offset:128
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v31, off, s32 offset:124
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v64, off, s32 offset:120
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v31, off, s32 offset:116
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v32, off, s32 offset:116
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v65, off, s32 offset:112
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v32, off, s32 offset:108
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v31, off, s32 offset:108
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v65, off, s32 offset:104
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v32, off, s32 offset:100
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v33, off, s32 offset:100
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v66, off, s32 offset:96
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v33, off, s32 offset:92
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v32, off, s32 offset:92
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v66, off, s32 offset:88
-; GFX11-TRUE16-NEXT: scratch_load_b32 v81, off, s32 offset:132
+; GFX11-TRUE16-NEXT: scratch_load_b32 v82, off, s32 offset:132
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v67, off, s32
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v67, off, s32 offset:8
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v68, off, s32 offset:16
@@ -12471,82 +12492,84 @@ define <16 x i32> @bitcast_v64i8_to_v16i32(<64 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v80, off, s32 offset:80
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v33, off, s32 offset:84
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v34, off, s32 offset:76
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v34, off, s32 offset:68
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v35, off, s32 offset:60
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v35, off, s32 offset:68
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v34, off, s32 offset:60
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v35, off, s32 offset:52
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v36, off, s32 offset:44
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v36, off, s32 offset:36
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v37, off, s32 offset:28
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v37, off, s32 offset:20
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v37, off, s32 offset:36
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v36, off, s32 offset:28
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v38, off, s32 offset:20
; GFX11-TRUE16-NEXT: s_clause 0x1
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v38, off, s32 offset:12
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v37, off, s32 offset:12
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v38, off, s32 offset:4
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v80.h, v29.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.h, v27.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.h, v22.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.h, v20.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.h, v18.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v16.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.l, v14.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.h, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v49.l, v10.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v49.h, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.h, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.l, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v53.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v81.l, v27.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.h, v25.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.h, v24.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v22.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, v20.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.h, v18.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.h, v16.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.h, v14.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v50.l, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.h, v10.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v51.h, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v53.l, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v54.l, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.h, v0.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v54.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v55.l, 8, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v53.l, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v54.l, 8, v7.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v51.l, 8, v9.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v51.h, 8, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v50.l, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v50.h, 8, v15.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v39.h, 8, v17.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v48.l, 8, v19.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v29.h, 8, v21.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v30.h, 8, v23.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.l, 8, v25.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.h, 8, v27.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.l, 8, v80.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v52.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v53.h, 8, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v50.h, 8, v7.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v52.h, 8, v9.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v49.h, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v51.l, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v48.l, 8, v15.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v49.l, 8, v17.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v29.h, 8, v19.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v39.l, 8, v21.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v28.h, 8, v23.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v29.l, 8, v25.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.h, 8, v81.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v26.h, 8, v80.h
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(33)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v64.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.l, 8, v64.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(31)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.l, 8, v64.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v64.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(29)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v65.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v65.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(27)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v65.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v65.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(25)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v66.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v66.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(23)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v66.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.l, 8, v66.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(21)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.h, 8, v67.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.h, 8, v67.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(20)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.l, 8, v67.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v24.l, 8, v67.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(19)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.h, 8, v68.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.h, 8, v68.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(18)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.l, 8, v68.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.l, 8, v68.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(17)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.h, 8, v69.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.h, 8, v69.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(16)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.l, 8, v69.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.l, 8, v69.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(15)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.h, 8, v70.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v20.h, 8, v70.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(14)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v20.l, 8, v70.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.l, 8, v70.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(13)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v20.h, 8, v71.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v71.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(12)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.l, 8, v71.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v20.l, 8, v71.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(11)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v80.l
-; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v81
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v80.l
+; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v82
; GFX11-TRUE16-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX11-TRUE16-NEXT: s_cbranch_execnz .LBB26_3
; GFX11-TRUE16-NEXT: ; %bb.1: ; %Flow
@@ -12558,338 +12581,384 @@ define <16 x i32> @bitcast_v64i8_to_v16i32(<64 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB26_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v55.h
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v53.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v64.h, 0
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v52.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v52.l
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v0.l, v54.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v55.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v64.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v1.l, v54.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v64.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v49.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v49.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v64, v0
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v1.h, v53.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v29.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v28.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v51.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v26.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v64, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v2.l, v51.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v64.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v48.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v39.l
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v48.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v24.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v64, v2
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v3.l, v50.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v50.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v64.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v30.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v24.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v26.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v28.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v64, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v4.l, v39.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v64.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v27.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v30.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v54.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v55.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v51.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v54.l, 0
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v54.h
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v53.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v53.l
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v0.h, v52.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v52.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v50.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v30.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.l, v51.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v4
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v48.h
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v1.l, v49.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v54
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v1.h, v50.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v39.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v27.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v7, v54
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v2.l, v49.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v4.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v27.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v3.h, v39.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v6, v54
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v3.l, v48.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v24.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v28.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v38.h
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v38.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v64, v4
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v5.l, v29.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v64.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v25.h
-; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v23.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v37.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v37.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v64, v5
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v6.l, v27.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v64.h
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v36.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.h, v22.h
-; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v36.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v35.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v64, v6
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v7.l, v25.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v64.h
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v10.h, v21.h
-; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v35.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v34.h
-; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v34.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v64, v7
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v8.l, v23.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v64.h
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v11.h, v20.h
-; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v12.h, v19.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v33.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v33.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v64, v8
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v9.l, v22.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v64.h
-; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v32.h
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v13.h, v18.h
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v32.l
-; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v31.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v64, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v10.l, v21.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v64.h
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v17.h
-; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v31.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v38.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v8, v54
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v4.l, v29.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v25.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v4.h, v29.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v54
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v8
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v5.l, v28.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v6.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v26.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v5.h, v26.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v38.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v8, v54
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v6.l, v25.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v30.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v6.h, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v37.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v54
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v7.l, v23.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v37.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v7.h, v23.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v35.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v10, v54
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v11
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v8.l, v22.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v9.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v36.h
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v8.h, v22.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v35.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v11, v54
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v12
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v9.l, v21.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v10.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v36.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v9.h, v21.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v12, v54
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v13
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v10.l, v20.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v11.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v34.h
+; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v10.h, v20.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v33.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v13, v54
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v14
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v11.l, v19.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v34.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v13, v54
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v12.h, v19.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v33.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v15, 0xffff, v14
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v12.l, v18.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v32.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v13.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v32.h
+; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v12.h, v18.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v15, v54
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v13.h, v17.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v19
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v13.l, v17.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v14.l
+; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v31.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v18, v54
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v19
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v14.l, v16.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v15.l
+; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v31.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v64, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v11.l, v20.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v64.h
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v15.h, v16.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v64, v11
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v12.l, v19.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v64.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v64, v12
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v13.l, v18.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v64.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v17, v54
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v18
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v15.l, v16.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v64, v13
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v14.l, v17.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v64.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v64, v14
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v15.l, v16.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v64.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_lo16
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v64, v15
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v17, v54
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_hi16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB26_2
; GFX11-TRUE16-NEXT: .LBB26_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v55.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v53.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v52.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v52.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.h, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v54.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v55.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v51.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v53.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v54.h, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v55.l, v0.h
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v54.l, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v53.l, v1.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v49.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v49.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v48.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v52, v3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v39.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v51.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v51.h, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v48.h, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v54.h, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v53.h, v1.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v50.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v52.l, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, 0
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v52, v4
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v52.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v50.l, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v50.h, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v29.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v28.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v52, v5
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v3.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v26.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v24.h, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v39.h, v3.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v48.l, v3.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v52.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v50.h, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, v39.h, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v30.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v55
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v49.h, v2.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v51.l, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v55
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v4
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v2.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v2.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v48.l, v3.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v27.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v5, v55
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v4.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v49.l, v3.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v27.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v3.h
; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v52, v6
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v5.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v29.h, v4.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v30.h, v4.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v24.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v26.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v52, v7
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, 0x300, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v28.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v30.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v27.l, v5.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v27.h, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, v24.h, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v39.l, v4.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v6, v55
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v5.l
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v29.h, v4.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, v25.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v4.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, 0x300, v4.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v29.l, v5.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v28.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v55
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v28.h, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v26.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v52, v8
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v7.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v7.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v25.l, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v25.h, v6.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v26.h, v6.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v38.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v38.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v52, v9
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v7.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, 0x300, v7.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v37.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v37.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v23.l, v7.l
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v23.h, v7.h
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, v38.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v8, v55
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v7.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v25.h, v6.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, v30.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v8
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v6.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, 0x300, v6.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v24.l, v7.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v37.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v55
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v8.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v23.h, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v38.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v52, v10
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v9.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v9.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v22.l, v8.l
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v22.h, v8.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, v36.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, v36.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v52, v11
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v9.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, 0x300, v9.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, v35.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, v35.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v21.l, v9.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v21.h, v9.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v9
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v22.h, v8.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, v36.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v10, v55
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v9.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v23.l, v8.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, v37.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v8.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v10
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, 0x300, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v21.h, v9.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v8.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v10, v55
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v22.l, v9.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, v36.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, v35.h, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v11
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v9.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v9.l
; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v52, v12
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v11.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.h, 0x300, v11.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v20.l, v10.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v20.h, v10.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, v34.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, v34.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v52, v13
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v11.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, 0x300, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v12.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, v33.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, v33.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v19.l, v11.l
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v19.h, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v9.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v11, v55
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v20.h, v10.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.l, v35.l, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v21.l, v10.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, v34.h, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v12
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v10.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, 0x300, v10.h
+; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v11.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, v33.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v13, v55
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v20.l, v11.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v12.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, v34.l, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v19.h, v11.h
; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v12.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v52, v14
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v13.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.h, 0x300, v13.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v18.l, v12.l
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v18.h, v12.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, v32.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.l, 0x300, v11.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v13
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v19.l, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v11.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v18.h, v12.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v13, v55
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.l, v33.l, 3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v15, 0xffff, v14
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, 0x300, v12.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v12.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.h, v32.h, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.l
; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, v32.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v52, v15
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v13.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.h, 0x300, v13.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v14.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v15, v55
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v14.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v13.h
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v18.l, v13.l
; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, v31.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, v31.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v17.l, v13.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v17.h, v13.h
-; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.l
; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v14.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v52, v18
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v15.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v17.h, 0x300, v15.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v16.l, v14.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v16.h, v14.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v52, v17
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v15.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.h, 0x300, v15.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v52, v15
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v15
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v17.l, v13.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.l, 0x300, v13.l
+; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.l
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v17.h, v14.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.l, v31.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v13.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v13.l
+; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v16.h, v14.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v17.l, 0x300, v14.h
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v15.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v18, v55
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v19
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v14.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v16.l, v14.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v17
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v18, v55
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v15.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v16, v55
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -23519,32 +23588,32 @@ define <64 x i8> @bitcast_v16f32_to_v64i8(<16 x float> %a, i32 %b) {
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
@@ -23563,26 +23632,26 @@ define <64 x i8> @bitcast_v16f32_to_v64i8(<16 x float> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[24:25], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v26, 24, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v27, 8, v16
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 8, v15
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 24, v14
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 8, v15
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 24, v14
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v30, 8, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v31, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v32, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v32, 8, v13
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v31, 24, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v33, 8, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v34, 8, v11
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v35, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v36, 8, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v39, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v50, 8, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v51, 8, v5
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v52, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v53, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v64, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v25, 8, v1
; GFX11-TRUE16-NEXT: .LBB48_2: ; %Flow
@@ -23607,26 +23676,26 @@ define <64 x i8> @bitcast_v16f32_to_v64i8(<16 x float> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[24:25], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v26, 24, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v27, 8, v16
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 8, v15
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 24, v14
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 8, v15
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 24, v14
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v30, 8, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v31, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v32, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v32, 8, v13
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v31, 24, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v33, 8, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v34, 8, v11
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v35, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v36, 8, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v39, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v50, 8, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v51, 8, v5
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v52, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v53, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v64, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v25, 8, v1
; GFX11-TRUE16-NEXT: .LBB48_4: ; %end
@@ -23634,135 +23703,156 @@ define <64 x i8> @bitcast_v16f32_to_v64i8(<16 x float> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v25.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v24.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v1.l, v17.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v64.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v1.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v64.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v55.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v24, v1
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v2.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v54.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v23.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v17.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v24.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, 0
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v24, v2
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v3.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v24.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v55.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v1.h, v17.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v25, 0xffff, v25
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v54.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v1.l, v3.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v53.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v52.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v24, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v4.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v51.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v22.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v55
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v25, v24
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v2.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v23.l
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v4.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v54, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v23, 0xffff, v25
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v52.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v3.l, v3.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v51.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v23, v24
; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v24, v4
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v5.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v24.h
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v4.l, v17.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v50.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v49.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v4.h, v5.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v23, 0xffff, v25
; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v24, v5
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v6.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v48.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v21.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v24, v6
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v7.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v39.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v38.l
; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v24, v7
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v8.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v37.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v20.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v6.l, v17.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v22.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v49.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v23, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff, v25
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v5.l, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v48.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v7.l, v17.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v21.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v22, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff, v23
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v6.l, v6.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v39.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v22, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v21, 0xffff, v23
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v7.l, v17.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v38.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v7.h, v8.l
; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v24, v8
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v9.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v36.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v35.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v11.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v24, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v10.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v10.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v34.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v19.l
; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v12.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v24, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v11.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v11.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v33.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v32.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v21, v24
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v9.l, v17.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v37.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v36.l
; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v13.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v24, v11
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v12.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v12.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v31.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v18.l
-; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v14.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v24, v12
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v13.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v13.h, v18.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v30.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v29.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v21, 0xffff, v22
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v9.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v8.l, v8.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v20.l
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v10.l, v17.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v20, 0xffff, v22
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v21, v24
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v35.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v9.l, v9.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v10.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.h
+; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v34.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v20, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v20, 0xffff, v21
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v10.l, v17.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v33.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v10.h, v11.l
; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.l
-; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v15.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v24, v13
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v14.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v18.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v28.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v17.l
; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.l
-; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v16.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v24, v14
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v15.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v15.h, v17.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v27.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v26.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v20, v24
+; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v12.l, v17.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v11.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v19.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v32.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v12.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v19, 0xffff, v21
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.h
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v11.l, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v31.l
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v13.l, v17.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v18.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v19, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v19, 0xffff, v20
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v12.l, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v13.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v14.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.l, 8, v30.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v19, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v20
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v13.l, v17.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v29.l
+; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v13.h, v14.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v24, v15
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v16.l, v17.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v16.h, v17.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v24.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v24, v16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v18, v24
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v15.l, v17.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_3) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v14.l
+; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.h, 8, v28.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v27.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v19
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v15.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v14.l, v14.h
+; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.h, 8, v17.l
+; GFX11-TRUE16-NEXT: v_or_b16 v16.l, v16.l, v17.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v19
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v18, v24
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v15.l, v15.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v16.l
+; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v26.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v17, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v18
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v16.l, v16.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v17, v24
; GFX11-TRUE16-NEXT: s_clause 0x3
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[1:4], off
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[5:8], off offset:16
@@ -27323,15 +27413,15 @@ define <16 x float> @bitcast_v64i8_to_v16f32(<64 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v64, off, s32 offset:128
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v31, off, s32 offset:124
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v64, off, s32 offset:120
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v31, off, s32 offset:116
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v32, off, s32 offset:116
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v65, off, s32 offset:112
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v32, off, s32 offset:108
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v31, off, s32 offset:108
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v65, off, s32 offset:104
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v32, off, s32 offset:100
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v33, off, s32 offset:100
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v66, off, s32 offset:96
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v33, off, s32 offset:92
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v32, off, s32 offset:92
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v66, off, s32 offset:88
-; GFX11-TRUE16-NEXT: scratch_load_b32 v81, off, s32 offset:132
+; GFX11-TRUE16-NEXT: scratch_load_b32 v82, off, s32 offset:132
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v67, off, s32
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v67, off, s32 offset:8
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v68, off, s32 offset:16
@@ -27345,82 +27435,84 @@ define <16 x float> @bitcast_v64i8_to_v16f32(<64 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v80, off, s32 offset:80
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v33, off, s32 offset:84
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v34, off, s32 offset:76
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v34, off, s32 offset:68
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v35, off, s32 offset:60
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v35, off, s32 offset:68
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v34, off, s32 offset:60
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v35, off, s32 offset:52
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v36, off, s32 offset:44
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v36, off, s32 offset:36
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v37, off, s32 offset:28
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v37, off, s32 offset:20
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v37, off, s32 offset:36
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v36, off, s32 offset:28
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v38, off, s32 offset:20
; GFX11-TRUE16-NEXT: s_clause 0x1
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v38, off, s32 offset:12
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v37, off, s32 offset:12
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v38, off, s32 offset:4
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v80.h, v29.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.h, v27.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.h, v22.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.h, v20.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.h, v18.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v16.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.l, v14.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.h, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v49.l, v10.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v49.h, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.h, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.l, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v53.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v81.l, v27.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.h, v25.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.h, v24.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v22.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, v20.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.h, v18.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.h, v16.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.h, v14.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v50.l, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.h, v10.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v51.h, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v53.l, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v54.l, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.h, v0.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v54.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v55.l, 8, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v53.l, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v54.l, 8, v7.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v51.l, 8, v9.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v51.h, 8, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v50.l, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v50.h, 8, v15.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v39.h, 8, v17.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v48.l, 8, v19.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v29.h, 8, v21.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v30.h, 8, v23.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.l, 8, v25.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.h, 8, v27.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.l, 8, v80.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v52.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v53.h, 8, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v50.h, 8, v7.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v52.h, 8, v9.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v49.h, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v51.l, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v48.l, 8, v15.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v49.l, 8, v17.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v29.h, 8, v19.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v39.l, 8, v21.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v28.h, 8, v23.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v29.l, 8, v25.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.h, 8, v81.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v26.h, 8, v80.h
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(33)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v64.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.l, 8, v64.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(31)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.l, 8, v64.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v64.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(29)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v65.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v65.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(27)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v65.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v65.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(25)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v66.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v66.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(23)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v66.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.l, 8, v66.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(21)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.h, 8, v67.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.h, 8, v67.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(20)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.l, 8, v67.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v24.l, 8, v67.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(19)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.h, 8, v68.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.h, 8, v68.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(18)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.l, 8, v68.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.l, 8, v68.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(17)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.h, 8, v69.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.h, 8, v69.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(16)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.l, 8, v69.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.l, 8, v69.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(15)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.h, 8, v70.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v20.h, 8, v70.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(14)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v20.l, 8, v70.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.l, 8, v70.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(13)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v20.h, 8, v71.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v71.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(12)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.l, 8, v71.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v20.l, 8, v71.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(11)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v80.l
-; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v81
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v80.l
+; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v82
; GFX11-TRUE16-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX11-TRUE16-NEXT: s_cbranch_execnz .LBB50_3
; GFX11-TRUE16-NEXT: ; %bb.1: ; %Flow
@@ -27432,338 +27524,384 @@ define <16 x float> @bitcast_v64i8_to_v16f32(<64 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB50_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v55.h
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v53.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v64.h, 0
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v52.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v52.l
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v0.l, v54.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v55.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v64.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v1.l, v54.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v64.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v49.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v49.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v64, v0
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v1.h, v53.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v29.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v28.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v51.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v26.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v64, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v2.l, v51.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v64.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v48.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v39.l
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v48.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v24.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v64, v2
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v3.l, v50.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v50.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v64.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v30.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v24.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v26.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v28.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v64, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v4.l, v39.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v64.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v27.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v30.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v54.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v55.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v51.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v54.l, 0
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v54.h
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v53.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v53.l
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v0.h, v52.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v52.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v50.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v30.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.l, v51.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v4
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v48.h
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v1.l, v49.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v54
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v1.h, v50.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v39.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v27.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v7, v54
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v2.l, v49.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v4.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v27.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v3.h, v39.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v6, v54
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v3.l, v48.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v24.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v28.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v38.h
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v38.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v64, v4
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v5.l, v29.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v64.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v25.h
-; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v23.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v37.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v37.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v64, v5
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v6.l, v27.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v64.h
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v36.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.h, v22.h
-; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v36.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v35.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v64, v6
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v7.l, v25.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v64.h
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v10.h, v21.h
-; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v35.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v34.h
-; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v34.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v64, v7
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v8.l, v23.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v64.h
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v11.h, v20.h
-; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v12.h, v19.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v33.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v33.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v64, v8
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v9.l, v22.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v64.h
-; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v32.h
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v13.h, v18.h
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v32.l
-; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v31.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v64, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v10.l, v21.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v64.h
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v17.h
-; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v31.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v38.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v8, v54
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v4.l, v29.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v25.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v4.h, v29.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v54
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v8
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v5.l, v28.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v6.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v26.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v5.h, v26.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v38.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v8, v54
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v6.l, v25.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v30.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v6.h, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v37.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v54
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v7.l, v23.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v37.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v7.h, v23.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v35.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v10, v54
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v11
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v8.l, v22.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v9.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v36.h
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v8.h, v22.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v35.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v11, v54
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v12
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v9.l, v21.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v10.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v36.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v9.h, v21.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v12, v54
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v13
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v10.l, v20.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v11.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v34.h
+; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v10.h, v20.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v33.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v13, v54
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v14
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v11.l, v19.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v34.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v13, v54
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v12.h, v19.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v33.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v15, 0xffff, v14
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v12.l, v18.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v32.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v13.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v32.h
+; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v12.h, v18.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v15, v54
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v13.h, v17.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v19
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v13.l, v17.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v14.l
+; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v31.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v18, v54
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v19
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v14.l, v16.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v15.l
+; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v31.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v64, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v11.l, v20.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v64.h
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v15.h, v16.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v64, v11
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v12.l, v19.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v64.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v64, v12
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v13.l, v18.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v64.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v17, v54
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v18
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v15.l, v16.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v64, v13
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v14.l, v17.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v64.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v64, v14
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v15.l, v16.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v64.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_lo16
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v64, v15
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v17, v54
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_hi16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB50_2
; GFX11-TRUE16-NEXT: .LBB50_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v55.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v53.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v52.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v52.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.h, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v54.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v55.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v51.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v53.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v54.h, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v55.l, v0.h
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v54.l, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v53.l, v1.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v49.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v49.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v48.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v52, v3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v39.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v51.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v51.h, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v48.h, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v54.h, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v53.h, v1.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v50.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v52.l, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, 0
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v52, v4
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v52.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v50.l, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v50.h, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v29.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v28.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v52, v5
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v3.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v26.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v24.h, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v39.h, v3.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v48.l, v3.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v52.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v50.h, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, v39.h, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v30.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v55
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v49.h, v2.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v51.l, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v55
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v4
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v2.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v2.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v48.l, v3.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v27.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v5, v55
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v4.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v49.l, v3.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v27.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v3.h
; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v52, v6
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v5.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v29.h, v4.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v30.h, v4.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v24.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v26.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v52, v7
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, 0x300, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v28.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v30.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v27.l, v5.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v27.h, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, v24.h, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v39.l, v4.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v6, v55
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v5.l
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v29.h, v4.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, v25.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v4.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, 0x300, v4.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v29.l, v5.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v28.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v55
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v28.h, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v26.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v52, v8
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v7.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v7.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v25.l, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v25.h, v6.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v26.h, v6.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v38.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v38.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v52, v9
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v7.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, 0x300, v7.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v37.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v37.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v23.l, v7.l
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v23.h, v7.h
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, v38.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v8, v55
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v7.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v25.h, v6.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, v30.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v8
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v6.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, 0x300, v6.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v24.l, v7.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v37.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v55
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v8.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v23.h, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v38.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v52, v10
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v9.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v9.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v22.l, v8.l
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v22.h, v8.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, v36.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, v36.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v52, v11
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v9.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, 0x300, v9.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, v35.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, v35.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v21.l, v9.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v21.h, v9.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v9
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v22.h, v8.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, v36.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v10, v55
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v9.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v23.l, v8.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, v37.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v8.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v10
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, 0x300, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v21.h, v9.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v8.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v10, v55
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v22.l, v9.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, v36.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, v35.h, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v11
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v9.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v9.l
; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v52, v12
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v11.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.h, 0x300, v11.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v20.l, v10.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v20.h, v10.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, v34.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, v34.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v52, v13
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v11.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, 0x300, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v12.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, v33.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, v33.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v19.l, v11.l
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v19.h, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v9.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v11, v55
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v20.h, v10.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.l, v35.l, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v21.l, v10.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, v34.h, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v12
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v10.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, 0x300, v10.h
+; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v11.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, v33.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v13, v55
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v20.l, v11.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v12.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, v34.l, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v19.h, v11.h
; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v12.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v52, v14
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v13.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.h, 0x300, v13.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v18.l, v12.l
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v18.h, v12.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, v32.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.l, 0x300, v11.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v13
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v19.l, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v11.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v18.h, v12.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v13, v55
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.l, v33.l, 3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v15, 0xffff, v14
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, 0x300, v12.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v12.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.h, v32.h, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.l
; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, v32.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v52, v15
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v13.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.h, 0x300, v13.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v14.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v15, v55
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v14.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v13.h
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v18.l, v13.l
; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, v31.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, v31.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v17.l, v13.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v17.h, v13.h
-; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.l
; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v14.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v52, v18
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v15.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v17.h, 0x300, v15.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v16.l, v14.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v16.h, v14.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v52, v17
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v15.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.h, 0x300, v15.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v52, v15
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v15
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v17.l, v13.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.l, 0x300, v13.l
+; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.l
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v17.h, v14.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.l, v31.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v13.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v13.l
+; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v16.h, v14.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v17.l, 0x300, v14.h
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v15.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v18, v55
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v19
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v14.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v16.l, v14.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v17
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v18, v55
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v15.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v16, v55
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -37778,32 +37916,32 @@ define <64 x i8> @bitcast_v8i64_to_v64i8(<8 x i64> %a, i32 %b) {
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
@@ -37822,26 +37960,26 @@ define <64 x i8> @bitcast_v8i64_to_v64i8(<8 x i64> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[24:25], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v26, 24, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v27, 8, v16
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 8, v15
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 24, v14
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 8, v15
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 24, v14
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v30, 8, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v31, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v32, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v32, 8, v13
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v31, 24, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v33, 8, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v34, 8, v11
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v35, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v36, 8, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v39, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v50, 8, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v51, 8, v5
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v52, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v53, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v64, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v25, 8, v1
; GFX11-TRUE16-NEXT: .LBB68_2: ; %Flow
@@ -37879,26 +38017,26 @@ define <64 x i8> @bitcast_v8i64_to_v64i8(<8 x i64> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[24:25], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v26, 24, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v27, 8, v16
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 8, v15
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 24, v14
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 8, v15
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 24, v14
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v30, 8, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v31, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v32, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v32, 8, v13
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v31, 24, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v33, 8, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v34, 8, v11
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v35, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v36, 8, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v39, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v50, 8, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v51, 8, v5
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v52, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v53, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v64, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v25, 8, v1
; GFX11-TRUE16-NEXT: .LBB68_4: ; %end
@@ -37906,135 +38044,156 @@ define <64 x i8> @bitcast_v8i64_to_v64i8(<8 x i64> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v25.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v24.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v1.l, v17.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v64.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v1.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v64.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v55.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v24, v1
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v2.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v54.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v23.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v17.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v24.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, 0
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v24, v2
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v3.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v24.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v55.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v1.h, v17.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v25, 0xffff, v25
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v54.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v1.l, v3.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v53.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v52.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v24, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v4.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v51.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v22.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v55
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v25, v24
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v2.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v23.l
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v4.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v54, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v23, 0xffff, v25
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v52.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v3.l, v3.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v51.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v23, v24
; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v24, v4
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v5.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v24.h
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v4.l, v17.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v50.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v49.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v4.h, v5.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v23, 0xffff, v25
; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v24, v5
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v6.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v48.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v21.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v24, v6
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v7.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v39.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v38.l
; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v24, v7
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v8.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v37.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v20.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v6.l, v17.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v22.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v49.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v23, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff, v25
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v5.l, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v48.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v7.l, v17.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v21.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v22, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff, v23
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v6.l, v6.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v39.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v22, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v21, 0xffff, v23
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v7.l, v17.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v38.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v7.h, v8.l
; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v24, v8
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v9.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v36.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v35.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v11.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v24, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v10.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v10.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v34.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v19.l
; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v12.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v24, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v11.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v11.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v33.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v32.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v21, v24
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v9.l, v17.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v37.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v36.l
; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v13.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v24, v11
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v12.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v12.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v31.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v18.l
-; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v14.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v24, v12
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v13.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v13.h, v18.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v30.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v29.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v21, 0xffff, v22
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v9.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v8.l, v8.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v20.l
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v10.l, v17.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v20, 0xffff, v22
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v21, v24
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v35.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v9.l, v9.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v10.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.h
+; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v34.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v20, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v20, 0xffff, v21
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v10.l, v17.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v33.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v10.h, v11.l
; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.l
-; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v15.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v24, v13
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v14.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v18.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v28.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v17.l
; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.l
-; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v16.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v24, v14
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v15.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v15.h, v17.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v27.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v26.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v20, v24
+; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v12.l, v17.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v11.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v19.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v32.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v12.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v19, 0xffff, v21
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.h
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v11.l, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v31.l
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v13.l, v17.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v18.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v19, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v19, 0xffff, v20
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v12.l, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v13.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v14.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.l, 8, v30.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v19, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v20
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v13.l, v17.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v29.l
+; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v13.h, v14.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v24, v15
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v16.l, v17.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v16.h, v17.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v24.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v24, v16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v18, v24
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v15.l, v17.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_3) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v14.l
+; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.h, 8, v28.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v27.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v19
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v15.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v14.l, v14.h
+; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.h, 8, v17.l
+; GFX11-TRUE16-NEXT: v_or_b16 v16.l, v16.l, v17.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v19
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v18, v24
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v15.l, v15.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v16.l
+; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v26.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v17, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v18
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v16.l, v16.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v17, v24
; GFX11-TRUE16-NEXT: s_clause 0x3
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[1:4], off
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[5:8], off offset:16
@@ -41469,15 +41628,15 @@ define <8 x i64> @bitcast_v64i8_to_v8i64(<64 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v64, off, s32 offset:128
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v31, off, s32 offset:124
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v64, off, s32 offset:120
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v31, off, s32 offset:116
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v32, off, s32 offset:116
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v65, off, s32 offset:112
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v32, off, s32 offset:108
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v31, off, s32 offset:108
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v65, off, s32 offset:104
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v32, off, s32 offset:100
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v33, off, s32 offset:100
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v66, off, s32 offset:96
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v33, off, s32 offset:92
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v32, off, s32 offset:92
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v66, off, s32 offset:88
-; GFX11-TRUE16-NEXT: scratch_load_b32 v81, off, s32 offset:132
+; GFX11-TRUE16-NEXT: scratch_load_b32 v82, off, s32 offset:132
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v67, off, s32
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v67, off, s32 offset:8
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v68, off, s32 offset:16
@@ -41491,82 +41650,84 @@ define <8 x i64> @bitcast_v64i8_to_v8i64(<64 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v80, off, s32 offset:80
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v33, off, s32 offset:84
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v34, off, s32 offset:76
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v34, off, s32 offset:68
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v35, off, s32 offset:60
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v35, off, s32 offset:68
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v34, off, s32 offset:60
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v35, off, s32 offset:52
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v36, off, s32 offset:44
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v36, off, s32 offset:36
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v37, off, s32 offset:28
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v37, off, s32 offset:20
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v37, off, s32 offset:36
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v36, off, s32 offset:28
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v38, off, s32 offset:20
; GFX11-TRUE16-NEXT: s_clause 0x1
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v38, off, s32 offset:12
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v37, off, s32 offset:12
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v38, off, s32 offset:4
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v80.h, v29.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.h, v27.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.h, v22.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.h, v20.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.h, v18.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v16.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.l, v14.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.h, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v49.l, v10.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v49.h, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.h, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.l, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v53.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v81.l, v27.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.h, v25.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.h, v24.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v22.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, v20.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.h, v18.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.h, v16.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.h, v14.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v50.l, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.h, v10.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v51.h, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v53.l, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v54.l, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.h, v0.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v54.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v55.l, 8, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v53.l, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v54.l, 8, v7.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v51.l, 8, v9.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v51.h, 8, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v50.l, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v50.h, 8, v15.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v39.h, 8, v17.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v48.l, 8, v19.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v29.h, 8, v21.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v30.h, 8, v23.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.l, 8, v25.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.h, 8, v27.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.l, 8, v80.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v52.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v53.h, 8, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v50.h, 8, v7.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v52.h, 8, v9.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v49.h, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v51.l, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v48.l, 8, v15.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v49.l, 8, v17.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v29.h, 8, v19.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v39.l, 8, v21.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v28.h, 8, v23.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v29.l, 8, v25.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.h, 8, v81.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v26.h, 8, v80.h
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(33)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v64.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.l, 8, v64.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(31)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.l, 8, v64.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v64.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(29)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v65.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v65.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(27)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v65.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v65.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(25)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v66.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v66.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(23)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v66.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.l, 8, v66.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(21)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.h, 8, v67.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.h, 8, v67.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(20)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.l, 8, v67.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v24.l, 8, v67.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(19)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.h, 8, v68.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.h, 8, v68.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(18)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.l, 8, v68.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.l, 8, v68.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(17)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.h, 8, v69.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.h, 8, v69.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(16)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.l, 8, v69.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.l, 8, v69.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(15)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.h, 8, v70.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v20.h, 8, v70.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(14)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v20.l, 8, v70.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.l, 8, v70.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(13)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v20.h, 8, v71.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v71.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(12)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.l, 8, v71.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v20.l, 8, v71.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(11)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v80.l
-; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v81
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v80.l
+; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v82
; GFX11-TRUE16-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX11-TRUE16-NEXT: s_cbranch_execnz .LBB70_3
; GFX11-TRUE16-NEXT: ; %bb.1: ; %Flow
@@ -41578,338 +41739,384 @@ define <8 x i64> @bitcast_v64i8_to_v8i64(<64 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB70_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v55.h
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v53.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v64.h, 0
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v52.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v52.l
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v0.l, v54.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v55.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v64.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v1.l, v54.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v64.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v49.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v49.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v64, v0
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v1.h, v53.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v29.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v28.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v51.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v26.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v64, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v2.l, v51.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v64.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v48.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v39.l
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v48.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v24.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v64, v2
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v3.l, v50.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v50.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v64.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v30.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v24.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v26.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v28.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v64, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v4.l, v39.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v64.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v27.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v30.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v54.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v55.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v51.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v54.l, 0
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v54.h
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v53.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v53.l
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v0.h, v52.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v52.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v50.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v30.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.l, v51.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v4
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v48.h
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v1.l, v49.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v54
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v1.h, v50.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v39.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v27.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v7, v54
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v2.l, v49.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v4.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v27.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v3.h, v39.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v6, v54
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v3.l, v48.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v24.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v28.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v38.h
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v38.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v64, v4
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v5.l, v29.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v64.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v25.h
-; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v23.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v37.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v37.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v64, v5
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v6.l, v27.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v64.h
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v36.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.h, v22.h
-; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v36.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v35.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v64, v6
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v7.l, v25.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v64.h
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v10.h, v21.h
-; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v35.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v34.h
-; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v34.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v64, v7
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v8.l, v23.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v64.h
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v11.h, v20.h
-; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v12.h, v19.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v33.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v33.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v64, v8
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v9.l, v22.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v64.h
-; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v32.h
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v13.h, v18.h
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v32.l
-; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v31.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v64, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v10.l, v21.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v64.h
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v17.h
-; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v31.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v38.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v8, v54
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v4.l, v29.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v25.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v4.h, v29.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v54
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v8
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v5.l, v28.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v6.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v26.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v5.h, v26.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v38.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v8, v54
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v6.l, v25.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v30.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v6.h, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v37.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v54
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v7.l, v23.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v37.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v7.h, v23.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v35.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v10, v54
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v11
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v8.l, v22.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v9.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v36.h
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v8.h, v22.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v35.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v11, v54
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v12
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v9.l, v21.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v10.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v36.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v9.h, v21.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v12, v54
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v13
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v10.l, v20.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v11.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v34.h
+; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v10.h, v20.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v33.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v13, v54
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v14
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v11.l, v19.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v34.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v13, v54
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v12.h, v19.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v33.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v15, 0xffff, v14
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v12.l, v18.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v32.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v13.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v32.h
+; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v12.h, v18.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v15, v54
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v13.h, v17.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v19
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v13.l, v17.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v14.l
+; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v31.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v18, v54
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v19
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v14.l, v16.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v15.l
+; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v31.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v64, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v11.l, v20.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v64.h
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v15.h, v16.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v64, v11
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v12.l, v19.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v64.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v64, v12
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v13.l, v18.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v64.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v17, v54
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v18
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v15.l, v16.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v64, v13
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v14.l, v17.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v64.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v64, v14
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v15.l, v16.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v64.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_lo16
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v64, v15
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v17, v54
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_hi16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB70_2
; GFX11-TRUE16-NEXT: .LBB70_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v55.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v53.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v52.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v52.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.h, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v54.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v55.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v51.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v53.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v54.h, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v55.l, v0.h
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v54.l, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v53.l, v1.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v49.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v49.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v48.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v52, v3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v39.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v51.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v51.h, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v48.h, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v54.h, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v53.h, v1.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v50.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v52.l, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, 0
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v52, v4
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v52.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v50.l, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v50.h, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v29.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v28.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v52, v5
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v3.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v26.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v24.h, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v39.h, v3.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v48.l, v3.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v52.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v50.h, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, v39.h, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v30.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v55
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v49.h, v2.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v51.l, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v55
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v4
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v2.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v2.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v48.l, v3.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v27.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v5, v55
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v4.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v49.l, v3.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v27.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v3.h
; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v52, v6
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v5.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v29.h, v4.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v30.h, v4.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v24.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v26.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v52, v7
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, 0x300, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v28.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v30.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v27.l, v5.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v27.h, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, v24.h, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v39.l, v4.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v6, v55
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v5.l
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v29.h, v4.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, v25.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v4.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, 0x300, v4.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v29.l, v5.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v28.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v55
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v28.h, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v26.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v52, v8
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v7.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v7.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v25.l, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v25.h, v6.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v26.h, v6.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v38.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v38.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v52, v9
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v7.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, 0x300, v7.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v37.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v37.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v23.l, v7.l
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v23.h, v7.h
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, v38.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v8, v55
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v7.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v25.h, v6.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, v30.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v8
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v6.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, 0x300, v6.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v24.l, v7.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v37.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v55
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v8.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v23.h, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v38.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v52, v10
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v9.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v9.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v22.l, v8.l
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v22.h, v8.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, v36.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, v36.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v52, v11
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v9.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, 0x300, v9.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, v35.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, v35.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v21.l, v9.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v21.h, v9.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v9
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v22.h, v8.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, v36.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v10, v55
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v9.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v23.l, v8.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, v37.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v8.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v10
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, 0x300, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v21.h, v9.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v8.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v10, v55
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v22.l, v9.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, v36.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, v35.h, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v11
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v9.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v9.l
; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v52, v12
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v11.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.h, 0x300, v11.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v20.l, v10.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v20.h, v10.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, v34.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, v34.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v52, v13
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v11.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, 0x300, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v12.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, v33.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, v33.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v19.l, v11.l
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v19.h, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v9.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v11, v55
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v20.h, v10.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.l, v35.l, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v21.l, v10.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, v34.h, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v12
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v10.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, 0x300, v10.h
+; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v11.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, v33.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v13, v55
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v20.l, v11.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v12.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, v34.l, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v19.h, v11.h
; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v12.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v52, v14
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v13.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.h, 0x300, v13.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v18.l, v12.l
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v18.h, v12.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, v32.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.l, 0x300, v11.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v13
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v19.l, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v11.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v18.h, v12.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v13, v55
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.l, v33.l, 3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v15, 0xffff, v14
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, 0x300, v12.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v12.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.h, v32.h, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.l
; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, v32.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v52, v15
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v13.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.h, 0x300, v13.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v14.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v15, v55
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v14.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v13.h
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v18.l, v13.l
; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, v31.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, v31.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v17.l, v13.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v17.h, v13.h
-; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.l
; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v14.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v52, v18
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v15.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v17.h, 0x300, v15.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v16.l, v14.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v16.h, v14.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v52, v17
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v15.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.h, 0x300, v15.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v52, v15
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v15
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v17.l, v13.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.l, 0x300, v13.l
+; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.l
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v17.h, v14.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.l, v31.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v13.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v13.l
+; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v16.h, v14.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v17.l, 0x300, v14.h
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v15.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v18, v55
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v19
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v14.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v16.l, v14.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v17
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v18, v55
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v15.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v16, v55
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -51088,32 +51295,32 @@ define <64 x i8> @bitcast_v8f64_to_v64i8(<8 x double> %a, i32 %b) {
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
@@ -51132,26 +51339,26 @@ define <64 x i8> @bitcast_v8f64_to_v64i8(<8 x double> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[24:25], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v26, 24, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v27, 8, v16
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 8, v15
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 24, v14
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 8, v15
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 24, v14
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v30, 8, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v31, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v32, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v32, 8, v13
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v31, 24, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v33, 8, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v34, 8, v11
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v35, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v36, 8, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v39, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v50, 8, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v51, 8, v5
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v52, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v53, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v64, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v25, 8, v1
; GFX11-TRUE16-NEXT: .LBB84_2: ; %Flow
@@ -51176,26 +51383,26 @@ define <64 x i8> @bitcast_v8f64_to_v64i8(<8 x double> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[24:25], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v26, 24, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v27, 8, v16
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 8, v15
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 24, v14
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 8, v15
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 24, v14
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v30, 8, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v31, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v32, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v32, 8, v13
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v31, 24, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v33, 8, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v34, 8, v11
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v35, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v36, 8, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v39, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v50, 8, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v51, 8, v5
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v52, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v53, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v64, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v25, 8, v1
; GFX11-TRUE16-NEXT: .LBB84_4: ; %end
@@ -51203,135 +51410,156 @@ define <64 x i8> @bitcast_v8f64_to_v64i8(<8 x double> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v25.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v24.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v1.l, v17.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v64.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v1.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v64.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v55.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v24, v1
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v2.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v54.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v23.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v17.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v24.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, 0
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v24, v2
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v3.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v24.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v55.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v1.h, v17.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v25, 0xffff, v25
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v54.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v1.l, v3.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v53.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v52.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v24, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v4.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v51.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v22.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v55
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v25, v24
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v2.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v23.l
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v4.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v54, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v23, 0xffff, v25
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v52.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v3.l, v3.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v51.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v23, v24
; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v24, v4
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v5.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v24.h
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v4.l, v17.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v50.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v49.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v4.h, v5.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v23, 0xffff, v25
; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v24, v5
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v6.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v48.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v21.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v24, v6
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v7.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v39.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v38.l
; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v24, v7
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v8.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v37.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v20.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v6.l, v17.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v22.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v49.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v23, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff, v25
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v5.l, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v48.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v7.l, v17.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v21.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v22, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff, v23
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v6.l, v6.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v39.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v22, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v21, 0xffff, v23
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v7.l, v17.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v38.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v7.h, v8.l
; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v24, v8
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v9.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v36.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v35.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v11.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v24, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v10.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v10.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v34.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v19.l
; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v12.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v24, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v11.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v11.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v33.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v32.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v21, v24
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v9.l, v17.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v37.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v36.l
; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v13.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v24, v11
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v12.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v12.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v31.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v18.l
-; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v14.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v24, v12
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v13.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v13.h, v18.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v30.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v29.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v21, 0xffff, v22
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v9.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v8.l, v8.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v20.l
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v10.l, v17.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v20, 0xffff, v22
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v21, v24
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v35.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v9.l, v9.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v10.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.h
+; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v34.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v20, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v20, 0xffff, v21
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v10.l, v17.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v33.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v10.h, v11.l
; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.l
-; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v15.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v24, v13
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v14.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v18.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v28.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v17.l
; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.l
-; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v16.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v24, v14
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v15.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v15.h, v17.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v27.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v26.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v20, v24
+; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v12.l, v17.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v11.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v19.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v32.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v12.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v19, 0xffff, v21
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.h
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v11.l, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v31.l
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v13.l, v17.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v18.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v19, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v19, 0xffff, v20
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v12.l, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v13.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v14.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.l, 8, v30.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v19, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v20
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v13.l, v17.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v29.l
+; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v13.h, v14.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v24, v15
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v16.l, v17.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v16.h, v17.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v24.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v24, v16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v18, v24
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v15.l, v17.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_3) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v14.l
+; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.h, 8, v28.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v27.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v19
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v15.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v14.l, v14.h
+; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.h, 8, v17.l
+; GFX11-TRUE16-NEXT: v_or_b16 v16.l, v16.l, v17.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v19
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v18, v24
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v15.l, v15.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v16.l
+; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v26.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v17, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v18
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v16.l, v16.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v17, v24
; GFX11-TRUE16-NEXT: s_clause 0x3
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[1:4], off
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[5:8], off offset:16
@@ -54761,15 +54989,15 @@ define <8 x double> @bitcast_v64i8_to_v8f64(<64 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v64, off, s32 offset:128
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v31, off, s32 offset:124
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v64, off, s32 offset:120
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v31, off, s32 offset:116
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v32, off, s32 offset:116
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v65, off, s32 offset:112
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v32, off, s32 offset:108
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v31, off, s32 offset:108
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v65, off, s32 offset:104
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v32, off, s32 offset:100
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v33, off, s32 offset:100
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v66, off, s32 offset:96
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v33, off, s32 offset:92
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v32, off, s32 offset:92
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v66, off, s32 offset:88
-; GFX11-TRUE16-NEXT: scratch_load_b32 v81, off, s32 offset:132
+; GFX11-TRUE16-NEXT: scratch_load_b32 v82, off, s32 offset:132
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v67, off, s32
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v67, off, s32 offset:8
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v68, off, s32 offset:16
@@ -54783,82 +55011,84 @@ define <8 x double> @bitcast_v64i8_to_v8f64(<64 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v80, off, s32 offset:80
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v33, off, s32 offset:84
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v34, off, s32 offset:76
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v34, off, s32 offset:68
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v35, off, s32 offset:60
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v35, off, s32 offset:68
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v34, off, s32 offset:60
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v35, off, s32 offset:52
; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v36, off, s32 offset:44
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v36, off, s32 offset:36
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v37, off, s32 offset:28
-; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v37, off, s32 offset:20
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v37, off, s32 offset:36
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v36, off, s32 offset:28
+; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v38, off, s32 offset:20
; GFX11-TRUE16-NEXT: s_clause 0x1
-; GFX11-TRUE16-NEXT: scratch_load_d16_b16 v38, off, s32 offset:12
+; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v37, off, s32 offset:12
; GFX11-TRUE16-NEXT: scratch_load_d16_hi_b16 v38, off, s32 offset:4
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v80.h, v29.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.h, v27.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.h, v22.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.h, v20.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.h, v18.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.l, v16.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.l, v14.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.h, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v49.l, v10.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v49.h, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.h, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.l, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v53.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v81.l, v27.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.h, v25.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.h, v24.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v22.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.l, v20.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.h, v18.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.h, v16.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.h, v14.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v50.l, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.h, v10.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v51.h, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v53.l, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v54.l, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.h, v0.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v54.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v55.l, 8, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v53.l, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v54.l, 8, v7.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v51.l, 8, v9.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v51.h, 8, v11.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v50.l, 8, v13.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v50.h, 8, v15.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v39.h, 8, v17.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v48.l, 8, v19.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v29.h, 8, v21.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v30.h, 8, v23.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.l, 8, v25.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v27.h, 8, v27.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.l, 8, v80.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v52.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v53.h, 8, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v50.h, 8, v7.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v52.h, 8, v9.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v49.h, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v51.l, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v48.l, 8, v15.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v49.l, 8, v17.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v29.h, 8, v19.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v39.l, 8, v21.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v28.h, 8, v23.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v29.l, 8, v25.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.h, 8, v81.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v26.h, 8, v80.h
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(33)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v64.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.l, 8, v64.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(31)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.l, 8, v64.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v64.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(29)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v65.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v65.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(27)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v65.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v65.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(25)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v66.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v66.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(23)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v66.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.l, 8, v66.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(21)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.h, 8, v67.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.h, 8, v67.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(20)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.l, 8, v67.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v24.l, 8, v67.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(19)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.h, 8, v68.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.h, 8, v68.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(18)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.l, 8, v68.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.l, 8, v68.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(17)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.h, 8, v69.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.h, 8, v69.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(16)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.l, 8, v69.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.l, 8, v69.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(15)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.h, 8, v70.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v20.h, 8, v70.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(14)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v20.l, 8, v70.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.l, 8, v70.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(13)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v20.h, 8, v71.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v71.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(12)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.l, 8, v71.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v20.l, 8, v71.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(11)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v19.h, 8, v80.l
-; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v81
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v80.l
+; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v82
; GFX11-TRUE16-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX11-TRUE16-NEXT: s_cbranch_execnz .LBB86_3
; GFX11-TRUE16-NEXT: ; %bb.1: ; %Flow
@@ -54870,338 +55100,384 @@ define <8 x double> @bitcast_v64i8_to_v8f64(<64 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB86_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v55.h
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v53.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v64.h, 0
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v52.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v52.l
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v0.l, v54.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v55.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v64.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v1.l, v54.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v64.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v49.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v49.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v64, v0
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v1.h, v53.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v29.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v28.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v51.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v26.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v64, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v2.l, v51.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v64.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v48.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v39.l
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v48.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v24.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v64, v2
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v3.l, v50.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v50.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v64.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v30.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v24.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v26.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v28.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v64, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v4.l, v39.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v64.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v27.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v30.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v54.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v55.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v51.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v54.l, 0
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v54.h
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v53.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v53.l
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v0.h, v52.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v52.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v50.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v30.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v3
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.l, v51.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v4
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v48.h
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v1.l, v49.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v54
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v1.h, v50.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v39.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v27.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v7, v54
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v2.l, v49.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v4.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v27.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v3.h, v39.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v6, v54
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v3.l, v48.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v24.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v28.l
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v38.h
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v38.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v64, v4
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v5.l, v29.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v64.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v25.h
-; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v23.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v37.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v37.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v64, v5
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v6.l, v27.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v64.h
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v36.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.h, v22.h
-; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v36.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v35.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v64, v6
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v7.l, v25.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v64.h
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v10.h, v21.h
-; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v35.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v34.h
-; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v34.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v64, v7
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v8.l, v23.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v64.h
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v11.h, v20.h
-; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v12.h, v19.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v33.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v33.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v64, v8
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v9.l, v22.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v64.h
-; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v32.h
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v13.h, v18.h
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v32.l
-; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v31.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v64, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v10.l, v21.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v64.h
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v17.h
-; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v31.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v38.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v8, v54
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v4.l, v29.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v25.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v4.h, v29.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v54
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v8
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v5.l, v28.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v6.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v26.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v5.h, v26.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v38.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v8, v54
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v9
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v6.l, v25.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v30.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v6.h, v24.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v37.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v54
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v7.l, v23.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v37.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v7.h, v23.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v35.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v10, v54
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v11
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v8.l, v22.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v9.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v36.h
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v8.h, v22.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v35.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v11, v54
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v12
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v9.l, v21.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v10.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v36.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v9.h, v21.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v12, v54
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v13
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v10.l, v20.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v11.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v34.h
+; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v10.h, v20.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v33.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v13, v54
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v14
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v11.l, v19.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v34.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v13, v54
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v12.h, v19.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v33.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v15, 0xffff, v14
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v12.l, v18.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v32.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v13.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v32.h
+; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v12.h, v18.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v15, v54
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v13.h, v17.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v19
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v13.l, v17.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v14.l
+; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v31.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v18, v54
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v19
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v14.l, v16.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v15.l
+; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v31.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v64, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v11.l, v20.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v64.h
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v15.h, v16.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v64, v11
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v12.l, v19.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v64.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v64, v12
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v13.l, v18.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v64.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v17, v54
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v18
+; GFX11-TRUE16-NEXT: v_or_b16 v54.h, v15.l, v16.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v64, v13
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v14.l, v17.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v64.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v64, v14
-; GFX11-TRUE16-NEXT: v_or_b16 v64.l, v15.l, v16.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v64.h
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr16_lo16
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v64, v15
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v17, v54
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_hi16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB86_2
; GFX11-TRUE16-NEXT: .LBB86_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v55.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v53.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v52.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v52.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.h, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v54.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v55.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v51.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v53.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v54.h, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v55.l, v0.h
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v54.l, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v53.l, v1.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v49.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v49.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v48.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v52, v3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v39.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v51.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v51.h, v1.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v48.h, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v54.h, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v53.h, v1.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v50.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v52.l, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, 0
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v52, v4
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v52.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v50.l, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v50.h, v2.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v29.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v28.h, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v52, v5
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v3.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v3.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v26.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v24.h, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v39.h, v3.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v48.l, v3.h
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v52.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v50.h, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, v39.h, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, v30.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v55
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v49.h, v2.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v51.l, v1.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v55
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v4
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v2.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v2.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v48.l, v3.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.h, v27.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v5, v55
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v4.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v49.l, v3.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, v27.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v3.h
; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v52, v6
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v5.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v29.h, v4.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v30.h, v4.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v24.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v26.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v52, v7
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v5.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, 0x300, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v28.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v30.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v27.l, v5.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v27.h, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, 0x300, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, v24.h, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v39.l, v4.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v6, v55
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v5.l
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v29.h, v4.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, v25.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v4.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, 0x300, v4.h
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v29.l, v5.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, v28.l, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v55
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v28.h, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.l, v26.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v52, v8
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v7.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v7.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v25.l, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v25.h, v6.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v5.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, 0x300, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v26.h, v6.h
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v38.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v38.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v52, v9
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v7.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, 0x300, v7.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v37.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v37.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v23.l, v7.l
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v23.h, v7.h
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, v38.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v8, v55
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v7.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v25.h, v6.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, v30.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v8
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v6.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, 0x300, v6.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v24.l, v7.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.h, v37.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v55
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v8.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v23.h, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, v38.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v52, v10
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v9.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, 0x300, v9.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v22.l, v8.l
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v22.h, v8.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, v36.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, v36.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v52, v11
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v9.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, 0x300, v9.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, v35.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, v35.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v21.l, v9.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v21.h, v9.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v9
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v7.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v7.h
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v22.h, v8.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, v36.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v10, v55
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v9.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v23.l, v8.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, v37.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v8.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v10
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v8.l, 0x300, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v21.h, v9.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v8.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v10, v55
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v22.l, v9.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.l, v36.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v10.h, v35.h, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v11
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v9.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.l, 0x300, v9.l
; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v52, v12
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v11.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.h, 0x300, v11.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v20.l, v10.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v20.h, v10.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, v34.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, v34.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v52, v13
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v11.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, 0x300, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v12.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, v33.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, v33.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v19.l, v11.l
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v19.h, v11.h
-; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v9.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v11, v55
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v20.h, v10.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.l, v35.l, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v21.l, v10.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.h, v34.h, 3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v12
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v10.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, 0x300, v10.h
+; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v11.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.h, v33.h, 3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v13, v55
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v20.l, v11.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v12.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v12.l, v34.l, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v19.h, v11.h
; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v12.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v52, v14
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v13.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.h, 0x300, v13.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v18.l, v12.l
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v18.h, v12.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, v32.h, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v11.l, 0x300, v11.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v13
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v11.h
+; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v19.l, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v11.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v18.h, v12.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v13, v55
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.l, v33.l, 3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v15, 0xffff, v14
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, 0x300, v12.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v12.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.h, v32.h, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.l
; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, v32.l, 3
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v52, v15
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v13.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v18.h, 0x300, v13.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v14.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v15, v55
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v14.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v13.h
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v18.l, v13.l
; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.l, v31.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v14.h, v31.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v17.l, v13.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v17.h, v13.h
-; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.l
; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v14.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v52, v18
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v15.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v17.h, 0x300, v15.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v16.l, v14.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v16.h, v14.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v52, v17
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v52.l, 0x300, v15.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.h, 0x300, v15.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v52.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v52, v15
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v15
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v17.l, v13.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v13.l, 0x300, v13.l
+; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.l
+; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v17.h, v14.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v15.l, v31.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v13.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v13.l
+; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v16.h, v14.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v17.l, 0x300, v14.h
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v15.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v18, v55
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v19
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v14.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v16.l, v14.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v17
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v18, v55
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v55.h, 0x300, v15.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v16, v55
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -64297,32 +64573,32 @@ define <64 x i8> @bitcast_v32i16_to_v64i8(<32 x i16> %a, i32 %b) {
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
@@ -64341,26 +64617,26 @@ define <64 x i8> @bitcast_v32i16_to_v64i8(<32 x i16> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[24:25], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v26, 24, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v27, 8, v16
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 8, v15
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 24, v14
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 8, v15
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 24, v14
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v30, 8, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v31, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v32, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v32, 8, v13
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v31, 24, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v33, 8, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v34, 8, v11
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v35, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v36, 8, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v39, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v50, 8, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v51, 8, v5
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v52, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v53, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v64, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v25, 8, v1
; GFX11-TRUE16-NEXT: .LBB96_2: ; %Flow
@@ -64393,26 +64669,26 @@ define <64 x i8> @bitcast_v32i16_to_v64i8(<32 x i16> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[24:25], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v26, 24, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v27, 8, v16
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 8, v15
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 24, v14
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 8, v15
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 24, v14
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v30, 8, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v31, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v32, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v32, 8, v13
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v31, 24, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v33, 8, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v34, 8, v11
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v35, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v36, 8, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v39, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v50, 8, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v51, 8, v5
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v52, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v53, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v64, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v25, 8, v1
; GFX11-TRUE16-NEXT: .LBB96_4: ; %end
@@ -64420,135 +64696,156 @@ define <64 x i8> @bitcast_v32i16_to_v64i8(<32 x i16> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v25.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v24.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v1.l, v17.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v64.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v1.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v64.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v55.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v24, v1
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v2.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v54.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v23.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v17.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v24.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, 0
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v24, v2
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v3.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v24.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v55.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v1.h, v17.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v25, 0xffff, v25
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v54.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v1.l, v3.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v53.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v52.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v24, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v4.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v51.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v22.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v55
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v25, v24
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v2.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v23.l
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v4.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v54, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v23, 0xffff, v25
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v52.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v3.l, v3.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v51.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v23, v24
; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v24, v4
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v5.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v24.h
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v4.l, v17.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v50.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v49.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v4.h, v5.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v23, 0xffff, v25
; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v24, v5
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v6.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v48.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v21.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v24, v6
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v7.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v39.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v38.l
; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v24, v7
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v8.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v37.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v20.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v6.l, v17.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v22.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v49.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v23, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff, v25
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v5.l, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v48.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v7.l, v17.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v21.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v22, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff, v23
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v6.l, v6.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v39.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v22, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v21, 0xffff, v23
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v7.l, v17.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v38.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v7.h, v8.l
; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v24, v8
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v9.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v36.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v35.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v11.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v24, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v10.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v10.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v34.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v19.l
; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v12.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v24, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v11.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v11.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v33.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v32.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v21, v24
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v9.l, v17.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v37.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v36.l
; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v13.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v24, v11
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v12.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v12.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v31.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v18.l
-; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v14.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v24, v12
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v13.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v13.h, v18.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v30.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v29.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v21, 0xffff, v22
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v9.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v8.l, v8.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v20.l
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v10.l, v17.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v20, 0xffff, v22
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v21, v24
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v35.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v9.l, v9.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v10.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.h
+; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v34.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v20, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v20, 0xffff, v21
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v10.l, v17.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v33.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v10.h, v11.l
; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.l
-; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v15.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v24, v13
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v14.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v18.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v28.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v17.l
; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.l
-; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v16.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v24, v14
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v15.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v15.h, v17.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v27.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v26.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v20, v24
+; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v12.l, v17.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v11.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v19.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v32.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v12.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v19, 0xffff, v21
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.h
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v11.l, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v31.l
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v13.l, v17.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v18.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v19, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v19, 0xffff, v20
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v12.l, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v13.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v14.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.l, 8, v30.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v19, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v20
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v13.l, v17.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v29.l
+; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v13.h, v14.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v24, v15
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v16.l, v17.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v16.h, v17.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v24.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v24, v16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v18, v24
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v15.l, v17.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_3) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v14.l
+; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.h, 8, v28.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v27.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v19
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v15.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v14.l, v14.h
+; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.h, 8, v17.l
+; GFX11-TRUE16-NEXT: v_or_b16 v16.l, v16.l, v17.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v19
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v18, v24
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v15.l, v15.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v16.l
+; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v26.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v17, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v18
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v16.l, v16.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v17, v24
; GFX11-TRUE16-NEXT: s_clause 0x3
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[1:4], off
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[5:8], off offset:16
@@ -76404,32 +76701,32 @@ define <64 x i8> @bitcast_v32f16_to_v64i8(<32 x half> %a, i32 %b) {
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr25_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_lo16
@@ -76448,26 +76745,26 @@ define <64 x i8> @bitcast_v32f16_to_v64i8(<32 x half> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[24:25], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v26, 24, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v27, 8, v16
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 8, v15
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 24, v14
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 8, v15
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 24, v14
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v30, 8, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v31, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v32, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v32, 8, v13
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v31, 24, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v33, 8, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v34, 8, v11
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v35, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v36, 8, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v39, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v50, 8, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v51, 8, v5
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v52, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v53, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v64, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v25, 8, v1
; GFX11-TRUE16-NEXT: .LBB104_2: ; %Flow
@@ -76500,26 +76797,26 @@ define <64 x i8> @bitcast_v32f16_to_v64i8(<32 x half> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[24:25], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v26, 24, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v27, 8, v16
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 8, v15
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 24, v14
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v29, 8, v15
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v28, 24, v14
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v30, 8, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v31, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v32, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v32, 8, v13
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v31, 24, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v33, 8, v12
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v34, 8, v11
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v35, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v36, 8, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 24, v8
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v39, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v50, 8, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v51, 8, v5
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v52, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v53, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v64, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v25, 8, v1
; GFX11-TRUE16-NEXT: .LBB104_4: ; %end
@@ -76527,135 +76824,156 @@ define <64 x i8> @bitcast_v32f16_to_v64i8(<32 x half> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v25.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v24.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v1.l, v17.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v64.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v1.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v64.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v55.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v24, v1
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v2.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v54.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v23.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v17.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v24.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v2.l, v18.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, 0
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v24, v2
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v3.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v24.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v55.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v1.h, v17.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v25, 0xffff, v25
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v54.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v1.l, v3.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v53.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v52.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v24, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v4.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v4.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v51.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v22.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v54, 0xffff, v55
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v25, v24
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v2.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v23.l
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v4.l, v17.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v54, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v23, 0xffff, v25
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v52.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v3.l, v3.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.h
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v51.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v23, v24
; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v24, v4
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v5.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v24.h
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v4.l, v17.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v50.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v49.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v4.h, v5.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v23, 0xffff, v25
; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v24, v5
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v6.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v6.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v48.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v21.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v24, v6
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v7.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v39.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v38.l
; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v24, v7
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v8.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v8.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v37.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v20.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v6.l, v17.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v25.l, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v22.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v49.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v23, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff, v25
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v5.l, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v48.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v7.l, v17.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v21.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v22, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff, v23
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v6.l, v6.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v39.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v22, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v21, 0xffff, v23
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v7.l, v17.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v38.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v7.h, v8.l
; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v10.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v24, v8
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v9.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v36.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v35.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v11.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v24, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v10.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v10.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v34.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v19.l
; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v12.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v24, v10
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v11.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v11.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v33.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 8, v32.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v21, v24
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v9.l, v17.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v37.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v36.l
; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v13.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v24, v11
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v12.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v12.h, v18.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v31.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v18.l
-; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v14.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v24, v12
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v13.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v13.h, v18.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v30.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 8, v29.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v21, 0xffff, v22
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v9.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v8.l, v8.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v20.l
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v10.l, v17.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v20, 0xffff, v22
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v21, v24
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v35.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v9.l, v9.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v10.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.h
+; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v34.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v20, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v20, 0xffff, v21
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v10.l, v17.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v33.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v10.h, v11.l
; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.l
-; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v15.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v24, v13
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v14.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v14.h, v18.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v28.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v17.l
; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.l
-; GFX11-TRUE16-NEXT: v_and_b16 v16.h, 0xff, v16.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v24, v14
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v15.l, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v15.h, v17.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 8, v27.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v26.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v20, v24
+; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v12.l, v17.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v11.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v19.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v32.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v12.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v19, 0xffff, v21
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.h
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v11.l, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v31.l
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v13.l, v17.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v18.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v19, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v19, 0xffff, v20
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v12.l, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v13.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v14.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.l, 8, v30.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v19, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v20
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v13.l, v17.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v29.l
+; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v13.h, v14.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v24, v15
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v16.l, v17.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v16.h, v17.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v24.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v24, v16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v18, v24
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v15.l, v17.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_3) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v14.l
+; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v14.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.h, 8, v28.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v27.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v19
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v15.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v14.l, v14.h
+; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.h, 8, v17.l
+; GFX11-TRUE16-NEXT: v_or_b16 v16.l, v16.l, v17.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v19
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v18, v24
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v15.l, v15.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v16.l
+; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v16.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v26.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v17, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v18
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v16.l, v16.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v17, v24
; GFX11-TRUE16-NEXT: s_clause 0x3
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[1:4], off
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[5:8], off offset:16
@@ -85374,59 +85692,59 @@ define <64 x i8> @bitcast_v32bf16_to_v64i8(<32 x bfloat> %a, i32 %b) {
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr28_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr113_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr24_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr112_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr27_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr112_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr26_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr102_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr103_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr23_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr101_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr30_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr101_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr29_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr100_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr99_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr22_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr32_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr33_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr97_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr98_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr31_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr96_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr97_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr21_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr35_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr34_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr83_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr65_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr84_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr36_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr82_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr81_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr20_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr39_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr68_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr80_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr70_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr71_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr69_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr19_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr67_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr52_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr66_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr65_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr85_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr55_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr86_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr64_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr18_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr69_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr54_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr67_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr87_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr68_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr53_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr66_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr50_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr85_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr51_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr17_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr83_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr49_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr86_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr48_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr82_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr38_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr81_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr37_lo16
; GFX11-TRUE16-NEXT: s_and_saveexec_b32 s0, vcc_lo
; GFX11-TRUE16-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX11-TRUE16-NEXT: s_xor_b32 s0, exec_lo, s0
@@ -85439,302 +85757,307 @@ define <64 x i8> @bitcast_v32bf16_to_v64i8(<32 x bfloat> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[21:22], 24, v[7:8]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[22:23], 24, v[5:6]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[23:24], 24, v[3:4]
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 24, v16
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 8, v16
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v50, 8, v15
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v51, 24, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 8, v14
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 24, v16
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 8, v16
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v51, 8, v15
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v50, 24, v14
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v53, 8, v14
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v64, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v65, 24, v12
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v66, 8, v12
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v68, 8, v11
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v71, 24, v10
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v67, 8, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v69, 8, v11
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v70, 24, v10
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v80, 8, v10
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 8, v9
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 24, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v84, 8, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 8, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v97, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v84, 8, v9
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v83, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 8, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v97, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 8, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v99, 8, v5
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v100, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v101, 8, v4
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v102, 8, v3
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v103, 24, v2
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v103, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v102, 24, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 8, v2
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 8, v1
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[24:25], 24, v[1:2]
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v28.h, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.h, v2.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v31.h, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.h, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.h, v4.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.h, v5.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v32.h, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v33.h, v6.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v38.h, v7.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v27.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v26.h, v2.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v32.h, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v30.h, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v29.h, v4.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v36.h, v5.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v33.h, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v31.h, v6.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v49.h, v7.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v35.h, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v36.h, v8.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v55.h, v9.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v37.h, v10.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.h, v10.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v70.h, v11.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.h, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v53.h, v12.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v85.h, v13.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v69.h, v14.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v67.h, v14.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v87.h, v15.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v83.h, v16.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v86.h, v16.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v34.h, v8.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v65.h, v9.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v39.h, v10.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v48.h, v10.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v71.h, v11.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v54.h, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v52.h, v12.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v86.h, v13.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v68.h, v14.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v66.h, v14.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v85.h, v15.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v82.h, v16.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v81.h, v16.h
; GFX11-TRUE16-NEXT: .LBB108_2: ; %Flow
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB108_4
; GFX11-TRUE16-NEXT: ; %bb.3: ; %cmp.true
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff0000, v1
; GFX11-TRUE16-NEXT: v_and_b32_e32 v19, 0xffff0000, v4
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v4, 16, v4
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v17, 16, v2
; GFX11-TRUE16-NEXT: v_and_b32_e32 v2, 0xffff0000, v2
-; GFX11-TRUE16-NEXT: v_dual_add_f32 v18, 0x40c00000, v18 :: v_dual_lshlrev_b32 v1, 16, v1
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v49, 0xffff0000, v11
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_dual_add_f32 v2, 0x40c00000, v2 :: v_dual_lshlrev_b32 v11, 16, v11
-; GFX11-TRUE16-NEXT: v_dual_add_f32 v1, 0x40c00000, v1 :: v_dual_add_f32 v4, 0x40c00000, v4
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v17, 0x40c00000, v17
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_bfe_u32 v21, v2, 16, 1
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, 0x400000, v2
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_dual_add_f32 v4, 0x40c00000, v4 :: v_dual_add_f32 v17, 0x40c00000, v17
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v2, 0x40c00000, v2
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_4)
; GFX11-TRUE16-NEXT: v_bfe_u32 v20, v17, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, 0x400000, v17
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v17, v17
-; GFX11-TRUE16-NEXT: v_add3_u32 v21, v21, v2, 0x7fff
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_bfe_u32 v21, v2, 16, 1
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, 0x400000, v2
; GFX11-TRUE16-NEXT: v_add3_u32 v20, v20, v17, 0x7fff
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff0000, v1
+; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v1, 16, v1
+; GFX11-TRUE16-NEXT: v_add3_u32 v21, v21, v2, 0x7fff
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v27, v20, v22, vcc_lo
+; GFX11-TRUE16-NEXT: v_dual_add_f32 v18, 0x40c00000, v18 :: v_dual_add_f32 v1, 0x40c00000, v1
+; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v2, v2
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v27.h
; GFX11-TRUE16-NEXT: v_bfe_u32 v17, v18, 16, 1
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v26, v20, v22, vcc_lo
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
; GFX11-TRUE16-NEXT: v_bfe_u32 v20, v1, 16, 1
-; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v2, v2
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, 0x400000, v18
-; GFX11-TRUE16-NEXT: v_add3_u32 v17, v17, v18, 0x7fff
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v26.h
-; GFX11-TRUE16-NEXT: v_add3_u32 v20, v20, v1, 0x7fff
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v27, v21, v23, vcc_lo
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v26, v21, v23, vcc_lo
; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, 0x400000, v1
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v1, v1
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, 0x400000, v18
+; GFX11-TRUE16-NEXT: v_add3_u32 v20, v20, v1, 0x7fff
; GFX11-TRUE16-NEXT: v_add_f32_e32 v19, 0x40c00000, v19
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_bfi_b32 v2, 0xffff, v2, v27
+; GFX11-TRUE16-NEXT: v_add3_u32 v17, v17, v18, 0x7fff
+; GFX11-TRUE16-NEXT: v_bfi_b32 v2, 0xffff, v2, v26
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v28, v20, v21, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v18, v18
; GFX11-TRUE16-NEXT: v_bfe_u32 v18, v4, 16, 1
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v21, 0xffff0000, v3
; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, 0x400000, v19
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v102, 24, v2
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v28.h
; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v1, v17, v22, vcc_lo
; GFX11-TRUE16-NEXT: v_add3_u32 v18, v18, v4, 0x7fff
; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, 0x400000, v4
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v4, v4
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v21, 0xffff0000, v3
; GFX11-TRUE16-NEXT: v_bfe_u32 v17, v19, 16, 1
-; GFX11-TRUE16-NEXT: v_bfi_b32 v1, 0xffff, v20, v1
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v20, 0xffff0000, v5
-; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, 16, v5
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v29, v18, v22, vcc_lo
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 8, v2
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v30, v18, v22, vcc_lo
; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff0000, v6
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, 16, v3
+; GFX11-TRUE16-NEXT: v_dual_add_f32 v21, 0x40c00000, v21 :: v_dual_lshlrev_b32 v6, 16, v6
; GFX11-TRUE16-NEXT: v_add3_u32 v17, v17, v19, 0x7fff
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v19, v19
-; GFX11-TRUE16-NEXT: v_dual_add_f32 v21, 0x40c00000, v21 :: v_dual_lshlrev_b32 v6, 16, v6
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_dual_add_f32 v3, 0x40c00000, v3 :: v_dual_add_f32 v20, 0x40c00000, v20
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v30, v17, v23, vcc_lo
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v3, 0x40c00000, v3
; GFX11-TRUE16-NEXT: v_bfe_u32 v4, v21, 16, 1
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, 0x400000, v21
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_dual_add_f32 v6, 0x40c00000, v6 :: v_dual_cndmask_b32 v29, v17, v23
; GFX11-TRUE16-NEXT: v_bfe_u32 v18, v3, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, 0x400000, v3
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v3, v3
; GFX11-TRUE16-NEXT: v_add3_u32 v4, v4, v21, 0x7fff
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v29.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, 0x400000, v21
; GFX11-TRUE16-NEXT: v_add3_u32 v18, v18, v3, 0x7fff
-; GFX11-TRUE16-NEXT: v_dual_add_f32 v6, 0x40c00000, v6 :: v_dual_add_f32 v5, 0x40c00000, v5
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v103, 24, v2
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v112, 8, v2
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_3) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v31, v18, v19, vcc_lo
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v17.l, v30.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v32, v18, v19, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v21, v21
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v18, 0x40c00000, v22
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 8, v1
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v31.h
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v3, v4, v23, vcc_lo
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_bfe_u32 v21, v18, 16, 1
-; GFX11-TRUE16-NEXT: v_bfi_b32 v4, 0xffff, v17, v30
+; GFX11-TRUE16-NEXT: v_bfi_b32 v1, 0xffff, v20, v1
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v32.h
+; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v3, v4, v23 :: v_dual_add_f32 v18, 0x40c00000, v22
+; GFX11-TRUE16-NEXT: v_bfi_b32 v4, 0xffff, v17, v29
; GFX11-TRUE16-NEXT: v_bfe_u32 v17, v6, 16, 1
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v6, v6
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
; GFX11-TRUE16-NEXT: v_bfi_b32 v3, 0xffff, v19, v3
-; GFX11-TRUE16-NEXT: v_add3_u32 v19, v21, v18, 0x7fff
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, 0x400000, v6
-; GFX11-TRUE16-NEXT: v_add3_u32 v17, v17, v6, 0x7fff
+; GFX11-TRUE16-NEXT: v_bfe_u32 v21, v18, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, 0x400000, v18
-; GFX11-TRUE16-NEXT: v_bfe_u32 v6, v20, 16, 1
+; GFX11-TRUE16-NEXT: v_add3_u32 v17, v17, v6, 0x7fff
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v100, 24, v4
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v101, 8, v4
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v32, v17, v21, vcc_lo
+; GFX11-TRUE16-NEXT: v_add3_u32 v19, v21, v18, 0x7fff
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, 0x400000, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v103, 8, v3
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v113, 8, v1
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v33, v17, v21, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v18, v18
-; GFX11-TRUE16-NEXT: v_add3_u32 v6, v6, v20, 0x7fff
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v102, 8, v3
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v33, v19, v22, vcc_lo
+; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v31, v19, v22 :: v_dual_and_b32 v20, 0xffff0000, v5
+; GFX11-TRUE16-NEXT: v_dual_add_f32 v20, 0x40c00000, v20 :: v_dual_lshlrev_b32 v5, 16, v5
; GFX11-TRUE16-NEXT: v_and_b32_e32 v19, 0xffff0000, v8
-; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v8, 16, v8
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_dual_add_f32 v5, 0x40c00000, v5 :: v_dual_lshlrev_b32 v8, 16, v8
+; GFX11-TRUE16-NEXT: v_bfe_u32 v6, v20, 16, 1
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, 0x400000, v20
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v8, 0x40c00000, v8
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v33.h
; GFX11-TRUE16-NEXT: v_bfe_u32 v17, v5, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, 0x400000, v5
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v5, v5
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, 0x400000, v20
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v8, 0x40c00000, v8
+; GFX11-TRUE16-NEXT: v_add3_u32 v6, v6, v20, 0x7fff
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_add3_u32 v17, v17, v5, 0x7fff
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v32.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v34, v17, v21 :: v_dual_add_f32 v19, 0x40c00000, v19
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v36, v17, v21, vcc_lo
; GFX11-TRUE16-NEXT: v_bfe_u32 v17, v8, 16, 1
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v20, v20
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v34.h
-; GFX11-TRUE16-NEXT: v_bfe_u32 v5, v19, 16, 1
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v19, 0x40c00000, v19
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v36.h
; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v20, v6, v22, vcc_lo
+; GFX11-TRUE16-NEXT: v_bfe_u32 v5, v19, 16, 1
; GFX11-TRUE16-NEXT: v_add3_u32 v6, v17, v8, 0x7fff
; GFX11-TRUE16-NEXT: v_or_b32_e32 v17, 0x400000, v8
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v8, v8
-; GFX11-TRUE16-NEXT: v_add3_u32 v5, v5, v19, 0x7fff
; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, 0x400000, v19
+; GFX11-TRUE16-NEXT: v_add3_u32 v5, v5, v19, 0x7fff
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v35, v6, v17, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v19, v19
; GFX11-TRUE16-NEXT: v_and_b32_e32 v23, 0xffff0000, v7
-; GFX11-TRUE16-NEXT: v_bfi_b32 v6, 0xffff, v18, v33
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v36, v5, v22, vcc_lo
+; GFX11-TRUE16-NEXT: v_bfi_b32 v6, 0xffff, v18, v31
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v34, v5, v22, vcc_lo
; GFX11-TRUE16-NEXT: v_bfi_b32 v5, 0xffff, v21, v20
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v20, 16, v10
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v7, 16, v7
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v35.h
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v97, 24, v6
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 24, v6
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v98, 8, v6
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
; GFX11-TRUE16-NEXT: v_dual_add_f32 v20, 0x40c00000, v20 :: v_dual_add_f32 v7, 0x40c00000, v7
-; GFX11-TRUE16-NEXT: v_bfi_b32 v8, 0xffff, v8, v36
+; GFX11-TRUE16-NEXT: v_bfi_b32 v8, 0xffff, v8, v34
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v99, 8, v5
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
; GFX11-TRUE16-NEXT: v_bfe_u32 v19, v7, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, 0x400000, v7
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v7, v7
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v82, 24, v8
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v84, 8, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v83, 24, v8
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v87, 8, v8
; GFX11-TRUE16-NEXT: v_add3_u32 v19, v19, v7, 0x7fff
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff0000, v10
; GFX11-TRUE16-NEXT: v_bfe_u32 v7, v20, 16, 1
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v38, v19, v21, vcc_lo
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v49, v19, v21, vcc_lo
+; GFX11-TRUE16-NEXT: v_dual_add_f32 v17, 0x40c00000, v23 :: v_dual_add_f32 v10, 0x40c00000, v10
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_3)
; GFX11-TRUE16-NEXT: v_add3_u32 v7, v7, v20, 0x7fff
; GFX11-TRUE16-NEXT: v_or_b32_e32 v21, 0x400000, v20
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff0000, v10
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_dual_add_f32 v17, 0x40c00000, v23 :: v_dual_add_f32 v10, 0x40c00000, v10
; GFX11-TRUE16-NEXT: v_bfe_u32 v18, v17, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, 0x400000, v17
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v17, v17
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
; GFX11-TRUE16-NEXT: v_bfe_u32 v19, v10, 16, 1
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_add3_u32 v18, v18, v17, 0x7fff
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_add3_u32 v19, v19, v10, 0x7fff
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v17, v18, v22, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v20, v20
; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, 0x400000, v10
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v38.h
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v37, v7, v21, vcc_lo
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v49.h
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v39, v7, v21, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v10, v10
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v10, 16, v12
; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff0000, v12
-; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v7, 16, v9
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v39, v19, v22, vcc_lo
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v48, v19, v22 :: v_dual_lshlrev_b32 v7, 16, v9
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_4)
; GFX11-TRUE16-NEXT: v_add_f32_e32 v21, 0x40c00000, v10
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v37.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v39.h
; GFX11-TRUE16-NEXT: v_add_f32_e32 v12, 0x40c00000, v12
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
; GFX11-TRUE16-NEXT: v_bfe_u32 v22, v21, 16, 1
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v48, 0x400000, v21
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v37, 0x400000, v21
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v21, v21
-; GFX11-TRUE16-NEXT: v_bfi_b32 v10, 0xffff, v19, v39
+; GFX11-TRUE16-NEXT: v_bfi_b32 v10, 0xffff, v19, v48
; GFX11-TRUE16-NEXT: v_bfe_u32 v24, v12, 16, 1
; GFX11-TRUE16-NEXT: v_add3_u32 v22, v22, v21, 0x7fff
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v23, 0xffff0000, v9
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v7, 0x40c00000, v7
; GFX11-TRUE16-NEXT: v_or_b32_e32 v50, 0x400000, v12
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v21, 0x40c00000, v49
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v70, 24, v10
; GFX11-TRUE16-NEXT: v_add3_u32 v24, v24, v12, 0x7fff
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_3) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v52, v22, v48 :: v_dual_add_f32 v9, 0x40c00000, v23
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff0000, v14
-; GFX11-TRUE16-NEXT: v_dual_add_f32 v7, 0x40c00000, v7 :: v_dual_lshlrev_b32 v14, 16, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v71, 24, v10
-; GFX11-TRUE16-NEXT: v_bfe_u32 v20, v9, 16, 1
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, 0x400000, v9
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v14, 0x40c00000, v14
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v54, v22, v37, vcc_lo
; GFX11-TRUE16-NEXT: v_bfe_u32 v19, v7, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, 0x400000, v7
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v7, v7
-; GFX11-TRUE16-NEXT: v_add3_u32 v20, v20, v9, 0x7fff
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v80, 8, v10
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v22, 0xffff0000, v14
+; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v14, 16, v14
; GFX11-TRUE16-NEXT: v_add3_u32 v19, v19, v7, 0x7fff
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v52.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v55, v19, v25, vcc_lo
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v38, 0xffff0000, v11
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v54.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_dual_add_f32 v14, 0x40c00000, v14 :: v_dual_lshlrev_b32 v11, 16, v11
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v65, v19, v25, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v12, v12
-; GFX11-TRUE16-NEXT: v_bfe_u32 v19, v21, 16, 1
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v23, 0xffff0000, v9
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v21, 0x40c00000, v38
; GFX11-TRUE16-NEXT: v_bfe_u32 v25, v14, 16, 1
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v53, v24, v50, vcc_lo
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v80, 8, v10
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v52, v24, v50 :: v_dual_add_f32 v9, 0x40c00000, v23
+; GFX11-TRUE16-NEXT: v_bfe_u32 v19, v21, 16, 1
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_bfi_b32 v12, 0xffff, v7, v52
+; GFX11-TRUE16-NEXT: v_bfe_u32 v20, v9, 16, 1
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v23, 0x400000, v9
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v9, v9
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_bfi_b32 v12, 0xffff, v7, v53
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v9, v20, v23, vcc_lo
; GFX11-TRUE16-NEXT: v_add_f32_e32 v7, 0x40c00000, v11
; GFX11-TRUE16-NEXT: v_add3_u32 v11, v19, v21, 0x7fff
+; GFX11-TRUE16-NEXT: v_add3_u32 v20, v20, v9, 0x7fff
; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, 0x400000, v21
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v55, 24, v12
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v67, 8, v12
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v9, v20, v23, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v21, v21
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v55.h
; GFX11-TRUE16-NEXT: v_bfe_u32 v23, v7, 16, 1
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v65, 24, v12
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v66, 8, v12
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v20.l, v65.h
; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v11, v11, v19, vcc_lo
; GFX11-TRUE16-NEXT: v_add_f32_e32 v19, 0x40c00000, v22
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
; GFX11-TRUE16-NEXT: v_add3_u32 v21, v23, v7, 0x7fff
; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, 0x400000, v7
; GFX11-TRUE16-NEXT: v_and_b32_e32 v23, 0xffff0000, v13
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v7, v7
; GFX11-TRUE16-NEXT: v_bfe_u32 v24, v19, 16, 1
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v13, 16, v13
-; GFX11-TRUE16-NEXT: v_bfi_b32 v9, 0xffff, v20, v9
-; GFX11-TRUE16-NEXT: v_dual_add_f32 v7, 0x40c00000, v23 :: v_dual_cndmask_b32 v70, v21, v22
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_add_f32_e32 v7, 0x40c00000, v23
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v71, v21, v22, vcc_lo
; GFX11-TRUE16-NEXT: v_add3_u32 v21, v24, v19, 0x7fff
; GFX11-TRUE16-NEXT: v_or_b32_e32 v22, 0x400000, v19
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v19, v19
; GFX11-TRUE16-NEXT: v_add3_u32 v23, v25, v14, 0x7fff
; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, 0x400000, v14
; GFX11-TRUE16-NEXT: v_bfe_u32 v25, v7, 16, 1
-; GFX11-TRUE16-NEXT: v_add_f32_e32 v13, 0x40c00000, v13
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v67, v21, v22, vcc_lo
+; GFX11-TRUE16-NEXT: v_dual_add_f32 v13, 0x40c00000, v13 :: v_dual_cndmask_b32 v66, v21, v22
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v14, v14
; GFX11-TRUE16-NEXT: v_or_b32_e32 v19, 0x400000, v7
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
; GFX11-TRUE16-NEXT: v_add3_u32 v14, v25, v7, 0x7fff
; GFX11-TRUE16-NEXT: v_and_b32_e32 v21, 0xffff0000, v16
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v16, 16, v16
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v69, v23, v24, vcc_lo
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v68, v23, v24, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v7, v7
; GFX11-TRUE16-NEXT: v_bfe_u32 v23, v13, 16, 1
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v81, 8, v9
+; GFX11-TRUE16-NEXT: v_bfi_b32 v9, 0xffff, v20, v9
; GFX11-TRUE16-NEXT: v_add_f32_e32 v16, 0x40c00000, v16
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v69.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v22.l, v68.h
; GFX11-TRUE16-NEXT: v_dual_cndmask_b32 v7, v14, v19 :: v_dual_add_f32 v14, 0x40c00000, v21
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v21, 16, v15
; GFX11-TRUE16-NEXT: v_add3_u32 v19, v23, v13, 0x7fff
@@ -85744,42 +86067,42 @@ define <64 x i8> @bitcast_v32bf16_to_v64i8(<32 x bfloat> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_add_f32_e32 v21, 0x40c00000, v21
; GFX11-TRUE16-NEXT: v_and_b32_e32 v15, 0xffff0000, v15
; GFX11-TRUE16-NEXT: v_or_b32_e32 v25, 0x400000, v16
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v85, v19, v23, vcc_lo
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v86, v19, v23, vcc_lo
; GFX11-TRUE16-NEXT: v_add3_u32 v13, v13, v16, 0x7fff
; GFX11-TRUE16-NEXT: v_bfe_u32 v23, v21, 16, 1
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v16, v16
; GFX11-TRUE16-NEXT: v_bfe_u32 v24, v14, 16, 1
; GFX11-TRUE16-NEXT: v_add_f32_e32 v15, 0x40c00000, v15
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v49, 0x400000, v21
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v38, 0x400000, v21
; GFX11-TRUE16-NEXT: v_add3_u32 v23, v23, v21, 0x7fff
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v83, v13, v25, vcc_lo
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v82, v13, v25, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v21, v21
; GFX11-TRUE16-NEXT: v_add3_u32 v19, v24, v14, 0x7fff
; GFX11-TRUE16-NEXT: v_or_b32_e32 v24, 0x400000, v14
-; GFX11-TRUE16-NEXT: v_bfe_u32 v48, v15, 16, 1
+; GFX11-TRUE16-NEXT: v_bfe_u32 v37, v15, 16, 1
; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, 0x400000, v15
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v87, v23, v49, vcc_lo
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v85, v23, v38, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v14, v14
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v85.h
-; GFX11-TRUE16-NEXT: v_add3_u32 v13, v48, v15, 0x7fff
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v70.h
-; GFX11-TRUE16-NEXT: v_bfi_b32 v14, 0xffff, v22, v67
-; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v86, v19, v24, vcc_lo
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v21.l, v86.h
+; GFX11-TRUE16-NEXT: v_add3_u32 v13, v37, v15, 0x7fff
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v23.l, v71.h
+; GFX11-TRUE16-NEXT: v_bfi_b32 v14, 0xffff, v22, v66
+; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v81, v19, v24, vcc_lo
; GFX11-TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v15, v15
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v83.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v87.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v82.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v85.h
; GFX11-TRUE16-NEXT: v_bfi_b32 v11, 0xffff, v23, v11
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v51, 24, v14
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v50, 24, v14
; GFX11-TRUE16-NEXT: v_cndmask_b32_e32 v13, v13, v16, vcc_lo
-; GFX11-TRUE16-NEXT: v_bfi_b32 v16, 0xffff, v19, v86
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v54, 8, v14
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v68, 8, v11
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_bfi_b32 v16, 0xffff, v19, v81
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v53, 8, v14
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v69, 8, v11
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v84, 8, v9
; GFX11-TRUE16-NEXT: v_bfi_b32 v15, 0xffff, v15, v13
; GFX11-TRUE16-NEXT: v_bfi_b32 v13, 0xffff, v21, v7
; GFX11-TRUE16-NEXT: v_bfi_b32 v7, 0xffff, v18, v17
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v48, 24, v16
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v49, 8, v16
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v37, 24, v16
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v38, 8, v16
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[17:18], 24, v[15:16]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[18:19], 24, v[13:14]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[19:20], 24, v[11:12]
@@ -85788,142 +86111,159 @@ define <64 x i8> @bitcast_v32bf16_to_v64i8(<32 x bfloat> %a, i32 %b) {
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[22:23], 24, v[5:6]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[23:24], 24, v[3:4]
; GFX11-TRUE16-NEXT: v_lshrrev_b64 v[24:25], 24, v[1:2]
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v50, 8, v15
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v51, 8, v15
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v64, 8, v13
-; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v96, 8, v7
+; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v97, 8, v7
; GFX11-TRUE16-NEXT: .LBB108_4: ; %end
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v28.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v113.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v27.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v112.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v24.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.h, 0
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v27.h
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v1.l, v2.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v26.h
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v1.h, v2.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v112.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v103.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v26.h
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v2.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v24.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v2.h, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v103.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v24.l, 0
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v32.h
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v1.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v102.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v1.l, v2.h
+; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v30.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v8
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v101.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v6, v24
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v4.l, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v5.l
; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v3.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v23.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v24, v1
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v2.l, v2.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v3.l, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v24.h
-; GFX11-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v31.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v102.l
-; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v3.h, v4.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v101.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v24, v2
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v30.h
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v3.l, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v24.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v23.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v3.l, v2.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v8, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v6
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v36.h
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v3.h, v4.l
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v29.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v99.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.l, 8, v100.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v5.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v22.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v24, v3
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v4.l, v4.h
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v5.l, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v24.h
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v34.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.l, 8, v99.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v5.h, v6.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v3.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v8, v24
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v98.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v24, v4
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v33.h
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v5.l, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v24.h
-; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v32.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v97.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v7.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v21.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v24, v5
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v6.l, v6.h
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v7.l, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v24.h
-; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v38.h
+; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v4.h, v5.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v4.l, v6.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v33.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v10
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.l, 8, v97.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v5.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v22.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v6.l, v6.h
+; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v49.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v8, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v10
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v5.l, v5.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v6.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v31.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v96.l
-; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v7.h, v8.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v84.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v24, v6
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v36.h
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v7.l, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v24.h
-; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v35.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.l, 8, v82.l
-; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v9.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v20.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v24, v7
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v8.l, v8.h
-; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v9.l, v10.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v24.h
-; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v55.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.l, 8, v81.l
-; GFX11-TRUE16-NEXT: v_or_b16 v9.h, v9.h, v10.h
+; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v6.h, v7.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v10, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v12
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v21.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v6.l, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.l, 0xff, v7.h
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v35.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v87.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v10, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v10, 0xffff, v12
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v7.l, v8.h
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v65.h
+; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v7.h, v8.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 8, v84.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v69.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v10, v24
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.l, 8, v83.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v8.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v34.h
+; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v8.h, v9.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v39.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 8, v80.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v24, v8
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v39.h
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v9.l, v10.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v24.h
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v37.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v71.l
-; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v11.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v19.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v24, v9
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v10.l, v10.h
-; GFX11-TRUE16-NEXT: v_or_b16 v10.h, v11.l, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v24.h
-; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v70.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v68.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.h, v11.h, v12.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v66.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v24, v10
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v53.h
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v11.l, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v24.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v12
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v8.l, v10.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v9.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v9.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v20.l
+; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v8.h, v10.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v12, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v14, 0xffff, v14
+; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v71.h
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v9.l, v9.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v10.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v48.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.l, 8, v70.l
+; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v10.h, v11.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v14, v24
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v67.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v14, 0xffff, v16
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v10.l, v12.l
+; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v54.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v11.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.l, 0xff, v11.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.h, 8, v19.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.l, 8, v64.l
+; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v12.l, v12.h
+; GFX11-TRUE16-NEXT: v_and_b16 v12.h, 0xff, v86.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v10, v14, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v16
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v11.l, v11.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v12.l
; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v52.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.l, 8, v65.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v13.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.l, 8, v55.l
+; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v12.h, v13.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v16, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v19
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.h, 8, v18.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v11, v24, v11
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v12.l, v12.h
-; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v13.l, v14.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, v24.h
-; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v85.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.l, 8, v64.l
-; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v13.h, v14.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.h, 8, v54.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v24, v12
-; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v67.h
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v13.l, v14.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v24.h
-; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v69.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.l, 8, v51.l
-; GFX11-TRUE16-NEXT: v_and_b16 v15.h, 0xff, v15.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v17.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v24, v13
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v14.l, v14.h
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v15.l, v16.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v24.h
-; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v87.h
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v12.l, v14.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v13.l
+; GFX11-TRUE16-NEXT: v_and_b16 v13.l, 0xff, v13.h
+; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v68.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v14.l, 8, v53.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v12, v16, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v16, 0xffff, v19
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v13.l, v14.h
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v85.h
+; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v13.h, v14.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.l, 8, v51.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v13, v16, v24
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.l, 8, v50.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v15.h, v16.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v49.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v24, v14
-; GFX11-TRUE16-NEXT: v_and_b16 v17.l, 0xff, v86.h
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v15.l, v16.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.l, v24.h
-; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v83.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 8, v48.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v14.l
+; GFX11-TRUE16-NEXT: v_and_b16 v14.l, 0xff, v66.h
+; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v14.h, v15.l
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, 0xff, v82.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v38.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v18, 0xffff, v18
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v14.l, v16.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v19.l, v15.l
+; GFX11-TRUE16-NEXT: v_and_b16 v15.l, 0xff, v15.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v15.h, 8, v17.l
+; GFX11-TRUE16-NEXT: v_or_b16 v16.l, v14.h, v16.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v14, v18, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v19
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 8, v37.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v15.l, v15.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v18.l, v16.l
+; GFX11-TRUE16-NEXT: v_and_b16 v16.l, 0xff, v81.h
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v24, v15
-; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v16.l, v16.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v17.l, v17.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v16.l, v24.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v24, v16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v15, v17, v24
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v17, 0xffff, v18
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v16.l, v16.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v16, v17, v24
; GFX11-TRUE16-NEXT: s_clause 0x3
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[1:4], off
; GFX11-TRUE16-NEXT: scratch_store_b128 v0, v[5:8], off offset:16
diff --git a/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.64bit.ll b/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.64bit.ll
index 4c485768bcbbf..6fe66655de3d6 100644
--- a/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.64bit.ll
+++ b/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.64bit.ll
@@ -3065,12 +3065,13 @@ define i64 @bitcast_v8i8_to_i64(<8 x i8> %a, i32 %b) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, v5.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.h, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v5.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v7.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v5.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v1.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v7.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v8
@@ -3084,53 +3085,61 @@ define i64 @bitcast_v8i8_to_i64(<8 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB26_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v3.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.h, v4.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v1.l, v2.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v0.l, v2.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v5.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v1.l, v3.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v0
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v1.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v5.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v2
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v0, v4
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v1.l, v2.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v3, v4
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB26_2
; GFX11-TRUE16-NEXT: .LBB26_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v5.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v4.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v6.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v4.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.h, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v4.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v3.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v6.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, 0
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v2.h, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v3.l, v0.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v4.h
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v3.h, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v2.l, v1.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v4.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v1.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v4, v2
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v3.h, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v4.h, v0.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v2.h, v1.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v2.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v2, 0xffff, v3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v2, v5
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -6205,12 +6214,13 @@ define double @bitcast_v8i8_to_f64(<8 x i8> %a, i32 %b) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, v5.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.h, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v5.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v7.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v5.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v1.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v7.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v8
@@ -6224,53 +6234,61 @@ define double @bitcast_v8i8_to_f64(<8 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB50_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v3.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.h, v4.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v1.l, v2.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v0.l, v2.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v5.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v1.l, v3.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v0
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v1.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v5.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v2
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v0, v4
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v1.l, v2.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v3, v4
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB50_2
; GFX11-TRUE16-NEXT: .LBB50_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v5.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v4.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v6.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v4.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.h, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v4.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v3.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v6.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, 0
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v2.h, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v3.l, v0.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v4.h
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v3.h, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v2.l, v1.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v4.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v1.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v4, v2
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v3.h, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v4.h, v0.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v2.h, v1.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v2.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v2, 0xffff, v3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v2, v5
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -9045,12 +9063,13 @@ define <2 x i32> @bitcast_v8i8_to_v2i32(<8 x i8> %a, i32 %b) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, v5.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.h, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v5.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v7.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v5.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v1.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v7.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v8
@@ -9064,53 +9083,61 @@ define <2 x i32> @bitcast_v8i8_to_v2i32(<8 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB70_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v3.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.h, v4.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v1.l, v2.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v0.l, v2.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v5.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v1.l, v3.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v0
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v1.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v5.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v2
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v0, v4
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v1.l, v2.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v3, v4
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB70_2
; GFX11-TRUE16-NEXT: .LBB70_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v5.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v4.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v6.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v4.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.h, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v4.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v3.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v6.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, 0
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v2.h, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v3.l, v0.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v4.h
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v3.h, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v2.l, v1.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v4.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v1.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v4, v2
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v3.h, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v4.h, v0.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v2.h, v1.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v2.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v2, 0xffff, v3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v2, v5
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -11576,12 +11603,13 @@ define <2 x float> @bitcast_v8i8_to_v2f32(<8 x i8> %a, i32 %b) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, v5.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.h, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v5.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v7.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v5.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v1.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v7.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v8
@@ -11595,53 +11623,61 @@ define <2 x float> @bitcast_v8i8_to_v2f32(<8 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB86_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v3.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.h, v4.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v1.l, v2.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v0.l, v2.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v5.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v1.l, v3.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v0
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v1.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v5.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v2
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v0, v4
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v1.l, v2.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v3, v4
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB86_2
; GFX11-TRUE16-NEXT: .LBB86_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v5.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v4.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v6.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v4.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.h, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v4.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v3.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v6.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, 0
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v2.h, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v3.l, v0.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v4.h
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v3.h, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v2.l, v1.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v4.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v1.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v4, v2
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v3.h, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v4.h, v0.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v2.h, v1.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v2.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v2, 0xffff, v3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v2, v5
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -13793,12 +13829,13 @@ define <4 x i16> @bitcast_v8i8_to_v4i16(<8 x i8> %a, i32 %b) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, v5.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.h, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v5.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v7.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v5.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v1.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v7.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v8
@@ -13812,53 +13849,61 @@ define <4 x i16> @bitcast_v8i8_to_v4i16(<8 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB98_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v3.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.h, v4.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v1.l, v2.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v0.l, v2.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v5.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v1.l, v3.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v0
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v1.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v5.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v2
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v0, v4
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v1.l, v2.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v3, v4
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB98_2
; GFX11-TRUE16-NEXT: .LBB98_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v5.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v4.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v6.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v4.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.h, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v4.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v3.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v6.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, 0
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v2.h, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v3.l, v0.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v4.h
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v3.h, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v2.l, v1.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v4.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v1.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v4, v2
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v3.h, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v4.h, v0.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v2.h, v1.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v2.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v2, 0xffff, v3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v2, v5
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -15610,12 +15655,13 @@ define <4 x half> @bitcast_v8i8_to_v4f16(<8 x i8> %a, i32 %b) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, v5.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.h, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v5.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v7.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v5.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v1.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v7.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v8
@@ -15629,53 +15675,61 @@ define <4 x half> @bitcast_v8i8_to_v4f16(<8 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB106_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v3.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.h, v4.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v1.l, v2.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v0.l, v2.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v5.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v1.l, v3.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v0
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v1.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v5.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v2
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v0, v4
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v1.l, v2.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v3, v4
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB106_2
; GFX11-TRUE16-NEXT: .LBB106_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v5.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v4.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v6.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v4.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.h, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v4.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v3.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v6.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, 0
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v2.h, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v3.l, v0.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v4.h
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v3.h, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v2.l, v1.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v4.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v1.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v4, v2
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v3.h, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v4.h, v0.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v2.h, v1.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v2.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v2, 0xffff, v3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v2, v5
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -16912,12 +16966,13 @@ define <4 x bfloat> @bitcast_v8i8_to_v4bf16(<8 x i8> %a, i32 %b) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, v5.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.h, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v5.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v7.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v5.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v1.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v7.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v8
@@ -16931,53 +16986,61 @@ define <4 x bfloat> @bitcast_v8i8_to_v4bf16(<8 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
; GFX11-TRUE16-NEXT: .LBB110_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v4.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v3.h
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.h, v4.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v1.l, v2.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v0.l, v2.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v5.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v1.l, v3.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v0
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v1.h, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v5.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v5, v2
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v0, v4
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v1.l, v2.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr2_lo16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v3, v4
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB110_2
; GFX11-TRUE16-NEXT: .LBB110_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v5.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v4.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v6.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v4.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.h, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v4.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v3.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v6.l, 3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, 0
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v2.h, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v3.l, v0.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v4.h
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v3.h, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v2.l, v1.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.h, 0x300, v0.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v4.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, 0x300, v1.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v1.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v4, v2
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v3.h, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v4.h, v0.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v2.h, v1.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v2.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v2, 0xffff, v3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v2, v5
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
diff --git a/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.96bit.ll b/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.96bit.ll
index 879e8520d8e18..e5245f7bd71d3 100644
--- a/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.96bit.ll
+++ b/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.96bit.ll
@@ -1102,16 +1102,15 @@ define <3 x i32> @bitcast_v12i8_to_v3i32(<12 x i8> %a, i32 %b) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.h, v7.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.h, v5.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.h, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.h, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.h, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v9.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v8.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v9.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v8.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v9.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v11.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v12
@@ -1126,74 +1125,80 @@ define <3 x i32> @bitcast_v12i8_to_v3i32(<12 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: .LBB6_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v7.h
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, 0
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v5.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.h, v5.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v1.l, v4.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v0.l, v5.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v5.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v7.h
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v1.l, v4.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v10.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_3) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v1.h, v4.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v2
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v10.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v1.l, v3.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v5
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v7, v0
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v1.h, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v7.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v3.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v7, v4
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v2.l, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v7.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v6, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v2.l, v3.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v7, v2
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v4, v7
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB6_2
; GFX11-TRUE16-NEXT: .LBB6_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v7.h, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v7.l, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v6.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v6.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.h, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v8.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v6.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v8.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v10.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v5.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v5.h, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v7.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v10.l, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v5.h, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v5.l, v0.h
; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v4.l, v1.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v0.h
; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v4.h, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v7, v5
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, 0x300, v1.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v7.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v3.h, v0.l
; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v3.l, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v3.h, v2.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v7, v1
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, 0x300, v2.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, 0x300, v2.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v7.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v7, v2
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v6
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v7
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v1.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v4, v7
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v2.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v3, v7
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -4236,16 +4241,15 @@ define <3 x float> @bitcast_v12i8_to_v3f32(<12 x i8> %a, i32 %b) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.h, v7.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.h, v5.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.h, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.h, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.h, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v9.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v8.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v9.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.h, 8, v8.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v9.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v11.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v12
@@ -4260,74 +4264,80 @@ define <3 x float> @bitcast_v12i8_to_v3f32(<12 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: .LBB22_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v7.h
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.h, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v6.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, 0
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v5.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.h, v5.l
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v1.l, v4.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v6.h
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v0.l, v5.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v5.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v7.h
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v1.l, v4.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v10.l
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_3) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v1.h, v4.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v2
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v10.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v4, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v1.l, v3.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v5
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v7, v0
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v1.h, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v7.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v3.h
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v7, v4
-; GFX11-TRUE16-NEXT: v_or_b16 v7.l, v2.l, v3.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v7.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v6, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v2.l, v3.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr3_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v7, v2
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v4, v7
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB22_2
; GFX11-TRUE16-NEXT: .LBB22_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v7.h, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v7.l, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v6.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v6.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.h, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v8.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v6.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v8.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v10.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v5.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v5.h, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v7.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v10.l, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v5.h, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v5.l, v0.h
; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v4.l, v1.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.h, 0x300, v0.h
; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v4.h, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v7, v5
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, 0x300, v1.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v7.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v4.l, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v3.h, v0.l
; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v3.l, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v3.h, v2.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v7, v1
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.l, 0x300, v2.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, 0x300, v2.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v7.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v7, v2
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v6
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v7
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v1.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v4, v7
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x300, v2.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v3, v7
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -6875,16 +6885,16 @@ define <6 x bfloat> @bitcast_v12i8_to_v6bf16(<12 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.h, v9.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.h, v7.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.h, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, v5.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.h, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.h, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.l, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v10.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v9.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.l, 8, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v5.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v10.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v9.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v11.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v12
@@ -6899,74 +6909,80 @@ define <6 x bfloat> @bitcast_v12i8_to_v6bf16(<12 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: .LBB36_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v9.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v7.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.l, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v6.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v3.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v1.l, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v10.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v7.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v8.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v7.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.h, v6.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, 0
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v1.l, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v1.h, v5.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v2
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v10.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v3, v0
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v1.h, v5.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v3.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v4.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v3, v5
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v2.l, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v3.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v0.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v3, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v1.l, v4.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v5
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_hi16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v6, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v2.l, v4.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v3, v2
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v3, v7
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_lo16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB36_2
; GFX11-TRUE16-NEXT: .LBB36_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v9.l, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v8.h, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v7.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v7.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v8.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v6.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v8.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v10.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v6.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v6.h, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v3.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v10.l, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v7.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v6.h, v0.h
; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v5.l, v1.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v0.h
; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v5.h, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v3, v6
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, 0x300, v1.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v3.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v4.h, v0.l
; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v4.l, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v4.h, v2.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v3, v1
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v2.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, 0x300, v2.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v3.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v3, v2
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v6
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v9
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v1.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v3, v9
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v2.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v4, v9
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -8635,16 +8651,16 @@ define <6 x half> @bitcast_v12i8_to_v6f16(<12 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.h, v9.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.h, v7.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.h, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, v5.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.h, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.h, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.l, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v10.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v9.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.l, 8, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v5.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v10.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v9.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v11.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v12
@@ -8659,74 +8675,80 @@ define <6 x half> @bitcast_v12i8_to_v6f16(<12 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: .LBB40_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v9.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v7.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.l, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v6.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v3.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v1.l, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v10.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v7.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v8.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v7.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.h, v6.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, 0
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v1.l, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v1.h, v5.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v2
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v10.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v3, v0
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v1.h, v5.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v3.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v4.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v3, v5
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v2.l, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v3.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v0.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v3, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v1.l, v4.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v5
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_hi16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v6, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v2.l, v4.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v3, v2
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v3, v7
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_lo16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB40_2
; GFX11-TRUE16-NEXT: .LBB40_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v9.l, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v8.h, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v7.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v7.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v8.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v6.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v8.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v10.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v6.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v6.h, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v3.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v10.l, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v7.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v6.h, v0.h
; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v5.l, v1.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v0.h
; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v5.h, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v3, v6
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, 0x300, v1.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v3.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v4.h, v0.l
; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v4.l, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v4.h, v2.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v3, v1
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v2.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, 0x300, v2.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v3.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v3, v2
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v6
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v9
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v1.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v3, v9
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v2.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v4, v9
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -10043,16 +10065,16 @@ define <6 x i16> @bitcast_v12i8_to_v6i16(<12 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.h, v9.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.h, v7.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.h, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, v5.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.h, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.h, v2.l
; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.l, 8, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v3.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v10.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v9.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v11.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.l, 8, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v5.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v10.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v9.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v11.l
; GFX11-TRUE16-NEXT: s_mov_b32 s0, exec_lo
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3
; GFX11-TRUE16-NEXT: v_cmpx_ne_u32_e32 0, v12
@@ -10067,74 +10089,80 @@ define <6 x i16> @bitcast_v12i8_to_v6i16(<12 x i8> %a, i32 %b) {
; GFX11-TRUE16-NEXT: .LBB44_3: ; %cmp.false
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v9.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v8.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v7.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.l, v6.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v6.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v3.h
-; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v1.l, v5.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v10.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v7.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v8.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v7.l
+; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.h, v6.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, 0
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v1.l, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v1.h, v5.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v2
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v10.l
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr9_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_hi16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr8_lo16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr10_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v3, v0
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v1.h, v5.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v3.h
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v2.h, v4.h
-; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v3, v5
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v2.l, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v3.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v0.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v3, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v1.l, v4.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v5
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr5_hi16
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v6, v7
+; GFX11-TRUE16-NEXT: v_or_b16 v7.h, v2.l, v4.l
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_lo16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr6_hi16
; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr4_lo16
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v3, v2
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v3, v7
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_hi16
+; GFX11-TRUE16-NEXT: ; implicit-def: $vgpr7_lo16
; GFX11-TRUE16-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX11-TRUE16-NEXT: s_cbranch_execz .LBB44_2
; GFX11-TRUE16-NEXT: .LBB44_4: ; %cmp.true
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v9.l, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v8.h, 3
; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, v7.h, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v7.l, 3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, v8.l, 3
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v6.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.l, v8.l, 3
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v10.l, 3
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v6.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v6.h, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v3.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, v10.l, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v7.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v6.h, v0.h
; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v5.l, v1.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x300, v0.h
; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v5.h, v1.h
-; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_and_b16 v2.h, 0xff, v2.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v3, v6
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v1.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.h, 0x300, v1.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.l, v3.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.l, 0
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x300, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v0.h
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v1.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v1.l, 0x300, v1.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v4.h, v0.l
; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v4.l, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v4.h, v2.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v3, v1
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v3.l, 0x300, v2.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v2.h, 0x300, v2.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v3.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v3, v2
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v6
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v5, v9
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v1.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v3, v9
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v9.h, 0x300, v2.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v4, v9
; GFX11-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
diff --git a/llvm/test/CodeGen/AMDGPU/amdgpu-llvm-debuginfo-analyzer.ll b/llvm/test/CodeGen/AMDGPU/amdgpu-llvm-debuginfo-analyzer.ll
index d6922bc09ff0a..89fc6c062c29d 100644
--- a/llvm/test/CodeGen/AMDGPU/amdgpu-llvm-debuginfo-analyzer.ll
+++ b/llvm/test/CodeGen/AMDGPU/amdgpu-llvm-debuginfo-analyzer.ll
@@ -1,4 +1,3 @@
-; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
; RUN: llc %s -o %t.o -mcpu=gfx1030 -filetype=obj -O0
; RUN: llvm-debuginfo-analyzer %t.o --print=all --attribute=all | FileCheck %s
diff --git a/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll b/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll
index 1d3368b036d0d..4cc39d93854a0 100644
--- a/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll
+++ b/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll
@@ -9022,12 +9022,13 @@ define amdgpu_kernel void @uniform_or_i16(ptr addrspace(1) %result, ptr addrspac
; GFX1164-TRUE16-NEXT: v_lshrrev_b32_e32 v0, s7, v0
; GFX1164-TRUE16-NEXT: .LBB15_2:
; GFX1164-TRUE16-NEXT: s_or_b64 exec, exec, s[4:5]
-; GFX1164-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
+; GFX1164-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX1164-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX1164-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX1164-TRUE16-NEXT: s_mov_b32 s3, 0x31016000
-; GFX1164-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
; GFX1164-TRUE16-NEXT: v_readfirstlane_b32 s2, v0
; GFX1164-TRUE16-NEXT: v_cndmask_b16 v0.l, s6, 0, vcc
+; GFX1164-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1164-TRUE16-NEXT: v_or_b16 v0.l, s2, v0.l
; GFX1164-TRUE16-NEXT: s_mov_b32 s2, -1
; GFX1164-TRUE16-NEXT: buffer_store_b16 v0, off, s[0:3], 0
@@ -9100,12 +9101,13 @@ define amdgpu_kernel void @uniform_or_i16(ptr addrspace(1) %result, ptr addrspac
; GFX1132-TRUE16-NEXT: v_lshrrev_b32_e32 v0, s6, v0
; GFX1132-TRUE16-NEXT: .LBB15_2:
; GFX1132-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s5
-; GFX1132-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
+; GFX1132-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX1132-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX1132-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX1132-TRUE16-NEXT: s_mov_b32 s3, 0x31016000
-; GFX1132-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
; GFX1132-TRUE16-NEXT: v_readfirstlane_b32 s2, v0
; GFX1132-TRUE16-NEXT: v_cndmask_b16 v0.l, s4, 0, vcc_lo
+; GFX1132-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1132-TRUE16-NEXT: v_or_b16 v0.l, s2, v0.l
; GFX1132-TRUE16-NEXT: s_mov_b32 s2, -1
; GFX1132-TRUE16-NEXT: buffer_store_b16 v0, off, s[0:3], 0
@@ -9178,12 +9180,13 @@ define amdgpu_kernel void @uniform_or_i16(ptr addrspace(1) %result, ptr addrspac
; GFX1264-TRUE16-NEXT: v_lshrrev_b32_e32 v0, s7, v0
; GFX1264-TRUE16-NEXT: .LBB15_2:
; GFX1264-TRUE16-NEXT: s_or_b64 exec, exec, s[4:5]
-; GFX1264-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
+; GFX1264-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX1264-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX1264-TRUE16-NEXT: s_wait_kmcnt 0x0
; GFX1264-TRUE16-NEXT: s_mov_b32 s3, 0x31016000
-; GFX1264-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
; GFX1264-TRUE16-NEXT: v_readfirstlane_b32 s2, v0
; GFX1264-TRUE16-NEXT: v_cndmask_b16 v0.l, s6, 0, vcc
+; GFX1264-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1264-TRUE16-NEXT: v_or_b16 v0.l, s2, v0.l
; GFX1264-TRUE16-NEXT: s_mov_b32 s2, -1
; GFX1264-TRUE16-NEXT: buffer_store_b16 v0, off, s[0:3], null
@@ -9256,12 +9259,13 @@ define amdgpu_kernel void @uniform_or_i16(ptr addrspace(1) %result, ptr addrspac
; GFX1232-TRUE16-NEXT: v_lshrrev_b32_e32 v0, s6, v0
; GFX1232-TRUE16-NEXT: .LBB15_2:
; GFX1232-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s5
-; GFX1232-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
+; GFX1232-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX1232-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX1232-TRUE16-NEXT: s_wait_kmcnt 0x0
; GFX1232-TRUE16-NEXT: s_mov_b32 s3, 0x31016000
-; GFX1232-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
; GFX1232-TRUE16-NEXT: v_readfirstlane_b32 s2, v0
; GFX1232-TRUE16-NEXT: v_cndmask_b16 v0.l, s4, 0, vcc_lo
+; GFX1232-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1232-TRUE16-NEXT: v_or_b16 v0.l, s2, v0.l
; GFX1232-TRUE16-NEXT: s_mov_b32 s2, -1
; GFX1232-TRUE16-NEXT: buffer_store_b16 v0, off, s[0:3], null
@@ -9658,11 +9662,12 @@ define amdgpu_kernel void @uniform_add_i16(ptr addrspace(1) %result, ptr addrspa
; GFX1164-TRUE16-NEXT: v_lshrrev_b32_e32 v0, s11, v2
; GFX1164-TRUE16-NEXT: .LBB16_4: ; %Flow
; GFX1164-TRUE16-NEXT: s_or_b64 exec, exec, s[8:9]
-; GFX1164-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
+; GFX1164-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX1164-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX1164-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX1164-TRUE16-NEXT: s_mov_b32 s3, 0x31016000
-; GFX1164-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1164-TRUE16-NEXT: v_readfirstlane_b32 s2, v0
+; GFX1164-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1164-TRUE16-NEXT: v_mad_u16 v0.l, s10, v4.l, s2
; GFX1164-TRUE16-NEXT: s_mov_b32 s2, -1
; GFX1164-TRUE16-NEXT: buffer_store_b16 v0, off, s[0:3], 0
@@ -9784,11 +9789,12 @@ define amdgpu_kernel void @uniform_add_i16(ptr addrspace(1) %result, ptr addrspa
; GFX1132-TRUE16-NEXT: v_lshrrev_b32_e32 v0, s2, v2
; GFX1132-TRUE16-NEXT: .LBB16_4: ; %Flow
; GFX1132-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s9
-; GFX1132-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
+; GFX1132-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX1132-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX1132-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX1132-TRUE16-NEXT: s_mov_b32 s3, 0x31016000
-; GFX1132-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1132-TRUE16-NEXT: v_readfirstlane_b32 s2, v0
+; GFX1132-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1132-TRUE16-NEXT: v_mad_u16 v0.l, s8, v4.l, s2
; GFX1132-TRUE16-NEXT: s_mov_b32 s2, -1
; GFX1132-TRUE16-NEXT: buffer_store_b16 v0, off, s[0:3], 0
@@ -9910,12 +9916,13 @@ define amdgpu_kernel void @uniform_add_i16(ptr addrspace(1) %result, ptr addrspa
; GFX1264-TRUE16-NEXT: v_lshrrev_b32_e32 v0, s11, v2
; GFX1264-TRUE16-NEXT: .LBB16_4: ; %Flow
; GFX1264-TRUE16-NEXT: s_or_b64 exec, exec, s[8:9]
-; GFX1264-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
+; GFX1264-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX1264-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX1264-TRUE16-NEXT: s_wait_kmcnt 0x0
; GFX1264-TRUE16-NEXT: s_mov_b32 s3, 0x31016000
-; GFX1264-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
; GFX1264-TRUE16-NEXT: v_readfirstlane_b32 s2, v0
; GFX1264-TRUE16-NEXT: s_wait_alu 0xf1ff
+; GFX1264-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1264-TRUE16-NEXT: v_mad_u16 v0.l, s10, v4.l, s2
; GFX1264-TRUE16-NEXT: s_mov_b32 s2, -1
; GFX1264-TRUE16-NEXT: buffer_store_b16 v0, off, s[0:3], null
@@ -10041,12 +10048,13 @@ define amdgpu_kernel void @uniform_add_i16(ptr addrspace(1) %result, ptr addrspa
; GFX1232-TRUE16-NEXT: v_lshrrev_b32_e32 v0, s2, v2
; GFX1232-TRUE16-NEXT: .LBB16_4: ; %Flow
; GFX1232-TRUE16-NEXT: s_or_b32 exec_lo, exec_lo, s9
-; GFX1232-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
+; GFX1232-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX1232-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX1232-TRUE16-NEXT: s_wait_kmcnt 0x0
; GFX1232-TRUE16-NEXT: s_mov_b32 s3, 0x31016000
-; GFX1232-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
; GFX1232-TRUE16-NEXT: v_readfirstlane_b32 s2, v0
; GFX1232-TRUE16-NEXT: s_wait_alu 0xf1ff
+; GFX1232-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1232-TRUE16-NEXT: v_mad_u16 v0.l, s8, v4.l, s2
; GFX1232-TRUE16-NEXT: s_mov_b32 s2, -1
; GFX1232-TRUE16-NEXT: buffer_store_b16 v0, off, s[0:3], null
@@ -10726,15 +10734,15 @@ define amdgpu_kernel void @uniform_fadd_f16(ptr addrspace(1) %result, ptr addrsp
; GFX1164-TRUE16-NEXT: s_mov_b64 s[2:3], 0
; GFX1164-TRUE16-NEXT: .LBB18_1: ; %atomicrmw.start
; GFX1164-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
-; GFX1164-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX1164-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1164-TRUE16-NEXT: v_lshrrev_b32_e32 v0, s9, v1
-; GFX1164-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX1164-TRUE16-NEXT: v_add_f16_e32 v0.l, s8, v0.l
; GFX1164-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1164-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX1164-TRUE16-NEXT: v_lshlrev_b32_e32 v0, s9, v0
+; GFX1164-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX1164-TRUE16-NEXT: v_and_or_b32 v0, v1, s10, v0
; GFX1164-TRUE16-NEXT: v_mov_b32_e32 v3, v1
-; GFX1164-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX1164-TRUE16-NEXT: v_mov_b32_e32 v2, v0
; GFX1164-TRUE16-NEXT: buffer_atomic_cmpswap_b32 v[2:3], off, s[4:7], 0 glc
; GFX1164-TRUE16-NEXT: s_waitcnt vmcnt(0)
@@ -10820,14 +10828,14 @@ define amdgpu_kernel void @uniform_fadd_f16(ptr addrspace(1) %result, ptr addrsp
; GFX1132-TRUE16-NEXT: s_mov_b32 s6, -1
; GFX1132-TRUE16-NEXT: .LBB18_1: ; %atomicrmw.start
; GFX1132-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
-; GFX1132-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX1132-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1132-TRUE16-NEXT: v_lshrrev_b32_e32 v0, s2, v1
-; GFX1132-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX1132-TRUE16-NEXT: v_add_f16_e32 v0.l, s8, v0.l
; GFX1132-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1132-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX1132-TRUE16-NEXT: v_lshlrev_b32_e32 v0, s2, v0
+; GFX1132-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1132-TRUE16-NEXT: v_and_or_b32 v0, v1, s3, v0
-; GFX1132-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1132-TRUE16-NEXT: v_dual_mov_b32 v3, v1 :: v_dual_mov_b32 v2, v0
; GFX1132-TRUE16-NEXT: buffer_atomic_cmpswap_b32 v[2:3], off, s[4:7], 0 glc
; GFX1132-TRUE16-NEXT: s_waitcnt vmcnt(0)
@@ -10912,15 +10920,15 @@ define amdgpu_kernel void @uniform_fadd_f16(ptr addrspace(1) %result, ptr addrsp
; GFX1264-TRUE16-NEXT: s_mov_b64 s[2:3], 0
; GFX1264-TRUE16-NEXT: .LBB18_1: ; %atomicrmw.start
; GFX1264-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
-; GFX1264-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX1264-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1264-TRUE16-NEXT: v_lshrrev_b32_e32 v0, s9, v1
-; GFX1264-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX1264-TRUE16-NEXT: v_add_f16_e32 v0.l, s8, v0.l
; GFX1264-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1264-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX1264-TRUE16-NEXT: v_lshlrev_b32_e32 v0, s9, v0
+; GFX1264-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX1264-TRUE16-NEXT: v_and_or_b32 v0, v1, s10, v0
; GFX1264-TRUE16-NEXT: v_mov_b32_e32 v3, v1
-; GFX1264-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX1264-TRUE16-NEXT: v_mov_b32_e32 v2, v0
; GFX1264-TRUE16-NEXT: buffer_atomic_cmpswap_b32 v[2:3], off, s[4:7], null th:TH_ATOMIC_RETURN scope:SCOPE_SYS
; GFX1264-TRUE16-NEXT: s_wait_loadcnt 0x0
@@ -11006,14 +11014,14 @@ define amdgpu_kernel void @uniform_fadd_f16(ptr addrspace(1) %result, ptr addrsp
; GFX1232-TRUE16-NEXT: s_mov_b32 s6, -1
; GFX1232-TRUE16-NEXT: .LBB18_1: ; %atomicrmw.start
; GFX1232-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
-; GFX1232-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX1232-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1232-TRUE16-NEXT: v_lshrrev_b32_e32 v0, s2, v1
-; GFX1232-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX1232-TRUE16-NEXT: v_add_f16_e32 v0.l, s8, v0.l
; GFX1232-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1232-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX1232-TRUE16-NEXT: v_lshlrev_b32_e32 v0, s2, v0
+; GFX1232-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1232-TRUE16-NEXT: v_and_or_b32 v0, v1, s3, v0
-; GFX1232-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1232-TRUE16-NEXT: v_dual_mov_b32 v3, v1 :: v_dual_mov_b32 v2, v0
; GFX1232-TRUE16-NEXT: buffer_atomic_cmpswap_b32 v[2:3], off, s[4:7], null th:TH_ATOMIC_RETURN scope:SCOPE_SYS
; GFX1232-TRUE16-NEXT: s_wait_loadcnt 0x0
diff --git a/llvm/test/CodeGen/AMDGPU/bf16.ll b/llvm/test/CodeGen/AMDGPU/bf16.ll
index 10e523d1a0cf1..505ddc8c3b575 100644
--- a/llvm/test/CodeGen/AMDGPU/bf16.ll
+++ b/llvm/test/CodeGen/AMDGPU/bf16.ll
@@ -37774,10 +37774,9 @@ define bfloat @v_uitofp_i16_to_bf16(i16 %x) {
; GFX11TRUE16-LABEL: v_uitofp_i16_to_bf16:
; GFX11TRUE16: ; %bb.0:
; GFX11TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX11TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
-; GFX11TRUE16-NEXT: v_mov_b16_e32 v1.l, v0.l
+; GFX11TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX11TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11TRUE16-NEXT: v_cvt_f32_u32_e32 v0, v1
+; GFX11TRUE16-NEXT: v_cvt_f32_u32_e32 v0, v0
; GFX11TRUE16-NEXT: v_bfe_u32 v1, v0, 16, 1
; GFX11TRUE16-NEXT: v_or_b32_e32 v2, 0x400000, v0
; GFX11TRUE16-NEXT: v_cmp_u_f32_e32 vcc_lo, v0, v0
@@ -40751,11 +40750,12 @@ define amdgpu_ps i32 @s_select_bf16(bfloat inreg %a, bfloat inreg %b, i32 %c) {
;
; GFX11TRUE16-LABEL: s_select_bf16:
; GFX11TRUE16: ; %bb.0:
-; GFX11TRUE16-NEXT: v_mov_b16_e32 v1.l, s0
; GFX11TRUE16-NEXT: v_cmp_eq_u32_e32 vcc_lo, 0, v0
-; GFX11TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
-; GFX11TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11TRUE16-NEXT: v_cndmask_b16 v0.l, s1, v1.l, vcc_lo
+; GFX11TRUE16-NEXT: v_mov_b16_e32 v0.l, s0
+; GFX11TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11TRUE16-NEXT: v_cndmask_b16 v0.l, s1, v0.l, vcc_lo
+; GFX11TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; GFX11TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11TRUE16-NEXT: v_readfirstlane_b32 s0, v0
; GFX11TRUE16-NEXT: ; return to shader part epilog
;
diff --git a/llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fadd.ll b/llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fadd.ll
index 0ceb9019eb990..f4b432dce8c8a 100644
--- a/llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fadd.ll
+++ b/llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fadd.ll
@@ -3443,14 +3443,15 @@ define half @buffer_fat_ptr_agent_atomic_fadd_ret_f16__offset__amdgpu_no_fine_gr
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v1, s4, v2
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_add_f16_e32 v1.l, v1.l, v0.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v1, s4, v1
; GFX12-TRUE16-NEXT: s_wait_alu 0xfffe
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v1, v2, s6, v1
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_dual_mov_b32 v4, v2 :: v_dual_mov_b32 v3, v1
; GFX12-TRUE16-NEXT: buffer_atomic_cmpswap_b32 v[3:4], v5, s[0:3], null offen th:TH_ATOMIC_RETURN
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
@@ -3568,13 +3569,14 @@ define half @buffer_fat_ptr_agent_atomic_fadd_ret_f16__offset__amdgpu_no_fine_gr
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v1, s4, v2
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_add_f16_e32 v1.l, v1.l, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v1, s4, v1
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v1, s4, v1
; GFX11-TRUE16-NEXT: v_and_or_b32 v1, v2, s6, v1
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_dual_mov_b32 v4, v2 :: v_dual_mov_b32 v3, v1
; GFX11-TRUE16-NEXT: buffer_atomic_cmpswap_b32 v[3:4], v5, s[0:3], 0 offen glc
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
@@ -3882,14 +3884,15 @@ define void @buffer_fat_ptr_agent_atomic_fadd_noret_f16__offset__amdgpu_no_fine_
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v1, s4, v2
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_add_f16_e32 v1.l, v1.l, v0.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v1, s4, v1
; GFX12-TRUE16-NEXT: s_wait_alu 0xfffe
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v1, v2, s6, v1
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_mov_b32 v4, v1
; GFX12-TRUE16-NEXT: buffer_atomic_cmpswap_b32 v[4:5], v3, s[0:3], null offen th:TH_ATOMIC_RETURN
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
@@ -4004,13 +4007,14 @@ define void @buffer_fat_ptr_agent_atomic_fadd_noret_f16__offset__amdgpu_no_fine_
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v1, s4, v2
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_add_f16_e32 v1.l, v1.l, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v1, s4, v1
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v1, s4, v1
; GFX11-TRUE16-NEXT: v_and_or_b32 v1, v2, s6, v1
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_mov_b32 v4, v1
; GFX11-TRUE16-NEXT: buffer_atomic_cmpswap_b32 v[4:5], v3, s[0:3], 0 offen glc
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
@@ -4324,14 +4328,15 @@ define half @buffer_fat_ptr_agent_atomic_fadd_ret_f16__offset__waterfall__amdgpu
; GFX12-TRUE16-NEXT: ; Child Loop BB15_4 Depth 2
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v6, v4, v7
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v6.h, 0
; GFX12-TRUE16-NEXT: s_mov_b32 s2, exec_lo
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_add_f16_e32 v6.l, v6.l, v5.l
-; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v6, v4, v6
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v6, v4, v6
; GFX12-TRUE16-NEXT: v_and_or_b32 v6, v7, v11, v6
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_dual_mov_b32 v9, v7 :: v_dual_mov_b32 v8, v6
; GFX12-TRUE16-NEXT: .LBB15_4: ; Parent Loop BB15_3 Depth=1
; GFX12-TRUE16-NEXT: ; => This Inner Loop Header: Depth=2
@@ -4551,14 +4556,15 @@ define half @buffer_fat_ptr_agent_atomic_fadd_ret_f16__offset__waterfall__amdgpu
; GFX11-TRUE16-NEXT: ; Child Loop BB15_4 Depth 2
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v6, v4, v7
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.h, 0
; GFX11-TRUE16-NEXT: s_mov_b32 s2, exec_lo
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_add_f16_e32 v6.l, v6.l, v5.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v6, v4, v6
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v6, v4, v6
; GFX11-TRUE16-NEXT: v_and_or_b32 v6, v7, v11, v6
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_dual_mov_b32 v9, v7 :: v_dual_mov_b32 v8, v6
; GFX11-TRUE16-NEXT: .LBB15_4: ; Parent Loop BB15_3 Depth=1
; GFX11-TRUE16-NEXT: ; => This Inner Loop Header: Depth=2
diff --git a/llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmax.ll b/llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmax.ll
index cad4c39eaf39f..6f1675edbe58a 100644
--- a/llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmax.ll
+++ b/llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmax.ll
@@ -2512,16 +2512,16 @@ define half @buffer_fat_ptr_agent_atomic_fmax_ret_f16__offset__amdgpu_no_fine_gr
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v1, s4, v2
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v1.l, v1.l
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v1.l, v0.h, v0.l
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v1, s4, v1
; GFX12-TRUE16-NEXT: s_wait_alu 0xfffe
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v1, v2, s6, v1
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_dual_mov_b32 v4, v2 :: v_dual_mov_b32 v3, v1
; GFX12-TRUE16-NEXT: buffer_atomic_cmpswap_b32 v[3:4], v5, s[0:3], null offen th:TH_ATOMIC_RETURN
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
@@ -2640,19 +2640,20 @@ define half @buffer_fat_ptr_agent_atomic_fmax_ret_f16__offset__amdgpu_no_fine_gr
; GFX11-TRUE16-NEXT: buffer_load_b32 v2, v5, s[0:3], 0 offen
; GFX11-TRUE16-NEXT: s_not_b32 s6, s5
; GFX11-TRUE16-NEXT: s_mov_b32 s5, 0
+; GFX11-TRUE16-NEXT: .p2align 6
; GFX11-TRUE16-NEXT: .LBB10_1: ; %atomicrmw.start
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v1, s4, v2
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v1.l, v1.l
; GFX11-TRUE16-NEXT: v_max_f16_e32 v1.l, v0.h, v0.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v1, s4, v1
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v1, v2, s6, v1
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_dual_mov_b32 v4, v2 :: v_dual_mov_b32 v3, v1
; GFX11-TRUE16-NEXT: buffer_atomic_cmpswap_b32 v[3:4], v5, s[0:3], 0 offen glc
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
@@ -2972,16 +2973,16 @@ define void @buffer_fat_ptr_agent_atomic_fmax_noret_f16__offset__amdgpu_no_fine_
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v1, s4, v2
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v1.l, v1.l
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v1.l, v0.h, v0.l
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v1, s4, v1
; GFX12-TRUE16-NEXT: s_wait_alu 0xfffe
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v1, v2, s6, v1
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_mov_b32 v4, v1
; GFX12-TRUE16-NEXT: buffer_atomic_cmpswap_b32 v[4:5], v3, s[0:3], null offen th:TH_ATOMIC_RETURN
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
@@ -3097,19 +3098,20 @@ define void @buffer_fat_ptr_agent_atomic_fmax_noret_f16__offset__amdgpu_no_fine_
; GFX11-TRUE16-NEXT: buffer_load_b32 v2, v3, s[0:3], 0 offen
; GFX11-TRUE16-NEXT: s_not_b32 s6, s5
; GFX11-TRUE16-NEXT: s_mov_b32 s5, 0
+; GFX11-TRUE16-NEXT: .p2align 6
; GFX11-TRUE16-NEXT: .LBB11_1: ; %atomicrmw.start
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v1, s4, v2
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v1.l, v1.l
; GFX11-TRUE16-NEXT: v_max_f16_e32 v1.l, v0.h, v0.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v1, s4, v1
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v1, v2, s6, v1
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_mov_b32 v4, v1
; GFX11-TRUE16-NEXT: buffer_atomic_cmpswap_b32 v[4:5], v3, s[0:3], 0 offen glc
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
@@ -3435,16 +3437,16 @@ define half @buffer_fat_ptr_agent_atomic_fmax_ret_f16__offset__waterfall__amdgpu
; GFX12-TRUE16-NEXT: ; Child Loop BB12_4 Depth 2
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v9, v6
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: s_mov_b32 s2, exec_lo
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v4.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v5.l, v4.h, v4.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v9, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v11, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_dual_mov_b32 v8, v6 :: v_dual_mov_b32 v7, v5
; GFX12-TRUE16-NEXT: .LBB12_4: ; Parent Loop BB12_3 Depth=1
; GFX12-TRUE16-NEXT: ; => This Inner Loop Header: Depth=2
@@ -3670,16 +3672,16 @@ define half @buffer_fat_ptr_agent_atomic_fmax_ret_f16__offset__waterfall__amdgpu
; GFX11-TRUE16-NEXT: ; Child Loop BB12_4 Depth 2
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v9, v6
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: s_mov_b32 s2, exec_lo
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v4.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: v_max_f16_e32 v5.l, v4.h, v4.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v9, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v11, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_dual_mov_b32 v8, v6 :: v_dual_mov_b32 v7, v5
; GFX11-TRUE16-NEXT: .LBB12_4: ; Parent Loop BB12_3 Depth=1
; GFX11-TRUE16-NEXT: ; => This Inner Loop Header: Depth=2
diff --git a/llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmin.ll b/llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmin.ll
index 6275afd2c6994..acb27be1846b9 100644
--- a/llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmin.ll
+++ b/llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmin.ll
@@ -2512,16 +2512,16 @@ define half @buffer_fat_ptr_agent_atomic_fmin_ret_f16__offset__amdgpu_no_fine_gr
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v1, s4, v2
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v1.l, v1.l
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v1.l, v0.h, v0.l
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v1, s4, v1
; GFX12-TRUE16-NEXT: s_wait_alu 0xfffe
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v1, v2, s6, v1
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_dual_mov_b32 v4, v2 :: v_dual_mov_b32 v3, v1
; GFX12-TRUE16-NEXT: buffer_atomic_cmpswap_b32 v[3:4], v5, s[0:3], null offen th:TH_ATOMIC_RETURN
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
@@ -2640,19 +2640,20 @@ define half @buffer_fat_ptr_agent_atomic_fmin_ret_f16__offset__amdgpu_no_fine_gr
; GFX11-TRUE16-NEXT: buffer_load_b32 v2, v5, s[0:3], 0 offen
; GFX11-TRUE16-NEXT: s_not_b32 s6, s5
; GFX11-TRUE16-NEXT: s_mov_b32 s5, 0
+; GFX11-TRUE16-NEXT: .p2align 6
; GFX11-TRUE16-NEXT: .LBB10_1: ; %atomicrmw.start
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v1, s4, v2
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v1.l, v1.l
; GFX11-TRUE16-NEXT: v_min_f16_e32 v1.l, v0.h, v0.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v1, s4, v1
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v1, v2, s6, v1
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_dual_mov_b32 v4, v2 :: v_dual_mov_b32 v3, v1
; GFX11-TRUE16-NEXT: buffer_atomic_cmpswap_b32 v[3:4], v5, s[0:3], 0 offen glc
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
@@ -2972,16 +2973,16 @@ define void @buffer_fat_ptr_agent_atomic_fmin_noret_f16__offset__amdgpu_no_fine_
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v1, s4, v2
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v1.l, v1.l
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v1.l, v0.h, v0.l
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v1, s4, v1
; GFX12-TRUE16-NEXT: s_wait_alu 0xfffe
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v1, v2, s6, v1
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_mov_b32 v4, v1
; GFX12-TRUE16-NEXT: buffer_atomic_cmpswap_b32 v[4:5], v3, s[0:3], null offen th:TH_ATOMIC_RETURN
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
@@ -3097,19 +3098,20 @@ define void @buffer_fat_ptr_agent_atomic_fmin_noret_f16__offset__amdgpu_no_fine_
; GFX11-TRUE16-NEXT: buffer_load_b32 v2, v3, s[0:3], 0 offen
; GFX11-TRUE16-NEXT: s_not_b32 s6, s5
; GFX11-TRUE16-NEXT: s_mov_b32 s5, 0
+; GFX11-TRUE16-NEXT: .p2align 6
; GFX11-TRUE16-NEXT: .LBB11_1: ; %atomicrmw.start
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v1, s4, v2
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v1.l, v1.l
; GFX11-TRUE16-NEXT: v_min_f16_e32 v1.l, v0.h, v0.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v1, s4, v1
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v1, v2, s6, v1
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_mov_b32 v4, v1
; GFX11-TRUE16-NEXT: buffer_atomic_cmpswap_b32 v[4:5], v3, s[0:3], 0 offen glc
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
@@ -3435,16 +3437,16 @@ define half @buffer_fat_ptr_agent_atomic_fmin_ret_f16__offset__waterfall__amdgpu
; GFX12-TRUE16-NEXT: ; Child Loop BB12_4 Depth 2
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v9, v6
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: s_mov_b32 s2, exec_lo
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v4.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v5.l, v4.h, v4.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v9, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v11, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_dual_mov_b32 v8, v6 :: v_dual_mov_b32 v7, v5
; GFX12-TRUE16-NEXT: .LBB12_4: ; Parent Loop BB12_3 Depth=1
; GFX12-TRUE16-NEXT: ; => This Inner Loop Header: Depth=2
@@ -3670,16 +3672,16 @@ define half @buffer_fat_ptr_agent_atomic_fmin_ret_f16__offset__waterfall__amdgpu
; GFX11-TRUE16-NEXT: ; Child Loop BB12_4 Depth 2
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v9, v6
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: s_mov_b32 s2, exec_lo
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v4.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: v_min_f16_e32 v5.l, v4.h, v4.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v9, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v11, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_dual_mov_b32 v8, v6 :: v_dual_mov_b32 v7, v5
; GFX11-TRUE16-NEXT: .LBB12_4: ; Parent Loop BB12_3 Depth=1
; GFX11-TRUE16-NEXT: ; => This Inner Loop Header: Depth=2
diff --git a/llvm/test/CodeGen/AMDGPU/calling-conventions.ll b/llvm/test/CodeGen/AMDGPU/calling-conventions.ll
index 2db7b28c7de97..ff80250bfc880 100644
--- a/llvm/test/CodeGen/AMDGPU/calling-conventions.ll
+++ b/llvm/test/CodeGen/AMDGPU/calling-conventions.ll
@@ -2745,15 +2745,6 @@ define amdgpu_cs void @amdgpu_cs_v32i1(<32 x i1> %arg0) {
;
; GFX11-TRUE16-LABEL: amdgpu_cs_v32i1:
; GFX11-TRUE16: ; %bb.0:
-; GFX11-TRUE16-NEXT: v_and_b16 v26.l, v26.l, 1
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.l, 1, v25.l
-; GFX11-TRUE16-NEXT: v_and_b16 v24.l, v24.l, 1
-; GFX11-TRUE16-NEXT: v_and_b16 v20.h, v22.l, 1
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.l, 1, v21.l
-; GFX11-TRUE16-NEXT: v_and_b16 v20.l, v20.l, 1
-; GFX11-TRUE16-NEXT: v_and_b16 v17.h, v18.l, 1
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 1, v17.l
-; GFX11-TRUE16-NEXT: v_and_b16 v16.l, v16.l, 1
; GFX11-TRUE16-NEXT: v_and_b16 v10.l, v10.l, 1
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 1, v9.l
; GFX11-TRUE16-NEXT: v_and_b16 v8.l, v8.l, 1
@@ -2763,18 +2754,6 @@ define amdgpu_cs void @amdgpu_cs_v32i1(<32 x i1> %arg0) {
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, v2.l, 1
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 1, v1.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, v0.l, 1
-; GFX11-TRUE16-NEXT: v_and_b16 v30.l, v30.l, 1
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v29.l, 1, v29.l
-; GFX11-TRUE16-NEXT: v_and_b16 v28.l, v28.l, 1
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.h, 3, v27.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v26.l, 2, v26.l
-; GFX11-TRUE16-NEXT: v_or_b16 v22.h, v24.l, v25.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.l, 3, v23.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v20.h, 2, v20.h
-; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v20.l, v21.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.h, 3, v19.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 2, v17.h
-; GFX11-TRUE16-NEXT: v_or_b16 v16.l, v16.l, v17.l
; GFX11-TRUE16-NEXT: v_and_b16 v14.l, v14.l, 1
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.l, 1, v13.l
; GFX11-TRUE16-NEXT: v_and_b16 v12.l, v12.l, 1
@@ -2787,15 +2766,15 @@ define amdgpu_cs void @amdgpu_cs_v32i1(<32 x i1> %arg0) {
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 3, v3.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.h, 2, v1.h
; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v28.h, 3, v31.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v29.h, 2, v30.l
-; GFX11-TRUE16-NEXT: v_or_b16 v28.l, v28.l, v29.l
-; GFX11-TRUE16-NEXT: v_or_b16 v25.h, v25.h, v26.l
-; GFX11-TRUE16-NEXT: v_and_b16 v21.h, v22.h, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v18.l, v22.l, v20.h
-; GFX11-TRUE16-NEXT: v_and_b16 v16.h, v16.h, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v17.h, v18.h, v17.h
-; GFX11-TRUE16-NEXT: v_and_b16 v16.l, v16.l, 3
+; GFX11-TRUE16-NEXT: v_and_b16 v25.h, v26.l, 1
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.l, 1, v25.l
+; GFX11-TRUE16-NEXT: v_and_b16 v24.l, v24.l, 1
+; GFX11-TRUE16-NEXT: v_and_b16 v22.l, v22.l, 1
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v21.l, 1, v21.l
+; GFX11-TRUE16-NEXT: v_and_b16 v20.l, v20.l, 1
+; GFX11-TRUE16-NEXT: v_and_b16 v18.l, v18.l, 1
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 1, v17.l
+; GFX11-TRUE16-NEXT: v_and_b16 v16.l, v16.l, 1
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 3, v15.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.h, 2, v14.l
; GFX11-TRUE16-NEXT: v_or_b16 v8.h, v12.l, v13.l
@@ -2805,42 +2784,65 @@ define amdgpu_cs void @amdgpu_cs_v32i1(<32 x i1> %arg0) {
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, v0.h, 3
; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v2.h, v1.h
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, v0.l, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v26.h, v28.h, v29.h
-; GFX11-TRUE16-NEXT: v_and_b16 v24.h, v28.l, 3
-; GFX11-TRUE16-NEXT: v_or_b16 v19.l, v21.h, v25.h
-; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v16.h, v18.l
-; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v16.l, v17.h
+; GFX11-TRUE16-NEXT: v_and_b16 v30.l, v30.l, 1
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v29.l, 1, v29.l
+; GFX11-TRUE16-NEXT: v_and_b16 v28.l, v28.l, 1
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.h, 3, v27.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v25.h, 2, v25.h
+; GFX11-TRUE16-NEXT: v_or_b16 v24.l, v24.l, v25.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v23.l, 3, v23.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v22.l, 2, v22.l
+; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v20.l, v21.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.h, 3, v19.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v18.l, 2, v18.l
+; GFX11-TRUE16-NEXT: v_or_b16 v15.h, v16.l, v17.l
; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v12.h, v10.h
; GFX11-TRUE16-NEXT: v_and_b16 v3.l, v8.h, 3
; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.l, v6.h
; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v0.h, v2.l
; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v1.h
-; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v24.h, v26.h
-; GFX11-TRUE16-NEXT: v_and_b16 v14.h, v19.l, 15
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v16.h, 4, v16.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, v15.h, 15
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v30.h, 3, v31.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v28.h, 2, v30.l
+; GFX11-TRUE16-NEXT: v_or_b16 v24.h, v28.l, v29.l
+; GFX11-TRUE16-NEXT: v_or_b16 v22.h, v22.h, v25.h
+; GFX11-TRUE16-NEXT: v_and_b16 v24.l, v24.l, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v18.h, v23.l, v22.l
+; GFX11-TRUE16-NEXT: v_and_b16 v14.h, v16.h, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v16.h, v17.h, v18.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, v15.h, 3
; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v3.l, v2.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.l, v1.l, 15
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 4, v0.h
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, v0.l, 15
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v17.l, 12, v24.h
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v14.h
+; GFX11-TRUE16-NEXT: v_or_b16 v28.h, v30.h, v28.h
+; GFX11-TRUE16-NEXT: v_and_b16 v24.h, v24.h, 3
+; GFX11-TRUE16-NEXT: v_or_b16 v20.h, v24.l, v22.h
+; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v14.h, v18.h
; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v1.h, v16.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 12, v2.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v0.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v17.l, v2.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v1.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v23.h, v24.h, v28.h
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, v20.h, 15
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 4, v2.h
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, v1.h, 15
; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v2.l, v1.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, 0
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 12, v23.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v0.h
+; GFX11-TRUE16-NEXT: v_or_b16 v1.h, v1.h, v2.h
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v1.h, v0.h
; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v1.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, v2.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v0, v2
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v2.l, v0.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v0.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v1.l, v0.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v2
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v1, v0
; GFX11-TRUE16-NEXT: global_store_b32 v[0:1], v0, off
; GFX11-TRUE16-NEXT: s_endpgm
;
diff --git a/llvm/test/CodeGen/AMDGPU/clamp-modifier.ll b/llvm/test/CodeGen/AMDGPU/clamp-modifier.ll
index ccdc0b1bf43c4..b9caf8e80bcdf 100644
--- a/llvm/test/CodeGen/AMDGPU/clamp-modifier.ll
+++ b/llvm/test/CodeGen/AMDGPU/clamp-modifier.ll
@@ -1561,10 +1561,10 @@ define amdgpu_kernel void @v_no_clamp_add_src_v2f16_f16_src(ptr addrspace(1) %ou
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v1, 2, v1
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-TRUE16-NEXT: global_load_d16_b16 v0, v0, s[2:3]
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_add_f16_e32 v0.l, 1.0, v0.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX11-TRUE16-NEXT: v_pk_max_f16 v0, v0, v0 clamp
; GFX11-TRUE16-NEXT: global_store_b32 v1, v0, s[0:1]
; GFX11-TRUE16-NEXT: s_endpgm
diff --git a/llvm/test/CodeGen/AMDGPU/cvt_f32_ubyte.ll b/llvm/test/CodeGen/AMDGPU/cvt_f32_ubyte.ll
index 26f204f29f5a4..b5bc09a1684ee 100644
--- a/llvm/test/CodeGen/AMDGPU/cvt_f32_ubyte.ll
+++ b/llvm/test/CodeGen/AMDGPU/cvt_f32_ubyte.ll
@@ -946,9 +946,9 @@ define double @v_uitofp_i8_to_f64(i8 %arg0) nounwind {
; GFX11-TRUE16-LABEL: v_uitofp_i8_to_f64:
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX11-TRUE16-NEXT: v_cvt_f64_u32_e32 v[0:1], v0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -1770,38 +1770,40 @@ define amdgpu_kernel void @load_v4i8_to_v4f32_2_uses(ptr addrspace(1) noalias %o
; GFX11-TRUE16-LABEL: load_v4i8_to_v4f32_2_uses:
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_load_b64 s[0:1], s[4:5], 0x34
-; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0x3ff, v0
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
-; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_dual_mov_b32 v5, 0 :: v_dual_and_b32 v0, 0x3ff, v0
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v0, 2, v0
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v5.h
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-TRUE16-NEXT: global_load_b32 v4, v0, s[0:1]
; GFX11-TRUE16-NEXT: s_load_b128 s[0:3], s[4:5], 0x24
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v4.l, 9
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v4.h, 9
-; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff00, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff00, v4.h
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff00, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff00, v4.h
; GFX11-TRUE16-NEXT: v_cvt_f32_ubyte3_e32 v3, v4
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.h, v0.l
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.h, v4.h, 9
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, 0x900, v0.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v0.h
-; GFX11-TRUE16-NEXT: v_cvt_f32_ubyte2_e32 v2, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v1.l, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v1.h, v0.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v1.l, v0.h
; GFX11-TRUE16-NEXT: v_cvt_f32_ubyte1_e32 v1, v4
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v5.l, 0x900, v0.l
-; GFX11-TRUE16-NEXT: v_add_nc_u16 v7.h, 0x900, v0.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v2
+; GFX11-TRUE16-NEXT: v_add_nc_u16 v6.h, 0x900, v0.l
+; GFX11-TRUE16-NEXT: v_cvt_f32_ubyte2_e32 v2, v4
; GFX11-TRUE16-NEXT: v_cvt_f32_ubyte0_e32 v0, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v5, v7
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v7, v6
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-TRUE16-NEXT: s_clause 0x1
-; GFX11-TRUE16-NEXT: global_store_b128 v6, v[0:3], s[0:1]
-; GFX11-TRUE16-NEXT: global_store_b32 v6, v4, s[2:3]
+; GFX11-TRUE16-NEXT: global_store_b128 v5, v[0:3], s[0:1]
+; GFX11-TRUE16-NEXT: global_store_b32 v5, v4, s[2:3]
; GFX11-TRUE16-NEXT: s_endpgm
;
; GFX11-FAKE16-LABEL: load_v4i8_to_v4f32_2_uses:
diff --git a/llvm/test/CodeGen/AMDGPU/dynamic_stackalloc.ll b/llvm/test/CodeGen/AMDGPU/dynamic_stackalloc.ll
index c5db7a33f70e0..b0439b1f7968f 100644
--- a/llvm/test/CodeGen/AMDGPU/dynamic_stackalloc.ll
+++ b/llvm/test/CodeGen/AMDGPU/dynamic_stackalloc.ll
@@ -2536,13 +2536,12 @@ define void @test_dynamic_stackalloc_device_divergent_non_standard_size_i16(i16
; GFX11-SDAG-LABEL: test_dynamic_stackalloc_device_divergent_non_standard_size_i16:
; GFX11-SDAG: ; %bb.0:
; GFX11-SDAG-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX11-SDAG-NEXT: v_mov_b16_e32 v1.h, 0
-; GFX11-SDAG-NEXT: v_mov_b16_e32 v1.l, v0.l
+; GFX11-SDAG-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX11-SDAG-NEXT: s_mov_b32 s4, s33
; GFX11-SDAG-NEXT: s_mov_b32 s1, exec_lo
; GFX11-SDAG-NEXT: s_mov_b32 s0, 0
; GFX11-SDAG-NEXT: s_mov_b32 s33, s32
-; GFX11-SDAG-NEXT: v_lshl_add_u32 v0, v1, 2, 15
+; GFX11-SDAG-NEXT: v_lshl_add_u32 v0, v0, 2, 15
; GFX11-SDAG-NEXT: s_add_i32 s32, s32, 16
; GFX11-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-SDAG-NEXT: v_and_b32_e32 v0, 0x7fff0, v0
diff --git a/llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fadd.ll b/llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fadd.ll
index 22dd66118837f..8c7d5cffe39d9 100644
--- a/llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fadd.ll
+++ b/llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fadd.ll
@@ -8410,12 +8410,13 @@ define half @flat_agent_atomic_fadd_ret_f16__amdgpu_no_fine_grained_memory(ptr %
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_add_f16_e32 v5.l, v5.l, v2.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[0:1], v[5:6] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -8528,12 +8529,13 @@ define half @flat_agent_atomic_fadd_ret_f16__amdgpu_no_fine_grained_memory(ptr %
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_add_f16_e32 v5.l, v5.l, v2.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[0:1], v[5:6] glc
@@ -8783,12 +8785,13 @@ define half @flat_agent_atomic_fadd_ret_f16__offset12b_pos__amdgpu_no_fine_grain
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_add_f16_e32 v5.l, v5.l, v2.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[0:1], v[5:6] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -8905,12 +8908,13 @@ define half @flat_agent_atomic_fadd_ret_f16__offset12b_pos__amdgpu_no_fine_grain
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_add_f16_e32 v5.l, v5.l, v2.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[0:1], v[5:6] glc
@@ -9167,12 +9171,13 @@ define half @flat_agent_atomic_fadd_ret_f16__offset12b_neg__amdgpu_no_fine_grain
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_add_f16_e32 v5.l, v5.l, v2.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[0:1], v[5:6] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -9290,12 +9295,13 @@ define half @flat_agent_atomic_fadd_ret_f16__offset12b_neg__amdgpu_no_fine_grain
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_add_f16_e32 v5.l, v5.l, v2.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[0:1], v[5:6] glc
@@ -9551,11 +9557,11 @@ define void @flat_agent_atomic_fadd_noret_f16__amdgpu_no_fine_grained_memory(ptr
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_add_f16_e32 v3.l, v3.l, v2.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -9665,11 +9671,11 @@ define void @flat_agent_atomic_fadd_noret_f16__amdgpu_no_fine_grained_memory(ptr
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_add_f16_e32 v3.l, v3.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] glc
@@ -9911,11 +9917,11 @@ define void @flat_agent_atomic_fadd_noret_f16__offset12b_pos__amdgpu_no_fine_gra
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_add_f16_e32 v3.l, v3.l, v2.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -10029,11 +10035,11 @@ define void @flat_agent_atomic_fadd_noret_f16__offset12b_pos__amdgpu_no_fine_gra
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_add_f16_e32 v3.l, v3.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] glc
@@ -10282,11 +10288,11 @@ define void @flat_agent_atomic_fadd_noret_f16__offset12b_neg__amdgpu_no_fine_gra
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_add_f16_e32 v3.l, v3.l, v2.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -10401,11 +10407,11 @@ define void @flat_agent_atomic_fadd_noret_f16__offset12b_neg__amdgpu_no_fine_gra
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_add_f16_e32 v3.l, v3.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] glc
@@ -10645,8 +10651,8 @@ define void @flat_agent_atomic_fadd_noret_f16__offset12b__align4_pos__amdgpu_no_
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_add_f16_e32 v3.l, v4.l, v2.l
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] offset:2046 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -10729,8 +10735,8 @@ define void @flat_agent_atomic_fadd_noret_f16__offset12b__align4_pos__amdgpu_no_
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_add_f16_e32 v3.l, v4.l, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] offset:2046 glc
@@ -10919,9 +10925,10 @@ define half @flat_agent_atomic_fadd_ret_f16__offset12b_pos__align4__amdgpu_no_fi
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_add_f16_e32 v3.l, v4.l, v2.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] offset:2046 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -11007,9 +11014,10 @@ define half @flat_agent_atomic_fadd_ret_f16__offset12b_pos__align4__amdgpu_no_fi
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_add_f16_e32 v3.l, v4.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] offset:2046 glc
@@ -11212,12 +11220,13 @@ define half @flat_system_atomic_fadd_ret_f16__offset12b_pos__amdgpu_no_fine_grai
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_add_f16_e32 v5.l, v5.l, v2.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX12-TRUE16-NEXT: global_wb scope:SCOPE_SYS
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
@@ -11336,12 +11345,13 @@ define half @flat_system_atomic_fadd_ret_f16__offset12b_pos__amdgpu_no_fine_grai
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_add_f16_e32 v5.l, v5.l, v2.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[0:1], v[5:6] glc
@@ -11600,11 +11610,11 @@ define void @flat_system_atomic_fadd_noret_f16__offset12b_pos__amdgpu_no_fine_gr
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_add_f16_e32 v3.l, v3.l, v2.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX12-TRUE16-NEXT: global_wb scope:SCOPE_SYS
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
@@ -11720,11 +11730,11 @@ define void @flat_system_atomic_fadd_noret_f16__offset12b_pos__amdgpu_no_fine_gr
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_add_f16_e32 v3.l, v3.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] glc
diff --git a/llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fmax.ll b/llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fmax.ll
index 1dc45179c74ce..56ad91dd59ffb 100644
--- a/llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fmax.ll
+++ b/llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fmax.ll
@@ -6043,14 +6043,14 @@ define half @flat_agent_atomic_fmax_ret_f16__amdgpu_no_fine_grained_memory(ptr %
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v2.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v5.l, v2.h, v2.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[0:1], v[5:6] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -6168,14 +6168,14 @@ define half @flat_agent_atomic_fmax_ret_f16__amdgpu_no_fine_grained_memory(ptr %
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_max_f16_e32 v2.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v5.l, v2.h, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[0:1], v[5:6] glc
@@ -6438,14 +6438,14 @@ define half @flat_agent_atomic_fmax_ret_f16__offset12b_pos__amdgpu_no_fine_grain
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v5.l, v0.h, v0.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[3:4], v[5:6] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -6570,14 +6570,14 @@ define half @flat_agent_atomic_fmax_ret_f16__offset12b_pos__amdgpu_no_fine_grain
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v5.l, v0.h, v0.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[3:4], v[5:6] glc
@@ -6847,14 +6847,14 @@ define half @flat_agent_atomic_fmax_ret_f16__offset12b_neg__amdgpu_no_fine_grain
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v5.l, v0.h, v0.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[3:4], v[5:6] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -6980,14 +6980,14 @@ define half @flat_agent_atomic_fmax_ret_f16__offset12b_neg__amdgpu_no_fine_grain
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v5.l, v0.h, v0.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[3:4], v[5:6] glc
@@ -7254,12 +7254,13 @@ define void @flat_agent_atomic_fmax_noret_f16__amdgpu_no_fine_grained_memory(ptr
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v2.h, v3.l, v3.l
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v3.l, v2.h, v2.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -7375,12 +7376,13 @@ define void @flat_agent_atomic_fmax_noret_f16__amdgpu_no_fine_grained_memory(ptr
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v2.h, v3.l, v3.l
; GFX11-TRUE16-NEXT: v_max_f16_e32 v3.l, v2.h, v2.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] glc
@@ -7636,12 +7638,13 @@ define void @flat_agent_atomic_fmax_noret_f16__offset12b_pos__amdgpu_no_fine_gra
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v5.l, v0.h, v0.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[3:4], v[5:6] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -7764,12 +7767,13 @@ define void @flat_agent_atomic_fmax_noret_f16__offset12b_pos__amdgpu_no_fine_gra
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: v_max_f16_e32 v5.l, v0.h, v0.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[3:4], v[5:6] glc
@@ -8032,12 +8036,13 @@ define void @flat_agent_atomic_fmax_noret_f16__offset12b_neg__amdgpu_no_fine_gra
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v5.l, v0.h, v0.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[3:4], v[5:6] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -8161,12 +8166,13 @@ define void @flat_agent_atomic_fmax_noret_f16__offset12b_neg__amdgpu_no_fine_gra
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: v_max_f16_e32 v5.l, v0.h, v0.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[3:4], v[5:6] glc
@@ -8418,11 +8424,11 @@ define half @flat_agent_atomic_fmax_ret_f16__offset12b_pos__align4__amdgpu_no_fi
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v2.h, v4.l, v4.l
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v3.l, v2.h, v2.l
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] offset:2046 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -8513,11 +8519,11 @@ define half @flat_agent_atomic_fmax_ret_f16__offset12b_pos__align4__amdgpu_no_fi
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v2.h, v4.l, v4.l
; GFX11-TRUE16-NEXT: v_max_f16_e32 v3.l, v2.h, v2.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] offset:2046 glc
@@ -8722,9 +8728,10 @@ define void @flat_agent_atomic_fmax_noret_f16__offset12b__align4_pos__amdgpu_no_
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v2.h, v4.l, v4.l
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v3.l, v2.h, v2.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] offset:2046 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -8813,9 +8820,10 @@ define void @flat_agent_atomic_fmax_noret_f16__offset12b__align4_pos__amdgpu_no_
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v2.h, v4.l, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v3.l, v2.h, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] offset:2046 glc
@@ -9027,14 +9035,14 @@ define half @flat_system_atomic_fmax_ret_f16__offset12b_pos__amdgpu_no_fine_grai
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v5.l, v0.h, v0.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX12-TRUE16-NEXT: global_wb scope:SCOPE_SYS
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
@@ -9161,14 +9169,14 @@ define half @flat_system_atomic_fmax_ret_f16__offset12b_pos__amdgpu_no_fine_grai
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v5.l, v0.h, v0.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[3:4], v[5:6] glc
@@ -9440,12 +9448,13 @@ define void @flat_system_atomic_fmax_noret_f16__offset12b_pos__amdgpu_no_fine_gr
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v5.l, v0.h, v0.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX12-TRUE16-NEXT: global_wb scope:SCOPE_SYS
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
@@ -9570,12 +9579,13 @@ define void @flat_system_atomic_fmax_noret_f16__offset12b_pos__amdgpu_no_fine_gr
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: v_max_f16_e32 v5.l, v0.h, v0.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[3:4], v[5:6] glc
diff --git a/llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fmin.ll b/llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fmin.ll
index 5d26293e7009b..f0083bd23660a 100644
--- a/llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fmin.ll
+++ b/llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fmin.ll
@@ -6043,14 +6043,14 @@ define half @flat_agent_atomic_fmin_ret_f16__amdgpu_no_fine_grained_memory(ptr %
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v2.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v5.l, v2.h, v2.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[0:1], v[5:6] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -6168,14 +6168,14 @@ define half @flat_agent_atomic_fmin_ret_f16__amdgpu_no_fine_grained_memory(ptr %
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_max_f16_e32 v2.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_min_f16_e32 v5.l, v2.h, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[0:1], v[5:6] glc
@@ -6438,14 +6438,14 @@ define half @flat_agent_atomic_fmin_ret_f16__offset12b_pos__amdgpu_no_fine_grain
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v5.l, v0.h, v0.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[3:4], v[5:6] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -6570,14 +6570,14 @@ define half @flat_agent_atomic_fmin_ret_f16__offset12b_pos__amdgpu_no_fine_grain
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_min_f16_e32 v5.l, v0.h, v0.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[3:4], v[5:6] glc
@@ -6847,14 +6847,14 @@ define half @flat_agent_atomic_fmin_ret_f16__offset12b_neg__amdgpu_no_fine_grain
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v5.l, v0.h, v0.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[3:4], v[5:6] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -6980,14 +6980,14 @@ define half @flat_agent_atomic_fmin_ret_f16__offset12b_neg__amdgpu_no_fine_grain
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_min_f16_e32 v5.l, v0.h, v0.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[3:4], v[5:6] glc
@@ -7254,12 +7254,13 @@ define void @flat_agent_atomic_fmin_noret_f16__amdgpu_no_fine_grained_memory(ptr
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v2.h, v3.l, v3.l
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v3.l, v2.h, v2.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -7375,12 +7376,13 @@ define void @flat_agent_atomic_fmin_noret_f16__amdgpu_no_fine_grained_memory(ptr
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v2.h, v3.l, v3.l
; GFX11-TRUE16-NEXT: v_min_f16_e32 v3.l, v2.h, v2.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] glc
@@ -7636,12 +7638,13 @@ define void @flat_agent_atomic_fmin_noret_f16__offset12b_pos__amdgpu_no_fine_gra
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v5.l, v0.h, v0.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[3:4], v[5:6] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -7764,12 +7767,13 @@ define void @flat_agent_atomic_fmin_noret_f16__offset12b_pos__amdgpu_no_fine_gra
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: v_min_f16_e32 v5.l, v0.h, v0.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[3:4], v[5:6] glc
@@ -8032,12 +8036,13 @@ define void @flat_agent_atomic_fmin_noret_f16__offset12b_neg__amdgpu_no_fine_gra
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v5.l, v0.h, v0.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[3:4], v[5:6] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -8161,12 +8166,13 @@ define void @flat_agent_atomic_fmin_noret_f16__offset12b_neg__amdgpu_no_fine_gra
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: v_min_f16_e32 v5.l, v0.h, v0.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[3:4], v[5:6] glc
@@ -8418,11 +8424,11 @@ define half @flat_agent_atomic_fmin_ret_f16__offset12b_pos__align4__amdgpu_no_fi
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v2.h, v4.l, v4.l
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v3.l, v2.h, v2.l
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] offset:2046 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -8513,11 +8519,11 @@ define half @flat_agent_atomic_fmin_ret_f16__offset12b_pos__align4__amdgpu_no_fi
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v2.h, v4.l, v4.l
; GFX11-TRUE16-NEXT: v_min_f16_e32 v3.l, v2.h, v2.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] offset:2046 glc
@@ -8722,9 +8728,10 @@ define void @flat_agent_atomic_fmin_noret_f16__offset12b__align4_pos__amdgpu_no_
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v2.h, v4.l, v4.l
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v3.l, v2.h, v2.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] offset:2046 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -8813,9 +8820,10 @@ define void @flat_agent_atomic_fmin_noret_f16__offset12b__align4_pos__amdgpu_no_
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v2.h, v4.l, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_min_f16_e32 v3.l, v2.h, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] offset:2046 glc
@@ -9027,14 +9035,14 @@ define half @flat_system_atomic_fmin_ret_f16__offset12b_pos__amdgpu_no_fine_grai
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v5.l, v0.h, v0.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX12-TRUE16-NEXT: global_wb scope:SCOPE_SYS
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
@@ -9161,14 +9169,14 @@ define half @flat_system_atomic_fmin_ret_f16__offset12b_pos__amdgpu_no_fine_grai
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_min_f16_e32 v5.l, v0.h, v0.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[3:4], v[5:6] glc
@@ -9440,12 +9448,13 @@ define void @flat_system_atomic_fmin_noret_f16__offset12b_pos__amdgpu_no_fine_gr
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v5.l, v0.h, v0.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX12-TRUE16-NEXT: global_wb scope:SCOPE_SYS
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
@@ -9570,12 +9579,13 @@ define void @flat_system_atomic_fmin_noret_f16__offset12b_pos__amdgpu_no_fine_gr
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: v_min_f16_e32 v5.l, v0.h, v0.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[3:4], v[5:6] glc
diff --git a/llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fsub.ll b/llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fsub.ll
index d12a7f9731586..3ee0bb2122abe 100644
--- a/llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fsub.ll
+++ b/llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fsub.ll
@@ -5855,12 +5855,13 @@ define half @flat_agent_atomic_fsub_ret_f16(ptr %ptr, half %val) #0 {
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_sub_f16_e32 v5.l, v5.l, v2.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[0:1], v[5:6] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -5973,12 +5974,13 @@ define half @flat_agent_atomic_fsub_ret_f16(ptr %ptr, half %val) #0 {
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_sub_f16_e32 v5.l, v5.l, v2.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[0:1], v[5:6] glc
@@ -6228,12 +6230,13 @@ define half @flat_agent_atomic_fsub_ret_f16__offset12b_pos(ptr %ptr, half %val)
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_sub_f16_e32 v5.l, v5.l, v2.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[0:1], v[5:6] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -6350,12 +6353,13 @@ define half @flat_agent_atomic_fsub_ret_f16__offset12b_pos(ptr %ptr, half %val)
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_sub_f16_e32 v5.l, v5.l, v2.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[0:1], v[5:6] glc
@@ -6612,12 +6616,13 @@ define half @flat_agent_atomic_fsub_ret_f16__offset12b_neg(ptr %ptr, half %val)
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_sub_f16_e32 v5.l, v5.l, v2.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[0:1], v[5:6] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -6735,12 +6740,13 @@ define half @flat_agent_atomic_fsub_ret_f16__offset12b_neg(ptr %ptr, half %val)
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_sub_f16_e32 v5.l, v5.l, v2.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[0:1], v[5:6] glc
@@ -6996,11 +7002,11 @@ define void @flat_agent_atomic_fsub_noret_f16(ptr %ptr, half %val) #0 {
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_sub_f16_e32 v3.l, v3.l, v2.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -7110,11 +7116,11 @@ define void @flat_agent_atomic_fsub_noret_f16(ptr %ptr, half %val) #0 {
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_sub_f16_e32 v3.l, v3.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] glc
@@ -7356,11 +7362,11 @@ define void @flat_agent_atomic_fsub_noret_f16__offset12b_pos(ptr %ptr, half %val
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_sub_f16_e32 v3.l, v3.l, v2.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -7474,11 +7480,11 @@ define void @flat_agent_atomic_fsub_noret_f16__offset12b_pos(ptr %ptr, half %val
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_sub_f16_e32 v3.l, v3.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] glc
@@ -7727,11 +7733,11 @@ define void @flat_agent_atomic_fsub_noret_f16__offset12b_neg(ptr %ptr, half %val
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_sub_f16_e32 v3.l, v3.l, v2.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -7846,11 +7852,11 @@ define void @flat_agent_atomic_fsub_noret_f16__offset12b_neg(ptr %ptr, half %val
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_sub_f16_e32 v3.l, v3.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] glc
@@ -8090,9 +8096,10 @@ define half @flat_agent_atomic_fsub_ret_f16__offset12b_pos__align4(ptr %ptr, hal
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_sub_f16_e32 v3.l, v4.l, v2.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] offset:2046 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -8178,9 +8185,10 @@ define half @flat_agent_atomic_fsub_ret_f16__offset12b_pos__align4(ptr %ptr, hal
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_sub_f16_e32 v3.l, v4.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] offset:2046 glc
@@ -8374,8 +8382,8 @@ define void @flat_agent_atomic_fsub_noret_f16__offset12b__align4_pos(ptr %ptr, h
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_sub_f16_e32 v3.l, v4.l, v2.l
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] offset:2046 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -8458,8 +8466,8 @@ define void @flat_agent_atomic_fsub_noret_f16__offset12b__align4_pos(ptr %ptr, h
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_sub_f16_e32 v3.l, v4.l, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] offset:2046 glc
@@ -8657,12 +8665,13 @@ define half @flat_system_atomic_fsub_ret_f16__offset12b_pos(ptr %ptr, half %val)
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_sub_f16_e32 v5.l, v5.l, v2.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX12-TRUE16-NEXT: global_wb scope:SCOPE_SYS
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
@@ -8781,12 +8790,13 @@ define half @flat_system_atomic_fsub_ret_f16__offset12b_pos(ptr %ptr, half %val)
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_sub_f16_e32 v5.l, v5.l, v2.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v5, v[0:1], v[5:6] glc
@@ -9045,11 +9055,11 @@ define void @flat_system_atomic_fsub_noret_f16__offset12b_pos(ptr %ptr, half %va
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_sub_f16_e32 v3.l, v3.l, v2.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX12-TRUE16-NEXT: global_wb scope:SCOPE_SYS
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
@@ -9165,11 +9175,11 @@ define void @flat_system_atomic_fsub_noret_f16__offset12b_pos(ptr %ptr, half %va
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_sub_f16_e32 v3.l, v3.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] glc
diff --git a/llvm/test/CodeGen/AMDGPU/fmul-to-ldexp.ll b/llvm/test/CodeGen/AMDGPU/fmul-to-ldexp.ll
index 899cc89405440..9c4901eb19f37 100644
--- a/llvm/test/CodeGen/AMDGPU/fmul-to-ldexp.ll
+++ b/llvm/test/CodeGen/AMDGPU/fmul-to-ldexp.ll
@@ -4238,7 +4238,7 @@ define amdgpu_ps i32 @s_mul_32_f16(half inreg %x, half inreg %y) {
; GFX11-GISEL-TRUE16-LABEL: s_mul_32_f16:
; GFX11-GISEL-TRUE16: ; %bb.0:
; GFX11-GISEL-TRUE16-NEXT: v_mul_f16_e64 v0.l, 0x5000, s0
-; GFX11-GISEL-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
+; GFX11-GISEL-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX11-GISEL-TRUE16-NEXT: v_readfirstlane_b32 s0, v0
; GFX11-GISEL-TRUE16-NEXT: ; return to shader part epilog
;
diff --git a/llvm/test/CodeGen/AMDGPU/fold-int-pow2-with-fmul-or-fdiv.ll b/llvm/test/CodeGen/AMDGPU/fold-int-pow2-with-fmul-or-fdiv.ll
index a859cc91b7fde..f09c25767648f 100644
--- a/llvm/test/CodeGen/AMDGPU/fold-int-pow2-with-fmul-or-fdiv.ll
+++ b/llvm/test/CodeGen/AMDGPU/fold-int-pow2-with-fmul-or-fdiv.ll
@@ -644,10 +644,11 @@ define double @fmul_pow_mul_max_pow2(i16 %cnt) nounwind {
; GFX11-TRUE16-LABEL: fmul_pow_mul_max_pow2:
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.l, v0.l, 2
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX11-TRUE16-NEXT: v_cvt_f64_u32_e32 v[0:1], v0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_mul_f64 v[0:1], 0x40080000, v[0:1]
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -1193,12 +1194,13 @@ define double @fmul_pow_shl_cnt_safe(i16 %cnt) nounwind {
; GFX11-TRUE16-LABEL: fmul_pow_shl_cnt_safe:
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.l, v0.l, 1
; GFX11-TRUE16-NEXT: s_mov_b32 s0, 0xff5f3992
; GFX11-TRUE16-NEXT: s_mov_b32 s1, 0x7befffff
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX11-TRUE16-NEXT: v_cvt_f64_u32_e32 v[0:1], v0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_mul_f64 v[0:1], v[0:1], s[0:1]
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
diff --git a/llvm/test/CodeGen/AMDGPU/fptrunc.f16.ll b/llvm/test/CodeGen/AMDGPU/fptrunc.f16.ll
index 40d2765395543..c52fb6197e3e3 100644
--- a/llvm/test/CodeGen/AMDGPU/fptrunc.f16.ll
+++ b/llvm/test/CodeGen/AMDGPU/fptrunc.f16.ll
@@ -4372,13 +4372,14 @@ define amdgpu_kernel void @fptrunc_f32_to_f16_zext_i32(
; GFX11-GISEL-TRUE16-LABEL: fptrunc_f32_to_f16_zext_i32:
; GFX11-GISEL-TRUE16: ; %bb.0: ; %entry
; GFX11-GISEL-TRUE16-NEXT: s_load_b128 s[0:3], s[4:5], 0x24
-; GFX11-GISEL-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX11-GISEL-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-GISEL-TRUE16-NEXT: s_load_b32 s2, s[2:3], 0x0
; GFX11-GISEL-TRUE16-NEXT: s_mov_b32 s3, 0x31016000
; GFX11-GISEL-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-GISEL-TRUE16-NEXT: v_cvt_f16_f32_e32 v0.l, s2
; GFX11-GISEL-TRUE16-NEXT: s_mov_b32 s2, -1
+; GFX11-GISEL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-GISEL-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX11-GISEL-TRUE16-NEXT: buffer_store_b32 v0, off, s[0:3], 0
; GFX11-GISEL-TRUE16-NEXT: s_endpgm
;
@@ -4606,13 +4607,14 @@ define amdgpu_kernel void @fptrunc_fabs_f32_to_f16_zext_i32(
; GFX11-GISEL-TRUE16-LABEL: fptrunc_fabs_f32_to_f16_zext_i32:
; GFX11-GISEL-TRUE16: ; %bb.0: ; %entry
; GFX11-GISEL-TRUE16-NEXT: s_load_b128 s[0:3], s[4:5], 0x24
-; GFX11-GISEL-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX11-GISEL-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-GISEL-TRUE16-NEXT: s_load_b32 s2, s[2:3], 0x0
; GFX11-GISEL-TRUE16-NEXT: s_mov_b32 s3, 0x31016000
; GFX11-GISEL-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-GISEL-TRUE16-NEXT: v_cvt_f16_f32_e64 v0.l, |s2|
; GFX11-GISEL-TRUE16-NEXT: s_mov_b32 s2, -1
+; GFX11-GISEL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-GISEL-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX11-GISEL-TRUE16-NEXT: buffer_store_b32 v0, off, s[0:3], 0
; GFX11-GISEL-TRUE16-NEXT: s_endpgm
;
diff --git a/llvm/test/CodeGen/AMDGPU/function-args.ll b/llvm/test/CodeGen/AMDGPU/function-args.ll
index 3c41cc43a089e..95e28a37f5ee1 100644
--- a/llvm/test/CodeGen/AMDGPU/function-args.ll
+++ b/llvm/test/CodeGen/AMDGPU/function-args.ll
@@ -1107,19 +1107,21 @@ define void @void_func_v4i8(<4 x i8> %arg0) #0 {
; GFX11-TRUE16-LABEL: void_func_v4i8:
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v3.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v1.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
; GFX11-TRUE16-NEXT: s_mov_b64 s[0:1], 0
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v1.h, v0.h
; GFX11-TRUE16-NEXT: s_mov_b32 s3, 0x31016000
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, v2.l
; GFX11-TRUE16-NEXT: s_mov_b32 s2, -1
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v0, v2
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v0.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v3.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v1.l, v0.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v2
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v1, v0
; GFX11-TRUE16-NEXT: buffer_store_b32 v0, off, s[0:3], 0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -1188,20 +1190,22 @@ define void @void_func_v5i8(<5 x i8> %arg0) #0 {
; GFX11-TRUE16-LABEL: void_func_v5i8:
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v3.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, 0
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v1.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
; GFX11-TRUE16-NEXT: s_mov_b64 s[0:1], 4
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v1.h, v0.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, v2.l
; GFX11-TRUE16-NEXT: s_mov_b32 s3, 0x31016000
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.l, v1.l
; GFX11-TRUE16-NEXT: s_mov_b32 s2, -1
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v0.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v3.l
; GFX11-TRUE16-NEXT: buffer_store_b8 v4, off, s[0:3], 0
; GFX11-TRUE16-NEXT: s_mov_b64 s[0:1], 0
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v3, v2
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v0.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, 0
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v1.l, v0.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v2
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v1, v0
; GFX11-TRUE16-NEXT: buffer_store_b32 v0, off, s[0:3], 0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -1281,27 +1285,29 @@ define void @void_func_v8i8(<8 x i8> %arg0) #0 {
; GFX11-TRUE16-LABEL: void_func_v8i8:
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v5.l
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, 0
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v3.l
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v5.h, v4.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v4.l, v5.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.h, v6.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v7.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v6.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v4.l, v4.h
; GFX11-TRUE16-NEXT: s_mov_b64 s[0:1], 0
-; GFX11-TRUE16-NEXT: s_mov_b32 s3, 0x31016000
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v4, v6
-; GFX11-TRUE16-NEXT: v_or_b16 v6.h, v1.h, v0.h
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v0.h, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v3.l
; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, v6.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, 0
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: s_mov_b32 s3, 0x31016000
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: s_mov_b32 s2, -1
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v0, v6
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v5, v4
+; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v1.l, v0.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v3
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v0, v4
; GFX11-TRUE16-NEXT: buffer_store_b64 v[1:2], off, s[0:3], 0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -1416,44 +1422,47 @@ define void @void_func_v16i8(<16 x i8> %arg0) #0 {
; GFX11-TRUE16-LABEL: void_func_v16i8:
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v15.l
-; GFX11-TRUE16-NEXT: v_and_b16 v13.h, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.l, 8, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v12.h, 8, v13.l
; GFX11-TRUE16-NEXT: v_and_b16 v12.l, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, 0
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.h, 8, v11.l
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v13.h, v12.h
-; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v12.l, v13.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.h, v14.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v13.l, 8, v15.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v14.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v9.l, 8, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v12, v14
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v10.l, v9.h
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v10.h, v9.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.h, v14.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v8.l, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v12.l, v12.h
+; GFX11-TRUE16-NEXT: v_and_b16 v10.l, 0xff, v10.l
+; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v8.h, v13.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.h, 8, v11.l
+; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v8.l, v9.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v12.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, 0
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v5.l
; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v3.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v9, v14
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v5.h, v4.h
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v4.l, v5.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.h, v14.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.l, v8.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v13
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v4.l, v4.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v7.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v9, v12
+; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v10.l, v8.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v11
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v6.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v8, v12
+; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v1.h, v0.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v0.l
; GFX11-TRUE16-NEXT: s_mov_b64 s[0:1], 0
; GFX11-TRUE16-NEXT: s_mov_b32 s3, 0x31016000
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v4, v14
-; GFX11-TRUE16-NEXT: v_or_b16 v14.h, v1.h, v0.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, v14.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v4, v12
+; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v1.l, v0.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v2
; GFX11-TRUE16-NEXT: s_mov_b32 s2, -1
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v0, v14
-; GFX11-TRUE16-NEXT: buffer_store_b128 v[5:8], off, s[0:3], 0
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v0, v12
+; GFX11-TRUE16-NEXT: buffer_store_b128 v[6:9], off, s[0:3], 0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-FAKE16-LABEL: void_func_v16i8:
@@ -1649,77 +1658,83 @@ define void @void_func_v32i8(<32 x i8> %arg0) #0 {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: scratch_load_d16_u8 v31, off, s32
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v32.l, 0
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v15.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v14.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v13.l
; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.h, v32.l
+; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v3.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v15.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v14.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v3.h, v2.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v9.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v32.l, 0
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, v3.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.h, 8, v11.l
; GFX11-TRUE16-NEXT: v_and_b16 v5.h, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v7.h, v6.h
; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v1.h, v0.h
-; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v3.h, v2.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.h, v32.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v4.l, v5.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v13
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.l, 8, v7.l
; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v3.l
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v7.h, v6.h
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v12, v32
+; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v29.l
+; GFX11-TRUE16-NEXT: v_and_b16 v9.h, 0xff, v28.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v1.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.l, v3.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v12, 0xffff, v12
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v13, v32
; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v5.h, v4.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.h, v32.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v10.l, v4.l, v5.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v9, v32
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v11.l, 8, v25.l
+; GFX11-TRUE16-NEXT: v_and_b16 v11.h, 0xff, v24.l
+; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v9.h, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v13, 0xffff, v14
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v12, v32
; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v6.l, v7.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v11.h, v32.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v30.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v29.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v28.l
-; GFX11-TRUE16-NEXT: v_or_b16 v11.l, v0.l, v1.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v10, v32
-; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v0.h, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.h, v32.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v27.l
-; GFX11-TRUE16-NEXT: v_and_b16 v5.l, 0xff, v26.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.l, 8, v25.l
-; GFX11-TRUE16-NEXT: v_and_b16 v7.h, 0xff, v24.l
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v6.h, v5.h
+; GFX11-TRUE16-NEXT: v_and_b16 v9.l, 0xff, v30.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v6.h, 8, v21.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v20.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v11.h, v11.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v5
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v13, v32
+; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v8.h, v8.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v10.l, 8, v27.l
+; GFX11-TRUE16-NEXT: v_and_b16 v10.h, 0xff, v26.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v17.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v16.l
+; GFX11-TRUE16-NEXT: v_or_b16 v6.l, v0.h, v6.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v8.l, v0.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v11, v32
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v14.h, v32.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v23.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.h, 8, v21.l
-; GFX11-TRUE16-NEXT: v_and_b16 v6.h, 0xff, v20.l
-; GFX11-TRUE16-NEXT: v_or_b16 v14.l, v7.h, v6.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v15.h, v32.l
-; GFX11-TRUE16-NEXT: v_and_b16 v8.h, 0xff, v16.l
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v4.h, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v22.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v11, 0xffff, v6
; GFX11-TRUE16-NEXT: s_mov_b64 s[0:1], 16
-; GFX11-TRUE16-NEXT: v_or_b16 v15.l, v6.h, v5.h
; GFX11-TRUE16-NEXT: s_mov_b32 s3, 0x31016000
; GFX11-TRUE16-NEXT: s_mov_b32 s2, -1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v7.l, 8, v31.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v4.h, v7.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v22.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v13, v32
-; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v5.l, v4.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v19.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v17.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v14, v32
-; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v4.h, v8.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v18.l
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v8.h, v5.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v15, v32
-; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v4.h, v4.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v31.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_3) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v9.l, v5.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v5.l, 8, v23.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v8
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v8.l, 8, v19.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v7, v32
+; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v10.h, v10.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v4.l, 0xff, v18.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v9, v32
+; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v4.h, v5.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v10
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v11, v32
+; GFX11-TRUE16-NEXT: v_or_b16 v32.h, v4.l, v8.l
; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v9, v32
; GFX11-TRUE16-NEXT: buffer_store_b128 v[4:7], off, s[0:3], 0
; GFX11-TRUE16-NEXT: s_mov_b64 s[0:1], 0
diff --git a/llvm/test/CodeGen/AMDGPU/gfx-callable-argument-types.ll b/llvm/test/CodeGen/AMDGPU/gfx-callable-argument-types.ll
index 919464a936740..2fdc1a8854863 100644
--- a/llvm/test/CodeGen/AMDGPU/gfx-callable-argument-types.ll
+++ b/llvm/test/CodeGen/AMDGPU/gfx-callable-argument-types.ll
@@ -4896,22 +4896,23 @@ define amdgpu_gfx void @test_call_external_void_func_v4i8_ret() #0 {
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v2, 16, v0
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, 24, v0
; GFX11-TRUE16-NEXT: s_swappc_b64 s[30:31], s[0:1]
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v3.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v1.l
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v1.h, v0.h
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v1.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, v2.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v0.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v3.l
; GFX11-TRUE16-NEXT: v_readlane_b32 s31, v42, 1
; GFX11-TRUE16-NEXT: v_readlane_b32 s30, v42, 0
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v0, v2
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.h, v1.l, v0.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, 0
; GFX11-TRUE16-NEXT: s_mov_b32 s32, s33
; GFX11-TRUE16-NEXT: v_readlane_b32 s0, v42, 2
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v2
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v1, v0
; GFX11-TRUE16-NEXT: global_store_b32 v[40:41], v0, off
; GFX11-TRUE16-NEXT: s_clause 0x1
; GFX11-TRUE16-NEXT: scratch_load_b32 v41, off, s33
@@ -5155,29 +5156,30 @@ define amdgpu_gfx void @test_call_external_void_func_v5i8_ret() #0 {
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v0, v5
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v4, v6
; GFX11-TRUE16-NEXT: s_swappc_b64 s[30:31], s[0:1]
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v3.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v1.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b16 v2.h, v1.h, v0.h
-; GFX11-TRUE16-NEXT: v_or_b16 v3.l, v0.l, v1.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, v2.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v0.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v3.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.l, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v3.h, v1.l, v0.h
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v0, 4
-; GFX11-TRUE16-NEXT: v_mov_b32_e32 v1, 0
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v3, v2
+; GFX11-TRUE16-NEXT: v_readlane_b32 s31, v42, 1
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_dual_mov_b32 v1, 0 :: v_dual_and_b32 v2, 0xffff, v2
+; GFX11-TRUE16-NEXT: v_readlane_b32 s30, v42, 0
+; GFX11-TRUE16-NEXT: s_mov_b32 s32, s33
+; GFX11-TRUE16-NEXT: v_readlane_b32 s0, v42, 2
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v2, v3
; GFX11-TRUE16-NEXT: s_clause 0x1
; GFX11-TRUE16-NEXT: global_store_b8 v[0:1], v4, off
; GFX11-TRUE16-NEXT: global_store_b32 v[40:41], v2, off
; GFX11-TRUE16-NEXT: s_clause 0x1
; GFX11-TRUE16-NEXT: scratch_load_b32 v41, off, s33
; GFX11-TRUE16-NEXT: scratch_load_b32 v40, off, s33 offset:4
-; GFX11-TRUE16-NEXT: v_readlane_b32 s31, v42, 1
-; GFX11-TRUE16-NEXT: v_readlane_b32 s30, v42, 0
-; GFX11-TRUE16-NEXT: s_mov_b32 s32, s33
-; GFX11-TRUE16-NEXT: v_readlane_b32 s0, v42, 2
; GFX11-TRUE16-NEXT: s_or_saveexec_b32 s1, -1
; GFX11-TRUE16-NEXT: scratch_load_b32 v42, off, s33 offset:8 ; 4-byte Folded Reload
; GFX11-TRUE16-NEXT: s_mov_b32 exec_lo, s1
@@ -5439,34 +5441,36 @@ define amdgpu_gfx void @test_call_external_void_func_v8i8_ret() #0 {
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v7, 24, v1
; GFX11-TRUE16-NEXT: v_dual_mov_b32 v4, v1 :: v_dual_mov_b32 v1, v8
; GFX11-TRUE16-NEXT: s_swappc_b64 s[30:31], s[0:1]
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v7.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v5.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.l, 0
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v1.h, v0.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v1.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v7.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, 0
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v1.h, v0.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v0.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v3.l
-; GFX11-TRUE16-NEXT: v_or_b16 v5.l, v3.h, v2.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, v4.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v1.h, v1.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v4
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.l, v0.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v5, v4
-; GFX11-TRUE16-NEXT: v_or_b16 v4.h, v1.h, v0.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, v4.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v0, v4
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v3, v5
+; GFX11-TRUE16-NEXT: v_or_b16 v5.h, v1.l, v0.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v4
; GFX11-TRUE16-NEXT: v_readlane_b32 s31, v42, 1
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v0, v5
; GFX11-TRUE16-NEXT: v_readlane_b32 s30, v42, 0
+; GFX11-TRUE16-NEXT: s_mov_b32 s32, s33
+; GFX11-TRUE16-NEXT: v_readlane_b32 s0, v42, 2
; GFX11-TRUE16-NEXT: global_store_b64 v[40:41], v[1:2], off
; GFX11-TRUE16-NEXT: s_clause 0x1
; GFX11-TRUE16-NEXT: scratch_load_b32 v41, off, s33
; GFX11-TRUE16-NEXT: scratch_load_b32 v40, off, s33 offset:4
-; GFX11-TRUE16-NEXT: s_mov_b32 s32, s33
-; GFX11-TRUE16-NEXT: v_readlane_b32 s0, v42, 2
; GFX11-TRUE16-NEXT: s_or_saveexec_b32 s1, -1
; GFX11-TRUE16-NEXT: scratch_load_b32 v42, off, s33 offset:8 ; 4-byte Folded Reload
; GFX11-TRUE16-NEXT: s_mov_b32 exec_lo, s1
@@ -5906,77 +5910,85 @@ define amdgpu_gfx void @test_call_external_void_func_v32i8_ret() #0 {
; GFX11-TRUE16-NEXT: v_dual_mov_b32 v17, v32 :: v_dual_mov_b32 v18, v33
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v19, v34
; GFX11-TRUE16-NEXT: s_swappc_b64 s[30:31], s[0:1]
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v15.l
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v14.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v13.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v12.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v12.l, 0
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v13.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v12.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v15.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v8.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.l, 0
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
-; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v1.h, v0.h
+; GFX11-TRUE16-NEXT: v_or_b16 v12.l, v1.h, v0.h
+; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v14.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.h, 8, v9.l
+; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v9, 0xffff, v12
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v0.h, v2.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b16 v8.l, v3.h, v1.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v5.l
+; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v4.l
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v11.l
-; GFX11-TRUE16-NEXT: v_or_b16 v13.l, v3.h, v2.h
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v13.h, v12.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v10.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v9.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v8.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v9.h, v12.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v13, v12
-; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v1.h, v0.h
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v9, v13
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v8, 0xffff, v8
+; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v3.h, v2.h
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v1.l
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v1.h, v0.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v7.l
-; GFX11-TRUE16-NEXT: v_or_b16 v9.l, v3.h, v2.h
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v6.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v2.h, 8, v5.l
-; GFX11-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.h, v12.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v9, v12
-; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v1.h, v0.h
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v6.l, v4.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v2.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v4, v8, v13
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v1.h, v0.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v6
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v3.l
-; GFX11-TRUE16-NEXT: v_or_b16 v4.l, v3.h, v2.h
-; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.h, v12.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v4, v12
-; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v1.h, v0.h
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v0.l, v1.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.l, 8, v31.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v30.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v29.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.l, 8, v29.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v28.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v5, v2, v12
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v4.l, 8, v17.l
-; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v0.h, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.l, 8, v27.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v1.h, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v26.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v2, 0xffff, v2
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v6, v13
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v1.l, v0.h
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v25.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v1.h, v0.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v24.l
-; GFX11-TRUE16-NEXT: v_and_b16 v4.h, 0xff, v16.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v3, v2, v12
-; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v0.h, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.l, 8, v23.l
-; GFX11-TRUE16-NEXT: v_or_b16 v2.l, v1.h, v1.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v22.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v21.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v31.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v30.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v2, v13
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v1.h, v1.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v26.l
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v6.l, v0.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v27.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v7.l, v0.l
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.l, 8, v21.l
; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v20.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v2, v2, v12
-; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v0.h, v0.l
-; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.l, 8, v19.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT: v_or_b16 v1.l, v1.h, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.h, v12.l
-; GFX11-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v18.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v1, v1, v12
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT: v_or_b16 v12.h, v0.h, v0.l
-; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v4.h, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, v12.l
-; GFX11-TRUE16-NEXT: v_or_b32_e32 v0, v0, v12
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_4)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v9, v6, v13
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v1.l, v0.h
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v17.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v1.h, v0.l
+; GFX11-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v16.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v7, 0xffff, v7
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v23.l
+; GFX11-TRUE16-NEXT: v_and_b16 v6.l, 0xff, v22.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v0.l
+; GFX11-TRUE16-NEXT: v_or_b16 v0.l, v1.h, v1.l
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v8, v7, v13
+; GFX11-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v18.l
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v6.l, v0.h
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v6, 0xffff, v10
+; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v19.l
+; GFX11-TRUE16-NEXT: v_mov_b16_e32 v10.l, v0.l
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v7, v6, v13
+; GFX11-TRUE16-NEXT: v_or_b16 v13.h, v1.l, v0.h
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v10
+; GFX11-TRUE16-NEXT: v_or_b32_e32 v6, v0, v13
; GFX11-TRUE16-NEXT: s_clause 0x1
-; GFX11-TRUE16-NEXT: global_store_b128 v[42:43], v[0:3], off
-; GFX11-TRUE16-NEXT: global_store_b128 v[40:41], v[5:8], off
+; GFX11-TRUE16-NEXT: global_store_b128 v[42:43], v[6:9], off
+; GFX11-TRUE16-NEXT: global_store_b128 v[40:41], v[2:5], off
; GFX11-TRUE16-NEXT: s_clause 0x3
; GFX11-TRUE16-NEXT: scratch_load_b32 v43, off, s33
; GFX11-TRUE16-NEXT: scratch_load_b32 v42, off, s33 offset:4
diff --git a/llvm/test/CodeGen/AMDGPU/global-atomicrmw-fadd.ll b/llvm/test/CodeGen/AMDGPU/global-atomicrmw-fadd.ll
index 9c1f9d21b9da3..1f74fbdc46e98 100644
--- a/llvm/test/CodeGen/AMDGPU/global-atomicrmw-fadd.ll
+++ b/llvm/test/CodeGen/AMDGPU/global-atomicrmw-fadd.ll
@@ -8275,12 +8275,13 @@ define half @global_agent_atomic_fadd_ret_f16__amdgpu_no_fine_grained_memory(ptr
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_add_f16_e32 v5.l, v5.l, v2.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[0:1], v[5:6], off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -8393,12 +8394,13 @@ define half @global_agent_atomic_fadd_ret_f16__amdgpu_no_fine_grained_memory(ptr
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_add_f16_e32 v5.l, v5.l, v2.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[0:1], v[5:6], off glc
@@ -8698,12 +8700,13 @@ define half @global_agent_atomic_fadd_ret_f16__offset12b_pos__amdgpu_no_fine_gra
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_add_f16_e32 v5.l, v5.l, v2.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[0:1], v[5:6], off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -8820,12 +8823,13 @@ define half @global_agent_atomic_fadd_ret_f16__offset12b_pos__amdgpu_no_fine_gra
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_add_f16_e32 v5.l, v5.l, v2.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[0:1], v[5:6], off glc
@@ -9134,12 +9138,13 @@ define half @global_agent_atomic_fadd_ret_f16__offset12b_neg__amdgpu_no_fine_gra
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_add_f16_e32 v5.l, v5.l, v2.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[0:1], v[5:6], off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -9257,12 +9262,13 @@ define half @global_agent_atomic_fadd_ret_f16__offset12b_neg__amdgpu_no_fine_gra
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_add_f16_e32 v5.l, v5.l, v2.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[0:1], v[5:6], off glc
@@ -9570,11 +9576,11 @@ define void @global_agent_atomic_fadd_noret_f16__amdgpu_no_fine_grained_memory(p
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_add_f16_e32 v3.l, v3.l, v2.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -9684,11 +9690,11 @@ define void @global_agent_atomic_fadd_noret_f16__amdgpu_no_fine_grained_memory(p
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_add_f16_e32 v3.l, v3.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off glc
@@ -9979,11 +9985,11 @@ define void @global_agent_atomic_fadd_noret_f16__offset12b_pos__amdgpu_no_fine_g
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_add_f16_e32 v3.l, v3.l, v2.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -10097,11 +10103,11 @@ define void @global_agent_atomic_fadd_noret_f16__offset12b_pos__amdgpu_no_fine_g
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_add_f16_e32 v3.l, v3.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off glc
@@ -10400,11 +10406,11 @@ define void @global_agent_atomic_fadd_noret_f16__offset12b_neg__amdgpu_no_fine_g
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_add_f16_e32 v3.l, v3.l, v2.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -10519,11 +10525,11 @@ define void @global_agent_atomic_fadd_noret_f16__offset12b_neg__amdgpu_no_fine_g
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_add_f16_e32 v3.l, v3.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off glc
@@ -10813,9 +10819,10 @@ define half @global_agent_atomic_fadd_ret_f16__offset12b_pos__align4__amdgpu_no_
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_add_f16_e32 v3.l, v4.l, v2.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off offset:2046 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -10901,9 +10908,10 @@ define half @global_agent_atomic_fadd_ret_f16__offset12b_pos__align4__amdgpu_no_
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_add_f16_e32 v3.l, v4.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off offset:2046 glc
@@ -11136,8 +11144,8 @@ define void @global_agent_atomic_fadd_noret_f16__offset12b__align4_pos__amdgpu_n
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_add_f16_e32 v3.l, v4.l, v2.l
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off offset:2046 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -11220,8 +11228,8 @@ define void @global_agent_atomic_fadd_noret_f16__offset12b__align4_pos__amdgpu_n
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_add_f16_e32 v3.l, v4.l, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off offset:2046 glc
@@ -11456,12 +11464,13 @@ define half @global_system_atomic_fadd_ret_f16__offset12b_pos__amdgpu_no_fine_gr
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_add_f16_e32 v5.l, v5.l, v2.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX12-TRUE16-NEXT: global_wb scope:SCOPE_SYS
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
@@ -11580,12 +11589,13 @@ define half @global_system_atomic_fadd_ret_f16__offset12b_pos__amdgpu_no_fine_gr
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_add_f16_e32 v5.l, v5.l, v2.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[0:1], v[5:6], off glc
@@ -11896,11 +11906,11 @@ define void @global_system_atomic_fadd_noret_f16__offset12b_pos__amdgpu_no_fine_
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_add_f16_e32 v3.l, v3.l, v2.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX12-TRUE16-NEXT: global_wb scope:SCOPE_SYS
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
@@ -12016,11 +12026,11 @@ define void @global_system_atomic_fadd_noret_f16__offset12b_pos__amdgpu_no_fine_
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_add_f16_e32 v3.l, v3.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off glc
diff --git a/llvm/test/CodeGen/AMDGPU/global-atomicrmw-fmax.ll b/llvm/test/CodeGen/AMDGPU/global-atomicrmw-fmax.ll
index f7cc0709109f9..faa74fef2be2f 100644
--- a/llvm/test/CodeGen/AMDGPU/global-atomicrmw-fmax.ll
+++ b/llvm/test/CodeGen/AMDGPU/global-atomicrmw-fmax.ll
@@ -4467,14 +4467,14 @@ define half @global_agent_atomic_fmax_ret_f16__amdgpu_no_fine_grained_memory(ptr
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v2.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v5.l, v2.h, v2.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[0:1], v[5:6], off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -4592,14 +4592,14 @@ define half @global_agent_atomic_fmax_ret_f16__amdgpu_no_fine_grained_memory(ptr
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_max_f16_e32 v2.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v5.l, v2.h, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[0:1], v[5:6], off glc
@@ -4912,14 +4912,14 @@ define half @global_agent_atomic_fmax_ret_f16__offset12b_pos__amdgpu_no_fine_gra
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v5.l, v0.h, v0.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[3:4], v[5:6], off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -5044,14 +5044,14 @@ define half @global_agent_atomic_fmax_ret_f16__offset12b_pos__amdgpu_no_fine_gra
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v5.l, v0.h, v0.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[3:4], v[5:6], off glc
@@ -5373,14 +5373,14 @@ define half @global_agent_atomic_fmax_ret_f16__offset12b_neg__amdgpu_no_fine_gra
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v5.l, v0.h, v0.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[3:4], v[5:6], off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -5506,14 +5506,14 @@ define half @global_agent_atomic_fmax_ret_f16__offset12b_neg__amdgpu_no_fine_gra
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v5.l, v0.h, v0.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[3:4], v[5:6], off glc
@@ -5832,12 +5832,13 @@ define void @global_agent_atomic_fmax_noret_f16__amdgpu_no_fine_grained_memory(p
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v2.h, v3.l, v3.l
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v3.l, v2.h, v2.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -5953,12 +5954,13 @@ define void @global_agent_atomic_fmax_noret_f16__amdgpu_no_fine_grained_memory(p
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v2.h, v3.l, v3.l
; GFX11-TRUE16-NEXT: v_max_f16_e32 v3.l, v2.h, v2.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off glc
@@ -6263,12 +6265,13 @@ define void @global_agent_atomic_fmax_noret_f16__offset12b_pos__amdgpu_no_fine_g
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v5.l, v0.h, v0.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[3:4], v[5:6], off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -6391,12 +6394,13 @@ define void @global_agent_atomic_fmax_noret_f16__offset12b_pos__amdgpu_no_fine_g
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: v_max_f16_e32 v5.l, v0.h, v0.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[3:4], v[5:6], off glc
@@ -6709,12 +6713,13 @@ define void @global_agent_atomic_fmax_noret_f16__offset12b_neg__amdgpu_no_fine_g
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v5.l, v0.h, v0.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[3:4], v[5:6], off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -6838,12 +6843,13 @@ define void @global_agent_atomic_fmax_noret_f16__offset12b_neg__amdgpu_no_fine_g
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: v_max_f16_e32 v5.l, v0.h, v0.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[3:4], v[5:6], off glc
@@ -7145,11 +7151,11 @@ define half @global_agent_atomic_fmax_ret_f16__offset12b_pos__align4__amdgpu_no_
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v2.h, v4.l, v4.l
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v3.l, v2.h, v2.l
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off offset:2046 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -7240,11 +7246,11 @@ define half @global_agent_atomic_fmax_ret_f16__offset12b_pos__align4__amdgpu_no_
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v2.h, v4.l, v4.l
; GFX11-TRUE16-NEXT: v_max_f16_e32 v3.l, v2.h, v2.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off offset:2046 glc
@@ -7488,9 +7494,10 @@ define void @global_agent_atomic_fmax_noret_f16__offset12b__align4_pos__amdgpu_n
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v2.h, v4.l, v4.l
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v3.l, v2.h, v2.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off offset:2046 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -7579,9 +7586,10 @@ define void @global_agent_atomic_fmax_noret_f16__offset12b__align4_pos__amdgpu_n
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v2.h, v4.l, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v3.l, v2.h, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off offset:2046 glc
@@ -7830,14 +7838,14 @@ define half @global_system_atomic_fmax_ret_f16__offset12b_pos__amdgpu_no_fine_gr
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v5.l, v0.h, v0.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX12-TRUE16-NEXT: global_wb scope:SCOPE_SYS
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
@@ -7964,14 +7972,14 @@ define half @global_system_atomic_fmax_ret_f16__offset12b_pos__amdgpu_no_fine_gr
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v5.l, v0.h, v0.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[3:4], v[5:6], off glc
@@ -8295,12 +8303,13 @@ define void @global_system_atomic_fmax_noret_f16__offset12b_pos__amdgpu_no_fine_
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v5.l, v0.h, v0.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX12-TRUE16-NEXT: global_wb scope:SCOPE_SYS
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
@@ -8425,12 +8434,13 @@ define void @global_system_atomic_fmax_noret_f16__offset12b_pos__amdgpu_no_fine_
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: v_max_f16_e32 v5.l, v0.h, v0.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[3:4], v[5:6], off glc
diff --git a/llvm/test/CodeGen/AMDGPU/global-atomicrmw-fmin.ll b/llvm/test/CodeGen/AMDGPU/global-atomicrmw-fmin.ll
index b81af1fc9233d..a46b0129b79e6 100644
--- a/llvm/test/CodeGen/AMDGPU/global-atomicrmw-fmin.ll
+++ b/llvm/test/CodeGen/AMDGPU/global-atomicrmw-fmin.ll
@@ -4467,14 +4467,14 @@ define half @global_agent_atomic_fmin_ret_f16__amdgpu_no_fine_grained_memory(ptr
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v2.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v5.l, v2.h, v2.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[0:1], v[5:6], off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -4592,14 +4592,14 @@ define half @global_agent_atomic_fmin_ret_f16__amdgpu_no_fine_grained_memory(ptr
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_max_f16_e32 v2.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_min_f16_e32 v5.l, v2.h, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[0:1], v[5:6], off glc
@@ -4912,14 +4912,14 @@ define half @global_agent_atomic_fmin_ret_f16__offset12b_pos__amdgpu_no_fine_gra
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v5.l, v0.h, v0.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[3:4], v[5:6], off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -5044,14 +5044,14 @@ define half @global_agent_atomic_fmin_ret_f16__offset12b_pos__amdgpu_no_fine_gra
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_min_f16_e32 v5.l, v0.h, v0.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[3:4], v[5:6], off glc
@@ -5373,14 +5373,14 @@ define half @global_agent_atomic_fmin_ret_f16__offset12b_neg__amdgpu_no_fine_gra
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v5.l, v0.h, v0.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[3:4], v[5:6], off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -5506,14 +5506,14 @@ define half @global_agent_atomic_fmin_ret_f16__offset12b_neg__amdgpu_no_fine_gra
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_min_f16_e32 v5.l, v0.h, v0.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[3:4], v[5:6], off glc
@@ -5832,12 +5832,13 @@ define void @global_agent_atomic_fmin_noret_f16__amdgpu_no_fine_grained_memory(p
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v2.h, v3.l, v3.l
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v3.l, v2.h, v2.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -5953,12 +5954,13 @@ define void @global_agent_atomic_fmin_noret_f16__amdgpu_no_fine_grained_memory(p
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v2.h, v3.l, v3.l
; GFX11-TRUE16-NEXT: v_min_f16_e32 v3.l, v2.h, v2.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off glc
@@ -6263,12 +6265,13 @@ define void @global_agent_atomic_fmin_noret_f16__offset12b_pos__amdgpu_no_fine_g
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v5.l, v0.h, v0.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[3:4], v[5:6], off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -6391,12 +6394,13 @@ define void @global_agent_atomic_fmin_noret_f16__offset12b_pos__amdgpu_no_fine_g
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: v_min_f16_e32 v5.l, v0.h, v0.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[3:4], v[5:6], off glc
@@ -6709,12 +6713,13 @@ define void @global_agent_atomic_fmin_noret_f16__offset12b_neg__amdgpu_no_fine_g
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v5.l, v0.h, v0.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[3:4], v[5:6], off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -6838,12 +6843,13 @@ define void @global_agent_atomic_fmin_noret_f16__offset12b_neg__amdgpu_no_fine_g
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: v_min_f16_e32 v5.l, v0.h, v0.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[3:4], v[5:6], off glc
@@ -7145,11 +7151,11 @@ define half @global_agent_atomic_fmin_ret_f16__offset12b_pos__align4__amdgpu_no_
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v2.h, v4.l, v4.l
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v3.l, v2.h, v2.l
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off offset:2046 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -7240,11 +7246,11 @@ define half @global_agent_atomic_fmin_ret_f16__offset12b_pos__align4__amdgpu_no_
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v2.h, v4.l, v4.l
; GFX11-TRUE16-NEXT: v_min_f16_e32 v3.l, v2.h, v2.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off offset:2046 glc
@@ -7488,9 +7494,10 @@ define void @global_agent_atomic_fmin_noret_f16__offset12b__align4_pos__amdgpu_n
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v2.h, v4.l, v4.l
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v3.l, v2.h, v2.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off offset:2046 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -7579,9 +7586,10 @@ define void @global_agent_atomic_fmin_noret_f16__offset12b__align4_pos__amdgpu_n
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v2.h, v4.l, v4.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_min_f16_e32 v3.l, v2.h, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off offset:2046 glc
@@ -7830,14 +7838,14 @@ define half @global_system_atomic_fmin_ret_f16__offset12b_pos__amdgpu_no_fine_gr
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v5.l, v0.h, v0.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX12-TRUE16-NEXT: global_wb scope:SCOPE_SYS
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
@@ -7964,14 +7972,14 @@ define half @global_system_atomic_fmin_ret_f16__offset12b_pos__amdgpu_no_fine_gr
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_min_f16_e32 v5.l, v0.h, v0.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[3:4], v[5:6], off glc
@@ -8295,12 +8303,13 @@ define void @global_system_atomic_fmin_noret_f16__offset12b_pos__amdgpu_no_fine_
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v0.h, v5.l, v5.l
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v5.l, v0.h, v0.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX12-TRUE16-NEXT: global_wb scope:SCOPE_SYS
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
@@ -8425,12 +8434,13 @@ define void @global_system_atomic_fmin_noret_f16__offset12b_pos__amdgpu_no_fine_
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v1, v6
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v0.h, v5.l, v5.l
; GFX11-TRUE16-NEXT: v_min_f16_e32 v5.l, v0.h, v0.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v1, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v2, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[3:4], v[5:6], off glc
diff --git a/llvm/test/CodeGen/AMDGPU/global-atomicrmw-fsub.ll b/llvm/test/CodeGen/AMDGPU/global-atomicrmw-fsub.ll
index b8762d13e1327..053efdcb76261 100644
--- a/llvm/test/CodeGen/AMDGPU/global-atomicrmw-fsub.ll
+++ b/llvm/test/CodeGen/AMDGPU/global-atomicrmw-fsub.ll
@@ -5221,12 +5221,13 @@ define half @global_agent_atomic_fsub_ret_f16(ptr addrspace(1) %ptr, half %val)
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_sub_f16_e32 v5.l, v5.l, v2.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[0:1], v[5:6], off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -5339,12 +5340,13 @@ define half @global_agent_atomic_fsub_ret_f16(ptr addrspace(1) %ptr, half %val)
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_sub_f16_e32 v5.l, v5.l, v2.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[0:1], v[5:6], off glc
@@ -5644,12 +5646,13 @@ define half @global_agent_atomic_fsub_ret_f16__offset12b_pos(ptr addrspace(1) %p
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_sub_f16_e32 v5.l, v5.l, v2.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[0:1], v[5:6], off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -5766,12 +5769,13 @@ define half @global_agent_atomic_fsub_ret_f16__offset12b_pos(ptr addrspace(1) %p
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_sub_f16_e32 v5.l, v5.l, v2.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[0:1], v[5:6], off glc
@@ -6080,12 +6084,13 @@ define half @global_agent_atomic_fsub_ret_f16__offset12b_neg(ptr addrspace(1) %p
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_sub_f16_e32 v5.l, v5.l, v2.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[0:1], v[5:6], off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -6203,12 +6208,13 @@ define half @global_agent_atomic_fsub_ret_f16__offset12b_neg(ptr addrspace(1) %p
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_sub_f16_e32 v5.l, v5.l, v2.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[0:1], v[5:6], off glc
@@ -6516,11 +6522,11 @@ define void @global_agent_atomic_fsub_noret_f16(ptr addrspace(1) %ptr, half %val
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_sub_f16_e32 v3.l, v3.l, v2.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -6630,11 +6636,11 @@ define void @global_agent_atomic_fsub_noret_f16(ptr addrspace(1) %ptr, half %val
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_sub_f16_e32 v3.l, v3.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off glc
@@ -6925,11 +6931,11 @@ define void @global_agent_atomic_fsub_noret_f16__offset12b_pos(ptr addrspace(1)
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_sub_f16_e32 v3.l, v3.l, v2.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -7043,11 +7049,11 @@ define void @global_agent_atomic_fsub_noret_f16__offset12b_pos(ptr addrspace(1)
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_sub_f16_e32 v3.l, v3.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off glc
@@ -7346,11 +7352,11 @@ define void @global_agent_atomic_fsub_noret_f16__offset12b_neg(ptr addrspace(1)
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_sub_f16_e32 v3.l, v3.l, v2.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -7465,11 +7471,11 @@ define void @global_agent_atomic_fsub_noret_f16__offset12b_neg(ptr addrspace(1)
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_sub_f16_e32 v3.l, v3.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off glc
@@ -7759,9 +7765,10 @@ define half @global_agent_atomic_fsub_ret_f16__offset12b_pos__align4(ptr addrspa
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_sub_f16_e32 v3.l, v4.l, v2.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off offset:2046 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -7847,9 +7854,10 @@ define half @global_agent_atomic_fsub_ret_f16__offset12b_pos__align4(ptr addrspa
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_sub_f16_e32 v3.l, v4.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off offset:2046 glc
@@ -8082,8 +8090,8 @@ define void @global_agent_atomic_fsub_noret_f16__offset12b__align4_pos(ptr addrs
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_sub_f16_e32 v3.l, v4.l, v2.l
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off offset:2046 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
@@ -8166,8 +8174,8 @@ define void @global_agent_atomic_fsub_noret_f16__offset12b__align4_pos(ptr addrs
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_sub_f16_e32 v3.l, v4.l, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, 0xffff0000, v4, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off offset:2046 glc
@@ -8402,12 +8410,13 @@ define half @global_system_atomic_fsub_ret_f16__offset12b_pos(ptr addrspace(1) %
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX12-TRUE16-NEXT: v_sub_f16_e32 v5.l, v5.l, v2.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX12-TRUE16-NEXT: global_wb scope:SCOPE_SYS
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
@@ -8526,12 +8535,13 @@ define half @global_system_atomic_fsub_ret_f16__offset12b_pos(ptr addrspace(1) %
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v6, v5
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v5, v3, v6
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v5.h, 0
; GFX11-TRUE16-NEXT: v_sub_f16_e32 v5.l, v5.l, v2.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v5, 0xffff, v5
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v5, v3, v5
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v5, v6, v4, v5
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v5, v[0:1], v[5:6], off glc
@@ -8842,11 +8852,11 @@ define void @global_system_atomic_fsub_noret_f16__offset12b_pos(ptr addrspace(1)
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_loadcnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_sub_f16_e32 v3.l, v3.l, v2.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX12-TRUE16-NEXT: global_wb scope:SCOPE_SYS
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
@@ -8962,11 +8972,11 @@ define void @global_system_atomic_fsub_noret_f16__offset12b_pos(ptr addrspace(1)
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v5, v4
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_sub_f16_e32 v3.l, v3.l, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v5, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v6, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: global_atomic_cmpswap_b32 v3, v[0:1], v[3:4], off glc
diff --git a/llvm/test/CodeGen/AMDGPU/idot4u.ll b/llvm/test/CodeGen/AMDGPU/idot4u.ll
index 305461ed6b208..7ebd69204d87f 100644
--- a/llvm/test/CodeGen/AMDGPU/idot4u.ll
+++ b/llvm/test/CodeGen/AMDGPU/idot4u.ll
@@ -1693,11 +1693,12 @@ define amdgpu_kernel void @notdot4_mixedtypes(ptr addrspace(1) %src1,
; GFX11-DL-TRUE16-NEXT: v_mov_b16_e32 v3.l, v7.l
; GFX11-DL-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-DL-TRUE16-NEXT: v_mad_u16 v0.l, v0.h, v1.l, v0.l
-; GFX11-DL-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX11-DL-TRUE16-NEXT: v_perm_b32 v1, v5, v5, 0xc0c0302
-; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX11-DL-TRUE16-NEXT: v_mad_u16 v0.l, v2.l, v3.l, v0.l
; GFX11-DL-TRUE16-NEXT: v_perm_b32 v2, v4, v4, 0xc0c0302
+; GFX11-DL-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-DL-TRUE16-NEXT: v_dot4_u32_u8 v0, v2, v1, v0
; GFX11-DL-TRUE16-NEXT: global_store_b16 v6, v0, s[4:5]
; GFX11-DL-TRUE16-NEXT: s_endpgm
@@ -2723,32 +2724,32 @@ define amdgpu_kernel void @udot4_acc8_vecMul(ptr addrspace(1) %src1,
; GFX11-DL-TRUE16-NEXT: global_load_b32 v4, v0, s[2:3]
; GFX11-DL-TRUE16-NEXT: global_load_d16_u8 v0, v5, s[4:5]
; GFX11-DL-TRUE16-NEXT: s_waitcnt vmcnt(2)
-; GFX11-DL-TRUE16-NEXT: v_lshrrev_b32_e32 v2, 24, v3
+; GFX11-DL-TRUE16-NEXT: v_lshrrev_b16 v0.h, 8, v3.l
; GFX11-DL-TRUE16-NEXT: s_waitcnt vmcnt(1)
+; GFX11-DL-TRUE16-NEXT: v_lshrrev_b16 v1.l, 8, v4.l
+; GFX11-DL-TRUE16-NEXT: v_lshrrev_b32_e32 v2, 24, v3
; GFX11-DL-TRUE16-NEXT: v_lshrrev_b32_e32 v6, 24, v4
-; GFX11-DL-TRUE16-NEXT: v_lshrrev_b16 v0.h, 8, v3.l
-; GFX11-DL-TRUE16-NEXT: v_mul_lo_u16 v1.l, v3.h, v4.h
-; GFX11-DL-TRUE16-NEXT: v_lshrrev_b16 v1.h, 8, v4.l
; GFX11-DL-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-DL-TRUE16-NEXT: v_mad_u16 v0.l, v3.l, v4.l, v0.l
-; GFX11-DL-TRUE16-NEXT: v_mul_lo_u16 v2.l, v2.l, v6.l
+; GFX11-DL-TRUE16-NEXT: v_mul_lo_u16 v0.h, v0.h, v1.l
+; GFX11-DL-TRUE16-NEXT: v_mul_lo_u16 v1.l, v3.h, v4.h
+; GFX11-DL-TRUE16-NEXT: v_mul_lo_u16 v1.h, v2.l, v6.l
; GFX11-DL-TRUE16-NEXT: v_mov_b16_e32 v6.l, 0
-; GFX11-DL-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
-; GFX11-DL-TRUE16-NEXT: v_mul_lo_u16 v0.h, v0.h, v1.h
; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-DL-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v2.l
-; GFX11-DL-TRUE16-NEXT: v_mov_b16_e32 v7.h, v6.l
-; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-DL-TRUE16-NEXT: v_lshlrev_b16 v7.l, 8, v0.h
-; GFX11-DL-TRUE16-NEXT: v_or_b16 v6.h, v1.l, v2.l
-; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-DL-TRUE16-NEXT: v_or_b32_e32 v1, v7, v6
+; GFX11-DL-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v0.h
+; GFX11-DL-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v1.l
+; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-DL-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.h
+; GFX11-DL-TRUE16-NEXT: v_and_b32_e32 v2, 0xffff, v2
+; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-DL-TRUE16-NEXT: v_or_b16 v6.h, v0.h, v1.l
; GFX11-DL-TRUE16-NEXT: v_lshrrev_b32_e32 v1, 8, v1
-; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
-; GFX11-DL-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v1.l
-; GFX11-DL-TRUE16-NEXT: v_lshrrev_b32_e32 v1, 8, v2
+; GFX11-DL-TRUE16-NEXT: v_or_b32_e32 v2, v2, v6
+; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-DL-TRUE16-NEXT: v_lshrrev_b32_e32 v2, 8, v2
+; GFX11-DL-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v2.l
+; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-DL-TRUE16-NEXT: v_mad_u16 v0.l, v3.h, v4.h, v0.l
-; GFX11-DL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-DL-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v1.l
; GFX11-DL-TRUE16-NEXT: global_store_b8 v5, v0, s[4:5]
; GFX11-DL-TRUE16-NEXT: s_endpgm
diff --git a/llvm/test/CodeGen/AMDGPU/integer-mad-patterns.ll b/llvm/test/CodeGen/AMDGPU/integer-mad-patterns.ll
index 31b6b533866d4..742d87f099ce4 100644
--- a/llvm/test/CodeGen/AMDGPU/integer-mad-patterns.ll
+++ b/llvm/test/CodeGen/AMDGPU/integer-mad-patterns.ll
@@ -1715,9 +1715,9 @@ define zeroext i16 @clpeak_umad_pat_i16(i16 zeroext %x, i16 zeroext %y) {
; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-SDAG-TRUE16-NEXT: v_mad_u16 v0.h, v0.l, v1.l, v0.l
; GFX11-SDAG-TRUE16-NEXT: v_mad_u16 v0.l, v0.h, v0.l, v0.h
-; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-SDAG-TRUE16-NEXT: v_mad_u16 v0.l, v0.l, v0.h, v0.l
-; GFX11-SDAG-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
+; GFX11-SDAG-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX11-SDAG-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-SDAG-FAKE16-LABEL: clpeak_umad_pat_i16:
@@ -1745,7 +1745,8 @@ define zeroext i16 @clpeak_umad_pat_i16(i16 zeroext %x, i16 zeroext %y) {
; GFX11-GISEL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-GISEL-TRUE16-NEXT: v_mul_lo_u16 v0.l, v0.l, v1.l
; GFX11-GISEL-TRUE16-NEXT: v_mul_lo_u16 v0.l, v0.l, v0.h
-; GFX11-GISEL-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
+; GFX11-GISEL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-GISEL-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX11-GISEL-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-GISEL-FAKE16-LABEL: clpeak_umad_pat_i16:
@@ -1776,9 +1777,9 @@ define zeroext i16 @clpeak_umad_pat_i16(i16 zeroext %x, i16 zeroext %y) {
; GFX1200-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1200-SDAG-TRUE16-NEXT: v_mad_u16 v0.h, v0.l, v1.l, v0.l
; GFX1200-SDAG-TRUE16-NEXT: v_mad_u16 v0.l, v0.h, v0.l, v0.h
-; GFX1200-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX1200-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1200-SDAG-TRUE16-NEXT: v_mad_u16 v0.l, v0.l, v0.h, v0.l
-; GFX1200-SDAG-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
+; GFX1200-SDAG-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX1200-SDAG-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX1200-SDAG-FAKE16-LABEL: clpeak_umad_pat_i16:
@@ -1814,7 +1815,8 @@ define zeroext i16 @clpeak_umad_pat_i16(i16 zeroext %x, i16 zeroext %y) {
; GFX1200-GISEL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1200-GISEL-TRUE16-NEXT: v_mul_lo_u16 v0.l, v0.l, v1.l
; GFX1200-GISEL-TRUE16-NEXT: v_mul_lo_u16 v0.l, v0.l, v0.h
-; GFX1200-GISEL-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
+; GFX1200-GISEL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX1200-GISEL-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX1200-GISEL-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX1200-GISEL-FAKE16-LABEL: clpeak_umad_pat_i16:
@@ -9361,9 +9363,9 @@ define zeroext i16 @clpeak_umad_pat_i16_x2(i16 zeroext %x, i16 zeroext %y) {
; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-SDAG-TRUE16-NEXT: v_mad_u16 v0.h, v0.l, v0.h, v0.l
; GFX11-SDAG-TRUE16-NEXT: v_mad_u16 v0.l, v0.h, v0.l, v0.h
-; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-SDAG-TRUE16-NEXT: v_mad_u16 v0.l, v0.l, v0.h, v0.l
-; GFX11-SDAG-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
+; GFX11-SDAG-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX11-SDAG-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-SDAG-FAKE16-LABEL: clpeak_umad_pat_i16_x2:
@@ -9407,7 +9409,8 @@ define zeroext i16 @clpeak_umad_pat_i16_x2(i16 zeroext %x, i16 zeroext %y) {
; GFX11-GISEL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-GISEL-TRUE16-NEXT: v_mul_lo_u16 v0.l, v0.l, v1.l
; GFX11-GISEL-TRUE16-NEXT: v_mul_lo_u16 v0.l, v0.l, v0.h
-; GFX11-GISEL-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
+; GFX11-GISEL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-GISEL-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX11-GISEL-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-GISEL-FAKE16-LABEL: clpeak_umad_pat_i16_x2:
@@ -9454,9 +9457,9 @@ define zeroext i16 @clpeak_umad_pat_i16_x2(i16 zeroext %x, i16 zeroext %y) {
; GFX1200-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1200-SDAG-TRUE16-NEXT: v_mad_u16 v0.h, v0.l, v0.h, v0.l
; GFX1200-SDAG-TRUE16-NEXT: v_mad_u16 v0.l, v0.h, v0.l, v0.h
-; GFX1200-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX1200-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1200-SDAG-TRUE16-NEXT: v_mad_u16 v0.l, v0.l, v0.h, v0.l
-; GFX1200-SDAG-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
+; GFX1200-SDAG-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX1200-SDAG-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX1200-SDAG-FAKE16-LABEL: clpeak_umad_pat_i16_x2:
@@ -9508,7 +9511,8 @@ define zeroext i16 @clpeak_umad_pat_i16_x2(i16 zeroext %x, i16 zeroext %y) {
; GFX1200-GISEL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1200-GISEL-TRUE16-NEXT: v_mul_lo_u16 v0.l, v0.l, v1.l
; GFX1200-GISEL-TRUE16-NEXT: v_mul_lo_u16 v0.l, v0.l, v0.h
-; GFX1200-GISEL-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
+; GFX1200-GISEL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX1200-GISEL-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX1200-GISEL-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX1200-GISEL-FAKE16-LABEL: clpeak_umad_pat_i16_x2:
diff --git a/llvm/test/CodeGen/AMDGPU/local-atomicrmw-fadd.ll b/llvm/test/CodeGen/AMDGPU/local-atomicrmw-fadd.ll
index c1a32aafbc71e..a42c71c4849bd 100644
--- a/llvm/test/CodeGen/AMDGPU/local-atomicrmw-fadd.ll
+++ b/llvm/test/CodeGen/AMDGPU/local-atomicrmw-fadd.ll
@@ -1259,12 +1259,13 @@ define half @local_atomic_fadd_ret_f16(ptr addrspace(3) %ptr) nounwind {
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v0, v4
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
; GFX12-TRUE16-NEXT: v_add_f16_e32 v3.l, 4.0, v3.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v0, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v2, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: ds_cmpstore_rtn_b32 v3, v1, v3, v4
@@ -1370,12 +1371,13 @@ define half @local_atomic_fadd_ret_f16(ptr addrspace(3) %ptr) nounwind {
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v0, v4
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
; GFX11-TRUE16-NEXT: v_add_f16_e32 v3.l, 4.0, v3.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v0, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v2, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: ds_cmpstore_rtn_b32 v3, v1, v3, v4
@@ -1644,12 +1646,13 @@ define half @local_atomic_fadd_ret_f16__offset(ptr addrspace(3) %ptr) nounwind {
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v1, v4
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
; GFX12-TRUE16-NEXT: v_add_f16_e32 v3.l, 4.0, v3.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v1, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v2, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: ds_cmpstore_rtn_b32 v3, v0, v3, v4
@@ -1760,12 +1763,13 @@ define half @local_atomic_fadd_ret_f16__offset(ptr addrspace(3) %ptr) nounwind {
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v1, v4
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
; GFX11-TRUE16-NEXT: v_add_f16_e32 v3.l, 4.0, v3.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v1, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v2, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: ds_cmpstore_rtn_b32 v3, v0, v3, v4
@@ -2040,12 +2044,13 @@ define void @local_atomic_fadd_noret_f16(ptr addrspace(3) %ptr) nounwind {
; GFX12-TRUE16-NEXT: .LBB10_1: ; %atomicrmw.start
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_dscnt 0x0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v4, v0, v2
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v4.h, 0
; GFX12-TRUE16-NEXT: v_add_f16_e32 v4.l, 4.0, v4.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v4, v0, v4
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v4, v2, v3, v4
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: ds_cmpstore_rtn_b32 v4, v1, v4, v2
@@ -2148,12 +2153,13 @@ define void @local_atomic_fadd_noret_f16(ptr addrspace(3) %ptr) nounwind {
; GFX11-TRUE16-NEXT: .LBB10_1: ; %atomicrmw.start
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v4, v0, v2
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.h, 0
; GFX11-TRUE16-NEXT: v_add_f16_e32 v4.l, 4.0, v4.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v4, v0, v4
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v4, v2, v3, v4
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: ds_cmpstore_rtn_b32 v4, v1, v4, v2
@@ -2413,11 +2419,11 @@ define void @local_atomic_fadd_noret_f16__offset(ptr addrspace(3) %ptr) nounwind
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_dscnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v4, v1, v3
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v4.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_add_f16_e32 v4.l, 4.0, v4.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v4, v1, v4
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v4, v3, v2, v4
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: ds_cmpstore_rtn_b32 v4, v0, v4, v3
@@ -2525,11 +2531,11 @@ define void @local_atomic_fadd_noret_f16__offset(ptr addrspace(3) %ptr) nounwind
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v4, v1, v3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_add_f16_e32 v4.l, 4.0, v4.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v4, v1, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v4, v3, v2, v4
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: ds_cmpstore_rtn_b32 v4, v0, v4, v3
@@ -2789,9 +2795,10 @@ define half @local_atomic_fadd_ret_f16__offset__align4(ptr addrspace(3) %ptr) no
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v2, v1
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_add_f16_e32 v1.l, 4.0, v2.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v1, 0xffff0000, v2, v1
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: ds_cmpstore_rtn_b32 v1, v0, v1, v2 offset:65534
@@ -2875,9 +2882,10 @@ define half @local_atomic_fadd_ret_f16__offset__align4(ptr addrspace(3) %ptr) no
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v2, v1
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_add_f16_e32 v1.l, 4.0, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v1, 0xffff0000, v2, v1
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: ds_cmpstore_rtn_b32 v1, v0, v1, v2 offset:65534
@@ -3087,8 +3095,8 @@ define void @local_atomic_fadd_noret_f16__offset__align4(ptr addrspace(3) %ptr)
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_dscnt 0x0
; GFX12-TRUE16-NEXT: v_add_f16_e32 v2.l, 4.0, v1.l
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v2.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v2, 0xffff, v2
; GFX12-TRUE16-NEXT: v_and_or_b32 v2, 0xffff0000, v1, v2
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: ds_cmpstore_rtn_b32 v2, v0, v2, v1 offset:65534
@@ -3169,8 +3177,8 @@ define void @local_atomic_fadd_noret_f16__offset__align4(ptr addrspace(3) %ptr)
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_add_f16_e32 v2.l, 4.0, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v2, 0xffff, v2
; GFX11-TRUE16-NEXT: v_and_or_b32 v2, 0xffff0000, v1, v2
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: ds_cmpstore_rtn_b32 v2, v0, v2, v1 offset:65534
diff --git a/llvm/test/CodeGen/AMDGPU/local-atomicrmw-fmax.ll b/llvm/test/CodeGen/AMDGPU/local-atomicrmw-fmax.ll
index 739e86d1928b1..8351d28057564 100644
--- a/llvm/test/CodeGen/AMDGPU/local-atomicrmw-fmax.ll
+++ b/llvm/test/CodeGen/AMDGPU/local-atomicrmw-fmax.ll
@@ -803,14 +803,14 @@ define half @local_atomic_fmax_ret_f16(ptr addrspace(3) %ptr) nounwind {
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v0, v4
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v3.l, v3.l, v3.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v3.l, 4.0, v3.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v0, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v2, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: ds_cmpstore_rtn_b32 v3, v1, v3, v4
@@ -918,14 +918,14 @@ define half @local_atomic_fmax_ret_f16(ptr addrspace(3) %ptr) nounwind {
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v0, v4
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
; GFX11-TRUE16-NEXT: v_max_f16_e32 v3.l, v3.l, v3.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v3.l, 4.0, v3.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v0, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v2, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: ds_cmpstore_rtn_b32 v3, v1, v3, v4
@@ -1199,14 +1199,14 @@ define half @local_atomic_fmax_ret_f16__offset(ptr addrspace(3) %ptr) nounwind {
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v1, v4
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v3.l, v3.l, v3.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v3.l, 4.0, v3.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v1, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v2, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: ds_cmpstore_rtn_b32 v3, v0, v3, v4
@@ -1319,14 +1319,14 @@ define half @local_atomic_fmax_ret_f16__offset(ptr addrspace(3) %ptr) nounwind {
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v1, v4
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
; GFX11-TRUE16-NEXT: v_max_f16_e32 v3.l, v3.l, v3.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v3.l, 4.0, v3.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v1, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v2, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: ds_cmpstore_rtn_b32 v3, v0, v3, v4
@@ -1606,14 +1606,14 @@ define void @local_atomic_fmax_noret_f16(ptr addrspace(3) %ptr) nounwind {
; GFX12-TRUE16-NEXT: .LBB10_1: ; %atomicrmw.start
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_dscnt 0x0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v4, v0, v2
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v4.h, 0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v4.l, v4.l, v4.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v4.l, 4.0, v4.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v4, v0, v4
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v4, v2, v3, v4
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: ds_cmpstore_rtn_b32 v4, v1, v4, v2
@@ -1718,14 +1718,14 @@ define void @local_atomic_fmax_noret_f16(ptr addrspace(3) %ptr) nounwind {
; GFX11-TRUE16-NEXT: .LBB10_1: ; %atomicrmw.start
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v4, v0, v2
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.h, 0
; GFX11-TRUE16-NEXT: v_max_f16_e32 v4.l, v4.l, v4.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v4.l, 4.0, v4.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v4, v0, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v4, v2, v3, v4
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: ds_cmpstore_rtn_b32 v4, v1, v4, v2
@@ -1990,12 +1990,13 @@ define void @local_atomic_fmax_noret_f16__offset(ptr addrspace(3) %ptr) nounwind
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_dscnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v4, v1, v3
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v4.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v4.l, v4.l, v4.l
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v4.l, 4.0, v4.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v4, v1, v4
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v4, v3, v2, v4
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: ds_cmpstore_rtn_b32 v4, v0, v4, v3
@@ -2106,12 +2107,13 @@ define void @local_atomic_fmax_noret_f16__offset(ptr addrspace(3) %ptr) nounwind
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v4, v1, v3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v4.l, v4.l, v4.l
; GFX11-TRUE16-NEXT: v_max_f16_e32 v4.l, 4.0, v4.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v4, v1, v4
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v4, v3, v2, v4
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: ds_cmpstore_rtn_b32 v4, v0, v4, v3
@@ -2377,11 +2379,11 @@ define half @local_atomic_fmax_ret_f16__offset__align4(ptr addrspace(3) %ptr) no
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v2, v1
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v1.l, v2.l, v2.l
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v1.l, 4.0, v1.l
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
; GFX12-TRUE16-NEXT: v_and_or_b32 v1, 0xffff0000, v2, v1
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: ds_cmpstore_rtn_b32 v1, v0, v1, v2 offset:65534
@@ -2467,11 +2469,11 @@ define half @local_atomic_fmax_ret_f16__offset__align4(ptr addrspace(3) %ptr) no
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v2, v1
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v1.l, v2.l, v2.l
; GFX11-TRUE16-NEXT: v_max_f16_e32 v1.l, 4.0, v1.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
; GFX11-TRUE16-NEXT: v_and_or_b32 v1, 0xffff0000, v2, v1
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: ds_cmpstore_rtn_b32 v1, v0, v1, v2 offset:65534
@@ -2686,9 +2688,10 @@ define void @local_atomic_fmax_noret_f16__offset__align4(ptr addrspace(3) %ptr)
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_dscnt 0x0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v2.l, v1.l, v1.l
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v2.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v2.l, 4.0, v2.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v2, 0xffff, v2
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v2, 0xffff0000, v1, v2
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: ds_cmpstore_rtn_b32 v2, v0, v2, v1 offset:65534
@@ -2772,9 +2775,10 @@ define void @local_atomic_fmax_noret_f16__offset__align4(ptr addrspace(3) %ptr)
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v2.l, v1.l, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v2.l, 4.0, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v2, 0xffff, v2
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v2, 0xffff0000, v1, v2
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: ds_cmpstore_rtn_b32 v2, v0, v2, v1 offset:65534
diff --git a/llvm/test/CodeGen/AMDGPU/local-atomicrmw-fmin.ll b/llvm/test/CodeGen/AMDGPU/local-atomicrmw-fmin.ll
index 6da80262951e5..0c4aca88b3781 100644
--- a/llvm/test/CodeGen/AMDGPU/local-atomicrmw-fmin.ll
+++ b/llvm/test/CodeGen/AMDGPU/local-atomicrmw-fmin.ll
@@ -803,14 +803,14 @@ define half @local_atomic_fmin_ret_f16(ptr addrspace(3) %ptr) nounwind {
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v0, v4
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v3.l, v3.l, v3.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v3.l, 4.0, v3.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v0, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v2, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: ds_cmpstore_rtn_b32 v3, v1, v3, v4
@@ -918,14 +918,14 @@ define half @local_atomic_fmin_ret_f16(ptr addrspace(3) %ptr) nounwind {
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v0, v4
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
; GFX11-TRUE16-NEXT: v_max_f16_e32 v3.l, v3.l, v3.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_min_f16_e32 v3.l, 4.0, v3.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v0, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v2, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: ds_cmpstore_rtn_b32 v3, v1, v3, v4
@@ -1199,14 +1199,14 @@ define half @local_atomic_fmin_ret_f16__offset(ptr addrspace(3) %ptr) nounwind {
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v1, v4
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v3.l, v3.l, v3.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v3.l, 4.0, v3.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v1, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v2, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: ds_cmpstore_rtn_b32 v3, v0, v3, v4
@@ -1319,14 +1319,14 @@ define half @local_atomic_fmin_ret_f16__offset(ptr addrspace(3) %ptr) nounwind {
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v1, v4
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
; GFX11-TRUE16-NEXT: v_max_f16_e32 v3.l, v3.l, v3.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_min_f16_e32 v3.l, 4.0, v3.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v1, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v2, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: ds_cmpstore_rtn_b32 v3, v0, v3, v4
@@ -1606,14 +1606,14 @@ define void @local_atomic_fmin_noret_f16(ptr addrspace(3) %ptr) nounwind {
; GFX12-TRUE16-NEXT: .LBB10_1: ; %atomicrmw.start
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_dscnt 0x0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v4, v0, v2
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v4.h, 0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v4.l, v4.l, v4.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v4.l, 4.0, v4.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v4, v0, v4
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v4, v2, v3, v4
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: ds_cmpstore_rtn_b32 v4, v1, v4, v2
@@ -1718,14 +1718,14 @@ define void @local_atomic_fmin_noret_f16(ptr addrspace(3) %ptr) nounwind {
; GFX11-TRUE16-NEXT: .LBB10_1: ; %atomicrmw.start
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v4, v0, v2
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.h, 0
; GFX11-TRUE16-NEXT: v_max_f16_e32 v4.l, v4.l, v4.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_min_f16_e32 v4.l, 4.0, v4.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v4, v0, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v4, v2, v3, v4
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: ds_cmpstore_rtn_b32 v4, v1, v4, v2
@@ -1990,12 +1990,13 @@ define void @local_atomic_fmin_noret_f16__offset(ptr addrspace(3) %ptr) nounwind
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_dscnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v4, v1, v3
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v4.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v4.l, v4.l, v4.l
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v4.l, 4.0, v4.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v4, v1, v4
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v4, v3, v2, v4
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: ds_cmpstore_rtn_b32 v4, v0, v4, v3
@@ -2106,12 +2107,13 @@ define void @local_atomic_fmin_noret_f16__offset(ptr addrspace(3) %ptr) nounwind
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v4, v1, v3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v4.l, v4.l, v4.l
; GFX11-TRUE16-NEXT: v_min_f16_e32 v4.l, 4.0, v4.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v4, v1, v4
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v4, v3, v2, v4
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: ds_cmpstore_rtn_b32 v4, v0, v4, v3
@@ -2377,11 +2379,11 @@ define half @local_atomic_fmin_ret_f16__offset__align4(ptr addrspace(3) %ptr) no
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v2, v1
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v1.l, v2.l, v2.l
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v1.l, 4.0, v1.l
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
; GFX12-TRUE16-NEXT: v_and_or_b32 v1, 0xffff0000, v2, v1
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: ds_cmpstore_rtn_b32 v1, v0, v1, v2 offset:65534
@@ -2467,11 +2469,11 @@ define half @local_atomic_fmin_ret_f16__offset__align4(ptr addrspace(3) %ptr) no
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v2, v1
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v1.l, v2.l, v2.l
; GFX11-TRUE16-NEXT: v_min_f16_e32 v1.l, 4.0, v1.l
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
; GFX11-TRUE16-NEXT: v_and_or_b32 v1, 0xffff0000, v2, v1
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: ds_cmpstore_rtn_b32 v1, v0, v1, v2 offset:65534
@@ -2686,9 +2688,10 @@ define void @local_atomic_fmin_noret_f16__offset__align4(ptr addrspace(3) %ptr)
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_dscnt 0x0
; GFX12-TRUE16-NEXT: v_max_num_f16_e32 v2.l, v1.l, v1.l
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v2.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_min_num_f16_e32 v2.l, 4.0, v2.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v2, 0xffff, v2
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v2, 0xffff0000, v1, v2
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: ds_cmpstore_rtn_b32 v2, v0, v2, v1 offset:65534
@@ -2772,9 +2775,10 @@ define void @local_atomic_fmin_noret_f16__offset__align4(ptr addrspace(3) %ptr)
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_max_f16_e32 v2.l, v1.l, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_min_f16_e32 v2.l, 4.0, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v2, 0xffff, v2
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v2, 0xffff0000, v1, v2
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: ds_cmpstore_rtn_b32 v2, v0, v2, v1 offset:65534
diff --git a/llvm/test/CodeGen/AMDGPU/local-atomicrmw-fsub.ll b/llvm/test/CodeGen/AMDGPU/local-atomicrmw-fsub.ll
index 786989cc9fb57..37310b614c0db 100644
--- a/llvm/test/CodeGen/AMDGPU/local-atomicrmw-fsub.ll
+++ b/llvm/test/CodeGen/AMDGPU/local-atomicrmw-fsub.ll
@@ -1721,12 +1721,13 @@ define half @local_atomic_fsub_ret_f16(ptr addrspace(3) %ptr) nounwind {
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v0, v4
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
; GFX12-TRUE16-NEXT: v_add_f16_e32 v3.l, -4.0, v3.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v0, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v2, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: ds_cmpstore_rtn_b32 v3, v1, v3, v4
@@ -1832,12 +1833,13 @@ define half @local_atomic_fsub_ret_f16(ptr addrspace(3) %ptr) nounwind {
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v0, v4
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
; GFX11-TRUE16-NEXT: v_add_f16_e32 v3.l, -4.0, v3.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v0, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v2, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: ds_cmpstore_rtn_b32 v3, v1, v3, v4
@@ -2106,12 +2108,13 @@ define half @local_atomic_fsub_ret_f16__offset(ptr addrspace(3) %ptr) nounwind {
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v1, v4
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
; GFX12-TRUE16-NEXT: v_add_f16_e32 v3.l, -4.0, v3.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v1, v3
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v3, v4, v2, v3
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: ds_cmpstore_rtn_b32 v3, v0, v3, v4
@@ -2222,12 +2225,13 @@ define half @local_atomic_fsub_ret_f16__offset(ptr addrspace(3) %ptr) nounwind {
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v4, v3
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v3, v1, v4
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
; GFX11-TRUE16-NEXT: v_add_f16_e32 v3.l, -4.0, v3.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v3, 0xffff, v3
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v3, v1, v3
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v3, v4, v2, v3
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: ds_cmpstore_rtn_b32 v3, v0, v3, v4
@@ -2502,12 +2506,13 @@ define void @local_atomic_fsub_noret_f16(ptr addrspace(3) %ptr) nounwind {
; GFX12-TRUE16-NEXT: .LBB10_1: ; %atomicrmw.start
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_dscnt 0x0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v4, v0, v2
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v4.h, 0
; GFX12-TRUE16-NEXT: v_add_f16_e32 v4.l, -4.0, v4.l
; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v4, v0, v4
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v4, v2, v3, v4
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: ds_cmpstore_rtn_b32 v4, v1, v4, v2
@@ -2610,12 +2615,13 @@ define void @local_atomic_fsub_noret_f16(ptr addrspace(3) %ptr) nounwind {
; GFX11-TRUE16-NEXT: .LBB10_1: ; %atomicrmw.start
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v4, v0, v2
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.h, 0
; GFX11-TRUE16-NEXT: v_add_f16_e32 v4.l, -4.0, v4.l
; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v4, v0, v4
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v4, v2, v3, v4
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: ds_cmpstore_rtn_b32 v4, v1, v4, v2
@@ -2875,11 +2881,11 @@ define void @local_atomic_fsub_noret_f16__offset(ptr addrspace(3) %ptr) nounwind
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_dscnt 0x0
; GFX12-TRUE16-NEXT: v_lshrrev_b32_e32 v4, v1, v3
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v4.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_add_f16_e32 v4.l, -4.0, v4.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_lshlrev_b32_e32 v4, v1, v4
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v4, v3, v2, v4
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: ds_cmpstore_rtn_b32 v4, v0, v4, v3
@@ -2987,11 +2993,11 @@ define void @local_atomic_fsub_noret_f16__offset(ptr addrspace(3) %ptr) nounwind
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b32_e32 v4, v1, v3
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v4.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_add_f16_e32 v4.l, -4.0, v4.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v4, 0xffff, v4
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_lshlrev_b32_e32 v4, v1, v4
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v4, v3, v2, v4
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: ds_cmpstore_rtn_b32 v4, v0, v4, v3
@@ -3251,9 +3257,10 @@ define half @local_atomic_fsub_ret_f16__offset__align4(ptr addrspace(3) %ptr) no
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_dscnt 0x0
; GFX12-TRUE16-NEXT: v_mov_b32_e32 v2, v1
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_add_f16_e32 v1.l, -4.0, v2.l
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-TRUE16-NEXT: v_and_or_b32 v1, 0xffff0000, v2, v1
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: ds_cmpstore_rtn_b32 v1, v0, v1, v2 offset:65534
@@ -3337,9 +3344,10 @@ define half @local_atomic_fsub_ret_f16__offset__align4(ptr addrspace(3) %ptr) no
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mov_b32_e32 v2, v1
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_add_f16_e32 v1.l, -4.0, v2.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-TRUE16-NEXT: v_and_or_b32 v1, 0xffff0000, v2, v1
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: ds_cmpstore_rtn_b32 v1, v0, v1, v2 offset:65534
@@ -3549,8 +3557,8 @@ define void @local_atomic_fsub_noret_f16__offset__align4(ptr addrspace(3) %ptr)
; GFX12-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX12-TRUE16-NEXT: s_wait_dscnt 0x0
; GFX12-TRUE16-NEXT: v_add_f16_e32 v2.l, -4.0, v1.l
-; GFX12-TRUE16-NEXT: v_mov_b16_e32 v2.h, 0
-; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-TRUE16-NEXT: v_and_b32_e32 v2, 0xffff, v2
; GFX12-TRUE16-NEXT: v_and_or_b32 v2, 0xffff0000, v1, v2
; GFX12-TRUE16-NEXT: s_wait_storecnt 0x0
; GFX12-TRUE16-NEXT: ds_cmpstore_rtn_b32 v2, v0, v2, v1 offset:65534
@@ -3631,8 +3639,8 @@ define void @local_atomic_fsub_noret_f16__offset__align4(ptr addrspace(3) %ptr)
; GFX11-TRUE16-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX11-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_add_f16_e32 v2.l, -4.0, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v2.h, 0
-; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v2, 0xffff, v2
; GFX11-TRUE16-NEXT: v_and_or_b32 v2, 0xffff0000, v1, v2
; GFX11-TRUE16-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-TRUE16-NEXT: ds_cmpstore_rtn_b32 v2, v0, v2, v1 offset:65534
diff --git a/llvm/test/CodeGen/AMDGPU/mad-mix-lo.ll b/llvm/test/CodeGen/AMDGPU/mad-mix-lo.ll
index eab92668c536b..811e25587d3d5 100644
--- a/llvm/test/CodeGen/AMDGPU/mad-mix-lo.ll
+++ b/llvm/test/CodeGen/AMDGPU/mad-mix-lo.ll
@@ -2382,22 +2382,13 @@ define <4 x half> @v_mad_mix_v4f32_clamp_precvt(<4 x half> %src0, <4 x half> %sr
}
define i32 @mixlo_zext(float %src0, float %src1, float %src2) #0 {
-; SDAG-GFX1100-TRUE16-LABEL: mixlo_zext:
-; SDAG-GFX1100-TRUE16: ; %bb.0:
-; SDAG-GFX1100-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; SDAG-GFX1100-TRUE16-NEXT: v_fma_mixlo_f16 v1, v0, v1, v2
-; SDAG-GFX1100-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
-; SDAG-GFX1100-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2)
-; SDAG-GFX1100-TRUE16-NEXT: v_mov_b16_e32 v0.l, v1.l
-; SDAG-GFX1100-TRUE16-NEXT: s_setpc_b64 s[30:31]
-;
-; SDAG-GFX1100-FAKE16-LABEL: mixlo_zext:
-; SDAG-GFX1100-FAKE16: ; %bb.0:
-; SDAG-GFX1100-FAKE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; SDAG-GFX1100-FAKE16-NEXT: v_fma_mixlo_f16 v0, v0, v1, v2
-; SDAG-GFX1100-FAKE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; SDAG-GFX1100-FAKE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
-; SDAG-GFX1100-FAKE16-NEXT: s_setpc_b64 s[30:31]
+; GFX1100-LABEL: mixlo_zext:
+; GFX1100: ; %bb.0:
+; GFX1100-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1100-NEXT: v_fma_mixlo_f16 v0, v0, v1, v2
+; GFX1100-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX1100-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; GFX1100-NEXT: s_setpc_b64 s[30:31]
;
; GFX900-LABEL: mixlo_zext:
; GFX900: ; %bb.0:
@@ -2427,14 +2418,6 @@ define i32 @mixlo_zext(float %src0, float %src1, float %src2) #0 {
; SDAG-CI-NEXT: v_cvt_f16_f32_e32 v0, v2
; SDAG-CI-NEXT: s_setpc_b64 s[30:31]
;
-; GISEL-GFX1100-LABEL: mixlo_zext:
-; GISEL-GFX1100: ; %bb.0:
-; GISEL-GFX1100-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GISEL-GFX1100-NEXT: v_fma_mixlo_f16 v0, v0, v1, v2
-; GISEL-GFX1100-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GISEL-GFX1100-NEXT: v_and_b32_e32 v0, 0xffff, v0
-; GISEL-GFX1100-NEXT: s_setpc_b64 s[30:31]
-;
; GISEL-CI-LABEL: mixlo_zext:
; GISEL-CI: ; %bb.0:
; GISEL-CI-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
diff --git a/llvm/test/CodeGen/AMDGPU/mad.u16.ll b/llvm/test/CodeGen/AMDGPU/mad.u16.ll
index fbf8011fd40c9..ef80323a98ec0 100644
--- a/llvm/test/CodeGen/AMDGPU/mad.u16.ll
+++ b/llvm/test/CodeGen/AMDGPU/mad.u16.ll
@@ -179,7 +179,8 @@ define i32 @v_mad_u16_zext(i16 %arg0, i16 %arg1, i16 %arg2) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mad_u16 v0.l, v0.l, v1.l, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-FAKE16-LABEL: v_mad_u16_zext:
@@ -221,9 +222,9 @@ define i64 @v_mad_u16_zext64(i16 %arg0, i16 %arg1, i16 %arg2) {
; GFX11-TRUE16-LABEL: v_mad_u16_zext64:
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX11-TRUE16-NEXT: v_mad_u16 v0.l, v0.l, v1.l, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b32_e32 v1, 0
+; GFX11-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-TRUE16-NEXT: v_dual_mov_b32 v1, 0 :: v_dual_and_b32 v0, 0xffff, v0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-FAKE16-LABEL: v_mad_u16_zext64:
diff --git a/llvm/test/CodeGen/AMDGPU/preserve-hi16.ll b/llvm/test/CodeGen/AMDGPU/preserve-hi16.ll
index 79910af5c0434..3ce09475c0949 100644
--- a/llvm/test/CodeGen/AMDGPU/preserve-hi16.ll
+++ b/llvm/test/CodeGen/AMDGPU/preserve-hi16.ll
@@ -374,7 +374,7 @@ define i32 @shl_i16_zext_i32(i16 %x, i16 %y) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_lshlrev_b16 v0.l, v1.l, v0.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-FAKE16-LABEL: shl_i16_zext_i32:
@@ -412,7 +412,7 @@ define i32 @lshr_i16_zext_i32(i16 %x, i16 %y) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_lshrrev_b16 v0.l, v1.l, v0.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-FAKE16-LABEL: lshr_i16_zext_i32:
@@ -450,7 +450,7 @@ define i32 @ashr_i16_zext_i32(i16 %x, i16 %y) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_ashrrev_i16 v0.l, v1.l, v0.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-FAKE16-LABEL: ashr_i16_zext_i32:
@@ -488,7 +488,7 @@ define i32 @add_u16_zext_i32(i16 %x, i16 %y) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-FAKE16-LABEL: add_u16_zext_i32:
@@ -526,7 +526,7 @@ define i32 @sub_u16_zext_i32(i16 %x, i16 %y) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_sub_nc_u16 v0.l, v0.l, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-FAKE16-LABEL: sub_u16_zext_i32:
@@ -564,7 +564,7 @@ define i32 @mul_lo_u16_zext_i32(i16 %x, i16 %y) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_mul_lo_u16 v0.l, v0.l, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-FAKE16-LABEL: mul_lo_u16_zext_i32:
@@ -602,7 +602,7 @@ define i32 @min_u16_zext_i32(i16 %x, i16 %y) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_min_u16 v0.l, v0.l, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-FAKE16-LABEL: min_u16_zext_i32:
@@ -641,7 +641,7 @@ define i32 @min_i16_zext_i32(i16 %x, i16 %y) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_min_i16 v0.l, v0.l, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-FAKE16-LABEL: min_i16_zext_i32:
@@ -680,7 +680,7 @@ define i32 @max_u16_zext_i32(i16 %x, i16 %y) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_max_u16 v0.l, v0.l, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-FAKE16-LABEL: max_u16_zext_i32:
@@ -719,7 +719,7 @@ define i32 @max_i16_zext_i32(i16 %x, i16 %y) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_max_i16 v0.l, v0.l, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-FAKE16-LABEL: max_i16_zext_i32:
@@ -758,7 +758,7 @@ define i32 @zext_fadd_f16(half %x, half %y) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_add_f16_e32 v0.l, v0.l, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-FAKE16-LABEL: zext_fadd_f16:
@@ -797,10 +797,8 @@ define i32 @zext_fma_f16(half %x, half %y, half %z) {
; GFX11-TRUE16-LABEL: zext_fma_f16:
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, v0.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v2.l
-; GFX11-TRUE16-NEXT: v_fmac_f16_e32 v0.l, v0.h, v1.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
+; GFX11-TRUE16-NEXT: v_fmac_f16_e32 v2.l, v0.l, v1.l
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v2
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-FAKE16-LABEL: zext_fma_f16:
@@ -840,7 +838,7 @@ define i32 @zext_div_fixup_f16(half %x, half %y, half %z) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_div_fixup_f16 v0.l, v0.l, v1.l, v2.l
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-FAKE16-LABEL: zext_div_fixup_f16:
@@ -882,7 +880,7 @@ define i32 @zext_fptrunc_f16(float %x) {
; GFX11-TRUE16: ; %bb.0:
; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-TRUE16-NEXT: v_cvt_f16_f32_e32 v0.l, v0
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
+; GFX11-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-FAKE16-LABEL: zext_fptrunc_f16:
@@ -926,20 +924,12 @@ define i32 @zext_fptrunc_fma_f16(float %x, float %y, float %z) {
; GFX10-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX10-NEXT: s_setpc_b64 s[30:31]
;
-; GFX11-TRUE16-LABEL: zext_fptrunc_fma_f16:
-; GFX11-TRUE16: ; %bb.0:
-; GFX11-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX11-TRUE16-NEXT: v_fma_mixlo_f16 v1, v0, v1, v2
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
-; GFX11-TRUE16-NEXT: v_mov_b16_e32 v0.l, v1.l
-; GFX11-TRUE16-NEXT: s_setpc_b64 s[30:31]
-;
-; GFX11-FAKE16-LABEL: zext_fptrunc_fma_f16:
-; GFX11-FAKE16: ; %bb.0:
-; GFX11-FAKE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX11-FAKE16-NEXT: v_fma_mixlo_f16 v0, v0, v1, v2
-; GFX11-FAKE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
-; GFX11-FAKE16-NEXT: s_setpc_b64 s[30:31]
+; GFX11-LABEL: zext_fptrunc_fma_f16:
+; GFX11: ; %bb.0:
+; GFX11-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX11-NEXT: v_fma_mixlo_f16 v0, v0, v1, v2
+; GFX11-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; GFX11-NEXT: s_setpc_b64 s[30:31]
%fma = call float @llvm.fma.f32(float %x, float %y, float %z)
%fptrunc = fptrunc float %fma to half
%cast = bitcast half %fptrunc to i16
@@ -950,5 +940,3 @@ define i32 @zext_fptrunc_fma_f16(float %x, float %y, float %z) {
declare half @llvm.amdgcn.div.fixup.f16(half, half, half)
declare half @llvm.fma.f16(half, half, half)
declare float @llvm.fma.f32(float, float, float)
-;; NOTE: These prefixes are unused and the list is autogenerated. Do not add tests below this line:
-; GFX11: {{.*}}
diff --git a/llvm/test/CodeGen/AMDGPU/shrink-add-sub-constant.ll b/llvm/test/CodeGen/AMDGPU/shrink-add-sub-constant.ll
index 91c88ec5e718c..21aa40d69998e 100644
--- a/llvm/test/CodeGen/AMDGPU/shrink-add-sub-constant.ll
+++ b/llvm/test/CodeGen/AMDGPU/shrink-add-sub-constant.ll
@@ -1528,9 +1528,10 @@ define amdgpu_kernel void @v_test_i16_x_sub_64_zext_to_i32(ptr addrspace(1) %out
; GFX11-SDAG-TRUE16-NEXT: v_lshlrev_b32_e32 v1, 2, v1
; GFX11-SDAG-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-SDAG-TRUE16-NEXT: global_load_d16_b16 v0, v0, s[2:3]
-; GFX11-SDAG-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX11-SDAG-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-SDAG-TRUE16-NEXT: v_sub_nc_u16 v0.l, v0.l, 64
+; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-SDAG-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX11-SDAG-TRUE16-NEXT: global_store_b32 v1, v0, s[0:1]
; GFX11-SDAG-TRUE16-NEXT: s_endpgm
;
@@ -1559,9 +1560,10 @@ define amdgpu_kernel void @v_test_i16_x_sub_64_zext_to_i32(ptr addrspace(1) %out
; GFX11-GISEL-TRUE16-NEXT: v_lshlrev_b32_e32 v1, 2, v1
; GFX11-GISEL-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-GISEL-TRUE16-NEXT: global_load_d16_b16 v0, v0, s[2:3]
-; GFX11-GISEL-TRUE16-NEXT: v_mov_b16_e32 v0.h, 0
; GFX11-GISEL-TRUE16-NEXT: s_waitcnt vmcnt(0)
; GFX11-GISEL-TRUE16-NEXT: v_add_nc_u16 v0.l, 0xffc0, v0.l
+; GFX11-GISEL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-GISEL-TRUE16-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX11-GISEL-TRUE16-NEXT: global_store_b32 v1, v0, s[0:1]
; GFX11-GISEL-TRUE16-NEXT: s_endpgm
;
diff --git a/llvm/test/CodeGen/AMDGPU/vector-reduce-add.ll b/llvm/test/CodeGen/AMDGPU/vector-reduce-add.ll
index 334215125f58a..30ed6ae5484c6 100644
--- a/llvm/test/CodeGen/AMDGPU/vector-reduce-add.ll
+++ b/llvm/test/CodeGen/AMDGPU/vector-reduce-add.ll
@@ -300,15 +300,17 @@ define i8 @test_vector_reduce_add_v4i8(<4 x i8> %v) {
; GFX11-SDAG-TRUE16-LABEL: test_vector_reduce_add_v4i8:
; GFX11-SDAG-TRUE16: ; %bb.0: ; %entry
; GFX11-SDAG-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.h, v1.l, v3.l
+; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v1.l, v1.l, v3.l
; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v2.l
-; GFX11-SDAG-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
-; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-SDAG-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v0.h
+; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-SDAG-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
; GFX11-SDAG-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-SDAG-TRUE16-NEXT: v_mov_b16_e32 v2.l, v1.l
; GFX11-SDAG-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v1.l
-; GFX11-SDAG-TRUE16-NEXT: v_lshrrev_b32_e32 v1, 8, v1
+; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-SDAG-TRUE16-NEXT: v_and_b32_e32 v2, 0xffff, v2
+; GFX11-SDAG-TRUE16-NEXT: v_lshrrev_b32_e32 v1, 8, v2
; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v1.l
; GFX11-SDAG-TRUE16-NEXT: s_setpc_b64 s[30:31]
@@ -346,15 +348,17 @@ define i8 @test_vector_reduce_add_v4i8(<4 x i8> %v) {
; GFX12-SDAG-TRUE16-NEXT: s_wait_samplecnt 0x0
; GFX12-SDAG-TRUE16-NEXT: s_wait_bvhcnt 0x0
; GFX12-SDAG-TRUE16-NEXT: s_wait_kmcnt 0x0
-; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.h, v1.l, v3.l
+; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v1.l, v1.l, v3.l
; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v2.l
-; GFX12-SDAG-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
-; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX12-SDAG-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v0.h
+; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX12-SDAG-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
; GFX12-SDAG-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX12-SDAG-TRUE16-NEXT: v_mov_b16_e32 v2.l, v1.l
; GFX12-SDAG-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v1.l
-; GFX12-SDAG-TRUE16-NEXT: v_lshrrev_b32_e32 v1, 8, v1
+; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-SDAG-TRUE16-NEXT: v_and_b32_e32 v2, 0xffff, v2
+; GFX12-SDAG-TRUE16-NEXT: v_lshrrev_b32_e32 v1, 8, v2
; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v1.l
; GFX12-SDAG-TRUE16-NEXT: s_setpc_b64 s[30:31]
@@ -514,19 +518,21 @@ define i8 @test_vector_reduce_add_v8i8(<8 x i8> %v) {
; GFX11-SDAG-TRUE16-LABEL: test_vector_reduce_add_v8i8:
; GFX11-SDAG-TRUE16: ; %bb.0: ; %entry
; GFX11-SDAG-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.h, v2.l, v6.l
; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v1.h, v3.l, v7.l
; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v1.l, v1.l, v5.l
; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v4.l
+; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.h, v1.l, v1.h
+; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v1.l, v2.l, v6.l
+; GFX11-SDAG-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v0.h
+; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v1.l
+; GFX11-SDAG-TRUE16-NEXT: v_mov_b16_e32 v1.l, v2.l
; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v1.l, v1.l, v1.h
-; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v0.h
-; GFX11-SDAG-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
-; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-SDAG-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
; GFX11-SDAG-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX11-SDAG-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v1.l
+; GFX11-SDAG-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
+; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-SDAG-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v2.l
; GFX11-SDAG-TRUE16-NEXT: v_lshrrev_b32_e32 v1, 8, v1
; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v1.l
@@ -575,19 +581,21 @@ define i8 @test_vector_reduce_add_v8i8(<8 x i8> %v) {
; GFX12-SDAG-TRUE16-NEXT: s_wait_samplecnt 0x0
; GFX12-SDAG-TRUE16-NEXT: s_wait_bvhcnt 0x0
; GFX12-SDAG-TRUE16-NEXT: s_wait_kmcnt 0x0
-; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.h, v2.l, v6.l
; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v1.h, v3.l, v7.l
; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v1.l, v1.l, v5.l
; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v4.l
+; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.h, v1.l, v1.h
+; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v1.l, v2.l, v6.l
+; GFX12-SDAG-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v0.h
+; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v1.l
+; GFX12-SDAG-TRUE16-NEXT: v_mov_b16_e32 v1.l, v2.l
; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v1.l, v1.l, v1.h
-; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v0.h
-; GFX12-SDAG-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
-; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX12-SDAG-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
; GFX12-SDAG-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_4)
-; GFX12-SDAG-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v1.l
+; GFX12-SDAG-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
+; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX12-SDAG-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v2.l
; GFX12-SDAG-TRUE16-NEXT: v_lshrrev_b32_e32 v1, 8, v1
; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v1.l
@@ -824,25 +832,28 @@ define i8 @test_vector_reduce_add_v16i8(<16 x i8> %v) {
; GFX11-SDAG-TRUE16-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v1.h, v5.l, v13.l
; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v1.l, v1.l, v9.l
-; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v5.l, v7.l, v15.l
-; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.h, v6.l, v14.l
+; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v3.h, v7.l, v15.l
+; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v3.l, v3.l, v11.l
+; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v2.h, v6.l, v14.l
; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v2.l, v2.l, v10.l
-; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v2.h, v3.l, v11.l
-; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v3.l, v4.l, v12.l
-; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v8.l
; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v1.l, v1.l, v1.h
-; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.h, v2.l, v0.h
-; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v2.l, v2.h, v5.l
-; GFX11-SDAG-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
-; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v3.l
+; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v1.h, v4.l, v12.l
+; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.h, v3.l, v3.h
+; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v8.l
+; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_3)
+; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.h, v1.l, v0.h
+; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v1.l, v2.l, v2.h
+; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v1.h
; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v1.l, v1.l, v2.l
-; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v0.h
+; GFX11-SDAG-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v0.h
+; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v1.l
; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-SDAG-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
+; GFX11-SDAG-TRUE16-NEXT: v_mov_b16_e32 v1.l, v2.l
; GFX11-SDAG-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-SDAG-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v1.l
+; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-SDAG-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
+; GFX11-SDAG-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v2.l
+; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-SDAG-TRUE16-NEXT: v_lshrrev_b32_e32 v1, 8, v1
; GFX11-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v1.l
; GFX11-SDAG-TRUE16-NEXT: s_setpc_b64 s[30:31]
@@ -911,25 +922,28 @@ define i8 @test_vector_reduce_add_v16i8(<16 x i8> %v) {
; GFX12-SDAG-TRUE16-NEXT: s_wait_kmcnt 0x0
; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v1.h, v5.l, v13.l
; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v1.l, v1.l, v9.l
-; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v5.l, v7.l, v15.l
-; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.h, v6.l, v14.l
+; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v3.h, v7.l, v15.l
+; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v3.l, v3.l, v11.l
+; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v2.h, v6.l, v14.l
; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v2.l, v2.l, v10.l
-; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v2.h, v3.l, v11.l
-; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v3.l, v4.l, v12.l
-; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v8.l
; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v1.l, v1.l, v1.h
-; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.h, v2.l, v0.h
-; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v2.l, v2.h, v5.l
-; GFX12-SDAG-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
-; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v3.l
+; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v1.h, v4.l, v12.l
+; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.h, v3.l, v3.h
+; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v8.l
+; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_3)
+; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.h, v1.l, v0.h
+; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v1.l, v2.l, v2.h
+; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v1.h
; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v1.l, v1.l, v2.l
-; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v0.h
+; GFX12-SDAG-TRUE16-NEXT: v_lshlrev_b16 v2.l, 8, v0.h
+; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v1.l
; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX12-SDAG-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
+; GFX12-SDAG-TRUE16-NEXT: v_mov_b16_e32 v1.l, v2.l
; GFX12-SDAG-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX12-SDAG-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v1.l
+; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX12-SDAG-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
+; GFX12-SDAG-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v2.l
+; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-SDAG-TRUE16-NEXT: v_lshrrev_b32_e32 v1, 8, v1
; GFX12-SDAG-TRUE16-NEXT: v_add_nc_u16 v0.l, v0.l, v1.l
; GFX12-SDAG-TRUE16-NEXT: s_setpc_b64 s[30:31]
diff --git a/llvm/test/CodeGen/AMDGPU/vector-reduce-umin.ll b/llvm/test/CodeGen/AMDGPU/vector-reduce-umin.ll
index 1d3b42ee43b0f..aab0e76410ccb 100644
--- a/llvm/test/CodeGen/AMDGPU/vector-reduce-umin.ll
+++ b/llvm/test/CodeGen/AMDGPU/vector-reduce-umin.ll
@@ -374,12 +374,13 @@ define i8 @test_vector_reduce_umin_v4i8(<4 x i8> %v) {
; GFX11-SDAG-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-SDAG-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v2.l
; GFX11-SDAG-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX11-SDAG-TRUE16-NEXT: v_min_u16 v1.l, v1.l, v1.h
-; GFX11-SDAG-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
; GFX11-SDAG-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-SDAG-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
; GFX11-SDAG-TRUE16-NEXT: v_lshrrev_b32_e32 v1, 8, v1
+; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-SDAG-TRUE16-NEXT: v_min3_u16 v0.l, v0.l, v0.h, v1.l
; GFX11-SDAG-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -426,12 +427,13 @@ define i8 @test_vector_reduce_umin_v4i8(<4 x i8> %v) {
; GFX12-SDAG-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX12-SDAG-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v2.l
; GFX12-SDAG-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-SDAG-TRUE16-NEXT: v_min_u16 v1.l, v1.l, v1.h
-; GFX12-SDAG-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
; GFX12-SDAG-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-SDAG-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
; GFX12-SDAG-TRUE16-NEXT: v_lshrrev_b32_e32 v1, 8, v1
+; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-SDAG-TRUE16-NEXT: v_min3_u16 v0.l, v0.l, v0.h, v1.l
; GFX12-SDAG-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -622,20 +624,22 @@ define i8 @test_vector_reduce_umin_v8i8(<8 x i8> %v) {
; GFX11-SDAG-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX11-SDAG-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v7.l
; GFX11-SDAG-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
+; GFX11-SDAG-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v4.l
; GFX11-SDAG-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX11-SDAG-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX11-SDAG-TRUE16-NEXT: v_min_u16 v0.h, v1.l, v1.h
-; GFX11-SDAG-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v4.l
-; GFX11-SDAG-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v6.l
-; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-SDAG-TRUE16-NEXT: v_min3_u16 v0.h, v0.h, v3.l, v3.h
-; GFX11-SDAG-TRUE16-NEXT: v_min_u16 v0.l, v0.l, v1.l
-; GFX11-SDAG-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX11-SDAG-TRUE16-NEXT: v_min_u16 v1.l, v1.l, v1.h
+; GFX11-SDAG-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.l
; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX11-SDAG-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v0.h
-; GFX11-SDAG-TRUE16-NEXT: v_min3_u16 v0.l, v0.l, v2.l, v1.h
+; GFX11-SDAG-TRUE16-NEXT: v_min_u16 v0.l, v0.l, v0.h
+; GFX11-SDAG-TRUE16-NEXT: v_min3_u16 v1.l, v1.l, v3.l, v3.h
+; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-SDAG-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
+; GFX11-SDAG-TRUE16-NEXT: v_mov_b16_e32 v3.l, v1.l
+; GFX11-SDAG-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v6.l
+; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-SDAG-TRUE16-NEXT: v_and_b32_e32 v2, 0xffff, v3
+; GFX11-SDAG-TRUE16-NEXT: v_min3_u16 v0.l, v0.l, v1.h, v1.l
; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-SDAG-TRUE16-NEXT: v_lshrrev_b32_e32 v1, 8, v3
+; GFX11-SDAG-TRUE16-NEXT: v_lshrrev_b32_e32 v1, 8, v2
; GFX11-SDAG-TRUE16-NEXT: v_min_u16 v0.l, v0.l, v1.l
; GFX11-SDAG-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -699,20 +703,22 @@ define i8 @test_vector_reduce_umin_v8i8(<8 x i8> %v) {
; GFX12-SDAG-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v1.l
; GFX12-SDAG-TRUE16-NEXT: v_and_b16 v3.h, 0xff, v7.l
; GFX12-SDAG-TRUE16-NEXT: v_and_b16 v3.l, 0xff, v3.l
+; GFX12-SDAG-TRUE16-NEXT: v_and_b16 v0.h, 0xff, v4.l
; GFX12-SDAG-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
-; GFX12-SDAG-TRUE16-NEXT: v_and_b16 v2.l, 0xff, v2.l
-; GFX12-SDAG-TRUE16-NEXT: v_min_u16 v0.h, v1.l, v1.h
-; GFX12-SDAG-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v4.l
-; GFX12-SDAG-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v6.l
-; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX12-SDAG-TRUE16-NEXT: v_min3_u16 v0.h, v0.h, v3.l, v3.h
-; GFX12-SDAG-TRUE16-NEXT: v_min_u16 v0.l, v0.l, v1.l
-; GFX12-SDAG-TRUE16-NEXT: v_mov_b16_e32 v3.h, 0
+; GFX12-SDAG-TRUE16-NEXT: v_min_u16 v1.l, v1.l, v1.h
+; GFX12-SDAG-TRUE16-NEXT: v_and_b16 v1.h, 0xff, v2.l
; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_3)
-; GFX12-SDAG-TRUE16-NEXT: v_lshlrev_b16 v3.l, 8, v0.h
-; GFX12-SDAG-TRUE16-NEXT: v_min3_u16 v0.l, v0.l, v2.l, v1.h
+; GFX12-SDAG-TRUE16-NEXT: v_min_u16 v0.l, v0.l, v0.h
+; GFX12-SDAG-TRUE16-NEXT: v_min3_u16 v1.l, v1.l, v3.l, v3.h
+; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-SDAG-TRUE16-NEXT: v_lshlrev_b16 v1.l, 8, v1.l
+; GFX12-SDAG-TRUE16-NEXT: v_mov_b16_e32 v3.l, v1.l
+; GFX12-SDAG-TRUE16-NEXT: v_and_b16 v1.l, 0xff, v6.l
+; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX12-SDAG-TRUE16-NEXT: v_and_b32_e32 v2, 0xffff, v3
+; GFX12-SDAG-TRUE16-NEXT: v_min3_u16 v0.l, v0.l, v1.h, v1.l
; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-SDAG-TRUE16-NEXT: v_lshrrev_b32_e32 v1, 8, v3
+; GFX12-SDAG-TRUE16-NEXT: v_lshrrev_b32_e32 v1, 8, v2
; GFX12-SDAG-TRUE16-NEXT: v_min_u16 v0.l, v0.l, v1.l
; GFX12-SDAG-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -1041,12 +1047,14 @@ define i8 @test_vector_reduce_umin_v16i8(<16 x i8> %v) {
; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
; GFX11-SDAG-TRUE16-NEXT: v_min3_u16 v0.l, v0.l, v1.h, v1.l
; GFX11-SDAG-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v0.h
-; GFX11-SDAG-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
+; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-SDAG-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v0.h
+; GFX11-SDAG-TRUE16-NEXT: v_mov_b16_e32 v1.l, v0.l
+; GFX11-SDAG-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-SDAG-TRUE16-NEXT: v_or_b16 v1.l, v0.l, v0.h
-; GFX11-SDAG-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v1.l
-; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-SDAG-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
; GFX11-SDAG-TRUE16-NEXT: v_lshrrev_b32_e32 v1, 8, v1
+; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-SDAG-TRUE16-NEXT: v_min_u16 v0.l, v0.l, v1.l
; GFX11-SDAG-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
@@ -1168,12 +1176,14 @@ define i8 @test_vector_reduce_umin_v16i8(<16 x i8> %v) {
; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
; GFX12-SDAG-TRUE16-NEXT: v_min3_u16 v0.l, v0.l, v1.h, v1.l
; GFX12-SDAG-TRUE16-NEXT: v_lshlrev_b16 v0.h, 8, v0.h
-; GFX12-SDAG-TRUE16-NEXT: v_mov_b16_e32 v1.h, 0
+; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-SDAG-TRUE16-NEXT: v_or_b16 v0.l, v0.l, v0.h
+; GFX12-SDAG-TRUE16-NEXT: v_mov_b16_e32 v1.l, v0.l
+; GFX12-SDAG-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v0.l
; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-SDAG-TRUE16-NEXT: v_or_b16 v1.l, v0.l, v0.h
-; GFX12-SDAG-TRUE16-NEXT: v_and_b16 v0.l, 0xff, v1.l
-; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-SDAG-TRUE16-NEXT: v_and_b32_e32 v1, 0xffff, v1
; GFX12-SDAG-TRUE16-NEXT: v_lshrrev_b32_e32 v1, 8, v1
+; GFX12-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX12-SDAG-TRUE16-NEXT: v_min_u16 v0.l, v0.l, v1.l
; GFX12-SDAG-TRUE16-NEXT: s_setpc_b64 s[30:31]
;
>From 8d256733a05ceeda8b854cc7665724c425236673 Mon Sep 17 00:00:00 2001
From: Jordan Rupprecht <rupprecht at google.com>
Date: Mon, 18 Aug 2025 13:07:05 -0500
Subject: [PATCH 064/112] [bazel] Port #151175: VectorFromElementsLowering
(#154169)
---
utils/bazel/llvm-project-overlay/mlir/BUILD.bazel | 1 +
1 file changed, 1 insertion(+)
diff --git a/utils/bazel/llvm-project-overlay/mlir/BUILD.bazel b/utils/bazel/llvm-project-overlay/mlir/BUILD.bazel
index 763dbdbaee26f..61c4673b6ac10 100644
--- a/utils/bazel/llvm-project-overlay/mlir/BUILD.bazel
+++ b/utils/bazel/llvm-project-overlay/mlir/BUILD.bazel
@@ -4920,6 +4920,7 @@ cc_library(
":MemRefDialect",
":Support",
":TensorDialect",
+ ":UBDialect",
":VectorDialect",
"//llvm:Support",
],
>From 064f02dac0c81c19350a74415b3245f42fed09dc Mon Sep 17 00:00:00 2001
From: Kyle Wang <ec1wng at gmail.com>
Date: Mon, 18 Aug 2025 11:16:32 -0700
Subject: [PATCH 065/112] [VectorCombine] Preserve scoped alias metadata
(#153714)
Right now if a load op is scalarized, the `!alias.scope` and `!noalias`
metadata are dropped. This PR is to keep them if exist.
---
.../Transforms/Vectorize/VectorCombine.cpp | 16 ++++--
llvm/test/Transforms/VectorCombine/alias.ll | 56 +++++++++++++++++++
2 files changed, 68 insertions(+), 4 deletions(-)
create mode 100644 llvm/test/Transforms/VectorCombine/alias.ll
diff --git a/llvm/lib/Transforms/Vectorize/VectorCombine.cpp b/llvm/lib/Transforms/Vectorize/VectorCombine.cpp
index 4e2a5c78e0ac8..1275d53a075b5 100644
--- a/llvm/lib/Transforms/Vectorize/VectorCombine.cpp
+++ b/llvm/lib/Transforms/Vectorize/VectorCombine.cpp
@@ -1812,6 +1812,8 @@ bool VectorCombine::scalarizeLoadExtract(Instruction &I) {
// erased in the correct order.
Worklist.push(LI);
+ Type *ElemType = VecTy->getElementType();
+
// Replace extracts with narrow scalar loads.
for (User *U : LI->users()) {
auto *EI = cast<ExtractElementInst>(U);
@@ -1825,13 +1827,19 @@ bool VectorCombine::scalarizeLoadExtract(Instruction &I) {
Builder.SetInsertPoint(EI);
Value *GEP =
Builder.CreateInBoundsGEP(VecTy, Ptr, {Builder.getInt32(0), Idx});
- auto *NewLoad = cast<LoadInst>(Builder.CreateLoad(
- VecTy->getElementType(), GEP, EI->getName() + ".scalar"));
+ auto *NewLoad = cast<LoadInst>(
+ Builder.CreateLoad(ElemType, GEP, EI->getName() + ".scalar"));
- Align ScalarOpAlignment = computeAlignmentAfterScalarization(
- LI->getAlign(), VecTy->getElementType(), Idx, *DL);
+ Align ScalarOpAlignment =
+ computeAlignmentAfterScalarization(LI->getAlign(), ElemType, Idx, *DL);
NewLoad->setAlignment(ScalarOpAlignment);
+ if (auto *ConstIdx = dyn_cast<ConstantInt>(Idx)) {
+ size_t Offset = ConstIdx->getZExtValue() * DL->getTypeStoreSize(ElemType);
+ AAMDNodes OldAAMD = LI->getAAMetadata();
+ NewLoad->setAAMetadata(OldAAMD.adjustForAccess(Offset, ElemType, *DL));
+ }
+
replaceValue(*EI, *NewLoad, false);
}
diff --git a/llvm/test/Transforms/VectorCombine/alias.ll b/llvm/test/Transforms/VectorCombine/alias.ll
new file mode 100644
index 0000000000000..459956cd997d8
--- /dev/null
+++ b/llvm/test/Transforms/VectorCombine/alias.ll
@@ -0,0 +1,56 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt < %s -passes=vector-combine -S | FileCheck %s --check-prefixes=CHECK
+
+define <4 x i32> @quux(ptr addrspace(3) %arg) {
+; CHECK-LABEL: define <4 x i32> @quux(
+; CHECK-SAME: ptr addrspace(3) [[ARG:%.*]]) {
+; CHECK-NEXT: [[BB:.*:]]
+; CHECK-NEXT: [[EXTRACTELEMENT:%.*]] = load i8, ptr addrspace(3) [[ARG]], align 4, !tbaa [[TBAA0:![0-9]+]], !alias.scope [[META0:![0-9]+]], !noalias [[META0]]
+; CHECK-NEXT: [[TMP0:%.*]] = getelementptr inbounds <4 x i8>, ptr addrspace(3) [[ARG]], i32 0, i64 1
+; CHECK-NEXT: [[EXTRACTELEMENT1:%.*]] = load i8, ptr addrspace(3) [[TMP0]], align 1, !tbaa [[TBAA0]], !alias.scope [[META0]], !noalias [[META0]]
+; CHECK-NEXT: [[TMP1:%.*]] = getelementptr inbounds <4 x i8>, ptr addrspace(3) [[ARG]], i32 0, i64 2
+; CHECK-NEXT: [[EXTRACTELEMENT2:%.*]] = load i8, ptr addrspace(3) [[TMP1]], align 2, !tbaa [[TBAA0]], !alias.scope [[META0]], !noalias [[META0]]
+; CHECK-NEXT: [[TMP2:%.*]] = getelementptr inbounds <4 x i8>, ptr addrspace(3) [[ARG]], i32 0, i64 3
+; CHECK-NEXT: [[EXTRACTELEMENT3:%.*]] = load i8, ptr addrspace(3) [[TMP2]], align 1, !tbaa [[TBAA0]], !alias.scope [[META0]], !noalias [[META0]]
+; CHECK-NEXT: [[ZEXT:%.*]] = zext i8 [[EXTRACTELEMENT]] to i32
+; CHECK-NEXT: [[ZEXT4:%.*]] = zext i8 [[EXTRACTELEMENT1]] to i32
+; CHECK-NEXT: [[ZEXT5:%.*]] = zext i8 [[EXTRACTELEMENT2]] to i32
+; CHECK-NEXT: [[ZEXT6:%.*]] = zext i8 [[EXTRACTELEMENT3]] to i32
+; CHECK-NEXT: [[INSERTELEMENT:%.*]] = insertelement <4 x i32> poison, i32 [[ZEXT]], i64 0
+; CHECK-NEXT: [[INSERTELEMENT7:%.*]] = insertelement <4 x i32> [[INSERTELEMENT]], i32 [[ZEXT4]], i64 1
+; CHECK-NEXT: [[INSERTELEMENT8:%.*]] = insertelement <4 x i32> [[INSERTELEMENT7]], i32 [[ZEXT5]], i64 2
+; CHECK-NEXT: [[INSERTELEMENT9:%.*]] = insertelement <4 x i32> [[INSERTELEMENT8]], i32 [[ZEXT6]], i64 3
+; CHECK-NEXT: ret <4 x i32> [[INSERTELEMENT9]]
+;
+bb:
+ %load = load <4 x i8>, ptr addrspace(3) %arg, align 4, !alias.scope !0, !noalias !0, !tbaa !5
+ %extractelement = extractelement <4 x i8> %load, i64 0
+ %extractelement1 = extractelement <4 x i8> %load, i64 1
+ %extractelement2 = extractelement <4 x i8> %load, i64 2
+ %extractelement3 = extractelement <4 x i8> %load, i64 3
+ %zext = zext i8 %extractelement to i32
+ %zext4 = zext i8 %extractelement1 to i32
+ %zext5 = zext i8 %extractelement2 to i32
+ %zext6 = zext i8 %extractelement3 to i32
+ %insertelement = insertelement <4 x i32> poison, i32 %zext, i64 0
+ %insertelement7 = insertelement <4 x i32> %insertelement, i32 %zext4, i64 1
+ %insertelement8 = insertelement <4 x i32> %insertelement7, i32 %zext5, i64 2
+ %insertelement9 = insertelement <4 x i32> %insertelement8, i32 %zext6, i64 3
+ ret <4 x i32> %insertelement9
+}
+
+!0 = !{!1}
+!1 = distinct !{!1, !2}
+!2 = distinct !{!2}
+!3 = !{!"Simple C/C++ TBAA"}
+!4 = !{!"omnipotent char", !3, i64 0}
+!5 = !{!"i8", !4, i64 0}
+;.
+; CHECK: [[TBAA0]] = !{[[META3:![0-9]+]], [[META3]], i64 0, i64 0}
+; CHECK: [[META3]] = !{!"i8", [[META4:![0-9]+]]}
+; CHECK: [[META4]] = !{!"omnipotent char", [[META5:![0-9]+]], i64 0}
+; CHECK: [[META5]] = !{!"Simple C/C++ TBAA"}
+; CHECK: [[META0]] = !{[[META1:![0-9]+]]}
+; CHECK: [[META1]] = distinct !{[[META1]], [[META2:![0-9]+]]}
+; CHECK: [[META2]] = distinct !{[[META2]]}
+;.
\ No newline at end of file
>From ade755d62b70eae9dfc460f19f0da7ab80e9a1fd Mon Sep 17 00:00:00 2001
From: Thurston Dang <thurston at google.com>
Date: Mon, 18 Aug 2025 11:31:15 -0700
Subject: [PATCH 066/112] [msan] Add Instrumentation for Avx512 Instructions:
pmaddw, pmaddubs (#153919)
This applies the pmadd handler (recently improved in https://github.com/llvm/llvm-project/pull/153353) to the Avx512
equivalent of the pmaddw and pmaddubs intrinsics:
<16 x i32> @llvm.x86.avx512.pmaddw.d.512(<32 x i16>, <32 x i16>)
<32 x i16> @llvm.x86.avx512.pmaddubs.w.512(<64 x i8>, <64 x i8>)
---
.../Instrumentation/MemorySanitizer.cpp | 18 +++
.../X86/avx512bw-intrinsics-upgrade.ll | 114 ++++++++++--------
.../X86/avx512bw-intrinsics.ll | 113 +++++++++--------
3 files changed, 142 insertions(+), 103 deletions(-)
diff --git a/llvm/lib/Transforms/Instrumentation/MemorySanitizer.cpp b/llvm/lib/Transforms/Instrumentation/MemorySanitizer.cpp
index 6b394f5338687..7865a90707400 100644
--- a/llvm/lib/Transforms/Instrumentation/MemorySanitizer.cpp
+++ b/llvm/lib/Transforms/Instrumentation/MemorySanitizer.cpp
@@ -5486,14 +5486,32 @@ struct MemorySanitizerVisitor : public InstVisitor<MemorySanitizerVisitor> {
// Multiply and Add Packed Words
// < 4 x i32> @llvm.x86.sse2.pmadd.wd(<8 x i16>, <8 x i16>)
// < 8 x i32> @llvm.x86.avx2.pmadd.wd(<16 x i16>, <16 x i16>)
+ // <16 x i32> @llvm.x86.avx512.pmaddw.d.512(<32 x i16>, <32 x i16>)
//
// Multiply and Add Packed Signed and Unsigned Bytes
// < 8 x i16> @llvm.x86.ssse3.pmadd.ub.sw.128(<16 x i8>, <16 x i8>)
// <16 x i16> @llvm.x86.avx2.pmadd.ub.sw(<32 x i8>, <32 x i8>)
+ // <32 x i16> @llvm.x86.avx512.pmaddubs.w.512(<64 x i8>, <64 x i8>)
+ //
+ // These intrinsics are auto-upgraded into non-masked forms:
+ // < 4 x i32> @llvm.x86.avx512.mask.pmaddw.d.128
+ // (<8 x i16>, <8 x i16>, <4 x i32>, i8)
+ // < 8 x i32> @llvm.x86.avx512.mask.pmaddw.d.256
+ // (<16 x i16>, <16 x i16>, <8 x i32>, i8)
+ // <16 x i32> @llvm.x86.avx512.mask.pmaddw.d.512
+ // (<32 x i16>, <32 x i16>, <16 x i32>, i16)
+ // < 8 x i16> @llvm.x86.avx512.mask.pmaddubs.w.128
+ // (<16 x i8>, <16 x i8>, <8 x i16>, i8)
+ // <16 x i16> @llvm.x86.avx512.mask.pmaddubs.w.256
+ // (<32 x i8>, <32 x i8>, <16 x i16>, i16)
+ // <32 x i16> @llvm.x86.avx512.mask.pmaddubs.w.512
+ // (<64 x i8>, <64 x i8>, <32 x i16>, i32)
case Intrinsic::x86_sse2_pmadd_wd:
case Intrinsic::x86_avx2_pmadd_wd:
+ case Intrinsic::x86_avx512_pmaddw_d_512:
case Intrinsic::x86_ssse3_pmadd_ub_sw_128:
case Intrinsic::x86_avx2_pmadd_ub_sw:
+ case Intrinsic::x86_avx512_pmaddubs_w_512:
handleVectorPmaddIntrinsic(I, /*ReductionFactor=*/2);
break;
diff --git a/llvm/test/Instrumentation/MemorySanitizer/X86/avx512bw-intrinsics-upgrade.ll b/llvm/test/Instrumentation/MemorySanitizer/X86/avx512bw-intrinsics-upgrade.ll
index abbbb040edf1b..51dad35a1edbc 100644
--- a/llvm/test/Instrumentation/MemorySanitizer/X86/avx512bw-intrinsics-upgrade.ll
+++ b/llvm/test/Instrumentation/MemorySanitizer/X86/avx512bw-intrinsics-upgrade.ll
@@ -7,8 +7,6 @@
; - llvm.x86.avx512.dbpsadbw.512
; - llvm.x86.avx512.packssdw.512, llvm.x86.avx512.packsswb.512
; - llvm.x86.avx512.packusdw.512, llvm.x86.avx512.packuswb.512
-; - llvm.x86.avx512.pmaddubs.w.512
-; - llvm.x86.avx512.pmaddw.d.512
;
; Heuristically handled:
; - llvm.sadd.sat.v32i16, llvm.sadd.sat.v64i8
@@ -4931,18 +4929,21 @@ define <32 x i16> @test_int_x86_avx512_pmaddubs_w_512(<64 x i8> %x0, <64 x i8> %
; CHECK-NEXT: [[TMP1:%.*]] = load <64 x i8>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP2:%.*]] = load <64 x i8>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 64) to ptr), align 8
; CHECK-NEXT: call void @llvm.donothing()
-; CHECK-NEXT: [[TMP3:%.*]] = bitcast <64 x i8> [[TMP1]] to i512
-; CHECK-NEXT: [[_MSCMP:%.*]] = icmp ne i512 [[TMP3]], 0
-; CHECK-NEXT: [[TMP4:%.*]] = bitcast <64 x i8> [[TMP2]] to i512
-; CHECK-NEXT: [[_MSCMP1:%.*]] = icmp ne i512 [[TMP4]], 0
-; CHECK-NEXT: [[_MSOR:%.*]] = or i1 [[_MSCMP]], [[_MSCMP1]]
-; CHECK-NEXT: br i1 [[_MSOR]], label [[TMP5:%.*]], label [[TMP6:%.*]], !prof [[PROF1]]
-; CHECK: 5:
-; CHECK-NEXT: call void @__msan_warning_noreturn() #[[ATTR7]]
-; CHECK-NEXT: unreachable
-; CHECK: 6:
-; CHECK-NEXT: [[TMP7:%.*]] = call <32 x i16> @llvm.x86.avx512.pmaddubs.w.512(<64 x i8> [[X0:%.*]], <64 x i8> [[X1:%.*]])
-; CHECK-NEXT: store <32 x i16> zeroinitializer, ptr @__msan_retval_tls, align 8
+; CHECK-NEXT: [[TMP3:%.*]] = icmp ne <64 x i8> [[TMP1]], zeroinitializer
+; CHECK-NEXT: [[TMP4:%.*]] = icmp ne <64 x i8> [[TMP2]], zeroinitializer
+; CHECK-NEXT: [[TMP5:%.*]] = icmp ne <64 x i8> [[X0:%.*]], zeroinitializer
+; CHECK-NEXT: [[TMP6:%.*]] = icmp ne <64 x i8> [[X1:%.*]], zeroinitializer
+; CHECK-NEXT: [[TMP17:%.*]] = and <64 x i1> [[TMP3]], [[TMP4]]
+; CHECK-NEXT: [[TMP8:%.*]] = and <64 x i1> [[TMP5]], [[TMP4]]
+; CHECK-NEXT: [[TMP9:%.*]] = and <64 x i1> [[TMP3]], [[TMP6]]
+; CHECK-NEXT: [[TMP10:%.*]] = or <64 x i1> [[TMP17]], [[TMP8]]
+; CHECK-NEXT: [[TMP11:%.*]] = or <64 x i1> [[TMP10]], [[TMP9]]
+; CHECK-NEXT: [[TMP12:%.*]] = sext <64 x i1> [[TMP11]] to <64 x i8>
+; CHECK-NEXT: [[TMP13:%.*]] = bitcast <64 x i8> [[TMP12]] to <32 x i16>
+; CHECK-NEXT: [[TMP14:%.*]] = icmp ne <32 x i16> [[TMP13]], zeroinitializer
+; CHECK-NEXT: [[TMP16:%.*]] = sext <32 x i1> [[TMP14]] to <32 x i16>
+; CHECK-NEXT: [[TMP7:%.*]] = call <32 x i16> @llvm.x86.avx512.pmaddubs.w.512(<64 x i8> [[X0]], <64 x i8> [[X1]])
+; CHECK-NEXT: store <32 x i16> [[TMP16]], ptr @__msan_retval_tls, align 8
; CHECK-NEXT: ret <32 x i16> [[TMP7]]
;
%res = call <32 x i16> @llvm.x86.avx512.mask.pmaddubs.w.512(<64 x i8> %x0, <64 x i8> %x1, <32 x i16> %x2, i32 -1)
@@ -4956,22 +4957,25 @@ define <32 x i16> @test_int_x86_avx512_mask_pmaddubs_w_512(<64 x i8> %x0, <64 x
; CHECK-NEXT: [[TMP3:%.*]] = load i32, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 192) to ptr), align 8
; CHECK-NEXT: [[TMP4:%.*]] = load <32 x i16>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 128) to ptr), align 8
; CHECK-NEXT: call void @llvm.donothing()
-; CHECK-NEXT: [[TMP5:%.*]] = bitcast <64 x i8> [[TMP1]] to i512
-; CHECK-NEXT: [[_MSCMP:%.*]] = icmp ne i512 [[TMP5]], 0
-; CHECK-NEXT: [[TMP6:%.*]] = bitcast <64 x i8> [[TMP2]] to i512
-; CHECK-NEXT: [[_MSCMP1:%.*]] = icmp ne i512 [[TMP6]], 0
-; CHECK-NEXT: [[_MSOR:%.*]] = or i1 [[_MSCMP]], [[_MSCMP1]]
-; CHECK-NEXT: br i1 [[_MSOR]], label [[TMP7:%.*]], label [[TMP8:%.*]], !prof [[PROF1]]
-; CHECK: 7:
-; CHECK-NEXT: call void @__msan_warning_noreturn() #[[ATTR7]]
-; CHECK-NEXT: unreachable
-; CHECK: 8:
-; CHECK-NEXT: [[TMP9:%.*]] = call <32 x i16> @llvm.x86.avx512.pmaddubs.w.512(<64 x i8> [[X0:%.*]], <64 x i8> [[X1:%.*]])
+; CHECK-NEXT: [[TMP5:%.*]] = icmp ne <64 x i8> [[TMP1]], zeroinitializer
+; CHECK-NEXT: [[TMP6:%.*]] = icmp ne <64 x i8> [[TMP2]], zeroinitializer
+; CHECK-NEXT: [[TMP7:%.*]] = icmp ne <64 x i8> [[X0:%.*]], zeroinitializer
+; CHECK-NEXT: [[TMP8:%.*]] = icmp ne <64 x i8> [[X1:%.*]], zeroinitializer
+; CHECK-NEXT: [[TMP19:%.*]] = and <64 x i1> [[TMP5]], [[TMP6]]
+; CHECK-NEXT: [[TMP20:%.*]] = and <64 x i1> [[TMP7]], [[TMP6]]
+; CHECK-NEXT: [[TMP21:%.*]] = and <64 x i1> [[TMP5]], [[TMP8]]
+; CHECK-NEXT: [[TMP22:%.*]] = or <64 x i1> [[TMP19]], [[TMP20]]
+; CHECK-NEXT: [[TMP23:%.*]] = or <64 x i1> [[TMP22]], [[TMP21]]
+; CHECK-NEXT: [[TMP24:%.*]] = sext <64 x i1> [[TMP23]] to <64 x i8>
+; CHECK-NEXT: [[TMP17:%.*]] = bitcast <64 x i8> [[TMP24]] to <32 x i16>
+; CHECK-NEXT: [[TMP25:%.*]] = icmp ne <32 x i16> [[TMP17]], zeroinitializer
+; CHECK-NEXT: [[TMP18:%.*]] = sext <32 x i1> [[TMP25]] to <32 x i16>
+; CHECK-NEXT: [[TMP9:%.*]] = call <32 x i16> @llvm.x86.avx512.pmaddubs.w.512(<64 x i8> [[X0]], <64 x i8> [[X1]])
; CHECK-NEXT: [[TMP10:%.*]] = bitcast i32 [[TMP3]] to <32 x i1>
; CHECK-NEXT: [[TMP11:%.*]] = bitcast i32 [[X3:%.*]] to <32 x i1>
-; CHECK-NEXT: [[TMP12:%.*]] = select <32 x i1> [[TMP11]], <32 x i16> zeroinitializer, <32 x i16> [[TMP4]]
+; CHECK-NEXT: [[TMP12:%.*]] = select <32 x i1> [[TMP11]], <32 x i16> [[TMP18]], <32 x i16> [[TMP4]]
; CHECK-NEXT: [[TMP13:%.*]] = xor <32 x i16> [[TMP9]], [[X2:%.*]]
-; CHECK-NEXT: [[TMP14:%.*]] = or <32 x i16> [[TMP13]], zeroinitializer
+; CHECK-NEXT: [[TMP14:%.*]] = or <32 x i16> [[TMP13]], [[TMP18]]
; CHECK-NEXT: [[TMP15:%.*]] = or <32 x i16> [[TMP14]], [[TMP4]]
; CHECK-NEXT: [[_MSPROP_SELECT:%.*]] = select <32 x i1> [[TMP10]], <32 x i16> [[TMP15]], <32 x i16> [[TMP12]]
; CHECK-NEXT: [[TMP16:%.*]] = select <32 x i1> [[TMP11]], <32 x i16> [[TMP9]], <32 x i16> [[X2]]
@@ -4989,18 +4993,21 @@ define <16 x i32> @test_int_x86_avx512_pmaddw_d_512(<32 x i16> %x0, <32 x i16> %
; CHECK-NEXT: [[TMP1:%.*]] = load <32 x i16>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP2:%.*]] = load <32 x i16>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 64) to ptr), align 8
; CHECK-NEXT: call void @llvm.donothing()
-; CHECK-NEXT: [[TMP3:%.*]] = bitcast <32 x i16> [[TMP1]] to i512
-; CHECK-NEXT: [[_MSCMP:%.*]] = icmp ne i512 [[TMP3]], 0
-; CHECK-NEXT: [[TMP4:%.*]] = bitcast <32 x i16> [[TMP2]] to i512
-; CHECK-NEXT: [[_MSCMP1:%.*]] = icmp ne i512 [[TMP4]], 0
-; CHECK-NEXT: [[_MSOR:%.*]] = or i1 [[_MSCMP]], [[_MSCMP1]]
-; CHECK-NEXT: br i1 [[_MSOR]], label [[TMP5:%.*]], label [[TMP6:%.*]], !prof [[PROF1]]
-; CHECK: 5:
-; CHECK-NEXT: call void @__msan_warning_noreturn() #[[ATTR7]]
-; CHECK-NEXT: unreachable
-; CHECK: 6:
-; CHECK-NEXT: [[TMP7:%.*]] = call <16 x i32> @llvm.x86.avx512.pmaddw.d.512(<32 x i16> [[X0:%.*]], <32 x i16> [[X1:%.*]])
-; CHECK-NEXT: store <16 x i32> zeroinitializer, ptr @__msan_retval_tls, align 8
+; CHECK-NEXT: [[TMP3:%.*]] = icmp ne <32 x i16> [[TMP1]], zeroinitializer
+; CHECK-NEXT: [[TMP4:%.*]] = icmp ne <32 x i16> [[TMP2]], zeroinitializer
+; CHECK-NEXT: [[TMP5:%.*]] = icmp ne <32 x i16> [[X0:%.*]], zeroinitializer
+; CHECK-NEXT: [[TMP6:%.*]] = icmp ne <32 x i16> [[X1:%.*]], zeroinitializer
+; CHECK-NEXT: [[TMP17:%.*]] = and <32 x i1> [[TMP3]], [[TMP4]]
+; CHECK-NEXT: [[TMP8:%.*]] = and <32 x i1> [[TMP5]], [[TMP4]]
+; CHECK-NEXT: [[TMP9:%.*]] = and <32 x i1> [[TMP3]], [[TMP6]]
+; CHECK-NEXT: [[TMP10:%.*]] = or <32 x i1> [[TMP17]], [[TMP8]]
+; CHECK-NEXT: [[TMP11:%.*]] = or <32 x i1> [[TMP10]], [[TMP9]]
+; CHECK-NEXT: [[TMP12:%.*]] = sext <32 x i1> [[TMP11]] to <32 x i16>
+; CHECK-NEXT: [[TMP13:%.*]] = bitcast <32 x i16> [[TMP12]] to <16 x i32>
+; CHECK-NEXT: [[TMP14:%.*]] = icmp ne <16 x i32> [[TMP13]], zeroinitializer
+; CHECK-NEXT: [[TMP16:%.*]] = sext <16 x i1> [[TMP14]] to <16 x i32>
+; CHECK-NEXT: [[TMP7:%.*]] = call <16 x i32> @llvm.x86.avx512.pmaddw.d.512(<32 x i16> [[X0]], <32 x i16> [[X1]])
+; CHECK-NEXT: store <16 x i32> [[TMP16]], ptr @__msan_retval_tls, align 8
; CHECK-NEXT: ret <16 x i32> [[TMP7]]
;
%res = call <16 x i32> @llvm.x86.avx512.mask.pmaddw.d.512(<32 x i16> %x0, <32 x i16> %x1, <16 x i32> %x2, i16 -1)
@@ -5014,22 +5021,25 @@ define <16 x i32> @test_int_x86_avx512_mask_pmaddw_d_512(<32 x i16> %x0, <32 x i
; CHECK-NEXT: [[TMP3:%.*]] = load i16, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 192) to ptr), align 8
; CHECK-NEXT: [[TMP4:%.*]] = load <16 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 128) to ptr), align 8
; CHECK-NEXT: call void @llvm.donothing()
-; CHECK-NEXT: [[TMP5:%.*]] = bitcast <32 x i16> [[TMP1]] to i512
-; CHECK-NEXT: [[_MSCMP:%.*]] = icmp ne i512 [[TMP5]], 0
-; CHECK-NEXT: [[TMP6:%.*]] = bitcast <32 x i16> [[TMP2]] to i512
-; CHECK-NEXT: [[_MSCMP1:%.*]] = icmp ne i512 [[TMP6]], 0
-; CHECK-NEXT: [[_MSOR:%.*]] = or i1 [[_MSCMP]], [[_MSCMP1]]
-; CHECK-NEXT: br i1 [[_MSOR]], label [[TMP7:%.*]], label [[TMP8:%.*]], !prof [[PROF1]]
-; CHECK: 7:
-; CHECK-NEXT: call void @__msan_warning_noreturn() #[[ATTR7]]
-; CHECK-NEXT: unreachable
-; CHECK: 8:
-; CHECK-NEXT: [[TMP9:%.*]] = call <16 x i32> @llvm.x86.avx512.pmaddw.d.512(<32 x i16> [[X0:%.*]], <32 x i16> [[X1:%.*]])
+; CHECK-NEXT: [[TMP5:%.*]] = icmp ne <32 x i16> [[TMP1]], zeroinitializer
+; CHECK-NEXT: [[TMP6:%.*]] = icmp ne <32 x i16> [[TMP2]], zeroinitializer
+; CHECK-NEXT: [[TMP7:%.*]] = icmp ne <32 x i16> [[X0:%.*]], zeroinitializer
+; CHECK-NEXT: [[TMP8:%.*]] = icmp ne <32 x i16> [[X1:%.*]], zeroinitializer
+; CHECK-NEXT: [[TMP19:%.*]] = and <32 x i1> [[TMP5]], [[TMP6]]
+; CHECK-NEXT: [[TMP20:%.*]] = and <32 x i1> [[TMP7]], [[TMP6]]
+; CHECK-NEXT: [[TMP21:%.*]] = and <32 x i1> [[TMP5]], [[TMP8]]
+; CHECK-NEXT: [[TMP22:%.*]] = or <32 x i1> [[TMP19]], [[TMP20]]
+; CHECK-NEXT: [[TMP23:%.*]] = or <32 x i1> [[TMP22]], [[TMP21]]
+; CHECK-NEXT: [[TMP24:%.*]] = sext <32 x i1> [[TMP23]] to <32 x i16>
+; CHECK-NEXT: [[TMP17:%.*]] = bitcast <32 x i16> [[TMP24]] to <16 x i32>
+; CHECK-NEXT: [[TMP25:%.*]] = icmp ne <16 x i32> [[TMP17]], zeroinitializer
+; CHECK-NEXT: [[TMP18:%.*]] = sext <16 x i1> [[TMP25]] to <16 x i32>
+; CHECK-NEXT: [[TMP9:%.*]] = call <16 x i32> @llvm.x86.avx512.pmaddw.d.512(<32 x i16> [[X0]], <32 x i16> [[X1]])
; CHECK-NEXT: [[TMP10:%.*]] = bitcast i16 [[TMP3]] to <16 x i1>
; CHECK-NEXT: [[TMP11:%.*]] = bitcast i16 [[X3:%.*]] to <16 x i1>
-; CHECK-NEXT: [[TMP12:%.*]] = select <16 x i1> [[TMP11]], <16 x i32> zeroinitializer, <16 x i32> [[TMP4]]
+; CHECK-NEXT: [[TMP12:%.*]] = select <16 x i1> [[TMP11]], <16 x i32> [[TMP18]], <16 x i32> [[TMP4]]
; CHECK-NEXT: [[TMP13:%.*]] = xor <16 x i32> [[TMP9]], [[X2:%.*]]
-; CHECK-NEXT: [[TMP14:%.*]] = or <16 x i32> [[TMP13]], zeroinitializer
+; CHECK-NEXT: [[TMP14:%.*]] = or <16 x i32> [[TMP13]], [[TMP18]]
; CHECK-NEXT: [[TMP15:%.*]] = or <16 x i32> [[TMP14]], [[TMP4]]
; CHECK-NEXT: [[_MSPROP_SELECT:%.*]] = select <16 x i1> [[TMP10]], <16 x i32> [[TMP15]], <16 x i32> [[TMP12]]
; CHECK-NEXT: [[TMP16:%.*]] = select <16 x i1> [[TMP11]], <16 x i32> [[TMP9]], <16 x i32> [[X2]]
diff --git a/llvm/test/Instrumentation/MemorySanitizer/X86/avx512bw-intrinsics.ll b/llvm/test/Instrumentation/MemorySanitizer/X86/avx512bw-intrinsics.ll
index 00337da67af11..c6c7e002213bd 100644
--- a/llvm/test/Instrumentation/MemorySanitizer/X86/avx512bw-intrinsics.ll
+++ b/llvm/test/Instrumentation/MemorySanitizer/X86/avx512bw-intrinsics.ll
@@ -9,7 +9,6 @@
; - llvm.x86.avx512.mask.pmov.wb.mem.512
; - llvm.x86.avx512.packssdw.512, llvm.x86.avx512.packsswb.512
; - llvm.x86.avx512.packusdw.512, llvm.x86.avx512.packuswb.512
-; - llvm.x86.avx512.pmaddubs.w.512, llvm.x86.avx512.pmaddw.d.512
; - llvm.x86.avx512.psad.bw.512
;
; Heuristically handled:
@@ -2206,18 +2205,21 @@ define <32 x i16> @test_int_x86_avx512_pmaddubs_w_512(<64 x i8> %x0, <64 x i8> %
; CHECK-NEXT: [[TMP1:%.*]] = load <64 x i8>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP2:%.*]] = load <64 x i8>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 64) to ptr), align 8
; CHECK-NEXT: call void @llvm.donothing()
-; CHECK-NEXT: [[TMP3:%.*]] = bitcast <64 x i8> [[TMP1]] to i512
-; CHECK-NEXT: [[_MSCMP:%.*]] = icmp ne i512 [[TMP3]], 0
-; CHECK-NEXT: [[TMP4:%.*]] = bitcast <64 x i8> [[TMP2]] to i512
-; CHECK-NEXT: [[_MSCMP1:%.*]] = icmp ne i512 [[TMP4]], 0
-; CHECK-NEXT: [[_MSOR:%.*]] = or i1 [[_MSCMP]], [[_MSCMP1]]
-; CHECK-NEXT: br i1 [[_MSOR]], label [[TMP5:%.*]], label [[TMP6:%.*]], !prof [[PROF1]]
-; CHECK: 5:
-; CHECK-NEXT: call void @__msan_warning_noreturn() #[[ATTR8]]
-; CHECK-NEXT: unreachable
-; CHECK: 6:
-; CHECK-NEXT: [[TMP7:%.*]] = call <32 x i16> @llvm.x86.avx512.pmaddubs.w.512(<64 x i8> [[X0:%.*]], <64 x i8> [[X1:%.*]])
-; CHECK-NEXT: store <32 x i16> zeroinitializer, ptr @__msan_retval_tls, align 8
+; CHECK-NEXT: [[TMP3:%.*]] = icmp ne <64 x i8> [[TMP1]], zeroinitializer
+; CHECK-NEXT: [[TMP4:%.*]] = icmp ne <64 x i8> [[TMP2]], zeroinitializer
+; CHECK-NEXT: [[TMP5:%.*]] = icmp ne <64 x i8> [[X0:%.*]], zeroinitializer
+; CHECK-NEXT: [[TMP6:%.*]] = icmp ne <64 x i8> [[X1:%.*]], zeroinitializer
+; CHECK-NEXT: [[TMP17:%.*]] = and <64 x i1> [[TMP3]], [[TMP4]]
+; CHECK-NEXT: [[TMP8:%.*]] = and <64 x i1> [[TMP5]], [[TMP4]]
+; CHECK-NEXT: [[TMP9:%.*]] = and <64 x i1> [[TMP3]], [[TMP6]]
+; CHECK-NEXT: [[TMP10:%.*]] = or <64 x i1> [[TMP17]], [[TMP8]]
+; CHECK-NEXT: [[TMP11:%.*]] = or <64 x i1> [[TMP10]], [[TMP9]]
+; CHECK-NEXT: [[TMP12:%.*]] = sext <64 x i1> [[TMP11]] to <64 x i8>
+; CHECK-NEXT: [[TMP13:%.*]] = bitcast <64 x i8> [[TMP12]] to <32 x i16>
+; CHECK-NEXT: [[TMP14:%.*]] = icmp ne <32 x i16> [[TMP13]], zeroinitializer
+; CHECK-NEXT: [[TMP16:%.*]] = sext <32 x i1> [[TMP14]] to <32 x i16>
+; CHECK-NEXT: [[TMP7:%.*]] = call <32 x i16> @llvm.x86.avx512.pmaddubs.w.512(<64 x i8> [[X0]], <64 x i8> [[X1]])
+; CHECK-NEXT: store <32 x i16> [[TMP16]], ptr @__msan_retval_tls, align 8
; CHECK-NEXT: ret <32 x i16> [[TMP7]]
;
%1 = call <32 x i16> @llvm.x86.avx512.pmaddubs.w.512(<64 x i8> %x0, <64 x i8> %x1)
@@ -2231,22 +2233,25 @@ define <32 x i16> @test_int_x86_avx512_mask_pmaddubs_w_512(<64 x i8> %x0, <64 x
; CHECK-NEXT: [[TMP3:%.*]] = load i32, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 192) to ptr), align 8
; CHECK-NEXT: [[TMP4:%.*]] = load <32 x i16>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 128) to ptr), align 8
; CHECK-NEXT: call void @llvm.donothing()
-; CHECK-NEXT: [[TMP5:%.*]] = bitcast <64 x i8> [[TMP1]] to i512
-; CHECK-NEXT: [[_MSCMP:%.*]] = icmp ne i512 [[TMP5]], 0
-; CHECK-NEXT: [[TMP6:%.*]] = bitcast <64 x i8> [[TMP2]] to i512
-; CHECK-NEXT: [[_MSCMP1:%.*]] = icmp ne i512 [[TMP6]], 0
-; CHECK-NEXT: [[_MSOR:%.*]] = or i1 [[_MSCMP]], [[_MSCMP1]]
-; CHECK-NEXT: br i1 [[_MSOR]], label [[TMP7:%.*]], label [[TMP8:%.*]], !prof [[PROF1]]
-; CHECK: 7:
-; CHECK-NEXT: call void @__msan_warning_noreturn() #[[ATTR8]]
-; CHECK-NEXT: unreachable
-; CHECK: 8:
-; CHECK-NEXT: [[TMP9:%.*]] = call <32 x i16> @llvm.x86.avx512.pmaddubs.w.512(<64 x i8> [[X0:%.*]], <64 x i8> [[X1:%.*]])
+; CHECK-NEXT: [[TMP5:%.*]] = icmp ne <64 x i8> [[TMP1]], zeroinitializer
+; CHECK-NEXT: [[TMP6:%.*]] = icmp ne <64 x i8> [[TMP2]], zeroinitializer
+; CHECK-NEXT: [[TMP7:%.*]] = icmp ne <64 x i8> [[X0:%.*]], zeroinitializer
+; CHECK-NEXT: [[TMP8:%.*]] = icmp ne <64 x i8> [[X1:%.*]], zeroinitializer
+; CHECK-NEXT: [[TMP19:%.*]] = and <64 x i1> [[TMP5]], [[TMP6]]
+; CHECK-NEXT: [[TMP20:%.*]] = and <64 x i1> [[TMP7]], [[TMP6]]
+; CHECK-NEXT: [[TMP21:%.*]] = and <64 x i1> [[TMP5]], [[TMP8]]
+; CHECK-NEXT: [[TMP22:%.*]] = or <64 x i1> [[TMP19]], [[TMP20]]
+; CHECK-NEXT: [[TMP23:%.*]] = or <64 x i1> [[TMP22]], [[TMP21]]
+; CHECK-NEXT: [[TMP24:%.*]] = sext <64 x i1> [[TMP23]] to <64 x i8>
+; CHECK-NEXT: [[TMP17:%.*]] = bitcast <64 x i8> [[TMP24]] to <32 x i16>
+; CHECK-NEXT: [[TMP25:%.*]] = icmp ne <32 x i16> [[TMP17]], zeroinitializer
+; CHECK-NEXT: [[TMP18:%.*]] = sext <32 x i1> [[TMP25]] to <32 x i16>
+; CHECK-NEXT: [[TMP9:%.*]] = call <32 x i16> @llvm.x86.avx512.pmaddubs.w.512(<64 x i8> [[X0]], <64 x i8> [[X1]])
; CHECK-NEXT: [[TMP10:%.*]] = bitcast i32 [[TMP3]] to <32 x i1>
; CHECK-NEXT: [[TMP11:%.*]] = bitcast i32 [[X3:%.*]] to <32 x i1>
-; CHECK-NEXT: [[TMP12:%.*]] = select <32 x i1> [[TMP11]], <32 x i16> zeroinitializer, <32 x i16> [[TMP4]]
+; CHECK-NEXT: [[TMP12:%.*]] = select <32 x i1> [[TMP11]], <32 x i16> [[TMP18]], <32 x i16> [[TMP4]]
; CHECK-NEXT: [[TMP13:%.*]] = xor <32 x i16> [[TMP9]], [[X2:%.*]]
-; CHECK-NEXT: [[TMP14:%.*]] = or <32 x i16> [[TMP13]], zeroinitializer
+; CHECK-NEXT: [[TMP14:%.*]] = or <32 x i16> [[TMP13]], [[TMP18]]
; CHECK-NEXT: [[TMP15:%.*]] = or <32 x i16> [[TMP14]], [[TMP4]]
; CHECK-NEXT: [[_MSPROP_SELECT:%.*]] = select <32 x i1> [[TMP10]], <32 x i16> [[TMP15]], <32 x i16> [[TMP12]]
; CHECK-NEXT: [[TMP16:%.*]] = select <32 x i1> [[TMP11]], <32 x i16> [[TMP9]], <32 x i16> [[X2]]
@@ -2266,18 +2271,21 @@ define <16 x i32> @test_int_x86_avx512_pmaddw_d_512(<32 x i16> %x0, <32 x i16> %
; CHECK-NEXT: [[TMP1:%.*]] = load <32 x i16>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP2:%.*]] = load <32 x i16>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 64) to ptr), align 8
; CHECK-NEXT: call void @llvm.donothing()
-; CHECK-NEXT: [[TMP3:%.*]] = bitcast <32 x i16> [[TMP1]] to i512
-; CHECK-NEXT: [[_MSCMP:%.*]] = icmp ne i512 [[TMP3]], 0
-; CHECK-NEXT: [[TMP4:%.*]] = bitcast <32 x i16> [[TMP2]] to i512
-; CHECK-NEXT: [[_MSCMP1:%.*]] = icmp ne i512 [[TMP4]], 0
-; CHECK-NEXT: [[_MSOR:%.*]] = or i1 [[_MSCMP]], [[_MSCMP1]]
-; CHECK-NEXT: br i1 [[_MSOR]], label [[TMP5:%.*]], label [[TMP6:%.*]], !prof [[PROF1]]
-; CHECK: 5:
-; CHECK-NEXT: call void @__msan_warning_noreturn() #[[ATTR8]]
-; CHECK-NEXT: unreachable
-; CHECK: 6:
-; CHECK-NEXT: [[TMP7:%.*]] = call <16 x i32> @llvm.x86.avx512.pmaddw.d.512(<32 x i16> [[X0:%.*]], <32 x i16> [[X1:%.*]])
-; CHECK-NEXT: store <16 x i32> zeroinitializer, ptr @__msan_retval_tls, align 8
+; CHECK-NEXT: [[TMP3:%.*]] = icmp ne <32 x i16> [[TMP1]], zeroinitializer
+; CHECK-NEXT: [[TMP4:%.*]] = icmp ne <32 x i16> [[TMP2]], zeroinitializer
+; CHECK-NEXT: [[TMP5:%.*]] = icmp ne <32 x i16> [[X0:%.*]], zeroinitializer
+; CHECK-NEXT: [[TMP6:%.*]] = icmp ne <32 x i16> [[X1:%.*]], zeroinitializer
+; CHECK-NEXT: [[TMP17:%.*]] = and <32 x i1> [[TMP3]], [[TMP4]]
+; CHECK-NEXT: [[TMP8:%.*]] = and <32 x i1> [[TMP5]], [[TMP4]]
+; CHECK-NEXT: [[TMP9:%.*]] = and <32 x i1> [[TMP3]], [[TMP6]]
+; CHECK-NEXT: [[TMP10:%.*]] = or <32 x i1> [[TMP17]], [[TMP8]]
+; CHECK-NEXT: [[TMP11:%.*]] = or <32 x i1> [[TMP10]], [[TMP9]]
+; CHECK-NEXT: [[TMP12:%.*]] = sext <32 x i1> [[TMP11]] to <32 x i16>
+; CHECK-NEXT: [[TMP13:%.*]] = bitcast <32 x i16> [[TMP12]] to <16 x i32>
+; CHECK-NEXT: [[TMP14:%.*]] = icmp ne <16 x i32> [[TMP13]], zeroinitializer
+; CHECK-NEXT: [[TMP16:%.*]] = sext <16 x i1> [[TMP14]] to <16 x i32>
+; CHECK-NEXT: [[TMP7:%.*]] = call <16 x i32> @llvm.x86.avx512.pmaddw.d.512(<32 x i16> [[X0]], <32 x i16> [[X1]])
+; CHECK-NEXT: store <16 x i32> [[TMP16]], ptr @__msan_retval_tls, align 8
; CHECK-NEXT: ret <16 x i32> [[TMP7]]
;
%1 = call <16 x i32> @llvm.x86.avx512.pmaddw.d.512(<32 x i16> %x0, <32 x i16> %x1)
@@ -2291,22 +2299,25 @@ define <16 x i32> @test_int_x86_avx512_mask_pmaddw_d_512(<32 x i16> %x0, <32 x i
; CHECK-NEXT: [[TMP3:%.*]] = load i16, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 192) to ptr), align 8
; CHECK-NEXT: [[TMP4:%.*]] = load <16 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 128) to ptr), align 8
; CHECK-NEXT: call void @llvm.donothing()
-; CHECK-NEXT: [[TMP5:%.*]] = bitcast <32 x i16> [[TMP1]] to i512
-; CHECK-NEXT: [[_MSCMP:%.*]] = icmp ne i512 [[TMP5]], 0
-; CHECK-NEXT: [[TMP6:%.*]] = bitcast <32 x i16> [[TMP2]] to i512
-; CHECK-NEXT: [[_MSCMP1:%.*]] = icmp ne i512 [[TMP6]], 0
-; CHECK-NEXT: [[_MSOR:%.*]] = or i1 [[_MSCMP]], [[_MSCMP1]]
-; CHECK-NEXT: br i1 [[_MSOR]], label [[TMP7:%.*]], label [[TMP8:%.*]], !prof [[PROF1]]
-; CHECK: 7:
-; CHECK-NEXT: call void @__msan_warning_noreturn() #[[ATTR8]]
-; CHECK-NEXT: unreachable
-; CHECK: 8:
-; CHECK-NEXT: [[TMP9:%.*]] = call <16 x i32> @llvm.x86.avx512.pmaddw.d.512(<32 x i16> [[X0:%.*]], <32 x i16> [[X1:%.*]])
+; CHECK-NEXT: [[TMP5:%.*]] = icmp ne <32 x i16> [[TMP1]], zeroinitializer
+; CHECK-NEXT: [[TMP6:%.*]] = icmp ne <32 x i16> [[TMP2]], zeroinitializer
+; CHECK-NEXT: [[TMP7:%.*]] = icmp ne <32 x i16> [[X0:%.*]], zeroinitializer
+; CHECK-NEXT: [[TMP8:%.*]] = icmp ne <32 x i16> [[X1:%.*]], zeroinitializer
+; CHECK-NEXT: [[TMP19:%.*]] = and <32 x i1> [[TMP5]], [[TMP6]]
+; CHECK-NEXT: [[TMP20:%.*]] = and <32 x i1> [[TMP7]], [[TMP6]]
+; CHECK-NEXT: [[TMP21:%.*]] = and <32 x i1> [[TMP5]], [[TMP8]]
+; CHECK-NEXT: [[TMP22:%.*]] = or <32 x i1> [[TMP19]], [[TMP20]]
+; CHECK-NEXT: [[TMP23:%.*]] = or <32 x i1> [[TMP22]], [[TMP21]]
+; CHECK-NEXT: [[TMP24:%.*]] = sext <32 x i1> [[TMP23]] to <32 x i16>
+; CHECK-NEXT: [[TMP17:%.*]] = bitcast <32 x i16> [[TMP24]] to <16 x i32>
+; CHECK-NEXT: [[TMP25:%.*]] = icmp ne <16 x i32> [[TMP17]], zeroinitializer
+; CHECK-NEXT: [[TMP18:%.*]] = sext <16 x i1> [[TMP25]] to <16 x i32>
+; CHECK-NEXT: [[TMP9:%.*]] = call <16 x i32> @llvm.x86.avx512.pmaddw.d.512(<32 x i16> [[X0]], <32 x i16> [[X1]])
; CHECK-NEXT: [[TMP10:%.*]] = bitcast i16 [[TMP3]] to <16 x i1>
; CHECK-NEXT: [[TMP11:%.*]] = bitcast i16 [[X3:%.*]] to <16 x i1>
-; CHECK-NEXT: [[TMP12:%.*]] = select <16 x i1> [[TMP11]], <16 x i32> zeroinitializer, <16 x i32> [[TMP4]]
+; CHECK-NEXT: [[TMP12:%.*]] = select <16 x i1> [[TMP11]], <16 x i32> [[TMP18]], <16 x i32> [[TMP4]]
; CHECK-NEXT: [[TMP13:%.*]] = xor <16 x i32> [[TMP9]], [[X2:%.*]]
-; CHECK-NEXT: [[TMP14:%.*]] = or <16 x i32> [[TMP13]], zeroinitializer
+; CHECK-NEXT: [[TMP14:%.*]] = or <16 x i32> [[TMP13]], [[TMP18]]
; CHECK-NEXT: [[TMP15:%.*]] = or <16 x i32> [[TMP14]], [[TMP4]]
; CHECK-NEXT: [[_MSPROP_SELECT:%.*]] = select <16 x i1> [[TMP10]], <16 x i32> [[TMP15]], <16 x i32> [[TMP12]]
; CHECK-NEXT: [[TMP16:%.*]] = select <16 x i1> [[TMP11]], <16 x i32> [[TMP9]], <16 x i32> [[X2]]
>From 0fb1057e40110e558e0fef8e183e485c4d01311b Mon Sep 17 00:00:00 2001
From: Steven Perron <stevenperron at google.com>
Date: Mon, 18 Aug 2025 14:33:58 -0400
Subject: [PATCH 067/112] [SPIRV] Filter disallowed extensions for env
(#150051)
Not all SPIR-V extensions are allows in every environment. When we use
the `-spirv-ext=all` option, the backend currently believes that all
extensions can be used.
This commit filters out the extensions on the command line to remove
those that are not known to be allowed for the current environment.
Alternatives considered: I considered modifying the
SPIRVExtensionsParser::parse to use a different list of extensions for
"all" depending on the target triple. However that does not work because
the target triple is not available, and cannot be made available in a
reasonable way.
Fixes #147717
---------
Co-authored-by: Victor Lomuller <victor at codeplay.com>
---
.../SPIRV/MCTargetDesc/SPIRVBaseInfo.cpp | 26 +
.../Target/SPIRV/MCTargetDesc/SPIRVBaseInfo.h | 8 +
llvm/lib/Target/SPIRV/SPIRVCommandLine.cpp | 23 +-
llvm/lib/Target/SPIRV/SPIRVCommandLine.h | 6 +
llvm/lib/Target/SPIRV/SPIRVSubtarget.cpp | 8 +-
.../lib/Target/SPIRV/SPIRVSymbolicOperands.td | 456 +++++++++++-------
.../enable-all-extensions-avoid-invalid.ll | 16 +
7 files changed, 368 insertions(+), 175 deletions(-)
create mode 100644 llvm/test/CodeGen/SPIRV/extensions/enable-all-extensions-avoid-invalid.ll
diff --git a/llvm/lib/Target/SPIRV/MCTargetDesc/SPIRVBaseInfo.cpp b/llvm/lib/Target/SPIRV/MCTargetDesc/SPIRVBaseInfo.cpp
index 0ed97f5b41c51..d6b6079810471 100644
--- a/llvm/lib/Target/SPIRV/MCTargetDesc/SPIRVBaseInfo.cpp
+++ b/llvm/lib/Target/SPIRV/MCTargetDesc/SPIRVBaseInfo.cpp
@@ -38,8 +38,15 @@ struct CapabilityEntry {
Capability::Capability ReqCapability;
};
+struct EnvironmentEntry {
+ OperandCategory::OperandCategory Category;
+ uint32_t Value;
+ Environment::Environment AllowedEnvironment;
+};
+
using namespace OperandCategory;
using namespace Extension;
+using namespace Environment;
using namespace Capability;
using namespace InstructionSet;
#define GET_SymbolicOperands_DECL
@@ -48,6 +55,8 @@ using namespace InstructionSet;
#define GET_ExtensionEntries_IMPL
#define GET_CapabilityEntries_DECL
#define GET_CapabilityEntries_IMPL
+#define GET_EnvironmentEntries_DECL
+#define GET_EnvironmentEntries_IMPL
#define GET_ExtendedBuiltins_DECL
#define GET_ExtendedBuiltins_IMPL
#include "SPIRVGenTables.inc"
@@ -133,6 +142,23 @@ getSymbolicOperandCapabilities(SPIRV::OperandCategory::OperandCategory Category,
return Capabilities;
}
+EnvironmentList getSymbolicOperandAllowedEnvironments(
+ SPIRV::OperandCategory::OperandCategory Category, uint32_t Value) {
+ EnvironmentList Environments;
+ const SPIRV::EnvironmentEntry *Environment =
+ SPIRV::lookupEnvironmentByCategoryAndValue(Category, Value);
+ auto TableEnd = ArrayRef(SPIRV::EnvironmentEntries).end();
+ while (Environment && Environment->Category == Category &&
+ Environment->Value == Value) {
+ Environments.push_back(static_cast<SPIRV::Environment::Environment>(
+ Environment->AllowedEnvironment));
+ if (++Environment == TableEnd)
+ break;
+ }
+
+ return Environments;
+}
+
CapabilityList
getCapabilitiesEnabledByExtension(SPIRV::Extension::Extension Extension) {
const SPIRV::ExtensionEntry *Entry =
diff --git a/llvm/lib/Target/SPIRV/MCTargetDesc/SPIRVBaseInfo.h b/llvm/lib/Target/SPIRV/MCTargetDesc/SPIRVBaseInfo.h
index b8c467fef8e8e..c2c08f8831307 100644
--- a/llvm/lib/Target/SPIRV/MCTargetDesc/SPIRVBaseInfo.h
+++ b/llvm/lib/Target/SPIRV/MCTargetDesc/SPIRVBaseInfo.h
@@ -37,6 +37,11 @@ namespace Capability {
#include "SPIRVGenTables.inc"
} // namespace Capability
+namespace Environment {
+#define GET_Environment_DECL
+#include "SPIRVGenTables.inc"
+} // namespace Environment
+
namespace SourceLanguage {
#define GET_SourceLanguage_DECL
#include "SPIRVGenTables.inc"
@@ -241,6 +246,7 @@ enum InstFlags {
using CapabilityList = SmallVector<SPIRV::Capability::Capability, 8>;
using ExtensionList = SmallVector<SPIRV::Extension::Extension, 8>;
+using EnvironmentList = SmallVector<SPIRV::Environment::Environment, 8>;
std::string
getSymbolicOperandMnemonic(SPIRV::OperandCategory::OperandCategory Category,
@@ -254,6 +260,8 @@ getSymbolicOperandMaxVersion(SPIRV::OperandCategory::OperandCategory Category,
CapabilityList
getSymbolicOperandCapabilities(SPIRV::OperandCategory::OperandCategory Category,
uint32_t Value);
+EnvironmentList getSymbolicOperandAllowedEnvironments(
+ SPIRV::OperandCategory::OperandCategory Category, uint32_t Value);
CapabilityList
getCapabilitiesEnabledByExtension(SPIRV::Extension::Extension Extension);
ExtensionList
diff --git a/llvm/lib/Target/SPIRV/SPIRVCommandLine.cpp b/llvm/lib/Target/SPIRV/SPIRVCommandLine.cpp
index d9265f498973e..5a5860ac1c24f 100644
--- a/llvm/lib/Target/SPIRV/SPIRVCommandLine.cpp
+++ b/llvm/lib/Target/SPIRV/SPIRVCommandLine.cpp
@@ -12,7 +12,8 @@
//===----------------------------------------------------------------------===//
#include "SPIRVCommandLine.h"
-#include "llvm/ADT/StringRef.h"
+#include "MCTargetDesc/SPIRVBaseInfo.h"
+#include "llvm/TargetParser/Triple.h"
#include <algorithm>
#include <map>
@@ -171,3 +172,23 @@ StringRef SPIRVExtensionsParser::checkExtensions(
}
return StringRef();
}
+
+std::set<SPIRV::Extension::Extension>
+SPIRVExtensionsParser::getValidExtensions(const Triple &TT) {
+ std::set<SPIRV::Extension::Extension> R;
+ SPIRV::Environment::Environment CurrentEnvironment =
+ SPIRV::Environment::Environment::EnvOpenCL;
+ if (TT.getOS() == Triple::Vulkan)
+ CurrentEnvironment = SPIRV::Environment::Environment::EnvVulkan;
+
+ for (const auto &[ExtensionName, ExtensionEnum] : SPIRVExtensionMap) {
+ EnvironmentList AllowedEnv = getSymbolicOperandAllowedEnvironments(
+ SPIRV::OperandCategory::OperandCategory::ExtensionOperand,
+ ExtensionEnum);
+
+ if (std::count(AllowedEnv.begin(), AllowedEnv.end(), CurrentEnvironment))
+ R.insert(ExtensionEnum);
+ }
+
+ return R;
+}
diff --git a/llvm/lib/Target/SPIRV/SPIRVCommandLine.h b/llvm/lib/Target/SPIRV/SPIRVCommandLine.h
index 3e3b22bde8603..02e847b322a77 100644
--- a/llvm/lib/Target/SPIRV/SPIRVCommandLine.h
+++ b/llvm/lib/Target/SPIRV/SPIRVCommandLine.h
@@ -21,6 +21,7 @@
namespace llvm {
class StringRef;
+class Triple;
/// Command line parser for toggling SPIR-V extensions.
struct SPIRVExtensionsParser
@@ -42,6 +43,11 @@ struct SPIRVExtensionsParser
static StringRef
checkExtensions(const std::vector<std::string> &ExtNames,
std::set<SPIRV::Extension::Extension> &AllowedExtensions);
+
+ /// Returns the list of extensions that are valid for a particular
+ /// target environment (i.e., OpenCL or Vulkan).
+ static std::set<SPIRV::Extension::Extension>
+ getValidExtensions(const Triple &TT);
};
} // namespace llvm
diff --git a/llvm/lib/Target/SPIRV/SPIRVSubtarget.cpp b/llvm/lib/Target/SPIRV/SPIRVSubtarget.cpp
index cdf3c6224d4c8..690493fb426bc 100644
--- a/llvm/lib/Target/SPIRV/SPIRVSubtarget.cpp
+++ b/llvm/lib/Target/SPIRV/SPIRVSubtarget.cpp
@@ -166,7 +166,13 @@ void SPIRVSubtarget::initAvailableExtInstSets() {
void SPIRVSubtarget::initAvailableExtensions(
const std::set<SPIRV::Extension::Extension> &AllowedExtIds) {
AvailableExtensions.clear();
- AvailableExtensions.insert_range(AllowedExtIds);
+ const std::set<SPIRV::Extension::Extension> &ValidExtensions =
+ SPIRVExtensionsParser::getValidExtensions(TargetTriple);
+
+ for (const auto &Ext : AllowedExtIds) {
+ if (ValidExtensions.count(Ext))
+ AvailableExtensions.insert(Ext);
+ }
accountForAMDShaderTrinaryMinmax();
}
diff --git a/llvm/lib/Target/SPIRV/SPIRVSymbolicOperands.td b/llvm/lib/Target/SPIRV/SPIRVSymbolicOperands.td
index 614e83ae9b286..d2824ee2d2caf 100644
--- a/llvm/lib/Target/SPIRV/SPIRVSymbolicOperands.td
+++ b/llvm/lib/Target/SPIRV/SPIRVSymbolicOperands.td
@@ -109,23 +109,59 @@ def CapabilityEntries : GenericTable {
let PrimaryKeyName = "lookupCapabilityByCategoryAndValue";
}
+//===----------------------------------------------------------------------===//
+// Lookup table for matching symbolic operands (category + 32-bit value) to
+// SPIR-V environments. If an operand is allows in more than one environment,
+// there will be multiple consecutive entries present in the table.
+//===----------------------------------------------------------------------===//
+
+// Forward-declare classes used in ExtensionEntry
+class Environment;
+
+class EnvironmentEntry<OperandCategory category, bits<32> value,
+ Environment allowedEnvironment> {
+ OperandCategory Category = category;
+ bits<32> Value = value;
+ Environment AllowedEnvironment = allowedEnvironment;
+}
+
+def EnvironmentEntries : GenericTable {
+ let FilterClass = "EnvironmentEntry";
+ let Fields = ["Category", "Value", "AllowedEnvironment"];
+ string TypeOf_Category = "OperandCategory";
+ string TypeOf_AllowedEnvironment = "Environment";
+ let PrimaryKey = ["Category", "Value"];
+ // Function for looking up a (the first) environment by category + value. Next
+ // environment should be consecutive.
+ let PrimaryKeyName = "lookupEnvironmentByCategoryAndValue";
+}
+
//===----------------------------------------------------------------------===//
// Multiclass used to define a SymbolicOperand and at the same time declare
// required extension and capabilities.
//===----------------------------------------------------------------------===//
-multiclass SymbolicOperandWithRequirements<OperandCategory category, bits<32> value, string mnemonic, bits<32> minVersion, bits<32> maxVersion, list<Extension> reqExtensions, list<Capability> reqCapabilities> {
- assert !ge(!size(mnemonic), 1), "No mnemonic/string representation provided for symbolic operand with value " # value;
- def : SymbolicOperand<category, value, mnemonic, minVersion, maxVersion>;
+multiclass SymbolicOperandWithRequirements<
+ OperandCategory category, bits<32> value, string mnemonic,
+ bits<32> minVersion, bits<32> maxVersion, list<Extension> reqExtensions,
+ list<Capability> reqCapabilities, list<Environment> allowedEnvironments> {
+ assert !ge(!size(mnemonic), 1), "No mnemonic/string representation provided "
+ "for symbolic operand with value "#value;
+ def : SymbolicOperand<category, value, mnemonic, minVersion, maxVersion>;
+
+ assert !le(!size(reqExtensions), 1),
+ "Too many required extensions for a symbolic/named operand: "#mnemonic;
+ if !eq(!size(reqExtensions), 1) then {
+ def : ExtensionEntry<category, value, reqExtensions[0]>;
+ }
- assert !le(!size(reqExtensions), 1), "Too many required extensions for a symbolic/named operand: " # mnemonic;
- if !eq(!size(reqExtensions), 1) then {
- def : ExtensionEntry<category, value, reqExtensions[0]>;
- }
+ foreach capability = reqCapabilities in {
+ def : CapabilityEntry<category, value, capability>;
+ }
- foreach capability = reqCapabilities in {
- def : CapabilityEntry<category, value, capability>;
- }
+ foreach environment = allowedEnvironments in {
+ def : EnvironmentEntry<category, value, environment>;
+ }
}
//===----------------------------------------------------------------------===//
@@ -175,6 +211,20 @@ def CooperativeMatrixOperandsOperand : OperandCategory;
def SpecConstantOpOperandsOperand : OperandCategory;
def MatrixMultiplyAccumulateOperandsOperand : OperandCategory;
+//===----------------------------------------------------------------------===//
+// Definition of the Environments
+//===----------------------------------------------------------------------===//
+
+def Environment : GenericEnum, Operand<i32> {
+ let FilterClass = "Environment";
+ let ValueField = "Value";
+}
+
+class Environment<bits<32> value> { bits<32> Value = value; }
+
+def EnvOpenCL : Environment<0>;
+def EnvVulkan : Environment<1>;
+
//===----------------------------------------------------------------------===//
// Multiclass used to define Extesions enum values and at the same time
// SymbolicOperand entries.
@@ -192,135 +242,146 @@ class Extension<string name, bits<32> value> {
bits<32> Value = value;
}
-multiclass ExtensionOperand<bits<32> value> {
+multiclass ExtensionOperand<bits<32> value,
+ list<Environment> allowedEnvironments> {
def NAME : Extension<NAME, value>;
- defm : SymbolicOperandWithRequirements<ExtensionOperand, value, NAME, 0, 0, [], []>;
-}
-
-defm SPV_AMD_shader_explicit_vertex_parameter : ExtensionOperand<1>;
-defm SPV_AMD_shader_trinary_minmax_extension : ExtensionOperand<2>;
-defm SPV_AMD_gcn_shader : ExtensionOperand<3>;
-defm SPV_KHR_shader_ballot : ExtensionOperand<4>;
-defm SPV_AMD_shader_ballot : ExtensionOperand<5>;
-defm SPV_AMD_gpu_shader_half_float : ExtensionOperand<6>;
-defm SPV_KHR_shader_draw_parameters : ExtensionOperand<7>;
-defm SPV_KHR_subgroup_vote : ExtensionOperand<8>;
-defm SPV_KHR_16bit_storage : ExtensionOperand<9>;
-defm SPV_KHR_device_group : ExtensionOperand<10>;
-defm SPV_KHR_multiview : ExtensionOperand<11>;
-defm SPV_NVX_multiview_per_view_attributes : ExtensionOperand<12>;
-defm SPV_NV_viewport_array2 : ExtensionOperand<13>;
-defm SPV_NV_stereo_view_rendering : ExtensionOperand<14>;
-defm SPV_NV_sample_mask_override_coverage : ExtensionOperand<15>;
-defm SPV_NV_geometry_shader_passthrough : ExtensionOperand<16>;
-defm SPV_AMD_texture_gather_bias_lod : ExtensionOperand<17>;
-defm SPV_KHR_storage_buffer_storage_class : ExtensionOperand<18>;
-defm SPV_KHR_variable_pointers : ExtensionOperand<19>;
-defm SPV_AMD_gpu_shader_int16 : ExtensionOperand<20>;
-defm SPV_KHR_post_depth_coverage : ExtensionOperand<21>;
-defm SPV_KHR_shader_atomic_counter_ops : ExtensionOperand<22>;
-defm SPV_EXT_shader_stencil_export : ExtensionOperand<23>;
-defm SPV_EXT_shader_viewport_index_layer : ExtensionOperand<24>;
-defm SPV_AMD_shader_image_load_store_lod : ExtensionOperand<25>;
-defm SPV_AMD_shader_fragment_mask : ExtensionOperand<26>;
-defm SPV_EXT_fragment_fully_covered : ExtensionOperand<27>;
-defm SPV_AMD_gpu_shader_half_float_fetch : ExtensionOperand<28>;
-defm SPV_GOOGLE_decorate_string : ExtensionOperand<29>;
-defm SPV_GOOGLE_hlsl_functionality1 : ExtensionOperand<30>;
-defm SPV_NV_shader_subgroup_partitioned : ExtensionOperand<31>;
-defm SPV_EXT_descriptor_indexing : ExtensionOperand<32>;
-defm SPV_KHR_8bit_storage : ExtensionOperand<33>;
-defm SPV_KHR_vulkan_memory_model : ExtensionOperand<34>;
-defm SPV_NV_ray_tracing : ExtensionOperand<35>;
-defm SPV_NV_compute_shader_derivatives : ExtensionOperand<36>;
-defm SPV_NV_fragment_shader_barycentric : ExtensionOperand<37>;
-defm SPV_NV_mesh_shader : ExtensionOperand<38>;
-defm SPV_NV_shader_image_footprint : ExtensionOperand<39>;
-defm SPV_NV_shading_rate : ExtensionOperand<40>;
-defm SPV_INTEL_subgroups : ExtensionOperand<41>;
-defm SPV_INTEL_media_block_io : ExtensionOperand<42>;
-defm SPV_EXT_fragment_invocation_density : ExtensionOperand<44>;
-defm SPV_KHR_no_integer_wrap_decoration : ExtensionOperand<45>;
-defm SPV_KHR_float_controls : ExtensionOperand<46>;
-defm SPV_EXT_physical_storage_buffer : ExtensionOperand<47>;
-defm SPV_INTEL_fpga_memory_attributes : ExtensionOperand<48>;
-defm SPV_NV_cooperative_matrix : ExtensionOperand<49>;
-defm SPV_INTEL_shader_integer_functions2 : ExtensionOperand<50>;
-defm SPV_INTEL_fpga_loop_controls : ExtensionOperand<51>;
-defm SPV_EXT_fragment_shader_interlock : ExtensionOperand<52>;
-defm SPV_NV_shader_sm_builtins : ExtensionOperand<53>;
-defm SPV_KHR_shader_clock : ExtensionOperand<54>;
-defm SPV_INTEL_unstructured_loop_controls : ExtensionOperand<55>;
-defm SPV_EXT_demote_to_helper_invocation : ExtensionOperand<56>;
-defm SPV_INTEL_fpga_reg : ExtensionOperand<57>;
-defm SPV_INTEL_blocking_pipes : ExtensionOperand<58>;
-defm SPV_GOOGLE_user_type : ExtensionOperand<59>;
-defm SPV_KHR_physical_storage_buffer : ExtensionOperand<60>;
-defm SPV_INTEL_kernel_attributes : ExtensionOperand<61>;
-defm SPV_KHR_non_semantic_info : ExtensionOperand<62>;
-defm SPV_INTEL_io_pipes : ExtensionOperand<63>;
-defm SPV_KHR_ray_tracing : ExtensionOperand<64>;
-defm SPV_KHR_ray_query : ExtensionOperand<65>;
-defm SPV_INTEL_fpga_memory_accesses : ExtensionOperand<66>;
-defm SPV_INTEL_arbitrary_precision_integers : ExtensionOperand<67>;
-defm SPV_EXT_shader_atomic_float_add : ExtensionOperand<68>;
-defm SPV_KHR_terminate_invocation : ExtensionOperand<69>;
-defm SPV_KHR_fragment_shading_rate : ExtensionOperand<70>;
-defm SPV_EXT_shader_image_int64 : ExtensionOperand<71>;
-defm SPV_INTEL_fp_fast_math_mode : ExtensionOperand<72>;
-defm SPV_INTEL_fpga_cluster_attributes : ExtensionOperand<73>;
-defm SPV_INTEL_loop_fuse : ExtensionOperand<74>;
-defm SPV_EXT_shader_atomic_float_min_max : ExtensionOperand<75>;
-defm SPV_KHR_workgroup_memory_explicit_layout : ExtensionOperand<76>;
-defm SPV_KHR_linkonce_odr : ExtensionOperand<77>;
-defm SPV_KHR_expect_assume : ExtensionOperand<78>;
-defm SPV_INTEL_fpga_dsp_control : ExtensionOperand<79>;
-defm SPV_NV_bindless_texture : ExtensionOperand<80>;
-defm SPV_INTEL_fpga_invocation_pipelining_attributes : ExtensionOperand<81>;
-defm SPV_KHR_subgroup_uniform_control_flow : ExtensionOperand<82>;
-defm SPV_HUAWEI_subpass_shading : ExtensionOperand<83>;
-defm SPV_KHR_integer_dot_product : ExtensionOperand<84>;
-defm SPV_EXT_shader_atomic_float16_add : ExtensionOperand<85>;
-defm SPV_INTEL_runtime_aligned : ExtensionOperand<86>;
-defm SPV_KHR_bit_instructions : ExtensionOperand<87>;
-defm SPV_NV_ray_tracing_motion_blur : ExtensionOperand<88>;
-defm SPV_KHR_uniform_group_instructions : ExtensionOperand<89>;
-defm SPV_KHR_subgroup_rotate : ExtensionOperand<90>;
-defm SPV_INTEL_split_barrier : ExtensionOperand<91>;
-defm SPV_KHR_ray_cull_mask : ExtensionOperand<92>;
-defm SPV_KHR_fragment_shader_barycentric : ExtensionOperand<93>;
-defm SPV_EXT_relaxed_printf_string_address_space : ExtensionOperand<94>;
-defm SPV_EXT_ycbcr_attachments : ExtensionOperand<95>;
-defm SPV_EXT_mesh_shader : ExtensionOperand<96>;
-defm SPV_ARM_core_builtins : ExtensionOperand<97>;
-defm SPV_EXT_opacity_micromap : ExtensionOperand<98>;
-defm SPV_NV_shader_invocation_reorder : ExtensionOperand<99>;
-defm SPV_INTEL_usm_storage_classes : ExtensionOperand<100>;
-defm SPV_INTEL_fpga_latency_control : ExtensionOperand<101>;
-defm SPV_INTEL_fpga_argument_interfaces : ExtensionOperand<102>;
-defm SPV_INTEL_optnone : ExtensionOperand<103>;
-defm SPV_INTEL_function_pointers : ExtensionOperand<104>;
-defm SPV_INTEL_variable_length_array : ExtensionOperand<105>;
-defm SPV_INTEL_bfloat16_conversion : ExtensionOperand<106>;
-defm SPV_INTEL_inline_assembly : ExtensionOperand<107>;
-defm SPV_INTEL_cache_controls : ExtensionOperand<108>;
-defm SPV_INTEL_global_variable_host_access : ExtensionOperand<109>;
-defm SPV_INTEL_global_variable_fpga_decorations : ExtensionOperand<110>;
-defm SPV_KHR_cooperative_matrix : ExtensionOperand<111>;
-defm SPV_EXT_arithmetic_fence : ExtensionOperand<112>;
-defm SPV_EXT_optnone : ExtensionOperand<113>;
-defm SPV_INTEL_joint_matrix : ExtensionOperand<114>;
-defm SPV_INTEL_float_controls2 : ExtensionOperand<115>;
-defm SPV_INTEL_bindless_images : ExtensionOperand<116>;
-defm SPV_INTEL_long_composites : ExtensionOperand<117>;
-defm SPV_INTEL_memory_access_aliasing : ExtensionOperand<118>;
-defm SPV_INTEL_fp_max_error : ExtensionOperand<119>;
-defm SPV_INTEL_ternary_bitwise_function : ExtensionOperand<120>;
-defm SPV_INTEL_subgroup_matrix_multiply_accumulate : ExtensionOperand<121>;
-defm SPV_INTEL_2d_block_io : ExtensionOperand<122>;
-defm SPV_INTEL_int4 : ExtensionOperand<123>;
-defm SPV_KHR_float_controls2 : ExtensionOperand<124>;
-defm SPV_INTEL_tensor_float32_conversion : ExtensionOperand<125>;
+ defm : SymbolicOperandWithRequirements<ExtensionOperand, value, NAME, 0,
+ 0, [], [], allowedEnvironments>;
+}
+
+defm SPV_AMD_shader_explicit_vertex_parameter
+ : ExtensionOperand<1, [EnvVulkan]>;
+defm SPV_AMD_shader_trinary_minmax_extension : ExtensionOperand<2, [EnvVulkan]>;
+defm SPV_AMD_gcn_shader : ExtensionOperand<3, [EnvVulkan]>;
+defm SPV_KHR_shader_ballot : ExtensionOperand<4, [EnvVulkan]>;
+defm SPV_AMD_shader_ballot : ExtensionOperand<5, [EnvVulkan]>;
+defm SPV_AMD_gpu_shader_half_float : ExtensionOperand<6, [EnvVulkan]>;
+defm SPV_KHR_shader_draw_parameters : ExtensionOperand<7, [EnvVulkan]>;
+defm SPV_KHR_subgroup_vote : ExtensionOperand<8, [EnvVulkan]>;
+defm SPV_KHR_16bit_storage : ExtensionOperand<9, [EnvVulkan]>;
+defm SPV_KHR_device_group : ExtensionOperand<10, [EnvVulkan]>;
+defm SPV_KHR_multiview : ExtensionOperand<11, [EnvVulkan]>;
+defm SPV_NVX_multiview_per_view_attributes : ExtensionOperand<12, [EnvVulkan]>;
+defm SPV_NV_viewport_array2 : ExtensionOperand<13, [EnvVulkan]>;
+defm SPV_NV_stereo_view_rendering : ExtensionOperand<14, [EnvVulkan]>;
+defm SPV_NV_sample_mask_override_coverage : ExtensionOperand<15, [EnvVulkan]>;
+defm SPV_NV_geometry_shader_passthrough : ExtensionOperand<16, [EnvVulkan]>;
+defm SPV_AMD_texture_gather_bias_lod : ExtensionOperand<17, [EnvVulkan]>;
+defm SPV_KHR_storage_buffer_storage_class : ExtensionOperand<18, [EnvVulkan]>;
+defm SPV_KHR_variable_pointers : ExtensionOperand<19, [EnvVulkan]>;
+defm SPV_AMD_gpu_shader_int16 : ExtensionOperand<20, [EnvVulkan]>;
+defm SPV_KHR_post_depth_coverage : ExtensionOperand<21, [EnvVulkan]>;
+defm SPV_KHR_shader_atomic_counter_ops : ExtensionOperand<22, []>;
+defm SPV_EXT_shader_stencil_export : ExtensionOperand<23, [EnvVulkan]>;
+defm SPV_EXT_shader_viewport_index_layer : ExtensionOperand<24, [EnvVulkan]>;
+defm SPV_AMD_shader_image_load_store_lod : ExtensionOperand<25, [EnvVulkan]>;
+defm SPV_AMD_shader_fragment_mask : ExtensionOperand<26, [EnvVulkan]>;
+defm SPV_EXT_fragment_fully_covered : ExtensionOperand<27, [EnvVulkan]>;
+defm SPV_AMD_gpu_shader_half_float_fetch : ExtensionOperand<28, [EnvVulkan]>;
+defm SPV_GOOGLE_decorate_string : ExtensionOperand<29, [EnvVulkan]>;
+defm SPV_GOOGLE_hlsl_functionality1 : ExtensionOperand<30, [EnvVulkan]>;
+defm SPV_NV_shader_subgroup_partitioned : ExtensionOperand<31, [EnvVulkan]>;
+defm SPV_EXT_descriptor_indexing : ExtensionOperand<32, [EnvVulkan]>;
+defm SPV_KHR_8bit_storage : ExtensionOperand<33, [EnvVulkan]>;
+defm SPV_KHR_vulkan_memory_model : ExtensionOperand<34, [EnvVulkan]>;
+defm SPV_NV_ray_tracing : ExtensionOperand<35, [EnvVulkan]>;
+defm SPV_NV_compute_shader_derivatives : ExtensionOperand<36, [EnvVulkan]>;
+defm SPV_NV_fragment_shader_barycentric : ExtensionOperand<37, [EnvVulkan]>;
+defm SPV_NV_mesh_shader : ExtensionOperand<38, [EnvVulkan]>;
+defm SPV_NV_shader_image_footprint : ExtensionOperand<39, [EnvVulkan]>;
+defm SPV_NV_shading_rate : ExtensionOperand<40, [EnvVulkan]>;
+defm SPV_INTEL_subgroups : ExtensionOperand<41, [EnvOpenCL]>;
+defm SPV_INTEL_media_block_io : ExtensionOperand<42, [EnvOpenCL]>;
+defm SPV_EXT_fragment_invocation_density : ExtensionOperand<44, [EnvVulkan]>;
+defm SPV_KHR_no_integer_wrap_decoration : ExtensionOperand<45, [EnvOpenCL]>;
+defm SPV_KHR_float_controls : ExtensionOperand<46, [EnvVulkan, EnvOpenCL]>;
+defm SPV_EXT_physical_storage_buffer : ExtensionOperand<47, [EnvVulkan]>;
+defm SPV_INTEL_fpga_memory_attributes : ExtensionOperand<48, [EnvOpenCL]>;
+defm SPV_NV_cooperative_matrix : ExtensionOperand<49, [EnvVulkan]>;
+defm SPV_INTEL_shader_integer_functions2
+ : ExtensionOperand<50, [EnvVulkan, EnvOpenCL]>;
+defm SPV_INTEL_fpga_loop_controls : ExtensionOperand<51, [EnvOpenCL]>;
+defm SPV_EXT_fragment_shader_interlock : ExtensionOperand<52, [EnvVulkan]>;
+defm SPV_NV_shader_sm_builtins : ExtensionOperand<53, [EnvVulkan]>;
+defm SPV_KHR_shader_clock : ExtensionOperand<54, [EnvVulkan, EnvOpenCL]>;
+defm SPV_INTEL_unstructured_loop_controls : ExtensionOperand<55, [EnvOpenCL]>;
+defm SPV_EXT_demote_to_helper_invocation : ExtensionOperand<56, [EnvVulkan]>;
+defm SPV_INTEL_fpga_reg : ExtensionOperand<57, [EnvOpenCL]>;
+defm SPV_INTEL_blocking_pipes : ExtensionOperand<58, [EnvOpenCL]>;
+defm SPV_GOOGLE_user_type : ExtensionOperand<59, [EnvVulkan]>;
+defm SPV_KHR_physical_storage_buffer : ExtensionOperand<60, [EnvVulkan]>;
+defm SPV_INTEL_kernel_attributes : ExtensionOperand<61, [EnvOpenCL]>;
+defm SPV_KHR_non_semantic_info : ExtensionOperand<62, [EnvVulkan, EnvOpenCL]>;
+defm SPV_INTEL_io_pipes : ExtensionOperand<63, [EnvOpenCL]>;
+defm SPV_KHR_ray_tracing : ExtensionOperand<64, [EnvVulkan]>;
+defm SPV_KHR_ray_query : ExtensionOperand<65, [EnvVulkan]>;
+defm SPV_INTEL_fpga_memory_accesses : ExtensionOperand<66, [EnvOpenCL]>;
+defm SPV_INTEL_arbitrary_precision_integers : ExtensionOperand<67, [EnvOpenCL]>;
+defm SPV_EXT_shader_atomic_float_add
+ : ExtensionOperand<68, [EnvVulkan, EnvOpenCL]>;
+defm SPV_KHR_terminate_invocation : ExtensionOperand<69, [EnvVulkan]>;
+defm SPV_KHR_fragment_shading_rate : ExtensionOperand<70, [EnvVulkan]>;
+defm SPV_EXT_shader_image_int64 : ExtensionOperand<71, [EnvVulkan]>;
+defm SPV_INTEL_fp_fast_math_mode : ExtensionOperand<72, [EnvOpenCL]>;
+defm SPV_INTEL_fpga_cluster_attributes : ExtensionOperand<73, [EnvOpenCL]>;
+defm SPV_INTEL_loop_fuse : ExtensionOperand<74, [EnvOpenCL]>;
+defm SPV_EXT_shader_atomic_float_min_max
+ : ExtensionOperand<75, [EnvVulkan, EnvOpenCL]>;
+defm SPV_KHR_workgroup_memory_explicit_layout
+ : ExtensionOperand<76, [EnvVulkan]>;
+defm SPV_KHR_linkonce_odr : ExtensionOperand<77, [EnvOpenCL]>;
+defm SPV_KHR_expect_assume : ExtensionOperand<78, [EnvVulkan, EnvOpenCL]>;
+defm SPV_INTEL_fpga_dsp_control : ExtensionOperand<79, [EnvOpenCL]>;
+defm SPV_NV_bindless_texture : ExtensionOperand<80, [EnvVulkan]>;
+defm SPV_INTEL_fpga_invocation_pipelining_attributes
+ : ExtensionOperand<81, [EnvOpenCL]>;
+defm SPV_KHR_subgroup_uniform_control_flow : ExtensionOperand<82, [EnvVulkan]>;
+defm SPV_HUAWEI_subpass_shading : ExtensionOperand<83, [EnvVulkan]>;
+defm SPV_KHR_integer_dot_product : ExtensionOperand<84, [EnvVulkan, EnvOpenCL]>;
+defm SPV_EXT_shader_atomic_float16_add
+ : ExtensionOperand<85, [EnvVulkan, EnvOpenCL]>;
+defm SPV_INTEL_runtime_aligned : ExtensionOperand<86, [EnvOpenCL]>;
+defm SPV_KHR_bit_instructions : ExtensionOperand<87, [EnvOpenCL]>;
+defm SPV_NV_ray_tracing_motion_blur : ExtensionOperand<88, [EnvVulkan]>;
+defm SPV_KHR_uniform_group_instructions : ExtensionOperand<89, [EnvOpenCL]>;
+defm SPV_KHR_subgroup_rotate : ExtensionOperand<90, [EnvVulkan, EnvOpenCL]>;
+defm SPV_INTEL_split_barrier : ExtensionOperand<91, [EnvOpenCL]>;
+defm SPV_KHR_ray_cull_mask : ExtensionOperand<92, [EnvVulkan]>;
+defm SPV_KHR_fragment_shader_barycentric : ExtensionOperand<93, [EnvVulkan]>;
+defm SPV_EXT_relaxed_printf_string_address_space
+ : ExtensionOperand<94, [EnvOpenCL]>;
+defm SPV_EXT_mesh_shader : ExtensionOperand<96, [EnvVulkan]>;
+defm SPV_ARM_core_builtins : ExtensionOperand<97, [EnvVulkan]>;
+defm SPV_EXT_opacity_micromap : ExtensionOperand<98, [EnvVulkan]>;
+defm SPV_NV_shader_invocation_reorder : ExtensionOperand<99, [EnvVulkan]>;
+defm SPV_INTEL_usm_storage_classes : ExtensionOperand<100, [EnvOpenCL]>;
+defm SPV_INTEL_fpga_latency_control : ExtensionOperand<101, [EnvOpenCL]>;
+defm SPV_INTEL_fpga_argument_interfaces : ExtensionOperand<102, [EnvOpenCL]>;
+defm SPV_INTEL_optnone : ExtensionOperand<103, [EnvOpenCL]>;
+defm SPV_INTEL_function_pointers : ExtensionOperand<104, [EnvOpenCL]>;
+defm SPV_INTEL_variable_length_array : ExtensionOperand<105, [EnvOpenCL]>;
+defm SPV_INTEL_bfloat16_conversion : ExtensionOperand<106, [EnvOpenCL]>;
+defm SPV_INTEL_inline_assembly : ExtensionOperand<107, [EnvOpenCL]>;
+defm SPV_INTEL_cache_controls : ExtensionOperand<108, [EnvOpenCL]>;
+defm SPV_INTEL_global_variable_host_access : ExtensionOperand<109, [EnvOpenCL]>;
+defm SPV_INTEL_global_variable_fpga_decorations
+ : ExtensionOperand<110, [EnvOpenCL]>;
+defm SPV_KHR_cooperative_matrix : ExtensionOperand<111, [EnvVulkan, EnvOpenCL]>;
+defm SPV_EXT_arithmetic_fence : ExtensionOperand<112, [EnvOpenCL]>;
+defm SPV_EXT_optnone : ExtensionOperand<113, [EnvOpenCL]>;
+defm SPV_INTEL_joint_matrix : ExtensionOperand<114, [EnvOpenCL]>;
+defm SPV_INTEL_float_controls2 : ExtensionOperand<115, [EnvOpenCL]>;
+defm SPV_INTEL_bindless_images : ExtensionOperand<116, [EnvOpenCL]>;
+defm SPV_INTEL_long_composites : ExtensionOperand<117, [EnvOpenCL]>;
+defm SPV_INTEL_memory_access_aliasing : ExtensionOperand<118, [EnvOpenCL]>;
+defm SPV_INTEL_fp_max_error : ExtensionOperand<119, [EnvOpenCL]>;
+defm SPV_INTEL_ternary_bitwise_function : ExtensionOperand<120, [EnvOpenCL]>;
+defm SPV_INTEL_subgroup_matrix_multiply_accumulate
+ : ExtensionOperand<121, [EnvOpenCL]>;
+defm SPV_INTEL_2d_block_io : ExtensionOperand<122, [EnvOpenCL]>;
+defm SPV_INTEL_int4 : ExtensionOperand<123, [EnvOpenCL]>;
+defm SPV_KHR_float_controls2 : ExtensionOperand<124, [EnvVulkan, EnvOpenCL]>;
+defm SPV_INTEL_tensor_float32_conversion : ExtensionOperand<125, [EnvOpenCL]>;
//===----------------------------------------------------------------------===//
// Multiclass used to define Capabilities enum values and at the same time
@@ -342,7 +403,9 @@ class Capability<string name, bits<32> value> {
multiclass CapabilityOperand<bits<32> value, bits<32> minVersion, bits<32> maxVersion, list<Extension> reqExtensions, list<Capability> reqCapabilities> {
def NAME : Capability<NAME, value>;
- defm : SymbolicOperandWithRequirements<CapabilityOperand, value, NAME, minVersion, maxVersion, reqExtensions, reqCapabilities>;
+ defm : SymbolicOperandWithRequirements<CapabilityOperand, value, NAME,
+ minVersion, maxVersion, reqExtensions,
+ reqCapabilities, []>;
}
defm Matrix : CapabilityOperand<0, 0, 0, [], []>;
@@ -551,7 +614,8 @@ class SourceLanguage<string name, bits<32> value> {
multiclass SourceLanguageOperand<bits<32> value> {
def : SourceLanguage<NAME, value>;
- defm : SymbolicOperandWithRequirements<SourceLanguageOperand, value, NAME, 0, 0, [], []>;
+ defm : SymbolicOperandWithRequirements<SourceLanguageOperand, value, NAME, 0,
+ 0, [], [], []>;
}
defm Unknown : SourceLanguageOperand<0>;
@@ -580,7 +644,8 @@ class AddressingModel<string name, bits<32> value> {
multiclass AddressingModelOperand<bits<32> value, list<Capability> reqCapabilities> {
def : AddressingModel<NAME, value>;
- defm : SymbolicOperandWithRequirements<AddressingModelOperand, value, NAME, 0, 0, [], reqCapabilities>;
+ defm : SymbolicOperandWithRequirements<AddressingModelOperand, value, NAME, 0,
+ 0, [], reqCapabilities, []>;
}
defm Logical : AddressingModelOperand<0, []>;
@@ -607,7 +672,8 @@ class ExecutionModel<string name, bits<32> value> {
multiclass ExecutionModelOperand<bits<32> value, list<Capability> reqCapabilities> {
def : ExecutionModel<NAME, value>;
- defm : SymbolicOperandWithRequirements<ExecutionModelOperand, value, NAME, 0, 0, [], reqCapabilities>;
+ defm : SymbolicOperandWithRequirements<ExecutionModelOperand, value, NAME, 0,
+ 0, [], reqCapabilities, []>;
}
defm Vertex : ExecutionModelOperand<0, [Shader]>;
@@ -645,7 +711,8 @@ class MemoryModel<string name, bits<32> value> {
multiclass MemoryModelOperand<bits<32> value, list<Capability> reqCapabilities> {
def : MemoryModel<NAME, value>;
- defm : SymbolicOperandWithRequirements<MemoryModelOperand, value, NAME, 0, 0, [], reqCapabilities>;
+ defm : SymbolicOperandWithRequirements<MemoryModelOperand, value, NAME, 0,
+ 0, [], reqCapabilities, []>;
}
defm Simple : MemoryModelOperand<0, [Shader]>;
@@ -672,7 +739,8 @@ class ExecutionMode<string name, bits<32> value> {
multiclass ExecutionModeOperand<bits<32> value, list<Capability> reqCapabilities> {
def : ExecutionMode<NAME, value>;
- defm : SymbolicOperandWithRequirements<ExecutionModeOperand, value, NAME, 0, 0, [], reqCapabilities>;
+ defm : SymbolicOperandWithRequirements<ExecutionModeOperand, value, NAME, 0,
+ 0, [], reqCapabilities, []>;
}
defm Invocations : ExecutionModeOperand<0, [Geometry]>;
@@ -748,7 +816,8 @@ class StorageClass<string name, bits<32> value> {
multiclass StorageClassOperand<bits<32> value, list<Extension> reqExtensions, list<Capability> reqCapabilities> {
def : StorageClass<NAME, value>;
- defm : SymbolicOperandWithRequirements<StorageClassOperand, value, NAME, 0, 0, reqExtensions, reqCapabilities>;
+ defm : SymbolicOperandWithRequirements<StorageClassOperand, value, NAME, 0, 0,
+ reqExtensions, reqCapabilities, []>;
}
defm UniformConstant : StorageClassOperand<0, [], []>;
@@ -794,7 +863,8 @@ class Dim<string name, bits<32> value> {
multiclass DimOperand<bits<32> value, string mnemonic, list<Capability> reqCapabilities> {
def NAME : Dim<NAME, value>;
- defm : SymbolicOperandWithRequirements<DimOperand, value, mnemonic, 0, 0, [], reqCapabilities>;
+ defm : SymbolicOperandWithRequirements<DimOperand, value, mnemonic, 0, 0, [],
+ reqCapabilities, []>;
}
defm DIM_1D : DimOperand<0, "1D", [Sampled1D, Image1D]>;
@@ -824,7 +894,8 @@ class SamplerAddressingMode<string name, bits<32> value> {
multiclass SamplerAddressingModeOperand<bits<32> value, list<Capability> reqCapabilities> {
def : SamplerAddressingMode<NAME, value>;
- defm : SymbolicOperandWithRequirements<SamplerAddressingModeOperand, value, NAME, 0, 0, [], reqCapabilities>;
+ defm : SymbolicOperandWithRequirements<SamplerAddressingModeOperand, value,
+ NAME, 0, 0, [], reqCapabilities, []>;
}
defm None : SamplerAddressingModeOperand<0, [Kernel]>;
@@ -852,7 +923,8 @@ class SamplerFilterMode<string name, bits<32> value> {
multiclass SamplerFilterModeOperand<bits<32> value, list<Capability> reqCapabilities> {
def : SamplerFilterMode<NAME, value>;
- defm : SymbolicOperandWithRequirements<SamplerFilterModeOperand, value, NAME, 0, 0, [], reqCapabilities>;
+ defm : SymbolicOperandWithRequirements<SamplerFilterModeOperand, value, NAME,
+ 0, 0, [], reqCapabilities, []>;
}
defm Nearest : SamplerFilterModeOperand<0, [Kernel]>;
@@ -877,7 +949,8 @@ class ImageFormat<string name, bits<32> value> {
multiclass ImageFormatOperand<bits<32> value, list<Capability> reqCapabilities> {
def NAME : ImageFormat<NAME, value>;
- defm : SymbolicOperandWithRequirements<ImageFormatOperand, value, NAME, 0, 0, [], reqCapabilities>;
+ defm : SymbolicOperandWithRequirements<ImageFormatOperand, value, NAME, 0,
+ 0, [], reqCapabilities, []>;
}
defm Unknown : ImageFormatOperand<0, []>;
@@ -940,7 +1013,8 @@ class ImageChannelOrder<string name, bits<32> value> {
multiclass ImageChannelOrderOperand<bits<32> value, list<Capability> reqCapabilities> {
def : ImageChannelOrder<NAME, value>;
- defm : SymbolicOperandWithRequirements<ImageChannelOrderOperand, value, NAME, 0, 0, [], reqCapabilities>;
+ defm : SymbolicOperandWithRequirements<ImageChannelOrderOperand, value, NAME,
+ 0, 0, [], reqCapabilities, []>;
}
defm R : ImageChannelOrderOperand<0, [Kernel]>;
@@ -983,7 +1057,8 @@ class ImageChannelDataType<string name, bits<32> value> {
multiclass ImageChannelDataTypeOperand<bits<32> value, list<Capability> reqCapabilities> {
def : ImageChannelDataType<NAME, value>;
- defm : SymbolicOperandWithRequirements<ImageChannelDataTypeOperand, value, NAME, 0, 0, [], reqCapabilities>;
+ defm : SymbolicOperandWithRequirements<ImageChannelDataTypeOperand, value,
+ NAME, 0, 0, [], reqCapabilities, []>;
}
defm SnormInt8 : ImageChannelDataTypeOperand<0, []>;
@@ -1023,7 +1098,8 @@ class ImageOperand<string name, bits<32> value> {
multiclass ImageOperandOperand<bits<32> value, list<Capability> reqCapabilities> {
def : ImageOperand<NAME, value>;
- defm : SymbolicOperandWithRequirements<ImageOperandOperand, value, NAME, 0, 0, [], reqCapabilities>;
+ defm : SymbolicOperandWithRequirements<ImageOperandOperand, value, NAME, 0,
+ 0, [], reqCapabilities, []>;
}
defm None : ImageOperandOperand<0x0, []>;
@@ -1061,7 +1137,8 @@ class FPFastMathMode<string name, bits<32> value> {
multiclass FPFastMathModeOperand<bits<32> value, list<Capability> reqCapabilities> {
def : FPFastMathMode<NAME, value>;
- defm : SymbolicOperandWithRequirements<FPFastMathModeOperand, value, NAME, 0, 0, [], reqCapabilities>;
+ defm : SymbolicOperandWithRequirements<FPFastMathModeOperand, value, NAME, 0,
+ 0, [], reqCapabilities, []>;
}
defm None : FPFastMathModeOperand<0x0, []>;
@@ -1090,7 +1167,8 @@ class FPRoundingMode<string name, bits<32> value> {
multiclass FPRoundingModeOperand<bits<32> value> {
def NAME : FPRoundingMode<NAME, value>;
- defm : SymbolicOperandWithRequirements<FPRoundingModeOperand, value, NAME, 0, 0, [], []>;
+ defm : SymbolicOperandWithRequirements<FPRoundingModeOperand, value, NAME, 0,
+ 0, [], [], []>;
}
defm RTE : FPRoundingModeOperand<0>;
@@ -1117,7 +1195,8 @@ class LinkageType<string name, bits<32> value> {
multiclass LinkageTypeOperand<bits<32> value, list<Capability> reqCapabilities> {
def : LinkageType<NAME, value>;
- defm : SymbolicOperandWithRequirements<LinkageTypeOperand, value, NAME, 0, 0, [], reqCapabilities>;
+ defm : SymbolicOperandWithRequirements<LinkageTypeOperand, value, NAME, 0,
+ 0, [], reqCapabilities, []>;
}
defm Export : LinkageTypeOperand<0, [Linkage]>;
@@ -1143,7 +1222,8 @@ class AccessQualifier<string name, bits<32> value> {
multiclass AccessQualifierOperand<bits<32> value, list<Capability> reqCapabilities> {
def NAME : AccessQualifier<NAME, value>;
- defm : SymbolicOperandWithRequirements<AccessQualifierOperand, value, NAME, 0, 0, [], reqCapabilities>;
+ defm : SymbolicOperandWithRequirements<AccessQualifierOperand, value, NAME, 0,
+ 0, [], reqCapabilities, []>;
}
defm ReadOnly : AccessQualifierOperand<0, [Kernel]>;
@@ -1170,7 +1250,9 @@ class FunctionParameterAttribute<string name, bits<32> value> {
multiclass FunctionParameterAttributeOperand<bits<32> value, list<Capability> reqCapabilities> {
def : FunctionParameterAttribute<NAME, value>;
- defm : SymbolicOperandWithRequirements<FunctionParameterAttributeOperand, value, NAME, 0, 0, [], reqCapabilities>;
+ defm : SymbolicOperandWithRequirements<FunctionParameterAttributeOperand,
+ value, NAME, 0, 0, [],
+ reqCapabilities, []>;
}
defm Zext : FunctionParameterAttributeOperand<0, [Kernel]>;
@@ -1202,7 +1284,9 @@ class Decoration<string name, bits<32> value> {
multiclass DecorationOperand<bits<32> value, bits<32> minVersion, bits<32> maxVersion, list<Extension> reqExtensions, list<Capability> reqCapabilities> {
def : Decoration<NAME, value>;
- defm : SymbolicOperandWithRequirements<DecorationOperand, value, NAME, minVersion, maxVersion, reqExtensions, reqCapabilities>;
+ defm : SymbolicOperandWithRequirements<DecorationOperand, value, NAME,
+ minVersion, maxVersion, reqExtensions,
+ reqCapabilities, []>;
}
defm RelaxedPrecision : DecorationOperand<0, 0, 0, [], [Shader]>;
@@ -1303,7 +1387,9 @@ class BuiltIn<string name, bits<32> value> {
multiclass BuiltInOperand<bits<32> value, bits<32> minVersion, bits<32> maxVersion, list<Extension> reqExtensions, list<Capability> reqCapabilities> {
def NAME : BuiltIn<NAME, value>;
- defm : SymbolicOperandWithRequirements<BuiltInOperand, value, NAME, minVersion, maxVersion, reqExtensions, reqCapabilities>;
+ defm : SymbolicOperandWithRequirements<BuiltInOperand, value, NAME,
+ minVersion, maxVersion, reqExtensions,
+ reqCapabilities, []>;
}
defm Position : BuiltInOperand<0, 0, 0, [], [Shader]>;
@@ -1417,7 +1503,8 @@ class SelectionControl<string name, bits<32> value> {
multiclass SelectionControlOperand<bits<32> value> {
def : SelectionControl<NAME, value>;
- defm : SymbolicOperandWithRequirements<SelectionControlOperand, value, NAME, 0, 0, [], []>;
+ defm : SymbolicOperandWithRequirements<SelectionControlOperand, value, NAME,
+ 0, 0, [], [], []>;
}
defm None : SelectionControlOperand<0x0>;
@@ -1443,7 +1530,8 @@ class LoopControl<string name, bits<32> value> {
multiclass LoopControlOperand<bits<32> value> {
def : LoopControl<NAME, value>;
- defm : SymbolicOperandWithRequirements<LoopControlOperand, value, NAME, 0, 0, [], []>;
+ defm : SymbolicOperandWithRequirements<LoopControlOperand, value, NAME, 0,
+ 0, [], [], []>;
}
defm None : LoopControlOperand<0x0>;
@@ -1476,7 +1564,8 @@ class FunctionControl<string name, bits<32> value> {
multiclass FunctionControlOperand<bits<32> value> {
def : FunctionControl<NAME, value>;
- defm : SymbolicOperandWithRequirements<FunctionControlOperand, value, NAME, 0, 0, [], []>;
+ defm : SymbolicOperandWithRequirements<FunctionControlOperand, value, NAME, 0,
+ 0, [], [], []>;
}
defm None : FunctionControlOperand<0x0>;
@@ -1506,7 +1595,9 @@ class MemorySemantics<string name, bits<32> value> {
multiclass MemorySemanticsOperand<bits<32> value, bits<32> minVersion, bits<32> maxVersion, list<Extension> reqExtensions, list<Capability> reqCapabilities> {
def : MemorySemantics<NAME, value>;
- defm : SymbolicOperandWithRequirements<MemorySemanticsOperand, value, NAME, minVersion, maxVersion, reqExtensions, reqCapabilities>;
+ defm : SymbolicOperandWithRequirements<MemorySemanticsOperand, value, NAME,
+ minVersion, maxVersion, reqExtensions,
+ reqCapabilities, []>;
}
defm None : MemorySemanticsOperand<0x0, 0, 0, [], []>;
@@ -1544,7 +1635,9 @@ class MemoryOperand<string name, bits<32> value> {
multiclass MemoryOperandOperand<bits<32> value, bits<32> minVersion, bits<32> maxVersion, list<Extension> reqExtensions, list<Capability> reqCapabilities> {
def : MemoryOperand<NAME, value>;
- defm : SymbolicOperandWithRequirements<MemoryOperandOperand, value, NAME, minVersion, maxVersion, reqExtensions, reqCapabilities>;
+ defm : SymbolicOperandWithRequirements<MemoryOperandOperand, value, NAME,
+ minVersion, maxVersion, reqExtensions,
+ reqCapabilities, []>;
}
defm None : MemoryOperandOperand<0x0, 0, 0, [], []>;
@@ -1577,7 +1670,9 @@ class Scope<string name, bits<32> value> {
multiclass ScopeOperand<bits<32> value, bits<32> minVersion, bits<32> maxVersion, list<Extension> reqExtensions, list<Capability> reqCapabilities> {
def : Scope<NAME, value>;
- defm : SymbolicOperandWithRequirements<ScopeOperand, value, NAME, minVersion, maxVersion, reqExtensions, reqCapabilities>;
+ defm : SymbolicOperandWithRequirements<ScopeOperand, value, NAME, minVersion,
+ maxVersion, reqExtensions,
+ reqCapabilities, []>;
}
defm CrossDevice : ScopeOperand<0, 0, 0, [], []>;
@@ -1607,7 +1702,9 @@ class GroupOperation<string name, bits<32> value> {
multiclass GroupOperationOperand<bits<32> value, bits<32> minVersion, bits<32> maxVersion, list<Extension> reqExtensions, list<Capability> reqCapabilities> {
def NAME : GroupOperation<NAME, value>;
- defm : SymbolicOperandWithRequirements<GroupOperationOperand, value, NAME, minVersion, maxVersion, reqExtensions, reqCapabilities>;
+ defm : SymbolicOperandWithRequirements<GroupOperationOperand, value, NAME,
+ minVersion, maxVersion, reqExtensions,
+ reqCapabilities, []>;
}
defm Reduce : GroupOperationOperand<0, 0, 0, [], [Kernel, GroupNonUniformArithmetic, GroupNonUniformBallot]>;
@@ -1638,7 +1735,9 @@ class KernelEnqueueFlags<string name, bits<32> value> {
multiclass KernelEnqueueFlagsOperand<bits<32> value, bits<32> minVersion, bits<32> maxVersion, list<Extension> reqExtensions, list<Capability> reqCapabilities> {
def : KernelEnqueueFlags<NAME, value>;
- defm : SymbolicOperandWithRequirements<KernelEnqueueFlagsOperand, value, NAME, minVersion, maxVersion, reqExtensions, reqCapabilities>;
+ defm : SymbolicOperandWithRequirements<KernelEnqueueFlagsOperand, value, NAME,
+ minVersion, maxVersion, reqExtensions,
+ reqCapabilities, []>;
}
defm NoWait : KernelEnqueueFlagsOperand<0, 0, 0, [], [Kernel]>;
@@ -1665,7 +1764,9 @@ class KernelProfilingInfo<string name, bits<32> value> {
multiclass KernelProfilingInfoOperand<bits<32> value, bits<32> minVersion, bits<32> maxVersion, list<Extension> reqExtensions, list<Capability> reqCapabilities> {
def : KernelProfilingInfo<NAME, value>;
- defm : SymbolicOperandWithRequirements<KernelProfilingInfoOperand, value, NAME, minVersion, maxVersion, reqExtensions, reqCapabilities>;
+ defm : SymbolicOperandWithRequirements<KernelProfilingInfoOperand, value,
+ NAME, minVersion, maxVersion,
+ reqExtensions, reqCapabilities, []>;
}
defm None : KernelProfilingInfoOperand<0x0, 0, 0, [], []>;
@@ -1690,7 +1791,8 @@ class Opcode<string name, bits<32> value> {
multiclass OpcodeOperand<bits<32> value> {
def : Opcode<NAME, value>;
- defm : SymbolicOperandWithRequirements<OpcodeOperand, value, NAME, 0, 0, [], []>;
+ defm : SymbolicOperandWithRequirements<OpcodeOperand, value, NAME, 0,
+ 0, [], [], []>;
}
// TODO: implement other mnemonics.
defm InBoundsAccessChain : OpcodeOperand<66>;
@@ -1720,7 +1822,9 @@ class CooperativeMatrixLayout<string name, bits<32> value> {
multiclass CooperativeMatrixLayoutOperand<bits<32> value, list<Extension> reqExtensions, list<Capability> reqCapabilities> {
def : CooperativeMatrixLayout<NAME, value>;
- defm : SymbolicOperandWithRequirements<CooperativeMatrixLayoutOperand, value, NAME, 0, 0, reqExtensions, reqCapabilities>;
+ defm : SymbolicOperandWithRequirements<CooperativeMatrixLayoutOperand, value,
+ NAME, 0, 0, reqExtensions,
+ reqCapabilities, []>;
}
defm RowMajorKHR : CooperativeMatrixLayoutOperand<0x0, [SPV_KHR_cooperative_matrix], [CooperativeMatrixKHR]>;
@@ -1747,7 +1851,9 @@ class CooperativeMatrixOperands<string name, bits<32> value> {
multiclass CooperativeMatrixOperandsOperand<bits<32> value, list<Extension> reqExtensions, list<Capability> reqCapabilities> {
def : CooperativeMatrixOperands<NAME, value>;
- defm : SymbolicOperandWithRequirements<CooperativeMatrixOperandsOperand, value, NAME, 0, 0, reqExtensions, reqCapabilities>;
+ defm : SymbolicOperandWithRequirements<CooperativeMatrixOperandsOperand,
+ value, NAME, 0, 0, reqExtensions,
+ reqCapabilities, []>;
}
defm NoneKHR : CooperativeMatrixOperandsOperand<0x0, [SPV_KHR_cooperative_matrix], [CooperativeMatrixKHR]>;
@@ -1780,7 +1886,9 @@ class SpecConstantOpOperands<string name, bits<32> value> {
multiclass SpecConstantOpOperandsOperand<bits<32> value, list<Extension> reqExtensions, list<Capability> reqCapabilities> {
def : SpecConstantOpOperands<NAME, value>;
- defm : SymbolicOperandWithRequirements<SpecConstantOpOperandsOperand, value, NAME, 0, 0, reqExtensions, reqCapabilities>;
+ defm : SymbolicOperandWithRequirements<SpecConstantOpOperandsOperand, value,
+ NAME, 0, 0, reqExtensions,
+ reqCapabilities, []>;
}
// Conversion
@@ -1868,7 +1976,9 @@ class MatrixMultiplyAccumulateOperands<string name, bits<32> value> {
multiclass MatrixMultiplyAccumulateOperandsOperand<bits<32> value, list<Extension> reqExtensions> {
def : MatrixMultiplyAccumulateOperands<NAME, value>;
- defm : SymbolicOperandWithRequirements<MatrixMultiplyAccumulateOperandsOperand, value, NAME, 0, 0, reqExtensions, []>;
+ defm : SymbolicOperandWithRequirements<
+ MatrixMultiplyAccumulateOperandsOperand, value, NAME, 0, 0,
+ reqExtensions, [], []>;
}
defm None : MatrixMultiplyAccumulateOperandsOperand<0x0, [SPV_INTEL_subgroup_matrix_multiply_accumulate]>;
diff --git a/llvm/test/CodeGen/SPIRV/extensions/enable-all-extensions-avoid-invalid.ll b/llvm/test/CodeGen/SPIRV/extensions/enable-all-extensions-avoid-invalid.ll
new file mode 100644
index 0000000000000..2de7fff0bc900
--- /dev/null
+++ b/llvm/test/CodeGen/SPIRV/extensions/enable-all-extensions-avoid-invalid.ll
@@ -0,0 +1,16 @@
+; RUN: llc -verify-machineinstrs -O0 -mtriple=spirv1.6-vulkan1.3-compute --spirv-ext=all %s -o - | FileCheck %s
+; RUN: %if spirv-tools %{ llc -O0 -mtriple=spirv1.6-vulkan1.3-compute --spirv-ext=all %s -o - -filetype=obj | spirv-val --target-env vulkan1.3 %}
+
+; CHECK-NOT: OpExtension "SPV_KHR_no_integer_wrap_decoration"
+
+define internal void @foo(i32 %i) local_unnamed_addr {
+ %sub.i = sub nsw i32 0, %i
+ ret void
+}
+
+define internal void @main() local_unnamed_addr #0 {
+entry:
+ ret void
+}
+
+attributes #0 = { "hlsl.numthreads"="1,1,1" "hlsl.shader"="compute" }
\ No newline at end of file
>From 8429f7faaa5c5afdece49be04bc5720d5110b6d1 Mon Sep 17 00:00:00 2001
From: Krzysztof Parzyszek <Krzysztof.Parzyszek at amd.com>
Date: Mon, 18 Aug 2025 13:35:02 -0500
Subject: [PATCH 068/112] [flang][OpenMP] Parsing support for DYN_GROUPPRIVATE
(#153615)
This does not perform semantic checks or lowering.
---
flang/include/flang/Lower/OpenMP/Clauses.h | 1 +
flang/include/flang/Parser/dump-parse-tree.h | 4 ++
flang/include/flang/Parser/parse-tree.h | 14 +++-
flang/lib/Lower/OpenMP/Clauses.cpp | 23 ++++++
flang/lib/Parser/openmp-parsers.cpp | 19 ++++-
flang/lib/Parser/unparse.cpp | 6 ++
flang/lib/Semantics/check-omp-structure.cpp | 1 +
.../OpenMP/Todo/dyn-groupprivate-clause.f90 | 10 +++
.../Parser/OpenMP/dyn-groupprivate-clause.f90 | 70 +++++++++++++++++++
llvm/include/llvm/Frontend/OpenMP/ClauseT.h | 22 ++++--
llvm/include/llvm/Frontend/OpenMP/OMP.td | 34 +++++++++
11 files changed, 196 insertions(+), 8 deletions(-)
create mode 100644 flang/test/Lower/OpenMP/Todo/dyn-groupprivate-clause.f90
create mode 100644 flang/test/Parser/OpenMP/dyn-groupprivate-clause.f90
diff --git a/flang/include/flang/Lower/OpenMP/Clauses.h b/flang/include/flang/Lower/OpenMP/Clauses.h
index 7f317f05f67b7..1ab594ffcd209 100644
--- a/flang/include/flang/Lower/OpenMP/Clauses.h
+++ b/flang/include/flang/Lower/OpenMP/Clauses.h
@@ -219,6 +219,7 @@ using DistSchedule = tomp::clause::DistScheduleT<TypeTy, IdTy, ExprTy>;
using Doacross = tomp::clause::DoacrossT<TypeTy, IdTy, ExprTy>;
using DynamicAllocators =
tomp::clause::DynamicAllocatorsT<TypeTy, IdTy, ExprTy>;
+using DynGroupprivate = tomp::clause::DynGroupprivateT<TypeTy, IdTy, ExprTy>;
using Enter = tomp::clause::EnterT<TypeTy, IdTy, ExprTy>;
using Exclusive = tomp::clause::ExclusiveT<TypeTy, IdTy, ExprTy>;
using Fail = tomp::clause::FailT<TypeTy, IdTy, ExprTy>;
diff --git a/flang/include/flang/Parser/dump-parse-tree.h b/flang/include/flang/Parser/dump-parse-tree.h
index 2c666a6d09a7b..a4380e19cdba1 100644
--- a/flang/include/flang/Parser/dump-parse-tree.h
+++ b/flang/include/flang/Parser/dump-parse-tree.h
@@ -525,6 +525,8 @@ class ParseTreeDumper {
NODE(parser, OmpAbsentClause)
NODE(parser, OmpAffinityClause)
NODE(OmpAffinityClause, Modifier)
+ NODE(parser, OmpAccessGroup)
+ NODE_ENUM(OmpAccessGroup, Value)
NODE(parser, OmpAlignment)
NODE(parser, OmpAlignClause)
NODE(parser, OmpAlignedClause)
@@ -569,6 +571,8 @@ class ParseTreeDumper {
NODE_ENUM(OmpDependenceType, Value)
NODE(parser, OmpTaskDependenceType)
NODE_ENUM(OmpTaskDependenceType, Value)
+ NODE(parser, OmpDynGroupprivateClause)
+ NODE(OmpDynGroupprivateClause, Modifier)
NODE(parser, OmpIndirectClause)
NODE(parser, OmpIterationOffset)
NODE(parser, OmpIteration)
diff --git a/flang/include/flang/Parser/parse-tree.h b/flang/include/flang/Parser/parse-tree.h
index e72190f019dd1..e9045b4f772e3 100644
--- a/flang/include/flang/Parser/parse-tree.h
+++ b/flang/include/flang/Parser/parse-tree.h
@@ -3736,6 +3736,11 @@ inline namespace modifier {
// ENUM_CLASS(Value, Keyword1, Keyword2);
// };
+struct OmpAccessGroup {
+ ENUM_CLASS(Value, Cgroup);
+ WRAPPER_CLASS_BOILERPLATE(OmpAccessGroup, Value);
+};
+
// Ref: [4.5:72-81], [5.0:110-119], [5.1:134-143], [5.2:169-170]
//
// alignment ->
@@ -4019,8 +4024,9 @@ struct OmpOrderModifier {
//
// prescriptiveness ->
// STRICT // since 5.1
+// FALLBACK // since 6.1
struct OmpPrescriptiveness {
- ENUM_CLASS(Value, Strict)
+ ENUM_CLASS(Value, Strict, Fallback)
WRAPPER_CLASS_BOILERPLATE(OmpPrescriptiveness, Value);
};
@@ -4375,6 +4381,12 @@ struct OmpDeviceTypeClause {
WRAPPER_CLASS_BOILERPLATE(OmpDeviceTypeClause, DeviceTypeDescription);
};
+struct OmpDynGroupprivateClause {
+ TUPLE_CLASS_BOILERPLATE(OmpDynGroupprivateClause);
+ MODIFIER_BOILERPLATE(OmpAccessGroup, OmpPrescriptiveness);
+ std::tuple<MODIFIERS(), ScalarIntExpr> t;
+};
+
// Ref: [5.2:158-159], [6.0:289-290]
//
// enter-clause ->
diff --git a/flang/lib/Lower/OpenMP/Clauses.cpp b/flang/lib/Lower/OpenMP/Clauses.cpp
index 7f75aae09def1..1a16e1c87e250 100644
--- a/flang/lib/Lower/OpenMP/Clauses.cpp
+++ b/flang/lib/Lower/OpenMP/Clauses.cpp
@@ -396,6 +396,8 @@ makePrescriptiveness(parser::OmpPrescriptiveness::Value v) {
switch (v) {
case parser::OmpPrescriptiveness::Value::Strict:
return clause::Prescriptiveness::Strict;
+ case parser::OmpPrescriptiveness::Value::Fallback:
+ return clause::Prescriptiveness::Fallback;
}
llvm_unreachable("Unexpected prescriptiveness");
}
@@ -770,6 +772,27 @@ Doacross make(const parser::OmpClause::Doacross &inp,
// DynamicAllocators: empty
+DynGroupprivate make(const parser::OmpClause::DynGroupprivate &inp,
+ semantics::SemanticsContext &semaCtx) {
+ // imp.v -> OmpDyngroupprivateClause
+ CLAUSET_ENUM_CONVERT( //
+ convert, parser::OmpAccessGroup::Value, DynGroupprivate::AccessGroup,
+ // clang-format off
+ MS(Cgroup, Cgroup)
+ // clang-format on
+ );
+
+ auto &mods = semantics::OmpGetModifiers(inp.v);
+ auto *m0 = semantics::OmpGetUniqueModifier<parser::OmpAccessGroup>(mods);
+ auto *m1 = semantics::OmpGetUniqueModifier<parser::OmpPrescriptiveness>(mods);
+ auto &size = std::get<parser::ScalarIntExpr>(inp.v.t);
+
+ return DynGroupprivate{
+ {/*AccessGroup=*/maybeApplyToV(convert, m0),
+ /*Prescriptiveness=*/maybeApplyToV(makePrescriptiveness, m1),
+ /*Size=*/makeExpr(size, semaCtx)}};
+}
+
Enter make(const parser::OmpClause::Enter &inp,
semantics::SemanticsContext &semaCtx) {
// inp.v -> parser::OmpEnterClause
diff --git a/flang/lib/Parser/openmp-parsers.cpp b/flang/lib/Parser/openmp-parsers.cpp
index 46b14861096f1..d83635952740f 100644
--- a/flang/lib/Parser/openmp-parsers.cpp
+++ b/flang/lib/Parser/openmp-parsers.cpp
@@ -469,6 +469,9 @@ TYPE_PARSER(sourced(construct<OmpContextSelectorSpecification>(
// --- Parsers for clause modifiers -----------------------------------
+TYPE_PARSER(construct<OmpAccessGroup>( //
+ "CGROUP" >> pure(OmpAccessGroup::Value::Cgroup)))
+
TYPE_PARSER(construct<OmpAlignment>(scalarIntExpr))
TYPE_PARSER(construct<OmpAlignModifier>( //
@@ -573,7 +576,8 @@ TYPE_PARSER(construct<OmpOrderingModifier>(
"SIMD" >> pure(OmpOrderingModifier::Value::Simd)))
TYPE_PARSER(construct<OmpPrescriptiveness>(
- "STRICT" >> pure(OmpPrescriptiveness::Value::Strict)))
+ "STRICT" >> pure(OmpPrescriptiveness::Value::Strict) ||
+ "FALLBACK" >> pure(OmpPrescriptiveness::Value::Fallback)))
TYPE_PARSER(construct<OmpPresentModifier>( //
"PRESENT" >> pure(OmpPresentModifier::Value::Present)))
@@ -636,6 +640,12 @@ TYPE_PARSER(sourced(construct<OmpDependClause::TaskDep::Modifier>(sourced(
construct<OmpDependClause::TaskDep::Modifier>(
Parser<OmpTaskDependenceType>{})))))
+TYPE_PARSER( //
+ sourced(construct<OmpDynGroupprivateClause::Modifier>(
+ Parser<OmpAccessGroup>{})) ||
+ sourced(construct<OmpDynGroupprivateClause::Modifier>(
+ Parser<OmpPrescriptiveness>{})))
+
TYPE_PARSER(
sourced(construct<OmpDeviceClause::Modifier>(Parser<OmpDeviceModifier>{})))
@@ -777,6 +787,10 @@ TYPE_PARSER(construct<OmpDefaultClause>(
Parser<OmpDefaultClause::DataSharingAttribute>{}) ||
construct<OmpDefaultClause>(indirect(Parser<OmpDirectiveSpecification>{}))))
+TYPE_PARSER(construct<OmpDynGroupprivateClause>(
+ maybe(nonemptyList(Parser<OmpDynGroupprivateClause::Modifier>{}) / ":"),
+ scalarIntExpr))
+
TYPE_PARSER(construct<OmpEnterClause>(
maybe(nonemptyList(Parser<OmpEnterClause::Modifier>{}) / ":"),
Parser<OmpObjectList>{}))
@@ -1068,6 +1082,9 @@ TYPE_PARSER( //
construct<OmpClause>(parenthesized(Parser<OmpDoacrossClause>{})) ||
"DYNAMIC_ALLOCATORS" >>
construct<OmpClause>(construct<OmpClause::DynamicAllocators>()) ||
+ "DYN_GROUPPRIVATE" >>
+ construct<OmpClause>(construct<OmpClause::DynGroupprivate>(
+ parenthesized(Parser<OmpDynGroupprivateClause>{}))) ||
"ENTER" >> construct<OmpClause>(construct<OmpClause::Enter>(
parenthesized(Parser<OmpEnterClause>{}))) ||
"EXCLUSIVE" >> construct<OmpClause>(construct<OmpClause::Exclusive>(
diff --git a/flang/lib/Parser/unparse.cpp b/flang/lib/Parser/unparse.cpp
index 4f8d498972807..f3b82975a837a 100644
--- a/flang/lib/Parser/unparse.cpp
+++ b/flang/lib/Parser/unparse.cpp
@@ -2250,6 +2250,11 @@ class UnparseVisitor {
Walk(std::get<OmpObjectList>(x.t));
Walk(": ", std::get<std::optional<std::list<Modifier>>>(x.t));
}
+ void Unparse(const OmpDynGroupprivateClause &x) {
+ using Modifier = OmpDynGroupprivateClause::Modifier;
+ Walk(std::get<std::optional<std::list<Modifier>>>(x.t), ": ");
+ Walk(std::get<ScalarIntExpr>(x.t));
+ }
void Unparse(const OmpEnterClause &x) {
using Modifier = OmpEnterClause::Modifier;
Walk(std::get<std::optional<std::list<Modifier>>>(x.t), ": ");
@@ -2941,6 +2946,7 @@ class UnparseVisitor {
WALK_NESTED_ENUM(OmpTaskDependenceType, Value) // OMP task-dependence-type
WALK_NESTED_ENUM(OmpScheduleClause, Kind) // OMP schedule-kind
WALK_NESTED_ENUM(OmpSeverityClause, Severity) // OMP severity
+ WALK_NESTED_ENUM(OmpAccessGroup, Value)
WALK_NESTED_ENUM(OmpDeviceModifier, Value) // OMP device modifier
WALK_NESTED_ENUM(
OmpDeviceTypeClause, DeviceTypeDescription) // OMP device_type
diff --git a/flang/lib/Semantics/check-omp-structure.cpp b/flang/lib/Semantics/check-omp-structure.cpp
index bf126bbb0d8c1..d9092565449da 100644
--- a/flang/lib/Semantics/check-omp-structure.cpp
+++ b/flang/lib/Semantics/check-omp-structure.cpp
@@ -2581,6 +2581,7 @@ CHECK_SIMPLE_CLAUSE(Default, OMPC_default)
CHECK_SIMPLE_CLAUSE(Depobj, OMPC_depobj)
CHECK_SIMPLE_CLAUSE(DeviceType, OMPC_device_type)
CHECK_SIMPLE_CLAUSE(DistSchedule, OMPC_dist_schedule)
+CHECK_SIMPLE_CLAUSE(DynGroupprivate, OMPC_dyn_groupprivate)
CHECK_SIMPLE_CLAUSE(Exclusive, OMPC_exclusive)
CHECK_SIMPLE_CLAUSE(Final, OMPC_final)
CHECK_SIMPLE_CLAUSE(Flush, OMPC_flush)
diff --git a/flang/test/Lower/OpenMP/Todo/dyn-groupprivate-clause.f90 b/flang/test/Lower/OpenMP/Todo/dyn-groupprivate-clause.f90
new file mode 100644
index 0000000000000..e06470f772bf8
--- /dev/null
+++ b/flang/test/Lower/OpenMP/Todo/dyn-groupprivate-clause.f90
@@ -0,0 +1,10 @@
+!RUN: %not_todo_cmd %flang_fc1 -emit-hlfir -fopenmp -fopenmp-version=61 -o - %s 2>&1 | FileCheck %s
+
+!CHECK: not yet implemented: DYN_GROUPPRIVATE clause is not implemented yet
+subroutine f00(n)
+ implicit none
+ integer :: n
+ !$omp target dyn_groupprivate(n)
+ !$omp end target
+end
+
diff --git a/flang/test/Parser/OpenMP/dyn-groupprivate-clause.f90 b/flang/test/Parser/OpenMP/dyn-groupprivate-clause.f90
new file mode 100644
index 0000000000000..7d41efd348e50
--- /dev/null
+++ b/flang/test/Parser/OpenMP/dyn-groupprivate-clause.f90
@@ -0,0 +1,70 @@
+!RUN: %flang_fc1 -fdebug-unparse -fopenmp -fopenmp-version=61 %s | FileCheck --ignore-case --check-prefix="UNPARSE" %s
+!RUN: %flang_fc1 -fdebug-dump-parse-tree -fopenmp -fopenmp-version=61 %s | FileCheck --check-prefix="PARSE-TREE" %s
+
+subroutine f00(n)
+ implicit none
+ integer :: n
+ !$omp target dyn_groupprivate(n)
+ !$omp end target
+end
+
+!UNPARSE: SUBROUTINE f00 (n)
+!UNPARSE: IMPLICIT NONE
+!UNPARSE: INTEGER n
+!UNPARSE: !$OMP TARGET DYN_GROUPPRIVATE(n)
+!UNPARSE: !$OMP END TARGET
+!UNPARSE: END SUBROUTINE
+
+!PARSE-TREE: OmpBeginDirective
+!PARSE-TREE: | OmpDirectiveName -> llvm::omp::Directive = target
+!PARSE-TREE: | OmpClauseList -> OmpClause -> DynGroupprivate -> OmpDynGroupprivateClause
+!PARSE-TREE: | | Scalar -> Integer -> Expr = 'n'
+!PARSE-TREE: | | | Designator -> DataRef -> Name = 'n'
+!PARSE-TREE: | Flags = None
+
+
+subroutine f01(n)
+ implicit none
+ integer :: n
+ !$omp target dyn_groupprivate(strict: n)
+ !$omp end target
+end
+
+!UNPARSE: SUBROUTINE f01 (n)
+!UNPARSE: IMPLICIT NONE
+!UNPARSE: INTEGER n
+!UNPARSE: !$OMP TARGET DYN_GROUPPRIVATE(STRICT: n)
+!UNPARSE: !$OMP END TARGET
+!UNPARSE: END SUBROUTINE
+
+!PARSE-TREE: OmpBeginDirective
+!PARSE-TREE: | OmpDirectiveName -> llvm::omp::Directive = target
+!PARSE-TREE: | OmpClauseList -> OmpClause -> DynGroupprivate -> OmpDynGroupprivateClause
+!PARSE-TREE: | | Modifier -> OmpPrescriptiveness -> Value = Strict
+!PARSE-TREE: | | Scalar -> Integer -> Expr = 'n'
+!PARSE-TREE: | | | Designator -> DataRef -> Name = 'n'
+!PARSE-TREE: | Flags = None
+
+
+subroutine f02(n)
+ implicit none
+ integer :: n
+ !$omp target dyn_groupprivate(fallback, cgroup: n)
+ !$omp end target
+end
+
+!UNPARSE: SUBROUTINE f02 (n)
+!UNPARSE: IMPLICIT NONE
+!UNPARSE: INTEGER n
+!UNPARSE: !$OMP TARGET DYN_GROUPPRIVATE(FALLBACK, CGROUP: n)
+!UNPARSE: !$OMP END TARGET
+!UNPARSE: END SUBROUTINE
+
+!PARSE-TREE: OmpBeginDirective
+!PARSE-TREE: | OmpDirectiveName -> llvm::omp::Directive = target
+!PARSE-TREE: | OmpClauseList -> OmpClause -> DynGroupprivate -> OmpDynGroupprivateClause
+!PARSE-TREE: | | Modifier -> OmpPrescriptiveness -> Value = Fallback
+!PARSE-TREE: | | Modifier -> OmpAccessGroup -> Value = Cgroup
+!PARSE-TREE: | | Scalar -> Integer -> Expr = 'n'
+!PARSE-TREE: | | | Designator -> DataRef -> Name = 'n'
+!PARSE-TREE: | Flags = None
diff --git a/llvm/include/llvm/Frontend/OpenMP/ClauseT.h b/llvm/include/llvm/Frontend/OpenMP/ClauseT.h
index ce1cedc188fbf..8ea50e7e8d416 100644
--- a/llvm/include/llvm/Frontend/OpenMP/ClauseT.h
+++ b/llvm/include/llvm/Frontend/OpenMP/ClauseT.h
@@ -242,7 +242,7 @@ ENUM(MotionExpectation, Present);
// V5.2: [15.9.1] `task-dependence-type` modifier
ENUM(DependenceType, Depobj, In, Inout, Inoutset, Mutexinoutset, Out, Sink,
Source);
-ENUM(Prescriptiveness, Strict);
+ENUM(Prescriptiveness, Strict, Fallback);
template <typename I, typename E> //
struct LoopIterationT {
@@ -574,6 +574,15 @@ struct DynamicAllocatorsT {
using EmptyTrait = std::true_type;
};
+template <typename T, typename I, typename E> //
+struct DynGroupprivateT {
+ ENUM(AccessGroup, Cgroup);
+ using Prescriptiveness = type::Prescriptiveness;
+ using Size = E;
+ using TupleTrait = std::true_type;
+ std::tuple<OPT(AccessGroup), OPT(Prescriptiveness), Size> t;
+};
+
// V5.2: [5.8.4] `enter` clause
template <typename T, typename I, typename E> //
struct EnterT {
@@ -1263,11 +1272,12 @@ template <typename T, typename I, typename E>
using TupleClausesT =
std::variant<AffinityT<T, I, E>, AlignedT<T, I, E>, AllocateT<T, I, E>,
DefaultmapT<T, I, E>, DeviceT<T, I, E>, DistScheduleT<T, I, E>,
- DoacrossT<T, I, E>, FromT<T, I, E>, GrainsizeT<T, I, E>,
- IfT<T, I, E>, InitT<T, I, E>, InReductionT<T, I, E>,
- LastprivateT<T, I, E>, LinearT<T, I, E>, MapT<T, I, E>,
- NumTasksT<T, I, E>, OrderT<T, I, E>, ReductionT<T, I, E>,
- ScheduleT<T, I, E>, TaskReductionT<T, I, E>, ToT<T, I, E>>;
+ DoacrossT<T, I, E>, DynGroupprivateT<T, I, E>, FromT<T, I, E>,
+ GrainsizeT<T, I, E>, IfT<T, I, E>, InitT<T, I, E>,
+ InReductionT<T, I, E>, LastprivateT<T, I, E>, LinearT<T, I, E>,
+ MapT<T, I, E>, NumTasksT<T, I, E>, OrderT<T, I, E>,
+ ReductionT<T, I, E>, ScheduleT<T, I, E>,
+ TaskReductionT<T, I, E>, ToT<T, I, E>>;
template <typename T, typename I, typename E>
using UnionClausesT = std::variant<DependT<T, I, E>>;
diff --git a/llvm/include/llvm/Frontend/OpenMP/OMP.td b/llvm/include/llvm/Frontend/OpenMP/OMP.td
index 79f25bb05f20e..7140980e63539 100644
--- a/llvm/include/llvm/Frontend/OpenMP/OMP.td
+++ b/llvm/include/llvm/Frontend/OpenMP/OMP.td
@@ -178,6 +178,9 @@ def OMPC_Doacross : Clause<[Spelling<"doacross">]> {
def OMPC_DynamicAllocators : Clause<[Spelling<"dynamic_allocators">]> {
let clangClass = "OMPDynamicAllocatorsClause";
}
+def OMPC_DynGroupprivate : Clause<[Spelling<"dyn_groupprivate">]> {
+ let flangClass = "OmpDynGroupprivateClause";
+}
def OMPC_Enter : Clause<[Spelling<"enter">]> {
let flangClass = "OmpEnterClause";
}
@@ -1104,6 +1107,7 @@ def OMP_Target : Directive<[Spelling<"target">]> {
let allowedOnceClauses = [
VersionedClause<OMPC_DefaultMap>,
VersionedClause<OMPC_Device>,
+ VersionedClause<OMPC_DynGroupprivate, 61>,
VersionedClause<OMPC_If>,
VersionedClause<OMPC_NoWait>,
VersionedClause<OMPC_OMPX_Bare>,
@@ -1254,6 +1258,7 @@ def OMP_Teams : Directive<[Spelling<"teams">]> {
];
let allowedOnceClauses = [
VersionedClause<OMPC_Default>,
+ VersionedClause<OMPC_DynGroupprivate, 61>,
VersionedClause<OMPC_If, 52>,
VersionedClause<OMPC_NumTeams>,
VersionedClause<OMPC_ThreadLimit>,
@@ -1522,6 +1527,7 @@ def OMP_target_loop : Directive<[Spelling<"target loop">]> {
let allowedOnceClauses = [
VersionedClause<OMPC_Bind, 50>,
VersionedClause<OMPC_Collapse>,
+ VersionedClause<OMPC_DynGroupprivate, 61>,
VersionedClause<OMPC_Order>,
VersionedClause<OMPC_ThreadLimit>,
VersionedClause<OMPC_OMPX_DynCGroupMem>,
@@ -1983,6 +1989,7 @@ def OMP_TargetParallel : Directive<[Spelling<"target parallel">]> {
let allowedOnceClauses = [
VersionedClause<OMPC_DefaultMap>,
VersionedClause<OMPC_Device>,
+ VersionedClause<OMPC_DynGroupprivate, 61>,
VersionedClause<OMPC_NumThreads>,
VersionedClause<OMPC_OMPX_DynCGroupMem>,
VersionedClause<OMPC_ProcBind>,
@@ -2012,6 +2019,7 @@ def OMP_TargetParallelDo : Directive<[Spelling<"target parallel do">]> {
VersionedClause<OMPC_Collapse>,
VersionedClause<OMPC_DefaultMap>,
VersionedClause<OMPC_Device>,
+ VersionedClause<OMPC_DynGroupprivate, 61>,
VersionedClause<OMPC_NoWait>,
VersionedClause<OMPC_NumThreads>,
VersionedClause<OMPC_Order, 50>,
@@ -2054,6 +2062,9 @@ def OMP_TargetParallelDoSimd
VersionedClause<OMPC_SimdLen>,
VersionedClause<OMPC_UsesAllocators>,
];
+ let allowedOnceClauses = [
+ VersionedClause<OMPC_DynGroupprivate, 61>,
+ ];
let leafConstructs = [OMP_Target, OMP_Parallel, OMP_Do, OMP_Simd];
let category = CA_Executable;
let languages = [L_Fortran];
@@ -2086,6 +2097,7 @@ def OMP_TargetParallelFor : Directive<[Spelling<"target parallel for">]> {
VersionedClause<OMPC_UsesAllocators, 50>,
];
let allowedOnceClauses = [
+ VersionedClause<OMPC_DynGroupprivate, 61>,
VersionedClause<OMPC_OMPX_DynCGroupMem>,
VersionedClause<OMPC_ThreadLimit, 51>,
];
@@ -2126,6 +2138,7 @@ def OMP_TargetParallelForSimd
VersionedClause<OMPC_UsesAllocators, 50>,
];
let allowedOnceClauses = [
+ VersionedClause<OMPC_DynGroupprivate, 61>,
VersionedClause<OMPC_OMPX_DynCGroupMem>,
VersionedClause<OMPC_ThreadLimit, 51>,
];
@@ -2155,6 +2168,7 @@ def OMP_target_parallel_loop : Directive<[Spelling<"target parallel loop">]> {
VersionedClause<OMPC_Collapse>,
VersionedClause<OMPC_Default>,
VersionedClause<OMPC_DefaultMap>,
+ VersionedClause<OMPC_DynGroupprivate, 61>,
VersionedClause<OMPC_NoWait>,
VersionedClause<OMPC_NumThreads>,
VersionedClause<OMPC_OMPX_DynCGroupMem>,
@@ -2189,6 +2203,7 @@ def OMP_TargetSimd : Directive<[Spelling<"target simd">]> {
VersionedClause<OMPC_Collapse>,
VersionedClause<OMPC_DefaultMap>,
VersionedClause<OMPC_Device>,
+ VersionedClause<OMPC_DynGroupprivate, 61>,
VersionedClause<OMPC_NumThreads>,
VersionedClause<OMPC_OMPX_DynCGroupMem>,
VersionedClause<OMPC_Order, 50>,
@@ -2220,6 +2235,7 @@ def OMP_TargetTeams : Directive<[Spelling<"target teams">]> {
VersionedClause<OMPC_Default>,
VersionedClause<OMPC_DefaultMap>,
VersionedClause<OMPC_Device>,
+ VersionedClause<OMPC_DynGroupprivate, 61>,
VersionedClause<OMPC_NoWait>,
VersionedClause<OMPC_NumTeams>,
VersionedClause<OMPC_OMPX_DynCGroupMem>,
@@ -2252,6 +2268,7 @@ def OMP_TargetTeamsDistribute
VersionedClause<OMPC_DefaultMap>,
VersionedClause<OMPC_Device>,
VersionedClause<OMPC_DistSchedule>,
+ VersionedClause<OMPC_DynGroupprivate, 61>,
VersionedClause<OMPC_NoWait>,
VersionedClause<OMPC_NumTeams>,
VersionedClause<OMPC_OMPX_DynCGroupMem>,
@@ -2284,6 +2301,7 @@ def OMP_TargetTeamsDistributeParallelDo
VersionedClause<OMPC_DefaultMap>,
VersionedClause<OMPC_Device>,
VersionedClause<OMPC_DistSchedule>,
+ VersionedClause<OMPC_DynGroupprivate, 61>,
VersionedClause<OMPC_NoWait>,
VersionedClause<OMPC_NumTeams>,
VersionedClause<OMPC_NumThreads>,
@@ -2322,6 +2340,7 @@ def OMP_TargetTeamsDistributeParallelDoSimd
VersionedClause<OMPC_DefaultMap>,
VersionedClause<OMPC_Device>,
VersionedClause<OMPC_DistSchedule>,
+ VersionedClause<OMPC_DynGroupprivate, 61>,
VersionedClause<OMPC_NoWait>,
VersionedClause<OMPC_NumTeams>,
VersionedClause<OMPC_NumThreads>,
@@ -2367,6 +2386,7 @@ def OMP_TargetTeamsDistributeParallelFor
VersionedClause<OMPC_UsesAllocators, 50>,
];
let allowedOnceClauses = [
+ VersionedClause<OMPC_DynGroupprivate, 61>,
VersionedClause<OMPC_OMPX_DynCGroupMem>,
];
let leafConstructs =
@@ -2409,6 +2429,7 @@ def OMP_TargetTeamsDistributeParallelForSimd
VersionedClause<OMPC_UsesAllocators, 50>,
];
let allowedOnceClauses = [
+ VersionedClause<OMPC_DynGroupprivate, 61>,
VersionedClause<OMPC_OMPX_DynCGroupMem>,
];
let leafConstructs =
@@ -2441,6 +2462,7 @@ def OMP_TargetTeamsDistributeSimd
VersionedClause<OMPC_DefaultMap>,
VersionedClause<OMPC_Device>,
VersionedClause<OMPC_DistSchedule>,
+ VersionedClause<OMPC_DynGroupprivate, 61>,
VersionedClause<OMPC_NoWait>,
VersionedClause<OMPC_NumTeams>,
VersionedClause<OMPC_OMPX_DynCGroupMem>,
@@ -2474,6 +2496,7 @@ def OMP_target_teams_loop : Directive<[Spelling<"target teams loop">]> {
VersionedClause<OMPC_Bind, 50>,
VersionedClause<OMPC_Collapse>,
VersionedClause<OMPC_Default>,
+ VersionedClause<OMPC_DynGroupprivate, 61>,
VersionedClause<OMPC_NoWait>,
VersionedClause<OMPC_NumTeams>,
VersionedClause<OMPC_OMPX_DynCGroupMem>,
@@ -2532,6 +2555,7 @@ def OMP_TeamsDistribute : Directive<[Spelling<"teams distribute">]> {
VersionedClause<OMPC_ThreadLimit>,
];
let allowedOnceClauses = [
+ VersionedClause<OMPC_DynGroupprivate, 61>,
VersionedClause<OMPC_If>,
VersionedClause<OMPC_Order, 50>,
];
@@ -2555,6 +2579,7 @@ def OMP_TeamsDistributeParallelDo
VersionedClause<OMPC_Collapse>,
VersionedClause<OMPC_Default>,
VersionedClause<OMPC_DistSchedule>,
+ VersionedClause<OMPC_DynGroupprivate, 61>,
VersionedClause<OMPC_NumTeams>,
VersionedClause<OMPC_NumThreads>,
VersionedClause<OMPC_Order, 50>,
@@ -2584,6 +2609,7 @@ def OMP_TeamsDistributeParallelDoSimd
VersionedClause<OMPC_Collapse>,
VersionedClause<OMPC_Default>,
VersionedClause<OMPC_DistSchedule>,
+ VersionedClause<OMPC_DynGroupprivate, 61>,
VersionedClause<OMPC_NumTeams>,
VersionedClause<OMPC_NumThreads>,
VersionedClause<OMPC_Order, 50>,
@@ -2620,6 +2646,9 @@ def OMP_TeamsDistributeParallelFor
VersionedClause<OMPC_Shared>,
VersionedClause<OMPC_ThreadLimit>,
];
+ let allowedOnceClauses = [
+ VersionedClause<OMPC_DynGroupprivate, 61>,
+ ];
let leafConstructs = [OMP_Teams, OMP_Distribute, OMP_Parallel, OMP_For];
let category = CA_Executable;
let languages = [L_C];
@@ -2650,6 +2679,9 @@ def OMP_TeamsDistributeParallelForSimd
VersionedClause<OMPC_SimdLen>,
VersionedClause<OMPC_ThreadLimit>,
];
+ let allowedOnceClauses = [
+ VersionedClause<OMPC_DynGroupprivate, 61>,
+ ];
let leafConstructs =
[OMP_Teams, OMP_Distribute, OMP_Parallel, OMP_For, OMP_Simd];
let category = CA_Executable;
@@ -2673,6 +2705,7 @@ def OMP_TeamsDistributeSimd : Directive<[Spelling<"teams distribute simd">]> {
VersionedClause<OMPC_Collapse>,
VersionedClause<OMPC_Default>,
VersionedClause<OMPC_DistSchedule>,
+ VersionedClause<OMPC_DynGroupprivate, 61>,
VersionedClause<OMPC_NumTeams>,
VersionedClause<OMPC_Order, 50>,
VersionedClause<OMPC_SafeLen>,
@@ -2696,6 +2729,7 @@ def OMP_teams_loop : Directive<[Spelling<"teams loop">]> {
VersionedClause<OMPC_Bind, 50>,
VersionedClause<OMPC_Collapse>,
VersionedClause<OMPC_Default>,
+ VersionedClause<OMPC_DynGroupprivate, 61>,
VersionedClause<OMPC_NumTeams>,
VersionedClause<OMPC_Order>,
VersionedClause<OMPC_ThreadLimit>,
>From 43df97a909fbb0ebc8416b9faa88de21447fc3fe Mon Sep 17 00:00:00 2001
From: Matthias Braun <matze at braunis.de>
Date: Mon, 18 Aug 2025 11:55:23 -0700
Subject: [PATCH 069/112] llvm-profgen: Avoid "using namespace" in headers
(#147631)
Avoid global `using namespace` directives in headers as they are bad
style.
---
llvm/tools/llvm-profgen/CSPreInliner.h | 3 --
llvm/tools/llvm-profgen/ErrorHandling.h | 4 ++-
llvm/tools/llvm-profgen/PerfReader.cpp | 5 ++-
llvm/tools/llvm-profgen/PerfReader.h | 3 --
llvm/tools/llvm-profgen/ProfileGenerator.cpp | 6 ++--
llvm/tools/llvm-profgen/ProfileGenerator.h | 3 --
llvm/tools/llvm-profgen/ProfiledBinary.cpp | 1 +
llvm/tools/llvm-profgen/ProfiledBinary.h | 34 +++++++++-----------
llvm/tools/llvm-profgen/llvm-profgen.cpp | 6 ++--
9 files changed, 27 insertions(+), 38 deletions(-)
diff --git a/llvm/tools/llvm-profgen/CSPreInliner.h b/llvm/tools/llvm-profgen/CSPreInliner.h
index 8a3f16a4f13cb..022c3f8d0daed 100644
--- a/llvm/tools/llvm-profgen/CSPreInliner.h
+++ b/llvm/tools/llvm-profgen/CSPreInliner.h
@@ -16,9 +16,6 @@
#include "llvm/Transforms/IPO/ProfiledCallGraph.h"
#include "llvm/Transforms/IPO/SampleContextTracker.h"
-using namespace llvm;
-using namespace sampleprof;
-
namespace llvm {
namespace sampleprof {
diff --git a/llvm/tools/llvm-profgen/ErrorHandling.h b/llvm/tools/llvm-profgen/ErrorHandling.h
index b797add8a892f..17084bd785e64 100644
--- a/llvm/tools/llvm-profgen/ErrorHandling.h
+++ b/llvm/tools/llvm-profgen/ErrorHandling.h
@@ -16,7 +16,7 @@
#include "llvm/Support/WithColor.h"
#include <system_error>
-using namespace llvm;
+namespace llvm {
[[noreturn]] inline void exitWithError(const Twine &Message,
StringRef Whence = StringRef(),
@@ -53,4 +53,6 @@ inline void emitWarningSummary(uint64_t Num, uint64_t Total, StringRef Msg) {
<< "%(" << Num << "/" << Total << ") " << Msg << "\n";
}
+} // end namespace llvm
+
#endif
diff --git a/llvm/tools/llvm-profgen/PerfReader.cpp b/llvm/tools/llvm-profgen/PerfReader.cpp
index ad113eda27914..4ab5f2e63fd12 100644
--- a/llvm/tools/llvm-profgen/PerfReader.cpp
+++ b/llvm/tools/llvm-profgen/PerfReader.cpp
@@ -15,6 +15,8 @@
#define DEBUG_TYPE "perf-reader"
+using namespace llvm;
+
cl::opt<bool> SkipSymbolization("skip-symbolization",
cl::desc("Dump the unsymbolized profile to the "
"output file. It will show unwinder "
@@ -47,9 +49,6 @@ static cl::opt<int> CSProfMaxUnsymbolizedCtxDepth(
cl::desc("Keep the last K contexts while merging unsymbolized profile. -1 "
"means no depth limit."));
-extern cl::opt<std::string> PerfTraceFilename;
-extern cl::opt<bool> ShowDisassemblyOnly;
-extern cl::opt<bool> ShowSourceLocations;
extern cl::opt<std::string> OutputFilename;
namespace llvm {
diff --git a/llvm/tools/llvm-profgen/PerfReader.h b/llvm/tools/llvm-profgen/PerfReader.h
index 4b3ac8f569755..19451915812e1 100644
--- a/llvm/tools/llvm-profgen/PerfReader.h
+++ b/llvm/tools/llvm-profgen/PerfReader.h
@@ -17,9 +17,6 @@
#include <fstream>
#include <map>
-using namespace llvm;
-using namespace sampleprof;
-
namespace llvm {
class CleanupInstaller;
diff --git a/llvm/tools/llvm-profgen/ProfileGenerator.cpp b/llvm/tools/llvm-profgen/ProfileGenerator.cpp
index db686c3b597eb..9468228acc427 100644
--- a/llvm/tools/llvm-profgen/ProfileGenerator.cpp
+++ b/llvm/tools/llvm-profgen/ProfileGenerator.cpp
@@ -17,6 +17,9 @@
#include <unordered_set>
#include <utility>
+using namespace llvm;
+using namespace sampleprof;
+
cl::opt<std::string> OutputFilename("output", cl::value_desc("output"),
cl::Required,
cl::desc("Output profile file"));
@@ -104,9 +107,6 @@ cl::opt<bool> InferMissingFrames(
"Infer missing call frames due to compiler tail call elimination."),
llvm::cl::Optional);
-using namespace llvm;
-using namespace sampleprof;
-
namespace llvm {
namespace sampleprof {
diff --git a/llvm/tools/llvm-profgen/ProfileGenerator.h b/llvm/tools/llvm-profgen/ProfileGenerator.h
index 5e36128530cd9..d3e04563a81c2 100644
--- a/llvm/tools/llvm-profgen/ProfileGenerator.h
+++ b/llvm/tools/llvm-profgen/ProfileGenerator.h
@@ -17,9 +17,6 @@
#include <memory>
#include <unordered_set>
-using namespace llvm;
-using namespace sampleprof;
-
namespace llvm {
namespace sampleprof {
diff --git a/llvm/tools/llvm-profgen/ProfiledBinary.cpp b/llvm/tools/llvm-profgen/ProfiledBinary.cpp
index 6847ba1b21b1f..beef4338d5f89 100644
--- a/llvm/tools/llvm-profgen/ProfiledBinary.cpp
+++ b/llvm/tools/llvm-profgen/ProfiledBinary.cpp
@@ -25,6 +25,7 @@
#define DEBUG_TYPE "load-binary"
using namespace llvm;
+using namespace llvm::object;
using namespace sampleprof;
cl::opt<bool> ShowDisassemblyOnly("show-disassembly-only",
diff --git a/llvm/tools/llvm-profgen/ProfiledBinary.h b/llvm/tools/llvm-profgen/ProfiledBinary.h
index 0588cb48b2af6..5b35c040b2c4b 100644
--- a/llvm/tools/llvm-profgen/ProfiledBinary.h
+++ b/llvm/tools/llvm-profgen/ProfiledBinary.h
@@ -42,15 +42,10 @@
#include <vector>
namespace llvm {
+
extern cl::opt<bool> EnableCSPreInliner;
extern cl::opt<bool> UseContextCostForPreInliner;
-} // namespace llvm
-
-using namespace llvm;
-using namespace sampleprof;
-using namespace llvm::object;
-namespace llvm {
namespace sampleprof {
class ProfiledBinary;
@@ -303,34 +298,34 @@ class ProfiledBinary {
bool IsCOFF = false;
- void setPreferredTextSegmentAddresses(const ObjectFile *O);
+ void setPreferredTextSegmentAddresses(const object::ObjectFile *O);
template <class ELFT>
- void setPreferredTextSegmentAddresses(const ELFFile<ELFT> &Obj,
+ void setPreferredTextSegmentAddresses(const object::ELFFile<ELFT> &Obj,
StringRef FileName);
- void setPreferredTextSegmentAddresses(const COFFObjectFile *Obj,
+ void setPreferredTextSegmentAddresses(const object::COFFObjectFile *Obj,
StringRef FileName);
- void checkPseudoProbe(const ELFObjectFileBase *Obj);
+ void checkPseudoProbe(const object::ELFObjectFileBase *Obj);
- void decodePseudoProbe(const ELFObjectFileBase *Obj);
+ void decodePseudoProbe(const object::ELFObjectFileBase *Obj);
- void
- checkUseFSDiscriminator(const ObjectFile *Obj,
- std::map<SectionRef, SectionSymbolsTy> &AllSymbols);
+ void checkUseFSDiscriminator(
+ const object::ObjectFile *Obj,
+ std::map<object::SectionRef, SectionSymbolsTy> &AllSymbols);
// Set up disassembler and related components.
- void setUpDisassembler(const ObjectFile *Obj);
+ void setUpDisassembler(const object::ObjectFile *Obj);
symbolize::LLVMSymbolizer::Options getSymbolizerOpts() const;
// Load debug info of subprograms from DWARF section.
- void loadSymbolsFromDWARF(ObjectFile &Obj);
+ void loadSymbolsFromDWARF(object::ObjectFile &Obj);
// Load debug info from DWARF unit.
void loadSymbolsFromDWARFUnit(DWARFUnit &CompilationUnit);
// Create elf symbol to its start address mapping.
- void populateElfSymbolAddressList(const ELFObjectFileBase *O);
+ void populateElfSymbolAddressList(const object::ELFObjectFileBase *O);
// A function may be spilt into multiple non-continuous address ranges. We use
// this to set whether start a function range is the real entry of the
@@ -341,11 +336,12 @@ class ProfiledBinary {
void warnNoFuncEntry();
/// Dissassemble the text section and build various address maps.
- void disassemble(const ObjectFile *O);
+ void disassemble(const object::ObjectFile *O);
/// Helper function to dissassemble the symbol and extract info for unwinding
bool dissassembleSymbol(std::size_t SI, ArrayRef<uint8_t> Bytes,
- SectionSymbolsTy &Symbols, const SectionRef &Section);
+ SectionSymbolsTy &Symbols,
+ const object::SectionRef &Section);
/// Symbolize a given instruction pointer and return a full call context.
SampleContextFrameVector symbolize(const InstructionPointer &IP,
bool UseCanonicalFnName = false,
diff --git a/llvm/tools/llvm-profgen/llvm-profgen.cpp b/llvm/tools/llvm-profgen/llvm-profgen.cpp
index 3b974e25103ad..5464888e77ad5 100644
--- a/llvm/tools/llvm-profgen/llvm-profgen.cpp
+++ b/llvm/tools/llvm-profgen/llvm-profgen.cpp
@@ -21,6 +21,9 @@
#include "llvm/Support/TargetSelect.h"
#include "llvm/Support/VirtualFileSystem.h"
+using namespace llvm;
+using namespace sampleprof;
+
static cl::OptionCategory ProfGenCategory("ProfGen Options");
static cl::opt<std::string> PerfScriptFilename(
@@ -71,9 +74,6 @@ extern cl::opt<bool> ShowDisassemblyOnly;
extern cl::opt<bool> ShowSourceLocations;
extern cl::opt<bool> SkipSymbolization;
-using namespace llvm;
-using namespace sampleprof;
-
// Validate the command line input.
static void validateCommandLine() {
// Allow the missing perfscript if we only use to show binary disassembly.
>From 549d7c4f35a99598a269004ee13b237d2565b5ec Mon Sep 17 00:00:00 2001
From: Trevor Gross <tmgross at umich.edu>
Date: Mon, 18 Aug 2025 13:56:24 -0500
Subject: [PATCH 070/112] [SPARC] Change `half` to use soft promotion rather
than `PromoteFloat` (#152727)
`half` currently uses the default legalization of promoting to a `f32`;
however, this implementation implements math in a way that results in
incorrect rounding. Switch to the soft promote implementation, which
does not have this problem.
The SPARC ABI does not specify a `_Float16` type, so there is no concern
with keeping interface compatibility.
Fixes the SPARC part of
https://github.com/llvm/llvm-project/issues/97975
Fixes the SPARC part of
https://github.com/llvm/llvm-project/issues/97981
---
llvm/lib/Target/Sparc/SparcISelLowering.h | 2 +
llvm/test/CodeGen/Generic/half.ll | 4 +-
llvm/test/CodeGen/SPARC/fp16-promote.ll | 64 +++--
llvm/test/CodeGen/SPARC/half.ll | 235 ++++-----------
llvm/test/CodeGen/SPARC/llvm.sincos.ll | 335 ++++++++++++----------
5 files changed, 289 insertions(+), 351 deletions(-)
diff --git a/llvm/lib/Target/Sparc/SparcISelLowering.h b/llvm/lib/Target/Sparc/SparcISelLowering.h
index 4017beb88ff31..7fffb7c9823f4 100644
--- a/llvm/lib/Target/Sparc/SparcISelLowering.h
+++ b/llvm/lib/Target/Sparc/SparcISelLowering.h
@@ -28,6 +28,8 @@ namespace llvm {
bool useSoftFloat() const override;
+ bool softPromoteHalfType() const override { return true; }
+
/// computeKnownBitsForTargetNode - Determine which of the bits specified
/// in Mask are known to be either zero or one and return them in the
/// KnownZero/KnownOne bitsets.
diff --git a/llvm/test/CodeGen/Generic/half.ll b/llvm/test/CodeGen/Generic/half.ll
index 9d6c8eb2730d2..ef7bfe2f2d9ce 100644
--- a/llvm/test/CodeGen/Generic/half.ll
+++ b/llvm/test/CodeGen/Generic/half.ll
@@ -34,8 +34,8 @@
; RUN: %if powerpc-registered-target %{ llc %s -o - -mtriple=powerpc64le-unknown-linux-gnu | FileCheck %s --check-prefixes=ALL,BAD %}
; RUN: %if riscv-registered-target %{ llc %s -o - -mtriple=riscv32-unknown-linux-gnu | FileCheck %s --check-prefixes=ALL,CHECK %}
; RUN: %if riscv-registered-target %{ llc %s -o - -mtriple=riscv64-unknown-linux-gnu | FileCheck %s --check-prefixes=ALL,CHECK %}
-; RUN: %if sparc-registered-target %{ llc %s -o - -mtriple=sparc-unknown-linux-gnu | FileCheck %s --check-prefixes=ALL,BAD %}
-; RUN: %if sparc-registered-target %{ llc %s -o - -mtriple=sparc64-unknown-linux-gnu | FileCheck %s --check-prefixes=ALL,BAD %}
+; RUN: %if sparc-registered-target %{ llc %s -o - -mtriple=sparc-unknown-linux-gnu | FileCheck %s --check-prefixes=ALL,CHECK %}
+; RUN: %if sparc-registered-target %{ llc %s -o - -mtriple=sparc64-unknown-linux-gnu | FileCheck %s --check-prefixes=ALL,CHECK %}
; RUN: %if spirv-registered-target %{ llc %s -o - -mtriple=spirv-unknown-unknown | FileCheck %s --check-prefixes=NOCRASH %}
; RUN: %if systemz-registered-target %{ llc %s -o - -mtriple=s390x-unknown-linux-gnu | FileCheck %s --check-prefixes=ALL,CHECK %}
; RUN: %if ve-registered-target %{ llc %s -o - -mtriple=ve-unknown-unknown | FileCheck %s --check-prefixes=ALL,BAD %}
diff --git a/llvm/test/CodeGen/SPARC/fp16-promote.ll b/llvm/test/CodeGen/SPARC/fp16-promote.ll
index efe67b04e8fb3..64873b744de50 100644
--- a/llvm/test/CodeGen/SPARC/fp16-promote.ll
+++ b/llvm/test/CodeGen/SPARC/fp16-promote.ll
@@ -329,13 +329,14 @@ define void @test_fadd(ptr %p, ptr %q) nounwind {
; V8-OPT-LABEL: test_fadd:
; V8-OPT: ! %bb.0:
; V8-OPT-NEXT: save %sp, -104, %sp
+; V8-OPT-NEXT: lduh [%i0], %i2
; V8-OPT-NEXT: call __extendhfsf2
-; V8-OPT-NEXT: lduh [%i0], %o0
+; V8-OPT-NEXT: lduh [%i1], %o0
; V8-OPT-NEXT: st %f0, [%fp+-8] ! 4-byte Folded Spill
; V8-OPT-NEXT: call __extendhfsf2
-; V8-OPT-NEXT: lduh [%i1], %o0
+; V8-OPT-NEXT: mov %i2, %o0
; V8-OPT-NEXT: ld [%fp+-8], %f1 ! 4-byte Folded Reload
-; V8-OPT-NEXT: fadds %f1, %f0, %f0
+; V8-OPT-NEXT: fadds %f0, %f1, %f0
; V8-OPT-NEXT: st %f0, [%fp+-4]
; V8-OPT-NEXT: call __truncsfhf2
; V8-OPT-NEXT: ld [%fp+-4], %o0
@@ -346,13 +347,14 @@ define void @test_fadd(ptr %p, ptr %q) nounwind {
; V8-UNOPT-LABEL: test_fadd:
; V8-UNOPT: ! %bb.0:
; V8-UNOPT-NEXT: save %sp, -104, %sp
-; V8-UNOPT-NEXT: call __extendhfsf2
-; V8-UNOPT-NEXT: lduh [%i0], %o0
-; V8-UNOPT-NEXT: st %f0, [%fp+-8] ! 4-byte Folded Spill
+; V8-UNOPT-NEXT: lduh [%i0], %i2
+; V8-UNOPT-NEXT: st %i2, [%fp+-12] ! 4-byte Folded Spill
; V8-UNOPT-NEXT: call __extendhfsf2
; V8-UNOPT-NEXT: lduh [%i1], %o0
-; V8-UNOPT-NEXT: fmovs %f0, %f1
-; V8-UNOPT-NEXT: ld [%fp+-8], %f0 ! 4-byte Folded Reload
+; V8-UNOPT-NEXT: ld [%fp+-12], %o0 ! 4-byte Folded Reload
+; V8-UNOPT-NEXT: call __extendhfsf2
+; V8-UNOPT-NEXT: st %f0, [%fp+-8]
+; V8-UNOPT-NEXT: ld [%fp+-8], %f1 ! 4-byte Folded Reload
; V8-UNOPT-NEXT: fadds %f0, %f1, %f0
; V8-UNOPT-NEXT: st %f0, [%fp+-4]
; V8-UNOPT-NEXT: call __truncsfhf2
@@ -364,13 +366,14 @@ define void @test_fadd(ptr %p, ptr %q) nounwind {
; V9-LABEL: test_fadd:
; V9: ! %bb.0:
; V9-NEXT: save %sp, -104, %sp
+; V9-NEXT: lduh [%i0], %i2
; V9-NEXT: call __extendhfsf2
-; V9-NEXT: lduh [%i0], %o0
+; V9-NEXT: lduh [%i1], %o0
; V9-NEXT: st %f0, [%fp+-8] ! 4-byte Folded Spill
; V9-NEXT: call __extendhfsf2
-; V9-NEXT: lduh [%i1], %o0
+; V9-NEXT: mov %i2, %o0
; V9-NEXT: ld [%fp+-8], %f1 ! 4-byte Folded Reload
-; V9-NEXT: fadds %f1, %f0, %f0
+; V9-NEXT: fadds %f0, %f1, %f0
; V9-NEXT: st %f0, [%fp+-4]
; V9-NEXT: call __truncsfhf2
; V9-NEXT: ld [%fp+-4], %o0
@@ -381,14 +384,15 @@ define void @test_fadd(ptr %p, ptr %q) nounwind {
; SPARC64-LABEL: test_fadd:
; SPARC64: ! %bb.0:
; SPARC64-NEXT: save %sp, -192, %sp
+; SPARC64-NEXT: lduh [%i0], %i2
; SPARC64-NEXT: call __extendhfsf2
-; SPARC64-NEXT: lduh [%i0], %o0
+; SPARC64-NEXT: lduh [%i1], %o0
; SPARC64-NEXT: st %f0, [%fp+2043] ! 4-byte Folded Spill
; SPARC64-NEXT: call __extendhfsf2
-; SPARC64-NEXT: lduh [%i1], %o0
+; SPARC64-NEXT: mov %i2, %o0
; SPARC64-NEXT: ld [%fp+2043], %f1 ! 4-byte Folded Reload
; SPARC64-NEXT: call __truncsfhf2
-; SPARC64-NEXT: fadds %f1, %f0, %f1
+; SPARC64-NEXT: fadds %f0, %f1, %f1
; SPARC64-NEXT: sth %o0, [%i0]
; SPARC64-NEXT: ret
; SPARC64-NEXT: restore
@@ -403,13 +407,14 @@ define void @test_fmul(ptr %p, ptr %q) nounwind {
; V8-OPT-LABEL: test_fmul:
; V8-OPT: ! %bb.0:
; V8-OPT-NEXT: save %sp, -104, %sp
+; V8-OPT-NEXT: lduh [%i0], %i2
; V8-OPT-NEXT: call __extendhfsf2
-; V8-OPT-NEXT: lduh [%i0], %o0
+; V8-OPT-NEXT: lduh [%i1], %o0
; V8-OPT-NEXT: st %f0, [%fp+-8] ! 4-byte Folded Spill
; V8-OPT-NEXT: call __extendhfsf2
-; V8-OPT-NEXT: lduh [%i1], %o0
+; V8-OPT-NEXT: mov %i2, %o0
; V8-OPT-NEXT: ld [%fp+-8], %f1 ! 4-byte Folded Reload
-; V8-OPT-NEXT: fmuls %f1, %f0, %f0
+; V8-OPT-NEXT: fmuls %f0, %f1, %f0
; V8-OPT-NEXT: st %f0, [%fp+-4]
; V8-OPT-NEXT: call __truncsfhf2
; V8-OPT-NEXT: ld [%fp+-4], %o0
@@ -420,13 +425,14 @@ define void @test_fmul(ptr %p, ptr %q) nounwind {
; V8-UNOPT-LABEL: test_fmul:
; V8-UNOPT: ! %bb.0:
; V8-UNOPT-NEXT: save %sp, -104, %sp
-; V8-UNOPT-NEXT: call __extendhfsf2
-; V8-UNOPT-NEXT: lduh [%i0], %o0
-; V8-UNOPT-NEXT: st %f0, [%fp+-8] ! 4-byte Folded Spill
+; V8-UNOPT-NEXT: lduh [%i0], %i2
+; V8-UNOPT-NEXT: st %i2, [%fp+-12] ! 4-byte Folded Spill
; V8-UNOPT-NEXT: call __extendhfsf2
; V8-UNOPT-NEXT: lduh [%i1], %o0
-; V8-UNOPT-NEXT: fmovs %f0, %f1
-; V8-UNOPT-NEXT: ld [%fp+-8], %f0 ! 4-byte Folded Reload
+; V8-UNOPT-NEXT: ld [%fp+-12], %o0 ! 4-byte Folded Reload
+; V8-UNOPT-NEXT: call __extendhfsf2
+; V8-UNOPT-NEXT: st %f0, [%fp+-8]
+; V8-UNOPT-NEXT: ld [%fp+-8], %f1 ! 4-byte Folded Reload
; V8-UNOPT-NEXT: fmuls %f0, %f1, %f0
; V8-UNOPT-NEXT: st %f0, [%fp+-4]
; V8-UNOPT-NEXT: call __truncsfhf2
@@ -438,13 +444,14 @@ define void @test_fmul(ptr %p, ptr %q) nounwind {
; V9-LABEL: test_fmul:
; V9: ! %bb.0:
; V9-NEXT: save %sp, -104, %sp
+; V9-NEXT: lduh [%i0], %i2
; V9-NEXT: call __extendhfsf2
-; V9-NEXT: lduh [%i0], %o0
+; V9-NEXT: lduh [%i1], %o0
; V9-NEXT: st %f0, [%fp+-8] ! 4-byte Folded Spill
; V9-NEXT: call __extendhfsf2
-; V9-NEXT: lduh [%i1], %o0
+; V9-NEXT: mov %i2, %o0
; V9-NEXT: ld [%fp+-8], %f1 ! 4-byte Folded Reload
-; V9-NEXT: fmuls %f1, %f0, %f0
+; V9-NEXT: fmuls %f0, %f1, %f0
; V9-NEXT: st %f0, [%fp+-4]
; V9-NEXT: call __truncsfhf2
; V9-NEXT: ld [%fp+-4], %o0
@@ -455,14 +462,15 @@ define void @test_fmul(ptr %p, ptr %q) nounwind {
; SPARC64-LABEL: test_fmul:
; SPARC64: ! %bb.0:
; SPARC64-NEXT: save %sp, -192, %sp
+; SPARC64-NEXT: lduh [%i0], %i2
; SPARC64-NEXT: call __extendhfsf2
-; SPARC64-NEXT: lduh [%i0], %o0
+; SPARC64-NEXT: lduh [%i1], %o0
; SPARC64-NEXT: st %f0, [%fp+2043] ! 4-byte Folded Spill
; SPARC64-NEXT: call __extendhfsf2
-; SPARC64-NEXT: lduh [%i1], %o0
+; SPARC64-NEXT: mov %i2, %o0
; SPARC64-NEXT: ld [%fp+2043], %f1 ! 4-byte Folded Reload
; SPARC64-NEXT: call __truncsfhf2
-; SPARC64-NEXT: fmuls %f1, %f0, %f1
+; SPARC64-NEXT: fmuls %f0, %f1, %f1
; SPARC64-NEXT: sth %o0, [%i0]
; SPARC64-NEXT: ret
; SPARC64-NEXT: restore
diff --git a/llvm/test/CodeGen/SPARC/half.ll b/llvm/test/CodeGen/SPARC/half.ll
index 34e2ceee28fc7..565160149e715 100644
--- a/llvm/test/CodeGen/SPARC/half.ll
+++ b/llvm/test/CodeGen/SPARC/half.ll
@@ -9,43 +9,19 @@
; copied from test/CodeGen/X86/half.ll.
define void @store(half %x, ptr %p) nounwind {
-; SPARC32-LABEL: store:
-; SPARC32: ! %bb.0:
-; SPARC32-NEXT: save %sp, -96, %sp
-; SPARC32-NEXT: call __truncsfhf2
-; SPARC32-NEXT: mov %i0, %o0
-; SPARC32-NEXT: sth %o0, [%i1]
-; SPARC32-NEXT: ret
-; SPARC32-NEXT: restore
-;
-; SPARC64-LABEL: store:
-; SPARC64: ! %bb.0:
-; SPARC64-NEXT: save %sp, -176, %sp
-; SPARC64-NEXT: call __truncsfhf2
-; SPARC64-NEXT: nop
-; SPARC64-NEXT: sth %o0, [%i1]
-; SPARC64-NEXT: ret
-; SPARC64-NEXT: restore
+; CHECK-LABEL: store:
+; CHECK: ! %bb.0:
+; CHECK-NEXT: retl
+; CHECK-NEXT: sth %o0, [%o1]
store half %x, ptr %p
ret void
}
define half @return(ptr %p) nounwind {
-; SPARC32-LABEL: return:
-; SPARC32: ! %bb.0:
-; SPARC32-NEXT: save %sp, -96, %sp
-; SPARC32-NEXT: call __extendhfsf2
-; SPARC32-NEXT: lduh [%i0], %o0
-; SPARC32-NEXT: ret
-; SPARC32-NEXT: restore
-;
-; SPARC64-LABEL: return:
-; SPARC64: ! %bb.0:
-; SPARC64-NEXT: save %sp, -176, %sp
-; SPARC64-NEXT: call __extendhfsf2
-; SPARC64-NEXT: lduh [%i0], %o0
-; SPARC64-NEXT: ret
-; SPARC64-NEXT: restore
+; CHECK-LABEL: return:
+; CHECK: ! %bb.0:
+; CHECK-NEXT: retl
+; CHECK-NEXT: lduh [%o0], %o0
%r = load half, ptr %p
ret half %r
}
@@ -185,46 +161,19 @@ define void @test_bitcast_to_half(ptr %addr, i16 %in) nounwind {
}
define half @from_bits(i16 %x) nounwind {
-; SPARC32-LABEL: from_bits:
-; SPARC32: ! %bb.0:
-; SPARC32-NEXT: save %sp, -96, %sp
-; SPARC32-NEXT: call __extendhfsf2
-; SPARC32-NEXT: mov %i0, %o0
-; SPARC32-NEXT: ret
-; SPARC32-NEXT: restore
-;
-; SPARC64-LABEL: from_bits:
-; SPARC64: ! %bb.0:
-; SPARC64-NEXT: save %sp, -176, %sp
-; SPARC64-NEXT: call __extendhfsf2
-; SPARC64-NEXT: srl %i0, 0, %o0
-; SPARC64-NEXT: ret
-; SPARC64-NEXT: restore
+; CHECK-LABEL: from_bits:
+; CHECK: ! %bb.0:
+; CHECK-NEXT: retl
+; CHECK-NEXT: nop
%res = bitcast i16 %x to half
ret half %res
}
define i16 @to_bits(half %x) nounwind {
-; SPARC32-LABEL: to_bits:
-; SPARC32: ! %bb.0:
-; SPARC32-NEXT: save %sp, -96, %sp
-; SPARC32-NEXT: call __truncsfhf2
-; SPARC32-NEXT: mov %i0, %o0
-; SPARC32-NEXT: sethi 4194240, %i0
-; SPARC32-NEXT: andn %o0, %i0, %i0
-; SPARC32-NEXT: ret
-; SPARC32-NEXT: restore
-;
-; SPARC64-LABEL: to_bits:
-; SPARC64: ! %bb.0:
-; SPARC64-NEXT: save %sp, -176, %sp
-; SPARC64-NEXT: call __truncsfhf2
-; SPARC64-NEXT: nop
-; SPARC64-NEXT: sethi 63, %i0
-; SPARC64-NEXT: or %i0, 1023, %i0
-; SPARC64-NEXT: and %o0, %i0, %i0
-; SPARC64-NEXT: ret
-; SPARC64-NEXT: restore
+; CHECK-LABEL: to_bits:
+; CHECK: ! %bb.0:
+; CHECK-NEXT: retl
+; CHECK-NEXT: nop
%res = bitcast half %x to i16
ret i16 %res
}
@@ -694,37 +643,47 @@ define void @test_trunc64_vec4(<4 x double> %a, ptr %p) nounwind {
define float @test_sitofp_fadd_i32(i32 %a, ptr %b) nounwind {
; SPARC32-LABEL: test_sitofp_fadd_i32:
; SPARC32: ! %bb.0:
-; SPARC32-NEXT: save %sp, -104, %sp
-; SPARC32-NEXT: call __extendhfsf2
-; SPARC32-NEXT: lduh [%i1], %o0
+; SPARC32-NEXT: save %sp, -112, %sp
+; SPARC32-NEXT: lduh [%i1], %i1
; SPARC32-NEXT: st %i0, [%fp+-4]
-; SPARC32-NEXT: ld [%fp+-4], %f1
-; SPARC32-NEXT: st %f0, [%fp+-12] ! 4-byte Folded Spill
-; SPARC32-NEXT: fitos %f1, %f0
+; SPARC32-NEXT: ld [%fp+-4], %f0
+; SPARC32-NEXT: fitos %f0, %f0
; SPARC32-NEXT: st %f0, [%fp+-8]
; SPARC32-NEXT: call __truncsfhf2
; SPARC32-NEXT: ld [%fp+-8], %o0
; SPARC32-NEXT: call __extendhfsf2
; SPARC32-NEXT: nop
-; SPARC32-NEXT: ld [%fp+-12], %f1 ! 4-byte Folded Reload
-; SPARC32-NEXT: fadds %f1, %f0, %f0
+; SPARC32-NEXT: st %f0, [%fp+-16] ! 4-byte Folded Spill
+; SPARC32-NEXT: call __extendhfsf2
+; SPARC32-NEXT: mov %i1, %o0
+; SPARC32-NEXT: ld [%fp+-16], %f1 ! 4-byte Folded Reload
+; SPARC32-NEXT: fadds %f0, %f1, %f0
+; SPARC32-NEXT: st %f0, [%fp+-12]
+; SPARC32-NEXT: call __truncsfhf2
+; SPARC32-NEXT: ld [%fp+-12], %o0
+; SPARC32-NEXT: call __extendhfsf2
+; SPARC32-NEXT: nop
; SPARC32-NEXT: ret
; SPARC32-NEXT: restore
;
; SPARC64-LABEL: test_sitofp_fadd_i32:
; SPARC64: ! %bb.0:
; SPARC64-NEXT: save %sp, -192, %sp
-; SPARC64-NEXT: call __extendhfsf2
-; SPARC64-NEXT: lduh [%i1], %o0
-; SPARC64-NEXT: st %f0, [%fp+2039] ! 4-byte Folded Spill
+; SPARC64-NEXT: lduh [%i1], %i1
; SPARC64-NEXT: st %i0, [%fp+2043]
; SPARC64-NEXT: ld [%fp+2043], %f0
; SPARC64-NEXT: call __truncsfhf2
; SPARC64-NEXT: fitos %f0, %f1
; SPARC64-NEXT: call __extendhfsf2
; SPARC64-NEXT: nop
+; SPARC64-NEXT: st %f0, [%fp+2039] ! 4-byte Folded Spill
+; SPARC64-NEXT: call __extendhfsf2
+; SPARC64-NEXT: mov %i1, %o0
; SPARC64-NEXT: ld [%fp+2039], %f1 ! 4-byte Folded Reload
-; SPARC64-NEXT: fadds %f1, %f0, %f0
+; SPARC64-NEXT: call __truncsfhf2
+; SPARC64-NEXT: fadds %f0, %f1, %f1
+; SPARC64-NEXT: call __extendhfsf2
+; SPARC64-NEXT: nop
; SPARC64-NEXT: ret
; SPARC64-NEXT: restore
%tmp0 = load half, ptr %b
@@ -738,10 +697,8 @@ define half @PR40273(half) nounwind {
; V8-LABEL: PR40273:
; V8: ! %bb.0:
; V8-NEXT: save %sp, -96, %sp
-; V8-NEXT: call __truncsfhf2
-; V8-NEXT: mov %i0, %o0
; V8-NEXT: call __extendhfsf2
-; V8-NEXT: nop
+; V8-NEXT: mov %i0, %o0
; V8-NEXT: sethi %hi(.LCPI24_0), %i0
; V8-NEXT: ld [%i0+%lo(.LCPI24_0)], %f1
; V8-NEXT: fcmps %f0, %f1
@@ -749,54 +706,40 @@ define half @PR40273(half) nounwind {
; V8-NEXT: fbne .LBB24_2
; V8-NEXT: nop
; V8-NEXT: ! %bb.1:
-; V8-NEXT: ba .LBB24_3
-; V8-NEXT: mov %g0, %i0
+; V8-NEXT: ret
+; V8-NEXT: restore %g0, %g0, %o0
; V8-NEXT: .LBB24_2:
-; V8-NEXT: mov 4, %i0
-; V8-NEXT: .LBB24_3:
-; V8-NEXT: sethi %hi(.LCPI24_1), %i1
-; V8-NEXT: add %i1, %lo(.LCPI24_1), %i1
-; V8-NEXT: ld [%i1+%i0], %f0
+; V8-NEXT: sethi 15, %i0
; V8-NEXT: ret
; V8-NEXT: restore
;
; V9-LABEL: PR40273:
; V9: ! %bb.0:
; V9-NEXT: save %sp, -96, %sp
-; V9-NEXT: call __truncsfhf2
-; V9-NEXT: mov %i0, %o0
; V9-NEXT: call __extendhfsf2
-; V9-NEXT: nop
+; V9-NEXT: mov %i0, %o0
; V9-NEXT: sethi %hi(.LCPI24_0), %i0
; V9-NEXT: ld [%i0+%lo(.LCPI24_0)], %f1
; V9-NEXT: mov %g0, %i0
+; V9-NEXT: sethi 15, %i1
; V9-NEXT: fcmps %fcc0, %f0, %f1
-; V9-NEXT: movne %fcc0, 4, %i0
-; V9-NEXT: sethi %hi(.LCPI24_1), %i1
-; V9-NEXT: add %i1, %lo(.LCPI24_1), %i1
-; V9-NEXT: ld [%i1+%i0], %f0
+; V9-NEXT: movne %fcc0, %i1, %i0
; V9-NEXT: ret
; V9-NEXT: restore
;
; SPARC64-LABEL: PR40273:
; SPARC64: ! %bb.0:
; SPARC64-NEXT: save %sp, -176, %sp
-; SPARC64-NEXT: call __truncsfhf2
-; SPARC64-NEXT: nop
; SPARC64-NEXT: call __extendhfsf2
-; SPARC64-NEXT: nop
+; SPARC64-NEXT: srl %i0, 0, %o0
; SPARC64-NEXT: sethi %h44(.LCPI24_0), %i0
; SPARC64-NEXT: add %i0, %m44(.LCPI24_0), %i0
; SPARC64-NEXT: sllx %i0, 12, %i0
; SPARC64-NEXT: ld [%i0+%l44(.LCPI24_0)], %f1
; SPARC64-NEXT: mov %g0, %i0
+; SPARC64-NEXT: sethi 15, %i1
; SPARC64-NEXT: fcmps %fcc0, %f0, %f1
-; SPARC64-NEXT: movne %fcc0, 4, %i0
-; SPARC64-NEXT: sethi %h44(.LCPI24_1), %i1
-; SPARC64-NEXT: add %i1, %m44(.LCPI24_1), %i1
-; SPARC64-NEXT: sllx %i1, 12, %i1
-; SPARC64-NEXT: add %i1, %l44(.LCPI24_1), %i1
-; SPARC64-NEXT: ld [%i1+%i0], %f0
+; SPARC64-NEXT: movne %fcc0, %i1, %i0
; SPARC64-NEXT: ret
; SPARC64-NEXT: restore
%2 = fcmp une half %0, 0xH0000
@@ -807,82 +750,28 @@ define half @PR40273(half) nounwind {
define half @fabs(half %x) nounwind {
; SPARC32-LABEL: fabs:
; SPARC32: ! %bb.0:
-; SPARC32-NEXT: save %sp, -96, %sp
-; SPARC32-NEXT: call __truncsfhf2
-; SPARC32-NEXT: mov %i0, %o0
-; SPARC32-NEXT: call __extendhfsf2
-; SPARC32-NEXT: nop
-; SPARC32-NEXT: fabss %f0, %f0
-; SPARC32-NEXT: ret
-; SPARC32-NEXT: restore
+; SPARC32-NEXT: sethi 4194272, %o1
+; SPARC32-NEXT: retl
+; SPARC32-NEXT: andn %o0, %o1, %o0
;
; SPARC64-LABEL: fabs:
; SPARC64: ! %bb.0:
-; SPARC64-NEXT: save %sp, -176, %sp
-; SPARC64-NEXT: call __truncsfhf2
-; SPARC64-NEXT: nop
-; SPARC64-NEXT: call __extendhfsf2
-; SPARC64-NEXT: nop
-; SPARC64-NEXT: fabss %f0, %f0
-; SPARC64-NEXT: ret
-; SPARC64-NEXT: restore
+; SPARC64-NEXT: sethi 31, %o1
+; SPARC64-NEXT: or %o1, 1023, %o1
+; SPARC64-NEXT: retl
+; SPARC64-NEXT: and %o0, %o1, %o0
%a = call half @llvm.fabs.f16(half %x)
ret half %a
}
define half @fcopysign(half %x, half %y) nounwind {
-; V8-LABEL: fcopysign:
-; V8: ! %bb.0:
-; V8-NEXT: save %sp, -96, %sp
-; V8-NEXT: call __truncsfhf2
-; V8-NEXT: mov %i0, %o0
-; V8-NEXT: call __extendhfsf2
-; V8-NEXT: nop
-; V8-NEXT: sethi 2097152, %i0
-; V8-NEXT: and %i1, %i0, %i0
-; V8-NEXT: cmp %i0, 0
-; V8-NEXT: be .LBB26_2
-; V8-NEXT: fabss %f0, %f0
-; V8-NEXT: ! %bb.1:
-; V8-NEXT: fnegs %f0, %f0
-; V8-NEXT: .LBB26_2:
-; V8-NEXT: ret
-; V8-NEXT: restore
-;
-; V9-LABEL: fcopysign:
-; V9: ! %bb.0:
-; V9-NEXT: save %sp, -96, %sp
-; V9-NEXT: call __truncsfhf2
-; V9-NEXT: mov %i0, %o0
-; V9-NEXT: call __extendhfsf2
-; V9-NEXT: nop
-; V9-NEXT: sethi 2097152, %i0
-; V9-NEXT: and %i1, %i0, %i0
-; V9-NEXT: fabss %f0, %f0
-; V9-NEXT: fnegs %f0, %f1
-; V9-NEXT: cmp %i0, 0
-; V9-NEXT: fmovsne %icc, %f1, %f0
-; V9-NEXT: ret
-; V9-NEXT: restore
-;
-; SPARC64-LABEL: fcopysign:
-; SPARC64: ! %bb.0:
-; SPARC64-NEXT: save %sp, -192, %sp
-; SPARC64-NEXT: call __truncsfhf2
-; SPARC64-NEXT: st %f3, [%fp+2039]
-; SPARC64-NEXT: call __extendhfsf2
-; SPARC64-NEXT: nop
-; SPARC64-NEXT: ld [%fp+2039], %f1 ! 4-byte Folded Reload
-; SPARC64-NEXT: st %f1, [%fp+2043]
-; SPARC64-NEXT: ld [%fp+2043], %i0
-; SPARC64-NEXT: sethi 2097152, %i1
-; SPARC64-NEXT: and %i0, %i1, %i0
-; SPARC64-NEXT: fabss %f0, %f0
-; SPARC64-NEXT: fnegs %f0, %f1
-; SPARC64-NEXT: cmp %i0, 0
-; SPARC64-NEXT: fmovsne %icc, %f1, %f0
-; SPARC64-NEXT: ret
-; SPARC64-NEXT: restore
+; CHECK-LABEL: fcopysign:
+; CHECK: ! %bb.0:
+; CHECK-NEXT: sethi 4194272, %o2
+; CHECK-NEXT: and %o1, %o2, %o1
+; CHECK-NEXT: andn %o0, %o2, %o0
+; CHECK-NEXT: retl
+; CHECK-NEXT: or %o0, %o1, %o0
%a = call half @llvm.copysign.f16(half %x, half %y)
ret half %a
}
diff --git a/llvm/test/CodeGen/SPARC/llvm.sincos.ll b/llvm/test/CodeGen/SPARC/llvm.sincos.ll
index 87b9c8e7ba47b..8d0d50f67e3f5 100644
--- a/llvm/test/CodeGen/SPARC/llvm.sincos.ll
+++ b/llvm/test/CodeGen/SPARC/llvm.sincos.ll
@@ -10,74 +10,84 @@ define { half, half } @test_sincos_f16(half %a) #0 {
; SPARC32-LABEL: test_sincos_f16:
; SPARC32: ! %bb.0:
; SPARC32-NEXT: save %sp, -104, %sp
-; SPARC32-NEXT: call __truncsfhf2
-; SPARC32-NEXT: mov %i0, %o0
; SPARC32-NEXT: call __extendhfsf2
-; SPARC32-NEXT: nop
-; SPARC32-NEXT: st %f0, [%fp+-4]
-; SPARC32-NEXT: ld [%fp+-4], %i0
+; SPARC32-NEXT: mov %i0, %o0
+; SPARC32-NEXT: st %f0, [%fp+-12]
+; SPARC32-NEXT: ld [%fp+-12], %i0
; SPARC32-NEXT: call sinf
; SPARC32-NEXT: mov %i0, %o0
-; SPARC32-NEXT: st %f0, [%fp+-8] ! 4-byte Folded Spill
+; SPARC32-NEXT: st %f0, [%fp+-8]
; SPARC32-NEXT: call cosf
; SPARC32-NEXT: mov %i0, %o0
-; SPARC32-NEXT: fmovs %f0, %f1
-; SPARC32-NEXT: ld [%fp+-8], %f0 ! 4-byte Folded Reload
+; SPARC32-NEXT: st %f0, [%fp+-4]
+; SPARC32-NEXT: call __truncsfhf2
+; SPARC32-NEXT: ld [%fp+-8], %o0
+; SPARC32-NEXT: mov %o0, %i0
+; SPARC32-NEXT: call __truncsfhf2
+; SPARC32-NEXT: ld [%fp+-4], %o0
; SPARC32-NEXT: ret
-; SPARC32-NEXT: restore
+; SPARC32-NEXT: restore %g0, %o0, %o1
;
; SPARC64-LABEL: test_sincos_f16:
; SPARC64: ! %bb.0:
; SPARC64-NEXT: save %sp, -192, %sp
-; SPARC64-NEXT: call __truncsfhf2
-; SPARC64-NEXT: nop
; SPARC64-NEXT: call __extendhfsf2
-; SPARC64-NEXT: nop
-; SPARC64-NEXT: st %f0, [%fp+2039] ! 4-byte Folded Spill
+; SPARC64-NEXT: srl %i0, 0, %o0
+; SPARC64-NEXT: st %f0, [%fp+2043] ! 4-byte Folded Spill
; SPARC64-NEXT: fmovs %f0, %f1
; SPARC64-NEXT: call sinf
; SPARC64-NEXT: nop
-; SPARC64-NEXT: st %f0, [%fp+2043] ! 4-byte Folded Spill
+; SPARC64-NEXT: fmovs %f0, %f1
+; SPARC64-NEXT: call __truncsfhf2
+; SPARC64-NEXT: nop
+; SPARC64-NEXT: mov %o0, %i0
; SPARC64-NEXT: call cosf
-; SPARC64-NEXT: ld [%fp+2039], %f1
+; SPARC64-NEXT: ld [%fp+2043], %f1
; SPARC64-NEXT: fmovs %f0, %f1
-; SPARC64-NEXT: ld [%fp+2043], %f0 ! 4-byte Folded Reload
+; SPARC64-NEXT: call __truncsfhf2
+; SPARC64-NEXT: nop
; SPARC64-NEXT: ret
-; SPARC64-NEXT: restore
+; SPARC64-NEXT: restore %g0, %o0, %o1
;
; GNU32-LABEL: test_sincos_f16:
; GNU32: ! %bb.0:
-; GNU32-NEXT: save %sp, -104, %sp
-; GNU32-NEXT: call __truncsfhf2
-; GNU32-NEXT: mov %i0, %o0
+; GNU32-NEXT: save %sp, -112, %sp
; GNU32-NEXT: call __extendhfsf2
-; GNU32-NEXT: nop
+; GNU32-NEXT: mov %i0, %o0
; GNU32-NEXT: st %f0, [%fp+-12]
; GNU32-NEXT: ld [%fp+-12], %o0
; GNU32-NEXT: add %fp, -4, %o1
; GNU32-NEXT: call sincosf
; GNU32-NEXT: add %fp, -8, %o2
; GNU32-NEXT: ld [%fp+-4], %f0
-; GNU32-NEXT: ld [%fp+-8], %f1
+; GNU32-NEXT: st %f0, [%fp+-20]
+; GNU32-NEXT: ld [%fp+-8], %f0
+; GNU32-NEXT: st %f0, [%fp+-16]
+; GNU32-NEXT: call __truncsfhf2
+; GNU32-NEXT: ld [%fp+-20], %o0
+; GNU32-NEXT: mov %o0, %i0
+; GNU32-NEXT: call __truncsfhf2
+; GNU32-NEXT: ld [%fp+-16], %o0
; GNU32-NEXT: ret
-; GNU32-NEXT: restore
+; GNU32-NEXT: restore %g0, %o0, %o1
;
; GNU64-LABEL: test_sincos_f16:
; GNU64: ! %bb.0:
; GNU64-NEXT: save %sp, -192, %sp
-; GNU64-NEXT: call __truncsfhf2
-; GNU64-NEXT: nop
; GNU64-NEXT: call __extendhfsf2
-; GNU64-NEXT: nop
+; GNU64-NEXT: srl %i0, 0, %o0
; GNU64-NEXT: add %fp, 2043, %o1
; GNU64-NEXT: add %fp, 2039, %o2
; GNU64-NEXT: fmovs %f0, %f1
; GNU64-NEXT: call sincosf
; GNU64-NEXT: nop
-; GNU64-NEXT: ld [%fp+2043], %f0
+; GNU64-NEXT: call __truncsfhf2
+; GNU64-NEXT: ld [%fp+2043], %f1
+; GNU64-NEXT: mov %o0, %i0
+; GNU64-NEXT: call __truncsfhf2
; GNU64-NEXT: ld [%fp+2039], %f1
; GNU64-NEXT: ret
-; GNU64-NEXT: restore
+; GNU64-NEXT: restore %g0, %o0, %o1
%result = call { half, half } @llvm.sincos.f16(half %a)
ret { half, half } %result
}
@@ -85,61 +95,63 @@ define { half, half } @test_sincos_f16(half %a) #0 {
define half @test_sincos_f16_only_use_sin(half %a) #0 {
; SPARC32-LABEL: test_sincos_f16_only_use_sin:
; SPARC32: ! %bb.0:
-; SPARC32-NEXT: save %sp, -96, %sp
-; SPARC32-NEXT: call __truncsfhf2
-; SPARC32-NEXT: mov %i0, %o0
+; SPARC32-NEXT: save %sp, -104, %sp
; SPARC32-NEXT: call __extendhfsf2
-; SPARC32-NEXT: nop
-; SPARC32-NEXT: st %f0, [%fp+-4]
+; SPARC32-NEXT: mov %i0, %o0
+; SPARC32-NEXT: st %f0, [%fp+-8]
; SPARC32-NEXT: call sinf
+; SPARC32-NEXT: ld [%fp+-8], %o0
+; SPARC32-NEXT: st %f0, [%fp+-4]
+; SPARC32-NEXT: call __truncsfhf2
; SPARC32-NEXT: ld [%fp+-4], %o0
; SPARC32-NEXT: ret
-; SPARC32-NEXT: restore
+; SPARC32-NEXT: restore %g0, %o0, %o0
;
; SPARC64-LABEL: test_sincos_f16_only_use_sin:
; SPARC64: ! %bb.0:
; SPARC64-NEXT: save %sp, -176, %sp
-; SPARC64-NEXT: call __truncsfhf2
-; SPARC64-NEXT: nop
; SPARC64-NEXT: call __extendhfsf2
-; SPARC64-NEXT: nop
+; SPARC64-NEXT: srl %i0, 0, %o0
; SPARC64-NEXT: fmovs %f0, %f1
; SPARC64-NEXT: call sinf
; SPARC64-NEXT: nop
+; SPARC64-NEXT: fmovs %f0, %f1
+; SPARC64-NEXT: call __truncsfhf2
+; SPARC64-NEXT: nop
; SPARC64-NEXT: ret
-; SPARC64-NEXT: restore
+; SPARC64-NEXT: restore %g0, %o0, %o0
;
; GNU32-LABEL: test_sincos_f16_only_use_sin:
; GNU32: ! %bb.0:
-; GNU32-NEXT: save %sp, -104, %sp
-; GNU32-NEXT: call __truncsfhf2
-; GNU32-NEXT: mov %i0, %o0
+; GNU32-NEXT: save %sp, -112, %sp
; GNU32-NEXT: call __extendhfsf2
-; GNU32-NEXT: nop
+; GNU32-NEXT: mov %i0, %o0
; GNU32-NEXT: st %f0, [%fp+-12]
; GNU32-NEXT: ld [%fp+-12], %o0
; GNU32-NEXT: add %fp, -4, %o1
; GNU32-NEXT: call sincosf
; GNU32-NEXT: add %fp, -8, %o2
; GNU32-NEXT: ld [%fp+-4], %f0
+; GNU32-NEXT: st %f0, [%fp+-16]
+; GNU32-NEXT: call __truncsfhf2
+; GNU32-NEXT: ld [%fp+-16], %o0
; GNU32-NEXT: ret
-; GNU32-NEXT: restore
+; GNU32-NEXT: restore %g0, %o0, %o0
;
; GNU64-LABEL: test_sincos_f16_only_use_sin:
; GNU64: ! %bb.0:
; GNU64-NEXT: save %sp, -192, %sp
-; GNU64-NEXT: call __truncsfhf2
-; GNU64-NEXT: nop
; GNU64-NEXT: call __extendhfsf2
-; GNU64-NEXT: nop
+; GNU64-NEXT: srl %i0, 0, %o0
; GNU64-NEXT: add %fp, 2043, %o1
; GNU64-NEXT: add %fp, 2039, %o2
; GNU64-NEXT: fmovs %f0, %f1
; GNU64-NEXT: call sincosf
; GNU64-NEXT: nop
-; GNU64-NEXT: ld [%fp+2043], %f0
+; GNU64-NEXT: call __truncsfhf2
+; GNU64-NEXT: ld [%fp+2043], %f1
; GNU64-NEXT: ret
-; GNU64-NEXT: restore
+; GNU64-NEXT: restore %g0, %o0, %o0
%result = call { half, half } @llvm.sincos.f16(half %a)
%result.0 = extractvalue { half, half } %result, 0
ret half %result.0
@@ -148,61 +160,63 @@ define half @test_sincos_f16_only_use_sin(half %a) #0 {
define half @test_sincos_f16_only_use_cos(half %a) #0 {
; SPARC32-LABEL: test_sincos_f16_only_use_cos:
; SPARC32: ! %bb.0:
-; SPARC32-NEXT: save %sp, -96, %sp
-; SPARC32-NEXT: call __truncsfhf2
-; SPARC32-NEXT: mov %i0, %o0
+; SPARC32-NEXT: save %sp, -104, %sp
; SPARC32-NEXT: call __extendhfsf2
-; SPARC32-NEXT: nop
-; SPARC32-NEXT: st %f0, [%fp+-4]
+; SPARC32-NEXT: mov %i0, %o0
+; SPARC32-NEXT: st %f0, [%fp+-8]
; SPARC32-NEXT: call cosf
+; SPARC32-NEXT: ld [%fp+-8], %o0
+; SPARC32-NEXT: st %f0, [%fp+-4]
+; SPARC32-NEXT: call __truncsfhf2
; SPARC32-NEXT: ld [%fp+-4], %o0
; SPARC32-NEXT: ret
-; SPARC32-NEXT: restore
+; SPARC32-NEXT: restore %g0, %o0, %o0
;
; SPARC64-LABEL: test_sincos_f16_only_use_cos:
; SPARC64: ! %bb.0:
; SPARC64-NEXT: save %sp, -176, %sp
-; SPARC64-NEXT: call __truncsfhf2
-; SPARC64-NEXT: nop
; SPARC64-NEXT: call __extendhfsf2
-; SPARC64-NEXT: nop
+; SPARC64-NEXT: srl %i0, 0, %o0
; SPARC64-NEXT: fmovs %f0, %f1
; SPARC64-NEXT: call cosf
; SPARC64-NEXT: nop
+; SPARC64-NEXT: fmovs %f0, %f1
+; SPARC64-NEXT: call __truncsfhf2
+; SPARC64-NEXT: nop
; SPARC64-NEXT: ret
-; SPARC64-NEXT: restore
+; SPARC64-NEXT: restore %g0, %o0, %o0
;
; GNU32-LABEL: test_sincos_f16_only_use_cos:
; GNU32: ! %bb.0:
-; GNU32-NEXT: save %sp, -104, %sp
-; GNU32-NEXT: call __truncsfhf2
-; GNU32-NEXT: mov %i0, %o0
+; GNU32-NEXT: save %sp, -112, %sp
; GNU32-NEXT: call __extendhfsf2
-; GNU32-NEXT: nop
+; GNU32-NEXT: mov %i0, %o0
; GNU32-NEXT: st %f0, [%fp+-12]
; GNU32-NEXT: ld [%fp+-12], %o0
; GNU32-NEXT: add %fp, -4, %o1
; GNU32-NEXT: call sincosf
; GNU32-NEXT: add %fp, -8, %o2
; GNU32-NEXT: ld [%fp+-8], %f0
+; GNU32-NEXT: st %f0, [%fp+-16]
+; GNU32-NEXT: call __truncsfhf2
+; GNU32-NEXT: ld [%fp+-16], %o0
; GNU32-NEXT: ret
-; GNU32-NEXT: restore
+; GNU32-NEXT: restore %g0, %o0, %o0
;
; GNU64-LABEL: test_sincos_f16_only_use_cos:
; GNU64: ! %bb.0:
; GNU64-NEXT: save %sp, -192, %sp
-; GNU64-NEXT: call __truncsfhf2
-; GNU64-NEXT: nop
; GNU64-NEXT: call __extendhfsf2
-; GNU64-NEXT: nop
+; GNU64-NEXT: srl %i0, 0, %o0
; GNU64-NEXT: add %fp, 2043, %o1
; GNU64-NEXT: add %fp, 2039, %o2
; GNU64-NEXT: fmovs %f0, %f1
; GNU64-NEXT: call sincosf
; GNU64-NEXT: nop
-; GNU64-NEXT: ld [%fp+2039], %f0
+; GNU64-NEXT: call __truncsfhf2
+; GNU64-NEXT: ld [%fp+2039], %f1
; GNU64-NEXT: ret
-; GNU64-NEXT: restore
+; GNU64-NEXT: restore %g0, %o0, %o0
%result = call { half, half } @llvm.sincos.f16(half %a)
%result.1 = extractvalue { half, half } %result, 1
ret half %result.1
@@ -211,132 +225,157 @@ define half @test_sincos_f16_only_use_cos(half %a) #0 {
define { <2 x half>, <2 x half> } @test_sincos_v2f16(<2 x half> %a) #0 {
; SPARC32-LABEL: test_sincos_v2f16:
; SPARC32: ! %bb.0:
-; SPARC32-NEXT: save %sp, -112, %sp
-; SPARC32-NEXT: call __truncsfhf2
-; SPARC32-NEXT: mov %i1, %o0
+; SPARC32-NEXT: save %sp, -128, %sp
; SPARC32-NEXT: call __extendhfsf2
-; SPARC32-NEXT: nop
-; SPARC32-NEXT: st %f0, [%fp+-12] ! 4-byte Folded Spill
-; SPARC32-NEXT: call __truncsfhf2
-; SPARC32-NEXT: mov %i0, %o0
+; SPARC32-NEXT: mov %i1, %o0
+; SPARC32-NEXT: st %f0, [%fp+-28]
; SPARC32-NEXT: call __extendhfsf2
-; SPARC32-NEXT: nop
-; SPARC32-NEXT: st %f0, [%fp+-8]
-; SPARC32-NEXT: ld [%fp+-12], %f0 ! 4-byte Folded Reload
-; SPARC32-NEXT: st %f0, [%fp+-4]
-; SPARC32-NEXT: ld [%fp+-8], %i0
-; SPARC32-NEXT: call sinf
; SPARC32-NEXT: mov %i0, %o0
-; SPARC32-NEXT: st %f0, [%fp+-12] ! 4-byte Folded Spill
-; SPARC32-NEXT: ld [%fp+-4], %i1
-; SPARC32-NEXT: call sinf
-; SPARC32-NEXT: mov %i1, %o0
-; SPARC32-NEXT: st %f0, [%fp+-16] ! 4-byte Folded Spill
+; SPARC32-NEXT: st %f0, [%fp+-32]
+; SPARC32-NEXT: ld [%fp+-28], %i0
; SPARC32-NEXT: call cosf
; SPARC32-NEXT: mov %i0, %o0
-; SPARC32-NEXT: st %f0, [%fp+-20] ! 4-byte Folded Spill
+; SPARC32-NEXT: st %f0, [%fp+-20]
+; SPARC32-NEXT: ld [%fp+-32], %i1
; SPARC32-NEXT: call cosf
; SPARC32-NEXT: mov %i1, %o0
-; SPARC32-NEXT: fmovs %f0, %f3
-; SPARC32-NEXT: ld [%fp+-12], %f0 ! 4-byte Folded Reload
-; SPARC32-NEXT: ld [%fp+-16], %f1 ! 4-byte Folded Reload
-; SPARC32-NEXT: ld [%fp+-20], %f2 ! 4-byte Folded Reload
+; SPARC32-NEXT: st %f0, [%fp+-12]
+; SPARC32-NEXT: call sinf
+; SPARC32-NEXT: mov %i0, %o0
+; SPARC32-NEXT: st %f0, [%fp+-24]
+; SPARC32-NEXT: call sinf
+; SPARC32-NEXT: mov %i1, %o0
+; SPARC32-NEXT: st %f0, [%fp+-16]
+; SPARC32-NEXT: call __truncsfhf2
+; SPARC32-NEXT: ld [%fp+-20], %o0
+; SPARC32-NEXT: sethi 63, %i0
+; SPARC32-NEXT: or %i0, 1023, %i0
+; SPARC32-NEXT: and %o0, %i0, %i4
+; SPARC32-NEXT: call __truncsfhf2
+; SPARC32-NEXT: ld [%fp+-12], %o0
+; SPARC32-NEXT: and %o0, %i0, %i2
+; SPARC32-NEXT: call __truncsfhf2
+; SPARC32-NEXT: ld [%fp+-24], %o0
+; SPARC32-NEXT: and %o0, %i0, %i1
+; SPARC32-NEXT: call __truncsfhf2
+; SPARC32-NEXT: ld [%fp+-16], %o0
+; SPARC32-NEXT: and %o0, %i0, %g2
+; SPARC32-NEXT: mov %g2, %i0
+; SPARC32-NEXT: ! kill: def $i2 killed $i2 killed $i2_i3
; SPARC32-NEXT: ret
-; SPARC32-NEXT: restore
+; SPARC32-NEXT: restore %g0, %i4, %o3
;
; SPARC64-LABEL: test_sincos_v2f16:
; SPARC64: ! %bb.0:
; SPARC64-NEXT: save %sp, -192, %sp
-; SPARC64-NEXT: st %f1, [%fp+2039] ! 4-byte Folded Spill
-; SPARC64-NEXT: fmovs %f3, %f1
-; SPARC64-NEXT: call __truncsfhf2
-; SPARC64-NEXT: nop
; SPARC64-NEXT: call __extendhfsf2
-; SPARC64-NEXT: nop
+; SPARC64-NEXT: srl %i0, 0, %o0
; SPARC64-NEXT: st %f0, [%fp+2043] ! 4-byte Folded Spill
-; SPARC64-NEXT: call __truncsfhf2
-; SPARC64-NEXT: ld [%fp+2039], %f1
-; SPARC64-NEXT: call __extendhfsf2
-; SPARC64-NEXT: nop
-; SPARC64-NEXT: st %f0, [%fp+2031] ! 4-byte Folded Spill
; SPARC64-NEXT: fmovs %f0, %f1
; SPARC64-NEXT: call sinf
; SPARC64-NEXT: nop
+; SPARC64-NEXT: fmovs %f0, %f1
+; SPARC64-NEXT: call __truncsfhf2
+; SPARC64-NEXT: nop
+; SPARC64-NEXT: mov %o0, %i0
+; SPARC64-NEXT: call __extendhfsf2
+; SPARC64-NEXT: srl %i1, 0, %o0
; SPARC64-NEXT: st %f0, [%fp+2039] ! 4-byte Folded Spill
+; SPARC64-NEXT: fmovs %f0, %f1
; SPARC64-NEXT: call sinf
-; SPARC64-NEXT: ld [%fp+2043], %f1
-; SPARC64-NEXT: st %f0, [%fp+2035] ! 4-byte Folded Spill
-; SPARC64-NEXT: call cosf
-; SPARC64-NEXT: ld [%fp+2031], %f1
-; SPARC64-NEXT: st %f0, [%fp+2031] ! 4-byte Folded Spill
+; SPARC64-NEXT: nop
+; SPARC64-NEXT: fmovs %f0, %f1
+; SPARC64-NEXT: call __truncsfhf2
+; SPARC64-NEXT: nop
+; SPARC64-NEXT: mov %o0, %i1
; SPARC64-NEXT: call cosf
; SPARC64-NEXT: ld [%fp+2043], %f1
-; SPARC64-NEXT: fmovs %f0, %f3
-; SPARC64-NEXT: ld [%fp+2039], %f0 ! 4-byte Folded Reload
-; SPARC64-NEXT: ld [%fp+2035], %f1 ! 4-byte Folded Reload
-; SPARC64-NEXT: ld [%fp+2031], %f2 ! 4-byte Folded Reload
+; SPARC64-NEXT: fmovs %f0, %f1
+; SPARC64-NEXT: call __truncsfhf2
+; SPARC64-NEXT: nop
+; SPARC64-NEXT: mov %o0, %i2
+; SPARC64-NEXT: call cosf
+; SPARC64-NEXT: ld [%fp+2039], %f1
+; SPARC64-NEXT: fmovs %f0, %f1
+; SPARC64-NEXT: call __truncsfhf2
+; SPARC64-NEXT: nop
; SPARC64-NEXT: ret
-; SPARC64-NEXT: restore
+; SPARC64-NEXT: restore %g0, %o0, %o3
;
; GNU32-LABEL: test_sincos_v2f16:
; GNU32: ! %bb.0:
-; GNU32-NEXT: save %sp, -120, %sp
-; GNU32-NEXT: call __truncsfhf2
-; GNU32-NEXT: mov %i1, %o0
-; GNU32-NEXT: call __extendhfsf2
-; GNU32-NEXT: nop
-; GNU32-NEXT: st %f0, [%fp+-28] ! 4-byte Folded Spill
-; GNU32-NEXT: call __truncsfhf2
-; GNU32-NEXT: mov %i0, %o0
+; GNU32-NEXT: save %sp, -144, %sp
; GNU32-NEXT: call __extendhfsf2
-; GNU32-NEXT: nop
-; GNU32-NEXT: st %f0, [%fp+-20]
-; GNU32-NEXT: ld [%fp+-20], %o0
+; GNU32-NEXT: mov %i1, %o0
+; GNU32-NEXT: st %f0, [%fp+-32]
+; GNU32-NEXT: ld [%fp+-32], %o0
; GNU32-NEXT: add %fp, -12, %o1
; GNU32-NEXT: call sincosf
; GNU32-NEXT: add %fp, -16, %o2
-; GNU32-NEXT: ld [%fp+-28], %f0 ! 4-byte Folded Reload
-; GNU32-NEXT: st %f0, [%fp+-24]
-; GNU32-NEXT: ld [%fp+-24], %o0
-; GNU32-NEXT: add %fp, -4, %o1
+; GNU32-NEXT: call __extendhfsf2
+; GNU32-NEXT: mov %i0, %o0
+; GNU32-NEXT: st %f0, [%fp+-28]
+; GNU32-NEXT: ld [%fp+-28], %o0
+; GNU32-NEXT: add %fp, -20, %o1
; GNU32-NEXT: call sincosf
-; GNU32-NEXT: add %fp, -8, %o2
+; GNU32-NEXT: add %fp, -24, %o2
+; GNU32-NEXT: ld [%fp+-16], %f0
+; GNU32-NEXT: st %f0, [%fp+-44]
+; GNU32-NEXT: ld [%fp+-24], %f0
+; GNU32-NEXT: st %f0, [%fp+-36]
; GNU32-NEXT: ld [%fp+-12], %f0
-; GNU32-NEXT: ld [%fp+-4], %f1
-; GNU32-NEXT: ld [%fp+-16], %f2
-; GNU32-NEXT: ld [%fp+-8], %f3
+; GNU32-NEXT: st %f0, [%fp+-48]
+; GNU32-NEXT: ld [%fp+-20], %f0
+; GNU32-NEXT: st %f0, [%fp+-40]
+; GNU32-NEXT: call __truncsfhf2
+; GNU32-NEXT: ld [%fp+-44], %o0
+; GNU32-NEXT: sethi 63, %i0
+; GNU32-NEXT: or %i0, 1023, %i0
+; GNU32-NEXT: and %o0, %i0, %i4
+; GNU32-NEXT: call __truncsfhf2
+; GNU32-NEXT: ld [%fp+-36], %o0
+; GNU32-NEXT: and %o0, %i0, %i2
+; GNU32-NEXT: call __truncsfhf2
+; GNU32-NEXT: ld [%fp+-48], %o0
+; GNU32-NEXT: and %o0, %i0, %i1
+; GNU32-NEXT: call __truncsfhf2
+; GNU32-NEXT: ld [%fp+-40], %o0
+; GNU32-NEXT: and %o0, %i0, %g2
+; GNU32-NEXT: mov %g2, %i0
+; GNU32-NEXT: ! kill: def $i2 killed $i2 killed $i2_i3
; GNU32-NEXT: ret
-; GNU32-NEXT: restore
+; GNU32-NEXT: restore %g0, %i4, %o3
;
; GNU64-LABEL: test_sincos_v2f16:
; GNU64: ! %bb.0:
-; GNU64-NEXT: save %sp, -208, %sp
-; GNU64-NEXT: st %f1, [%fp+2023] ! 4-byte Folded Spill
-; GNU64-NEXT: fmovs %f3, %f1
-; GNU64-NEXT: call __truncsfhf2
-; GNU64-NEXT: nop
-; GNU64-NEXT: call __extendhfsf2
-; GNU64-NEXT: nop
-; GNU64-NEXT: st %f0, [%fp+2027] ! 4-byte Folded Spill
-; GNU64-NEXT: call __truncsfhf2
-; GNU64-NEXT: ld [%fp+2023], %f1
+; GNU64-NEXT: save %sp, -192, %sp
; GNU64-NEXT: call __extendhfsf2
-; GNU64-NEXT: nop
+; GNU64-NEXT: srl %i0, 0, %o0
; GNU64-NEXT: add %fp, 2035, %o1
; GNU64-NEXT: add %fp, 2031, %o2
; GNU64-NEXT: fmovs %f0, %f1
; GNU64-NEXT: call sincosf
; GNU64-NEXT: nop
+; GNU64-NEXT: call __extendhfsf2
+; GNU64-NEXT: srl %i1, 0, %o0
; GNU64-NEXT: add %fp, 2043, %o1
; GNU64-NEXT: add %fp, 2039, %o2
+; GNU64-NEXT: fmovs %f0, %f1
; GNU64-NEXT: call sincosf
-; GNU64-NEXT: ld [%fp+2027], %f1
-; GNU64-NEXT: ld [%fp+2035], %f0
+; GNU64-NEXT: nop
+; GNU64-NEXT: call __truncsfhf2
+; GNU64-NEXT: ld [%fp+2035], %f1
+; GNU64-NEXT: mov %o0, %i0
+; GNU64-NEXT: call __truncsfhf2
; GNU64-NEXT: ld [%fp+2043], %f1
-; GNU64-NEXT: ld [%fp+2031], %f2
-; GNU64-NEXT: ld [%fp+2039], %f3
+; GNU64-NEXT: mov %o0, %i1
+; GNU64-NEXT: call __truncsfhf2
+; GNU64-NEXT: ld [%fp+2031], %f1
+; GNU64-NEXT: mov %o0, %i2
+; GNU64-NEXT: call __truncsfhf2
+; GNU64-NEXT: ld [%fp+2039], %f1
; GNU64-NEXT: ret
-; GNU64-NEXT: restore
+; GNU64-NEXT: restore %g0, %o0, %o3
%result = call { <2 x half>, <2 x half> } @llvm.sincos.v2f16(<2 x half> %a)
ret { <2 x half>, <2 x half> } %result
}
>From 4b94c08a57b2b026aa434ef69823d579d56cfbda Mon Sep 17 00:00:00 2001
From: Jonas Devlieghere <jonas at devlieghere.com>
Date: Mon, 18 Aug 2025 14:01:41 -0500
Subject: [PATCH 071/112] [lldb] Relax the error message in
TestProcessCrashInfo.py (#153653)
The error message has been updated in macOS 26. Relax the error message
to check the more generic "BUG IN CLIENT OF LIBMALLOC" rather than the
error message that comes after.
---
.../process_crash_info/TestProcessCrashInfo.py | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/lldb/test/API/functionalities/process_crash_info/TestProcessCrashInfo.py b/lldb/test/API/functionalities/process_crash_info/TestProcessCrashInfo.py
index af05c2f3a0f62..4924937b4fe25 100644
--- a/lldb/test/API/functionalities/process_crash_info/TestProcessCrashInfo.py
+++ b/lldb/test/API/functionalities/process_crash_info/TestProcessCrashInfo.py
@@ -38,7 +38,7 @@ def test_cli(self):
patterns=[
"Extended Crash Information",
"Crash-Info Annotations",
- "pointer being freed was not allocated",
+ "BUG IN CLIENT OF LIBMALLOC",
],
)
@@ -67,7 +67,7 @@ def test_api(self):
self.assertTrue(crash_info.IsValid())
- self.assertIn("pointer being freed was not allocated", stream.GetData())
+ self.assertIn("BUG IN CLIENT OF LIBMALLOC", stream.GetData())
# dyld leaves permanent crash_info records when testing on device.
@skipIfDarwinEmbedded
>From d30fd562e8a45c90e8b256890100442b61e0dac8 Mon Sep 17 00:00:00 2001
From: Utkarsh Saxena <usx at google.com>
Date: Mon, 18 Aug 2025 21:07:41 +0200
Subject: [PATCH 072/112] [LifetimeSafety] Enhance benchmark script for new sub
analyses (#149577)
Enhanced the lifetime safety analysis benchmark script with more
detailed performance metrics and a new nested loop test case. This is a
worst case for loan expiry analysis.
### What changed?
- Added a new test case `nested_loops` that generates code with N levels
of nested loops to test how analysis performance scales with loop
nesting depth
- Improved the trace file analysis to extract durations for sub-phases
of the lifetime analysis (FactGenerator, LoanPropagation, ExpiredLoans)
- Enhanced the markdown report generation to include:
- Relative timing results as percentages of total Clang time
- More detailed complexity analysis for each analysis phase
Report
# Lifetime Analysis Performance Report
> Generated on: 2025-08-18 13:29:57
---
## Test Case: Pointer Cycle in Loop
**Timing Results:**
| N (Input Size) | Total Time | Analysis Time (%) | Fact Generator (%) |
Loan Propagation (%) | Expired Loans (%) |
|:---------------|-----------:|------------------:|-------------------:|---------------------:|------------------:|
| 10 | 10.75 ms | 24.61% | 0.00% | 24.38% | 0.00% |
| 25 | 64.98 ms | 86.08% | 0.00% | 86.02% | 0.00% |
| 50 | 709.37 ms | 98.53% | 0.00% | 98.51% | 0.00% |
| 75 | 3.13 s | 99.63% | 0.00% | 99.63% | 0.00% |
| 100 | 9.44 s | 99.85% | 0.00% | 99.84% | 0.00% |
| 150 | 45.31 s | 99.96% | 0.00% | 99.96% | 0.00% |
**Complexity Analysis:**
| Analysis Phase | Complexity O(n<sup>k</sup>) |
|:------------------|:--------------------------|
| Total Analysis | O(n<sup>3.87</sup> ± 0.01) |
| FactGenerator | (Negligible) |
| LoanPropagation | O(n<sup>3.87</sup> ± 0.01) |
| ExpiredLoans | (Negligible) |
---
## Test Case: CFG Merges
**Timing Results:**
| N (Input Size) | Total Time | Analysis Time (%) | Fact Generator (%) |
Loan Propagation (%) | Expired Loans (%) |
|:---------------|-----------:|------------------:|-------------------:|---------------------:|------------------:|
| 10 | 8.54 ms | 0.00% | 0.00% | 0.00% | 0.00% |
| 50 | 40.85 ms | 65.09% | 0.00% | 64.61% | 0.00% |
| 100 | 207.70 ms | 93.58% | 0.00% | 93.46% | 0.00% |
| 200 | 1.54 s | 98.82% | 0.00% | 98.78% | 0.00% |
| 400 | 12.04 s | 99.72% | 0.00% | 99.71% | 0.01% |
| 800 | 96.73 s | 99.94% | 0.00% | 99.94% | 0.00% |
**Complexity Analysis:**
| Analysis Phase | Complexity O(n<sup>k</sup>) |
|:------------------|:--------------------------|
| Total Analysis | O(n<sup>3.01</sup> ± 0.00) |
| FactGenerator | (Negligible) |
| LoanPropagation | O(n<sup>3.01</sup> ± 0.00) |
| ExpiredLoans | (Negligible) |
---
## Test Case: Deeply Nested Loops
**Timing Results:**
| N (Input Size) | Total Time | Analysis Time (%) | Fact Generator (%) |
Loan Propagation (%) | Expired Loans (%) |
|:---------------|-----------:|------------------:|-------------------:|---------------------:|------------------:|
| 10 | 8.25 ms | 0.00% | 0.00% | 0.00% | 0.00% |
| 50 | 27.25 ms | 51.87% | 0.00% | 45.71% | 5.93% |
| 100 | 113.42 ms | 82.48% | 0.00% | 72.74% | 9.62% |
| 200 | 730.05 ms | 95.24% | 0.00% | 83.95% | 11.25% |
| 400 | 5.40 s | 98.74% | 0.01% | 87.05% | 11.68% |
| 800 | 41.86 s | 99.62% | 0.00% | 87.77% | 11.84% |
**Complexity Analysis:**
| Analysis Phase | Complexity O(n<sup>k</sup>) |
|:------------------|:--------------------------|
| Total Analysis | O(n<sup>2.97</sup> ± 0.00) |
| FactGenerator | (Negligible) |
| LoanPropagation | O(n<sup>2.96</sup> ± 0.00) |
| ExpiredLoans | O(n<sup>2.97</sup> ± 0.00) |
---
---
.../test/Analysis/LifetimeSafety/benchmark.py | 227 +++++++++++++-----
1 file changed, 161 insertions(+), 66 deletions(-)
diff --git a/clang/test/Analysis/LifetimeSafety/benchmark.py b/clang/test/Analysis/LifetimeSafety/benchmark.py
index 9d5f36c51b9ee..4421fe9a81e21 100644
--- a/clang/test/Analysis/LifetimeSafety/benchmark.py
+++ b/clang/test/Analysis/LifetimeSafety/benchmark.py
@@ -99,28 +99,84 @@ def generate_cpp_merge_test(n: int) -> str:
return cpp_code
-def analyze_trace_file(trace_path: str) -> tuple[float, float]:
+def generate_cpp_nested_loop_test(n: int) -> str:
"""
- Parses the -ftime-trace JSON output to find durations.
+ Generates C++ code with N levels of nested loops.
+ This pattern tests how analysis performance scales with loop nesting depth,
+ which is a key factor in the complexity of dataflow analyses on structured
+ control flow.
- Returns:
- A tuple of (lifetime_analysis_duration_us, total_clang_duration_us).
+ Example (n=3):
+ struct MyObj { int id; ~MyObj() {} };
+ void nested_loops_3() {
+ MyObj* p = nullptr;
+ for(int i0=0; i0<2; ++i0) {
+ MyObj s0;
+ p = &s0;
+ for(int i1=0; i1<2; ++i1) {
+ MyObj s1;
+ p = &s1;
+ for(int i2=0; i2<2; ++i2) {
+ MyObj s2;
+ p = &s2;
+ }
+ }
+ }
+ }
+ """
+ if n <= 0:
+ return "// Nesting depth must be positive."
+
+ cpp_code = "struct MyObj { int id; ~MyObj() {} };\n\n"
+ cpp_code += f"void nested_loops_{n}() {{\n"
+ cpp_code += " MyObj* p = nullptr;\n"
+
+ for i in range(n):
+ indent = " " * (i + 1)
+ cpp_code += f"{indent}for(int i{i}=0; i{i}<2; ++i{i}) {{\n"
+ cpp_code += f"{indent} MyObj s{i}; p = &s{i};\n"
+
+ for i in range(n - 1, -1, -1):
+ indent = " " * (i + 1)
+ cpp_code += f"{indent}}}\n"
+
+ cpp_code += "}\n"
+ cpp_code += f"\nint main() {{ nested_loops_{n}(); return 0; }}\n"
+ return cpp_code
+
+
+def analyze_trace_file(trace_path: str) -> dict:
"""
- lifetime_duration = 0.0
- total_duration = 0.0
+ Parses the -ftime-trace JSON output to find durations for the lifetime
+ analysis and its sub-phases.
+ Returns a dictionary of durations in microseconds.
+ """
+ durations = {
+ "lifetime_us": 0.0,
+ "total_us": 0.0,
+ "fact_gen_us": 0.0,
+ "loan_prop_us": 0.0,
+ "expired_loans_us": 0.0,
+ }
+ event_name_map = {
+ "LifetimeSafetyAnalysis": "lifetime_us",
+ "ExecuteCompiler": "total_us",
+ "FactGenerator": "fact_gen_us",
+ "LoanPropagation": "loan_prop_us",
+ "ExpiredLoans": "expired_loans_us",
+ }
try:
with open(trace_path, "r") as f:
trace_data = json.load(f)
for event in trace_data.get("traceEvents", []):
- if event.get("name") == "LifetimeSafetyAnalysis":
- lifetime_duration += float(event.get("dur", 0))
- if event.get("name") == "ExecuteCompiler":
- total_duration += float(event.get("dur", 0))
-
+ event_name = event.get("name")
+ if event_name in event_name_map:
+ key = event_name_map[event_name]
+ durations[key] += float(event.get("dur", 0))
except (IOError, json.JSONDecodeError) as e:
print(f"Error reading or parsing trace file {trace_path}: {e}", file=sys.stderr)
- return 0.0, 0.0
- return lifetime_duration, total_duration
+ return {key: 0.0 for key in durations}
+ return durations
def power_law(n, c, k):
@@ -135,8 +191,29 @@ def human_readable_time(ms: float) -> str:
return f"{ms:.2f} ms"
+def calculate_complexity(n_data, y_data) -> tuple[float | None, float | None]:
+ """
+ Calculates the exponent 'k' for the power law fit y = c * n^k.
+ Returns a tuple of (k, k_standard_error).
+ """
+ try:
+ if len(n_data) < 3 or np.all(y_data < 1e-6) or np.var(y_data) < 1e-6:
+ return None, None
+
+ non_zero_indices = y_data > 0
+ if np.sum(non_zero_indices) < 3:
+ return None, None
+
+ n_fit, y_fit = n_data[non_zero_indices], y_data[non_zero_indices]
+ popt, pcov = curve_fit(power_law, n_fit, y_fit, p0=[0, 1], maxfev=5000)
+ k_stderr = np.sqrt(np.diag(pcov))[1]
+ return popt[1], k_stderr
+ except (RuntimeError, ValueError):
+ return None, None
+
+
def generate_markdown_report(results: dict) -> str:
- """Generates a Markdown-formatted report from the benchmark results."""
+ """Generates a concise, Markdown-formatted report from the benchmark results."""
report = []
timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S %Z")
report.append(f"# Lifetime Analysis Performance Report")
@@ -146,54 +223,52 @@ def generate_markdown_report(results: dict) -> str:
for test_name, data in results.items():
title = data["title"]
report.append(f"## Test Case: {title}")
- report.append("")
+ report.append("\n**Timing Results:**\n")
# Table header
- report.append("| N | Analysis Time | Total Clang Time |")
- report.append("|:----|--------------:|-----------------:|")
+ report.append(
+ "| N (Input Size) | Total Time | Analysis Time (%) | Fact Generator (%) | Loan Propagation (%) | Expired Loans (%) |"
+ )
+ report.append(
+ "|:---------------|-----------:|------------------:|-------------------:|---------------------:|------------------:|"
+ )
# Table rows
n_data = np.array(data["n"])
- analysis_data = np.array(data["lifetime_ms"])
- total_data = np.array(data["total_ms"])
+ total_ms_data = np.array(data["total_ms"])
for i in range(len(n_data)):
- analysis_str = human_readable_time(analysis_data[i])
- total_str = human_readable_time(total_data[i])
- report.append(f"| {n_data[i]:<3} | {analysis_str:>13} | {total_str:>16} |")
-
- report.append("")
-
- # Complexity analysis
- report.append(f"**Complexity Analysis:**")
- try:
- # Curve fitting requires at least 3 points
- if len(n_data) < 3:
- raise ValueError("Not enough data points to perform curve fitting.")
-
- popt, pcov = curve_fit(
- power_law, n_data, analysis_data, p0=[0, 2], maxfev=5000
- )
- _, k = popt
-
- # Confidence Interval for k
- alpha = 0.05 # 95% confidence
- dof = max(0, len(n_data) - len(popt)) # degrees of freedom
- t_val = t.ppf(1.0 - alpha / 2.0, dof)
- # Standard error of the parameters
- perr = np.sqrt(np.diag(pcov))
- k_stderr = perr[1]
- k_ci_lower = k - t_val * k_stderr
- k_ci_upper = k + t_val * k_stderr
-
- report.append(
- f"- The performance for this case scales approx. as **O(n<sup>{k:.2f}</sup>)**."
- )
- report.append(
- f"- **95% Confidence interval for exponent:** `[{k_ci_lower:.2f}, {k_ci_upper:.2f}]`."
- )
+ total_t = total_ms_data[i]
+ if total_t < 1e-6:
+ total_t = 1.0 # Avoid division by zero
+
+ row = [
+ f"| {n_data[i]:<14} |",
+ f"{human_readable_time(total_t):>10} |",
+ f"{data['lifetime_ms'][i] / total_t * 100:>17.2f}% |",
+ f"{data['fact_gen_ms'][i] / total_t * 100:>18.2f}% |",
+ f"{data['loan_prop_ms'][i] / total_t * 100:>20.2f}% |",
+ f"{data['expired_loans_ms'][i] / total_t * 100:>17.2f}% |",
+ ]
+ report.append(" ".join(row))
+
+ report.append("\n**Complexity Analysis:**\n")
+ report.append("| Analysis Phase | Complexity O(n<sup>k</sup>) |")
+ report.append("|:------------------|:--------------------------|")
+
+ analysis_phases = {
+ "Total Analysis": data["lifetime_ms"],
+ "FactGenerator": data["fact_gen_ms"],
+ "LoanPropagation": data["loan_prop_ms"],
+ "ExpiredLoans": data["expired_loans_ms"],
+ }
- except (RuntimeError, ValueError) as e:
- report.append(f"- Could not determine a best-fit curve for the data: {e}")
+ for phase_name, y_data in analysis_phases.items():
+ k, delta = calculate_complexity(n_data, np.array(y_data))
+ if k is not None and delta is not None:
+ complexity_str = f"O(n<sup>{k:.2f}</sup> ± {delta:.2f})"
+ else:
+ complexity_str = "(Negligible)"
+ report.append(f"| {phase_name:<17} | {complexity_str:<25} |")
report.append("\n---\n")
@@ -202,7 +277,7 @@ def generate_markdown_report(results: dict) -> str:
def run_single_test(
clang_binary: str, output_dir: str, test_name: str, generator_func, n: int
-) -> tuple[float, float]:
+) -> dict:
"""Generates, compiles, and benchmarks a single test case."""
print(f"--- Running Test: {test_name.capitalize()} with N={n} ---")
@@ -221,7 +296,8 @@ def run_single_test(
"-o",
"/dev/null",
"-ftime-trace=" + trace_file,
- "-Wexperimental-lifetime-safety",
+ "-Xclang",
+ "-fexperimental-lifetime-safety",
"-std=c++17",
source_file,
]
@@ -231,11 +307,12 @@ def run_single_test(
if result.returncode != 0:
print(f"Compilation failed for N={n}!", file=sys.stderr)
print(result.stderr, file=sys.stderr)
- return 0.0, 0.0
+ return {}
- lifetime_us, total_us = analyze_trace_file(trace_file)
-
- return lifetime_us / 1000.0, total_us / 1000.0
+ durations_us = analyze_trace_file(trace_file)
+ return {
+ key.replace("_us", "_ms"): value / 1000.0 for key, value in durations_us.items()
+ }
if __name__ == "__main__":
@@ -270,6 +347,12 @@ def run_single_test(
"generator_func": generate_cpp_merge_test,
"n_values": [10, 50, 100, 200, 400, 800],
},
+ {
+ "name": "nested_loops",
+ "title": "Deeply Nested Loops",
+ "generator_func": generate_cpp_nested_loop_test,
+ "n_values": [10, 50, 100, 200, 400, 800],
+ },
]
results = {}
@@ -282,21 +365,28 @@ def run_single_test(
"n": [],
"lifetime_ms": [],
"total_ms": [],
+ "fact_gen_ms": [],
+ "loan_prop_ms": [],
+ "expired_loans_ms": [],
}
for n in config["n_values"]:
- lifetime_ms, total_ms = run_single_test(
+ durations_ms = run_single_test(
args.clang_binary,
args.output_dir,
test_name,
config["generator_func"],
n,
)
- if total_ms > 0:
+ if durations_ms:
results[test_name]["n"].append(n)
- results[test_name]["lifetime_ms"].append(lifetime_ms)
- results[test_name]["total_ms"].append(total_ms)
+ for key, value in durations_ms.items():
+ results[test_name][key].append(value)
+
print(
- f" Total: {human_readable_time(total_ms)} | Analysis: {human_readable_time(lifetime_ms)}"
+ f" Total Analysis: {human_readable_time(durations_ms['lifetime_ms'])} | "
+ f"FactGen: {human_readable_time(durations_ms['fact_gen_ms'])} | "
+ f"LoanProp: {human_readable_time(durations_ms['loan_prop_ms'])} | "
+ f"ExpiredLoans: {human_readable_time(durations_ms['expired_loans_ms'])}"
)
print("\n\n" + "=" * 80)
@@ -305,3 +395,8 @@ def run_single_test(
markdown_report = generate_markdown_report(results)
print(markdown_report)
+
+ report_filename = os.path.join(args.output_dir, "performance_report.md")
+ with open(report_filename, "w") as f:
+ f.write(markdown_report)
+ print(f"Report saved to: {report_filename}")
>From 1bb72170501b95afd8124c4026bf927385be9b47 Mon Sep 17 00:00:00 2001
From: Usama Hameed <u_hameed at apple.com>
Date: Mon, 18 Aug 2025 12:08:45 -0700
Subject: [PATCH 073/112] [Sanitizers][Darwin][Test] The top few frames are
inaccurate in UBSan. (#153899)
XFailing until further investigation
rdar://158303080
---
.../TestCases/Posix/dedup_token_length_test.cpp | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/compiler-rt/test/sanitizer_common/TestCases/Posix/dedup_token_length_test.cpp b/compiler-rt/test/sanitizer_common/TestCases/Posix/dedup_token_length_test.cpp
index deedbba76cdeb..37bfee4806173 100644
--- a/compiler-rt/test/sanitizer_common/TestCases/Posix/dedup_token_length_test.cpp
+++ b/compiler-rt/test/sanitizer_common/TestCases/Posix/dedup_token_length_test.cpp
@@ -10,6 +10,10 @@
// REQUIRES: stable-runtime
+// rdar://158303080 top few frames are at times inaccurate in ubsan fast stack
+// unwind on darwin
+// XFAIL: (darwin && ubsan && (arm64-target-arch || arm64e-target-arch))
+
// XFAIL: target={{.*netbsd.*}} && !asan
volatile int *null = 0;
>From e7c2c80fa16644b8c4e47c75caffaea8bc20a30d Mon Sep 17 00:00:00 2001
From: Stanislav Mekhanoshin <Stanislav.Mekhanoshin at amd.com>
Date: Mon, 18 Aug 2025 12:13:16 -0700
Subject: [PATCH 074/112] [AMDGPU] Combine prng(undef) -> undef (#154160)
---
llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp | 3 ++-
llvm/test/CodeGen/AMDGPU/llvm.amdgcn.prng.ll | 9 ++++++++-
2 files changed, 10 insertions(+), 2 deletions(-)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
index 64e68ab7d753c..a28e272367c7a 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
@@ -4002,7 +4002,8 @@ SDValue AMDGPUTargetLowering::performIntrinsicWOChainCombine(
case Intrinsic::amdgcn_rcp_legacy:
case Intrinsic::amdgcn_rsq_legacy:
case Intrinsic::amdgcn_rsq_clamp:
- case Intrinsic::amdgcn_tanh: {
+ case Intrinsic::amdgcn_tanh:
+ case Intrinsic::amdgcn_prng_b32: {
// FIXME: This is probably wrong. If src is an sNaN, it won't be quieted
SDValue Src = N->getOperand(1);
return Src.isUndef() ? Src : SDValue();
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.prng.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.prng.ll
index 6a5dc8f8dd0a6..2daf9c3b472f1 100644
--- a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.prng.ll
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.prng.ll
@@ -1,6 +1,6 @@
; RUN: llc -global-isel=0 -mtriple=amdgcn -mcpu=gfx950 < %s | FileCheck -check-prefixes=GCN %s
; RUN: llc -global-isel=1 -mtriple=amdgcn -mcpu=gfx950 < %s | FileCheck -check-prefix=GCN %s
-; RUN: llc -global-isel=0 -mtriple=amdgcn -mcpu=gfx1250 < %s | FileCheck -check-prefixes=GCN %s
+; RUN: llc -global-isel=0 -mtriple=amdgcn -mcpu=gfx1250 < %s | FileCheck -check-prefixes=GCN,SDAG %s
; RUN: llc -global-isel=1 -mtriple=amdgcn -mcpu=gfx1250 < %s | FileCheck -check-prefix=GCN %s
declare i32 @llvm.amdgcn.prng.b32(i32) #0
@@ -29,6 +29,13 @@ define amdgpu_kernel void @prng_b32_constant_100(ptr addrspace(1) %out) #1 {
ret void
}
+; GCN-LABEL: {{^}}prng_undef_i32:
+; SDAG-NOT: v_prng_b32
+define amdgpu_kernel void @prng_undef_i32(ptr addrspace(1) %out) #1 {
+ %prng = call i32 @llvm.amdgcn.prng.b32(i32 undef)
+ store i32 %prng, ptr addrspace(1) %out, align 4
+ ret void
+}
attributes #0 = { nounwind readnone }
attributes #1 = { nounwind }
>From 3d6177c14b4dca7412d929ef364196a98403ef01 Mon Sep 17 00:00:00 2001
From: Stanislav Mekhanoshin <Stanislav.Mekhanoshin at amd.com>
Date: Mon, 18 Aug 2025 12:13:31 -0700
Subject: [PATCH 075/112] [AMDGPU] Avoid setting op_sel_hi bits if there is
matrix_b_scale. NFCI. (#154176)
This is NFCI now as there is no matrix_b_scale without matrix_b_reuse,
but technically this condition shall be here.
---
llvm/lib/Target/AMDGPU/MCTargetDesc/AMDGPUMCCodeEmitter.cpp | 2 ++
1 file changed, 2 insertions(+)
diff --git a/llvm/lib/Target/AMDGPU/MCTargetDesc/AMDGPUMCCodeEmitter.cpp b/llvm/lib/Target/AMDGPU/MCTargetDesc/AMDGPUMCCodeEmitter.cpp
index f3580842c6ff0..61f673221739a 100644
--- a/llvm/lib/Target/AMDGPU/MCTargetDesc/AMDGPUMCCodeEmitter.cpp
+++ b/llvm/lib/Target/AMDGPU/MCTargetDesc/AMDGPUMCCodeEmitter.cpp
@@ -389,6 +389,8 @@ void AMDGPUMCCodeEmitter::encodeInstruction(const MCInst &MI,
Opcode == AMDGPU::V_ACCVGPR_WRITE_B32_vi) &&
// Matrix B format operand reuses op_sel_hi.
!AMDGPU::hasNamedOperand(Opcode, AMDGPU::OpName::matrix_b_fmt) &&
+ // Matrix B scale operand reuses op_sel_hi.
+ !AMDGPU::hasNamedOperand(Opcode, AMDGPU::OpName::matrix_b_scale) &&
// Matrix B reuse operand reuses op_sel_hi.
!AMDGPU::hasNamedOperand(Opcode, AMDGPU::OpName::matrix_b_reuse)) {
Encoding |= getImplicitOpSelHiEncoding(Opcode);
>From 986d7aa675e957e0160aeb2f045a6abf1bf2082e Mon Sep 17 00:00:00 2001
From: Daniel Thornburgh <dthorn at google.com>
Date: Mon, 18 Aug 2025 12:19:19 -0700
Subject: [PATCH 076/112] Bump ProtocolServerMCPTest timeout to 200ms (#154182)
This should reduce flakes observed in the Fuchsia AArch64 Linux LLDB CI
builders.
---
lldb/unittests/ProtocolServer/ProtocolMCPServerTest.cpp | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lldb/unittests/ProtocolServer/ProtocolMCPServerTest.cpp b/lldb/unittests/ProtocolServer/ProtocolMCPServerTest.cpp
index 2ac40c41dd28e..de2ae2313ecd7 100644
--- a/lldb/unittests/ProtocolServer/ProtocolMCPServerTest.cpp
+++ b/lldb/unittests/ProtocolServer/ProtocolMCPServerTest.cpp
@@ -144,7 +144,7 @@ class ProtocolServerMCPTest : public ::testing::Test {
template <typename P>
void
RunOnce(const std::function<void(llvm::Expected<P>)> &callback,
- std::chrono::milliseconds timeout = std::chrono::milliseconds(100)) {
+ std::chrono::milliseconds timeout = std::chrono::milliseconds(200)) {
auto handle = m_transport_up->RegisterReadObject<P>(
loop, [&](lldb_private::MainLoopBase &loop, llvm::Expected<P> message) {
callback(std::move(message));
>From 9403c2d64d63c16a09739d943eaa22b8e3499b7a Mon Sep 17 00:00:00 2001
From: Naveen Seth Hanig <naveen.hanig at outlook.com>
Date: Tue, 19 Aug 2025 00:51:08 +0530
Subject: [PATCH 077/112] Reland [clang][modules-driver] Add scanner to detect
C++20 module presence (#153497)
This patch is part of a series to support driver managed module builds
for C++ named modules and Clang modules.
This introduces a scanner that detects C++ named module usage early in
the driver with only negligible overhead.
For now, it is enabled only with the `-fmodules-driver` flag and serves
solely diagnostic purposes. In the future, the scanner will be enabled
for any (modules-driver compatible) compilation with two or more inputs,
and will help the driver determine whether to implicitly enable the
modules driver.
Since the scanner adds very little overhead, we are also exploring
enabling it for compilations with only a single input. This approach
could allow us to detect `import std` usage in a single-file
compilation, which would then activate the modules driver. For
performance measurements on this, see
https://github.com/naveen-seth/llvm-dev-cxx-modules-check-benchmark.
RFC for driver managed module builds:
https://discourse.llvm.org/t/rfc-modules-support-simple-c-20-modules-use-from-the-clang-driver-without-a-build-system
This patch relands the reland (2d31fc8) for commit ded1426. The earlier
reland failed due to a missing link dependency on `clangLex`. This
reland fixes the issue by adding the link dependency after discussing it
in the following RFC:
https://discourse.llvm.org/t/rfc-driver-link-the-driver-against-clangdependencyscanning-clangast-clangfrontend-clangserialization-and-clanglex
---
.../clang/Basic/DiagnosticDriverKinds.td | 7 +
clang/include/clang/Basic/DiagnosticGroups.td | 1 +
clang/include/clang/Driver/Driver.h | 32 +++
clang/include/clang/Driver/Options.td | 7 +
.../clang/Lex/DependencyDirectivesScanner.h | 7 +
clang/lib/Driver/CMakeLists.txt | 1 +
clang/lib/Driver/Driver.cpp | 67 ++++++
clang/lib/Lex/DependencyDirectivesScanner.cpp | 50 +++++
...ules-driver-cxx20-module-usage-scanner.cpp | 192 ++++++++++++++++++
9 files changed, 364 insertions(+)
create mode 100644 clang/test/Driver/modules-driver-cxx20-module-usage-scanner.cpp
diff --git a/clang/include/clang/Basic/DiagnosticDriverKinds.td b/clang/include/clang/Basic/DiagnosticDriverKinds.td
index 0f17f4aa761ea..6df8f9932f30f 100644
--- a/clang/include/clang/Basic/DiagnosticDriverKinds.td
+++ b/clang/include/clang/Basic/DiagnosticDriverKinds.td
@@ -581,6 +581,13 @@ def err_drv_reduced_module_output_overrided : Warning<
"please consider use '-fmodule-output=' to specify the output file for reduced BMI explicitly">,
InGroup<DiagGroup<"reduced-bmi-output-overrided">>;
+def remark_found_cxx20_module_usage : Remark<
+ "found C++20 module usage in file '%0'">,
+ InGroup<ModulesDriver>;
+def remark_performing_driver_managed_module_build : Remark<
+ "performing driver managed module build">,
+ InGroup<ModulesDriver>;
+
def warn_drv_delayed_template_parsing_after_cxx20 : Warning<
"-fdelayed-template-parsing is deprecated after C++20">,
InGroup<DiagGroup<"delayed-template-parsing-in-cxx20">>;
diff --git a/clang/include/clang/Basic/DiagnosticGroups.td b/clang/include/clang/Basic/DiagnosticGroups.td
index 2edf4da435366..e29c4694fa5ea 100644
--- a/clang/include/clang/Basic/DiagnosticGroups.td
+++ b/clang/include/clang/Basic/DiagnosticGroups.td
@@ -635,6 +635,7 @@ def ModuleConflict : DiagGroup<"module-conflict">;
def ModuleFileExtension : DiagGroup<"module-file-extension">;
def ModuleIncludeDirectiveTranslation : DiagGroup<"module-include-translation">;
def ModuleMap : DiagGroup<"module-map">;
+def ModulesDriver : DiagGroup<"modules-driver">;
def RoundTripCC1Args : DiagGroup<"round-trip-cc1-args">;
def NewlineEOF : DiagGroup<"newline-eof">;
def Nullability : DiagGroup<"nullability">;
diff --git a/clang/include/clang/Driver/Driver.h b/clang/include/clang/Driver/Driver.h
index 4d32552b7f85f..b9b187ada8add 100644
--- a/clang/include/clang/Driver/Driver.h
+++ b/clang/include/clang/Driver/Driver.h
@@ -512,6 +512,9 @@ class Driver {
/// BuildActions - Construct the list of actions to perform for the
/// given arguments, which are only done for a single architecture.
+ /// If the compilation is an explicit module build, delegates to
+ /// BuildDriverManagedModuleBuildActions. Otherwise, BuildDefaultActions is
+ /// used.
///
/// \param C - The compilation that is being built.
/// \param Args - The input arguments.
@@ -796,6 +799,35 @@ class Driver {
/// compilation based on which -f(no-)?lto(=.*)? option occurs last.
void setLTOMode(const llvm::opt::ArgList &Args);
+ /// BuildDefaultActions - Constructs the list of actions to perform
+ /// for the provided arguments, which are only done for a single architecture.
+ ///
+ /// \param C - The compilation that is being built.
+ /// \param Args - The input arguments.
+ /// \param Actions - The list to store the resulting actions onto.
+ void BuildDefaultActions(Compilation &C, llvm::opt::DerivedArgList &Args,
+ const InputList &Inputs, ActionList &Actions) const;
+
+ /// BuildDriverManagedModuleBuildActions - Performs a dependency
+ /// scan and constructs the list of actions to perform for dependency order
+ /// and the provided arguments. This is only done for a single a architecture.
+ ///
+ /// \param C - The compilation that is being built.
+ /// \param Args - The input arguments.
+ /// \param Actions - The list to store the resulting actions onto.
+ void BuildDriverManagedModuleBuildActions(Compilation &C,
+ llvm::opt::DerivedArgList &Args,
+ const InputList &Inputs,
+ ActionList &Actions) const;
+
+ /// Scans the leading lines of the C++ source inputs to detect C++20 module
+ /// usage.
+ ///
+ /// \returns True if module usage is detected, false otherwise, or an error on
+ /// read failure.
+ llvm::ErrorOr<bool>
+ ScanInputsForCXX20ModulesUsage(const InputList &Inputs) const;
+
/// Retrieves a ToolChain for a particular \p Target triple.
///
/// Will cache ToolChains for the life of the driver object, and create them
diff --git a/clang/include/clang/Driver/Options.td b/clang/include/clang/Driver/Options.td
index 6a2f4575459b2..06bff0bf3b4ff 100644
--- a/clang/include/clang/Driver/Options.td
+++ b/clang/include/clang/Driver/Options.td
@@ -3296,6 +3296,13 @@ defm modules_reduced_bmi : BoolOption<"f", "modules-reduced-bmi",
PosFlag<SetTrue, [], [ClangOption, CC1Option],
"Generate the reduced BMI">>;
+def fmodules_driver : Flag<["-"], "fmodules-driver">,
+ Group<f_Group>, Visibility<[ClangOption]>,
+ HelpText<"Enable support for driver managed module builds (experimental)">;
+def fno_modules_driver : Flag<["-"], "fno-modules-driver">,
+ Group<f_Group>, Visibility<[ClangOption]>,
+ HelpText<"Disable support for driver managed module builds (experimental)">;
+
def experimental_modules_reduced_bmi : Flag<["-"], "fexperimental-modules-reduced-bmi">,
Group<f_Group>, Visibility<[ClangOption, CC1Option]>, Alias<fmodules_reduced_bmi>;
diff --git a/clang/include/clang/Lex/DependencyDirectivesScanner.h b/clang/include/clang/Lex/DependencyDirectivesScanner.h
index f9fec3998ca53..c0b742d652a03 100644
--- a/clang/include/clang/Lex/DependencyDirectivesScanner.h
+++ b/clang/include/clang/Lex/DependencyDirectivesScanner.h
@@ -135,6 +135,13 @@ void printDependencyDirectivesAsSource(
ArrayRef<dependency_directives_scan::Directive> Directives,
llvm::raw_ostream &OS);
+/// Scan an input source buffer for C++20 named module usage.
+///
+/// \param Source The input source buffer.
+///
+/// \returns true if any C++20 named modules related directive was found.
+bool scanInputForCXX20ModulesUsage(StringRef Source);
+
/// Functor that returns the dependency directives for a given file.
class DependencyDirectivesGetter {
public:
diff --git a/clang/lib/Driver/CMakeLists.txt b/clang/lib/Driver/CMakeLists.txt
index 45782cbd9d16d..7c4f70b966c48 100644
--- a/clang/lib/Driver/CMakeLists.txt
+++ b/clang/lib/Driver/CMakeLists.txt
@@ -98,5 +98,6 @@ add_clang_library(clangDriver
LINK_LIBS
clangBasic
+ clangLex
${system_libs}
)
diff --git a/clang/lib/Driver/Driver.cpp b/clang/lib/Driver/Driver.cpp
index 8c0bba938a09b..d682ffc832c83 100644
--- a/clang/lib/Driver/Driver.cpp
+++ b/clang/lib/Driver/Driver.cpp
@@ -66,6 +66,7 @@
#include "clang/Driver/Tool.h"
#include "clang/Driver/ToolChain.h"
#include "clang/Driver/Types.h"
+#include "clang/Lex/DependencyDirectivesScanner.h"
#include "llvm/ADT/ArrayRef.h"
#include "llvm/ADT/STLExtras.h"
#include "llvm/ADT/SmallSet.h"
@@ -4188,6 +4189,11 @@ void Driver::handleArguments(Compilation &C, DerivedArgList &Args,
YcArg = nullptr;
}
+ if (Args.hasArgNoClaim(options::OPT_fmodules_driver))
+ // TODO: Check against all incompatible -fmodules-driver arguments
+ if (!ModulesModeCXX20 && !Args.hasArgNoClaim(options::OPT_fmodules))
+ Args.eraseArg(options::OPT_fmodules_driver);
+
Arg *FinalPhaseArg;
phases::ID FinalPhase = getFinalPhase(Args, &FinalPhaseArg);
@@ -4314,6 +4320,33 @@ void Driver::handleArguments(Compilation &C, DerivedArgList &Args,
}
}
+static bool hasCXXModuleInputType(const Driver::InputList &Inputs) {
+ const auto IsTypeCXXModule = [](const auto &Input) -> bool {
+ const auto TypeID = Input.first;
+ return (TypeID == types::TY_CXXModule);
+ };
+ return llvm::any_of(Inputs, IsTypeCXXModule);
+}
+
+llvm::ErrorOr<bool>
+Driver::ScanInputsForCXX20ModulesUsage(const InputList &Inputs) const {
+ const auto CXXInputs = llvm::make_filter_range(
+ Inputs, [](const auto &Input) { return types::isCXX(Input.first); });
+ for (const auto &Input : CXXInputs) {
+ StringRef Filename = Input.second->getSpelling();
+ auto ErrOrBuffer = VFS->getBufferForFile(Filename);
+ if (!ErrOrBuffer)
+ return ErrOrBuffer.getError();
+ const auto Buffer = std::move(*ErrOrBuffer);
+
+ if (scanInputForCXX20ModulesUsage(Buffer->getBuffer())) {
+ Diags.Report(diag::remark_found_cxx20_module_usage) << Filename;
+ return true;
+ }
+ }
+ return false;
+}
+
void Driver::BuildActions(Compilation &C, DerivedArgList &Args,
const InputList &Inputs, ActionList &Actions) const {
llvm::PrettyStackTraceString CrashInfo("Building compilation actions");
@@ -4325,6 +4358,33 @@ void Driver::BuildActions(Compilation &C, DerivedArgList &Args,
handleArguments(C, Args, Inputs, Actions);
+ if (Args.hasFlag(options::OPT_fmodules_driver,
+ options::OPT_fno_modules_driver, false)) {
+ // TODO: Move the logic for implicitly enabling explicit-module-builds out
+ // of -fmodules-driver once it is no longer experimental.
+ // Currently, this serves diagnostic purposes only.
+ bool UsesCXXModules = hasCXXModuleInputType(Inputs);
+ if (!UsesCXXModules) {
+ const auto ErrOrScanResult = ScanInputsForCXX20ModulesUsage(Inputs);
+ if (!ErrOrScanResult) {
+ Diags.Report(diag::err_cannot_open_file)
+ << ErrOrScanResult.getError().message();
+ return;
+ }
+ UsesCXXModules = *ErrOrScanResult;
+ }
+ if (UsesCXXModules || Args.hasArg(options::OPT_fmodules))
+ BuildDriverManagedModuleBuildActions(C, Args, Inputs, Actions);
+ return;
+ }
+
+ BuildDefaultActions(C, Args, Inputs, Actions);
+}
+
+void Driver::BuildDefaultActions(Compilation &C, DerivedArgList &Args,
+ const InputList &Inputs,
+ ActionList &Actions) const {
+
bool UseNewOffloadingDriver =
C.isOffloadingHostKind(Action::OFK_OpenMP) ||
C.isOffloadingHostKind(Action::OFK_SYCL) ||
@@ -4608,6 +4668,13 @@ void Driver::BuildActions(Compilation &C, DerivedArgList &Args,
Args.ClaimAllArgs(options::OPT_cl_ignored_Group);
}
+void Driver::BuildDriverManagedModuleBuildActions(
+ Compilation &C, llvm::opt::DerivedArgList &Args, const InputList &Inputs,
+ ActionList &Actions) const {
+ Diags.Report(diag::remark_performing_driver_managed_module_build);
+ return;
+}
+
/// Returns the canonical name for the offloading architecture when using a HIP
/// or CUDA architecture.
static StringRef getCanonicalArchString(Compilation &C,
diff --git a/clang/lib/Lex/DependencyDirectivesScanner.cpp b/clang/lib/Lex/DependencyDirectivesScanner.cpp
index 9ccff5e3342d5..eee57c786442a 100644
--- a/clang/lib/Lex/DependencyDirectivesScanner.cpp
+++ b/clang/lib/Lex/DependencyDirectivesScanner.cpp
@@ -83,6 +83,8 @@ struct Scanner {
/// \returns True on error.
bool scan(SmallVectorImpl<Directive> &Directives);
+ friend bool clang::scanInputForCXX20ModulesUsage(StringRef Source);
+
private:
/// Lexes next token and advances \p First and the \p Lexer.
[[nodiscard]] dependency_directives_scan::Token &
@@ -1075,3 +1077,51 @@ void clang::printDependencyDirectivesAsSource(
}
}
}
+
+static void skipUntilMaybeCXX20ModuleDirective(const char *&First,
+ const char *const End) {
+ assert(First <= End);
+ while (First != End) {
+ if (*First == '#') {
+ ++First;
+ skipToNewlineRaw(First, End);
+ }
+ skipWhitespace(First, End);
+ if (const auto Len = isEOL(First, End)) {
+ First += Len;
+ continue;
+ }
+ break;
+ }
+}
+
+bool clang::scanInputForCXX20ModulesUsage(StringRef Source) {
+ const char *First = Source.begin();
+ const char *const End = Source.end();
+ skipUntilMaybeCXX20ModuleDirective(First, End);
+ if (First == End)
+ return false;
+
+ // Check if the next token can even be a module directive before creating a
+ // full lexer.
+ if (!(*First == 'i' || *First == 'e' || *First == 'm'))
+ return false;
+
+ llvm::SmallVector<dependency_directives_scan::Token> Tokens;
+ Scanner S(StringRef(First, End - First), Tokens, nullptr, SourceLocation());
+ S.TheLexer.setParsingPreprocessorDirective(true);
+ if (S.lexModule(First, End))
+ return false;
+ auto IsCXXNamedModuleDirective = [](const DirectiveWithTokens &D) {
+ switch (D.Kind) {
+ case dependency_directives_scan::cxx_module_decl:
+ case dependency_directives_scan::cxx_import_decl:
+ case dependency_directives_scan::cxx_export_module_decl:
+ case dependency_directives_scan::cxx_export_import_decl:
+ return true;
+ default:
+ return false;
+ }
+ };
+ return llvm::any_of(S.DirsWithToks, IsCXXNamedModuleDirective);
+}
diff --git a/clang/test/Driver/modules-driver-cxx20-module-usage-scanner.cpp b/clang/test/Driver/modules-driver-cxx20-module-usage-scanner.cpp
new file mode 100644
index 0000000000000..a434587a78759
--- /dev/null
+++ b/clang/test/Driver/modules-driver-cxx20-module-usage-scanner.cpp
@@ -0,0 +1,192 @@
+// The driver never checks to implicitly enable the explicit module build
+// support unless at least two input files are provided.
+// To trigger the C++20 module usage check, we always pass a second dummy file
+// as input.
+// TODO: Remove -fmodules everywhere once implicitly enabled explicit module
+// builds are supported.
+
+// RUN: split-file %s %t
+//--- empty.cpp
+// Nothing here
+
+//--- only-global.cpp
+// RUN: %clang -std=c++20 -ccc-print-phases -fmodules-driver -Rmodules-driver \
+// RUN: %t/only-global.cpp %t/empty.cpp 2>&1 \
+// RUN: | FileCheck %s --check-prefix=CHECK1
+// CHECK1: remark: found C++20 module usage in file '{{.*}}' [-Rmodules-driver]
+module;
+
+//--- only-import.cpp
+// RUN: %clang -std=c++20 -ccc-print-phases -fmodules-driver -Rmodules-driver \
+// RUN: %t/only-import.cpp %t/empty.cpp 2>&1 \
+// RUN: | FileCheck %s --check-prefix=CHECK2
+// CHECK2: remark: found C++20 module usage in file '{{.*}}' [-Rmodules-driver]
+import A;
+
+//--- only-export.cpp
+// RUN: %clang -std=c++20 -ccc-print-phases -fmodules-driver -Rmodules-driver \
+// RUN: %t/only-export.cpp %t/empty.cpp 2>&1 \
+// RUN: | FileCheck %s --check-prefix=CHECK3
+// CHECK3: remark: found C++20 module usage in file '{{.*}}' [-Rmodules-driver]
+export module A;
+
+//--- leading-line-comment.cpp
+// RUN: %clang -std=c++20 -ccc-print-phases -fmodules-driver -Rmodules-driver \
+// RUN: %t/leading-line-comment.cpp %t/empty.cpp 2>&1 \
+// RUN: | FileCheck %s --check-prefix=CHECK4
+// CHECK4: remark: found C++20 module usage in file '{{.*}}' [-Rmodules-driver]
+// My line comment
+import A;
+
+//--- leading-block-comment1.cpp
+// RUN: %clang -std=c++20 -ccc-print-phases -fmodules-driver -Rmodules-driver \
+// RUN: %t/leading-block-comment1.cpp %t/empty.cpp 2>&1 \
+// RUN: | FileCheck %s --check-prefix=CHECK5
+// CHECK5: remark: found C++20 module usage in file '{{.*}}' [-Rmodules-driver]
+/*My block comment */
+import A;
+
+//--- leading-block-comment2.cpp
+// RUN: %clang -std=c++20 -ccc-print-phases -fmodules-driver -Rmodules-driver \
+// RUN: %t/leading-block-comment2.cpp %t/empty.cpp 2>&1 \
+// RUN: | FileCheck %s --check-prefix=CHECK6
+// CHECK6: remark: found C++20 module usage in file '{{.*}}' [-Rmodules-driver]
+/*My line comment */ import A;
+
+//--- inline-block-comment1.cpp
+// RUN: %clang -std=c++20 -ccc-print-phases -fmodules-driver -Rmodules-driver \
+// RUN: %t/leading-block-comment1.cpp %t/empty.cpp 2>&1 \
+// RUN: | FileCheck %s --check-prefix=CHECK7
+// CHECK7: remark: found C++20 module usage in file '{{.*}}' [-Rmodules-driver]
+export/*a comment*/module/*another comment*/A;
+
+//--- inline-block-comment2.cpp
+// RUN: %clang -std=c++20 -ccc-print-phases -fmodules-driver -Rmodules-driver \
+// RUN: %t/leading-block-comment2.cpp %t/empty.cpp 2>&1 \
+// RUN: | FileCheck %s --check-prefix=CHECK8
+// CHECK8: remark: found C++20 module usage in file '{{.*}}' [-Rmodules-driver]
+module/*a comment*/;
+
+//--- leading-directives.cpp
+// RUN: %clang -std=c++23 -ccc-print-phases -fmodules-driver -Rmodules-driver \
+// RUN: %t/leading-directives.cpp %t/empty.cpp 2>&1 \
+// RUN: | FileCheck %s --check-prefix=CHECK9
+// CHECK9: remark: found C++20 module usage in file '{{.*}}' [-Rmodules-driver]
+#define A
+#undef A
+#if A
+#ifdef A
+#elifdef A
+#elifndef A
+#endif
+#ifndef A
+#elif A
+#else
+#endif
+#endif
+#pragma once;
+#include <iostream>
+import m;
+
+//--- multiline-directive.cpp
+// RUN: %clang -std=c++23 -ccc-print-phases -fmodules-driver -Rmodules-driver \
+// RUN: %t/multiline-directive.cpp %t/empty.cpp 2>&1 \
+// RUN: | FileCheck %s --check-prefix=CHECK10
+// CHECK10: remark: found C++20 module usage in file '{{.*}}' [-Rmodules-driver]
+#define MACRO(a, \
+ b) \
+ call((a), \
+ (b)
+import a;
+
+//--- leading-line-splice.cpp
+// RUN: %clang -std=c++23 -ccc-print-phases -fmodules-driver -Rmodules-driver \
+// RUN: %t/leading-line-splice.cpp %t/empty.cpp 2>&1 \
+// RUN: | FileCheck %s --check-prefix=CHECK11
+// CHECK11: remark: found C++20 module usage in file '{{.*}}' [-Rmodules-driver]
+\
+module;
+
+//--- leading-line-splice-trailing-whitespace.cpp
+// RUN: %clang -std=c++23 -ccc-print-phases -fmodules-driver -Rmodules-driver \
+// RUN: %t/leading-line-splice-trailing-whitespace.cpp %t/empty.cpp 2>&1 \
+// RUN: | FileCheck %s --check-prefix=CHECK12
+// CHECK12: remark: found C++20 module usage in file '{{.*}}' [-Rmodules-driver]
+// v This backslash has trailing whitespace.
+ \
+export module A;
+
+//--- comment-line-splice.cpp
+// RUN: %clang -std=c++23 -ccc-print-phases -fmodules-driver -Rmodules-driver \
+// RUN: %t/comment-line-splice.cpp %t/empty.cpp 2>&1 \
+// RUN: | FileCheck %s --allow-empty --check-prefix=CHECK13
+// CHECK13-NOT: remark: found C++20 module usage in file '{{.*}}' [-Rmodules-driver]
+// My comment continues next-line!\
+import A;
+
+//--- comment-line-splice-trailing-whitespace.cpp
+// RUN: %clang -std=c++23 -ccc-print-phases -fmodules-driver -Rmodules-driver \
+// RUN: %t/comment-line-splice-trailing-whitespace.cpp %t/empty.cpp 2>&1 \
+// RUN: | FileCheck %s --allow-empty --check-prefix=CHECK14
+// CHECK14-NOT: remark: found C++20 module usage in file '{{.*}}' [-Rmodules-driver]
+// My comment continues next-line! This backslash has trailing whitespace. -> \
+module;
+
+//--- line-splice-in-directive1.cpp
+// RUN: %clang -std=c++23 -ccc-print-phases -fmodules-driver -Rmodules-driver \
+// RUN: %t/line-splice-in-directive1.cpp %t/empty.cpp 2>&1 \
+// RUN: | FileCheck %s --check-prefix=CHECK15
+// CHECK15: remark: found C++20 module usage in file '{{.*}}' [-Rmodules-driver]
+
+module\
+;
+
+//--- line-splice-in-directive2.cpp
+// RUN: %clang -std=c++23 -ccc-print-phases -fmodules-driver -Rmodules-driver \
+// RUN: %t/line-splice-in-directive2.cpp %t/empty.cpp 2>&1 \
+// RUN: | FileCheck %s --check-prefix=CHECK16
+// CHECK16: remark: found C++20 module usage in file '{{.*}}' [-Rmodules-driver]
+
+export\
+ module\
+ A;
+
+//--- no-module-usage1.cpp
+// RUN: %clang -std=c++23 -ccc-print-phases -fmodules-driver -Rmodules-driver \
+// RUN: %t/no-module-usage1.cpp %t/empty.cpp 2>&1 \
+// RUN: | FileCheck %s --allow-empty --check-prefix=CHECK17
+// CHECK17-NOT: remark: found C++20 module usage in file '{{.*}}' [-Rmodules-driver]
+auto main() -> int {}
+
+//--- no-module-usage2.cpp
+// RUN: %clang -std=c++23 -ccc-print-phases -fmodules-driver -Rmodules-driver \
+// RUN: %t/no-module-usage2.cpp %t/empty.cpp 2>&1 \
+// RUN: | FileCheck %s --allow-empty --check-prefix=CHECK18
+// CHECK18-NOT: remark: found C++20 module usage in file '{{.*}}' [-Rmodules-driver]
+moduleStruct{};
+
+//--- no-module-usage3.cpp
+// RUN: %clang -std=c++23 -ccc-print-phases -fmodules-driver -Rmodules-driver \
+// RUN: %t/no-module-usage3.cpp %t/empty.cpp 2>&1 \
+// RUN: | FileCheck %s --allow-empty --check-prefix=CHECK19
+// CHECK19-NOT: remark: found C++20 module usage in file '{{.*}}' [-Rmodules-driver]
+export_struct{};
+
+//--- no-module-usage-namespace-import.cpp
+// RUN: %clang -std=c++23 -ccc-print-phases -fmodules-driver -Rmodules-driver \
+// RUN: %t/no-module-usage-namespace-import.cpp %t/empty.cpp 2>&1 \
+// RUN: | FileCheck %s --allow-empty --check-prefix=CHECK20
+// CHECK20-NOT: remark: found C++20 module usage in file '{{.*}}' [-Rmodules-driver]
+import::inner xi = {};
+
+//--- no-module-usage-namespace-module.cpp
+// RUN: %clang -std=c++23 -ccc-print-phases -fmodules-driver -Rmodules-driver \
+// RUN: %t/no-module-usage-namespace-module.cpp %t/empty.cpp 2>&1 \
+// RUN: | FileCheck %s --allow-empty --check-prefix=CHECK21
+// CHECK21-NOT: remark: found C++20 module usage in file '{{.*}}' [-Rmodules-driver]
+module::inner yi = {};
+
+// RUN: not %clang -std=c++20 -ccc-print-phases -fmodules-driver -Rmodules-driver \
+// RUN: imaginary-file.cpp %t/empty.cpp 2>&1 \
+// RUN: | FileCheck %s --check-prefix=CHECK-NON-EXISTING-FILE-ERR
+// CHECK-NON-EXISTING-FILE-ERR: clang: error: no such file or directory: 'imaginary-file.cpp'
>From f5a648f9193a16b1136772096c3024a0d8b3fb34 Mon Sep 17 00:00:00 2001
From: Konrad Kleine <kkleine at redhat.com>
Date: Mon, 18 Aug 2025 21:46:34 +0200
Subject: [PATCH 078/112] [doc] Add documentation for clang-change-namespace
(#148277)
This adds rst documentation for the `clang-change-namespace` program.
Fixes #35519
---
.../docs/clang-change-namespace.rst | 314 ++++++++++++++++++
clang-tools-extra/docs/index.rst | 1 +
2 files changed, 315 insertions(+)
create mode 100644 clang-tools-extra/docs/clang-change-namespace.rst
diff --git a/clang-tools-extra/docs/clang-change-namespace.rst b/clang-tools-extra/docs/clang-change-namespace.rst
new file mode 100644
index 0000000000000..1eab83f5069b6
--- /dev/null
+++ b/clang-tools-extra/docs/clang-change-namespace.rst
@@ -0,0 +1,314 @@
+======================
+Clang-Change-Namespace
+======================
+
+.. contents::
+
+.. toctree::
+ :maxdepth: 1
+
+:program:`clang-change-namespace` can be used to change the surrounding
+namespaces of class/function definitions.
+
+Classes/functions in the moved namespace will have new namespaces while
+references to symbols (e.g. types, functions) which are not defined in the
+changed namespace will be correctly qualified by prepending namespace specifiers
+before them. This will try to add shortest namespace specifiers possible.
+
+When a symbol reference needs to be fully-qualified, this adds a `::` prefix to
+the namespace specifiers unless the new namespace is the global namespace. For
+classes, only classes that are declared/defined in the given namespace in
+specified files will be moved: forward declarations will remain in the old
+namespace. The will be demonstrated in the next example.
+
+Example usage
+-------------
+
+For example, consider this `test.cc` example here with the forward declared
+class `FWD` and the defined class `A`, both in the namespace `a`.
+
+.. code-block:: c++
+
+ namespace a {
+ class FWD;
+ class A {
+ FWD *fwd;
+ };
+ } // namespace a
+
+And now let's change the namespace `a` to `x`.
+
+.. code-block:: console
+
+ clang-change-namespace \
+ --old_namespace "a" \
+ --new_namespace "x" \
+ --file_pattern "test.cc" \
+ --i \
+ test.cc
+
+Note that in the code below there's still the forward decalred class `FWD` that
+stayed in the namespace `a`. It wasn't moved to the new namespace because it
+wasn't defined/declared here in `a` but only forward declared.
+
+.. code-block:: c++
+
+ namespace a {
+ class FWD;
+ } // namespace a
+ namespace x {
+
+ class A {
+ a::FWD *fwd;
+ };
+ } // namespace x
+
+
+Another example
+---------------
+
+Consider this `test.cc` file:
+
+.. code-block:: c++
+
+ namespace na {
+ class X {};
+ namespace nb {
+ class Y {
+ X x;
+ };
+ } // namespace nb
+ } // namespace na
+
+To move the definition of class `Y` from namespace `na::nb` to `x::y`, run:
+
+.. code-block:: console
+
+ clang-change-namespace \
+ --old_namespace "na::nb" \
+ --new_namespace "x::y" \
+ --file_pattern "test.cc" \
+ --i \
+ test.cc
+
+This will overwrite `test.cc` to look like this:
+
+.. code-block:: c++
+
+ namespace na {
+ class X {};
+
+ } // namespace na
+ namespace x {
+ namespace y {
+ class Y {
+ na::X x;
+ };
+ } // namespace y
+ } // namespace x
+
+Note, that we've successfully moved the class `Y` from namespace `na::nb` to
+namespace `x::y`.
+
+Caveats
+=======
+
+Content already exists in new namespace
+---------------------------------------
+
+Consider this `test.cc` example that defines two `class A` one inside the
+namespace `a` and one in namespace `b`:
+
+.. code-block:: c++
+
+ namespace a {
+ class A {
+ int classAFromWithinNamespace_a;
+ };
+ } // namespace a
+
+ namespace b {
+ class A {
+ int classAFromWithinNamespace_b;
+ };
+ } //namespace b
+
+Let's move everything from namespace `a` to namespace `b`:
+
+.. code-block:: console
+
+ clang-change-namespace \
+ --old_namespace "a" \
+ --new_namespace "b" \
+ --file_pattern test.cc \
+ test.cc
+
+As expected we now have to definitions of `class A` inside the namespace `b`:
+
+.. code-block:: c++
+
+ namespace b {
+ class A {
+ int classAFromWithinNamespace_a;
+ };
+ } // namespace b
+
+ namespace b {
+ class A {
+ int classAFromWithinNamespace_b;
+ };
+ } //namespace b
+
+The re-factoring looks correct but the code will not compile due to the name
+duplication. It is not up to the tool to ensure compilability in that sense.
+But one has to be aware of that.
+
+Inline namespace doesn't work
+-----------------------------
+
+Consider this usage of two versions of implementations for a `greet` function:
+
+.. code-block:: c++
+
+ #include <cstdio>
+
+ namespace Greeter {
+ inline namespace Version1 {
+ const char* greet() { return "Hello from version 1!"; }
+ } // namespace Version1
+ namespace Version2 {
+ const char* greet() { return "Hello from version 2!"; }
+ } // namespace Version2
+ } // namespace Greeter
+
+ int main(int argc, char* argv[]) {
+ printf("%s\n", Greeter::greet());
+ return 0;
+ }
+
+Note, that currently `Greeter::greet()` will result in a call to
+`Greeter::Version1::greet()` because that's the inlined namespace.
+
+Let's say you want to move one and make `Version2` the default now and remove
+the `inline` from the `Version1`. First let's try to turn `namespace Version2`
+into `inline namespace Version2`:
+
+.. code-block:: console
+
+ clang-change-namespace \
+ --old_namespace "Greeter::Version2" \
+ --new_namespace "inline Version2" \
+ --file_pattern main.cc main.cc
+
+But this will put the `inline` keyword in the wrong place resulting in:
+
+.. code-block:: c++
+
+ #include <cstdio>
+
+ namespace Greeter {
+ inline namespace Version1 {
+ const char* greet() { return "Hello from version 1!"; }
+ } // namespace Version1
+
+ } // namespace Greeter
+ namespace inline Greeter {
+ namespace Version2 {
+ const char *greet() { return "Hello from version 2!"; }
+ } // namespace Version2
+ } // namespace inline Greeter
+
+ int main(int argc, char* argv[]) {
+ printf("%s\n", Greeter::greet());
+ return 0;
+ }
+
+One cannot use `:program:`clang-change-namespace` to inline a namespace.
+
+Symbol references not updated
+-----------------------------
+
+Consider this `test.cc` file:
+
+.. code-block:: c++
+
+ namespace old {
+ struct foo {};
+ } // namespace old
+
+ namespace b {
+ old::foo g_foo;
+ } // namespace b
+
+Notice that namespace `b` defines a global variable of type `old::foo`. If we
+now change the name of the `old` namespace to `modern`, the reference will not
+be updated:
+
+.. code-block:: console
+
+ clang-change-namespace \
+ --old_namespace "old" \
+ --new_namespace "modern" \
+ --file_pattern test.cc \
+ test.cc
+
+.. code-block:: c++
+
+ namespace modern {
+ struct foo {};
+ } // namespace modern
+
+ namespace b {
+ old::foo g_foo;
+ } // namespace b
+
+`g_foo` is still of the no longer existing type `old::foo` while instead it
+should use `modern::foo`.
+
+Only symbol references in the moved namespace are updated, not outside of it.
+
+
+:program:`clang-change-namespace` Command Line Options
+======================================================
+
+.. option:: --allowed_file=<string>
+
+ A file containing regexes of symbol names that are not expected to be updated
+ when changing namespaces around them.
+
+.. option:: --dump_result
+
+ Dump new file contents in YAML, if specified.
+
+.. option:: --extra-arg=<string>
+
+ Additional argument to append to the compiler command line
+
+.. option:: --extra-arg-before=<string>
+
+ Additional argument to prepend to the compiler command line
+
+.. option:: --file_pattern=<string>
+
+ Only rename namespaces in files that match the given regular expression
+ pattern.
+
+.. option:: -i
+
+ Inplace edit <file>s, if specified.
+
+.. option:: --new_namespace=<string>
+
+ New namespace. Use `""` when you target the global namespace.
+
+.. option:: --old_namespace=<string>
+
+ Old namespace.
+
+.. option:: -p <string>
+
+ Build path
+
+.. option:: --style=<string>
+
+ The style name used for reformatting.
diff --git a/clang-tools-extra/docs/index.rst b/clang-tools-extra/docs/index.rst
index 9f7324fcf7419..3f3a99d1b70c6 100644
--- a/clang-tools-extra/docs/index.rst
+++ b/clang-tools-extra/docs/index.rst
@@ -17,6 +17,7 @@ Contents
clang-tidy/index
clang-include-fixer
+ clang-change-namespace
modularize
pp-trace
clangd <https://clangd.llvm.org/>
>From 7e9989390d95cbb382cb2dc9eb44b37717e23738 Mon Sep 17 00:00:00 2001
From: Florian Hahn <flo at fhahn.com>
Date: Mon, 18 Aug 2025 20:49:42 +0100
Subject: [PATCH 079/112] [VPlan] Materialize Build(Struct)Vectors for
VPReplicateRecipes. (NFCI) (#151487)
Materialze Build(Struct)Vectors explicitly for VPRecplicateRecipes, to
serve their users requiring a vector, instead of doing so when unrolling
by VF.
Now we only need to implicitly build vectors in VPTransformState::get
for VPInstructions. Once they are also unrolled by VF we can remove the
code-path alltogether.
PR: https://github.com/llvm/llvm-project/pull/151487
---
.../Transforms/Vectorize/LoopVectorize.cpp | 3 +-
llvm/lib/Transforms/Vectorize/VPlan.cpp | 3 +
.../lib/Transforms/Vectorize/VPlanRecipes.cpp | 2 +
.../Transforms/Vectorize/VPlanTransforms.cpp | 46 ++++++++++++++
.../Transforms/Vectorize/VPlanTransforms.h | 4 ++
llvm/lib/Transforms/Vectorize/VPlanUnroll.cpp | 61 +++++++++++++------
6 files changed, 101 insertions(+), 18 deletions(-)
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index e009b81afd0ed..9c00e51ff5213 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -7254,8 +7254,9 @@ DenseMap<const SCEV *, Value *> LoopVectorizationPlanner::executePlan(
// TODO: Move to VPlan transform stage once the transition to the VPlan-based
// cost model is complete for better cost estimates.
VPlanTransforms::runPass(VPlanTransforms::unrollByUF, BestVPlan, BestUF);
- VPlanTransforms::runPass(VPlanTransforms::replicateByVF, BestVPlan, BestVF);
+ VPlanTransforms::runPass(VPlanTransforms::materializeBuildVectors, BestVPlan);
VPlanTransforms::runPass(VPlanTransforms::materializeBroadcasts, BestVPlan);
+ VPlanTransforms::runPass(VPlanTransforms::replicateByVF, BestVPlan, BestVF);
bool HasBranchWeights =
hasBranchWeightMD(*OrigLoop->getLoopLatch()->getTerminator());
if (HasBranchWeights) {
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.cpp b/llvm/lib/Transforms/Vectorize/VPlan.cpp
index 724a38e565304..f972efa07eb7e 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlan.cpp
@@ -355,6 +355,9 @@ Value *VPTransformState::get(const VPValue *Def, bool NeedsScalar) {
set(Def, VectorValue);
} else {
assert(!VF.isScalable() && "VF is assumed to be non scalable.");
+ assert(isa<VPInstruction>(Def) &&
+ "Explicit BuildVector recipes must have"
+ "handled packing for non-VPInstructions.");
// Initialize packing with insertelements to start from poison.
VectorValue = PoisonValue::get(toVectorizedTy(LastInst->getType(), VF));
for (unsigned Lane = 0; Lane < VF.getFixedValue(); ++Lane)
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index 0609510ac8212..96ef6e7cf8243 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -460,6 +460,8 @@ unsigned VPInstruction::getNumOperandsForOpcode(unsigned Opcode) {
case Instruction::Load:
case VPInstruction::AnyOf:
case VPInstruction::BranchOnCond:
+ case VPInstruction::BuildStructVector:
+ case VPInstruction::BuildVector:
case VPInstruction::CalculateTripCountMinusVF:
case VPInstruction::CanonicalIVIncrementForPart:
case VPInstruction::ExplicitVectorLength:
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
index 14532244d5748..81088c9a81392 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
@@ -3282,6 +3282,52 @@ void VPlanTransforms::materializeBackedgeTakenCount(VPlan &Plan,
BTC->replaceAllUsesWith(TCMO);
}
+void VPlanTransforms::materializeBuildVectors(VPlan &Plan) {
+ if (Plan.hasScalarVFOnly())
+ return;
+
+ VPTypeAnalysis TypeInfo(Plan);
+ VPRegionBlock *LoopRegion = Plan.getVectorLoopRegion();
+ auto VPBBsOutsideLoopRegion = VPBlockUtils::blocksOnly<VPBasicBlock>(
+ vp_depth_first_shallow(Plan.getEntry()));
+ auto VPBBsInsideLoopRegion = VPBlockUtils::blocksOnly<VPBasicBlock>(
+ vp_depth_first_shallow(LoopRegion->getEntry()));
+ // Materialize Build(Struct)Vector for all replicating VPReplicateRecipes,
+ // excluding ones in replicate regions. Those are not materialized explicitly
+ // yet. Those vector users are still handled in VPReplicateRegion::execute(),
+ // via shouldPack().
+ // TODO: materialize build vectors for replicating recipes in replicating
+ // regions.
+ // TODO: materialize build vectors for VPInstructions.
+ for (VPBasicBlock *VPBB :
+ concat<VPBasicBlock *>(VPBBsOutsideLoopRegion, VPBBsInsideLoopRegion)) {
+ for (VPRecipeBase &R : make_early_inc_range(*VPBB)) {
+ auto *RepR = dyn_cast<VPReplicateRecipe>(&R);
+ auto UsesVectorOrInsideReplicateRegion = [RepR, LoopRegion](VPUser *U) {
+ VPRegionBlock *ParentRegion =
+ cast<VPRecipeBase>(U)->getParent()->getParent();
+ return !U->usesScalars(RepR) || ParentRegion != LoopRegion;
+ };
+ if (!RepR || RepR->isSingleScalar() ||
+ none_of(RepR->users(), UsesVectorOrInsideReplicateRegion))
+ continue;
+
+ Type *ScalarTy = TypeInfo.inferScalarType(RepR);
+ unsigned Opcode = ScalarTy->isStructTy()
+ ? VPInstruction::BuildStructVector
+ : VPInstruction::BuildVector;
+ auto *BuildVector = new VPInstruction(Opcode, {RepR});
+ BuildVector->insertAfter(RepR);
+
+ RepR->replaceUsesWithIf(
+ BuildVector, [BuildVector, &UsesVectorOrInsideReplicateRegion](
+ VPUser &U, unsigned) {
+ return &U != BuildVector && UsesVectorOrInsideReplicateRegion(&U);
+ });
+ }
+ }
+}
+
void VPlanTransforms::materializeVectorTripCount(VPlan &Plan,
VPBasicBlock *VectorPHVPBB,
bool TailByMasking,
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
index 35fa45ced53e0..5b3d18b237efb 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
@@ -274,6 +274,10 @@ struct VPlanTransforms {
static void materializeBackedgeTakenCount(VPlan &Plan,
VPBasicBlock *VectorPH);
+ /// Add explicit Build[Struct]Vector recipes that combine multiple scalar
+ /// values into single vectors.
+ static void materializeBuildVectors(VPlan &Plan);
+
/// Materialize VF and VFxUF to be computed explicitly using VPInstructions.
static void materializeVFAndVFxUF(VPlan &Plan, VPBasicBlock *VectorPH,
ElementCount VF);
diff --git a/llvm/lib/Transforms/Vectorize/VPlanUnroll.cpp b/llvm/lib/Transforms/Vectorize/VPlanUnroll.cpp
index 9a6b7b70cc9f9..62fd83a5e092a 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanUnroll.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanUnroll.cpp
@@ -464,10 +464,12 @@ void VPlanTransforms::unrollByUF(VPlan &Plan, unsigned UF) {
VPlanTransforms::removeDeadRecipes(Plan);
}
-/// Create a single-scalar clone of \p RepR for lane \p Lane.
-static VPReplicateRecipe *cloneForLane(VPlan &Plan, VPBuilder &Builder,
- Type *IdxTy, VPReplicateRecipe *RepR,
- VPLane Lane) {
+/// Create a single-scalar clone of \p RepR for lane \p Lane. Use \p
+/// Def2LaneDefs to look up scalar definitions for operands of \RepR.
+static VPReplicateRecipe *
+cloneForLane(VPlan &Plan, VPBuilder &Builder, Type *IdxTy,
+ VPReplicateRecipe *RepR, VPLane Lane,
+ const DenseMap<VPValue *, SmallVector<VPValue *>> &Def2LaneDefs) {
// Collect the operands at Lane, creating extracts as needed.
SmallVector<VPValue *> NewOps;
for (VPValue *Op : RepR->operands()) {
@@ -480,6 +482,14 @@ static VPReplicateRecipe *cloneForLane(VPlan &Plan, VPBuilder &Builder,
Builder.createNaryOp(VPInstruction::ExtractLastElement, {Op}));
continue;
}
+ // If Op is a definition that has been unrolled, directly use the clone for
+ // the corresponding lane.
+ auto LaneDefs = Def2LaneDefs.find(Op);
+ if (LaneDefs != Def2LaneDefs.end()) {
+ NewOps.push_back(LaneDefs->second[Lane.getKnownLane()]);
+ continue;
+ }
+
// Look through buildvector to avoid unnecessary extracts.
if (match(Op, m_BuildVector())) {
NewOps.push_back(
@@ -512,6 +522,13 @@ void VPlanTransforms::replicateByVF(VPlan &Plan, ElementCount VF) {
vp_depth_first_shallow(Plan.getVectorLoopRegion()->getEntry()));
auto VPBBsToUnroll =
concat<VPBasicBlock *>(VPBBsOutsideLoopRegion, VPBBsInsideLoopRegion);
+ // A mapping of current VPValue definitions to collections of new VPValues
+ // defined per lane. Serves to hook-up potential users of current VPValue
+ // definition that are replicated-per-VF later.
+ DenseMap<VPValue *, SmallVector<VPValue *>> Def2LaneDefs;
+ // The removal of current recipes being replaced by new ones needs to be
+ // delayed after Def2LaneDefs is no longer in use.
+ SmallVector<VPRecipeBase *> ToRemove;
for (VPBasicBlock *VPBB : VPBBsToUnroll) {
for (VPRecipeBase &R : make_early_inc_range(*VPBB)) {
auto *RepR = dyn_cast<VPReplicateRecipe>(&R);
@@ -523,12 +540,12 @@ void VPlanTransforms::replicateByVF(VPlan &Plan, ElementCount VF) {
if (isa<StoreInst>(RepR->getUnderlyingInstr()) &&
vputils::isSingleScalar(RepR->getOperand(1))) {
// Stores to invariant addresses need to store the last lane only.
- cloneForLane(Plan, Builder, IdxTy, RepR,
- VPLane::getLastLaneForVF(VF));
+ cloneForLane(Plan, Builder, IdxTy, RepR, VPLane::getLastLaneForVF(VF),
+ Def2LaneDefs);
} else {
// Create single-scalar version of RepR for all lanes.
for (unsigned I = 0; I != VF.getKnownMinValue(); ++I)
- cloneForLane(Plan, Builder, IdxTy, RepR, VPLane(I));
+ cloneForLane(Plan, Builder, IdxTy, RepR, VPLane(I), Def2LaneDefs);
}
RepR->eraseFromParent();
continue;
@@ -536,23 +553,33 @@ void VPlanTransforms::replicateByVF(VPlan &Plan, ElementCount VF) {
/// Create single-scalar version of RepR for all lanes.
SmallVector<VPValue *> LaneDefs;
for (unsigned I = 0; I != VF.getKnownMinValue(); ++I)
- LaneDefs.push_back(cloneForLane(Plan, Builder, IdxTy, RepR, VPLane(I)));
+ LaneDefs.push_back(
+ cloneForLane(Plan, Builder, IdxTy, RepR, VPLane(I), Def2LaneDefs));
+ Def2LaneDefs[RepR] = LaneDefs;
/// Users that only demand the first lane can use the definition for lane
/// 0.
RepR->replaceUsesWithIf(LaneDefs[0], [RepR](VPUser &U, unsigned) {
return U.onlyFirstLaneUsed(RepR);
});
- // If needed, create a Build(Struct)Vector recipe to insert the scalar
- // lane values into a vector.
- Type *ResTy = RepR->getUnderlyingInstr()->getType();
- VPValue *VecRes = Builder.createNaryOp(
- ResTy->isStructTy() ? VPInstruction::BuildStructVector
- : VPInstruction::BuildVector,
- LaneDefs);
- RepR->replaceAllUsesWith(VecRes);
- RepR->eraseFromParent();
+ // Update each build vector user that currently has RepR as its only
+ // operand, to have all LaneDefs as its operands.
+ for (VPUser *U : to_vector(RepR->users())) {
+ auto *VPI = dyn_cast<VPInstruction>(U);
+ if (!VPI || (VPI->getOpcode() != VPInstruction::BuildVector &&
+ VPI->getOpcode() != VPInstruction::BuildStructVector))
+ continue;
+ assert(VPI->getNumOperands() == 1 &&
+ "Build(Struct)Vector must have a single operand before "
+ "replicating by VF");
+ VPI->setOperand(0, LaneDefs[0]);
+ for (VPValue *LaneDef : drop_begin(LaneDefs))
+ VPI->addOperand(LaneDef);
+ }
+ ToRemove.push_back(RepR);
}
}
+ for (auto *R : reverse(ToRemove))
+ R->eraseFromParent();
}
>From 378d2401251f53a8abb8a9757536bae2d000bc77 Mon Sep 17 00:00:00 2001
From: Baranov Victor <bar.victor.2002 at gmail.com>
Date: Mon, 18 Aug 2025 22:49:54 +0300
Subject: [PATCH 080/112] [clang-tidy] Remove addition of emacs tag in checks
headers (#153942)
After https://github.com/llvm/llvm-project/pull/118553, emacs tag is no
longer needed in LLVM files:
https://llvm.org/docs/CodingStandards.html#file-headers.
This patch removes it from `add_new_check.py` lowering complexity we
need to maintain.
---
clang-tools-extra/clang-tidy/add_new_check.py | 12 ++----------
1 file changed, 2 insertions(+), 10 deletions(-)
diff --git a/clang-tools-extra/clang-tidy/add_new_check.py b/clang-tools-extra/clang-tidy/add_new_check.py
index e366f10053535..2b51a1dc40ebc 100755
--- a/clang-tools-extra/clang-tidy/add_new_check.py
+++ b/clang-tools-extra/clang-tidy/add_new_check.py
@@ -89,13 +89,9 @@ def write_header(
+ check_name_camel.upper()
+ "_H"
)
- f.write("//===--- ")
- f.write(os.path.basename(filename))
- f.write(" - clang-tidy ")
- f.write("-" * max(0, 42 - len(os.path.basename(filename))))
- f.write("*- C++ -*-===//")
f.write(
"""
+//===----------------------------------------------------------------------===//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
@@ -145,13 +141,9 @@ def write_implementation(
filename = os.path.join(module_path, check_name_camel) + ".cpp"
print("Creating %s..." % filename)
with io.open(filename, "w", encoding="utf8", newline="\n") as f:
- f.write("//===--- ")
- f.write(os.path.basename(filename))
- f.write(" - clang-tidy ")
- f.write("-" * max(0, 51 - len(os.path.basename(filename))))
- f.write("-===//")
f.write(
"""
+//===----------------------------------------------------------------------===//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
>From b20bbd48e8b1966731a284b4208e048e060e97c2 Mon Sep 17 00:00:00 2001
From: Sergei Barannikov <barannikov88 at gmail.com>
Date: Mon, 18 Aug 2025 22:53:09 +0300
Subject: [PATCH 081/112] [TableGen][DecoderEmitter] Store HW mode ID instead
of name (NFC) (#154052)
This simplifies code a bit.
---
llvm/utils/TableGen/DecoderEmitter.cpp | 71 +++++++++++---------------
1 file changed, 29 insertions(+), 42 deletions(-)
diff --git a/llvm/utils/TableGen/DecoderEmitter.cpp b/llvm/utils/TableGen/DecoderEmitter.cpp
index 238c87a196ea9..2b44577253982 100644
--- a/llvm/utils/TableGen/DecoderEmitter.cpp
+++ b/llvm/utils/TableGen/DecoderEmitter.cpp
@@ -208,14 +208,14 @@ struct DecoderTableInfo {
struct EncodingAndInst {
const Record *EncodingDef;
const CodeGenInstruction *Inst;
- StringRef HwModeName;
+ unsigned HwModeID;
EncodingAndInst(const Record *EncodingDef, const CodeGenInstruction *Inst,
- StringRef HwModeName = "")
- : EncodingDef(EncodingDef), Inst(Inst), HwModeName(HwModeName) {}
+ unsigned HwModeID = DefaultMode)
+ : EncodingDef(EncodingDef), Inst(Inst), HwModeID(HwModeID) {}
};
-using NamespacesHwModesMap = std::map<std::string, std::set<StringRef>>;
+using NamespacesHwModesMap = std::map<std::string, std::set<unsigned>>;
class DecoderEmitter {
const RecordKeeper &RK;
@@ -2386,10 +2386,9 @@ static bool Check(DecodeStatus &Out, DecodeStatus In) {
)";
}
-// Collect all HwModes referenced by the target for encoding purposes,
-// returning a vector of corresponding names.
+// Collect all HwModes referenced by the target for encoding purposes.
static void collectHwModesReferencedForEncodings(
- const CodeGenHwModes &HWM, std::vector<StringRef> &Names,
+ const CodeGenHwModes &HWM, std::vector<unsigned> &HwModeIDs,
NamespacesHwModesMap &NamespacesWithHwModes) {
SmallBitVector BV(HWM.getNumModeIds());
for (const auto &MS : HWM.getHwModeSelects()) {
@@ -2397,34 +2396,25 @@ static void collectHwModesReferencedForEncodings(
if (EncodingDef->isSubClassOf("InstructionEncoding")) {
std::string DecoderNamespace =
EncodingDef->getValueAsString("DecoderNamespace").str();
- if (HwModeID == DefaultMode) {
- NamespacesWithHwModes[DecoderNamespace].insert("");
- } else {
- NamespacesWithHwModes[DecoderNamespace].insert(
- HWM.getMode(HwModeID).Name);
- }
+ NamespacesWithHwModes[DecoderNamespace].insert(HwModeID);
BV.set(HwModeID);
}
}
}
- transform(BV.set_bits(), std::back_inserter(Names), [&HWM](const int &M) {
- if (M == DefaultMode)
- return StringRef("");
- return HWM.getModeName(M, /*IncludeDefault=*/true);
- });
+ HwModeIDs.assign(BV.set_bits_begin(), BV.set_bits_end());
}
static void
handleHwModesUnrelatedEncodings(const CodeGenInstruction *Instr,
- ArrayRef<StringRef> HwModeNames,
+ ArrayRef<unsigned> HwModeIDs,
NamespacesHwModesMap &NamespacesWithHwModes,
std::vector<EncodingAndInst> &GlobalEncodings) {
const Record *InstDef = Instr->TheDef;
switch (DecoderEmitterSuppressDuplicates) {
case SUPPRESSION_DISABLE: {
- for (StringRef HwModeName : HwModeNames)
- GlobalEncodings.emplace_back(InstDef, Instr, HwModeName);
+ for (unsigned HwModeID : HwModeIDs)
+ GlobalEncodings.emplace_back(InstDef, Instr, HwModeID);
break;
}
case SUPPRESSION_LEVEL1: {
@@ -2432,17 +2422,17 @@ handleHwModesUnrelatedEncodings(const CodeGenInstruction *Instr,
InstDef->getValueAsString("DecoderNamespace").str();
auto It = NamespacesWithHwModes.find(DecoderNamespace);
if (It != NamespacesWithHwModes.end()) {
- for (StringRef HwModeName : It->second)
- GlobalEncodings.emplace_back(InstDef, Instr, HwModeName);
+ for (unsigned HwModeID : It->second)
+ GlobalEncodings.emplace_back(InstDef, Instr, HwModeID);
} else {
// Only emit the encoding once, as it's DecoderNamespace doesn't
// contain any HwModes.
- GlobalEncodings.emplace_back(InstDef, Instr, "");
+ GlobalEncodings.emplace_back(InstDef, Instr, DefaultMode);
}
break;
}
case SUPPRESSION_LEVEL2:
- GlobalEncodings.emplace_back(InstDef, Instr, "");
+ GlobalEncodings.emplace_back(InstDef, Instr, DefaultMode);
break;
}
}
@@ -2473,13 +2463,13 @@ namespace {
// First, collect all encoding-related HwModes referenced by the target.
// And establish a mapping table between DecoderNamespace and HwMode.
- // If HwModeNames is empty, add the empty string so we always have one HwMode.
+ // If HwModeNames is empty, add the default mode so we always have one HwMode.
const CodeGenHwModes &HWM = Target.getHwModes();
- std::vector<StringRef> HwModeNames;
+ std::vector<unsigned> HwModeIDs;
NamespacesHwModesMap NamespacesWithHwModes;
- collectHwModesReferencedForEncodings(HWM, HwModeNames, NamespacesWithHwModes);
- if (HwModeNames.empty())
- HwModeNames.push_back("");
+ collectHwModesReferencedForEncodings(HWM, HwModeIDs, NamespacesWithHwModes);
+ if (HwModeIDs.empty())
+ HwModeIDs.push_back(DefaultMode);
const auto &NumberedInstructions = Target.getInstructions();
NumberedEncodings.reserve(NumberedInstructions.size());
@@ -2487,20 +2477,14 @@ namespace {
const Record *InstDef = NumberedInstruction->TheDef;
if (const Record *RV = InstDef->getValueAsOptionalDef("EncodingInfos")) {
EncodingInfoByHwMode EBM(RV, HWM);
- for (auto [HwModeID, EncodingDef] : EBM) {
- // DecoderTables with DefaultMode should not have any suffix.
- if (HwModeID == DefaultMode) {
- NumberedEncodings.emplace_back(EncodingDef, NumberedInstruction, "");
- } else {
- NumberedEncodings.emplace_back(EncodingDef, NumberedInstruction,
- HWM.getMode(HwModeID).Name);
- }
- }
+ for (auto [HwModeID, EncodingDef] : EBM)
+ NumberedEncodings.emplace_back(EncodingDef, NumberedInstruction,
+ HwModeID);
continue;
}
// This instruction is encoded the same on all HwModes.
// According to user needs, provide varying degrees of suppression.
- handleHwModesUnrelatedEncodings(NumberedInstruction, HwModeNames,
+ handleHwModesUnrelatedEncodings(NumberedInstruction, HwModeIDs,
NamespacesWithHwModes, NumberedEncodings);
}
for (const Record *NumberedAlias :
@@ -2547,8 +2531,11 @@ namespace {
}
std::string DecoderNamespace =
EncodingDef->getValueAsString("DecoderNamespace").str();
- if (!NumberedEncoding.HwModeName.empty())
- DecoderNamespace += "_" + NumberedEncoding.HwModeName.str();
+ // DecoderTables with DefaultMode should not have any suffix.
+ if (NumberedEncoding.HwModeID != DefaultMode) {
+ StringRef HwModeName = HWM.getModeName(NumberedEncoding.HwModeID);
+ DecoderNamespace += ("_" + HwModeName).str();
+ }
EncMap[{DecoderNamespace, Size}].push_back(NEI);
} else {
NumEncodingsOmitted++;
>From c328c5d9117c19555793c548ebccfedc0b972398 Mon Sep 17 00:00:00 2001
From: Stanislav Mekhanoshin <Stanislav.Mekhanoshin at amd.com>
Date: Mon, 18 Aug 2025 13:07:20 -0700
Subject: [PATCH 082/112] [AMDGPU] Combine to bf16 reciprocal square root.
(#154185)
Co-authored-by: Ivan Kosarev <Ivan.Kosarev at amd.com>
Co-authored-by: Ivan Kosarev <Ivan.Kosarev at amd.com>
---
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | 2 +-
llvm/test/CodeGen/AMDGPU/fdiv.bf16.ll | 73 +++++++++--------------
2 files changed, 28 insertions(+), 47 deletions(-)
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index f58fde421f77d..072fb9cc547b0 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -15729,7 +15729,7 @@ SDValue SITargetLowering::performFDivCombine(SDNode *N,
SelectionDAG &DAG = DCI.DAG;
SDLoc SL(N);
EVT VT = N->getValueType(0);
- if (VT != MVT::f16 || !Subtarget->has16BitInsts())
+ if ((VT != MVT::f16 && VT != MVT::bf16) || !Subtarget->has16BitInsts())
return SDValue();
SDValue LHS = N->getOperand(0);
diff --git a/llvm/test/CodeGen/AMDGPU/fdiv.bf16.ll b/llvm/test/CodeGen/AMDGPU/fdiv.bf16.ll
index 01ebe7d71428b..91831a8d4fecb 100644
--- a/llvm/test/CodeGen/AMDGPU/fdiv.bf16.ll
+++ b/llvm/test/CodeGen/AMDGPU/fdiv.bf16.ll
@@ -82,67 +82,59 @@ define bfloat @v_rcp_bf16_neg(bfloat %x) {
ret bfloat %fdiv
}
-; TODO: Support lowering to v_rsq_bf16.
define bfloat @v_rsq_bf16(bfloat %x) {
; GFX1250-TRUE16-LABEL: v_rsq_bf16:
; GFX1250-TRUE16: ; %bb.0:
; GFX1250-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX1250-TRUE16-NEXT: s_wait_kmcnt 0x0
-; GFX1250-TRUE16-NEXT: v_sqrt_bf16_e32 v0.l, v0.l
-; GFX1250-TRUE16-NEXT: s_delay_alu instid0(TRANS32_DEP_1)
-; GFX1250-TRUE16-NEXT: v_rcp_bf16_e32 v0.l, v0.l
+; GFX1250-TRUE16-NEXT: v_rsq_bf16_e32 v0.l, v0.l
; GFX1250-TRUE16-NEXT: s_set_pc_i64 s[30:31]
;
; GFX1250-FAKE16-LABEL: v_rsq_bf16:
; GFX1250-FAKE16: ; %bb.0:
; GFX1250-FAKE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX1250-FAKE16-NEXT: s_wait_kmcnt 0x0
-; GFX1250-FAKE16-NEXT: v_sqrt_bf16_e32 v0, v0
-; GFX1250-FAKE16-NEXT: s_delay_alu instid0(TRANS32_DEP_1)
-; GFX1250-FAKE16-NEXT: v_rcp_bf16_e32 v0, v0
+; GFX1250-FAKE16-NEXT: v_rsq_bf16_e32 v0, v0
; GFX1250-FAKE16-NEXT: s_set_pc_i64 s[30:31]
%sqrt = call contract bfloat @llvm.sqrt.bf16(bfloat %x)
%fdiv = fdiv contract bfloat 1.0, %sqrt
ret bfloat %fdiv
}
-; TODO: Support lowering to v_rsq_bf16.
define bfloat @v_rsq_bf16_neg(bfloat %x) {
; GFX1250-TRUE16-LABEL: v_rsq_bf16_neg:
; GFX1250-TRUE16: ; %bb.0:
; GFX1250-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX1250-TRUE16-NEXT: s_wait_kmcnt 0x0
-; GFX1250-TRUE16-NEXT: v_sqrt_bf16_e32 v0.l, v0.l
+; GFX1250-TRUE16-NEXT: v_rsq_bf16_e32 v0.l, v0.l
+; GFX1250-TRUE16-NEXT: v_nop
; GFX1250-TRUE16-NEXT: s_delay_alu instid0(TRANS32_DEP_1)
-; GFX1250-TRUE16-NEXT: v_rcp_bf16_e64 v0.l, -v0.l
+; GFX1250-TRUE16-NEXT: v_xor_b16 v0.l, 0x8000, v0.l
; GFX1250-TRUE16-NEXT: s_set_pc_i64 s[30:31]
;
; GFX1250-FAKE16-LABEL: v_rsq_bf16_neg:
; GFX1250-FAKE16: ; %bb.0:
; GFX1250-FAKE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX1250-FAKE16-NEXT: s_wait_kmcnt 0x0
-; GFX1250-FAKE16-NEXT: v_sqrt_bf16_e32 v0, v0
+; GFX1250-FAKE16-NEXT: v_rsq_bf16_e32 v0, v0
+; GFX1250-FAKE16-NEXT: v_nop
; GFX1250-FAKE16-NEXT: s_delay_alu instid0(TRANS32_DEP_1)
-; GFX1250-FAKE16-NEXT: v_rcp_bf16_e64 v0, -v0
+; GFX1250-FAKE16-NEXT: v_xor_b32_e32 v0, 0x8000, v0
; GFX1250-FAKE16-NEXT: s_set_pc_i64 s[30:31]
%sqrt = call contract bfloat @llvm.sqrt.bf16(bfloat %x)
%fdiv = fdiv contract bfloat -1.0, %sqrt
ret bfloat %fdiv
}
-; TODO: Support lowering to v_rsq_bf16.
define <2 x bfloat> @v_rsq_bf16_multi_use(bfloat %x) {
; GFX1250-TRUE16-LABEL: v_rsq_bf16_multi_use:
; GFX1250-TRUE16: ; %bb.0:
; GFX1250-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX1250-TRUE16-NEXT: s_wait_kmcnt 0x0
; GFX1250-TRUE16-NEXT: v_mov_b16_e32 v1.l, v0.l
-; GFX1250-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(TRANS32_DEP_1)
-; GFX1250-TRUE16-NEXT: v_sqrt_bf16_e32 v1.l, v1.l
-; GFX1250-TRUE16-NEXT: v_rcp_bf16_e32 v1.h, v1.l
+; GFX1250-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(TRANS32_DEP_1)
+; GFX1250-TRUE16-NEXT: v_rsq_bf16_e32 v1.h, v1.l
; GFX1250-TRUE16-NEXT: v_nop
-; GFX1250-TRUE16-NEXT: v_mov_b16_e32 v1.l, v0.l
-; GFX1250-TRUE16-NEXT: s_delay_alu instid0(TRANS32_DEP_1) | instid1(VALU_DEP_1)
; GFX1250-TRUE16-NEXT: v_mov_b32_e32 v0, v1
; GFX1250-TRUE16-NEXT: s_set_pc_i64 s[30:31]
;
@@ -150,10 +142,9 @@ define <2 x bfloat> @v_rsq_bf16_multi_use(bfloat %x) {
; GFX1250-FAKE16: ; %bb.0:
; GFX1250-FAKE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX1250-FAKE16-NEXT: s_wait_kmcnt 0x0
-; GFX1250-FAKE16-NEXT: v_sqrt_bf16_e32 v1, v0
-; GFX1250-FAKE16-NEXT: s_delay_alu instid0(TRANS32_DEP_1) | instskip(SKIP_1) | instid1(TRANS32_DEP_1)
-; GFX1250-FAKE16-NEXT: v_rcp_bf16_e32 v1, v1
+; GFX1250-FAKE16-NEXT: v_rsq_bf16_e32 v1, v0
; GFX1250-FAKE16-NEXT: v_nop
+; GFX1250-FAKE16-NEXT: s_delay_alu instid0(TRANS32_DEP_1)
; GFX1250-FAKE16-NEXT: v_perm_b32 v0, v1, v0, 0x5040100
; GFX1250-FAKE16-NEXT: s_set_pc_i64 s[30:31]
%sqrt = call contract bfloat @llvm.sqrt.bf16(bfloat %x)
@@ -163,7 +154,6 @@ define <2 x bfloat> @v_rsq_bf16_multi_use(bfloat %x) {
ret <2 x bfloat> %r2
}
-; TODO: Support lowering to v_rsq_bf16.
define bfloat @v_rsq_bf16_missing_contract0(bfloat %x) {
; GFX1250-TRUE16-LABEL: v_rsq_bf16_missing_contract0:
; GFX1250-TRUE16: ; %bb.0:
@@ -187,7 +177,6 @@ define bfloat @v_rsq_bf16_missing_contract0(bfloat %x) {
ret bfloat %fdiv
}
-; TODO: Support lowering to v_rsq_bf16.
define bfloat @v_rsq_bf16_missing_contract1(bfloat %x) {
; GFX1250-TRUE16-LABEL: v_rsq_bf16_missing_contract1:
; GFX1250-TRUE16: ; %bb.0:
@@ -211,7 +200,6 @@ define bfloat @v_rsq_bf16_missing_contract1(bfloat %x) {
ret bfloat %fdiv
}
-; TODO: Support lowering to v_rsq_bf16.
define bfloat @v_neg_rsq_bf16_missing_contract1(bfloat %x) {
; GFX1250-TRUE16-LABEL: v_neg_rsq_bf16_missing_contract1:
; GFX1250-TRUE16: ; %bb.0:
@@ -240,11 +228,8 @@ define <2 x bfloat> @v_rsq_v2bf16(<2 x bfloat> %a) {
; GFX1250-TRUE16: ; %bb.0:
; GFX1250-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX1250-TRUE16-NEXT: s_wait_kmcnt 0x0
-; GFX1250-TRUE16-NEXT: v_sqrt_bf16_e32 v0.h, v0.h
-; GFX1250-TRUE16-NEXT: v_sqrt_bf16_e32 v0.l, v0.l
-; GFX1250-TRUE16-NEXT: s_delay_alu instid0(TRANS32_DEP_2) | instskip(NEXT) | instid1(TRANS32_DEP_2)
-; GFX1250-TRUE16-NEXT: v_rcp_bf16_e32 v0.h, v0.h
-; GFX1250-TRUE16-NEXT: v_rcp_bf16_e32 v0.l, v0.l
+; GFX1250-TRUE16-NEXT: v_rsq_bf16_e32 v0.h, v0.h
+; GFX1250-TRUE16-NEXT: v_rsq_bf16_e32 v0.l, v0.l
; GFX1250-TRUE16-NEXT: s_set_pc_i64 s[30:31]
;
; GFX1250-FAKE16-LABEL: v_rsq_v2bf16:
@@ -252,12 +237,9 @@ define <2 x bfloat> @v_rsq_v2bf16(<2 x bfloat> %a) {
; GFX1250-FAKE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX1250-FAKE16-NEXT: s_wait_kmcnt 0x0
; GFX1250-FAKE16-NEXT: v_lshrrev_b32_e32 v1, 16, v0
-; GFX1250-FAKE16-NEXT: v_sqrt_bf16_e32 v0, v0
-; GFX1250-FAKE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(TRANS32_DEP_2)
-; GFX1250-FAKE16-NEXT: v_sqrt_bf16_e32 v1, v1
-; GFX1250-FAKE16-NEXT: v_rcp_bf16_e32 v0, v0
-; GFX1250-FAKE16-NEXT: s_delay_alu instid0(TRANS32_DEP_2) | instskip(SKIP_1) | instid1(TRANS32_DEP_1)
-; GFX1250-FAKE16-NEXT: v_rcp_bf16_e32 v1, v1
+; GFX1250-FAKE16-NEXT: v_rsq_bf16_e32 v0, v0
+; GFX1250-FAKE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(TRANS32_DEP_1)
+; GFX1250-FAKE16-NEXT: v_rsq_bf16_e32 v1, v1
; GFX1250-FAKE16-NEXT: v_nop
; GFX1250-FAKE16-NEXT: v_perm_b32 v0, v1, v0, 0x5040100
; GFX1250-FAKE16-NEXT: s_set_pc_i64 s[30:31]
@@ -271,11 +253,11 @@ define <2 x bfloat> @v_neg_rsq_v2bf16(<2 x bfloat> %a) {
; GFX1250-TRUE16: ; %bb.0:
; GFX1250-TRUE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX1250-TRUE16-NEXT: s_wait_kmcnt 0x0
-; GFX1250-TRUE16-NEXT: v_sqrt_bf16_e32 v0.h, v0.h
-; GFX1250-TRUE16-NEXT: v_sqrt_bf16_e32 v0.l, v0.l
-; GFX1250-TRUE16-NEXT: s_delay_alu instid0(TRANS32_DEP_2) | instskip(NEXT) | instid1(TRANS32_DEP_2)
-; GFX1250-TRUE16-NEXT: v_rcp_bf16_e64 v0.h, -v0.h
-; GFX1250-TRUE16-NEXT: v_rcp_bf16_e64 v0.l, -v0.l
+; GFX1250-TRUE16-NEXT: v_rsq_bf16_e32 v0.h, v0.h
+; GFX1250-TRUE16-NEXT: v_rsq_bf16_e32 v0.l, v0.l
+; GFX1250-TRUE16-NEXT: s_delay_alu instid0(TRANS32_DEP_2) | instskip(NEXT) | instid1(TRANS32_DEP_1)
+; GFX1250-TRUE16-NEXT: v_xor_b16 v0.h, 0x8000, v0.h
+; GFX1250-TRUE16-NEXT: v_xor_b16 v0.l, 0x8000, v0.l
; GFX1250-TRUE16-NEXT: s_set_pc_i64 s[30:31]
;
; GFX1250-FAKE16-LABEL: v_neg_rsq_v2bf16:
@@ -283,13 +265,12 @@ define <2 x bfloat> @v_neg_rsq_v2bf16(<2 x bfloat> %a) {
; GFX1250-FAKE16-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX1250-FAKE16-NEXT: s_wait_kmcnt 0x0
; GFX1250-FAKE16-NEXT: v_lshrrev_b32_e32 v1, 16, v0
-; GFX1250-FAKE16-NEXT: v_sqrt_bf16_e32 v0, v0
+; GFX1250-FAKE16-NEXT: v_rsq_bf16_e32 v0, v0
; GFX1250-FAKE16-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(TRANS32_DEP_2)
-; GFX1250-FAKE16-NEXT: v_sqrt_bf16_e32 v1, v1
-; GFX1250-FAKE16-NEXT: v_rcp_bf16_e64 v0, -v0
-; GFX1250-FAKE16-NEXT: s_delay_alu instid0(TRANS32_DEP_2) | instskip(SKIP_1) | instid1(TRANS32_DEP_1)
-; GFX1250-FAKE16-NEXT: v_rcp_bf16_e64 v1, -v1
-; GFX1250-FAKE16-NEXT: v_nop
+; GFX1250-FAKE16-NEXT: v_rsq_bf16_e32 v1, v1
+; GFX1250-FAKE16-NEXT: v_xor_b32_e32 v0, 0x8000, v0
+; GFX1250-FAKE16-NEXT: s_delay_alu instid0(TRANS32_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-FAKE16-NEXT: v_xor_b32_e32 v1, 0x8000, v1
; GFX1250-FAKE16-NEXT: v_perm_b32 v0, v1, v0, 0x5040100
; GFX1250-FAKE16-NEXT: s_set_pc_i64 s[30:31]
%sqrt = call contract <2 x bfloat> @llvm.sqrt.v2bf16(<2 x bfloat> %a)
>From 3395676a18ab580f21ebcd4324feaf1294a8b6d9 Mon Sep 17 00:00:00 2001
From: Stanislav Mekhanoshin <Stanislav.Mekhanoshin at amd.com>
Date: Mon, 18 Aug 2025 13:07:36 -0700
Subject: [PATCH 083/112] [AMDGPU] Fold copies of constant physical registers
into their uses (#154183)
With current codegen this only affects src_flat_scratch_base_lo/hi.
Co-authored-by: Jay Foad <Jay.Foad at amd.com>
Co-authored-by: Jay Foad <Jay.Foad at amd.com>
---
llvm/lib/Target/AMDGPU/SIFoldOperands.cpp | 9 +-
llvm/test/CodeGen/AMDGPU/addrspacecast-gas.ll | 52 +-
.../CodeGen/AMDGPU/atomics-system-scope.ll | 410 +++++----
.../test/CodeGen/AMDGPU/flat-saddr-atomics.ll | 849 +++++++-----------
.../CodeGen/AMDGPU/llvm.amdgcn.is.private.ll | 49 +
llvm/test/CodeGen/AMDGPU/scale-offset-flat.ll | 15 +-
6 files changed, 597 insertions(+), 787 deletions(-)
diff --git a/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp b/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
index 962c276bc2123..66d1126eb4151 100644
--- a/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
+++ b/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
@@ -709,7 +709,10 @@ bool SIFoldOperandsImpl::updateOperand(FoldCandidate &Fold) const {
// 16-bit SGPRs instead of 32-bit ones.
if (Old.getSubReg() == AMDGPU::lo16 && TRI->isSGPRReg(*MRI, New->getReg()))
Old.setSubReg(AMDGPU::NoSubRegister);
- Old.substVirtReg(New->getReg(), New->getSubReg(), *TRI);
+ if (New->getReg().isPhysical())
+ Old.substPhysReg(New->getReg(), *TRI);
+ else
+ Old.substVirtReg(New->getReg(), New->getSubReg(), *TRI);
Old.setIsUndef(New->isUndef());
return true;
}
@@ -1986,7 +1989,9 @@ bool SIFoldOperandsImpl::tryFoldFoldableCopy(
if (!FoldingImm && !OpToFold.isReg())
return false;
- if (OpToFold.isReg() && !OpToFold.getReg().isVirtual())
+ // Fold virtual registers and constant physical registers.
+ if (OpToFold.isReg() && OpToFold.getReg().isPhysical() &&
+ !TRI->isConstantPhysReg(OpToFold.getReg()))
return false;
// Prevent folding operands backwards in the function. For example,
diff --git a/llvm/test/CodeGen/AMDGPU/addrspacecast-gas.ll b/llvm/test/CodeGen/AMDGPU/addrspacecast-gas.ll
index 4b6375cc60800..b4b49e90dca02 100644
--- a/llvm/test/CodeGen/AMDGPU/addrspacecast-gas.ll
+++ b/llvm/test/CodeGen/AMDGPU/addrspacecast-gas.ll
@@ -9,15 +9,14 @@ target triple = "amdgcn-amd-amdhsa"
define amdgpu_kernel void @use_private_to_flat_addrspacecast(ptr addrspace(5) %ptr) {
; GFX1250-SDAG-LABEL: use_private_to_flat_addrspacecast:
; GFX1250-SDAG: ; %bb.0:
-; GFX1250-SDAG-NEXT: s_load_b32 s2, s[4:5], 0x24
+; GFX1250-SDAG-NEXT: s_load_b32 s0, s[4:5], 0x24
; GFX1250-SDAG-NEXT: v_mbcnt_lo_u32_b32 v0, -1, 0
-; GFX1250-SDAG-NEXT: s_mov_b64 s[0:1], src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: s_wait_kmcnt 0x0
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX1250-SDAG-NEXT: v_dual_mov_b32 v0, s2 :: v_dual_lshlrev_b32 v1, 20, v0
-; GFX1250-SDAG-NEXT: s_cmp_lg_u32 s2, -1
+; GFX1250-SDAG-NEXT: v_dual_mov_b32 v0, s0 :: v_dual_lshlrev_b32 v1, 20, v0
+; GFX1250-SDAG-NEXT: s_cmp_lg_u32 s0, -1
; GFX1250-SDAG-NEXT: s_cselect_b32 vcc_lo, -1, 0
-; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[0:1], v[0:1]
+; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], src_flat_scratch_base_lo, v[0:1]
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_2)
; GFX1250-SDAG-NEXT: v_dual_mov_b32 v2, 0 :: v_dual_cndmask_b32 v1, 0, v1
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v0, 0, v0, vcc_lo
@@ -27,20 +26,20 @@ define amdgpu_kernel void @use_private_to_flat_addrspacecast(ptr addrspace(5) %p
;
; GFX1250-GISEL-LABEL: use_private_to_flat_addrspacecast:
; GFX1250-GISEL: ; %bb.0:
-; GFX1250-GISEL-NEXT: s_load_b32 s2, s[4:5], 0x24
-; GFX1250-GISEL-NEXT: s_mov_b64 s[0:1], src_flat_scratch_base_lo
+; GFX1250-GISEL-NEXT: s_load_b32 s0, s[4:5], 0x24
+; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[0:1], src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_mbcnt_lo_u32_b32 v2, -1, 0
-; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[0:1], s[0:1]
; GFX1250-GISEL-NEXT: s_wait_kmcnt 0x0
-; GFX1250-GISEL-NEXT: s_cmp_lg_u32 s2, -1
-; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(SALU_CYCLE_1)
-; GFX1250-GISEL-NEXT: v_add_co_u32 v0, vcc_lo, s2, v0
+; GFX1250-GISEL-NEXT: s_cmp_lg_u32 s0, -1
+; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX1250-GISEL-NEXT: v_add_co_u32 v0, vcc_lo, s0, v0
; GFX1250-GISEL-NEXT: v_lshlrev_b32_e32 v2, 20, v2
-; GFX1250-GISEL-NEXT: s_cselect_b32 s0, 1, 0
-; GFX1250-GISEL-NEXT: s_and_b32 s0, 1, s0
-; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX1250-GISEL-NEXT: s_cselect_b32 s1, 1, 0
+; GFX1250-GISEL-NEXT: s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-GISEL-NEXT: s_and_b32 s1, 1, s1
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v1, null, v2, v1, vcc_lo
-; GFX1250-GISEL-NEXT: v_cmp_ne_u32_e64 vcc_lo, 0, s0
+; GFX1250-GISEL-NEXT: v_cmp_ne_u32_e64 vcc_lo, 0, s1
+; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v2, 0 :: v_dual_cndmask_b32 v1, 0, v1
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v0, 0, v0, vcc_lo
; GFX1250-GISEL-NEXT: flat_store_b32 v[0:1], v2 scope:SCOPE_SYS
@@ -56,27 +55,24 @@ define amdgpu_kernel void @use_private_to_flat_addrspacecast_nonnull(ptr addrspa
; GFX1250-SDAG: ; %bb.0:
; GFX1250-SDAG-NEXT: s_load_b32 s0, s[4:5], 0x24
; GFX1250-SDAG-NEXT: v_mbcnt_lo_u32_b32 v0, -1, 0
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_dual_mov_b32 v2, 0 :: v_dual_lshlrev_b32 v1, 20, v0
; GFX1250-SDAG-NEXT: s_wait_kmcnt 0x0
; GFX1250-SDAG-NEXT: v_mov_b32_e32 v0, s0
-; GFX1250-SDAG-NEXT: s_mov_b64 s[0:1], src_flat_scratch_base_lo
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instid1(SALU_CYCLE_1)
-; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[0:1], v[0:1]
+; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], src_flat_scratch_base_lo, v[0:1]
; GFX1250-SDAG-NEXT: flat_store_b32 v[0:1], v2 scope:SCOPE_SYS
; GFX1250-SDAG-NEXT: s_wait_storecnt 0x0
; GFX1250-SDAG-NEXT: s_endpgm
;
; GFX1250-GISEL-LABEL: use_private_to_flat_addrspacecast_nonnull:
; GFX1250-GISEL: ; %bb.0:
-; GFX1250-GISEL-NEXT: s_load_b32 s2, s[4:5], 0x24
-; GFX1250-GISEL-NEXT: s_mov_b64 s[0:1], src_flat_scratch_base_lo
+; GFX1250-GISEL-NEXT: s_load_b32 s0, s[4:5], 0x24
; GFX1250-GISEL-NEXT: v_mbcnt_lo_u32_b32 v2, -1, 0
-; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[0:1], s[0:1]
+; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[0:1], src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v3, 0 :: v_dual_lshlrev_b32 v2, 20, v2
; GFX1250-GISEL-NEXT: s_wait_kmcnt 0x0
-; GFX1250-GISEL-NEXT: v_add_co_u32 v0, vcc_lo, s2, v0
+; GFX1250-GISEL-NEXT: v_add_co_u32 v0, vcc_lo, s0, v0
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v1, null, v2, v1, vcc_lo
; GFX1250-GISEL-NEXT: flat_store_b32 v[0:1], v3 scope:SCOPE_SYS
@@ -91,10 +87,9 @@ define amdgpu_kernel void @use_flat_to_private_addrspacecast(ptr %ptr) {
; GFX1250-LABEL: use_flat_to_private_addrspacecast:
; GFX1250: ; %bb.0:
; GFX1250-NEXT: s_load_b64 s[0:1], s[4:5], 0x24
-; GFX1250-NEXT: s_mov_b32 s2, src_flat_scratch_base_lo
; GFX1250-NEXT: v_mov_b32_e32 v0, 0
; GFX1250-NEXT: s_wait_kmcnt 0x0
-; GFX1250-NEXT: s_sub_co_i32 s2, s0, s2
+; GFX1250-NEXT: s_sub_co_i32 s2, s0, src_flat_scratch_base_lo
; GFX1250-NEXT: s_cmp_lg_u64 s[0:1], 0
; GFX1250-NEXT: s_cselect_b32 s0, s2, -1
; GFX1250-NEXT: scratch_store_b32 off, v0, s0 scope:SCOPE_SYS
@@ -110,9 +105,8 @@ define amdgpu_kernel void @use_flat_to_private_addrspacecast_nonnull(ptr %ptr) {
; GFX1250-SDAG: ; %bb.0:
; GFX1250-SDAG-NEXT: s_load_b32 s0, s[4:5], 0x24
; GFX1250-SDAG-NEXT: v_mov_b32_e32 v0, 0
-; GFX1250-SDAG-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: s_wait_kmcnt 0x0
-; GFX1250-SDAG-NEXT: s_sub_co_i32 s0, s0, s1
+; GFX1250-SDAG-NEXT: s_sub_co_i32 s0, s0, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: scratch_store_b32 off, v0, s0 scope:SCOPE_SYS
; GFX1250-SDAG-NEXT: s_wait_storecnt 0x0
; GFX1250-SDAG-NEXT: s_endpgm
@@ -122,9 +116,7 @@ define amdgpu_kernel void @use_flat_to_private_addrspacecast_nonnull(ptr %ptr) {
; GFX1250-GISEL-NEXT: s_load_b64 s[0:1], s[4:5], 0x24
; GFX1250-GISEL-NEXT: v_mov_b32_e32 v0, 0
; GFX1250-GISEL-NEXT: s_wait_kmcnt 0x0
-; GFX1250-GISEL-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
-; GFX1250-GISEL-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
-; GFX1250-GISEL-NEXT: s_sub_co_i32 s0, s0, s1
+; GFX1250-GISEL-NEXT: s_sub_co_i32 s0, s0, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: scratch_store_b32 off, v0, s0 scope:SCOPE_SYS
; GFX1250-GISEL-NEXT: s_wait_storecnt 0x0
; GFX1250-GISEL-NEXT: s_endpgm
diff --git a/llvm/test/CodeGen/AMDGPU/atomics-system-scope.ll b/llvm/test/CodeGen/AMDGPU/atomics-system-scope.ll
index 5fc9f4a0f8038..817e3f01c8cdd 100644
--- a/llvm/test/CodeGen/AMDGPU/atomics-system-scope.ll
+++ b/llvm/test/CodeGen/AMDGPU/atomics-system-scope.ll
@@ -534,58 +534,61 @@ define double @flat_system_atomic_fadd_f64(ptr %ptr, double %val) {
; GFX1250: ; %bb.0:
; GFX1250-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX1250-NEXT: s_wait_kmcnt 0x0
+; GFX1250-NEXT: v_dual_mov_b32 v5, v1 :: v_dual_mov_b32 v4, v0
; GFX1250-NEXT: s_mov_b64 s[0:1], src_shared_base
; GFX1250-NEXT: s_mov_b32 s0, exec_lo
-; GFX1250-NEXT: ; implicit-def: $vgpr4_vgpr5
-; GFX1250-NEXT: v_cmpx_ne_u32_e64 s1, v1
+; GFX1250-NEXT: ; implicit-def: $vgpr0_vgpr1
+; GFX1250-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX1250-NEXT: v_cmpx_ne_u32_e64 s1, v5
; GFX1250-NEXT: s_xor_b32 s0, exec_lo, s0
-; GFX1250-NEXT: s_cbranch_execz .LBB34_6
-; GFX1250-NEXT: ; %bb.1: ; %atomicrmw.check.private
-; GFX1250-NEXT: s_mov_b32 s1, src_flat_scratch_base_hi
-; GFX1250-NEXT: s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1250-NEXT: v_xor_b32_e32 v4, s1, v1
-; GFX1250-NEXT: v_cmp_lt_u32_e32 vcc_lo, 0x3ffffff, v4
-; GFX1250-NEXT: ; implicit-def: $vgpr4_vgpr5
+; GFX1250-NEXT: s_cbranch_execnz .LBB34_3
+; GFX1250-NEXT: ; %bb.1: ; %Flow2
+; GFX1250-NEXT: s_and_not1_saveexec_b32 s0, s0
+; GFX1250-NEXT: s_cbranch_execnz .LBB34_8
+; GFX1250-NEXT: .LBB34_2: ; %atomicrmw.phi
+; GFX1250-NEXT: s_or_b32 exec_lo, exec_lo, s0
+; GFX1250-NEXT: s_wait_loadcnt 0x0
+; GFX1250-NEXT: s_set_pc_i64 s[30:31]
+; GFX1250-NEXT: .LBB34_3: ; %atomicrmw.check.private
+; GFX1250-NEXT: v_xor_b32_e32 v0, src_flat_scratch_base_hi, v5
+; GFX1250-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1250-NEXT: v_cmp_lt_u32_e32 vcc_lo, 0x3ffffff, v0
+; GFX1250-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX1250-NEXT: s_and_saveexec_b32 s1, vcc_lo
-; GFX1250-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1250-NEXT: s_xor_b32 s1, exec_lo, s1
-; GFX1250-NEXT: s_cbranch_execz .LBB34_3
-; GFX1250-NEXT: ; %bb.2: ; %atomicrmw.global
-; GFX1250-NEXT: global_atomic_add_f64 v[4:5], v[0:1], v[2:3], off th:TH_ATOMIC_RETURN scope:SCOPE_SYS
-; GFX1250-NEXT: ; implicit-def: $vgpr0_vgpr1
+; GFX1250-NEXT: s_cbranch_execz .LBB34_5
+; GFX1250-NEXT: ; %bb.4: ; %atomicrmw.global
+; GFX1250-NEXT: global_atomic_add_f64 v[0:1], v[4:5], v[2:3], off th:TH_ATOMIC_RETURN scope:SCOPE_SYS
+; GFX1250-NEXT: ; implicit-def: $vgpr4_vgpr5
; GFX1250-NEXT: ; implicit-def: $vgpr2_vgpr3
-; GFX1250-NEXT: .LBB34_3: ; %Flow
+; GFX1250-NEXT: .LBB34_5: ; %Flow
; GFX1250-NEXT: s_wait_xcnt 0x0
; GFX1250-NEXT: s_and_not1_saveexec_b32 s1, s1
-; GFX1250-NEXT: s_cbranch_execz .LBB34_5
-; GFX1250-NEXT: ; %bb.4: ; %atomicrmw.private
-; GFX1250-NEXT: s_mov_b32 s2, src_flat_scratch_base_lo
-; GFX1250-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[0:1]
+; GFX1250-NEXT: s_cbranch_execz .LBB34_7
+; GFX1250-NEXT: ; %bb.6: ; %atomicrmw.private
+; GFX1250-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[4:5]
; GFX1250-NEXT: s_wait_loadcnt 0x0
-; GFX1250-NEXT: v_subrev_nc_u32_e32 v4, s2, v0
+; GFX1250-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v4
; GFX1250-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX1250-NEXT: v_cndmask_b32_e32 v6, -1, v4, vcc_lo
-; GFX1250-NEXT: scratch_load_b64 v[4:5], v6, off
+; GFX1250-NEXT: v_cndmask_b32_e32 v4, -1, v0, vcc_lo
+; GFX1250-NEXT: scratch_load_b64 v[0:1], v4, off
; GFX1250-NEXT: s_wait_loadcnt 0x0
-; GFX1250-NEXT: v_add_f64_e32 v[0:1], v[4:5], v[2:3]
-; GFX1250-NEXT: scratch_store_b64 v6, v[0:1], off scope:SCOPE_SE
-; GFX1250-NEXT: .LBB34_5: ; %Flow1
+; GFX1250-NEXT: v_add_f64_e32 v[2:3], v[0:1], v[2:3]
+; GFX1250-NEXT: scratch_store_b64 v4, v[2:3], off scope:SCOPE_SE
+; GFX1250-NEXT: .LBB34_7: ; %Flow1
; GFX1250-NEXT: s_wait_xcnt 0x0
; GFX1250-NEXT: s_or_b32 exec_lo, exec_lo, s1
-; GFX1250-NEXT: ; implicit-def: $vgpr0_vgpr1
+; GFX1250-NEXT: ; implicit-def: $vgpr4_vgpr5
; GFX1250-NEXT: ; implicit-def: $vgpr2_vgpr3
-; GFX1250-NEXT: .LBB34_6: ; %Flow2
; GFX1250-NEXT: s_and_not1_saveexec_b32 s0, s0
-; GFX1250-NEXT: s_cbranch_execz .LBB34_8
-; GFX1250-NEXT: ; %bb.7: ; %atomicrmw.shared
-; GFX1250-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[0:1]
-; GFX1250-NEXT: v_cndmask_b32_e32 v0, -1, v0, vcc_lo
+; GFX1250-NEXT: s_cbranch_execz .LBB34_2
+; GFX1250-NEXT: .LBB34_8: ; %atomicrmw.shared
+; GFX1250-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[4:5]
; GFX1250-NEXT: s_wait_loadcnt 0x0
-; GFX1250-NEXT: ds_add_rtn_f64 v[4:5], v0, v[2:3]
-; GFX1250-NEXT: .LBB34_8: ; %atomicrmw.phi
+; GFX1250-NEXT: v_cndmask_b32_e32 v0, -1, v4, vcc_lo
+; GFX1250-NEXT: ds_add_rtn_f64 v[0:1], v0, v[2:3]
; GFX1250-NEXT: s_or_b32 exec_lo, exec_lo, s0
-; GFX1250-NEXT: s_wait_loadcnt_dscnt 0x0
-; GFX1250-NEXT: v_dual_mov_b32 v0, v4 :: v_dual_mov_b32 v1, v5
+; GFX1250-NEXT: s_wait_dscnt 0x0
; GFX1250-NEXT: s_set_pc_i64 s[30:31]
%result = atomicrmw fadd ptr %ptr, double %val monotonic
ret double %result
@@ -596,58 +599,61 @@ define double @flat_one_as_atomic_fadd_f64(ptr %ptr, double %val) {
; GFX1250: ; %bb.0:
; GFX1250-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX1250-NEXT: s_wait_kmcnt 0x0
+; GFX1250-NEXT: v_dual_mov_b32 v5, v1 :: v_dual_mov_b32 v4, v0
; GFX1250-NEXT: s_mov_b64 s[0:1], src_shared_base
; GFX1250-NEXT: s_mov_b32 s0, exec_lo
-; GFX1250-NEXT: ; implicit-def: $vgpr4_vgpr5
-; GFX1250-NEXT: v_cmpx_ne_u32_e64 s1, v1
+; GFX1250-NEXT: ; implicit-def: $vgpr0_vgpr1
+; GFX1250-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX1250-NEXT: v_cmpx_ne_u32_e64 s1, v5
; GFX1250-NEXT: s_xor_b32 s0, exec_lo, s0
-; GFX1250-NEXT: s_cbranch_execz .LBB35_6
-; GFX1250-NEXT: ; %bb.1: ; %atomicrmw.check.private
-; GFX1250-NEXT: s_mov_b32 s1, src_flat_scratch_base_hi
-; GFX1250-NEXT: s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1250-NEXT: v_xor_b32_e32 v4, s1, v1
-; GFX1250-NEXT: v_cmp_lt_u32_e32 vcc_lo, 0x3ffffff, v4
-; GFX1250-NEXT: ; implicit-def: $vgpr4_vgpr5
+; GFX1250-NEXT: s_cbranch_execnz .LBB35_3
+; GFX1250-NEXT: ; %bb.1: ; %Flow2
+; GFX1250-NEXT: s_and_not1_saveexec_b32 s0, s0
+; GFX1250-NEXT: s_cbranch_execnz .LBB35_8
+; GFX1250-NEXT: .LBB35_2: ; %atomicrmw.phi
+; GFX1250-NEXT: s_or_b32 exec_lo, exec_lo, s0
+; GFX1250-NEXT: s_wait_loadcnt 0x0
+; GFX1250-NEXT: s_set_pc_i64 s[30:31]
+; GFX1250-NEXT: .LBB35_3: ; %atomicrmw.check.private
+; GFX1250-NEXT: v_xor_b32_e32 v0, src_flat_scratch_base_hi, v5
+; GFX1250-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1250-NEXT: v_cmp_lt_u32_e32 vcc_lo, 0x3ffffff, v0
+; GFX1250-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX1250-NEXT: s_and_saveexec_b32 s1, vcc_lo
-; GFX1250-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1250-NEXT: s_xor_b32 s1, exec_lo, s1
-; GFX1250-NEXT: s_cbranch_execz .LBB35_3
-; GFX1250-NEXT: ; %bb.2: ; %atomicrmw.global
-; GFX1250-NEXT: global_atomic_add_f64 v[4:5], v[0:1], v[2:3], off th:TH_ATOMIC_RETURN scope:SCOPE_SYS
-; GFX1250-NEXT: ; implicit-def: $vgpr0_vgpr1
+; GFX1250-NEXT: s_cbranch_execz .LBB35_5
+; GFX1250-NEXT: ; %bb.4: ; %atomicrmw.global
+; GFX1250-NEXT: global_atomic_add_f64 v[0:1], v[4:5], v[2:3], off th:TH_ATOMIC_RETURN scope:SCOPE_SYS
+; GFX1250-NEXT: ; implicit-def: $vgpr4_vgpr5
; GFX1250-NEXT: ; implicit-def: $vgpr2_vgpr3
-; GFX1250-NEXT: .LBB35_3: ; %Flow
+; GFX1250-NEXT: .LBB35_5: ; %Flow
; GFX1250-NEXT: s_wait_xcnt 0x0
; GFX1250-NEXT: s_and_not1_saveexec_b32 s1, s1
-; GFX1250-NEXT: s_cbranch_execz .LBB35_5
-; GFX1250-NEXT: ; %bb.4: ; %atomicrmw.private
-; GFX1250-NEXT: s_mov_b32 s2, src_flat_scratch_base_lo
-; GFX1250-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[0:1]
+; GFX1250-NEXT: s_cbranch_execz .LBB35_7
+; GFX1250-NEXT: ; %bb.6: ; %atomicrmw.private
+; GFX1250-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[4:5]
; GFX1250-NEXT: s_wait_loadcnt 0x0
-; GFX1250-NEXT: v_subrev_nc_u32_e32 v4, s2, v0
+; GFX1250-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v4
; GFX1250-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX1250-NEXT: v_cndmask_b32_e32 v6, -1, v4, vcc_lo
-; GFX1250-NEXT: scratch_load_b64 v[4:5], v6, off
+; GFX1250-NEXT: v_cndmask_b32_e32 v4, -1, v0, vcc_lo
+; GFX1250-NEXT: scratch_load_b64 v[0:1], v4, off
; GFX1250-NEXT: s_wait_loadcnt 0x0
-; GFX1250-NEXT: v_add_f64_e32 v[0:1], v[4:5], v[2:3]
-; GFX1250-NEXT: scratch_store_b64 v6, v[0:1], off scope:SCOPE_SE
-; GFX1250-NEXT: .LBB35_5: ; %Flow1
+; GFX1250-NEXT: v_add_f64_e32 v[2:3], v[0:1], v[2:3]
+; GFX1250-NEXT: scratch_store_b64 v4, v[2:3], off scope:SCOPE_SE
+; GFX1250-NEXT: .LBB35_7: ; %Flow1
; GFX1250-NEXT: s_wait_xcnt 0x0
; GFX1250-NEXT: s_or_b32 exec_lo, exec_lo, s1
-; GFX1250-NEXT: ; implicit-def: $vgpr0_vgpr1
+; GFX1250-NEXT: ; implicit-def: $vgpr4_vgpr5
; GFX1250-NEXT: ; implicit-def: $vgpr2_vgpr3
-; GFX1250-NEXT: .LBB35_6: ; %Flow2
; GFX1250-NEXT: s_and_not1_saveexec_b32 s0, s0
-; GFX1250-NEXT: s_cbranch_execz .LBB35_8
-; GFX1250-NEXT: ; %bb.7: ; %atomicrmw.shared
-; GFX1250-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[0:1]
-; GFX1250-NEXT: v_cndmask_b32_e32 v0, -1, v0, vcc_lo
+; GFX1250-NEXT: s_cbranch_execz .LBB35_2
+; GFX1250-NEXT: .LBB35_8: ; %atomicrmw.shared
+; GFX1250-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[4:5]
; GFX1250-NEXT: s_wait_loadcnt 0x0
-; GFX1250-NEXT: ds_add_rtn_f64 v[4:5], v0, v[2:3]
-; GFX1250-NEXT: .LBB35_8: ; %atomicrmw.phi
+; GFX1250-NEXT: v_cndmask_b32_e32 v0, -1, v4, vcc_lo
+; GFX1250-NEXT: ds_add_rtn_f64 v[0:1], v0, v[2:3]
; GFX1250-NEXT: s_or_b32 exec_lo, exec_lo, s0
-; GFX1250-NEXT: s_wait_loadcnt_dscnt 0x0
-; GFX1250-NEXT: v_dual_mov_b32 v0, v4 :: v_dual_mov_b32 v1, v5
+; GFX1250-NEXT: s_wait_dscnt 0x0
; GFX1250-NEXT: s_set_pc_i64 s[30:31]
%result = atomicrmw fadd ptr %ptr, double %val syncscope("one-as") monotonic
ret double %result
@@ -682,40 +688,42 @@ define double @flat_system_atomic_fmin_f64(ptr %ptr, double %val) {
; GFX1250: ; %bb.0:
; GFX1250-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX1250-NEXT: s_wait_kmcnt 0x0
-; GFX1250-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
-; GFX1250-NEXT: s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1250-NEXT: v_xor_b32_e32 v4, s0, v1
-; GFX1250-NEXT: v_cmp_lt_u32_e32 vcc_lo, 0x3ffffff, v4
-; GFX1250-NEXT: ; implicit-def: $vgpr4_vgpr5
+; GFX1250-NEXT: v_dual_mov_b32 v5, v1 :: v_dual_mov_b32 v4, v0
+; GFX1250-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-NEXT: v_xor_b32_e32 v0, src_flat_scratch_base_hi, v5
+; GFX1250-NEXT: v_cmp_lt_u32_e32 vcc_lo, 0x3ffffff, v0
+; GFX1250-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX1250-NEXT: s_and_saveexec_b32 s0, vcc_lo
; GFX1250-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1250-NEXT: s_xor_b32 s0, exec_lo, s0
-; GFX1250-NEXT: s_cbranch_execz .LBB38_2
-; GFX1250-NEXT: ; %bb.1: ; %atomicrmw.global
-; GFX1250-NEXT: flat_atomic_min_num_f64 v[4:5], v[0:1], v[2:3] th:TH_ATOMIC_RETURN scope:SCOPE_SYS
-; GFX1250-NEXT: ; implicit-def: $vgpr0_vgpr1
+; GFX1250-NEXT: s_cbranch_execnz .LBB38_3
+; GFX1250-NEXT: ; %bb.1: ; %Flow
+; GFX1250-NEXT: s_and_not1_saveexec_b32 s0, s0
+; GFX1250-NEXT: s_cbranch_execnz .LBB38_4
+; GFX1250-NEXT: .LBB38_2: ; %atomicrmw.phi
+; GFX1250-NEXT: s_or_b32 exec_lo, exec_lo, s0
+; GFX1250-NEXT: s_wait_loadcnt_dscnt 0x0
+; GFX1250-NEXT: s_set_pc_i64 s[30:31]
+; GFX1250-NEXT: .LBB38_3: ; %atomicrmw.global
+; GFX1250-NEXT: flat_atomic_min_num_f64 v[0:1], v[4:5], v[2:3] th:TH_ATOMIC_RETURN scope:SCOPE_SYS
+; GFX1250-NEXT: ; implicit-def: $vgpr4_vgpr5
; GFX1250-NEXT: ; implicit-def: $vgpr2_vgpr3
-; GFX1250-NEXT: .LBB38_2: ; %Flow
; GFX1250-NEXT: s_wait_xcnt 0x0
; GFX1250-NEXT: s_and_not1_saveexec_b32 s0, s0
-; GFX1250-NEXT: s_cbranch_execz .LBB38_4
-; GFX1250-NEXT: ; %bb.3: ; %atomicrmw.private
-; GFX1250-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
-; GFX1250-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[0:1]
+; GFX1250-NEXT: s_cbranch_execz .LBB38_2
+; GFX1250-NEXT: .LBB38_4: ; %atomicrmw.private
+; GFX1250-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[4:5]
; GFX1250-NEXT: s_wait_loadcnt_dscnt 0x0
-; GFX1250-NEXT: v_subrev_nc_u32_e32 v4, s1, v0
+; GFX1250-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v4
; GFX1250-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_3) | instid1(VALU_DEP_1)
-; GFX1250-NEXT: v_dual_max_num_f64 v[2:3], v[2:3], v[2:3] :: v_dual_cndmask_b32 v6, -1, v4, vcc_lo
-; GFX1250-NEXT: scratch_load_b64 v[4:5], v6, off
+; GFX1250-NEXT: v_dual_max_num_f64 v[2:3], v[2:3], v[2:3] :: v_dual_cndmask_b32 v6, -1, v0, vcc_lo
+; GFX1250-NEXT: scratch_load_b64 v[0:1], v6, off
; GFX1250-NEXT: s_wait_loadcnt 0x0
-; GFX1250-NEXT: v_max_num_f64_e32 v[0:1], v[4:5], v[4:5]
-; GFX1250-NEXT: v_min_num_f64_e32 v[0:1], v[0:1], v[2:3]
-; GFX1250-NEXT: scratch_store_b64 v6, v[0:1], off scope:SCOPE_SE
-; GFX1250-NEXT: .LBB38_4: ; %atomicrmw.phi
+; GFX1250-NEXT: v_max_num_f64_e32 v[4:5], v[0:1], v[0:1]
+; GFX1250-NEXT: v_min_num_f64_e32 v[2:3], v[4:5], v[2:3]
+; GFX1250-NEXT: scratch_store_b64 v6, v[2:3], off scope:SCOPE_SE
; GFX1250-NEXT: s_wait_xcnt 0x0
; GFX1250-NEXT: s_or_b32 exec_lo, exec_lo, s0
-; GFX1250-NEXT: s_wait_loadcnt_dscnt 0x0
-; GFX1250-NEXT: v_dual_mov_b32 v0, v4 :: v_dual_mov_b32 v1, v5
; GFX1250-NEXT: s_set_pc_i64 s[30:31]
%result = atomicrmw fmin ptr %ptr, double %val monotonic
ret double %result
@@ -726,40 +734,42 @@ define double @flat_one_as_atomic_fmin_f64(ptr %ptr, double %val) {
; GFX1250: ; %bb.0:
; GFX1250-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX1250-NEXT: s_wait_kmcnt 0x0
-; GFX1250-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
-; GFX1250-NEXT: s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1250-NEXT: v_xor_b32_e32 v4, s0, v1
-; GFX1250-NEXT: v_cmp_lt_u32_e32 vcc_lo, 0x3ffffff, v4
-; GFX1250-NEXT: ; implicit-def: $vgpr4_vgpr5
+; GFX1250-NEXT: v_dual_mov_b32 v5, v1 :: v_dual_mov_b32 v4, v0
+; GFX1250-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-NEXT: v_xor_b32_e32 v0, src_flat_scratch_base_hi, v5
+; GFX1250-NEXT: v_cmp_lt_u32_e32 vcc_lo, 0x3ffffff, v0
+; GFX1250-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX1250-NEXT: s_and_saveexec_b32 s0, vcc_lo
; GFX1250-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1250-NEXT: s_xor_b32 s0, exec_lo, s0
-; GFX1250-NEXT: s_cbranch_execz .LBB39_2
-; GFX1250-NEXT: ; %bb.1: ; %atomicrmw.global
-; GFX1250-NEXT: flat_atomic_min_num_f64 v[4:5], v[0:1], v[2:3] th:TH_ATOMIC_RETURN scope:SCOPE_SYS
-; GFX1250-NEXT: ; implicit-def: $vgpr0_vgpr1
+; GFX1250-NEXT: s_cbranch_execnz .LBB39_3
+; GFX1250-NEXT: ; %bb.1: ; %Flow
+; GFX1250-NEXT: s_and_not1_saveexec_b32 s0, s0
+; GFX1250-NEXT: s_cbranch_execnz .LBB39_4
+; GFX1250-NEXT: .LBB39_2: ; %atomicrmw.phi
+; GFX1250-NEXT: s_or_b32 exec_lo, exec_lo, s0
+; GFX1250-NEXT: s_wait_loadcnt_dscnt 0x0
+; GFX1250-NEXT: s_set_pc_i64 s[30:31]
+; GFX1250-NEXT: .LBB39_3: ; %atomicrmw.global
+; GFX1250-NEXT: flat_atomic_min_num_f64 v[0:1], v[4:5], v[2:3] th:TH_ATOMIC_RETURN scope:SCOPE_SYS
+; GFX1250-NEXT: ; implicit-def: $vgpr4_vgpr5
; GFX1250-NEXT: ; implicit-def: $vgpr2_vgpr3
-; GFX1250-NEXT: .LBB39_2: ; %Flow
; GFX1250-NEXT: s_wait_xcnt 0x0
; GFX1250-NEXT: s_and_not1_saveexec_b32 s0, s0
-; GFX1250-NEXT: s_cbranch_execz .LBB39_4
-; GFX1250-NEXT: ; %bb.3: ; %atomicrmw.private
-; GFX1250-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
-; GFX1250-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[0:1]
+; GFX1250-NEXT: s_cbranch_execz .LBB39_2
+; GFX1250-NEXT: .LBB39_4: ; %atomicrmw.private
+; GFX1250-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[4:5]
; GFX1250-NEXT: s_wait_loadcnt_dscnt 0x0
-; GFX1250-NEXT: v_subrev_nc_u32_e32 v4, s1, v0
+; GFX1250-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v4
; GFX1250-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_3) | instid1(VALU_DEP_1)
-; GFX1250-NEXT: v_dual_max_num_f64 v[2:3], v[2:3], v[2:3] :: v_dual_cndmask_b32 v6, -1, v4, vcc_lo
-; GFX1250-NEXT: scratch_load_b64 v[4:5], v6, off
+; GFX1250-NEXT: v_dual_max_num_f64 v[2:3], v[2:3], v[2:3] :: v_dual_cndmask_b32 v6, -1, v0, vcc_lo
+; GFX1250-NEXT: scratch_load_b64 v[0:1], v6, off
; GFX1250-NEXT: s_wait_loadcnt 0x0
-; GFX1250-NEXT: v_max_num_f64_e32 v[0:1], v[4:5], v[4:5]
-; GFX1250-NEXT: v_min_num_f64_e32 v[0:1], v[0:1], v[2:3]
-; GFX1250-NEXT: scratch_store_b64 v6, v[0:1], off scope:SCOPE_SE
-; GFX1250-NEXT: .LBB39_4: ; %atomicrmw.phi
+; GFX1250-NEXT: v_max_num_f64_e32 v[4:5], v[0:1], v[0:1]
+; GFX1250-NEXT: v_min_num_f64_e32 v[2:3], v[4:5], v[2:3]
+; GFX1250-NEXT: scratch_store_b64 v6, v[2:3], off scope:SCOPE_SE
; GFX1250-NEXT: s_wait_xcnt 0x0
; GFX1250-NEXT: s_or_b32 exec_lo, exec_lo, s0
-; GFX1250-NEXT: s_wait_loadcnt_dscnt 0x0
-; GFX1250-NEXT: v_dual_mov_b32 v0, v4 :: v_dual_mov_b32 v1, v5
; GFX1250-NEXT: s_set_pc_i64 s[30:31]
%result = atomicrmw fmin ptr %ptr, double %val syncscope("one-as") monotonic
ret double %result
@@ -794,40 +804,42 @@ define double @flat_system_atomic_fmax_f64(ptr %ptr, double %val) {
; GFX1250: ; %bb.0:
; GFX1250-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX1250-NEXT: s_wait_kmcnt 0x0
-; GFX1250-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
-; GFX1250-NEXT: s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1250-NEXT: v_xor_b32_e32 v4, s0, v1
-; GFX1250-NEXT: v_cmp_lt_u32_e32 vcc_lo, 0x3ffffff, v4
-; GFX1250-NEXT: ; implicit-def: $vgpr4_vgpr5
+; GFX1250-NEXT: v_dual_mov_b32 v5, v1 :: v_dual_mov_b32 v4, v0
+; GFX1250-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-NEXT: v_xor_b32_e32 v0, src_flat_scratch_base_hi, v5
+; GFX1250-NEXT: v_cmp_lt_u32_e32 vcc_lo, 0x3ffffff, v0
+; GFX1250-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX1250-NEXT: s_and_saveexec_b32 s0, vcc_lo
; GFX1250-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1250-NEXT: s_xor_b32 s0, exec_lo, s0
-; GFX1250-NEXT: s_cbranch_execz .LBB42_2
-; GFX1250-NEXT: ; %bb.1: ; %atomicrmw.global
-; GFX1250-NEXT: flat_atomic_max_num_f64 v[4:5], v[0:1], v[2:3] th:TH_ATOMIC_RETURN scope:SCOPE_SYS
-; GFX1250-NEXT: ; implicit-def: $vgpr0_vgpr1
+; GFX1250-NEXT: s_cbranch_execnz .LBB42_3
+; GFX1250-NEXT: ; %bb.1: ; %Flow
+; GFX1250-NEXT: s_and_not1_saveexec_b32 s0, s0
+; GFX1250-NEXT: s_cbranch_execnz .LBB42_4
+; GFX1250-NEXT: .LBB42_2: ; %atomicrmw.phi
+; GFX1250-NEXT: s_or_b32 exec_lo, exec_lo, s0
+; GFX1250-NEXT: s_wait_loadcnt_dscnt 0x0
+; GFX1250-NEXT: s_set_pc_i64 s[30:31]
+; GFX1250-NEXT: .LBB42_3: ; %atomicrmw.global
+; GFX1250-NEXT: flat_atomic_max_num_f64 v[0:1], v[4:5], v[2:3] th:TH_ATOMIC_RETURN scope:SCOPE_SYS
+; GFX1250-NEXT: ; implicit-def: $vgpr4_vgpr5
; GFX1250-NEXT: ; implicit-def: $vgpr2_vgpr3
-; GFX1250-NEXT: .LBB42_2: ; %Flow
; GFX1250-NEXT: s_wait_xcnt 0x0
; GFX1250-NEXT: s_and_not1_saveexec_b32 s0, s0
-; GFX1250-NEXT: s_cbranch_execz .LBB42_4
-; GFX1250-NEXT: ; %bb.3: ; %atomicrmw.private
-; GFX1250-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
-; GFX1250-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[0:1]
+; GFX1250-NEXT: s_cbranch_execz .LBB42_2
+; GFX1250-NEXT: .LBB42_4: ; %atomicrmw.private
+; GFX1250-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[4:5]
; GFX1250-NEXT: s_wait_loadcnt_dscnt 0x0
-; GFX1250-NEXT: v_subrev_nc_u32_e32 v4, s1, v0
+; GFX1250-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v4
; GFX1250-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_3) | instid1(VALU_DEP_1)
-; GFX1250-NEXT: v_dual_max_num_f64 v[2:3], v[2:3], v[2:3] :: v_dual_cndmask_b32 v6, -1, v4, vcc_lo
-; GFX1250-NEXT: scratch_load_b64 v[4:5], v6, off
+; GFX1250-NEXT: v_dual_max_num_f64 v[2:3], v[2:3], v[2:3] :: v_dual_cndmask_b32 v6, -1, v0, vcc_lo
+; GFX1250-NEXT: scratch_load_b64 v[0:1], v6, off
; GFX1250-NEXT: s_wait_loadcnt 0x0
-; GFX1250-NEXT: v_max_num_f64_e32 v[0:1], v[4:5], v[4:5]
-; GFX1250-NEXT: v_max_num_f64_e32 v[0:1], v[0:1], v[2:3]
-; GFX1250-NEXT: scratch_store_b64 v6, v[0:1], off scope:SCOPE_SE
-; GFX1250-NEXT: .LBB42_4: ; %atomicrmw.phi
+; GFX1250-NEXT: v_max_num_f64_e32 v[4:5], v[0:1], v[0:1]
+; GFX1250-NEXT: v_max_num_f64_e32 v[2:3], v[4:5], v[2:3]
+; GFX1250-NEXT: scratch_store_b64 v6, v[2:3], off scope:SCOPE_SE
; GFX1250-NEXT: s_wait_xcnt 0x0
; GFX1250-NEXT: s_or_b32 exec_lo, exec_lo, s0
-; GFX1250-NEXT: s_wait_loadcnt_dscnt 0x0
-; GFX1250-NEXT: v_dual_mov_b32 v0, v4 :: v_dual_mov_b32 v1, v5
; GFX1250-NEXT: s_set_pc_i64 s[30:31]
%result = atomicrmw fmax ptr %ptr, double %val monotonic
ret double %result
@@ -838,40 +850,42 @@ define double @flat_one_as_atomic_fmax_f64(ptr %ptr, double %val) {
; GFX1250: ; %bb.0:
; GFX1250-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX1250-NEXT: s_wait_kmcnt 0x0
-; GFX1250-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
-; GFX1250-NEXT: s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1250-NEXT: v_xor_b32_e32 v4, s0, v1
-; GFX1250-NEXT: v_cmp_lt_u32_e32 vcc_lo, 0x3ffffff, v4
-; GFX1250-NEXT: ; implicit-def: $vgpr4_vgpr5
+; GFX1250-NEXT: v_dual_mov_b32 v5, v1 :: v_dual_mov_b32 v4, v0
+; GFX1250-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-NEXT: v_xor_b32_e32 v0, src_flat_scratch_base_hi, v5
+; GFX1250-NEXT: v_cmp_lt_u32_e32 vcc_lo, 0x3ffffff, v0
+; GFX1250-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX1250-NEXT: s_and_saveexec_b32 s0, vcc_lo
; GFX1250-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1250-NEXT: s_xor_b32 s0, exec_lo, s0
-; GFX1250-NEXT: s_cbranch_execz .LBB43_2
-; GFX1250-NEXT: ; %bb.1: ; %atomicrmw.global
-; GFX1250-NEXT: flat_atomic_max_num_f64 v[4:5], v[0:1], v[2:3] th:TH_ATOMIC_RETURN scope:SCOPE_SYS
-; GFX1250-NEXT: ; implicit-def: $vgpr0_vgpr1
+; GFX1250-NEXT: s_cbranch_execnz .LBB43_3
+; GFX1250-NEXT: ; %bb.1: ; %Flow
+; GFX1250-NEXT: s_and_not1_saveexec_b32 s0, s0
+; GFX1250-NEXT: s_cbranch_execnz .LBB43_4
+; GFX1250-NEXT: .LBB43_2: ; %atomicrmw.phi
+; GFX1250-NEXT: s_or_b32 exec_lo, exec_lo, s0
+; GFX1250-NEXT: s_wait_loadcnt_dscnt 0x0
+; GFX1250-NEXT: s_set_pc_i64 s[30:31]
+; GFX1250-NEXT: .LBB43_3: ; %atomicrmw.global
+; GFX1250-NEXT: flat_atomic_max_num_f64 v[0:1], v[4:5], v[2:3] th:TH_ATOMIC_RETURN scope:SCOPE_SYS
+; GFX1250-NEXT: ; implicit-def: $vgpr4_vgpr5
; GFX1250-NEXT: ; implicit-def: $vgpr2_vgpr3
-; GFX1250-NEXT: .LBB43_2: ; %Flow
; GFX1250-NEXT: s_wait_xcnt 0x0
; GFX1250-NEXT: s_and_not1_saveexec_b32 s0, s0
-; GFX1250-NEXT: s_cbranch_execz .LBB43_4
-; GFX1250-NEXT: ; %bb.3: ; %atomicrmw.private
-; GFX1250-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
-; GFX1250-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[0:1]
+; GFX1250-NEXT: s_cbranch_execz .LBB43_2
+; GFX1250-NEXT: .LBB43_4: ; %atomicrmw.private
+; GFX1250-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[4:5]
; GFX1250-NEXT: s_wait_loadcnt_dscnt 0x0
-; GFX1250-NEXT: v_subrev_nc_u32_e32 v4, s1, v0
+; GFX1250-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v4
; GFX1250-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_3) | instid1(VALU_DEP_1)
-; GFX1250-NEXT: v_dual_max_num_f64 v[2:3], v[2:3], v[2:3] :: v_dual_cndmask_b32 v6, -1, v4, vcc_lo
-; GFX1250-NEXT: scratch_load_b64 v[4:5], v6, off
+; GFX1250-NEXT: v_dual_max_num_f64 v[2:3], v[2:3], v[2:3] :: v_dual_cndmask_b32 v6, -1, v0, vcc_lo
+; GFX1250-NEXT: scratch_load_b64 v[0:1], v6, off
; GFX1250-NEXT: s_wait_loadcnt 0x0
-; GFX1250-NEXT: v_max_num_f64_e32 v[0:1], v[4:5], v[4:5]
-; GFX1250-NEXT: v_max_num_f64_e32 v[0:1], v[0:1], v[2:3]
-; GFX1250-NEXT: scratch_store_b64 v6, v[0:1], off scope:SCOPE_SE
-; GFX1250-NEXT: .LBB43_4: ; %atomicrmw.phi
+; GFX1250-NEXT: v_max_num_f64_e32 v[4:5], v[0:1], v[0:1]
+; GFX1250-NEXT: v_max_num_f64_e32 v[2:3], v[4:5], v[2:3]
+; GFX1250-NEXT: scratch_store_b64 v6, v[2:3], off scope:SCOPE_SE
; GFX1250-NEXT: s_wait_xcnt 0x0
; GFX1250-NEXT: s_or_b32 exec_lo, exec_lo, s0
-; GFX1250-NEXT: s_wait_loadcnt_dscnt 0x0
-; GFX1250-NEXT: v_dual_mov_b32 v0, v4 :: v_dual_mov_b32 v1, v5
; GFX1250-NEXT: s_set_pc_i64 s[30:31]
%result = atomicrmw fmax ptr %ptr, double %val syncscope("one-as") monotonic
ret double %result
@@ -978,13 +992,11 @@ define i64 @flat_one_as_atomic_min_i64(ptr %ptr, i64 %val) {
; GFX1250: ; %bb.0:
; GFX1250-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX1250-NEXT: s_wait_kmcnt 0x0
-; GFX1250-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
-; GFX1250-NEXT: s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1250-NEXT: v_xor_b32_e32 v4, s0, v1
+; GFX1250-NEXT: v_xor_b32_e32 v4, src_flat_scratch_base_hi, v1
+; GFX1250-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
; GFX1250-NEXT: v_cmp_lt_u32_e32 vcc_lo, 0x3ffffff, v4
; GFX1250-NEXT: ; implicit-def: $vgpr4_vgpr5
; GFX1250-NEXT: s_and_saveexec_b32 s0, vcc_lo
-; GFX1250-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1250-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-NEXT: s_cbranch_execz .LBB52_2
; GFX1250-NEXT: ; %bb.1: ; %atomicrmw.global
@@ -996,10 +1008,9 @@ define i64 @flat_one_as_atomic_min_i64(ptr %ptr, i64 %val) {
; GFX1250-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-NEXT: s_cbranch_execz .LBB52_4
; GFX1250-NEXT: ; %bb.3: ; %atomicrmw.private
-; GFX1250-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[0:1]
; GFX1250-NEXT: s_wait_loadcnt_dscnt 0x0
-; GFX1250-NEXT: v_subrev_nc_u32_e32 v4, s1, v0
+; GFX1250-NEXT: v_subrev_nc_u32_e32 v4, src_flat_scratch_base_lo, v0
; GFX1250-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-NEXT: v_cndmask_b32_e32 v6, -1, v4, vcc_lo
; GFX1250-NEXT: scratch_load_b64 v[4:5], v6, off
@@ -1021,13 +1032,11 @@ define i64 @flat_system_atomic_min_i64(ptr %ptr, i64 %val) {
; GFX1250: ; %bb.0:
; GFX1250-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX1250-NEXT: s_wait_kmcnt 0x0
-; GFX1250-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
-; GFX1250-NEXT: s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1250-NEXT: v_xor_b32_e32 v4, s0, v1
+; GFX1250-NEXT: v_xor_b32_e32 v4, src_flat_scratch_base_hi, v1
+; GFX1250-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
; GFX1250-NEXT: v_cmp_lt_u32_e32 vcc_lo, 0x3ffffff, v4
; GFX1250-NEXT: ; implicit-def: $vgpr4_vgpr5
; GFX1250-NEXT: s_and_saveexec_b32 s0, vcc_lo
-; GFX1250-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1250-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-NEXT: s_cbranch_execz .LBB53_2
; GFX1250-NEXT: ; %bb.1: ; %atomicrmw.global
@@ -1039,10 +1048,9 @@ define i64 @flat_system_atomic_min_i64(ptr %ptr, i64 %val) {
; GFX1250-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-NEXT: s_cbranch_execz .LBB53_4
; GFX1250-NEXT: ; %bb.3: ; %atomicrmw.private
-; GFX1250-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[0:1]
; GFX1250-NEXT: s_wait_loadcnt_dscnt 0x0
-; GFX1250-NEXT: v_subrev_nc_u32_e32 v4, s1, v0
+; GFX1250-NEXT: v_subrev_nc_u32_e32 v4, src_flat_scratch_base_lo, v0
; GFX1250-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-NEXT: v_cndmask_b32_e32 v6, -1, v4, vcc_lo
; GFX1250-NEXT: scratch_load_b64 v[4:5], v6, off
@@ -1064,13 +1072,11 @@ define i64 @flat_one_as_atomic_max_i64(ptr %ptr, i64 %val) {
; GFX1250: ; %bb.0:
; GFX1250-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX1250-NEXT: s_wait_kmcnt 0x0
-; GFX1250-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
-; GFX1250-NEXT: s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1250-NEXT: v_xor_b32_e32 v4, s0, v1
+; GFX1250-NEXT: v_xor_b32_e32 v4, src_flat_scratch_base_hi, v1
+; GFX1250-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
; GFX1250-NEXT: v_cmp_lt_u32_e32 vcc_lo, 0x3ffffff, v4
; GFX1250-NEXT: ; implicit-def: $vgpr4_vgpr5
; GFX1250-NEXT: s_and_saveexec_b32 s0, vcc_lo
-; GFX1250-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1250-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-NEXT: s_cbranch_execz .LBB54_2
; GFX1250-NEXT: ; %bb.1: ; %atomicrmw.global
@@ -1082,10 +1088,9 @@ define i64 @flat_one_as_atomic_max_i64(ptr %ptr, i64 %val) {
; GFX1250-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-NEXT: s_cbranch_execz .LBB54_4
; GFX1250-NEXT: ; %bb.3: ; %atomicrmw.private
-; GFX1250-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[0:1]
; GFX1250-NEXT: s_wait_loadcnt_dscnt 0x0
-; GFX1250-NEXT: v_subrev_nc_u32_e32 v4, s1, v0
+; GFX1250-NEXT: v_subrev_nc_u32_e32 v4, src_flat_scratch_base_lo, v0
; GFX1250-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-NEXT: v_cndmask_b32_e32 v6, -1, v4, vcc_lo
; GFX1250-NEXT: scratch_load_b64 v[4:5], v6, off
@@ -1107,13 +1112,11 @@ define i64 @flat_system_atomic_max_i64(ptr %ptr, i64 %val) {
; GFX1250: ; %bb.0:
; GFX1250-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX1250-NEXT: s_wait_kmcnt 0x0
-; GFX1250-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
-; GFX1250-NEXT: s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1250-NEXT: v_xor_b32_e32 v4, s0, v1
+; GFX1250-NEXT: v_xor_b32_e32 v4, src_flat_scratch_base_hi, v1
+; GFX1250-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
; GFX1250-NEXT: v_cmp_lt_u32_e32 vcc_lo, 0x3ffffff, v4
; GFX1250-NEXT: ; implicit-def: $vgpr4_vgpr5
; GFX1250-NEXT: s_and_saveexec_b32 s0, vcc_lo
-; GFX1250-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1250-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-NEXT: s_cbranch_execz .LBB55_2
; GFX1250-NEXT: ; %bb.1: ; %atomicrmw.global
@@ -1125,10 +1128,9 @@ define i64 @flat_system_atomic_max_i64(ptr %ptr, i64 %val) {
; GFX1250-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-NEXT: s_cbranch_execz .LBB55_4
; GFX1250-NEXT: ; %bb.3: ; %atomicrmw.private
-; GFX1250-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[0:1]
; GFX1250-NEXT: s_wait_loadcnt_dscnt 0x0
-; GFX1250-NEXT: v_subrev_nc_u32_e32 v4, s1, v0
+; GFX1250-NEXT: v_subrev_nc_u32_e32 v4, src_flat_scratch_base_lo, v0
; GFX1250-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-NEXT: v_cndmask_b32_e32 v6, -1, v4, vcc_lo
; GFX1250-NEXT: scratch_load_b64 v[4:5], v6, off
@@ -1150,13 +1152,11 @@ define i64 @flat_one_as_atomic_umin_i64(ptr %ptr, i64 %val) {
; GFX1250: ; %bb.0:
; GFX1250-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX1250-NEXT: s_wait_kmcnt 0x0
-; GFX1250-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
-; GFX1250-NEXT: s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1250-NEXT: v_xor_b32_e32 v4, s0, v1
+; GFX1250-NEXT: v_xor_b32_e32 v4, src_flat_scratch_base_hi, v1
+; GFX1250-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
; GFX1250-NEXT: v_cmp_lt_u32_e32 vcc_lo, 0x3ffffff, v4
; GFX1250-NEXT: ; implicit-def: $vgpr4_vgpr5
; GFX1250-NEXT: s_and_saveexec_b32 s0, vcc_lo
-; GFX1250-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1250-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-NEXT: s_cbranch_execz .LBB56_2
; GFX1250-NEXT: ; %bb.1: ; %atomicrmw.global
@@ -1168,10 +1168,9 @@ define i64 @flat_one_as_atomic_umin_i64(ptr %ptr, i64 %val) {
; GFX1250-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-NEXT: s_cbranch_execz .LBB56_4
; GFX1250-NEXT: ; %bb.3: ; %atomicrmw.private
-; GFX1250-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[0:1]
; GFX1250-NEXT: s_wait_loadcnt_dscnt 0x0
-; GFX1250-NEXT: v_subrev_nc_u32_e32 v4, s1, v0
+; GFX1250-NEXT: v_subrev_nc_u32_e32 v4, src_flat_scratch_base_lo, v0
; GFX1250-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-NEXT: v_cndmask_b32_e32 v6, -1, v4, vcc_lo
; GFX1250-NEXT: scratch_load_b64 v[4:5], v6, off
@@ -1193,13 +1192,11 @@ define i64 @flat_system_atomic_umin_i64(ptr %ptr, i64 %val) {
; GFX1250: ; %bb.0:
; GFX1250-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX1250-NEXT: s_wait_kmcnt 0x0
-; GFX1250-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
-; GFX1250-NEXT: s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1250-NEXT: v_xor_b32_e32 v4, s0, v1
+; GFX1250-NEXT: v_xor_b32_e32 v4, src_flat_scratch_base_hi, v1
+; GFX1250-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
; GFX1250-NEXT: v_cmp_lt_u32_e32 vcc_lo, 0x3ffffff, v4
; GFX1250-NEXT: ; implicit-def: $vgpr4_vgpr5
; GFX1250-NEXT: s_and_saveexec_b32 s0, vcc_lo
-; GFX1250-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1250-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-NEXT: s_cbranch_execz .LBB57_2
; GFX1250-NEXT: ; %bb.1: ; %atomicrmw.global
@@ -1211,10 +1208,9 @@ define i64 @flat_system_atomic_umin_i64(ptr %ptr, i64 %val) {
; GFX1250-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-NEXT: s_cbranch_execz .LBB57_4
; GFX1250-NEXT: ; %bb.3: ; %atomicrmw.private
-; GFX1250-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[0:1]
; GFX1250-NEXT: s_wait_loadcnt_dscnt 0x0
-; GFX1250-NEXT: v_subrev_nc_u32_e32 v4, s1, v0
+; GFX1250-NEXT: v_subrev_nc_u32_e32 v4, src_flat_scratch_base_lo, v0
; GFX1250-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-NEXT: v_cndmask_b32_e32 v6, -1, v4, vcc_lo
; GFX1250-NEXT: scratch_load_b64 v[4:5], v6, off
@@ -1236,13 +1232,11 @@ define i64 @flat_one_as_atomic_umax_i64(ptr %ptr, i64 %val) {
; GFX1250: ; %bb.0:
; GFX1250-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX1250-NEXT: s_wait_kmcnt 0x0
-; GFX1250-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
-; GFX1250-NEXT: s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1250-NEXT: v_xor_b32_e32 v4, s0, v1
+; GFX1250-NEXT: v_xor_b32_e32 v4, src_flat_scratch_base_hi, v1
+; GFX1250-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
; GFX1250-NEXT: v_cmp_lt_u32_e32 vcc_lo, 0x3ffffff, v4
; GFX1250-NEXT: ; implicit-def: $vgpr4_vgpr5
; GFX1250-NEXT: s_and_saveexec_b32 s0, vcc_lo
-; GFX1250-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1250-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-NEXT: s_cbranch_execz .LBB58_2
; GFX1250-NEXT: ; %bb.1: ; %atomicrmw.global
@@ -1254,10 +1248,9 @@ define i64 @flat_one_as_atomic_umax_i64(ptr %ptr, i64 %val) {
; GFX1250-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-NEXT: s_cbranch_execz .LBB58_4
; GFX1250-NEXT: ; %bb.3: ; %atomicrmw.private
-; GFX1250-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[0:1]
; GFX1250-NEXT: s_wait_loadcnt_dscnt 0x0
-; GFX1250-NEXT: v_subrev_nc_u32_e32 v4, s1, v0
+; GFX1250-NEXT: v_subrev_nc_u32_e32 v4, src_flat_scratch_base_lo, v0
; GFX1250-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-NEXT: v_cndmask_b32_e32 v6, -1, v4, vcc_lo
; GFX1250-NEXT: scratch_load_b64 v[4:5], v6, off
@@ -1279,13 +1272,11 @@ define i64 @flat_system_atomic_umax_i64(ptr %ptr, i64 %val) {
; GFX1250: ; %bb.0:
; GFX1250-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX1250-NEXT: s_wait_kmcnt 0x0
-; GFX1250-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
-; GFX1250-NEXT: s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1250-NEXT: v_xor_b32_e32 v4, s0, v1
+; GFX1250-NEXT: v_xor_b32_e32 v4, src_flat_scratch_base_hi, v1
+; GFX1250-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
; GFX1250-NEXT: v_cmp_lt_u32_e32 vcc_lo, 0x3ffffff, v4
; GFX1250-NEXT: ; implicit-def: $vgpr4_vgpr5
; GFX1250-NEXT: s_and_saveexec_b32 s0, vcc_lo
-; GFX1250-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1250-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-NEXT: s_cbranch_execz .LBB59_2
; GFX1250-NEXT: ; %bb.1: ; %atomicrmw.global
@@ -1297,10 +1288,9 @@ define i64 @flat_system_atomic_umax_i64(ptr %ptr, i64 %val) {
; GFX1250-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-NEXT: s_cbranch_execz .LBB59_4
; GFX1250-NEXT: ; %bb.3: ; %atomicrmw.private
-; GFX1250-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[0:1]
; GFX1250-NEXT: s_wait_loadcnt_dscnt 0x0
-; GFX1250-NEXT: v_subrev_nc_u32_e32 v4, s1, v0
+; GFX1250-NEXT: v_subrev_nc_u32_e32 v4, src_flat_scratch_base_lo, v0
; GFX1250-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-NEXT: v_cndmask_b32_e32 v6, -1, v4, vcc_lo
; GFX1250-NEXT: scratch_load_b64 v[4:5], v6, off
diff --git a/llvm/test/CodeGen/AMDGPU/flat-saddr-atomics.ll b/llvm/test/CodeGen/AMDGPU/flat-saddr-atomics.ll
index 004d3c0c1cf53..265848b441f69 100644
--- a/llvm/test/CodeGen/AMDGPU/flat-saddr-atomics.ll
+++ b/llvm/test/CodeGen/AMDGPU/flat-saddr-atomics.ll
@@ -252,11 +252,10 @@ define amdgpu_ps <2 x float> @flat_xchg_saddr_i64_rtn(ptr inreg %sbase, i32 %vof
; GFX1250-SDAG-LABEL: flat_xchg_saddr_i64_rtn:
; GFX1250-SDAG: ; %bb.0:
; GFX1250-SDAG-NEXT: v_dual_mov_b32 v3, v2 :: v_dual_mov_b32 v2, v1
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
; GFX1250-SDAG-NEXT: v_mov_b32_e32 v1, 0
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[4:5], s[2:3], v[0:1]
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, s0, v5
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, src_flat_scratch_base_hi, v5
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
; GFX1250-SDAG-NEXT: v_cmp_lt_u32_e32 vcc_lo, 0x3ffffff, v0
; GFX1250-SDAG-NEXT: ; implicit-def: $vgpr0_vgpr1
@@ -279,9 +278,8 @@ define amdgpu_ps <2 x float> @flat_xchg_saddr_i64_rtn(ptr inreg %sbase, i32 %vof
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB10_2
; GFX1250-SDAG-NEXT: .LBB10_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[4:5]
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, s1, v4
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v4
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v4, -1, v0, vcc_lo
; GFX1250-SDAG-NEXT: s_clause 0x1
@@ -297,12 +295,11 @@ define amdgpu_ps <2 x float> @flat_xchg_saddr_i64_rtn(ptr inreg %sbase, i32 %vof
; GFX1250-GISEL: ; %bb.0:
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v3, v0 :: v_dual_mov_b32 v4, v1
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[0:1], s[2:3]
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v6, vcc_lo, v0, v3
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v7, null, 0, v1, vcc_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_bitop2_b32 v0, s0, v7 bitop3:0x14
+; GFX1250-GISEL-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_bitop2_b32 v0, src_flat_scratch_base_hi, v7 bitop3:0x14
; GFX1250-GISEL-NEXT: v_cmp_le_u32_e32 vcc_lo, 0x4000000, v0
; GFX1250-GISEL-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX1250-GISEL-NEXT: s_and_saveexec_b32 s0, vcc_lo
@@ -325,9 +322,8 @@ define amdgpu_ps <2 x float> @flat_xchg_saddr_i64_rtn(ptr inreg %sbase, i32 %vof
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB10_2
; GFX1250-GISEL-NEXT: .LBB10_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[6:7]
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s1, v6
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v6
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v2, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: s_clause 0x1
@@ -354,13 +350,12 @@ define amdgpu_ps <2 x float> @flat_xchg_saddr_i64_rtn_neg128(ptr inreg %sbase, i
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[2:3], v[0:1]
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[4:5], s[0:1], v[0:1]
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instid1(SALU_CYCLE_1)
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, s0, v5
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, src_flat_scratch_base_hi, v5
; GFX1250-SDAG-NEXT: v_cmp_lt_u32_e32 vcc_lo, 0x3ffffff, v0
; GFX1250-SDAG-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX1250-SDAG-NEXT: s_and_saveexec_b32 s0, vcc_lo
+; GFX1250-SDAG-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1250-SDAG-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-SDAG-NEXT: s_cbranch_execnz .LBB11_3
; GFX1250-SDAG-NEXT: ; %bb.1: ; %Flow
@@ -379,9 +374,8 @@ define amdgpu_ps <2 x float> @flat_xchg_saddr_i64_rtn_neg128(ptr inreg %sbase, i
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB11_2
; GFX1250-SDAG-NEXT: .LBB11_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[4:5]
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, s1, v4
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v4
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v4, -1, v0, vcc_lo
; GFX1250-SDAG-NEXT: s_clause 0x1
@@ -397,7 +391,6 @@ define amdgpu_ps <2 x float> @flat_xchg_saddr_i64_rtn_neg128(ptr inreg %sbase, i
; GFX1250-GISEL: ; %bb.0:
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v3, v0 :: v_dual_mov_b32 v4, v1
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[0:1], s[2:3]
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v0, vcc_lo, v0, v3
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v1, null, 0, v1, vcc_lo
@@ -405,7 +398,7 @@ define amdgpu_ps <2 x float> @flat_xchg_saddr_i64_rtn_neg128(ptr inreg %sbase, i
; GFX1250-GISEL-NEXT: v_add_co_u32 v6, vcc_lo, 0xffffff80, v0
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v7, null, -1, v1, vcc_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_bitop2_b32 v0, s0, v7 bitop3:0x14
+; GFX1250-GISEL-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_bitop2_b32 v0, src_flat_scratch_base_hi, v7 bitop3:0x14
; GFX1250-GISEL-NEXT: v_cmp_le_u32_e32 vcc_lo, 0x4000000, v0
; GFX1250-GISEL-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX1250-GISEL-NEXT: s_and_saveexec_b32 s0, vcc_lo
@@ -428,9 +421,8 @@ define amdgpu_ps <2 x float> @flat_xchg_saddr_i64_rtn_neg128(ptr inreg %sbase, i
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB11_2
; GFX1250-GISEL-NEXT: .LBB11_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[6:7]
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s1, v6
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v6
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v2, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: s_clause 0x1
@@ -454,11 +446,10 @@ define amdgpu_ps void @flat_xchg_saddr_i64_nortn(ptr inreg %sbase, i32 %voffset,
; GFX1250-SDAG: ; %bb.0:
; GFX1250-SDAG-NEXT: v_dual_mov_b32 v3, v2 :: v_dual_mov_b32 v2, v1
; GFX1250-SDAG-NEXT: v_mov_b32_e32 v1, 0
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
+; GFX1250-SDAG-NEXT: s_mov_b32 s0, exec_lo
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[2:3], v[0:1]
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v4, s0, v1
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, exec_lo
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v4, src_flat_scratch_base_hi, v1
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cmpx_lt_u32_e32 0x3ffffff, v4
; GFX1250-SDAG-NEXT: s_xor_b32 s0, exec_lo, s0
@@ -478,9 +469,8 @@ define amdgpu_ps void @flat_xchg_saddr_i64_nortn(ptr inreg %sbase, i32 %voffset,
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB12_2
; GFX1250-SDAG-NEXT: .LBB12_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[0:1]
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v4, s0, v0
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v4, src_flat_scratch_base_lo, v0
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v0, -1, v4, vcc_lo
; GFX1250-SDAG-NEXT: scratch_store_b64 v0, v[2:3], off scope:SCOPE_SE
@@ -490,13 +480,12 @@ define amdgpu_ps void @flat_xchg_saddr_i64_nortn(ptr inreg %sbase, i32 %voffset,
; GFX1250-GISEL: ; %bb.0:
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v4, v1 :: v_dual_mov_b32 v5, v2
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[2:3], s[2:3]
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
+; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v2, vcc_lo, v2, v0
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v3, null, 0, v3, vcc_lo
-; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, s0, v3
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
+; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, src_flat_scratch_base_hi, v3
; GFX1250-GISEL-NEXT: v_cmpx_le_u32_e32 0x4000000, v1
; GFX1250-GISEL-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-GISEL-NEXT: s_cbranch_execnz .LBB12_3
@@ -515,9 +504,8 @@ define amdgpu_ps void @flat_xchg_saddr_i64_nortn(ptr inreg %sbase, i32 %voffset,
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB12_2
; GFX1250-GISEL-NEXT: .LBB12_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[2:3]
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s0, v2
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v2
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v0, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: scratch_store_b64 v0, v[4:5], off scope:SCOPE_SE
@@ -537,11 +525,9 @@ define amdgpu_ps void @flat_xchg_saddr_i64_nortn_neg128(ptr inreg %sbase, i32 %v
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[2:3], v[0:1]
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[0:1], v[0:1]
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instid1(SALU_CYCLE_1)
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v4, s0, v1
; GFX1250-SDAG-NEXT: s_mov_b32 s0, exec_lo
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v4, src_flat_scratch_base_hi, v1
; GFX1250-SDAG-NEXT: v_cmpx_lt_u32_e32 0x3ffffff, v4
; GFX1250-SDAG-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-SDAG-NEXT: s_cbranch_execnz .LBB13_3
@@ -560,9 +546,8 @@ define amdgpu_ps void @flat_xchg_saddr_i64_nortn_neg128(ptr inreg %sbase, i32 %v
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB13_2
; GFX1250-SDAG-NEXT: .LBB13_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[0:1]
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v4, s0, v0
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v4, src_flat_scratch_base_lo, v0
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v0, -1, v4, vcc_lo
; GFX1250-SDAG-NEXT: scratch_store_b64 v0, v[2:3], off scope:SCOPE_SE
@@ -572,16 +557,15 @@ define amdgpu_ps void @flat_xchg_saddr_i64_nortn_neg128(ptr inreg %sbase, i32 %v
; GFX1250-GISEL: ; %bb.0:
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v4, v1 :: v_dual_mov_b32 v5, v2
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[2:3], s[2:3]
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
+; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v1, vcc_lo, v2, v0
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v3, null, 0, v3, vcc_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v2, vcc_lo, 0xffffff80, v1
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v3, null, -1, v3, vcc_lo
-; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, s0, v3
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
+; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, src_flat_scratch_base_hi, v3
; GFX1250-GISEL-NEXT: v_cmpx_le_u32_e32 0x4000000, v1
; GFX1250-GISEL-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-GISEL-NEXT: s_cbranch_execnz .LBB13_3
@@ -600,9 +584,8 @@ define amdgpu_ps void @flat_xchg_saddr_i64_nortn_neg128(ptr inreg %sbase, i32 %v
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB13_2
; GFX1250-GISEL-NEXT: .LBB13_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[2:3]
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s0, v2
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v2
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v0, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: scratch_store_b64 v0, v[4:5], off scope:SCOPE_SE
@@ -680,11 +663,10 @@ define amdgpu_ps <2 x float> @flat_add_saddr_i64_rtn(ptr inreg %sbase, i32 %voff
; GFX1250-SDAG-LABEL: flat_add_saddr_i64_rtn:
; GFX1250-SDAG: ; %bb.0:
; GFX1250-SDAG-NEXT: v_dual_mov_b32 v3, v2 :: v_dual_mov_b32 v2, v1
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
; GFX1250-SDAG-NEXT: v_mov_b32_e32 v1, 0
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[4:5], s[2:3], v[0:1]
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, s0, v5
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, src_flat_scratch_base_hi, v5
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
; GFX1250-SDAG-NEXT: v_cmp_lt_u32_e32 vcc_lo, 0x3ffffff, v0
; GFX1250-SDAG-NEXT: ; implicit-def: $vgpr0_vgpr1
@@ -707,9 +689,8 @@ define amdgpu_ps <2 x float> @flat_add_saddr_i64_rtn(ptr inreg %sbase, i32 %voff
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB18_2
; GFX1250-SDAG-NEXT: .LBB18_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[4:5]
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, s1, v4
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v4
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v4, -1, v0, vcc_lo
; GFX1250-SDAG-NEXT: scratch_load_b64 v[0:1], v4, off
@@ -725,12 +706,11 @@ define amdgpu_ps <2 x float> @flat_add_saddr_i64_rtn(ptr inreg %sbase, i32 %voff
; GFX1250-GISEL: ; %bb.0:
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v3, v0 :: v_dual_mov_b32 v4, v1
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[0:1], s[2:3]
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v6, vcc_lo, v0, v3
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v7, null, 0, v1, vcc_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_bitop2_b32 v0, s0, v7 bitop3:0x14
+; GFX1250-GISEL-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_bitop2_b32 v0, src_flat_scratch_base_hi, v7 bitop3:0x14
; GFX1250-GISEL-NEXT: v_cmp_le_u32_e32 vcc_lo, 0x4000000, v0
; GFX1250-GISEL-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX1250-GISEL-NEXT: s_and_saveexec_b32 s0, vcc_lo
@@ -753,9 +733,8 @@ define amdgpu_ps <2 x float> @flat_add_saddr_i64_rtn(ptr inreg %sbase, i32 %voff
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB18_2
; GFX1250-GISEL-NEXT: .LBB18_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[6:7]
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s1, v6
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v6
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v6, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: scratch_load_b64 v[0:1], v6, off
@@ -782,13 +761,12 @@ define amdgpu_ps <2 x float> @flat_add_saddr_i64_rtn_neg128(ptr inreg %sbase, i3
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[2:3], v[0:1]
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[4:5], s[0:1], v[0:1]
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instid1(SALU_CYCLE_1)
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, s0, v5
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, src_flat_scratch_base_hi, v5
; GFX1250-SDAG-NEXT: v_cmp_lt_u32_e32 vcc_lo, 0x3ffffff, v0
; GFX1250-SDAG-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX1250-SDAG-NEXT: s_and_saveexec_b32 s0, vcc_lo
+; GFX1250-SDAG-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1250-SDAG-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-SDAG-NEXT: s_cbranch_execnz .LBB19_3
; GFX1250-SDAG-NEXT: ; %bb.1: ; %Flow
@@ -807,9 +785,8 @@ define amdgpu_ps <2 x float> @flat_add_saddr_i64_rtn_neg128(ptr inreg %sbase, i3
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB19_2
; GFX1250-SDAG-NEXT: .LBB19_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[4:5]
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, s1, v4
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v4
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v4, -1, v0, vcc_lo
; GFX1250-SDAG-NEXT: scratch_load_b64 v[0:1], v4, off
@@ -825,7 +802,6 @@ define amdgpu_ps <2 x float> @flat_add_saddr_i64_rtn_neg128(ptr inreg %sbase, i3
; GFX1250-GISEL: ; %bb.0:
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v3, v0 :: v_dual_mov_b32 v4, v1
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[0:1], s[2:3]
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v0, vcc_lo, v0, v3
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v1, null, 0, v1, vcc_lo
@@ -833,7 +809,7 @@ define amdgpu_ps <2 x float> @flat_add_saddr_i64_rtn_neg128(ptr inreg %sbase, i3
; GFX1250-GISEL-NEXT: v_add_co_u32 v6, vcc_lo, 0xffffff80, v0
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v7, null, -1, v1, vcc_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_bitop2_b32 v0, s0, v7 bitop3:0x14
+; GFX1250-GISEL-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_bitop2_b32 v0, src_flat_scratch_base_hi, v7 bitop3:0x14
; GFX1250-GISEL-NEXT: v_cmp_le_u32_e32 vcc_lo, 0x4000000, v0
; GFX1250-GISEL-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX1250-GISEL-NEXT: s_and_saveexec_b32 s0, vcc_lo
@@ -856,9 +832,8 @@ define amdgpu_ps <2 x float> @flat_add_saddr_i64_rtn_neg128(ptr inreg %sbase, i3
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB19_2
; GFX1250-GISEL-NEXT: .LBB19_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[6:7]
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s1, v6
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v6
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v6, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: scratch_load_b64 v[0:1], v6, off
@@ -882,11 +857,10 @@ define amdgpu_ps void @flat_add_saddr_i64_nortn(ptr inreg %sbase, i32 %voffset,
; GFX1250-SDAG: ; %bb.0:
; GFX1250-SDAG-NEXT: v_dual_mov_b32 v3, v2 :: v_dual_mov_b32 v2, v1
; GFX1250-SDAG-NEXT: v_mov_b32_e32 v1, 0
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
+; GFX1250-SDAG-NEXT: s_mov_b32 s0, exec_lo
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[2:3], v[0:1]
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v4, s0, v1
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, exec_lo
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v4, src_flat_scratch_base_hi, v1
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cmpx_lt_u32_e32 0x3ffffff, v4
; GFX1250-SDAG-NEXT: s_xor_b32 s0, exec_lo, s0
@@ -906,9 +880,8 @@ define amdgpu_ps void @flat_add_saddr_i64_nortn(ptr inreg %sbase, i32 %voffset,
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB20_2
; GFX1250-SDAG-NEXT: .LBB20_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[0:1]
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v4, s0, v0
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v4, src_flat_scratch_base_lo, v0
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v4, -1, v4, vcc_lo
; GFX1250-SDAG-NEXT: scratch_load_b64 v[0:1], v4, off
@@ -921,13 +894,12 @@ define amdgpu_ps void @flat_add_saddr_i64_nortn(ptr inreg %sbase, i32 %voffset,
; GFX1250-GISEL: ; %bb.0:
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v4, v1 :: v_dual_mov_b32 v5, v2
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[2:3], s[2:3]
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
+; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v2, vcc_lo, v2, v0
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v3, null, 0, v3, vcc_lo
-; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, s0, v3
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
+; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, src_flat_scratch_base_hi, v3
; GFX1250-GISEL-NEXT: v_cmpx_le_u32_e32 0x4000000, v1
; GFX1250-GISEL-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-GISEL-NEXT: s_cbranch_execnz .LBB20_3
@@ -946,9 +918,8 @@ define amdgpu_ps void @flat_add_saddr_i64_nortn(ptr inreg %sbase, i32 %voffset,
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB20_2
; GFX1250-GISEL-NEXT: .LBB20_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[2:3]
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s0, v2
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v2
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v2, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: scratch_load_b64 v[0:1], v2, off
@@ -971,11 +942,9 @@ define amdgpu_ps void @flat_add_saddr_i64_nortn_neg128(ptr inreg %sbase, i32 %vo
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[2:3], v[0:1]
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[0:1], v[0:1]
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instid1(SALU_CYCLE_1)
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v4, s0, v1
; GFX1250-SDAG-NEXT: s_mov_b32 s0, exec_lo
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v4, src_flat_scratch_base_hi, v1
; GFX1250-SDAG-NEXT: v_cmpx_lt_u32_e32 0x3ffffff, v4
; GFX1250-SDAG-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-SDAG-NEXT: s_cbranch_execnz .LBB21_3
@@ -994,9 +963,8 @@ define amdgpu_ps void @flat_add_saddr_i64_nortn_neg128(ptr inreg %sbase, i32 %vo
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB21_2
; GFX1250-SDAG-NEXT: .LBB21_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[0:1]
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v4, s0, v0
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v4, src_flat_scratch_base_lo, v0
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v4, -1, v4, vcc_lo
; GFX1250-SDAG-NEXT: scratch_load_b64 v[0:1], v4, off
@@ -1009,16 +977,15 @@ define amdgpu_ps void @flat_add_saddr_i64_nortn_neg128(ptr inreg %sbase, i32 %vo
; GFX1250-GISEL: ; %bb.0:
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v4, v1 :: v_dual_mov_b32 v5, v2
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[2:3], s[2:3]
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
+; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v1, vcc_lo, v2, v0
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v3, null, 0, v3, vcc_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v2, vcc_lo, 0xffffff80, v1
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v3, null, -1, v3, vcc_lo
-; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, s0, v3
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
+; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, src_flat_scratch_base_hi, v3
; GFX1250-GISEL-NEXT: v_cmpx_le_u32_e32 0x4000000, v1
; GFX1250-GISEL-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-GISEL-NEXT: s_cbranch_execnz .LBB21_3
@@ -1037,9 +1004,8 @@ define amdgpu_ps void @flat_add_saddr_i64_nortn_neg128(ptr inreg %sbase, i32 %vo
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB21_2
; GFX1250-GISEL-NEXT: .LBB21_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[2:3]
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s0, v2
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v2
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v2, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: scratch_load_b64 v[0:1], v2, off
@@ -1120,11 +1086,10 @@ define amdgpu_ps <2 x float> @flat_sub_saddr_i64_rtn(ptr inreg %sbase, i32 %voff
; GFX1250-SDAG-LABEL: flat_sub_saddr_i64_rtn:
; GFX1250-SDAG: ; %bb.0:
; GFX1250-SDAG-NEXT: v_dual_mov_b32 v3, v2 :: v_dual_mov_b32 v2, v1
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
; GFX1250-SDAG-NEXT: v_mov_b32_e32 v1, 0
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[4:5], s[2:3], v[0:1]
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, s0, v5
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, src_flat_scratch_base_hi, v5
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
; GFX1250-SDAG-NEXT: v_cmp_lt_u32_e32 vcc_lo, 0x3ffffff, v0
; GFX1250-SDAG-NEXT: ; implicit-def: $vgpr0_vgpr1
@@ -1147,9 +1112,8 @@ define amdgpu_ps <2 x float> @flat_sub_saddr_i64_rtn(ptr inreg %sbase, i32 %voff
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB26_2
; GFX1250-SDAG-NEXT: .LBB26_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[4:5]
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, s1, v4
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v4
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v4, -1, v0, vcc_lo
; GFX1250-SDAG-NEXT: scratch_load_b64 v[0:1], v4, off
@@ -1165,12 +1129,11 @@ define amdgpu_ps <2 x float> @flat_sub_saddr_i64_rtn(ptr inreg %sbase, i32 %voff
; GFX1250-GISEL: ; %bb.0:
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v3, v0 :: v_dual_mov_b32 v4, v1
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[0:1], s[2:3]
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v6, vcc_lo, v0, v3
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v7, null, 0, v1, vcc_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_bitop2_b32 v0, s0, v7 bitop3:0x14
+; GFX1250-GISEL-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_bitop2_b32 v0, src_flat_scratch_base_hi, v7 bitop3:0x14
; GFX1250-GISEL-NEXT: v_cmp_le_u32_e32 vcc_lo, 0x4000000, v0
; GFX1250-GISEL-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX1250-GISEL-NEXT: s_and_saveexec_b32 s0, vcc_lo
@@ -1193,9 +1156,8 @@ define amdgpu_ps <2 x float> @flat_sub_saddr_i64_rtn(ptr inreg %sbase, i32 %voff
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB26_2
; GFX1250-GISEL-NEXT: .LBB26_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[6:7]
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s1, v6
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v6
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v6, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: scratch_load_b64 v[0:1], v6, off
@@ -1222,13 +1184,12 @@ define amdgpu_ps <2 x float> @flat_sub_saddr_i64_rtn_neg128(ptr inreg %sbase, i3
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[2:3], v[0:1]
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[4:5], s[0:1], v[0:1]
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instid1(SALU_CYCLE_1)
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, s0, v5
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, src_flat_scratch_base_hi, v5
; GFX1250-SDAG-NEXT: v_cmp_lt_u32_e32 vcc_lo, 0x3ffffff, v0
; GFX1250-SDAG-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX1250-SDAG-NEXT: s_and_saveexec_b32 s0, vcc_lo
+; GFX1250-SDAG-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1250-SDAG-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-SDAG-NEXT: s_cbranch_execnz .LBB27_3
; GFX1250-SDAG-NEXT: ; %bb.1: ; %Flow
@@ -1247,9 +1208,8 @@ define amdgpu_ps <2 x float> @flat_sub_saddr_i64_rtn_neg128(ptr inreg %sbase, i3
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB27_2
; GFX1250-SDAG-NEXT: .LBB27_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[4:5]
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, s1, v4
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v4
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v4, -1, v0, vcc_lo
; GFX1250-SDAG-NEXT: scratch_load_b64 v[0:1], v4, off
@@ -1265,7 +1225,6 @@ define amdgpu_ps <2 x float> @flat_sub_saddr_i64_rtn_neg128(ptr inreg %sbase, i3
; GFX1250-GISEL: ; %bb.0:
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v3, v0 :: v_dual_mov_b32 v4, v1
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[0:1], s[2:3]
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v0, vcc_lo, v0, v3
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v1, null, 0, v1, vcc_lo
@@ -1273,7 +1232,7 @@ define amdgpu_ps <2 x float> @flat_sub_saddr_i64_rtn_neg128(ptr inreg %sbase, i3
; GFX1250-GISEL-NEXT: v_add_co_u32 v6, vcc_lo, 0xffffff80, v0
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v7, null, -1, v1, vcc_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_bitop2_b32 v0, s0, v7 bitop3:0x14
+; GFX1250-GISEL-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_bitop2_b32 v0, src_flat_scratch_base_hi, v7 bitop3:0x14
; GFX1250-GISEL-NEXT: v_cmp_le_u32_e32 vcc_lo, 0x4000000, v0
; GFX1250-GISEL-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX1250-GISEL-NEXT: s_and_saveexec_b32 s0, vcc_lo
@@ -1296,9 +1255,8 @@ define amdgpu_ps <2 x float> @flat_sub_saddr_i64_rtn_neg128(ptr inreg %sbase, i3
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB27_2
; GFX1250-GISEL-NEXT: .LBB27_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[6:7]
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s1, v6
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v6
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v6, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: scratch_load_b64 v[0:1], v6, off
@@ -1322,11 +1280,10 @@ define amdgpu_ps void @flat_sub_saddr_i64_nortn(ptr inreg %sbase, i32 %voffset,
; GFX1250-SDAG: ; %bb.0:
; GFX1250-SDAG-NEXT: v_dual_mov_b32 v3, v2 :: v_dual_mov_b32 v2, v1
; GFX1250-SDAG-NEXT: v_mov_b32_e32 v1, 0
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
+; GFX1250-SDAG-NEXT: s_mov_b32 s0, exec_lo
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[2:3], v[0:1]
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v4, s0, v1
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, exec_lo
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v4, src_flat_scratch_base_hi, v1
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cmpx_lt_u32_e32 0x3ffffff, v4
; GFX1250-SDAG-NEXT: s_xor_b32 s0, exec_lo, s0
@@ -1346,9 +1303,8 @@ define amdgpu_ps void @flat_sub_saddr_i64_nortn(ptr inreg %sbase, i32 %voffset,
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB28_2
; GFX1250-SDAG-NEXT: .LBB28_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[0:1]
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v4, s0, v0
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v4, src_flat_scratch_base_lo, v0
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v4, -1, v4, vcc_lo
; GFX1250-SDAG-NEXT: scratch_load_b64 v[0:1], v4, off
@@ -1361,13 +1317,12 @@ define amdgpu_ps void @flat_sub_saddr_i64_nortn(ptr inreg %sbase, i32 %voffset,
; GFX1250-GISEL: ; %bb.0:
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v4, v1 :: v_dual_mov_b32 v5, v2
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[2:3], s[2:3]
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
+; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v2, vcc_lo, v2, v0
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v3, null, 0, v3, vcc_lo
-; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, s0, v3
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
+; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, src_flat_scratch_base_hi, v3
; GFX1250-GISEL-NEXT: v_cmpx_le_u32_e32 0x4000000, v1
; GFX1250-GISEL-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-GISEL-NEXT: s_cbranch_execnz .LBB28_3
@@ -1386,9 +1341,8 @@ define amdgpu_ps void @flat_sub_saddr_i64_nortn(ptr inreg %sbase, i32 %voffset,
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB28_2
; GFX1250-GISEL-NEXT: .LBB28_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[2:3]
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s0, v2
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v2
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v2, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: scratch_load_b64 v[0:1], v2, off
@@ -1411,11 +1365,9 @@ define amdgpu_ps void @flat_sub_saddr_i64_nortn_neg128(ptr inreg %sbase, i32 %vo
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[2:3], v[0:1]
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[0:1], v[0:1]
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instid1(SALU_CYCLE_1)
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v4, s0, v1
; GFX1250-SDAG-NEXT: s_mov_b32 s0, exec_lo
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v4, src_flat_scratch_base_hi, v1
; GFX1250-SDAG-NEXT: v_cmpx_lt_u32_e32 0x3ffffff, v4
; GFX1250-SDAG-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-SDAG-NEXT: s_cbranch_execnz .LBB29_3
@@ -1434,9 +1386,8 @@ define amdgpu_ps void @flat_sub_saddr_i64_nortn_neg128(ptr inreg %sbase, i32 %vo
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB29_2
; GFX1250-SDAG-NEXT: .LBB29_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[0:1]
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v4, s0, v0
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v4, src_flat_scratch_base_lo, v0
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v4, -1, v4, vcc_lo
; GFX1250-SDAG-NEXT: scratch_load_b64 v[0:1], v4, off
@@ -1449,16 +1400,15 @@ define amdgpu_ps void @flat_sub_saddr_i64_nortn_neg128(ptr inreg %sbase, i32 %vo
; GFX1250-GISEL: ; %bb.0:
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v4, v1 :: v_dual_mov_b32 v5, v2
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[2:3], s[2:3]
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
+; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v1, vcc_lo, v2, v0
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v3, null, 0, v3, vcc_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v2, vcc_lo, 0xffffff80, v1
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v3, null, -1, v3, vcc_lo
-; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, s0, v3
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
+; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, src_flat_scratch_base_hi, v3
; GFX1250-GISEL-NEXT: v_cmpx_le_u32_e32 0x4000000, v1
; GFX1250-GISEL-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-GISEL-NEXT: s_cbranch_execnz .LBB29_3
@@ -1477,9 +1427,8 @@ define amdgpu_ps void @flat_sub_saddr_i64_nortn_neg128(ptr inreg %sbase, i32 %vo
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB29_2
; GFX1250-GISEL-NEXT: .LBB29_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[2:3]
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s0, v2
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v2
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v2, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: scratch_load_b64 v[0:1], v2, off
@@ -1560,11 +1509,10 @@ define amdgpu_ps <2 x float> @flat_and_saddr_i64_rtn(ptr inreg %sbase, i32 %voff
; GFX1250-SDAG-LABEL: flat_and_saddr_i64_rtn:
; GFX1250-SDAG: ; %bb.0:
; GFX1250-SDAG-NEXT: v_dual_mov_b32 v3, v2 :: v_dual_mov_b32 v2, v1
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
; GFX1250-SDAG-NEXT: v_mov_b32_e32 v1, 0
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[4:5], s[2:3], v[0:1]
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, s0, v5
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, src_flat_scratch_base_hi, v5
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
; GFX1250-SDAG-NEXT: v_cmp_lt_u32_e32 vcc_lo, 0x3ffffff, v0
; GFX1250-SDAG-NEXT: ; implicit-def: $vgpr0_vgpr1
@@ -1587,9 +1535,8 @@ define amdgpu_ps <2 x float> @flat_and_saddr_i64_rtn(ptr inreg %sbase, i32 %voff
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB34_2
; GFX1250-SDAG-NEXT: .LBB34_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[4:5]
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, s1, v4
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v4
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v4, -1, v0, vcc_lo
; GFX1250-SDAG-NEXT: scratch_load_b64 v[0:1], v4, off
@@ -1606,12 +1553,11 @@ define amdgpu_ps <2 x float> @flat_and_saddr_i64_rtn(ptr inreg %sbase, i32 %voff
; GFX1250-GISEL: ; %bb.0:
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v3, v0 :: v_dual_mov_b32 v4, v1
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[0:1], s[2:3]
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v6, vcc_lo, v0, v3
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v7, null, 0, v1, vcc_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_bitop2_b32 v0, s0, v7 bitop3:0x14
+; GFX1250-GISEL-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_bitop2_b32 v0, src_flat_scratch_base_hi, v7 bitop3:0x14
; GFX1250-GISEL-NEXT: v_cmp_le_u32_e32 vcc_lo, 0x4000000, v0
; GFX1250-GISEL-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX1250-GISEL-NEXT: s_and_saveexec_b32 s0, vcc_lo
@@ -1634,9 +1580,8 @@ define amdgpu_ps <2 x float> @flat_and_saddr_i64_rtn(ptr inreg %sbase, i32 %voff
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB34_2
; GFX1250-GISEL-NEXT: .LBB34_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[6:7]
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s1, v6
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v6
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v6, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: scratch_load_b64 v[0:1], v6, off
@@ -1664,13 +1609,12 @@ define amdgpu_ps <2 x float> @flat_and_saddr_i64_rtn_neg128(ptr inreg %sbase, i3
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[2:3], v[0:1]
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[4:5], s[0:1], v[0:1]
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instid1(SALU_CYCLE_1)
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, s0, v5
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, src_flat_scratch_base_hi, v5
; GFX1250-SDAG-NEXT: v_cmp_lt_u32_e32 vcc_lo, 0x3ffffff, v0
; GFX1250-SDAG-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX1250-SDAG-NEXT: s_and_saveexec_b32 s0, vcc_lo
+; GFX1250-SDAG-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1250-SDAG-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-SDAG-NEXT: s_cbranch_execnz .LBB35_3
; GFX1250-SDAG-NEXT: ; %bb.1: ; %Flow
@@ -1689,9 +1633,8 @@ define amdgpu_ps <2 x float> @flat_and_saddr_i64_rtn_neg128(ptr inreg %sbase, i3
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB35_2
; GFX1250-SDAG-NEXT: .LBB35_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[4:5]
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, s1, v4
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v4
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v4, -1, v0, vcc_lo
; GFX1250-SDAG-NEXT: scratch_load_b64 v[0:1], v4, off
@@ -1708,7 +1651,6 @@ define amdgpu_ps <2 x float> @flat_and_saddr_i64_rtn_neg128(ptr inreg %sbase, i3
; GFX1250-GISEL: ; %bb.0:
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v3, v0 :: v_dual_mov_b32 v4, v1
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[0:1], s[2:3]
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v0, vcc_lo, v0, v3
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v1, null, 0, v1, vcc_lo
@@ -1716,7 +1658,7 @@ define amdgpu_ps <2 x float> @flat_and_saddr_i64_rtn_neg128(ptr inreg %sbase, i3
; GFX1250-GISEL-NEXT: v_add_co_u32 v6, vcc_lo, 0xffffff80, v0
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v7, null, -1, v1, vcc_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_bitop2_b32 v0, s0, v7 bitop3:0x14
+; GFX1250-GISEL-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_bitop2_b32 v0, src_flat_scratch_base_hi, v7 bitop3:0x14
; GFX1250-GISEL-NEXT: v_cmp_le_u32_e32 vcc_lo, 0x4000000, v0
; GFX1250-GISEL-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX1250-GISEL-NEXT: s_and_saveexec_b32 s0, vcc_lo
@@ -1739,9 +1681,8 @@ define amdgpu_ps <2 x float> @flat_and_saddr_i64_rtn_neg128(ptr inreg %sbase, i3
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB35_2
; GFX1250-GISEL-NEXT: .LBB35_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[6:7]
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s1, v6
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v6
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v6, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: scratch_load_b64 v[0:1], v6, off
@@ -1766,11 +1707,10 @@ define amdgpu_ps void @flat_and_saddr_i64_nortn(ptr inreg %sbase, i32 %voffset,
; GFX1250-SDAG: ; %bb.0:
; GFX1250-SDAG-NEXT: v_dual_mov_b32 v3, v2 :: v_dual_mov_b32 v2, v1
; GFX1250-SDAG-NEXT: v_mov_b32_e32 v1, 0
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
+; GFX1250-SDAG-NEXT: s_mov_b32 s0, exec_lo
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[2:3], v[0:1]
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v4, s0, v1
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, exec_lo
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v4, src_flat_scratch_base_hi, v1
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cmpx_lt_u32_e32 0x3ffffff, v4
; GFX1250-SDAG-NEXT: s_xor_b32 s0, exec_lo, s0
@@ -1790,9 +1730,8 @@ define amdgpu_ps void @flat_and_saddr_i64_nortn(ptr inreg %sbase, i32 %voffset,
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB36_2
; GFX1250-SDAG-NEXT: .LBB36_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[0:1]
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v4, s0, v0
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v4, src_flat_scratch_base_lo, v0
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v4, -1, v4, vcc_lo
; GFX1250-SDAG-NEXT: scratch_load_b64 v[0:1], v4, off
@@ -1806,13 +1745,12 @@ define amdgpu_ps void @flat_and_saddr_i64_nortn(ptr inreg %sbase, i32 %voffset,
; GFX1250-GISEL: ; %bb.0:
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v4, v1 :: v_dual_mov_b32 v5, v2
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[2:3], s[2:3]
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
+; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v2, vcc_lo, v2, v0
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v3, null, 0, v3, vcc_lo
-; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, s0, v3
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
+; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, src_flat_scratch_base_hi, v3
; GFX1250-GISEL-NEXT: v_cmpx_le_u32_e32 0x4000000, v1
; GFX1250-GISEL-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-GISEL-NEXT: s_cbranch_execnz .LBB36_3
@@ -1831,9 +1769,8 @@ define amdgpu_ps void @flat_and_saddr_i64_nortn(ptr inreg %sbase, i32 %voffset,
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB36_2
; GFX1250-GISEL-NEXT: .LBB36_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[2:3]
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s0, v2
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v2
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v2, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: scratch_load_b64 v[0:1], v2, off
@@ -1857,11 +1794,9 @@ define amdgpu_ps void @flat_and_saddr_i64_nortn_neg128(ptr inreg %sbase, i32 %vo
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[2:3], v[0:1]
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[0:1], v[0:1]
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instid1(SALU_CYCLE_1)
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v4, s0, v1
; GFX1250-SDAG-NEXT: s_mov_b32 s0, exec_lo
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v4, src_flat_scratch_base_hi, v1
; GFX1250-SDAG-NEXT: v_cmpx_lt_u32_e32 0x3ffffff, v4
; GFX1250-SDAG-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-SDAG-NEXT: s_cbranch_execnz .LBB37_3
@@ -1880,9 +1815,8 @@ define amdgpu_ps void @flat_and_saddr_i64_nortn_neg128(ptr inreg %sbase, i32 %vo
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB37_2
; GFX1250-SDAG-NEXT: .LBB37_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[0:1]
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v4, s0, v0
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v4, src_flat_scratch_base_lo, v0
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v4, -1, v4, vcc_lo
; GFX1250-SDAG-NEXT: scratch_load_b64 v[0:1], v4, off
@@ -1896,16 +1830,15 @@ define amdgpu_ps void @flat_and_saddr_i64_nortn_neg128(ptr inreg %sbase, i32 %vo
; GFX1250-GISEL: ; %bb.0:
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v4, v1 :: v_dual_mov_b32 v5, v2
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[2:3], s[2:3]
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
+; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v1, vcc_lo, v2, v0
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v3, null, 0, v3, vcc_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v2, vcc_lo, 0xffffff80, v1
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v3, null, -1, v3, vcc_lo
-; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, s0, v3
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
+; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, src_flat_scratch_base_hi, v3
; GFX1250-GISEL-NEXT: v_cmpx_le_u32_e32 0x4000000, v1
; GFX1250-GISEL-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-GISEL-NEXT: s_cbranch_execnz .LBB37_3
@@ -1924,9 +1857,8 @@ define amdgpu_ps void @flat_and_saddr_i64_nortn_neg128(ptr inreg %sbase, i32 %vo
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB37_2
; GFX1250-GISEL-NEXT: .LBB37_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[2:3]
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s0, v2
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v2
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v2, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: scratch_load_b64 v[0:1], v2, off
@@ -2008,11 +1940,10 @@ define amdgpu_ps <2 x float> @flat_or_saddr_i64_rtn(ptr inreg %sbase, i32 %voffs
; GFX1250-SDAG-LABEL: flat_or_saddr_i64_rtn:
; GFX1250-SDAG: ; %bb.0:
; GFX1250-SDAG-NEXT: v_dual_mov_b32 v3, v2 :: v_dual_mov_b32 v2, v1
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
; GFX1250-SDAG-NEXT: v_mov_b32_e32 v1, 0
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[4:5], s[2:3], v[0:1]
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, s0, v5
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, src_flat_scratch_base_hi, v5
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
; GFX1250-SDAG-NEXT: v_cmp_lt_u32_e32 vcc_lo, 0x3ffffff, v0
; GFX1250-SDAG-NEXT: ; implicit-def: $vgpr0_vgpr1
@@ -2035,9 +1966,8 @@ define amdgpu_ps <2 x float> @flat_or_saddr_i64_rtn(ptr inreg %sbase, i32 %voffs
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB42_2
; GFX1250-SDAG-NEXT: .LBB42_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[4:5]
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, s1, v4
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v4
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v4, -1, v0, vcc_lo
; GFX1250-SDAG-NEXT: scratch_load_b64 v[0:1], v4, off
@@ -2054,12 +1984,11 @@ define amdgpu_ps <2 x float> @flat_or_saddr_i64_rtn(ptr inreg %sbase, i32 %voffs
; GFX1250-GISEL: ; %bb.0:
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v3, v0 :: v_dual_mov_b32 v4, v1
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[0:1], s[2:3]
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v6, vcc_lo, v0, v3
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v7, null, 0, v1, vcc_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_bitop2_b32 v0, s0, v7 bitop3:0x14
+; GFX1250-GISEL-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_bitop2_b32 v0, src_flat_scratch_base_hi, v7 bitop3:0x14
; GFX1250-GISEL-NEXT: v_cmp_le_u32_e32 vcc_lo, 0x4000000, v0
; GFX1250-GISEL-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX1250-GISEL-NEXT: s_and_saveexec_b32 s0, vcc_lo
@@ -2082,9 +2011,8 @@ define amdgpu_ps <2 x float> @flat_or_saddr_i64_rtn(ptr inreg %sbase, i32 %voffs
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB42_2
; GFX1250-GISEL-NEXT: .LBB42_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[6:7]
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s1, v6
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v6
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v6, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: scratch_load_b64 v[0:1], v6, off
@@ -2112,13 +2040,12 @@ define amdgpu_ps <2 x float> @flat_or_saddr_i64_rtn_neg128(ptr inreg %sbase, i32
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[2:3], v[0:1]
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[4:5], s[0:1], v[0:1]
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instid1(SALU_CYCLE_1)
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, s0, v5
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, src_flat_scratch_base_hi, v5
; GFX1250-SDAG-NEXT: v_cmp_lt_u32_e32 vcc_lo, 0x3ffffff, v0
; GFX1250-SDAG-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX1250-SDAG-NEXT: s_and_saveexec_b32 s0, vcc_lo
+; GFX1250-SDAG-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1250-SDAG-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-SDAG-NEXT: s_cbranch_execnz .LBB43_3
; GFX1250-SDAG-NEXT: ; %bb.1: ; %Flow
@@ -2137,9 +2064,8 @@ define amdgpu_ps <2 x float> @flat_or_saddr_i64_rtn_neg128(ptr inreg %sbase, i32
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB43_2
; GFX1250-SDAG-NEXT: .LBB43_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[4:5]
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, s1, v4
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v4
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v4, -1, v0, vcc_lo
; GFX1250-SDAG-NEXT: scratch_load_b64 v[0:1], v4, off
@@ -2156,7 +2082,6 @@ define amdgpu_ps <2 x float> @flat_or_saddr_i64_rtn_neg128(ptr inreg %sbase, i32
; GFX1250-GISEL: ; %bb.0:
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v3, v0 :: v_dual_mov_b32 v4, v1
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[0:1], s[2:3]
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v0, vcc_lo, v0, v3
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v1, null, 0, v1, vcc_lo
@@ -2164,7 +2089,7 @@ define amdgpu_ps <2 x float> @flat_or_saddr_i64_rtn_neg128(ptr inreg %sbase, i32
; GFX1250-GISEL-NEXT: v_add_co_u32 v6, vcc_lo, 0xffffff80, v0
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v7, null, -1, v1, vcc_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_bitop2_b32 v0, s0, v7 bitop3:0x14
+; GFX1250-GISEL-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_bitop2_b32 v0, src_flat_scratch_base_hi, v7 bitop3:0x14
; GFX1250-GISEL-NEXT: v_cmp_le_u32_e32 vcc_lo, 0x4000000, v0
; GFX1250-GISEL-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX1250-GISEL-NEXT: s_and_saveexec_b32 s0, vcc_lo
@@ -2187,9 +2112,8 @@ define amdgpu_ps <2 x float> @flat_or_saddr_i64_rtn_neg128(ptr inreg %sbase, i32
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB43_2
; GFX1250-GISEL-NEXT: .LBB43_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[6:7]
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s1, v6
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v6
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v6, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: scratch_load_b64 v[0:1], v6, off
@@ -2214,11 +2138,10 @@ define amdgpu_ps void @flat_or_saddr_i64_nortn(ptr inreg %sbase, i32 %voffset, i
; GFX1250-SDAG: ; %bb.0:
; GFX1250-SDAG-NEXT: v_dual_mov_b32 v3, v2 :: v_dual_mov_b32 v2, v1
; GFX1250-SDAG-NEXT: v_mov_b32_e32 v1, 0
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
+; GFX1250-SDAG-NEXT: s_mov_b32 s0, exec_lo
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[2:3], v[0:1]
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v4, s0, v1
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, exec_lo
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v4, src_flat_scratch_base_hi, v1
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cmpx_lt_u32_e32 0x3ffffff, v4
; GFX1250-SDAG-NEXT: s_xor_b32 s0, exec_lo, s0
@@ -2238,9 +2161,8 @@ define amdgpu_ps void @flat_or_saddr_i64_nortn(ptr inreg %sbase, i32 %voffset, i
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB44_2
; GFX1250-SDAG-NEXT: .LBB44_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[0:1]
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v4, s0, v0
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v4, src_flat_scratch_base_lo, v0
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v4, -1, v4, vcc_lo
; GFX1250-SDAG-NEXT: scratch_load_b64 v[0:1], v4, off
@@ -2254,13 +2176,12 @@ define amdgpu_ps void @flat_or_saddr_i64_nortn(ptr inreg %sbase, i32 %voffset, i
; GFX1250-GISEL: ; %bb.0:
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v4, v1 :: v_dual_mov_b32 v5, v2
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[2:3], s[2:3]
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
+; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v2, vcc_lo, v2, v0
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v3, null, 0, v3, vcc_lo
-; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, s0, v3
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
+; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, src_flat_scratch_base_hi, v3
; GFX1250-GISEL-NEXT: v_cmpx_le_u32_e32 0x4000000, v1
; GFX1250-GISEL-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-GISEL-NEXT: s_cbranch_execnz .LBB44_3
@@ -2279,9 +2200,8 @@ define amdgpu_ps void @flat_or_saddr_i64_nortn(ptr inreg %sbase, i32 %voffset, i
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB44_2
; GFX1250-GISEL-NEXT: .LBB44_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[2:3]
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s0, v2
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v2
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v2, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: scratch_load_b64 v[0:1], v2, off
@@ -2305,11 +2225,9 @@ define amdgpu_ps void @flat_or_saddr_i64_nortn_neg128(ptr inreg %sbase, i32 %vof
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[2:3], v[0:1]
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[0:1], v[0:1]
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instid1(SALU_CYCLE_1)
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v4, s0, v1
; GFX1250-SDAG-NEXT: s_mov_b32 s0, exec_lo
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v4, src_flat_scratch_base_hi, v1
; GFX1250-SDAG-NEXT: v_cmpx_lt_u32_e32 0x3ffffff, v4
; GFX1250-SDAG-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-SDAG-NEXT: s_cbranch_execnz .LBB45_3
@@ -2328,9 +2246,8 @@ define amdgpu_ps void @flat_or_saddr_i64_nortn_neg128(ptr inreg %sbase, i32 %vof
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB45_2
; GFX1250-SDAG-NEXT: .LBB45_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[0:1]
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v4, s0, v0
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v4, src_flat_scratch_base_lo, v0
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v4, -1, v4, vcc_lo
; GFX1250-SDAG-NEXT: scratch_load_b64 v[0:1], v4, off
@@ -2344,16 +2261,15 @@ define amdgpu_ps void @flat_or_saddr_i64_nortn_neg128(ptr inreg %sbase, i32 %vof
; GFX1250-GISEL: ; %bb.0:
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v4, v1 :: v_dual_mov_b32 v5, v2
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[2:3], s[2:3]
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
+; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v1, vcc_lo, v2, v0
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v3, null, 0, v3, vcc_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v2, vcc_lo, 0xffffff80, v1
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v3, null, -1, v3, vcc_lo
-; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, s0, v3
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
+; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, src_flat_scratch_base_hi, v3
; GFX1250-GISEL-NEXT: v_cmpx_le_u32_e32 0x4000000, v1
; GFX1250-GISEL-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-GISEL-NEXT: s_cbranch_execnz .LBB45_3
@@ -2372,9 +2288,8 @@ define amdgpu_ps void @flat_or_saddr_i64_nortn_neg128(ptr inreg %sbase, i32 %vof
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB45_2
; GFX1250-GISEL-NEXT: .LBB45_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[2:3]
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s0, v2
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v2
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v2, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: scratch_load_b64 v[0:1], v2, off
@@ -2456,11 +2371,10 @@ define amdgpu_ps <2 x float> @flat_xor_saddr_i64_rtn(ptr inreg %sbase, i32 %voff
; GFX1250-SDAG-LABEL: flat_xor_saddr_i64_rtn:
; GFX1250-SDAG: ; %bb.0:
; GFX1250-SDAG-NEXT: v_dual_mov_b32 v3, v2 :: v_dual_mov_b32 v2, v1
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
; GFX1250-SDAG-NEXT: v_mov_b32_e32 v1, 0
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[4:5], s[2:3], v[0:1]
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, s0, v5
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, src_flat_scratch_base_hi, v5
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
; GFX1250-SDAG-NEXT: v_cmp_lt_u32_e32 vcc_lo, 0x3ffffff, v0
; GFX1250-SDAG-NEXT: ; implicit-def: $vgpr0_vgpr1
@@ -2483,9 +2397,8 @@ define amdgpu_ps <2 x float> @flat_xor_saddr_i64_rtn(ptr inreg %sbase, i32 %voff
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB50_2
; GFX1250-SDAG-NEXT: .LBB50_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[4:5]
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, s1, v4
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v4
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v4, -1, v0, vcc_lo
; GFX1250-SDAG-NEXT: scratch_load_b64 v[0:1], v4, off
@@ -2502,12 +2415,11 @@ define amdgpu_ps <2 x float> @flat_xor_saddr_i64_rtn(ptr inreg %sbase, i32 %voff
; GFX1250-GISEL: ; %bb.0:
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v3, v0 :: v_dual_mov_b32 v4, v1
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[0:1], s[2:3]
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v6, vcc_lo, v0, v3
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v7, null, 0, v1, vcc_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_bitop2_b32 v0, s0, v7 bitop3:0x14
+; GFX1250-GISEL-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_bitop2_b32 v0, src_flat_scratch_base_hi, v7 bitop3:0x14
; GFX1250-GISEL-NEXT: v_cmp_le_u32_e32 vcc_lo, 0x4000000, v0
; GFX1250-GISEL-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX1250-GISEL-NEXT: s_and_saveexec_b32 s0, vcc_lo
@@ -2530,9 +2442,8 @@ define amdgpu_ps <2 x float> @flat_xor_saddr_i64_rtn(ptr inreg %sbase, i32 %voff
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB50_2
; GFX1250-GISEL-NEXT: .LBB50_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[6:7]
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s1, v6
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v6
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v6, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: scratch_load_b64 v[0:1], v6, off
@@ -2560,13 +2471,12 @@ define amdgpu_ps <2 x float> @flat_xor_saddr_i64_rtn_neg128(ptr inreg %sbase, i3
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[2:3], v[0:1]
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[4:5], s[0:1], v[0:1]
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instid1(SALU_CYCLE_1)
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, s0, v5
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, src_flat_scratch_base_hi, v5
; GFX1250-SDAG-NEXT: v_cmp_lt_u32_e32 vcc_lo, 0x3ffffff, v0
; GFX1250-SDAG-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX1250-SDAG-NEXT: s_and_saveexec_b32 s0, vcc_lo
+; GFX1250-SDAG-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1250-SDAG-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-SDAG-NEXT: s_cbranch_execnz .LBB51_3
; GFX1250-SDAG-NEXT: ; %bb.1: ; %Flow
@@ -2585,9 +2495,8 @@ define amdgpu_ps <2 x float> @flat_xor_saddr_i64_rtn_neg128(ptr inreg %sbase, i3
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB51_2
; GFX1250-SDAG-NEXT: .LBB51_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[4:5]
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, s1, v4
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v4
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v4, -1, v0, vcc_lo
; GFX1250-SDAG-NEXT: scratch_load_b64 v[0:1], v4, off
@@ -2604,7 +2513,6 @@ define amdgpu_ps <2 x float> @flat_xor_saddr_i64_rtn_neg128(ptr inreg %sbase, i3
; GFX1250-GISEL: ; %bb.0:
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v3, v0 :: v_dual_mov_b32 v4, v1
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[0:1], s[2:3]
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v0, vcc_lo, v0, v3
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v1, null, 0, v1, vcc_lo
@@ -2612,7 +2520,7 @@ define amdgpu_ps <2 x float> @flat_xor_saddr_i64_rtn_neg128(ptr inreg %sbase, i3
; GFX1250-GISEL-NEXT: v_add_co_u32 v6, vcc_lo, 0xffffff80, v0
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v7, null, -1, v1, vcc_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_bitop2_b32 v0, s0, v7 bitop3:0x14
+; GFX1250-GISEL-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_bitop2_b32 v0, src_flat_scratch_base_hi, v7 bitop3:0x14
; GFX1250-GISEL-NEXT: v_cmp_le_u32_e32 vcc_lo, 0x4000000, v0
; GFX1250-GISEL-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX1250-GISEL-NEXT: s_and_saveexec_b32 s0, vcc_lo
@@ -2635,9 +2543,8 @@ define amdgpu_ps <2 x float> @flat_xor_saddr_i64_rtn_neg128(ptr inreg %sbase, i3
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB51_2
; GFX1250-GISEL-NEXT: .LBB51_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[6:7]
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s1, v6
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v6
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v6, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: scratch_load_b64 v[0:1], v6, off
@@ -2662,11 +2569,10 @@ define amdgpu_ps void @flat_xor_saddr_i64_nortn(ptr inreg %sbase, i32 %voffset,
; GFX1250-SDAG: ; %bb.0:
; GFX1250-SDAG-NEXT: v_dual_mov_b32 v3, v2 :: v_dual_mov_b32 v2, v1
; GFX1250-SDAG-NEXT: v_mov_b32_e32 v1, 0
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
+; GFX1250-SDAG-NEXT: s_mov_b32 s0, exec_lo
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[2:3], v[0:1]
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v4, s0, v1
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, exec_lo
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v4, src_flat_scratch_base_hi, v1
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cmpx_lt_u32_e32 0x3ffffff, v4
; GFX1250-SDAG-NEXT: s_xor_b32 s0, exec_lo, s0
@@ -2686,9 +2592,8 @@ define amdgpu_ps void @flat_xor_saddr_i64_nortn(ptr inreg %sbase, i32 %voffset,
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB52_2
; GFX1250-SDAG-NEXT: .LBB52_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[0:1]
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v4, s0, v0
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v4, src_flat_scratch_base_lo, v0
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v4, -1, v4, vcc_lo
; GFX1250-SDAG-NEXT: scratch_load_b64 v[0:1], v4, off
@@ -2702,13 +2607,12 @@ define amdgpu_ps void @flat_xor_saddr_i64_nortn(ptr inreg %sbase, i32 %voffset,
; GFX1250-GISEL: ; %bb.0:
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v4, v1 :: v_dual_mov_b32 v5, v2
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[2:3], s[2:3]
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
+; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v2, vcc_lo, v2, v0
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v3, null, 0, v3, vcc_lo
-; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, s0, v3
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
+; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, src_flat_scratch_base_hi, v3
; GFX1250-GISEL-NEXT: v_cmpx_le_u32_e32 0x4000000, v1
; GFX1250-GISEL-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-GISEL-NEXT: s_cbranch_execnz .LBB52_3
@@ -2727,9 +2631,8 @@ define amdgpu_ps void @flat_xor_saddr_i64_nortn(ptr inreg %sbase, i32 %voffset,
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB52_2
; GFX1250-GISEL-NEXT: .LBB52_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[2:3]
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s0, v2
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v2
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v2, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: scratch_load_b64 v[0:1], v2, off
@@ -2753,11 +2656,9 @@ define amdgpu_ps void @flat_xor_saddr_i64_nortn_neg128(ptr inreg %sbase, i32 %vo
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[2:3], v[0:1]
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[0:1], v[0:1]
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instid1(SALU_CYCLE_1)
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v4, s0, v1
; GFX1250-SDAG-NEXT: s_mov_b32 s0, exec_lo
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v4, src_flat_scratch_base_hi, v1
; GFX1250-SDAG-NEXT: v_cmpx_lt_u32_e32 0x3ffffff, v4
; GFX1250-SDAG-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-SDAG-NEXT: s_cbranch_execnz .LBB53_3
@@ -2776,9 +2677,8 @@ define amdgpu_ps void @flat_xor_saddr_i64_nortn_neg128(ptr inreg %sbase, i32 %vo
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB53_2
; GFX1250-SDAG-NEXT: .LBB53_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[0:1]
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v4, s0, v0
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v4, src_flat_scratch_base_lo, v0
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v4, -1, v4, vcc_lo
; GFX1250-SDAG-NEXT: scratch_load_b64 v[0:1], v4, off
@@ -2792,16 +2692,15 @@ define amdgpu_ps void @flat_xor_saddr_i64_nortn_neg128(ptr inreg %sbase, i32 %vo
; GFX1250-GISEL: ; %bb.0:
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v4, v1 :: v_dual_mov_b32 v5, v2
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[2:3], s[2:3]
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
+; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v1, vcc_lo, v2, v0
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v3, null, 0, v3, vcc_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v2, vcc_lo, 0xffffff80, v1
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v3, null, -1, v3, vcc_lo
-; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, s0, v3
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
+; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, src_flat_scratch_base_hi, v3
; GFX1250-GISEL-NEXT: v_cmpx_le_u32_e32 0x4000000, v1
; GFX1250-GISEL-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-GISEL-NEXT: s_cbranch_execnz .LBB53_3
@@ -2820,9 +2719,8 @@ define amdgpu_ps void @flat_xor_saddr_i64_nortn_neg128(ptr inreg %sbase, i32 %vo
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB53_2
; GFX1250-GISEL-NEXT: .LBB53_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[2:3]
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s0, v2
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v2
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v2, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: scratch_load_b64 v[0:1], v2, off
@@ -2898,11 +2796,10 @@ define amdgpu_ps <2 x float> @flat_max_saddr_i64_rtn(ptr inreg %sbase, i32 %voff
; GFX1250-SDAG-LABEL: flat_max_saddr_i64_rtn:
; GFX1250-SDAG: ; %bb.0:
; GFX1250-SDAG-NEXT: v_dual_mov_b32 v3, v2 :: v_dual_mov_b32 v2, v1
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
; GFX1250-SDAG-NEXT: v_mov_b32_e32 v1, 0
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[4:5], s[2:3], v[0:1]
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, s0, v5
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, src_flat_scratch_base_hi, v5
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
; GFX1250-SDAG-NEXT: v_cmp_lt_u32_e32 vcc_lo, 0x3ffffff, v0
; GFX1250-SDAG-NEXT: ; implicit-def: $vgpr0_vgpr1
@@ -2925,10 +2822,9 @@ define amdgpu_ps <2 x float> @flat_max_saddr_i64_rtn(ptr inreg %sbase, i32 %voff
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB58_2
; GFX1250-SDAG-NEXT: .LBB58_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[4:5]
; GFX1250-SDAG-NEXT: s_wait_loadcnt 0x0
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, s1, v4
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v4
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v4, -1, v0, vcc_lo
; GFX1250-SDAG-NEXT: scratch_load_b64 v[0:1], v4, off
@@ -2944,12 +2840,11 @@ define amdgpu_ps <2 x float> @flat_max_saddr_i64_rtn(ptr inreg %sbase, i32 %voff
; GFX1250-GISEL: ; %bb.0:
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v3, v0 :: v_dual_mov_b32 v4, v1
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[0:1], s[2:3]
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v6, vcc_lo, v0, v3
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v7, null, 0, v1, vcc_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_bitop2_b32 v0, s0, v7 bitop3:0x14
+; GFX1250-GISEL-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_bitop2_b32 v0, src_flat_scratch_base_hi, v7 bitop3:0x14
; GFX1250-GISEL-NEXT: v_cmp_le_u32_e32 vcc_lo, 0x4000000, v0
; GFX1250-GISEL-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX1250-GISEL-NEXT: s_and_saveexec_b32 s0, vcc_lo
@@ -2972,10 +2867,9 @@ define amdgpu_ps <2 x float> @flat_max_saddr_i64_rtn(ptr inreg %sbase, i32 %voff
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB58_2
; GFX1250-GISEL-NEXT: .LBB58_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[6:7]
; GFX1250-GISEL-NEXT: s_wait_loadcnt 0x0
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s1, v6
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v6
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v6, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: scratch_load_b64 v[0:1], v6, off
@@ -3002,13 +2896,12 @@ define amdgpu_ps <2 x float> @flat_max_saddr_i64_rtn_neg128(ptr inreg %sbase, i3
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[2:3], v[0:1]
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[4:5], s[0:1], v[0:1]
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instid1(SALU_CYCLE_1)
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, s0, v5
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, src_flat_scratch_base_hi, v5
; GFX1250-SDAG-NEXT: v_cmp_lt_u32_e32 vcc_lo, 0x3ffffff, v0
; GFX1250-SDAG-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX1250-SDAG-NEXT: s_and_saveexec_b32 s0, vcc_lo
+; GFX1250-SDAG-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1250-SDAG-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-SDAG-NEXT: s_cbranch_execnz .LBB59_3
; GFX1250-SDAG-NEXT: ; %bb.1: ; %Flow
@@ -3027,10 +2920,9 @@ define amdgpu_ps <2 x float> @flat_max_saddr_i64_rtn_neg128(ptr inreg %sbase, i3
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB59_2
; GFX1250-SDAG-NEXT: .LBB59_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[4:5]
; GFX1250-SDAG-NEXT: s_wait_loadcnt 0x0
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, s1, v4
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v4
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v4, -1, v0, vcc_lo
; GFX1250-SDAG-NEXT: scratch_load_b64 v[0:1], v4, off
@@ -3046,7 +2938,6 @@ define amdgpu_ps <2 x float> @flat_max_saddr_i64_rtn_neg128(ptr inreg %sbase, i3
; GFX1250-GISEL: ; %bb.0:
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v3, v0 :: v_dual_mov_b32 v4, v1
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[0:1], s[2:3]
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v0, vcc_lo, v0, v3
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v1, null, 0, v1, vcc_lo
@@ -3054,7 +2945,7 @@ define amdgpu_ps <2 x float> @flat_max_saddr_i64_rtn_neg128(ptr inreg %sbase, i3
; GFX1250-GISEL-NEXT: v_add_co_u32 v6, vcc_lo, 0xffffff80, v0
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v7, null, -1, v1, vcc_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_bitop2_b32 v0, s0, v7 bitop3:0x14
+; GFX1250-GISEL-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_bitop2_b32 v0, src_flat_scratch_base_hi, v7 bitop3:0x14
; GFX1250-GISEL-NEXT: v_cmp_le_u32_e32 vcc_lo, 0x4000000, v0
; GFX1250-GISEL-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX1250-GISEL-NEXT: s_and_saveexec_b32 s0, vcc_lo
@@ -3077,10 +2968,9 @@ define amdgpu_ps <2 x float> @flat_max_saddr_i64_rtn_neg128(ptr inreg %sbase, i3
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB59_2
; GFX1250-GISEL-NEXT: .LBB59_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[6:7]
; GFX1250-GISEL-NEXT: s_wait_loadcnt 0x0
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s1, v6
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v6
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v6, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: scratch_load_b64 v[0:1], v6, off
@@ -3104,11 +2994,10 @@ define amdgpu_ps void @flat_max_saddr_i64_nortn(ptr inreg %sbase, i32 %voffset,
; GFX1250-SDAG: ; %bb.0:
; GFX1250-SDAG-NEXT: v_dual_mov_b32 v3, v2 :: v_dual_mov_b32 v2, v1
; GFX1250-SDAG-NEXT: v_mov_b32_e32 v1, 0
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
+; GFX1250-SDAG-NEXT: s_mov_b32 s0, exec_lo
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[2:3], v[0:1]
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v4, s0, v1
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, exec_lo
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v4, src_flat_scratch_base_hi, v1
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cmpx_lt_u32_e32 0x3ffffff, v4
; GFX1250-SDAG-NEXT: s_xor_b32 s0, exec_lo, s0
@@ -3127,9 +3016,8 @@ define amdgpu_ps void @flat_max_saddr_i64_nortn(ptr inreg %sbase, i32 %voffset,
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB60_2
; GFX1250-SDAG-NEXT: .LBB60_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[0:1]
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v4, s0, v0
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v4, src_flat_scratch_base_lo, v0
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v4, -1, v4, vcc_lo
; GFX1250-SDAG-NEXT: scratch_load_b64 v[0:1], v4, off
@@ -3142,13 +3030,12 @@ define amdgpu_ps void @flat_max_saddr_i64_nortn(ptr inreg %sbase, i32 %voffset,
; GFX1250-GISEL: ; %bb.0:
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v4, v1 :: v_dual_mov_b32 v5, v2
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[2:3], s[2:3]
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
+; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v2, vcc_lo, v2, v0
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v3, null, 0, v3, vcc_lo
-; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, s0, v3
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
+; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, src_flat_scratch_base_hi, v3
; GFX1250-GISEL-NEXT: v_cmpx_le_u32_e32 0x4000000, v1
; GFX1250-GISEL-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-GISEL-NEXT: s_cbranch_execnz .LBB60_3
@@ -3166,9 +3053,8 @@ define amdgpu_ps void @flat_max_saddr_i64_nortn(ptr inreg %sbase, i32 %voffset,
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB60_2
; GFX1250-GISEL-NEXT: .LBB60_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[2:3]
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s0, v2
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v2
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v2, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: scratch_load_b64 v[0:1], v2, off
@@ -3191,11 +3077,9 @@ define amdgpu_ps void @flat_max_saddr_i64_nortn_neg128(ptr inreg %sbase, i32 %vo
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[2:3], v[0:1]
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[0:1], v[0:1]
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instid1(SALU_CYCLE_1)
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v4, s0, v1
; GFX1250-SDAG-NEXT: s_mov_b32 s0, exec_lo
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v4, src_flat_scratch_base_hi, v1
; GFX1250-SDAG-NEXT: v_cmpx_lt_u32_e32 0x3ffffff, v4
; GFX1250-SDAG-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-SDAG-NEXT: s_cbranch_execnz .LBB61_3
@@ -3213,9 +3097,8 @@ define amdgpu_ps void @flat_max_saddr_i64_nortn_neg128(ptr inreg %sbase, i32 %vo
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB61_2
; GFX1250-SDAG-NEXT: .LBB61_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[0:1]
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v4, s0, v0
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v4, src_flat_scratch_base_lo, v0
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v4, -1, v4, vcc_lo
; GFX1250-SDAG-NEXT: scratch_load_b64 v[0:1], v4, off
@@ -3228,16 +3111,15 @@ define amdgpu_ps void @flat_max_saddr_i64_nortn_neg128(ptr inreg %sbase, i32 %vo
; GFX1250-GISEL: ; %bb.0:
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v4, v1 :: v_dual_mov_b32 v5, v2
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[2:3], s[2:3]
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
+; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v1, vcc_lo, v2, v0
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v3, null, 0, v3, vcc_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v2, vcc_lo, 0xffffff80, v1
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v3, null, -1, v3, vcc_lo
-; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, s0, v3
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
+; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, src_flat_scratch_base_hi, v3
; GFX1250-GISEL-NEXT: v_cmpx_le_u32_e32 0x4000000, v1
; GFX1250-GISEL-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-GISEL-NEXT: s_cbranch_execnz .LBB61_3
@@ -3255,9 +3137,8 @@ define amdgpu_ps void @flat_max_saddr_i64_nortn_neg128(ptr inreg %sbase, i32 %vo
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB61_2
; GFX1250-GISEL-NEXT: .LBB61_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[2:3]
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s0, v2
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v2
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v2, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: scratch_load_b64 v[0:1], v2, off
@@ -3332,11 +3213,10 @@ define amdgpu_ps <2 x float> @flat_min_saddr_i64_rtn(ptr inreg %sbase, i32 %voff
; GFX1250-SDAG-LABEL: flat_min_saddr_i64_rtn:
; GFX1250-SDAG: ; %bb.0:
; GFX1250-SDAG-NEXT: v_dual_mov_b32 v3, v2 :: v_dual_mov_b32 v2, v1
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
; GFX1250-SDAG-NEXT: v_mov_b32_e32 v1, 0
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[4:5], s[2:3], v[0:1]
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, s0, v5
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, src_flat_scratch_base_hi, v5
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
; GFX1250-SDAG-NEXT: v_cmp_lt_u32_e32 vcc_lo, 0x3ffffff, v0
; GFX1250-SDAG-NEXT: ; implicit-def: $vgpr0_vgpr1
@@ -3359,10 +3239,9 @@ define amdgpu_ps <2 x float> @flat_min_saddr_i64_rtn(ptr inreg %sbase, i32 %voff
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB66_2
; GFX1250-SDAG-NEXT: .LBB66_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[4:5]
; GFX1250-SDAG-NEXT: s_wait_loadcnt 0x0
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, s1, v4
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v4
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v4, -1, v0, vcc_lo
; GFX1250-SDAG-NEXT: scratch_load_b64 v[0:1], v4, off
@@ -3378,12 +3257,11 @@ define amdgpu_ps <2 x float> @flat_min_saddr_i64_rtn(ptr inreg %sbase, i32 %voff
; GFX1250-GISEL: ; %bb.0:
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v3, v0 :: v_dual_mov_b32 v4, v1
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[0:1], s[2:3]
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v6, vcc_lo, v0, v3
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v7, null, 0, v1, vcc_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_bitop2_b32 v0, s0, v7 bitop3:0x14
+; GFX1250-GISEL-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_bitop2_b32 v0, src_flat_scratch_base_hi, v7 bitop3:0x14
; GFX1250-GISEL-NEXT: v_cmp_le_u32_e32 vcc_lo, 0x4000000, v0
; GFX1250-GISEL-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX1250-GISEL-NEXT: s_and_saveexec_b32 s0, vcc_lo
@@ -3406,10 +3284,9 @@ define amdgpu_ps <2 x float> @flat_min_saddr_i64_rtn(ptr inreg %sbase, i32 %voff
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB66_2
; GFX1250-GISEL-NEXT: .LBB66_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[6:7]
; GFX1250-GISEL-NEXT: s_wait_loadcnt 0x0
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s1, v6
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v6
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v6, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: scratch_load_b64 v[0:1], v6, off
@@ -3436,13 +3313,12 @@ define amdgpu_ps <2 x float> @flat_min_saddr_i64_rtn_neg128(ptr inreg %sbase, i3
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[2:3], v[0:1]
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[4:5], s[0:1], v[0:1]
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instid1(SALU_CYCLE_1)
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, s0, v5
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, src_flat_scratch_base_hi, v5
; GFX1250-SDAG-NEXT: v_cmp_lt_u32_e32 vcc_lo, 0x3ffffff, v0
; GFX1250-SDAG-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX1250-SDAG-NEXT: s_and_saveexec_b32 s0, vcc_lo
+; GFX1250-SDAG-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1250-SDAG-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-SDAG-NEXT: s_cbranch_execnz .LBB67_3
; GFX1250-SDAG-NEXT: ; %bb.1: ; %Flow
@@ -3461,10 +3337,9 @@ define amdgpu_ps <2 x float> @flat_min_saddr_i64_rtn_neg128(ptr inreg %sbase, i3
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB67_2
; GFX1250-SDAG-NEXT: .LBB67_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[4:5]
; GFX1250-SDAG-NEXT: s_wait_loadcnt 0x0
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, s1, v4
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v4
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v4, -1, v0, vcc_lo
; GFX1250-SDAG-NEXT: scratch_load_b64 v[0:1], v4, off
@@ -3480,7 +3355,6 @@ define amdgpu_ps <2 x float> @flat_min_saddr_i64_rtn_neg128(ptr inreg %sbase, i3
; GFX1250-GISEL: ; %bb.0:
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v3, v0 :: v_dual_mov_b32 v4, v1
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[0:1], s[2:3]
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v0, vcc_lo, v0, v3
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v1, null, 0, v1, vcc_lo
@@ -3488,7 +3362,7 @@ define amdgpu_ps <2 x float> @flat_min_saddr_i64_rtn_neg128(ptr inreg %sbase, i3
; GFX1250-GISEL-NEXT: v_add_co_u32 v6, vcc_lo, 0xffffff80, v0
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v7, null, -1, v1, vcc_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_bitop2_b32 v0, s0, v7 bitop3:0x14
+; GFX1250-GISEL-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_bitop2_b32 v0, src_flat_scratch_base_hi, v7 bitop3:0x14
; GFX1250-GISEL-NEXT: v_cmp_le_u32_e32 vcc_lo, 0x4000000, v0
; GFX1250-GISEL-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX1250-GISEL-NEXT: s_and_saveexec_b32 s0, vcc_lo
@@ -3511,10 +3385,9 @@ define amdgpu_ps <2 x float> @flat_min_saddr_i64_rtn_neg128(ptr inreg %sbase, i3
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB67_2
; GFX1250-GISEL-NEXT: .LBB67_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[6:7]
; GFX1250-GISEL-NEXT: s_wait_loadcnt 0x0
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s1, v6
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v6
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v6, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: scratch_load_b64 v[0:1], v6, off
@@ -3538,11 +3411,10 @@ define amdgpu_ps void @flat_min_saddr_i64_nortn(ptr inreg %sbase, i32 %voffset,
; GFX1250-SDAG: ; %bb.0:
; GFX1250-SDAG-NEXT: v_dual_mov_b32 v3, v2 :: v_dual_mov_b32 v2, v1
; GFX1250-SDAG-NEXT: v_mov_b32_e32 v1, 0
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
+; GFX1250-SDAG-NEXT: s_mov_b32 s0, exec_lo
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[2:3], v[0:1]
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v4, s0, v1
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, exec_lo
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v4, src_flat_scratch_base_hi, v1
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cmpx_lt_u32_e32 0x3ffffff, v4
; GFX1250-SDAG-NEXT: s_xor_b32 s0, exec_lo, s0
@@ -3561,9 +3433,8 @@ define amdgpu_ps void @flat_min_saddr_i64_nortn(ptr inreg %sbase, i32 %voffset,
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB68_2
; GFX1250-SDAG-NEXT: .LBB68_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[0:1]
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v4, s0, v0
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v4, src_flat_scratch_base_lo, v0
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v4, -1, v4, vcc_lo
; GFX1250-SDAG-NEXT: scratch_load_b64 v[0:1], v4, off
@@ -3576,13 +3447,12 @@ define amdgpu_ps void @flat_min_saddr_i64_nortn(ptr inreg %sbase, i32 %voffset,
; GFX1250-GISEL: ; %bb.0:
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v4, v1 :: v_dual_mov_b32 v5, v2
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[2:3], s[2:3]
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
+; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v2, vcc_lo, v2, v0
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v3, null, 0, v3, vcc_lo
-; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, s0, v3
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
+; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, src_flat_scratch_base_hi, v3
; GFX1250-GISEL-NEXT: v_cmpx_le_u32_e32 0x4000000, v1
; GFX1250-GISEL-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-GISEL-NEXT: s_cbranch_execnz .LBB68_3
@@ -3600,9 +3470,8 @@ define amdgpu_ps void @flat_min_saddr_i64_nortn(ptr inreg %sbase, i32 %voffset,
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB68_2
; GFX1250-GISEL-NEXT: .LBB68_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[2:3]
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s0, v2
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v2
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v2, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: scratch_load_b64 v[0:1], v2, off
@@ -3625,11 +3494,9 @@ define amdgpu_ps void @flat_min_saddr_i64_nortn_neg128(ptr inreg %sbase, i32 %vo
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[2:3], v[0:1]
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[0:1], v[0:1]
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instid1(SALU_CYCLE_1)
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v4, s0, v1
; GFX1250-SDAG-NEXT: s_mov_b32 s0, exec_lo
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v4, src_flat_scratch_base_hi, v1
; GFX1250-SDAG-NEXT: v_cmpx_lt_u32_e32 0x3ffffff, v4
; GFX1250-SDAG-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-SDAG-NEXT: s_cbranch_execnz .LBB69_3
@@ -3647,9 +3514,8 @@ define amdgpu_ps void @flat_min_saddr_i64_nortn_neg128(ptr inreg %sbase, i32 %vo
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB69_2
; GFX1250-SDAG-NEXT: .LBB69_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[0:1]
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v4, s0, v0
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v4, src_flat_scratch_base_lo, v0
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v4, -1, v4, vcc_lo
; GFX1250-SDAG-NEXT: scratch_load_b64 v[0:1], v4, off
@@ -3662,16 +3528,15 @@ define amdgpu_ps void @flat_min_saddr_i64_nortn_neg128(ptr inreg %sbase, i32 %vo
; GFX1250-GISEL: ; %bb.0:
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v4, v1 :: v_dual_mov_b32 v5, v2
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[2:3], s[2:3]
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
+; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v1, vcc_lo, v2, v0
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v3, null, 0, v3, vcc_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v2, vcc_lo, 0xffffff80, v1
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v3, null, -1, v3, vcc_lo
-; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, s0, v3
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
+; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, src_flat_scratch_base_hi, v3
; GFX1250-GISEL-NEXT: v_cmpx_le_u32_e32 0x4000000, v1
; GFX1250-GISEL-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-GISEL-NEXT: s_cbranch_execnz .LBB69_3
@@ -3689,9 +3554,8 @@ define amdgpu_ps void @flat_min_saddr_i64_nortn_neg128(ptr inreg %sbase, i32 %vo
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB69_2
; GFX1250-GISEL-NEXT: .LBB69_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[2:3]
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s0, v2
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v2
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v2, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: scratch_load_b64 v[0:1], v2, off
@@ -3766,11 +3630,10 @@ define amdgpu_ps <2 x float> @flat_umax_saddr_i64_rtn(ptr inreg %sbase, i32 %vof
; GFX1250-SDAG-LABEL: flat_umax_saddr_i64_rtn:
; GFX1250-SDAG: ; %bb.0:
; GFX1250-SDAG-NEXT: v_dual_mov_b32 v3, v2 :: v_dual_mov_b32 v2, v1
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
; GFX1250-SDAG-NEXT: v_mov_b32_e32 v1, 0
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[4:5], s[2:3], v[0:1]
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, s0, v5
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, src_flat_scratch_base_hi, v5
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
; GFX1250-SDAG-NEXT: v_cmp_lt_u32_e32 vcc_lo, 0x3ffffff, v0
; GFX1250-SDAG-NEXT: ; implicit-def: $vgpr0_vgpr1
@@ -3793,10 +3656,9 @@ define amdgpu_ps <2 x float> @flat_umax_saddr_i64_rtn(ptr inreg %sbase, i32 %vof
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB74_2
; GFX1250-SDAG-NEXT: .LBB74_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[4:5]
; GFX1250-SDAG-NEXT: s_wait_loadcnt 0x0
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, s1, v4
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v4
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v4, -1, v0, vcc_lo
; GFX1250-SDAG-NEXT: scratch_load_b64 v[0:1], v4, off
@@ -3812,12 +3674,11 @@ define amdgpu_ps <2 x float> @flat_umax_saddr_i64_rtn(ptr inreg %sbase, i32 %vof
; GFX1250-GISEL: ; %bb.0:
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v3, v0 :: v_dual_mov_b32 v4, v1
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[0:1], s[2:3]
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v6, vcc_lo, v0, v3
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v7, null, 0, v1, vcc_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_bitop2_b32 v0, s0, v7 bitop3:0x14
+; GFX1250-GISEL-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_bitop2_b32 v0, src_flat_scratch_base_hi, v7 bitop3:0x14
; GFX1250-GISEL-NEXT: v_cmp_le_u32_e32 vcc_lo, 0x4000000, v0
; GFX1250-GISEL-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX1250-GISEL-NEXT: s_and_saveexec_b32 s0, vcc_lo
@@ -3840,10 +3701,9 @@ define amdgpu_ps <2 x float> @flat_umax_saddr_i64_rtn(ptr inreg %sbase, i32 %vof
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB74_2
; GFX1250-GISEL-NEXT: .LBB74_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[6:7]
; GFX1250-GISEL-NEXT: s_wait_loadcnt 0x0
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s1, v6
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v6
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v6, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: scratch_load_b64 v[0:1], v6, off
@@ -3870,13 +3730,12 @@ define amdgpu_ps <2 x float> @flat_umax_saddr_i64_rtn_neg128(ptr inreg %sbase, i
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[2:3], v[0:1]
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[4:5], s[0:1], v[0:1]
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instid1(SALU_CYCLE_1)
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, s0, v5
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, src_flat_scratch_base_hi, v5
; GFX1250-SDAG-NEXT: v_cmp_lt_u32_e32 vcc_lo, 0x3ffffff, v0
; GFX1250-SDAG-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX1250-SDAG-NEXT: s_and_saveexec_b32 s0, vcc_lo
+; GFX1250-SDAG-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1250-SDAG-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-SDAG-NEXT: s_cbranch_execnz .LBB75_3
; GFX1250-SDAG-NEXT: ; %bb.1: ; %Flow
@@ -3895,10 +3754,9 @@ define amdgpu_ps <2 x float> @flat_umax_saddr_i64_rtn_neg128(ptr inreg %sbase, i
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB75_2
; GFX1250-SDAG-NEXT: .LBB75_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[4:5]
; GFX1250-SDAG-NEXT: s_wait_loadcnt 0x0
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, s1, v4
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v4
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v4, -1, v0, vcc_lo
; GFX1250-SDAG-NEXT: scratch_load_b64 v[0:1], v4, off
@@ -3914,7 +3772,6 @@ define amdgpu_ps <2 x float> @flat_umax_saddr_i64_rtn_neg128(ptr inreg %sbase, i
; GFX1250-GISEL: ; %bb.0:
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v3, v0 :: v_dual_mov_b32 v4, v1
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[0:1], s[2:3]
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v0, vcc_lo, v0, v3
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v1, null, 0, v1, vcc_lo
@@ -3922,7 +3779,7 @@ define amdgpu_ps <2 x float> @flat_umax_saddr_i64_rtn_neg128(ptr inreg %sbase, i
; GFX1250-GISEL-NEXT: v_add_co_u32 v6, vcc_lo, 0xffffff80, v0
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v7, null, -1, v1, vcc_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_bitop2_b32 v0, s0, v7 bitop3:0x14
+; GFX1250-GISEL-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_bitop2_b32 v0, src_flat_scratch_base_hi, v7 bitop3:0x14
; GFX1250-GISEL-NEXT: v_cmp_le_u32_e32 vcc_lo, 0x4000000, v0
; GFX1250-GISEL-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX1250-GISEL-NEXT: s_and_saveexec_b32 s0, vcc_lo
@@ -3945,10 +3802,9 @@ define amdgpu_ps <2 x float> @flat_umax_saddr_i64_rtn_neg128(ptr inreg %sbase, i
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB75_2
; GFX1250-GISEL-NEXT: .LBB75_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[6:7]
; GFX1250-GISEL-NEXT: s_wait_loadcnt 0x0
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s1, v6
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v6
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v6, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: scratch_load_b64 v[0:1], v6, off
@@ -3972,11 +3828,10 @@ define amdgpu_ps void @flat_umax_saddr_i64_nortn(ptr inreg %sbase, i32 %voffset,
; GFX1250-SDAG: ; %bb.0:
; GFX1250-SDAG-NEXT: v_dual_mov_b32 v3, v2 :: v_dual_mov_b32 v2, v1
; GFX1250-SDAG-NEXT: v_mov_b32_e32 v1, 0
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
+; GFX1250-SDAG-NEXT: s_mov_b32 s0, exec_lo
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[2:3], v[0:1]
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v4, s0, v1
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, exec_lo
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v4, src_flat_scratch_base_hi, v1
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cmpx_lt_u32_e32 0x3ffffff, v4
; GFX1250-SDAG-NEXT: s_xor_b32 s0, exec_lo, s0
@@ -3995,9 +3850,8 @@ define amdgpu_ps void @flat_umax_saddr_i64_nortn(ptr inreg %sbase, i32 %voffset,
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB76_2
; GFX1250-SDAG-NEXT: .LBB76_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[0:1]
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v4, s0, v0
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v4, src_flat_scratch_base_lo, v0
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v4, -1, v4, vcc_lo
; GFX1250-SDAG-NEXT: scratch_load_b64 v[0:1], v4, off
@@ -4010,13 +3864,12 @@ define amdgpu_ps void @flat_umax_saddr_i64_nortn(ptr inreg %sbase, i32 %voffset,
; GFX1250-GISEL: ; %bb.0:
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v4, v1 :: v_dual_mov_b32 v5, v2
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[2:3], s[2:3]
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
+; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v2, vcc_lo, v2, v0
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v3, null, 0, v3, vcc_lo
-; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, s0, v3
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
+; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, src_flat_scratch_base_hi, v3
; GFX1250-GISEL-NEXT: v_cmpx_le_u32_e32 0x4000000, v1
; GFX1250-GISEL-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-GISEL-NEXT: s_cbranch_execnz .LBB76_3
@@ -4034,9 +3887,8 @@ define amdgpu_ps void @flat_umax_saddr_i64_nortn(ptr inreg %sbase, i32 %voffset,
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB76_2
; GFX1250-GISEL-NEXT: .LBB76_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[2:3]
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s0, v2
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v2
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v2, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: scratch_load_b64 v[0:1], v2, off
@@ -4059,11 +3911,9 @@ define amdgpu_ps void @flat_umax_saddr_i64_nortn_neg128(ptr inreg %sbase, i32 %v
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[2:3], v[0:1]
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[0:1], v[0:1]
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instid1(SALU_CYCLE_1)
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v4, s0, v1
; GFX1250-SDAG-NEXT: s_mov_b32 s0, exec_lo
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v4, src_flat_scratch_base_hi, v1
; GFX1250-SDAG-NEXT: v_cmpx_lt_u32_e32 0x3ffffff, v4
; GFX1250-SDAG-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-SDAG-NEXT: s_cbranch_execnz .LBB77_3
@@ -4081,9 +3931,8 @@ define amdgpu_ps void @flat_umax_saddr_i64_nortn_neg128(ptr inreg %sbase, i32 %v
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB77_2
; GFX1250-SDAG-NEXT: .LBB77_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[0:1]
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v4, s0, v0
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v4, src_flat_scratch_base_lo, v0
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v4, -1, v4, vcc_lo
; GFX1250-SDAG-NEXT: scratch_load_b64 v[0:1], v4, off
@@ -4096,16 +3945,15 @@ define amdgpu_ps void @flat_umax_saddr_i64_nortn_neg128(ptr inreg %sbase, i32 %v
; GFX1250-GISEL: ; %bb.0:
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v4, v1 :: v_dual_mov_b32 v5, v2
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[2:3], s[2:3]
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
+; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v1, vcc_lo, v2, v0
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v3, null, 0, v3, vcc_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v2, vcc_lo, 0xffffff80, v1
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v3, null, -1, v3, vcc_lo
-; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, s0, v3
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
+; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, src_flat_scratch_base_hi, v3
; GFX1250-GISEL-NEXT: v_cmpx_le_u32_e32 0x4000000, v1
; GFX1250-GISEL-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-GISEL-NEXT: s_cbranch_execnz .LBB77_3
@@ -4123,9 +3971,8 @@ define amdgpu_ps void @flat_umax_saddr_i64_nortn_neg128(ptr inreg %sbase, i32 %v
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB77_2
; GFX1250-GISEL-NEXT: .LBB77_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[2:3]
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s0, v2
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v2
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v2, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: scratch_load_b64 v[0:1], v2, off
@@ -4200,11 +4047,10 @@ define amdgpu_ps <2 x float> @flat_umin_saddr_i64_rtn(ptr inreg %sbase, i32 %vof
; GFX1250-SDAG-LABEL: flat_umin_saddr_i64_rtn:
; GFX1250-SDAG: ; %bb.0:
; GFX1250-SDAG-NEXT: v_dual_mov_b32 v3, v2 :: v_dual_mov_b32 v2, v1
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
; GFX1250-SDAG-NEXT: v_mov_b32_e32 v1, 0
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[4:5], s[2:3], v[0:1]
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, s0, v5
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, src_flat_scratch_base_hi, v5
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
; GFX1250-SDAG-NEXT: v_cmp_lt_u32_e32 vcc_lo, 0x3ffffff, v0
; GFX1250-SDAG-NEXT: ; implicit-def: $vgpr0_vgpr1
@@ -4227,10 +4073,9 @@ define amdgpu_ps <2 x float> @flat_umin_saddr_i64_rtn(ptr inreg %sbase, i32 %vof
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB82_2
; GFX1250-SDAG-NEXT: .LBB82_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[4:5]
; GFX1250-SDAG-NEXT: s_wait_loadcnt 0x0
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, s1, v4
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v4
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v4, -1, v0, vcc_lo
; GFX1250-SDAG-NEXT: scratch_load_b64 v[0:1], v4, off
@@ -4246,12 +4091,11 @@ define amdgpu_ps <2 x float> @flat_umin_saddr_i64_rtn(ptr inreg %sbase, i32 %vof
; GFX1250-GISEL: ; %bb.0:
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v3, v0 :: v_dual_mov_b32 v4, v1
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[0:1], s[2:3]
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v6, vcc_lo, v0, v3
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v7, null, 0, v1, vcc_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_bitop2_b32 v0, s0, v7 bitop3:0x14
+; GFX1250-GISEL-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_bitop2_b32 v0, src_flat_scratch_base_hi, v7 bitop3:0x14
; GFX1250-GISEL-NEXT: v_cmp_le_u32_e32 vcc_lo, 0x4000000, v0
; GFX1250-GISEL-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX1250-GISEL-NEXT: s_and_saveexec_b32 s0, vcc_lo
@@ -4274,10 +4118,9 @@ define amdgpu_ps <2 x float> @flat_umin_saddr_i64_rtn(ptr inreg %sbase, i32 %vof
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB82_2
; GFX1250-GISEL-NEXT: .LBB82_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[6:7]
; GFX1250-GISEL-NEXT: s_wait_loadcnt 0x0
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s1, v6
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v6
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v6, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: scratch_load_b64 v[0:1], v6, off
@@ -4304,13 +4147,12 @@ define amdgpu_ps <2 x float> @flat_umin_saddr_i64_rtn_neg128(ptr inreg %sbase, i
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[2:3], v[0:1]
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[4:5], s[0:1], v[0:1]
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instid1(SALU_CYCLE_1)
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, s0, v5
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, src_flat_scratch_base_hi, v5
; GFX1250-SDAG-NEXT: v_cmp_lt_u32_e32 vcc_lo, 0x3ffffff, v0
; GFX1250-SDAG-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX1250-SDAG-NEXT: s_and_saveexec_b32 s0, vcc_lo
+; GFX1250-SDAG-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1250-SDAG-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-SDAG-NEXT: s_cbranch_execnz .LBB83_3
; GFX1250-SDAG-NEXT: ; %bb.1: ; %Flow
@@ -4329,10 +4171,9 @@ define amdgpu_ps <2 x float> @flat_umin_saddr_i64_rtn_neg128(ptr inreg %sbase, i
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB83_2
; GFX1250-SDAG-NEXT: .LBB83_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[4:5]
; GFX1250-SDAG-NEXT: s_wait_loadcnt 0x0
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, s1, v4
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v4
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v4, -1, v0, vcc_lo
; GFX1250-SDAG-NEXT: scratch_load_b64 v[0:1], v4, off
@@ -4348,7 +4189,6 @@ define amdgpu_ps <2 x float> @flat_umin_saddr_i64_rtn_neg128(ptr inreg %sbase, i
; GFX1250-GISEL: ; %bb.0:
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v3, v0 :: v_dual_mov_b32 v4, v1
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[0:1], s[2:3]
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v0, vcc_lo, v0, v3
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v1, null, 0, v1, vcc_lo
@@ -4356,7 +4196,7 @@ define amdgpu_ps <2 x float> @flat_umin_saddr_i64_rtn_neg128(ptr inreg %sbase, i
; GFX1250-GISEL-NEXT: v_add_co_u32 v6, vcc_lo, 0xffffff80, v0
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v7, null, -1, v1, vcc_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_bitop2_b32 v0, s0, v7 bitop3:0x14
+; GFX1250-GISEL-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_bitop2_b32 v0, src_flat_scratch_base_hi, v7 bitop3:0x14
; GFX1250-GISEL-NEXT: v_cmp_le_u32_e32 vcc_lo, 0x4000000, v0
; GFX1250-GISEL-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX1250-GISEL-NEXT: s_and_saveexec_b32 s0, vcc_lo
@@ -4379,10 +4219,9 @@ define amdgpu_ps <2 x float> @flat_umin_saddr_i64_rtn_neg128(ptr inreg %sbase, i
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB83_2
; GFX1250-GISEL-NEXT: .LBB83_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[6:7]
; GFX1250-GISEL-NEXT: s_wait_loadcnt 0x0
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s1, v6
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v6
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v6, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: scratch_load_b64 v[0:1], v6, off
@@ -4406,11 +4245,10 @@ define amdgpu_ps void @flat_umin_saddr_i64_nortn(ptr inreg %sbase, i32 %voffset,
; GFX1250-SDAG: ; %bb.0:
; GFX1250-SDAG-NEXT: v_dual_mov_b32 v3, v2 :: v_dual_mov_b32 v2, v1
; GFX1250-SDAG-NEXT: v_mov_b32_e32 v1, 0
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
+; GFX1250-SDAG-NEXT: s_mov_b32 s0, exec_lo
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[2:3], v[0:1]
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v4, s0, v1
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, exec_lo
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v4, src_flat_scratch_base_hi, v1
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cmpx_lt_u32_e32 0x3ffffff, v4
; GFX1250-SDAG-NEXT: s_xor_b32 s0, exec_lo, s0
@@ -4429,9 +4267,8 @@ define amdgpu_ps void @flat_umin_saddr_i64_nortn(ptr inreg %sbase, i32 %voffset,
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB84_2
; GFX1250-SDAG-NEXT: .LBB84_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[0:1]
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v4, s0, v0
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v4, src_flat_scratch_base_lo, v0
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v4, -1, v4, vcc_lo
; GFX1250-SDAG-NEXT: scratch_load_b64 v[0:1], v4, off
@@ -4444,13 +4281,12 @@ define amdgpu_ps void @flat_umin_saddr_i64_nortn(ptr inreg %sbase, i32 %voffset,
; GFX1250-GISEL: ; %bb.0:
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v4, v1 :: v_dual_mov_b32 v5, v2
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[2:3], s[2:3]
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
+; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v2, vcc_lo, v2, v0
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v3, null, 0, v3, vcc_lo
-; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, s0, v3
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
+; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, src_flat_scratch_base_hi, v3
; GFX1250-GISEL-NEXT: v_cmpx_le_u32_e32 0x4000000, v1
; GFX1250-GISEL-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-GISEL-NEXT: s_cbranch_execnz .LBB84_3
@@ -4468,9 +4304,8 @@ define amdgpu_ps void @flat_umin_saddr_i64_nortn(ptr inreg %sbase, i32 %voffset,
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB84_2
; GFX1250-GISEL-NEXT: .LBB84_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[2:3]
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s0, v2
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v2
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v2, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: scratch_load_b64 v[0:1], v2, off
@@ -4493,11 +4328,9 @@ define amdgpu_ps void @flat_umin_saddr_i64_nortn_neg128(ptr inreg %sbase, i32 %v
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[2:3], v[0:1]
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[0:1], v[0:1]
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instid1(SALU_CYCLE_1)
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v4, s0, v1
; GFX1250-SDAG-NEXT: s_mov_b32 s0, exec_lo
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v4, src_flat_scratch_base_hi, v1
; GFX1250-SDAG-NEXT: v_cmpx_lt_u32_e32 0x3ffffff, v4
; GFX1250-SDAG-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-SDAG-NEXT: s_cbranch_execnz .LBB85_3
@@ -4515,9 +4348,8 @@ define amdgpu_ps void @flat_umin_saddr_i64_nortn_neg128(ptr inreg %sbase, i32 %v
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB85_2
; GFX1250-SDAG-NEXT: .LBB85_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[0:1]
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v4, s0, v0
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v4, src_flat_scratch_base_lo, v0
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v4, -1, v4, vcc_lo
; GFX1250-SDAG-NEXT: scratch_load_b64 v[0:1], v4, off
@@ -4530,16 +4362,15 @@ define amdgpu_ps void @flat_umin_saddr_i64_nortn_neg128(ptr inreg %sbase, i32 %v
; GFX1250-GISEL: ; %bb.0:
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v4, v1 :: v_dual_mov_b32 v5, v2
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[2:3], s[2:3]
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
+; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v1, vcc_lo, v2, v0
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v3, null, 0, v3, vcc_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v2, vcc_lo, 0xffffff80, v1
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v3, null, -1, v3, vcc_lo
-; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, s0, v3
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
+; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, src_flat_scratch_base_hi, v3
; GFX1250-GISEL-NEXT: v_cmpx_le_u32_e32 0x4000000, v1
; GFX1250-GISEL-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-GISEL-NEXT: s_cbranch_execnz .LBB85_3
@@ -4557,9 +4388,8 @@ define amdgpu_ps void @flat_umin_saddr_i64_nortn_neg128(ptr inreg %sbase, i32 %v
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB85_2
; GFX1250-GISEL-NEXT: .LBB85_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[2:3]
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s0, v2
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v2
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v2, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: scratch_load_b64 v[0:1], v2, off
@@ -4654,12 +4484,11 @@ define amdgpu_ps <2 x float> @flat_cmpxchg_saddr_i64_rtn(ptr inreg %sbase, i32 %
; GFX1250-SDAG-LABEL: flat_cmpxchg_saddr_i64_rtn:
; GFX1250-SDAG: ; %bb.0:
; GFX1250-SDAG-NEXT: v_dual_mov_b32 v7, v2 :: v_dual_mov_b32 v6, v1
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
; GFX1250-SDAG-NEXT: v_dual_mov_b32 v1, 0 :: v_dual_mov_b32 v5, v4
; GFX1250-SDAG-NEXT: v_mov_b32_e32 v4, v3
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[2:3], s[2:3], v[0:1]
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, s0, v3
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, src_flat_scratch_base_hi, v3
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
; GFX1250-SDAG-NEXT: v_cmp_lt_u32_e32 vcc_lo, 0x3ffffff, v0
; GFX1250-SDAG-NEXT: ; implicit-def: $vgpr0_vgpr1
@@ -4684,9 +4513,8 @@ define amdgpu_ps <2 x float> @flat_cmpxchg_saddr_i64_rtn(ptr inreg %sbase, i32 %
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB90_2
; GFX1250-SDAG-NEXT: .LBB90_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[2:3]
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, s1, v2
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v2
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v8, -1, v0, vcc_lo
; GFX1250-SDAG-NEXT: scratch_load_b64 v[0:1], v8, off
@@ -4704,12 +4532,11 @@ define amdgpu_ps <2 x float> @flat_cmpxchg_saddr_i64_rtn(ptr inreg %sbase, i32 %
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v5, v0 :: v_dual_mov_b32 v8, v1
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[0:1], s[2:3]
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v9, v2 :: v_dual_mov_b32 v6, v3
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v2, vcc_lo, v0, v5
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v3, null, 0, v1, vcc_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_dual_mov_b32 v7, v4 :: v_dual_bitop2_b32 v0, s0, v3 bitop3:0x14
+; GFX1250-GISEL-NEXT: v_dual_mov_b32 v7, v4 :: v_dual_bitop2_b32 v0, src_flat_scratch_base_hi, v3 bitop3:0x14
; GFX1250-GISEL-NEXT: v_cmp_le_u32_e32 vcc_lo, 0x4000000, v0
; GFX1250-GISEL-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX1250-GISEL-NEXT: s_and_saveexec_b32 s0, vcc_lo
@@ -4734,9 +4561,8 @@ define amdgpu_ps <2 x float> @flat_cmpxchg_saddr_i64_rtn(ptr inreg %sbase, i32 %
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB90_2
; GFX1250-GISEL-NEXT: .LBB90_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[2:3]
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s1, v2
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v2
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v4, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: scratch_load_b64 v[0:1], v4, off
@@ -4766,13 +4592,12 @@ define amdgpu_ps <2 x float> @flat_cmpxchg_saddr_i64_rtn_neg128(ptr inreg %sbase
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[2:3], v[0:1]
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[2:3], s[0:1], v[0:1]
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instid1(SALU_CYCLE_1)
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, s0, v3
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, src_flat_scratch_base_hi, v3
; GFX1250-SDAG-NEXT: v_cmp_lt_u32_e32 vcc_lo, 0x3ffffff, v0
; GFX1250-SDAG-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX1250-SDAG-NEXT: s_and_saveexec_b32 s0, vcc_lo
+; GFX1250-SDAG-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1250-SDAG-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-SDAG-NEXT: s_cbranch_execnz .LBB91_3
; GFX1250-SDAG-NEXT: ; %bb.1: ; %Flow
@@ -4793,9 +4618,8 @@ define amdgpu_ps <2 x float> @flat_cmpxchg_saddr_i64_rtn_neg128(ptr inreg %sbase
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB91_2
; GFX1250-SDAG-NEXT: .LBB91_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[2:3]
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, s1, v2
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v2
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v8, -1, v0, vcc_lo
; GFX1250-SDAG-NEXT: scratch_load_b64 v[0:1], v8, off
@@ -4813,7 +4637,6 @@ define amdgpu_ps <2 x float> @flat_cmpxchg_saddr_i64_rtn_neg128(ptr inreg %sbase
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v5, v0 :: v_dual_mov_b32 v8, v1
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[0:1], s[2:3]
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v9, v2 :: v_dual_mov_b32 v6, v3
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v0, vcc_lo, v0, v5
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v1, null, 0, v1, vcc_lo
@@ -4821,7 +4644,7 @@ define amdgpu_ps <2 x float> @flat_cmpxchg_saddr_i64_rtn_neg128(ptr inreg %sbase
; GFX1250-GISEL-NEXT: v_add_co_u32 v2, vcc_lo, 0xffffff80, v0
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v3, null, -1, v1, vcc_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_dual_mov_b32 v7, v4 :: v_dual_bitop2_b32 v0, s0, v3 bitop3:0x14
+; GFX1250-GISEL-NEXT: v_dual_mov_b32 v7, v4 :: v_dual_bitop2_b32 v0, src_flat_scratch_base_hi, v3 bitop3:0x14
; GFX1250-GISEL-NEXT: v_cmp_le_u32_e32 vcc_lo, 0x4000000, v0
; GFX1250-GISEL-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX1250-GISEL-NEXT: s_and_saveexec_b32 s0, vcc_lo
@@ -4846,9 +4669,8 @@ define amdgpu_ps <2 x float> @flat_cmpxchg_saddr_i64_rtn_neg128(ptr inreg %sbase
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB91_2
; GFX1250-GISEL-NEXT: .LBB91_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[2:3]
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s1, v2
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v2
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v4, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: scratch_load_b64 v[0:1], v4, off
@@ -4873,13 +4695,12 @@ define amdgpu_ps void @flat_cmpxchg_saddr_i64_nortn(ptr inreg %sbase, i32 %voffs
; GFX1250-SDAG-LABEL: flat_cmpxchg_saddr_i64_nortn:
; GFX1250-SDAG: ; %bb.0:
; GFX1250-SDAG-NEXT: v_dual_mov_b32 v7, v2 :: v_dual_mov_b32 v6, v1
-; GFX1250-SDAG-NEXT: v_mov_b32_e32 v1, 0
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
-; GFX1250-SDAG-NEXT: v_dual_mov_b32 v5, v4 :: v_dual_mov_b32 v4, v3
+; GFX1250-SDAG-NEXT: v_dual_mov_b32 v1, 0 :: v_dual_mov_b32 v5, v4
+; GFX1250-SDAG-NEXT: v_mov_b32_e32 v4, v3
+; GFX1250-SDAG-NEXT: s_mov_b32 s0, exec_lo
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[2:3], v[0:1]
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v2, s0, v1
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, exec_lo
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v2, src_flat_scratch_base_hi, v1
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cmpx_lt_u32_e32 0x3ffffff, v2
; GFX1250-SDAG-NEXT: s_xor_b32 s0, exec_lo, s0
@@ -4901,9 +4722,8 @@ define amdgpu_ps void @flat_cmpxchg_saddr_i64_nortn(ptr inreg %sbase, i32 %voffs
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB92_2
; GFX1250-SDAG-NEXT: .LBB92_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[0:1]
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v2, s0, v0
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v2, src_flat_scratch_base_lo, v0
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v2, -1, v2, vcc_lo
; GFX1250-SDAG-NEXT: scratch_load_b64 v[0:1], v2, off
@@ -4918,13 +4738,12 @@ define amdgpu_ps void @flat_cmpxchg_saddr_i64_nortn(ptr inreg %sbase, i32 %voffs
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v8, v1 :: v_dual_mov_b32 v9, v2
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v6, v3 :: v_dual_mov_b32 v7, v4
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[2:3], s[2:3]
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
+; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v2, vcc_lo, v2, v0
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v3, null, 0, v3, vcc_lo
-; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, s0, v3
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
+; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, src_flat_scratch_base_hi, v3
; GFX1250-GISEL-NEXT: v_cmpx_le_u32_e32 0x4000000, v1
; GFX1250-GISEL-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-GISEL-NEXT: s_cbranch_execnz .LBB92_3
@@ -4945,9 +4764,8 @@ define amdgpu_ps void @flat_cmpxchg_saddr_i64_nortn(ptr inreg %sbase, i32 %voffs
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB92_2
; GFX1250-GISEL-NEXT: .LBB92_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[2:3]
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s0, v2
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v2
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v2, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: scratch_load_b64 v[0:1], v2, off
@@ -4972,11 +4790,9 @@ define amdgpu_ps void @flat_cmpxchg_saddr_i64_nortn_neg128(ptr inreg %sbase, i32
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[2:3], v[0:1]
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[0:1], v[0:1]
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instid1(SALU_CYCLE_1)
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v2, s0, v1
; GFX1250-SDAG-NEXT: s_mov_b32 s0, exec_lo
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v2, src_flat_scratch_base_hi, v1
; GFX1250-SDAG-NEXT: v_cmpx_lt_u32_e32 0x3ffffff, v2
; GFX1250-SDAG-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-SDAG-NEXT: s_cbranch_execnz .LBB93_3
@@ -4997,9 +4813,8 @@ define amdgpu_ps void @flat_cmpxchg_saddr_i64_nortn_neg128(ptr inreg %sbase, i32
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB93_2
; GFX1250-SDAG-NEXT: .LBB93_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[0:1]
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v2, s0, v0
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v2, src_flat_scratch_base_lo, v0
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v2, -1, v2, vcc_lo
; GFX1250-SDAG-NEXT: scratch_load_b64 v[0:1], v2, off
@@ -5014,16 +4829,15 @@ define amdgpu_ps void @flat_cmpxchg_saddr_i64_nortn_neg128(ptr inreg %sbase, i32
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v8, v1 :: v_dual_mov_b32 v9, v2
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v6, v3 :: v_dual_mov_b32 v7, v4
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[2:3], s[2:3]
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
+; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v1, vcc_lo, v2, v0
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v3, null, 0, v3, vcc_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v2, vcc_lo, 0xffffff80, v1
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v3, null, -1, v3, vcc_lo
-; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, s0, v3
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
+; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, src_flat_scratch_base_hi, v3
; GFX1250-GISEL-NEXT: v_cmpx_le_u32_e32 0x4000000, v1
; GFX1250-GISEL-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-GISEL-NEXT: s_cbranch_execnz .LBB93_3
@@ -5044,9 +4858,8 @@ define amdgpu_ps void @flat_cmpxchg_saddr_i64_nortn_neg128(ptr inreg %sbase, i32
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB93_2
; GFX1250-GISEL-NEXT: .LBB93_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[2:3]
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s0, v2
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v2
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v2, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: scratch_load_b64 v[0:1], v2, off
@@ -5120,11 +4933,10 @@ define amdgpu_ps <2 x float> @flat_inc_saddr_i64_rtn(ptr inreg %sbase, i32 %voff
; GFX1250-SDAG-LABEL: flat_inc_saddr_i64_rtn:
; GFX1250-SDAG: ; %bb.0:
; GFX1250-SDAG-NEXT: v_dual_mov_b32 v3, v2 :: v_dual_mov_b32 v2, v1
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
; GFX1250-SDAG-NEXT: v_mov_b32_e32 v1, 0
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[4:5], s[2:3], v[0:1]
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, s0, v5
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, src_flat_scratch_base_hi, v5
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
; GFX1250-SDAG-NEXT: v_cmp_lt_u32_e32 vcc_lo, 0x3ffffff, v0
; GFX1250-SDAG-NEXT: ; implicit-def: $vgpr0_vgpr1
@@ -5146,10 +4958,9 @@ define amdgpu_ps <2 x float> @flat_inc_saddr_i64_rtn(ptr inreg %sbase, i32 %voff
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB98_2
; GFX1250-SDAG-NEXT: .LBB98_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[4:5]
; GFX1250-SDAG-NEXT: s_wait_loadcnt_dscnt 0x0
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, s1, v4
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v4
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_4) | instid1(VALU_DEP_2)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v6, -1, v0, vcc_lo
; GFX1250-SDAG-NEXT: scratch_load_b64 v[0:1], v6, off
@@ -5167,12 +4978,11 @@ define amdgpu_ps <2 x float> @flat_inc_saddr_i64_rtn(ptr inreg %sbase, i32 %voff
; GFX1250-GISEL: ; %bb.0:
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v3, v0 :: v_dual_mov_b32 v4, v1
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[0:1], s[2:3]
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v6, vcc_lo, v0, v3
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v7, null, 0, v1, vcc_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_bitop2_b32 v0, s0, v7 bitop3:0x14
+; GFX1250-GISEL-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_bitop2_b32 v0, src_flat_scratch_base_hi, v7 bitop3:0x14
; GFX1250-GISEL-NEXT: v_cmp_le_u32_e32 vcc_lo, 0x4000000, v0
; GFX1250-GISEL-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX1250-GISEL-NEXT: s_and_saveexec_b32 s0, vcc_lo
@@ -5194,10 +5004,9 @@ define amdgpu_ps <2 x float> @flat_inc_saddr_i64_rtn(ptr inreg %sbase, i32 %voff
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB98_2
; GFX1250-GISEL-NEXT: .LBB98_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[6:7]
; GFX1250-GISEL-NEXT: s_wait_loadcnt_dscnt 0x0
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s1, v6
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v6
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_4) | instid1(VALU_DEP_2)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v6, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: scratch_load_b64 v[0:1], v6, off
@@ -5228,13 +5037,12 @@ define amdgpu_ps <2 x float> @flat_inc_saddr_i64_rtn_neg128(ptr inreg %sbase, i3
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[2:3], v[0:1]
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[4:5], s[0:1], v[0:1]
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instid1(SALU_CYCLE_1)
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, s0, v5
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, src_flat_scratch_base_hi, v5
; GFX1250-SDAG-NEXT: v_cmp_lt_u32_e32 vcc_lo, 0x3ffffff, v0
; GFX1250-SDAG-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX1250-SDAG-NEXT: s_and_saveexec_b32 s0, vcc_lo
+; GFX1250-SDAG-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1250-SDAG-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-SDAG-NEXT: s_cbranch_execnz .LBB99_3
; GFX1250-SDAG-NEXT: ; %bb.1: ; %Flow
@@ -5252,10 +5060,9 @@ define amdgpu_ps <2 x float> @flat_inc_saddr_i64_rtn_neg128(ptr inreg %sbase, i3
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB99_2
; GFX1250-SDAG-NEXT: .LBB99_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[4:5]
; GFX1250-SDAG-NEXT: s_wait_loadcnt_dscnt 0x0
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, s1, v4
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v4
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_4) | instid1(VALU_DEP_2)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v6, -1, v0, vcc_lo
; GFX1250-SDAG-NEXT: scratch_load_b64 v[0:1], v6, off
@@ -5273,7 +5080,6 @@ define amdgpu_ps <2 x float> @flat_inc_saddr_i64_rtn_neg128(ptr inreg %sbase, i3
; GFX1250-GISEL: ; %bb.0:
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v3, v0 :: v_dual_mov_b32 v4, v1
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[0:1], s[2:3]
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v0, vcc_lo, v0, v3
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v1, null, 0, v1, vcc_lo
@@ -5281,7 +5087,7 @@ define amdgpu_ps <2 x float> @flat_inc_saddr_i64_rtn_neg128(ptr inreg %sbase, i3
; GFX1250-GISEL-NEXT: v_add_co_u32 v6, vcc_lo, 0xffffff80, v0
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v7, null, -1, v1, vcc_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_bitop2_b32 v0, s0, v7 bitop3:0x14
+; GFX1250-GISEL-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_bitop2_b32 v0, src_flat_scratch_base_hi, v7 bitop3:0x14
; GFX1250-GISEL-NEXT: v_cmp_le_u32_e32 vcc_lo, 0x4000000, v0
; GFX1250-GISEL-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX1250-GISEL-NEXT: s_and_saveexec_b32 s0, vcc_lo
@@ -5303,10 +5109,9 @@ define amdgpu_ps <2 x float> @flat_inc_saddr_i64_rtn_neg128(ptr inreg %sbase, i3
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB99_2
; GFX1250-GISEL-NEXT: .LBB99_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[6:7]
; GFX1250-GISEL-NEXT: s_wait_loadcnt_dscnt 0x0
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s1, v6
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v6
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_4) | instid1(VALU_DEP_2)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v6, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: scratch_load_b64 v[0:1], v6, off
@@ -5334,11 +5139,10 @@ define amdgpu_ps void @flat_inc_saddr_i64_nortn(ptr inreg %sbase, i32 %voffset,
; GFX1250-SDAG: ; %bb.0:
; GFX1250-SDAG-NEXT: v_dual_mov_b32 v3, v2 :: v_dual_mov_b32 v2, v1
; GFX1250-SDAG-NEXT: v_mov_b32_e32 v1, 0
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
+; GFX1250-SDAG-NEXT: s_mov_b32 s0, exec_lo
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[2:3], v[0:1]
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v4, s0, v1
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, exec_lo
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v4, src_flat_scratch_base_hi, v1
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cmpx_lt_u32_e32 0x3ffffff, v4
; GFX1250-SDAG-NEXT: s_xor_b32 s0, exec_lo, s0
@@ -5356,9 +5160,8 @@ define amdgpu_ps void @flat_inc_saddr_i64_nortn(ptr inreg %sbase, i32 %voffset,
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB100_2
; GFX1250-SDAG-NEXT: .LBB100_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[0:1]
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v4, s0, v0
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v4, src_flat_scratch_base_lo, v0
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_4) | instid1(VALU_DEP_2)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v6, -1, v4, vcc_lo
; GFX1250-SDAG-NEXT: scratch_load_b64 v[0:1], v6, off
@@ -5373,13 +5176,12 @@ define amdgpu_ps void @flat_inc_saddr_i64_nortn(ptr inreg %sbase, i32 %voffset,
; GFX1250-GISEL: ; %bb.0:
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v4, v1 :: v_dual_mov_b32 v5, v2
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[2:3], s[2:3]
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
+; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v2, vcc_lo, v2, v0
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v3, null, 0, v3, vcc_lo
-; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, s0, v3
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
+; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, src_flat_scratch_base_hi, v3
; GFX1250-GISEL-NEXT: v_cmpx_le_u32_e32 0x4000000, v1
; GFX1250-GISEL-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-GISEL-NEXT: s_cbranch_execnz .LBB100_3
@@ -5396,9 +5198,8 @@ define amdgpu_ps void @flat_inc_saddr_i64_nortn(ptr inreg %sbase, i32 %voffset,
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB100_2
; GFX1250-GISEL-NEXT: .LBB100_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[2:3]
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s0, v2
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v2
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_4) | instid1(VALU_DEP_2)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v6, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: scratch_load_b64 v[0:1], v6, off
@@ -5425,11 +5226,9 @@ define amdgpu_ps void @flat_inc_saddr_i64_nortn_neg128(ptr inreg %sbase, i32 %vo
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[2:3], v[0:1]
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[0:1], v[0:1]
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instid1(SALU_CYCLE_1)
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v4, s0, v1
; GFX1250-SDAG-NEXT: s_mov_b32 s0, exec_lo
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v4, src_flat_scratch_base_hi, v1
; GFX1250-SDAG-NEXT: v_cmpx_lt_u32_e32 0x3ffffff, v4
; GFX1250-SDAG-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-SDAG-NEXT: s_cbranch_execnz .LBB101_3
@@ -5446,9 +5245,8 @@ define amdgpu_ps void @flat_inc_saddr_i64_nortn_neg128(ptr inreg %sbase, i32 %vo
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB101_2
; GFX1250-SDAG-NEXT: .LBB101_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[0:1]
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v4, s0, v0
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v4, src_flat_scratch_base_lo, v0
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_4) | instid1(VALU_DEP_2)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v6, -1, v4, vcc_lo
; GFX1250-SDAG-NEXT: scratch_load_b64 v[0:1], v6, off
@@ -5463,16 +5261,15 @@ define amdgpu_ps void @flat_inc_saddr_i64_nortn_neg128(ptr inreg %sbase, i32 %vo
; GFX1250-GISEL: ; %bb.0:
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v4, v1 :: v_dual_mov_b32 v5, v2
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[2:3], s[2:3]
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
+; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v1, vcc_lo, v2, v0
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v3, null, 0, v3, vcc_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v2, vcc_lo, 0xffffff80, v1
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v3, null, -1, v3, vcc_lo
-; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, s0, v3
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
+; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, src_flat_scratch_base_hi, v3
; GFX1250-GISEL-NEXT: v_cmpx_le_u32_e32 0x4000000, v1
; GFX1250-GISEL-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-GISEL-NEXT: s_cbranch_execnz .LBB101_3
@@ -5489,9 +5286,8 @@ define amdgpu_ps void @flat_inc_saddr_i64_nortn_neg128(ptr inreg %sbase, i32 %vo
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB101_2
; GFX1250-GISEL-NEXT: .LBB101_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[2:3]
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s0, v2
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v2
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_4) | instid1(VALU_DEP_2)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v6, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: scratch_load_b64 v[0:1], v6, off
@@ -5569,11 +5365,10 @@ define amdgpu_ps <2 x float> @flat_dec_saddr_i64_rtn(ptr inreg %sbase, i32 %voff
; GFX1250-SDAG-LABEL: flat_dec_saddr_i64_rtn:
; GFX1250-SDAG: ; %bb.0:
; GFX1250-SDAG-NEXT: v_dual_mov_b32 v3, v2 :: v_dual_mov_b32 v2, v1
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
; GFX1250-SDAG-NEXT: v_mov_b32_e32 v1, 0
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[4:5], s[2:3], v[0:1]
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, s0, v5
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, src_flat_scratch_base_hi, v5
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
; GFX1250-SDAG-NEXT: v_cmp_lt_u32_e32 vcc_lo, 0x3ffffff, v0
; GFX1250-SDAG-NEXT: ; implicit-def: $vgpr0_vgpr1
@@ -5595,10 +5390,9 @@ define amdgpu_ps <2 x float> @flat_dec_saddr_i64_rtn(ptr inreg %sbase, i32 %voff
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s1, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB106_2
; GFX1250-SDAG-NEXT: .LBB106_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[4:5]
; GFX1250-SDAG-NEXT: s_wait_loadcnt_dscnt 0x0
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, s0, v4
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v4
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v6, -1, v0, vcc_lo
; GFX1250-SDAG-NEXT: scratch_load_b64 v[0:1], v6, off
@@ -5619,12 +5413,11 @@ define amdgpu_ps <2 x float> @flat_dec_saddr_i64_rtn(ptr inreg %sbase, i32 %voff
; GFX1250-GISEL: ; %bb.0:
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v3, v0 :: v_dual_mov_b32 v4, v1
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[0:1], s[2:3]
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v6, vcc_lo, v0, v3
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v7, null, 0, v1, vcc_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_bitop2_b32 v0, s0, v7 bitop3:0x14
+; GFX1250-GISEL-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_bitop2_b32 v0, src_flat_scratch_base_hi, v7 bitop3:0x14
; GFX1250-GISEL-NEXT: v_cmp_le_u32_e32 vcc_lo, 0x4000000, v0
; GFX1250-GISEL-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX1250-GISEL-NEXT: s_and_saveexec_b32 s0, vcc_lo
@@ -5646,10 +5439,9 @@ define amdgpu_ps <2 x float> @flat_dec_saddr_i64_rtn(ptr inreg %sbase, i32 %voff
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s1, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB106_2
; GFX1250-GISEL-NEXT: .LBB106_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[6:7]
; GFX1250-GISEL-NEXT: s_wait_loadcnt_dscnt 0x0
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s0, v6
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v6
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v6, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: scratch_load_b64 v[0:1], v6, off
@@ -5681,13 +5473,12 @@ define amdgpu_ps <2 x float> @flat_dec_saddr_i64_rtn_neg128(ptr inreg %sbase, i3
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[2:3], v[0:1]
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[4:5], s[0:1], v[0:1]
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instid1(SALU_CYCLE_1)
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, s0, v5
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v0, src_flat_scratch_base_hi, v5
; GFX1250-SDAG-NEXT: v_cmp_lt_u32_e32 vcc_lo, 0x3ffffff, v0
; GFX1250-SDAG-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX1250-SDAG-NEXT: s_and_saveexec_b32 s0, vcc_lo
+; GFX1250-SDAG-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1250-SDAG-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-SDAG-NEXT: s_cbranch_execnz .LBB107_3
; GFX1250-SDAG-NEXT: ; %bb.1: ; %Flow
@@ -5705,10 +5496,9 @@ define amdgpu_ps <2 x float> @flat_dec_saddr_i64_rtn_neg128(ptr inreg %sbase, i3
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s1, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB107_2
; GFX1250-SDAG-NEXT: .LBB107_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[4:5]
; GFX1250-SDAG-NEXT: s_wait_loadcnt_dscnt 0x0
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, s0, v4
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v4
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v6, -1, v0, vcc_lo
; GFX1250-SDAG-NEXT: scratch_load_b64 v[0:1], v6, off
@@ -5729,7 +5519,6 @@ define amdgpu_ps <2 x float> @flat_dec_saddr_i64_rtn_neg128(ptr inreg %sbase, i3
; GFX1250-GISEL: ; %bb.0:
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v3, v0 :: v_dual_mov_b32 v4, v1
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[0:1], s[2:3]
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v0, vcc_lo, v0, v3
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v1, null, 0, v1, vcc_lo
@@ -5737,7 +5526,7 @@ define amdgpu_ps <2 x float> @flat_dec_saddr_i64_rtn_neg128(ptr inreg %sbase, i3
; GFX1250-GISEL-NEXT: v_add_co_u32 v6, vcc_lo, 0xffffff80, v0
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v7, null, -1, v1, vcc_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_bitop2_b32 v0, s0, v7 bitop3:0x14
+; GFX1250-GISEL-NEXT: v_dual_mov_b32 v5, v2 :: v_dual_bitop2_b32 v0, src_flat_scratch_base_hi, v7 bitop3:0x14
; GFX1250-GISEL-NEXT: v_cmp_le_u32_e32 vcc_lo, 0x4000000, v0
; GFX1250-GISEL-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX1250-GISEL-NEXT: s_and_saveexec_b32 s0, vcc_lo
@@ -5759,10 +5548,9 @@ define amdgpu_ps <2 x float> @flat_dec_saddr_i64_rtn_neg128(ptr inreg %sbase, i3
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s1, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB107_2
; GFX1250-GISEL-NEXT: .LBB107_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[6:7]
; GFX1250-GISEL-NEXT: s_wait_loadcnt_dscnt 0x0
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s0, v6
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v6
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v6, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: scratch_load_b64 v[0:1], v6, off
@@ -5791,11 +5579,10 @@ define amdgpu_ps void @flat_dec_saddr_i64_nortn(ptr inreg %sbase, i32 %voffset,
; GFX1250-SDAG: ; %bb.0:
; GFX1250-SDAG-NEXT: v_dual_mov_b32 v3, v2 :: v_dual_mov_b32 v2, v1
; GFX1250-SDAG-NEXT: v_mov_b32_e32 v1, 0
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
+; GFX1250-SDAG-NEXT: s_mov_b32 s0, exec_lo
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[2:3], v[0:1]
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v4, s0, v1
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, exec_lo
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v4, src_flat_scratch_base_hi, v1
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cmpx_lt_u32_e32 0x3ffffff, v4
; GFX1250-SDAG-NEXT: s_xor_b32 s0, exec_lo, s0
@@ -5813,9 +5600,8 @@ define amdgpu_ps void @flat_dec_saddr_i64_nortn(ptr inreg %sbase, i32 %voffset,
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB108_2
; GFX1250-SDAG-NEXT: .LBB108_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[0:1]
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v4, s0, v0
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v4, src_flat_scratch_base_lo, v0
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v4, -1, v4, vcc_lo
; GFX1250-SDAG-NEXT: scratch_load_b64 v[0:1], v4, off
@@ -5833,13 +5619,12 @@ define amdgpu_ps void @flat_dec_saddr_i64_nortn(ptr inreg %sbase, i32 %voffset,
; GFX1250-GISEL: ; %bb.0:
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v4, v1 :: v_dual_mov_b32 v5, v2
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[2:3], s[2:3]
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
+; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v2, vcc_lo, v2, v0
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v3, null, 0, v3, vcc_lo
-; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, s0, v3
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
+; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, src_flat_scratch_base_hi, v3
; GFX1250-GISEL-NEXT: v_cmpx_le_u32_e32 0x4000000, v1
; GFX1250-GISEL-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-GISEL-NEXT: s_cbranch_execnz .LBB108_3
@@ -5856,9 +5641,8 @@ define amdgpu_ps void @flat_dec_saddr_i64_nortn(ptr inreg %sbase, i32 %voffset,
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB108_2
; GFX1250-GISEL-NEXT: .LBB108_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[2:3]
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s0, v2
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v2
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v2, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: scratch_load_b64 v[0:1], v2, off
@@ -5886,11 +5670,9 @@ define amdgpu_ps void @flat_dec_saddr_i64_nortn_neg128(ptr inreg %sbase, i32 %vo
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[2:3], v[0:1]
; GFX1250-SDAG-NEXT: v_add_nc_u64_e32 v[0:1], s[0:1], v[0:1]
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instid1(SALU_CYCLE_1)
-; GFX1250-SDAG-NEXT: v_xor_b32_e32 v4, s0, v1
; GFX1250-SDAG-NEXT: s_mov_b32 s0, exec_lo
-; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-SDAG-NEXT: v_xor_b32_e32 v4, src_flat_scratch_base_hi, v1
; GFX1250-SDAG-NEXT: v_cmpx_lt_u32_e32 0x3ffffff, v4
; GFX1250-SDAG-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-SDAG-NEXT: s_cbranch_execnz .LBB109_3
@@ -5907,9 +5689,8 @@ define amdgpu_ps void @flat_dec_saddr_i64_nortn_neg128(ptr inreg %sbase, i32 %vo
; GFX1250-SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-SDAG-NEXT: s_cbranch_execz .LBB109_2
; GFX1250-SDAG-NEXT: .LBB109_4: ; %atomicrmw.private
-; GFX1250-SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[0:1]
-; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v4, s0, v0
+; GFX1250-SDAG-NEXT: v_subrev_nc_u32_e32 v4, src_flat_scratch_base_lo, v0
; GFX1250-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-SDAG-NEXT: v_cndmask_b32_e32 v4, -1, v4, vcc_lo
; GFX1250-SDAG-NEXT: scratch_load_b64 v[0:1], v4, off
@@ -5927,16 +5708,15 @@ define amdgpu_ps void @flat_dec_saddr_i64_nortn_neg128(ptr inreg %sbase, i32 %vo
; GFX1250-GISEL: ; %bb.0:
; GFX1250-GISEL-NEXT: v_dual_mov_b32 v4, v1 :: v_dual_mov_b32 v5, v2
; GFX1250-GISEL-NEXT: v_mov_b64_e32 v[2:3], s[2:3]
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
+; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v1, vcc_lo, v2, v0
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v3, null, 0, v3, vcc_lo
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_add_co_u32 v2, vcc_lo, 0xffffff80, v1
; GFX1250-GISEL-NEXT: v_add_co_ci_u32_e64 v3, null, -1, v3, vcc_lo
-; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, s0, v3
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, exec_lo
+; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1250-GISEL-NEXT: v_xor_b32_e32 v1, src_flat_scratch_base_hi, v3
; GFX1250-GISEL-NEXT: v_cmpx_le_u32_e32 0x4000000, v1
; GFX1250-GISEL-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX1250-GISEL-NEXT: s_cbranch_execnz .LBB109_3
@@ -5953,9 +5733,8 @@ define amdgpu_ps void @flat_dec_saddr_i64_nortn_neg128(ptr inreg %sbase, i32 %vo
; GFX1250-GISEL-NEXT: s_and_not1_saveexec_b32 s0, s0
; GFX1250-GISEL-NEXT: s_cbranch_execz .LBB109_2
; GFX1250-GISEL-NEXT: .LBB109_4: ; %atomicrmw.private
-; GFX1250-GISEL-NEXT: s_mov_b32 s0, src_flat_scratch_base_lo
; GFX1250-GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[2:3]
-; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, s0, v2
+; GFX1250-GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v2
; GFX1250-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX1250-GISEL-NEXT: v_cndmask_b32_e32 v2, -1, v0, vcc_lo
; GFX1250-GISEL-NEXT: scratch_load_b64 v[0:1], v2, off
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.is.private.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.is.private.ll
index 9e1815b48abfd..f4e88b4b564eb 100644
--- a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.is.private.ll
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.is.private.ll
@@ -2,10 +2,12 @@
; RUN: llc -global-isel=0 -mtriple=amdgcn-amd-amdhsa -mcpu=tahiti < %s | FileCheck -check-prefixes=SI,SI-SDAG %s
; RUN: llc -global-isel=0 -mtriple=amdgcn-amd-amdhsa -mcpu=hawaii < %s | FileCheck -check-prefixes=CI,CI-SDAG %s
; RUN: llc -global-isel=0 -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 < %s | FileCheck -check-prefixes=GFX9,GFX9-SDAG %s
+; RUN: llc -global-isel=0 -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1250 < %s | FileCheck -check-prefixes=GFX1250,GFX1250-SDAG %s
; RUN: llc -global-isel=1 -mtriple=amdgcn-amd-amdhsa -mcpu=hawaii < %s | FileCheck -check-prefixes=CI,CI-GISEL %s
; RUN: llc -global-isel=1 -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 < %s | FileCheck -check-prefixes=GFX9,GFX9-GISEL %s
; RUN: llc -global-isel=1 -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1010 < %s | FileCheck -check-prefixes=GFX10,GFX10-GISEL %s
; RUN: llc -global-isel=1 -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1100 < %s | FileCheck -check-prefixes=GFX11,GFX11-GISEL %s
+; RUN: llc -global-isel=1 -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1250 < %s | FileCheck -check-prefixes=GFX1250,GFX1250-GISEL %s
define amdgpu_kernel void @is_private_vgpr(ptr addrspace(1) %ptr.ptr) {
; SI-LABEL: is_private_vgpr:
@@ -57,6 +59,21 @@ define amdgpu_kernel void @is_private_vgpr(ptr addrspace(1) %ptr.ptr) {
; GFX9-NEXT: global_store_dword v[0:1], v0, off
; GFX9-NEXT: s_endpgm
;
+; GFX1250-LABEL: is_private_vgpr:
+; GFX1250: ; %bb.0:
+; GFX1250-NEXT: s_load_b64 s[0:1], s[4:5], 0x0
+; GFX1250-NEXT: v_and_b32_e32 v0, 0x3ff, v0
+; GFX1250-NEXT: s_wait_kmcnt 0x0
+; GFX1250-NEXT: global_load_b64 v[0:1], v0, s[0:1] scale_offset scope:SCOPE_SYS
+; GFX1250-NEXT: s_wait_loadcnt 0x0
+; GFX1250-NEXT: s_wait_xcnt 0x0
+; GFX1250-NEXT: v_xor_b32_e32 v0, src_flat_scratch_base_hi, v1
+; GFX1250-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX1250-NEXT: v_cmp_gt_u32_e32 vcc_lo, 0x4000000, v0
+; GFX1250-NEXT: v_cndmask_b32_e64 v0, 0, 1, vcc_lo
+; GFX1250-NEXT: global_store_b32 v[0:1], v0, off
+; GFX1250-NEXT: s_endpgm
+;
; CI-GISEL-LABEL: is_private_vgpr:
; CI-GISEL: ; %bb.0:
; CI-GISEL-NEXT: s_load_dwordx2 s[0:1], s[8:9], 0x0
@@ -170,6 +187,23 @@ define amdgpu_kernel void @is_private_sgpr(ptr %ptr) {
; GFX9-SDAG-NEXT: .LBB1_2: ; %bb1
; GFX9-SDAG-NEXT: s_endpgm
;
+; GFX1250-SDAG-LABEL: is_private_sgpr:
+; GFX1250-SDAG: ; %bb.0:
+; GFX1250-SDAG-NEXT: s_load_b32 s0, s[4:5], 0x4
+; GFX1250-SDAG-NEXT: s_wait_kmcnt 0x0
+; GFX1250-SDAG-NEXT: s_xor_b32 s0, s0, src_flat_scratch_base_hi
+; GFX1250-SDAG-NEXT: s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1250-SDAG-NEXT: s_cmp_lt_u32 s0, 0x4000000
+; GFX1250-SDAG-NEXT: s_cselect_b32 s0, -1, 0
+; GFX1250-SDAG-NEXT: s_and_not1_b32 vcc_lo, exec_lo, s0
+; GFX1250-SDAG-NEXT: s_cbranch_vccnz .LBB1_2
+; GFX1250-SDAG-NEXT: ; %bb.1: ; %bb0
+; GFX1250-SDAG-NEXT: v_mov_b32_e32 v0, 0
+; GFX1250-SDAG-NEXT: global_store_b32 v[0:1], v0, off scope:SCOPE_SYS
+; GFX1250-SDAG-NEXT: s_wait_storecnt 0x0
+; GFX1250-SDAG-NEXT: .LBB1_2: ; %bb1
+; GFX1250-SDAG-NEXT: s_endpgm
+;
; CI-GISEL-LABEL: is_private_sgpr:
; CI-GISEL: ; %bb.0:
; CI-GISEL-NEXT: s_load_dwordx2 s[0:1], s[8:9], 0x0
@@ -229,6 +263,21 @@ define amdgpu_kernel void @is_private_sgpr(ptr %ptr) {
; GFX11-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-NEXT: .LBB1_2: ; %bb1
; GFX11-NEXT: s_endpgm
+;
+; GFX1250-GISEL-LABEL: is_private_sgpr:
+; GFX1250-GISEL: ; %bb.0:
+; GFX1250-GISEL-NEXT: s_load_b64 s[0:1], s[4:5], 0x0
+; GFX1250-GISEL-NEXT: s_wait_kmcnt 0x0
+; GFX1250-GISEL-NEXT: s_xor_b32 s0, s1, src_flat_scratch_base_hi
+; GFX1250-GISEL-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
+; GFX1250-GISEL-NEXT: s_cmp_ge_u32 s0, 0x4000000
+; GFX1250-GISEL-NEXT: s_cbranch_scc1 .LBB1_2
+; GFX1250-GISEL-NEXT: ; %bb.1: ; %bb0
+; GFX1250-GISEL-NEXT: v_mov_b32_e32 v0, 0
+; GFX1250-GISEL-NEXT: global_store_b32 v[0:1], v0, off scope:SCOPE_SYS
+; GFX1250-GISEL-NEXT: s_wait_storecnt 0x0
+; GFX1250-GISEL-NEXT: .LBB1_2: ; %bb1
+; GFX1250-GISEL-NEXT: s_endpgm
%val = call i1 @llvm.amdgcn.is.private(ptr %ptr)
br i1 %val, label %bb0, label %bb1
diff --git a/llvm/test/CodeGen/AMDGPU/scale-offset-flat.ll b/llvm/test/CodeGen/AMDGPU/scale-offset-flat.ll
index 725d57d852966..788cdd1c89051 100644
--- a/llvm/test/CodeGen/AMDGPU/scale-offset-flat.ll
+++ b/llvm/test/CodeGen/AMDGPU/scale-offset-flat.ll
@@ -337,11 +337,9 @@ define amdgpu_ps <2 x float> @flat_atomicrmw_b64_rtn_idxprom(ptr align 8 inreg %
; SDAG-LABEL: flat_atomicrmw_b64_rtn_idxprom:
; SDAG: ; %bb.0: ; %entry
; SDAG-NEXT: v_ashrrev_i32_e32 v1, 31, v0
-; SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; SDAG-NEXT: v_lshl_add_u64 v[2:3], v[0:1], 3, s[0:1]
-; SDAG-NEXT: s_mov_b32 s0, src_flat_scratch_base_hi
-; SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instid1(SALU_CYCLE_1)
-; SDAG-NEXT: v_xor_b32_e32 v0, s0, v3
+; SDAG-NEXT: v_xor_b32_e32 v0, src_flat_scratch_base_hi, v3
; SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
; SDAG-NEXT: v_cmp_lt_u32_e32 vcc_lo, 0x3ffffff, v0
; SDAG-NEXT: ; implicit-def: $vgpr0_vgpr1
@@ -363,10 +361,9 @@ define amdgpu_ps <2 x float> @flat_atomicrmw_b64_rtn_idxprom(ptr align 8 inreg %
; SDAG-NEXT: s_and_not1_saveexec_b32 s0, s0
; SDAG-NEXT: s_cbranch_execz .LBB21_2
; SDAG-NEXT: .LBB21_4: ; %atomicrmw.private
-; SDAG-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; SDAG-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[2:3]
; SDAG-NEXT: s_wait_loadcnt_dscnt 0x0
-; SDAG-NEXT: v_subrev_nc_u32_e32 v0, s1, v2
+; SDAG-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v2
; SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
; SDAG-NEXT: v_cndmask_b32_e32 v4, -1, v0, vcc_lo
; SDAG-NEXT: scratch_load_b64 v[0:1], v4, off
@@ -380,7 +377,6 @@ define amdgpu_ps <2 x float> @flat_atomicrmw_b64_rtn_idxprom(ptr align 8 inreg %
;
; GISEL-LABEL: flat_atomicrmw_b64_rtn_idxprom:
; GISEL: ; %bb.0: ; %entry
-; GISEL-NEXT: s_mov_b32 s2, src_flat_scratch_base_hi
; GISEL-NEXT: v_mov_b32_e32 v2, v0
; GISEL-NEXT: v_mov_b64_e32 v[4:5], s[0:1]
; GISEL-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
@@ -390,7 +386,7 @@ define amdgpu_ps <2 x float> @flat_atomicrmw_b64_rtn_idxprom(ptr align 8 inreg %
; GISEL-NEXT: v_add_co_u32 v4, vcc_lo, v4, v0
; GISEL-NEXT: v_add_co_ci_u32_e64 v5, null, v5, v1, vcc_lo
; GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GISEL-NEXT: v_xor_b32_e32 v0, s2, v5
+; GISEL-NEXT: v_xor_b32_e32 v0, src_flat_scratch_base_hi, v5
; GISEL-NEXT: v_cmp_le_u32_e32 vcc_lo, 0x4000000, v0
; GISEL-NEXT: ; implicit-def: $vgpr0_vgpr1
; GISEL-NEXT: s_and_saveexec_b32 s2, vcc_lo
@@ -412,10 +408,9 @@ define amdgpu_ps <2 x float> @flat_atomicrmw_b64_rtn_idxprom(ptr align 8 inreg %
; GISEL-NEXT: s_and_not1_saveexec_b32 s0, s2
; GISEL-NEXT: s_cbranch_execz .LBB21_2
; GISEL-NEXT: .LBB21_4: ; %atomicrmw.private
-; GISEL-NEXT: s_mov_b32 s1, src_flat_scratch_base_lo
; GISEL-NEXT: v_cmp_ne_u64_e32 vcc_lo, 0, v[4:5]
; GISEL-NEXT: s_wait_loadcnt_dscnt 0x0
-; GISEL-NEXT: v_subrev_nc_u32_e32 v0, s1, v4
+; GISEL-NEXT: v_subrev_nc_u32_e32 v0, src_flat_scratch_base_lo, v4
; GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GISEL-NEXT: v_cndmask_b32_e32 v4, -1, v0, vcc_lo
; GISEL-NEXT: scratch_load_b64 v[0:1], v4, off
>From 462929183cafc8d1229dc167972195f4b088e339 Mon Sep 17 00:00:00 2001
From: Jordan Rupprecht <rupprecht at google.com>
Date: Mon, 18 Aug 2025 15:15:12 -0500
Subject: [PATCH 084/112] [bazel] Port #153497: reland clang modules scanner
change (#154192)
---
utils/bazel/llvm-project-overlay/clang/BUILD.bazel | 1 +
1 file changed, 1 insertion(+)
diff --git a/utils/bazel/llvm-project-overlay/clang/BUILD.bazel b/utils/bazel/llvm-project-overlay/clang/BUILD.bazel
index 64f86c4b15083..7076c104860da 100644
--- a/utils/bazel/llvm-project-overlay/clang/BUILD.bazel
+++ b/utils/bazel/llvm-project-overlay/clang/BUILD.bazel
@@ -1529,6 +1529,7 @@ cc_library(
":basic",
":config",
":driver_options_inc_gen",
+ ":lex",
":parse",
":static_analyzer_checkers_gen",
"//llvm:BinaryFormat",
>From 4220538e257939fca7472ea9d5dfedee1fae7bd7 Mon Sep 17 00:00:00 2001
From: Thurston Dang <thurston at google.com>
Date: Mon, 18 Aug 2025 13:18:27 -0700
Subject: [PATCH 085/112] [msan] Handle multiply-add-accumulate; apply to AVX
Vector Neural Network Instructions (VNNI) (#153927)
This extends the pmadd handler (recently improved in https://github.com/llvm/llvm-project/pull/153353) to three-operand intrinsics (multiply-add-accumulate), and applies it to the AVX Vector Neural Network Instructions.
Updates the tests from https://github.com/llvm/llvm-project/pull/153135
---
.../Instrumentation/MemorySanitizer.cpp | 187 +++++-
.../X86/avx10_2_512ni-intrinsics.ll | 86 ++-
.../X86/avx10_2ni-intrinsics.ll | 122 ++--
.../X86/avx512vl_vnni-intrinsics-upgrade.ll | 546 +++++++++++++++---
.../X86/avx512vl_vnni-intrinsics.ll | 546 +++++++++++++++---
.../X86/avx512vnni-intrinsics-upgrade.ll | 274 +++++++--
.../X86/avx512vnni-intrinsics.ll | 274 +++++++--
.../X86/avx_vnni-intrinsics.ll | 194 +++++--
.../X86/avxvnniint8-intrinsics.ll | 198 +++++--
9 files changed, 2069 insertions(+), 358 deletions(-)
diff --git a/llvm/lib/Transforms/Instrumentation/MemorySanitizer.cpp b/llvm/lib/Transforms/Instrumentation/MemorySanitizer.cpp
index 7865a90707400..948e2c6e06843 100644
--- a/llvm/lib/Transforms/Instrumentation/MemorySanitizer.cpp
+++ b/llvm/lib/Transforms/Instrumentation/MemorySanitizer.cpp
@@ -3846,7 +3846,7 @@ struct MemorySanitizerVisitor : public InstVisitor<MemorySanitizerVisitor> {
setOriginForNaryOp(I);
}
- // Instrument multiply-add intrinsics.
+ // Instrument multiply-add(-accumulate)? intrinsics.
//
// e.g., Two operands:
// <4 x i32> @llvm.x86.sse2.pmadd.wd(<8 x i16> %a, <8 x i16> %b)
@@ -3854,10 +3854,13 @@ struct MemorySanitizerVisitor : public InstVisitor<MemorySanitizerVisitor> {
// Two operands which require an EltSizeInBits override:
// <1 x i64> @llvm.x86.mmx.pmadd.wd(<1 x i64> %a, <1 x i64> %b)
//
- // Three operands are not implemented yet:
+ // Three operands:
// <4 x i32> @llvm.x86.avx512.vpdpbusd.128
// (<4 x i32> %s, <4 x i32> %a, <4 x i32> %b)
- // (the result of multiply-add'ing %a and %b is accumulated with %s)
+ // (this is equivalent to multiply-add on %a and %b, followed by
+ // adding/"accumulating" %s. "Accumulation" stores the result in one
+ // of the source registers, but this accumulate vs. add distinction
+ // is lost when dealing with LLVM intrinsics.)
void handleVectorPmaddIntrinsic(IntrinsicInst &I, unsigned ReductionFactor,
unsigned EltSizeInBits = 0) {
IRBuilder<> IRB(&I);
@@ -3866,22 +3869,39 @@ struct MemorySanitizerVisitor : public InstVisitor<MemorySanitizerVisitor> {
cast<FixedVectorType>(I.getType());
assert(isa<FixedVectorType>(ReturnType));
- assert(I.arg_size() == 2);
-
// Vectors A and B, and shadows
- Value *Va = I.getOperand(0);
- Value *Vb = I.getOperand(1);
+ Value *Va = nullptr;
+ Value *Vb = nullptr;
+ Value *Sa = nullptr;
+ Value *Sb = nullptr;
- Value *Sa = getShadow(&I, 0);
- Value *Sb = getShadow(&I, 1);
+ assert(I.arg_size() == 2 || I.arg_size() == 3);
+ if (I.arg_size() == 2) {
+ Va = I.getOperand(0);
+ Vb = I.getOperand(1);
- FixedVectorType *ParamType =
- cast<FixedVectorType>(I.getArgOperand(0)->getType());
- assert(ParamType == I.getArgOperand(1)->getType());
+ Sa = getShadow(&I, 0);
+ Sb = getShadow(&I, 1);
+ } else if (I.arg_size() == 3) {
+ // Operand 0 is the accumulator. We will deal with that below.
+ Va = I.getOperand(1);
+ Vb = I.getOperand(2);
+
+ Sa = getShadow(&I, 1);
+ Sb = getShadow(&I, 2);
+ }
+
+ FixedVectorType *ParamType = cast<FixedVectorType>(Va->getType());
+ assert(ParamType == Vb->getType());
assert(ParamType->getPrimitiveSizeInBits() ==
ReturnType->getPrimitiveSizeInBits());
+ if (I.arg_size() == 3) {
+ assert(ParamType == ReturnType);
+ assert(ParamType == I.getArgOperand(0)->getType());
+ }
+
FixedVectorType *ImplicitReturnType = ReturnType;
// Step 1: instrument multiplication of corresponding vector elements
if (EltSizeInBits) {
@@ -3944,10 +3964,14 @@ struct MemorySanitizerVisitor : public InstVisitor<MemorySanitizerVisitor> {
Constant::getNullValue(Horizontal->getType())),
ImplicitReturnType);
- // For MMX, cast it back to the required fake return type (<1 x i64>).
+ // Cast it back to the required fake return type (<1 x i64>).
if (EltSizeInBits)
OutShadow = CreateShadowCast(IRB, OutShadow, getShadowTy(&I));
+ // Step 3 (if applicable): instrument accumulator
+ if (I.arg_size() == 3)
+ OutShadow = IRB.CreateOr(OutShadow, getShadow(&I, 0));
+
setShadow(&I, OutShadow);
setOriginForNaryOp(I);
}
@@ -5525,6 +5549,143 @@ struct MemorySanitizerVisitor : public InstVisitor<MemorySanitizerVisitor> {
handleVectorPmaddIntrinsic(I, /*ReductionFactor=*/2, /*EltSize=*/16);
break;
+ // AVX Vector Neural Network Instructions: bytes
+ //
+ // Multiply and Add Packed Signed and Unsigned Bytes
+ // < 4 x i32> @llvm.x86.avx512.vpdpbusd.128
+ // (< 4 x i32>, < 4 x i32>, < 4 x i32>)
+ // < 8 x i32> @llvm.x86.avx512.vpdpbusd.256
+ // (< 8 x i32>, < 8 x i32>, < 8 x i32>)
+ // <16 x i32> @llvm.x86.avx512.vpdpbusd.512
+ // (<16 x i32>, <16 x i32>, <16 x i32>)
+ //
+ // Multiply and Add Unsigned and Signed Bytes With Saturation
+ // < 4 x i32> @llvm.x86.avx512.vpdpbusds.128
+ // (< 4 x i32>, < 4 x i32>, < 4 x i32>)
+ // < 8 x i32> @llvm.x86.avx512.vpdpbusds.256
+ // (< 8 x i32>, < 8 x i32>, < 8 x i32>)
+ // <16 x i32> @llvm.x86.avx512.vpdpbusds.512
+ // (<16 x i32>, <16 x i32>, <16 x i32>)
+ //
+ // < 4 x i32> @llvm.x86.avx2.vpdpbssd.128
+ // (< 4 x i32>, < 4 x i32>, < 4 x i32>)
+ // < 8 x i32> @llvm.x86.avx2.vpdpbssd.256
+ // (< 8 x i32>, < 8 x i32>, < 8 x i32>)
+ //
+ // < 4 x i32> @llvm.x86.avx2.vpdpbssds.128
+ // (< 4 x i32>, < 4 x i32>, < 4 x i32>)
+ // < 8 x i32> @llvm.x86.avx2.vpdpbssds.256
+ // (< 8 x i32>, < 8 x i32>, < 8 x i32>)
+ //
+ // <16 x i32> @llvm.x86.avx10.vpdpbssd.512
+ // (<16 x i32>, <16 x i32>, <16 x i32>)
+ // <16 x i32> @llvm.x86.avx10.vpdpbssds.512
+ // (<16 x i32>, <16 x i32>, <16 x i32>)
+ //
+ // These intrinsics are auto-upgraded into non-masked forms:
+ // <4 x i32> @llvm.x86.avx512.mask.vpdpbusd.128
+ // (<4 x i32>, <4 x i32>, <4 x i32>, i8)
+ // <4 x i32> @llvm.x86.avx512.maskz.vpdpbusd.128
+ // (<4 x i32>, <4 x i32>, <4 x i32>, i8)
+ // <8 x i32> @llvm.x86.avx512.mask.vpdpbusd.256
+ // (<8 x i32>, <8 x i32>, <8 x i32>, i8)
+ // <8 x i32> @llvm.x86.avx512.maskz.vpdpbusd.256
+ // (<8 x i32>, <8 x i32>, <8 x i32>, i8)
+ // <16 x i32> @llvm.x86.avx512.mask.vpdpbusd.512
+ // (<16 x i32>, <16 x i32>, <16 x i32>, i16)
+ // <16 x i32> @llvm.x86.avx512.maskz.vpdpbusd.512
+ // (<16 x i32>, <16 x i32>, <16 x i32>, i16)
+ //
+ // <4 x i32> @llvm.x86.avx512.mask.vpdpbusds.128
+ // (<4 x i32>, <4 x i32>, <4 x i32>, i8)
+ // <4 x i32> @llvm.x86.avx512.maskz.vpdpbusds.128
+ // (<4 x i32>, <4 x i32>, <4 x i32>, i8)
+ // <8 x i32> @llvm.x86.avx512.mask.vpdpbusds.256
+ // (<8 x i32>, <8 x i32>, <8 x i32>, i8)
+ // <8 x i32> @llvm.x86.avx512.maskz.vpdpbusds.256
+ // (<8 x i32>, <8 x i32>, <8 x i32>, i8)
+ // <16 x i32> @llvm.x86.avx512.mask.vpdpbusds.512
+ // (<16 x i32>, <16 x i32>, <16 x i32>, i16)
+ // <16 x i32> @llvm.x86.avx512.maskz.vpdpbusds.512
+ // (<16 x i32>, <16 x i32>, <16 x i32>, i16)
+ case Intrinsic::x86_avx512_vpdpbusd_128:
+ case Intrinsic::x86_avx512_vpdpbusd_256:
+ case Intrinsic::x86_avx512_vpdpbusd_512:
+ case Intrinsic::x86_avx512_vpdpbusds_128:
+ case Intrinsic::x86_avx512_vpdpbusds_256:
+ case Intrinsic::x86_avx512_vpdpbusds_512:
+ case Intrinsic::x86_avx2_vpdpbssd_128:
+ case Intrinsic::x86_avx2_vpdpbssd_256:
+ case Intrinsic::x86_avx2_vpdpbssds_128:
+ case Intrinsic::x86_avx2_vpdpbssds_256:
+ case Intrinsic::x86_avx10_vpdpbssd_512:
+ case Intrinsic::x86_avx10_vpdpbssds_512:
+ handleVectorPmaddIntrinsic(I, /*ReductionFactor=*/4, /*EltSize=*/8);
+ break;
+
+ // AVX Vector Neural Network Instructions: words
+ //
+ // Multiply and Add Signed Word Integers
+ // < 4 x i32> @llvm.x86.avx512.vpdpwssd.128
+ // (< 4 x i32>, < 4 x i32>, < 4 x i32>)
+ // < 8 x i32> @llvm.x86.avx512.vpdpwssd.256
+ // (< 8 x i32>, < 8 x i32>, < 8 x i32>)
+ // <16 x i32> @llvm.x86.avx512.vpdpwssd.512
+ // (<16 x i32>, <16 x i32>, <16 x i32>)
+ //
+ // Multiply and Add Signed Word Integers With Saturation
+ // < 4 x i32> @llvm.x86.avx512.vpdpwssds.128
+ // (< 4 x i32>, < 4 x i32>, < 4 x i32>)
+ // < 8 x i32> @llvm.x86.avx512.vpdpwssds.256
+ // (< 8 x i32>, < 8 x i32>, < 8 x i32>)
+ // <16 x i32> @llvm.x86.avx512.vpdpwssds.512
+ // (<16 x i32>, <16 x i32>, <16 x i32>)
+ //
+ // These intrinsics are auto-upgraded into non-masked forms:
+ // <4 x i32> @llvm.x86.avx512.mask.vpdpwssd.128
+ // (<4 x i32>, <4 x i32>, <4 x i32>, i8)
+ // <4 x i32> @llvm.x86.avx512.maskz.vpdpwssd.128
+ // (<4 x i32>, <4 x i32>, <4 x i32>, i8)
+ // <8 x i32> @llvm.x86.avx512.mask.vpdpwssd.256
+ // (<8 x i32>, <8 x i32>, <8 x i32>, i8)
+ // <8 x i32> @llvm.x86.avx512.maskz.vpdpwssd.256
+ // (<8 x i32>, <8 x i32>, <8 x i32>, i8)
+ // <16 x i32> @llvm.x86.avx512.mask.vpdpwssd.512
+ // (<16 x i32>, <16 x i32>, <16 x i32>, i16)
+ // <16 x i32> @llvm.x86.avx512.maskz.vpdpwssd.512
+ // (<16 x i32>, <16 x i32>, <16 x i32>, i16)
+ //
+ // <4 x i32> @llvm.x86.avx512.mask.vpdpwssds.128
+ // (<4 x i32>, <4 x i32>, <4 x i32>, i8)
+ // <4 x i32> @llvm.x86.avx512.maskz.vpdpwssds.128
+ // (<4 x i32>, <4 x i32>, <4 x i32>, i8)
+ // <8 x i32> @llvm.x86.avx512.mask.vpdpwssds.256
+ // (<8 x i32>, <8 x i32>, <8 x i32>, i8)
+ // <8 x i32> @llvm.x86.avx512.maskz.vpdpwssds.256
+ // (<8 x i32>, <8 x i32>, <8 x i32>, i8)
+ // <16 x i32> @llvm.x86.avx512.mask.vpdpwssds.512
+ // (<16 x i32>, <16 x i32>, <16 x i32>, i16)
+ // <16 x i32> @llvm.x86.avx512.maskz.vpdpwssds.512
+ // (<16 x i32>, <16 x i32>, <16 x i32>, i16)
+ case Intrinsic::x86_avx512_vpdpwssd_128:
+ case Intrinsic::x86_avx512_vpdpwssd_256:
+ case Intrinsic::x86_avx512_vpdpwssd_512:
+ case Intrinsic::x86_avx512_vpdpwssds_128:
+ case Intrinsic::x86_avx512_vpdpwssds_256:
+ case Intrinsic::x86_avx512_vpdpwssds_512:
+ handleVectorPmaddIntrinsic(I, /*ReductionFactor=*/2, /*EltSize=*/16);
+ break;
+
+ // TODO: Dot Product of BF16 Pairs Accumulated Into Packed Single
+ // Precision
+ // <4 x float> @llvm.x86.avx512bf16.dpbf16ps.128
+ // (<4 x float>, <8 x bfloat>, <8 x bfloat>)
+ // <8 x float> @llvm.x86.avx512bf16.dpbf16ps.256
+ // (<8 x float>, <16 x bfloat>, <16 x bfloat>)
+ // <16 x float> @llvm.x86.avx512bf16.dpbf16ps.512
+ // (<16 x float>, <32 x bfloat>, <32 x bfloat>)
+ // handleVectorPmaddIntrinsic() currently only handles integer types.
+
case Intrinsic::x86_sse_cmp_ss:
case Intrinsic::x86_sse2_cmp_sd:
case Intrinsic::x86_sse_comieq_ss:
diff --git a/llvm/test/Instrumentation/MemorySanitizer/X86/avx10_2_512ni-intrinsics.ll b/llvm/test/Instrumentation/MemorySanitizer/X86/avx10_2_512ni-intrinsics.ll
index 7af8f34d403a0..298dc4b2c853a 100644
--- a/llvm/test/Instrumentation/MemorySanitizer/X86/avx10_2_512ni-intrinsics.ll
+++ b/llvm/test/Instrumentation/MemorySanitizer/X86/avx10_2_512ni-intrinsics.ll
@@ -7,19 +7,7 @@
; - llvm.x86.avx10.vdpphps.512
; - llvm.x86.avx10.vmpsadbw.512
;
-; Handled heuristically:
-; - llvm.x86.avx10.vpdpbssd.512
-; - llvm.x86.avx10.vpdpbssds.512
-; - llvm.x86.avx10.vpdpbsud.512
-; - llvm.x86.avx10.vpdpbsuds.512
-; - llvm.x86.avx10.vpdpbuud.512
-; - llvm.x86.avx10.vpdpbuuds.512
-; - llvm.x86.avx10.vpdpwsud.512
-; - llvm.x86.avx10.vpdpwsuds.512
-; - llvm.x86.avx10.vpdpwusd.512
-; - llvm.x86.avx10.vpdpwusds.512
-; - llvm.x86.avx10.vpdpwuud.512
-; - llvm.x86.avx10.vpdpwuuds.512
+; Handled heuristically: (none)
target datalayout = "e-m:o-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"
@@ -140,8 +128,8 @@ define <16 x i32> @test_mm512_dpbssd_epi32(<16 x i32> %__W, <16 x i32> %__A, ptr
; CHECK-LABEL: define <16 x i32> @test_mm512_dpbssd_epi32(
; CHECK-SAME: <16 x i32> [[__W:%.*]], <16 x i32> [[__A:%.*]], ptr [[PB:%.*]]) #[[ATTR0]] {
; CHECK-NEXT: [[TMP1:%.*]] = load i64, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 128) to ptr), align 8
-; CHECK-NEXT: [[TMP2:%.*]] = load <16 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <16 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 64) to ptr), align 8
+; CHECK-NEXT: [[TMP4:%.*]] = load <16 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: call void @llvm.donothing()
; CHECK-NEXT: [[_MSCMP:%.*]] = icmp ne i64 [[TMP1]], 0
; CHECK-NEXT: br i1 [[_MSCMP]], label %[[BB4:.*]], label %[[BB5:.*]], !prof [[PROF1]]
@@ -154,8 +142,26 @@ define <16 x i32> @test_mm512_dpbssd_epi32(<16 x i32> %__W, <16 x i32> %__A, ptr
; CHECK-NEXT: [[TMP7:%.*]] = xor i64 [[TMP6]], 87960930222080
; CHECK-NEXT: [[TMP8:%.*]] = inttoptr i64 [[TMP7]] to ptr
; CHECK-NEXT: [[_MSLD:%.*]] = load <16 x i32>, ptr [[TMP8]], align 64
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <16 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <16 x i32> [[_MSPROP]], [[_MSLD]]
+; CHECK-NEXT: [[TMP9:%.*]] = bitcast <16 x i32> [[__A]] to <64 x i8>
+; CHECK-NEXT: [[TMP10:%.*]] = bitcast <16 x i32> [[__B]] to <64 x i8>
+; CHECK-NEXT: [[TMP11:%.*]] = bitcast <16 x i32> [[TMP3]] to <64 x i8>
+; CHECK-NEXT: [[TMP12:%.*]] = bitcast <16 x i32> [[_MSLD]] to <64 x i8>
+; CHECK-NEXT: [[TMP13:%.*]] = icmp ne <64 x i8> [[TMP11]], zeroinitializer
+; CHECK-NEXT: [[TMP14:%.*]] = icmp ne <64 x i8> [[TMP12]], zeroinitializer
+; CHECK-NEXT: [[TMP15:%.*]] = icmp ne <64 x i8> [[TMP9]], zeroinitializer
+; CHECK-NEXT: [[TMP16:%.*]] = icmp ne <64 x i8> [[TMP10]], zeroinitializer
+; CHECK-NEXT: [[TMP17:%.*]] = and <64 x i1> [[TMP13]], [[TMP14]]
+; CHECK-NEXT: [[TMP18:%.*]] = and <64 x i1> [[TMP15]], [[TMP14]]
+; CHECK-NEXT: [[TMP19:%.*]] = and <64 x i1> [[TMP13]], [[TMP16]]
+; CHECK-NEXT: [[TMP20:%.*]] = or <64 x i1> [[TMP17]], [[TMP18]]
+; CHECK-NEXT: [[TMP21:%.*]] = or <64 x i1> [[TMP20]], [[TMP19]]
+; CHECK-NEXT: [[TMP22:%.*]] = sext <64 x i1> [[TMP21]] to <64 x i8>
+; CHECK-NEXT: [[TMP23:%.*]] = bitcast <64 x i8> [[TMP22]] to <32 x i16>
+; CHECK-NEXT: [[TMP24:%.*]] = icmp ne <32 x i16> [[TMP23]], zeroinitializer
+; CHECK-NEXT: [[TMP25:%.*]] = sext <32 x i1> [[TMP24]] to <32 x i16>
+; CHECK-NEXT: [[TMP26:%.*]] = bitcast <32 x i16> [[TMP25]] to i512
+; CHECK-NEXT: [[TMP27:%.*]] = bitcast i512 [[TMP26]] to <16 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <16 x i32> [[TMP27]], [[TMP4]]
; CHECK-NEXT: [[RES:%.*]] = tail call <16 x i32> @llvm.x86.avx10.vpdpbssd.512(<16 x i32> [[__W]], <16 x i32> [[__A]], <16 x i32> [[__B]])
; CHECK-NEXT: store <16 x i32> [[_MSPROP1]], ptr @__msan_retval_tls, align 8
; CHECK-NEXT: ret <16 x i32> [[RES]]
@@ -168,13 +174,31 @@ define <16 x i32> @test_mm512_dpbssd_epi32(<16 x i32> %__W, <16 x i32> %__A, ptr
define <16 x i32> @test_mm512_mask_dpbssds_epi32(<16 x i32> %__W, i16 zeroext %__U, <16 x i32> %__A, <16 x i32> %__B) sanitize_memory {
; CHECK-LABEL: define <16 x i32> @test_mm512_mask_dpbssds_epi32(
; CHECK-SAME: <16 x i32> [[__W:%.*]], i16 zeroext [[__U:%.*]], <16 x i32> [[__A:%.*]], <16 x i32> [[__B:%.*]]) #[[ATTR0]] {
-; CHECK-NEXT: [[TMP1:%.*]] = load <16 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP2:%.*]] = load <16 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 72) to ptr), align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <16 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 136) to ptr), align 8
+; CHECK-NEXT: [[TMP1:%.*]] = load <16 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP4:%.*]] = load i16, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 64) to ptr), align 8
; CHECK-NEXT: call void @llvm.donothing()
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <16 x i32> [[TMP1]], [[TMP2]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <16 x i32> [[_MSPROP]], [[TMP3]]
+; CHECK-NEXT: [[TMP24:%.*]] = bitcast <16 x i32> [[__A]] to <64 x i8>
+; CHECK-NEXT: [[TMP25:%.*]] = bitcast <16 x i32> [[__B]] to <64 x i8>
+; CHECK-NEXT: [[TMP26:%.*]] = bitcast <16 x i32> [[TMP2]] to <64 x i8>
+; CHECK-NEXT: [[TMP27:%.*]] = bitcast <16 x i32> [[TMP3]] to <64 x i8>
+; CHECK-NEXT: [[TMP28:%.*]] = icmp ne <64 x i8> [[TMP26]], zeroinitializer
+; CHECK-NEXT: [[TMP10:%.*]] = icmp ne <64 x i8> [[TMP27]], zeroinitializer
+; CHECK-NEXT: [[TMP11:%.*]] = icmp ne <64 x i8> [[TMP24]], zeroinitializer
+; CHECK-NEXT: [[TMP12:%.*]] = icmp ne <64 x i8> [[TMP25]], zeroinitializer
+; CHECK-NEXT: [[TMP13:%.*]] = and <64 x i1> [[TMP28]], [[TMP10]]
+; CHECK-NEXT: [[TMP14:%.*]] = and <64 x i1> [[TMP11]], [[TMP10]]
+; CHECK-NEXT: [[TMP15:%.*]] = and <64 x i1> [[TMP28]], [[TMP12]]
+; CHECK-NEXT: [[TMP16:%.*]] = or <64 x i1> [[TMP13]], [[TMP14]]
+; CHECK-NEXT: [[TMP17:%.*]] = or <64 x i1> [[TMP16]], [[TMP15]]
+; CHECK-NEXT: [[TMP18:%.*]] = sext <64 x i1> [[TMP17]] to <64 x i8>
+; CHECK-NEXT: [[TMP19:%.*]] = bitcast <64 x i8> [[TMP18]] to <32 x i16>
+; CHECK-NEXT: [[TMP20:%.*]] = icmp ne <32 x i16> [[TMP19]], zeroinitializer
+; CHECK-NEXT: [[TMP21:%.*]] = sext <32 x i1> [[TMP20]] to <32 x i16>
+; CHECK-NEXT: [[TMP22:%.*]] = bitcast <32 x i16> [[TMP21]] to i512
+; CHECK-NEXT: [[TMP23:%.*]] = bitcast i512 [[TMP22]] to <16 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <16 x i32> [[TMP23]], [[TMP1]]
; CHECK-NEXT: [[DPI:%.*]] = tail call <16 x i32> @llvm.x86.avx10.vpdpbssds.512(<16 x i32> [[__W]], <16 x i32> [[__A]], <16 x i32> [[__B]])
; CHECK-NEXT: [[TMP5:%.*]] = bitcast i16 [[TMP4]] to <16 x i1>
; CHECK-NEXT: [[BST:%.*]] = bitcast i16 [[__U]] to <16 x i1>
@@ -196,13 +220,31 @@ define <16 x i32> @test_mm512_mask_dpbssds_epi32(<16 x i32> %__W, i16 zeroext %_
define <16 x i32> @test_mm512_maskz_dpbssd_epi32(i16 zeroext %__U, <16 x i32> %__W, <16 x i32> %__A, <16 x i32> %__B) sanitize_memory {
; CHECK-LABEL: define <16 x i32> @test_mm512_maskz_dpbssd_epi32(
; CHECK-SAME: i16 zeroext [[__U:%.*]], <16 x i32> [[__W:%.*]], <16 x i32> [[__A:%.*]], <16 x i32> [[__B:%.*]]) #[[ATTR0]] {
-; CHECK-NEXT: [[TMP1:%.*]] = load <16 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 8) to ptr), align 8
; CHECK-NEXT: [[TMP2:%.*]] = load <16 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 72) to ptr), align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <16 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 136) to ptr), align 8
+; CHECK-NEXT: [[TMP24:%.*]] = load <16 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 8) to ptr), align 8
; CHECK-NEXT: [[TMP4:%.*]] = load i16, ptr @__msan_param_tls, align 8
; CHECK-NEXT: call void @llvm.donothing()
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <16 x i32> [[TMP1]], [[TMP2]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <16 x i32> [[_MSPROP]], [[TMP3]]
+; CHECK-NEXT: [[TMP25:%.*]] = bitcast <16 x i32> [[__A]] to <64 x i8>
+; CHECK-NEXT: [[TMP26:%.*]] = bitcast <16 x i32> [[__B]] to <64 x i8>
+; CHECK-NEXT: [[TMP27:%.*]] = bitcast <16 x i32> [[TMP2]] to <64 x i8>
+; CHECK-NEXT: [[TMP28:%.*]] = bitcast <16 x i32> [[TMP3]] to <64 x i8>
+; CHECK-NEXT: [[TMP29:%.*]] = icmp ne <64 x i8> [[TMP27]], zeroinitializer
+; CHECK-NEXT: [[TMP10:%.*]] = icmp ne <64 x i8> [[TMP28]], zeroinitializer
+; CHECK-NEXT: [[TMP11:%.*]] = icmp ne <64 x i8> [[TMP25]], zeroinitializer
+; CHECK-NEXT: [[TMP12:%.*]] = icmp ne <64 x i8> [[TMP26]], zeroinitializer
+; CHECK-NEXT: [[TMP13:%.*]] = and <64 x i1> [[TMP29]], [[TMP10]]
+; CHECK-NEXT: [[TMP14:%.*]] = and <64 x i1> [[TMP11]], [[TMP10]]
+; CHECK-NEXT: [[TMP15:%.*]] = and <64 x i1> [[TMP29]], [[TMP12]]
+; CHECK-NEXT: [[TMP16:%.*]] = or <64 x i1> [[TMP13]], [[TMP14]]
+; CHECK-NEXT: [[TMP17:%.*]] = or <64 x i1> [[TMP16]], [[TMP15]]
+; CHECK-NEXT: [[TMP18:%.*]] = sext <64 x i1> [[TMP17]] to <64 x i8>
+; CHECK-NEXT: [[TMP19:%.*]] = bitcast <64 x i8> [[TMP18]] to <32 x i16>
+; CHECK-NEXT: [[TMP20:%.*]] = icmp ne <32 x i16> [[TMP19]], zeroinitializer
+; CHECK-NEXT: [[TMP21:%.*]] = sext <32 x i1> [[TMP20]] to <32 x i16>
+; CHECK-NEXT: [[TMP22:%.*]] = bitcast <32 x i16> [[TMP21]] to i512
+; CHECK-NEXT: [[TMP23:%.*]] = bitcast i512 [[TMP22]] to <16 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <16 x i32> [[TMP23]], [[TMP24]]
; CHECK-NEXT: [[DPI:%.*]] = tail call <16 x i32> @llvm.x86.avx10.vpdpbssd.512(<16 x i32> [[__W]], <16 x i32> [[__A]], <16 x i32> [[__B]])
; CHECK-NEXT: [[TMP5:%.*]] = bitcast i16 [[TMP4]] to <16 x i1>
; CHECK-NEXT: [[BST:%.*]] = bitcast i16 [[__U]] to <16 x i1>
diff --git a/llvm/test/Instrumentation/MemorySanitizer/X86/avx10_2ni-intrinsics.ll b/llvm/test/Instrumentation/MemorySanitizer/X86/avx10_2ni-intrinsics.ll
index 5f0b0b39da4d9..e3a26ae07ac1b 100644
--- a/llvm/test/Instrumentation/MemorySanitizer/X86/avx10_2ni-intrinsics.ll
+++ b/llvm/test/Instrumentation/MemorySanitizer/X86/avx10_2ni-intrinsics.ll
@@ -10,31 +10,7 @@
; - llvm.x86.avx2.mpsadbw
; - llvm.x86.sse41.mpsadbw
;
-; Handled heuristically:
-; - llvm.x86.avx2.vpdpbssd.128
-; - llvm.x86.avx2.vpdpbssd.256
-; - llvm.x86.avx2.vpdpbssds.128
-; - llvm.x86.avx2.vpdpbssds.256
-; - llvm.x86.avx2.vpdpbsud.128
-; - llvm.x86.avx2.vpdpbsud.256
-; - llvm.x86.avx2.vpdpbsuds.128
-; - llvm.x86.avx2.vpdpbsuds.256
-; - llvm.x86.avx2.vpdpbuud.128
-; - llvm.x86.avx2.vpdpbuud.256
-; - llvm.x86.avx2.vpdpbuuds.128
-; - llvm.x86.avx2.vpdpbuuds.256
-; - llvm.x86.avx2.vpdpwsud.128
-; - llvm.x86.avx2.vpdpwsud.256
-; - llvm.x86.avx2.vpdpwsuds.128
-; - llvm.x86.avx2.vpdpwsuds.256
-; - llvm.x86.avx2.vpdpwusd.128
-; - llvm.x86.avx2.vpdpwusd.256
-; - llvm.x86.avx2.vpdpwusds.128
-; - llvm.x86.avx2.vpdpwusds.256
-; - llvm.x86.avx2.vpdpwuud.128
-; - llvm.x86.avx2.vpdpwuud.256
-; - llvm.x86.avx2.vpdpwuuds.128
-; - llvm.x86.avx2.vpdpwuuds.256
+; Handled heuristically: (none)
target datalayout = "e-m:o-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"
@@ -270,13 +246,31 @@ declare <8 x float> @llvm.x86.avx10.vdpphps.256(<8 x float>, <16 x half>, <16 x
define <4 x i32> @test_mm_mask_dpbssd_epi32(<4 x i32> %__W, i4 zeroext %__U, <4 x i32> %__A, <4 x i32> %__B) sanitize_memory {
; CHECK-LABEL: define <4 x i32> @test_mm_mask_dpbssd_epi32(
; CHECK-SAME: <4 x i32> [[__W:%.*]], i4 zeroext [[__U:%.*]], <4 x i32> [[__A:%.*]], <4 x i32> [[__B:%.*]]) #[[ATTR0]] {
-; CHECK-NEXT: [[TMP1:%.*]] = load <4 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP2:%.*]] = load <4 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 24) to ptr), align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <4 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 40) to ptr), align 8
+; CHECK-NEXT: [[TMP1:%.*]] = load <4 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP4:%.*]] = load i4, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 16) to ptr), align 8
; CHECK-NEXT: call void @llvm.donothing()
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <4 x i32> [[TMP1]], [[TMP2]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <4 x i32> [[_MSPROP]], [[TMP3]]
+; CHECK-NEXT: [[TMP24:%.*]] = bitcast <4 x i32> [[__A]] to <16 x i8>
+; CHECK-NEXT: [[TMP25:%.*]] = bitcast <4 x i32> [[__B]] to <16 x i8>
+; CHECK-NEXT: [[TMP26:%.*]] = bitcast <4 x i32> [[TMP2]] to <16 x i8>
+; CHECK-NEXT: [[TMP27:%.*]] = bitcast <4 x i32> [[TMP3]] to <16 x i8>
+; CHECK-NEXT: [[TMP28:%.*]] = icmp ne <16 x i8> [[TMP26]], zeroinitializer
+; CHECK-NEXT: [[TMP10:%.*]] = icmp ne <16 x i8> [[TMP27]], zeroinitializer
+; CHECK-NEXT: [[TMP11:%.*]] = icmp ne <16 x i8> [[TMP24]], zeroinitializer
+; CHECK-NEXT: [[TMP12:%.*]] = icmp ne <16 x i8> [[TMP25]], zeroinitializer
+; CHECK-NEXT: [[TMP13:%.*]] = and <16 x i1> [[TMP28]], [[TMP10]]
+; CHECK-NEXT: [[TMP14:%.*]] = and <16 x i1> [[TMP11]], [[TMP10]]
+; CHECK-NEXT: [[TMP15:%.*]] = and <16 x i1> [[TMP28]], [[TMP12]]
+; CHECK-NEXT: [[TMP16:%.*]] = or <16 x i1> [[TMP13]], [[TMP14]]
+; CHECK-NEXT: [[TMP17:%.*]] = or <16 x i1> [[TMP16]], [[TMP15]]
+; CHECK-NEXT: [[TMP18:%.*]] = sext <16 x i1> [[TMP17]] to <16 x i8>
+; CHECK-NEXT: [[TMP19:%.*]] = bitcast <16 x i8> [[TMP18]] to <8 x i16>
+; CHECK-NEXT: [[TMP20:%.*]] = icmp ne <8 x i16> [[TMP19]], zeroinitializer
+; CHECK-NEXT: [[TMP21:%.*]] = sext <8 x i1> [[TMP20]] to <8 x i16>
+; CHECK-NEXT: [[TMP22:%.*]] = bitcast <8 x i16> [[TMP21]] to i128
+; CHECK-NEXT: [[TMP23:%.*]] = bitcast i128 [[TMP22]] to <4 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <4 x i32> [[TMP23]], [[TMP1]]
; CHECK-NEXT: [[DPI:%.*]] = tail call <4 x i32> @llvm.x86.avx2.vpdpbssd.128(<4 x i32> [[__W]], <4 x i32> [[__A]], <4 x i32> [[__B]])
; CHECK-NEXT: [[TMP5:%.*]] = bitcast i4 [[TMP4]] to <4 x i1>
; CHECK-NEXT: [[BST:%.*]] = bitcast i4 [[__U]] to <4 x i1>
@@ -298,13 +292,31 @@ define <4 x i32> @test_mm_mask_dpbssd_epi32(<4 x i32> %__W, i4 zeroext %__U, <4
define <4 x i32> @test_mm_maskz_dpbssds_epi32(i4 zeroext %__U, <4 x i32> %__W, <4 x i32> %__A, <4 x i32> %__B) sanitize_memory {
; CHECK-LABEL: define <4 x i32> @test_mm_maskz_dpbssds_epi32(
; CHECK-SAME: i4 zeroext [[__U:%.*]], <4 x i32> [[__W:%.*]], <4 x i32> [[__A:%.*]], <4 x i32> [[__B:%.*]]) #[[ATTR0]] {
-; CHECK-NEXT: [[TMP1:%.*]] = load <4 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 8) to ptr), align 8
; CHECK-NEXT: [[TMP2:%.*]] = load <4 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 24) to ptr), align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <4 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 40) to ptr), align 8
+; CHECK-NEXT: [[TMP24:%.*]] = load <4 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 8) to ptr), align 8
; CHECK-NEXT: [[TMP4:%.*]] = load i4, ptr @__msan_param_tls, align 8
; CHECK-NEXT: call void @llvm.donothing()
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <4 x i32> [[TMP1]], [[TMP2]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <4 x i32> [[_MSPROP]], [[TMP3]]
+; CHECK-NEXT: [[TMP25:%.*]] = bitcast <4 x i32> [[__A]] to <16 x i8>
+; CHECK-NEXT: [[TMP26:%.*]] = bitcast <4 x i32> [[__B]] to <16 x i8>
+; CHECK-NEXT: [[TMP27:%.*]] = bitcast <4 x i32> [[TMP2]] to <16 x i8>
+; CHECK-NEXT: [[TMP28:%.*]] = bitcast <4 x i32> [[TMP3]] to <16 x i8>
+; CHECK-NEXT: [[TMP29:%.*]] = icmp ne <16 x i8> [[TMP27]], zeroinitializer
+; CHECK-NEXT: [[TMP10:%.*]] = icmp ne <16 x i8> [[TMP28]], zeroinitializer
+; CHECK-NEXT: [[TMP11:%.*]] = icmp ne <16 x i8> [[TMP25]], zeroinitializer
+; CHECK-NEXT: [[TMP12:%.*]] = icmp ne <16 x i8> [[TMP26]], zeroinitializer
+; CHECK-NEXT: [[TMP13:%.*]] = and <16 x i1> [[TMP29]], [[TMP10]]
+; CHECK-NEXT: [[TMP14:%.*]] = and <16 x i1> [[TMP11]], [[TMP10]]
+; CHECK-NEXT: [[TMP15:%.*]] = and <16 x i1> [[TMP29]], [[TMP12]]
+; CHECK-NEXT: [[TMP16:%.*]] = or <16 x i1> [[TMP13]], [[TMP14]]
+; CHECK-NEXT: [[TMP17:%.*]] = or <16 x i1> [[TMP16]], [[TMP15]]
+; CHECK-NEXT: [[TMP18:%.*]] = sext <16 x i1> [[TMP17]] to <16 x i8>
+; CHECK-NEXT: [[TMP19:%.*]] = bitcast <16 x i8> [[TMP18]] to <8 x i16>
+; CHECK-NEXT: [[TMP20:%.*]] = icmp ne <8 x i16> [[TMP19]], zeroinitializer
+; CHECK-NEXT: [[TMP21:%.*]] = sext <8 x i1> [[TMP20]] to <8 x i16>
+; CHECK-NEXT: [[TMP22:%.*]] = bitcast <8 x i16> [[TMP21]] to i128
+; CHECK-NEXT: [[TMP23:%.*]] = bitcast i128 [[TMP22]] to <4 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <4 x i32> [[TMP23]], [[TMP24]]
; CHECK-NEXT: [[DPI:%.*]] = tail call <4 x i32> @llvm.x86.avx2.vpdpbssds.128(<4 x i32> [[__W]], <4 x i32> [[__A]], <4 x i32> [[__B]])
; CHECK-NEXT: [[TMP5:%.*]] = bitcast i4 [[TMP4]] to <4 x i1>
; CHECK-NEXT: [[BST:%.*]] = bitcast i4 [[__U]] to <4 x i1>
@@ -326,13 +338,31 @@ define <4 x i32> @test_mm_maskz_dpbssds_epi32(i4 zeroext %__U, <4 x i32> %__W, <
define <8 x i32> @test_mm256_maskz_dpbssds_epi32(<8 x i32> %__W, i8 zeroext %__U, <8 x i32> %__A, <8 x i32> %__B) sanitize_memory {
; CHECK-LABEL: define <8 x i32> @test_mm256_maskz_dpbssds_epi32(
; CHECK-SAME: <8 x i32> [[__W:%.*]], i8 zeroext [[__U:%.*]], <8 x i32> [[__A:%.*]], <8 x i32> [[__B:%.*]]) #[[ATTR0]] {
-; CHECK-NEXT: [[TMP1:%.*]] = load <8 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP2:%.*]] = load <8 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 40) to ptr), align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <8 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 72) to ptr), align 8
+; CHECK-NEXT: [[TMP1:%.*]] = load <8 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP4:%.*]] = load i8, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 32) to ptr), align 8
; CHECK-NEXT: call void @llvm.donothing()
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <8 x i32> [[TMP1]], [[TMP2]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <8 x i32> [[_MSPROP]], [[TMP3]]
+; CHECK-NEXT: [[TMP24:%.*]] = bitcast <8 x i32> [[__A]] to <32 x i8>
+; CHECK-NEXT: [[TMP25:%.*]] = bitcast <8 x i32> [[__B]] to <32 x i8>
+; CHECK-NEXT: [[TMP26:%.*]] = bitcast <8 x i32> [[TMP2]] to <32 x i8>
+; CHECK-NEXT: [[TMP27:%.*]] = bitcast <8 x i32> [[TMP3]] to <32 x i8>
+; CHECK-NEXT: [[TMP28:%.*]] = icmp ne <32 x i8> [[TMP26]], zeroinitializer
+; CHECK-NEXT: [[TMP10:%.*]] = icmp ne <32 x i8> [[TMP27]], zeroinitializer
+; CHECK-NEXT: [[TMP11:%.*]] = icmp ne <32 x i8> [[TMP24]], zeroinitializer
+; CHECK-NEXT: [[TMP12:%.*]] = icmp ne <32 x i8> [[TMP25]], zeroinitializer
+; CHECK-NEXT: [[TMP13:%.*]] = and <32 x i1> [[TMP28]], [[TMP10]]
+; CHECK-NEXT: [[TMP14:%.*]] = and <32 x i1> [[TMP11]], [[TMP10]]
+; CHECK-NEXT: [[TMP15:%.*]] = and <32 x i1> [[TMP28]], [[TMP12]]
+; CHECK-NEXT: [[TMP16:%.*]] = or <32 x i1> [[TMP13]], [[TMP14]]
+; CHECK-NEXT: [[TMP17:%.*]] = or <32 x i1> [[TMP16]], [[TMP15]]
+; CHECK-NEXT: [[TMP18:%.*]] = sext <32 x i1> [[TMP17]] to <32 x i8>
+; CHECK-NEXT: [[TMP19:%.*]] = bitcast <32 x i8> [[TMP18]] to <16 x i16>
+; CHECK-NEXT: [[TMP20:%.*]] = icmp ne <16 x i16> [[TMP19]], zeroinitializer
+; CHECK-NEXT: [[TMP21:%.*]] = sext <16 x i1> [[TMP20]] to <16 x i16>
+; CHECK-NEXT: [[TMP22:%.*]] = bitcast <16 x i16> [[TMP21]] to i256
+; CHECK-NEXT: [[TMP23:%.*]] = bitcast i256 [[TMP22]] to <8 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <8 x i32> [[TMP23]], [[TMP1]]
; CHECK-NEXT: [[DPI:%.*]] = tail call <8 x i32> @llvm.x86.avx2.vpdpbssds.256(<8 x i32> [[__W]], <8 x i32> [[__A]], <8 x i32> [[__B]])
; CHECK-NEXT: [[TMP5:%.*]] = bitcast i8 [[TMP4]] to <8 x i1>
; CHECK-NEXT: [[BST:%.*]] = bitcast i8 [[__U]] to <8 x i1>
@@ -354,13 +384,31 @@ define <8 x i32> @test_mm256_maskz_dpbssds_epi32(<8 x i32> %__W, i8 zeroext %__U
define <8 x i32> @test_mm256_mask_dpbssd_epi32(i8 zeroext %__U, <8 x i32> %__W, <8 x i32> %__A, <8 x i32> %__B) sanitize_memory {
; CHECK-LABEL: define <8 x i32> @test_mm256_mask_dpbssd_epi32(
; CHECK-SAME: i8 zeroext [[__U:%.*]], <8 x i32> [[__W:%.*]], <8 x i32> [[__A:%.*]], <8 x i32> [[__B:%.*]]) #[[ATTR0]] {
-; CHECK-NEXT: [[TMP1:%.*]] = load <8 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 8) to ptr), align 8
; CHECK-NEXT: [[TMP2:%.*]] = load <8 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 40) to ptr), align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <8 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 72) to ptr), align 8
+; CHECK-NEXT: [[TMP24:%.*]] = load <8 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 8) to ptr), align 8
; CHECK-NEXT: [[TMP4:%.*]] = load i8, ptr @__msan_param_tls, align 8
; CHECK-NEXT: call void @llvm.donothing()
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <8 x i32> [[TMP1]], [[TMP2]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <8 x i32> [[_MSPROP]], [[TMP3]]
+; CHECK-NEXT: [[TMP25:%.*]] = bitcast <8 x i32> [[__A]] to <32 x i8>
+; CHECK-NEXT: [[TMP26:%.*]] = bitcast <8 x i32> [[__B]] to <32 x i8>
+; CHECK-NEXT: [[TMP27:%.*]] = bitcast <8 x i32> [[TMP2]] to <32 x i8>
+; CHECK-NEXT: [[TMP28:%.*]] = bitcast <8 x i32> [[TMP3]] to <32 x i8>
+; CHECK-NEXT: [[TMP29:%.*]] = icmp ne <32 x i8> [[TMP27]], zeroinitializer
+; CHECK-NEXT: [[TMP10:%.*]] = icmp ne <32 x i8> [[TMP28]], zeroinitializer
+; CHECK-NEXT: [[TMP11:%.*]] = icmp ne <32 x i8> [[TMP25]], zeroinitializer
+; CHECK-NEXT: [[TMP12:%.*]] = icmp ne <32 x i8> [[TMP26]], zeroinitializer
+; CHECK-NEXT: [[TMP13:%.*]] = and <32 x i1> [[TMP29]], [[TMP10]]
+; CHECK-NEXT: [[TMP14:%.*]] = and <32 x i1> [[TMP11]], [[TMP10]]
+; CHECK-NEXT: [[TMP15:%.*]] = and <32 x i1> [[TMP29]], [[TMP12]]
+; CHECK-NEXT: [[TMP16:%.*]] = or <32 x i1> [[TMP13]], [[TMP14]]
+; CHECK-NEXT: [[TMP17:%.*]] = or <32 x i1> [[TMP16]], [[TMP15]]
+; CHECK-NEXT: [[TMP18:%.*]] = sext <32 x i1> [[TMP17]] to <32 x i8>
+; CHECK-NEXT: [[TMP19:%.*]] = bitcast <32 x i8> [[TMP18]] to <16 x i16>
+; CHECK-NEXT: [[TMP20:%.*]] = icmp ne <16 x i16> [[TMP19]], zeroinitializer
+; CHECK-NEXT: [[TMP21:%.*]] = sext <16 x i1> [[TMP20]] to <16 x i16>
+; CHECK-NEXT: [[TMP22:%.*]] = bitcast <16 x i16> [[TMP21]] to i256
+; CHECK-NEXT: [[TMP23:%.*]] = bitcast i256 [[TMP22]] to <8 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <8 x i32> [[TMP23]], [[TMP24]]
; CHECK-NEXT: [[DPI:%.*]] = tail call <8 x i32> @llvm.x86.avx2.vpdpbssd.256(<8 x i32> [[__W]], <8 x i32> [[__A]], <8 x i32> [[__B]])
; CHECK-NEXT: [[TMP5:%.*]] = bitcast i8 [[TMP4]] to <8 x i1>
; CHECK-NEXT: [[BST:%.*]] = bitcast i8 [[__U]] to <8 x i1>
diff --git a/llvm/test/Instrumentation/MemorySanitizer/X86/avx512vl_vnni-intrinsics-upgrade.ll b/llvm/test/Instrumentation/MemorySanitizer/X86/avx512vl_vnni-intrinsics-upgrade.ll
index 983d5aaada652..822e546c84bca 100644
--- a/llvm/test/Instrumentation/MemorySanitizer/X86/avx512vl_vnni-intrinsics-upgrade.ll
+++ b/llvm/test/Instrumentation/MemorySanitizer/X86/avx512vl_vnni-intrinsics-upgrade.ll
@@ -5,15 +5,7 @@
;
; Handled strictly: (none)
;
-; Handled heuristically:
-; - llvm.x86.avx512.vpdpbusd.128
-; - llvm.x86.avx512.vpdpbusd.256
-; - llvm.x86.avx512.vpdpbusds.128
-; - llvm.x86.avx512.vpdpbusds.256
-; - llvm.x86.avx512.vpdpwssd.128
-; - llvm.x86.avx512.vpdpwssd.256
-; - llvm.x86.avx512.vpdpwssds.128
-; - llvm.x86.avx512.vpdpwssds.256
+; Handled heuristically: (none)
target datalayout = "e-m:o-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"
@@ -24,12 +16,30 @@ declare <8 x i32> @llvm.x86.avx512.maskz.vpdpbusd.256(<8 x i32>, <8 x i32>, <8 x
define <8 x i32>@test_int_x86_avx512_vpdpbusd_256(<8 x i32> %x0, <8 x i32> %x1, <8 x i32> %x2) sanitize_memory {
; CHECK-LABEL: define <8 x i32> @test_int_x86_avx512_vpdpbusd_256(
; CHECK-SAME: <8 x i32> [[X0:%.*]], <8 x i32> [[X1:%.*]], <8 x i32> [[X2:%.*]]) #[[ATTR0:[0-9]+]] {
-; CHECK-NEXT: [[TMP1:%.*]] = load <8 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP2:%.*]] = load <8 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 32) to ptr), align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <8 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 64) to ptr), align 8
+; CHECK-NEXT: [[TMP23:%.*]] = load <8 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: call void @llvm.donothing()
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <8 x i32> [[TMP1]], [[TMP2]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <8 x i32> [[_MSPROP]], [[TMP3]]
+; CHECK-NEXT: [[TMP24:%.*]] = bitcast <8 x i32> [[X1]] to <32 x i8>
+; CHECK-NEXT: [[TMP5:%.*]] = bitcast <8 x i32> [[X2]] to <32 x i8>
+; CHECK-NEXT: [[TMP6:%.*]] = bitcast <8 x i32> [[TMP2]] to <32 x i8>
+; CHECK-NEXT: [[TMP7:%.*]] = bitcast <8 x i32> [[TMP3]] to <32 x i8>
+; CHECK-NEXT: [[TMP8:%.*]] = icmp ne <32 x i8> [[TMP6]], zeroinitializer
+; CHECK-NEXT: [[TMP9:%.*]] = icmp ne <32 x i8> [[TMP7]], zeroinitializer
+; CHECK-NEXT: [[TMP10:%.*]] = icmp ne <32 x i8> [[TMP24]], zeroinitializer
+; CHECK-NEXT: [[TMP11:%.*]] = icmp ne <32 x i8> [[TMP5]], zeroinitializer
+; CHECK-NEXT: [[TMP12:%.*]] = and <32 x i1> [[TMP8]], [[TMP9]]
+; CHECK-NEXT: [[TMP13:%.*]] = and <32 x i1> [[TMP10]], [[TMP9]]
+; CHECK-NEXT: [[TMP14:%.*]] = and <32 x i1> [[TMP8]], [[TMP11]]
+; CHECK-NEXT: [[TMP15:%.*]] = or <32 x i1> [[TMP12]], [[TMP13]]
+; CHECK-NEXT: [[TMP16:%.*]] = or <32 x i1> [[TMP15]], [[TMP14]]
+; CHECK-NEXT: [[TMP17:%.*]] = sext <32 x i1> [[TMP16]] to <32 x i8>
+; CHECK-NEXT: [[TMP18:%.*]] = bitcast <32 x i8> [[TMP17]] to <16 x i16>
+; CHECK-NEXT: [[TMP19:%.*]] = icmp ne <16 x i16> [[TMP18]], zeroinitializer
+; CHECK-NEXT: [[TMP20:%.*]] = sext <16 x i1> [[TMP19]] to <16 x i16>
+; CHECK-NEXT: [[TMP21:%.*]] = bitcast <16 x i16> [[TMP20]] to i256
+; CHECK-NEXT: [[TMP22:%.*]] = bitcast i256 [[TMP21]] to <8 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <8 x i32> [[TMP22]], [[TMP23]]
; CHECK-NEXT: [[TMP4:%.*]] = call <8 x i32> @llvm.x86.avx512.vpdpbusd.256(<8 x i32> [[X0]], <8 x i32> [[X1]], <8 x i32> [[X2]])
; CHECK-NEXT: store <8 x i32> [[_MSPROP1]], ptr @__msan_retval_tls, align 8
; CHECK-NEXT: ret <8 x i32> [[TMP4]]
@@ -42,8 +52,8 @@ define { <8 x i32>, <8 x i32> } @test_int_x86_avx512_mask_vpdpbusd_256(<8 x i32>
; CHECK-LABEL: define { <8 x i32>, <8 x i32> } @test_int_x86_avx512_mask_vpdpbusd_256(
; CHECK-SAME: <8 x i32> [[X0:%.*]], <8 x i32> [[X1:%.*]], ptr [[X2P:%.*]], <8 x i32> [[X4:%.*]], i8 [[X3:%.*]]) #[[ATTR0]] {
; CHECK-NEXT: [[TMP1:%.*]] = load i64, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 64) to ptr), align 8
-; CHECK-NEXT: [[TMP2:%.*]] = load <8 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <8 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 32) to ptr), align 8
+; CHECK-NEXT: [[TMP2:%.*]] = load <8 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP4:%.*]] = load i8, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 104) to ptr), align 8
; CHECK-NEXT: [[TMP5:%.*]] = load <8 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 72) to ptr), align 8
; CHECK-NEXT: call void @llvm.donothing()
@@ -58,8 +68,26 @@ define { <8 x i32>, <8 x i32> } @test_int_x86_avx512_mask_vpdpbusd_256(<8 x i32>
; CHECK-NEXT: [[TMP9:%.*]] = xor i64 [[TMP8]], 87960930222080
; CHECK-NEXT: [[TMP10:%.*]] = inttoptr i64 [[TMP9]] to ptr
; CHECK-NEXT: [[_MSLD:%.*]] = load <8 x i32>, ptr [[TMP10]], align 32
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <8 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <8 x i32> [[_MSPROP]], [[_MSLD]]
+; CHECK-NEXT: [[TMP30:%.*]] = bitcast <8 x i32> [[X1]] to <32 x i8>
+; CHECK-NEXT: [[TMP31:%.*]] = bitcast <8 x i32> [[X2]] to <32 x i8>
+; CHECK-NEXT: [[TMP32:%.*]] = bitcast <8 x i32> [[TMP3]] to <32 x i8>
+; CHECK-NEXT: [[TMP33:%.*]] = bitcast <8 x i32> [[_MSLD]] to <32 x i8>
+; CHECK-NEXT: [[TMP34:%.*]] = icmp ne <32 x i8> [[TMP32]], zeroinitializer
+; CHECK-NEXT: [[TMP35:%.*]] = icmp ne <32 x i8> [[TMP33]], zeroinitializer
+; CHECK-NEXT: [[TMP36:%.*]] = icmp ne <32 x i8> [[TMP30]], zeroinitializer
+; CHECK-NEXT: [[TMP37:%.*]] = icmp ne <32 x i8> [[TMP31]], zeroinitializer
+; CHECK-NEXT: [[TMP38:%.*]] = and <32 x i1> [[TMP34]], [[TMP35]]
+; CHECK-NEXT: [[TMP58:%.*]] = and <32 x i1> [[TMP36]], [[TMP35]]
+; CHECK-NEXT: [[TMP59:%.*]] = and <32 x i1> [[TMP34]], [[TMP37]]
+; CHECK-NEXT: [[TMP60:%.*]] = or <32 x i1> [[TMP38]], [[TMP58]]
+; CHECK-NEXT: [[TMP61:%.*]] = or <32 x i1> [[TMP60]], [[TMP59]]
+; CHECK-NEXT: [[TMP62:%.*]] = sext <32 x i1> [[TMP61]] to <32 x i8>
+; CHECK-NEXT: [[TMP63:%.*]] = bitcast <32 x i8> [[TMP62]] to <16 x i16>
+; CHECK-NEXT: [[TMP64:%.*]] = icmp ne <16 x i16> [[TMP63]], zeroinitializer
+; CHECK-NEXT: [[TMP65:%.*]] = sext <16 x i1> [[TMP64]] to <16 x i16>
+; CHECK-NEXT: [[TMP66:%.*]] = bitcast <16 x i16> [[TMP65]] to i256
+; CHECK-NEXT: [[TMP29:%.*]] = bitcast i256 [[TMP66]] to <8 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <8 x i32> [[TMP29]], [[TMP2]]
; CHECK-NEXT: [[TMP11:%.*]] = call <8 x i32> @llvm.x86.avx512.vpdpbusd.256(<8 x i32> [[X0]], <8 x i32> [[X1]], <8 x i32> [[X2]])
; CHECK-NEXT: [[TMP12:%.*]] = bitcast i8 [[TMP4]] to <8 x i1>
; CHECK-NEXT: [[TMP13:%.*]] = bitcast i8 [[X3]] to <8 x i1>
@@ -69,8 +97,26 @@ define { <8 x i32>, <8 x i32> } @test_int_x86_avx512_mask_vpdpbusd_256(<8 x i32>
; CHECK-NEXT: [[TMP17:%.*]] = or <8 x i32> [[TMP16]], [[TMP2]]
; CHECK-NEXT: [[_MSPROP_SELECT:%.*]] = select <8 x i1> [[TMP12]], <8 x i32> [[TMP17]], <8 x i32> [[TMP14]]
; CHECK-NEXT: [[TMP18:%.*]] = select <8 x i1> [[TMP13]], <8 x i32> [[TMP11]], <8 x i32> [[X0]]
-; CHECK-NEXT: [[_MSPROP2:%.*]] = or <8 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP3:%.*]] = or <8 x i32> [[_MSPROP2]], [[TMP5]]
+; CHECK-NEXT: [[TMP39:%.*]] = bitcast <8 x i32> [[X1]] to <32 x i8>
+; CHECK-NEXT: [[TMP40:%.*]] = bitcast <8 x i32> [[X4]] to <32 x i8>
+; CHECK-NEXT: [[TMP41:%.*]] = bitcast <8 x i32> [[TMP3]] to <32 x i8>
+; CHECK-NEXT: [[TMP42:%.*]] = bitcast <8 x i32> [[TMP5]] to <32 x i8>
+; CHECK-NEXT: [[TMP43:%.*]] = icmp ne <32 x i8> [[TMP41]], zeroinitializer
+; CHECK-NEXT: [[TMP44:%.*]] = icmp ne <32 x i8> [[TMP42]], zeroinitializer
+; CHECK-NEXT: [[TMP45:%.*]] = icmp ne <32 x i8> [[TMP39]], zeroinitializer
+; CHECK-NEXT: [[TMP46:%.*]] = icmp ne <32 x i8> [[TMP40]], zeroinitializer
+; CHECK-NEXT: [[TMP47:%.*]] = and <32 x i1> [[TMP43]], [[TMP44]]
+; CHECK-NEXT: [[TMP48:%.*]] = and <32 x i1> [[TMP45]], [[TMP44]]
+; CHECK-NEXT: [[TMP49:%.*]] = and <32 x i1> [[TMP43]], [[TMP46]]
+; CHECK-NEXT: [[TMP50:%.*]] = or <32 x i1> [[TMP47]], [[TMP48]]
+; CHECK-NEXT: [[TMP51:%.*]] = or <32 x i1> [[TMP50]], [[TMP49]]
+; CHECK-NEXT: [[TMP52:%.*]] = sext <32 x i1> [[TMP51]] to <32 x i8>
+; CHECK-NEXT: [[TMP53:%.*]] = bitcast <32 x i8> [[TMP52]] to <16 x i16>
+; CHECK-NEXT: [[TMP54:%.*]] = icmp ne <16 x i16> [[TMP53]], zeroinitializer
+; CHECK-NEXT: [[TMP55:%.*]] = sext <16 x i1> [[TMP54]] to <16 x i16>
+; CHECK-NEXT: [[TMP56:%.*]] = bitcast <16 x i16> [[TMP55]] to i256
+; CHECK-NEXT: [[TMP57:%.*]] = bitcast i256 [[TMP56]] to <8 x i32>
+; CHECK-NEXT: [[_MSPROP3:%.*]] = or <8 x i32> [[TMP57]], [[TMP2]]
; CHECK-NEXT: [[TMP19:%.*]] = call <8 x i32> @llvm.x86.avx512.vpdpbusd.256(<8 x i32> [[X0]], <8 x i32> [[X1]], <8 x i32> [[X4]])
; CHECK-NEXT: [[TMP20:%.*]] = bitcast i8 [[TMP4]] to <8 x i1>
; CHECK-NEXT: [[TMP21:%.*]] = bitcast i8 [[X3]] to <8 x i1>
@@ -101,12 +147,30 @@ declare <4 x i32> @llvm.x86.avx512.maskz.vpdpbusd.128(<4 x i32>, <4 x i32>, <4 x
define <4 x i32>@test_int_x86_avx512_vpdpbusd_128(<4 x i32> %x0, <4 x i32> %x1, <4 x i32> %x2) sanitize_memory {
; CHECK-LABEL: define <4 x i32> @test_int_x86_avx512_vpdpbusd_128(
; CHECK-SAME: <4 x i32> [[X0:%.*]], <4 x i32> [[X1:%.*]], <4 x i32> [[X2:%.*]]) #[[ATTR0]] {
-; CHECK-NEXT: [[TMP1:%.*]] = load <4 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP2:%.*]] = load <4 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 16) to ptr), align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <4 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 32) to ptr), align 8
+; CHECK-NEXT: [[TMP23:%.*]] = load <4 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: call void @llvm.donothing()
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <4 x i32> [[TMP1]], [[TMP2]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <4 x i32> [[_MSPROP]], [[TMP3]]
+; CHECK-NEXT: [[TMP24:%.*]] = bitcast <4 x i32> [[X1]] to <16 x i8>
+; CHECK-NEXT: [[TMP5:%.*]] = bitcast <4 x i32> [[X2]] to <16 x i8>
+; CHECK-NEXT: [[TMP6:%.*]] = bitcast <4 x i32> [[TMP2]] to <16 x i8>
+; CHECK-NEXT: [[TMP7:%.*]] = bitcast <4 x i32> [[TMP3]] to <16 x i8>
+; CHECK-NEXT: [[TMP8:%.*]] = icmp ne <16 x i8> [[TMP6]], zeroinitializer
+; CHECK-NEXT: [[TMP9:%.*]] = icmp ne <16 x i8> [[TMP7]], zeroinitializer
+; CHECK-NEXT: [[TMP10:%.*]] = icmp ne <16 x i8> [[TMP24]], zeroinitializer
+; CHECK-NEXT: [[TMP11:%.*]] = icmp ne <16 x i8> [[TMP5]], zeroinitializer
+; CHECK-NEXT: [[TMP12:%.*]] = and <16 x i1> [[TMP8]], [[TMP9]]
+; CHECK-NEXT: [[TMP13:%.*]] = and <16 x i1> [[TMP10]], [[TMP9]]
+; CHECK-NEXT: [[TMP14:%.*]] = and <16 x i1> [[TMP8]], [[TMP11]]
+; CHECK-NEXT: [[TMP15:%.*]] = or <16 x i1> [[TMP12]], [[TMP13]]
+; CHECK-NEXT: [[TMP16:%.*]] = or <16 x i1> [[TMP15]], [[TMP14]]
+; CHECK-NEXT: [[TMP17:%.*]] = sext <16 x i1> [[TMP16]] to <16 x i8>
+; CHECK-NEXT: [[TMP18:%.*]] = bitcast <16 x i8> [[TMP17]] to <8 x i16>
+; CHECK-NEXT: [[TMP19:%.*]] = icmp ne <8 x i16> [[TMP18]], zeroinitializer
+; CHECK-NEXT: [[TMP20:%.*]] = sext <8 x i1> [[TMP19]] to <8 x i16>
+; CHECK-NEXT: [[TMP21:%.*]] = bitcast <8 x i16> [[TMP20]] to i128
+; CHECK-NEXT: [[TMP22:%.*]] = bitcast i128 [[TMP21]] to <4 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <4 x i32> [[TMP22]], [[TMP23]]
; CHECK-NEXT: [[TMP4:%.*]] = call <4 x i32> @llvm.x86.avx512.vpdpbusd.128(<4 x i32> [[X0]], <4 x i32> [[X1]], <4 x i32> [[X2]])
; CHECK-NEXT: store <4 x i32> [[_MSPROP1]], ptr @__msan_retval_tls, align 8
; CHECK-NEXT: ret <4 x i32> [[TMP4]]
@@ -119,8 +183,8 @@ define { <4 x i32>, <4 x i32> } @test_int_x86_avx512_mask_vpdpbusd_128(<4 x i32>
; CHECK-LABEL: define { <4 x i32>, <4 x i32> } @test_int_x86_avx512_mask_vpdpbusd_128(
; CHECK-SAME: <4 x i32> [[X0:%.*]], <4 x i32> [[X1:%.*]], ptr [[X2P:%.*]], <4 x i32> [[X4:%.*]], i8 [[X3:%.*]]) #[[ATTR0]] {
; CHECK-NEXT: [[TMP1:%.*]] = load i64, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 32) to ptr), align 8
-; CHECK-NEXT: [[TMP2:%.*]] = load <4 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <4 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 16) to ptr), align 8
+; CHECK-NEXT: [[TMP2:%.*]] = load <4 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP4:%.*]] = load i8, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 56) to ptr), align 8
; CHECK-NEXT: [[TMP5:%.*]] = load <4 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 40) to ptr), align 8
; CHECK-NEXT: call void @llvm.donothing()
@@ -135,8 +199,26 @@ define { <4 x i32>, <4 x i32> } @test_int_x86_avx512_mask_vpdpbusd_128(<4 x i32>
; CHECK-NEXT: [[TMP9:%.*]] = xor i64 [[TMP8]], 87960930222080
; CHECK-NEXT: [[TMP10:%.*]] = inttoptr i64 [[TMP9]] to ptr
; CHECK-NEXT: [[_MSLD:%.*]] = load <4 x i32>, ptr [[TMP10]], align 16
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <4 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP2:%.*]] = or <4 x i32> [[_MSPROP]], [[_MSLD]]
+; CHECK-NEXT: [[TMP30:%.*]] = bitcast <4 x i32> [[X1]] to <16 x i8>
+; CHECK-NEXT: [[TMP31:%.*]] = bitcast <4 x i32> [[X2]] to <16 x i8>
+; CHECK-NEXT: [[TMP32:%.*]] = bitcast <4 x i32> [[TMP3]] to <16 x i8>
+; CHECK-NEXT: [[TMP33:%.*]] = bitcast <4 x i32> [[_MSLD]] to <16 x i8>
+; CHECK-NEXT: [[TMP34:%.*]] = icmp ne <16 x i8> [[TMP32]], zeroinitializer
+; CHECK-NEXT: [[TMP35:%.*]] = icmp ne <16 x i8> [[TMP33]], zeroinitializer
+; CHECK-NEXT: [[TMP36:%.*]] = icmp ne <16 x i8> [[TMP30]], zeroinitializer
+; CHECK-NEXT: [[TMP37:%.*]] = icmp ne <16 x i8> [[TMP31]], zeroinitializer
+; CHECK-NEXT: [[TMP38:%.*]] = and <16 x i1> [[TMP34]], [[TMP35]]
+; CHECK-NEXT: [[TMP58:%.*]] = and <16 x i1> [[TMP36]], [[TMP35]]
+; CHECK-NEXT: [[TMP59:%.*]] = and <16 x i1> [[TMP34]], [[TMP37]]
+; CHECK-NEXT: [[TMP60:%.*]] = or <16 x i1> [[TMP38]], [[TMP58]]
+; CHECK-NEXT: [[TMP61:%.*]] = or <16 x i1> [[TMP60]], [[TMP59]]
+; CHECK-NEXT: [[TMP62:%.*]] = sext <16 x i1> [[TMP61]] to <16 x i8>
+; CHECK-NEXT: [[TMP63:%.*]] = bitcast <16 x i8> [[TMP62]] to <8 x i16>
+; CHECK-NEXT: [[TMP64:%.*]] = icmp ne <8 x i16> [[TMP63]], zeroinitializer
+; CHECK-NEXT: [[TMP65:%.*]] = sext <8 x i1> [[TMP64]] to <8 x i16>
+; CHECK-NEXT: [[TMP66:%.*]] = bitcast <8 x i16> [[TMP65]] to i128
+; CHECK-NEXT: [[TMP29:%.*]] = bitcast i128 [[TMP66]] to <4 x i32>
+; CHECK-NEXT: [[_MSPROP2:%.*]] = or <4 x i32> [[TMP29]], [[TMP2]]
; CHECK-NEXT: [[TMP11:%.*]] = call <4 x i32> @llvm.x86.avx512.vpdpbusd.128(<4 x i32> [[X0]], <4 x i32> [[X1]], <4 x i32> [[X2]])
; CHECK-NEXT: [[TMP12:%.*]] = bitcast i8 [[TMP4]] to <8 x i1>
; CHECK-NEXT: [[TMP13:%.*]] = bitcast i8 [[X3]] to <8 x i1>
@@ -148,8 +230,26 @@ define { <4 x i32>, <4 x i32> } @test_int_x86_avx512_mask_vpdpbusd_128(<4 x i32>
; CHECK-NEXT: [[TMP17:%.*]] = or <4 x i32> [[TMP16]], [[TMP2]]
; CHECK-NEXT: [[_MSPROP_SELECT:%.*]] = select <4 x i1> [[_MSPROP3]], <4 x i32> [[TMP17]], <4 x i32> [[TMP14]]
; CHECK-NEXT: [[TMP18:%.*]] = select <4 x i1> [[EXTRACT]], <4 x i32> [[TMP11]], <4 x i32> [[X0]]
-; CHECK-NEXT: [[_MSPROP4:%.*]] = or <4 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP5:%.*]] = or <4 x i32> [[_MSPROP4]], [[TMP5]]
+; CHECK-NEXT: [[TMP39:%.*]] = bitcast <4 x i32> [[X1]] to <16 x i8>
+; CHECK-NEXT: [[TMP40:%.*]] = bitcast <4 x i32> [[X4]] to <16 x i8>
+; CHECK-NEXT: [[TMP41:%.*]] = bitcast <4 x i32> [[TMP3]] to <16 x i8>
+; CHECK-NEXT: [[TMP42:%.*]] = bitcast <4 x i32> [[TMP5]] to <16 x i8>
+; CHECK-NEXT: [[TMP43:%.*]] = icmp ne <16 x i8> [[TMP41]], zeroinitializer
+; CHECK-NEXT: [[TMP44:%.*]] = icmp ne <16 x i8> [[TMP42]], zeroinitializer
+; CHECK-NEXT: [[TMP45:%.*]] = icmp ne <16 x i8> [[TMP39]], zeroinitializer
+; CHECK-NEXT: [[TMP46:%.*]] = icmp ne <16 x i8> [[TMP40]], zeroinitializer
+; CHECK-NEXT: [[TMP47:%.*]] = and <16 x i1> [[TMP43]], [[TMP44]]
+; CHECK-NEXT: [[TMP48:%.*]] = and <16 x i1> [[TMP45]], [[TMP44]]
+; CHECK-NEXT: [[TMP49:%.*]] = and <16 x i1> [[TMP43]], [[TMP46]]
+; CHECK-NEXT: [[TMP50:%.*]] = or <16 x i1> [[TMP47]], [[TMP48]]
+; CHECK-NEXT: [[TMP51:%.*]] = or <16 x i1> [[TMP50]], [[TMP49]]
+; CHECK-NEXT: [[TMP52:%.*]] = sext <16 x i1> [[TMP51]] to <16 x i8>
+; CHECK-NEXT: [[TMP53:%.*]] = bitcast <16 x i8> [[TMP52]] to <8 x i16>
+; CHECK-NEXT: [[TMP54:%.*]] = icmp ne <8 x i16> [[TMP53]], zeroinitializer
+; CHECK-NEXT: [[TMP55:%.*]] = sext <8 x i1> [[TMP54]] to <8 x i16>
+; CHECK-NEXT: [[TMP56:%.*]] = bitcast <8 x i16> [[TMP55]] to i128
+; CHECK-NEXT: [[TMP57:%.*]] = bitcast i128 [[TMP56]] to <4 x i32>
+; CHECK-NEXT: [[_MSPROP5:%.*]] = or <4 x i32> [[TMP57]], [[TMP2]]
; CHECK-NEXT: [[TMP19:%.*]] = call <4 x i32> @llvm.x86.avx512.vpdpbusd.128(<4 x i32> [[X0]], <4 x i32> [[X1]], <4 x i32> [[X4]])
; CHECK-NEXT: [[TMP20:%.*]] = bitcast i8 [[TMP4]] to <8 x i1>
; CHECK-NEXT: [[TMP21:%.*]] = bitcast i8 [[X3]] to <8 x i1>
@@ -182,12 +282,30 @@ declare <8 x i32> @llvm.x86.avx512.maskz.vpdpbusds.256(<8 x i32>, <8 x i32>, <8
define <8 x i32>@test_int_x86_avx512_vpdpbusds_256(<8 x i32> %x0, <8 x i32> %x1, <8 x i32> %x2) sanitize_memory {
; CHECK-LABEL: define <8 x i32> @test_int_x86_avx512_vpdpbusds_256(
; CHECK-SAME: <8 x i32> [[X0:%.*]], <8 x i32> [[X1:%.*]], <8 x i32> [[X2:%.*]]) #[[ATTR0]] {
-; CHECK-NEXT: [[TMP1:%.*]] = load <8 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP2:%.*]] = load <8 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 32) to ptr), align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <8 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 64) to ptr), align 8
+; CHECK-NEXT: [[TMP23:%.*]] = load <8 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: call void @llvm.donothing()
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <8 x i32> [[TMP1]], [[TMP2]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <8 x i32> [[_MSPROP]], [[TMP3]]
+; CHECK-NEXT: [[TMP24:%.*]] = bitcast <8 x i32> [[X1]] to <32 x i8>
+; CHECK-NEXT: [[TMP5:%.*]] = bitcast <8 x i32> [[X2]] to <32 x i8>
+; CHECK-NEXT: [[TMP6:%.*]] = bitcast <8 x i32> [[TMP2]] to <32 x i8>
+; CHECK-NEXT: [[TMP7:%.*]] = bitcast <8 x i32> [[TMP3]] to <32 x i8>
+; CHECK-NEXT: [[TMP8:%.*]] = icmp ne <32 x i8> [[TMP6]], zeroinitializer
+; CHECK-NEXT: [[TMP9:%.*]] = icmp ne <32 x i8> [[TMP7]], zeroinitializer
+; CHECK-NEXT: [[TMP10:%.*]] = icmp ne <32 x i8> [[TMP24]], zeroinitializer
+; CHECK-NEXT: [[TMP11:%.*]] = icmp ne <32 x i8> [[TMP5]], zeroinitializer
+; CHECK-NEXT: [[TMP12:%.*]] = and <32 x i1> [[TMP8]], [[TMP9]]
+; CHECK-NEXT: [[TMP13:%.*]] = and <32 x i1> [[TMP10]], [[TMP9]]
+; CHECK-NEXT: [[TMP14:%.*]] = and <32 x i1> [[TMP8]], [[TMP11]]
+; CHECK-NEXT: [[TMP15:%.*]] = or <32 x i1> [[TMP12]], [[TMP13]]
+; CHECK-NEXT: [[TMP16:%.*]] = or <32 x i1> [[TMP15]], [[TMP14]]
+; CHECK-NEXT: [[TMP17:%.*]] = sext <32 x i1> [[TMP16]] to <32 x i8>
+; CHECK-NEXT: [[TMP18:%.*]] = bitcast <32 x i8> [[TMP17]] to <16 x i16>
+; CHECK-NEXT: [[TMP19:%.*]] = icmp ne <16 x i16> [[TMP18]], zeroinitializer
+; CHECK-NEXT: [[TMP20:%.*]] = sext <16 x i1> [[TMP19]] to <16 x i16>
+; CHECK-NEXT: [[TMP21:%.*]] = bitcast <16 x i16> [[TMP20]] to i256
+; CHECK-NEXT: [[TMP22:%.*]] = bitcast i256 [[TMP21]] to <8 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <8 x i32> [[TMP22]], [[TMP23]]
; CHECK-NEXT: [[TMP4:%.*]] = call <8 x i32> @llvm.x86.avx512.vpdpbusds.256(<8 x i32> [[X0]], <8 x i32> [[X1]], <8 x i32> [[X2]])
; CHECK-NEXT: store <8 x i32> [[_MSPROP1]], ptr @__msan_retval_tls, align 8
; CHECK-NEXT: ret <8 x i32> [[TMP4]]
@@ -200,8 +318,8 @@ define { <8 x i32>, <8 x i32> } @test_int_x86_avx512_mask_vpdpbusds_256(<8 x i32
; CHECK-LABEL: define { <8 x i32>, <8 x i32> } @test_int_x86_avx512_mask_vpdpbusds_256(
; CHECK-SAME: <8 x i32> [[X0:%.*]], <8 x i32> [[X1:%.*]], ptr [[X2P:%.*]], <8 x i32> [[X4:%.*]], i8 [[X3:%.*]]) #[[ATTR0]] {
; CHECK-NEXT: [[TMP1:%.*]] = load i64, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 64) to ptr), align 8
-; CHECK-NEXT: [[TMP2:%.*]] = load <8 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <8 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 32) to ptr), align 8
+; CHECK-NEXT: [[TMP2:%.*]] = load <8 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP4:%.*]] = load i8, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 104) to ptr), align 8
; CHECK-NEXT: [[TMP5:%.*]] = load <8 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 72) to ptr), align 8
; CHECK-NEXT: call void @llvm.donothing()
@@ -216,8 +334,26 @@ define { <8 x i32>, <8 x i32> } @test_int_x86_avx512_mask_vpdpbusds_256(<8 x i32
; CHECK-NEXT: [[TMP9:%.*]] = xor i64 [[TMP8]], 87960930222080
; CHECK-NEXT: [[TMP10:%.*]] = inttoptr i64 [[TMP9]] to ptr
; CHECK-NEXT: [[_MSLD:%.*]] = load <8 x i32>, ptr [[TMP10]], align 32
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <8 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <8 x i32> [[_MSPROP]], [[_MSLD]]
+; CHECK-NEXT: [[TMP30:%.*]] = bitcast <8 x i32> [[X1]] to <32 x i8>
+; CHECK-NEXT: [[TMP31:%.*]] = bitcast <8 x i32> [[X2]] to <32 x i8>
+; CHECK-NEXT: [[TMP32:%.*]] = bitcast <8 x i32> [[TMP3]] to <32 x i8>
+; CHECK-NEXT: [[TMP33:%.*]] = bitcast <8 x i32> [[_MSLD]] to <32 x i8>
+; CHECK-NEXT: [[TMP34:%.*]] = icmp ne <32 x i8> [[TMP32]], zeroinitializer
+; CHECK-NEXT: [[TMP35:%.*]] = icmp ne <32 x i8> [[TMP33]], zeroinitializer
+; CHECK-NEXT: [[TMP36:%.*]] = icmp ne <32 x i8> [[TMP30]], zeroinitializer
+; CHECK-NEXT: [[TMP37:%.*]] = icmp ne <32 x i8> [[TMP31]], zeroinitializer
+; CHECK-NEXT: [[TMP38:%.*]] = and <32 x i1> [[TMP34]], [[TMP35]]
+; CHECK-NEXT: [[TMP58:%.*]] = and <32 x i1> [[TMP36]], [[TMP35]]
+; CHECK-NEXT: [[TMP59:%.*]] = and <32 x i1> [[TMP34]], [[TMP37]]
+; CHECK-NEXT: [[TMP60:%.*]] = or <32 x i1> [[TMP38]], [[TMP58]]
+; CHECK-NEXT: [[TMP61:%.*]] = or <32 x i1> [[TMP60]], [[TMP59]]
+; CHECK-NEXT: [[TMP62:%.*]] = sext <32 x i1> [[TMP61]] to <32 x i8>
+; CHECK-NEXT: [[TMP63:%.*]] = bitcast <32 x i8> [[TMP62]] to <16 x i16>
+; CHECK-NEXT: [[TMP64:%.*]] = icmp ne <16 x i16> [[TMP63]], zeroinitializer
+; CHECK-NEXT: [[TMP65:%.*]] = sext <16 x i1> [[TMP64]] to <16 x i16>
+; CHECK-NEXT: [[TMP66:%.*]] = bitcast <16 x i16> [[TMP65]] to i256
+; CHECK-NEXT: [[TMP29:%.*]] = bitcast i256 [[TMP66]] to <8 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <8 x i32> [[TMP29]], [[TMP2]]
; CHECK-NEXT: [[TMP11:%.*]] = call <8 x i32> @llvm.x86.avx512.vpdpbusds.256(<8 x i32> [[X0]], <8 x i32> [[X1]], <8 x i32> [[X2]])
; CHECK-NEXT: [[TMP12:%.*]] = bitcast i8 [[TMP4]] to <8 x i1>
; CHECK-NEXT: [[TMP13:%.*]] = bitcast i8 [[X3]] to <8 x i1>
@@ -227,8 +363,26 @@ define { <8 x i32>, <8 x i32> } @test_int_x86_avx512_mask_vpdpbusds_256(<8 x i32
; CHECK-NEXT: [[TMP17:%.*]] = or <8 x i32> [[TMP16]], [[TMP2]]
; CHECK-NEXT: [[_MSPROP_SELECT:%.*]] = select <8 x i1> [[TMP12]], <8 x i32> [[TMP17]], <8 x i32> [[TMP14]]
; CHECK-NEXT: [[TMP18:%.*]] = select <8 x i1> [[TMP13]], <8 x i32> [[TMP11]], <8 x i32> [[X0]]
-; CHECK-NEXT: [[_MSPROP2:%.*]] = or <8 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP3:%.*]] = or <8 x i32> [[_MSPROP2]], [[TMP5]]
+; CHECK-NEXT: [[TMP39:%.*]] = bitcast <8 x i32> [[X1]] to <32 x i8>
+; CHECK-NEXT: [[TMP40:%.*]] = bitcast <8 x i32> [[X4]] to <32 x i8>
+; CHECK-NEXT: [[TMP41:%.*]] = bitcast <8 x i32> [[TMP3]] to <32 x i8>
+; CHECK-NEXT: [[TMP42:%.*]] = bitcast <8 x i32> [[TMP5]] to <32 x i8>
+; CHECK-NEXT: [[TMP43:%.*]] = icmp ne <32 x i8> [[TMP41]], zeroinitializer
+; CHECK-NEXT: [[TMP44:%.*]] = icmp ne <32 x i8> [[TMP42]], zeroinitializer
+; CHECK-NEXT: [[TMP45:%.*]] = icmp ne <32 x i8> [[TMP39]], zeroinitializer
+; CHECK-NEXT: [[TMP46:%.*]] = icmp ne <32 x i8> [[TMP40]], zeroinitializer
+; CHECK-NEXT: [[TMP47:%.*]] = and <32 x i1> [[TMP43]], [[TMP44]]
+; CHECK-NEXT: [[TMP48:%.*]] = and <32 x i1> [[TMP45]], [[TMP44]]
+; CHECK-NEXT: [[TMP49:%.*]] = and <32 x i1> [[TMP43]], [[TMP46]]
+; CHECK-NEXT: [[TMP50:%.*]] = or <32 x i1> [[TMP47]], [[TMP48]]
+; CHECK-NEXT: [[TMP51:%.*]] = or <32 x i1> [[TMP50]], [[TMP49]]
+; CHECK-NEXT: [[TMP52:%.*]] = sext <32 x i1> [[TMP51]] to <32 x i8>
+; CHECK-NEXT: [[TMP53:%.*]] = bitcast <32 x i8> [[TMP52]] to <16 x i16>
+; CHECK-NEXT: [[TMP54:%.*]] = icmp ne <16 x i16> [[TMP53]], zeroinitializer
+; CHECK-NEXT: [[TMP55:%.*]] = sext <16 x i1> [[TMP54]] to <16 x i16>
+; CHECK-NEXT: [[TMP56:%.*]] = bitcast <16 x i16> [[TMP55]] to i256
+; CHECK-NEXT: [[TMP57:%.*]] = bitcast i256 [[TMP56]] to <8 x i32>
+; CHECK-NEXT: [[_MSPROP3:%.*]] = or <8 x i32> [[TMP57]], [[TMP2]]
; CHECK-NEXT: [[TMP19:%.*]] = call <8 x i32> @llvm.x86.avx512.vpdpbusds.256(<8 x i32> [[X0]], <8 x i32> [[X1]], <8 x i32> [[X4]])
; CHECK-NEXT: [[TMP20:%.*]] = bitcast i8 [[TMP4]] to <8 x i1>
; CHECK-NEXT: [[TMP21:%.*]] = bitcast i8 [[X3]] to <8 x i1>
@@ -259,12 +413,30 @@ declare <4 x i32> @llvm.x86.avx512.maskz.vpdpbusds.128(<4 x i32>, <4 x i32>, <4
define <4 x i32>@test_int_x86_avx512_vpdpbusds_128(<4 x i32> %x0, <4 x i32> %x1, <4 x i32> %x2) sanitize_memory {
; CHECK-LABEL: define <4 x i32> @test_int_x86_avx512_vpdpbusds_128(
; CHECK-SAME: <4 x i32> [[X0:%.*]], <4 x i32> [[X1:%.*]], <4 x i32> [[X2:%.*]]) #[[ATTR0]] {
-; CHECK-NEXT: [[TMP1:%.*]] = load <4 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP2:%.*]] = load <4 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 16) to ptr), align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <4 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 32) to ptr), align 8
+; CHECK-NEXT: [[TMP23:%.*]] = load <4 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: call void @llvm.donothing()
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <4 x i32> [[TMP1]], [[TMP2]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <4 x i32> [[_MSPROP]], [[TMP3]]
+; CHECK-NEXT: [[TMP24:%.*]] = bitcast <4 x i32> [[X1]] to <16 x i8>
+; CHECK-NEXT: [[TMP5:%.*]] = bitcast <4 x i32> [[X2]] to <16 x i8>
+; CHECK-NEXT: [[TMP6:%.*]] = bitcast <4 x i32> [[TMP2]] to <16 x i8>
+; CHECK-NEXT: [[TMP7:%.*]] = bitcast <4 x i32> [[TMP3]] to <16 x i8>
+; CHECK-NEXT: [[TMP8:%.*]] = icmp ne <16 x i8> [[TMP6]], zeroinitializer
+; CHECK-NEXT: [[TMP9:%.*]] = icmp ne <16 x i8> [[TMP7]], zeroinitializer
+; CHECK-NEXT: [[TMP10:%.*]] = icmp ne <16 x i8> [[TMP24]], zeroinitializer
+; CHECK-NEXT: [[TMP11:%.*]] = icmp ne <16 x i8> [[TMP5]], zeroinitializer
+; CHECK-NEXT: [[TMP12:%.*]] = and <16 x i1> [[TMP8]], [[TMP9]]
+; CHECK-NEXT: [[TMP13:%.*]] = and <16 x i1> [[TMP10]], [[TMP9]]
+; CHECK-NEXT: [[TMP14:%.*]] = and <16 x i1> [[TMP8]], [[TMP11]]
+; CHECK-NEXT: [[TMP15:%.*]] = or <16 x i1> [[TMP12]], [[TMP13]]
+; CHECK-NEXT: [[TMP16:%.*]] = or <16 x i1> [[TMP15]], [[TMP14]]
+; CHECK-NEXT: [[TMP17:%.*]] = sext <16 x i1> [[TMP16]] to <16 x i8>
+; CHECK-NEXT: [[TMP18:%.*]] = bitcast <16 x i8> [[TMP17]] to <8 x i16>
+; CHECK-NEXT: [[TMP19:%.*]] = icmp ne <8 x i16> [[TMP18]], zeroinitializer
+; CHECK-NEXT: [[TMP20:%.*]] = sext <8 x i1> [[TMP19]] to <8 x i16>
+; CHECK-NEXT: [[TMP21:%.*]] = bitcast <8 x i16> [[TMP20]] to i128
+; CHECK-NEXT: [[TMP22:%.*]] = bitcast i128 [[TMP21]] to <4 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <4 x i32> [[TMP22]], [[TMP23]]
; CHECK-NEXT: [[TMP4:%.*]] = call <4 x i32> @llvm.x86.avx512.vpdpbusds.128(<4 x i32> [[X0]], <4 x i32> [[X1]], <4 x i32> [[X2]])
; CHECK-NEXT: store <4 x i32> [[_MSPROP1]], ptr @__msan_retval_tls, align 8
; CHECK-NEXT: ret <4 x i32> [[TMP4]]
@@ -277,8 +449,8 @@ define { <4 x i32>, <4 x i32> } @test_int_x86_avx512_mask_vpdpbusds_128(<4 x i32
; CHECK-LABEL: define { <4 x i32>, <4 x i32> } @test_int_x86_avx512_mask_vpdpbusds_128(
; CHECK-SAME: <4 x i32> [[X0:%.*]], <4 x i32> [[X1:%.*]], ptr [[X2P:%.*]], <4 x i32> [[X4:%.*]], i8 [[X3:%.*]]) #[[ATTR0]] {
; CHECK-NEXT: [[TMP1:%.*]] = load i64, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 32) to ptr), align 8
-; CHECK-NEXT: [[TMP2:%.*]] = load <4 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <4 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 16) to ptr), align 8
+; CHECK-NEXT: [[TMP2:%.*]] = load <4 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP4:%.*]] = load i8, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 56) to ptr), align 8
; CHECK-NEXT: [[TMP5:%.*]] = load <4 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 40) to ptr), align 8
; CHECK-NEXT: call void @llvm.donothing()
@@ -293,8 +465,26 @@ define { <4 x i32>, <4 x i32> } @test_int_x86_avx512_mask_vpdpbusds_128(<4 x i32
; CHECK-NEXT: [[TMP9:%.*]] = xor i64 [[TMP8]], 87960930222080
; CHECK-NEXT: [[TMP10:%.*]] = inttoptr i64 [[TMP9]] to ptr
; CHECK-NEXT: [[_MSLD:%.*]] = load <4 x i32>, ptr [[TMP10]], align 16
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <4 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP2:%.*]] = or <4 x i32> [[_MSPROP]], [[_MSLD]]
+; CHECK-NEXT: [[TMP30:%.*]] = bitcast <4 x i32> [[X1]] to <16 x i8>
+; CHECK-NEXT: [[TMP31:%.*]] = bitcast <4 x i32> [[X2]] to <16 x i8>
+; CHECK-NEXT: [[TMP32:%.*]] = bitcast <4 x i32> [[TMP3]] to <16 x i8>
+; CHECK-NEXT: [[TMP33:%.*]] = bitcast <4 x i32> [[_MSLD]] to <16 x i8>
+; CHECK-NEXT: [[TMP34:%.*]] = icmp ne <16 x i8> [[TMP32]], zeroinitializer
+; CHECK-NEXT: [[TMP35:%.*]] = icmp ne <16 x i8> [[TMP33]], zeroinitializer
+; CHECK-NEXT: [[TMP36:%.*]] = icmp ne <16 x i8> [[TMP30]], zeroinitializer
+; CHECK-NEXT: [[TMP37:%.*]] = icmp ne <16 x i8> [[TMP31]], zeroinitializer
+; CHECK-NEXT: [[TMP38:%.*]] = and <16 x i1> [[TMP34]], [[TMP35]]
+; CHECK-NEXT: [[TMP58:%.*]] = and <16 x i1> [[TMP36]], [[TMP35]]
+; CHECK-NEXT: [[TMP59:%.*]] = and <16 x i1> [[TMP34]], [[TMP37]]
+; CHECK-NEXT: [[TMP60:%.*]] = or <16 x i1> [[TMP38]], [[TMP58]]
+; CHECK-NEXT: [[TMP61:%.*]] = or <16 x i1> [[TMP60]], [[TMP59]]
+; CHECK-NEXT: [[TMP62:%.*]] = sext <16 x i1> [[TMP61]] to <16 x i8>
+; CHECK-NEXT: [[TMP63:%.*]] = bitcast <16 x i8> [[TMP62]] to <8 x i16>
+; CHECK-NEXT: [[TMP64:%.*]] = icmp ne <8 x i16> [[TMP63]], zeroinitializer
+; CHECK-NEXT: [[TMP65:%.*]] = sext <8 x i1> [[TMP64]] to <8 x i16>
+; CHECK-NEXT: [[TMP66:%.*]] = bitcast <8 x i16> [[TMP65]] to i128
+; CHECK-NEXT: [[TMP29:%.*]] = bitcast i128 [[TMP66]] to <4 x i32>
+; CHECK-NEXT: [[_MSPROP2:%.*]] = or <4 x i32> [[TMP29]], [[TMP2]]
; CHECK-NEXT: [[TMP11:%.*]] = call <4 x i32> @llvm.x86.avx512.vpdpbusds.128(<4 x i32> [[X0]], <4 x i32> [[X1]], <4 x i32> [[X2]])
; CHECK-NEXT: [[TMP12:%.*]] = bitcast i8 [[TMP4]] to <8 x i1>
; CHECK-NEXT: [[TMP13:%.*]] = bitcast i8 [[X3]] to <8 x i1>
@@ -306,8 +496,26 @@ define { <4 x i32>, <4 x i32> } @test_int_x86_avx512_mask_vpdpbusds_128(<4 x i32
; CHECK-NEXT: [[TMP17:%.*]] = or <4 x i32> [[TMP16]], [[TMP2]]
; CHECK-NEXT: [[_MSPROP_SELECT:%.*]] = select <4 x i1> [[_MSPROP3]], <4 x i32> [[TMP17]], <4 x i32> [[TMP14]]
; CHECK-NEXT: [[TMP18:%.*]] = select <4 x i1> [[EXTRACT]], <4 x i32> [[TMP11]], <4 x i32> [[X0]]
-; CHECK-NEXT: [[_MSPROP4:%.*]] = or <4 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP5:%.*]] = or <4 x i32> [[_MSPROP4]], [[TMP5]]
+; CHECK-NEXT: [[TMP39:%.*]] = bitcast <4 x i32> [[X1]] to <16 x i8>
+; CHECK-NEXT: [[TMP40:%.*]] = bitcast <4 x i32> [[X4]] to <16 x i8>
+; CHECK-NEXT: [[TMP41:%.*]] = bitcast <4 x i32> [[TMP3]] to <16 x i8>
+; CHECK-NEXT: [[TMP42:%.*]] = bitcast <4 x i32> [[TMP5]] to <16 x i8>
+; CHECK-NEXT: [[TMP43:%.*]] = icmp ne <16 x i8> [[TMP41]], zeroinitializer
+; CHECK-NEXT: [[TMP44:%.*]] = icmp ne <16 x i8> [[TMP42]], zeroinitializer
+; CHECK-NEXT: [[TMP45:%.*]] = icmp ne <16 x i8> [[TMP39]], zeroinitializer
+; CHECK-NEXT: [[TMP46:%.*]] = icmp ne <16 x i8> [[TMP40]], zeroinitializer
+; CHECK-NEXT: [[TMP47:%.*]] = and <16 x i1> [[TMP43]], [[TMP44]]
+; CHECK-NEXT: [[TMP48:%.*]] = and <16 x i1> [[TMP45]], [[TMP44]]
+; CHECK-NEXT: [[TMP49:%.*]] = and <16 x i1> [[TMP43]], [[TMP46]]
+; CHECK-NEXT: [[TMP50:%.*]] = or <16 x i1> [[TMP47]], [[TMP48]]
+; CHECK-NEXT: [[TMP51:%.*]] = or <16 x i1> [[TMP50]], [[TMP49]]
+; CHECK-NEXT: [[TMP52:%.*]] = sext <16 x i1> [[TMP51]] to <16 x i8>
+; CHECK-NEXT: [[TMP53:%.*]] = bitcast <16 x i8> [[TMP52]] to <8 x i16>
+; CHECK-NEXT: [[TMP54:%.*]] = icmp ne <8 x i16> [[TMP53]], zeroinitializer
+; CHECK-NEXT: [[TMP55:%.*]] = sext <8 x i1> [[TMP54]] to <8 x i16>
+; CHECK-NEXT: [[TMP56:%.*]] = bitcast <8 x i16> [[TMP55]] to i128
+; CHECK-NEXT: [[TMP57:%.*]] = bitcast i128 [[TMP56]] to <4 x i32>
+; CHECK-NEXT: [[_MSPROP5:%.*]] = or <4 x i32> [[TMP57]], [[TMP2]]
; CHECK-NEXT: [[TMP19:%.*]] = call <4 x i32> @llvm.x86.avx512.vpdpbusds.128(<4 x i32> [[X0]], <4 x i32> [[X1]], <4 x i32> [[X4]])
; CHECK-NEXT: [[TMP20:%.*]] = bitcast i8 [[TMP4]] to <8 x i1>
; CHECK-NEXT: [[TMP21:%.*]] = bitcast i8 [[X3]] to <8 x i1>
@@ -340,12 +548,28 @@ declare <8 x i32> @llvm.x86.avx512.maskz.vpdpwssd.256(<8 x i32>, <8 x i32>, <8 x
define <8 x i32>@test_int_x86_avx512_vpdpwssd_256(<8 x i32> %x0, <8 x i32> %x1, <8 x i32> %x2) sanitize_memory {
; CHECK-LABEL: define <8 x i32> @test_int_x86_avx512_vpdpwssd_256(
; CHECK-SAME: <8 x i32> [[X0:%.*]], <8 x i32> [[X1:%.*]], <8 x i32> [[X2:%.*]]) #[[ATTR0]] {
-; CHECK-NEXT: [[TMP1:%.*]] = load <8 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP2:%.*]] = load <8 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 32) to ptr), align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <8 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 64) to ptr), align 8
+; CHECK-NEXT: [[TMP21:%.*]] = load <8 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: call void @llvm.donothing()
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <8 x i32> [[TMP1]], [[TMP2]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <8 x i32> [[_MSPROP]], [[TMP3]]
+; CHECK-NEXT: [[TMP22:%.*]] = bitcast <8 x i32> [[X1]] to <16 x i16>
+; CHECK-NEXT: [[TMP5:%.*]] = bitcast <8 x i32> [[X2]] to <16 x i16>
+; CHECK-NEXT: [[TMP6:%.*]] = bitcast <8 x i32> [[TMP2]] to <16 x i16>
+; CHECK-NEXT: [[TMP7:%.*]] = bitcast <8 x i32> [[TMP3]] to <16 x i16>
+; CHECK-NEXT: [[TMP8:%.*]] = icmp ne <16 x i16> [[TMP6]], zeroinitializer
+; CHECK-NEXT: [[TMP9:%.*]] = icmp ne <16 x i16> [[TMP7]], zeroinitializer
+; CHECK-NEXT: [[TMP10:%.*]] = icmp ne <16 x i16> [[TMP22]], zeroinitializer
+; CHECK-NEXT: [[TMP11:%.*]] = icmp ne <16 x i16> [[TMP5]], zeroinitializer
+; CHECK-NEXT: [[TMP12:%.*]] = and <16 x i1> [[TMP8]], [[TMP9]]
+; CHECK-NEXT: [[TMP13:%.*]] = and <16 x i1> [[TMP10]], [[TMP9]]
+; CHECK-NEXT: [[TMP14:%.*]] = and <16 x i1> [[TMP8]], [[TMP11]]
+; CHECK-NEXT: [[TMP15:%.*]] = or <16 x i1> [[TMP12]], [[TMP13]]
+; CHECK-NEXT: [[TMP16:%.*]] = or <16 x i1> [[TMP15]], [[TMP14]]
+; CHECK-NEXT: [[TMP17:%.*]] = sext <16 x i1> [[TMP16]] to <16 x i16>
+; CHECK-NEXT: [[TMP18:%.*]] = bitcast <16 x i16> [[TMP17]] to <8 x i32>
+; CHECK-NEXT: [[TMP19:%.*]] = icmp ne <8 x i32> [[TMP18]], zeroinitializer
+; CHECK-NEXT: [[TMP20:%.*]] = sext <8 x i1> [[TMP19]] to <8 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <8 x i32> [[TMP20]], [[TMP21]]
; CHECK-NEXT: [[TMP4:%.*]] = call <8 x i32> @llvm.x86.avx512.vpdpwssd.256(<8 x i32> [[X0]], <8 x i32> [[X1]], <8 x i32> [[X2]])
; CHECK-NEXT: store <8 x i32> [[_MSPROP1]], ptr @__msan_retval_tls, align 8
; CHECK-NEXT: ret <8 x i32> [[TMP4]]
@@ -358,8 +582,8 @@ define { <8 x i32>, <8 x i32> } @test_int_x86_avx512_mask_vpdpwssd_256(<8 x i32>
; CHECK-LABEL: define { <8 x i32>, <8 x i32> } @test_int_x86_avx512_mask_vpdpwssd_256(
; CHECK-SAME: <8 x i32> [[X0:%.*]], <8 x i32> [[X1:%.*]], ptr [[X2P:%.*]], <8 x i32> [[X4:%.*]], i8 [[X3:%.*]]) #[[ATTR0]] {
; CHECK-NEXT: [[TMP1:%.*]] = load i64, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 64) to ptr), align 8
-; CHECK-NEXT: [[TMP2:%.*]] = load <8 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <8 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 32) to ptr), align 8
+; CHECK-NEXT: [[TMP2:%.*]] = load <8 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP4:%.*]] = load i8, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 104) to ptr), align 8
; CHECK-NEXT: [[TMP5:%.*]] = load <8 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 72) to ptr), align 8
; CHECK-NEXT: call void @llvm.donothing()
@@ -374,8 +598,24 @@ define { <8 x i32>, <8 x i32> } @test_int_x86_avx512_mask_vpdpwssd_256(<8 x i32>
; CHECK-NEXT: [[TMP9:%.*]] = xor i64 [[TMP8]], 87960930222080
; CHECK-NEXT: [[TMP10:%.*]] = inttoptr i64 [[TMP9]] to ptr
; CHECK-NEXT: [[_MSLD:%.*]] = load <8 x i32>, ptr [[TMP10]], align 32
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <8 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <8 x i32> [[_MSPROP]], [[_MSLD]]
+; CHECK-NEXT: [[TMP29:%.*]] = bitcast <8 x i32> [[X1]] to <16 x i16>
+; CHECK-NEXT: [[TMP30:%.*]] = bitcast <8 x i32> [[X2]] to <16 x i16>
+; CHECK-NEXT: [[TMP31:%.*]] = bitcast <8 x i32> [[TMP3]] to <16 x i16>
+; CHECK-NEXT: [[TMP32:%.*]] = bitcast <8 x i32> [[_MSLD]] to <16 x i16>
+; CHECK-NEXT: [[TMP33:%.*]] = icmp ne <16 x i16> [[TMP31]], zeroinitializer
+; CHECK-NEXT: [[TMP34:%.*]] = icmp ne <16 x i16> [[TMP32]], zeroinitializer
+; CHECK-NEXT: [[TMP35:%.*]] = icmp ne <16 x i16> [[TMP29]], zeroinitializer
+; CHECK-NEXT: [[TMP36:%.*]] = icmp ne <16 x i16> [[TMP30]], zeroinitializer
+; CHECK-NEXT: [[TMP54:%.*]] = and <16 x i1> [[TMP33]], [[TMP34]]
+; CHECK-NEXT: [[TMP55:%.*]] = and <16 x i1> [[TMP35]], [[TMP34]]
+; CHECK-NEXT: [[TMP56:%.*]] = and <16 x i1> [[TMP33]], [[TMP36]]
+; CHECK-NEXT: [[TMP57:%.*]] = or <16 x i1> [[TMP54]], [[TMP55]]
+; CHECK-NEXT: [[TMP58:%.*]] = or <16 x i1> [[TMP57]], [[TMP56]]
+; CHECK-NEXT: [[TMP59:%.*]] = sext <16 x i1> [[TMP58]] to <16 x i16>
+; CHECK-NEXT: [[TMP60:%.*]] = bitcast <16 x i16> [[TMP59]] to <8 x i32>
+; CHECK-NEXT: [[TMP61:%.*]] = icmp ne <8 x i32> [[TMP60]], zeroinitializer
+; CHECK-NEXT: [[TMP62:%.*]] = sext <8 x i1> [[TMP61]] to <8 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <8 x i32> [[TMP62]], [[TMP2]]
; CHECK-NEXT: [[TMP11:%.*]] = call <8 x i32> @llvm.x86.avx512.vpdpwssd.256(<8 x i32> [[X0]], <8 x i32> [[X1]], <8 x i32> [[X2]])
; CHECK-NEXT: [[TMP12:%.*]] = bitcast i8 [[TMP4]] to <8 x i1>
; CHECK-NEXT: [[TMP13:%.*]] = bitcast i8 [[X3]] to <8 x i1>
@@ -385,8 +625,24 @@ define { <8 x i32>, <8 x i32> } @test_int_x86_avx512_mask_vpdpwssd_256(<8 x i32>
; CHECK-NEXT: [[TMP17:%.*]] = or <8 x i32> [[TMP16]], [[TMP2]]
; CHECK-NEXT: [[_MSPROP_SELECT:%.*]] = select <8 x i1> [[TMP12]], <8 x i32> [[TMP17]], <8 x i32> [[TMP14]]
; CHECK-NEXT: [[TMP18:%.*]] = select <8 x i1> [[TMP13]], <8 x i32> [[TMP11]], <8 x i32> [[X0]]
-; CHECK-NEXT: [[_MSPROP2:%.*]] = or <8 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP3:%.*]] = or <8 x i32> [[_MSPROP2]], [[TMP5]]
+; CHECK-NEXT: [[TMP37:%.*]] = bitcast <8 x i32> [[X1]] to <16 x i16>
+; CHECK-NEXT: [[TMP38:%.*]] = bitcast <8 x i32> [[X4]] to <16 x i16>
+; CHECK-NEXT: [[TMP39:%.*]] = bitcast <8 x i32> [[TMP3]] to <16 x i16>
+; CHECK-NEXT: [[TMP40:%.*]] = bitcast <8 x i32> [[TMP5]] to <16 x i16>
+; CHECK-NEXT: [[TMP41:%.*]] = icmp ne <16 x i16> [[TMP39]], zeroinitializer
+; CHECK-NEXT: [[TMP42:%.*]] = icmp ne <16 x i16> [[TMP40]], zeroinitializer
+; CHECK-NEXT: [[TMP43:%.*]] = icmp ne <16 x i16> [[TMP37]], zeroinitializer
+; CHECK-NEXT: [[TMP44:%.*]] = icmp ne <16 x i16> [[TMP38]], zeroinitializer
+; CHECK-NEXT: [[TMP45:%.*]] = and <16 x i1> [[TMP41]], [[TMP42]]
+; CHECK-NEXT: [[TMP46:%.*]] = and <16 x i1> [[TMP43]], [[TMP42]]
+; CHECK-NEXT: [[TMP47:%.*]] = and <16 x i1> [[TMP41]], [[TMP44]]
+; CHECK-NEXT: [[TMP48:%.*]] = or <16 x i1> [[TMP45]], [[TMP46]]
+; CHECK-NEXT: [[TMP49:%.*]] = or <16 x i1> [[TMP48]], [[TMP47]]
+; CHECK-NEXT: [[TMP50:%.*]] = sext <16 x i1> [[TMP49]] to <16 x i16>
+; CHECK-NEXT: [[TMP51:%.*]] = bitcast <16 x i16> [[TMP50]] to <8 x i32>
+; CHECK-NEXT: [[TMP52:%.*]] = icmp ne <8 x i32> [[TMP51]], zeroinitializer
+; CHECK-NEXT: [[TMP53:%.*]] = sext <8 x i1> [[TMP52]] to <8 x i32>
+; CHECK-NEXT: [[_MSPROP3:%.*]] = or <8 x i32> [[TMP53]], [[TMP2]]
; CHECK-NEXT: [[TMP19:%.*]] = call <8 x i32> @llvm.x86.avx512.vpdpwssd.256(<8 x i32> [[X0]], <8 x i32> [[X1]], <8 x i32> [[X4]])
; CHECK-NEXT: [[TMP20:%.*]] = bitcast i8 [[TMP4]] to <8 x i1>
; CHECK-NEXT: [[TMP21:%.*]] = bitcast i8 [[X3]] to <8 x i1>
@@ -417,12 +673,28 @@ declare <4 x i32> @llvm.x86.avx512.maskz.vpdpwssd.128(<4 x i32>, <4 x i32>, <4 x
define <4 x i32>@test_int_x86_avx512_vpdpwssd_128(<4 x i32> %x0, <4 x i32> %x1, <4 x i32> %x2) sanitize_memory {
; CHECK-LABEL: define <4 x i32> @test_int_x86_avx512_vpdpwssd_128(
; CHECK-SAME: <4 x i32> [[X0:%.*]], <4 x i32> [[X1:%.*]], <4 x i32> [[X2:%.*]]) #[[ATTR0]] {
-; CHECK-NEXT: [[TMP1:%.*]] = load <4 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP2:%.*]] = load <4 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 16) to ptr), align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <4 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 32) to ptr), align 8
+; CHECK-NEXT: [[TMP21:%.*]] = load <4 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: call void @llvm.donothing()
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <4 x i32> [[TMP1]], [[TMP2]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <4 x i32> [[_MSPROP]], [[TMP3]]
+; CHECK-NEXT: [[TMP22:%.*]] = bitcast <4 x i32> [[X1]] to <8 x i16>
+; CHECK-NEXT: [[TMP5:%.*]] = bitcast <4 x i32> [[X2]] to <8 x i16>
+; CHECK-NEXT: [[TMP6:%.*]] = bitcast <4 x i32> [[TMP2]] to <8 x i16>
+; CHECK-NEXT: [[TMP7:%.*]] = bitcast <4 x i32> [[TMP3]] to <8 x i16>
+; CHECK-NEXT: [[TMP8:%.*]] = icmp ne <8 x i16> [[TMP6]], zeroinitializer
+; CHECK-NEXT: [[TMP9:%.*]] = icmp ne <8 x i16> [[TMP7]], zeroinitializer
+; CHECK-NEXT: [[TMP10:%.*]] = icmp ne <8 x i16> [[TMP22]], zeroinitializer
+; CHECK-NEXT: [[TMP11:%.*]] = icmp ne <8 x i16> [[TMP5]], zeroinitializer
+; CHECK-NEXT: [[TMP12:%.*]] = and <8 x i1> [[TMP8]], [[TMP9]]
+; CHECK-NEXT: [[TMP13:%.*]] = and <8 x i1> [[TMP10]], [[TMP9]]
+; CHECK-NEXT: [[TMP14:%.*]] = and <8 x i1> [[TMP8]], [[TMP11]]
+; CHECK-NEXT: [[TMP15:%.*]] = or <8 x i1> [[TMP12]], [[TMP13]]
+; CHECK-NEXT: [[TMP16:%.*]] = or <8 x i1> [[TMP15]], [[TMP14]]
+; CHECK-NEXT: [[TMP17:%.*]] = sext <8 x i1> [[TMP16]] to <8 x i16>
+; CHECK-NEXT: [[TMP18:%.*]] = bitcast <8 x i16> [[TMP17]] to <4 x i32>
+; CHECK-NEXT: [[TMP19:%.*]] = icmp ne <4 x i32> [[TMP18]], zeroinitializer
+; CHECK-NEXT: [[TMP20:%.*]] = sext <4 x i1> [[TMP19]] to <4 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <4 x i32> [[TMP20]], [[TMP21]]
; CHECK-NEXT: [[TMP4:%.*]] = call <4 x i32> @llvm.x86.avx512.vpdpwssd.128(<4 x i32> [[X0]], <4 x i32> [[X1]], <4 x i32> [[X2]])
; CHECK-NEXT: store <4 x i32> [[_MSPROP1]], ptr @__msan_retval_tls, align 8
; CHECK-NEXT: ret <4 x i32> [[TMP4]]
@@ -435,8 +707,8 @@ define { <4 x i32>, <4 x i32> } @test_int_x86_avx512_mask_vpdpwssd_128(<4 x i32>
; CHECK-LABEL: define { <4 x i32>, <4 x i32> } @test_int_x86_avx512_mask_vpdpwssd_128(
; CHECK-SAME: <4 x i32> [[X0:%.*]], <4 x i32> [[X1:%.*]], ptr [[X2P:%.*]], <4 x i32> [[X4:%.*]], i8 [[X3:%.*]]) #[[ATTR0]] {
; CHECK-NEXT: [[TMP1:%.*]] = load i64, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 32) to ptr), align 8
-; CHECK-NEXT: [[TMP2:%.*]] = load <4 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <4 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 16) to ptr), align 8
+; CHECK-NEXT: [[TMP2:%.*]] = load <4 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP4:%.*]] = load i8, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 56) to ptr), align 8
; CHECK-NEXT: [[TMP5:%.*]] = load <4 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 40) to ptr), align 8
; CHECK-NEXT: call void @llvm.donothing()
@@ -451,8 +723,24 @@ define { <4 x i32>, <4 x i32> } @test_int_x86_avx512_mask_vpdpwssd_128(<4 x i32>
; CHECK-NEXT: [[TMP9:%.*]] = xor i64 [[TMP8]], 87960930222080
; CHECK-NEXT: [[TMP10:%.*]] = inttoptr i64 [[TMP9]] to ptr
; CHECK-NEXT: [[_MSLD:%.*]] = load <4 x i32>, ptr [[TMP10]], align 16
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <4 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP2:%.*]] = or <4 x i32> [[_MSPROP]], [[_MSLD]]
+; CHECK-NEXT: [[TMP29:%.*]] = bitcast <4 x i32> [[X1]] to <8 x i16>
+; CHECK-NEXT: [[TMP30:%.*]] = bitcast <4 x i32> [[X2]] to <8 x i16>
+; CHECK-NEXT: [[TMP31:%.*]] = bitcast <4 x i32> [[TMP3]] to <8 x i16>
+; CHECK-NEXT: [[TMP32:%.*]] = bitcast <4 x i32> [[_MSLD]] to <8 x i16>
+; CHECK-NEXT: [[TMP33:%.*]] = icmp ne <8 x i16> [[TMP31]], zeroinitializer
+; CHECK-NEXT: [[TMP34:%.*]] = icmp ne <8 x i16> [[TMP32]], zeroinitializer
+; CHECK-NEXT: [[TMP35:%.*]] = icmp ne <8 x i16> [[TMP29]], zeroinitializer
+; CHECK-NEXT: [[TMP36:%.*]] = icmp ne <8 x i16> [[TMP30]], zeroinitializer
+; CHECK-NEXT: [[TMP54:%.*]] = and <8 x i1> [[TMP33]], [[TMP34]]
+; CHECK-NEXT: [[TMP55:%.*]] = and <8 x i1> [[TMP35]], [[TMP34]]
+; CHECK-NEXT: [[TMP56:%.*]] = and <8 x i1> [[TMP33]], [[TMP36]]
+; CHECK-NEXT: [[TMP57:%.*]] = or <8 x i1> [[TMP54]], [[TMP55]]
+; CHECK-NEXT: [[TMP58:%.*]] = or <8 x i1> [[TMP57]], [[TMP56]]
+; CHECK-NEXT: [[TMP59:%.*]] = sext <8 x i1> [[TMP58]] to <8 x i16>
+; CHECK-NEXT: [[TMP60:%.*]] = bitcast <8 x i16> [[TMP59]] to <4 x i32>
+; CHECK-NEXT: [[TMP61:%.*]] = icmp ne <4 x i32> [[TMP60]], zeroinitializer
+; CHECK-NEXT: [[TMP62:%.*]] = sext <4 x i1> [[TMP61]] to <4 x i32>
+; CHECK-NEXT: [[_MSPROP2:%.*]] = or <4 x i32> [[TMP62]], [[TMP2]]
; CHECK-NEXT: [[TMP11:%.*]] = call <4 x i32> @llvm.x86.avx512.vpdpwssd.128(<4 x i32> [[X0]], <4 x i32> [[X1]], <4 x i32> [[X2]])
; CHECK-NEXT: [[TMP12:%.*]] = bitcast i8 [[TMP4]] to <8 x i1>
; CHECK-NEXT: [[TMP13:%.*]] = bitcast i8 [[X3]] to <8 x i1>
@@ -464,8 +752,24 @@ define { <4 x i32>, <4 x i32> } @test_int_x86_avx512_mask_vpdpwssd_128(<4 x i32>
; CHECK-NEXT: [[TMP17:%.*]] = or <4 x i32> [[TMP16]], [[TMP2]]
; CHECK-NEXT: [[_MSPROP_SELECT:%.*]] = select <4 x i1> [[_MSPROP3]], <4 x i32> [[TMP17]], <4 x i32> [[TMP14]]
; CHECK-NEXT: [[TMP18:%.*]] = select <4 x i1> [[EXTRACT]], <4 x i32> [[TMP11]], <4 x i32> [[X0]]
-; CHECK-NEXT: [[_MSPROP4:%.*]] = or <4 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP5:%.*]] = or <4 x i32> [[_MSPROP4]], [[TMP5]]
+; CHECK-NEXT: [[TMP37:%.*]] = bitcast <4 x i32> [[X1]] to <8 x i16>
+; CHECK-NEXT: [[TMP38:%.*]] = bitcast <4 x i32> [[X4]] to <8 x i16>
+; CHECK-NEXT: [[TMP39:%.*]] = bitcast <4 x i32> [[TMP3]] to <8 x i16>
+; CHECK-NEXT: [[TMP40:%.*]] = bitcast <4 x i32> [[TMP5]] to <8 x i16>
+; CHECK-NEXT: [[TMP41:%.*]] = icmp ne <8 x i16> [[TMP39]], zeroinitializer
+; CHECK-NEXT: [[TMP42:%.*]] = icmp ne <8 x i16> [[TMP40]], zeroinitializer
+; CHECK-NEXT: [[TMP43:%.*]] = icmp ne <8 x i16> [[TMP37]], zeroinitializer
+; CHECK-NEXT: [[TMP44:%.*]] = icmp ne <8 x i16> [[TMP38]], zeroinitializer
+; CHECK-NEXT: [[TMP45:%.*]] = and <8 x i1> [[TMP41]], [[TMP42]]
+; CHECK-NEXT: [[TMP46:%.*]] = and <8 x i1> [[TMP43]], [[TMP42]]
+; CHECK-NEXT: [[TMP47:%.*]] = and <8 x i1> [[TMP41]], [[TMP44]]
+; CHECK-NEXT: [[TMP48:%.*]] = or <8 x i1> [[TMP45]], [[TMP46]]
+; CHECK-NEXT: [[TMP49:%.*]] = or <8 x i1> [[TMP48]], [[TMP47]]
+; CHECK-NEXT: [[TMP50:%.*]] = sext <8 x i1> [[TMP49]] to <8 x i16>
+; CHECK-NEXT: [[TMP51:%.*]] = bitcast <8 x i16> [[TMP50]] to <4 x i32>
+; CHECK-NEXT: [[TMP52:%.*]] = icmp ne <4 x i32> [[TMP51]], zeroinitializer
+; CHECK-NEXT: [[TMP53:%.*]] = sext <4 x i1> [[TMP52]] to <4 x i32>
+; CHECK-NEXT: [[_MSPROP5:%.*]] = or <4 x i32> [[TMP53]], [[TMP2]]
; CHECK-NEXT: [[TMP19:%.*]] = call <4 x i32> @llvm.x86.avx512.vpdpwssd.128(<4 x i32> [[X0]], <4 x i32> [[X1]], <4 x i32> [[X4]])
; CHECK-NEXT: [[TMP20:%.*]] = bitcast i8 [[TMP4]] to <8 x i1>
; CHECK-NEXT: [[TMP21:%.*]] = bitcast i8 [[X3]] to <8 x i1>
@@ -499,12 +803,28 @@ declare <8 x i32> @llvm.x86.avx512.maskz.vpdpwssds.256(<8 x i32>, <8 x i32>, <8
define <8 x i32>@test_int_x86_avx512_vpdpwssds_256(<8 x i32> %x0, <8 x i32> %x1, <8 x i32> %x2) sanitize_memory {
; CHECK-LABEL: define <8 x i32> @test_int_x86_avx512_vpdpwssds_256(
; CHECK-SAME: <8 x i32> [[X0:%.*]], <8 x i32> [[X1:%.*]], <8 x i32> [[X2:%.*]]) #[[ATTR0]] {
-; CHECK-NEXT: [[TMP1:%.*]] = load <8 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP2:%.*]] = load <8 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 32) to ptr), align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <8 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 64) to ptr), align 8
+; CHECK-NEXT: [[TMP21:%.*]] = load <8 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: call void @llvm.donothing()
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <8 x i32> [[TMP1]], [[TMP2]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <8 x i32> [[_MSPROP]], [[TMP3]]
+; CHECK-NEXT: [[TMP22:%.*]] = bitcast <8 x i32> [[X1]] to <16 x i16>
+; CHECK-NEXT: [[TMP5:%.*]] = bitcast <8 x i32> [[X2]] to <16 x i16>
+; CHECK-NEXT: [[TMP6:%.*]] = bitcast <8 x i32> [[TMP2]] to <16 x i16>
+; CHECK-NEXT: [[TMP7:%.*]] = bitcast <8 x i32> [[TMP3]] to <16 x i16>
+; CHECK-NEXT: [[TMP8:%.*]] = icmp ne <16 x i16> [[TMP6]], zeroinitializer
+; CHECK-NEXT: [[TMP9:%.*]] = icmp ne <16 x i16> [[TMP7]], zeroinitializer
+; CHECK-NEXT: [[TMP10:%.*]] = icmp ne <16 x i16> [[TMP22]], zeroinitializer
+; CHECK-NEXT: [[TMP11:%.*]] = icmp ne <16 x i16> [[TMP5]], zeroinitializer
+; CHECK-NEXT: [[TMP12:%.*]] = and <16 x i1> [[TMP8]], [[TMP9]]
+; CHECK-NEXT: [[TMP13:%.*]] = and <16 x i1> [[TMP10]], [[TMP9]]
+; CHECK-NEXT: [[TMP14:%.*]] = and <16 x i1> [[TMP8]], [[TMP11]]
+; CHECK-NEXT: [[TMP15:%.*]] = or <16 x i1> [[TMP12]], [[TMP13]]
+; CHECK-NEXT: [[TMP16:%.*]] = or <16 x i1> [[TMP15]], [[TMP14]]
+; CHECK-NEXT: [[TMP17:%.*]] = sext <16 x i1> [[TMP16]] to <16 x i16>
+; CHECK-NEXT: [[TMP18:%.*]] = bitcast <16 x i16> [[TMP17]] to <8 x i32>
+; CHECK-NEXT: [[TMP19:%.*]] = icmp ne <8 x i32> [[TMP18]], zeroinitializer
+; CHECK-NEXT: [[TMP20:%.*]] = sext <8 x i1> [[TMP19]] to <8 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <8 x i32> [[TMP20]], [[TMP21]]
; CHECK-NEXT: [[TMP4:%.*]] = call <8 x i32> @llvm.x86.avx512.vpdpwssds.256(<8 x i32> [[X0]], <8 x i32> [[X1]], <8 x i32> [[X2]])
; CHECK-NEXT: store <8 x i32> [[_MSPROP1]], ptr @__msan_retval_tls, align 8
; CHECK-NEXT: ret <8 x i32> [[TMP4]]
@@ -517,8 +837,8 @@ define { <8 x i32>, <8 x i32> } @test_int_x86_avx512_mask_vpdpwssds_256(<8 x i32
; CHECK-LABEL: define { <8 x i32>, <8 x i32> } @test_int_x86_avx512_mask_vpdpwssds_256(
; CHECK-SAME: <8 x i32> [[X0:%.*]], <8 x i32> [[X1:%.*]], ptr [[X2P:%.*]], <8 x i32> [[X4:%.*]], i8 [[X3:%.*]]) #[[ATTR0]] {
; CHECK-NEXT: [[TMP1:%.*]] = load i64, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 64) to ptr), align 8
-; CHECK-NEXT: [[TMP2:%.*]] = load <8 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <8 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 32) to ptr), align 8
+; CHECK-NEXT: [[TMP2:%.*]] = load <8 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP4:%.*]] = load i8, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 104) to ptr), align 8
; CHECK-NEXT: [[TMP5:%.*]] = load <8 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 72) to ptr), align 8
; CHECK-NEXT: call void @llvm.donothing()
@@ -533,8 +853,24 @@ define { <8 x i32>, <8 x i32> } @test_int_x86_avx512_mask_vpdpwssds_256(<8 x i32
; CHECK-NEXT: [[TMP9:%.*]] = xor i64 [[TMP8]], 87960930222080
; CHECK-NEXT: [[TMP10:%.*]] = inttoptr i64 [[TMP9]] to ptr
; CHECK-NEXT: [[_MSLD:%.*]] = load <8 x i32>, ptr [[TMP10]], align 32
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <8 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <8 x i32> [[_MSPROP]], [[_MSLD]]
+; CHECK-NEXT: [[TMP29:%.*]] = bitcast <8 x i32> [[X1]] to <16 x i16>
+; CHECK-NEXT: [[TMP30:%.*]] = bitcast <8 x i32> [[X2]] to <16 x i16>
+; CHECK-NEXT: [[TMP31:%.*]] = bitcast <8 x i32> [[TMP3]] to <16 x i16>
+; CHECK-NEXT: [[TMP32:%.*]] = bitcast <8 x i32> [[_MSLD]] to <16 x i16>
+; CHECK-NEXT: [[TMP33:%.*]] = icmp ne <16 x i16> [[TMP31]], zeroinitializer
+; CHECK-NEXT: [[TMP34:%.*]] = icmp ne <16 x i16> [[TMP32]], zeroinitializer
+; CHECK-NEXT: [[TMP35:%.*]] = icmp ne <16 x i16> [[TMP29]], zeroinitializer
+; CHECK-NEXT: [[TMP36:%.*]] = icmp ne <16 x i16> [[TMP30]], zeroinitializer
+; CHECK-NEXT: [[TMP54:%.*]] = and <16 x i1> [[TMP33]], [[TMP34]]
+; CHECK-NEXT: [[TMP55:%.*]] = and <16 x i1> [[TMP35]], [[TMP34]]
+; CHECK-NEXT: [[TMP56:%.*]] = and <16 x i1> [[TMP33]], [[TMP36]]
+; CHECK-NEXT: [[TMP57:%.*]] = or <16 x i1> [[TMP54]], [[TMP55]]
+; CHECK-NEXT: [[TMP58:%.*]] = or <16 x i1> [[TMP57]], [[TMP56]]
+; CHECK-NEXT: [[TMP59:%.*]] = sext <16 x i1> [[TMP58]] to <16 x i16>
+; CHECK-NEXT: [[TMP60:%.*]] = bitcast <16 x i16> [[TMP59]] to <8 x i32>
+; CHECK-NEXT: [[TMP61:%.*]] = icmp ne <8 x i32> [[TMP60]], zeroinitializer
+; CHECK-NEXT: [[TMP62:%.*]] = sext <8 x i1> [[TMP61]] to <8 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <8 x i32> [[TMP62]], [[TMP2]]
; CHECK-NEXT: [[TMP11:%.*]] = call <8 x i32> @llvm.x86.avx512.vpdpwssds.256(<8 x i32> [[X0]], <8 x i32> [[X1]], <8 x i32> [[X2]])
; CHECK-NEXT: [[TMP12:%.*]] = bitcast i8 [[TMP4]] to <8 x i1>
; CHECK-NEXT: [[TMP13:%.*]] = bitcast i8 [[X3]] to <8 x i1>
@@ -544,8 +880,24 @@ define { <8 x i32>, <8 x i32> } @test_int_x86_avx512_mask_vpdpwssds_256(<8 x i32
; CHECK-NEXT: [[TMP17:%.*]] = or <8 x i32> [[TMP16]], [[TMP2]]
; CHECK-NEXT: [[_MSPROP_SELECT:%.*]] = select <8 x i1> [[TMP12]], <8 x i32> [[TMP17]], <8 x i32> [[TMP14]]
; CHECK-NEXT: [[TMP18:%.*]] = select <8 x i1> [[TMP13]], <8 x i32> [[TMP11]], <8 x i32> [[X0]]
-; CHECK-NEXT: [[_MSPROP2:%.*]] = or <8 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP3:%.*]] = or <8 x i32> [[_MSPROP2]], [[TMP5]]
+; CHECK-NEXT: [[TMP37:%.*]] = bitcast <8 x i32> [[X1]] to <16 x i16>
+; CHECK-NEXT: [[TMP38:%.*]] = bitcast <8 x i32> [[X4]] to <16 x i16>
+; CHECK-NEXT: [[TMP39:%.*]] = bitcast <8 x i32> [[TMP3]] to <16 x i16>
+; CHECK-NEXT: [[TMP40:%.*]] = bitcast <8 x i32> [[TMP5]] to <16 x i16>
+; CHECK-NEXT: [[TMP41:%.*]] = icmp ne <16 x i16> [[TMP39]], zeroinitializer
+; CHECK-NEXT: [[TMP42:%.*]] = icmp ne <16 x i16> [[TMP40]], zeroinitializer
+; CHECK-NEXT: [[TMP43:%.*]] = icmp ne <16 x i16> [[TMP37]], zeroinitializer
+; CHECK-NEXT: [[TMP44:%.*]] = icmp ne <16 x i16> [[TMP38]], zeroinitializer
+; CHECK-NEXT: [[TMP45:%.*]] = and <16 x i1> [[TMP41]], [[TMP42]]
+; CHECK-NEXT: [[TMP46:%.*]] = and <16 x i1> [[TMP43]], [[TMP42]]
+; CHECK-NEXT: [[TMP47:%.*]] = and <16 x i1> [[TMP41]], [[TMP44]]
+; CHECK-NEXT: [[TMP48:%.*]] = or <16 x i1> [[TMP45]], [[TMP46]]
+; CHECK-NEXT: [[TMP49:%.*]] = or <16 x i1> [[TMP48]], [[TMP47]]
+; CHECK-NEXT: [[TMP50:%.*]] = sext <16 x i1> [[TMP49]] to <16 x i16>
+; CHECK-NEXT: [[TMP51:%.*]] = bitcast <16 x i16> [[TMP50]] to <8 x i32>
+; CHECK-NEXT: [[TMP52:%.*]] = icmp ne <8 x i32> [[TMP51]], zeroinitializer
+; CHECK-NEXT: [[TMP53:%.*]] = sext <8 x i1> [[TMP52]] to <8 x i32>
+; CHECK-NEXT: [[_MSPROP3:%.*]] = or <8 x i32> [[TMP53]], [[TMP2]]
; CHECK-NEXT: [[TMP19:%.*]] = call <8 x i32> @llvm.x86.avx512.vpdpwssds.256(<8 x i32> [[X0]], <8 x i32> [[X1]], <8 x i32> [[X4]])
; CHECK-NEXT: [[TMP20:%.*]] = bitcast i8 [[TMP4]] to <8 x i1>
; CHECK-NEXT: [[TMP21:%.*]] = bitcast i8 [[X3]] to <8 x i1>
@@ -576,12 +928,28 @@ declare <4 x i32> @llvm.x86.avx512.maskz.vpdpwssds.128(<4 x i32>, <4 x i32>, <4
define <4 x i32>@test_int_x86_avx512_vpdpwssds_128(<4 x i32> %x0, <4 x i32> %x1, <4 x i32> %x2) sanitize_memory {
; CHECK-LABEL: define <4 x i32> @test_int_x86_avx512_vpdpwssds_128(
; CHECK-SAME: <4 x i32> [[X0:%.*]], <4 x i32> [[X1:%.*]], <4 x i32> [[X2:%.*]]) #[[ATTR0]] {
-; CHECK-NEXT: [[TMP1:%.*]] = load <4 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP2:%.*]] = load <4 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 16) to ptr), align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <4 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 32) to ptr), align 8
+; CHECK-NEXT: [[TMP21:%.*]] = load <4 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: call void @llvm.donothing()
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <4 x i32> [[TMP1]], [[TMP2]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <4 x i32> [[_MSPROP]], [[TMP3]]
+; CHECK-NEXT: [[TMP22:%.*]] = bitcast <4 x i32> [[X1]] to <8 x i16>
+; CHECK-NEXT: [[TMP5:%.*]] = bitcast <4 x i32> [[X2]] to <8 x i16>
+; CHECK-NEXT: [[TMP6:%.*]] = bitcast <4 x i32> [[TMP2]] to <8 x i16>
+; CHECK-NEXT: [[TMP7:%.*]] = bitcast <4 x i32> [[TMP3]] to <8 x i16>
+; CHECK-NEXT: [[TMP8:%.*]] = icmp ne <8 x i16> [[TMP6]], zeroinitializer
+; CHECK-NEXT: [[TMP9:%.*]] = icmp ne <8 x i16> [[TMP7]], zeroinitializer
+; CHECK-NEXT: [[TMP10:%.*]] = icmp ne <8 x i16> [[TMP22]], zeroinitializer
+; CHECK-NEXT: [[TMP11:%.*]] = icmp ne <8 x i16> [[TMP5]], zeroinitializer
+; CHECK-NEXT: [[TMP12:%.*]] = and <8 x i1> [[TMP8]], [[TMP9]]
+; CHECK-NEXT: [[TMP13:%.*]] = and <8 x i1> [[TMP10]], [[TMP9]]
+; CHECK-NEXT: [[TMP14:%.*]] = and <8 x i1> [[TMP8]], [[TMP11]]
+; CHECK-NEXT: [[TMP15:%.*]] = or <8 x i1> [[TMP12]], [[TMP13]]
+; CHECK-NEXT: [[TMP16:%.*]] = or <8 x i1> [[TMP15]], [[TMP14]]
+; CHECK-NEXT: [[TMP17:%.*]] = sext <8 x i1> [[TMP16]] to <8 x i16>
+; CHECK-NEXT: [[TMP18:%.*]] = bitcast <8 x i16> [[TMP17]] to <4 x i32>
+; CHECK-NEXT: [[TMP19:%.*]] = icmp ne <4 x i32> [[TMP18]], zeroinitializer
+; CHECK-NEXT: [[TMP20:%.*]] = sext <4 x i1> [[TMP19]] to <4 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <4 x i32> [[TMP20]], [[TMP21]]
; CHECK-NEXT: [[TMP4:%.*]] = call <4 x i32> @llvm.x86.avx512.vpdpwssds.128(<4 x i32> [[X0]], <4 x i32> [[X1]], <4 x i32> [[X2]])
; CHECK-NEXT: store <4 x i32> [[_MSPROP1]], ptr @__msan_retval_tls, align 8
; CHECK-NEXT: ret <4 x i32> [[TMP4]]
@@ -594,8 +962,8 @@ define { <4 x i32>, <4 x i32> } @test_int_x86_avx512_mask_vpdpwssds_128(<4 x i32
; CHECK-LABEL: define { <4 x i32>, <4 x i32> } @test_int_x86_avx512_mask_vpdpwssds_128(
; CHECK-SAME: <4 x i32> [[X0:%.*]], <4 x i32> [[X1:%.*]], ptr [[X2P:%.*]], <4 x i32> [[X4:%.*]], i8 [[X3:%.*]]) #[[ATTR0]] {
; CHECK-NEXT: [[TMP1:%.*]] = load i64, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 32) to ptr), align 8
-; CHECK-NEXT: [[TMP2:%.*]] = load <4 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <4 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 16) to ptr), align 8
+; CHECK-NEXT: [[TMP2:%.*]] = load <4 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP4:%.*]] = load i8, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 56) to ptr), align 8
; CHECK-NEXT: [[TMP5:%.*]] = load <4 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 40) to ptr), align 8
; CHECK-NEXT: call void @llvm.donothing()
@@ -610,8 +978,24 @@ define { <4 x i32>, <4 x i32> } @test_int_x86_avx512_mask_vpdpwssds_128(<4 x i32
; CHECK-NEXT: [[TMP9:%.*]] = xor i64 [[TMP8]], 87960930222080
; CHECK-NEXT: [[TMP10:%.*]] = inttoptr i64 [[TMP9]] to ptr
; CHECK-NEXT: [[_MSLD:%.*]] = load <4 x i32>, ptr [[TMP10]], align 16
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <4 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP2:%.*]] = or <4 x i32> [[_MSPROP]], [[_MSLD]]
+; CHECK-NEXT: [[TMP29:%.*]] = bitcast <4 x i32> [[X1]] to <8 x i16>
+; CHECK-NEXT: [[TMP30:%.*]] = bitcast <4 x i32> [[X2]] to <8 x i16>
+; CHECK-NEXT: [[TMP31:%.*]] = bitcast <4 x i32> [[TMP3]] to <8 x i16>
+; CHECK-NEXT: [[TMP32:%.*]] = bitcast <4 x i32> [[_MSLD]] to <8 x i16>
+; CHECK-NEXT: [[TMP33:%.*]] = icmp ne <8 x i16> [[TMP31]], zeroinitializer
+; CHECK-NEXT: [[TMP34:%.*]] = icmp ne <8 x i16> [[TMP32]], zeroinitializer
+; CHECK-NEXT: [[TMP35:%.*]] = icmp ne <8 x i16> [[TMP29]], zeroinitializer
+; CHECK-NEXT: [[TMP36:%.*]] = icmp ne <8 x i16> [[TMP30]], zeroinitializer
+; CHECK-NEXT: [[TMP54:%.*]] = and <8 x i1> [[TMP33]], [[TMP34]]
+; CHECK-NEXT: [[TMP55:%.*]] = and <8 x i1> [[TMP35]], [[TMP34]]
+; CHECK-NEXT: [[TMP56:%.*]] = and <8 x i1> [[TMP33]], [[TMP36]]
+; CHECK-NEXT: [[TMP57:%.*]] = or <8 x i1> [[TMP54]], [[TMP55]]
+; CHECK-NEXT: [[TMP58:%.*]] = or <8 x i1> [[TMP57]], [[TMP56]]
+; CHECK-NEXT: [[TMP59:%.*]] = sext <8 x i1> [[TMP58]] to <8 x i16>
+; CHECK-NEXT: [[TMP60:%.*]] = bitcast <8 x i16> [[TMP59]] to <4 x i32>
+; CHECK-NEXT: [[TMP61:%.*]] = icmp ne <4 x i32> [[TMP60]], zeroinitializer
+; CHECK-NEXT: [[TMP62:%.*]] = sext <4 x i1> [[TMP61]] to <4 x i32>
+; CHECK-NEXT: [[_MSPROP2:%.*]] = or <4 x i32> [[TMP62]], [[TMP2]]
; CHECK-NEXT: [[TMP11:%.*]] = call <4 x i32> @llvm.x86.avx512.vpdpwssds.128(<4 x i32> [[X0]], <4 x i32> [[X1]], <4 x i32> [[X2]])
; CHECK-NEXT: [[TMP12:%.*]] = bitcast i8 [[TMP4]] to <8 x i1>
; CHECK-NEXT: [[TMP13:%.*]] = bitcast i8 [[X3]] to <8 x i1>
@@ -623,8 +1007,24 @@ define { <4 x i32>, <4 x i32> } @test_int_x86_avx512_mask_vpdpwssds_128(<4 x i32
; CHECK-NEXT: [[TMP17:%.*]] = or <4 x i32> [[TMP16]], [[TMP2]]
; CHECK-NEXT: [[_MSPROP_SELECT:%.*]] = select <4 x i1> [[_MSPROP3]], <4 x i32> [[TMP17]], <4 x i32> [[TMP14]]
; CHECK-NEXT: [[TMP18:%.*]] = select <4 x i1> [[EXTRACT]], <4 x i32> [[TMP11]], <4 x i32> [[X0]]
-; CHECK-NEXT: [[_MSPROP4:%.*]] = or <4 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP5:%.*]] = or <4 x i32> [[_MSPROP4]], [[TMP5]]
+; CHECK-NEXT: [[TMP37:%.*]] = bitcast <4 x i32> [[X1]] to <8 x i16>
+; CHECK-NEXT: [[TMP38:%.*]] = bitcast <4 x i32> [[X4]] to <8 x i16>
+; CHECK-NEXT: [[TMP39:%.*]] = bitcast <4 x i32> [[TMP3]] to <8 x i16>
+; CHECK-NEXT: [[TMP40:%.*]] = bitcast <4 x i32> [[TMP5]] to <8 x i16>
+; CHECK-NEXT: [[TMP41:%.*]] = icmp ne <8 x i16> [[TMP39]], zeroinitializer
+; CHECK-NEXT: [[TMP42:%.*]] = icmp ne <8 x i16> [[TMP40]], zeroinitializer
+; CHECK-NEXT: [[TMP43:%.*]] = icmp ne <8 x i16> [[TMP37]], zeroinitializer
+; CHECK-NEXT: [[TMP44:%.*]] = icmp ne <8 x i16> [[TMP38]], zeroinitializer
+; CHECK-NEXT: [[TMP45:%.*]] = and <8 x i1> [[TMP41]], [[TMP42]]
+; CHECK-NEXT: [[TMP46:%.*]] = and <8 x i1> [[TMP43]], [[TMP42]]
+; CHECK-NEXT: [[TMP47:%.*]] = and <8 x i1> [[TMP41]], [[TMP44]]
+; CHECK-NEXT: [[TMP48:%.*]] = or <8 x i1> [[TMP45]], [[TMP46]]
+; CHECK-NEXT: [[TMP49:%.*]] = or <8 x i1> [[TMP48]], [[TMP47]]
+; CHECK-NEXT: [[TMP50:%.*]] = sext <8 x i1> [[TMP49]] to <8 x i16>
+; CHECK-NEXT: [[TMP51:%.*]] = bitcast <8 x i16> [[TMP50]] to <4 x i32>
+; CHECK-NEXT: [[TMP52:%.*]] = icmp ne <4 x i32> [[TMP51]], zeroinitializer
+; CHECK-NEXT: [[TMP53:%.*]] = sext <4 x i1> [[TMP52]] to <4 x i32>
+; CHECK-NEXT: [[_MSPROP5:%.*]] = or <4 x i32> [[TMP53]], [[TMP2]]
; CHECK-NEXT: [[TMP19:%.*]] = call <4 x i32> @llvm.x86.avx512.vpdpwssds.128(<4 x i32> [[X0]], <4 x i32> [[X1]], <4 x i32> [[X4]])
; CHECK-NEXT: [[TMP20:%.*]] = bitcast i8 [[TMP4]] to <8 x i1>
; CHECK-NEXT: [[TMP21:%.*]] = bitcast i8 [[X3]] to <8 x i1>
diff --git a/llvm/test/Instrumentation/MemorySanitizer/X86/avx512vl_vnni-intrinsics.ll b/llvm/test/Instrumentation/MemorySanitizer/X86/avx512vl_vnni-intrinsics.ll
index 234d68f1aaf56..38f4272ef106f 100644
--- a/llvm/test/Instrumentation/MemorySanitizer/X86/avx512vl_vnni-intrinsics.ll
+++ b/llvm/test/Instrumentation/MemorySanitizer/X86/avx512vl_vnni-intrinsics.ll
@@ -5,15 +5,7 @@
;
; Handled strictly: (none)
;
-; Handled heuristically:
-; - llvm.x86.avx512.vpdpbusd.128
-; - llvm.x86.avx512.vpdpbusd.256
-; - llvm.x86.avx512.vpdpbusds.128
-; - llvm.x86.avx512.vpdpbusds.256
-; - llvm.x86.avx512.vpdpwssd.128
-; - llvm.x86.avx512.vpdpwssd.256
-; - llvm.x86.avx512.vpdpwssds.128
-; - llvm.x86.avx512.vpdpwssds.256
+; Handled heuristically: (none)
target datalayout = "e-m:o-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"
@@ -23,12 +15,30 @@ declare <8 x i32> @llvm.x86.avx512.vpdpbusd.256(<8 x i32>, <8 x i32>, <8 x i32>)
define <8 x i32>@test_int_x86_avx512_vpdpbusd_256(<8 x i32> %x0, <8 x i32> %x1, <8 x i32> %x2) sanitize_memory {
; CHECK-LABEL: define <8 x i32> @test_int_x86_avx512_vpdpbusd_256(
; CHECK-SAME: <8 x i32> [[X0:%.*]], <8 x i32> [[X1:%.*]], <8 x i32> [[X2:%.*]]) #[[ATTR1:[0-9]+]] {
-; CHECK-NEXT: [[TMP1:%.*]] = load <8 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP2:%.*]] = load <8 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 32) to ptr), align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <8 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 64) to ptr), align 8
+; CHECK-NEXT: [[TMP23:%.*]] = load <8 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: call void @llvm.donothing()
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <8 x i32> [[TMP1]], [[TMP2]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <8 x i32> [[_MSPROP]], [[TMP3]]
+; CHECK-NEXT: [[TMP24:%.*]] = bitcast <8 x i32> [[X1]] to <32 x i8>
+; CHECK-NEXT: [[TMP5:%.*]] = bitcast <8 x i32> [[X2]] to <32 x i8>
+; CHECK-NEXT: [[TMP6:%.*]] = bitcast <8 x i32> [[TMP2]] to <32 x i8>
+; CHECK-NEXT: [[TMP7:%.*]] = bitcast <8 x i32> [[TMP3]] to <32 x i8>
+; CHECK-NEXT: [[TMP8:%.*]] = icmp ne <32 x i8> [[TMP6]], zeroinitializer
+; CHECK-NEXT: [[TMP9:%.*]] = icmp ne <32 x i8> [[TMP7]], zeroinitializer
+; CHECK-NEXT: [[TMP10:%.*]] = icmp ne <32 x i8> [[TMP24]], zeroinitializer
+; CHECK-NEXT: [[TMP11:%.*]] = icmp ne <32 x i8> [[TMP5]], zeroinitializer
+; CHECK-NEXT: [[TMP12:%.*]] = and <32 x i1> [[TMP8]], [[TMP9]]
+; CHECK-NEXT: [[TMP13:%.*]] = and <32 x i1> [[TMP10]], [[TMP9]]
+; CHECK-NEXT: [[TMP14:%.*]] = and <32 x i1> [[TMP8]], [[TMP11]]
+; CHECK-NEXT: [[TMP15:%.*]] = or <32 x i1> [[TMP12]], [[TMP13]]
+; CHECK-NEXT: [[TMP16:%.*]] = or <32 x i1> [[TMP15]], [[TMP14]]
+; CHECK-NEXT: [[TMP17:%.*]] = sext <32 x i1> [[TMP16]] to <32 x i8>
+; CHECK-NEXT: [[TMP18:%.*]] = bitcast <32 x i8> [[TMP17]] to <16 x i16>
+; CHECK-NEXT: [[TMP19:%.*]] = icmp ne <16 x i16> [[TMP18]], zeroinitializer
+; CHECK-NEXT: [[TMP20:%.*]] = sext <16 x i1> [[TMP19]] to <16 x i16>
+; CHECK-NEXT: [[TMP21:%.*]] = bitcast <16 x i16> [[TMP20]] to i256
+; CHECK-NEXT: [[TMP22:%.*]] = bitcast i256 [[TMP21]] to <8 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <8 x i32> [[TMP22]], [[TMP23]]
; CHECK-NEXT: [[TMP4:%.*]] = call <8 x i32> @llvm.x86.avx512.vpdpbusd.256(<8 x i32> [[X0]], <8 x i32> [[X1]], <8 x i32> [[X2]])
; CHECK-NEXT: store <8 x i32> [[_MSPROP1]], ptr @__msan_retval_tls, align 8
; CHECK-NEXT: ret <8 x i32> [[TMP4]]
@@ -41,8 +51,8 @@ define { <8 x i32>, <8 x i32> } @test_int_x86_avx512_mask_vpdpbusd_256(<8 x i32>
; CHECK-LABEL: define { <8 x i32>, <8 x i32> } @test_int_x86_avx512_mask_vpdpbusd_256(
; CHECK-SAME: <8 x i32> [[X0:%.*]], <8 x i32> [[X1:%.*]], ptr [[X2P:%.*]], <8 x i32> [[X4:%.*]], i8 [[X3:%.*]]) #[[ATTR1]] {
; CHECK-NEXT: [[TMP1:%.*]] = load i64, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 64) to ptr), align 8
-; CHECK-NEXT: [[TMP2:%.*]] = load <8 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <8 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 32) to ptr), align 8
+; CHECK-NEXT: [[TMP2:%.*]] = load <8 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP4:%.*]] = load i8, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 104) to ptr), align 8
; CHECK-NEXT: [[TMP5:%.*]] = load <8 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 72) to ptr), align 8
; CHECK-NEXT: call void @llvm.donothing()
@@ -57,8 +67,26 @@ define { <8 x i32>, <8 x i32> } @test_int_x86_avx512_mask_vpdpbusd_256(<8 x i32>
; CHECK-NEXT: [[TMP9:%.*]] = xor i64 [[TMP8]], 87960930222080
; CHECK-NEXT: [[TMP10:%.*]] = inttoptr i64 [[TMP9]] to ptr
; CHECK-NEXT: [[_MSLD:%.*]] = load <8 x i32>, ptr [[TMP10]], align 32
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <8 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <8 x i32> [[_MSPROP]], [[_MSLD]]
+; CHECK-NEXT: [[TMP30:%.*]] = bitcast <8 x i32> [[X1]] to <32 x i8>
+; CHECK-NEXT: [[TMP31:%.*]] = bitcast <8 x i32> [[X2]] to <32 x i8>
+; CHECK-NEXT: [[TMP32:%.*]] = bitcast <8 x i32> [[TMP3]] to <32 x i8>
+; CHECK-NEXT: [[TMP33:%.*]] = bitcast <8 x i32> [[_MSLD]] to <32 x i8>
+; CHECK-NEXT: [[TMP34:%.*]] = icmp ne <32 x i8> [[TMP32]], zeroinitializer
+; CHECK-NEXT: [[TMP35:%.*]] = icmp ne <32 x i8> [[TMP33]], zeroinitializer
+; CHECK-NEXT: [[TMP36:%.*]] = icmp ne <32 x i8> [[TMP30]], zeroinitializer
+; CHECK-NEXT: [[TMP37:%.*]] = icmp ne <32 x i8> [[TMP31]], zeroinitializer
+; CHECK-NEXT: [[TMP38:%.*]] = and <32 x i1> [[TMP34]], [[TMP35]]
+; CHECK-NEXT: [[TMP58:%.*]] = and <32 x i1> [[TMP36]], [[TMP35]]
+; CHECK-NEXT: [[TMP59:%.*]] = and <32 x i1> [[TMP34]], [[TMP37]]
+; CHECK-NEXT: [[TMP60:%.*]] = or <32 x i1> [[TMP38]], [[TMP58]]
+; CHECK-NEXT: [[TMP61:%.*]] = or <32 x i1> [[TMP60]], [[TMP59]]
+; CHECK-NEXT: [[TMP62:%.*]] = sext <32 x i1> [[TMP61]] to <32 x i8>
+; CHECK-NEXT: [[TMP63:%.*]] = bitcast <32 x i8> [[TMP62]] to <16 x i16>
+; CHECK-NEXT: [[TMP64:%.*]] = icmp ne <16 x i16> [[TMP63]], zeroinitializer
+; CHECK-NEXT: [[TMP65:%.*]] = sext <16 x i1> [[TMP64]] to <16 x i16>
+; CHECK-NEXT: [[TMP66:%.*]] = bitcast <16 x i16> [[TMP65]] to i256
+; CHECK-NEXT: [[TMP29:%.*]] = bitcast i256 [[TMP66]] to <8 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <8 x i32> [[TMP29]], [[TMP2]]
; CHECK-NEXT: [[TMP11:%.*]] = call <8 x i32> @llvm.x86.avx512.vpdpbusd.256(<8 x i32> [[X0]], <8 x i32> [[X1]], <8 x i32> [[X2]])
; CHECK-NEXT: [[TMP12:%.*]] = bitcast i8 [[TMP4]] to <8 x i1>
; CHECK-NEXT: [[TMP13:%.*]] = bitcast i8 [[X3]] to <8 x i1>
@@ -68,8 +96,26 @@ define { <8 x i32>, <8 x i32> } @test_int_x86_avx512_mask_vpdpbusd_256(<8 x i32>
; CHECK-NEXT: [[TMP17:%.*]] = or <8 x i32> [[TMP16]], [[TMP2]]
; CHECK-NEXT: [[_MSPROP_SELECT:%.*]] = select <8 x i1> [[TMP12]], <8 x i32> [[TMP17]], <8 x i32> [[TMP14]]
; CHECK-NEXT: [[TMP18:%.*]] = select <8 x i1> [[TMP13]], <8 x i32> [[TMP11]], <8 x i32> [[X0]]
-; CHECK-NEXT: [[_MSPROP2:%.*]] = or <8 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP3:%.*]] = or <8 x i32> [[_MSPROP2]], [[TMP5]]
+; CHECK-NEXT: [[TMP39:%.*]] = bitcast <8 x i32> [[X1]] to <32 x i8>
+; CHECK-NEXT: [[TMP40:%.*]] = bitcast <8 x i32> [[X4]] to <32 x i8>
+; CHECK-NEXT: [[TMP41:%.*]] = bitcast <8 x i32> [[TMP3]] to <32 x i8>
+; CHECK-NEXT: [[TMP42:%.*]] = bitcast <8 x i32> [[TMP5]] to <32 x i8>
+; CHECK-NEXT: [[TMP43:%.*]] = icmp ne <32 x i8> [[TMP41]], zeroinitializer
+; CHECK-NEXT: [[TMP44:%.*]] = icmp ne <32 x i8> [[TMP42]], zeroinitializer
+; CHECK-NEXT: [[TMP45:%.*]] = icmp ne <32 x i8> [[TMP39]], zeroinitializer
+; CHECK-NEXT: [[TMP46:%.*]] = icmp ne <32 x i8> [[TMP40]], zeroinitializer
+; CHECK-NEXT: [[TMP47:%.*]] = and <32 x i1> [[TMP43]], [[TMP44]]
+; CHECK-NEXT: [[TMP48:%.*]] = and <32 x i1> [[TMP45]], [[TMP44]]
+; CHECK-NEXT: [[TMP49:%.*]] = and <32 x i1> [[TMP43]], [[TMP46]]
+; CHECK-NEXT: [[TMP50:%.*]] = or <32 x i1> [[TMP47]], [[TMP48]]
+; CHECK-NEXT: [[TMP51:%.*]] = or <32 x i1> [[TMP50]], [[TMP49]]
+; CHECK-NEXT: [[TMP52:%.*]] = sext <32 x i1> [[TMP51]] to <32 x i8>
+; CHECK-NEXT: [[TMP53:%.*]] = bitcast <32 x i8> [[TMP52]] to <16 x i16>
+; CHECK-NEXT: [[TMP54:%.*]] = icmp ne <16 x i16> [[TMP53]], zeroinitializer
+; CHECK-NEXT: [[TMP55:%.*]] = sext <16 x i1> [[TMP54]] to <16 x i16>
+; CHECK-NEXT: [[TMP56:%.*]] = bitcast <16 x i16> [[TMP55]] to i256
+; CHECK-NEXT: [[TMP57:%.*]] = bitcast i256 [[TMP56]] to <8 x i32>
+; CHECK-NEXT: [[_MSPROP3:%.*]] = or <8 x i32> [[TMP57]], [[TMP2]]
; CHECK-NEXT: [[TMP19:%.*]] = call <8 x i32> @llvm.x86.avx512.vpdpbusd.256(<8 x i32> [[X0]], <8 x i32> [[X1]], <8 x i32> [[X4]])
; CHECK-NEXT: [[TMP20:%.*]] = bitcast i8 [[TMP4]] to <8 x i1>
; CHECK-NEXT: [[TMP21:%.*]] = bitcast i8 [[X3]] to <8 x i1>
@@ -103,12 +149,30 @@ declare <4 x i32> @llvm.x86.avx512.vpdpbusd.128(<4 x i32>, <4 x i32>, <4 x i32>)
define <4 x i32>@test_int_x86_avx512_vpdpbusd_128(<4 x i32> %x0, <4 x i32> %x1, <4 x i32> %x2) sanitize_memory {
; CHECK-LABEL: define <4 x i32> @test_int_x86_avx512_vpdpbusd_128(
; CHECK-SAME: <4 x i32> [[X0:%.*]], <4 x i32> [[X1:%.*]], <4 x i32> [[X2:%.*]]) #[[ATTR1]] {
-; CHECK-NEXT: [[TMP1:%.*]] = load <4 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP2:%.*]] = load <4 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 16) to ptr), align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <4 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 32) to ptr), align 8
+; CHECK-NEXT: [[TMP23:%.*]] = load <4 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: call void @llvm.donothing()
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <4 x i32> [[TMP1]], [[TMP2]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <4 x i32> [[_MSPROP]], [[TMP3]]
+; CHECK-NEXT: [[TMP24:%.*]] = bitcast <4 x i32> [[X1]] to <16 x i8>
+; CHECK-NEXT: [[TMP5:%.*]] = bitcast <4 x i32> [[X2]] to <16 x i8>
+; CHECK-NEXT: [[TMP6:%.*]] = bitcast <4 x i32> [[TMP2]] to <16 x i8>
+; CHECK-NEXT: [[TMP7:%.*]] = bitcast <4 x i32> [[TMP3]] to <16 x i8>
+; CHECK-NEXT: [[TMP8:%.*]] = icmp ne <16 x i8> [[TMP6]], zeroinitializer
+; CHECK-NEXT: [[TMP9:%.*]] = icmp ne <16 x i8> [[TMP7]], zeroinitializer
+; CHECK-NEXT: [[TMP10:%.*]] = icmp ne <16 x i8> [[TMP24]], zeroinitializer
+; CHECK-NEXT: [[TMP11:%.*]] = icmp ne <16 x i8> [[TMP5]], zeroinitializer
+; CHECK-NEXT: [[TMP12:%.*]] = and <16 x i1> [[TMP8]], [[TMP9]]
+; CHECK-NEXT: [[TMP13:%.*]] = and <16 x i1> [[TMP10]], [[TMP9]]
+; CHECK-NEXT: [[TMP14:%.*]] = and <16 x i1> [[TMP8]], [[TMP11]]
+; CHECK-NEXT: [[TMP15:%.*]] = or <16 x i1> [[TMP12]], [[TMP13]]
+; CHECK-NEXT: [[TMP16:%.*]] = or <16 x i1> [[TMP15]], [[TMP14]]
+; CHECK-NEXT: [[TMP17:%.*]] = sext <16 x i1> [[TMP16]] to <16 x i8>
+; CHECK-NEXT: [[TMP18:%.*]] = bitcast <16 x i8> [[TMP17]] to <8 x i16>
+; CHECK-NEXT: [[TMP19:%.*]] = icmp ne <8 x i16> [[TMP18]], zeroinitializer
+; CHECK-NEXT: [[TMP20:%.*]] = sext <8 x i1> [[TMP19]] to <8 x i16>
+; CHECK-NEXT: [[TMP21:%.*]] = bitcast <8 x i16> [[TMP20]] to i128
+; CHECK-NEXT: [[TMP22:%.*]] = bitcast i128 [[TMP21]] to <4 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <4 x i32> [[TMP22]], [[TMP23]]
; CHECK-NEXT: [[TMP4:%.*]] = call <4 x i32> @llvm.x86.avx512.vpdpbusd.128(<4 x i32> [[X0]], <4 x i32> [[X1]], <4 x i32> [[X2]])
; CHECK-NEXT: store <4 x i32> [[_MSPROP1]], ptr @__msan_retval_tls, align 8
; CHECK-NEXT: ret <4 x i32> [[TMP4]]
@@ -121,8 +185,8 @@ define { <4 x i32>, <4 x i32> } @test_int_x86_avx512_mask_vpdpbusd_128(<4 x i32>
; CHECK-LABEL: define { <4 x i32>, <4 x i32> } @test_int_x86_avx512_mask_vpdpbusd_128(
; CHECK-SAME: <4 x i32> [[X0:%.*]], <4 x i32> [[X1:%.*]], ptr [[X2P:%.*]], <4 x i32> [[X4:%.*]], i8 [[X3:%.*]]) #[[ATTR1]] {
; CHECK-NEXT: [[TMP1:%.*]] = load i64, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 32) to ptr), align 8
-; CHECK-NEXT: [[TMP2:%.*]] = load <4 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <4 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 16) to ptr), align 8
+; CHECK-NEXT: [[TMP2:%.*]] = load <4 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP4:%.*]] = load i8, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 56) to ptr), align 8
; CHECK-NEXT: [[TMP5:%.*]] = load <4 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 40) to ptr), align 8
; CHECK-NEXT: call void @llvm.donothing()
@@ -137,8 +201,26 @@ define { <4 x i32>, <4 x i32> } @test_int_x86_avx512_mask_vpdpbusd_128(<4 x i32>
; CHECK-NEXT: [[TMP9:%.*]] = xor i64 [[TMP8]], 87960930222080
; CHECK-NEXT: [[TMP10:%.*]] = inttoptr i64 [[TMP9]] to ptr
; CHECK-NEXT: [[_MSLD:%.*]] = load <4 x i32>, ptr [[TMP10]], align 16
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <4 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <4 x i32> [[_MSPROP]], [[_MSLD]]
+; CHECK-NEXT: [[TMP30:%.*]] = bitcast <4 x i32> [[X1]] to <16 x i8>
+; CHECK-NEXT: [[TMP31:%.*]] = bitcast <4 x i32> [[X2]] to <16 x i8>
+; CHECK-NEXT: [[TMP32:%.*]] = bitcast <4 x i32> [[TMP3]] to <16 x i8>
+; CHECK-NEXT: [[TMP33:%.*]] = bitcast <4 x i32> [[_MSLD]] to <16 x i8>
+; CHECK-NEXT: [[TMP34:%.*]] = icmp ne <16 x i8> [[TMP32]], zeroinitializer
+; CHECK-NEXT: [[TMP35:%.*]] = icmp ne <16 x i8> [[TMP33]], zeroinitializer
+; CHECK-NEXT: [[TMP36:%.*]] = icmp ne <16 x i8> [[TMP30]], zeroinitializer
+; CHECK-NEXT: [[TMP37:%.*]] = icmp ne <16 x i8> [[TMP31]], zeroinitializer
+; CHECK-NEXT: [[TMP38:%.*]] = and <16 x i1> [[TMP34]], [[TMP35]]
+; CHECK-NEXT: [[TMP58:%.*]] = and <16 x i1> [[TMP36]], [[TMP35]]
+; CHECK-NEXT: [[TMP59:%.*]] = and <16 x i1> [[TMP34]], [[TMP37]]
+; CHECK-NEXT: [[TMP60:%.*]] = or <16 x i1> [[TMP38]], [[TMP58]]
+; CHECK-NEXT: [[TMP61:%.*]] = or <16 x i1> [[TMP60]], [[TMP59]]
+; CHECK-NEXT: [[TMP62:%.*]] = sext <16 x i1> [[TMP61]] to <16 x i8>
+; CHECK-NEXT: [[TMP63:%.*]] = bitcast <16 x i8> [[TMP62]] to <8 x i16>
+; CHECK-NEXT: [[TMP64:%.*]] = icmp ne <8 x i16> [[TMP63]], zeroinitializer
+; CHECK-NEXT: [[TMP65:%.*]] = sext <8 x i1> [[TMP64]] to <8 x i16>
+; CHECK-NEXT: [[TMP66:%.*]] = bitcast <8 x i16> [[TMP65]] to i128
+; CHECK-NEXT: [[TMP29:%.*]] = bitcast i128 [[TMP66]] to <4 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <4 x i32> [[TMP29]], [[TMP2]]
; CHECK-NEXT: [[TMP11:%.*]] = call <4 x i32> @llvm.x86.avx512.vpdpbusd.128(<4 x i32> [[X0]], <4 x i32> [[X1]], <4 x i32> [[X2]])
; CHECK-NEXT: [[TMP12:%.*]] = bitcast i8 [[TMP4]] to <8 x i1>
; CHECK-NEXT: [[TMP13:%.*]] = bitcast i8 [[X3]] to <8 x i1>
@@ -150,8 +232,26 @@ define { <4 x i32>, <4 x i32> } @test_int_x86_avx512_mask_vpdpbusd_128(<4 x i32>
; CHECK-NEXT: [[TMP17:%.*]] = or <4 x i32> [[TMP16]], [[TMP2]]
; CHECK-NEXT: [[_MSPROP_SELECT:%.*]] = select <4 x i1> [[_MSPROP2]], <4 x i32> [[TMP17]], <4 x i32> [[TMP14]]
; CHECK-NEXT: [[TMP18:%.*]] = select <4 x i1> [[EXTRACT]], <4 x i32> [[TMP11]], <4 x i32> [[X0]]
-; CHECK-NEXT: [[_MSPROP3:%.*]] = or <4 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP4:%.*]] = or <4 x i32> [[_MSPROP3]], [[TMP5]]
+; CHECK-NEXT: [[TMP39:%.*]] = bitcast <4 x i32> [[X1]] to <16 x i8>
+; CHECK-NEXT: [[TMP40:%.*]] = bitcast <4 x i32> [[X4]] to <16 x i8>
+; CHECK-NEXT: [[TMP41:%.*]] = bitcast <4 x i32> [[TMP3]] to <16 x i8>
+; CHECK-NEXT: [[TMP42:%.*]] = bitcast <4 x i32> [[TMP5]] to <16 x i8>
+; CHECK-NEXT: [[TMP43:%.*]] = icmp ne <16 x i8> [[TMP41]], zeroinitializer
+; CHECK-NEXT: [[TMP44:%.*]] = icmp ne <16 x i8> [[TMP42]], zeroinitializer
+; CHECK-NEXT: [[TMP45:%.*]] = icmp ne <16 x i8> [[TMP39]], zeroinitializer
+; CHECK-NEXT: [[TMP46:%.*]] = icmp ne <16 x i8> [[TMP40]], zeroinitializer
+; CHECK-NEXT: [[TMP47:%.*]] = and <16 x i1> [[TMP43]], [[TMP44]]
+; CHECK-NEXT: [[TMP48:%.*]] = and <16 x i1> [[TMP45]], [[TMP44]]
+; CHECK-NEXT: [[TMP49:%.*]] = and <16 x i1> [[TMP43]], [[TMP46]]
+; CHECK-NEXT: [[TMP50:%.*]] = or <16 x i1> [[TMP47]], [[TMP48]]
+; CHECK-NEXT: [[TMP51:%.*]] = or <16 x i1> [[TMP50]], [[TMP49]]
+; CHECK-NEXT: [[TMP52:%.*]] = sext <16 x i1> [[TMP51]] to <16 x i8>
+; CHECK-NEXT: [[TMP53:%.*]] = bitcast <16 x i8> [[TMP52]] to <8 x i16>
+; CHECK-NEXT: [[TMP54:%.*]] = icmp ne <8 x i16> [[TMP53]], zeroinitializer
+; CHECK-NEXT: [[TMP55:%.*]] = sext <8 x i1> [[TMP54]] to <8 x i16>
+; CHECK-NEXT: [[TMP56:%.*]] = bitcast <8 x i16> [[TMP55]] to i128
+; CHECK-NEXT: [[TMP57:%.*]] = bitcast i128 [[TMP56]] to <4 x i32>
+; CHECK-NEXT: [[_MSPROP4:%.*]] = or <4 x i32> [[TMP57]], [[TMP2]]
; CHECK-NEXT: [[TMP19:%.*]] = call <4 x i32> @llvm.x86.avx512.vpdpbusd.128(<4 x i32> [[X0]], <4 x i32> [[X1]], <4 x i32> [[X4]])
; CHECK-NEXT: [[TMP20:%.*]] = bitcast i8 [[TMP4]] to <8 x i1>
; CHECK-NEXT: [[TMP21:%.*]] = bitcast i8 [[X3]] to <8 x i1>
@@ -189,12 +289,30 @@ declare <8 x i32> @llvm.x86.avx512.vpdpbusds.256(<8 x i32>, <8 x i32>, <8 x i32>
define <8 x i32>@test_int_x86_avx512_vpdpbusds_256(<8 x i32> %x0, <8 x i32> %x1, <8 x i32> %x2) sanitize_memory {
; CHECK-LABEL: define <8 x i32> @test_int_x86_avx512_vpdpbusds_256(
; CHECK-SAME: <8 x i32> [[X0:%.*]], <8 x i32> [[X1:%.*]], <8 x i32> [[X2:%.*]]) #[[ATTR1]] {
-; CHECK-NEXT: [[TMP1:%.*]] = load <8 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP2:%.*]] = load <8 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 32) to ptr), align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <8 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 64) to ptr), align 8
+; CHECK-NEXT: [[TMP23:%.*]] = load <8 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: call void @llvm.donothing()
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <8 x i32> [[TMP1]], [[TMP2]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <8 x i32> [[_MSPROP]], [[TMP3]]
+; CHECK-NEXT: [[TMP24:%.*]] = bitcast <8 x i32> [[X1]] to <32 x i8>
+; CHECK-NEXT: [[TMP5:%.*]] = bitcast <8 x i32> [[X2]] to <32 x i8>
+; CHECK-NEXT: [[TMP6:%.*]] = bitcast <8 x i32> [[TMP2]] to <32 x i8>
+; CHECK-NEXT: [[TMP7:%.*]] = bitcast <8 x i32> [[TMP3]] to <32 x i8>
+; CHECK-NEXT: [[TMP8:%.*]] = icmp ne <32 x i8> [[TMP6]], zeroinitializer
+; CHECK-NEXT: [[TMP9:%.*]] = icmp ne <32 x i8> [[TMP7]], zeroinitializer
+; CHECK-NEXT: [[TMP10:%.*]] = icmp ne <32 x i8> [[TMP24]], zeroinitializer
+; CHECK-NEXT: [[TMP11:%.*]] = icmp ne <32 x i8> [[TMP5]], zeroinitializer
+; CHECK-NEXT: [[TMP12:%.*]] = and <32 x i1> [[TMP8]], [[TMP9]]
+; CHECK-NEXT: [[TMP13:%.*]] = and <32 x i1> [[TMP10]], [[TMP9]]
+; CHECK-NEXT: [[TMP14:%.*]] = and <32 x i1> [[TMP8]], [[TMP11]]
+; CHECK-NEXT: [[TMP15:%.*]] = or <32 x i1> [[TMP12]], [[TMP13]]
+; CHECK-NEXT: [[TMP16:%.*]] = or <32 x i1> [[TMP15]], [[TMP14]]
+; CHECK-NEXT: [[TMP17:%.*]] = sext <32 x i1> [[TMP16]] to <32 x i8>
+; CHECK-NEXT: [[TMP18:%.*]] = bitcast <32 x i8> [[TMP17]] to <16 x i16>
+; CHECK-NEXT: [[TMP19:%.*]] = icmp ne <16 x i16> [[TMP18]], zeroinitializer
+; CHECK-NEXT: [[TMP20:%.*]] = sext <16 x i1> [[TMP19]] to <16 x i16>
+; CHECK-NEXT: [[TMP21:%.*]] = bitcast <16 x i16> [[TMP20]] to i256
+; CHECK-NEXT: [[TMP22:%.*]] = bitcast i256 [[TMP21]] to <8 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <8 x i32> [[TMP22]], [[TMP23]]
; CHECK-NEXT: [[TMP4:%.*]] = call <8 x i32> @llvm.x86.avx512.vpdpbusds.256(<8 x i32> [[X0]], <8 x i32> [[X1]], <8 x i32> [[X2]])
; CHECK-NEXT: store <8 x i32> [[_MSPROP1]], ptr @__msan_retval_tls, align 8
; CHECK-NEXT: ret <8 x i32> [[TMP4]]
@@ -207,8 +325,8 @@ define { <8 x i32>, <8 x i32> } @test_int_x86_avx512_mask_vpdpbusds_256(<8 x i32
; CHECK-LABEL: define { <8 x i32>, <8 x i32> } @test_int_x86_avx512_mask_vpdpbusds_256(
; CHECK-SAME: <8 x i32> [[X0:%.*]], <8 x i32> [[X1:%.*]], ptr [[X2P:%.*]], <8 x i32> [[X4:%.*]], i8 [[X3:%.*]]) #[[ATTR1]] {
; CHECK-NEXT: [[TMP1:%.*]] = load i64, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 64) to ptr), align 8
-; CHECK-NEXT: [[TMP2:%.*]] = load <8 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <8 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 32) to ptr), align 8
+; CHECK-NEXT: [[TMP2:%.*]] = load <8 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP4:%.*]] = load i8, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 104) to ptr), align 8
; CHECK-NEXT: [[TMP5:%.*]] = load <8 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 72) to ptr), align 8
; CHECK-NEXT: call void @llvm.donothing()
@@ -223,8 +341,26 @@ define { <8 x i32>, <8 x i32> } @test_int_x86_avx512_mask_vpdpbusds_256(<8 x i32
; CHECK-NEXT: [[TMP9:%.*]] = xor i64 [[TMP8]], 87960930222080
; CHECK-NEXT: [[TMP10:%.*]] = inttoptr i64 [[TMP9]] to ptr
; CHECK-NEXT: [[_MSLD:%.*]] = load <8 x i32>, ptr [[TMP10]], align 32
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <8 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <8 x i32> [[_MSPROP]], [[_MSLD]]
+; CHECK-NEXT: [[TMP30:%.*]] = bitcast <8 x i32> [[X1]] to <32 x i8>
+; CHECK-NEXT: [[TMP31:%.*]] = bitcast <8 x i32> [[X2]] to <32 x i8>
+; CHECK-NEXT: [[TMP32:%.*]] = bitcast <8 x i32> [[TMP3]] to <32 x i8>
+; CHECK-NEXT: [[TMP33:%.*]] = bitcast <8 x i32> [[_MSLD]] to <32 x i8>
+; CHECK-NEXT: [[TMP34:%.*]] = icmp ne <32 x i8> [[TMP32]], zeroinitializer
+; CHECK-NEXT: [[TMP35:%.*]] = icmp ne <32 x i8> [[TMP33]], zeroinitializer
+; CHECK-NEXT: [[TMP36:%.*]] = icmp ne <32 x i8> [[TMP30]], zeroinitializer
+; CHECK-NEXT: [[TMP37:%.*]] = icmp ne <32 x i8> [[TMP31]], zeroinitializer
+; CHECK-NEXT: [[TMP38:%.*]] = and <32 x i1> [[TMP34]], [[TMP35]]
+; CHECK-NEXT: [[TMP58:%.*]] = and <32 x i1> [[TMP36]], [[TMP35]]
+; CHECK-NEXT: [[TMP59:%.*]] = and <32 x i1> [[TMP34]], [[TMP37]]
+; CHECK-NEXT: [[TMP60:%.*]] = or <32 x i1> [[TMP38]], [[TMP58]]
+; CHECK-NEXT: [[TMP61:%.*]] = or <32 x i1> [[TMP60]], [[TMP59]]
+; CHECK-NEXT: [[TMP62:%.*]] = sext <32 x i1> [[TMP61]] to <32 x i8>
+; CHECK-NEXT: [[TMP63:%.*]] = bitcast <32 x i8> [[TMP62]] to <16 x i16>
+; CHECK-NEXT: [[TMP64:%.*]] = icmp ne <16 x i16> [[TMP63]], zeroinitializer
+; CHECK-NEXT: [[TMP65:%.*]] = sext <16 x i1> [[TMP64]] to <16 x i16>
+; CHECK-NEXT: [[TMP66:%.*]] = bitcast <16 x i16> [[TMP65]] to i256
+; CHECK-NEXT: [[TMP29:%.*]] = bitcast i256 [[TMP66]] to <8 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <8 x i32> [[TMP29]], [[TMP2]]
; CHECK-NEXT: [[TMP11:%.*]] = call <8 x i32> @llvm.x86.avx512.vpdpbusds.256(<8 x i32> [[X0]], <8 x i32> [[X1]], <8 x i32> [[X2]])
; CHECK-NEXT: [[TMP12:%.*]] = bitcast i8 [[TMP4]] to <8 x i1>
; CHECK-NEXT: [[TMP13:%.*]] = bitcast i8 [[X3]] to <8 x i1>
@@ -234,8 +370,26 @@ define { <8 x i32>, <8 x i32> } @test_int_x86_avx512_mask_vpdpbusds_256(<8 x i32
; CHECK-NEXT: [[TMP17:%.*]] = or <8 x i32> [[TMP16]], [[TMP2]]
; CHECK-NEXT: [[_MSPROP_SELECT:%.*]] = select <8 x i1> [[TMP12]], <8 x i32> [[TMP17]], <8 x i32> [[TMP14]]
; CHECK-NEXT: [[TMP18:%.*]] = select <8 x i1> [[TMP13]], <8 x i32> [[TMP11]], <8 x i32> [[X0]]
-; CHECK-NEXT: [[_MSPROP2:%.*]] = or <8 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP3:%.*]] = or <8 x i32> [[_MSPROP2]], [[TMP5]]
+; CHECK-NEXT: [[TMP39:%.*]] = bitcast <8 x i32> [[X1]] to <32 x i8>
+; CHECK-NEXT: [[TMP40:%.*]] = bitcast <8 x i32> [[X4]] to <32 x i8>
+; CHECK-NEXT: [[TMP41:%.*]] = bitcast <8 x i32> [[TMP3]] to <32 x i8>
+; CHECK-NEXT: [[TMP42:%.*]] = bitcast <8 x i32> [[TMP5]] to <32 x i8>
+; CHECK-NEXT: [[TMP43:%.*]] = icmp ne <32 x i8> [[TMP41]], zeroinitializer
+; CHECK-NEXT: [[TMP44:%.*]] = icmp ne <32 x i8> [[TMP42]], zeroinitializer
+; CHECK-NEXT: [[TMP45:%.*]] = icmp ne <32 x i8> [[TMP39]], zeroinitializer
+; CHECK-NEXT: [[TMP46:%.*]] = icmp ne <32 x i8> [[TMP40]], zeroinitializer
+; CHECK-NEXT: [[TMP47:%.*]] = and <32 x i1> [[TMP43]], [[TMP44]]
+; CHECK-NEXT: [[TMP48:%.*]] = and <32 x i1> [[TMP45]], [[TMP44]]
+; CHECK-NEXT: [[TMP49:%.*]] = and <32 x i1> [[TMP43]], [[TMP46]]
+; CHECK-NEXT: [[TMP50:%.*]] = or <32 x i1> [[TMP47]], [[TMP48]]
+; CHECK-NEXT: [[TMP51:%.*]] = or <32 x i1> [[TMP50]], [[TMP49]]
+; CHECK-NEXT: [[TMP52:%.*]] = sext <32 x i1> [[TMP51]] to <32 x i8>
+; CHECK-NEXT: [[TMP53:%.*]] = bitcast <32 x i8> [[TMP52]] to <16 x i16>
+; CHECK-NEXT: [[TMP54:%.*]] = icmp ne <16 x i16> [[TMP53]], zeroinitializer
+; CHECK-NEXT: [[TMP55:%.*]] = sext <16 x i1> [[TMP54]] to <16 x i16>
+; CHECK-NEXT: [[TMP56:%.*]] = bitcast <16 x i16> [[TMP55]] to i256
+; CHECK-NEXT: [[TMP57:%.*]] = bitcast i256 [[TMP56]] to <8 x i32>
+; CHECK-NEXT: [[_MSPROP3:%.*]] = or <8 x i32> [[TMP57]], [[TMP2]]
; CHECK-NEXT: [[TMP19:%.*]] = call <8 x i32> @llvm.x86.avx512.vpdpbusds.256(<8 x i32> [[X0]], <8 x i32> [[X1]], <8 x i32> [[X4]])
; CHECK-NEXT: [[TMP20:%.*]] = bitcast i8 [[TMP4]] to <8 x i1>
; CHECK-NEXT: [[TMP21:%.*]] = bitcast i8 [[X3]] to <8 x i1>
@@ -269,12 +423,30 @@ declare <4 x i32> @llvm.x86.avx512.vpdpbusds.128(<4 x i32>, <4 x i32>, <4 x i32>
define <4 x i32>@test_int_x86_avx512_vpdpbusds_128(<4 x i32> %x0, <4 x i32> %x1, <4 x i32> %x2) sanitize_memory {
; CHECK-LABEL: define <4 x i32> @test_int_x86_avx512_vpdpbusds_128(
; CHECK-SAME: <4 x i32> [[X0:%.*]], <4 x i32> [[X1:%.*]], <4 x i32> [[X2:%.*]]) #[[ATTR1]] {
-; CHECK-NEXT: [[TMP1:%.*]] = load <4 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP2:%.*]] = load <4 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 16) to ptr), align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <4 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 32) to ptr), align 8
+; CHECK-NEXT: [[TMP23:%.*]] = load <4 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: call void @llvm.donothing()
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <4 x i32> [[TMP1]], [[TMP2]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <4 x i32> [[_MSPROP]], [[TMP3]]
+; CHECK-NEXT: [[TMP24:%.*]] = bitcast <4 x i32> [[X1]] to <16 x i8>
+; CHECK-NEXT: [[TMP5:%.*]] = bitcast <4 x i32> [[X2]] to <16 x i8>
+; CHECK-NEXT: [[TMP6:%.*]] = bitcast <4 x i32> [[TMP2]] to <16 x i8>
+; CHECK-NEXT: [[TMP7:%.*]] = bitcast <4 x i32> [[TMP3]] to <16 x i8>
+; CHECK-NEXT: [[TMP8:%.*]] = icmp ne <16 x i8> [[TMP6]], zeroinitializer
+; CHECK-NEXT: [[TMP9:%.*]] = icmp ne <16 x i8> [[TMP7]], zeroinitializer
+; CHECK-NEXT: [[TMP10:%.*]] = icmp ne <16 x i8> [[TMP24]], zeroinitializer
+; CHECK-NEXT: [[TMP11:%.*]] = icmp ne <16 x i8> [[TMP5]], zeroinitializer
+; CHECK-NEXT: [[TMP12:%.*]] = and <16 x i1> [[TMP8]], [[TMP9]]
+; CHECK-NEXT: [[TMP13:%.*]] = and <16 x i1> [[TMP10]], [[TMP9]]
+; CHECK-NEXT: [[TMP14:%.*]] = and <16 x i1> [[TMP8]], [[TMP11]]
+; CHECK-NEXT: [[TMP15:%.*]] = or <16 x i1> [[TMP12]], [[TMP13]]
+; CHECK-NEXT: [[TMP16:%.*]] = or <16 x i1> [[TMP15]], [[TMP14]]
+; CHECK-NEXT: [[TMP17:%.*]] = sext <16 x i1> [[TMP16]] to <16 x i8>
+; CHECK-NEXT: [[TMP18:%.*]] = bitcast <16 x i8> [[TMP17]] to <8 x i16>
+; CHECK-NEXT: [[TMP19:%.*]] = icmp ne <8 x i16> [[TMP18]], zeroinitializer
+; CHECK-NEXT: [[TMP20:%.*]] = sext <8 x i1> [[TMP19]] to <8 x i16>
+; CHECK-NEXT: [[TMP21:%.*]] = bitcast <8 x i16> [[TMP20]] to i128
+; CHECK-NEXT: [[TMP22:%.*]] = bitcast i128 [[TMP21]] to <4 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <4 x i32> [[TMP22]], [[TMP23]]
; CHECK-NEXT: [[TMP4:%.*]] = call <4 x i32> @llvm.x86.avx512.vpdpbusds.128(<4 x i32> [[X0]], <4 x i32> [[X1]], <4 x i32> [[X2]])
; CHECK-NEXT: store <4 x i32> [[_MSPROP1]], ptr @__msan_retval_tls, align 8
; CHECK-NEXT: ret <4 x i32> [[TMP4]]
@@ -287,8 +459,8 @@ define { <4 x i32>, <4 x i32> } @test_int_x86_avx512_mask_vpdpbusds_128(<4 x i32
; CHECK-LABEL: define { <4 x i32>, <4 x i32> } @test_int_x86_avx512_mask_vpdpbusds_128(
; CHECK-SAME: <4 x i32> [[X0:%.*]], <4 x i32> [[X1:%.*]], ptr [[X2P:%.*]], <4 x i32> [[X4:%.*]], i8 [[X3:%.*]]) #[[ATTR1]] {
; CHECK-NEXT: [[TMP1:%.*]] = load i64, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 32) to ptr), align 8
-; CHECK-NEXT: [[TMP2:%.*]] = load <4 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <4 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 16) to ptr), align 8
+; CHECK-NEXT: [[TMP2:%.*]] = load <4 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP4:%.*]] = load i8, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 56) to ptr), align 8
; CHECK-NEXT: [[TMP5:%.*]] = load <4 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 40) to ptr), align 8
; CHECK-NEXT: call void @llvm.donothing()
@@ -303,8 +475,26 @@ define { <4 x i32>, <4 x i32> } @test_int_x86_avx512_mask_vpdpbusds_128(<4 x i32
; CHECK-NEXT: [[TMP9:%.*]] = xor i64 [[TMP8]], 87960930222080
; CHECK-NEXT: [[TMP10:%.*]] = inttoptr i64 [[TMP9]] to ptr
; CHECK-NEXT: [[_MSLD:%.*]] = load <4 x i32>, ptr [[TMP10]], align 16
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <4 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <4 x i32> [[_MSPROP]], [[_MSLD]]
+; CHECK-NEXT: [[TMP30:%.*]] = bitcast <4 x i32> [[X1]] to <16 x i8>
+; CHECK-NEXT: [[TMP31:%.*]] = bitcast <4 x i32> [[X2]] to <16 x i8>
+; CHECK-NEXT: [[TMP32:%.*]] = bitcast <4 x i32> [[TMP3]] to <16 x i8>
+; CHECK-NEXT: [[TMP33:%.*]] = bitcast <4 x i32> [[_MSLD]] to <16 x i8>
+; CHECK-NEXT: [[TMP34:%.*]] = icmp ne <16 x i8> [[TMP32]], zeroinitializer
+; CHECK-NEXT: [[TMP35:%.*]] = icmp ne <16 x i8> [[TMP33]], zeroinitializer
+; CHECK-NEXT: [[TMP36:%.*]] = icmp ne <16 x i8> [[TMP30]], zeroinitializer
+; CHECK-NEXT: [[TMP37:%.*]] = icmp ne <16 x i8> [[TMP31]], zeroinitializer
+; CHECK-NEXT: [[TMP38:%.*]] = and <16 x i1> [[TMP34]], [[TMP35]]
+; CHECK-NEXT: [[TMP58:%.*]] = and <16 x i1> [[TMP36]], [[TMP35]]
+; CHECK-NEXT: [[TMP59:%.*]] = and <16 x i1> [[TMP34]], [[TMP37]]
+; CHECK-NEXT: [[TMP60:%.*]] = or <16 x i1> [[TMP38]], [[TMP58]]
+; CHECK-NEXT: [[TMP61:%.*]] = or <16 x i1> [[TMP60]], [[TMP59]]
+; CHECK-NEXT: [[TMP62:%.*]] = sext <16 x i1> [[TMP61]] to <16 x i8>
+; CHECK-NEXT: [[TMP63:%.*]] = bitcast <16 x i8> [[TMP62]] to <8 x i16>
+; CHECK-NEXT: [[TMP64:%.*]] = icmp ne <8 x i16> [[TMP63]], zeroinitializer
+; CHECK-NEXT: [[TMP65:%.*]] = sext <8 x i1> [[TMP64]] to <8 x i16>
+; CHECK-NEXT: [[TMP66:%.*]] = bitcast <8 x i16> [[TMP65]] to i128
+; CHECK-NEXT: [[TMP29:%.*]] = bitcast i128 [[TMP66]] to <4 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <4 x i32> [[TMP29]], [[TMP2]]
; CHECK-NEXT: [[TMP11:%.*]] = call <4 x i32> @llvm.x86.avx512.vpdpbusds.128(<4 x i32> [[X0]], <4 x i32> [[X1]], <4 x i32> [[X2]])
; CHECK-NEXT: [[TMP12:%.*]] = bitcast i8 [[TMP4]] to <8 x i1>
; CHECK-NEXT: [[TMP13:%.*]] = bitcast i8 [[X3]] to <8 x i1>
@@ -316,8 +506,26 @@ define { <4 x i32>, <4 x i32> } @test_int_x86_avx512_mask_vpdpbusds_128(<4 x i32
; CHECK-NEXT: [[TMP17:%.*]] = or <4 x i32> [[TMP16]], [[TMP2]]
; CHECK-NEXT: [[_MSPROP_SELECT:%.*]] = select <4 x i1> [[_MSPROP2]], <4 x i32> [[TMP17]], <4 x i32> [[TMP14]]
; CHECK-NEXT: [[TMP18:%.*]] = select <4 x i1> [[EXTRACT]], <4 x i32> [[TMP11]], <4 x i32> [[X0]]
-; CHECK-NEXT: [[_MSPROP3:%.*]] = or <4 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP4:%.*]] = or <4 x i32> [[_MSPROP3]], [[TMP5]]
+; CHECK-NEXT: [[TMP39:%.*]] = bitcast <4 x i32> [[X1]] to <16 x i8>
+; CHECK-NEXT: [[TMP40:%.*]] = bitcast <4 x i32> [[X4]] to <16 x i8>
+; CHECK-NEXT: [[TMP41:%.*]] = bitcast <4 x i32> [[TMP3]] to <16 x i8>
+; CHECK-NEXT: [[TMP42:%.*]] = bitcast <4 x i32> [[TMP5]] to <16 x i8>
+; CHECK-NEXT: [[TMP43:%.*]] = icmp ne <16 x i8> [[TMP41]], zeroinitializer
+; CHECK-NEXT: [[TMP44:%.*]] = icmp ne <16 x i8> [[TMP42]], zeroinitializer
+; CHECK-NEXT: [[TMP45:%.*]] = icmp ne <16 x i8> [[TMP39]], zeroinitializer
+; CHECK-NEXT: [[TMP46:%.*]] = icmp ne <16 x i8> [[TMP40]], zeroinitializer
+; CHECK-NEXT: [[TMP47:%.*]] = and <16 x i1> [[TMP43]], [[TMP44]]
+; CHECK-NEXT: [[TMP48:%.*]] = and <16 x i1> [[TMP45]], [[TMP44]]
+; CHECK-NEXT: [[TMP49:%.*]] = and <16 x i1> [[TMP43]], [[TMP46]]
+; CHECK-NEXT: [[TMP50:%.*]] = or <16 x i1> [[TMP47]], [[TMP48]]
+; CHECK-NEXT: [[TMP51:%.*]] = or <16 x i1> [[TMP50]], [[TMP49]]
+; CHECK-NEXT: [[TMP52:%.*]] = sext <16 x i1> [[TMP51]] to <16 x i8>
+; CHECK-NEXT: [[TMP53:%.*]] = bitcast <16 x i8> [[TMP52]] to <8 x i16>
+; CHECK-NEXT: [[TMP54:%.*]] = icmp ne <8 x i16> [[TMP53]], zeroinitializer
+; CHECK-NEXT: [[TMP55:%.*]] = sext <8 x i1> [[TMP54]] to <8 x i16>
+; CHECK-NEXT: [[TMP56:%.*]] = bitcast <8 x i16> [[TMP55]] to i128
+; CHECK-NEXT: [[TMP57:%.*]] = bitcast i128 [[TMP56]] to <4 x i32>
+; CHECK-NEXT: [[_MSPROP4:%.*]] = or <4 x i32> [[TMP57]], [[TMP2]]
; CHECK-NEXT: [[TMP19:%.*]] = call <4 x i32> @llvm.x86.avx512.vpdpbusds.128(<4 x i32> [[X0]], <4 x i32> [[X1]], <4 x i32> [[X4]])
; CHECK-NEXT: [[TMP20:%.*]] = bitcast i8 [[TMP4]] to <8 x i1>
; CHECK-NEXT: [[TMP21:%.*]] = bitcast i8 [[X3]] to <8 x i1>
@@ -355,12 +563,28 @@ declare <8 x i32> @llvm.x86.avx512.vpdpwssd.256(<8 x i32>, <8 x i32>, <8 x i32>)
define <8 x i32>@test_int_x86_avx512_vpdpwssd_256(<8 x i32> %x0, <8 x i32> %x1, <8 x i32> %x2) sanitize_memory {
; CHECK-LABEL: define <8 x i32> @test_int_x86_avx512_vpdpwssd_256(
; CHECK-SAME: <8 x i32> [[X0:%.*]], <8 x i32> [[X1:%.*]], <8 x i32> [[X2:%.*]]) #[[ATTR1]] {
-; CHECK-NEXT: [[TMP1:%.*]] = load <8 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP2:%.*]] = load <8 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 32) to ptr), align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <8 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 64) to ptr), align 8
+; CHECK-NEXT: [[TMP21:%.*]] = load <8 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: call void @llvm.donothing()
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <8 x i32> [[TMP1]], [[TMP2]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <8 x i32> [[_MSPROP]], [[TMP3]]
+; CHECK-NEXT: [[TMP22:%.*]] = bitcast <8 x i32> [[X1]] to <16 x i16>
+; CHECK-NEXT: [[TMP5:%.*]] = bitcast <8 x i32> [[X2]] to <16 x i16>
+; CHECK-NEXT: [[TMP6:%.*]] = bitcast <8 x i32> [[TMP2]] to <16 x i16>
+; CHECK-NEXT: [[TMP7:%.*]] = bitcast <8 x i32> [[TMP3]] to <16 x i16>
+; CHECK-NEXT: [[TMP8:%.*]] = icmp ne <16 x i16> [[TMP6]], zeroinitializer
+; CHECK-NEXT: [[TMP9:%.*]] = icmp ne <16 x i16> [[TMP7]], zeroinitializer
+; CHECK-NEXT: [[TMP10:%.*]] = icmp ne <16 x i16> [[TMP22]], zeroinitializer
+; CHECK-NEXT: [[TMP11:%.*]] = icmp ne <16 x i16> [[TMP5]], zeroinitializer
+; CHECK-NEXT: [[TMP12:%.*]] = and <16 x i1> [[TMP8]], [[TMP9]]
+; CHECK-NEXT: [[TMP13:%.*]] = and <16 x i1> [[TMP10]], [[TMP9]]
+; CHECK-NEXT: [[TMP14:%.*]] = and <16 x i1> [[TMP8]], [[TMP11]]
+; CHECK-NEXT: [[TMP15:%.*]] = or <16 x i1> [[TMP12]], [[TMP13]]
+; CHECK-NEXT: [[TMP16:%.*]] = or <16 x i1> [[TMP15]], [[TMP14]]
+; CHECK-NEXT: [[TMP17:%.*]] = sext <16 x i1> [[TMP16]] to <16 x i16>
+; CHECK-NEXT: [[TMP18:%.*]] = bitcast <16 x i16> [[TMP17]] to <8 x i32>
+; CHECK-NEXT: [[TMP19:%.*]] = icmp ne <8 x i32> [[TMP18]], zeroinitializer
+; CHECK-NEXT: [[TMP20:%.*]] = sext <8 x i1> [[TMP19]] to <8 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <8 x i32> [[TMP20]], [[TMP21]]
; CHECK-NEXT: [[TMP4:%.*]] = call <8 x i32> @llvm.x86.avx512.vpdpwssd.256(<8 x i32> [[X0]], <8 x i32> [[X1]], <8 x i32> [[X2]])
; CHECK-NEXT: store <8 x i32> [[_MSPROP1]], ptr @__msan_retval_tls, align 8
; CHECK-NEXT: ret <8 x i32> [[TMP4]]
@@ -373,8 +597,8 @@ define { <8 x i32>, <8 x i32> } @test_int_x86_avx512_mask_vpdpwssd_256(<8 x i32>
; CHECK-LABEL: define { <8 x i32>, <8 x i32> } @test_int_x86_avx512_mask_vpdpwssd_256(
; CHECK-SAME: <8 x i32> [[X0:%.*]], <8 x i32> [[X1:%.*]], ptr [[X2P:%.*]], <8 x i32> [[X4:%.*]], i8 [[X3:%.*]]) #[[ATTR1]] {
; CHECK-NEXT: [[TMP1:%.*]] = load i64, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 64) to ptr), align 8
-; CHECK-NEXT: [[TMP2:%.*]] = load <8 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <8 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 32) to ptr), align 8
+; CHECK-NEXT: [[TMP2:%.*]] = load <8 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP4:%.*]] = load i8, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 104) to ptr), align 8
; CHECK-NEXT: [[TMP5:%.*]] = load <8 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 72) to ptr), align 8
; CHECK-NEXT: call void @llvm.donothing()
@@ -389,8 +613,24 @@ define { <8 x i32>, <8 x i32> } @test_int_x86_avx512_mask_vpdpwssd_256(<8 x i32>
; CHECK-NEXT: [[TMP9:%.*]] = xor i64 [[TMP8]], 87960930222080
; CHECK-NEXT: [[TMP10:%.*]] = inttoptr i64 [[TMP9]] to ptr
; CHECK-NEXT: [[_MSLD:%.*]] = load <8 x i32>, ptr [[TMP10]], align 32
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <8 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <8 x i32> [[_MSPROP]], [[_MSLD]]
+; CHECK-NEXT: [[TMP29:%.*]] = bitcast <8 x i32> [[X1]] to <16 x i16>
+; CHECK-NEXT: [[TMP30:%.*]] = bitcast <8 x i32> [[X2]] to <16 x i16>
+; CHECK-NEXT: [[TMP31:%.*]] = bitcast <8 x i32> [[TMP3]] to <16 x i16>
+; CHECK-NEXT: [[TMP32:%.*]] = bitcast <8 x i32> [[_MSLD]] to <16 x i16>
+; CHECK-NEXT: [[TMP33:%.*]] = icmp ne <16 x i16> [[TMP31]], zeroinitializer
+; CHECK-NEXT: [[TMP34:%.*]] = icmp ne <16 x i16> [[TMP32]], zeroinitializer
+; CHECK-NEXT: [[TMP35:%.*]] = icmp ne <16 x i16> [[TMP29]], zeroinitializer
+; CHECK-NEXT: [[TMP36:%.*]] = icmp ne <16 x i16> [[TMP30]], zeroinitializer
+; CHECK-NEXT: [[TMP54:%.*]] = and <16 x i1> [[TMP33]], [[TMP34]]
+; CHECK-NEXT: [[TMP55:%.*]] = and <16 x i1> [[TMP35]], [[TMP34]]
+; CHECK-NEXT: [[TMP56:%.*]] = and <16 x i1> [[TMP33]], [[TMP36]]
+; CHECK-NEXT: [[TMP57:%.*]] = or <16 x i1> [[TMP54]], [[TMP55]]
+; CHECK-NEXT: [[TMP58:%.*]] = or <16 x i1> [[TMP57]], [[TMP56]]
+; CHECK-NEXT: [[TMP59:%.*]] = sext <16 x i1> [[TMP58]] to <16 x i16>
+; CHECK-NEXT: [[TMP60:%.*]] = bitcast <16 x i16> [[TMP59]] to <8 x i32>
+; CHECK-NEXT: [[TMP61:%.*]] = icmp ne <8 x i32> [[TMP60]], zeroinitializer
+; CHECK-NEXT: [[TMP62:%.*]] = sext <8 x i1> [[TMP61]] to <8 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <8 x i32> [[TMP62]], [[TMP2]]
; CHECK-NEXT: [[TMP11:%.*]] = call <8 x i32> @llvm.x86.avx512.vpdpwssd.256(<8 x i32> [[X0]], <8 x i32> [[X1]], <8 x i32> [[X2]])
; CHECK-NEXT: [[TMP12:%.*]] = bitcast i8 [[TMP4]] to <8 x i1>
; CHECK-NEXT: [[TMP13:%.*]] = bitcast i8 [[X3]] to <8 x i1>
@@ -400,8 +640,24 @@ define { <8 x i32>, <8 x i32> } @test_int_x86_avx512_mask_vpdpwssd_256(<8 x i32>
; CHECK-NEXT: [[TMP17:%.*]] = or <8 x i32> [[TMP16]], [[TMP2]]
; CHECK-NEXT: [[_MSPROP_SELECT:%.*]] = select <8 x i1> [[TMP12]], <8 x i32> [[TMP17]], <8 x i32> [[TMP14]]
; CHECK-NEXT: [[TMP18:%.*]] = select <8 x i1> [[TMP13]], <8 x i32> [[TMP11]], <8 x i32> [[X0]]
-; CHECK-NEXT: [[_MSPROP2:%.*]] = or <8 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP3:%.*]] = or <8 x i32> [[_MSPROP2]], [[TMP5]]
+; CHECK-NEXT: [[TMP37:%.*]] = bitcast <8 x i32> [[X1]] to <16 x i16>
+; CHECK-NEXT: [[TMP38:%.*]] = bitcast <8 x i32> [[X4]] to <16 x i16>
+; CHECK-NEXT: [[TMP39:%.*]] = bitcast <8 x i32> [[TMP3]] to <16 x i16>
+; CHECK-NEXT: [[TMP40:%.*]] = bitcast <8 x i32> [[TMP5]] to <16 x i16>
+; CHECK-NEXT: [[TMP41:%.*]] = icmp ne <16 x i16> [[TMP39]], zeroinitializer
+; CHECK-NEXT: [[TMP42:%.*]] = icmp ne <16 x i16> [[TMP40]], zeroinitializer
+; CHECK-NEXT: [[TMP43:%.*]] = icmp ne <16 x i16> [[TMP37]], zeroinitializer
+; CHECK-NEXT: [[TMP44:%.*]] = icmp ne <16 x i16> [[TMP38]], zeroinitializer
+; CHECK-NEXT: [[TMP45:%.*]] = and <16 x i1> [[TMP41]], [[TMP42]]
+; CHECK-NEXT: [[TMP46:%.*]] = and <16 x i1> [[TMP43]], [[TMP42]]
+; CHECK-NEXT: [[TMP47:%.*]] = and <16 x i1> [[TMP41]], [[TMP44]]
+; CHECK-NEXT: [[TMP48:%.*]] = or <16 x i1> [[TMP45]], [[TMP46]]
+; CHECK-NEXT: [[TMP49:%.*]] = or <16 x i1> [[TMP48]], [[TMP47]]
+; CHECK-NEXT: [[TMP50:%.*]] = sext <16 x i1> [[TMP49]] to <16 x i16>
+; CHECK-NEXT: [[TMP51:%.*]] = bitcast <16 x i16> [[TMP50]] to <8 x i32>
+; CHECK-NEXT: [[TMP52:%.*]] = icmp ne <8 x i32> [[TMP51]], zeroinitializer
+; CHECK-NEXT: [[TMP53:%.*]] = sext <8 x i1> [[TMP52]] to <8 x i32>
+; CHECK-NEXT: [[_MSPROP3:%.*]] = or <8 x i32> [[TMP53]], [[TMP2]]
; CHECK-NEXT: [[TMP19:%.*]] = call <8 x i32> @llvm.x86.avx512.vpdpwssd.256(<8 x i32> [[X0]], <8 x i32> [[X1]], <8 x i32> [[X4]])
; CHECK-NEXT: [[TMP20:%.*]] = bitcast i8 [[TMP4]] to <8 x i1>
; CHECK-NEXT: [[TMP21:%.*]] = bitcast i8 [[X3]] to <8 x i1>
@@ -435,12 +691,28 @@ declare <4 x i32> @llvm.x86.avx512.vpdpwssd.128(<4 x i32>, <4 x i32>, <4 x i32>)
define <4 x i32>@test_int_x86_avx512_vpdpwssd_128(<4 x i32> %x0, <4 x i32> %x1, <4 x i32> %x2) sanitize_memory {
; CHECK-LABEL: define <4 x i32> @test_int_x86_avx512_vpdpwssd_128(
; CHECK-SAME: <4 x i32> [[X0:%.*]], <4 x i32> [[X1:%.*]], <4 x i32> [[X2:%.*]]) #[[ATTR1]] {
-; CHECK-NEXT: [[TMP1:%.*]] = load <4 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP2:%.*]] = load <4 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 16) to ptr), align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <4 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 32) to ptr), align 8
+; CHECK-NEXT: [[TMP21:%.*]] = load <4 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: call void @llvm.donothing()
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <4 x i32> [[TMP1]], [[TMP2]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <4 x i32> [[_MSPROP]], [[TMP3]]
+; CHECK-NEXT: [[TMP22:%.*]] = bitcast <4 x i32> [[X1]] to <8 x i16>
+; CHECK-NEXT: [[TMP5:%.*]] = bitcast <4 x i32> [[X2]] to <8 x i16>
+; CHECK-NEXT: [[TMP6:%.*]] = bitcast <4 x i32> [[TMP2]] to <8 x i16>
+; CHECK-NEXT: [[TMP7:%.*]] = bitcast <4 x i32> [[TMP3]] to <8 x i16>
+; CHECK-NEXT: [[TMP8:%.*]] = icmp ne <8 x i16> [[TMP6]], zeroinitializer
+; CHECK-NEXT: [[TMP9:%.*]] = icmp ne <8 x i16> [[TMP7]], zeroinitializer
+; CHECK-NEXT: [[TMP10:%.*]] = icmp ne <8 x i16> [[TMP22]], zeroinitializer
+; CHECK-NEXT: [[TMP11:%.*]] = icmp ne <8 x i16> [[TMP5]], zeroinitializer
+; CHECK-NEXT: [[TMP12:%.*]] = and <8 x i1> [[TMP8]], [[TMP9]]
+; CHECK-NEXT: [[TMP13:%.*]] = and <8 x i1> [[TMP10]], [[TMP9]]
+; CHECK-NEXT: [[TMP14:%.*]] = and <8 x i1> [[TMP8]], [[TMP11]]
+; CHECK-NEXT: [[TMP15:%.*]] = or <8 x i1> [[TMP12]], [[TMP13]]
+; CHECK-NEXT: [[TMP16:%.*]] = or <8 x i1> [[TMP15]], [[TMP14]]
+; CHECK-NEXT: [[TMP17:%.*]] = sext <8 x i1> [[TMP16]] to <8 x i16>
+; CHECK-NEXT: [[TMP18:%.*]] = bitcast <8 x i16> [[TMP17]] to <4 x i32>
+; CHECK-NEXT: [[TMP19:%.*]] = icmp ne <4 x i32> [[TMP18]], zeroinitializer
+; CHECK-NEXT: [[TMP20:%.*]] = sext <4 x i1> [[TMP19]] to <4 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <4 x i32> [[TMP20]], [[TMP21]]
; CHECK-NEXT: [[TMP4:%.*]] = call <4 x i32> @llvm.x86.avx512.vpdpwssd.128(<4 x i32> [[X0]], <4 x i32> [[X1]], <4 x i32> [[X2]])
; CHECK-NEXT: store <4 x i32> [[_MSPROP1]], ptr @__msan_retval_tls, align 8
; CHECK-NEXT: ret <4 x i32> [[TMP4]]
@@ -453,8 +725,8 @@ define { <4 x i32>, <4 x i32> } @test_int_x86_avx512_mask_vpdpwssd_128(<4 x i32>
; CHECK-LABEL: define { <4 x i32>, <4 x i32> } @test_int_x86_avx512_mask_vpdpwssd_128(
; CHECK-SAME: <4 x i32> [[X0:%.*]], <4 x i32> [[X1:%.*]], ptr [[X2P:%.*]], <4 x i32> [[X4:%.*]], i8 [[X3:%.*]]) #[[ATTR1]] {
; CHECK-NEXT: [[TMP1:%.*]] = load i64, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 32) to ptr), align 8
-; CHECK-NEXT: [[TMP2:%.*]] = load <4 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <4 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 16) to ptr), align 8
+; CHECK-NEXT: [[TMP2:%.*]] = load <4 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP4:%.*]] = load i8, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 56) to ptr), align 8
; CHECK-NEXT: [[TMP5:%.*]] = load <4 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 40) to ptr), align 8
; CHECK-NEXT: call void @llvm.donothing()
@@ -469,8 +741,24 @@ define { <4 x i32>, <4 x i32> } @test_int_x86_avx512_mask_vpdpwssd_128(<4 x i32>
; CHECK-NEXT: [[TMP9:%.*]] = xor i64 [[TMP8]], 87960930222080
; CHECK-NEXT: [[TMP10:%.*]] = inttoptr i64 [[TMP9]] to ptr
; CHECK-NEXT: [[_MSLD:%.*]] = load <4 x i32>, ptr [[TMP10]], align 16
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <4 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <4 x i32> [[_MSPROP]], [[_MSLD]]
+; CHECK-NEXT: [[TMP29:%.*]] = bitcast <4 x i32> [[X1]] to <8 x i16>
+; CHECK-NEXT: [[TMP30:%.*]] = bitcast <4 x i32> [[X2]] to <8 x i16>
+; CHECK-NEXT: [[TMP31:%.*]] = bitcast <4 x i32> [[TMP3]] to <8 x i16>
+; CHECK-NEXT: [[TMP32:%.*]] = bitcast <4 x i32> [[_MSLD]] to <8 x i16>
+; CHECK-NEXT: [[TMP33:%.*]] = icmp ne <8 x i16> [[TMP31]], zeroinitializer
+; CHECK-NEXT: [[TMP34:%.*]] = icmp ne <8 x i16> [[TMP32]], zeroinitializer
+; CHECK-NEXT: [[TMP35:%.*]] = icmp ne <8 x i16> [[TMP29]], zeroinitializer
+; CHECK-NEXT: [[TMP36:%.*]] = icmp ne <8 x i16> [[TMP30]], zeroinitializer
+; CHECK-NEXT: [[TMP54:%.*]] = and <8 x i1> [[TMP33]], [[TMP34]]
+; CHECK-NEXT: [[TMP55:%.*]] = and <8 x i1> [[TMP35]], [[TMP34]]
+; CHECK-NEXT: [[TMP56:%.*]] = and <8 x i1> [[TMP33]], [[TMP36]]
+; CHECK-NEXT: [[TMP57:%.*]] = or <8 x i1> [[TMP54]], [[TMP55]]
+; CHECK-NEXT: [[TMP58:%.*]] = or <8 x i1> [[TMP57]], [[TMP56]]
+; CHECK-NEXT: [[TMP59:%.*]] = sext <8 x i1> [[TMP58]] to <8 x i16>
+; CHECK-NEXT: [[TMP60:%.*]] = bitcast <8 x i16> [[TMP59]] to <4 x i32>
+; CHECK-NEXT: [[TMP61:%.*]] = icmp ne <4 x i32> [[TMP60]], zeroinitializer
+; CHECK-NEXT: [[TMP62:%.*]] = sext <4 x i1> [[TMP61]] to <4 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <4 x i32> [[TMP62]], [[TMP2]]
; CHECK-NEXT: [[TMP11:%.*]] = call <4 x i32> @llvm.x86.avx512.vpdpwssd.128(<4 x i32> [[X0]], <4 x i32> [[X1]], <4 x i32> [[X2]])
; CHECK-NEXT: [[TMP12:%.*]] = bitcast i8 [[TMP4]] to <8 x i1>
; CHECK-NEXT: [[TMP13:%.*]] = bitcast i8 [[X3]] to <8 x i1>
@@ -482,8 +770,24 @@ define { <4 x i32>, <4 x i32> } @test_int_x86_avx512_mask_vpdpwssd_128(<4 x i32>
; CHECK-NEXT: [[TMP17:%.*]] = or <4 x i32> [[TMP16]], [[TMP2]]
; CHECK-NEXT: [[_MSPROP_SELECT:%.*]] = select <4 x i1> [[_MSPROP2]], <4 x i32> [[TMP17]], <4 x i32> [[TMP14]]
; CHECK-NEXT: [[TMP18:%.*]] = select <4 x i1> [[EXTRACT]], <4 x i32> [[TMP11]], <4 x i32> [[X0]]
-; CHECK-NEXT: [[_MSPROP3:%.*]] = or <4 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP4:%.*]] = or <4 x i32> [[_MSPROP3]], [[TMP5]]
+; CHECK-NEXT: [[TMP37:%.*]] = bitcast <4 x i32> [[X1]] to <8 x i16>
+; CHECK-NEXT: [[TMP38:%.*]] = bitcast <4 x i32> [[X4]] to <8 x i16>
+; CHECK-NEXT: [[TMP39:%.*]] = bitcast <4 x i32> [[TMP3]] to <8 x i16>
+; CHECK-NEXT: [[TMP40:%.*]] = bitcast <4 x i32> [[TMP5]] to <8 x i16>
+; CHECK-NEXT: [[TMP41:%.*]] = icmp ne <8 x i16> [[TMP39]], zeroinitializer
+; CHECK-NEXT: [[TMP42:%.*]] = icmp ne <8 x i16> [[TMP40]], zeroinitializer
+; CHECK-NEXT: [[TMP43:%.*]] = icmp ne <8 x i16> [[TMP37]], zeroinitializer
+; CHECK-NEXT: [[TMP44:%.*]] = icmp ne <8 x i16> [[TMP38]], zeroinitializer
+; CHECK-NEXT: [[TMP45:%.*]] = and <8 x i1> [[TMP41]], [[TMP42]]
+; CHECK-NEXT: [[TMP46:%.*]] = and <8 x i1> [[TMP43]], [[TMP42]]
+; CHECK-NEXT: [[TMP47:%.*]] = and <8 x i1> [[TMP41]], [[TMP44]]
+; CHECK-NEXT: [[TMP48:%.*]] = or <8 x i1> [[TMP45]], [[TMP46]]
+; CHECK-NEXT: [[TMP49:%.*]] = or <8 x i1> [[TMP48]], [[TMP47]]
+; CHECK-NEXT: [[TMP50:%.*]] = sext <8 x i1> [[TMP49]] to <8 x i16>
+; CHECK-NEXT: [[TMP51:%.*]] = bitcast <8 x i16> [[TMP50]] to <4 x i32>
+; CHECK-NEXT: [[TMP52:%.*]] = icmp ne <4 x i32> [[TMP51]], zeroinitializer
+; CHECK-NEXT: [[TMP53:%.*]] = sext <4 x i1> [[TMP52]] to <4 x i32>
+; CHECK-NEXT: [[_MSPROP4:%.*]] = or <4 x i32> [[TMP53]], [[TMP2]]
; CHECK-NEXT: [[TMP19:%.*]] = call <4 x i32> @llvm.x86.avx512.vpdpwssd.128(<4 x i32> [[X0]], <4 x i32> [[X1]], <4 x i32> [[X4]])
; CHECK-NEXT: [[TMP20:%.*]] = bitcast i8 [[TMP4]] to <8 x i1>
; CHECK-NEXT: [[TMP21:%.*]] = bitcast i8 [[X3]] to <8 x i1>
@@ -521,12 +825,28 @@ declare <8 x i32> @llvm.x86.avx512.vpdpwssds.256(<8 x i32>, <8 x i32>, <8 x i32>
define <8 x i32>@test_int_x86_avx512_vpdpwssds_256(<8 x i32> %x0, <8 x i32> %x1, <8 x i32> %x2) sanitize_memory {
; CHECK-LABEL: define <8 x i32> @test_int_x86_avx512_vpdpwssds_256(
; CHECK-SAME: <8 x i32> [[X0:%.*]], <8 x i32> [[X1:%.*]], <8 x i32> [[X2:%.*]]) #[[ATTR1]] {
-; CHECK-NEXT: [[TMP1:%.*]] = load <8 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP2:%.*]] = load <8 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 32) to ptr), align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <8 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 64) to ptr), align 8
+; CHECK-NEXT: [[TMP21:%.*]] = load <8 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: call void @llvm.donothing()
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <8 x i32> [[TMP1]], [[TMP2]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <8 x i32> [[_MSPROP]], [[TMP3]]
+; CHECK-NEXT: [[TMP22:%.*]] = bitcast <8 x i32> [[X1]] to <16 x i16>
+; CHECK-NEXT: [[TMP5:%.*]] = bitcast <8 x i32> [[X2]] to <16 x i16>
+; CHECK-NEXT: [[TMP6:%.*]] = bitcast <8 x i32> [[TMP2]] to <16 x i16>
+; CHECK-NEXT: [[TMP7:%.*]] = bitcast <8 x i32> [[TMP3]] to <16 x i16>
+; CHECK-NEXT: [[TMP8:%.*]] = icmp ne <16 x i16> [[TMP6]], zeroinitializer
+; CHECK-NEXT: [[TMP9:%.*]] = icmp ne <16 x i16> [[TMP7]], zeroinitializer
+; CHECK-NEXT: [[TMP10:%.*]] = icmp ne <16 x i16> [[TMP22]], zeroinitializer
+; CHECK-NEXT: [[TMP11:%.*]] = icmp ne <16 x i16> [[TMP5]], zeroinitializer
+; CHECK-NEXT: [[TMP12:%.*]] = and <16 x i1> [[TMP8]], [[TMP9]]
+; CHECK-NEXT: [[TMP13:%.*]] = and <16 x i1> [[TMP10]], [[TMP9]]
+; CHECK-NEXT: [[TMP14:%.*]] = and <16 x i1> [[TMP8]], [[TMP11]]
+; CHECK-NEXT: [[TMP15:%.*]] = or <16 x i1> [[TMP12]], [[TMP13]]
+; CHECK-NEXT: [[TMP16:%.*]] = or <16 x i1> [[TMP15]], [[TMP14]]
+; CHECK-NEXT: [[TMP17:%.*]] = sext <16 x i1> [[TMP16]] to <16 x i16>
+; CHECK-NEXT: [[TMP18:%.*]] = bitcast <16 x i16> [[TMP17]] to <8 x i32>
+; CHECK-NEXT: [[TMP19:%.*]] = icmp ne <8 x i32> [[TMP18]], zeroinitializer
+; CHECK-NEXT: [[TMP20:%.*]] = sext <8 x i1> [[TMP19]] to <8 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <8 x i32> [[TMP20]], [[TMP21]]
; CHECK-NEXT: [[TMP4:%.*]] = call <8 x i32> @llvm.x86.avx512.vpdpwssds.256(<8 x i32> [[X0]], <8 x i32> [[X1]], <8 x i32> [[X2]])
; CHECK-NEXT: store <8 x i32> [[_MSPROP1]], ptr @__msan_retval_tls, align 8
; CHECK-NEXT: ret <8 x i32> [[TMP4]]
@@ -539,8 +859,8 @@ define { <8 x i32>, <8 x i32> } @test_int_x86_avx512_mask_vpdpwssds_256(<8 x i32
; CHECK-LABEL: define { <8 x i32>, <8 x i32> } @test_int_x86_avx512_mask_vpdpwssds_256(
; CHECK-SAME: <8 x i32> [[X0:%.*]], <8 x i32> [[X1:%.*]], ptr [[X2P:%.*]], <8 x i32> [[X4:%.*]], i8 [[X3:%.*]]) #[[ATTR1]] {
; CHECK-NEXT: [[TMP1:%.*]] = load i64, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 64) to ptr), align 8
-; CHECK-NEXT: [[TMP2:%.*]] = load <8 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <8 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 32) to ptr), align 8
+; CHECK-NEXT: [[TMP2:%.*]] = load <8 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP4:%.*]] = load i8, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 104) to ptr), align 8
; CHECK-NEXT: [[TMP5:%.*]] = load <8 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 72) to ptr), align 8
; CHECK-NEXT: call void @llvm.donothing()
@@ -555,8 +875,24 @@ define { <8 x i32>, <8 x i32> } @test_int_x86_avx512_mask_vpdpwssds_256(<8 x i32
; CHECK-NEXT: [[TMP9:%.*]] = xor i64 [[TMP8]], 87960930222080
; CHECK-NEXT: [[TMP10:%.*]] = inttoptr i64 [[TMP9]] to ptr
; CHECK-NEXT: [[_MSLD:%.*]] = load <8 x i32>, ptr [[TMP10]], align 32
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <8 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <8 x i32> [[_MSPROP]], [[_MSLD]]
+; CHECK-NEXT: [[TMP29:%.*]] = bitcast <8 x i32> [[X1]] to <16 x i16>
+; CHECK-NEXT: [[TMP30:%.*]] = bitcast <8 x i32> [[X2]] to <16 x i16>
+; CHECK-NEXT: [[TMP31:%.*]] = bitcast <8 x i32> [[TMP3]] to <16 x i16>
+; CHECK-NEXT: [[TMP32:%.*]] = bitcast <8 x i32> [[_MSLD]] to <16 x i16>
+; CHECK-NEXT: [[TMP33:%.*]] = icmp ne <16 x i16> [[TMP31]], zeroinitializer
+; CHECK-NEXT: [[TMP34:%.*]] = icmp ne <16 x i16> [[TMP32]], zeroinitializer
+; CHECK-NEXT: [[TMP35:%.*]] = icmp ne <16 x i16> [[TMP29]], zeroinitializer
+; CHECK-NEXT: [[TMP36:%.*]] = icmp ne <16 x i16> [[TMP30]], zeroinitializer
+; CHECK-NEXT: [[TMP54:%.*]] = and <16 x i1> [[TMP33]], [[TMP34]]
+; CHECK-NEXT: [[TMP55:%.*]] = and <16 x i1> [[TMP35]], [[TMP34]]
+; CHECK-NEXT: [[TMP56:%.*]] = and <16 x i1> [[TMP33]], [[TMP36]]
+; CHECK-NEXT: [[TMP57:%.*]] = or <16 x i1> [[TMP54]], [[TMP55]]
+; CHECK-NEXT: [[TMP58:%.*]] = or <16 x i1> [[TMP57]], [[TMP56]]
+; CHECK-NEXT: [[TMP59:%.*]] = sext <16 x i1> [[TMP58]] to <16 x i16>
+; CHECK-NEXT: [[TMP60:%.*]] = bitcast <16 x i16> [[TMP59]] to <8 x i32>
+; CHECK-NEXT: [[TMP61:%.*]] = icmp ne <8 x i32> [[TMP60]], zeroinitializer
+; CHECK-NEXT: [[TMP62:%.*]] = sext <8 x i1> [[TMP61]] to <8 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <8 x i32> [[TMP62]], [[TMP2]]
; CHECK-NEXT: [[TMP11:%.*]] = call <8 x i32> @llvm.x86.avx512.vpdpwssds.256(<8 x i32> [[X0]], <8 x i32> [[X1]], <8 x i32> [[X2]])
; CHECK-NEXT: [[TMP12:%.*]] = bitcast i8 [[TMP4]] to <8 x i1>
; CHECK-NEXT: [[TMP13:%.*]] = bitcast i8 [[X3]] to <8 x i1>
@@ -566,8 +902,24 @@ define { <8 x i32>, <8 x i32> } @test_int_x86_avx512_mask_vpdpwssds_256(<8 x i32
; CHECK-NEXT: [[TMP17:%.*]] = or <8 x i32> [[TMP16]], [[TMP2]]
; CHECK-NEXT: [[_MSPROP_SELECT:%.*]] = select <8 x i1> [[TMP12]], <8 x i32> [[TMP17]], <8 x i32> [[TMP14]]
; CHECK-NEXT: [[TMP18:%.*]] = select <8 x i1> [[TMP13]], <8 x i32> [[TMP11]], <8 x i32> [[X0]]
-; CHECK-NEXT: [[_MSPROP2:%.*]] = or <8 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP3:%.*]] = or <8 x i32> [[_MSPROP2]], [[TMP5]]
+; CHECK-NEXT: [[TMP37:%.*]] = bitcast <8 x i32> [[X1]] to <16 x i16>
+; CHECK-NEXT: [[TMP38:%.*]] = bitcast <8 x i32> [[X4]] to <16 x i16>
+; CHECK-NEXT: [[TMP39:%.*]] = bitcast <8 x i32> [[TMP3]] to <16 x i16>
+; CHECK-NEXT: [[TMP40:%.*]] = bitcast <8 x i32> [[TMP5]] to <16 x i16>
+; CHECK-NEXT: [[TMP41:%.*]] = icmp ne <16 x i16> [[TMP39]], zeroinitializer
+; CHECK-NEXT: [[TMP42:%.*]] = icmp ne <16 x i16> [[TMP40]], zeroinitializer
+; CHECK-NEXT: [[TMP43:%.*]] = icmp ne <16 x i16> [[TMP37]], zeroinitializer
+; CHECK-NEXT: [[TMP44:%.*]] = icmp ne <16 x i16> [[TMP38]], zeroinitializer
+; CHECK-NEXT: [[TMP45:%.*]] = and <16 x i1> [[TMP41]], [[TMP42]]
+; CHECK-NEXT: [[TMP46:%.*]] = and <16 x i1> [[TMP43]], [[TMP42]]
+; CHECK-NEXT: [[TMP47:%.*]] = and <16 x i1> [[TMP41]], [[TMP44]]
+; CHECK-NEXT: [[TMP48:%.*]] = or <16 x i1> [[TMP45]], [[TMP46]]
+; CHECK-NEXT: [[TMP49:%.*]] = or <16 x i1> [[TMP48]], [[TMP47]]
+; CHECK-NEXT: [[TMP50:%.*]] = sext <16 x i1> [[TMP49]] to <16 x i16>
+; CHECK-NEXT: [[TMP51:%.*]] = bitcast <16 x i16> [[TMP50]] to <8 x i32>
+; CHECK-NEXT: [[TMP52:%.*]] = icmp ne <8 x i32> [[TMP51]], zeroinitializer
+; CHECK-NEXT: [[TMP53:%.*]] = sext <8 x i1> [[TMP52]] to <8 x i32>
+; CHECK-NEXT: [[_MSPROP3:%.*]] = or <8 x i32> [[TMP53]], [[TMP2]]
; CHECK-NEXT: [[TMP19:%.*]] = call <8 x i32> @llvm.x86.avx512.vpdpwssds.256(<8 x i32> [[X0]], <8 x i32> [[X1]], <8 x i32> [[X4]])
; CHECK-NEXT: [[TMP20:%.*]] = bitcast i8 [[TMP4]] to <8 x i1>
; CHECK-NEXT: [[TMP21:%.*]] = bitcast i8 [[X3]] to <8 x i1>
@@ -602,8 +954,8 @@ define <4 x i32>@test_int_x86_avx512_vpdpwssds_128(<4 x i32> %x0, <4 x i32> %x1,
; CHECK-LABEL: define <4 x i32> @test_int_x86_avx512_vpdpwssds_128(
; CHECK-SAME: <4 x i32> [[X0:%.*]], <4 x i32> [[X1:%.*]], ptr [[X2P:%.*]]) #[[ATTR1]] {
; CHECK-NEXT: [[TMP1:%.*]] = load i64, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 32) to ptr), align 8
-; CHECK-NEXT: [[TMP2:%.*]] = load <4 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <4 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 16) to ptr), align 8
+; CHECK-NEXT: [[TMP4:%.*]] = load <4 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: call void @llvm.donothing()
; CHECK-NEXT: [[_MSCMP:%.*]] = icmp ne i64 [[TMP1]], 0
; CHECK-NEXT: br i1 [[_MSCMP]], label %[[BB4:.*]], label %[[BB5:.*]], !prof [[PROF1]]
@@ -616,8 +968,24 @@ define <4 x i32>@test_int_x86_avx512_vpdpwssds_128(<4 x i32> %x0, <4 x i32> %x1,
; CHECK-NEXT: [[TMP7:%.*]] = xor i64 [[TMP6]], 87960930222080
; CHECK-NEXT: [[TMP8:%.*]] = inttoptr i64 [[TMP7]] to ptr
; CHECK-NEXT: [[_MSLD:%.*]] = load <4 x i32>, ptr [[TMP8]], align 16
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <4 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <4 x i32> [[_MSPROP]], [[_MSLD]]
+; CHECK-NEXT: [[TMP26:%.*]] = bitcast <4 x i32> [[X1]] to <8 x i16>
+; CHECK-NEXT: [[TMP10:%.*]] = bitcast <4 x i32> [[X2]] to <8 x i16>
+; CHECK-NEXT: [[TMP11:%.*]] = bitcast <4 x i32> [[TMP3]] to <8 x i16>
+; CHECK-NEXT: [[TMP12:%.*]] = bitcast <4 x i32> [[_MSLD]] to <8 x i16>
+; CHECK-NEXT: [[TMP13:%.*]] = icmp ne <8 x i16> [[TMP11]], zeroinitializer
+; CHECK-NEXT: [[TMP14:%.*]] = icmp ne <8 x i16> [[TMP12]], zeroinitializer
+; CHECK-NEXT: [[TMP15:%.*]] = icmp ne <8 x i16> [[TMP26]], zeroinitializer
+; CHECK-NEXT: [[TMP16:%.*]] = icmp ne <8 x i16> [[TMP10]], zeroinitializer
+; CHECK-NEXT: [[TMP17:%.*]] = and <8 x i1> [[TMP13]], [[TMP14]]
+; CHECK-NEXT: [[TMP18:%.*]] = and <8 x i1> [[TMP15]], [[TMP14]]
+; CHECK-NEXT: [[TMP19:%.*]] = and <8 x i1> [[TMP13]], [[TMP16]]
+; CHECK-NEXT: [[TMP20:%.*]] = or <8 x i1> [[TMP17]], [[TMP18]]
+; CHECK-NEXT: [[TMP21:%.*]] = or <8 x i1> [[TMP20]], [[TMP19]]
+; CHECK-NEXT: [[TMP22:%.*]] = sext <8 x i1> [[TMP21]] to <8 x i16>
+; CHECK-NEXT: [[TMP23:%.*]] = bitcast <8 x i16> [[TMP22]] to <4 x i32>
+; CHECK-NEXT: [[TMP24:%.*]] = icmp ne <4 x i32> [[TMP23]], zeroinitializer
+; CHECK-NEXT: [[TMP25:%.*]] = sext <4 x i1> [[TMP24]] to <4 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <4 x i32> [[TMP25]], [[TMP4]]
; CHECK-NEXT: [[TMP9:%.*]] = call <4 x i32> @llvm.x86.avx512.vpdpwssds.128(<4 x i32> [[X0]], <4 x i32> [[X1]], <4 x i32> [[X2]])
; CHECK-NEXT: store <4 x i32> [[_MSPROP1]], ptr @__msan_retval_tls, align 8
; CHECK-NEXT: ret <4 x i32> [[TMP9]]
@@ -631,8 +999,8 @@ define { <4 x i32>, <4 x i32> } @test_int_x86_avx512_mask_vpdpwssds_128(<4 x i32
; CHECK-LABEL: define { <4 x i32>, <4 x i32> } @test_int_x86_avx512_mask_vpdpwssds_128(
; CHECK-SAME: <4 x i32> [[X0:%.*]], <4 x i32> [[X1:%.*]], ptr [[X2P:%.*]], <4 x i32> [[X4:%.*]], i8 [[X3:%.*]]) #[[ATTR1]] {
; CHECK-NEXT: [[TMP1:%.*]] = load i64, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 32) to ptr), align 8
-; CHECK-NEXT: [[TMP2:%.*]] = load <4 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <4 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 16) to ptr), align 8
+; CHECK-NEXT: [[TMP2:%.*]] = load <4 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP4:%.*]] = load i8, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 56) to ptr), align 8
; CHECK-NEXT: [[TMP5:%.*]] = load <4 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 40) to ptr), align 8
; CHECK-NEXT: call void @llvm.donothing()
@@ -647,8 +1015,24 @@ define { <4 x i32>, <4 x i32> } @test_int_x86_avx512_mask_vpdpwssds_128(<4 x i32
; CHECK-NEXT: [[TMP9:%.*]] = xor i64 [[TMP8]], 87960930222080
; CHECK-NEXT: [[TMP10:%.*]] = inttoptr i64 [[TMP9]] to ptr
; CHECK-NEXT: [[_MSLD:%.*]] = load <4 x i32>, ptr [[TMP10]], align 16
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <4 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <4 x i32> [[_MSPROP]], [[_MSLD]]
+; CHECK-NEXT: [[TMP29:%.*]] = bitcast <4 x i32> [[X1]] to <8 x i16>
+; CHECK-NEXT: [[TMP30:%.*]] = bitcast <4 x i32> [[X2]] to <8 x i16>
+; CHECK-NEXT: [[TMP31:%.*]] = bitcast <4 x i32> [[TMP3]] to <8 x i16>
+; CHECK-NEXT: [[TMP32:%.*]] = bitcast <4 x i32> [[_MSLD]] to <8 x i16>
+; CHECK-NEXT: [[TMP33:%.*]] = icmp ne <8 x i16> [[TMP31]], zeroinitializer
+; CHECK-NEXT: [[TMP34:%.*]] = icmp ne <8 x i16> [[TMP32]], zeroinitializer
+; CHECK-NEXT: [[TMP35:%.*]] = icmp ne <8 x i16> [[TMP29]], zeroinitializer
+; CHECK-NEXT: [[TMP36:%.*]] = icmp ne <8 x i16> [[TMP30]], zeroinitializer
+; CHECK-NEXT: [[TMP54:%.*]] = and <8 x i1> [[TMP33]], [[TMP34]]
+; CHECK-NEXT: [[TMP55:%.*]] = and <8 x i1> [[TMP35]], [[TMP34]]
+; CHECK-NEXT: [[TMP56:%.*]] = and <8 x i1> [[TMP33]], [[TMP36]]
+; CHECK-NEXT: [[TMP57:%.*]] = or <8 x i1> [[TMP54]], [[TMP55]]
+; CHECK-NEXT: [[TMP58:%.*]] = or <8 x i1> [[TMP57]], [[TMP56]]
+; CHECK-NEXT: [[TMP59:%.*]] = sext <8 x i1> [[TMP58]] to <8 x i16>
+; CHECK-NEXT: [[TMP60:%.*]] = bitcast <8 x i16> [[TMP59]] to <4 x i32>
+; CHECK-NEXT: [[TMP61:%.*]] = icmp ne <4 x i32> [[TMP60]], zeroinitializer
+; CHECK-NEXT: [[TMP62:%.*]] = sext <4 x i1> [[TMP61]] to <4 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <4 x i32> [[TMP62]], [[TMP2]]
; CHECK-NEXT: [[TMP11:%.*]] = call <4 x i32> @llvm.x86.avx512.vpdpwssds.128(<4 x i32> [[X0]], <4 x i32> [[X1]], <4 x i32> [[X2]])
; CHECK-NEXT: [[TMP12:%.*]] = bitcast i8 [[TMP4]] to <8 x i1>
; CHECK-NEXT: [[TMP13:%.*]] = bitcast i8 [[X3]] to <8 x i1>
@@ -660,8 +1044,24 @@ define { <4 x i32>, <4 x i32> } @test_int_x86_avx512_mask_vpdpwssds_128(<4 x i32
; CHECK-NEXT: [[TMP17:%.*]] = or <4 x i32> [[TMP16]], [[TMP2]]
; CHECK-NEXT: [[_MSPROP_SELECT:%.*]] = select <4 x i1> [[_MSPROP2]], <4 x i32> [[TMP17]], <4 x i32> [[TMP14]]
; CHECK-NEXT: [[TMP18:%.*]] = select <4 x i1> [[EXTRACT]], <4 x i32> [[TMP11]], <4 x i32> [[X0]]
-; CHECK-NEXT: [[_MSPROP3:%.*]] = or <4 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP4:%.*]] = or <4 x i32> [[_MSPROP3]], [[TMP5]]
+; CHECK-NEXT: [[TMP37:%.*]] = bitcast <4 x i32> [[X1]] to <8 x i16>
+; CHECK-NEXT: [[TMP38:%.*]] = bitcast <4 x i32> [[X4]] to <8 x i16>
+; CHECK-NEXT: [[TMP39:%.*]] = bitcast <4 x i32> [[TMP3]] to <8 x i16>
+; CHECK-NEXT: [[TMP40:%.*]] = bitcast <4 x i32> [[TMP5]] to <8 x i16>
+; CHECK-NEXT: [[TMP41:%.*]] = icmp ne <8 x i16> [[TMP39]], zeroinitializer
+; CHECK-NEXT: [[TMP42:%.*]] = icmp ne <8 x i16> [[TMP40]], zeroinitializer
+; CHECK-NEXT: [[TMP43:%.*]] = icmp ne <8 x i16> [[TMP37]], zeroinitializer
+; CHECK-NEXT: [[TMP44:%.*]] = icmp ne <8 x i16> [[TMP38]], zeroinitializer
+; CHECK-NEXT: [[TMP45:%.*]] = and <8 x i1> [[TMP41]], [[TMP42]]
+; CHECK-NEXT: [[TMP46:%.*]] = and <8 x i1> [[TMP43]], [[TMP42]]
+; CHECK-NEXT: [[TMP47:%.*]] = and <8 x i1> [[TMP41]], [[TMP44]]
+; CHECK-NEXT: [[TMP48:%.*]] = or <8 x i1> [[TMP45]], [[TMP46]]
+; CHECK-NEXT: [[TMP49:%.*]] = or <8 x i1> [[TMP48]], [[TMP47]]
+; CHECK-NEXT: [[TMP50:%.*]] = sext <8 x i1> [[TMP49]] to <8 x i16>
+; CHECK-NEXT: [[TMP51:%.*]] = bitcast <8 x i16> [[TMP50]] to <4 x i32>
+; CHECK-NEXT: [[TMP52:%.*]] = icmp ne <4 x i32> [[TMP51]], zeroinitializer
+; CHECK-NEXT: [[TMP53:%.*]] = sext <4 x i1> [[TMP52]] to <4 x i32>
+; CHECK-NEXT: [[_MSPROP4:%.*]] = or <4 x i32> [[TMP53]], [[TMP2]]
; CHECK-NEXT: [[TMP19:%.*]] = call <4 x i32> @llvm.x86.avx512.vpdpwssds.128(<4 x i32> [[X0]], <4 x i32> [[X1]], <4 x i32> [[X4]])
; CHECK-NEXT: [[TMP20:%.*]] = bitcast i8 [[TMP4]] to <8 x i1>
; CHECK-NEXT: [[TMP21:%.*]] = bitcast i8 [[X3]] to <8 x i1>
diff --git a/llvm/test/Instrumentation/MemorySanitizer/X86/avx512vnni-intrinsics-upgrade.ll b/llvm/test/Instrumentation/MemorySanitizer/X86/avx512vnni-intrinsics-upgrade.ll
index 77306202dc4fe..f146823b90e03 100644
--- a/llvm/test/Instrumentation/MemorySanitizer/X86/avx512vnni-intrinsics-upgrade.ll
+++ b/llvm/test/Instrumentation/MemorySanitizer/X86/avx512vnni-intrinsics-upgrade.ll
@@ -5,11 +5,7 @@
;
; Handled strictly: (none)
;
-; Handled heuristically:
-; - llvm.x86.avx512.vpdpbusd.512
-; - llvm.x86.avx512.vpdpbusds.512
-; - llvm.x86.avx512.vpdpwssd.512
-; - llvm.x86.avx512.vpdpwssds.512
+; Handled heuristically: (none)
target datalayout = "e-m:o-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"
@@ -20,12 +16,30 @@ declare <16 x i32> @llvm.x86.avx512.maskz.vpdpbusd.512(<16 x i32>, <16 x i32>, <
define <16 x i32>@test_int_x86_avx512_vpdpbusd_512(<16 x i32> %x0, <16 x i32> %x1, <16 x i32> %x2) sanitize_memory {
; CHECK-LABEL: define <16 x i32> @test_int_x86_avx512_vpdpbusd_512(
; CHECK-SAME: <16 x i32> [[X0:%.*]], <16 x i32> [[X1:%.*]], <16 x i32> [[X2:%.*]]) #[[ATTR0:[0-9]+]] {
-; CHECK-NEXT: [[TMP1:%.*]] = load <16 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP2:%.*]] = load <16 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 64) to ptr), align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <16 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 128) to ptr), align 8
+; CHECK-NEXT: [[TMP23:%.*]] = load <16 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: call void @llvm.donothing()
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <16 x i32> [[TMP1]], [[TMP2]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <16 x i32> [[_MSPROP]], [[TMP3]]
+; CHECK-NEXT: [[TMP24:%.*]] = bitcast <16 x i32> [[X1]] to <64 x i8>
+; CHECK-NEXT: [[TMP5:%.*]] = bitcast <16 x i32> [[X2]] to <64 x i8>
+; CHECK-NEXT: [[TMP6:%.*]] = bitcast <16 x i32> [[TMP2]] to <64 x i8>
+; CHECK-NEXT: [[TMP7:%.*]] = bitcast <16 x i32> [[TMP3]] to <64 x i8>
+; CHECK-NEXT: [[TMP8:%.*]] = icmp ne <64 x i8> [[TMP6]], zeroinitializer
+; CHECK-NEXT: [[TMP9:%.*]] = icmp ne <64 x i8> [[TMP7]], zeroinitializer
+; CHECK-NEXT: [[TMP10:%.*]] = icmp ne <64 x i8> [[TMP24]], zeroinitializer
+; CHECK-NEXT: [[TMP11:%.*]] = icmp ne <64 x i8> [[TMP5]], zeroinitializer
+; CHECK-NEXT: [[TMP12:%.*]] = and <64 x i1> [[TMP8]], [[TMP9]]
+; CHECK-NEXT: [[TMP13:%.*]] = and <64 x i1> [[TMP10]], [[TMP9]]
+; CHECK-NEXT: [[TMP14:%.*]] = and <64 x i1> [[TMP8]], [[TMP11]]
+; CHECK-NEXT: [[TMP15:%.*]] = or <64 x i1> [[TMP12]], [[TMP13]]
+; CHECK-NEXT: [[TMP16:%.*]] = or <64 x i1> [[TMP15]], [[TMP14]]
+; CHECK-NEXT: [[TMP17:%.*]] = sext <64 x i1> [[TMP16]] to <64 x i8>
+; CHECK-NEXT: [[TMP18:%.*]] = bitcast <64 x i8> [[TMP17]] to <32 x i16>
+; CHECK-NEXT: [[TMP19:%.*]] = icmp ne <32 x i16> [[TMP18]], zeroinitializer
+; CHECK-NEXT: [[TMP20:%.*]] = sext <32 x i1> [[TMP19]] to <32 x i16>
+; CHECK-NEXT: [[TMP21:%.*]] = bitcast <32 x i16> [[TMP20]] to i512
+; CHECK-NEXT: [[TMP22:%.*]] = bitcast i512 [[TMP21]] to <16 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <16 x i32> [[TMP22]], [[TMP23]]
; CHECK-NEXT: [[TMP4:%.*]] = call <16 x i32> @llvm.x86.avx512.vpdpbusd.512(<16 x i32> [[X0]], <16 x i32> [[X1]], <16 x i32> [[X2]])
; CHECK-NEXT: store <16 x i32> [[_MSPROP1]], ptr @__msan_retval_tls, align 8
; CHECK-NEXT: ret <16 x i32> [[TMP4]]
@@ -38,8 +52,8 @@ define { <16 x i32>, <16 x i32> } @test_int_x86_avx512_mask_vpdpbusd_512(<16 x i
; CHECK-LABEL: define { <16 x i32>, <16 x i32> } @test_int_x86_avx512_mask_vpdpbusd_512(
; CHECK-SAME: <16 x i32> [[X0:%.*]], <16 x i32> [[X1:%.*]], ptr [[X2P:%.*]], <16 x i32> [[X4:%.*]], i16 [[X3:%.*]]) #[[ATTR0]] {
; CHECK-NEXT: [[TMP1:%.*]] = load i64, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 128) to ptr), align 8
-; CHECK-NEXT: [[TMP2:%.*]] = load <16 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <16 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 64) to ptr), align 8
+; CHECK-NEXT: [[TMP2:%.*]] = load <16 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP4:%.*]] = load i16, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 200) to ptr), align 8
; CHECK-NEXT: [[TMP5:%.*]] = load <16 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 136) to ptr), align 8
; CHECK-NEXT: call void @llvm.donothing()
@@ -54,8 +68,26 @@ define { <16 x i32>, <16 x i32> } @test_int_x86_avx512_mask_vpdpbusd_512(<16 x i
; CHECK-NEXT: [[TMP9:%.*]] = xor i64 [[TMP8]], 87960930222080
; CHECK-NEXT: [[TMP10:%.*]] = inttoptr i64 [[TMP9]] to ptr
; CHECK-NEXT: [[_MSLD:%.*]] = load <16 x i32>, ptr [[TMP10]], align 64
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <16 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <16 x i32> [[_MSPROP]], [[_MSLD]]
+; CHECK-NEXT: [[TMP30:%.*]] = bitcast <16 x i32> [[X1]] to <64 x i8>
+; CHECK-NEXT: [[TMP31:%.*]] = bitcast <16 x i32> [[X2]] to <64 x i8>
+; CHECK-NEXT: [[TMP32:%.*]] = bitcast <16 x i32> [[TMP3]] to <64 x i8>
+; CHECK-NEXT: [[TMP33:%.*]] = bitcast <16 x i32> [[_MSLD]] to <64 x i8>
+; CHECK-NEXT: [[TMP34:%.*]] = icmp ne <64 x i8> [[TMP32]], zeroinitializer
+; CHECK-NEXT: [[TMP35:%.*]] = icmp ne <64 x i8> [[TMP33]], zeroinitializer
+; CHECK-NEXT: [[TMP36:%.*]] = icmp ne <64 x i8> [[TMP30]], zeroinitializer
+; CHECK-NEXT: [[TMP37:%.*]] = icmp ne <64 x i8> [[TMP31]], zeroinitializer
+; CHECK-NEXT: [[TMP38:%.*]] = and <64 x i1> [[TMP34]], [[TMP35]]
+; CHECK-NEXT: [[TMP58:%.*]] = and <64 x i1> [[TMP36]], [[TMP35]]
+; CHECK-NEXT: [[TMP59:%.*]] = and <64 x i1> [[TMP34]], [[TMP37]]
+; CHECK-NEXT: [[TMP60:%.*]] = or <64 x i1> [[TMP38]], [[TMP58]]
+; CHECK-NEXT: [[TMP61:%.*]] = or <64 x i1> [[TMP60]], [[TMP59]]
+; CHECK-NEXT: [[TMP62:%.*]] = sext <64 x i1> [[TMP61]] to <64 x i8>
+; CHECK-NEXT: [[TMP63:%.*]] = bitcast <64 x i8> [[TMP62]] to <32 x i16>
+; CHECK-NEXT: [[TMP64:%.*]] = icmp ne <32 x i16> [[TMP63]], zeroinitializer
+; CHECK-NEXT: [[TMP65:%.*]] = sext <32 x i1> [[TMP64]] to <32 x i16>
+; CHECK-NEXT: [[TMP66:%.*]] = bitcast <32 x i16> [[TMP65]] to i512
+; CHECK-NEXT: [[TMP29:%.*]] = bitcast i512 [[TMP66]] to <16 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <16 x i32> [[TMP29]], [[TMP2]]
; CHECK-NEXT: [[TMP11:%.*]] = call <16 x i32> @llvm.x86.avx512.vpdpbusd.512(<16 x i32> [[X0]], <16 x i32> [[X1]], <16 x i32> [[X2]])
; CHECK-NEXT: [[TMP12:%.*]] = bitcast i16 [[TMP4]] to <16 x i1>
; CHECK-NEXT: [[TMP13:%.*]] = bitcast i16 [[X3]] to <16 x i1>
@@ -65,8 +97,26 @@ define { <16 x i32>, <16 x i32> } @test_int_x86_avx512_mask_vpdpbusd_512(<16 x i
; CHECK-NEXT: [[TMP17:%.*]] = or <16 x i32> [[TMP16]], [[TMP2]]
; CHECK-NEXT: [[_MSPROP_SELECT:%.*]] = select <16 x i1> [[TMP12]], <16 x i32> [[TMP17]], <16 x i32> [[TMP14]]
; CHECK-NEXT: [[TMP18:%.*]] = select <16 x i1> [[TMP13]], <16 x i32> [[TMP11]], <16 x i32> [[X0]]
-; CHECK-NEXT: [[_MSPROP2:%.*]] = or <16 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP3:%.*]] = or <16 x i32> [[_MSPROP2]], [[TMP5]]
+; CHECK-NEXT: [[TMP39:%.*]] = bitcast <16 x i32> [[X1]] to <64 x i8>
+; CHECK-NEXT: [[TMP40:%.*]] = bitcast <16 x i32> [[X4]] to <64 x i8>
+; CHECK-NEXT: [[TMP41:%.*]] = bitcast <16 x i32> [[TMP3]] to <64 x i8>
+; CHECK-NEXT: [[TMP42:%.*]] = bitcast <16 x i32> [[TMP5]] to <64 x i8>
+; CHECK-NEXT: [[TMP43:%.*]] = icmp ne <64 x i8> [[TMP41]], zeroinitializer
+; CHECK-NEXT: [[TMP44:%.*]] = icmp ne <64 x i8> [[TMP42]], zeroinitializer
+; CHECK-NEXT: [[TMP45:%.*]] = icmp ne <64 x i8> [[TMP39]], zeroinitializer
+; CHECK-NEXT: [[TMP46:%.*]] = icmp ne <64 x i8> [[TMP40]], zeroinitializer
+; CHECK-NEXT: [[TMP47:%.*]] = and <64 x i1> [[TMP43]], [[TMP44]]
+; CHECK-NEXT: [[TMP48:%.*]] = and <64 x i1> [[TMP45]], [[TMP44]]
+; CHECK-NEXT: [[TMP49:%.*]] = and <64 x i1> [[TMP43]], [[TMP46]]
+; CHECK-NEXT: [[TMP50:%.*]] = or <64 x i1> [[TMP47]], [[TMP48]]
+; CHECK-NEXT: [[TMP51:%.*]] = or <64 x i1> [[TMP50]], [[TMP49]]
+; CHECK-NEXT: [[TMP52:%.*]] = sext <64 x i1> [[TMP51]] to <64 x i8>
+; CHECK-NEXT: [[TMP53:%.*]] = bitcast <64 x i8> [[TMP52]] to <32 x i16>
+; CHECK-NEXT: [[TMP54:%.*]] = icmp ne <32 x i16> [[TMP53]], zeroinitializer
+; CHECK-NEXT: [[TMP55:%.*]] = sext <32 x i1> [[TMP54]] to <32 x i16>
+; CHECK-NEXT: [[TMP56:%.*]] = bitcast <32 x i16> [[TMP55]] to i512
+; CHECK-NEXT: [[TMP57:%.*]] = bitcast i512 [[TMP56]] to <16 x i32>
+; CHECK-NEXT: [[_MSPROP3:%.*]] = or <16 x i32> [[TMP57]], [[TMP2]]
; CHECK-NEXT: [[TMP19:%.*]] = call <16 x i32> @llvm.x86.avx512.vpdpbusd.512(<16 x i32> [[X0]], <16 x i32> [[X1]], <16 x i32> [[X4]])
; CHECK-NEXT: [[TMP20:%.*]] = bitcast i16 [[TMP4]] to <16 x i1>
; CHECK-NEXT: [[TMP21:%.*]] = bitcast i16 [[X3]] to <16 x i1>
@@ -97,12 +147,30 @@ declare <16 x i32> @llvm.x86.avx512.maskz.vpdpbusds.512(<16 x i32>, <16 x i32>,
define <16 x i32>@test_int_x86_avx512_vpdpbusds_512(<16 x i32> %x0, <16 x i32> %x1, <16 x i32> %x2) sanitize_memory {
; CHECK-LABEL: define <16 x i32> @test_int_x86_avx512_vpdpbusds_512(
; CHECK-SAME: <16 x i32> [[X0:%.*]], <16 x i32> [[X1:%.*]], <16 x i32> [[X2:%.*]]) #[[ATTR0]] {
-; CHECK-NEXT: [[TMP1:%.*]] = load <16 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP2:%.*]] = load <16 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 64) to ptr), align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <16 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 128) to ptr), align 8
+; CHECK-NEXT: [[TMP23:%.*]] = load <16 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: call void @llvm.donothing()
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <16 x i32> [[TMP1]], [[TMP2]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <16 x i32> [[_MSPROP]], [[TMP3]]
+; CHECK-NEXT: [[TMP24:%.*]] = bitcast <16 x i32> [[X1]] to <64 x i8>
+; CHECK-NEXT: [[TMP5:%.*]] = bitcast <16 x i32> [[X2]] to <64 x i8>
+; CHECK-NEXT: [[TMP6:%.*]] = bitcast <16 x i32> [[TMP2]] to <64 x i8>
+; CHECK-NEXT: [[TMP7:%.*]] = bitcast <16 x i32> [[TMP3]] to <64 x i8>
+; CHECK-NEXT: [[TMP8:%.*]] = icmp ne <64 x i8> [[TMP6]], zeroinitializer
+; CHECK-NEXT: [[TMP9:%.*]] = icmp ne <64 x i8> [[TMP7]], zeroinitializer
+; CHECK-NEXT: [[TMP10:%.*]] = icmp ne <64 x i8> [[TMP24]], zeroinitializer
+; CHECK-NEXT: [[TMP11:%.*]] = icmp ne <64 x i8> [[TMP5]], zeroinitializer
+; CHECK-NEXT: [[TMP12:%.*]] = and <64 x i1> [[TMP8]], [[TMP9]]
+; CHECK-NEXT: [[TMP13:%.*]] = and <64 x i1> [[TMP10]], [[TMP9]]
+; CHECK-NEXT: [[TMP14:%.*]] = and <64 x i1> [[TMP8]], [[TMP11]]
+; CHECK-NEXT: [[TMP15:%.*]] = or <64 x i1> [[TMP12]], [[TMP13]]
+; CHECK-NEXT: [[TMP16:%.*]] = or <64 x i1> [[TMP15]], [[TMP14]]
+; CHECK-NEXT: [[TMP17:%.*]] = sext <64 x i1> [[TMP16]] to <64 x i8>
+; CHECK-NEXT: [[TMP18:%.*]] = bitcast <64 x i8> [[TMP17]] to <32 x i16>
+; CHECK-NEXT: [[TMP19:%.*]] = icmp ne <32 x i16> [[TMP18]], zeroinitializer
+; CHECK-NEXT: [[TMP20:%.*]] = sext <32 x i1> [[TMP19]] to <32 x i16>
+; CHECK-NEXT: [[TMP21:%.*]] = bitcast <32 x i16> [[TMP20]] to i512
+; CHECK-NEXT: [[TMP22:%.*]] = bitcast i512 [[TMP21]] to <16 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <16 x i32> [[TMP22]], [[TMP23]]
; CHECK-NEXT: [[TMP4:%.*]] = call <16 x i32> @llvm.x86.avx512.vpdpbusds.512(<16 x i32> [[X0]], <16 x i32> [[X1]], <16 x i32> [[X2]])
; CHECK-NEXT: store <16 x i32> [[_MSPROP1]], ptr @__msan_retval_tls, align 8
; CHECK-NEXT: ret <16 x i32> [[TMP4]]
@@ -115,8 +183,8 @@ define { <16 x i32>, <16 x i32> } @test_int_x86_avx512_mask_vpdpbusds_512(<16 x
; CHECK-LABEL: define { <16 x i32>, <16 x i32> } @test_int_x86_avx512_mask_vpdpbusds_512(
; CHECK-SAME: <16 x i32> [[X0:%.*]], <16 x i32> [[X1:%.*]], ptr [[X2P:%.*]], <16 x i32> [[X4:%.*]], i16 [[X3:%.*]]) #[[ATTR0]] {
; CHECK-NEXT: [[TMP1:%.*]] = load i64, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 128) to ptr), align 8
-; CHECK-NEXT: [[TMP2:%.*]] = load <16 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <16 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 64) to ptr), align 8
+; CHECK-NEXT: [[TMP2:%.*]] = load <16 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP4:%.*]] = load i16, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 200) to ptr), align 8
; CHECK-NEXT: [[TMP5:%.*]] = load <16 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 136) to ptr), align 8
; CHECK-NEXT: call void @llvm.donothing()
@@ -131,8 +199,26 @@ define { <16 x i32>, <16 x i32> } @test_int_x86_avx512_mask_vpdpbusds_512(<16 x
; CHECK-NEXT: [[TMP9:%.*]] = xor i64 [[TMP8]], 87960930222080
; CHECK-NEXT: [[TMP10:%.*]] = inttoptr i64 [[TMP9]] to ptr
; CHECK-NEXT: [[_MSLD:%.*]] = load <16 x i32>, ptr [[TMP10]], align 64
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <16 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <16 x i32> [[_MSPROP]], [[_MSLD]]
+; CHECK-NEXT: [[TMP30:%.*]] = bitcast <16 x i32> [[X1]] to <64 x i8>
+; CHECK-NEXT: [[TMP31:%.*]] = bitcast <16 x i32> [[X2]] to <64 x i8>
+; CHECK-NEXT: [[TMP32:%.*]] = bitcast <16 x i32> [[TMP3]] to <64 x i8>
+; CHECK-NEXT: [[TMP33:%.*]] = bitcast <16 x i32> [[_MSLD]] to <64 x i8>
+; CHECK-NEXT: [[TMP34:%.*]] = icmp ne <64 x i8> [[TMP32]], zeroinitializer
+; CHECK-NEXT: [[TMP35:%.*]] = icmp ne <64 x i8> [[TMP33]], zeroinitializer
+; CHECK-NEXT: [[TMP36:%.*]] = icmp ne <64 x i8> [[TMP30]], zeroinitializer
+; CHECK-NEXT: [[TMP37:%.*]] = icmp ne <64 x i8> [[TMP31]], zeroinitializer
+; CHECK-NEXT: [[TMP38:%.*]] = and <64 x i1> [[TMP34]], [[TMP35]]
+; CHECK-NEXT: [[TMP58:%.*]] = and <64 x i1> [[TMP36]], [[TMP35]]
+; CHECK-NEXT: [[TMP59:%.*]] = and <64 x i1> [[TMP34]], [[TMP37]]
+; CHECK-NEXT: [[TMP60:%.*]] = or <64 x i1> [[TMP38]], [[TMP58]]
+; CHECK-NEXT: [[TMP61:%.*]] = or <64 x i1> [[TMP60]], [[TMP59]]
+; CHECK-NEXT: [[TMP62:%.*]] = sext <64 x i1> [[TMP61]] to <64 x i8>
+; CHECK-NEXT: [[TMP63:%.*]] = bitcast <64 x i8> [[TMP62]] to <32 x i16>
+; CHECK-NEXT: [[TMP64:%.*]] = icmp ne <32 x i16> [[TMP63]], zeroinitializer
+; CHECK-NEXT: [[TMP65:%.*]] = sext <32 x i1> [[TMP64]] to <32 x i16>
+; CHECK-NEXT: [[TMP66:%.*]] = bitcast <32 x i16> [[TMP65]] to i512
+; CHECK-NEXT: [[TMP29:%.*]] = bitcast i512 [[TMP66]] to <16 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <16 x i32> [[TMP29]], [[TMP2]]
; CHECK-NEXT: [[TMP11:%.*]] = call <16 x i32> @llvm.x86.avx512.vpdpbusds.512(<16 x i32> [[X0]], <16 x i32> [[X1]], <16 x i32> [[X2]])
; CHECK-NEXT: [[TMP12:%.*]] = bitcast i16 [[TMP4]] to <16 x i1>
; CHECK-NEXT: [[TMP13:%.*]] = bitcast i16 [[X3]] to <16 x i1>
@@ -142,8 +228,26 @@ define { <16 x i32>, <16 x i32> } @test_int_x86_avx512_mask_vpdpbusds_512(<16 x
; CHECK-NEXT: [[TMP17:%.*]] = or <16 x i32> [[TMP16]], [[TMP2]]
; CHECK-NEXT: [[_MSPROP_SELECT:%.*]] = select <16 x i1> [[TMP12]], <16 x i32> [[TMP17]], <16 x i32> [[TMP14]]
; CHECK-NEXT: [[TMP18:%.*]] = select <16 x i1> [[TMP13]], <16 x i32> [[TMP11]], <16 x i32> [[X0]]
-; CHECK-NEXT: [[_MSPROP2:%.*]] = or <16 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP3:%.*]] = or <16 x i32> [[_MSPROP2]], [[TMP5]]
+; CHECK-NEXT: [[TMP39:%.*]] = bitcast <16 x i32> [[X1]] to <64 x i8>
+; CHECK-NEXT: [[TMP40:%.*]] = bitcast <16 x i32> [[X4]] to <64 x i8>
+; CHECK-NEXT: [[TMP41:%.*]] = bitcast <16 x i32> [[TMP3]] to <64 x i8>
+; CHECK-NEXT: [[TMP42:%.*]] = bitcast <16 x i32> [[TMP5]] to <64 x i8>
+; CHECK-NEXT: [[TMP43:%.*]] = icmp ne <64 x i8> [[TMP41]], zeroinitializer
+; CHECK-NEXT: [[TMP44:%.*]] = icmp ne <64 x i8> [[TMP42]], zeroinitializer
+; CHECK-NEXT: [[TMP45:%.*]] = icmp ne <64 x i8> [[TMP39]], zeroinitializer
+; CHECK-NEXT: [[TMP46:%.*]] = icmp ne <64 x i8> [[TMP40]], zeroinitializer
+; CHECK-NEXT: [[TMP47:%.*]] = and <64 x i1> [[TMP43]], [[TMP44]]
+; CHECK-NEXT: [[TMP48:%.*]] = and <64 x i1> [[TMP45]], [[TMP44]]
+; CHECK-NEXT: [[TMP49:%.*]] = and <64 x i1> [[TMP43]], [[TMP46]]
+; CHECK-NEXT: [[TMP50:%.*]] = or <64 x i1> [[TMP47]], [[TMP48]]
+; CHECK-NEXT: [[TMP51:%.*]] = or <64 x i1> [[TMP50]], [[TMP49]]
+; CHECK-NEXT: [[TMP52:%.*]] = sext <64 x i1> [[TMP51]] to <64 x i8>
+; CHECK-NEXT: [[TMP53:%.*]] = bitcast <64 x i8> [[TMP52]] to <32 x i16>
+; CHECK-NEXT: [[TMP54:%.*]] = icmp ne <32 x i16> [[TMP53]], zeroinitializer
+; CHECK-NEXT: [[TMP55:%.*]] = sext <32 x i1> [[TMP54]] to <32 x i16>
+; CHECK-NEXT: [[TMP56:%.*]] = bitcast <32 x i16> [[TMP55]] to i512
+; CHECK-NEXT: [[TMP57:%.*]] = bitcast i512 [[TMP56]] to <16 x i32>
+; CHECK-NEXT: [[_MSPROP3:%.*]] = or <16 x i32> [[TMP57]], [[TMP2]]
; CHECK-NEXT: [[TMP19:%.*]] = call <16 x i32> @llvm.x86.avx512.vpdpbusds.512(<16 x i32> [[X0]], <16 x i32> [[X1]], <16 x i32> [[X4]])
; CHECK-NEXT: [[TMP20:%.*]] = bitcast i16 [[TMP4]] to <16 x i1>
; CHECK-NEXT: [[TMP21:%.*]] = bitcast i16 [[X3]] to <16 x i1>
@@ -174,12 +278,28 @@ declare <16 x i32> @llvm.x86.avx512.maskz.vpdpwssd.512(<16 x i32>, <16 x i32>, <
define <16 x i32>@test_int_x86_avx512_vpdpwssd_512(<16 x i32> %x0, <16 x i32> %x1, <16 x i32> %x2) sanitize_memory {
; CHECK-LABEL: define <16 x i32> @test_int_x86_avx512_vpdpwssd_512(
; CHECK-SAME: <16 x i32> [[X0:%.*]], <16 x i32> [[X1:%.*]], <16 x i32> [[X2:%.*]]) #[[ATTR0]] {
-; CHECK-NEXT: [[TMP1:%.*]] = load <16 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP2:%.*]] = load <16 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 64) to ptr), align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <16 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 128) to ptr), align 8
+; CHECK-NEXT: [[TMP21:%.*]] = load <16 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: call void @llvm.donothing()
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <16 x i32> [[TMP1]], [[TMP2]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <16 x i32> [[_MSPROP]], [[TMP3]]
+; CHECK-NEXT: [[TMP22:%.*]] = bitcast <16 x i32> [[X1]] to <32 x i16>
+; CHECK-NEXT: [[TMP5:%.*]] = bitcast <16 x i32> [[X2]] to <32 x i16>
+; CHECK-NEXT: [[TMP6:%.*]] = bitcast <16 x i32> [[TMP2]] to <32 x i16>
+; CHECK-NEXT: [[TMP7:%.*]] = bitcast <16 x i32> [[TMP3]] to <32 x i16>
+; CHECK-NEXT: [[TMP8:%.*]] = icmp ne <32 x i16> [[TMP6]], zeroinitializer
+; CHECK-NEXT: [[TMP9:%.*]] = icmp ne <32 x i16> [[TMP7]], zeroinitializer
+; CHECK-NEXT: [[TMP10:%.*]] = icmp ne <32 x i16> [[TMP22]], zeroinitializer
+; CHECK-NEXT: [[TMP11:%.*]] = icmp ne <32 x i16> [[TMP5]], zeroinitializer
+; CHECK-NEXT: [[TMP12:%.*]] = and <32 x i1> [[TMP8]], [[TMP9]]
+; CHECK-NEXT: [[TMP13:%.*]] = and <32 x i1> [[TMP10]], [[TMP9]]
+; CHECK-NEXT: [[TMP14:%.*]] = and <32 x i1> [[TMP8]], [[TMP11]]
+; CHECK-NEXT: [[TMP15:%.*]] = or <32 x i1> [[TMP12]], [[TMP13]]
+; CHECK-NEXT: [[TMP16:%.*]] = or <32 x i1> [[TMP15]], [[TMP14]]
+; CHECK-NEXT: [[TMP17:%.*]] = sext <32 x i1> [[TMP16]] to <32 x i16>
+; CHECK-NEXT: [[TMP18:%.*]] = bitcast <32 x i16> [[TMP17]] to <16 x i32>
+; CHECK-NEXT: [[TMP19:%.*]] = icmp ne <16 x i32> [[TMP18]], zeroinitializer
+; CHECK-NEXT: [[TMP20:%.*]] = sext <16 x i1> [[TMP19]] to <16 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <16 x i32> [[TMP20]], [[TMP21]]
; CHECK-NEXT: [[TMP4:%.*]] = call <16 x i32> @llvm.x86.avx512.vpdpwssd.512(<16 x i32> [[X0]], <16 x i32> [[X1]], <16 x i32> [[X2]])
; CHECK-NEXT: store <16 x i32> [[_MSPROP1]], ptr @__msan_retval_tls, align 8
; CHECK-NEXT: ret <16 x i32> [[TMP4]]
@@ -192,8 +312,8 @@ define { <16 x i32>, <16 x i32> } @test_int_x86_avx512_mask_vpdpwssd_512(<16 x i
; CHECK-LABEL: define { <16 x i32>, <16 x i32> } @test_int_x86_avx512_mask_vpdpwssd_512(
; CHECK-SAME: <16 x i32> [[X0:%.*]], <16 x i32> [[X1:%.*]], ptr [[X2P:%.*]], <16 x i32> [[X4:%.*]], i16 [[X3:%.*]]) #[[ATTR0]] {
; CHECK-NEXT: [[TMP1:%.*]] = load i64, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 128) to ptr), align 8
-; CHECK-NEXT: [[TMP2:%.*]] = load <16 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <16 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 64) to ptr), align 8
+; CHECK-NEXT: [[TMP2:%.*]] = load <16 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP4:%.*]] = load i16, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 200) to ptr), align 8
; CHECK-NEXT: [[TMP5:%.*]] = load <16 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 136) to ptr), align 8
; CHECK-NEXT: call void @llvm.donothing()
@@ -208,8 +328,24 @@ define { <16 x i32>, <16 x i32> } @test_int_x86_avx512_mask_vpdpwssd_512(<16 x i
; CHECK-NEXT: [[TMP9:%.*]] = xor i64 [[TMP8]], 87960930222080
; CHECK-NEXT: [[TMP10:%.*]] = inttoptr i64 [[TMP9]] to ptr
; CHECK-NEXT: [[_MSLD:%.*]] = load <16 x i32>, ptr [[TMP10]], align 64
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <16 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <16 x i32> [[_MSPROP]], [[_MSLD]]
+; CHECK-NEXT: [[TMP29:%.*]] = bitcast <16 x i32> [[X1]] to <32 x i16>
+; CHECK-NEXT: [[TMP30:%.*]] = bitcast <16 x i32> [[X2]] to <32 x i16>
+; CHECK-NEXT: [[TMP31:%.*]] = bitcast <16 x i32> [[TMP3]] to <32 x i16>
+; CHECK-NEXT: [[TMP32:%.*]] = bitcast <16 x i32> [[_MSLD]] to <32 x i16>
+; CHECK-NEXT: [[TMP33:%.*]] = icmp ne <32 x i16> [[TMP31]], zeroinitializer
+; CHECK-NEXT: [[TMP34:%.*]] = icmp ne <32 x i16> [[TMP32]], zeroinitializer
+; CHECK-NEXT: [[TMP35:%.*]] = icmp ne <32 x i16> [[TMP29]], zeroinitializer
+; CHECK-NEXT: [[TMP36:%.*]] = icmp ne <32 x i16> [[TMP30]], zeroinitializer
+; CHECK-NEXT: [[TMP54:%.*]] = and <32 x i1> [[TMP33]], [[TMP34]]
+; CHECK-NEXT: [[TMP55:%.*]] = and <32 x i1> [[TMP35]], [[TMP34]]
+; CHECK-NEXT: [[TMP56:%.*]] = and <32 x i1> [[TMP33]], [[TMP36]]
+; CHECK-NEXT: [[TMP57:%.*]] = or <32 x i1> [[TMP54]], [[TMP55]]
+; CHECK-NEXT: [[TMP58:%.*]] = or <32 x i1> [[TMP57]], [[TMP56]]
+; CHECK-NEXT: [[TMP59:%.*]] = sext <32 x i1> [[TMP58]] to <32 x i16>
+; CHECK-NEXT: [[TMP60:%.*]] = bitcast <32 x i16> [[TMP59]] to <16 x i32>
+; CHECK-NEXT: [[TMP61:%.*]] = icmp ne <16 x i32> [[TMP60]], zeroinitializer
+; CHECK-NEXT: [[TMP62:%.*]] = sext <16 x i1> [[TMP61]] to <16 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <16 x i32> [[TMP62]], [[TMP2]]
; CHECK-NEXT: [[TMP11:%.*]] = call <16 x i32> @llvm.x86.avx512.vpdpwssd.512(<16 x i32> [[X0]], <16 x i32> [[X1]], <16 x i32> [[X2]])
; CHECK-NEXT: [[TMP12:%.*]] = bitcast i16 [[TMP4]] to <16 x i1>
; CHECK-NEXT: [[TMP13:%.*]] = bitcast i16 [[X3]] to <16 x i1>
@@ -219,8 +355,24 @@ define { <16 x i32>, <16 x i32> } @test_int_x86_avx512_mask_vpdpwssd_512(<16 x i
; CHECK-NEXT: [[TMP17:%.*]] = or <16 x i32> [[TMP16]], [[TMP2]]
; CHECK-NEXT: [[_MSPROP_SELECT:%.*]] = select <16 x i1> [[TMP12]], <16 x i32> [[TMP17]], <16 x i32> [[TMP14]]
; CHECK-NEXT: [[TMP18:%.*]] = select <16 x i1> [[TMP13]], <16 x i32> [[TMP11]], <16 x i32> [[X0]]
-; CHECK-NEXT: [[_MSPROP2:%.*]] = or <16 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP3:%.*]] = or <16 x i32> [[_MSPROP2]], [[TMP5]]
+; CHECK-NEXT: [[TMP37:%.*]] = bitcast <16 x i32> [[X1]] to <32 x i16>
+; CHECK-NEXT: [[TMP38:%.*]] = bitcast <16 x i32> [[X4]] to <32 x i16>
+; CHECK-NEXT: [[TMP39:%.*]] = bitcast <16 x i32> [[TMP3]] to <32 x i16>
+; CHECK-NEXT: [[TMP40:%.*]] = bitcast <16 x i32> [[TMP5]] to <32 x i16>
+; CHECK-NEXT: [[TMP41:%.*]] = icmp ne <32 x i16> [[TMP39]], zeroinitializer
+; CHECK-NEXT: [[TMP42:%.*]] = icmp ne <32 x i16> [[TMP40]], zeroinitializer
+; CHECK-NEXT: [[TMP43:%.*]] = icmp ne <32 x i16> [[TMP37]], zeroinitializer
+; CHECK-NEXT: [[TMP44:%.*]] = icmp ne <32 x i16> [[TMP38]], zeroinitializer
+; CHECK-NEXT: [[TMP45:%.*]] = and <32 x i1> [[TMP41]], [[TMP42]]
+; CHECK-NEXT: [[TMP46:%.*]] = and <32 x i1> [[TMP43]], [[TMP42]]
+; CHECK-NEXT: [[TMP47:%.*]] = and <32 x i1> [[TMP41]], [[TMP44]]
+; CHECK-NEXT: [[TMP48:%.*]] = or <32 x i1> [[TMP45]], [[TMP46]]
+; CHECK-NEXT: [[TMP49:%.*]] = or <32 x i1> [[TMP48]], [[TMP47]]
+; CHECK-NEXT: [[TMP50:%.*]] = sext <32 x i1> [[TMP49]] to <32 x i16>
+; CHECK-NEXT: [[TMP51:%.*]] = bitcast <32 x i16> [[TMP50]] to <16 x i32>
+; CHECK-NEXT: [[TMP52:%.*]] = icmp ne <16 x i32> [[TMP51]], zeroinitializer
+; CHECK-NEXT: [[TMP53:%.*]] = sext <16 x i1> [[TMP52]] to <16 x i32>
+; CHECK-NEXT: [[_MSPROP3:%.*]] = or <16 x i32> [[TMP53]], [[TMP2]]
; CHECK-NEXT: [[TMP19:%.*]] = call <16 x i32> @llvm.x86.avx512.vpdpwssd.512(<16 x i32> [[X0]], <16 x i32> [[X1]], <16 x i32> [[X4]])
; CHECK-NEXT: [[TMP20:%.*]] = bitcast i16 [[TMP4]] to <16 x i1>
; CHECK-NEXT: [[TMP21:%.*]] = bitcast i16 [[X3]] to <16 x i1>
@@ -251,12 +403,28 @@ declare <16 x i32> @llvm.x86.avx512.maskz.vpdpwssds.512(<16 x i32>, <16 x i32>,
define <16 x i32>@test_int_x86_avx512_vpdpwssds_512(<16 x i32> %x0, <16 x i32> %x1, <16 x i32> %x2) sanitize_memory {
; CHECK-LABEL: define <16 x i32> @test_int_x86_avx512_vpdpwssds_512(
; CHECK-SAME: <16 x i32> [[X0:%.*]], <16 x i32> [[X1:%.*]], <16 x i32> [[X2:%.*]]) #[[ATTR0]] {
-; CHECK-NEXT: [[TMP1:%.*]] = load <16 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP2:%.*]] = load <16 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 64) to ptr), align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <16 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 128) to ptr), align 8
+; CHECK-NEXT: [[TMP21:%.*]] = load <16 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: call void @llvm.donothing()
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <16 x i32> [[TMP1]], [[TMP2]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <16 x i32> [[_MSPROP]], [[TMP3]]
+; CHECK-NEXT: [[TMP22:%.*]] = bitcast <16 x i32> [[X1]] to <32 x i16>
+; CHECK-NEXT: [[TMP5:%.*]] = bitcast <16 x i32> [[X2]] to <32 x i16>
+; CHECK-NEXT: [[TMP6:%.*]] = bitcast <16 x i32> [[TMP2]] to <32 x i16>
+; CHECK-NEXT: [[TMP7:%.*]] = bitcast <16 x i32> [[TMP3]] to <32 x i16>
+; CHECK-NEXT: [[TMP8:%.*]] = icmp ne <32 x i16> [[TMP6]], zeroinitializer
+; CHECK-NEXT: [[TMP9:%.*]] = icmp ne <32 x i16> [[TMP7]], zeroinitializer
+; CHECK-NEXT: [[TMP10:%.*]] = icmp ne <32 x i16> [[TMP22]], zeroinitializer
+; CHECK-NEXT: [[TMP11:%.*]] = icmp ne <32 x i16> [[TMP5]], zeroinitializer
+; CHECK-NEXT: [[TMP12:%.*]] = and <32 x i1> [[TMP8]], [[TMP9]]
+; CHECK-NEXT: [[TMP13:%.*]] = and <32 x i1> [[TMP10]], [[TMP9]]
+; CHECK-NEXT: [[TMP14:%.*]] = and <32 x i1> [[TMP8]], [[TMP11]]
+; CHECK-NEXT: [[TMP15:%.*]] = or <32 x i1> [[TMP12]], [[TMP13]]
+; CHECK-NEXT: [[TMP16:%.*]] = or <32 x i1> [[TMP15]], [[TMP14]]
+; CHECK-NEXT: [[TMP17:%.*]] = sext <32 x i1> [[TMP16]] to <32 x i16>
+; CHECK-NEXT: [[TMP18:%.*]] = bitcast <32 x i16> [[TMP17]] to <16 x i32>
+; CHECK-NEXT: [[TMP19:%.*]] = icmp ne <16 x i32> [[TMP18]], zeroinitializer
+; CHECK-NEXT: [[TMP20:%.*]] = sext <16 x i1> [[TMP19]] to <16 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <16 x i32> [[TMP20]], [[TMP21]]
; CHECK-NEXT: [[TMP4:%.*]] = call <16 x i32> @llvm.x86.avx512.vpdpwssds.512(<16 x i32> [[X0]], <16 x i32> [[X1]], <16 x i32> [[X2]])
; CHECK-NEXT: store <16 x i32> [[_MSPROP1]], ptr @__msan_retval_tls, align 8
; CHECK-NEXT: ret <16 x i32> [[TMP4]]
@@ -269,8 +437,8 @@ define { <16 x i32>, <16 x i32> } @test_int_x86_avx512_mask_vpdpwssds_512(<16 x
; CHECK-LABEL: define { <16 x i32>, <16 x i32> } @test_int_x86_avx512_mask_vpdpwssds_512(
; CHECK-SAME: <16 x i32> [[X0:%.*]], <16 x i32> [[X1:%.*]], ptr [[X2P:%.*]], <16 x i32> [[X4:%.*]], i16 [[X3:%.*]]) #[[ATTR0]] {
; CHECK-NEXT: [[TMP1:%.*]] = load i64, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 128) to ptr), align 8
-; CHECK-NEXT: [[TMP2:%.*]] = load <16 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <16 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 64) to ptr), align 8
+; CHECK-NEXT: [[TMP2:%.*]] = load <16 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP4:%.*]] = load i16, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 200) to ptr), align 8
; CHECK-NEXT: [[TMP5:%.*]] = load <16 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 136) to ptr), align 8
; CHECK-NEXT: call void @llvm.donothing()
@@ -285,8 +453,24 @@ define { <16 x i32>, <16 x i32> } @test_int_x86_avx512_mask_vpdpwssds_512(<16 x
; CHECK-NEXT: [[TMP9:%.*]] = xor i64 [[TMP8]], 87960930222080
; CHECK-NEXT: [[TMP10:%.*]] = inttoptr i64 [[TMP9]] to ptr
; CHECK-NEXT: [[_MSLD:%.*]] = load <16 x i32>, ptr [[TMP10]], align 64
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <16 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <16 x i32> [[_MSPROP]], [[_MSLD]]
+; CHECK-NEXT: [[TMP29:%.*]] = bitcast <16 x i32> [[X1]] to <32 x i16>
+; CHECK-NEXT: [[TMP30:%.*]] = bitcast <16 x i32> [[X2]] to <32 x i16>
+; CHECK-NEXT: [[TMP31:%.*]] = bitcast <16 x i32> [[TMP3]] to <32 x i16>
+; CHECK-NEXT: [[TMP32:%.*]] = bitcast <16 x i32> [[_MSLD]] to <32 x i16>
+; CHECK-NEXT: [[TMP33:%.*]] = icmp ne <32 x i16> [[TMP31]], zeroinitializer
+; CHECK-NEXT: [[TMP34:%.*]] = icmp ne <32 x i16> [[TMP32]], zeroinitializer
+; CHECK-NEXT: [[TMP35:%.*]] = icmp ne <32 x i16> [[TMP29]], zeroinitializer
+; CHECK-NEXT: [[TMP36:%.*]] = icmp ne <32 x i16> [[TMP30]], zeroinitializer
+; CHECK-NEXT: [[TMP54:%.*]] = and <32 x i1> [[TMP33]], [[TMP34]]
+; CHECK-NEXT: [[TMP55:%.*]] = and <32 x i1> [[TMP35]], [[TMP34]]
+; CHECK-NEXT: [[TMP56:%.*]] = and <32 x i1> [[TMP33]], [[TMP36]]
+; CHECK-NEXT: [[TMP57:%.*]] = or <32 x i1> [[TMP54]], [[TMP55]]
+; CHECK-NEXT: [[TMP58:%.*]] = or <32 x i1> [[TMP57]], [[TMP56]]
+; CHECK-NEXT: [[TMP59:%.*]] = sext <32 x i1> [[TMP58]] to <32 x i16>
+; CHECK-NEXT: [[TMP60:%.*]] = bitcast <32 x i16> [[TMP59]] to <16 x i32>
+; CHECK-NEXT: [[TMP61:%.*]] = icmp ne <16 x i32> [[TMP60]], zeroinitializer
+; CHECK-NEXT: [[TMP62:%.*]] = sext <16 x i1> [[TMP61]] to <16 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <16 x i32> [[TMP62]], [[TMP2]]
; CHECK-NEXT: [[TMP11:%.*]] = call <16 x i32> @llvm.x86.avx512.vpdpwssds.512(<16 x i32> [[X0]], <16 x i32> [[X1]], <16 x i32> [[X2]])
; CHECK-NEXT: [[TMP12:%.*]] = bitcast i16 [[TMP4]] to <16 x i1>
; CHECK-NEXT: [[TMP13:%.*]] = bitcast i16 [[X3]] to <16 x i1>
@@ -296,8 +480,24 @@ define { <16 x i32>, <16 x i32> } @test_int_x86_avx512_mask_vpdpwssds_512(<16 x
; CHECK-NEXT: [[TMP17:%.*]] = or <16 x i32> [[TMP16]], [[TMP2]]
; CHECK-NEXT: [[_MSPROP_SELECT:%.*]] = select <16 x i1> [[TMP12]], <16 x i32> [[TMP17]], <16 x i32> [[TMP14]]
; CHECK-NEXT: [[TMP18:%.*]] = select <16 x i1> [[TMP13]], <16 x i32> [[TMP11]], <16 x i32> [[X0]]
-; CHECK-NEXT: [[_MSPROP2:%.*]] = or <16 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP3:%.*]] = or <16 x i32> [[_MSPROP2]], [[TMP5]]
+; CHECK-NEXT: [[TMP37:%.*]] = bitcast <16 x i32> [[X1]] to <32 x i16>
+; CHECK-NEXT: [[TMP38:%.*]] = bitcast <16 x i32> [[X4]] to <32 x i16>
+; CHECK-NEXT: [[TMP39:%.*]] = bitcast <16 x i32> [[TMP3]] to <32 x i16>
+; CHECK-NEXT: [[TMP40:%.*]] = bitcast <16 x i32> [[TMP5]] to <32 x i16>
+; CHECK-NEXT: [[TMP41:%.*]] = icmp ne <32 x i16> [[TMP39]], zeroinitializer
+; CHECK-NEXT: [[TMP42:%.*]] = icmp ne <32 x i16> [[TMP40]], zeroinitializer
+; CHECK-NEXT: [[TMP43:%.*]] = icmp ne <32 x i16> [[TMP37]], zeroinitializer
+; CHECK-NEXT: [[TMP44:%.*]] = icmp ne <32 x i16> [[TMP38]], zeroinitializer
+; CHECK-NEXT: [[TMP45:%.*]] = and <32 x i1> [[TMP41]], [[TMP42]]
+; CHECK-NEXT: [[TMP46:%.*]] = and <32 x i1> [[TMP43]], [[TMP42]]
+; CHECK-NEXT: [[TMP47:%.*]] = and <32 x i1> [[TMP41]], [[TMP44]]
+; CHECK-NEXT: [[TMP48:%.*]] = or <32 x i1> [[TMP45]], [[TMP46]]
+; CHECK-NEXT: [[TMP49:%.*]] = or <32 x i1> [[TMP48]], [[TMP47]]
+; CHECK-NEXT: [[TMP50:%.*]] = sext <32 x i1> [[TMP49]] to <32 x i16>
+; CHECK-NEXT: [[TMP51:%.*]] = bitcast <32 x i16> [[TMP50]] to <16 x i32>
+; CHECK-NEXT: [[TMP52:%.*]] = icmp ne <16 x i32> [[TMP51]], zeroinitializer
+; CHECK-NEXT: [[TMP53:%.*]] = sext <16 x i1> [[TMP52]] to <16 x i32>
+; CHECK-NEXT: [[_MSPROP3:%.*]] = or <16 x i32> [[TMP53]], [[TMP2]]
; CHECK-NEXT: [[TMP19:%.*]] = call <16 x i32> @llvm.x86.avx512.vpdpwssds.512(<16 x i32> [[X0]], <16 x i32> [[X1]], <16 x i32> [[X4]])
; CHECK-NEXT: [[TMP20:%.*]] = bitcast i16 [[TMP4]] to <16 x i1>
; CHECK-NEXT: [[TMP21:%.*]] = bitcast i16 [[X3]] to <16 x i1>
diff --git a/llvm/test/Instrumentation/MemorySanitizer/X86/avx512vnni-intrinsics.ll b/llvm/test/Instrumentation/MemorySanitizer/X86/avx512vnni-intrinsics.ll
index ca07d5905c8af..7c39ff6bb2be1 100644
--- a/llvm/test/Instrumentation/MemorySanitizer/X86/avx512vnni-intrinsics.ll
+++ b/llvm/test/Instrumentation/MemorySanitizer/X86/avx512vnni-intrinsics.ll
@@ -5,11 +5,7 @@
;
; Handled strictly: (none)
;
-; Handled heuristically:
-; - llvm.x86.avx512.vpdpbusd.512
-; - llvm.x86.avx512.vpdpbusds.512
-; - llvm.x86.avx512.vpdpwssd.512
-; - llvm.x86.avx512.vpdpwssds.512
+; Handled heuristically: (none)
target datalayout = "e-m:o-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"
@@ -19,12 +15,30 @@ declare <16 x i32> @llvm.x86.avx512.vpdpbusd.512(<16 x i32>, <16 x i32>, <16 x i
define <16 x i32> @test_int_x86_avx512_ask_vpdpbusd_512(<16 x i32> %x0, <16 x i32> %x1, <16 x i32> %x2) sanitize_memory {
; CHECK-LABEL: define <16 x i32> @test_int_x86_avx512_ask_vpdpbusd_512(
; CHECK-SAME: <16 x i32> [[X0:%.*]], <16 x i32> [[X1:%.*]], <16 x i32> [[X2:%.*]]) #[[ATTR1:[0-9]+]] {
-; CHECK-NEXT: [[TMP1:%.*]] = load <16 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP2:%.*]] = load <16 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 64) to ptr), align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <16 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 128) to ptr), align 8
+; CHECK-NEXT: [[TMP23:%.*]] = load <16 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: call void @llvm.donothing()
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <16 x i32> [[TMP1]], [[TMP2]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <16 x i32> [[_MSPROP]], [[TMP3]]
+; CHECK-NEXT: [[TMP24:%.*]] = bitcast <16 x i32> [[X1]] to <64 x i8>
+; CHECK-NEXT: [[TMP5:%.*]] = bitcast <16 x i32> [[X2]] to <64 x i8>
+; CHECK-NEXT: [[TMP6:%.*]] = bitcast <16 x i32> [[TMP2]] to <64 x i8>
+; CHECK-NEXT: [[TMP7:%.*]] = bitcast <16 x i32> [[TMP3]] to <64 x i8>
+; CHECK-NEXT: [[TMP8:%.*]] = icmp ne <64 x i8> [[TMP6]], zeroinitializer
+; CHECK-NEXT: [[TMP9:%.*]] = icmp ne <64 x i8> [[TMP7]], zeroinitializer
+; CHECK-NEXT: [[TMP10:%.*]] = icmp ne <64 x i8> [[TMP24]], zeroinitializer
+; CHECK-NEXT: [[TMP11:%.*]] = icmp ne <64 x i8> [[TMP5]], zeroinitializer
+; CHECK-NEXT: [[TMP12:%.*]] = and <64 x i1> [[TMP8]], [[TMP9]]
+; CHECK-NEXT: [[TMP13:%.*]] = and <64 x i1> [[TMP10]], [[TMP9]]
+; CHECK-NEXT: [[TMP14:%.*]] = and <64 x i1> [[TMP8]], [[TMP11]]
+; CHECK-NEXT: [[TMP15:%.*]] = or <64 x i1> [[TMP12]], [[TMP13]]
+; CHECK-NEXT: [[TMP16:%.*]] = or <64 x i1> [[TMP15]], [[TMP14]]
+; CHECK-NEXT: [[TMP17:%.*]] = sext <64 x i1> [[TMP16]] to <64 x i8>
+; CHECK-NEXT: [[TMP18:%.*]] = bitcast <64 x i8> [[TMP17]] to <32 x i16>
+; CHECK-NEXT: [[TMP19:%.*]] = icmp ne <32 x i16> [[TMP18]], zeroinitializer
+; CHECK-NEXT: [[TMP20:%.*]] = sext <32 x i1> [[TMP19]] to <32 x i16>
+; CHECK-NEXT: [[TMP21:%.*]] = bitcast <32 x i16> [[TMP20]] to i512
+; CHECK-NEXT: [[TMP22:%.*]] = bitcast i512 [[TMP21]] to <16 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <16 x i32> [[TMP22]], [[TMP23]]
; CHECK-NEXT: [[TMP4:%.*]] = call <16 x i32> @llvm.x86.avx512.vpdpbusd.512(<16 x i32> [[X0]], <16 x i32> [[X1]], <16 x i32> [[X2]])
; CHECK-NEXT: store <16 x i32> [[_MSPROP1]], ptr @__msan_retval_tls, align 8
; CHECK-NEXT: ret <16 x i32> [[TMP4]]
@@ -37,8 +51,8 @@ define { <16 x i32>, <16 x i32> } @test_int_x86_avx512_mask_vpdpbusd_512(<16 x i
; CHECK-LABEL: define { <16 x i32>, <16 x i32> } @test_int_x86_avx512_mask_vpdpbusd_512(
; CHECK-SAME: <16 x i32> [[X0:%.*]], <16 x i32> [[X1:%.*]], ptr [[X2P:%.*]], <16 x i32> [[X4:%.*]], i16 [[X3:%.*]]) #[[ATTR1]] {
; CHECK-NEXT: [[TMP1:%.*]] = load i64, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 128) to ptr), align 8
-; CHECK-NEXT: [[TMP2:%.*]] = load <16 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <16 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 64) to ptr), align 8
+; CHECK-NEXT: [[TMP2:%.*]] = load <16 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP4:%.*]] = load i16, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 200) to ptr), align 8
; CHECK-NEXT: [[TMP5:%.*]] = load <16 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 136) to ptr), align 8
; CHECK-NEXT: call void @llvm.donothing()
@@ -53,8 +67,26 @@ define { <16 x i32>, <16 x i32> } @test_int_x86_avx512_mask_vpdpbusd_512(<16 x i
; CHECK-NEXT: [[TMP9:%.*]] = xor i64 [[TMP8]], 87960930222080
; CHECK-NEXT: [[TMP10:%.*]] = inttoptr i64 [[TMP9]] to ptr
; CHECK-NEXT: [[_MSLD:%.*]] = load <16 x i32>, ptr [[TMP10]], align 64
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <16 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <16 x i32> [[_MSPROP]], [[_MSLD]]
+; CHECK-NEXT: [[TMP30:%.*]] = bitcast <16 x i32> [[X1]] to <64 x i8>
+; CHECK-NEXT: [[TMP31:%.*]] = bitcast <16 x i32> [[X2]] to <64 x i8>
+; CHECK-NEXT: [[TMP32:%.*]] = bitcast <16 x i32> [[TMP3]] to <64 x i8>
+; CHECK-NEXT: [[TMP33:%.*]] = bitcast <16 x i32> [[_MSLD]] to <64 x i8>
+; CHECK-NEXT: [[TMP34:%.*]] = icmp ne <64 x i8> [[TMP32]], zeroinitializer
+; CHECK-NEXT: [[TMP35:%.*]] = icmp ne <64 x i8> [[TMP33]], zeroinitializer
+; CHECK-NEXT: [[TMP36:%.*]] = icmp ne <64 x i8> [[TMP30]], zeroinitializer
+; CHECK-NEXT: [[TMP37:%.*]] = icmp ne <64 x i8> [[TMP31]], zeroinitializer
+; CHECK-NEXT: [[TMP38:%.*]] = and <64 x i1> [[TMP34]], [[TMP35]]
+; CHECK-NEXT: [[TMP58:%.*]] = and <64 x i1> [[TMP36]], [[TMP35]]
+; CHECK-NEXT: [[TMP59:%.*]] = and <64 x i1> [[TMP34]], [[TMP37]]
+; CHECK-NEXT: [[TMP60:%.*]] = or <64 x i1> [[TMP38]], [[TMP58]]
+; CHECK-NEXT: [[TMP61:%.*]] = or <64 x i1> [[TMP60]], [[TMP59]]
+; CHECK-NEXT: [[TMP62:%.*]] = sext <64 x i1> [[TMP61]] to <64 x i8>
+; CHECK-NEXT: [[TMP63:%.*]] = bitcast <64 x i8> [[TMP62]] to <32 x i16>
+; CHECK-NEXT: [[TMP64:%.*]] = icmp ne <32 x i16> [[TMP63]], zeroinitializer
+; CHECK-NEXT: [[TMP65:%.*]] = sext <32 x i1> [[TMP64]] to <32 x i16>
+; CHECK-NEXT: [[TMP66:%.*]] = bitcast <32 x i16> [[TMP65]] to i512
+; CHECK-NEXT: [[TMP29:%.*]] = bitcast i512 [[TMP66]] to <16 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <16 x i32> [[TMP29]], [[TMP2]]
; CHECK-NEXT: [[TMP11:%.*]] = call <16 x i32> @llvm.x86.avx512.vpdpbusd.512(<16 x i32> [[X0]], <16 x i32> [[X1]], <16 x i32> [[X2]])
; CHECK-NEXT: [[TMP12:%.*]] = bitcast i16 [[TMP4]] to <16 x i1>
; CHECK-NEXT: [[TMP13:%.*]] = bitcast i16 [[X3]] to <16 x i1>
@@ -64,8 +96,26 @@ define { <16 x i32>, <16 x i32> } @test_int_x86_avx512_mask_vpdpbusd_512(<16 x i
; CHECK-NEXT: [[TMP17:%.*]] = or <16 x i32> [[TMP16]], [[TMP2]]
; CHECK-NEXT: [[_MSPROP_SELECT:%.*]] = select <16 x i1> [[TMP12]], <16 x i32> [[TMP17]], <16 x i32> [[TMP14]]
; CHECK-NEXT: [[TMP18:%.*]] = select <16 x i1> [[TMP13]], <16 x i32> [[TMP11]], <16 x i32> [[X0]]
-; CHECK-NEXT: [[_MSPROP2:%.*]] = or <16 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP3:%.*]] = or <16 x i32> [[_MSPROP2]], [[TMP5]]
+; CHECK-NEXT: [[TMP39:%.*]] = bitcast <16 x i32> [[X1]] to <64 x i8>
+; CHECK-NEXT: [[TMP40:%.*]] = bitcast <16 x i32> [[X4]] to <64 x i8>
+; CHECK-NEXT: [[TMP41:%.*]] = bitcast <16 x i32> [[TMP3]] to <64 x i8>
+; CHECK-NEXT: [[TMP42:%.*]] = bitcast <16 x i32> [[TMP5]] to <64 x i8>
+; CHECK-NEXT: [[TMP43:%.*]] = icmp ne <64 x i8> [[TMP41]], zeroinitializer
+; CHECK-NEXT: [[TMP44:%.*]] = icmp ne <64 x i8> [[TMP42]], zeroinitializer
+; CHECK-NEXT: [[TMP45:%.*]] = icmp ne <64 x i8> [[TMP39]], zeroinitializer
+; CHECK-NEXT: [[TMP46:%.*]] = icmp ne <64 x i8> [[TMP40]], zeroinitializer
+; CHECK-NEXT: [[TMP47:%.*]] = and <64 x i1> [[TMP43]], [[TMP44]]
+; CHECK-NEXT: [[TMP48:%.*]] = and <64 x i1> [[TMP45]], [[TMP44]]
+; CHECK-NEXT: [[TMP49:%.*]] = and <64 x i1> [[TMP43]], [[TMP46]]
+; CHECK-NEXT: [[TMP50:%.*]] = or <64 x i1> [[TMP47]], [[TMP48]]
+; CHECK-NEXT: [[TMP51:%.*]] = or <64 x i1> [[TMP50]], [[TMP49]]
+; CHECK-NEXT: [[TMP52:%.*]] = sext <64 x i1> [[TMP51]] to <64 x i8>
+; CHECK-NEXT: [[TMP53:%.*]] = bitcast <64 x i8> [[TMP52]] to <32 x i16>
+; CHECK-NEXT: [[TMP54:%.*]] = icmp ne <32 x i16> [[TMP53]], zeroinitializer
+; CHECK-NEXT: [[TMP55:%.*]] = sext <32 x i1> [[TMP54]] to <32 x i16>
+; CHECK-NEXT: [[TMP56:%.*]] = bitcast <32 x i16> [[TMP55]] to i512
+; CHECK-NEXT: [[TMP57:%.*]] = bitcast i512 [[TMP56]] to <16 x i32>
+; CHECK-NEXT: [[_MSPROP3:%.*]] = or <16 x i32> [[TMP57]], [[TMP2]]
; CHECK-NEXT: [[TMP19:%.*]] = call <16 x i32> @llvm.x86.avx512.vpdpbusd.512(<16 x i32> [[X0]], <16 x i32> [[X1]], <16 x i32> [[X4]])
; CHECK-NEXT: [[TMP20:%.*]] = bitcast i16 [[TMP4]] to <16 x i1>
; CHECK-NEXT: [[TMP21:%.*]] = bitcast i16 [[X3]] to <16 x i1>
@@ -99,12 +149,30 @@ declare <16 x i32> @llvm.x86.avx512.vpdpbusds.512(<16 x i32>, <16 x i32>, <16 x
define <16 x i32>@test_int_x86_avx512_vpdpbusds_512(<16 x i32> %x0, <16 x i32> %x1, <16 x i32> %x2) sanitize_memory {
; CHECK-LABEL: define <16 x i32> @test_int_x86_avx512_vpdpbusds_512(
; CHECK-SAME: <16 x i32> [[X0:%.*]], <16 x i32> [[X1:%.*]], <16 x i32> [[X2:%.*]]) #[[ATTR1]] {
-; CHECK-NEXT: [[TMP1:%.*]] = load <16 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP2:%.*]] = load <16 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 64) to ptr), align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <16 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 128) to ptr), align 8
+; CHECK-NEXT: [[TMP23:%.*]] = load <16 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: call void @llvm.donothing()
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <16 x i32> [[TMP1]], [[TMP2]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <16 x i32> [[_MSPROP]], [[TMP3]]
+; CHECK-NEXT: [[TMP24:%.*]] = bitcast <16 x i32> [[X1]] to <64 x i8>
+; CHECK-NEXT: [[TMP5:%.*]] = bitcast <16 x i32> [[X2]] to <64 x i8>
+; CHECK-NEXT: [[TMP6:%.*]] = bitcast <16 x i32> [[TMP2]] to <64 x i8>
+; CHECK-NEXT: [[TMP7:%.*]] = bitcast <16 x i32> [[TMP3]] to <64 x i8>
+; CHECK-NEXT: [[TMP8:%.*]] = icmp ne <64 x i8> [[TMP6]], zeroinitializer
+; CHECK-NEXT: [[TMP9:%.*]] = icmp ne <64 x i8> [[TMP7]], zeroinitializer
+; CHECK-NEXT: [[TMP10:%.*]] = icmp ne <64 x i8> [[TMP24]], zeroinitializer
+; CHECK-NEXT: [[TMP11:%.*]] = icmp ne <64 x i8> [[TMP5]], zeroinitializer
+; CHECK-NEXT: [[TMP12:%.*]] = and <64 x i1> [[TMP8]], [[TMP9]]
+; CHECK-NEXT: [[TMP13:%.*]] = and <64 x i1> [[TMP10]], [[TMP9]]
+; CHECK-NEXT: [[TMP14:%.*]] = and <64 x i1> [[TMP8]], [[TMP11]]
+; CHECK-NEXT: [[TMP15:%.*]] = or <64 x i1> [[TMP12]], [[TMP13]]
+; CHECK-NEXT: [[TMP16:%.*]] = or <64 x i1> [[TMP15]], [[TMP14]]
+; CHECK-NEXT: [[TMP17:%.*]] = sext <64 x i1> [[TMP16]] to <64 x i8>
+; CHECK-NEXT: [[TMP18:%.*]] = bitcast <64 x i8> [[TMP17]] to <32 x i16>
+; CHECK-NEXT: [[TMP19:%.*]] = icmp ne <32 x i16> [[TMP18]], zeroinitializer
+; CHECK-NEXT: [[TMP20:%.*]] = sext <32 x i1> [[TMP19]] to <32 x i16>
+; CHECK-NEXT: [[TMP21:%.*]] = bitcast <32 x i16> [[TMP20]] to i512
+; CHECK-NEXT: [[TMP22:%.*]] = bitcast i512 [[TMP21]] to <16 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <16 x i32> [[TMP22]], [[TMP23]]
; CHECK-NEXT: [[TMP4:%.*]] = call <16 x i32> @llvm.x86.avx512.vpdpbusds.512(<16 x i32> [[X0]], <16 x i32> [[X1]], <16 x i32> [[X2]])
; CHECK-NEXT: store <16 x i32> [[_MSPROP1]], ptr @__msan_retval_tls, align 8
; CHECK-NEXT: ret <16 x i32> [[TMP4]]
@@ -117,8 +185,8 @@ define { <16 x i32>, <16 x i32> } @test_int_x86_avx512_mask_vpdpbusds_512(<16 x
; CHECK-LABEL: define { <16 x i32>, <16 x i32> } @test_int_x86_avx512_mask_vpdpbusds_512(
; CHECK-SAME: <16 x i32> [[X0:%.*]], <16 x i32> [[X1:%.*]], ptr [[X2P:%.*]], <16 x i32> [[X4:%.*]], i16 [[X3:%.*]]) #[[ATTR1]] {
; CHECK-NEXT: [[TMP1:%.*]] = load i64, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 128) to ptr), align 8
-; CHECK-NEXT: [[TMP2:%.*]] = load <16 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <16 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 64) to ptr), align 8
+; CHECK-NEXT: [[TMP2:%.*]] = load <16 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP4:%.*]] = load i16, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 200) to ptr), align 8
; CHECK-NEXT: [[TMP5:%.*]] = load <16 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 136) to ptr), align 8
; CHECK-NEXT: call void @llvm.donothing()
@@ -133,8 +201,26 @@ define { <16 x i32>, <16 x i32> } @test_int_x86_avx512_mask_vpdpbusds_512(<16 x
; CHECK-NEXT: [[TMP9:%.*]] = xor i64 [[TMP8]], 87960930222080
; CHECK-NEXT: [[TMP10:%.*]] = inttoptr i64 [[TMP9]] to ptr
; CHECK-NEXT: [[_MSLD:%.*]] = load <16 x i32>, ptr [[TMP10]], align 64
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <16 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <16 x i32> [[_MSPROP]], [[_MSLD]]
+; CHECK-NEXT: [[TMP30:%.*]] = bitcast <16 x i32> [[X1]] to <64 x i8>
+; CHECK-NEXT: [[TMP31:%.*]] = bitcast <16 x i32> [[X2]] to <64 x i8>
+; CHECK-NEXT: [[TMP32:%.*]] = bitcast <16 x i32> [[TMP3]] to <64 x i8>
+; CHECK-NEXT: [[TMP33:%.*]] = bitcast <16 x i32> [[_MSLD]] to <64 x i8>
+; CHECK-NEXT: [[TMP34:%.*]] = icmp ne <64 x i8> [[TMP32]], zeroinitializer
+; CHECK-NEXT: [[TMP35:%.*]] = icmp ne <64 x i8> [[TMP33]], zeroinitializer
+; CHECK-NEXT: [[TMP36:%.*]] = icmp ne <64 x i8> [[TMP30]], zeroinitializer
+; CHECK-NEXT: [[TMP37:%.*]] = icmp ne <64 x i8> [[TMP31]], zeroinitializer
+; CHECK-NEXT: [[TMP38:%.*]] = and <64 x i1> [[TMP34]], [[TMP35]]
+; CHECK-NEXT: [[TMP58:%.*]] = and <64 x i1> [[TMP36]], [[TMP35]]
+; CHECK-NEXT: [[TMP59:%.*]] = and <64 x i1> [[TMP34]], [[TMP37]]
+; CHECK-NEXT: [[TMP60:%.*]] = or <64 x i1> [[TMP38]], [[TMP58]]
+; CHECK-NEXT: [[TMP61:%.*]] = or <64 x i1> [[TMP60]], [[TMP59]]
+; CHECK-NEXT: [[TMP62:%.*]] = sext <64 x i1> [[TMP61]] to <64 x i8>
+; CHECK-NEXT: [[TMP63:%.*]] = bitcast <64 x i8> [[TMP62]] to <32 x i16>
+; CHECK-NEXT: [[TMP64:%.*]] = icmp ne <32 x i16> [[TMP63]], zeroinitializer
+; CHECK-NEXT: [[TMP65:%.*]] = sext <32 x i1> [[TMP64]] to <32 x i16>
+; CHECK-NEXT: [[TMP66:%.*]] = bitcast <32 x i16> [[TMP65]] to i512
+; CHECK-NEXT: [[TMP29:%.*]] = bitcast i512 [[TMP66]] to <16 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <16 x i32> [[TMP29]], [[TMP2]]
; CHECK-NEXT: [[TMP11:%.*]] = call <16 x i32> @llvm.x86.avx512.vpdpbusds.512(<16 x i32> [[X0]], <16 x i32> [[X1]], <16 x i32> [[X2]])
; CHECK-NEXT: [[TMP12:%.*]] = bitcast i16 [[TMP4]] to <16 x i1>
; CHECK-NEXT: [[TMP13:%.*]] = bitcast i16 [[X3]] to <16 x i1>
@@ -144,8 +230,26 @@ define { <16 x i32>, <16 x i32> } @test_int_x86_avx512_mask_vpdpbusds_512(<16 x
; CHECK-NEXT: [[TMP17:%.*]] = or <16 x i32> [[TMP16]], [[TMP2]]
; CHECK-NEXT: [[_MSPROP_SELECT:%.*]] = select <16 x i1> [[TMP12]], <16 x i32> [[TMP17]], <16 x i32> [[TMP14]]
; CHECK-NEXT: [[TMP18:%.*]] = select <16 x i1> [[TMP13]], <16 x i32> [[TMP11]], <16 x i32> [[X0]]
-; CHECK-NEXT: [[_MSPROP2:%.*]] = or <16 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP3:%.*]] = or <16 x i32> [[_MSPROP2]], [[TMP5]]
+; CHECK-NEXT: [[TMP39:%.*]] = bitcast <16 x i32> [[X1]] to <64 x i8>
+; CHECK-NEXT: [[TMP40:%.*]] = bitcast <16 x i32> [[X4]] to <64 x i8>
+; CHECK-NEXT: [[TMP41:%.*]] = bitcast <16 x i32> [[TMP3]] to <64 x i8>
+; CHECK-NEXT: [[TMP42:%.*]] = bitcast <16 x i32> [[TMP5]] to <64 x i8>
+; CHECK-NEXT: [[TMP43:%.*]] = icmp ne <64 x i8> [[TMP41]], zeroinitializer
+; CHECK-NEXT: [[TMP44:%.*]] = icmp ne <64 x i8> [[TMP42]], zeroinitializer
+; CHECK-NEXT: [[TMP45:%.*]] = icmp ne <64 x i8> [[TMP39]], zeroinitializer
+; CHECK-NEXT: [[TMP46:%.*]] = icmp ne <64 x i8> [[TMP40]], zeroinitializer
+; CHECK-NEXT: [[TMP47:%.*]] = and <64 x i1> [[TMP43]], [[TMP44]]
+; CHECK-NEXT: [[TMP48:%.*]] = and <64 x i1> [[TMP45]], [[TMP44]]
+; CHECK-NEXT: [[TMP49:%.*]] = and <64 x i1> [[TMP43]], [[TMP46]]
+; CHECK-NEXT: [[TMP50:%.*]] = or <64 x i1> [[TMP47]], [[TMP48]]
+; CHECK-NEXT: [[TMP51:%.*]] = or <64 x i1> [[TMP50]], [[TMP49]]
+; CHECK-NEXT: [[TMP52:%.*]] = sext <64 x i1> [[TMP51]] to <64 x i8>
+; CHECK-NEXT: [[TMP53:%.*]] = bitcast <64 x i8> [[TMP52]] to <32 x i16>
+; CHECK-NEXT: [[TMP54:%.*]] = icmp ne <32 x i16> [[TMP53]], zeroinitializer
+; CHECK-NEXT: [[TMP55:%.*]] = sext <32 x i1> [[TMP54]] to <32 x i16>
+; CHECK-NEXT: [[TMP56:%.*]] = bitcast <32 x i16> [[TMP55]] to i512
+; CHECK-NEXT: [[TMP57:%.*]] = bitcast i512 [[TMP56]] to <16 x i32>
+; CHECK-NEXT: [[_MSPROP3:%.*]] = or <16 x i32> [[TMP57]], [[TMP2]]
; CHECK-NEXT: [[TMP19:%.*]] = call <16 x i32> @llvm.x86.avx512.vpdpbusds.512(<16 x i32> [[X0]], <16 x i32> [[X1]], <16 x i32> [[X4]])
; CHECK-NEXT: [[TMP20:%.*]] = bitcast i16 [[TMP4]] to <16 x i1>
; CHECK-NEXT: [[TMP21:%.*]] = bitcast i16 [[X3]] to <16 x i1>
@@ -179,12 +283,28 @@ declare <16 x i32> @llvm.x86.avx512.vpdpwssd.512(<16 x i32>, <16 x i32>, <16 x i
define <16 x i32>@test_int_x86_avx512_vpdpwssd_512(<16 x i32> %x0, <16 x i32> %x1, <16 x i32> %x2) sanitize_memory {
; CHECK-LABEL: define <16 x i32> @test_int_x86_avx512_vpdpwssd_512(
; CHECK-SAME: <16 x i32> [[X0:%.*]], <16 x i32> [[X1:%.*]], <16 x i32> [[X2:%.*]]) #[[ATTR1]] {
-; CHECK-NEXT: [[TMP1:%.*]] = load <16 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP2:%.*]] = load <16 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 64) to ptr), align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <16 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 128) to ptr), align 8
+; CHECK-NEXT: [[TMP21:%.*]] = load <16 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: call void @llvm.donothing()
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <16 x i32> [[TMP1]], [[TMP2]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <16 x i32> [[_MSPROP]], [[TMP3]]
+; CHECK-NEXT: [[TMP22:%.*]] = bitcast <16 x i32> [[X1]] to <32 x i16>
+; CHECK-NEXT: [[TMP5:%.*]] = bitcast <16 x i32> [[X2]] to <32 x i16>
+; CHECK-NEXT: [[TMP6:%.*]] = bitcast <16 x i32> [[TMP2]] to <32 x i16>
+; CHECK-NEXT: [[TMP7:%.*]] = bitcast <16 x i32> [[TMP3]] to <32 x i16>
+; CHECK-NEXT: [[TMP8:%.*]] = icmp ne <32 x i16> [[TMP6]], zeroinitializer
+; CHECK-NEXT: [[TMP9:%.*]] = icmp ne <32 x i16> [[TMP7]], zeroinitializer
+; CHECK-NEXT: [[TMP10:%.*]] = icmp ne <32 x i16> [[TMP22]], zeroinitializer
+; CHECK-NEXT: [[TMP11:%.*]] = icmp ne <32 x i16> [[TMP5]], zeroinitializer
+; CHECK-NEXT: [[TMP12:%.*]] = and <32 x i1> [[TMP8]], [[TMP9]]
+; CHECK-NEXT: [[TMP13:%.*]] = and <32 x i1> [[TMP10]], [[TMP9]]
+; CHECK-NEXT: [[TMP14:%.*]] = and <32 x i1> [[TMP8]], [[TMP11]]
+; CHECK-NEXT: [[TMP15:%.*]] = or <32 x i1> [[TMP12]], [[TMP13]]
+; CHECK-NEXT: [[TMP16:%.*]] = or <32 x i1> [[TMP15]], [[TMP14]]
+; CHECK-NEXT: [[TMP17:%.*]] = sext <32 x i1> [[TMP16]] to <32 x i16>
+; CHECK-NEXT: [[TMP18:%.*]] = bitcast <32 x i16> [[TMP17]] to <16 x i32>
+; CHECK-NEXT: [[TMP19:%.*]] = icmp ne <16 x i32> [[TMP18]], zeroinitializer
+; CHECK-NEXT: [[TMP20:%.*]] = sext <16 x i1> [[TMP19]] to <16 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <16 x i32> [[TMP20]], [[TMP21]]
; CHECK-NEXT: [[TMP4:%.*]] = call <16 x i32> @llvm.x86.avx512.vpdpwssd.512(<16 x i32> [[X0]], <16 x i32> [[X1]], <16 x i32> [[X2]])
; CHECK-NEXT: store <16 x i32> [[_MSPROP1]], ptr @__msan_retval_tls, align 8
; CHECK-NEXT: ret <16 x i32> [[TMP4]]
@@ -197,8 +317,8 @@ define { <16 x i32>, <16 x i32> } @test_int_x86_avx512_mask_vpdpwssd_512(<16 x i
; CHECK-LABEL: define { <16 x i32>, <16 x i32> } @test_int_x86_avx512_mask_vpdpwssd_512(
; CHECK-SAME: <16 x i32> [[X0:%.*]], <16 x i32> [[X1:%.*]], ptr [[X2P:%.*]], <16 x i32> [[X4:%.*]], i16 [[X3:%.*]]) #[[ATTR1]] {
; CHECK-NEXT: [[TMP1:%.*]] = load i64, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 128) to ptr), align 8
-; CHECK-NEXT: [[TMP2:%.*]] = load <16 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <16 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 64) to ptr), align 8
+; CHECK-NEXT: [[TMP2:%.*]] = load <16 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP4:%.*]] = load i16, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 200) to ptr), align 8
; CHECK-NEXT: [[TMP5:%.*]] = load <16 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 136) to ptr), align 8
; CHECK-NEXT: call void @llvm.donothing()
@@ -213,8 +333,24 @@ define { <16 x i32>, <16 x i32> } @test_int_x86_avx512_mask_vpdpwssd_512(<16 x i
; CHECK-NEXT: [[TMP9:%.*]] = xor i64 [[TMP8]], 87960930222080
; CHECK-NEXT: [[TMP10:%.*]] = inttoptr i64 [[TMP9]] to ptr
; CHECK-NEXT: [[_MSLD:%.*]] = load <16 x i32>, ptr [[TMP10]], align 64
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <16 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <16 x i32> [[_MSPROP]], [[_MSLD]]
+; CHECK-NEXT: [[TMP29:%.*]] = bitcast <16 x i32> [[X1]] to <32 x i16>
+; CHECK-NEXT: [[TMP30:%.*]] = bitcast <16 x i32> [[X2]] to <32 x i16>
+; CHECK-NEXT: [[TMP31:%.*]] = bitcast <16 x i32> [[TMP3]] to <32 x i16>
+; CHECK-NEXT: [[TMP32:%.*]] = bitcast <16 x i32> [[_MSLD]] to <32 x i16>
+; CHECK-NEXT: [[TMP33:%.*]] = icmp ne <32 x i16> [[TMP31]], zeroinitializer
+; CHECK-NEXT: [[TMP34:%.*]] = icmp ne <32 x i16> [[TMP32]], zeroinitializer
+; CHECK-NEXT: [[TMP35:%.*]] = icmp ne <32 x i16> [[TMP29]], zeroinitializer
+; CHECK-NEXT: [[TMP36:%.*]] = icmp ne <32 x i16> [[TMP30]], zeroinitializer
+; CHECK-NEXT: [[TMP54:%.*]] = and <32 x i1> [[TMP33]], [[TMP34]]
+; CHECK-NEXT: [[TMP55:%.*]] = and <32 x i1> [[TMP35]], [[TMP34]]
+; CHECK-NEXT: [[TMP56:%.*]] = and <32 x i1> [[TMP33]], [[TMP36]]
+; CHECK-NEXT: [[TMP57:%.*]] = or <32 x i1> [[TMP54]], [[TMP55]]
+; CHECK-NEXT: [[TMP58:%.*]] = or <32 x i1> [[TMP57]], [[TMP56]]
+; CHECK-NEXT: [[TMP59:%.*]] = sext <32 x i1> [[TMP58]] to <32 x i16>
+; CHECK-NEXT: [[TMP60:%.*]] = bitcast <32 x i16> [[TMP59]] to <16 x i32>
+; CHECK-NEXT: [[TMP61:%.*]] = icmp ne <16 x i32> [[TMP60]], zeroinitializer
+; CHECK-NEXT: [[TMP62:%.*]] = sext <16 x i1> [[TMP61]] to <16 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <16 x i32> [[TMP62]], [[TMP2]]
; CHECK-NEXT: [[TMP11:%.*]] = call <16 x i32> @llvm.x86.avx512.vpdpwssd.512(<16 x i32> [[X0]], <16 x i32> [[X1]], <16 x i32> [[X2]])
; CHECK-NEXT: [[TMP12:%.*]] = bitcast i16 [[TMP4]] to <16 x i1>
; CHECK-NEXT: [[TMP13:%.*]] = bitcast i16 [[X3]] to <16 x i1>
@@ -224,8 +360,24 @@ define { <16 x i32>, <16 x i32> } @test_int_x86_avx512_mask_vpdpwssd_512(<16 x i
; CHECK-NEXT: [[TMP17:%.*]] = or <16 x i32> [[TMP16]], [[TMP2]]
; CHECK-NEXT: [[_MSPROP_SELECT:%.*]] = select <16 x i1> [[TMP12]], <16 x i32> [[TMP17]], <16 x i32> [[TMP14]]
; CHECK-NEXT: [[TMP18:%.*]] = select <16 x i1> [[TMP13]], <16 x i32> [[TMP11]], <16 x i32> [[X0]]
-; CHECK-NEXT: [[_MSPROP2:%.*]] = or <16 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP3:%.*]] = or <16 x i32> [[_MSPROP2]], [[TMP5]]
+; CHECK-NEXT: [[TMP37:%.*]] = bitcast <16 x i32> [[X1]] to <32 x i16>
+; CHECK-NEXT: [[TMP38:%.*]] = bitcast <16 x i32> [[X4]] to <32 x i16>
+; CHECK-NEXT: [[TMP39:%.*]] = bitcast <16 x i32> [[TMP3]] to <32 x i16>
+; CHECK-NEXT: [[TMP40:%.*]] = bitcast <16 x i32> [[TMP5]] to <32 x i16>
+; CHECK-NEXT: [[TMP41:%.*]] = icmp ne <32 x i16> [[TMP39]], zeroinitializer
+; CHECK-NEXT: [[TMP42:%.*]] = icmp ne <32 x i16> [[TMP40]], zeroinitializer
+; CHECK-NEXT: [[TMP43:%.*]] = icmp ne <32 x i16> [[TMP37]], zeroinitializer
+; CHECK-NEXT: [[TMP44:%.*]] = icmp ne <32 x i16> [[TMP38]], zeroinitializer
+; CHECK-NEXT: [[TMP45:%.*]] = and <32 x i1> [[TMP41]], [[TMP42]]
+; CHECK-NEXT: [[TMP46:%.*]] = and <32 x i1> [[TMP43]], [[TMP42]]
+; CHECK-NEXT: [[TMP47:%.*]] = and <32 x i1> [[TMP41]], [[TMP44]]
+; CHECK-NEXT: [[TMP48:%.*]] = or <32 x i1> [[TMP45]], [[TMP46]]
+; CHECK-NEXT: [[TMP49:%.*]] = or <32 x i1> [[TMP48]], [[TMP47]]
+; CHECK-NEXT: [[TMP50:%.*]] = sext <32 x i1> [[TMP49]] to <32 x i16>
+; CHECK-NEXT: [[TMP51:%.*]] = bitcast <32 x i16> [[TMP50]] to <16 x i32>
+; CHECK-NEXT: [[TMP52:%.*]] = icmp ne <16 x i32> [[TMP51]], zeroinitializer
+; CHECK-NEXT: [[TMP53:%.*]] = sext <16 x i1> [[TMP52]] to <16 x i32>
+; CHECK-NEXT: [[_MSPROP3:%.*]] = or <16 x i32> [[TMP53]], [[TMP2]]
; CHECK-NEXT: [[TMP19:%.*]] = call <16 x i32> @llvm.x86.avx512.vpdpwssd.512(<16 x i32> [[X0]], <16 x i32> [[X1]], <16 x i32> [[X4]])
; CHECK-NEXT: [[TMP20:%.*]] = bitcast i16 [[TMP4]] to <16 x i1>
; CHECK-NEXT: [[TMP21:%.*]] = bitcast i16 [[X3]] to <16 x i1>
@@ -259,12 +411,28 @@ declare <16 x i32> @llvm.x86.avx512.vpdpwssds.512(<16 x i32>, <16 x i32>, <16 x
define <16 x i32>@test_int_x86_avx512_ask_vpdpwssds_512(<16 x i32> %x0, <16 x i32> %x1, <16 x i32> %x2) sanitize_memory {
; CHECK-LABEL: define <16 x i32> @test_int_x86_avx512_ask_vpdpwssds_512(
; CHECK-SAME: <16 x i32> [[X0:%.*]], <16 x i32> [[X1:%.*]], <16 x i32> [[X2:%.*]]) #[[ATTR1]] {
-; CHECK-NEXT: [[TMP1:%.*]] = load <16 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP2:%.*]] = load <16 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 64) to ptr), align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <16 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 128) to ptr), align 8
+; CHECK-NEXT: [[TMP21:%.*]] = load <16 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: call void @llvm.donothing()
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <16 x i32> [[TMP1]], [[TMP2]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <16 x i32> [[_MSPROP]], [[TMP3]]
+; CHECK-NEXT: [[TMP22:%.*]] = bitcast <16 x i32> [[X1]] to <32 x i16>
+; CHECK-NEXT: [[TMP5:%.*]] = bitcast <16 x i32> [[X2]] to <32 x i16>
+; CHECK-NEXT: [[TMP6:%.*]] = bitcast <16 x i32> [[TMP2]] to <32 x i16>
+; CHECK-NEXT: [[TMP7:%.*]] = bitcast <16 x i32> [[TMP3]] to <32 x i16>
+; CHECK-NEXT: [[TMP8:%.*]] = icmp ne <32 x i16> [[TMP6]], zeroinitializer
+; CHECK-NEXT: [[TMP9:%.*]] = icmp ne <32 x i16> [[TMP7]], zeroinitializer
+; CHECK-NEXT: [[TMP10:%.*]] = icmp ne <32 x i16> [[TMP22]], zeroinitializer
+; CHECK-NEXT: [[TMP11:%.*]] = icmp ne <32 x i16> [[TMP5]], zeroinitializer
+; CHECK-NEXT: [[TMP12:%.*]] = and <32 x i1> [[TMP8]], [[TMP9]]
+; CHECK-NEXT: [[TMP13:%.*]] = and <32 x i1> [[TMP10]], [[TMP9]]
+; CHECK-NEXT: [[TMP14:%.*]] = and <32 x i1> [[TMP8]], [[TMP11]]
+; CHECK-NEXT: [[TMP15:%.*]] = or <32 x i1> [[TMP12]], [[TMP13]]
+; CHECK-NEXT: [[TMP16:%.*]] = or <32 x i1> [[TMP15]], [[TMP14]]
+; CHECK-NEXT: [[TMP17:%.*]] = sext <32 x i1> [[TMP16]] to <32 x i16>
+; CHECK-NEXT: [[TMP18:%.*]] = bitcast <32 x i16> [[TMP17]] to <16 x i32>
+; CHECK-NEXT: [[TMP19:%.*]] = icmp ne <16 x i32> [[TMP18]], zeroinitializer
+; CHECK-NEXT: [[TMP20:%.*]] = sext <16 x i1> [[TMP19]] to <16 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <16 x i32> [[TMP20]], [[TMP21]]
; CHECK-NEXT: [[TMP4:%.*]] = call <16 x i32> @llvm.x86.avx512.vpdpwssds.512(<16 x i32> [[X0]], <16 x i32> [[X1]], <16 x i32> [[X2]])
; CHECK-NEXT: store <16 x i32> [[_MSPROP1]], ptr @__msan_retval_tls, align 8
; CHECK-NEXT: ret <16 x i32> [[TMP4]]
@@ -277,8 +445,8 @@ define { <16 x i32>, <16 x i32> } @test_int_x86_avx512_mask_vpdpwssds_512(<16 x
; CHECK-LABEL: define { <16 x i32>, <16 x i32> } @test_int_x86_avx512_mask_vpdpwssds_512(
; CHECK-SAME: <16 x i32> [[X0:%.*]], <16 x i32> [[X1:%.*]], ptr [[X2P:%.*]], <16 x i32> [[X4:%.*]], i16 [[X3:%.*]]) #[[ATTR1]] {
; CHECK-NEXT: [[TMP1:%.*]] = load i64, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 128) to ptr), align 8
-; CHECK-NEXT: [[TMP2:%.*]] = load <16 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <16 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 64) to ptr), align 8
+; CHECK-NEXT: [[TMP2:%.*]] = load <16 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP4:%.*]] = load i16, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 200) to ptr), align 8
; CHECK-NEXT: [[TMP5:%.*]] = load <16 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 136) to ptr), align 8
; CHECK-NEXT: call void @llvm.donothing()
@@ -293,8 +461,24 @@ define { <16 x i32>, <16 x i32> } @test_int_x86_avx512_mask_vpdpwssds_512(<16 x
; CHECK-NEXT: [[TMP9:%.*]] = xor i64 [[TMP8]], 87960930222080
; CHECK-NEXT: [[TMP10:%.*]] = inttoptr i64 [[TMP9]] to ptr
; CHECK-NEXT: [[_MSLD:%.*]] = load <16 x i32>, ptr [[TMP10]], align 64
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <16 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <16 x i32> [[_MSPROP]], [[_MSLD]]
+; CHECK-NEXT: [[TMP29:%.*]] = bitcast <16 x i32> [[X1]] to <32 x i16>
+; CHECK-NEXT: [[TMP30:%.*]] = bitcast <16 x i32> [[X2]] to <32 x i16>
+; CHECK-NEXT: [[TMP31:%.*]] = bitcast <16 x i32> [[TMP3]] to <32 x i16>
+; CHECK-NEXT: [[TMP32:%.*]] = bitcast <16 x i32> [[_MSLD]] to <32 x i16>
+; CHECK-NEXT: [[TMP33:%.*]] = icmp ne <32 x i16> [[TMP31]], zeroinitializer
+; CHECK-NEXT: [[TMP34:%.*]] = icmp ne <32 x i16> [[TMP32]], zeroinitializer
+; CHECK-NEXT: [[TMP35:%.*]] = icmp ne <32 x i16> [[TMP29]], zeroinitializer
+; CHECK-NEXT: [[TMP36:%.*]] = icmp ne <32 x i16> [[TMP30]], zeroinitializer
+; CHECK-NEXT: [[TMP54:%.*]] = and <32 x i1> [[TMP33]], [[TMP34]]
+; CHECK-NEXT: [[TMP55:%.*]] = and <32 x i1> [[TMP35]], [[TMP34]]
+; CHECK-NEXT: [[TMP56:%.*]] = and <32 x i1> [[TMP33]], [[TMP36]]
+; CHECK-NEXT: [[TMP57:%.*]] = or <32 x i1> [[TMP54]], [[TMP55]]
+; CHECK-NEXT: [[TMP58:%.*]] = or <32 x i1> [[TMP57]], [[TMP56]]
+; CHECK-NEXT: [[TMP59:%.*]] = sext <32 x i1> [[TMP58]] to <32 x i16>
+; CHECK-NEXT: [[TMP60:%.*]] = bitcast <32 x i16> [[TMP59]] to <16 x i32>
+; CHECK-NEXT: [[TMP61:%.*]] = icmp ne <16 x i32> [[TMP60]], zeroinitializer
+; CHECK-NEXT: [[TMP62:%.*]] = sext <16 x i1> [[TMP61]] to <16 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <16 x i32> [[TMP62]], [[TMP2]]
; CHECK-NEXT: [[TMP11:%.*]] = call <16 x i32> @llvm.x86.avx512.vpdpwssds.512(<16 x i32> [[X0]], <16 x i32> [[X1]], <16 x i32> [[X2]])
; CHECK-NEXT: [[TMP12:%.*]] = bitcast i16 [[TMP4]] to <16 x i1>
; CHECK-NEXT: [[TMP13:%.*]] = bitcast i16 [[X3]] to <16 x i1>
@@ -304,8 +488,24 @@ define { <16 x i32>, <16 x i32> } @test_int_x86_avx512_mask_vpdpwssds_512(<16 x
; CHECK-NEXT: [[TMP17:%.*]] = or <16 x i32> [[TMP16]], [[TMP2]]
; CHECK-NEXT: [[_MSPROP_SELECT:%.*]] = select <16 x i1> [[TMP12]], <16 x i32> [[TMP17]], <16 x i32> [[TMP14]]
; CHECK-NEXT: [[TMP18:%.*]] = select <16 x i1> [[TMP13]], <16 x i32> [[TMP11]], <16 x i32> [[X0]]
-; CHECK-NEXT: [[_MSPROP2:%.*]] = or <16 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP3:%.*]] = or <16 x i32> [[_MSPROP2]], [[TMP5]]
+; CHECK-NEXT: [[TMP37:%.*]] = bitcast <16 x i32> [[X1]] to <32 x i16>
+; CHECK-NEXT: [[TMP38:%.*]] = bitcast <16 x i32> [[X4]] to <32 x i16>
+; CHECK-NEXT: [[TMP39:%.*]] = bitcast <16 x i32> [[TMP3]] to <32 x i16>
+; CHECK-NEXT: [[TMP40:%.*]] = bitcast <16 x i32> [[TMP5]] to <32 x i16>
+; CHECK-NEXT: [[TMP41:%.*]] = icmp ne <32 x i16> [[TMP39]], zeroinitializer
+; CHECK-NEXT: [[TMP42:%.*]] = icmp ne <32 x i16> [[TMP40]], zeroinitializer
+; CHECK-NEXT: [[TMP43:%.*]] = icmp ne <32 x i16> [[TMP37]], zeroinitializer
+; CHECK-NEXT: [[TMP44:%.*]] = icmp ne <32 x i16> [[TMP38]], zeroinitializer
+; CHECK-NEXT: [[TMP45:%.*]] = and <32 x i1> [[TMP41]], [[TMP42]]
+; CHECK-NEXT: [[TMP46:%.*]] = and <32 x i1> [[TMP43]], [[TMP42]]
+; CHECK-NEXT: [[TMP47:%.*]] = and <32 x i1> [[TMP41]], [[TMP44]]
+; CHECK-NEXT: [[TMP48:%.*]] = or <32 x i1> [[TMP45]], [[TMP46]]
+; CHECK-NEXT: [[TMP49:%.*]] = or <32 x i1> [[TMP48]], [[TMP47]]
+; CHECK-NEXT: [[TMP50:%.*]] = sext <32 x i1> [[TMP49]] to <32 x i16>
+; CHECK-NEXT: [[TMP51:%.*]] = bitcast <32 x i16> [[TMP50]] to <16 x i32>
+; CHECK-NEXT: [[TMP52:%.*]] = icmp ne <16 x i32> [[TMP51]], zeroinitializer
+; CHECK-NEXT: [[TMP53:%.*]] = sext <16 x i1> [[TMP52]] to <16 x i32>
+; CHECK-NEXT: [[_MSPROP3:%.*]] = or <16 x i32> [[TMP53]], [[TMP2]]
; CHECK-NEXT: [[TMP19:%.*]] = call <16 x i32> @llvm.x86.avx512.vpdpwssds.512(<16 x i32> [[X0]], <16 x i32> [[X1]], <16 x i32> [[X4]])
; CHECK-NEXT: [[TMP20:%.*]] = bitcast i16 [[TMP4]] to <16 x i1>
; CHECK-NEXT: [[TMP21:%.*]] = bitcast i16 [[X3]] to <16 x i1>
diff --git a/llvm/test/Instrumentation/MemorySanitizer/X86/avx_vnni-intrinsics.ll b/llvm/test/Instrumentation/MemorySanitizer/X86/avx_vnni-intrinsics.ll
index 0af0a89f177ee..678faef203324 100644
--- a/llvm/test/Instrumentation/MemorySanitizer/X86/avx_vnni-intrinsics.ll
+++ b/llvm/test/Instrumentation/MemorySanitizer/X86/avx_vnni-intrinsics.ll
@@ -5,15 +5,7 @@
;
; Handled strictly: (none)
;
-; Handled heuristically:
-; - llvm.x86.avx512.vpdpbusd.128
-; - llvm.x86.avx512.vpdpbusd.256
-; - llvm.x86.avx512.vpdpbusds.128
-; - llvm.x86.avx512.vpdpbusds.256
-; - llvm.x86.avx512.vpdpwssd.128
-; - llvm.x86.avx512.vpdpwssd.256
-; - llvm.x86.avx512.vpdpwssds.128
-; - llvm.x86.avx512.vpdpwssds.256
+; Handled heuristically: (none)
target datalayout = "e-m:o-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"
@@ -23,12 +15,30 @@ declare <8 x i32> @llvm.x86.avx512.vpdpbusd.256(<8 x i32>, <8 x i32>, <8 x i32>)
define <8 x i32>@test_int_x86_avx_vpdpbusd_256(<8 x i32> %x0, <8 x i32> %x1, <8 x i32> %x2) sanitize_memory {
; CHECK-LABEL: define <8 x i32> @test_int_x86_avx_vpdpbusd_256(
; CHECK-SAME: <8 x i32> [[X0:%.*]], <8 x i32> [[X1:%.*]], <8 x i32> [[X2:%.*]]) #[[ATTR1:[0-9]+]] {
-; CHECK-NEXT: [[TMP1:%.*]] = load <8 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP2:%.*]] = load <8 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 32) to ptr), align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <8 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 64) to ptr), align 8
+; CHECK-NEXT: [[TMP23:%.*]] = load <8 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: call void @llvm.donothing()
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <8 x i32> [[TMP1]], [[TMP2]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <8 x i32> [[_MSPROP]], [[TMP3]]
+; CHECK-NEXT: [[TMP4:%.*]] = bitcast <8 x i32> [[X1]] to <32 x i8>
+; CHECK-NEXT: [[TMP5:%.*]] = bitcast <8 x i32> [[X2]] to <32 x i8>
+; CHECK-NEXT: [[TMP6:%.*]] = bitcast <8 x i32> [[TMP2]] to <32 x i8>
+; CHECK-NEXT: [[TMP7:%.*]] = bitcast <8 x i32> [[TMP3]] to <32 x i8>
+; CHECK-NEXT: [[TMP8:%.*]] = icmp ne <32 x i8> [[TMP6]], zeroinitializer
+; CHECK-NEXT: [[TMP9:%.*]] = icmp ne <32 x i8> [[TMP7]], zeroinitializer
+; CHECK-NEXT: [[TMP10:%.*]] = icmp ne <32 x i8> [[TMP4]], zeroinitializer
+; CHECK-NEXT: [[TMP11:%.*]] = icmp ne <32 x i8> [[TMP5]], zeroinitializer
+; CHECK-NEXT: [[TMP12:%.*]] = and <32 x i1> [[TMP8]], [[TMP9]]
+; CHECK-NEXT: [[TMP13:%.*]] = and <32 x i1> [[TMP10]], [[TMP9]]
+; CHECK-NEXT: [[TMP14:%.*]] = and <32 x i1> [[TMP8]], [[TMP11]]
+; CHECK-NEXT: [[TMP15:%.*]] = or <32 x i1> [[TMP12]], [[TMP13]]
+; CHECK-NEXT: [[TMP16:%.*]] = or <32 x i1> [[TMP15]], [[TMP14]]
+; CHECK-NEXT: [[TMP17:%.*]] = sext <32 x i1> [[TMP16]] to <32 x i8>
+; CHECK-NEXT: [[TMP18:%.*]] = bitcast <32 x i8> [[TMP17]] to <16 x i16>
+; CHECK-NEXT: [[TMP19:%.*]] = icmp ne <16 x i16> [[TMP18]], zeroinitializer
+; CHECK-NEXT: [[TMP20:%.*]] = sext <16 x i1> [[TMP19]] to <16 x i16>
+; CHECK-NEXT: [[TMP21:%.*]] = bitcast <16 x i16> [[TMP20]] to i256
+; CHECK-NEXT: [[TMP22:%.*]] = bitcast i256 [[TMP21]] to <8 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <8 x i32> [[TMP22]], [[TMP23]]
; CHECK-NEXT: [[RES:%.*]] = call <8 x i32> @llvm.x86.avx512.vpdpbusd.256(<8 x i32> [[X0]], <8 x i32> [[X1]], <8 x i32> [[X2]])
; CHECK-NEXT: store <8 x i32> [[_MSPROP1]], ptr @__msan_retval_tls, align 8
; CHECK-NEXT: ret <8 x i32> [[RES]]
@@ -42,12 +52,30 @@ declare <4 x i32> @llvm.x86.avx512.vpdpbusd.128(<4 x i32>, <4 x i32>, <4 x i32>)
define <4 x i32>@test_int_x86_avx_vpdpbusd_128(<4 x i32> %x0, <4 x i32> %x1, <4 x i32> %x2) sanitize_memory {
; CHECK-LABEL: define <4 x i32> @test_int_x86_avx_vpdpbusd_128(
; CHECK-SAME: <4 x i32> [[X0:%.*]], <4 x i32> [[X1:%.*]], <4 x i32> [[X2:%.*]]) #[[ATTR1]] {
-; CHECK-NEXT: [[TMP1:%.*]] = load <4 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP2:%.*]] = load <4 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 16) to ptr), align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <4 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 32) to ptr), align 8
+; CHECK-NEXT: [[TMP23:%.*]] = load <4 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: call void @llvm.donothing()
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <4 x i32> [[TMP1]], [[TMP2]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <4 x i32> [[_MSPROP]], [[TMP3]]
+; CHECK-NEXT: [[TMP4:%.*]] = bitcast <4 x i32> [[X1]] to <16 x i8>
+; CHECK-NEXT: [[TMP5:%.*]] = bitcast <4 x i32> [[X2]] to <16 x i8>
+; CHECK-NEXT: [[TMP6:%.*]] = bitcast <4 x i32> [[TMP2]] to <16 x i8>
+; CHECK-NEXT: [[TMP7:%.*]] = bitcast <4 x i32> [[TMP3]] to <16 x i8>
+; CHECK-NEXT: [[TMP8:%.*]] = icmp ne <16 x i8> [[TMP6]], zeroinitializer
+; CHECK-NEXT: [[TMP9:%.*]] = icmp ne <16 x i8> [[TMP7]], zeroinitializer
+; CHECK-NEXT: [[TMP10:%.*]] = icmp ne <16 x i8> [[TMP4]], zeroinitializer
+; CHECK-NEXT: [[TMP11:%.*]] = icmp ne <16 x i8> [[TMP5]], zeroinitializer
+; CHECK-NEXT: [[TMP12:%.*]] = and <16 x i1> [[TMP8]], [[TMP9]]
+; CHECK-NEXT: [[TMP13:%.*]] = and <16 x i1> [[TMP10]], [[TMP9]]
+; CHECK-NEXT: [[TMP14:%.*]] = and <16 x i1> [[TMP8]], [[TMP11]]
+; CHECK-NEXT: [[TMP15:%.*]] = or <16 x i1> [[TMP12]], [[TMP13]]
+; CHECK-NEXT: [[TMP16:%.*]] = or <16 x i1> [[TMP15]], [[TMP14]]
+; CHECK-NEXT: [[TMP17:%.*]] = sext <16 x i1> [[TMP16]] to <16 x i8>
+; CHECK-NEXT: [[TMP18:%.*]] = bitcast <16 x i8> [[TMP17]] to <8 x i16>
+; CHECK-NEXT: [[TMP19:%.*]] = icmp ne <8 x i16> [[TMP18]], zeroinitializer
+; CHECK-NEXT: [[TMP20:%.*]] = sext <8 x i1> [[TMP19]] to <8 x i16>
+; CHECK-NEXT: [[TMP21:%.*]] = bitcast <8 x i16> [[TMP20]] to i128
+; CHECK-NEXT: [[TMP22:%.*]] = bitcast i128 [[TMP21]] to <4 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <4 x i32> [[TMP22]], [[TMP23]]
; CHECK-NEXT: [[RES:%.*]] = call <4 x i32> @llvm.x86.avx512.vpdpbusd.128(<4 x i32> [[X0]], <4 x i32> [[X1]], <4 x i32> [[X2]])
; CHECK-NEXT: store <4 x i32> [[_MSPROP1]], ptr @__msan_retval_tls, align 8
; CHECK-NEXT: ret <4 x i32> [[RES]]
@@ -61,12 +89,30 @@ declare <8 x i32> @llvm.x86.avx512.vpdpbusds.256(<8 x i32>, <8 x i32>, <8 x i32>
define <8 x i32>@test_int_x86_avx_vpdpbusds_256(<8 x i32> %x0, <8 x i32> %x1, <8 x i32> %x2) sanitize_memory {
; CHECK-LABEL: define <8 x i32> @test_int_x86_avx_vpdpbusds_256(
; CHECK-SAME: <8 x i32> [[X0:%.*]], <8 x i32> [[X1:%.*]], <8 x i32> [[X2:%.*]]) #[[ATTR1]] {
-; CHECK-NEXT: [[TMP1:%.*]] = load <8 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP2:%.*]] = load <8 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 32) to ptr), align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <8 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 64) to ptr), align 8
+; CHECK-NEXT: [[TMP23:%.*]] = load <8 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: call void @llvm.donothing()
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <8 x i32> [[TMP1]], [[TMP2]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <8 x i32> [[_MSPROP]], [[TMP3]]
+; CHECK-NEXT: [[TMP4:%.*]] = bitcast <8 x i32> [[X1]] to <32 x i8>
+; CHECK-NEXT: [[TMP5:%.*]] = bitcast <8 x i32> [[X2]] to <32 x i8>
+; CHECK-NEXT: [[TMP6:%.*]] = bitcast <8 x i32> [[TMP2]] to <32 x i8>
+; CHECK-NEXT: [[TMP7:%.*]] = bitcast <8 x i32> [[TMP3]] to <32 x i8>
+; CHECK-NEXT: [[TMP8:%.*]] = icmp ne <32 x i8> [[TMP6]], zeroinitializer
+; CHECK-NEXT: [[TMP9:%.*]] = icmp ne <32 x i8> [[TMP7]], zeroinitializer
+; CHECK-NEXT: [[TMP10:%.*]] = icmp ne <32 x i8> [[TMP4]], zeroinitializer
+; CHECK-NEXT: [[TMP11:%.*]] = icmp ne <32 x i8> [[TMP5]], zeroinitializer
+; CHECK-NEXT: [[TMP12:%.*]] = and <32 x i1> [[TMP8]], [[TMP9]]
+; CHECK-NEXT: [[TMP13:%.*]] = and <32 x i1> [[TMP10]], [[TMP9]]
+; CHECK-NEXT: [[TMP14:%.*]] = and <32 x i1> [[TMP8]], [[TMP11]]
+; CHECK-NEXT: [[TMP15:%.*]] = or <32 x i1> [[TMP12]], [[TMP13]]
+; CHECK-NEXT: [[TMP16:%.*]] = or <32 x i1> [[TMP15]], [[TMP14]]
+; CHECK-NEXT: [[TMP17:%.*]] = sext <32 x i1> [[TMP16]] to <32 x i8>
+; CHECK-NEXT: [[TMP18:%.*]] = bitcast <32 x i8> [[TMP17]] to <16 x i16>
+; CHECK-NEXT: [[TMP19:%.*]] = icmp ne <16 x i16> [[TMP18]], zeroinitializer
+; CHECK-NEXT: [[TMP20:%.*]] = sext <16 x i1> [[TMP19]] to <16 x i16>
+; CHECK-NEXT: [[TMP21:%.*]] = bitcast <16 x i16> [[TMP20]] to i256
+; CHECK-NEXT: [[TMP22:%.*]] = bitcast i256 [[TMP21]] to <8 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <8 x i32> [[TMP22]], [[TMP23]]
; CHECK-NEXT: [[RES:%.*]] = call <8 x i32> @llvm.x86.avx512.vpdpbusds.256(<8 x i32> [[X0]], <8 x i32> [[X1]], <8 x i32> [[X2]])
; CHECK-NEXT: store <8 x i32> [[_MSPROP1]], ptr @__msan_retval_tls, align 8
; CHECK-NEXT: ret <8 x i32> [[RES]]
@@ -80,12 +126,30 @@ declare <4 x i32> @llvm.x86.avx512.vpdpbusds.128(<4 x i32>, <4 x i32>, <4 x i32>
define <4 x i32>@test_int_x86_avx_vpdpbusds_128(<4 x i32> %x0, <4 x i32> %x1, <4 x i32> %x2) sanitize_memory {
; CHECK-LABEL: define <4 x i32> @test_int_x86_avx_vpdpbusds_128(
; CHECK-SAME: <4 x i32> [[X0:%.*]], <4 x i32> [[X1:%.*]], <4 x i32> [[X2:%.*]]) #[[ATTR1]] {
-; CHECK-NEXT: [[TMP1:%.*]] = load <4 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP2:%.*]] = load <4 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 16) to ptr), align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <4 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 32) to ptr), align 8
+; CHECK-NEXT: [[TMP23:%.*]] = load <4 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: call void @llvm.donothing()
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <4 x i32> [[TMP1]], [[TMP2]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <4 x i32> [[_MSPROP]], [[TMP3]]
+; CHECK-NEXT: [[TMP4:%.*]] = bitcast <4 x i32> [[X1]] to <16 x i8>
+; CHECK-NEXT: [[TMP5:%.*]] = bitcast <4 x i32> [[X2]] to <16 x i8>
+; CHECK-NEXT: [[TMP6:%.*]] = bitcast <4 x i32> [[TMP2]] to <16 x i8>
+; CHECK-NEXT: [[TMP7:%.*]] = bitcast <4 x i32> [[TMP3]] to <16 x i8>
+; CHECK-NEXT: [[TMP8:%.*]] = icmp ne <16 x i8> [[TMP6]], zeroinitializer
+; CHECK-NEXT: [[TMP9:%.*]] = icmp ne <16 x i8> [[TMP7]], zeroinitializer
+; CHECK-NEXT: [[TMP10:%.*]] = icmp ne <16 x i8> [[TMP4]], zeroinitializer
+; CHECK-NEXT: [[TMP11:%.*]] = icmp ne <16 x i8> [[TMP5]], zeroinitializer
+; CHECK-NEXT: [[TMP12:%.*]] = and <16 x i1> [[TMP8]], [[TMP9]]
+; CHECK-NEXT: [[TMP13:%.*]] = and <16 x i1> [[TMP10]], [[TMP9]]
+; CHECK-NEXT: [[TMP14:%.*]] = and <16 x i1> [[TMP8]], [[TMP11]]
+; CHECK-NEXT: [[TMP15:%.*]] = or <16 x i1> [[TMP12]], [[TMP13]]
+; CHECK-NEXT: [[TMP16:%.*]] = or <16 x i1> [[TMP15]], [[TMP14]]
+; CHECK-NEXT: [[TMP17:%.*]] = sext <16 x i1> [[TMP16]] to <16 x i8>
+; CHECK-NEXT: [[TMP18:%.*]] = bitcast <16 x i8> [[TMP17]] to <8 x i16>
+; CHECK-NEXT: [[TMP19:%.*]] = icmp ne <8 x i16> [[TMP18]], zeroinitializer
+; CHECK-NEXT: [[TMP20:%.*]] = sext <8 x i1> [[TMP19]] to <8 x i16>
+; CHECK-NEXT: [[TMP21:%.*]] = bitcast <8 x i16> [[TMP20]] to i128
+; CHECK-NEXT: [[TMP22:%.*]] = bitcast i128 [[TMP21]] to <4 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <4 x i32> [[TMP22]], [[TMP23]]
; CHECK-NEXT: [[RES:%.*]] = call <4 x i32> @llvm.x86.avx512.vpdpbusds.128(<4 x i32> [[X0]], <4 x i32> [[X1]], <4 x i32> [[X2]])
; CHECK-NEXT: store <4 x i32> [[_MSPROP1]], ptr @__msan_retval_tls, align 8
; CHECK-NEXT: ret <4 x i32> [[RES]]
@@ -99,12 +163,28 @@ declare <8 x i32> @llvm.x86.avx512.vpdpwssd.256(<8 x i32>, <8 x i32>, <8 x i32>)
define <8 x i32>@test_int_x86_avx_vpdpwssd_256(<8 x i32> %x0, <8 x i32> %x1, <8 x i32> %x2) sanitize_memory {
; CHECK-LABEL: define <8 x i32> @test_int_x86_avx_vpdpwssd_256(
; CHECK-SAME: <8 x i32> [[X0:%.*]], <8 x i32> [[X1:%.*]], <8 x i32> [[X2:%.*]]) #[[ATTR1]] {
-; CHECK-NEXT: [[TMP1:%.*]] = load <8 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP2:%.*]] = load <8 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 32) to ptr), align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <8 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 64) to ptr), align 8
+; CHECK-NEXT: [[TMP21:%.*]] = load <8 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: call void @llvm.donothing()
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <8 x i32> [[TMP1]], [[TMP2]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <8 x i32> [[_MSPROP]], [[TMP3]]
+; CHECK-NEXT: [[TMP4:%.*]] = bitcast <8 x i32> [[X1]] to <16 x i16>
+; CHECK-NEXT: [[TMP5:%.*]] = bitcast <8 x i32> [[X2]] to <16 x i16>
+; CHECK-NEXT: [[TMP6:%.*]] = bitcast <8 x i32> [[TMP2]] to <16 x i16>
+; CHECK-NEXT: [[TMP7:%.*]] = bitcast <8 x i32> [[TMP3]] to <16 x i16>
+; CHECK-NEXT: [[TMP8:%.*]] = icmp ne <16 x i16> [[TMP6]], zeroinitializer
+; CHECK-NEXT: [[TMP9:%.*]] = icmp ne <16 x i16> [[TMP7]], zeroinitializer
+; CHECK-NEXT: [[TMP10:%.*]] = icmp ne <16 x i16> [[TMP4]], zeroinitializer
+; CHECK-NEXT: [[TMP11:%.*]] = icmp ne <16 x i16> [[TMP5]], zeroinitializer
+; CHECK-NEXT: [[TMP12:%.*]] = and <16 x i1> [[TMP8]], [[TMP9]]
+; CHECK-NEXT: [[TMP13:%.*]] = and <16 x i1> [[TMP10]], [[TMP9]]
+; CHECK-NEXT: [[TMP14:%.*]] = and <16 x i1> [[TMP8]], [[TMP11]]
+; CHECK-NEXT: [[TMP15:%.*]] = or <16 x i1> [[TMP12]], [[TMP13]]
+; CHECK-NEXT: [[TMP16:%.*]] = or <16 x i1> [[TMP15]], [[TMP14]]
+; CHECK-NEXT: [[TMP17:%.*]] = sext <16 x i1> [[TMP16]] to <16 x i16>
+; CHECK-NEXT: [[TMP18:%.*]] = bitcast <16 x i16> [[TMP17]] to <8 x i32>
+; CHECK-NEXT: [[TMP19:%.*]] = icmp ne <8 x i32> [[TMP18]], zeroinitializer
+; CHECK-NEXT: [[TMP20:%.*]] = sext <8 x i1> [[TMP19]] to <8 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <8 x i32> [[TMP20]], [[TMP21]]
; CHECK-NEXT: [[RES:%.*]] = call <8 x i32> @llvm.x86.avx512.vpdpwssd.256(<8 x i32> [[X0]], <8 x i32> [[X1]], <8 x i32> [[X2]])
; CHECK-NEXT: store <8 x i32> [[_MSPROP1]], ptr @__msan_retval_tls, align 8
; CHECK-NEXT: ret <8 x i32> [[RES]]
@@ -118,12 +198,28 @@ declare <4 x i32> @llvm.x86.avx512.vpdpwssd.128(<4 x i32>, <4 x i32>, <4 x i32>)
define <4 x i32>@test_int_x86_avx_vpdpwssd_128(<4 x i32> %x0, <4 x i32> %x1, <4 x i32> %x2) sanitize_memory {
; CHECK-LABEL: define <4 x i32> @test_int_x86_avx_vpdpwssd_128(
; CHECK-SAME: <4 x i32> [[X0:%.*]], <4 x i32> [[X1:%.*]], <4 x i32> [[X2:%.*]]) #[[ATTR1]] {
-; CHECK-NEXT: [[TMP1:%.*]] = load <4 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP2:%.*]] = load <4 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 16) to ptr), align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <4 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 32) to ptr), align 8
+; CHECK-NEXT: [[TMP21:%.*]] = load <4 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: call void @llvm.donothing()
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <4 x i32> [[TMP1]], [[TMP2]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <4 x i32> [[_MSPROP]], [[TMP3]]
+; CHECK-NEXT: [[TMP4:%.*]] = bitcast <4 x i32> [[X1]] to <8 x i16>
+; CHECK-NEXT: [[TMP5:%.*]] = bitcast <4 x i32> [[X2]] to <8 x i16>
+; CHECK-NEXT: [[TMP6:%.*]] = bitcast <4 x i32> [[TMP2]] to <8 x i16>
+; CHECK-NEXT: [[TMP7:%.*]] = bitcast <4 x i32> [[TMP3]] to <8 x i16>
+; CHECK-NEXT: [[TMP8:%.*]] = icmp ne <8 x i16> [[TMP6]], zeroinitializer
+; CHECK-NEXT: [[TMP9:%.*]] = icmp ne <8 x i16> [[TMP7]], zeroinitializer
+; CHECK-NEXT: [[TMP10:%.*]] = icmp ne <8 x i16> [[TMP4]], zeroinitializer
+; CHECK-NEXT: [[TMP11:%.*]] = icmp ne <8 x i16> [[TMP5]], zeroinitializer
+; CHECK-NEXT: [[TMP12:%.*]] = and <8 x i1> [[TMP8]], [[TMP9]]
+; CHECK-NEXT: [[TMP13:%.*]] = and <8 x i1> [[TMP10]], [[TMP9]]
+; CHECK-NEXT: [[TMP14:%.*]] = and <8 x i1> [[TMP8]], [[TMP11]]
+; CHECK-NEXT: [[TMP15:%.*]] = or <8 x i1> [[TMP12]], [[TMP13]]
+; CHECK-NEXT: [[TMP16:%.*]] = or <8 x i1> [[TMP15]], [[TMP14]]
+; CHECK-NEXT: [[TMP17:%.*]] = sext <8 x i1> [[TMP16]] to <8 x i16>
+; CHECK-NEXT: [[TMP18:%.*]] = bitcast <8 x i16> [[TMP17]] to <4 x i32>
+; CHECK-NEXT: [[TMP19:%.*]] = icmp ne <4 x i32> [[TMP18]], zeroinitializer
+; CHECK-NEXT: [[TMP20:%.*]] = sext <4 x i1> [[TMP19]] to <4 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <4 x i32> [[TMP20]], [[TMP21]]
; CHECK-NEXT: [[RES:%.*]] = call <4 x i32> @llvm.x86.avx512.vpdpwssd.128(<4 x i32> [[X0]], <4 x i32> [[X1]], <4 x i32> [[X2]])
; CHECK-NEXT: store <4 x i32> [[_MSPROP1]], ptr @__msan_retval_tls, align 8
; CHECK-NEXT: ret <4 x i32> [[RES]]
@@ -137,12 +233,28 @@ declare <8 x i32> @llvm.x86.avx512.vpdpwssds.256(<8 x i32>, <8 x i32>, <8 x i32>
define <8 x i32>@test_int_x86_avx_vpdpwssds_256(<8 x i32> %x0, <8 x i32> %x1, <8 x i32> %x2) sanitize_memory {
; CHECK-LABEL: define <8 x i32> @test_int_x86_avx_vpdpwssds_256(
; CHECK-SAME: <8 x i32> [[X0:%.*]], <8 x i32> [[X1:%.*]], <8 x i32> [[X2:%.*]]) #[[ATTR1]] {
-; CHECK-NEXT: [[TMP1:%.*]] = load <8 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP2:%.*]] = load <8 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 32) to ptr), align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <8 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 64) to ptr), align 8
+; CHECK-NEXT: [[TMP21:%.*]] = load <8 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: call void @llvm.donothing()
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <8 x i32> [[TMP1]], [[TMP2]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <8 x i32> [[_MSPROP]], [[TMP3]]
+; CHECK-NEXT: [[TMP4:%.*]] = bitcast <8 x i32> [[X1]] to <16 x i16>
+; CHECK-NEXT: [[TMP5:%.*]] = bitcast <8 x i32> [[X2]] to <16 x i16>
+; CHECK-NEXT: [[TMP6:%.*]] = bitcast <8 x i32> [[TMP2]] to <16 x i16>
+; CHECK-NEXT: [[TMP7:%.*]] = bitcast <8 x i32> [[TMP3]] to <16 x i16>
+; CHECK-NEXT: [[TMP8:%.*]] = icmp ne <16 x i16> [[TMP6]], zeroinitializer
+; CHECK-NEXT: [[TMP9:%.*]] = icmp ne <16 x i16> [[TMP7]], zeroinitializer
+; CHECK-NEXT: [[TMP10:%.*]] = icmp ne <16 x i16> [[TMP4]], zeroinitializer
+; CHECK-NEXT: [[TMP11:%.*]] = icmp ne <16 x i16> [[TMP5]], zeroinitializer
+; CHECK-NEXT: [[TMP12:%.*]] = and <16 x i1> [[TMP8]], [[TMP9]]
+; CHECK-NEXT: [[TMP13:%.*]] = and <16 x i1> [[TMP10]], [[TMP9]]
+; CHECK-NEXT: [[TMP14:%.*]] = and <16 x i1> [[TMP8]], [[TMP11]]
+; CHECK-NEXT: [[TMP15:%.*]] = or <16 x i1> [[TMP12]], [[TMP13]]
+; CHECK-NEXT: [[TMP16:%.*]] = or <16 x i1> [[TMP15]], [[TMP14]]
+; CHECK-NEXT: [[TMP17:%.*]] = sext <16 x i1> [[TMP16]] to <16 x i16>
+; CHECK-NEXT: [[TMP18:%.*]] = bitcast <16 x i16> [[TMP17]] to <8 x i32>
+; CHECK-NEXT: [[TMP19:%.*]] = icmp ne <8 x i32> [[TMP18]], zeroinitializer
+; CHECK-NEXT: [[TMP20:%.*]] = sext <8 x i1> [[TMP19]] to <8 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <8 x i32> [[TMP20]], [[TMP21]]
; CHECK-NEXT: [[RES:%.*]] = call <8 x i32> @llvm.x86.avx512.vpdpwssds.256(<8 x i32> [[X0]], <8 x i32> [[X1]], <8 x i32> [[X2]])
; CHECK-NEXT: store <8 x i32> [[_MSPROP1]], ptr @__msan_retval_tls, align 8
; CHECK-NEXT: ret <8 x i32> [[RES]]
@@ -156,12 +268,28 @@ declare <4 x i32> @llvm.x86.avx512.vpdpwssds.128(<4 x i32>, <4 x i32>, <4 x i32>
define <4 x i32>@test_int_x86_avx_vpdpwssds_128(<4 x i32> %x0, <4 x i32> %x1, <4 x i32> %x2) sanitize_memory {
; CHECK-LABEL: define <4 x i32> @test_int_x86_avx_vpdpwssds_128(
; CHECK-SAME: <4 x i32> [[X0:%.*]], <4 x i32> [[X1:%.*]], <4 x i32> [[X2:%.*]]) #[[ATTR1]] {
-; CHECK-NEXT: [[TMP1:%.*]] = load <4 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP2:%.*]] = load <4 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 16) to ptr), align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <4 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 32) to ptr), align 8
+; CHECK-NEXT: [[TMP21:%.*]] = load <4 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: call void @llvm.donothing()
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <4 x i32> [[TMP1]], [[TMP2]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <4 x i32> [[_MSPROP]], [[TMP3]]
+; CHECK-NEXT: [[TMP4:%.*]] = bitcast <4 x i32> [[X1]] to <8 x i16>
+; CHECK-NEXT: [[TMP5:%.*]] = bitcast <4 x i32> [[X2]] to <8 x i16>
+; CHECK-NEXT: [[TMP6:%.*]] = bitcast <4 x i32> [[TMP2]] to <8 x i16>
+; CHECK-NEXT: [[TMP7:%.*]] = bitcast <4 x i32> [[TMP3]] to <8 x i16>
+; CHECK-NEXT: [[TMP8:%.*]] = icmp ne <8 x i16> [[TMP6]], zeroinitializer
+; CHECK-NEXT: [[TMP9:%.*]] = icmp ne <8 x i16> [[TMP7]], zeroinitializer
+; CHECK-NEXT: [[TMP10:%.*]] = icmp ne <8 x i16> [[TMP4]], zeroinitializer
+; CHECK-NEXT: [[TMP11:%.*]] = icmp ne <8 x i16> [[TMP5]], zeroinitializer
+; CHECK-NEXT: [[TMP12:%.*]] = and <8 x i1> [[TMP8]], [[TMP9]]
+; CHECK-NEXT: [[TMP13:%.*]] = and <8 x i1> [[TMP10]], [[TMP9]]
+; CHECK-NEXT: [[TMP14:%.*]] = and <8 x i1> [[TMP8]], [[TMP11]]
+; CHECK-NEXT: [[TMP15:%.*]] = or <8 x i1> [[TMP12]], [[TMP13]]
+; CHECK-NEXT: [[TMP16:%.*]] = or <8 x i1> [[TMP15]], [[TMP14]]
+; CHECK-NEXT: [[TMP17:%.*]] = sext <8 x i1> [[TMP16]] to <8 x i16>
+; CHECK-NEXT: [[TMP18:%.*]] = bitcast <8 x i16> [[TMP17]] to <4 x i32>
+; CHECK-NEXT: [[TMP19:%.*]] = icmp ne <4 x i32> [[TMP18]], zeroinitializer
+; CHECK-NEXT: [[TMP20:%.*]] = sext <4 x i1> [[TMP19]] to <4 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <4 x i32> [[TMP20]], [[TMP21]]
; CHECK-NEXT: [[RES:%.*]] = call <4 x i32> @llvm.x86.avx512.vpdpwssds.128(<4 x i32> [[X0]], <4 x i32> [[X1]], <4 x i32> [[X2]])
; CHECK-NEXT: store <4 x i32> [[_MSPROP1]], ptr @__msan_retval_tls, align 8
; CHECK-NEXT: ret <4 x i32> [[RES]]
diff --git a/llvm/test/Instrumentation/MemorySanitizer/X86/avxvnniint8-intrinsics.ll b/llvm/test/Instrumentation/MemorySanitizer/X86/avxvnniint8-intrinsics.ll
index d586c314ed28c..b36d09bfb5944 100644
--- a/llvm/test/Instrumentation/MemorySanitizer/X86/avxvnniint8-intrinsics.ll
+++ b/llvm/test/Instrumentation/MemorySanitizer/X86/avxvnniint8-intrinsics.ll
@@ -5,19 +5,7 @@
;
; Handled strictly: (none)
;
-; Handled heuristically:
-; - llvm.x86.avx2.vpdpbssd.128
-; - llvm.x86.avx2.vpdpbssd.256
-; - llvm.x86.avx2.vpdpbssds.128
-; - llvm.x86.avx2.vpdpbssds.256
-; - llvm.x86.avx2.vpdpbsud.128
-; - llvm.x86.avx2.vpdpbsud.256
-; - llvm.x86.avx2.vpdpbsuds.128
-; - llvm.x86.avx2.vpdpbsuds.256
-; - llvm.x86.avx2.vpdpbuud.128
-; - llvm.x86.avx2.vpdpbuud.256
-; - llvm.x86.avx2.vpdpbuuds.128
-; - llvm.x86.avx2.vpdpbuuds.256
+; Handled heuristically: (none)
target datalayout = "e-m:o-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"
@@ -28,8 +16,8 @@ define <4 x i32>@test_int_x86_avx2_vpdpbssd_128(<4 x i32> %x0, <4 x i32> %x1, pt
; CHECK-LABEL: define <4 x i32> @test_int_x86_avx2_vpdpbssd_128(
; CHECK-SAME: <4 x i32> [[X0:%.*]], <4 x i32> [[X1:%.*]], ptr [[X2P:%.*]], <4 x i32> [[X4:%.*]]) #[[ATTR1:[0-9]+]] {
; CHECK-NEXT: [[TMP1:%.*]] = load i64, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 32) to ptr), align 8
-; CHECK-NEXT: [[TMP2:%.*]] = load <4 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <4 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 16) to ptr), align 8
+; CHECK-NEXT: [[TMP5:%.*]] = load <4 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP4:%.*]] = load <4 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 40) to ptr), align 8
; CHECK-NEXT: call void @llvm.donothing()
; CHECK-NEXT: [[_MSCMP:%.*]] = icmp ne i64 [[TMP1]], 0
@@ -43,11 +31,47 @@ define <4 x i32>@test_int_x86_avx2_vpdpbssd_128(<4 x i32> %x0, <4 x i32> %x1, pt
; CHECK-NEXT: [[TMP8:%.*]] = xor i64 [[TMP7]], 87960930222080
; CHECK-NEXT: [[TMP9:%.*]] = inttoptr i64 [[TMP8]] to ptr
; CHECK-NEXT: [[_MSLD:%.*]] = load <4 x i32>, ptr [[TMP9]], align 16
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <4 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <4 x i32> [[_MSPROP]], [[_MSLD]]
+; CHECK-NEXT: [[TMP29:%.*]] = bitcast <4 x i32> [[X1]] to <16 x i8>
+; CHECK-NEXT: [[TMP30:%.*]] = bitcast <4 x i32> [[X2]] to <16 x i8>
+; CHECK-NEXT: [[TMP12:%.*]] = bitcast <4 x i32> [[TMP3]] to <16 x i8>
+; CHECK-NEXT: [[TMP13:%.*]] = bitcast <4 x i32> [[_MSLD]] to <16 x i8>
+; CHECK-NEXT: [[TMP14:%.*]] = icmp ne <16 x i8> [[TMP12]], zeroinitializer
+; CHECK-NEXT: [[TMP15:%.*]] = icmp ne <16 x i8> [[TMP13]], zeroinitializer
+; CHECK-NEXT: [[TMP16:%.*]] = icmp ne <16 x i8> [[TMP29]], zeroinitializer
+; CHECK-NEXT: [[TMP17:%.*]] = icmp ne <16 x i8> [[TMP30]], zeroinitializer
+; CHECK-NEXT: [[TMP18:%.*]] = and <16 x i1> [[TMP14]], [[TMP15]]
+; CHECK-NEXT: [[TMP19:%.*]] = and <16 x i1> [[TMP16]], [[TMP15]]
+; CHECK-NEXT: [[TMP20:%.*]] = and <16 x i1> [[TMP14]], [[TMP17]]
+; CHECK-NEXT: [[TMP21:%.*]] = or <16 x i1> [[TMP18]], [[TMP19]]
+; CHECK-NEXT: [[TMP22:%.*]] = or <16 x i1> [[TMP21]], [[TMP20]]
+; CHECK-NEXT: [[TMP23:%.*]] = sext <16 x i1> [[TMP22]] to <16 x i8>
+; CHECK-NEXT: [[TMP24:%.*]] = bitcast <16 x i8> [[TMP23]] to <8 x i16>
+; CHECK-NEXT: [[TMP25:%.*]] = icmp ne <8 x i16> [[TMP24]], zeroinitializer
+; CHECK-NEXT: [[TMP26:%.*]] = sext <8 x i1> [[TMP25]] to <8 x i16>
+; CHECK-NEXT: [[TMP27:%.*]] = bitcast <8 x i16> [[TMP26]] to i128
+; CHECK-NEXT: [[TMP28:%.*]] = bitcast i128 [[TMP27]] to <4 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <4 x i32> [[TMP28]], [[TMP5]]
; CHECK-NEXT: [[TMP10:%.*]] = call <4 x i32> @llvm.x86.avx2.vpdpbssd.128(<4 x i32> [[X0]], <4 x i32> [[X1]], <4 x i32> [[X2]])
-; CHECK-NEXT: [[_MSPROP2:%.*]] = or <4 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP3:%.*]] = or <4 x i32> [[_MSPROP2]], [[TMP4]]
+; CHECK-NEXT: [[TMP31:%.*]] = bitcast <4 x i32> [[X1]] to <16 x i8>
+; CHECK-NEXT: [[TMP32:%.*]] = bitcast <4 x i32> [[X4]] to <16 x i8>
+; CHECK-NEXT: [[TMP33:%.*]] = bitcast <4 x i32> [[TMP3]] to <16 x i8>
+; CHECK-NEXT: [[TMP34:%.*]] = bitcast <4 x i32> [[TMP4]] to <16 x i8>
+; CHECK-NEXT: [[TMP35:%.*]] = icmp ne <16 x i8> [[TMP33]], zeroinitializer
+; CHECK-NEXT: [[TMP36:%.*]] = icmp ne <16 x i8> [[TMP34]], zeroinitializer
+; CHECK-NEXT: [[TMP37:%.*]] = icmp ne <16 x i8> [[TMP31]], zeroinitializer
+; CHECK-NEXT: [[TMP38:%.*]] = icmp ne <16 x i8> [[TMP32]], zeroinitializer
+; CHECK-NEXT: [[TMP39:%.*]] = and <16 x i1> [[TMP35]], [[TMP36]]
+; CHECK-NEXT: [[TMP40:%.*]] = and <16 x i1> [[TMP37]], [[TMP36]]
+; CHECK-NEXT: [[TMP41:%.*]] = and <16 x i1> [[TMP35]], [[TMP38]]
+; CHECK-NEXT: [[TMP42:%.*]] = or <16 x i1> [[TMP39]], [[TMP40]]
+; CHECK-NEXT: [[TMP43:%.*]] = or <16 x i1> [[TMP42]], [[TMP41]]
+; CHECK-NEXT: [[TMP44:%.*]] = sext <16 x i1> [[TMP43]] to <16 x i8>
+; CHECK-NEXT: [[TMP45:%.*]] = bitcast <16 x i8> [[TMP44]] to <8 x i16>
+; CHECK-NEXT: [[TMP46:%.*]] = icmp ne <8 x i16> [[TMP45]], zeroinitializer
+; CHECK-NEXT: [[TMP47:%.*]] = sext <8 x i1> [[TMP46]] to <8 x i16>
+; CHECK-NEXT: [[TMP48:%.*]] = bitcast <8 x i16> [[TMP47]] to i128
+; CHECK-NEXT: [[TMP49:%.*]] = bitcast i128 [[TMP48]] to <4 x i32>
+; CHECK-NEXT: [[_MSPROP3:%.*]] = or <4 x i32> [[TMP49]], [[TMP5]]
; CHECK-NEXT: [[TMP11:%.*]] = call <4 x i32> @llvm.x86.avx2.vpdpbssd.128(<4 x i32> [[X0]], <4 x i32> [[X1]], <4 x i32> [[X4]])
; CHECK-NEXT: [[_MSPROP4:%.*]] = or <4 x i32> [[_MSPROP1]], [[_MSPROP3]]
; CHECK-NEXT: [[RES:%.*]] = add <4 x i32> [[TMP10]], [[TMP11]]
@@ -67,8 +91,8 @@ define <4 x i32>@test_int_x86_avx2_vpdpbssds_128(<4 x i32> %x0, <4 x i32> %x1, p
; CHECK-LABEL: define <4 x i32> @test_int_x86_avx2_vpdpbssds_128(
; CHECK-SAME: <4 x i32> [[X0:%.*]], <4 x i32> [[X1:%.*]], ptr [[X2P:%.*]], <4 x i32> [[X4:%.*]]) #[[ATTR1]] {
; CHECK-NEXT: [[TMP1:%.*]] = load i64, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 32) to ptr), align 8
-; CHECK-NEXT: [[TMP2:%.*]] = load <4 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <4 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 16) to ptr), align 8
+; CHECK-NEXT: [[TMP5:%.*]] = load <4 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP4:%.*]] = load <4 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 40) to ptr), align 8
; CHECK-NEXT: call void @llvm.donothing()
; CHECK-NEXT: [[_MSCMP:%.*]] = icmp ne i64 [[TMP1]], 0
@@ -82,11 +106,47 @@ define <4 x i32>@test_int_x86_avx2_vpdpbssds_128(<4 x i32> %x0, <4 x i32> %x1, p
; CHECK-NEXT: [[TMP8:%.*]] = xor i64 [[TMP7]], 87960930222080
; CHECK-NEXT: [[TMP9:%.*]] = inttoptr i64 [[TMP8]] to ptr
; CHECK-NEXT: [[_MSLD:%.*]] = load <4 x i32>, ptr [[TMP9]], align 16
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <4 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <4 x i32> [[_MSPROP]], [[_MSLD]]
+; CHECK-NEXT: [[TMP29:%.*]] = bitcast <4 x i32> [[X1]] to <16 x i8>
+; CHECK-NEXT: [[TMP30:%.*]] = bitcast <4 x i32> [[X2]] to <16 x i8>
+; CHECK-NEXT: [[TMP12:%.*]] = bitcast <4 x i32> [[TMP3]] to <16 x i8>
+; CHECK-NEXT: [[TMP13:%.*]] = bitcast <4 x i32> [[_MSLD]] to <16 x i8>
+; CHECK-NEXT: [[TMP14:%.*]] = icmp ne <16 x i8> [[TMP12]], zeroinitializer
+; CHECK-NEXT: [[TMP15:%.*]] = icmp ne <16 x i8> [[TMP13]], zeroinitializer
+; CHECK-NEXT: [[TMP16:%.*]] = icmp ne <16 x i8> [[TMP29]], zeroinitializer
+; CHECK-NEXT: [[TMP17:%.*]] = icmp ne <16 x i8> [[TMP30]], zeroinitializer
+; CHECK-NEXT: [[TMP18:%.*]] = and <16 x i1> [[TMP14]], [[TMP15]]
+; CHECK-NEXT: [[TMP19:%.*]] = and <16 x i1> [[TMP16]], [[TMP15]]
+; CHECK-NEXT: [[TMP20:%.*]] = and <16 x i1> [[TMP14]], [[TMP17]]
+; CHECK-NEXT: [[TMP21:%.*]] = or <16 x i1> [[TMP18]], [[TMP19]]
+; CHECK-NEXT: [[TMP22:%.*]] = or <16 x i1> [[TMP21]], [[TMP20]]
+; CHECK-NEXT: [[TMP23:%.*]] = sext <16 x i1> [[TMP22]] to <16 x i8>
+; CHECK-NEXT: [[TMP24:%.*]] = bitcast <16 x i8> [[TMP23]] to <8 x i16>
+; CHECK-NEXT: [[TMP25:%.*]] = icmp ne <8 x i16> [[TMP24]], zeroinitializer
+; CHECK-NEXT: [[TMP26:%.*]] = sext <8 x i1> [[TMP25]] to <8 x i16>
+; CHECK-NEXT: [[TMP27:%.*]] = bitcast <8 x i16> [[TMP26]] to i128
+; CHECK-NEXT: [[TMP28:%.*]] = bitcast i128 [[TMP27]] to <4 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <4 x i32> [[TMP28]], [[TMP5]]
; CHECK-NEXT: [[TMP10:%.*]] = call <4 x i32> @llvm.x86.avx2.vpdpbssds.128(<4 x i32> [[X0]], <4 x i32> [[X1]], <4 x i32> [[X2]])
-; CHECK-NEXT: [[_MSPROP2:%.*]] = or <4 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP3:%.*]] = or <4 x i32> [[_MSPROP2]], [[TMP4]]
+; CHECK-NEXT: [[TMP31:%.*]] = bitcast <4 x i32> [[X1]] to <16 x i8>
+; CHECK-NEXT: [[TMP32:%.*]] = bitcast <4 x i32> [[X4]] to <16 x i8>
+; CHECK-NEXT: [[TMP33:%.*]] = bitcast <4 x i32> [[TMP3]] to <16 x i8>
+; CHECK-NEXT: [[TMP34:%.*]] = bitcast <4 x i32> [[TMP4]] to <16 x i8>
+; CHECK-NEXT: [[TMP35:%.*]] = icmp ne <16 x i8> [[TMP33]], zeroinitializer
+; CHECK-NEXT: [[TMP36:%.*]] = icmp ne <16 x i8> [[TMP34]], zeroinitializer
+; CHECK-NEXT: [[TMP37:%.*]] = icmp ne <16 x i8> [[TMP31]], zeroinitializer
+; CHECK-NEXT: [[TMP38:%.*]] = icmp ne <16 x i8> [[TMP32]], zeroinitializer
+; CHECK-NEXT: [[TMP39:%.*]] = and <16 x i1> [[TMP35]], [[TMP36]]
+; CHECK-NEXT: [[TMP40:%.*]] = and <16 x i1> [[TMP37]], [[TMP36]]
+; CHECK-NEXT: [[TMP41:%.*]] = and <16 x i1> [[TMP35]], [[TMP38]]
+; CHECK-NEXT: [[TMP42:%.*]] = or <16 x i1> [[TMP39]], [[TMP40]]
+; CHECK-NEXT: [[TMP43:%.*]] = or <16 x i1> [[TMP42]], [[TMP41]]
+; CHECK-NEXT: [[TMP44:%.*]] = sext <16 x i1> [[TMP43]] to <16 x i8>
+; CHECK-NEXT: [[TMP45:%.*]] = bitcast <16 x i8> [[TMP44]] to <8 x i16>
+; CHECK-NEXT: [[TMP46:%.*]] = icmp ne <8 x i16> [[TMP45]], zeroinitializer
+; CHECK-NEXT: [[TMP47:%.*]] = sext <8 x i1> [[TMP46]] to <8 x i16>
+; CHECK-NEXT: [[TMP48:%.*]] = bitcast <8 x i16> [[TMP47]] to i128
+; CHECK-NEXT: [[TMP49:%.*]] = bitcast i128 [[TMP48]] to <4 x i32>
+; CHECK-NEXT: [[_MSPROP3:%.*]] = or <4 x i32> [[TMP49]], [[TMP5]]
; CHECK-NEXT: [[TMP11:%.*]] = call <4 x i32> @llvm.x86.avx2.vpdpbssds.128(<4 x i32> [[X0]], <4 x i32> [[X1]], <4 x i32> [[X4]])
; CHECK-NEXT: [[_MSPROP4:%.*]] = or <4 x i32> [[_MSPROP1]], [[_MSPROP3]]
; CHECK-NEXT: [[RES:%.*]] = add <4 x i32> [[TMP10]], [[TMP11]]
@@ -106,8 +166,8 @@ define <8 x i32>@test_int_x86_avx2_vpdpbssd_256(<8 x i32> %x0, <8 x i32> %x1, pt
; CHECK-LABEL: define <8 x i32> @test_int_x86_avx2_vpdpbssd_256(
; CHECK-SAME: <8 x i32> [[X0:%.*]], <8 x i32> [[X1:%.*]], ptr [[X2P:%.*]], <8 x i32> [[X4:%.*]]) #[[ATTR1]] {
; CHECK-NEXT: [[TMP1:%.*]] = load i64, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 64) to ptr), align 8
-; CHECK-NEXT: [[TMP2:%.*]] = load <8 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <8 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 32) to ptr), align 8
+; CHECK-NEXT: [[TMP5:%.*]] = load <8 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP4:%.*]] = load <8 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 72) to ptr), align 8
; CHECK-NEXT: call void @llvm.donothing()
; CHECK-NEXT: [[_MSCMP:%.*]] = icmp ne i64 [[TMP1]], 0
@@ -121,11 +181,47 @@ define <8 x i32>@test_int_x86_avx2_vpdpbssd_256(<8 x i32> %x0, <8 x i32> %x1, pt
; CHECK-NEXT: [[TMP8:%.*]] = xor i64 [[TMP7]], 87960930222080
; CHECK-NEXT: [[TMP9:%.*]] = inttoptr i64 [[TMP8]] to ptr
; CHECK-NEXT: [[_MSLD:%.*]] = load <8 x i32>, ptr [[TMP9]], align 32
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <8 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <8 x i32> [[_MSPROP]], [[_MSLD]]
+; CHECK-NEXT: [[TMP29:%.*]] = bitcast <8 x i32> [[X1]] to <32 x i8>
+; CHECK-NEXT: [[TMP30:%.*]] = bitcast <8 x i32> [[X2]] to <32 x i8>
+; CHECK-NEXT: [[TMP12:%.*]] = bitcast <8 x i32> [[TMP3]] to <32 x i8>
+; CHECK-NEXT: [[TMP13:%.*]] = bitcast <8 x i32> [[_MSLD]] to <32 x i8>
+; CHECK-NEXT: [[TMP14:%.*]] = icmp ne <32 x i8> [[TMP12]], zeroinitializer
+; CHECK-NEXT: [[TMP15:%.*]] = icmp ne <32 x i8> [[TMP13]], zeroinitializer
+; CHECK-NEXT: [[TMP16:%.*]] = icmp ne <32 x i8> [[TMP29]], zeroinitializer
+; CHECK-NEXT: [[TMP17:%.*]] = icmp ne <32 x i8> [[TMP30]], zeroinitializer
+; CHECK-NEXT: [[TMP18:%.*]] = and <32 x i1> [[TMP14]], [[TMP15]]
+; CHECK-NEXT: [[TMP19:%.*]] = and <32 x i1> [[TMP16]], [[TMP15]]
+; CHECK-NEXT: [[TMP20:%.*]] = and <32 x i1> [[TMP14]], [[TMP17]]
+; CHECK-NEXT: [[TMP21:%.*]] = or <32 x i1> [[TMP18]], [[TMP19]]
+; CHECK-NEXT: [[TMP22:%.*]] = or <32 x i1> [[TMP21]], [[TMP20]]
+; CHECK-NEXT: [[TMP23:%.*]] = sext <32 x i1> [[TMP22]] to <32 x i8>
+; CHECK-NEXT: [[TMP24:%.*]] = bitcast <32 x i8> [[TMP23]] to <16 x i16>
+; CHECK-NEXT: [[TMP25:%.*]] = icmp ne <16 x i16> [[TMP24]], zeroinitializer
+; CHECK-NEXT: [[TMP26:%.*]] = sext <16 x i1> [[TMP25]] to <16 x i16>
+; CHECK-NEXT: [[TMP27:%.*]] = bitcast <16 x i16> [[TMP26]] to i256
+; CHECK-NEXT: [[TMP28:%.*]] = bitcast i256 [[TMP27]] to <8 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <8 x i32> [[TMP28]], [[TMP5]]
; CHECK-NEXT: [[TMP10:%.*]] = call <8 x i32> @llvm.x86.avx2.vpdpbssd.256(<8 x i32> [[X0]], <8 x i32> [[X1]], <8 x i32> [[X2]])
-; CHECK-NEXT: [[_MSPROP2:%.*]] = or <8 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP3:%.*]] = or <8 x i32> [[_MSPROP2]], [[TMP4]]
+; CHECK-NEXT: [[TMP31:%.*]] = bitcast <8 x i32> [[X1]] to <32 x i8>
+; CHECK-NEXT: [[TMP32:%.*]] = bitcast <8 x i32> [[X4]] to <32 x i8>
+; CHECK-NEXT: [[TMP33:%.*]] = bitcast <8 x i32> [[TMP3]] to <32 x i8>
+; CHECK-NEXT: [[TMP34:%.*]] = bitcast <8 x i32> [[TMP4]] to <32 x i8>
+; CHECK-NEXT: [[TMP35:%.*]] = icmp ne <32 x i8> [[TMP33]], zeroinitializer
+; CHECK-NEXT: [[TMP36:%.*]] = icmp ne <32 x i8> [[TMP34]], zeroinitializer
+; CHECK-NEXT: [[TMP37:%.*]] = icmp ne <32 x i8> [[TMP31]], zeroinitializer
+; CHECK-NEXT: [[TMP38:%.*]] = icmp ne <32 x i8> [[TMP32]], zeroinitializer
+; CHECK-NEXT: [[TMP39:%.*]] = and <32 x i1> [[TMP35]], [[TMP36]]
+; CHECK-NEXT: [[TMP40:%.*]] = and <32 x i1> [[TMP37]], [[TMP36]]
+; CHECK-NEXT: [[TMP41:%.*]] = and <32 x i1> [[TMP35]], [[TMP38]]
+; CHECK-NEXT: [[TMP42:%.*]] = or <32 x i1> [[TMP39]], [[TMP40]]
+; CHECK-NEXT: [[TMP43:%.*]] = or <32 x i1> [[TMP42]], [[TMP41]]
+; CHECK-NEXT: [[TMP44:%.*]] = sext <32 x i1> [[TMP43]] to <32 x i8>
+; CHECK-NEXT: [[TMP45:%.*]] = bitcast <32 x i8> [[TMP44]] to <16 x i16>
+; CHECK-NEXT: [[TMP46:%.*]] = icmp ne <16 x i16> [[TMP45]], zeroinitializer
+; CHECK-NEXT: [[TMP47:%.*]] = sext <16 x i1> [[TMP46]] to <16 x i16>
+; CHECK-NEXT: [[TMP48:%.*]] = bitcast <16 x i16> [[TMP47]] to i256
+; CHECK-NEXT: [[TMP49:%.*]] = bitcast i256 [[TMP48]] to <8 x i32>
+; CHECK-NEXT: [[_MSPROP3:%.*]] = or <8 x i32> [[TMP49]], [[TMP5]]
; CHECK-NEXT: [[TMP11:%.*]] = call <8 x i32> @llvm.x86.avx2.vpdpbssd.256(<8 x i32> [[X0]], <8 x i32> [[X1]], <8 x i32> [[X4]])
; CHECK-NEXT: [[_MSPROP4:%.*]] = or <8 x i32> [[_MSPROP1]], [[_MSPROP3]]
; CHECK-NEXT: [[RES:%.*]] = add <8 x i32> [[TMP10]], [[TMP11]]
@@ -145,8 +241,8 @@ define <8 x i32>@test_int_x86_avx2_vpdpbssds_256(<8 x i32> %x0, <8 x i32> %x1, p
; CHECK-LABEL: define <8 x i32> @test_int_x86_avx2_vpdpbssds_256(
; CHECK-SAME: <8 x i32> [[X0:%.*]], <8 x i32> [[X1:%.*]], ptr [[X2P:%.*]], <8 x i32> [[X4:%.*]]) #[[ATTR1]] {
; CHECK-NEXT: [[TMP1:%.*]] = load i64, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 64) to ptr), align 8
-; CHECK-NEXT: [[TMP2:%.*]] = load <8 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP3:%.*]] = load <8 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 32) to ptr), align 8
+; CHECK-NEXT: [[TMP5:%.*]] = load <8 x i32>, ptr @__msan_param_tls, align 8
; CHECK-NEXT: [[TMP4:%.*]] = load <8 x i32>, ptr inttoptr (i64 add (i64 ptrtoint (ptr @__msan_param_tls to i64), i64 72) to ptr), align 8
; CHECK-NEXT: call void @llvm.donothing()
; CHECK-NEXT: [[_MSCMP:%.*]] = icmp ne i64 [[TMP1]], 0
@@ -160,11 +256,47 @@ define <8 x i32>@test_int_x86_avx2_vpdpbssds_256(<8 x i32> %x0, <8 x i32> %x1, p
; CHECK-NEXT: [[TMP8:%.*]] = xor i64 [[TMP7]], 87960930222080
; CHECK-NEXT: [[TMP9:%.*]] = inttoptr i64 [[TMP8]] to ptr
; CHECK-NEXT: [[_MSLD:%.*]] = load <8 x i32>, ptr [[TMP9]], align 32
-; CHECK-NEXT: [[_MSPROP:%.*]] = or <8 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP1:%.*]] = or <8 x i32> [[_MSPROP]], [[_MSLD]]
+; CHECK-NEXT: [[TMP29:%.*]] = bitcast <8 x i32> [[X1]] to <32 x i8>
+; CHECK-NEXT: [[TMP30:%.*]] = bitcast <8 x i32> [[X2]] to <32 x i8>
+; CHECK-NEXT: [[TMP12:%.*]] = bitcast <8 x i32> [[TMP3]] to <32 x i8>
+; CHECK-NEXT: [[TMP13:%.*]] = bitcast <8 x i32> [[_MSLD]] to <32 x i8>
+; CHECK-NEXT: [[TMP14:%.*]] = icmp ne <32 x i8> [[TMP12]], zeroinitializer
+; CHECK-NEXT: [[TMP15:%.*]] = icmp ne <32 x i8> [[TMP13]], zeroinitializer
+; CHECK-NEXT: [[TMP16:%.*]] = icmp ne <32 x i8> [[TMP29]], zeroinitializer
+; CHECK-NEXT: [[TMP17:%.*]] = icmp ne <32 x i8> [[TMP30]], zeroinitializer
+; CHECK-NEXT: [[TMP18:%.*]] = and <32 x i1> [[TMP14]], [[TMP15]]
+; CHECK-NEXT: [[TMP19:%.*]] = and <32 x i1> [[TMP16]], [[TMP15]]
+; CHECK-NEXT: [[TMP20:%.*]] = and <32 x i1> [[TMP14]], [[TMP17]]
+; CHECK-NEXT: [[TMP21:%.*]] = or <32 x i1> [[TMP18]], [[TMP19]]
+; CHECK-NEXT: [[TMP22:%.*]] = or <32 x i1> [[TMP21]], [[TMP20]]
+; CHECK-NEXT: [[TMP23:%.*]] = sext <32 x i1> [[TMP22]] to <32 x i8>
+; CHECK-NEXT: [[TMP24:%.*]] = bitcast <32 x i8> [[TMP23]] to <16 x i16>
+; CHECK-NEXT: [[TMP25:%.*]] = icmp ne <16 x i16> [[TMP24]], zeroinitializer
+; CHECK-NEXT: [[TMP26:%.*]] = sext <16 x i1> [[TMP25]] to <16 x i16>
+; CHECK-NEXT: [[TMP27:%.*]] = bitcast <16 x i16> [[TMP26]] to i256
+; CHECK-NEXT: [[TMP28:%.*]] = bitcast i256 [[TMP27]] to <8 x i32>
+; CHECK-NEXT: [[_MSPROP1:%.*]] = or <8 x i32> [[TMP28]], [[TMP5]]
; CHECK-NEXT: [[TMP10:%.*]] = call <8 x i32> @llvm.x86.avx2.vpdpbssds.256(<8 x i32> [[X0]], <8 x i32> [[X1]], <8 x i32> [[X2]])
-; CHECK-NEXT: [[_MSPROP2:%.*]] = or <8 x i32> [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[_MSPROP3:%.*]] = or <8 x i32> [[_MSPROP2]], [[TMP4]]
+; CHECK-NEXT: [[TMP31:%.*]] = bitcast <8 x i32> [[X1]] to <32 x i8>
+; CHECK-NEXT: [[TMP32:%.*]] = bitcast <8 x i32> [[X4]] to <32 x i8>
+; CHECK-NEXT: [[TMP33:%.*]] = bitcast <8 x i32> [[TMP3]] to <32 x i8>
+; CHECK-NEXT: [[TMP34:%.*]] = bitcast <8 x i32> [[TMP4]] to <32 x i8>
+; CHECK-NEXT: [[TMP35:%.*]] = icmp ne <32 x i8> [[TMP33]], zeroinitializer
+; CHECK-NEXT: [[TMP36:%.*]] = icmp ne <32 x i8> [[TMP34]], zeroinitializer
+; CHECK-NEXT: [[TMP37:%.*]] = icmp ne <32 x i8> [[TMP31]], zeroinitializer
+; CHECK-NEXT: [[TMP38:%.*]] = icmp ne <32 x i8> [[TMP32]], zeroinitializer
+; CHECK-NEXT: [[TMP39:%.*]] = and <32 x i1> [[TMP35]], [[TMP36]]
+; CHECK-NEXT: [[TMP40:%.*]] = and <32 x i1> [[TMP37]], [[TMP36]]
+; CHECK-NEXT: [[TMP41:%.*]] = and <32 x i1> [[TMP35]], [[TMP38]]
+; CHECK-NEXT: [[TMP42:%.*]] = or <32 x i1> [[TMP39]], [[TMP40]]
+; CHECK-NEXT: [[TMP43:%.*]] = or <32 x i1> [[TMP42]], [[TMP41]]
+; CHECK-NEXT: [[TMP44:%.*]] = sext <32 x i1> [[TMP43]] to <32 x i8>
+; CHECK-NEXT: [[TMP45:%.*]] = bitcast <32 x i8> [[TMP44]] to <16 x i16>
+; CHECK-NEXT: [[TMP46:%.*]] = icmp ne <16 x i16> [[TMP45]], zeroinitializer
+; CHECK-NEXT: [[TMP47:%.*]] = sext <16 x i1> [[TMP46]] to <16 x i16>
+; CHECK-NEXT: [[TMP48:%.*]] = bitcast <16 x i16> [[TMP47]] to i256
+; CHECK-NEXT: [[TMP49:%.*]] = bitcast i256 [[TMP48]] to <8 x i32>
+; CHECK-NEXT: [[_MSPROP3:%.*]] = or <8 x i32> [[TMP49]], [[TMP5]]
; CHECK-NEXT: [[TMP11:%.*]] = call <8 x i32> @llvm.x86.avx2.vpdpbssds.256(<8 x i32> [[X0]], <8 x i32> [[X1]], <8 x i32> [[X4]])
; CHECK-NEXT: [[_MSPROP4:%.*]] = or <8 x i32> [[_MSPROP1]], [[_MSPROP3]]
; CHECK-NEXT: [[RES:%.*]] = add <8 x i32> [[TMP10]], [[TMP11]]
>From 13716843eb0c36e9020b369365e4f5a73b84999b Mon Sep 17 00:00:00 2001
From: Stanislav Mekhanoshin <Stanislav.Mekhanoshin at amd.com>
Date: Mon, 18 Aug 2025 13:20:38 -0700
Subject: [PATCH 086/112] [AMDGPU] Make s_setprio_inc_wg a scheduling boundary
(#154188)
---
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp | 1 +
1 file changed, 1 insertion(+)
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index 41885e45b4101..1f3943f6e1b27 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -4241,6 +4241,7 @@ bool SIInstrInfo::isSchedulingBoundary(const MachineInstr &MI,
MI.getOpcode() == AMDGPU::S_SETREG_IMM32_B32 ||
MI.getOpcode() == AMDGPU::S_SETREG_B32 ||
MI.getOpcode() == AMDGPU::S_SETPRIO ||
+ MI.getOpcode() == AMDGPU::S_SETPRIO_INC_WG ||
changesVGPRIndexingMode(MI);
}
>From 9617ce4862cf0ed5257699348f369eb179f119e9 Mon Sep 17 00:00:00 2001
From: Charitha Saumya <136391709+charithaintc at users.noreply.github.com>
Date: Mon, 18 Aug 2025 13:26:08 -0700
Subject: [PATCH 087/112] [vector][distribution] Bug fix in
`moveRegionToNewWarpOpAndAppendReturns` (#153656)
---
.../Dialect/GPU/Utils/DistributionUtils.cpp | 32 +++++++++++--------
.../Vector/vector-warp-distribute.mlir | 21 ++++++++++++
2 files changed, 39 insertions(+), 14 deletions(-)
diff --git a/mlir/lib/Dialect/GPU/Utils/DistributionUtils.cpp b/mlir/lib/Dialect/GPU/Utils/DistributionUtils.cpp
index 384d1a0ddccd2..be71bd02fc43b 100644
--- a/mlir/lib/Dialect/GPU/Utils/DistributionUtils.cpp
+++ b/mlir/lib/Dialect/GPU/Utils/DistributionUtils.cpp
@@ -14,6 +14,7 @@
#include "mlir/Dialect/Affine/IR/AffineOps.h"
#include "mlir/Dialect/Arith/IR/Arith.h"
#include "mlir/IR/Value.h"
+#include "llvm/ADT/DenseMap.h"
#include <numeric>
@@ -57,26 +58,29 @@ WarpDistributionPattern::moveRegionToNewWarpOpAndAppendReturns(
warpOp.getResultTypes().end());
auto yield = cast<gpu::YieldOp>(
warpOp.getBodyRegion().getBlocks().begin()->getTerminator());
- llvm::SmallSetVector<Value, 32> yieldValues(yield.getOperands().begin(),
- yield.getOperands().end());
+ SmallVector<Value> yieldValues(yield.getOperands().begin(),
+ yield.getOperands().end());
+ llvm::SmallDenseMap<Value, unsigned> indexLookup;
+ // Record the value -> first index mapping for faster lookup.
+ for (auto [i, v] : llvm::enumerate(yieldValues)) {
+ if (!indexLookup.count(v))
+ indexLookup[v] = i;
+ }
+
for (auto [value, type] : llvm::zip_equal(newYieldedValues, newReturnTypes)) {
- if (yieldValues.insert(value)) {
+ // If the value already exists in the yield, don't create a new output.
+ if (indexLookup.count(value)) {
+ indices.push_back(indexLookup[value]);
+ } else {
+ // If the value is new, add it to the yield and to the types.
+ yieldValues.push_back(value);
types.push_back(type);
indices.push_back(yieldValues.size() - 1);
- } else {
- // If the value already exit the region don't create a new output.
- for (auto [idx, yieldOperand] :
- llvm::enumerate(yieldValues.getArrayRef())) {
- if (yieldOperand == value) {
- indices.push_back(idx);
- break;
- }
- }
}
}
- yieldValues.insert_range(newYieldedValues);
+
WarpExecuteOnLane0Op newWarpOp = moveRegionToNewWarpOpAndReplaceReturns(
- rewriter, warpOp, yieldValues.getArrayRef(), types);
+ rewriter, warpOp, yieldValues, types);
rewriter.replaceOp(warpOp,
newWarpOp.getResults().take_front(warpOp.getNumResults()));
return newWarpOp;
diff --git a/mlir/test/Dialect/Vector/vector-warp-distribute.mlir b/mlir/test/Dialect/Vector/vector-warp-distribute.mlir
index ae8fce786ee57..c3ce7e9ca7fda 100644
--- a/mlir/test/Dialect/Vector/vector-warp-distribute.mlir
+++ b/mlir/test/Dialect/Vector/vector-warp-distribute.mlir
@@ -1803,3 +1803,24 @@ func.func @warp_propagate_nd_write(%laneid: index, %dest: memref<4x1024xf32>) {
// CHECK-DIST-AND-PROP: %[[IDS:.+]]:2 = affine.delinearize_index %{{.*}} into (4, 8) : index, index
// CHECK-DIST-AND-PROP: %[[INNER_ID:.+]] = affine.apply #map()[%[[IDS]]#1]
// CHECK-DIST-AND-PROP: vector.transfer_write %[[W]], %{{.*}}[%[[IDS]]#0, %[[INNER_ID]]] {{.*}} : vector<1x128xf32>
+
+// -----
+func.func @warp_propagate_duplicated_operands_in_yield(%laneid: index) {
+ %r:3 = gpu.warp_execute_on_lane_0(%laneid)[32] -> (vector<1xf32>, vector<1xf32>, vector<1xf32>) {
+ %0 = "some_def"() : () -> (vector<32xf32>)
+ %1 = "some_other_def"() : () -> (vector<32xf32>)
+ %2 = math.exp %1 : vector<32xf32>
+ gpu.yield %2, %0, %0 : vector<32xf32>, vector<32xf32>, vector<32xf32>
+ }
+ "some_use"(%r#0) : (vector<1xf32>) -> ()
+ return
+}
+
+// CHECK-PROP-LABEL : func.func @warp_propagate_duplicated_operands_in_yield(
+// CHECK-PROP : %[[W:.*]] = gpu.warp_execute_on_lane_0(%{{.*}})[32] -> (vector<1xf32>) {
+// CHECK-PROP : %{{.*}} = "some_def"() : () -> vector<32xf32>
+// CHECK-PROP : %[[T3:.*]] = "some_other_def"() : () -> vector<32xf32>
+// CHECK-PROP : gpu.yield %[[T3]] : vector<32xf32>
+// CHECK-PROP : }
+// CHECK-PROP : %[T1:.*] = math.exp %[[W]] : vector<1xf32>
+// CHECK-PROP : "some_use"(%[[T1]]) : (vector<1xf32>) -> ()
>From 624b724ca6df5d2d3ea16b9ed232851e5d061be4 Mon Sep 17 00:00:00 2001
From: Oliver Hunt <oliver at apple.com>
Date: Mon, 18 Aug 2025 13:29:26 -0700
Subject: [PATCH 088/112] [clang][PAC] ptrauth_qualifier and ptrauth_intrinsic
should only be available on Darwin (#153912)
For backwards compatibility reasons the `ptrauth_qualifier` and
`ptrauth_intrinsic` features need to be testable with `__has_feature()`
on Apple platforms, but for other platforms this backwards compatibility
issue does not exist.
This PR resolves these issues by making the `ptrauth_qualifier` and
`ptrauth_intrinsic` tests conditional upon a darwin target. This also
allows us to revert the ptrauth_qualifier check from an extension to a
feature test again, as is required on these platforms.
At the same time we introduce a new predefined macro `__PTRAUTH__` that
answers the same question as `__has_feature(ptrauth_qualifier)` and
`__has_feature(ptrauth_intrinsic)` as those tests are synonymous and
only exist separately for compatibility reasons.
The requirement to test for the `__PTRAUTH__` macro also resolves the
hazard presented by mixing the `ptrauth_qualifier` flag (that impacts
ABI and security policies) with `-pedantics-errors`, which makes
`__has_extension` return false for all extensions.
---------
Co-authored-by: Aaron Ballman <aaron at aaronballman.com>
---
clang/docs/ReleaseNotes.rst | 6 +++++
clang/include/clang/Basic/Features.def | 6 +++--
clang/lib/Frontend/InitPreprocessor.cpp | 3 +++
clang/lib/Headers/ptrauth.h | 4 +--
clang/test/Preprocessor/ptrauth_extension.c | 30 ++++++++++++++++++---
clang/test/Preprocessor/ptrauth_feature.c | 2 +-
clang/test/Sema/ptrauth-qualifier.c | 16 +++++++++--
clang/test/SemaObjC/ptrauth-qualifier.m | 16 +++++++++--
8 files changed, 70 insertions(+), 13 deletions(-)
diff --git a/clang/docs/ReleaseNotes.rst b/clang/docs/ReleaseNotes.rst
index 9ea9fcdf889df..7f76f87ce6be0 100644
--- a/clang/docs/ReleaseNotes.rst
+++ b/clang/docs/ReleaseNotes.rst
@@ -137,6 +137,12 @@ Non-comprehensive list of changes in this release
- ``__builtin_elementwise_max`` and ``__builtin_elementwise_min`` functions for integer types can
now be used in constant expressions.
+- Use of ``__has_feature`` to detect the ``ptrauth_qualifier`` and ``ptrauth_intrinsics``
+ features has been deprecated, and is restricted to the arm64e target only. The
+ correct method to check for these features is to test for the ``__PTRAUTH__``
+ macro.
+
+
New Compiler Flags
------------------
- New option ``-fno-sanitize-annotate-debug-info-traps`` added to disable emitting trap reasons into the debug info when compiling with trapping UBSan (e.g. ``-fsanitize-trap=undefined``).
diff --git a/clang/include/clang/Basic/Features.def b/clang/include/clang/Basic/Features.def
index b9efc6a6a2e9d..7039844aaf270 100644
--- a/clang/include/clang/Basic/Features.def
+++ b/clang/include/clang/Basic/Features.def
@@ -147,8 +147,10 @@ FEATURE(type_sanitizer, LangOpts.Sanitize.has(SanitizerKind::Type))
FEATURE(thread_sanitizer, LangOpts.Sanitize.has(SanitizerKind::Thread))
FEATURE(dataflow_sanitizer, LangOpts.Sanitize.has(SanitizerKind::DataFlow))
FEATURE(scudo, LangOpts.Sanitize.hasOneOf(SanitizerKind::Scudo))
-FEATURE(ptrauth_intrinsics, LangOpts.PointerAuthIntrinsics)
-EXTENSION(ptrauth_qualifier, LangOpts.PointerAuthIntrinsics)
+FEATURE(ptrauth_intrinsics, LangOpts.PointerAuthIntrinsics &&
+ PP.getTargetInfo().getTriple().isOSDarwin())
+FEATURE(ptrauth_qualifier, LangOpts.PointerAuthIntrinsics &&
+ PP.getTargetInfo().getTriple().isOSDarwin())
FEATURE(ptrauth_calls, LangOpts.PointerAuthCalls)
FEATURE(ptrauth_returns, LangOpts.PointerAuthReturns)
FEATURE(ptrauth_vtable_pointer_address_discrimination, LangOpts.PointerAuthVTPtrAddressDiscrimination)
diff --git a/clang/lib/Frontend/InitPreprocessor.cpp b/clang/lib/Frontend/InitPreprocessor.cpp
index 5980806fba5e4..4865c0b889044 100644
--- a/clang/lib/Frontend/InitPreprocessor.cpp
+++ b/clang/lib/Frontend/InitPreprocessor.cpp
@@ -1535,6 +1535,9 @@ static void InitializePredefinedMacros(const TargetInfo &TI,
#undef TARGET_OS
}
+ if (LangOpts.PointerAuthIntrinsics)
+ Builder.defineMacro("__PTRAUTH__");
+
// Get other target #defines.
TI.getTargetDefines(LangOpts, Builder);
}
diff --git a/clang/lib/Headers/ptrauth.h b/clang/lib/Headers/ptrauth.h
index 7f7d387cbdfda..f902ca1e3bbd3 100644
--- a/clang/lib/Headers/ptrauth.h
+++ b/clang/lib/Headers/ptrauth.h
@@ -95,7 +95,7 @@ typedef __UINTPTR_TYPE__ ptrauth_generic_signature_t;
__ptrauth qualifier; the compiler will perform this check
automatically. */
-#if __has_feature(ptrauth_intrinsics)
+#if __has_feature(ptrauth_intrinsics) || defined(__PTRAUTH__)
/* Strip the signature from a value without authenticating it.
@@ -388,6 +388,6 @@ typedef __UINTPTR_TYPE__ ptrauth_generic_signature_t;
#define __ptrauth_objc_isa_uintptr
#define __ptrauth_objc_super_pointer
-#endif /* __has_feature(ptrauth_intrinsics) */
+#endif /* __has_feature(ptrauth_intrinsics) || defined(__PTRAUTH__) */
#endif /* __PTRAUTH_H */
diff --git a/clang/test/Preprocessor/ptrauth_extension.c b/clang/test/Preprocessor/ptrauth_extension.c
index d6b79187ba62d..3267b0786c28f 100644
--- a/clang/test/Preprocessor/ptrauth_extension.c
+++ b/clang/test/Preprocessor/ptrauth_extension.c
@@ -4,10 +4,32 @@
// RUN: %clang_cc1 -E %s -triple=aarch64 -fptrauth-calls | \
// RUN: FileCheck %s --check-prefixes=NOINTRIN
-#if __has_extension(ptrauth_qualifier)
-// INTRIN: has_ptrauth_qualifier
-void has_ptrauth_qualifier() {}
-#else
+// RUN: %clang_cc1 -E %s -DIS_DARWIN -triple=arm64e-apple-darwin -fptrauth-intrinsics | \
+// RUN: FileCheck %s --check-prefixes=INTRIN,INTRIN_MAC
+
+// RUN: %clang_cc1 -E %s -DIS_DARWIN -triple=arm64e-apple-darwin -fptrauth-calls | \
+// RUN: FileCheck %s --check-prefixes=NOINTRIN
+
+#if defined(IS_DARWIN) && __has_extension(ptrauth_qualifier)
+// INTRIN_MAC: has_ptrauth_qualifier1
+void has_ptrauth_qualifier1() {}
+#ifndef __PTRAUTH__
+#error ptrauth_qualifier extension present without predefined test macro
+#endif
+#endif
+#if defined(IS_DARWIN) && __has_feature(ptrauth_qualifier)
+// INTRIN_MAC: has_ptrauth_qualifier2
+void has_ptrauth_qualifier2() {}
+#ifndef __PTRAUTH__
+#error ptrauth_qualifier extension present without predefined test macro
+#endif
+#endif
+#if defined(__PTRAUTH__)
+// INTRIN: has_ptrauth_qualifier3
+void has_ptrauth_qualifier3() {}
+#endif
+
+#if !defined(__PTRAUTH__) && !__has_feature(ptrauth_qualifier) && !__has_extension(ptrauth_qualifier)
// NOINTRIN: no_ptrauth_qualifier
void no_ptrauth_qualifier() {}
#endif
diff --git a/clang/test/Preprocessor/ptrauth_feature.c b/clang/test/Preprocessor/ptrauth_feature.c
index a440791d6cc69..45d9cd4245dba 100644
--- a/clang/test/Preprocessor/ptrauth_feature.c
+++ b/clang/test/Preprocessor/ptrauth_feature.c
@@ -34,7 +34,7 @@
// RUN: %clang_cc1 -E %s -triple=aarch64 -fptrauth-elf-got | \
// RUN: FileCheck %s --check-prefixes=NOINTRIN,NOCALLS,NORETS,NOVPTR_ADDR_DISCR,NOVPTR_TYPE_DISCR,NOTYPE_INFO_DISCR,NOFUNC,NOINITFINI,NOINITFINI_ADDR_DISCR,NOGOTOS,ELFGOT
-#if __has_feature(ptrauth_intrinsics)
+#if defined(__PTRAUTH__)
// INTRIN: has_ptrauth_intrinsics
void has_ptrauth_intrinsics() {}
#else
diff --git a/clang/test/Sema/ptrauth-qualifier.c b/clang/test/Sema/ptrauth-qualifier.c
index 5d932b724f07a..3e568ce9f37e3 100644
--- a/clang/test/Sema/ptrauth-qualifier.c
+++ b/clang/test/Sema/ptrauth-qualifier.c
@@ -1,13 +1,25 @@
-// RUN: %clang_cc1 -triple arm64-apple-ios -std=c23 -fsyntax-only -verify -fptrauth-intrinsics %s
+// RUN: %clang_cc1 -triple arm64-apple-ios -DIS_DARWIN -std=c23 -fsyntax-only -verify -fptrauth-intrinsics %s
// RUN: %clang_cc1 -triple aarch64-linux-gnu -std=c23 -fsyntax-only -verify -fptrauth-intrinsics %s
-#if !__has_extension(ptrauth_qualifier)
+#if defined(IS_DARWIN) && !__has_extension(ptrauth_qualifier)
// This error means that the __ptrauth qualifier availability test says that it
// is not available. This error is not expected in the output, if it is seen
// there is a feature detection regression.
#error __ptrauth qualifier not enabled
#endif
+#if defined(IS_DARWIN) && !__has_feature(ptrauth_qualifier)
+// This error means that the __has_feature test for ptrauth_qualifier has
+// failed, despite it being expected on darwin.
+#error __ptrauth qualifier not enabled
+#elif !defined(IS_DARWIN) && (__has_feature(ptrauth_qualifier) || __has_extension(ptrauth_qualifier))
+#error ptrauth_qualifier labeled a feature on a non-darwin platform
+#endif
+
+#if !defined (__PTRAUTH__)
+#error __PTRAUTH__ test macro not defined when ptrauth is enabled
+#endif
+
#if __aarch64__
#define VALID_CODE_KEY 0
#define VALID_DATA_KEY 2
diff --git a/clang/test/SemaObjC/ptrauth-qualifier.m b/clang/test/SemaObjC/ptrauth-qualifier.m
index 74bbe6f09899b..67a73bbe45777 100644
--- a/clang/test/SemaObjC/ptrauth-qualifier.m
+++ b/clang/test/SemaObjC/ptrauth-qualifier.m
@@ -1,13 +1,25 @@
-// RUN: %clang_cc1 -triple arm64-apple-ios -fsyntax-only -verify -fptrauth-intrinsics %s
+// RUN: %clang_cc1 -triple arm64-apple-ios -DIS_DARWIN -fsyntax-only -verify -fptrauth-intrinsics %s
// RUN: %clang_cc1 -triple aarch64-linux-gnu -fsyntax-only -verify -fptrauth-intrinsics %s
-#if !__has_extension(ptrauth_qualifier)
+#if defined(IS_DARWIN) && !__has_extension(ptrauth_qualifier)
// This error means that the __ptrauth qualifier availability test says that it
// is not available. This error is not expected in the output, if it is seen
// there is a feature detection regression.
#error __ptrauth qualifier not enabled
#endif
+#if defined(IS_DARWIN) && !__has_feature(ptrauth_qualifier)
+// This error means that the __has_feature test for ptrauth_qualifier has
+// failed, despite it being expected on darwin.
+#error __ptrauth qualifier not enabled
+#elif !defined(IS_DARWIN) && (__has_feature(ptrauth_qualifier) || __has_extension(ptrauth_qualifier))
+#error ptrauth_qualifier labeled a feature on a non-darwin platform
+#endif
+
+#if !defined (__PTRAUTH__)
+#error __PTRAUTH__ test macro not defined when ptrauth is enabled
+#endif
+
@interface Foo
// expected-warning at -1 {{class 'Foo' defined without specifying a base class}}
// expected-note at -2 {{add a super class to fix this problem}}
>From 191e7eba93d07ebbf46436a531258ca267a3aa34 Mon Sep 17 00:00:00 2001
From: Mehdi Amini <joker.eph at gmail.com>
Date: Mon, 18 Aug 2025 22:46:59 +0200
Subject: [PATCH 089/112] [MLIR] Stop visiting unreachable blocks in the
walkAndApplyPatterns driver (#154038)
This is similar to the fix to the greedy driver in #153957 ; except that
instead of removing unreachable code, we just ignore it.
Operations like:
```
%add = arith.addi %add, %add : i64
```
are legal in unreachable code.
Unfortunately many patterns would be unsafe to apply on such IR and can
lead to crashes or infinite loops.
---
.../Transforms/WalkPatternRewriteDriver.h | 2 ++
.../Utils/WalkPatternRewriteDriver.cpp | 27 +++++++++++++++++++
.../IR/test-walk-pattern-rewrite-driver.mlir | 20 ++++++++++++++
3 files changed, 49 insertions(+)
diff --git a/mlir/include/mlir/Transforms/WalkPatternRewriteDriver.h b/mlir/include/mlir/Transforms/WalkPatternRewriteDriver.h
index 6d62ae3dd43dc..7d5c1d5cebb26 100644
--- a/mlir/include/mlir/Transforms/WalkPatternRewriteDriver.h
+++ b/mlir/include/mlir/Transforms/WalkPatternRewriteDriver.h
@@ -27,6 +27,8 @@ namespace mlir {
/// This is intended as the simplest and most lightweight pattern rewriter in
/// cases when a simple walk gets the job done.
///
+/// The driver will skip unreachable blocks.
+///
/// Note: Does not apply patterns to the given operation itself.
void walkAndApplyPatterns(Operation *op,
const FrozenRewritePatternSet &patterns,
diff --git a/mlir/lib/Transforms/Utils/WalkPatternRewriteDriver.cpp b/mlir/lib/Transforms/Utils/WalkPatternRewriteDriver.cpp
index 2111e29120567..1382550e0f7e6 100644
--- a/mlir/lib/Transforms/Utils/WalkPatternRewriteDriver.cpp
+++ b/mlir/lib/Transforms/Utils/WalkPatternRewriteDriver.cpp
@@ -27,6 +27,26 @@
namespace mlir {
+// Find all reachable blocks in the region and add them to the visitedBlocks
+// set.
+static void findReachableBlocks(Region ®ion,
+ DenseSet<Block *> &reachableBlocks) {
+ Block *entryBlock = ®ion.front();
+ reachableBlocks.insert(entryBlock);
+ // Traverse the CFG and add all reachable blocks to the blockList.
+ SmallVector<Block *> worklist({entryBlock});
+ while (!worklist.empty()) {
+ Block *block = worklist.pop_back_val();
+ Operation *terminator = &block->back();
+ for (Block *successor : terminator->getSuccessors()) {
+ if (reachableBlocks.contains(successor))
+ continue;
+ worklist.push_back(successor);
+ reachableBlocks.insert(successor);
+ }
+ }
+}
+
namespace {
struct WalkAndApplyPatternsAction final
: tracing::ActionImpl<WalkAndApplyPatternsAction> {
@@ -98,6 +118,8 @@ void walkAndApplyPatterns(Operation *op,
regionIt = region->begin();
if (regionIt != region->end())
blockIt = regionIt->begin();
+ if (!llvm::hasSingleElement(*region))
+ findReachableBlocks(*region, reachableBlocks);
}
// Advance the iterator to the next reachable operation.
void advance() {
@@ -105,6 +127,9 @@ void walkAndApplyPatterns(Operation *op,
hasVisitedRegions = false;
if (blockIt == regionIt->end()) {
++regionIt;
+ while (regionIt != region->end() &&
+ !reachableBlocks.contains(&*regionIt))
+ ++regionIt;
if (regionIt != region->end())
blockIt = regionIt->begin();
return;
@@ -121,6 +146,8 @@ void walkAndApplyPatterns(Operation *op,
Region::iterator regionIt;
// The Operation currently being iterated over.
Block::iterator blockIt;
+ // The set of blocks that are reachable in the current region.
+ DenseSet<Block *> reachableBlocks;
// Whether we've visited the nested regions of the current op already.
bool hasVisitedRegions = false;
};
diff --git a/mlir/test/IR/test-walk-pattern-rewrite-driver.mlir b/mlir/test/IR/test-walk-pattern-rewrite-driver.mlir
index c75c478ec3734..c3063416b0360 100644
--- a/mlir/test/IR/test-walk-pattern-rewrite-driver.mlir
+++ b/mlir/test/IR/test-walk-pattern-rewrite-driver.mlir
@@ -119,3 +119,23 @@ func.func @erase_nested_block() -> i32 {
}): () -> (i32)
return %a : i32
}
+
+
+// CHECK-LABEL: func.func @unreachable_replace_with_new_op
+// CHECK: "test.new_op"
+// CHECK: "test.replace_with_new_op"
+// CHECK-SAME: unreachable
+// CHECK: "test.new_op"
+func.func @unreachable_replace_with_new_op() {
+ "test.br"()[^bb1] : () -> ()
+^bb1:
+ %a = "test.replace_with_new_op"() : () -> (i32)
+ "test.br"()[^end] : () -> () // Test jumping over the unreachable block is visited as well.
+^unreachable:
+ %b = "test.replace_with_new_op"() {test.unreachable} : () -> (i32)
+ return
+^end:
+ %c = "test.replace_with_new_op"() : () -> (i32)
+ return
+}
+
>From dfaebe7f485f966fc7456ea8d372eaf9f1dc0306 Mon Sep 17 00:00:00 2001
From: Mehdi Amini <joker.eph at gmail.com>
Date: Mon, 18 Aug 2025 22:50:36 +0200
Subject: [PATCH 090/112] [MLIR] Fix Liveness analysis handling of unreachable
code (#153973)
This patch is forcing all values to be initialized by the
LivenessAnalysis, even in dead blocks. The dataflow framework will skip
visiting values when its already knows that a block is dynamically
unreachable, so this requires specific handling.
Downstream code could consider that the absence of liveness is the same
a "dead".
However as the code is mutated, new value can be introduced, and a
transformation like "RemoveDeadValue" must conservatively consider that
the absence of liveness information meant that we weren't sure if a
value was dead (it could be a newly introduced value.
Fixes #153906
---
.../Analysis/DataFlow/LivenessAnalysis.cpp | 29 +++++++++++-
mlir/lib/Analysis/DataFlow/SparseAnalysis.cpp | 46 +++++++++++++++++--
mlir/lib/Transforms/RemoveDeadValues.cpp | 46 ++++++++++++++++---
.../DataFlow/test-liveness-analysis.mlir | 20 ++++++++
mlir/test/Transforms/remove-dead-values.mlir | 21 +++++++++
.../DataFlow/TestLivenessAnalysis.cpp | 1 -
6 files changed, 150 insertions(+), 13 deletions(-)
diff --git a/mlir/lib/Analysis/DataFlow/LivenessAnalysis.cpp b/mlir/lib/Analysis/DataFlow/LivenessAnalysis.cpp
index 509f5202be08d..65df355216f74 100644
--- a/mlir/lib/Analysis/DataFlow/LivenessAnalysis.cpp
+++ b/mlir/lib/Analysis/DataFlow/LivenessAnalysis.cpp
@@ -294,7 +294,34 @@ RunLivenessAnalysis::RunLivenessAnalysis(Operation *op) {
solver.load<LivenessAnalysis>(symbolTable);
LDBG() << "Initializing and running solver";
(void)solver.initializeAndRun(op);
- LDBG() << "RunLivenessAnalysis initialized for op: " << op->getName();
+ LDBG() << "RunLivenessAnalysis initialized for op: " << op->getName()
+ << " check on unreachable code now:";
+ // The framework doesn't visit operations in dead blocks, so we need to
+ // explicitly mark them as dead.
+ op->walk([&](Operation *op) {
+ if (op->getNumResults() == 0)
+ return;
+ for (auto result : llvm::enumerate(op->getResults())) {
+ if (getLiveness(result.value()))
+ continue;
+ LDBG() << "Result: " << result.index() << " of "
+ << OpWithFlags(op, OpPrintingFlags().skipRegions())
+ << " has no liveness info (unreachable), mark dead";
+ solver.getOrCreateState<Liveness>(result.value());
+ }
+ for (auto ®ion : op->getRegions()) {
+ for (auto &block : region) {
+ for (auto blockArg : llvm::enumerate(block.getArguments())) {
+ if (getLiveness(blockArg.value()))
+ continue;
+ LDBG() << "Block argument: " << blockArg.index() << " of "
+ << OpWithFlags(op, OpPrintingFlags().skipRegions())
+ << " has no liveness info, mark dead";
+ solver.getOrCreateState<Liveness>(blockArg.value());
+ }
+ }
+ }
+ });
}
const Liveness *RunLivenessAnalysis::getLiveness(Value val) {
diff --git a/mlir/lib/Analysis/DataFlow/SparseAnalysis.cpp b/mlir/lib/Analysis/DataFlow/SparseAnalysis.cpp
index e625f626d12fd..13a3e1480c836 100644
--- a/mlir/lib/Analysis/DataFlow/SparseAnalysis.cpp
+++ b/mlir/lib/Analysis/DataFlow/SparseAnalysis.cpp
@@ -19,12 +19,15 @@
#include "mlir/Interfaces/ControlFlowInterfaces.h"
#include "mlir/Support/LLVM.h"
#include "llvm/ADT/STLExtras.h"
+#include "llvm/Support/DebugLog.h"
#include <cassert>
#include <optional>
using namespace mlir;
using namespace mlir::dataflow;
+#define DEBUG_TYPE "dataflow"
+
//===----------------------------------------------------------------------===//
// AbstractSparseLattice
//===----------------------------------------------------------------------===//
@@ -64,22 +67,36 @@ AbstractSparseForwardDataFlowAnalysis::initialize(Operation *top) {
LogicalResult
AbstractSparseForwardDataFlowAnalysis::initializeRecursively(Operation *op) {
+ LDBG() << "Initializing recursively for operation: " << op->getName();
+
// Initialize the analysis by visiting every owner of an SSA value (all
// operations and blocks).
- if (failed(visitOperation(op)))
+ if (failed(visitOperation(op))) {
+ LDBG() << "Failed to visit operation: " << op->getName();
return failure();
+ }
for (Region ®ion : op->getRegions()) {
+ LDBG() << "Processing region with " << region.getBlocks().size()
+ << " blocks";
for (Block &block : region) {
+ LDBG() << "Processing block with " << block.getNumArguments()
+ << " arguments";
getOrCreate<Executable>(getProgramPointBefore(&block))
->blockContentSubscribe(this);
visitBlock(&block);
- for (Operation &op : block)
- if (failed(initializeRecursively(&op)))
+ for (Operation &op : block) {
+ LDBG() << "Recursively initializing nested operation: " << op.getName();
+ if (failed(initializeRecursively(&op))) {
+ LDBG() << "Failed to initialize nested operation: " << op.getName();
return failure();
+ }
+ }
}
}
+ LDBG() << "Successfully completed recursive initialization for operation: "
+ << op->getName();
return success();
}
@@ -409,11 +426,20 @@ static MutableArrayRef<OpOperand> operandsToOpOperands(OperandRange &operands) {
LogicalResult
AbstractSparseBackwardDataFlowAnalysis::visitOperation(Operation *op) {
+ LDBG() << "Visiting operation: " << op->getName() << " with "
+ << op->getNumOperands() << " operands and " << op->getNumResults()
+ << " results";
+
// If we're in a dead block, bail out.
if (op->getBlock() != nullptr &&
- !getOrCreate<Executable>(getProgramPointBefore(op->getBlock()))->isLive())
+ !getOrCreate<Executable>(getProgramPointBefore(op->getBlock()))
+ ->isLive()) {
+ LDBG() << "Operation is in dead block, bailing out";
return success();
+ }
+ LDBG() << "Creating lattice elements for " << op->getNumOperands()
+ << " operands and " << op->getNumResults() << " results";
SmallVector<AbstractSparseLattice *> operandLattices =
getLatticeElements(op->getOperands());
SmallVector<const AbstractSparseLattice *> resultLattices =
@@ -422,11 +448,15 @@ AbstractSparseBackwardDataFlowAnalysis::visitOperation(Operation *op) {
// Block arguments of region branch operations flow back into the operands
// of the parent op
if (auto branch = dyn_cast<RegionBranchOpInterface>(op)) {
+ LDBG() << "Processing RegionBranchOpInterface operation";
visitRegionSuccessors(branch, operandLattices);
return success();
}
if (auto branch = dyn_cast<BranchOpInterface>(op)) {
+ LDBG() << "Processing BranchOpInterface operation with "
+ << op->getNumSuccessors() << " successors";
+
// Block arguments of successor blocks flow back into our operands.
// We remember all operands not forwarded to any block in a BitVector.
@@ -463,6 +493,7 @@ AbstractSparseBackwardDataFlowAnalysis::visitOperation(Operation *op) {
// For function calls, connect the arguments of the entry blocks to the
// operands of the call op that are forwarded to these arguments.
if (auto call = dyn_cast<CallOpInterface>(op)) {
+ LDBG() << "Processing CallOpInterface operation";
Operation *callableOp = call.resolveCallableInTable(&symbolTable);
if (auto callable = dyn_cast_or_null<CallableOpInterface>(callableOp)) {
// Not all operands of a call op forward to arguments. Such operands are
@@ -513,6 +544,7 @@ AbstractSparseBackwardDataFlowAnalysis::visitOperation(Operation *op) {
// of this op itself and the operands of the terminators of the regions of
// this op.
if (auto terminator = dyn_cast<RegionBranchTerminatorOpInterface>(op)) {
+ LDBG() << "Processing RegionBranchTerminatorOpInterface operation";
if (auto branch = dyn_cast<RegionBranchOpInterface>(op->getParentOp())) {
visitRegionSuccessorsFromTerminator(terminator, branch);
return success();
@@ -520,12 +552,16 @@ AbstractSparseBackwardDataFlowAnalysis::visitOperation(Operation *op) {
}
if (op->hasTrait<OpTrait::ReturnLike>()) {
+ LDBG() << "Processing ReturnLike operation";
// Going backwards, the operands of the return are derived from the
// results of all CallOps calling this CallableOp.
- if (auto callable = dyn_cast<CallableOpInterface>(op->getParentOp()))
+ if (auto callable = dyn_cast<CallableOpInterface>(op->getParentOp())) {
+ LDBG() << "Callable parent found, visiting callable operation";
return visitCallableOperation(op, callable, operandLattices);
+ }
}
+ LDBG() << "Using default visitOperationImpl for operation: " << op->getName();
return visitOperationImpl(op, operandLattices, resultLattices);
}
diff --git a/mlir/lib/Transforms/RemoveDeadValues.cpp b/mlir/lib/Transforms/RemoveDeadValues.cpp
index 4ccb83f3ad298..02dad69e49614 100644
--- a/mlir/lib/Transforms/RemoveDeadValues.cpp
+++ b/mlir/lib/Transforms/RemoveDeadValues.cpp
@@ -258,18 +258,17 @@ static SmallVector<OpOperand *> operandsToOpOperands(OperandRange operands) {
static void processSimpleOp(Operation *op, RunLivenessAnalysis &la,
DenseSet<Value> &nonLiveSet,
RDVFinalCleanupList &cl) {
- LDBG() << "Processing simple op: " << *op;
if (!isMemoryEffectFree(op) || hasLive(op->getResults(), nonLiveSet, la)) {
- LDBG()
- << "Simple op is not memory effect free or has live results, skipping: "
- << *op;
+ LDBG() << "Simple op is not memory effect free or has live results, "
+ "preserving it: "
+ << OpWithFlags(op, OpPrintingFlags().skipRegions());
return;
}
LDBG()
<< "Simple op has all dead results and is memory effect free, scheduling "
"for removal: "
- << *op;
+ << OpWithFlags(op, OpPrintingFlags().skipRegions());
cl.operations.push_back(op);
collectNonLiveValues(nonLiveSet, op->getResults(),
BitVector(op->getNumResults(), true));
@@ -728,19 +727,31 @@ static void processBranchOp(BranchOpInterface branchOp, RunLivenessAnalysis &la,
/// Removes dead values collected in RDVFinalCleanupList.
/// To be run once when all dead values have been collected.
static void cleanUpDeadVals(RDVFinalCleanupList &list) {
+ LDBG() << "Starting cleanup of dead values...";
+
// 1. Operations
+ LDBG() << "Cleaning up " << list.operations.size() << " operations";
for (auto &op : list.operations) {
+ LDBG() << "Erasing operation: "
+ << OpWithFlags(op, OpPrintingFlags().skipRegions());
op->dropAllUses();
op->erase();
}
// 2. Values
+ LDBG() << "Cleaning up " << list.values.size() << " values";
for (auto &v : list.values) {
+ LDBG() << "Dropping all uses of value: " << v;
v.dropAllUses();
}
// 3. Functions
+ LDBG() << "Cleaning up " << list.functions.size() << " functions";
for (auto &f : list.functions) {
+ LDBG() << "Cleaning up function: " << f.funcOp.getOperation()->getName();
+ LDBG() << " Erasing " << f.nonLiveArgs.count() << " non-live arguments";
+ LDBG() << " Erasing " << f.nonLiveRets.count()
+ << " non-live return values";
// Some functions may not allow erasing arguments or results. These calls
// return failure in such cases without modifying the function, so it's okay
// to proceed.
@@ -749,44 +760,67 @@ static void cleanUpDeadVals(RDVFinalCleanupList &list) {
}
// 4. Operands
+ LDBG() << "Cleaning up " << list.operands.size() << " operand lists";
for (OperationToCleanup &o : list.operands) {
- if (o.op->getNumOperands() > 0)
+ if (o.op->getNumOperands() > 0) {
+ LDBG() << "Erasing " << o.nonLive.count()
+ << " non-live operands from operation: "
+ << OpWithFlags(o.op, OpPrintingFlags().skipRegions());
o.op->eraseOperands(o.nonLive);
+ }
}
// 5. Results
+ LDBG() << "Cleaning up " << list.results.size() << " result lists";
for (auto &r : list.results) {
+ LDBG() << "Erasing " << r.nonLive.count()
+ << " non-live results from operation: "
+ << OpWithFlags(r.op, OpPrintingFlags().skipRegions());
dropUsesAndEraseResults(r.op, r.nonLive);
}
// 6. Blocks
+ LDBG() << "Cleaning up " << list.blocks.size() << " block argument lists";
for (auto &b : list.blocks) {
// blocks that are accessed via multiple codepaths processed once
if (b.b->getNumArguments() != b.nonLiveArgs.size())
continue;
+ LDBG() << "Erasing " << b.nonLiveArgs.count()
+ << " non-live arguments from block: " << b.b;
// it iterates backwards because erase invalidates all successor indexes
for (int i = b.nonLiveArgs.size() - 1; i >= 0; --i) {
if (!b.nonLiveArgs[i])
continue;
+ LDBG() << " Erasing block argument " << i << ": " << b.b->getArgument(i);
b.b->getArgument(i).dropAllUses();
b.b->eraseArgument(i);
}
}
// 7. Successor Operands
+ LDBG() << "Cleaning up " << list.successorOperands.size()
+ << " successor operand lists";
for (auto &op : list.successorOperands) {
SuccessorOperands successorOperands =
op.branch.getSuccessorOperands(op.successorIndex);
// blocks that are accessed via multiple codepaths processed once
if (successorOperands.size() != op.nonLiveOperands.size())
continue;
+ LDBG() << "Erasing " << op.nonLiveOperands.count()
+ << " non-live successor operands from successor "
+ << op.successorIndex << " of branch: "
+ << OpWithFlags(op.branch, OpPrintingFlags().skipRegions());
// it iterates backwards because erase invalidates all successor indexes
for (int i = successorOperands.size() - 1; i >= 0; --i) {
if (!op.nonLiveOperands[i])
continue;
+ LDBG() << " Erasing successor operand " << i << ": "
+ << successorOperands[i];
successorOperands.erase(i);
}
}
+
+ LDBG() << "Finished cleanup of dead values";
}
struct RemoveDeadValues : public impl::RemoveDeadValuesBase<RemoveDeadValues> {
diff --git a/mlir/test/Analysis/DataFlow/test-liveness-analysis.mlir b/mlir/test/Analysis/DataFlow/test-liveness-analysis.mlir
index a89a0f4084e99..3748be74eb0f3 100644
--- a/mlir/test/Analysis/DataFlow/test-liveness-analysis.mlir
+++ b/mlir/test/Analysis/DataFlow/test-liveness-analysis.mlir
@@ -283,3 +283,23 @@ func.func @test_10_negative() -> (i32) {
%0:2 = func.call @private_1() : () -> (i32, i32)
return %0#0 : i32
}
+
+// -----
+
+// Test that we correctly set a liveness value for operations in dead block.
+// These won't be visited by the dataflow framework so the analysis need to
+// explicitly manage them.
+// CHECK-LABEL: test_tag: dead_block_cmpi:
+// CHECK-NEXT: operand #0: not live
+// CHECK-NEXT: operand #1: not live
+// CHECK-NEXT: result #0: not live
+func.func @dead_block() {
+ %false = arith.constant false
+ %zero = arith.constant 0 : i64
+ cf.cond_br %false, ^bb1, ^bb4
+ ^bb1:
+ %3 = arith.cmpi eq, %zero, %zero {tag = "dead_block_cmpi"} : i64
+ cf.br ^bb1
+ ^bb4:
+ return
+}
diff --git a/mlir/test/Transforms/remove-dead-values.mlir b/mlir/test/Transforms/remove-dead-values.mlir
index 9ded6a30d9c95..0f8d757086e87 100644
--- a/mlir/test/Transforms/remove-dead-values.mlir
+++ b/mlir/test/Transforms/remove-dead-values.mlir
@@ -571,3 +571,24 @@ module @return_void_with_unused_argument {
}
}
+// -----
+
+// CHECK-LABEL: module @dynamically_unreachable
+module @dynamically_unreachable {
+ func.func @dynamically_unreachable() {
+ // This value is used by an operation in a dynamically unreachable block.
+ %zero = arith.constant 0 : i64
+
+ // Dataflow analysis knows from the constant condition that
+ // ^bb1 is unreachable
+ %false = arith.constant false
+ cf.cond_br %false, ^bb1, ^bb4
+ ^bb1:
+ // This unreachable operation should be removed.
+ // CHECK-NOT: arith.cmpi
+ %3 = arith.cmpi eq, %zero, %zero : i64
+ cf.br ^bb1
+ ^bb4:
+ return
+ }
+}
diff --git a/mlir/test/lib/Analysis/DataFlow/TestLivenessAnalysis.cpp b/mlir/test/lib/Analysis/DataFlow/TestLivenessAnalysis.cpp
index 43005e22584c2..8e2f03b644e49 100644
--- a/mlir/test/lib/Analysis/DataFlow/TestLivenessAnalysis.cpp
+++ b/mlir/test/lib/Analysis/DataFlow/TestLivenessAnalysis.cpp
@@ -33,7 +33,6 @@ struct TestLivenessAnalysisPass
void runOnOperation() override {
auto &livenessAnalysis = getAnalysis<RunLivenessAnalysis>();
-
Operation *op = getOperation();
raw_ostream &os = llvm::outs();
>From d8208b0575c7fc03931b678b74acf9e7dedcea8e Mon Sep 17 00:00:00 2001
From: Jonas Devlieghere <jonas at devlieghere.com>
Date: Mon, 18 Aug 2025 15:53:12 -0500
Subject: [PATCH 091/112] Revert "[lldb] Relax the error message in
TestProcessCrashInfo.py" (#154197)
Reverts llvm/llvm-project#153653 because older versions of macOS do not
use the same prefix.
---
.../process_crash_info/TestProcessCrashInfo.py | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/lldb/test/API/functionalities/process_crash_info/TestProcessCrashInfo.py b/lldb/test/API/functionalities/process_crash_info/TestProcessCrashInfo.py
index 4924937b4fe25..af05c2f3a0f62 100644
--- a/lldb/test/API/functionalities/process_crash_info/TestProcessCrashInfo.py
+++ b/lldb/test/API/functionalities/process_crash_info/TestProcessCrashInfo.py
@@ -38,7 +38,7 @@ def test_cli(self):
patterns=[
"Extended Crash Information",
"Crash-Info Annotations",
- "BUG IN CLIENT OF LIBMALLOC",
+ "pointer being freed was not allocated",
],
)
@@ -67,7 +67,7 @@ def test_api(self):
self.assertTrue(crash_info.IsValid())
- self.assertIn("BUG IN CLIENT OF LIBMALLOC", stream.GetData())
+ self.assertIn("pointer being freed was not allocated", stream.GetData())
# dyld leaves permanent crash_info records when testing on device.
@skipIfDarwinEmbedded
>From 79be94c98412f899557cd06185167b980f563b64 Mon Sep 17 00:00:00 2001
From: Florian Hahn <flo at fhahn.com>
Date: Mon, 18 Aug 2025 21:56:54 +0100
Subject: [PATCH 092/112] [VPlan] Compute cost single-scalar calls in
computeCost. (NFC)
Compute the cost of non-intrinsic, single-scalar calls directly in
VPReplicateRecipe::computeCost.
This starts moving call cost computations to VPlan, handling the
simplest case first.
---
llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index 96ef6e7cf8243..40af5c9919783 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -2974,6 +2974,24 @@ InstructionCost VPReplicateRecipe::computeCost(ElementCount VF,
// is scalarized or not. Therefore, we handle GEPs with the memory
// instruction cost.
return 0;
+ case Instruction::Call: {
+ if (!isSingleScalar()) {
+ // TODO: Handle remaining call costs here as well.
+ if (VF.isScalable())
+ return InstructionCost::getInvalid();
+ break;
+ }
+
+ auto *CalledFn =
+ cast<Function>(getOperand(getNumOperands() - 1)->getLiveInIRValue());
+ if (CalledFn->isIntrinsic())
+ break;
+
+ SmallVector<Type *, 4> Tys;
+ for (VPValue *ArgOp : drop_end(operands()))
+ Tys.push_back(Ctx.Types.inferScalarType(ArgOp));
+ return Ctx.TTI.getCallInstrCost(CalledFn, ResultTy, Tys, Ctx.CostKind);
+ }
case Instruction::Add:
case Instruction::Sub:
case Instruction::FAdd:
>From 906c9e9542f69cf01ef44408007ce77ae9ac70ae Mon Sep 17 00:00:00 2001
From: Stanislav Mekhanoshin <Stanislav.Mekhanoshin at amd.com>
Date: Mon, 18 Aug 2025 13:58:54 -0700
Subject: [PATCH 093/112] [AMDGPU] Remove misplaced assert. (#154187)
The assert that RegScavenger required for long branching is now
placed below the code to use s_add_pc64, where it is actually
used.
---
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp | 1 -
1 file changed, 1 deletion(-)
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index 1f3943f6e1b27..10f29b3a4559b 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -2905,7 +2905,6 @@ void SIInstrInfo::insertIndirectBranch(MachineBasicBlock &MBB,
MachineBasicBlock &RestoreBB,
const DebugLoc &DL, int64_t BrOffset,
RegScavenger *RS) const {
- assert(RS && "RegScavenger required for long branching");
assert(MBB.empty() &&
"new block should be inserted for expanding unconditional branch");
assert(MBB.pred_size() == 1);
>From 8c605bd1f4087663acb78d6bd98d285fdb751e23 Mon Sep 17 00:00:00 2001
From: Mehdi Amini <joker.eph at gmail.com>
Date: Mon, 18 Aug 2025 23:02:53 +0200
Subject: [PATCH 094/112] [MLIR] Add logging to eraseUnreachableBlocks (NFC)
(#153968)
---
mlir/lib/Transforms/Utils/RegionUtils.cpp | 33 ++++++++++++++++++++---
1 file changed, 29 insertions(+), 4 deletions(-)
diff --git a/mlir/lib/Transforms/Utils/RegionUtils.cpp b/mlir/lib/Transforms/Utils/RegionUtils.cpp
index a1d975dfb1476..31ae1d1895b81 100644
--- a/mlir/lib/Transforms/Utils/RegionUtils.cpp
+++ b/mlir/lib/Transforms/Utils/RegionUtils.cpp
@@ -23,12 +23,15 @@
#include "llvm/ADT/DepthFirstIterator.h"
#include "llvm/ADT/PostOrderIterator.h"
#include "llvm/ADT/STLExtras.h"
+#include "llvm/Support/DebugLog.h"
#include <deque>
#include <iterator>
using namespace mlir;
+#define DEBUG_TYPE "region-utils"
+
void mlir::replaceAllUsesInRegionWith(Value orig, Value replacement,
Region ®ion) {
for (auto &use : llvm::make_early_inc_range(orig.getUses())) {
@@ -182,19 +185,34 @@ SmallVector<Value> mlir::makeRegionIsolatedFromAbove(
// TODO: We could likely merge this with the DCE algorithm below.
LogicalResult mlir::eraseUnreachableBlocks(RewriterBase &rewriter,
MutableArrayRef<Region> regions) {
+ LDBG() << "Starting eraseUnreachableBlocks with " << regions.size()
+ << " regions";
+
// Set of blocks found to be reachable within a given region.
llvm::df_iterator_default_set<Block *, 16> reachable;
// If any blocks were found to be dead.
- bool erasedDeadBlocks = false;
+ int erasedDeadBlocks = 0;
SmallVector<Region *, 1> worklist;
worklist.reserve(regions.size());
for (Region ®ion : regions)
worklist.push_back(®ion);
+
+ LDBG(2) << "Initial worklist size: " << worklist.size();
+
while (!worklist.empty()) {
Region *region = worklist.pop_back_val();
- if (region->empty())
+ if (region->empty()) {
+ LDBG(2) << "Skipping empty region";
continue;
+ }
+
+ LDBG(2) << "Processing region with " << region->getBlocks().size()
+ << " blocks";
+ if (region->getParentOp())
+ LDBG(2) << " -> for operation: "
+ << OpWithFlags(region->getParentOp(),
+ OpPrintingFlags().skipRegions());
// If this is a single block region, just collect the nested regions.
if (region->hasOneBlock()) {
@@ -209,13 +227,17 @@ LogicalResult mlir::eraseUnreachableBlocks(RewriterBase &rewriter,
for (Block *block : depth_first_ext(®ion->front(), reachable))
(void)block /* Mark all reachable blocks */;
+ LDBG(2) << "Found " << reachable.size() << " reachable blocks out of "
+ << region->getBlocks().size() << " total blocks";
+
// Collect all of the dead blocks and push the live regions onto the
// worklist.
for (Block &block : llvm::make_early_inc_range(*region)) {
if (!reachable.count(&block)) {
+ LDBG() << "Erasing unreachable block: " << █
block.dropAllDefinedValueUses();
rewriter.eraseBlock(&block);
- erasedDeadBlocks = true;
+ ++erasedDeadBlocks;
continue;
}
@@ -226,7 +248,10 @@ LogicalResult mlir::eraseUnreachableBlocks(RewriterBase &rewriter,
}
}
- return success(erasedDeadBlocks);
+ LDBG() << "Finished eraseUnreachableBlocks, erased " << erasedDeadBlocks
+ << " dead blocks";
+
+ return success(erasedDeadBlocks > 0);
}
//===----------------------------------------------------------------------===//
>From 89abccc9a6ed8e7263ce0f133961d6ff556754e7 Mon Sep 17 00:00:00 2001
From: Mehdi Amini <joker.eph at gmail.com>
Date: Mon, 18 Aug 2025 23:05:34 +0200
Subject: [PATCH 095/112] [MLIR] Update GreedyRewriter to use the LDBG() debug
log mechanism (NFC) (#153961)
Also improve a bit the LDBG() implementation
---
llvm/include/llvm/Support/DebugLog.h | 39 +++++++++----------
llvm/unittests/Support/DebugLogTest.cpp | 14 ++++++-
.../Transforms/Utils/DialectConversion.cpp | 4 +-
.../Utils/GreedyPatternRewriteDriver.cpp | 29 +++++++-------
4 files changed, 48 insertions(+), 38 deletions(-)
diff --git a/llvm/include/llvm/Support/DebugLog.h b/llvm/include/llvm/Support/DebugLog.h
index a94e578c0aa1e..ead5dd2a4e8bd 100644
--- a/llvm/include/llvm/Support/DebugLog.h
+++ b/llvm/include/llvm/Support/DebugLog.h
@@ -71,11 +71,10 @@ namespace llvm {
for (bool _c = \
(::llvm::DebugFlag && ::llvm::isCurrentDebugType(TYPE, LEVEL)); \
_c; _c = false) \
- for (::llvm::impl::RAIINewLineStream NewLineStream{(STREAM)}; _c; \
- _c = false) \
- ::llvm::impl::raw_ldbg_ostream{ \
- ::llvm::impl::computePrefix(TYPE, FILE, LINE, LEVEL), NewLineStream} \
- .asLvalue()
+ for (::llvm::impl::raw_ldbg_ostream LdbgOS{ \
+ ::llvm::impl::computePrefix(TYPE, FILE, LINE, LEVEL), (STREAM)}; \
+ _c; _c = false) \
+ ::llvm::impl::RAIINewLineStream{LdbgOS}.asLvalue()
#define DEBUGLOG_WITH_STREAM_TYPE_AND_FILE(STREAM, LEVEL, TYPE, FILE) \
DEBUGLOG_WITH_STREAM_TYPE_FILE_AND_LINE(STREAM, LEVEL, TYPE, FILE, __LINE__)
@@ -89,22 +88,22 @@ namespace impl {
class LLVM_ABI raw_ldbg_ostream final : public raw_ostream {
std::string Prefix;
raw_ostream &Os;
- bool HasPendingNewline;
+ bool ShouldPrefixNextString;
/// Split the line on newlines and insert the prefix before each
/// newline. Forward everything to the underlying stream.
void write_impl(const char *Ptr, size_t Size) final {
auto Str = StringRef(Ptr, Size);
- // Handle the initial prefix.
- if (!Str.empty())
- writeWithPrefix(StringRef());
-
auto Eol = Str.find('\n');
+ // Handle `\n` occurring in the string, ensure to print the prefix at the
+ // beginning of each line.
while (Eol != StringRef::npos) {
+ // Take the line up to the newline (including the newline).
StringRef Line = Str.take_front(Eol + 1);
if (!Line.empty())
writeWithPrefix(Line);
- HasPendingNewline = true;
+ // We printed a newline, record here to print a prefix.
+ ShouldPrefixNextString = true;
Str = Str.drop_front(Eol + 1);
Eol = Str.find('\n');
}
@@ -113,24 +112,21 @@ class LLVM_ABI raw_ldbg_ostream final : public raw_ostream {
}
void emitPrefix() { Os.write(Prefix.c_str(), Prefix.size()); }
void writeWithPrefix(StringRef Str) {
- flushEol();
+ if (ShouldPrefixNextString) {
+ emitPrefix();
+ ShouldPrefixNextString = false;
+ }
Os.write(Str.data(), Str.size());
}
public:
explicit raw_ldbg_ostream(std::string Prefix, raw_ostream &Os,
- bool HasPendingNewline = true)
+ bool ShouldPrefixNextString = true)
: Prefix(std::move(Prefix)), Os(Os),
- HasPendingNewline(HasPendingNewline) {
+ ShouldPrefixNextString(ShouldPrefixNextString) {
SetUnbuffered();
}
- ~raw_ldbg_ostream() final { flushEol(); }
- void flushEol() {
- if (HasPendingNewline) {
- emitPrefix();
- HasPendingNewline = false;
- }
- }
+ ~raw_ldbg_ostream() final {}
/// Forward the current_pos method to the underlying stream.
uint64_t current_pos() const final { return Os.tell(); }
@@ -149,6 +145,7 @@ class RAIINewLineStream final : public raw_ostream {
~RAIINewLineStream() { Os << '\n'; }
void write_impl(const char *Ptr, size_t Size) final { Os.write(Ptr, Size); }
uint64_t current_pos() const final { return Os.tell(); }
+ RAIINewLineStream &asLvalue() { return *this; }
};
/// Remove the path prefix from the file name.
diff --git a/llvm/unittests/Support/DebugLogTest.cpp b/llvm/unittests/Support/DebugLogTest.cpp
index b28c59cf2bdd5..e087705b72586 100644
--- a/llvm/unittests/Support/DebugLogTest.cpp
+++ b/llvm/unittests/Support/DebugLogTest.cpp
@@ -115,8 +115,18 @@ TEST(DebugLogTest, StreamPrefix) {
ldbg_osA << "5";
EXPECT_EQ(os.str(), expected);
}
- // After destructors, there was a pending newline for stream B.
- EXPECT_EQ(os.str(), expected + "PrefixB ");
+ EXPECT_EQ(os.str(), expected);
+}
+
+TEST(DebugLogTest, DestructorPrefix) {
+ llvm::DebugFlag = true;
+ std::string str;
+ raw_string_ostream os(str);
+ {
+ llvm::impl::raw_ldbg_ostream ldbg_osB("PrefixB ", os);
+ }
+ // After destructors, nothing should have been printed.
+ EXPECT_EQ(os.str(), "");
}
#else
TEST(DebugLogTest, Basic) {
diff --git a/mlir/lib/Transforms/Utils/DialectConversion.cpp b/mlir/lib/Transforms/Utils/DialectConversion.cpp
index e48cfca486808..7494ca9ec3784 100644
--- a/mlir/lib/Transforms/Utils/DialectConversion.cpp
+++ b/mlir/lib/Transforms/Utils/DialectConversion.cpp
@@ -1138,8 +1138,8 @@ struct ConversionPatternRewriterImpl : public RewriterBase::Listener {
SmallPtrSet<Operation *, 1> pendingRootUpdates;
/// A raw output stream used to prefix the debug log.
- llvm::impl::raw_ldbg_ostream os{(Twine("[") + DEBUG_TYPE + "] ").str(),
- llvm::dbgs(), /*HasPendingNewline=*/false};
+ llvm::impl::raw_ldbg_ostream os{(Twine("[") + DEBUG_TYPE + ":1] ").str(),
+ llvm::dbgs()};
/// A logger used to emit diagnostics during the conversion process.
llvm::ScopedPrinter logger{os};
diff --git a/mlir/lib/Transforms/Utils/GreedyPatternRewriteDriver.cpp b/mlir/lib/Transforms/Utils/GreedyPatternRewriteDriver.cpp
index 0a2a0cc1d5c73..0324588ac6691 100644
--- a/mlir/lib/Transforms/Utils/GreedyPatternRewriteDriver.cpp
+++ b/mlir/lib/Transforms/Utils/GreedyPatternRewriteDriver.cpp
@@ -15,6 +15,8 @@
#include "mlir/Config/mlir-config.h"
#include "mlir/IR/Action.h"
#include "mlir/IR/Matchers.h"
+#include "mlir/IR/Operation.h"
+#include "mlir/IR/OperationSupport.h"
#include "mlir/IR/Verifier.h"
#include "mlir/Interfaces/SideEffectInterfaces.h"
#include "mlir/Rewrite/PatternApplicator.h"
@@ -23,7 +25,7 @@
#include "llvm/ADT/BitVector.h"
#include "llvm/ADT/DenseMap.h"
#include "llvm/ADT/ScopeExit.h"
-#include "llvm/Support/Debug.h"
+#include "llvm/Support/DebugLog.h"
#include "llvm/Support/ScopedPrinter.h"
#include "llvm/Support/raw_ostream.h"
@@ -178,9 +180,8 @@ static Operation *getDumpRootOp(Operation *op) {
return op;
}
static void logSuccessfulFolding(Operation *op) {
- llvm::dbgs() << "// *** IR Dump After Successful Folding ***\n";
- op->dump();
- llvm::dbgs() << "\n\n";
+ LDBG() << "// *** IR Dump After Successful Folding ***\n"
+ << OpWithFlags(op, OpPrintingFlags().elideLargeElementsAttrs());
}
#endif // NDEBUG
@@ -394,8 +395,12 @@ class GreedyPatternRewriteDriver : public RewriterBase::Listener {
function_ref<void(Diagnostic &)> reasonCallback) override;
#ifndef NDEBUG
+ /// A raw output stream used to prefix the debug log.
+
+ llvm::impl::raw_ldbg_ostream os{(Twine("[") + DEBUG_TYPE + ":1] ").str(),
+ llvm::dbgs()};
/// A logger used to emit information during the application process.
- llvm::ScopedPrinter logger{llvm::dbgs()};
+ llvm::ScopedPrinter logger{os};
#endif
/// The low-level pattern applicator.
@@ -928,10 +933,9 @@ mlir::applyPatternsGreedily(Region ®ion,
RegionPatternRewriteDriver driver(region.getContext(), patterns, config,
region);
LogicalResult converged = std::move(driver).simplify(changed);
- LLVM_DEBUG(if (failed(converged)) {
- llvm::dbgs() << "The pattern rewrite did not converge after scanning "
- << config.getMaxIterations() << " times\n";
- });
+ if (failed(converged))
+ LDBG() << "The pattern rewrite did not converge after scanning "
+ << config.getMaxIterations() << " times";
return converged;
}
@@ -1063,9 +1067,8 @@ LogicalResult mlir::applyOpPatternsGreedily(
LogicalResult converged = std::move(driver).simplify(ops, changed);
if (allErased)
*allErased = surviving.empty();
- LLVM_DEBUG(if (failed(converged)) {
- llvm::dbgs() << "The pattern rewrite did not converge after "
- << config.getMaxNumRewrites() << " rewrites";
- });
+ if (failed(converged))
+ LDBG() << "The pattern rewrite did not converge after "
+ << config.getMaxNumRewrites() << " rewrites";
return converged;
}
>From 8a0b3cc0893377996cc3ede5c2b8398793d2ea43 Mon Sep 17 00:00:00 2001
From: Aiden Grossman <aidengrossman at google.com>
Date: Mon, 18 Aug 2025 14:10:23 -0700
Subject: [PATCH 096/112] [CI] Ignore upload artifact failures (#154196)
Some CI runs are seeing issues with failures running the artifact upload
step. They seem related to
https://github.com/actions/upload-artifact/issues/569. We should
continue the workflow and ignore errors in the upload artifact step if
it fails so that users do not see a red CI that is not due to their
changes.
Fixes #154155.
---
.github/workflows/premerge.yaml | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/.github/workflows/premerge.yaml b/.github/workflows/premerge.yaml
index 8ac57ec252ecf..9d925517a7211 100644
--- a/.github/workflows/premerge.yaml
+++ b/.github/workflows/premerge.yaml
@@ -69,6 +69,11 @@ jobs:
./.ci/monolithic-linux.sh "${projects_to_build}" "${project_check_targets}" "${runtimes_to_build}" "${runtimes_check_targets}" "${runtimes_check_targets_needs_reconfig}" "${enable_cir}"
- name: Upload Artifacts
+ # In some cases, Github will fail to upload the artifact. We want to
+ # continue anyways as a failed artifact upload is an infra failure, not
+ # a checks failure.
+ # https://github.com/actions/upload-artifact/issues/569
+ continue-on-error: true
if: '!cancelled()'
uses: actions/upload-artifact at 65c4c4a1ddee5b72f698fdd19549f0f0fb45cf08 # v4.6.0
with:
@@ -114,6 +119,11 @@ jobs:
# these environment variables.
bash -c "export SCCACHE_GCS_BUCKET=$CACHE_GCS_BUCKET; export SCCACHE_GCS_RW_MODE=READ_WRITE; export SCCACHE_IDLE_TIMEOUT=0; sccache --start-server; .ci/monolithic-windows.sh \"${{ steps.vars.outputs.windows-projects }}\" \"${{ steps.vars.outputs.windows-check-targets }}\""
- name: Upload Artifacts
+ # In some cases, Github will fail to upload the artifact. We want to
+ # continue anyways as a failed artifact upload is an infra failure, not
+ # a checks failure.
+ # https://github.com/actions/upload-artifact/issues/569
+ continue-on-error: true
if: '!cancelled()'
uses: actions/upload-artifact at 65c4c4a1ddee5b72f698fdd19549f0f0fb45cf08 # v4.6.0
with:
>From c2eb895c200220c8a870b046c5b05957131b40e2 Mon Sep 17 00:00:00 2001
From: "Oleksandr T." <oleksandr.tarasiuk at outlook.com>
Date: Tue, 19 Aug 2025 00:10:53 +0300
Subject: [PATCH 097/112] [Clang] improve -Wstring-concatenation to warn on
every missing comma in initializer lists (#154018)
Fixes #153745
---
This PR addresses a limitation in `-Wstring-concatenation`, where only
the first missing comma in an initializer list was diagnosed.
---
clang/docs/ReleaseNotes.rst | 2 ++
clang/lib/Sema/SemaDecl.cpp | 43 ++++++++++++++++++---------------
clang/test/Sema/string-concat.c | 28 +++++++++++++++++++++
3 files changed, 54 insertions(+), 19 deletions(-)
diff --git a/clang/docs/ReleaseNotes.rst b/clang/docs/ReleaseNotes.rst
index 7f76f87ce6be0..b86a9c437ffb1 100644
--- a/clang/docs/ReleaseNotes.rst
+++ b/clang/docs/ReleaseNotes.rst
@@ -168,6 +168,8 @@ Improvements to Clang's diagnostics
an override of a virtual method.
- Fixed fix-it hint for fold expressions. Clang now correctly places the suggested right
parenthesis when diagnosing malformed fold expressions. (#GH151787)
+- ``-Wstring-concatenation`` now diagnoses every missing comma in an initializer list,
+ rather than stopping after the first. (#GH153745)
- Fixed an issue where emitted format-signedness diagnostics were not associated with an appropriate
diagnostic id. Besides being incorrect from an API standpoint, this was user visible, e.g.:
diff --git a/clang/lib/Sema/SemaDecl.cpp b/clang/lib/Sema/SemaDecl.cpp
index 8ddbaf34a7f47..98485cf9e72be 100644
--- a/clang/lib/Sema/SemaDecl.cpp
+++ b/clang/lib/Sema/SemaDecl.cpp
@@ -14708,7 +14708,14 @@ void Sema::CheckCompleteVariableDeclaration(VarDecl *var) {
isa<InitListExpr>(var->getInit())) {
const auto *ILE = cast<InitListExpr>(var->getInit());
unsigned NumInits = ILE->getNumInits();
- if (NumInits > 2)
+ if (NumInits > 2) {
+ auto concatenatedPartsAt = [&](unsigned Index) -> unsigned {
+ if (const Expr *E = ILE->getInit(Index))
+ if (const auto *S = dyn_cast<StringLiteral>(E->IgnoreImpCasts()))
+ return S->getNumConcatenated();
+ return 0;
+ };
+
for (unsigned I = 0; I < NumInits; ++I) {
const auto *Init = ILE->getInit(I);
if (!Init)
@@ -14721,24 +14728,23 @@ void Sema::CheckCompleteVariableDeclaration(VarDecl *var) {
// Diagnose missing comma in string array initialization.
// Do not warn when all the elements in the initializer are concatenated
// together. Do not warn for macros too.
- if (NumConcat == 2 && !SL->getBeginLoc().isMacroID()) {
- bool OnlyOneMissingComma = true;
- for (unsigned J = I + 1; J < NumInits; ++J) {
- const auto *Init = ILE->getInit(J);
- if (!Init)
- break;
- const auto *SLJ = dyn_cast<StringLiteral>(Init->IgnoreImpCasts());
- if (!SLJ || SLJ->getNumConcatenated() > 1) {
- OnlyOneMissingComma = false;
- break;
- }
- }
+ if (NumConcat == 2) {
+ if (SL->getBeginLoc().isMacroID())
+ continue;
+
+ unsigned L = I > 0 ? concatenatedPartsAt(I - 1) : 0;
+ unsigned R = I + 1 < NumInits ? concatenatedPartsAt(I + 1) : 0;
+
+ // Skip neighbors with multi-part concatenations.
+ if (R > 1)
+ continue;
- if (OnlyOneMissingComma) {
+ // Diagnose when at least one neighbor is a single literal.
+ if (R == 1 || L == 1) {
SmallVector<FixItHint, 1> Hints;
- for (unsigned i = 0; i < NumConcat - 1; ++i)
- Hints.push_back(FixItHint::CreateInsertion(
- PP.getLocForEndOfToken(SL->getStrTokenLoc(i)), ","));
+ // Insert a comma between the two tokens of this element.
+ Hints.push_back(FixItHint::CreateInsertion(
+ PP.getLocForEndOfToken(SL->getStrTokenLoc(0)), ", "));
Diag(SL->getStrTokenLoc(1),
diag::warn_concatenated_literal_array_init)
@@ -14746,10 +14752,9 @@ void Sema::CheckCompleteVariableDeclaration(VarDecl *var) {
Diag(SL->getBeginLoc(),
diag::note_concatenated_string_literal_silence);
}
- // In any case, stop now.
- break;
}
}
+ }
}
diff --git a/clang/test/Sema/string-concat.c b/clang/test/Sema/string-concat.c
index 63abf100c020f..4b52a74116b49 100644
--- a/clang/test/Sema/string-concat.c
+++ b/clang/test/Sema/string-concat.c
@@ -168,3 +168,31 @@ const char *extra_parens_to_suppress_warning[] = {
"promise"),
"shared_future"
};
+
+const char *multiple_missing_commas1[] = {
+ "1",
+ "2" // expected-note {{place parentheses around the string literal to silence warning}}
+ "3", // expected-warning {{suspicious concatenation of string literals in an array initialization; did you mean to separate the elements with a comma?}}
+ "4",
+ "5",
+ "6" // expected-note {{place parentheses around the string literal to silence warning}}
+ "7", // expected-warning {{suspicious concatenation of string literals in an array initialization; did you mean to separate the elements with a comma?}}
+ "8",
+ "9",
+ "10",
+ "11",
+};
+
+const char *multiple_missing_commas2[] = {
+ "1",
+ "2"
+ "3"
+ "4"
+ "5",
+ "6" // expected-note {{place parentheses around the string literal to silence warning}}
+ "7", // expected-warning {{suspicious concatenation of string literals in an array initialization; did you mean to separate the elements with a comma?}}
+ "8",
+ "9",
+ "10",
+ "11",
+};
>From 13dd65096b5311c01aa67ed34f85d4b03b57426b Mon Sep 17 00:00:00 2001
From: Sergei Barannikov <barannikov88 at gmail.com>
Date: Tue, 19 Aug 2025 00:16:56 +0300
Subject: [PATCH 098/112] [TableGen][DecoderEmitter] Rename some variables for
clarity (NFC)
---
llvm/utils/TableGen/DecoderEmitter.cpp | 83 ++++++++++++++------------
1 file changed, 44 insertions(+), 39 deletions(-)
diff --git a/llvm/utils/TableGen/DecoderEmitter.cpp b/llvm/utils/TableGen/DecoderEmitter.cpp
index 2b44577253982..e2b6248a77ef1 100644
--- a/llvm/utils/TableGen/DecoderEmitter.cpp
+++ b/llvm/utils/TableGen/DecoderEmitter.cpp
@@ -486,10 +486,10 @@ class FilterChooser {
protected:
friend class Filter;
- // Vector of codegen instructions to choose our filter.
- ArrayRef<EncodingAndInst> AllInstructions;
+ // Vector of encodings to choose our filter.
+ ArrayRef<EncodingAndInst> Encodings;
- // Vector of uid's for this filter chooser to work on.
+ // Vector of encoding IDs for this filter chooser to work on.
ArrayRef<unsigned> EncodingIDs;
// Lookup table for the operand decoding of instructions.
@@ -518,20 +518,22 @@ class FilterChooser {
};
public:
- FilterChooser(ArrayRef<EncodingAndInst> Insts, ArrayRef<unsigned> EncodingIDs,
+ FilterChooser(ArrayRef<EncodingAndInst> Encodings,
+ ArrayRef<unsigned> EncodingIDs,
const std::map<unsigned, std::vector<OperandInfo>> &Ops,
unsigned BW, const DecoderEmitter *E)
- : AllInstructions(Insts), EncodingIDs(EncodingIDs), Operands(Ops),
+ : Encodings(Encodings), EncodingIDs(EncodingIDs), Operands(Ops),
FilterBitValues(BW, BitValue::BIT_UNFILTERED), Parent(nullptr),
BitWidth(BW), Emitter(E) {
doFilter();
}
- FilterChooser(ArrayRef<EncodingAndInst> Insts, ArrayRef<unsigned> EncodingIDs,
+ FilterChooser(ArrayRef<EncodingAndInst> Encodings,
+ ArrayRef<unsigned> EncodingIDs,
const std::map<unsigned, std::vector<OperandInfo>> &Ops,
const std::vector<BitValue> &ParentFilterBitValues,
const FilterChooser &parent)
- : AllInstructions(Insts), EncodingIDs(EncodingIDs), Operands(Ops),
+ : Encodings(Encodings), EncodingIDs(EncodingIDs), Operands(Ops),
FilterBitValues(ParentFilterBitValues), Parent(&parent),
BitWidth(parent.BitWidth), Emitter(parent.Emitter) {
doFilter();
@@ -544,8 +546,8 @@ class FilterChooser {
protected:
// Populates the insn given the uid.
- insn_t insnWithID(unsigned Opcode) const {
- const Record *EncodingDef = AllInstructions[Opcode].EncodingDef;
+ insn_t insnWithID(unsigned EncodingID) const {
+ const Record *EncodingDef = Encodings[EncodingID].EncodingDef;
const BitsInit &Bits = getBitsField(*EncodingDef, "Inst");
insn_t Insn(std::max(BitWidth, Bits.getNumBits()), BitValue::BIT_UNSET);
// We may have a SoftFail bitmask, which specifies a mask where an encoding
@@ -584,15 +586,17 @@ class FilterChooser {
// Emits code to check the Predicates member of an instruction are true.
// Returns true if predicate matches were emitted, false otherwise.
- bool emitPredicateMatch(raw_ostream &OS, unsigned Opc) const;
+ bool emitPredicateMatch(raw_ostream &OS, unsigned EncodingID) const;
bool emitPredicateMatchAux(const Init &Val, bool ParenIfBinOp,
raw_ostream &OS) const;
- bool doesOpcodeNeedPredicate(unsigned Opc) const;
+ bool doesOpcodeNeedPredicate(unsigned EncodingID) const;
unsigned getPredicateIndex(DecoderTableInfo &TableInfo, StringRef P) const;
- void emitPredicateTableEntry(DecoderTableInfo &TableInfo, unsigned Opc) const;
+ void emitPredicateTableEntry(DecoderTableInfo &TableInfo,
+ unsigned EncodingID) const;
- void emitSoftFailTableEntry(DecoderTableInfo &TableInfo, unsigned Opc) const;
+ void emitSoftFailTableEntry(DecoderTableInfo &TableInfo,
+ unsigned EncodingID) const;
// Emits table entries to decode the singleton.
void emitSingletonTableEntry(DecoderTableInfo &TableInfo,
@@ -605,9 +609,9 @@ class FilterChooser {
bool emitBinaryParser(raw_ostream &OS, indent Indent,
const OperandInfo &OpInfo) const;
- bool emitDecoder(raw_ostream &OS, indent Indent, unsigned Opc) const;
+ bool emitDecoder(raw_ostream &OS, indent Indent, unsigned EncodingID) const;
std::pair<unsigned, bool> getDecoderIndex(DecoderSet &Decoders,
- unsigned Opc) const;
+ unsigned EncodingID) const;
// Assign a single filter and run with it.
void runSingleFilter(unsigned startBit, unsigned numBit);
@@ -694,9 +698,8 @@ void Filter::recurse() {
// Delegates to an inferior filter chooser for further processing on this
// group of instructions whose segment values are variable.
- VariableFC =
- std::make_unique<FilterChooser>(Owner.AllInstructions, VariableIDs,
- Owner.Operands, BitValueArray, Owner);
+ VariableFC = std::make_unique<FilterChooser>(
+ Owner.Encodings, VariableIDs, Owner.Operands, BitValueArray, Owner);
}
// No need to recurse for a singleton filtered instruction.
@@ -718,7 +721,7 @@ void Filter::recurse() {
// category of instructions.
FilterChooserMap.try_emplace(
FilterVal,
- std::make_unique<FilterChooser>(Owner.AllInstructions, EncodingIDs,
+ std::make_unique<FilterChooser>(Owner.Encodings, EncodingIDs,
Owner.Operands, BitValueArray, Owner));
}
}
@@ -1197,10 +1200,10 @@ bool FilterChooser::emitBinaryParser(raw_ostream &OS, indent Indent,
}
bool FilterChooser::emitDecoder(raw_ostream &OS, indent Indent,
- unsigned Opc) const {
+ unsigned EncodingID) const {
bool HasCompleteDecoder = true;
- for (const auto &Op : Operands.find(Opc)->second) {
+ for (const OperandInfo &Op : Operands.find(EncodingID)->second) {
// If a custom instruction decoder was specified, use that.
if (Op.numFields() == 0 && !Op.Decoder.empty()) {
HasCompleteDecoder = Op.HasCompleteDecoder;
@@ -1216,15 +1219,16 @@ bool FilterChooser::emitDecoder(raw_ostream &OS, indent Indent,
return HasCompleteDecoder;
}
-std::pair<unsigned, bool> FilterChooser::getDecoderIndex(DecoderSet &Decoders,
- unsigned Opc) const {
+std::pair<unsigned, bool>
+FilterChooser::getDecoderIndex(DecoderSet &Decoders,
+ unsigned EncodingID) const {
// Build up the predicate string.
SmallString<256> Decoder;
// FIXME: emitDecoder() function can take a buffer directly rather than
// a stream.
raw_svector_ostream S(Decoder);
indent Indent(UseFnTableInDecodeToMCInst ? 2 : 4);
- bool HasCompleteDecoder = emitDecoder(S, Indent, Opc);
+ bool HasCompleteDecoder = emitDecoder(S, Indent, EncodingID);
// Using the full decoder string as the key value here is a bit
// heavyweight, but is effective. If the string comparisons become a
@@ -1273,9 +1277,10 @@ bool FilterChooser::emitPredicateMatchAux(const Init &Val, bool ParenIfBinOp,
return true;
}
-bool FilterChooser::emitPredicateMatch(raw_ostream &OS, unsigned Opc) const {
+bool FilterChooser::emitPredicateMatch(raw_ostream &OS,
+ unsigned EncodingID) const {
const ListInit *Predicates =
- AllInstructions[Opc].EncodingDef->getValueAsListInit("Predicates");
+ Encodings[EncodingID].EncodingDef->getValueAsListInit("Predicates");
bool IsFirstEmission = true;
for (unsigned i = 0; i < Predicates->size(); ++i) {
const Record *Pred = Predicates->getElementAsRecord(i);
@@ -1295,9 +1300,9 @@ bool FilterChooser::emitPredicateMatch(raw_ostream &OS, unsigned Opc) const {
return !Predicates->empty();
}
-bool FilterChooser::doesOpcodeNeedPredicate(unsigned Opc) const {
+bool FilterChooser::doesOpcodeNeedPredicate(unsigned EncodingID) const {
const ListInit *Predicates =
- AllInstructions[Opc].EncodingDef->getValueAsListInit("Predicates");
+ Encodings[EncodingID].EncodingDef->getValueAsListInit("Predicates");
for (unsigned i = 0; i < Predicates->size(); ++i) {
const Record *Pred = Predicates->getElementAsRecord(i);
if (!Pred->getValue("AssemblerMatcherPredicate"))
@@ -1325,8 +1330,8 @@ unsigned FilterChooser::getPredicateIndex(DecoderTableInfo &TableInfo,
}
void FilterChooser::emitPredicateTableEntry(DecoderTableInfo &TableInfo,
- unsigned Opc) const {
- if (!doesOpcodeNeedPredicate(Opc))
+ unsigned EncodingID) const {
+ if (!doesOpcodeNeedPredicate(EncodingID))
return;
// Build up the predicate string.
@@ -1334,7 +1339,7 @@ void FilterChooser::emitPredicateTableEntry(DecoderTableInfo &TableInfo,
// FIXME: emitPredicateMatch() functions can take a buffer directly rather
// than a stream.
raw_svector_ostream PS(Predicate);
- emitPredicateMatch(PS, Opc);
+ emitPredicateMatch(PS, EncodingID);
// Figure out the index into the predicate table for the predicate just
// computed.
@@ -1353,8 +1358,8 @@ void FilterChooser::emitPredicateTableEntry(DecoderTableInfo &TableInfo,
}
void FilterChooser::emitSoftFailTableEntry(DecoderTableInfo &TableInfo,
- unsigned Opc) const {
- const Record *EncodingDef = AllInstructions[Opc].EncodingDef;
+ unsigned EncodingID) const {
+ const Record *EncodingDef = Encodings[EncodingID].EncodingDef;
const RecordVal *RV = EncodingDef->getValue("SoftFail");
const BitsInit *SFBits = RV ? dyn_cast<BitsInit>(RV->getValue()) : nullptr;
@@ -1380,7 +1385,7 @@ void FilterChooser::emitSoftFailTableEntry(DecoderTableInfo &TableInfo,
} else {
// The bit is not set; this must be an error!
errs() << "SoftFail Conflict: bit SoftFail{" << i << "} in "
- << AllInstructions[Opc] << " is set but Inst{" << i
+ << Encodings[EncodingID] << " is set but Inst{" << i
<< "} is unset!\n"
<< " - You can only mark a bit as SoftFail if it is fully defined"
<< " (1/0 - not '?') in Inst\n";
@@ -1453,7 +1458,7 @@ void FilterChooser::emitSingletonTableEntry(DecoderTableInfo &TableInfo,
: MCD::OPC_TryDecode);
TableInfo.Table.push_back(DecoderOp);
NumEncodingsSupported++;
- const Record *InstDef = AllInstructions[EncodingID].Inst->TheDef;
+ const Record *InstDef = Encodings[EncodingID].Inst->TheDef;
TableInfo.Table.insertULEB128(Emitter->getTarget().getInstrIntValue(InstDef));
TableInfo.Table.insertULEB128(DIdx);
@@ -1748,7 +1753,7 @@ void FilterChooser::doFilter() {
// Dump encodings.
for (unsigned EncodingID : EncodingIDs) {
- const EncodingAndInst &Enc = AllInstructions[EncodingID];
+ const EncodingAndInst &Enc = Encodings[EncodingID];
errs() << Indent;
dumpBits(errs(), getBitsField(*Enc.EncodingDef, "Inst"), BitWidth);
errs() << " " << Enc << '\n';
@@ -1919,7 +1924,7 @@ static void addOneOperandFields(const Record &EncodingDef, const BitsInit &Bits,
static unsigned
populateInstruction(const CodeGenTarget &Target, const Record &EncodingDef,
- const CodeGenInstruction &CGI, unsigned Opc,
+ const CodeGenInstruction &CGI, unsigned EncodingID,
std::map<unsigned, std::vector<OperandInfo>> &Operands,
bool IsVarLenInst) {
const Record &Def = *CGI.TheDef;
@@ -1941,7 +1946,7 @@ populateInstruction(const CodeGenTarget &Target, const Record &EncodingDef,
EncodingDef.getValueAsBit("hasCompleteDecoder");
InsnOperands.push_back(
OperandInfo(InstDecoder.str(), HasCompleteInstDecoder));
- Operands[Opc] = std::move(InsnOperands);
+ Operands[EncodingID] = std::move(InsnOperands);
return Bits.getNumBits();
}
@@ -2063,7 +2068,7 @@ populateInstruction(const CodeGenTarget &Target, const Record &EncodingDef,
InsnOperands.push_back(std::move(OpInfo));
}
}
- Operands[Opc] = std::move(InsnOperands);
+ Operands[EncodingID] = std::move(InsnOperands);
#if 0
LLVM_DEBUG({
>From 668e6492b833fc3f329d3e772ab7c52a4d3fec93 Mon Sep 17 00:00:00 2001
From: Stanislav Mekhanoshin <Stanislav.Mekhanoshin at amd.com>
Date: Mon, 18 Aug 2025 14:31:41 -0700
Subject: [PATCH 099/112] [AMDGPU] Support merging of flat GVS ops (#154200)
---
.../Target/AMDGPU/SILoadStoreOptimizer.cpp | 62 ++++
.../AMDGPU/merge-flat-saddr-load-store.mir | 338 ++++++++++++++++++
2 files changed, 400 insertions(+)
create mode 100644 llvm/test/CodeGen/AMDGPU/merge-flat-saddr-load-store.mir
diff --git a/llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp b/llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
index b49c5a997af78..e204d6ba356b8 100644
--- a/llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
+++ b/llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
@@ -87,6 +87,8 @@ enum InstClassEnum {
GLOBAL_STORE_SADDR,
FLAT_LOAD,
FLAT_STORE,
+ FLAT_LOAD_SADDR,
+ FLAT_STORE_SADDR,
GLOBAL_LOAD, // GLOBAL_LOAD/GLOBAL_STORE are never used as the InstClass of
GLOBAL_STORE // any CombineInfo, they are only ever returned by
// getCommonInstClass.
@@ -354,6 +356,8 @@ static unsigned getOpcodeWidth(const MachineInstr &MI, const SIInstrInfo &TII) {
case AMDGPU::GLOBAL_STORE_DWORD_SADDR:
case AMDGPU::FLAT_LOAD_DWORD:
case AMDGPU::FLAT_STORE_DWORD:
+ case AMDGPU::FLAT_LOAD_DWORD_SADDR:
+ case AMDGPU::FLAT_STORE_DWORD_SADDR:
return 1;
case AMDGPU::S_BUFFER_LOAD_DWORDX2_IMM:
case AMDGPU::S_BUFFER_LOAD_DWORDX2_SGPR_IMM:
@@ -367,6 +371,8 @@ static unsigned getOpcodeWidth(const MachineInstr &MI, const SIInstrInfo &TII) {
case AMDGPU::GLOBAL_STORE_DWORDX2_SADDR:
case AMDGPU::FLAT_LOAD_DWORDX2:
case AMDGPU::FLAT_STORE_DWORDX2:
+ case AMDGPU::FLAT_LOAD_DWORDX2_SADDR:
+ case AMDGPU::FLAT_STORE_DWORDX2_SADDR:
return 2;
case AMDGPU::S_BUFFER_LOAD_DWORDX3_IMM:
case AMDGPU::S_BUFFER_LOAD_DWORDX3_SGPR_IMM:
@@ -380,6 +386,8 @@ static unsigned getOpcodeWidth(const MachineInstr &MI, const SIInstrInfo &TII) {
case AMDGPU::GLOBAL_STORE_DWORDX3_SADDR:
case AMDGPU::FLAT_LOAD_DWORDX3:
case AMDGPU::FLAT_STORE_DWORDX3:
+ case AMDGPU::FLAT_LOAD_DWORDX3_SADDR:
+ case AMDGPU::FLAT_STORE_DWORDX3_SADDR:
return 3;
case AMDGPU::S_BUFFER_LOAD_DWORDX4_IMM:
case AMDGPU::S_BUFFER_LOAD_DWORDX4_SGPR_IMM:
@@ -393,6 +401,8 @@ static unsigned getOpcodeWidth(const MachineInstr &MI, const SIInstrInfo &TII) {
case AMDGPU::GLOBAL_STORE_DWORDX4_SADDR:
case AMDGPU::FLAT_LOAD_DWORDX4:
case AMDGPU::FLAT_STORE_DWORDX4:
+ case AMDGPU::FLAT_LOAD_DWORDX4_SADDR:
+ case AMDGPU::FLAT_STORE_DWORDX4_SADDR:
return 4;
case AMDGPU::S_BUFFER_LOAD_DWORDX8_IMM:
case AMDGPU::S_BUFFER_LOAD_DWORDX8_SGPR_IMM:
@@ -575,6 +585,16 @@ static InstClassEnum getInstClass(unsigned Opc, const SIInstrInfo &TII) {
case AMDGPU::GLOBAL_STORE_DWORDX3_SADDR:
case AMDGPU::GLOBAL_STORE_DWORDX4_SADDR:
return GLOBAL_STORE_SADDR;
+ case AMDGPU::FLAT_LOAD_DWORD_SADDR:
+ case AMDGPU::FLAT_LOAD_DWORDX2_SADDR:
+ case AMDGPU::FLAT_LOAD_DWORDX3_SADDR:
+ case AMDGPU::FLAT_LOAD_DWORDX4_SADDR:
+ return FLAT_LOAD_SADDR;
+ case AMDGPU::FLAT_STORE_DWORD_SADDR:
+ case AMDGPU::FLAT_STORE_DWORDX2_SADDR:
+ case AMDGPU::FLAT_STORE_DWORDX3_SADDR:
+ case AMDGPU::FLAT_STORE_DWORDX4_SADDR:
+ return FLAT_STORE_SADDR;
}
}
@@ -661,6 +681,16 @@ static unsigned getInstSubclass(unsigned Opc, const SIInstrInfo &TII) {
case AMDGPU::GLOBAL_STORE_DWORDX3_SADDR:
case AMDGPU::GLOBAL_STORE_DWORDX4_SADDR:
return AMDGPU::GLOBAL_STORE_DWORD_SADDR;
+ case AMDGPU::FLAT_LOAD_DWORD_SADDR:
+ case AMDGPU::FLAT_LOAD_DWORDX2_SADDR:
+ case AMDGPU::FLAT_LOAD_DWORDX3_SADDR:
+ case AMDGPU::FLAT_LOAD_DWORDX4_SADDR:
+ return AMDGPU::FLAT_LOAD_DWORD_SADDR;
+ case AMDGPU::FLAT_STORE_DWORD_SADDR:
+ case AMDGPU::FLAT_STORE_DWORDX2_SADDR:
+ case AMDGPU::FLAT_STORE_DWORDX3_SADDR:
+ case AMDGPU::FLAT_STORE_DWORDX4_SADDR:
+ return AMDGPU::FLAT_STORE_DWORD_SADDR;
}
}
@@ -776,6 +806,14 @@ static AddressRegs getRegs(unsigned Opc, const SIInstrInfo &TII) {
case AMDGPU::GLOBAL_STORE_DWORDX2_SADDR:
case AMDGPU::GLOBAL_STORE_DWORDX3_SADDR:
case AMDGPU::GLOBAL_STORE_DWORDX4_SADDR:
+ case AMDGPU::FLAT_LOAD_DWORD_SADDR:
+ case AMDGPU::FLAT_LOAD_DWORDX2_SADDR:
+ case AMDGPU::FLAT_LOAD_DWORDX3_SADDR:
+ case AMDGPU::FLAT_LOAD_DWORDX4_SADDR:
+ case AMDGPU::FLAT_STORE_DWORD_SADDR:
+ case AMDGPU::FLAT_STORE_DWORDX2_SADDR:
+ case AMDGPU::FLAT_STORE_DWORDX3_SADDR:
+ case AMDGPU::FLAT_STORE_DWORDX4_SADDR:
Result.SAddr = true;
[[fallthrough]];
case AMDGPU::GLOBAL_LOAD_DWORD:
@@ -1875,6 +1913,28 @@ unsigned SILoadStoreOptimizer::getNewOpcode(const CombineInfo &CI,
case 4:
return AMDGPU::FLAT_STORE_DWORDX4;
}
+ case FLAT_LOAD_SADDR:
+ switch (Width) {
+ default:
+ return 0;
+ case 2:
+ return AMDGPU::FLAT_LOAD_DWORDX2_SADDR;
+ case 3:
+ return AMDGPU::FLAT_LOAD_DWORDX3_SADDR;
+ case 4:
+ return AMDGPU::FLAT_LOAD_DWORDX4_SADDR;
+ }
+ case FLAT_STORE_SADDR:
+ switch (Width) {
+ default:
+ return 0;
+ case 2:
+ return AMDGPU::FLAT_STORE_DWORDX2_SADDR;
+ case 3:
+ return AMDGPU::FLAT_STORE_DWORDX3_SADDR;
+ case 4:
+ return AMDGPU::FLAT_STORE_DWORDX4_SADDR;
+ }
case MIMG:
assert(((unsigned)llvm::popcount(CI.DMask | Paired.DMask) == Width) &&
"No overlaps");
@@ -2508,12 +2568,14 @@ SILoadStoreOptimizer::optimizeInstsWithSameBaseAddr(
OptimizeListAgain |= CI.Width + Paired.Width < 4;
break;
case FLAT_LOAD:
+ case FLAT_LOAD_SADDR:
case GLOBAL_LOAD:
case GLOBAL_LOAD_SADDR:
NewMI = mergeFlatLoadPair(CI, Paired, Where->I);
OptimizeListAgain |= CI.Width + Paired.Width < 4;
break;
case FLAT_STORE:
+ case FLAT_STORE_SADDR:
case GLOBAL_STORE:
case GLOBAL_STORE_SADDR:
NewMI = mergeFlatStorePair(CI, Paired, Where->I);
diff --git a/llvm/test/CodeGen/AMDGPU/merge-flat-saddr-load-store.mir b/llvm/test/CodeGen/AMDGPU/merge-flat-saddr-load-store.mir
new file mode 100644
index 0000000000000..1c133c6114ec2
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/merge-flat-saddr-load-store.mir
@@ -0,0 +1,338 @@
+# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
+# RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1250 -run-pass=si-load-store-opt -o - %s | FileCheck -check-prefix=GCN %s
+
+---
+name: merge_flat_load_dword_saddr_2
+body: |
+ bb.0.entry:
+
+ ; GCN-LABEL: name: merge_flat_load_dword_saddr_2
+ ; GCN: [[DEF:%[0-9]+]]:sreg_64_xexec_xnull = IMPLICIT_DEF
+ ; GCN-NEXT: [[DEF1:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
+ ; GCN-NEXT: [[FLAT_LOAD_DWORDX2_SADDR:%[0-9]+]]:vreg_64_align2 = FLAT_LOAD_DWORDX2_SADDR [[DEF]], [[DEF1]], 0, 0, implicit $exec, implicit $flat_scr :: (load (s64) from `ptr addrspace(1) undef`, align 4, addrspace 1)
+ ; GCN-NEXT: [[COPY:%[0-9]+]]:vgpr_32 = COPY [[FLAT_LOAD_DWORDX2_SADDR]].sub0
+ ; GCN-NEXT: [[COPY1:%[0-9]+]]:vgpr_32 = COPY killed [[FLAT_LOAD_DWORDX2_SADDR]].sub1
+ ; GCN-NEXT: S_NOP 0, implicit [[COPY]], implicit [[COPY1]]
+ %0:sreg_64_xexec_xnull = IMPLICIT_DEF
+ %1:vgpr_32 = IMPLICIT_DEF
+ %2:vgpr_32 = FLAT_LOAD_DWORD_SADDR %0, %1, 0, 0, implicit $exec, implicit $flat_scr :: (load (s32) from `ptr addrspace(1) undef`, align 4, addrspace 1)
+ %3:vgpr_32 = FLAT_LOAD_DWORD_SADDR %0, %1, 4, 0, implicit $exec, implicit $flat_scr :: (load (s32) from `ptr addrspace(1) undef`, align 4, addrspace 1)
+ S_NOP 0, implicit %2, implicit %3
+...
+
+---
+name: merge_flat_load_dword_saddr_3
+body: |
+ bb.0.entry:
+
+ ; GCN-LABEL: name: merge_flat_load_dword_saddr_3
+ ; GCN: [[DEF:%[0-9]+]]:sreg_64_xexec_xnull = IMPLICIT_DEF
+ ; GCN-NEXT: [[DEF1:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
+ ; GCN-NEXT: [[FLAT_LOAD_DWORDX3_SADDR:%[0-9]+]]:vreg_96_align2 = FLAT_LOAD_DWORDX3_SADDR [[DEF]], [[DEF1]], 0, 1, implicit $exec, implicit $flat_scr :: (load (s96) from `ptr addrspace(1) undef`, align 4, addrspace 1)
+ ; GCN-NEXT: [[COPY:%[0-9]+]]:vreg_64_align2 = COPY [[FLAT_LOAD_DWORDX3_SADDR]].sub0_sub1
+ ; GCN-NEXT: [[COPY1:%[0-9]+]]:vgpr_32 = COPY killed [[FLAT_LOAD_DWORDX3_SADDR]].sub2
+ ; GCN-NEXT: [[COPY2:%[0-9]+]]:vgpr_32 = COPY [[COPY]].sub0
+ ; GCN-NEXT: [[COPY3:%[0-9]+]]:vgpr_32 = COPY killed [[COPY]].sub1
+ ; GCN-NEXT: S_NOP 0, implicit [[COPY2]], implicit [[COPY3]], implicit [[COPY1]]
+ %0:sreg_64_xexec_xnull = IMPLICIT_DEF
+ %1:vgpr_32 = IMPLICIT_DEF
+ %2:vgpr_32 = FLAT_LOAD_DWORD_SADDR %0, %1, 0, 1, implicit $exec, implicit $flat_scr :: (load (s32) from `ptr addrspace(1) undef`, align 4, addrspace 1)
+ %3:vgpr_32 = FLAT_LOAD_DWORD_SADDR %0, %1, 4, 1, implicit $exec, implicit $flat_scr :: (load (s32) from `ptr addrspace(1) undef`, align 4, addrspace 1)
+ %4:vgpr_32 = FLAT_LOAD_DWORD_SADDR %0, %1, 8, 1, implicit $exec, implicit $flat_scr :: (load (s32) from `ptr addrspace(1) undef`, align 4, addrspace 1)
+ S_NOP 0, implicit %2, implicit %3, implicit %4
+...
+
+---
+name: merge_flat_load_dword_saddr_4
+body: |
+ bb.0.entry:
+
+ ; GCN-LABEL: name: merge_flat_load_dword_saddr_4
+ ; GCN: [[DEF:%[0-9]+]]:sreg_64_xexec_xnull = IMPLICIT_DEF
+ ; GCN-NEXT: [[DEF1:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
+ ; GCN-NEXT: [[FLAT_LOAD_DWORDX4_SADDR:%[0-9]+]]:vreg_128_align2 = FLAT_LOAD_DWORDX4_SADDR [[DEF]], [[DEF1]], 0, 2, implicit $exec, implicit $flat_scr :: (load (s128) from `ptr addrspace(1) undef`, align 4, addrspace 1)
+ ; GCN-NEXT: [[COPY:%[0-9]+]]:vreg_96_align2 = COPY [[FLAT_LOAD_DWORDX4_SADDR]].sub0_sub1_sub2
+ ; GCN-NEXT: [[COPY1:%[0-9]+]]:vgpr_32 = COPY killed [[FLAT_LOAD_DWORDX4_SADDR]].sub3
+ ; GCN-NEXT: [[COPY2:%[0-9]+]]:vreg_64_align2 = COPY [[COPY]].sub0_sub1
+ ; GCN-NEXT: [[COPY3:%[0-9]+]]:vgpr_32 = COPY killed [[COPY]].sub2
+ ; GCN-NEXT: [[COPY4:%[0-9]+]]:vgpr_32 = COPY [[COPY2]].sub0
+ ; GCN-NEXT: [[COPY5:%[0-9]+]]:vgpr_32 = COPY killed [[COPY2]].sub1
+ ; GCN-NEXT: S_NOP 0, implicit [[COPY4]], implicit [[COPY5]], implicit [[COPY3]], implicit [[COPY1]]
+ %0:sreg_64_xexec_xnull = IMPLICIT_DEF
+ %1:vgpr_32 = IMPLICIT_DEF
+ %2:vgpr_32 = FLAT_LOAD_DWORD_SADDR %0, %1, 0, 2, implicit $exec, implicit $flat_scr :: (load (s32) from `ptr addrspace(1) undef`, align 4, addrspace 1)
+ %3:vgpr_32 = FLAT_LOAD_DWORD_SADDR %0, %1, 4, 2, implicit $exec, implicit $flat_scr :: (load (s32) from `ptr addrspace(1) undef`, align 4, addrspace 1)
+ %4:vgpr_32 = FLAT_LOAD_DWORD_SADDR %0, %1, 8, 2, implicit $exec, implicit $flat_scr :: (load (s32) from `ptr addrspace(1) undef`, align 4, addrspace 1)
+ %5:vgpr_32 = FLAT_LOAD_DWORD_SADDR %0, %1, 12, 2, implicit $exec, implicit $flat_scr :: (load (s32) from `ptr addrspace(1) undef`, align 4, addrspace 1)
+ S_NOP 0, implicit %2, implicit %3, implicit %4, implicit %5
+...
+
+---
+name: merge_flat_load_dword_saddr_6
+body: |
+ bb.0.entry:
+
+ ; GCN-LABEL: name: merge_flat_load_dword_saddr_6
+ ; GCN: [[DEF:%[0-9]+]]:sreg_64_xexec_xnull = IMPLICIT_DEF
+ ; GCN-NEXT: [[DEF1:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
+ ; GCN-NEXT: [[FLAT_LOAD_DWORDX4_SADDR:%[0-9]+]]:vreg_128_align2 = FLAT_LOAD_DWORDX4_SADDR [[DEF]], [[DEF1]], 4, 3, implicit $exec, implicit $flat_scr :: (load (s128) from `ptr addrspace(1) undef`, align 4, addrspace 1)
+ ; GCN-NEXT: [[COPY:%[0-9]+]]:vreg_96_align2 = COPY [[FLAT_LOAD_DWORDX4_SADDR]].sub0_sub1_sub2
+ ; GCN-NEXT: [[COPY1:%[0-9]+]]:vgpr_32 = COPY killed [[FLAT_LOAD_DWORDX4_SADDR]].sub3
+ ; GCN-NEXT: [[COPY2:%[0-9]+]]:vreg_64_align2 = COPY [[COPY]].sub0_sub1
+ ; GCN-NEXT: [[COPY3:%[0-9]+]]:vgpr_32 = COPY killed [[COPY]].sub2
+ ; GCN-NEXT: [[COPY4:%[0-9]+]]:vgpr_32 = COPY [[COPY2]].sub0
+ ; GCN-NEXT: [[COPY5:%[0-9]+]]:vgpr_32 = COPY killed [[COPY2]].sub1
+ ; GCN-NEXT: [[FLAT_LOAD_DWORDX2_SADDR:%[0-9]+]]:vreg_64_align2 = FLAT_LOAD_DWORDX2_SADDR [[DEF]], [[DEF1]], 20, 3, implicit $exec, implicit $flat_scr :: (load (s64) from `ptr addrspace(1) undef`, align 4, addrspace 1)
+ ; GCN-NEXT: [[COPY6:%[0-9]+]]:vgpr_32 = COPY [[FLAT_LOAD_DWORDX2_SADDR]].sub0
+ ; GCN-NEXT: [[COPY7:%[0-9]+]]:vgpr_32 = COPY killed [[FLAT_LOAD_DWORDX2_SADDR]].sub1
+ ; GCN-NEXT: S_NOP 0, implicit [[COPY4]], implicit [[COPY5]], implicit [[COPY3]], implicit [[COPY1]], implicit [[COPY6]], implicit [[COPY7]]
+ %0:sreg_64_xexec_xnull = IMPLICIT_DEF
+ %1:vgpr_32 = IMPLICIT_DEF
+ %2:vgpr_32 = FLAT_LOAD_DWORD_SADDR %0, %1, 4, 3, implicit $exec, implicit $flat_scr :: (load (s32) from `ptr addrspace(1) undef`, align 4, addrspace 1)
+ %3:vgpr_32 = FLAT_LOAD_DWORD_SADDR %0, %1, 8, 3, implicit $exec, implicit $flat_scr :: (load (s32) from `ptr addrspace(1) undef`, align 4, addrspace 1)
+ %4:vgpr_32 = FLAT_LOAD_DWORD_SADDR %0, %1, 12, 3, implicit $exec, implicit $flat_scr :: (load (s32) from `ptr addrspace(1) undef`, align 4, addrspace 1)
+ %5:vgpr_32 = FLAT_LOAD_DWORD_SADDR %0, %1, 16, 3, implicit $exec, implicit $flat_scr :: (load (s32) from `ptr addrspace(1) undef`, align 4, addrspace 1)
+ %6:vgpr_32 = FLAT_LOAD_DWORD_SADDR %0, %1, 20, 3, implicit $exec, implicit $flat_scr :: (load (s32) from `ptr addrspace(1) undef`, align 4, addrspace 1)
+ %7:vgpr_32 = FLAT_LOAD_DWORD_SADDR %0, %1, 24, 3, implicit $exec, implicit $flat_scr :: (load (s32) from `ptr addrspace(1) undef`, align 4, addrspace 1)
+ S_NOP 0, implicit %2, implicit %3, implicit %4, implicit %5, implicit %6, implicit %7
+...
+
+---
+name: merge_flat_load_dwordx2_saddr
+body: |
+ bb.0.entry:
+
+ ; GCN-LABEL: name: merge_flat_load_dwordx2_saddr
+ ; GCN: [[DEF:%[0-9]+]]:sreg_64_xexec_xnull = IMPLICIT_DEF
+ ; GCN-NEXT: [[DEF1:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
+ ; GCN-NEXT: [[FLAT_LOAD_DWORDX4_SADDR:%[0-9]+]]:vreg_128_align2 = FLAT_LOAD_DWORDX4_SADDR [[DEF]], [[DEF1]], 0, 0, implicit $exec, implicit $flat_scr :: (load (s128) from `ptr addrspace(1) undef`, align 4, addrspace 1)
+ ; GCN-NEXT: [[COPY:%[0-9]+]]:vreg_64_align2 = COPY [[FLAT_LOAD_DWORDX4_SADDR]].sub0_sub1
+ ; GCN-NEXT: [[COPY1:%[0-9]+]]:vreg_64_align2 = COPY killed [[FLAT_LOAD_DWORDX4_SADDR]].sub2_sub3
+ ; GCN-NEXT: S_NOP 0, implicit [[COPY]], implicit [[COPY1]]
+ %0:sreg_64_xexec_xnull = IMPLICIT_DEF
+ %1:vgpr_32 = IMPLICIT_DEF
+ %2:vreg_64_align2 = FLAT_LOAD_DWORDX2_SADDR %0, %1, 0, 0, implicit $exec, implicit $flat_scr :: (load (s64) from `ptr addrspace(1) undef`, align 4, addrspace 1)
+ %3:vreg_64_align2 = FLAT_LOAD_DWORDX2_SADDR %0, %1, 8, 0, implicit $exec, implicit $flat_scr :: (load (s64) from `ptr addrspace(1) undef`, align 4, addrspace 1)
+ S_NOP 0, implicit %2, implicit %3
+...
+
+---
+name: no_merge_flat_load_dword_and_flat_load_dword_saddr
+body: |
+ bb.0.entry:
+
+ ; GCN-LABEL: name: no_merge_flat_load_dword_and_flat_load_dword_saddr
+ ; GCN: [[DEF:%[0-9]+]]:sreg_64_xexec_xnull = IMPLICIT_DEF
+ ; GCN-NEXT: [[DEF1:%[0-9]+]]:vreg_64_align2 = IMPLICIT_DEF
+ ; GCN-NEXT: [[FLAT_LOAD_DWORD:%[0-9]+]]:vgpr_32 = FLAT_LOAD_DWORD [[DEF1]], 0, 0, implicit $exec, implicit $flat_scr :: (load (s32) from `ptr addrspace(1) undef`, addrspace 1)
+ ; GCN-NEXT: [[FLAT_LOAD_DWORD_SADDR:%[0-9]+]]:vgpr_32 = FLAT_LOAD_DWORD_SADDR [[DEF]], [[DEF1]].sub0, 4, 0, implicit $exec, implicit $flat_scr :: (load (s32) from `ptr addrspace(1) undef`, addrspace 1)
+ ; GCN-NEXT: S_NOP 0, implicit [[FLAT_LOAD_DWORD]], implicit [[FLAT_LOAD_DWORD_SADDR]]
+ %0:sreg_64_xexec_xnull = IMPLICIT_DEF
+ %1:vreg_64_align2 = IMPLICIT_DEF
+ %2:vgpr_32 = FLAT_LOAD_DWORD %1, 0, 0, implicit $exec, implicit $flat_scr :: (load (s32) from `ptr addrspace(1) undef`, align 4, addrspace 1)
+ %3:vgpr_32 = FLAT_LOAD_DWORD_SADDR %0, %1.sub0, 4, 0, implicit $exec, implicit $flat_scr :: (load (s32) from `ptr addrspace(1) undef`, align 4, addrspace 1)
+ S_NOP 0, implicit %2, implicit %3
+...
+
+---
+name: no_merge_flat_load_dword_saddr_different_saddr
+body: |
+ bb.0.entry:
+
+ ; GCN-LABEL: name: no_merge_flat_load_dword_saddr_different_saddr
+ ; GCN: [[DEF:%[0-9]+]]:sgpr_128 = IMPLICIT_DEF
+ ; GCN-NEXT: [[DEF1:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
+ ; GCN-NEXT: [[FLAT_LOAD_DWORD_SADDR:%[0-9]+]]:vgpr_32 = FLAT_LOAD_DWORD_SADDR [[DEF]].sub0_sub1, [[DEF1]], 0, 0, implicit $exec, implicit $flat_scr :: (load (s32) from `ptr addrspace(1) undef`, addrspace 1)
+ ; GCN-NEXT: [[FLAT_LOAD_DWORD_SADDR1:%[0-9]+]]:vgpr_32 = FLAT_LOAD_DWORD_SADDR [[DEF]].sub2_sub3, [[DEF1]], 4, 0, implicit $exec, implicit $flat_scr :: (load (s32) from `ptr addrspace(1) undef`, addrspace 1)
+ ; GCN-NEXT: S_NOP 0, implicit [[FLAT_LOAD_DWORD_SADDR]], implicit [[FLAT_LOAD_DWORD_SADDR1]]
+ %0:sgpr_128 = IMPLICIT_DEF
+ %1:vgpr_32 = IMPLICIT_DEF
+ %2:vgpr_32 = FLAT_LOAD_DWORD_SADDR %0.sub0_sub1, %1, 0, 0, implicit $exec, implicit $flat_scr :: (load (s32) from `ptr addrspace(1) undef`, align 4, addrspace 1)
+ %3:vgpr_32 = FLAT_LOAD_DWORD_SADDR %0.sub2_sub3, %1, 4, 0, implicit $exec, implicit $flat_scr :: (load (s32) from `ptr addrspace(1) undef`, align 4, addrspace 1)
+ S_NOP 0, implicit %2, implicit %3
+...
+
+---
+name: no_merge_flat_load_dword_saddr_different_vaddr
+body: |
+ bb.0.entry:
+
+ ; GCN-LABEL: name: no_merge_flat_load_dword_saddr_different_vaddr
+ ; GCN: [[DEF:%[0-9]+]]:sreg_64_xexec_xnull = IMPLICIT_DEF
+ ; GCN-NEXT: [[DEF1:%[0-9]+]]:vreg_64_align2 = IMPLICIT_DEF
+ ; GCN-NEXT: [[FLAT_LOAD_DWORD_SADDR:%[0-9]+]]:vgpr_32 = FLAT_LOAD_DWORD_SADDR [[DEF]], [[DEF1]].sub0, 0, 0, implicit $exec, implicit $flat_scr :: (load (s32) from `ptr addrspace(1) undef`, addrspace 1)
+ ; GCN-NEXT: [[FLAT_LOAD_DWORD_SADDR1:%[0-9]+]]:vgpr_32 = FLAT_LOAD_DWORD_SADDR [[DEF]], [[DEF1]].sub1, 4, 0, implicit $exec, implicit $flat_scr :: (load (s32) from `ptr addrspace(1) undef`, addrspace 1)
+ ; GCN-NEXT: S_NOP 0, implicit [[FLAT_LOAD_DWORD_SADDR]], implicit [[FLAT_LOAD_DWORD_SADDR1]]
+ %0:sreg_64_xexec_xnull = IMPLICIT_DEF
+ %1:vreg_64_align2 = IMPLICIT_DEF
+ %2:vgpr_32 = FLAT_LOAD_DWORD_SADDR %0, %1.sub0, 0, 0, implicit $exec, implicit $flat_scr :: (load (s32) from `ptr addrspace(1) undef`, align 4, addrspace 1)
+ %3:vgpr_32 = FLAT_LOAD_DWORD_SADDR %0, %1.sub1, 4, 0, implicit $exec, implicit $flat_scr :: (load (s32) from `ptr addrspace(1) undef`, align 4, addrspace 1)
+ S_NOP 0, implicit %2, implicit %3
+...
+---
+name: merge_flat_store_dword_saddr_2
+body: |
+ bb.0.entry:
+
+ ; GCN-LABEL: name: merge_flat_store_dword_saddr_2
+ ; GCN: [[DEF:%[0-9]+]]:sreg_64_xexec_xnull = IMPLICIT_DEF
+ ; GCN-NEXT: [[DEF1:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
+ ; GCN-NEXT: [[DEF2:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
+ ; GCN-NEXT: [[DEF3:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
+ ; GCN-NEXT: [[REG_SEQUENCE:%[0-9]+]]:vreg_64_align2 = REG_SEQUENCE [[DEF2]], %subreg.sub0, [[DEF3]], %subreg.sub1
+ ; GCN-NEXT: FLAT_STORE_DWORDX2_SADDR [[DEF1]], killed [[REG_SEQUENCE]], [[DEF]], 0, 0, implicit $exec, implicit $flat_scr :: (store (s64) into `ptr addrspace(1) undef`, align 4, addrspace 1)
+ %0:sreg_64_xexec_xnull = IMPLICIT_DEF
+ %1:vgpr_32 = IMPLICIT_DEF
+ %2:vgpr_32 = IMPLICIT_DEF
+ %3:vgpr_32 = IMPLICIT_DEF
+ FLAT_STORE_DWORD_SADDR %1, %2, %0, 0, 0, implicit $exec, implicit $flat_scr :: (store (s32) into `ptr addrspace(1) undef`, align 4, addrspace 1)
+ FLAT_STORE_DWORD_SADDR %1, %3, %0, 4, 0, implicit $exec, implicit $flat_scr :: (store (s32) into `ptr addrspace(1) undef`, align 4, addrspace 1)
+...
+
+---
+name: merge_flat_store_dword_saddr_3
+body: |
+ bb.0.entry:
+
+ ; GCN-LABEL: name: merge_flat_store_dword_saddr_3
+ ; GCN: [[DEF:%[0-9]+]]:sreg_64_xexec_xnull = IMPLICIT_DEF
+ ; GCN-NEXT: [[DEF1:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
+ ; GCN-NEXT: [[DEF2:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
+ ; GCN-NEXT: [[DEF3:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
+ ; GCN-NEXT: [[DEF4:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
+ ; GCN-NEXT: [[REG_SEQUENCE:%[0-9]+]]:vreg_64_align2 = REG_SEQUENCE [[DEF2]], %subreg.sub0, [[DEF3]], %subreg.sub1
+ ; GCN-NEXT: [[REG_SEQUENCE1:%[0-9]+]]:vreg_96_align2 = REG_SEQUENCE killed [[REG_SEQUENCE]], %subreg.sub0_sub1, [[DEF4]], %subreg.sub2
+ ; GCN-NEXT: FLAT_STORE_DWORDX3_SADDR [[DEF1]], killed [[REG_SEQUENCE1]], [[DEF]], 4, 1, implicit $exec, implicit $flat_scr :: (store (s96) into `ptr addrspace(1) undef`, align 4, addrspace 1)
+ %0:sreg_64_xexec_xnull = IMPLICIT_DEF
+ %1:vgpr_32 = IMPLICIT_DEF
+ %2:vgpr_32 = IMPLICIT_DEF
+ %3:vgpr_32 = IMPLICIT_DEF
+ %4:vgpr_32 = IMPLICIT_DEF
+ FLAT_STORE_DWORD_SADDR %1, %2, %0, 4, 1, implicit $exec, implicit $flat_scr :: (store (s32) into `ptr addrspace(1) undef`, align 4, addrspace 1)
+ FLAT_STORE_DWORD_SADDR %1, %3, %0, 8, 1, implicit $exec, implicit $flat_scr :: (store (s32) into `ptr addrspace(1) undef`, align 4, addrspace 1)
+ FLAT_STORE_DWORD_SADDR %1, %4, %0, 12, 1, implicit $exec, implicit $flat_scr :: (store (s32) into `ptr addrspace(1) undef`, align 4, addrspace 1)
+...
+
+---
+name: merge_flat_store_dword_saddr_4
+body: |
+ bb.0.entry:
+
+ ; GCN-LABEL: name: merge_flat_store_dword_saddr_4
+ ; GCN: [[DEF:%[0-9]+]]:sreg_64_xexec_xnull = IMPLICIT_DEF
+ ; GCN-NEXT: [[DEF1:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
+ ; GCN-NEXT: [[DEF2:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
+ ; GCN-NEXT: [[DEF3:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
+ ; GCN-NEXT: [[DEF4:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
+ ; GCN-NEXT: [[DEF5:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
+ ; GCN-NEXT: [[REG_SEQUENCE:%[0-9]+]]:vreg_64_align2 = REG_SEQUENCE [[DEF2]], %subreg.sub0, [[DEF3]], %subreg.sub1
+ ; GCN-NEXT: [[REG_SEQUENCE1:%[0-9]+]]:vreg_96_align2 = REG_SEQUENCE killed [[REG_SEQUENCE]], %subreg.sub0_sub1, [[DEF4]], %subreg.sub2
+ ; GCN-NEXT: [[REG_SEQUENCE2:%[0-9]+]]:vreg_128_align2 = REG_SEQUENCE killed [[REG_SEQUENCE1]], %subreg.sub0_sub1_sub2, [[DEF5]], %subreg.sub3
+ ; GCN-NEXT: FLAT_STORE_DWORDX4_SADDR [[DEF1]], killed [[REG_SEQUENCE2]], [[DEF]], 4, 2, implicit $exec, implicit $flat_scr :: (store (s128) into `ptr addrspace(1) undef`, align 4, addrspace 1)
+ %0:sreg_64_xexec_xnull = IMPLICIT_DEF
+ %1:vgpr_32 = IMPLICIT_DEF
+ %2:vgpr_32 = IMPLICIT_DEF
+ %3:vgpr_32 = IMPLICIT_DEF
+ %4:vgpr_32 = IMPLICIT_DEF
+ %5:vgpr_32 = IMPLICIT_DEF
+ FLAT_STORE_DWORD_SADDR %1, %2, %0, 4, 2, implicit $exec, implicit $flat_scr :: (store (s32) into `ptr addrspace(1) undef`, align 4, addrspace 1)
+ FLAT_STORE_DWORD_SADDR %1, %3, %0, 8, 2, implicit $exec, implicit $flat_scr :: (store (s32) into `ptr addrspace(1) undef`, align 4, addrspace 1)
+ FLAT_STORE_DWORD_SADDR %1, %4, %0, 12, 2, implicit $exec, implicit $flat_scr :: (store (s32) into `ptr addrspace(1) undef`, align 4, addrspace 1)
+ FLAT_STORE_DWORD_SADDR %1, %5, %0, 16, 2, implicit $exec, implicit $flat_scr :: (store (s32) into `ptr addrspace(1) undef`, align 4, addrspace 1)
+...
+
+---
+name: merge_flat_store_dword_saddr_6
+body: |
+ bb.0.entry:
+
+ ; GCN-LABEL: name: merge_flat_store_dword_saddr_6
+ ; GCN: [[DEF:%[0-9]+]]:sreg_64_xexec_xnull = IMPLICIT_DEF
+ ; GCN-NEXT: [[DEF1:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
+ ; GCN-NEXT: [[DEF2:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
+ ; GCN-NEXT: [[DEF3:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
+ ; GCN-NEXT: [[DEF4:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
+ ; GCN-NEXT: [[DEF5:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
+ ; GCN-NEXT: [[DEF6:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
+ ; GCN-NEXT: [[DEF7:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
+ ; GCN-NEXT: [[REG_SEQUENCE:%[0-9]+]]:vreg_64_align2 = REG_SEQUENCE [[DEF2]], %subreg.sub0, [[DEF3]], %subreg.sub1
+ ; GCN-NEXT: [[REG_SEQUENCE1:%[0-9]+]]:vreg_96_align2 = REG_SEQUENCE killed [[REG_SEQUENCE]], %subreg.sub0_sub1, [[DEF4]], %subreg.sub2
+ ; GCN-NEXT: [[REG_SEQUENCE2:%[0-9]+]]:vreg_128_align2 = REG_SEQUENCE killed [[REG_SEQUENCE1]], %subreg.sub0_sub1_sub2, [[DEF5]], %subreg.sub3
+ ; GCN-NEXT: FLAT_STORE_DWORDX4_SADDR [[DEF1]], killed [[REG_SEQUENCE2]], [[DEF]], 4, 3, implicit $exec, implicit $flat_scr :: (store (s128) into `ptr addrspace(1) undef`, align 4, addrspace 1)
+ ; GCN-NEXT: [[REG_SEQUENCE3:%[0-9]+]]:vreg_64_align2 = REG_SEQUENCE [[DEF6]], %subreg.sub0, [[DEF7]], %subreg.sub1
+ ; GCN-NEXT: FLAT_STORE_DWORDX2_SADDR [[DEF1]], killed [[REG_SEQUENCE3]], [[DEF]], 20, 3, implicit $exec, implicit $flat_scr :: (store (s64) into `ptr addrspace(1) undef`, align 4, addrspace 1)
+ %0:sreg_64_xexec_xnull = IMPLICIT_DEF
+ %1:vgpr_32 = IMPLICIT_DEF
+ %2:vgpr_32 = IMPLICIT_DEF
+ %3:vgpr_32 = IMPLICIT_DEF
+ %4:vgpr_32 = IMPLICIT_DEF
+ %5:vgpr_32 = IMPLICIT_DEF
+ %6:vgpr_32 = IMPLICIT_DEF
+ %7:vgpr_32 = IMPLICIT_DEF
+ FLAT_STORE_DWORD_SADDR %1, %2, %0, 4, 3, implicit $exec, implicit $flat_scr :: (store (s32) into `ptr addrspace(1) undef`, align 4, addrspace 1)
+ FLAT_STORE_DWORD_SADDR %1, %3, %0, 8, 3, implicit $exec, implicit $flat_scr :: (store (s32) into `ptr addrspace(1) undef`, align 4, addrspace 1)
+ FLAT_STORE_DWORD_SADDR %1, %4, %0, 12, 3, implicit $exec, implicit $flat_scr :: (store (s32) into `ptr addrspace(1) undef`, align 4, addrspace 1)
+ FLAT_STORE_DWORD_SADDR %1, %5, %0, 16, 3, implicit $exec, implicit $flat_scr :: (store (s32) into `ptr addrspace(1) undef`, align 4, addrspace 1)
+ FLAT_STORE_DWORD_SADDR %1, %6, %0, 20, 3, implicit $exec, implicit $flat_scr :: (store (s32) into `ptr addrspace(1) undef`, align 4, addrspace 1)
+ FLAT_STORE_DWORD_SADDR %1, %7, %0, 24, 3, implicit $exec, implicit $flat_scr :: (store (s32) into `ptr addrspace(1) undef`, align 4, addrspace 1)
+...
+
+---
+name: no_merge_flat_store_dword_saddr_with_flat_store_dword
+body: |
+ bb.0.entry:
+
+ ; GCN-LABEL: name: no_merge_flat_store_dword_saddr_with_flat_store_dword
+ ; GCN: [[DEF:%[0-9]+]]:sreg_64_xexec_xnull = IMPLICIT_DEF
+ ; GCN-NEXT: [[DEF1:%[0-9]+]]:vreg_64_align2 = IMPLICIT_DEF
+ ; GCN-NEXT: [[DEF2:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
+ ; GCN-NEXT: [[DEF3:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
+ ; GCN-NEXT: FLAT_STORE_DWORD_SADDR [[DEF1]].sub0, [[DEF2]], [[DEF]], 0, 0, implicit $exec, implicit $flat_scr :: (store (s32) into `ptr addrspace(1) undef`, addrspace 1)
+ ; GCN-NEXT: FLAT_STORE_DWORD [[DEF1]], [[DEF3]], 4, 0, implicit $exec, implicit $flat_scr :: (store (s32) into `ptr addrspace(1) undef`, addrspace 1)
+ %0:sreg_64_xexec_xnull = IMPLICIT_DEF
+ %1:vreg_64_align2 = IMPLICIT_DEF
+ %2:vgpr_32 = IMPLICIT_DEF
+ %3:vgpr_32 = IMPLICIT_DEF
+ FLAT_STORE_DWORD_SADDR %1.sub0, %2, %0, 0, 0, implicit $exec, implicit $flat_scr :: (store (s32) into `ptr addrspace(1) undef`, align 4, addrspace 1)
+ FLAT_STORE_DWORD %1, %3, 4, 0, implicit $exec, implicit $flat_scr :: (store (s32) into `ptr addrspace(1) undef`, align 4, addrspace 1)
+...
+
+---
+name: no_merge_flat_store_dword_saddr_different_vaddr
+body: |
+ bb.0.entry:
+
+ ; GCN-LABEL: name: no_merge_flat_store_dword_saddr_different_vaddr
+ ; GCN: [[DEF:%[0-9]+]]:sreg_64_xexec_xnull = IMPLICIT_DEF
+ ; GCN-NEXT: [[DEF1:%[0-9]+]]:vreg_64_align2 = IMPLICIT_DEF
+ ; GCN-NEXT: [[DEF2:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
+ ; GCN-NEXT: [[DEF3:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
+ ; GCN-NEXT: FLAT_STORE_DWORD_SADDR [[DEF1]].sub0, [[DEF2]], [[DEF]], 0, 0, implicit $exec, implicit $flat_scr :: (store (s32) into `ptr addrspace(1) undef`, addrspace 1)
+ ; GCN-NEXT: FLAT_STORE_DWORD_SADDR [[DEF1]].sub1, [[DEF3]], [[DEF]], 4, 0, implicit $exec, implicit $flat_scr :: (store (s32) into `ptr addrspace(1) undef`, addrspace 1)
+ %0:sreg_64_xexec_xnull = IMPLICIT_DEF
+ %1:vreg_64_align2 = IMPLICIT_DEF
+ %2:vgpr_32 = IMPLICIT_DEF
+ %3:vgpr_32 = IMPLICIT_DEF
+ FLAT_STORE_DWORD_SADDR %1.sub0, %2, %0, 0, 0, implicit $exec, implicit $flat_scr :: (store (s32) into `ptr addrspace(1) undef`, align 4, addrspace 1)
+ FLAT_STORE_DWORD_SADDR %1.sub1, %3, %0, 4, 0, implicit $exec, implicit $flat_scr :: (store (s32) into `ptr addrspace(1) undef`, align 4, addrspace 1)
+...
+
+---
+name: no_merge_flat_store_dword_saddr_different_saddr
+body: |
+ bb.0.entry:
+
+ ; GCN-LABEL: name: no_merge_flat_store_dword_saddr_different_saddr
+ ; GCN: [[DEF:%[0-9]+]]:sgpr_128 = IMPLICIT_DEF
+ ; GCN-NEXT: [[DEF1:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
+ ; GCN-NEXT: [[DEF2:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
+ ; GCN-NEXT: [[DEF3:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
+ ; GCN-NEXT: FLAT_STORE_DWORD_SADDR [[DEF1]], [[DEF2]], [[DEF]].sub0_sub1, 0, 0, implicit $exec, implicit $flat_scr :: (store (s32) into `ptr addrspace(1) undef`, addrspace 1)
+ ; GCN-NEXT: FLAT_STORE_DWORD_SADDR [[DEF1]], [[DEF3]], [[DEF]].sub2_sub3, 4, 0, implicit $exec, implicit $flat_scr :: (store (s32) into `ptr addrspace(1) undef`, addrspace 1)
+ %0:sgpr_128 = IMPLICIT_DEF
+ %1:vgpr_32 = IMPLICIT_DEF
+ %2:vgpr_32 = IMPLICIT_DEF
+ %3:vgpr_32 = IMPLICIT_DEF
+ FLAT_STORE_DWORD_SADDR %1, %2, %0.sub0_sub1, 0, 0, implicit $exec, implicit $flat_scr :: (store (s32) into `ptr addrspace(1) undef`, align 4, addrspace 1)
+ FLAT_STORE_DWORD_SADDR %1, %3, %0.sub2_sub3, 4, 0, implicit $exec, implicit $flat_scr :: (store (s32) into `ptr addrspace(1) undef`, align 4, addrspace 1)
+...
>From 5612dc533a9222a0f5561b2ba7c897115f26673f Mon Sep 17 00:00:00 2001
From: Shubham Sandeep Rastogi <srastogi22 at apple.com>
Date: Mon, 18 Aug 2025 14:36:15 -0700
Subject: [PATCH 100/112] Revert "[TableGen][DecoderEmitter] Store HW mode ID
instead of name (NFC) (#154052)"
This reverts commit b20bbd48e8b1966731a284b4208e048e060e97c2.
Reverted due to greendragon failures:
20:34:43 In file included from /Users/ec2-user/jenkins/workspace/llvm.org/as-lldb-cmake/llvm-project/llvm/utils/TableGen/DecoderEmitter.cpp:14:
20:34:43 In file included from /Users/ec2-user/jenkins/workspace/llvm.org/as-lldb-cmake/llvm-project/llvm/utils/TableGen/Common/CodeGenHwModes.h:14:
20:34:43 In file included from /Users/ec2-user/jenkins/workspace/llvm.org/as-lldb-cmake/llvm-project/llvm/include/llvm/ADT/DenseMap.h:20:
20:34:43 In file included from /Users/ec2-user/jenkins/workspace/llvm.org/as-lldb-cmake/llvm-project/llvm/include/llvm/ADT/STLExtras.h:21:
20:34:43 In file included from /Users/ec2-user/jenkins/workspace/llvm.org/as-lldb-cmake/llvm-project/llvm/include/llvm/ADT/Hashing.h:53:
20:34:43 In file included from /Applications/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.2.sdk/usr/include/c++/v1/algorithm:1913:
20:34:43 In file included from /Applications/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.2.sdk/usr/include/c++/v1/chrono:746:
20:34:43 In file included from /Applications/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.2.sdk/usr/include/c++/v1/__chrono/convert_to_tm.h:19:
20:34:43 In file included from /Applications/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.2.sdk/usr/include/c++/v1/__chrono/statically_widen.h:17:
20:34:43 In file included from /Applications/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.2.sdk/usr/include/c++/v1/__format/concepts.h:17:
20:34:43 In file included from /Applications/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.2.sdk/usr/include/c++/v1/__format/format_parse_context.h:15:
20:34:43 In file included from /Applications/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.2.sdk/usr/include/c++/v1/string_view:1027:
20:34:43 In file included from /Applications/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.2.sdk/usr/include/c++/v1/functional:515:
20:34:43 In file included from /Applications/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.2.sdk/usr/include/c++/v1/__functional/boyer_moore_searcher.h:26:
20:34:43 /Applications/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.2.sdk/usr/include/c++/v1/vector:1376:19: error: object of type 'llvm::const_set_bits_iterator_impl<llvm::SmallBitVector>' cannot be assigned because its copy assignment operator is implicitly deleted
20:34:43 __mid = __first;
20:34:43 ^
20:34:43 /Users/ec2-user/jenkins/workspace/llvm.org/as-lldb-cmake/llvm-project/llvm/utils/TableGen/DecoderEmitter.cpp:2404:13: note: in instantiation of function template specialization 'std::vector<unsigned int>::assign<llvm::const_set_bits_iterator_impl<llvm::SmallBitVector>, 0>' requested here
20:34:43 HwModeIDs.assign(BV.set_bits_begin(), BV.set_bits_end());
20:34:43 ^
20:34:43 /Users/ec2-user/jenkins/workspace/llvm.org/as-lldb-cmake/llvm-project/llvm/include/llvm/ADT/BitVector.h:35:21: note: copy assignment operator of 'const_set_bits_iterator_impl<llvm::SmallBitVector>' is implicitly deleted because field 'Parent' is of reference type 'const llvm::SmallBitVector &'
20:34:43 const BitVectorT &Parent;
20:34:43 ^
20:34:43 1 warning and 1 error generated.
---
llvm/utils/TableGen/DecoderEmitter.cpp | 71 +++++++++++++++-----------
1 file changed, 42 insertions(+), 29 deletions(-)
diff --git a/llvm/utils/TableGen/DecoderEmitter.cpp b/llvm/utils/TableGen/DecoderEmitter.cpp
index e2b6248a77ef1..496c5390625aa 100644
--- a/llvm/utils/TableGen/DecoderEmitter.cpp
+++ b/llvm/utils/TableGen/DecoderEmitter.cpp
@@ -208,14 +208,14 @@ struct DecoderTableInfo {
struct EncodingAndInst {
const Record *EncodingDef;
const CodeGenInstruction *Inst;
- unsigned HwModeID;
+ StringRef HwModeName;
EncodingAndInst(const Record *EncodingDef, const CodeGenInstruction *Inst,
- unsigned HwModeID = DefaultMode)
- : EncodingDef(EncodingDef), Inst(Inst), HwModeID(HwModeID) {}
+ StringRef HwModeName = "")
+ : EncodingDef(EncodingDef), Inst(Inst), HwModeName(HwModeName) {}
};
-using NamespacesHwModesMap = std::map<std::string, std::set<unsigned>>;
+using NamespacesHwModesMap = std::map<std::string, std::set<StringRef>>;
class DecoderEmitter {
const RecordKeeper &RK;
@@ -2391,9 +2391,10 @@ static bool Check(DecodeStatus &Out, DecodeStatus In) {
)";
}
-// Collect all HwModes referenced by the target for encoding purposes.
+// Collect all HwModes referenced by the target for encoding purposes,
+// returning a vector of corresponding names.
static void collectHwModesReferencedForEncodings(
- const CodeGenHwModes &HWM, std::vector<unsigned> &HwModeIDs,
+ const CodeGenHwModes &HWM, std::vector<StringRef> &Names,
NamespacesHwModesMap &NamespacesWithHwModes) {
SmallBitVector BV(HWM.getNumModeIds());
for (const auto &MS : HWM.getHwModeSelects()) {
@@ -2401,25 +2402,34 @@ static void collectHwModesReferencedForEncodings(
if (EncodingDef->isSubClassOf("InstructionEncoding")) {
std::string DecoderNamespace =
EncodingDef->getValueAsString("DecoderNamespace").str();
- NamespacesWithHwModes[DecoderNamespace].insert(HwModeID);
+ if (HwModeID == DefaultMode) {
+ NamespacesWithHwModes[DecoderNamespace].insert("");
+ } else {
+ NamespacesWithHwModes[DecoderNamespace].insert(
+ HWM.getMode(HwModeID).Name);
+ }
BV.set(HwModeID);
}
}
}
- HwModeIDs.assign(BV.set_bits_begin(), BV.set_bits_end());
+ transform(BV.set_bits(), std::back_inserter(Names), [&HWM](const int &M) {
+ if (M == DefaultMode)
+ return StringRef("");
+ return HWM.getModeName(M, /*IncludeDefault=*/true);
+ });
}
static void
handleHwModesUnrelatedEncodings(const CodeGenInstruction *Instr,
- ArrayRef<unsigned> HwModeIDs,
+ ArrayRef<StringRef> HwModeNames,
NamespacesHwModesMap &NamespacesWithHwModes,
std::vector<EncodingAndInst> &GlobalEncodings) {
const Record *InstDef = Instr->TheDef;
switch (DecoderEmitterSuppressDuplicates) {
case SUPPRESSION_DISABLE: {
- for (unsigned HwModeID : HwModeIDs)
- GlobalEncodings.emplace_back(InstDef, Instr, HwModeID);
+ for (StringRef HwModeName : HwModeNames)
+ GlobalEncodings.emplace_back(InstDef, Instr, HwModeName);
break;
}
case SUPPRESSION_LEVEL1: {
@@ -2427,17 +2437,17 @@ handleHwModesUnrelatedEncodings(const CodeGenInstruction *Instr,
InstDef->getValueAsString("DecoderNamespace").str();
auto It = NamespacesWithHwModes.find(DecoderNamespace);
if (It != NamespacesWithHwModes.end()) {
- for (unsigned HwModeID : It->second)
- GlobalEncodings.emplace_back(InstDef, Instr, HwModeID);
+ for (StringRef HwModeName : It->second)
+ GlobalEncodings.emplace_back(InstDef, Instr, HwModeName);
} else {
// Only emit the encoding once, as it's DecoderNamespace doesn't
// contain any HwModes.
- GlobalEncodings.emplace_back(InstDef, Instr, DefaultMode);
+ GlobalEncodings.emplace_back(InstDef, Instr, "");
}
break;
}
case SUPPRESSION_LEVEL2:
- GlobalEncodings.emplace_back(InstDef, Instr, DefaultMode);
+ GlobalEncodings.emplace_back(InstDef, Instr, "");
break;
}
}
@@ -2468,13 +2478,13 @@ namespace {
// First, collect all encoding-related HwModes referenced by the target.
// And establish a mapping table between DecoderNamespace and HwMode.
- // If HwModeNames is empty, add the default mode so we always have one HwMode.
+ // If HwModeNames is empty, add the empty string so we always have one HwMode.
const CodeGenHwModes &HWM = Target.getHwModes();
- std::vector<unsigned> HwModeIDs;
+ std::vector<StringRef> HwModeNames;
NamespacesHwModesMap NamespacesWithHwModes;
- collectHwModesReferencedForEncodings(HWM, HwModeIDs, NamespacesWithHwModes);
- if (HwModeIDs.empty())
- HwModeIDs.push_back(DefaultMode);
+ collectHwModesReferencedForEncodings(HWM, HwModeNames, NamespacesWithHwModes);
+ if (HwModeNames.empty())
+ HwModeNames.push_back("");
const auto &NumberedInstructions = Target.getInstructions();
NumberedEncodings.reserve(NumberedInstructions.size());
@@ -2482,14 +2492,20 @@ namespace {
const Record *InstDef = NumberedInstruction->TheDef;
if (const Record *RV = InstDef->getValueAsOptionalDef("EncodingInfos")) {
EncodingInfoByHwMode EBM(RV, HWM);
- for (auto [HwModeID, EncodingDef] : EBM)
- NumberedEncodings.emplace_back(EncodingDef, NumberedInstruction,
- HwModeID);
+ for (auto [HwModeID, EncodingDef] : EBM) {
+ // DecoderTables with DefaultMode should not have any suffix.
+ if (HwModeID == DefaultMode) {
+ NumberedEncodings.emplace_back(EncodingDef, NumberedInstruction, "");
+ } else {
+ NumberedEncodings.emplace_back(EncodingDef, NumberedInstruction,
+ HWM.getMode(HwModeID).Name);
+ }
+ }
continue;
}
// This instruction is encoded the same on all HwModes.
// According to user needs, provide varying degrees of suppression.
- handleHwModesUnrelatedEncodings(NumberedInstruction, HwModeIDs,
+ handleHwModesUnrelatedEncodings(NumberedInstruction, HwModeNames,
NamespacesWithHwModes, NumberedEncodings);
}
for (const Record *NumberedAlias :
@@ -2536,11 +2552,8 @@ namespace {
}
std::string DecoderNamespace =
EncodingDef->getValueAsString("DecoderNamespace").str();
- // DecoderTables with DefaultMode should not have any suffix.
- if (NumberedEncoding.HwModeID != DefaultMode) {
- StringRef HwModeName = HWM.getModeName(NumberedEncoding.HwModeID);
- DecoderNamespace += ("_" + HwModeName).str();
- }
+ if (!NumberedEncoding.HwModeName.empty())
+ DecoderNamespace += "_" + NumberedEncoding.HwModeName.str();
EncMap[{DecoderNamespace, Size}].push_back(NEI);
} else {
NumEncodingsOmitted++;
>From ec4e6aaac4612af26322b2b10b8f518ecf053c74 Mon Sep 17 00:00:00 2001
From: Oliver Hunt <oliver at apple.com>
Date: Mon, 18 Aug 2025 14:38:50 -0700
Subject: [PATCH 101/112] [clang][ObjC] Fix incorrect return type inference for
discarded blocks (#154109)
When parsing a block expression we were not entering a new eval context
and as a result when parsing the block body we continue to treat any
return statements as discarded so infer a `void` result.
This fixes the problem by introducing an evaluation context around the
parsing of the body.
---
clang/lib/Parse/ParseExpr.cpp | 3 ++-
.../SemaObjCXX/discarded-block-type-inference.mm | 15 +++++++++++++++
2 files changed, 17 insertions(+), 1 deletion(-)
create mode 100644 clang/test/SemaObjCXX/discarded-block-type-inference.mm
diff --git a/clang/lib/Parse/ParseExpr.cpp b/clang/lib/Parse/ParseExpr.cpp
index bc238a9517a37..3515343202de1 100644
--- a/clang/lib/Parse/ParseExpr.cpp
+++ b/clang/lib/Parse/ParseExpr.cpp
@@ -3342,7 +3342,8 @@ ExprResult Parser::ParseBlockLiteralExpression() {
Actions.ActOnBlockError(CaretLoc, getCurScope());
return ExprError();
}
-
+ EnterExpressionEvaluationContextForFunction PotentiallyEvaluated(
+ Actions, Sema::ExpressionEvaluationContext::PotentiallyEvaluated);
StmtResult Stmt(ParseCompoundStatementBody());
BlockScope.Exit();
if (!Stmt.isInvalid())
diff --git a/clang/test/SemaObjCXX/discarded-block-type-inference.mm b/clang/test/SemaObjCXX/discarded-block-type-inference.mm
new file mode 100644
index 0000000000000..8e2587724a7f6
--- /dev/null
+++ b/clang/test/SemaObjCXX/discarded-block-type-inference.mm
@@ -0,0 +1,15 @@
+// RUN: %clang_cc1 -std=c++23 -fsyntax-only -fobjc-arc -fblocks %s
+
+void block_receiver(int (^)() );
+
+int f1() {
+ if constexpr (0)
+ (block_receiver)(^{ return 2; });
+ return 1;
+}
+
+int f2() {
+ if constexpr (0)
+ return (^{ return 2; })();
+ return 1;
+}
>From 50b55a5ee9c6fd0999c71aeab85c10f1430acb27 Mon Sep 17 00:00:00 2001
From: Peter Klausler <pklausler at nvidia.com>
Date: Mon, 18 Aug 2025 14:42:16 -0700
Subject: [PATCH 102/112] [flang][runtime] Fix AllocateAssignmentLHS for
monomorphic LHS (#153073)
When the left-hand side of an assignment statement is an allocatable
that has a monomorphic derived type, and the right-hand side of the
assignment has a type that is an extension of that type, *don't* change
the incoming type or element size of the descriptor before allocating
it.
Fixes https://github.com/llvm/llvm-project/issues/152758.
---
flang-rt/lib/runtime/assign.cpp | 31 ++++++++++++++++++++-----------
1 file changed, 20 insertions(+), 11 deletions(-)
diff --git a/flang-rt/lib/runtime/assign.cpp b/flang-rt/lib/runtime/assign.cpp
index 6aeb103208785..2c29a98d5a5cb 100644
--- a/flang-rt/lib/runtime/assign.cpp
+++ b/flang-rt/lib/runtime/assign.cpp
@@ -88,23 +88,32 @@ static inline RT_API_ATTRS bool MustDeallocateLHS(
// originally deallocated or because it required reallocation
static RT_API_ATTRS int AllocateAssignmentLHS(
Descriptor &to, const Descriptor &from, Terminator &terminator, int flags) {
- to.raw().type = from.raw().type;
- if (!(flags & ExplicitLengthCharacterLHS)) {
- to.raw().elem_len = from.ElementBytes();
- }
- const typeInfo::DerivedType *derived{nullptr};
DescriptorAddendum *toAddendum{to.Addendum()};
+ const typeInfo::DerivedType *derived{nullptr};
+ if (toAddendum) {
+ derived = toAddendum->derivedType();
+ }
if (const DescriptorAddendum * fromAddendum{from.Addendum()}) {
- derived = fromAddendum->derivedType();
- if (toAddendum) {
- toAddendum->set_derivedType(derived);
- std::size_t lenParms{derived ? derived->LenParameters() : 0};
+ if (!derived || (flags & PolymorphicLHS)) {
+ derived = fromAddendum->derivedType();
+ }
+ if (toAddendum && derived) {
+ std::size_t lenParms{derived->LenParameters()};
for (std::size_t j{0}; j < lenParms; ++j) {
toAddendum->SetLenParameterValue(j, fromAddendum->LenParameterValue(j));
}
}
- } else if (toAddendum) {
- toAddendum->set_derivedType(nullptr);
+ } else {
+ derived = nullptr;
+ }
+ if (toAddendum) {
+ toAddendum->set_derivedType(derived);
+ }
+ to.raw().type = from.raw().type;
+ if (derived) {
+ to.raw().elem_len = derived->sizeInBytes();
+ } else if (!(flags & ExplicitLengthCharacterLHS)) {
+ to.raw().elem_len = from.ElementBytes();
}
// subtle: leave bounds in place when "from" is scalar (10.2.1.3(3))
int rank{from.rank()};
>From 48232594a030f17729b9d21606f816b04e81a926 Mon Sep 17 00:00:00 2001
From: Matthias Braun <matze at braunis.de>
Date: Mon, 18 Aug 2025 14:42:55 -0700
Subject: [PATCH 103/112] llvm-profgen: Options cleanup / fixes (#147632)
- Add `cl::cat(ProfGenCategory)` to non-hidden options so they show up
in `--help` output.
- Introduce `Options.h` for options referenced in multiple files.
---
.../llvm-profgen/MissingFrameInferrer.cpp | 4 +-
llvm/tools/llvm-profgen/Options.h | 28 ++++++++
llvm/tools/llvm-profgen/PerfReader.cpp | 27 +++++---
llvm/tools/llvm-profgen/ProfileGenerator.cpp | 68 ++++++++++---------
llvm/tools/llvm-profgen/ProfiledBinary.cpp | 35 ++++++----
llvm/tools/llvm-profgen/ProfiledBinary.h | 4 --
llvm/tools/llvm-profgen/llvm-profgen.cpp | 11 +--
7 files changed, 109 insertions(+), 68 deletions(-)
create mode 100644 llvm/tools/llvm-profgen/Options.h
diff --git a/llvm/tools/llvm-profgen/MissingFrameInferrer.cpp b/llvm/tools/llvm-profgen/MissingFrameInferrer.cpp
index edfe8979c7121..7ebca23ba7956 100644
--- a/llvm/tools/llvm-profgen/MissingFrameInferrer.cpp
+++ b/llvm/tools/llvm-profgen/MissingFrameInferrer.cpp
@@ -7,6 +7,7 @@
//===----------------------------------------------------------------------===//
#include "MissingFrameInferrer.h"
+#include "Options.h"
#include "PerfReader.h"
#include "ProfiledBinary.h"
#include "llvm/ADT/SCCIterator.h"
@@ -37,7 +38,8 @@ STATISTIC(TailCallMaxTailCallPath, "Length of the longest tail call path");
static cl::opt<uint32_t>
MaximumSearchDepth("max-search-depth", cl::init(UINT32_MAX - 1),
cl::desc("The maximum levels the DFS-based missing "
- "frame search should go with"));
+ "frame search should go with"),
+ cl::cat(ProfGenCategory));
void MissingFrameInferrer::initialize(
const ContextSampleCounterMap *SampleCounters) {
diff --git a/llvm/tools/llvm-profgen/Options.h b/llvm/tools/llvm-profgen/Options.h
new file mode 100644
index 0000000000000..f94cf9118c06a
--- /dev/null
+++ b/llvm/tools/llvm-profgen/Options.h
@@ -0,0 +1,28 @@
+//===-- Options.h -----------------------------------------------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+#ifndef LLVM_TOOLS_LLVM_PROFGEN_OPTIONS_H
+#define LLVM_TOOLS_LLVM_PROFGEN_OPTIONS_H
+
+#include "llvm/Support/CommandLine.h"
+
+namespace llvm {
+
+extern cl::OptionCategory ProfGenCategory;
+
+extern cl::opt<std::string> OutputFilename;
+extern cl::opt<bool> ShowDisassemblyOnly;
+extern cl::opt<bool> ShowSourceLocations;
+extern cl::opt<bool> SkipSymbolization;
+extern cl::opt<bool> ShowDetailedWarning;
+extern cl::opt<bool> InferMissingFrames;
+extern cl::opt<bool> EnableCSPreInliner;
+extern cl::opt<bool> UseContextCostForPreInliner;
+
+} // end namespace llvm
+
+#endif
diff --git a/llvm/tools/llvm-profgen/PerfReader.cpp b/llvm/tools/llvm-profgen/PerfReader.cpp
index 4ab5f2e63fd12..9a805f2941753 100644
--- a/llvm/tools/llvm-profgen/PerfReader.cpp
+++ b/llvm/tools/llvm-profgen/PerfReader.cpp
@@ -6,6 +6,7 @@
//
//===----------------------------------------------------------------------===//
#include "PerfReader.h"
+#include "Options.h"
#include "ProfileGenerator.h"
#include "llvm/ADT/SmallString.h"
#include "llvm/DebugInfo/Symbolize/SymbolizableModule.h"
@@ -15,43 +16,47 @@
#define DEBUG_TYPE "perf-reader"
-using namespace llvm;
+namespace llvm {
cl::opt<bool> SkipSymbolization("skip-symbolization",
cl::desc("Dump the unsymbolized profile to the "
"output file. It will show unwinder "
- "output for CS profile generation."));
+ "output for CS profile generation."),
+ cl::cat(ProfGenCategory));
static cl::opt<bool> ShowMmapEvents("show-mmap-events",
- cl::desc("Print binary load events."));
+ cl::desc("Print binary load events."),
+ cl::cat(ProfGenCategory));
static cl::opt<bool>
UseOffset("use-offset", cl::init(true),
cl::desc("Work with `--skip-symbolization` or "
"`--unsymbolized-profile` to write/read the "
- "offset instead of virtual address."));
+ "offset instead of virtual address."),
+ cl::cat(ProfGenCategory));
static cl::opt<bool> UseLoadableSegmentAsBase(
"use-first-loadable-segment-as-base",
cl::desc("Use first loadable segment address as base address "
"for offsets in unsymbolized profile. By default "
- "first executable segment address is used"));
+ "first executable segment address is used"),
+ cl::cat(ProfGenCategory));
static cl::opt<bool>
IgnoreStackSamples("ignore-stack-samples",
cl::desc("Ignore call stack samples for hybrid samples "
- "and produce context-insensitive profile."));
+ "and produce context-insensitive profile."),
+ cl::cat(ProfGenCategory));
cl::opt<bool> ShowDetailedWarning("show-detailed-warning",
- cl::desc("Show detailed warning message."));
+ cl::desc("Show detailed warning message."),
+ cl::cat(ProfGenCategory));
static cl::opt<int> CSProfMaxUnsymbolizedCtxDepth(
"csprof-max-unsymbolized-context-depth", cl::init(-1),
cl::desc("Keep the last K contexts while merging unsymbolized profile. -1 "
- "means no depth limit."));
-
-extern cl::opt<std::string> OutputFilename;
+ "means no depth limit."),
+ cl::cat(ProfGenCategory));
-namespace llvm {
namespace sampleprof {
void VirtualUnwinder::unwindCall(UnwindState &State) {
diff --git a/llvm/tools/llvm-profgen/ProfileGenerator.cpp b/llvm/tools/llvm-profgen/ProfileGenerator.cpp
index 9468228acc427..33575b9c67625 100644
--- a/llvm/tools/llvm-profgen/ProfileGenerator.cpp
+++ b/llvm/tools/llvm-profgen/ProfileGenerator.cpp
@@ -8,6 +8,7 @@
#include "ProfileGenerator.h"
#include "ErrorHandling.h"
#include "MissingFrameInferrer.h"
+#include "Options.h"
#include "PerfReader.h"
#include "ProfiledBinary.h"
#include "llvm/DebugInfo/Symbolize/SymbolizableModule.h"
@@ -17,23 +18,24 @@
#include <unordered_set>
#include <utility>
-using namespace llvm;
-using namespace sampleprof;
+namespace llvm {
cl::opt<std::string> OutputFilename("output", cl::value_desc("output"),
cl::Required,
- cl::desc("Output profile file"));
+ cl::desc("Output profile file"),
+ cl::cat(ProfGenCategory));
static cl::alias OutputA("o", cl::desc("Alias for --output"),
cl::aliasopt(OutputFilename));
static cl::opt<SampleProfileFormat> OutputFormat(
"format", cl::desc("Format of output profile"), cl::init(SPF_Ext_Binary),
- cl::values(
- clEnumValN(SPF_Binary, "binary", "Binary encoding (default)"),
- clEnumValN(SPF_Ext_Binary, "extbinary", "Extensible binary encoding"),
- clEnumValN(SPF_Text, "text", "Text encoding"),
- clEnumValN(SPF_GCC, "gcc",
- "GCC encoding (only meaningful for -sample)")));
+ cl::values(clEnumValN(SPF_Binary, "binary", "Binary encoding (default)"),
+ clEnumValN(SPF_Ext_Binary, "extbinary",
+ "Extensible binary encoding"),
+ clEnumValN(SPF_Text, "text", "Text encoding"),
+ clEnumValN(SPF_GCC, "gcc",
+ "GCC encoding (only meaningful for -sample)")),
+ cl::cat(ProfGenCategory));
static cl::opt<bool> UseMD5(
"use-md5", cl::Hidden,
@@ -59,55 +61,57 @@ static cl::opt<int32_t, true> RecursionCompression(
static cl::opt<bool>
TrimColdProfile("trim-cold-profile",
cl::desc("If the total count of the profile is smaller "
- "than threshold, it will be trimmed."));
+ "than threshold, it will be trimmed."),
+ cl::cat(ProfGenCategory));
static cl::opt<bool> CSProfMergeColdContext(
"csprof-merge-cold-context", cl::init(true),
cl::desc("If the total count of context profile is smaller than "
"the threshold, it will be merged into context-less base "
- "profile."));
+ "profile."),
+ cl::cat(ProfGenCategory));
static cl::opt<uint32_t> CSProfMaxColdContextDepth(
"csprof-max-cold-context-depth", cl::init(1),
cl::desc("Keep the last K contexts while merging cold profile. 1 means the "
- "context-less base profile"));
+ "context-less base profile"),
+ cl::cat(ProfGenCategory));
static cl::opt<int, true> CSProfMaxContextDepth(
"csprof-max-context-depth",
cl::desc("Keep the last K contexts while merging profile. -1 means no "
"depth limit."),
- cl::location(llvm::sampleprof::CSProfileGenerator::MaxContextDepth));
+ cl::location(llvm::sampleprof::CSProfileGenerator::MaxContextDepth),
+ cl::cat(ProfGenCategory));
static cl::opt<double> ProfileDensityThreshold(
- "profile-density-threshold", llvm::cl::init(50),
- llvm::cl::desc("If the profile density is below the given threshold, it "
- "will be suggested to increase the sampling rate."),
- llvm::cl::Optional);
-static cl::opt<bool> ShowDensity("show-density", llvm::cl::init(false),
- llvm::cl::desc("show profile density details"),
- llvm::cl::Optional);
+ "profile-density-threshold", cl::init(50),
+ cl::desc("If the profile density is below the given threshold, it "
+ "will be suggested to increase the sampling rate."),
+ cl::Optional, cl::cat(ProfGenCategory));
+static cl::opt<bool> ShowDensity("show-density", cl::init(false),
+ cl::desc("show profile density details"),
+ cl::Optional, cl::cat(ProfGenCategory));
static cl::opt<int> ProfileDensityCutOffHot(
- "profile-density-cutoff-hot", llvm::cl::init(990000),
- llvm::cl::desc("Total samples cutoff for functions used to calculate "
- "profile density."));
+ "profile-density-cutoff-hot", cl::init(990000),
+ cl::desc("Total samples cutoff for functions used to calculate "
+ "profile density."),
+ cl::cat(ProfGenCategory));
static cl::opt<bool> UpdateTotalSamples(
- "update-total-samples", llvm::cl::init(false),
- llvm::cl::desc(
- "Update total samples by accumulating all its body samples."),
- llvm::cl::Optional);
+ "update-total-samples", cl::init(false),
+ cl::desc("Update total samples by accumulating all its body samples."),
+ cl::Optional, cl::cat(ProfGenCategory));
static cl::opt<bool> GenCSNestedProfile(
"gen-cs-nested-profile", cl::Hidden, cl::init(true),
cl::desc("Generate nested function profiles for CSSPGO"));
cl::opt<bool> InferMissingFrames(
- "infer-missing-frames", llvm::cl::init(true),
- llvm::cl::desc(
+ "infer-missing-frames", cl::init(true),
+ cl::desc(
"Infer missing call frames due to compiler tail call elimination."),
- llvm::cl::Optional);
-
-namespace llvm {
+ cl::Optional, cl::cat(ProfGenCategory));
namespace sampleprof {
diff --git a/llvm/tools/llvm-profgen/ProfiledBinary.cpp b/llvm/tools/llvm-profgen/ProfiledBinary.cpp
index beef4338d5f89..31cac4d5c7721 100644
--- a/llvm/tools/llvm-profgen/ProfiledBinary.cpp
+++ b/llvm/tools/llvm-profgen/ProfiledBinary.cpp
@@ -9,6 +9,7 @@
#include "ProfiledBinary.h"
#include "ErrorHandling.h"
#include "MissingFrameInferrer.h"
+#include "Options.h"
#include "ProfileGenerator.h"
#include "llvm/DebugInfo/Symbolize/SymbolizableModule.h"
#include "llvm/Demangle/Demangle.h"
@@ -24,47 +25,51 @@
#define DEBUG_TYPE "load-binary"
-using namespace llvm;
-using namespace llvm::object;
-using namespace sampleprof;
+namespace llvm {
+
+using namespace object;
cl::opt<bool> ShowDisassemblyOnly("show-disassembly-only",
- cl::desc("Print disassembled code."));
+ cl::desc("Print disassembled code."),
+ cl::cat(ProfGenCategory));
cl::opt<bool> ShowSourceLocations("show-source-locations",
- cl::desc("Print source locations."));
+ cl::desc("Print source locations."),
+ cl::cat(ProfGenCategory));
static cl::opt<bool>
ShowCanonicalFnName("show-canonical-fname",
- cl::desc("Print canonical function name."));
+ cl::desc("Print canonical function name."),
+ cl::cat(ProfGenCategory));
static cl::opt<bool> ShowPseudoProbe(
"show-pseudo-probe",
- cl::desc("Print pseudo probe section and disassembled info."));
+ cl::desc("Print pseudo probe section and disassembled info."),
+ cl::cat(ProfGenCategory));
static cl::opt<bool> UseDwarfCorrelation(
"use-dwarf-correlation",
cl::desc("Use dwarf for profile correlation even when binary contains "
- "pseudo probe."));
+ "pseudo probe."),
+ cl::cat(ProfGenCategory));
static cl::opt<std::string>
DWPPath("dwp", cl::init(""),
cl::desc("Path of .dwp file. When not specified, it will be "
- "<binary>.dwp in the same directory as the main binary."));
+ "<binary>.dwp in the same directory as the main binary."),
+ cl::cat(ProfGenCategory));
static cl::list<std::string> DisassembleFunctions(
"disassemble-functions", cl::CommaSeparated,
cl::desc("List of functions to print disassembly for. Accept demangled "
- "names only. Only work with show-disassembly-only"));
+ "names only. Only work with show-disassembly-only"),
+ cl::cat(ProfGenCategory));
static cl::opt<bool>
KernelBinary("kernel",
- cl::desc("Generate the profile for Linux kernel binary."));
+ cl::desc("Generate the profile for Linux kernel binary."),
+ cl::cat(ProfGenCategory));
-extern cl::opt<bool> ShowDetailedWarning;
-extern cl::opt<bool> InferMissingFrames;
-
-namespace llvm {
namespace sampleprof {
static const Target *getTarget(const ObjectFile *Obj) {
diff --git a/llvm/tools/llvm-profgen/ProfiledBinary.h b/llvm/tools/llvm-profgen/ProfiledBinary.h
index 5b35c040b2c4b..9c0bff591337a 100644
--- a/llvm/tools/llvm-profgen/ProfiledBinary.h
+++ b/llvm/tools/llvm-profgen/ProfiledBinary.h
@@ -42,10 +42,6 @@
#include <vector>
namespace llvm {
-
-extern cl::opt<bool> EnableCSPreInliner;
-extern cl::opt<bool> UseContextCostForPreInliner;
-
namespace sampleprof {
class ProfiledBinary;
diff --git a/llvm/tools/llvm-profgen/llvm-profgen.cpp b/llvm/tools/llvm-profgen/llvm-profgen.cpp
index 5464888e77ad5..7e070a1ea6489 100644
--- a/llvm/tools/llvm-profgen/llvm-profgen.cpp
+++ b/llvm/tools/llvm-profgen/llvm-profgen.cpp
@@ -11,6 +11,7 @@
//===----------------------------------------------------------------------===//
#include "ErrorHandling.h"
+#include "Options.h"
#include "PerfReader.h"
#include "ProfileGenerator.h"
#include "ProfiledBinary.h"
@@ -24,7 +25,9 @@
using namespace llvm;
using namespace sampleprof;
-static cl::OptionCategory ProfGenCategory("ProfGen Options");
+namespace llvm {
+
+cl::OptionCategory ProfGenCategory("ProfGen Options");
static cl::opt<std::string> PerfScriptFilename(
"perfscript", cl::value_desc("perfscript"),
@@ -70,10 +73,6 @@ static cl::opt<std::string> DebugBinPath(
"from it instead of the executable binary."),
cl::cat(ProfGenCategory));
-extern cl::opt<bool> ShowDisassemblyOnly;
-extern cl::opt<bool> ShowSourceLocations;
-extern cl::opt<bool> SkipSymbolization;
-
// Validate the command line input.
static void validateCommandLine() {
// Allow the missing perfscript if we only use to show binary disassembly.
@@ -138,6 +137,8 @@ static PerfInputFile getPerfInputFile() {
return File;
}
+} // end namespace llvm
+
int main(int argc, const char *argv[]) {
InitLLVM X(argc, argv);
>From 2cf982c0f5f44d0f0920a48c94a64687763de22b Mon Sep 17 00:00:00 2001
From: Peter Klausler <pklausler at nvidia.com>
Date: Mon, 18 Aug 2025 14:43:13 -0700
Subject: [PATCH 104/112] [flang] Don't duplicate impure function call for
UBOUND() (#153648)
Because the per-dimension information in a descriptor holds an extent
and a lower bound, but not an upper bound, the calculation of the upper
bound sometimes requires that the extent and lower bound be extracted
from a descriptor and added together, minus 1. This shouldn't be
attempted when the NamedEntity of the descriptor is something that
shouldn't be duplicated and used twice; specifically, it shouldn't apply
to NamedEntities containing references to impure functions as parts of
subscript expressions.
Fixes https://github.com/llvm/llvm-project/issues/153031.
---
flang/include/flang/Evaluate/tools.h | 27 +++++++++++++++++----------
flang/lib/Evaluate/shape.cpp | 10 ++++++----
flang/test/Evaluate/bug153031.f90 | 18 ++++++++++++++++++
3 files changed, 41 insertions(+), 14 deletions(-)
create mode 100644 flang/test/Evaluate/bug153031.f90
diff --git a/flang/include/flang/Evaluate/tools.h b/flang/include/flang/Evaluate/tools.h
index 212356136d6ee..74c6acbcb1ed5 100644
--- a/flang/include/flang/Evaluate/tools.h
+++ b/flang/include/flang/Evaluate/tools.h
@@ -1144,15 +1144,14 @@ std::optional<std::string> FindImpureCall(
std::optional<std::string> FindImpureCall(
FoldingContext &, const ProcedureRef &);
-// Predicate: is a scalar expression suitable for naive scalar expansion
-// in the flattening of an array expression?
-// TODO: capture such scalar expansions in temporaries, flatten everything
-class UnexpandabilityFindingVisitor
- : public AnyTraverse<UnexpandabilityFindingVisitor> {
+// Predicate: does an expression contain anything that would prevent it from
+// being duplicated so that two instances of it then appear in the same
+// expression?
+class UnsafeToCopyVisitor : public AnyTraverse<UnsafeToCopyVisitor> {
public:
- using Base = AnyTraverse<UnexpandabilityFindingVisitor>;
+ using Base = AnyTraverse<UnsafeToCopyVisitor>;
using Base::operator();
- explicit UnexpandabilityFindingVisitor(bool admitPureCall)
+ explicit UnsafeToCopyVisitor(bool admitPureCall)
: Base{*this}, admitPureCall_{admitPureCall} {}
template <typename T> bool operator()(const FunctionRef<T> &procRef) {
return !admitPureCall_ || !procRef.proc().IsPure();
@@ -1163,14 +1162,22 @@ class UnexpandabilityFindingVisitor
bool admitPureCall_{false};
};
+template <typename A>
+bool IsSafelyCopyable(const A &x, bool admitPureCall = false) {
+ return !UnsafeToCopyVisitor{admitPureCall}(x);
+}
+
+// Predicate: is a scalar expression suitable for naive scalar expansion
+// in the flattening of an array expression?
+// TODO: capture such scalar expansions in temporaries, flatten everything
template <typename T>
bool IsExpandableScalar(const Expr<T> &expr, FoldingContext &context,
const Shape &shape, bool admitPureCall = false) {
- if (UnexpandabilityFindingVisitor{admitPureCall}(expr)) {
+ if (IsSafelyCopyable(expr, admitPureCall)) {
+ return true;
+ } else {
auto extents{AsConstantExtents(context, shape)};
return extents && !HasNegativeExtent(*extents) && GetSize(*extents) == 1;
- } else {
- return true;
}
}
diff --git a/flang/lib/Evaluate/shape.cpp b/flang/lib/Evaluate/shape.cpp
index 776866d1416d2..894049f32a6bf 100644
--- a/flang/lib/Evaluate/shape.cpp
+++ b/flang/lib/Evaluate/shape.cpp
@@ -623,7 +623,7 @@ MaybeExtentExpr GetRawUpperBound(
} else if (semantics::IsAssumedSizeArray(symbol) &&
dimension + 1 == symbol.Rank()) {
return std::nullopt;
- } else {
+ } else if (IsSafelyCopyable(base, /*admitPureCall=*/true)) {
return ComputeUpperBound(
GetRawLowerBound(base, dimension), GetExtent(base, dimension));
}
@@ -678,9 +678,11 @@ static MaybeExtentExpr GetUBOUND(FoldingContext *context,
} else if (semantics::IsAssumedSizeArray(symbol) &&
dimension + 1 == symbol.Rank()) {
return std::nullopt; // UBOUND() folding replaces with -1
- } else if (auto lb{GetLBOUND(base, dimension, invariantOnly)}) {
- return ComputeUpperBound(
- std::move(*lb), GetExtent(base, dimension, invariantOnly));
+ } else if (IsSafelyCopyable(base, /*admitPureCall=*/true)) {
+ if (auto lb{GetLBOUND(base, dimension, invariantOnly)}) {
+ return ComputeUpperBound(
+ std::move(*lb), GetExtent(base, dimension, invariantOnly));
+ }
}
}
} else if (const auto *assoc{
diff --git a/flang/test/Evaluate/bug153031.f90 b/flang/test/Evaluate/bug153031.f90
new file mode 100644
index 0000000000000..a717954ecaed1
--- /dev/null
+++ b/flang/test/Evaluate/bug153031.f90
@@ -0,0 +1,18 @@
+! RUN: %flang_fc1 -fdebug-unparse %s 2>&1 | FileCheck %s
+! Ensure that UBOUND() calculation from LBOUND()+SIZE() isn't applied to
+! variables containing references to impure functions.
+type t
+ real, allocatable :: a(:)
+end type
+interface
+ pure integer function pure(n)
+ integer, intent(in) :: n
+ end
+end interface
+type(t) :: x(10)
+allocate(x(1)%a(2))
+!CHECK: PRINT *, ubound(x(int(impure(1_4),kind=8))%a,dim=1_4)
+print *, ubound(x(impure(1))%a, dim=1)
+!CHECK: PRINT *, int(size(x(int(pure(1_4),kind=8))%a,dim=1,kind=8)+lbound(x(int(pure(1_4),kind=8))%a,dim=1,kind=8)-1_8,kind=4)
+print *, ubound(x(pure(1))%a, dim=1)
+end
>From c53792b278f1b0415b0071607b31818248222187 Mon Sep 17 00:00:00 2001
From: Peter Klausler <pklausler at nvidia.com>
Date: Mon, 18 Aug 2025 14:44:00 -0700
Subject: [PATCH 105/112] [flang][runtime] OPEN(existingUnit,POSITION=)
(#153688)
Ensure that when a connected unit is reopened with POSITION='REWIND' or
'APPEND', and a STATUS='OLD' or unspecified, that it is actually
repositioned as requested.
Fixes https://github.com/llvm/llvm-project/issues/153426.
---
flang-rt/include/flang-rt/runtime/file.h | 4 ++--
flang-rt/lib/runtime/external-unit.cpp | 14 +++++++-------
flang-rt/lib/runtime/file.cpp | 24 ++++++++++++++++--------
flang-rt/lib/runtime/unit.h | 4 ++++
4 files changed, 29 insertions(+), 17 deletions(-)
diff --git a/flang-rt/include/flang-rt/runtime/file.h b/flang-rt/include/flang-rt/runtime/file.h
index 3bba29722b3b8..6e35fe89b5341 100644
--- a/flang-rt/include/flang-rt/runtime/file.h
+++ b/flang-rt/include/flang-rt/runtime/file.h
@@ -68,7 +68,7 @@ class OpenFile {
void WaitAll(IoErrorHandler &);
// INQUIRE(POSITION=)
- Position InquirePosition() const;
+ Position InquirePosition(FileOffset offset) const;
private:
struct Pending {
@@ -80,7 +80,7 @@ class OpenFile {
void CheckOpen(const Terminator &);
bool Seek(FileOffset, IoErrorHandler &);
bool RawSeek(FileOffset);
- bool RawSeekToEnd();
+ bool SeekToEnd(IoErrorHandler &);
int PendingResult(const Terminator &, int);
void SetPosition(FileOffset pos) {
position_ = pos;
diff --git a/flang-rt/lib/runtime/external-unit.cpp b/flang-rt/lib/runtime/external-unit.cpp
index b8004d6315994..42441e59d9bb6 100644
--- a/flang-rt/lib/runtime/external-unit.cpp
+++ b/flang-rt/lib/runtime/external-unit.cpp
@@ -57,7 +57,7 @@ ExternalFileUnit *ExternalFileUnit::LookUpOrCreate(
}
ExternalFileUnit *ExternalFileUnit::LookUpOrCreateAnonymous(int unit,
- Direction dir, Fortran::common::optional<bool> isUnformatted,
+ Direction dir, common::optional<bool> isUnformatted,
IoErrorHandler &handler) {
// Make sure that the returned anonymous unit has been opened,
// not just created in the unitMap.
@@ -109,8 +109,8 @@ ExternalFileUnit &ExternalFileUnit::NewUnit(
return unit;
}
-bool ExternalFileUnit::OpenUnit(Fortran::common::optional<OpenStatus> status,
- Fortran::common::optional<Action> action, Position position,
+bool ExternalFileUnit::OpenUnit(common::optional<OpenStatus> status,
+ common::optional<Action> action, Position position,
OwningPtr<char> &&newPath, std::size_t newPathLength, Convert convert,
IoErrorHandler &handler) {
if (convert == Convert::Unknown) {
@@ -131,6 +131,7 @@ bool ExternalFileUnit::OpenUnit(Fortran::common::optional<OpenStatus> status,
if (!newPath.get() || isSamePath) {
// OPEN of existing unit, STATUS='OLD' or unspecified, not new FILE=
newPath.reset();
+ Open(status.value_or(OpenStatus::Old), action, position, handler);
return impliedClose;
}
// Otherwise, OPEN on open unit with new FILE= implies CLOSE
@@ -194,10 +195,9 @@ bool ExternalFileUnit::OpenUnit(Fortran::common::optional<OpenStatus> status,
return impliedClose;
}
-bool ExternalFileUnit::OpenAnonymousUnit(
- Fortran::common::optional<OpenStatus> status,
- Fortran::common::optional<Action> action, Position position,
- Convert convert, IoErrorHandler &handler) {
+bool ExternalFileUnit::OpenAnonymousUnit(common::optional<OpenStatus> status,
+ common::optional<Action> action, Position position, Convert convert,
+ IoErrorHandler &handler) {
// I/O to an unconnected unit reads/creates a local file, e.g. fort.7
std::size_t pathMaxLen{32};
auto path{SizedNew<char>{handler}(pathMaxLen)};
diff --git a/flang-rt/lib/runtime/file.cpp b/flang-rt/lib/runtime/file.cpp
index 16e73db488727..8255ec8691886 100644
--- a/flang-rt/lib/runtime/file.cpp
+++ b/flang-rt/lib/runtime/file.cpp
@@ -60,10 +60,16 @@ static int openfile_mkstemp(IoErrorHandler &handler) {
return fd;
}
-void OpenFile::Open(OpenStatus status, Fortran::common::optional<Action> action,
+void OpenFile::Open(OpenStatus status, common::optional<Action> action,
Position position, IoErrorHandler &handler) {
if (fd_ >= 0 &&
(status == OpenStatus::Old || status == OpenStatus::Unknown)) {
+ if (position == Position::Rewind) {
+ Seek(0, handler);
+ } else if (position == Position::Append) {
+ SeekToEnd(handler);
+ }
+ openPosition_ = position; // for INQUIRE(POSITION=)
return;
}
CloseFd(handler);
@@ -131,8 +137,8 @@ void OpenFile::Open(OpenStatus status, Fortran::common::optional<Action> action,
}
RUNTIME_CHECK(handler, action.has_value());
pending_.reset();
- if (fd_ >= 0 && position == Position::Append && !RawSeekToEnd()) {
- handler.SignalError(IostatOpenBadAppend);
+ if (fd_ >= 0 && position == Position::Append) {
+ SeekToEnd(handler);
}
isTerminal_ = fd_ >= 0 && IsATerminal(fd_);
mayRead_ = *action != Action::Write;
@@ -322,7 +328,7 @@ int OpenFile::WriteAsynchronously(FileOffset at, const char *buffer,
}
void OpenFile::Wait(int id, IoErrorHandler &handler) {
- Fortran::common::optional<int> ioStat;
+ common::optional<int> ioStat;
Pending *prev{nullptr};
for (Pending *p{pending_.get()}; p; p = (prev = p)->next.get()) {
if (p->id == id) {
@@ -353,13 +359,13 @@ void OpenFile::WaitAll(IoErrorHandler &handler) {
}
}
-Position OpenFile::InquirePosition() const {
+Position OpenFile::InquirePosition(FileOffset offset) const {
if (openPosition_) { // from OPEN statement
return *openPosition_;
} else { // unit has been repositioned since opening
- if (position_ == knownSize_.value_or(position_ + 1)) {
+ if (offset == knownSize_.value_or(offset + 1)) {
return Position::Append;
- } else if (position_ == 0 && mayPosition_) {
+ } else if (offset == 0 && mayPosition_) {
return Position::Rewind;
} else {
return Position::AsIs; // processor-dependent & no common behavior
@@ -391,7 +397,7 @@ bool OpenFile::RawSeek(FileOffset at) {
#endif
}
-bool OpenFile::RawSeekToEnd() {
+bool OpenFile::SeekToEnd(IoErrorHandler &handler) {
#ifdef _LARGEFILE64_SOURCE
std::int64_t at{::lseek64(fd_, 0, SEEK_END)};
#else
@@ -399,8 +405,10 @@ bool OpenFile::RawSeekToEnd() {
#endif
if (at >= 0) {
knownSize_ = at;
+ SetPosition(at);
return true;
} else {
+ handler.SignalError(IostatOpenBadAppend);
return false;
}
}
diff --git a/flang-rt/lib/runtime/unit.h b/flang-rt/lib/runtime/unit.h
index f266a486bb708..34b7c3972bd96 100644
--- a/flang-rt/lib/runtime/unit.h
+++ b/flang-rt/lib/runtime/unit.h
@@ -197,6 +197,10 @@ class ExternalFileUnit : public ConnectionState,
RT_API_ATTRS int GetAsynchronousId(IoErrorHandler &);
RT_API_ATTRS bool Wait(int);
+ RT_API_ATTRS Position InquirePosition() const {
+ return OpenFile::InquirePosition(
+ static_cast<FileOffset>(frameOffsetInFile_ + recordOffsetInFrame_));
+ }
private:
static RT_API_ATTRS UnitMap &CreateUnitMap();
>From ffec26698080b3db8ef7726e4e5cf6029f07b02b Mon Sep 17 00:00:00 2001
From: Peter Klausler <pklausler at nvidia.com>
Date: Mon, 18 Aug 2025 14:44:23 -0700
Subject: [PATCH 106/112] [flang][runtime] Catch bad OPEN specifiers for
unformatted files (#153707)
When an OPEN statement has specifiers that are allowed only for
formatted files, detect an error when the file turns out to be
unformatted.
Fixes https://github.com/llvm/llvm-project/issues/153480.
---
flang-rt/include/flang-rt/runtime/io-error.h | 3 +++
flang-rt/include/flang-rt/runtime/io-stmt.h | 4 ++++
flang-rt/lib/runtime/io-api.cpp | 24 ++++++++++++++++++--
flang-rt/lib/runtime/io-stmt.cpp | 11 +++++++++
4 files changed, 40 insertions(+), 2 deletions(-)
diff --git a/flang-rt/include/flang-rt/runtime/io-error.h b/flang-rt/include/flang-rt/runtime/io-error.h
index 1cef6a208f374..3e8401036f289 100644
--- a/flang-rt/include/flang-rt/runtime/io-error.h
+++ b/flang-rt/include/flang-rt/runtime/io-error.h
@@ -38,6 +38,9 @@ class IoErrorHandler : public Terminator {
RT_API_ATTRS bool InError() const {
return ioStat_ != IostatOk || pendingError_ != IostatOk;
}
+ RT_API_ATTRS bool HasErrorRecovery() const {
+ return (flags_ & (hasIoStat | hasErr)) != 0;
+ }
// For I/O statements that detect fatal errors in their
// Begin...() API routines before it is known whether they
diff --git a/flang-rt/include/flang-rt/runtime/io-stmt.h b/flang-rt/include/flang-rt/runtime/io-stmt.h
index 9f71d515cb615..7693b60cccfc9 100644
--- a/flang-rt/include/flang-rt/runtime/io-stmt.h
+++ b/flang-rt/include/flang-rt/runtime/io-stmt.h
@@ -729,6 +729,9 @@ class OpenStatementState : public ExternalIoStatementBase {
RT_API_ATTRS void set_isUnformatted(bool yes = true) {
isUnformatted_ = yes;
} // FORM=
+ RT_API_ATTRS void set_mustBeFormatted(bool yes = true) {
+ mustBeFormatted_ = yes;
+ }
RT_API_ATTRS void CompleteOperation();
RT_API_ATTRS int EndIoStatement();
@@ -743,6 +746,7 @@ class OpenStatementState : public ExternalIoStatementBase {
OwningPtr<char> path_;
std::size_t pathLength_{};
Fortran::common::optional<bool> isUnformatted_;
+ Fortran::common::optional<bool> mustBeFormatted_;
Fortran::common::optional<Access> access_;
};
diff --git a/flang-rt/lib/runtime/io-api.cpp b/flang-rt/lib/runtime/io-api.cpp
index 6af0121437cd5..c7c15e77c0770 100644
--- a/flang-rt/lib/runtime/io-api.cpp
+++ b/flang-rt/lib/runtime/io-api.cpp
@@ -528,6 +528,9 @@ bool IODEF(SetAdvance)(Cookie cookie, const char *keyword, std::size_t length) {
bool IODEF(SetBlank)(Cookie cookie, const char *keyword, std::size_t length) {
IoStatementState &io{*cookie};
+ if (auto *open{io.get_if<OpenStatementState>()}) {
+ open->set_mustBeFormatted();
+ }
static const char *keywords[]{"NULL", "ZERO", nullptr};
switch (IdentifyValue(keyword, length, keywords)) {
case 0:
@@ -545,6 +548,9 @@ bool IODEF(SetBlank)(Cookie cookie, const char *keyword, std::size_t length) {
bool IODEF(SetDecimal)(Cookie cookie, const char *keyword, std::size_t length) {
IoStatementState &io{*cookie};
+ if (auto *open{io.get_if<OpenStatementState>()}) {
+ open->set_mustBeFormatted();
+ }
static const char *keywords[]{"COMMA", "POINT", nullptr};
switch (IdentifyValue(keyword, length, keywords)) {
case 0:
@@ -562,6 +568,9 @@ bool IODEF(SetDecimal)(Cookie cookie, const char *keyword, std::size_t length) {
bool IODEF(SetDelim)(Cookie cookie, const char *keyword, std::size_t length) {
IoStatementState &io{*cookie};
+ if (auto *open{io.get_if<OpenStatementState>()}) {
+ open->set_mustBeFormatted();
+ }
static const char *keywords[]{"APOSTROPHE", "QUOTE", "NONE", nullptr};
switch (IdentifyValue(keyword, length, keywords)) {
case 0:
@@ -583,6 +592,9 @@ bool IODEF(SetDelim)(Cookie cookie, const char *keyword, std::size_t length) {
bool IODEF(SetPad)(Cookie cookie, const char *keyword, std::size_t length) {
IoStatementState &io{*cookie};
IoErrorHandler &handler{io.GetIoErrorHandler()};
+ if (auto *open{io.get_if<OpenStatementState>()}) {
+ open->set_mustBeFormatted();
+ }
io.mutableModes().pad = YesOrNo(keyword, length, "PAD", handler);
return !handler.InError();
}
@@ -617,6 +629,9 @@ bool IODEF(SetRec)(Cookie cookie, std::int64_t rec) {
bool IODEF(SetRound)(Cookie cookie, const char *keyword, std::size_t length) {
IoStatementState &io{*cookie};
+ if (auto *open{io.get_if<OpenStatementState>()}) {
+ open->set_mustBeFormatted();
+ }
static const char *keywords[]{"UP", "DOWN", "ZERO", "NEAREST", "COMPATIBLE",
"PROCESSOR_DEFINED", nullptr};
switch (IdentifyValue(keyword, length, keywords)) {
@@ -647,6 +662,9 @@ bool IODEF(SetRound)(Cookie cookie, const char *keyword, std::size_t length) {
bool IODEF(SetSign)(Cookie cookie, const char *keyword, std::size_t length) {
IoStatementState &io{*cookie};
+ if (auto *open{io.get_if<OpenStatementState>()}) {
+ open->set_mustBeFormatted();
+ }
static const char *keywords[]{
"PLUS", "SUPPRESS", "PROCESSOR_DEFINED", nullptr};
switch (IdentifyValue(keyword, length, keywords)) {
@@ -784,6 +802,7 @@ bool IODEF(SetCarriagecontrol)(
io.GetIoErrorHandler().Crash(
"SetCarriageControl() called after GetNewUnit() for an OPEN statement");
}
+ open->set_mustBeFormatted();
static const char *keywords[]{"LIST", "FORTRAN", "NONE", nullptr};
switch (IdentifyValue(keyword, length, keywords)) {
case 0:
@@ -840,6 +859,7 @@ bool IODEF(SetEncoding)(
io.GetIoErrorHandler().Crash(
"SetEncoding() called after GetNewUnit() for an OPEN statement");
}
+ open->set_mustBeFormatted();
// Allow the encoding to be changed on an open unit -- it's
// useful and safe.
static const char *keywords[]{"UTF-8", "DEFAULT", nullptr};
@@ -872,10 +892,10 @@ bool IODEF(SetForm)(Cookie cookie, const char *keyword, std::size_t length) {
}
static const char *keywords[]{"FORMATTED", "UNFORMATTED", "BINARY", nullptr};
switch (IdentifyValue(keyword, length, keywords)) {
- case 0:
+ case 0: // FORM='FORMATTED'
open->set_isUnformatted(false);
break;
- case 1:
+ case 1: // FORM='UNFORMATTED'
open->set_isUnformatted(true);
break;
case 2: // legacy FORM='BINARY' means an unformatted stream
diff --git a/flang-rt/lib/runtime/io-stmt.cpp b/flang-rt/lib/runtime/io-stmt.cpp
index e08088fab4311..c462f60b6b019 100644
--- a/flang-rt/lib/runtime/io-stmt.cpp
+++ b/flang-rt/lib/runtime/io-stmt.cpp
@@ -352,6 +352,17 @@ void OpenStatementState::CompleteOperation() {
// Set default format (C.7.4 point 2).
unit().isUnformatted = unit().access != Access::Sequential;
}
+ if (unit().isUnformatted.value_or(false) && mustBeFormatted_) {
+ // This is an unformatted unit, but the OPEN statement contained at least
+ // one specifier that is not permitted unless the unit is formatted
+ // (e.g., BLANK=). Programs that want to detect this error (i.e., tests)
+ // should be informed about it, but don't crash the program otherwise
+ // since most other compilers let it slide.
+ if (HasErrorRecovery()) {
+ SignalError("FORM='UNFORMATTED' is not allowed with OPEN specifiers that "
+ "apply only to formatted units");
+ }
+ }
if (!wasExtant_ && InError()) {
// Release the new unit on failure
set_destroy();
>From 9a7a16c8d5a5927bfadd05e01c288a4fada00830 Mon Sep 17 00:00:00 2001
From: Peter Klausler <pklausler at nvidia.com>
Date: Mon, 18 Aug 2025 14:44:48 -0700
Subject: [PATCH 107/112] [flang][runtime] Allow child NAMELIST input to
advance records (#153963)
NAMELIST input in child I/O is rare, and it's not clear in the standard
whether it should be allowed to advance to later records in the parent
unit. But GNU Fortran supports it, and there's no good reason not to do
so since a NAMELIST input group that isn't terminated on the same line
is otherwise going to be a fatal error.
Fixes https://github.com/llvm/llvm-project/issues/153416.
---
flang-rt/include/flang-rt/runtime/io-stmt.h | 1 +
flang-rt/lib/runtime/io-stmt.cpp | 20 +++++++++++++++++---
2 files changed, 18 insertions(+), 3 deletions(-)
diff --git a/flang-rt/include/flang-rt/runtime/io-stmt.h b/flang-rt/include/flang-rt/runtime/io-stmt.h
index 7693b60cccfc9..3d1ca5091e923 100644
--- a/flang-rt/include/flang-rt/runtime/io-stmt.h
+++ b/flang-rt/include/flang-rt/runtime/io-stmt.h
@@ -696,6 +696,7 @@ class ChildListIoStatementState : public ChildIoStatementState<DIR>,
RT_API_ATTRS ChildListIoStatementState(
ChildIo &, const char *sourceFile = nullptr, int sourceLine = 0);
using ListDirectedStatementState<DIR>::GetNextDataEdit;
+ RT_API_ATTRS bool AdvanceRecord(int = 1);
RT_API_ATTRS int EndIoStatement();
};
diff --git a/flang-rt/lib/runtime/io-stmt.cpp b/flang-rt/lib/runtime/io-stmt.cpp
index c462f60b6b019..28149090eb169 100644
--- a/flang-rt/lib/runtime/io-stmt.cpp
+++ b/flang-rt/lib/runtime/io-stmt.cpp
@@ -1106,10 +1106,14 @@ ChildListIoStatementState<DIR>::ChildListIoStatementState(
}
template <Direction DIR>
-bool ChildUnformattedIoStatementState<DIR>::Receive(
- char *data, std::size_t bytes, std::size_t elementBytes) {
+bool ChildListIoStatementState<DIR>::AdvanceRecord(int n) {
#if !defined(RT_DEVICE_AVOID_RECURSION)
- return this->child().parent().Receive(data, bytes, elementBytes);
+ // Allow child NAMELIST input to advance
+ if (DIR == Direction::Input && this->mutableModes().inNamelist) {
+ return this->child().parent().AdvanceRecord(n);
+ } else {
+ return false;
+ }
#else
this->ReportUnsupportedChildIo();
#endif
@@ -1125,6 +1129,16 @@ template <Direction DIR> int ChildListIoStatementState<DIR>::EndIoStatement() {
return ChildIoStatementState<DIR>::EndIoStatement();
}
+template <Direction DIR>
+bool ChildUnformattedIoStatementState<DIR>::Receive(
+ char *data, std::size_t bytes, std::size_t elementBytes) {
+#if !defined(RT_DEVICE_AVOID_RECURSION)
+ return this->child().parent().Receive(data, bytes, elementBytes);
+#else
+ this->ReportUnsupportedChildIo();
+#endif
+}
+
template class InternalIoStatementState<Direction::Output>;
template class InternalIoStatementState<Direction::Input>;
template class InternalFormattedIoStatementState<Direction::Output>;
>From 638f8636df801271833daf41184992c9ec329704 Mon Sep 17 00:00:00 2001
From: Peter Klausler <pklausler at nvidia.com>
Date: Mon, 18 Aug 2025 14:45:14 -0700
Subject: [PATCH 108/112] [flang][runtime] Account for missing READ(SIZE=)
characters (#153967)
One of the two formatted real input paths was failing to call GotChar()
to account for the characters that it consumes.
Fixes https://github.com/llvm/llvm-project/issues/153830.
---
flang-rt/lib/runtime/edit-input.cpp | 1 +
1 file changed, 1 insertion(+)
diff --git a/flang-rt/lib/runtime/edit-input.cpp b/flang-rt/lib/runtime/edit-input.cpp
index 4f01623c6cf19..1bfc16cbc966d 100644
--- a/flang-rt/lib/runtime/edit-input.cpp
+++ b/flang-rt/lib/runtime/edit-input.cpp
@@ -663,6 +663,7 @@ static RT_API_ATTRS bool TryFastPathRealDecimalInput(
*reinterpret_cast<decimal::BinaryFloatingPointNumber<PRECISION> *>(n) =
converted.binary;
io.HandleRelativePosition(p - str);
+ io.GotChar(p - str);
// Set FP exception flags
if (converted.flags != decimal::ConversionResultFlags::Exact) {
RaiseFPExceptions(converted.flags);
>From 50a40738d65e6c3df83777f39503684eedd1a559 Mon Sep 17 00:00:00 2001
From: Peter Klausler <pklausler at nvidia.com>
Date: Mon, 18 Aug 2025 14:45:38 -0700
Subject: [PATCH 109/112] [flang] Catch semantic error with LBOUND/UBOUND
(#154184)
The "ARRAY=" argument to these intrinsics cannot be scalar, whether
"DIM=" is present or not. (Allowing the "ARRAY=" argument to be scalar
when "DIM=" is absent would be a conceivable extension returning an
empty result array, like SHAPE() does with extents, but it doesn't seem
useful in a programming language without compilation-time rank
polymorphism apart from assumed-rank dummy arguments, and those are
supported.)
Fixes https://github.com/llvm/llvm-project/issues/154044.
---
flang/lib/Evaluate/intrinsics.cpp | 4 ++--
flang/test/Evaluate/errors01.f90 | 10 +++++++---
2 files changed, 9 insertions(+), 5 deletions(-)
diff --git a/flang/lib/Evaluate/intrinsics.cpp b/flang/lib/Evaluate/intrinsics.cpp
index c37a7f908d4d1..0f79ba6ed62b6 100644
--- a/flang/lib/Evaluate/intrinsics.cpp
+++ b/flang/lib/Evaluate/intrinsics.cpp
@@ -666,7 +666,7 @@ static const IntrinsicInterface genericIntrinsicFunction[]{
{ArgFlag::canBeMoldNull, ArgFlag::onlyConstantInquiry}}},
DefaultInt, Rank::elemental, IntrinsicClass::inquiryFunction},
{"lbound",
- {{"array", AnyData, Rank::anyOrAssumedRank}, RequiredDIM,
+ {{"array", AnyData, Rank::arrayOrAssumedRank}, RequiredDIM,
SizeDefaultKIND},
KINDInt, Rank::scalar, IntrinsicClass::inquiryFunction},
{"lbound", {{"array", AnyData, Rank::arrayOrAssumedRank}, SizeDefaultKIND},
@@ -1034,7 +1034,7 @@ static const IntrinsicInterface genericIntrinsicFunction[]{
{"trim", {{"string", SameCharNoLen, Rank::scalar}}, SameCharNoLen,
Rank::scalar, IntrinsicClass::transformationalFunction},
{"ubound",
- {{"array", AnyData, Rank::anyOrAssumedRank}, RequiredDIM,
+ {{"array", AnyData, Rank::arrayOrAssumedRank}, RequiredDIM,
SizeDefaultKIND},
KINDInt, Rank::scalar, IntrinsicClass::inquiryFunction},
{"ubound", {{"array", AnyData, Rank::arrayOrAssumedRank}, SizeDefaultKIND},
diff --git a/flang/test/Evaluate/errors01.f90 b/flang/test/Evaluate/errors01.f90
index b20922237f240..90a0c300e3567 100644
--- a/flang/test/Evaluate/errors01.f90
+++ b/flang/test/Evaluate/errors01.f90
@@ -6,8 +6,8 @@ module m
real x
end type t
contains
- subroutine s1(a,b,c)
- real :: a(*), b(:), c(..)
+ subroutine s1(a,b,c,d)
+ real :: a(*), b(:), c(..), d
!CHECK: error: DIM=1 dimension is out of range for rank-1 assumed-size array
integer :: ub1(ubound(a,1))
!CHECK-NOT: error: DIM=1 dimension is out of range for rank-1 assumed-size array
@@ -23,7 +23,11 @@ subroutine s1(a,b,c)
!CHECK: error: DIM=0 dimension must be positive
integer :: lb4(lbound(c,0))
!CHECK: error: DIM=666 dimension is too large for any array (maximum rank 15)
- integer :: lb4(lbound(c,666))
+ integer :: lb5(lbound(c,666))
+ !CHECK: error: 'array=' argument has unacceptable rank 0
+ integer :: lb6(lbound(d,1))
+ !CHECK: error: 'array=' argument has unacceptable rank 0
+ integer :: ub4(ubound(d,1))
end subroutine
subroutine s2
integer, parameter :: array(2,3) = reshape([(j, j=1, 6)], shape(array))
>From a4cff34f3f5717d18e7dfccbc38a14cccee8afd9 Mon Sep 17 00:00:00 2001
From: Daniel Paoliello <danpao at microsoft.com>
Date: Mon, 18 Aug 2025 15:00:59 -0700
Subject: [PATCH 110/112] [win][x64] Permit lea to adjust the stack when using
unwind v2 (#154171)
In some cases `leaq` may be used to adjust the stack in an epilog, this
is permitted by unwind v2 and shouldn't raise an error.
---
llvm/lib/Target/X86/X86WinEHUnwindV2.cpp | 4 +++-
.../CodeGen/X86/win64-eh-unwindv2-errors.mir | 2 +-
llvm/test/CodeGen/X86/win64-eh-unwindv2.ll | 19 +++++++++++++++++++
3 files changed, 23 insertions(+), 2 deletions(-)
diff --git a/llvm/lib/Target/X86/X86WinEHUnwindV2.cpp b/llvm/lib/Target/X86/X86WinEHUnwindV2.cpp
index 7fa77ee8204a9..ea8b88f41bb87 100644
--- a/llvm/lib/Target/X86/X86WinEHUnwindV2.cpp
+++ b/llvm/lib/Target/X86/X86WinEHUnwindV2.cpp
@@ -190,6 +190,7 @@ bool X86WinEHUnwindV2::runOnMachineFunction(MachineFunction &MF) {
State = FunctionState::FinishedEpilog;
break;
+ case X86::LEA64r:
case X86::MOV64rr:
case X86::ADD64ri32:
if (State == FunctionState::InEpilog) {
@@ -210,7 +211,8 @@ bool X86WinEHUnwindV2::runOnMachineFunction(MachineFunction &MF) {
HasStackDealloc = true;
} else if (State == FunctionState::FinishedEpilog)
return rejectCurrentFunctionInternalError(
- MF, Mode, "Unexpected mov or add instruction after the epilog");
+ MF, Mode,
+ "Unexpected lea, mov or add instruction after the epilog");
break;
case X86::POP64r:
diff --git a/llvm/test/CodeGen/X86/win64-eh-unwindv2-errors.mir b/llvm/test/CodeGen/X86/win64-eh-unwindv2-errors.mir
index ed97e52f2d5c5..de76d90bf6b6c 100644
--- a/llvm/test/CodeGen/X86/win64-eh-unwindv2-errors.mir
+++ b/llvm/test/CodeGen/X86/win64-eh-unwindv2-errors.mir
@@ -106,7 +106,7 @@ body: |
# RUN: -x86-wineh-unwindv2-force-mode=1 | FileCheck %s \
# RUN: --check-prefix=BESTEFFORT
# DEALLOC-AFTER-EPILOG: LLVM ERROR: Windows x64 Unwind v2 is required, but LLVM has generated incompatible code in function 'dealloc_after_epilog':
-# DEALLOC-AFTER-EPILOG-SAME: Unexpected mov or add instruction after the epilog
+# DEALLOC-AFTER-EPILOG-SAME: Unexpected lea, mov or add instruction after the epilog
--- |
define dso_local void @dealloc_after_epilog() local_unnamed_addr {
diff --git a/llvm/test/CodeGen/X86/win64-eh-unwindv2.ll b/llvm/test/CodeGen/X86/win64-eh-unwindv2.ll
index a9fd1b9ac2acd..326127a919f3a 100644
--- a/llvm/test/CodeGen/X86/win64-eh-unwindv2.ll
+++ b/llvm/test/CodeGen/X86/win64-eh-unwindv2.ll
@@ -152,6 +152,25 @@ entry:
; CHECK-NEXT: retq
; CHECK-NEXT: .seh_endproc
+define dso_local void @large_aligned_alloc() align 16 {
+ %1 = alloca [128 x i8], align 64
+ ret void
+}
+; CHECK-LABEL: large_aligned_alloc:
+; CHECK: .seh_unwindversion 2
+; CHECK: .seh_pushreg %rbp
+; CHECK: .seh_stackalloc 176
+; CHECK: .seh_setframe %rbp, 128
+; CHECK: .seh_endprologue
+; CHECK-NOT: .seh_endproc
+; CHECK: .seh_startepilogue
+; CHECK-NEXT: leaq 48(%rbp), %rsp
+; CHECK-NEXT: .seh_unwindv2start
+; CHECK-NEXT: popq %rbp
+; CHECK-NEXT: .seh_endepilogue
+; CHECK-NEXT: retq
+; CHECK-NEXT: .seh_endproc
+
declare void @a() local_unnamed_addr
declare i32 @b() local_unnamed_addr
declare i32 @c(i32) local_unnamed_addr
>From a26c3e9491a040e59df787c56974985e471192db Mon Sep 17 00:00:00 2001
From: Stanislav Mekhanoshin <Stanislav.Mekhanoshin at amd.com>
Date: Mon, 18 Aug 2025 15:04:56 -0700
Subject: [PATCH 111/112] [AMDGPU] User SGPR count increased to 32 on gfx1250
(#154205)
---
llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp | 6 +++++-
.../test/CodeGen/AMDGPU/preload-implicit-kernargs.ll | 4 +---
llvm/test/CodeGen/AMDGPU/preload-kernargs.ll | 12 +++++-------
3 files changed, 11 insertions(+), 11 deletions(-)
diff --git a/llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp b/llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp
index c41d62748c4be..42a64f0601cb8 100644
--- a/llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp
@@ -2417,7 +2417,11 @@ unsigned getNSAMaxSize(const MCSubtargetInfo &STI, bool HasSampler) {
return 0;
}
-unsigned getMaxNumUserSGPRs(const MCSubtargetInfo &STI) { return 16; }
+unsigned getMaxNumUserSGPRs(const MCSubtargetInfo &STI) {
+ if (isGFX1250(STI))
+ return 32;
+ return 16;
+}
bool isSI(const MCSubtargetInfo &STI) {
return STI.hasFeature(AMDGPU::FeatureSouthernIslands);
diff --git a/llvm/test/CodeGen/AMDGPU/preload-implicit-kernargs.ll b/llvm/test/CodeGen/AMDGPU/preload-implicit-kernargs.ll
index c87f723086a41..546054cba4700 100644
--- a/llvm/test/CodeGen/AMDGPU/preload-implicit-kernargs.ll
+++ b/llvm/test/CodeGen/AMDGPU/preload-implicit-kernargs.ll
@@ -117,9 +117,7 @@ define amdgpu_kernel void @no_free_sgprs_block_count_x(ptr addrspace(1) inreg %o
;
; GFX1250-LABEL: no_free_sgprs_block_count_x:
; GFX1250: ; %bb.0:
-; GFX1250-NEXT: s_load_b32 s0, s[4:5], 0x28
-; GFX1250-NEXT: s_wait_kmcnt 0x0
-; GFX1250-NEXT: v_dual_mov_b32 v0, 0 :: v_dual_mov_b32 v1, s0
+; GFX1250-NEXT: v_dual_mov_b32 v0, 0 :: v_dual_mov_b32 v1, s18
; GFX1250-NEXT: global_store_b32 v0, v1, s[8:9]
; GFX1250-NEXT: s_endpgm
%imp_arg_ptr = call ptr addrspace(4) @llvm.amdgcn.implicitarg.ptr()
diff --git a/llvm/test/CodeGen/AMDGPU/preload-kernargs.ll b/llvm/test/CodeGen/AMDGPU/preload-kernargs.ll
index d5edfb42fa6d1..be86fd18cd737 100644
--- a/llvm/test/CodeGen/AMDGPU/preload-kernargs.ll
+++ b/llvm/test/CodeGen/AMDGPU/preload-kernargs.ll
@@ -471,13 +471,11 @@ define amdgpu_kernel void @v8i32_arg(ptr addrspace(1) nocapture inreg %out, <8 x
;
; GFX1250-LABEL: v8i32_arg:
; GFX1250: ; %bb.0:
-; GFX1250-NEXT: s_load_b256 s[4:11], s[0:1], 0x20
-; GFX1250-NEXT: s_wait_kmcnt 0x0
-; GFX1250-NEXT: v_dual_mov_b32 v8, 0 :: v_dual_mov_b32 v0, s8
-; GFX1250-NEXT: v_dual_mov_b32 v1, s9 :: v_dual_mov_b32 v2, s10
-; GFX1250-NEXT: v_dual_mov_b32 v3, s11 :: v_dual_mov_b32 v4, s4
-; GFX1250-NEXT: v_dual_mov_b32 v5, s5 :: v_dual_mov_b32 v6, s6
-; GFX1250-NEXT: v_mov_b32_e32 v7, s7
+; GFX1250-NEXT: v_dual_mov_b32 v8, 0 :: v_dual_mov_b32 v0, s14
+; GFX1250-NEXT: v_dual_mov_b32 v1, s15 :: v_dual_mov_b32 v2, s16
+; GFX1250-NEXT: v_dual_mov_b32 v3, s17 :: v_dual_mov_b32 v4, s10
+; GFX1250-NEXT: v_dual_mov_b32 v5, s11 :: v_dual_mov_b32 v6, s12
+; GFX1250-NEXT: v_mov_b32_e32 v7, s13
; GFX1250-NEXT: s_clause 0x1
; GFX1250-NEXT: global_store_b128 v8, v[0:3], s[2:3] offset:16
; GFX1250-NEXT: global_store_b128 v8, v[4:7], s[2:3]
>From 9e7697a59fffe5541fc8c1d24c85f6f2ee64803a Mon Sep 17 00:00:00 2001
From: Utkarsh Saxena <usx at google.com>
Date: Sat, 16 Aug 2025 11:45:30 +0000
Subject: [PATCH 112/112] Add decl/expr name to Origin's debug output
---
clang/lib/Analysis/LifetimeSafety.cpp | 48 ++++--
.../Sema/warn-lifetime-safety-dataflow.cpp | 153 +++++++++---------
2 files changed, 111 insertions(+), 90 deletions(-)
diff --git a/clang/lib/Analysis/LifetimeSafety.cpp b/clang/lib/Analysis/LifetimeSafety.cpp
index ba9f7d0f6ee36..c2e6dd74d0758 100644
--- a/clang/lib/Analysis/LifetimeSafety.cpp
+++ b/clang/lib/Analysis/LifetimeSafety.cpp
@@ -175,6 +175,18 @@ class OriginManager {
return NewID;
}
+ void dump(OriginID OID, llvm::raw_ostream &OS) const {
+ OS << OID << " (";
+ Origin O = getOrigin(OID);
+ if (const ValueDecl *VD = O.getDecl())
+ OS << "Decl: " << VD->getNameAsString();
+ else if (const Expr *E = O.getExpr())
+ OS << "Expr: " << E->getStmtClassName();
+ else
+ OS << "Unknown";
+ OS << ")";
+ }
+
private:
OriginID getNextOriginID() { return NextOriginID++; }
@@ -222,7 +234,7 @@ class Fact {
return nullptr;
}
- virtual void dump(llvm::raw_ostream &OS) const {
+ virtual void dump(llvm::raw_ostream &OS, const OriginManager &) const {
OS << "Fact (Kind: " << static_cast<int>(K) << ")\n";
}
};
@@ -237,9 +249,10 @@ class IssueFact : public Fact {
IssueFact(LoanID LID, OriginID OID) : Fact(Kind::Issue), LID(LID), OID(OID) {}
LoanID getLoanID() const { return LID; }
OriginID getOriginID() const { return OID; }
- void dump(llvm::raw_ostream &OS) const override {
- OS << "Issue (LoanID: " << getLoanID() << ", OriginID: " << getOriginID()
- << ")\n";
+ void dump(llvm::raw_ostream &OS, const OriginManager &OM) const override {
+ OS << "Issue (LoanID: " << getLoanID() << ", ToOrigin: ";
+ OM.dump(getOriginID(), OS);
+ OS << ")\n";
}
};
@@ -256,7 +269,7 @@ class ExpireFact : public Fact {
LoanID getLoanID() const { return LID; }
SourceLocation getExpiryLoc() const { return ExpiryLoc; }
- void dump(llvm::raw_ostream &OS) const override {
+ void dump(llvm::raw_ostream &OS, const OriginManager &OM) const override {
OS << "Expire (LoanID: " << getLoanID() << ")\n";
}
};
@@ -274,9 +287,12 @@ class AssignOriginFact : public Fact {
: Fact(Kind::AssignOrigin), OIDDest(OIDDest), OIDSrc(OIDSrc) {}
OriginID getDestOriginID() const { return OIDDest; }
OriginID getSrcOriginID() const { return OIDSrc; }
- void dump(llvm::raw_ostream &OS) const override {
- OS << "AssignOrigin (DestID: " << getDestOriginID()
- << ", SrcID: " << getSrcOriginID() << ")\n";
+ void dump(llvm::raw_ostream &OS, const OriginManager &OM) const override {
+ OS << "AssignOrigin (Dest: ";
+ OM.dump(getDestOriginID(), OS);
+ OS << ", Src: ";
+ OM.dump(getSrcOriginID(), OS);
+ OS << ")\n";
}
};
@@ -290,8 +306,10 @@ class ReturnOfOriginFact : public Fact {
ReturnOfOriginFact(OriginID OID) : Fact(Kind::ReturnOfOrigin), OID(OID) {}
OriginID getReturnedOriginID() const { return OID; }
- void dump(llvm::raw_ostream &OS) const override {
- OS << "ReturnOfOrigin (OriginID: " << getReturnedOriginID() << ")\n";
+ void dump(llvm::raw_ostream &OS, const OriginManager &OM) const override {
+ OS << "ReturnOfOrigin (";
+ OM.dump(getReturnedOriginID(), OS);
+ OS << ")\n";
}
};
@@ -308,8 +326,10 @@ class UseFact : public Fact {
OriginID getUsedOrigin() const { return UsedOrigin; }
const Expr *getUseExpr() const { return UseExpr; }
- void dump(llvm::raw_ostream &OS) const override {
- OS << "Use (OriginID: " << UsedOrigin << ")\n";
+ void dump(llvm::raw_ostream &OS, const OriginManager &OM) const override {
+ OS << "Use (";
+ OM.dump(getUsedOrigin(), OS);
+ OS << ")\n";
}
};
@@ -326,7 +346,7 @@ class TestPointFact : public Fact {
StringRef getAnnotation() const { return Annotation; }
- void dump(llvm::raw_ostream &OS) const override {
+ void dump(llvm::raw_ostream &OS, const OriginManager &) const override {
OS << "TestPoint (Annotation: \"" << getAnnotation() << "\")\n";
}
};
@@ -365,7 +385,7 @@ class FactManager {
if (It != BlockToFactsMap.end()) {
for (const Fact *F : It->second) {
llvm::dbgs() << " ";
- F->dump(llvm::dbgs());
+ F->dump(llvm::dbgs(), OriginMgr);
}
}
llvm::dbgs() << " End of Block\n";
diff --git a/clang/test/Sema/warn-lifetime-safety-dataflow.cpp b/clang/test/Sema/warn-lifetime-safety-dataflow.cpp
index 2b934ac23b92d..bcde9adf25ca5 100644
--- a/clang/test/Sema/warn-lifetime-safety-dataflow.cpp
+++ b/clang/test/Sema/warn-lifetime-safety-dataflow.cpp
@@ -12,11 +12,11 @@ MyObj* return_local_addr() {
MyObj x {10};
MyObj* p = &x;
// CHECK: Block B{{[0-9]+}}:
-// CHECK: Issue (LoanID: [[L_X:[0-9]+]], OriginID: [[O_ADDR_X:[0-9]+]])
-// CHECK: AssignOrigin (DestID: [[O_P:[0-9]+]], SrcID: [[O_ADDR_X]])
+// CHECK: Issue (LoanID: [[L_X:[0-9]+]], ToOrigin: [[O_ADDR_X:[0-9]+]] (Expr: UnaryOperator))
+// CHECK: AssignOrigin (Dest: [[O_P:[0-9]+]] (Decl: p), Src: [[O_ADDR_X]] (Expr: UnaryOperator))
return p;
-// CHECK: AssignOrigin (DestID: [[O_RET_VAL:[0-9]+]], SrcID: [[O_P]])
-// CHECK: ReturnOfOrigin (OriginID: [[O_RET_VAL]])
+// CHECK: AssignOrigin (Dest: [[O_RET_VAL:[0-9]+]] (Expr: ImplicitCastExpr), Src: [[O_P]] (Decl: p))
+// CHECK: ReturnOfOrigin ([[O_RET_VAL]] (Expr: ImplicitCastExpr))
// CHECK: Expire (LoanID: [[L_X]])
}
@@ -27,20 +27,20 @@ MyObj* return_local_addr() {
MyObj* assign_and_return_local_addr() {
MyObj y{20};
MyObj* ptr1 = &y;
-// CHECK: Issue (LoanID: [[L_Y:[0-9]+]], OriginID: [[O_ADDR_Y:[0-9]+]])
-// CHECK: AssignOrigin (DestID: [[O_PTR1:[0-9]+]], SrcID: [[O_ADDR_Y]])
+// CHECK: Issue (LoanID: [[L_Y:[0-9]+]], ToOrigin: [[O_ADDR_Y:[0-9]+]] (Expr: UnaryOperator))
+// CHECK: AssignOrigin (Dest: [[O_PTR1:[0-9]+]] (Decl: ptr1), Src: [[O_ADDR_Y]] (Expr: UnaryOperator))
MyObj* ptr2 = ptr1;
-// CHECK: AssignOrigin (DestID: [[O_PTR1_RVAL:[0-9]+]], SrcID: [[O_PTR1]])
-// CHECK: AssignOrigin (DestID: [[O_PTR2:[0-9]+]], SrcID: [[O_PTR1_RVAL]])
+// CHECK: AssignOrigin (Dest: [[O_PTR1_RVAL:[0-9]+]] (Expr: ImplicitCastExpr), Src: [[O_PTR1]] (Decl: ptr1))
+// CHECK: AssignOrigin (Dest: [[O_PTR2:[0-9]+]] (Decl: ptr2), Src: [[O_PTR1_RVAL]] (Expr: ImplicitCastExpr))
ptr2 = ptr1;
-// CHECK: AssignOrigin (DestID: [[O_PTR1_RVAL_2:[0-9]+]], SrcID: [[O_PTR1]])
-// CHECK: AssignOrigin (DestID: [[O_PTR2]], SrcID: [[O_PTR1_RVAL_2]])
+// CHECK: AssignOrigin (Dest: [[O_PTR1_RVAL_2:[0-9]+]] (Expr: ImplicitCastExpr), Src: [[O_PTR1]] (Decl: ptr1))
+// CHECK: AssignOrigin (Dest: [[O_PTR2]] (Decl: ptr2), Src: [[O_PTR1_RVAL_2]] (Expr: ImplicitCastExpr))
ptr2 = ptr2; // Self assignment.
-// CHECK: AssignOrigin (DestID: [[O_PTR2_RVAL:[0-9]+]], SrcID: [[O_PTR2]])
-// CHECK: AssignOrigin (DestID: [[O_PTR2]], SrcID: [[O_PTR2_RVAL]])
+// CHECK: AssignOrigin (Dest: [[O_PTR2_RVAL:[0-9]+]] (Expr: ImplicitCastExpr), Src: [[O_PTR2]] (Decl: ptr2))
+// CHECK: AssignOrigin (Dest: [[O_PTR2]] (Decl: ptr2), Src: [[O_PTR2_RVAL]] (Expr: ImplicitCastExpr))
return ptr2;
-// CHECK: AssignOrigin (DestID: [[O_PTR2_RVAL_2:[0-9]+]], SrcID: [[O_PTR2]])
-// CHECK: ReturnOfOrigin (OriginID: [[O_PTR2_RVAL_2]])
+// CHECK: AssignOrigin (Dest: [[O_PTR2_RVAL_2:[0-9]+]] (Expr: ImplicitCastExpr), Src: [[O_PTR2]] (Decl: ptr2))
+// CHECK: ReturnOfOrigin ([[O_PTR2_RVAL_2]] (Expr: ImplicitCastExpr))
// CHECK: Expire (LoanID: [[L_Y]])
}
@@ -60,8 +60,8 @@ int return_int_val() {
void loan_expires_cpp() {
MyObj obj{1};
MyObj* pObj = &obj;
-// CHECK: Issue (LoanID: [[L_OBJ:[0-9]+]], OriginID: [[O_ADDR_OBJ:[0-9]+]])
-// CHECK: AssignOrigin (DestID: [[O_POBJ:[0-9]+]], SrcID: [[O_ADDR_OBJ]])
+// CHECK: Issue (LoanID: [[L_OBJ:[0-9]+]], ToOrigin: [[O_ADDR_OBJ:[0-9]+]] (Expr: UnaryOperator))
+// CHECK: AssignOrigin (Dest: [[O_POBJ:[0-9]+]] (Decl: pObj), Src: [[O_ADDR_OBJ]] (Expr: UnaryOperator))
// CHECK: Expire (LoanID: [[L_OBJ]])
}
@@ -72,8 +72,8 @@ void loan_expires_cpp() {
void loan_expires_trivial() {
int trivial_obj = 1;
int* pTrivialObj = &trivial_obj;
-// CHECK: Issue (LoanID: [[L_TRIVIAL_OBJ:[0-9]+]], OriginID: [[O_ADDR_TRIVIAL_OBJ:[0-9]+]])
-// CHECK: AssignOrigin (DestID: [[O_PTOBJ:[0-9]+]], SrcID: [[O_ADDR_TRIVIAL_OBJ]])
+// CHECK: Issue (LoanID: [[L_TRIVIAL_OBJ:[0-9]+]], ToOrigin: [[O_ADDR_TRIVIAL_OBJ:[0-9]+]] (Expr: UnaryOperator))
+// CHECK: AssignOrigin (Dest: [[O_PTOBJ:[0-9]+]] (Decl: pTrivialObj), Src: [[O_ADDR_TRIVIAL_OBJ]] (Expr: UnaryOperator))
// CHECK-NOT: Expire (LoanID: [[L_TRIVIAL_OBJ]])
// CHECK-NEXT: End of Block
// FIXME: Add check for Expire once trivial destructors are handled for expiration.
@@ -87,15 +87,15 @@ void conditional(bool condition) {
if (condition)
p = &a;
- // CHECK: Issue (LoanID: [[L_A:[0-9]+]], OriginID: [[O_ADDR_A:[0-9]+]])
- // CHECK: AssignOrigin (DestID: [[O_P:[0-9]+]], SrcID: [[O_ADDR_A]])
+// CHECK: Issue (LoanID: [[L_A:[0-9]+]], ToOrigin: [[O_ADDR_A:[0-9]+]] (Expr: UnaryOperator))
+// CHECK: AssignOrigin (Dest: [[O_P:[0-9]+]] (Decl: p), Src: [[O_ADDR_A]] (Expr: UnaryOperator))
else
p = &b;
- // CHECK: Issue (LoanID: [[L_B:[0-9]+]], OriginID: [[O_ADDR_B:[0-9]+]])
- // CHECK: AssignOrigin (DestID: [[O_P]], SrcID: [[O_ADDR_B]])
+// CHECK: Issue (LoanID: [[L_B:[0-9]+]], ToOrigin: [[O_ADDR_B:[0-9]+]] (Expr: UnaryOperator))
+// CHECK: AssignOrigin (Dest: [[O_P]] (Decl: p), Src: [[O_ADDR_B]] (Expr: UnaryOperator))
int *q = p;
- // CHECK: AssignOrigin (DestID: [[O_P_RVAL:[0-9]+]], SrcID: [[O_P]])
- // CHECK: AssignOrigin (DestID: [[O_Q:[0-9]+]], SrcID: [[O_P_RVAL]])
+// CHECK: AssignOrigin (Dest: [[O_P_RVAL:[0-9]+]] (Expr: ImplicitCastExpr), Src: [[O_P]] (Decl: p))
+// CHECK: AssignOrigin (Dest: [[O_Q:[0-9]+]] (Decl: q), Src: [[O_P_RVAL]] (Expr: ImplicitCastExpr))
}
@@ -109,12 +109,12 @@ void pointers_in_a_cycle(bool condition) {
MyObj* p2 = &v2;
MyObj* p3 = &v3;
// CHECK: Block B{{[0-9]+}}:
-// CHECK: Issue (LoanID: [[L_V1:[0-9]+]], OriginID: [[O_ADDR_V1:[0-9]+]])
-// CHECK: AssignOrigin (DestID: [[O_P1:[0-9]+]], SrcID: [[O_ADDR_V1]])
-// CHECK: Issue (LoanID: [[L_V2:[0-9]+]], OriginID: [[O_ADDR_V2:[0-9]+]])
-// CHECK: AssignOrigin (DestID: [[O_P2:[0-9]+]], SrcID: [[O_ADDR_V2]])
-// CHECK: Issue (LoanID: [[L_V3:[0-9]+]], OriginID: [[O_ADDR_V3:[0-9]+]])
-// CHECK: AssignOrigin (DestID: [[O_P3:[0-9]+]], SrcID: [[O_ADDR_V3]])
+// CHECK: Issue (LoanID: [[L_V1:[0-9]+]], ToOrigin: [[O_ADDR_V1:[0-9]+]] (Expr: UnaryOperator))
+// CHECK: AssignOrigin (Dest: [[O_P1:[0-9]+]] (Decl: p1), Src: [[O_ADDR_V1]] (Expr: UnaryOperator))
+// CHECK: Issue (LoanID: [[L_V2:[0-9]+]], ToOrigin: [[O_ADDR_V2:[0-9]+]] (Expr: UnaryOperator))
+// CHECK: AssignOrigin (Dest: [[O_P2:[0-9]+]] (Decl: p2), Src: [[O_ADDR_V2]] (Expr: UnaryOperator))
+// CHECK: Issue (LoanID: [[L_V3:[0-9]+]], ToOrigin: [[O_ADDR_V3:[0-9]+]] (Expr: UnaryOperator))
+// CHECK: AssignOrigin (Dest: [[O_P3:[0-9]+]] (Decl: p3), Src: [[O_ADDR_V3]] (Expr: UnaryOperator))
while (condition) {
MyObj* temp = p1;
@@ -122,14 +122,14 @@ void pointers_in_a_cycle(bool condition) {
p2 = p3;
p3 = temp;
// CHECK: Block B{{[0-9]+}}:
-// CHECK: AssignOrigin (DestID: [[O_P1_RVAL:[0-9]+]], SrcID: [[O_P1]])
-// CHECK: AssignOrigin (DestID: [[O_TEMP:[0-9]+]], SrcID: [[O_P1_RVAL]])
-// CHECK: AssignOrigin (DestID: [[O_P2_RVAL:[0-9]+]], SrcID: [[O_P2]])
-// CHECK: AssignOrigin (DestID: [[O_P1]], SrcID: [[O_P2_RVAL]])
-// CHECK: AssignOrigin (DestID: [[O_P3_RVAL:[0-9]+]], SrcID: [[O_P3]])
-// CHECK: AssignOrigin (DestID: [[O_P2]], SrcID: [[O_P3_RVAL]])
-// CHECK: AssignOrigin (DestID: [[O_TEMP_RVAL:[0-9]+]], SrcID: [[O_TEMP]])
-// CHECK: AssignOrigin (DestID: [[O_P3]], SrcID: [[O_TEMP_RVAL]])
+// CHECK: AssignOrigin (Dest: [[O_P1_RVAL:[0-9]+]] (Expr: ImplicitCastExpr), Src: [[O_P1]] (Decl: p1))
+// CHECK: AssignOrigin (Dest: [[O_TEMP:[0-9]+]] (Decl: temp), Src: [[O_P1_RVAL]] (Expr: ImplicitCastExpr))
+// CHECK: AssignOrigin (Dest: [[O_P2_RVAL:[0-9]+]] (Expr: ImplicitCastExpr), Src: [[O_P2]] (Decl: p2))
+// CHECK: AssignOrigin (Dest: [[O_P1]] (Decl: p1), Src: [[O_P2_RVAL]] (Expr: ImplicitCastExpr))
+// CHECK: AssignOrigin (Dest: [[O_P3_RVAL:[0-9]+]] (Expr: ImplicitCastExpr), Src: [[O_P3]] (Decl: p3))
+// CHECK: AssignOrigin (Dest: [[O_P2]] (Decl: p2), Src: [[O_P3_RVAL]] (Expr: ImplicitCastExpr))
+// CHECK: AssignOrigin (Dest: [[O_TEMP_RVAL:[0-9]+]] (Expr: ImplicitCastExpr), Src: [[O_TEMP]] (Decl: temp))
+// CHECK: AssignOrigin (Dest: [[O_P3]] (Decl: p3), Src: [[O_TEMP_RVAL]] (Expr: ImplicitCastExpr))
}
}
@@ -139,11 +139,11 @@ void overwrite_origin() {
MyObj s2;
MyObj* p = &s1;
// CHECK: Block B{{[0-9]+}}:
-// CHECK: Issue (LoanID: [[L_S1:[0-9]+]], OriginID: [[O_ADDR_S1:[0-9]+]])
-// CHECK: AssignOrigin (DestID: [[O_P:[0-9]+]], SrcID: [[O_ADDR_S1]])
+// CHECK: Issue (LoanID: [[L_S1:[0-9]+]], ToOrigin: [[O_ADDR_S1:[0-9]+]] (Expr: UnaryOperator))
+// CHECK: AssignOrigin (Dest: [[O_P:[0-9]+]] (Decl: p), Src: [[O_ADDR_S1]] (Expr: UnaryOperator))
p = &s2;
-// CHECK: Issue (LoanID: [[L_S2:[0-9]+]], OriginID: [[O_ADDR_S2:[0-9]+]])
-// CHECK: AssignOrigin (DestID: [[O_P]], SrcID: [[O_ADDR_S2]])
+// CHECK: Issue (LoanID: [[L_S2:[0-9]+]], ToOrigin: [[O_ADDR_S2:[0-9]+]] (Expr: UnaryOperator))
+// CHECK: AssignOrigin (Dest: [[O_P]] (Decl: p), Src: [[O_ADDR_S2]] (Expr: UnaryOperator))
// CHECK: Expire (LoanID: [[L_S2]])
// CHECK: Expire (LoanID: [[L_S1]])
}
@@ -153,10 +153,11 @@ void reassign_to_null() {
MyObj s1;
MyObj* p = &s1;
// CHECK: Block B{{[0-9]+}}:
-// CHECK: Issue (LoanID: [[L_S1:[0-9]+]], OriginID: [[O_ADDR_S1:[0-9]+]])
-// CHECK: AssignOrigin (DestID: [[O_P:[0-9]+]], SrcID: [[O_ADDR_S1]])
+// CHECK: Issue (LoanID: [[L_S1:[0-9]+]], ToOrigin: [[O_ADDR_S1:[0-9]+]] (Expr: UnaryOperator))
+// CHECK: AssignOrigin (Dest: [[O_P:[0-9]+]] (Decl: p), Src: [[O_ADDR_S1]] (Expr: UnaryOperator))
p = nullptr;
-// CHECK: AssignOrigin (DestID: [[O_P]], SrcID: [[O_NULLPTR:[0-9]+]])
+// CHECK: AssignOrigin (Dest: [[O_NULLPTR_CAST:[0-9]+]] (Expr: ImplicitCastExpr), Src: {{[0-9]+}} (Expr: CXXNullPtrLiteralExpr))
+// CHECK: AssignOrigin (Dest: [[O_P]] (Decl: p), Src: [[O_NULLPTR_CAST]] (Expr: ImplicitCastExpr))
// CHECK: Expire (LoanID: [[L_S1]])
}
// FIXME: Have a better representation for nullptr than just an empty origin.
@@ -169,13 +170,13 @@ void reassign_in_if(bool condition) {
MyObj s2;
MyObj* p = &s1;
// CHECK: Block B{{[0-9]+}}:
-// CHECK: Issue (LoanID: [[L_S1:[0-9]+]], OriginID: [[O_ADDR_S1:[0-9]+]])
-// CHECK: AssignOrigin (DestID: [[O_P:[0-9]+]], SrcID: [[O_ADDR_S1]])
+// CHECK: Issue (LoanID: [[L_S1:[0-9]+]], ToOrigin: [[O_ADDR_S1:[0-9]+]] (Expr: UnaryOperator))
+// CHECK: AssignOrigin (Dest: [[O_P:[0-9]+]] (Decl: p), Src: [[O_ADDR_S1]] (Expr: UnaryOperator))
if (condition) {
p = &s2;
// CHECK: Block B{{[0-9]+}}:
-// CHECK: Issue (LoanID: [[L_S2:[0-9]+]], OriginID: [[O_ADDR_S2:[0-9]+]])
-// CHECK: AssignOrigin (DestID: [[O_P]], SrcID: [[O_ADDR_S2]])
+// CHECK: Issue (LoanID: [[L_S2:[0-9]+]], ToOrigin: [[O_ADDR_S2:[0-9]+]] (Expr: UnaryOperator))
+// CHECK: AssignOrigin (Dest: [[O_P]] (Decl: p), Src: [[O_ADDR_S2]] (Expr: UnaryOperator))
}
// CHECK: Block B{{[0-9]+}}:
// CHECK: Expire (LoanID: [[L_S2]])
@@ -190,26 +191,26 @@ void assign_in_switch(int mode) {
MyObj s3;
MyObj* p = nullptr;
// CHECK: Block B{{[0-9]+}}:
-// CHECK: AssignOrigin (DestID: [[O_NULLPTR_CAST:[0-9]+]], SrcID: [[O_NULLPTR:[0-9]+]])
-// CHECK: AssignOrigin (DestID: [[O_P:[0-9]+]], SrcID: [[O_NULLPTR_CAST]])
+// CHECK: AssignOrigin (Dest: [[O_NULLPTR_CAST:[0-9]+]] (Expr: ImplicitCastExpr), Src: [[O_NULLPTR:[0-9]+]] (Expr: CXXNullPtrLiteralExpr))
+// CHECK: AssignOrigin (Dest: [[O_P:[0-9]+]] (Decl: p), Src: [[O_NULLPTR_CAST]] (Expr: ImplicitCastExpr))
switch (mode) {
case 1:
p = &s1;
// CHECK: Block B{{[0-9]+}}:
-// CHECK: Issue (LoanID: [[L_S1:[0-9]+]], OriginID: [[O_ADDR_S1:[0-9]+]])
-// CHECK: AssignOrigin (DestID: [[O_P]], SrcID: [[O_ADDR_S1]])
+// CHECK: Issue (LoanID: [[L_S1:[0-9]+]], ToOrigin: [[O_ADDR_S1:[0-9]+]] (Expr: UnaryOperator))
+// CHECK: AssignOrigin (Dest: [[O_P]] (Decl: p), Src: [[O_ADDR_S1]] (Expr: UnaryOperator))
break;
case 2:
p = &s2;
// CHECK: Block B{{[0-9]+}}:
-// CHECK: Issue (LoanID: [[L_S2:[0-9]+]], OriginID: [[O_ADDR_S2:[0-9]+]])
-// CHECK: AssignOrigin (DestID: [[O_P]], SrcID: [[O_ADDR_S2]])
+// CHECK: Issue (LoanID: [[L_S2:[0-9]+]], ToOrigin: [[O_ADDR_S2:[0-9]+]] (Expr: UnaryOperator))
+// CHECK: AssignOrigin (Dest: [[O_P]] (Decl: p), Src: [[O_ADDR_S2]] (Expr: UnaryOperator))
break;
default:
p = &s3;
// CHECK: Block B{{[0-9]+}}:
-// CHECK: Issue (LoanID: [[L_S3:[0-9]+]], OriginID: [[O_ADDR_S3:[0-9]+]])
-// CHECK: AssignOrigin (DestID: [[O_P]], SrcID: [[O_ADDR_S3]])
+// CHECK: Issue (LoanID: [[L_S3:[0-9]+]], ToOrigin: [[O_ADDR_S3:[0-9]+]] (Expr: UnaryOperator))
+// CHECK: AssignOrigin (Dest: [[O_P]] (Decl: p), Src: [[O_ADDR_S3]] (Expr: UnaryOperator))
break;
}
// CHECK: Block B{{[0-9]+}}:
@@ -221,14 +222,14 @@ void assign_in_switch(int mode) {
// CHECK-LABEL: Function: loan_in_loop
void loan_in_loop(bool condition) {
MyObj* p = nullptr;
- // CHECK: AssignOrigin (DestID: [[O_NULLPTR_CAST:[0-9]+]], SrcID: [[O_NULLPTR:[0-9]+]])
- // CHECK: AssignOrigin (DestID: [[O_P:[0-9]+]], SrcID: [[O_NULLPTR_CAST]])
+ // CHECK: AssignOrigin (Dest: [[O_NULLPTR_CAST:[0-9]+]] (Expr: ImplicitCastExpr), Src: [[O_NULLPTR:[0-9]+]] (Expr: CXXNullPtrLiteralExpr))
+ // CHECK: AssignOrigin (Dest: [[O_P:[0-9]+]] (Decl: p), Src: [[O_NULLPTR_CAST]] (Expr: ImplicitCastExpr))
while (condition) {
MyObj inner;
p = &inner;
// CHECK: Block B{{[0-9]+}}:
-// CHECK: Issue (LoanID: [[L_INNER:[0-9]+]], OriginID: [[O_ADDR_INNER:[0-9]+]])
-// CHECK: AssignOrigin (DestID: [[O_P]], SrcID: [[O_ADDR_INNER]])
+// CHECK: Issue (LoanID: [[L_INNER:[0-9]+]], ToOrigin: [[O_ADDR_INNER:[0-9]+]] (Expr: UnaryOperator))
+// CHECK: AssignOrigin (Dest: [[O_P]] (Decl: p), Src: [[O_ADDR_INNER]] (Expr: UnaryOperator))
// CHECK: Expire (LoanID: [[L_INNER]])
}
}
@@ -239,14 +240,14 @@ void loop_with_break(int count) {
MyObj s2;
MyObj* p = &s1;
// CHECK: Block B{{[0-9]+}}:
-// CHECK: Issue (LoanID: [[L_S1:[0-9]+]], OriginID: [[O_ADDR_S1:[0-9]+]])
-// CHECK: AssignOrigin (DestID: [[O_P:[0-9]+]], SrcID: [[O_ADDR_S1]])
+// CHECK: Issue (LoanID: [[L_S1:[0-9]+]], ToOrigin: [[O_ADDR_S1:[0-9]+]] (Expr: UnaryOperator))
+// CHECK: AssignOrigin (Dest: [[O_P:[0-9]+]] (Decl: p), Src: [[O_ADDR_S1]] (Expr: UnaryOperator))
for (int i = 0; i < count; ++i) {
if (i == 5) {
p = &s2;
// CHECK: Block B{{[0-9]+}}:
-// CHECK: Issue (LoanID: [[L_S2:[0-9]+]], OriginID: [[O_ADDR_S2:[0-9]+]])
-// CHECK: AssignOrigin (DestID: [[O_P]], SrcID: [[O_ADDR_S2]])
+// CHECK: Issue (LoanID: [[L_S2:[0-9]+]], ToOrigin: [[O_ADDR_S2:[0-9]+]] (Expr: UnaryOperator))
+// CHECK: AssignOrigin (Dest: [[O_P]] (Decl: p), Src: [[O_ADDR_S2]] (Expr: UnaryOperator))
break;
}
}
@@ -259,18 +260,18 @@ void loop_with_break(int count) {
void nested_scopes() {
MyObj* p = nullptr;
// CHECK: Block B{{[0-9]+}}:
-// CHECK: AssignOrigin (DestID: [[O_NULLPTR_CAST:[0-9]+]], SrcID: [[O_NULLPTR:[0-9]+]])
-// CHECK: AssignOrigin (DestID: [[O_P:[0-9]+]], SrcID: [[O_NULLPTR_CAST]])
+// CHECK: AssignOrigin (Dest: [[O_NULLPTR_CAST:[0-9]+]] (Expr: ImplicitCastExpr), Src: [[O_NULLPTR:[0-9]+]] (Expr: CXXNullPtrLiteralExpr))
+// CHECK: AssignOrigin (Dest: [[O_P:[0-9]+]] (Decl: p), Src: [[O_NULLPTR_CAST]] (Expr: ImplicitCastExpr))
{
MyObj outer;
p = &outer;
-// CHECK: Issue (LoanID: [[L_OUTER:[0-9]+]], OriginID: [[O_ADDR_OUTER:[0-9]+]])
-// CHECK: AssignOrigin (DestID: [[O_P]], SrcID: [[O_ADDR_OUTER]])
+// CHECK: Issue (LoanID: [[L_OUTER:[0-9]+]], ToOrigin: [[O_ADDR_OUTER:[0-9]+]] (Expr: UnaryOperator))
+// CHECK: AssignOrigin (Dest: [[O_P]] (Decl: p), Src: [[O_ADDR_OUTER]] (Expr: UnaryOperator))
{
MyObj inner;
p = &inner;
-// CHECK: Issue (LoanID: [[L_INNER:[0-9]+]], OriginID: [[O_ADDR_INNER:[0-9]+]])
-// CHECK: AssignOrigin (DestID: [[O_P]], SrcID: [[O_ADDR_INNER]])
+// CHECK: Issue (LoanID: [[L_INNER:[0-9]+]], ToOrigin: [[O_ADDR_INNER:[0-9]+]] (Expr: UnaryOperator))
+// CHECK: AssignOrigin (Dest: [[O_P]] (Decl: p), Src: [[O_ADDR_INNER]] (Expr: UnaryOperator))
}
// CHECK: Expire (LoanID: [[L_INNER]])
}
@@ -282,13 +283,13 @@ void pointer_indirection() {
int a;
int *p = &a;
// CHECK: Block B1:
-// CHECK: Issue (LoanID: [[L_A:[0-9]+]], OriginID: [[O_ADDR_A:[0-9]+]])
-// CHECK: AssignOrigin (DestID: [[O_P:[0-9]+]], SrcID: [[O_ADDR_A]])
+// CHECK: Issue (LoanID: [[L_A:[0-9]+]], ToOrigin: [[O_ADDR_A:[0-9]+]] (Expr: UnaryOperator))
+// CHECK: AssignOrigin (Dest: [[O_P:[0-9]+]] (Decl: p), Src: [[O_ADDR_A]] (Expr: UnaryOperator))
int **pp = &p;
-// CHECK: Issue (LoanID: [[L_P:[0-9]+]], OriginID: [[O_ADDR_P:[0-9]+]])
-// CHECK: AssignOrigin (DestID: [[O_PP:[0-9]+]], SrcID: [[O_ADDR_P]])
+// CHECK: Issue (LoanID: [[L_P:[0-g]+]], ToOrigin: [[O_ADDR_P:[0-9]+]] (Expr: UnaryOperator))
+// CHECK: AssignOrigin (Dest: [[O_PP:[0-9]+]] (Decl: pp), Src: [[O_ADDR_P]] (Expr: UnaryOperator))
// FIXME: The Origin for the RHS is broken
int *q = *pp;
-// CHECK: AssignOrigin (DestID: [[O_Q:[0-9]+]], SrcID: {{[0-9]+}})
+// CHECK: AssignOrigin (Dest: {{[0-9]+}} (Decl: q), Src: {{[0-9]+}} (Expr: ImplicitCastExpr))
}
More information about the llvm-branch-commits
mailing list