[llvm] [AMDGPU] ISel & PEI for whole wave functions (PR #145858)

Diana Picus via llvm-commits llvm-commits at lists.llvm.org
Fri Jun 27 04:59:30 PDT 2025


https://github.com/rovka updated https://github.com/llvm/llvm-project/pull/145858

>From 0c7bdd19eba7c43a6ea18ff02c3e7552c5b88362 Mon Sep 17 00:00:00 2001
From: Diana Picus <diana-magda.picus at amd.com>
Date: Fri, 24 Jan 2025 10:18:23 +0100
Subject: [PATCH 01/24] Add subtarget feature

---
 llvm/lib/Target/AMDGPU/AMDGPU.td      | 6 ++++++
 llvm/lib/Target/AMDGPU/GCNSubtarget.h | 6 ++++++
 2 files changed, 12 insertions(+)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.td b/llvm/lib/Target/AMDGPU/AMDGPU.td
index 72d6a78539ada..37b109851bd16 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.td
@@ -1319,6 +1319,12 @@ def FeatureMemToLDSLoad : SubtargetFeature<"vmem-to-lds-load-insts",
   "The platform has memory to lds instructions (global_load w/lds bit set, buffer_load w/lds bit set or global_load_lds. This does not include scratch_load_lds."
 >;
 
+def FeatureWholeWaveFunction : SubtargetFeature<"whole-wave-function",
+  "IsWholeWaveFunction",
+  "true",
+  "Current function is a whole wave function (runs with all lanes enabled)"
+  >;
+
 // Dummy feature used to disable assembler instructions.
 def FeatureDisable : SubtargetFeature<"",
   "FeatureDisable","true",
diff --git a/llvm/lib/Target/AMDGPU/GCNSubtarget.h b/llvm/lib/Target/AMDGPU/GCNSubtarget.h
index 2f79599091faf..ab1a57485bae3 100644
--- a/llvm/lib/Target/AMDGPU/GCNSubtarget.h
+++ b/llvm/lib/Target/AMDGPU/GCNSubtarget.h
@@ -268,6 +268,8 @@ class GCNSubtarget final : public AMDGPUGenSubtargetInfo,
   bool RequiresCOV6 = false;
   bool UseBlockVGPROpsForCSR = false;
 
+  bool IsWholeWaveFunction = false;
+
   // Dummy feature to use for assembler in tablegen.
   bool FeatureDisable = false;
 
@@ -1481,6 +1483,10 @@ class GCNSubtarget final : public AMDGPUGenSubtargetInfo,
   // of sign-extending.
   bool hasGetPCZeroExtension() const { return GFX12Insts; }
 
+  /// \returns true if the current function is a whole wave function (i.e. it
+  /// runs with all the lanes enabled).
+  bool isWholeWaveFunction() const { return IsWholeWaveFunction; }
+
   /// \returns SGPR allocation granularity supported by the subtarget.
   unsigned getSGPRAllocGranule() const {
     return AMDGPU::IsaInfo::getSGPRAllocGranule(this);

>From 51b1f9270505d2c655a419180561a2560ffe5684 Mon Sep 17 00:00:00 2001
From: Diana Picus <diana-magda.picus at amd.com>
Date: Mon, 27 Jan 2025 13:17:19 +0100
Subject: [PATCH 02/24] [AMDGPU] ISel & PEI for whole wave functions

Whole wave functions are functions that will run with a full EXEC mask.
They will not be invoked directly, but instead will be launched by way
of a new intrinsic, `llvm.amdgcn.call.whole.wave` (to be added in
a future patch). These functions are meant as an alternative to the
`llvm.amdgcn.init.whole.wave` or `llvm.amdgcn.strict.wwm` intrinsics.

Whole wave functions will set EXEC to -1 in the prologue and restore the
original value of EXEC in the epilogue. They must have a special first
argument, `i1 %active`, that is going to be mapped to EXEC. They may
have either the default calling convention or amdgpu_gfx. The inactive
lanes need to be preserved for all registers used, active lanes only for
the CSRs.

At the IR level, arguments to a whole wave function (other than
`%active`) contain poison in their inactive lanes. Likewise, the return
value for the inactive lanes is poison.

This patch contains the following work:
* 2 new pseudos, SI_SETUP_WHOLE_WAVE_FUNC and SI_WHOLE_WAVE_FUNC_RETURN
  used for managing the EXEC mask. SI_SETUP_WHOLE_WAVE_FUNC will return
  a SReg_1 representing `%active`, which needs to be passed into
  SI_WHOLE_WAVE_FUNC_RETURN.
* SelectionDAG support for generating these 2 new pseudos and the
  special handling of %active. Since the return may be in a different
  basic block, it's difficult to add the virtual reg for %active to
  SI_WHOLE_WAVE_FUNC_RETURN, so we initially generate an IMPLICIT_DEF
  which is later replaced via a custom inserter.
* Expansion of the 2 pseudos during prolog/epilog insertion. PEI also
  marks any used VGPRs are WWM registers, which are then spilled and
  restored with the usual logic.

I'm still working on the GlobalISel support and on adding some docs in
AMDGPUUsage.rst.

Future patches will include the `llvm.amdgcn.call.whole.wave` intrinsic,
a codegen prepare patch that looks for the callees of that intrinsic and
marks them as whole wave functions, and probably a lot of optimization
work.
---
 llvm/lib/Target/AMDGPU/AMDGPU.td              |   2 +
 llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp |   2 +
 llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h   |   6 +
 llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td     |  11 +
 llvm/lib/Target/AMDGPU/SIFrameLowering.cpp    |  81 +++-
 llvm/lib/Target/AMDGPU/SIISelLowering.cpp     |  30 +-
 llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp   |   1 +
 llvm/lib/Target/AMDGPU/SIInstrInfo.cpp        |  11 +
 llvm/lib/Target/AMDGPU/SIInstrInfo.h          |   2 +
 llvm/lib/Target/AMDGPU/SIInstructions.td      |  29 ++
 .../AMDGPU/isel-whole-wave-functions.ll       | 116 +++++
 .../AMDGPU/whole-wave-functions-pei.mir       | 439 ++++++++++++++++++
 .../CodeGen/AMDGPU/whole-wave-functions.ll    | 285 ++++++++++++
 13 files changed, 1002 insertions(+), 13 deletions(-)
 create mode 100644 llvm/test/CodeGen/AMDGPU/isel-whole-wave-functions.ll
 create mode 100644 llvm/test/CodeGen/AMDGPU/whole-wave-functions-pei.mir
 create mode 100644 llvm/test/CodeGen/AMDGPU/whole-wave-functions.ll

diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.td b/llvm/lib/Target/AMDGPU/AMDGPU.td
index 37b109851bd16..e80559da97b1e 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.td
@@ -2696,6 +2696,8 @@ def HasLshlAddU64Inst : Predicate<"Subtarget->hasLshlAddU64Inst()">,
 def HasSetPrioIncWgInst : Predicate<"Subtarget->hasSetPrioIncWgInst()">,
  AssemblerPredicate<(all_of FeatureSetPrioIncWgInst)>;
 
+def IsWholeWaveFunction : Predicate<"Subtarget->isWholeWaveFunction()">;
+
 // Include AMDGPU TD files
 include "SISchedule.td"
 include "GCNProcessors.td"
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
index d75c7a178b4a8..f6e0b93f85643 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
@@ -5777,6 +5777,8 @@ const char* AMDGPUTargetLowering::getTargetNodeName(unsigned Opcode) const {
   NODE_NAME_CASE(BUFFER_ATOMIC_FMIN)
   NODE_NAME_CASE(BUFFER_ATOMIC_FMAX)
   NODE_NAME_CASE(BUFFER_ATOMIC_COND_SUB_U32)
+  NODE_NAME_CASE(WHOLE_WAVE_SETUP)
+  NODE_NAME_CASE(WHOLE_WAVE_RETURN)
   }
   return nullptr;
 }
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h
index 0dd2183b72b24..5716711de3402 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h
@@ -607,6 +607,12 @@ enum NodeType : unsigned {
   BUFFER_ATOMIC_FMAX,
   BUFFER_ATOMIC_COND_SUB_U32,
   LAST_MEMORY_OPCODE = BUFFER_ATOMIC_COND_SUB_U32,
+
+  // Set up a whole wave function.
+  WHOLE_WAVE_SETUP,
+
+  // Return from a whole wave function.
+  WHOLE_WAVE_RETURN,
 };
 
 } // End namespace AMDGPUISD
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td b/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td
index ce58e93a15207..e305f08925cc6 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td
@@ -348,6 +348,17 @@ def AMDGPUfdot2_impl : SDNode<"AMDGPUISD::FDOT2",
 
 def AMDGPUperm_impl : SDNode<"AMDGPUISD::PERM", AMDGPUDTIntTernaryOp, []>;
 
+// Marks the entry into a whole wave function.
+def AMDGPUwhole_wave_setup : SDNode<
+  "AMDGPUISD::WHOLE_WAVE_SETUP", SDTypeProfile<1, 0, [SDTCisInt<0>]>,
+  [SDNPHasChain, SDNPSideEffect]>;
+
+// Marks the return from a whole wave function.
+def AMDGPUwhole_wave_return : SDNode<
+  "AMDGPUISD::WHOLE_WAVE_RETURN", SDTNone,
+  [SDNPHasChain, SDNPOptInGlue, SDNPVariadic]
+>;
+
 // SI+ export
 def AMDGPUExportOp : SDTypeProfile<0, 8, [
   SDTCisInt<0>,       // i8 tgt
diff --git a/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp b/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
index 6a3867937d57f..fda1a09759259 100644
--- a/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
@@ -946,8 +946,18 @@ static Register buildScratchExecCopy(LiveRegUnits &LiveUnits,
 
   initLiveUnits(LiveUnits, TRI, FuncInfo, MF, MBB, MBBI, IsProlog);
 
-  ScratchExecCopy = findScratchNonCalleeSaveRegister(
-      MRI, LiveUnits, *TRI.getWaveMaskRegClass());
+  if (ST.isWholeWaveFunction()) {
+    // Whole wave functions already have a copy of the original EXEC mask that
+    // we can use.
+    assert(IsProlog && "Epilog should look at return, not setup");
+    ScratchExecCopy =
+        TII->getWholeWaveFunctionSetup(MBB)->getOperand(0).getReg();
+    assert(ScratchExecCopy && "Couldn't find copy of EXEC");
+  } else {
+    ScratchExecCopy = findScratchNonCalleeSaveRegister(
+        MRI, LiveUnits, *TRI.getWaveMaskRegClass());
+  }
+
   if (!ScratchExecCopy)
     report_fatal_error("failed to find free scratch register");
 
@@ -996,10 +1006,15 @@ void SIFrameLowering::emitCSRSpillStores(
       };
 
   StoreWWMRegisters(WWMScratchRegs);
+
+  auto EnableAllLanes = [&]() {
+    unsigned MovOpc = ST.isWave32() ? AMDGPU::S_MOV_B32 : AMDGPU::S_MOV_B64;
+    BuildMI(MBB, MBBI, DL, TII->get(MovOpc), TRI.getExec()).addImm(-1);
+  };
+
   if (!WWMCalleeSavedRegs.empty()) {
     if (ScratchExecCopy) {
-      unsigned MovOpc = ST.isWave32() ? AMDGPU::S_MOV_B32 : AMDGPU::S_MOV_B64;
-      BuildMI(MBB, MBBI, DL, TII->get(MovOpc), TRI.getExec()).addImm(-1);
+      EnableAllLanes();
     } else {
       ScratchExecCopy = buildScratchExecCopy(LiveUnits, MF, MBB, MBBI, DL,
                                              /*IsProlog*/ true,
@@ -1008,7 +1023,15 @@ void SIFrameLowering::emitCSRSpillStores(
   }
 
   StoreWWMRegisters(WWMCalleeSavedRegs);
-  if (ScratchExecCopy) {
+  if (ST.isWholeWaveFunction()) {
+    // SI_SETUP_WHOLE_WAVE_FUNCTION has outlived its purpose, so we can remove
+    // it now. If we have already saved some WWM CSR registers, then the EXEC is
+    // already -1 and we don't need to do anything else. Otherwise, set EXEC to
+    // -1 here.
+    if (WWMCalleeSavedRegs.empty())
+      EnableAllLanes();
+    TII->getWholeWaveFunctionSetup(MBB)->eraseFromParent();
+  } else if (ScratchExecCopy) {
     // FIXME: Split block and make terminator.
     unsigned ExecMov = ST.isWave32() ? AMDGPU::S_MOV_B32 : AMDGPU::S_MOV_B64;
     BuildMI(MBB, MBBI, DL, TII->get(ExecMov), TRI.getExec())
@@ -1083,11 +1106,6 @@ void SIFrameLowering::emitCSRSpillRestores(
   Register ScratchExecCopy;
   SmallVector<std::pair<Register, int>, 2> WWMCalleeSavedRegs, WWMScratchRegs;
   FuncInfo->splitWWMSpillRegisters(MF, WWMCalleeSavedRegs, WWMScratchRegs);
-  if (!WWMScratchRegs.empty())
-    ScratchExecCopy =
-        buildScratchExecCopy(LiveUnits, MF, MBB, MBBI, DL,
-                             /*IsProlog*/ false, /*EnableInactiveLanes*/ true);
-
   auto RestoreWWMRegisters =
       [&](SmallVectorImpl<std::pair<Register, int>> &WWMRegs) {
         for (const auto &Reg : WWMRegs) {
@@ -1098,6 +1116,36 @@ void SIFrameLowering::emitCSRSpillRestores(
         }
       };
 
+  if (ST.isWholeWaveFunction()) {
+    // For whole wave functions, the EXEC is already -1 at this point.
+    // Therefore, we can restore the CSR WWM registers right away.
+    RestoreWWMRegisters(WWMCalleeSavedRegs);
+
+    // The original EXEC is the first operand of the return instruction.
+    const MachineInstr &Return = MBB.instr_back();
+    assert(Return.getOpcode() == AMDGPU::SI_WHOLE_WAVE_FUNC_RETURN &&
+           "Unexpected return inst");
+    Register OrigExec = Return.getOperand(0).getReg();
+
+    if (!WWMScratchRegs.empty()) {
+      unsigned XorOpc = ST.isWave32() ? AMDGPU::S_XOR_B32 : AMDGPU::S_XOR_B64;
+      BuildMI(MBB, MBBI, DL, TII->get(XorOpc), TRI.getExec())
+          .addReg(OrigExec)
+          .addImm(-1);
+      RestoreWWMRegisters(WWMScratchRegs);
+    }
+
+    // Restore original EXEC.
+    unsigned MovOpc = ST.isWave32() ? AMDGPU::S_MOV_B32 : AMDGPU::S_MOV_B64;
+    BuildMI(MBB, MBBI, DL, TII->get(MovOpc), TRI.getExec()).addReg(OrigExec);
+    return;
+  }
+
+  if (!WWMScratchRegs.empty())
+    ScratchExecCopy =
+        buildScratchExecCopy(LiveUnits, MF, MBB, MBBI, DL,
+                             /*IsProlog*/ false, /*EnableInactiveLanes*/ true);
+
   RestoreWWMRegisters(WWMScratchRegs);
   if (!WWMCalleeSavedRegs.empty()) {
     if (ScratchExecCopy) {
@@ -1634,6 +1682,7 @@ void SIFrameLowering::determineCalleeSaves(MachineFunction &MF,
         NeedExecCopyReservedReg = true;
       else if (MI.getOpcode() == AMDGPU::SI_RETURN ||
                MI.getOpcode() == AMDGPU::SI_RETURN_TO_EPILOG ||
+               MI.getOpcode() == AMDGPU::SI_WHOLE_WAVE_FUNC_RETURN ||
                (MFI->isChainFunction() &&
                 TII->isChainCallOpcode(MI.getOpcode()))) {
         // We expect all return to be the same size.
@@ -1662,6 +1711,18 @@ void SIFrameLowering::determineCalleeSaves(MachineFunction &MF,
   if (MFI->isEntryFunction())
     return;
 
+  if (ST.isWholeWaveFunction()) {
+    // In practice, all the VGPRs are WWM registers, and we will need to save at
+    // least their inactive lanes. Add them to WWMReservedRegs.
+    assert(!NeedExecCopyReservedReg && "Whole wave functions can use the reg mapped for their i1 argument");
+    for (MCRegister Reg : AMDGPU::VGPR_32RegClass)
+      if (MF.getRegInfo().isPhysRegModified(Reg)) {
+        MFI->reserveWWMRegister(Reg);
+        MF.begin()->addLiveIn(Reg);
+      }
+    MF.begin()->sortUniqueLiveIns();
+  }
+
   // Remove any VGPRs used in the return value because these do not need to be saved.
   // This prevents CSR restore from clobbering return VGPRs.
   if (ReturnMI) {
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 8d7dcf8c4a064..a6867b690013f 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -2914,6 +2914,8 @@ SDValue SITargetLowering::LowerFormalArguments(
              !Info->hasWorkGroupIDZ());
   }
 
+  bool IsWholeWaveFunc = getSubtarget()->isWholeWaveFunction();
+
   if (CallConv == CallingConv::AMDGPU_PS) {
     processPSInputArgs(Splits, CallConv, Ins, Skipped, FType, Info);
 
@@ -2954,7 +2956,8 @@ SDValue SITargetLowering::LowerFormalArguments(
   } else if (IsKernel) {
     assert(Info->hasWorkGroupIDX() && Info->hasWorkItemIDX());
   } else {
-    Splits.append(Ins.begin(), Ins.end());
+    Splits.append(IsWholeWaveFunc ? std::next(Ins.begin()) : Ins.begin(),
+                  Ins.end());
   }
 
   if (IsKernel)
@@ -2985,6 +2988,13 @@ SDValue SITargetLowering::LowerFormalArguments(
 
   SmallVector<SDValue, 16> Chains;
 
+  if (IsWholeWaveFunc) {
+    SDValue Setup = DAG.getNode(AMDGPUISD::WHOLE_WAVE_SETUP, DL,
+                                {MVT::i1, MVT::Other}, Chain);
+    InVals.push_back(Setup.getValue(0));
+    Chains.push_back(Setup.getValue(1));
+  }
+
   // FIXME: This is the minimum kernel argument alignment. We should improve
   // this to the maximum alignment of the arguments.
   //
@@ -2992,7 +3002,8 @@ SDValue SITargetLowering::LowerFormalArguments(
   // kern arg offset.
   const Align KernelArgBaseAlign = Align(16);
 
-  for (unsigned i = 0, e = Ins.size(), ArgIdx = 0; i != e; ++i) {
+  for (unsigned i = IsWholeWaveFunc ? 1 : 0, e = Ins.size(), ArgIdx = 0; i != e;
+       ++i) {
     const ISD::InputArg &Arg = Ins[i];
     if ((Arg.isOrigArg() && Skipped[Arg.getOrigArgIndex()]) || IsError) {
       InVals.push_back(DAG.getPOISON(Arg.VT));
@@ -3340,7 +3351,9 @@ SITargetLowering::LowerReturn(SDValue Chain, CallingConv::ID CallConv,
 
   unsigned Opc = AMDGPUISD::ENDPGM;
   if (!IsWaveEnd)
-    Opc = IsShader ? AMDGPUISD::RETURN_TO_EPILOG : AMDGPUISD::RET_GLUE;
+    Opc = Subtarget->isWholeWaveFunction() ? AMDGPUISD::WHOLE_WAVE_RETURN
+          : IsShader                       ? AMDGPUISD::RETURN_TO_EPILOG
+                                           : AMDGPUISD::RET_GLUE;
   return DAG.getNode(Opc, DL, MVT::Other, RetOps);
 }
 
@@ -5856,6 +5869,17 @@ SITargetLowering::EmitInstrWithCustomInserter(MachineInstr &MI,
     MI.eraseFromParent();
     return SplitBB;
   }
+  case AMDGPU::SI_WHOLE_WAVE_FUNC_RETURN: {
+    assert(Subtarget->isWholeWaveFunction());
+
+    // During ISel, it's difficult to propagate the original EXEC mask to use as
+    // an input to SI_WHOLE_WAVE_FUNC_RETURN. Set it up here instead.
+    MachineInstr *Setup =
+        TII->getWholeWaveFunctionSetup(*BB->getParent()->begin());
+    assert(Setup && "Couldn't find SI_SETUP_WHOLE_WAVE_FUNC");
+    MI.getOperand(0).setReg(Setup->getOperand(0).getReg());
+    return BB;
+  }
   default:
     if (TII->isImage(MI) || TII->isMUBUF(MI)) {
       if (!MI.mayStore())
diff --git a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
index 69ea8aa6122aa..00de7422b7948 100644
--- a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
@@ -1808,6 +1808,7 @@ bool SIInsertWaitcnts::generateWaitcntInstBefore(MachineInstr &MI,
   //   with knowledge of the called routines.
   if (MI.getOpcode() == AMDGPU::SI_RETURN_TO_EPILOG ||
       MI.getOpcode() == AMDGPU::SI_RETURN ||
+      MI.getOpcode() == AMDGPU::SI_WHOLE_WAVE_FUNC_RETURN ||
       MI.getOpcode() == AMDGPU::S_SETPC_B64_return ||
       (MI.isReturn() && MI.isCall() && !callWaitsOnFunctionEntry(MI))) {
     Wait = Wait.combined(WCG->getAllZeroWaitcnt(/*IncludeVSCnt=*/false));
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index 5962556db62eb..4933001ebc520 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -2515,6 +2515,7 @@ bool SIInstrInfo::expandPostRAPseudo(MachineInstr &MI) const {
     MI.setDesc(get(ST.isWave32() ? AMDGPU::S_MOV_B32 : AMDGPU::S_MOV_B64));
     break;
   }
+  case AMDGPU::SI_WHOLE_WAVE_FUNC_RETURN:
   case AMDGPU::SI_RETURN: {
     const MachineFunction *MF = MBB.getParent();
     const GCNSubtarget &ST = MF->getSubtarget<GCNSubtarget>();
@@ -5769,6 +5770,16 @@ void SIInstrInfo::restoreExec(MachineFunction &MF, MachineBasicBlock &MBB,
     Indexes->insertMachineInstrInMaps(*ExecRestoreMI);
 }
 
+MachineInstr *
+SIInstrInfo::getWholeWaveFunctionSetup(MachineBasicBlock &MBB) const {
+  assert(ST.isWholeWaveFunction() && "Not a whole wave func");
+  for (MachineInstr &MI : MBB)
+    if (MI.getOpcode() == AMDGPU::SI_SETUP_WHOLE_WAVE_FUNC)
+      return &MI;
+
+  llvm_unreachable("Couldn't find instruction. Wrong MBB?");
+}
+
 static const TargetRegisterClass *
 adjustAllocatableRegClass(const GCNSubtarget &ST, const SIRegisterInfo &RI,
                           const MachineRegisterInfo &MRI,
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.h b/llvm/lib/Target/AMDGPU/SIInstrInfo.h
index 01dd3c9f4119e..13ec9791bfa58 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.h
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.h
@@ -1200,6 +1200,8 @@ class SIInstrInfo final : public AMDGPUGenInstrInfo {
                    MachineBasicBlock::iterator MBBI, const DebugLoc &DL,
                    Register Reg, SlotIndexes *Indexes = nullptr) const;
 
+  MachineInstr *getWholeWaveFunctionSetup(MachineBasicBlock &MBB) const;
+
   /// Return the correct register class for \p OpNo.  For target-specific
   /// instructions, this will return the register class that has been defined
   /// in tablegen.  For generic instructions, like REG_SEQUENCE it will return
diff --git a/llvm/lib/Target/AMDGPU/SIInstructions.td b/llvm/lib/Target/AMDGPU/SIInstructions.td
index 7b45023dd3c77..27584c60c2c2e 100644
--- a/llvm/lib/Target/AMDGPU/SIInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SIInstructions.td
@@ -644,6 +644,35 @@ def SI_INIT_WHOLE_WAVE : SPseudoInstSI <
   let isConvergent = 1;
 }
 
+let SubtargetPredicate = IsWholeWaveFunction in {
+// Sets EXEC to all lanes and returns the previous EXEC.
+def SI_SETUP_WHOLE_WAVE_FUNC : SPseudoInstSI <
+  (outs SReg_1:$dst), (ins), [(set i1:$dst, (AMDGPUwhole_wave_setup))]> {
+  let Defs = [EXEC];
+  let Uses = [EXEC];
+
+  let isConvergent = 1;
+}
+
+// Restores the previous EXEC and otherwise behaves entirely like a SI_RETURN.
+def SI_WHOLE_WAVE_FUNC_RETURN : SPseudoInstSI <
+  (outs), (ins SReg_1:$orig_exec)> {
+  let isTerminator = 1;
+  let isBarrier = 1;
+  let isReturn = 1;
+  let SchedRW = [WriteBranch];
+
+  // We're going to use custom handling to set the $orig_exec to the correct value.
+  let usesCustomInserter = 1;
+}
+
+// Generate a SI_WHOLE_WAVE_FUNC_RETURN pseudo with a placeholder for its
+// argument. It will be filled in by the custom inserter.
+def : GCNPat<
+  (AMDGPUwhole_wave_return), (SI_WHOLE_WAVE_FUNC_RETURN (i1 (IMPLICIT_DEF)))>;
+
+} // SubtargetPredicate = IsWholeWaveFunction
+
 // Return for returning shaders to a shader variant epilog.
 def SI_RETURN_TO_EPILOG : SPseudoInstSI <
   (outs), (ins variable_ops), [(AMDGPUreturn_to_epilog)]> {
diff --git a/llvm/test/CodeGen/AMDGPU/isel-whole-wave-functions.ll b/llvm/test/CodeGen/AMDGPU/isel-whole-wave-functions.ll
new file mode 100644
index 0000000000000..9e41b4e4dd614
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/isel-whole-wave-functions.ll
@@ -0,0 +1,116 @@
+; NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -global-isel=0 -mtriple=amdgcn--amdpal -mcpu=gfx1200 -mattr=+whole-wave-function -stop-after=finalize-isel -verify-machineinstrs < %s | FileCheck --check-prefix=DAGISEL %s
+; TODO: llc -global-isel=1 -mtriple=amdgcn--amdpal -mcpu=gfx1200 -mattr=+whole-wave-function < %s | FileCheck --check-prefix=GISEL %s
+
+define amdgpu_gfx i32 @basic_test(i1 %active, i32 %a, i32 %b) {
+  ; DAGISEL-LABEL: name: basic_test
+  ; DAGISEL: bb.0 (%ir-block.0):
+  ; DAGISEL-NEXT:   liveins: $vgpr0, $vgpr1
+  ; DAGISEL-NEXT: {{  $}}
+  ; DAGISEL-NEXT:   [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr1
+  ; DAGISEL-NEXT:   [[COPY1:%[0-9]+]]:vgpr_32 = COPY $vgpr0
+  ; DAGISEL-NEXT:   [[SI_SETUP_WHOLE_WAVE_FUNC:%[0-9]+]]:sreg_32_xm0_xexec = SI_SETUP_WHOLE_WAVE_FUNC implicit-def dead $exec, implicit $exec
+  ; DAGISEL-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 5
+  ; DAGISEL-NEXT:   [[V_CNDMASK_B32_e64_:%[0-9]+]]:vgpr_32 = V_CNDMASK_B32_e64 0, killed [[S_MOV_B32_]], 0, [[COPY1]], [[SI_SETUP_WHOLE_WAVE_FUNC]], implicit $exec
+  ; DAGISEL-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 3
+  ; DAGISEL-NEXT:   [[V_CNDMASK_B32_e64_1:%[0-9]+]]:vgpr_32 = V_CNDMASK_B32_e64 0, killed [[S_MOV_B32_1]], 0, [[COPY]], [[SI_SETUP_WHOLE_WAVE_FUNC]], implicit $exec
+  ; DAGISEL-NEXT:   [[V_MOV_B32_dpp:%[0-9]+]]:vgpr_32 = V_MOV_B32_dpp [[V_CNDMASK_B32_e64_]], killed [[V_CNDMASK_B32_e64_1]], 1, 1, 1, 0, implicit $exec
+  ; DAGISEL-NEXT:   $vgpr0 = COPY [[V_MOV_B32_dpp]]
+  ; DAGISEL-NEXT:   [[DEF:%[0-9]+]]:sreg_32 = IMPLICIT_DEF
+  ; DAGISEL-NEXT:   SI_WHOLE_WAVE_FUNC_RETURN killed [[SI_SETUP_WHOLE_WAVE_FUNC]], implicit $vgpr0
+  %x = select i1 %active, i32 %a, i32 5
+  %y = select i1 %active, i32 %b, i32 3
+  %ret = call i32 @llvm.amdgcn.update.dpp.i32(i32 %x, i32 %y, i32 1, i32 1, i32 1, i1 false) #0
+  ret i32 %ret
+}
+
+; Make sure we don't crash if %active is not used at all.
+define amdgpu_gfx i32 @unused_active(i1 %active, i32 %a, i32 %b) {
+  ; DAGISEL-LABEL: name: unused_active
+  ; DAGISEL: bb.0 (%ir-block.0):
+  ; DAGISEL-NEXT:   [[SI_SETUP_WHOLE_WAVE_FUNC:%[0-9]+]]:sreg_32 = SI_SETUP_WHOLE_WAVE_FUNC implicit-def dead $exec, implicit $exec
+  ; DAGISEL-NEXT:   [[V_MOV_B32_e32_:%[0-9]+]]:vgpr_32 = V_MOV_B32_e32 14, implicit $exec
+  ; DAGISEL-NEXT:   $vgpr0 = COPY [[V_MOV_B32_e32_]]
+  ; DAGISEL-NEXT:   [[DEF:%[0-9]+]]:sreg_32 = IMPLICIT_DEF
+  ; DAGISEL-NEXT:   SI_WHOLE_WAVE_FUNC_RETURN killed [[SI_SETUP_WHOLE_WAVE_FUNC]], implicit $vgpr0
+  ret i32 14
+}
+
+define amdgpu_gfx i32 @multiple_blocks(i1 %active, i32 %a, i32 %b) {
+  ; DAGISEL-LABEL: name: multiple_blocks
+  ; DAGISEL: bb.0 (%ir-block.0):
+  ; DAGISEL-NEXT:   successors: %bb.1(0x40000000), %bb.2(0x40000000)
+  ; DAGISEL-NEXT:   liveins: $vgpr0, $vgpr1
+  ; DAGISEL-NEXT: {{  $}}
+  ; DAGISEL-NEXT:   [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr1
+  ; DAGISEL-NEXT:   [[COPY1:%[0-9]+]]:vgpr_32 = COPY $vgpr0
+  ; DAGISEL-NEXT:   [[SI_SETUP_WHOLE_WAVE_FUNC:%[0-9]+]]:sreg_32 = SI_SETUP_WHOLE_WAVE_FUNC implicit-def dead $exec, implicit $exec
+  ; DAGISEL-NEXT:   [[COPY2:%[0-9]+]]:sreg_32 = COPY [[SI_SETUP_WHOLE_WAVE_FUNC]]
+  ; DAGISEL-NEXT:   [[V_CMP_EQ_U32_e64_:%[0-9]+]]:sreg_32 = V_CMP_EQ_U32_e64 [[COPY1]], [[COPY]], implicit $exec
+  ; DAGISEL-NEXT:   [[SI_IF:%[0-9]+]]:sreg_32 = SI_IF killed [[V_CMP_EQ_U32_e64_]], %bb.2, implicit-def dead $exec, implicit-def dead $scc, implicit $exec
+  ; DAGISEL-NEXT:   S_BRANCH %bb.1
+  ; DAGISEL-NEXT: {{  $}}
+  ; DAGISEL-NEXT: bb.1.if.then:
+  ; DAGISEL-NEXT:   successors: %bb.2(0x80000000)
+  ; DAGISEL-NEXT: {{  $}}
+  ; DAGISEL-NEXT:   [[V_ADD_U32_e64_:%[0-9]+]]:vgpr_32 = V_ADD_U32_e64 [[COPY1]], [[COPY]], 0, implicit $exec
+  ; DAGISEL-NEXT: {{  $}}
+  ; DAGISEL-NEXT: bb.2.if.end:
+  ; DAGISEL-NEXT:   [[PHI:%[0-9]+]]:vgpr_32 = PHI [[COPY]], %bb.0, [[V_ADD_U32_e64_]], %bb.1
+  ; DAGISEL-NEXT:   SI_END_CF [[SI_IF]], implicit-def dead $exec, implicit-def dead $scc, implicit $exec
+  ; DAGISEL-NEXT:   [[COPY3:%[0-9]+]]:sreg_32_xm0_xexec = COPY [[COPY2]]
+  ; DAGISEL-NEXT:   [[V_CNDMASK_B32_e64_:%[0-9]+]]:vgpr_32 = V_CNDMASK_B32_e64 0, [[PHI]], 0, [[COPY1]], [[COPY3]], implicit $exec
+  ; DAGISEL-NEXT:   $vgpr0 = COPY [[V_CNDMASK_B32_e64_]]
+  ; DAGISEL-NEXT:   [[DEF:%[0-9]+]]:sreg_32 = IMPLICIT_DEF
+  ; DAGISEL-NEXT:   SI_WHOLE_WAVE_FUNC_RETURN killed [[SI_SETUP_WHOLE_WAVE_FUNC]], implicit $vgpr0
+  %c = icmp eq i32 %a, %b
+  br i1 %c, label %if.then, label %if.end
+
+if.then:                                          ; preds = %0
+  %d = add i32 %a, %b
+  br label %if.end
+
+if.end:
+  %f = phi i32 [ %d, %if.then ], [ %b, %0 ]
+  %e = select i1 %active, i32 %a, i32 %f
+  ret i32 %e
+}
+
+define amdgpu_gfx i64 @ret_64(i1 %active, i64 %a, i64 %b) {
+  ; DAGISEL-LABEL: name: ret_64
+  ; DAGISEL: bb.0 (%ir-block.0):
+  ; DAGISEL-NEXT:   liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3
+  ; DAGISEL-NEXT: {{  $}}
+  ; DAGISEL-NEXT:   [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr3
+  ; DAGISEL-NEXT:   [[COPY1:%[0-9]+]]:vgpr_32 = COPY $vgpr2
+  ; DAGISEL-NEXT:   [[COPY2:%[0-9]+]]:vgpr_32 = COPY $vgpr1
+  ; DAGISEL-NEXT:   [[COPY3:%[0-9]+]]:vgpr_32 = COPY $vgpr0
+  ; DAGISEL-NEXT:   [[DEF:%[0-9]+]]:sgpr_32 = IMPLICIT_DEF
+  ; DAGISEL-NEXT:   [[DEF1:%[0-9]+]]:sgpr_32 = IMPLICIT_DEF
+  ; DAGISEL-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:vreg_64 = REG_SEQUENCE [[COPY1]], %subreg.sub0, [[COPY]], %subreg.sub1
+  ; DAGISEL-NEXT:   [[DEF2:%[0-9]+]]:sgpr_32 = IMPLICIT_DEF
+  ; DAGISEL-NEXT:   [[DEF3:%[0-9]+]]:sgpr_32 = IMPLICIT_DEF
+  ; DAGISEL-NEXT:   [[REG_SEQUENCE1:%[0-9]+]]:vreg_64 = REG_SEQUENCE [[COPY3]], %subreg.sub0, [[COPY2]], %subreg.sub1
+  ; DAGISEL-NEXT:   [[SI_SETUP_WHOLE_WAVE_FUNC:%[0-9]+]]:sreg_32_xm0_xexec = SI_SETUP_WHOLE_WAVE_FUNC implicit-def dead $exec, implicit $exec
+  ; DAGISEL-NEXT:   [[COPY4:%[0-9]+]]:vgpr_32 = COPY [[REG_SEQUENCE1]].sub1
+  ; DAGISEL-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 0
+  ; DAGISEL-NEXT:   [[V_CNDMASK_B32_e64_:%[0-9]+]]:vgpr_32 = V_CNDMASK_B32_e64 0, [[S_MOV_B32_]], 0, killed [[COPY4]], [[SI_SETUP_WHOLE_WAVE_FUNC]], implicit $exec
+  ; DAGISEL-NEXT:   [[COPY5:%[0-9]+]]:vgpr_32 = COPY [[REG_SEQUENCE1]].sub0
+  ; DAGISEL-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 5
+  ; DAGISEL-NEXT:   [[V_CNDMASK_B32_e64_1:%[0-9]+]]:vgpr_32 = V_CNDMASK_B32_e64 0, killed [[S_MOV_B32_1]], 0, killed [[COPY5]], [[SI_SETUP_WHOLE_WAVE_FUNC]], implicit $exec
+  ; DAGISEL-NEXT:   [[COPY6:%[0-9]+]]:vgpr_32 = COPY [[REG_SEQUENCE]].sub1
+  ; DAGISEL-NEXT:   [[V_CNDMASK_B32_e64_2:%[0-9]+]]:vgpr_32 = V_CNDMASK_B32_e64 0, [[S_MOV_B32_]], 0, killed [[COPY6]], [[SI_SETUP_WHOLE_WAVE_FUNC]], implicit $exec
+  ; DAGISEL-NEXT:   [[COPY7:%[0-9]+]]:vgpr_32 = COPY [[REG_SEQUENCE]].sub0
+  ; DAGISEL-NEXT:   [[S_MOV_B32_2:%[0-9]+]]:sreg_32 = S_MOV_B32 3
+  ; DAGISEL-NEXT:   [[V_CNDMASK_B32_e64_3:%[0-9]+]]:vgpr_32 = V_CNDMASK_B32_e64 0, killed [[S_MOV_B32_2]], 0, killed [[COPY7]], [[SI_SETUP_WHOLE_WAVE_FUNC]], implicit $exec
+  ; DAGISEL-NEXT:   [[V_MOV_B32_dpp:%[0-9]+]]:vgpr_32 = V_MOV_B32_dpp [[V_CNDMASK_B32_e64_1]], killed [[V_CNDMASK_B32_e64_3]], 1, 1, 1, 0, implicit $exec
+  ; DAGISEL-NEXT:   [[V_MOV_B32_dpp1:%[0-9]+]]:vgpr_32 = V_MOV_B32_dpp [[V_CNDMASK_B32_e64_]], killed [[V_CNDMASK_B32_e64_2]], 1, 1, 1, 0, implicit $exec
+  ; DAGISEL-NEXT:   $vgpr0 = COPY [[V_MOV_B32_dpp]]
+  ; DAGISEL-NEXT:   $vgpr1 = COPY [[V_MOV_B32_dpp1]]
+  ; DAGISEL-NEXT:   [[DEF4:%[0-9]+]]:sreg_32 = IMPLICIT_DEF
+  ; DAGISEL-NEXT:   SI_WHOLE_WAVE_FUNC_RETURN killed [[SI_SETUP_WHOLE_WAVE_FUNC]], implicit $vgpr0, implicit $vgpr1
+  %x = select i1 %active, i64 %a, i64 5
+  %y = select i1 %active, i64 %b, i64 3
+  %ret = call i64 @llvm.amdgcn.update.dpp.i64(i64 %x, i64 %y, i32 1, i32 1, i32 1, i1 false) #0
+  ret i64 %ret
+}
diff --git a/llvm/test/CodeGen/AMDGPU/whole-wave-functions-pei.mir b/llvm/test/CodeGen/AMDGPU/whole-wave-functions-pei.mir
new file mode 100644
index 0000000000000..d62e90441284c
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/whole-wave-functions-pei.mir
@@ -0,0 +1,439 @@
+# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5
+# RUN: llc -mtriple=amdgcn -mcpu=gfx1200 -mattr=+whole-wave-function -run-pass=prologepilog -o - %s | FileCheck %s
+
+---
+name:            save_inactive_lanes_non_csr_vgpr
+alignment:       1
+tracksRegLiveness: true
+noPhis:          true
+isSSA:           false
+noVRegs:         true
+hasFakeUses:     false
+tracksDebugUserValues: true
+frameInfo:
+  maxAlignment:    1
+  isCalleeSavedInfoValid: true
+machineFunctionInfo:
+  maxKernArgAlign: 1
+  frameOffsetReg:  '$sgpr33'
+  stackPtrOffsetReg: '$sgpr32'
+  returnsVoid:     false
+  occupancy:       16
+  sgprForEXECCopy: '$sgpr105'
+body:             |
+  bb.0:
+    ; CHECK-LABEL: name: save_inactive_lanes_non_csr_vgpr
+    ; CHECK: liveins: $vgpr0
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: $sgpr0 = S_XOR_SAVEEXEC_B32 -1, implicit-def $exec, implicit-def dead $scc, implicit $exec
+    ; CHECK-NEXT: SCRATCH_STORE_DWORD_SADDR $vgpr0, $sgpr32, 0, 0, implicit $exec, implicit $flat_scr :: (store (s32) into %stack.0, addrspace 5)
+    ; CHECK-NEXT: $exec_lo = S_MOV_B32 -1
+    ; CHECK-NEXT: $vgpr0 = V_MOV_B32_e32 14, implicit $exec
+    ; CHECK-NEXT: $exec_lo = S_XOR_B32 $sgpr0, -1, implicit-def $scc
+    ; CHECK-NEXT: $vgpr0 = SCRATCH_LOAD_DWORD_SADDR $sgpr32, 0, 0, implicit $exec, implicit $flat_scr, implicit $vgpr0(tied-def 0) :: (load (s32) from %stack.0, addrspace 5)
+    ; CHECK-NEXT: $exec_lo = S_MOV_B32 $sgpr0
+    ; CHECK-NEXT: SI_WHOLE_WAVE_FUNC_RETURN killed renamable $sgpr0, implicit killed $vgpr0
+    renamable $sgpr0 = SI_SETUP_WHOLE_WAVE_FUNC implicit-def dead $exec, implicit $exec
+    $vgpr0 = V_MOV_B32_e32 14, implicit $exec
+    SI_WHOLE_WAVE_FUNC_RETURN killed renamable $sgpr0, implicit killed $vgpr0
+
+...
+---
+name:            save_all_lanes_csr_vgpr
+alignment:       1
+tracksRegLiveness: true
+noPhis:          true
+isSSA:           false
+noVRegs:         true
+hasFakeUses:     false
+tracksDebugUserValues: true
+frameInfo:
+  maxAlignment:    1
+  isCalleeSavedInfoValid: true
+machineFunctionInfo:
+  maxKernArgAlign: 1
+  frameOffsetReg:  '$sgpr33'
+  stackPtrOffsetReg: '$sgpr32'
+  returnsVoid:     false
+  occupancy:       16
+  sgprForEXECCopy: '$sgpr105'
+body:             |
+  bb.0:
+    ; CHECK-LABEL: name: save_all_lanes_csr_vgpr
+    ; CHECK: liveins: $vgpr40
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: $sgpr0 = S_OR_SAVEEXEC_B32 -1, implicit-def $exec, implicit-def dead $scc, implicit $exec
+    ; CHECK-NEXT: SCRATCH_STORE_DWORD_SADDR $vgpr40, $sgpr32, 0, 0, implicit $exec, implicit $flat_scr :: (store (s32) into %stack.0, addrspace 5)
+    ; CHECK-NEXT: $vgpr40 = V_MOV_B32_e32 14, implicit $exec
+    ; CHECK-NEXT: $vgpr40 = SCRATCH_LOAD_DWORD_SADDR $sgpr32, 0, 0, implicit $exec, implicit $flat_scr :: (load (s32) from %stack.0, addrspace 5)
+    ; CHECK-NEXT: $exec_lo = S_MOV_B32 $sgpr0
+    ; CHECK-NEXT: SI_WHOLE_WAVE_FUNC_RETURN killed renamable $sgpr0
+    renamable $sgpr0 = SI_SETUP_WHOLE_WAVE_FUNC implicit-def dead $exec, implicit $exec
+    $vgpr40 = V_MOV_B32_e32 14, implicit $exec
+    SI_WHOLE_WAVE_FUNC_RETURN killed renamable $sgpr0
+
+...
+---
+name:            save_csr_sgpr_to_non_csr_vgpr
+alignment:       1
+tracksRegLiveness: true
+noPhis:          true
+isSSA:           false
+noVRegs:         true
+hasFakeUses:     false
+tracksDebugUserValues: true
+frameInfo:
+  maxAlignment:    1
+  isCalleeSavedInfoValid: true
+machineFunctionInfo:
+  maxKernArgAlign: 1
+  frameOffsetReg:  '$sgpr33'
+  stackPtrOffsetReg: '$sgpr32'
+  returnsVoid:     false
+  occupancy:       16
+  sgprForEXECCopy: '$sgpr105'
+body:             |
+  bb.0:
+    liveins: $sgpr20, $vgpr191
+    ; CHECK-LABEL: name: save_csr_sgpr_to_non_csr_vgpr
+    ; CHECK: liveins: $sgpr20, $vgpr191, $vgpr192
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: $vcc_lo = S_XOR_SAVEEXEC_B32 -1, implicit-def $exec, implicit-def dead $scc, implicit $exec
+    ; CHECK-NEXT: SCRATCH_STORE_DWORD_SADDR $vgpr192, $sgpr32, 0, 0, implicit $exec, implicit $flat_scr :: (store (s32) into %stack.0, addrspace 5)
+    ; CHECK-NEXT: $exec_lo = S_MOV_B32 -1
+    ; CHECK-NEXT: $vgpr192 = SI_SPILL_S32_TO_VGPR killed $sgpr20, 0, $vgpr192
+    ; CHECK-NEXT: $sgpr20 = S_MOV_B32 14, implicit $exec
+    ; CHECK-NEXT: $sgpr20 = SI_RESTORE_S32_FROM_VGPR $vgpr192, 0
+    ; CHECK-NEXT: $exec_lo = S_XOR_B32 $vcc_lo, -1, implicit-def $scc
+    ; CHECK-NEXT: $vgpr192 = SCRATCH_LOAD_DWORD_SADDR $sgpr32, 0, 0, implicit $exec, implicit $flat_scr :: (load (s32) from %stack.0, addrspace 5)
+    ; CHECK-NEXT: $exec_lo = S_MOV_B32 $vcc_lo
+    ; CHECK-NEXT: SI_WHOLE_WAVE_FUNC_RETURN killed renamable $vcc_lo
+    $vgpr192 = SI_SPILL_S32_TO_VGPR killed $sgpr20, 0, $vgpr192
+    renamable $vcc_lo = SI_SETUP_WHOLE_WAVE_FUNC implicit-def dead $exec, implicit $exec
+    $sgpr20 = S_MOV_B32 14, implicit $exec
+    $sgpr20 = SI_RESTORE_S32_FROM_VGPR $vgpr192, 0
+    SI_WHOLE_WAVE_FUNC_RETURN killed renamable $vcc_lo
+
+...
+---
+name:            save_csr_sgpr_to_csr_vgpr
+alignment:       1
+tracksRegLiveness: true
+noPhis:          true
+isSSA:           false
+noVRegs:         true
+hasFakeUses:     false
+tracksDebugUserValues: true
+frameInfo:
+  maxAlignment:    1
+  isCalleeSavedInfoValid: true
+machineFunctionInfo:
+  maxKernArgAlign: 1
+  frameOffsetReg:  '$sgpr33'
+  stackPtrOffsetReg: '$sgpr32'
+  returnsVoid:     false
+  occupancy:       16
+  sgprForEXECCopy: '$sgpr105'
+body:             |
+  bb.0:
+    liveins: $sgpr20, $vgpr191
+    ; CHECK-LABEL: name: save_csr_sgpr_to_csr_vgpr
+    ; CHECK: liveins: $sgpr20, $vgpr191
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: $vcc_lo = S_OR_SAVEEXEC_B32 -1, implicit-def $exec, implicit-def dead $scc, implicit $exec
+    ; CHECK-NEXT: SCRATCH_STORE_DWORD_SADDR $vgpr191, $sgpr32, 0, 0, implicit $exec, implicit $flat_scr :: (store (s32) into %stack.0, addrspace 5)
+    ; CHECK-NEXT: $vgpr191 = SI_SPILL_S32_TO_VGPR killed $sgpr20, 0, $vgpr191
+    ; CHECK-NEXT: $sgpr20 = S_MOV_B32 14, implicit $exec
+    ; CHECK-NEXT: $sgpr20 = SI_RESTORE_S32_FROM_VGPR $vgpr191, 0
+    ; CHECK-NEXT: $vgpr191 = SCRATCH_LOAD_DWORD_SADDR $sgpr32, 0, 0, implicit $exec, implicit $flat_scr :: (load (s32) from %stack.0, addrspace 5)
+    ; CHECK-NEXT: $exec_lo = S_MOV_B32 $vcc_lo
+    ; CHECK-NEXT: SI_WHOLE_WAVE_FUNC_RETURN killed renamable $vcc_lo
+    $vgpr191 = SI_SPILL_S32_TO_VGPR killed $sgpr20, 0, $vgpr191
+    renamable $vcc_lo = SI_SETUP_WHOLE_WAVE_FUNC implicit-def dead $exec, implicit $exec
+    $sgpr20 = S_MOV_B32 14, implicit $exec
+    $sgpr20 = SI_RESTORE_S32_FROM_VGPR $vgpr191, 0
+    SI_WHOLE_WAVE_FUNC_RETURN killed renamable $vcc_lo
+
+...
+---
+name:            vgpr_and_sgpr_csr
+alignment:       1
+tracksRegLiveness: true
+noPhis:          true
+isSSA:           false
+noVRegs:         true
+hasFakeUses:     false
+tracksDebugUserValues: true
+liveins:
+  - { reg: '$vgpr0' }
+  - { reg: '$vgpr1' }
+frameInfo:
+  maxAlignment:    4
+  isCalleeSavedInfoValid: true
+machineFunctionInfo:
+  maxKernArgAlign: 1
+  hasSpilledSGPRs: true
+  frameOffsetReg:  '$sgpr33'
+  stackPtrOffsetReg: '$sgpr32'
+  returnsVoid:     false
+  occupancy:       16
+  spillPhysVGPRs:
+    - '$vgpr191'
+  wwmReservedRegs:
+    - '$vgpr191'
+body:             |
+  bb.0:
+    liveins: $sgpr20, $vgpr0, $vgpr1, $vgpr191
+
+    ; CHECK-LABEL: name: vgpr_and_sgpr_csr
+    ; CHECK: liveins: $sgpr20, $vgpr0, $vgpr1, $vgpr40, $vgpr49
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: $vcc_lo = S_XOR_SAVEEXEC_B32 -1, implicit-def $exec, implicit-def dead $scc, implicit $exec
+    ; CHECK-NEXT: SCRATCH_STORE_DWORD_SADDR $vgpr0, $sgpr32, 0, 0, implicit $exec, implicit $flat_scr :: (store (s32) into %stack.0, addrspace 5)
+    ; CHECK-NEXT: SCRATCH_STORE_DWORD_SADDR $vgpr49, $sgpr32, 8, 0, implicit $exec, implicit $flat_scr :: (store (s32) into %stack.2, addrspace 5)
+    ; CHECK-NEXT: $exec_lo = S_MOV_B32 -1
+    ; CHECK-NEXT: SCRATCH_STORE_DWORD_SADDR $vgpr40, $sgpr32, 4, 0, implicit $exec, implicit $flat_scr :: (store (s32) into %stack.1, addrspace 5)
+    ; CHECK-NEXT: $vgpr0 = SI_SPILL_S32_TO_VGPR killed $sgpr20, 0, $vgpr0
+    ; CHECK-NEXT: S_NOP 0, implicit-def $vgpr40, implicit-def $sgpr20
+    ; CHECK-NEXT: S_NOP 0, implicit-def $vgpr49, implicit-def $sgpr40
+    ; CHECK-NEXT: $sgpr20 = SI_RESTORE_S32_FROM_VGPR $vgpr0, 0
+    ; CHECK-NEXT: $vgpr40 = SCRATCH_LOAD_DWORD_SADDR $sgpr32, 4, 0, implicit $exec, implicit $flat_scr :: (load (s32) from %stack.1, addrspace 5)
+    ; CHECK-NEXT: $exec_lo = S_XOR_B32 $vcc_lo, -1, implicit-def $scc
+    ; CHECK-NEXT: $vgpr0 = SCRATCH_LOAD_DWORD_SADDR $sgpr32, 0, 0, implicit $exec, implicit $flat_scr :: (load (s32) from %stack.0, addrspace 5)
+    ; CHECK-NEXT: $vgpr49 = SCRATCH_LOAD_DWORD_SADDR $sgpr32, 8, 0, implicit $exec, implicit $flat_scr :: (load (s32) from %stack.2, addrspace 5)
+    ; CHECK-NEXT: $exec_lo = S_MOV_B32 $vcc_lo
+    ; CHECK-NEXT: SI_WHOLE_WAVE_FUNC_RETURN killed renamable $vcc_lo
+    $vgpr191 = SI_SPILL_S32_TO_VGPR killed $sgpr20, 0, $vgpr191
+    renamable $vcc_lo = SI_SETUP_WHOLE_WAVE_FUNC implicit-def dead $exec, implicit $exec
+    S_NOP 0, implicit-def $vgpr40, implicit-def $sgpr20
+    S_NOP 0, implicit-def $vgpr49, implicit-def $sgpr40
+    $sgpr20 = SI_RESTORE_S32_FROM_VGPR $vgpr191, 0
+    SI_WHOLE_WAVE_FUNC_RETURN killed renamable $vcc_lo
+
+...
+---
+name:            split_orig_exec
+alignment:       1
+tracksRegLiveness: true
+noPhis:          true
+isSSA:           false
+noVRegs:         true
+hasFakeUses:     false
+tracksDebugUserValues: true
+liveins:
+  - { reg: '$vgpr0' }
+  - { reg: '$vgpr1' }
+frameInfo:
+  maxAlignment:    4
+  isCalleeSavedInfoValid: true
+machineFunctionInfo:
+  maxKernArgAlign: 1
+  hasSpilledSGPRs: true
+  frameOffsetReg:  '$sgpr33'
+  stackPtrOffsetReg: '$sgpr32'
+  returnsVoid:     false
+  occupancy:       16
+  spillPhysVGPRs:
+    - '$vgpr191'
+  wwmReservedRegs:
+    - '$vgpr191'
+body:             |
+  bb.0:
+    liveins: $sgpr20, $vgpr0, $vgpr1, $vgpr191
+
+    ; CHECK-LABEL: name: split_orig_exec
+    ; CHECK: liveins: $sgpr20, $vgpr0, $vgpr1, $vgpr40, $vgpr49
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: $vcc_lo = S_XOR_SAVEEXEC_B32 -1, implicit-def $exec, implicit-def dead $scc, implicit $exec
+    ; CHECK-NEXT: SCRATCH_STORE_DWORD_SADDR $vgpr0, $sgpr32, 0, 0, implicit $exec, implicit $flat_scr :: (store (s32) into %stack.0, addrspace 5)
+    ; CHECK-NEXT: SCRATCH_STORE_DWORD_SADDR $vgpr49, $sgpr32, 8, 0, implicit $exec, implicit $flat_scr :: (store (s32) into %stack.2, addrspace 5)
+    ; CHECK-NEXT: $exec_lo = S_MOV_B32 -1
+    ; CHECK-NEXT: SCRATCH_STORE_DWORD_SADDR $vgpr40, $sgpr32, 4, 0, implicit $exec, implicit $flat_scr :: (store (s32) into %stack.1, addrspace 5)
+    ; CHECK-NEXT: $vgpr0 = SI_SPILL_S32_TO_VGPR killed $sgpr20, 0, $vgpr0
+    ; CHECK-NEXT: S_NOP 0, implicit-def $vgpr40, implicit-def $sgpr20
+    ; CHECK-NEXT: $sgpr3 = COPY $vcc_lo
+    ; CHECK-NEXT: S_NOP 0, implicit-def $vgpr49, implicit-def $sgpr40
+    ; CHECK-NEXT: $sgpr20 = SI_RESTORE_S32_FROM_VGPR $vgpr0, 0
+    ; CHECK-NEXT: $vgpr40 = SCRATCH_LOAD_DWORD_SADDR $sgpr32, 4, 0, implicit $exec, implicit $flat_scr :: (load (s32) from %stack.1, addrspace 5)
+    ; CHECK-NEXT: $exec_lo = S_XOR_B32 $sgpr3, -1, implicit-def $scc
+    ; CHECK-NEXT: $vgpr0 = SCRATCH_LOAD_DWORD_SADDR $sgpr32, 0, 0, implicit $exec, implicit $flat_scr :: (load (s32) from %stack.0, addrspace 5)
+    ; CHECK-NEXT: $vgpr49 = SCRATCH_LOAD_DWORD_SADDR $sgpr32, 8, 0, implicit $exec, implicit $flat_scr :: (load (s32) from %stack.2, addrspace 5)
+    ; CHECK-NEXT: $exec_lo = S_MOV_B32 $sgpr3
+    ; CHECK-NEXT: SI_WHOLE_WAVE_FUNC_RETURN killed renamable $sgpr3
+    $vgpr191 = SI_SPILL_S32_TO_VGPR killed $sgpr20, 0, $vgpr191
+    renamable $vcc_lo = SI_SETUP_WHOLE_WAVE_FUNC implicit-def dead $exec, implicit $exec
+    S_NOP 0, implicit-def $vgpr40, implicit-def $sgpr20
+    $sgpr3 = COPY $vcc_lo
+    S_NOP 0, implicit-def $vgpr49, implicit-def $sgpr40
+    $sgpr20 = SI_RESTORE_S32_FROM_VGPR $vgpr191, 0
+    SI_WHOLE_WAVE_FUNC_RETURN killed renamable $sgpr3
+
+...
+---
+name:            vgpr_superregs
+alignment:       1
+tracksRegLiveness: true
+noPhis:          true
+isSSA:           false
+noVRegs:         true
+hasFakeUses:     false
+tracksDebugUserValues: true
+frameInfo:
+  maxAlignment:    1
+  isCalleeSavedInfoValid: true
+machineFunctionInfo:
+  maxKernArgAlign: 1
+  frameOffsetReg:  '$sgpr33'
+  stackPtrOffsetReg: '$sgpr32'
+  returnsVoid:     false
+  occupancy:       16
+  sgprForEXECCopy: '$sgpr105'
+body:             |
+  bb.0:
+    ; CHECK-LABEL: name: vgpr_superregs
+    ; CHECK: liveins: $vgpr0, $vgpr2, $vgpr3, $vgpr4, $vgpr5, $vgpr40, $vgpr41, $vgpr42
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: $sgpr0 = S_XOR_SAVEEXEC_B32 -1, implicit-def $exec, implicit-def dead $scc, implicit $exec
+    ; CHECK-NEXT: SCRATCH_STORE_DWORD_SADDR $vgpr0, $sgpr32, 0, 0, implicit $exec, implicit $flat_scr :: (store (s32) into %stack.0, addrspace 5)
+    ; CHECK-NEXT: SCRATCH_STORE_DWORD_SADDR $vgpr2, $sgpr32, 4, 0, implicit $exec, implicit $flat_scr :: (store (s32) into %stack.1, addrspace 5)
+    ; CHECK-NEXT: SCRATCH_STORE_DWORD_SADDR $vgpr3, $sgpr32, 8, 0, implicit $exec, implicit $flat_scr :: (store (s32) into %stack.2, addrspace 5)
+    ; CHECK-NEXT: SCRATCH_STORE_DWORD_SADDR $vgpr4, $sgpr32, 12, 0, implicit $exec, implicit $flat_scr :: (store (s32) into %stack.3, addrspace 5)
+    ; CHECK-NEXT: SCRATCH_STORE_DWORD_SADDR $vgpr5, $sgpr32, 16, 0, implicit $exec, implicit $flat_scr :: (store (s32) into %stack.4, addrspace 5)
+    ; CHECK-NEXT: $exec_lo = S_MOV_B32 -1
+    ; CHECK-NEXT: SCRATCH_STORE_DWORD_SADDR $vgpr40, $sgpr32, 20, 0, implicit $exec, implicit $flat_scr :: (store (s32) into %stack.5, addrspace 5)
+    ; CHECK-NEXT: SCRATCH_STORE_DWORD_SADDR $vgpr41, $sgpr32, 24, 0, implicit $exec, implicit $flat_scr :: (store (s32) into %stack.6, addrspace 5)
+    ; CHECK-NEXT: SCRATCH_STORE_DWORD_SADDR $vgpr42, $sgpr32, 28, 0, implicit $exec, implicit $flat_scr :: (store (s32) into %stack.7, addrspace 5)
+    ; CHECK-NEXT: $vgpr0 = V_MOV_B32_e32 14, implicit $exec
+    ; CHECK-NEXT: S_NOP 0, implicit-def $vgpr2_vgpr3_vgpr4_vgpr5, implicit-def $vgpr40_vgpr41_vgpr42
+    ; CHECK-NEXT: $vgpr40 = SCRATCH_LOAD_DWORD_SADDR $sgpr32, 20, 0, implicit $exec, implicit $flat_scr :: (load (s32) from %stack.5, addrspace 5)
+    ; CHECK-NEXT: $vgpr41 = SCRATCH_LOAD_DWORD_SADDR $sgpr32, 24, 0, implicit $exec, implicit $flat_scr :: (load (s32) from %stack.6, addrspace 5)
+    ; CHECK-NEXT: $vgpr42 = SCRATCH_LOAD_DWORD_SADDR $sgpr32, 28, 0, implicit $exec, implicit $flat_scr :: (load (s32) from %stack.7, addrspace 5)
+    ; CHECK-NEXT: $exec_lo = S_XOR_B32 $sgpr0, -1, implicit-def $scc
+    ; CHECK-NEXT: $vgpr0 = SCRATCH_LOAD_DWORD_SADDR $sgpr32, 0, 0, implicit $exec, implicit $flat_scr, implicit $vgpr0(tied-def 0) :: (load (s32) from %stack.0, addrspace 5)
+    ; CHECK-NEXT: $vgpr2 = SCRATCH_LOAD_DWORD_SADDR $sgpr32, 4, 0, implicit $exec, implicit $flat_scr :: (load (s32) from %stack.1, addrspace 5)
+    ; CHECK-NEXT: $vgpr3 = SCRATCH_LOAD_DWORD_SADDR $sgpr32, 8, 0, implicit $exec, implicit $flat_scr :: (load (s32) from %stack.2, addrspace 5)
+    ; CHECK-NEXT: $vgpr4 = SCRATCH_LOAD_DWORD_SADDR $sgpr32, 12, 0, implicit $exec, implicit $flat_scr :: (load (s32) from %stack.3, addrspace 5)
+    ; CHECK-NEXT: $vgpr5 = SCRATCH_LOAD_DWORD_SADDR $sgpr32, 16, 0, implicit $exec, implicit $flat_scr :: (load (s32) from %stack.4, addrspace 5)
+    ; CHECK-NEXT: $exec_lo = S_MOV_B32 $sgpr0
+    ; CHECK-NEXT: SI_WHOLE_WAVE_FUNC_RETURN killed renamable $sgpr0, implicit killed $vgpr0
+    renamable $sgpr0 = SI_SETUP_WHOLE_WAVE_FUNC implicit-def dead $exec, implicit $exec
+    $vgpr0 = V_MOV_B32_e32 14, implicit $exec
+    S_NOP 0, implicit-def $vgpr2_vgpr3_vgpr4_vgpr5, implicit-def $vgpr40_vgpr41_vgpr42
+    SI_WHOLE_WAVE_FUNC_RETURN killed renamable $sgpr0, implicit killed $vgpr0
+
+...
+---
+name:            dont_restore_used_vgprs
+alignment:       1
+tracksRegLiveness: true
+noPhis:          true
+isSSA:           false
+noVRegs:         true
+hasFakeUses:     false
+tracksDebugUserValues: true
+liveins:
+  - { reg: '$vgpr0' }
+  - { reg: '$vgpr20' }
+  - { reg: '$vgpr40' }
+frameInfo:
+  maxAlignment:    1
+  isCalleeSavedInfoValid: true
+machineFunctionInfo:
+  maxKernArgAlign: 1
+  frameOffsetReg:  '$sgpr33'
+  stackPtrOffsetReg: '$sgpr32'
+  returnsVoid:     false
+  occupancy:       16
+  sgprForEXECCopy: '$sgpr105'
+body:             |
+  bb.0:
+    liveins: $vgpr0, $vgpr20, $vgpr40
+
+    ; CHECK-LABEL: name: dont_restore_used_vgprs
+    ; CHECK: liveins: $vgpr0, $vgpr20, $vgpr40
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: $exec_lo = S_MOV_B32 -1
+    ; CHECK-NEXT: S_NOP 0, implicit $vgpr0, implicit $vgpr20, implicit $vgpr40
+    ; CHECK-NEXT: $exec_lo = S_MOV_B32 $sgpr0
+    ; CHECK-NEXT: SI_WHOLE_WAVE_FUNC_RETURN killed renamable $sgpr0, implicit killed $vgpr0
+    renamable $sgpr0 = SI_SETUP_WHOLE_WAVE_FUNC implicit-def dead $exec, implicit $exec
+    S_NOP 0, implicit $vgpr0, implicit $vgpr20, implicit $vgpr40
+    SI_WHOLE_WAVE_FUNC_RETURN killed renamable $sgpr0, implicit killed $vgpr0
+
+...
+---
+name:            multiple_blocks
+alignment:       1
+tracksRegLiveness: true
+noPhis:          true
+isSSA:           false
+noVRegs:         true
+hasFakeUses:     false
+tracksDebugUserValues: true
+liveins:
+  - { reg: '$vgpr0' }
+  - { reg: '$vgpr1' }
+frameInfo:
+  maxAlignment:    1
+  isCalleeSavedInfoValid: true
+machineFunctionInfo:
+  maxKernArgAlign: 1
+  frameOffsetReg:  '$sgpr33'
+  stackPtrOffsetReg: '$sgpr32'
+  returnsVoid:     false
+  occupancy:       16
+  sgprForEXECCopy: '$sgpr105'
+body:             |
+  ; CHECK-LABEL: name: multiple_blocks
+  ; CHECK: bb.0:
+  ; CHECK-NEXT:   successors: %bb.1(0x40000000), %bb.2(0x40000000)
+  ; CHECK-NEXT:   liveins: $vgpr0, $vgpr1
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   $vcc_lo = S_XOR_SAVEEXEC_B32 -1, implicit-def $exec, implicit-def dead $scc, implicit $exec
+  ; CHECK-NEXT:   SCRATCH_STORE_DWORD_SADDR $vgpr0, $sgpr32, 0, 0, implicit $exec, implicit $flat_scr :: (store (s32) into %stack.0, addrspace 5)
+  ; CHECK-NEXT:   SCRATCH_STORE_DWORD_SADDR $vgpr1, $sgpr32, 4, 0, implicit $exec, implicit $flat_scr :: (store (s32) into %stack.1, addrspace 5)
+  ; CHECK-NEXT:   $exec_lo = S_MOV_B32 -1
+  ; CHECK-NEXT:   $sgpr1 = S_MOV_B32 $exec_lo
+  ; CHECK-NEXT:   V_CMPX_EQ_U32_nosdst_e64 $vgpr0, $vgpr1, implicit-def $exec, implicit $exec
+  ; CHECK-NEXT:   S_CBRANCH_EXECZ %bb.2, implicit $exec
+  ; CHECK-NEXT:   S_BRANCH %bb.1
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.1:
+  ; CHECK-NEXT:   successors: %bb.2(0x80000000)
+  ; CHECK-NEXT:   liveins: $vcc_lo, $sgpr1, $vgpr0, $vgpr1
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   renamable $vgpr1 = V_ADD_U32_e64 $vgpr0, $vgpr1, 0, implicit $exec
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.2:
+  ; CHECK-NEXT:   liveins: $vcc_lo, $sgpr1, $vgpr0, $vgpr1
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   $exec_lo = S_OR_B32 $exec_lo, killed renamable $sgpr1, implicit-def $scc
+  ; CHECK-NEXT:   renamable $vgpr0 = V_CNDMASK_B32_e64 0, $vgpr1, 0, $vgpr0, $vcc_lo, implicit $exec
+  ; CHECK-NEXT:   $exec_lo = S_XOR_B32 $vcc_lo, -1, implicit-def $scc
+  ; CHECK-NEXT:   $vgpr0 = SCRATCH_LOAD_DWORD_SADDR $sgpr32, 0, 0, implicit $exec, implicit $flat_scr, implicit $vgpr0(tied-def 0) :: (load (s32) from %stack.0, addrspace 5)
+  ; CHECK-NEXT:   $vgpr1 = SCRATCH_LOAD_DWORD_SADDR $sgpr32, 4, 0, implicit $exec, implicit $flat_scr :: (load (s32) from %stack.1, addrspace 5)
+  ; CHECK-NEXT:   $exec_lo = S_MOV_B32 $vcc_lo
+  ; CHECK-NEXT:   SI_WHOLE_WAVE_FUNC_RETURN killed renamable $vcc_lo, implicit $vgpr0
+  bb.0:
+    successors: %bb.1, %bb.2
+    liveins: $vgpr0, $vgpr1
+
+    renamable $vcc_lo = SI_SETUP_WHOLE_WAVE_FUNC implicit-def dead $exec, implicit $exec
+    $sgpr1 = S_MOV_B32 $exec_lo
+    V_CMPX_EQ_U32_nosdst_e64 $vgpr0, $vgpr1, implicit-def $exec, implicit $exec
+    S_CBRANCH_EXECZ %bb.2, implicit $exec
+    S_BRANCH %bb.1
+
+  bb.1:
+    liveins: $vcc_lo, $sgpr1, $vgpr0, $vgpr1
+
+    renamable $vgpr1 = V_ADD_U32_e64 $vgpr0, $vgpr1, 0, implicit $exec
+
+  bb.2:
+    liveins: $vcc_lo, $sgpr1, $vgpr0, $vgpr1
+
+    $exec_lo = S_OR_B32 $exec_lo, killed renamable $sgpr1, implicit-def $scc
+    renamable $vgpr0 = V_CNDMASK_B32_e64 0, $vgpr1, 0, $vgpr0, $vcc_lo, implicit $exec
+    SI_WHOLE_WAVE_FUNC_RETURN killed renamable $vcc_lo, implicit $vgpr0
+
+...
diff --git a/llvm/test/CodeGen/AMDGPU/whole-wave-functions.ll b/llvm/test/CodeGen/AMDGPU/whole-wave-functions.ll
new file mode 100644
index 0000000000000..9a951e95f3983
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/whole-wave-functions.ll
@@ -0,0 +1,285 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -global-isel=0 -mtriple=amdgcn--amdpal -mcpu=gfx1200 -mattr=+whole-wave-function -verify-machineinstrs < %s | FileCheck --check-prefix=DAGISEL %s
+; TODO: llc -global-isel=1 -mtriple=amdgcn--amdpal -mcpu=gfx1200 -mattr=+whole-wave-function < %s | FileCheck --check-prefix=GISEL %s
+
+; Make sure the i1 %active is passed through EXEC.
+; The EXEC mask should be set to -1 for the duration of the function
+; and restored to its original value in the epilogue.
+; We will also need to restore the inactive lanes for any allocated VGPRs.
+define i32 @basic_test(i1 %active, i32 %a, i32 %b) {
+; DAGISEL-LABEL: basic_test:
+; DAGISEL:       ; %bb.0:
+; DAGISEL-NEXT:    s_wait_loadcnt_dscnt 0x0
+; DAGISEL-NEXT:    s_wait_expcnt 0x0
+; DAGISEL-NEXT:    s_wait_samplecnt 0x0
+; DAGISEL-NEXT:    s_wait_bvhcnt 0x0
+; DAGISEL-NEXT:    s_wait_kmcnt 0x0
+; DAGISEL-NEXT:    s_xor_saveexec_b32 vcc_lo, -1
+; DAGISEL-NEXT:    s_clause 0x1
+; DAGISEL-NEXT:    scratch_store_b32 off, v0, s32
+; DAGISEL-NEXT:    scratch_store_b32 off, v1, s32 offset:4
+; DAGISEL-NEXT:    s_mov_b32 exec_lo, -1
+; DAGISEL-NEXT:    s_wait_alu 0xfffe
+; DAGISEL-NEXT:    v_dual_cndmask_b32 v0, 5, v0 :: v_dual_cndmask_b32 v1, 3, v1
+; DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; DAGISEL-NEXT:    v_mov_b32_dpp v0, v1 quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1
+; DAGISEL-NEXT:    s_xor_b32 exec_lo, vcc_lo, -1
+; DAGISEL-NEXT:    s_clause 0x1
+; DAGISEL-NEXT:    scratch_load_b32 v0, off, s32
+; DAGISEL-NEXT:    scratch_load_b32 v1, off, s32 offset:4
+; DAGISEL-NEXT:    s_mov_b32 exec_lo, vcc_lo
+; DAGISEL-NEXT:    s_wait_loadcnt 0x0
+; DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+  %x = select i1 %active, i32 %a, i32 5
+  %y = select i1 %active, i32 %b, i32 3
+  %ret = call i32 @llvm.amdgcn.update.dpp.i32(i32 %x, i32 %y, i32 1, i32 1, i32 1, i1 false) #0
+  ret i32 %ret
+}
+
+; Make sure we don't crash if %active is not used at all.
+define i32 @unused_active(i1 %active, i32 %a, i32 %b) {
+; DAGISEL-LABEL: unused_active:
+; DAGISEL:       ; %bb.0:
+; DAGISEL-NEXT:    s_wait_loadcnt_dscnt 0x0
+; DAGISEL-NEXT:    s_wait_expcnt 0x0
+; DAGISEL-NEXT:    s_wait_samplecnt 0x0
+; DAGISEL-NEXT:    s_wait_bvhcnt 0x0
+; DAGISEL-NEXT:    s_wait_kmcnt 0x0
+; DAGISEL-NEXT:    s_xor_saveexec_b32 s0, -1
+; DAGISEL-NEXT:    scratch_store_b32 off, v0, s32 ; 4-byte Folded Spill
+; DAGISEL-NEXT:    s_mov_b32 exec_lo, -1
+; DAGISEL-NEXT:    v_mov_b32_e32 v0, 14
+; DAGISEL-NEXT:    s_wait_alu 0xfffe
+; DAGISEL-NEXT:    s_xor_b32 exec_lo, s0, -1
+; DAGISEL-NEXT:    scratch_load_b32 v0, off, s32 ; 4-byte Folded Reload
+; DAGISEL-NEXT:    s_mov_b32 exec_lo, s0
+; DAGISEL-NEXT:    s_wait_loadcnt 0x0
+; DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+  ret i32 14
+}
+
+; For any used VGPRs (including those used for SGPR spills), we need to restore the inactive lanes.
+; For CSR VGPRs, we need to restore all lanes.
+define i32 @csr_default_cc(i1 %active, i32 %a, i32 %b) {
+; DAGISEL-LABEL: csr_default_cc:
+; DAGISEL:       ; %bb.0:
+; DAGISEL-NEXT:    s_wait_loadcnt_dscnt 0x0
+; DAGISEL-NEXT:    s_wait_expcnt 0x0
+; DAGISEL-NEXT:    s_wait_samplecnt 0x0
+; DAGISEL-NEXT:    s_wait_bvhcnt 0x0
+; DAGISEL-NEXT:    s_wait_kmcnt 0x0
+; DAGISEL-NEXT:    s_xor_saveexec_b32 vcc_lo, -1
+; DAGISEL-NEXT:    s_clause 0x3
+; DAGISEL-NEXT:    scratch_store_b32 off, v2, s32
+; DAGISEL-NEXT:    scratch_store_b32 off, v0, s32 offset:4
+; DAGISEL-NEXT:    scratch_store_b32 off, v1, s32 offset:8
+; DAGISEL-NEXT:    scratch_store_b32 off, v49, s32 offset:16
+; DAGISEL-NEXT:    s_mov_b32 exec_lo, -1
+; DAGISEL-NEXT:    scratch_store_b32 off, v40, s32 offset:12 ; 4-byte Folded Spill
+; DAGISEL-NEXT:    v_writelane_b32 v2, s48, 0
+; DAGISEL-NEXT:    ;;#ASMSTART
+; DAGISEL-NEXT:    ; clobber CSR
+; DAGISEL-NEXT:    ;;#ASMEND
+; DAGISEL-NEXT:    ;;#ASMSTART
+; DAGISEL-NEXT:    ; clobber non-CSR
+; DAGISEL-NEXT:    ;;#ASMEND
+; DAGISEL-NEXT:    scratch_load_b32 v40, off, s32 offset:12 ; 4-byte Folded Reload
+; DAGISEL-NEXT:    s_wait_alu 0xfffe
+; DAGISEL-NEXT:    v_dual_cndmask_b32 v0, 5, v0 :: v_dual_cndmask_b32 v1, 3, v1
+; DAGISEL-NEXT:    v_readlane_b32 s48, v2, 0
+; DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_2)
+; DAGISEL-NEXT:    v_mov_b32_dpp v0, v1 quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1
+; DAGISEL-NEXT:    s_xor_b32 exec_lo, vcc_lo, -1
+; DAGISEL-NEXT:    s_clause 0x3
+; DAGISEL-NEXT:    scratch_load_b32 v2, off, s32
+; DAGISEL-NEXT:    scratch_load_b32 v0, off, s32 offset:4
+; DAGISEL-NEXT:    scratch_load_b32 v1, off, s32 offset:8
+; DAGISEL-NEXT:    scratch_load_b32 v49, off, s32 offset:16
+; DAGISEL-NEXT:    s_mov_b32 exec_lo, vcc_lo
+; DAGISEL-NEXT:    s_wait_loadcnt 0x0
+; DAGISEL-NEXT:    s_wait_alu 0xf1ff
+; DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+  %x = select i1 %active, i32 %a, i32 5
+  %y = select i1 %active, i32 %b, i32 3
+  call void asm sideeffect "; clobber CSR", "~{v40},~{s48}"()
+  call void asm sideeffect "; clobber non-CSR", "~{v49},~{s40}"()
+  %ret = call i32 @llvm.amdgcn.update.dpp.i32(i32 %x, i32 %y, i32 1, i32 1, i32 1, i1 false) #0
+  ret i32 %ret
+}
+
+; Same as above, but with the amdgpu_gfx calling convention.
+define amdgpu_gfx i32 @csr_amdgpu_gfx(i1 %active, i32 %a, i32 %b) {
+; DAGISEL-LABEL: csr_amdgpu_gfx:
+; DAGISEL:       ; %bb.0:
+; DAGISEL-NEXT:    s_wait_loadcnt_dscnt 0x0
+; DAGISEL-NEXT:    s_wait_expcnt 0x0
+; DAGISEL-NEXT:    s_wait_samplecnt 0x0
+; DAGISEL-NEXT:    s_wait_bvhcnt 0x0
+; DAGISEL-NEXT:    s_wait_kmcnt 0x0
+; DAGISEL-NEXT:    s_xor_saveexec_b32 vcc_lo, -1
+; DAGISEL-NEXT:    s_clause 0x3
+; DAGISEL-NEXT:    scratch_store_b32 off, v2, s32
+; DAGISEL-NEXT:    scratch_store_b32 off, v0, s32 offset:4
+; DAGISEL-NEXT:    scratch_store_b32 off, v1, s32 offset:8
+; DAGISEL-NEXT:    scratch_store_b32 off, v49, s32 offset:16
+; DAGISEL-NEXT:    s_mov_b32 exec_lo, -1
+; DAGISEL-NEXT:    scratch_store_b32 off, v40, s32 offset:12 ; 4-byte Folded Spill
+; DAGISEL-NEXT:    v_writelane_b32 v2, s28, 0
+; DAGISEL-NEXT:    ;;#ASMSTART
+; DAGISEL-NEXT:    ; clobber CSR
+; DAGISEL-NEXT:    ;;#ASMEND
+; DAGISEL-NEXT:    ;;#ASMSTART
+; DAGISEL-NEXT:    ; clobber non-CSR
+; DAGISEL-NEXT:    ;;#ASMEND
+; DAGISEL-NEXT:    scratch_load_b32 v40, off, s32 offset:12 ; 4-byte Folded Reload
+; DAGISEL-NEXT:    s_wait_alu 0xfffe
+; DAGISEL-NEXT:    v_dual_cndmask_b32 v0, 5, v0 :: v_dual_cndmask_b32 v1, 3, v1
+; DAGISEL-NEXT:    v_readlane_b32 s28, v2, 0
+; DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_2)
+; DAGISEL-NEXT:    v_mov_b32_dpp v0, v1 quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1
+; DAGISEL-NEXT:    s_xor_b32 exec_lo, vcc_lo, -1
+; DAGISEL-NEXT:    s_clause 0x3
+; DAGISEL-NEXT:    scratch_load_b32 v2, off, s32
+; DAGISEL-NEXT:    scratch_load_b32 v0, off, s32 offset:4
+; DAGISEL-NEXT:    scratch_load_b32 v1, off, s32 offset:8
+; DAGISEL-NEXT:    scratch_load_b32 v49, off, s32 offset:16
+; DAGISEL-NEXT:    s_mov_b32 exec_lo, vcc_lo
+; DAGISEL-NEXT:    s_wait_loadcnt 0x0
+; DAGISEL-NEXT:    s_wait_alu 0xf1ff
+; DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+  %x = select i1 %active, i32 %a, i32 5
+  %y = select i1 %active, i32 %b, i32 3
+  call void asm sideeffect "; clobber CSR", "~{v40},~{s28}"()
+  call void asm sideeffect "; clobber non-CSR", "~{v49},~{s40}"()
+  %ret = call i32 @llvm.amdgcn.update.dpp.i32(i32 %x, i32 %y, i32 1, i32 1, i32 1, i1 false) #0
+  ret i32 %ret
+}
+
+; Save and restore all lanes of v40.
+define void @csr_vgpr_only(i1 %active, i32 %a, i32 %b) {
+; DAGISEL-LABEL: csr_vgpr_only:
+; DAGISEL:       ; %bb.0:
+; DAGISEL-NEXT:    s_wait_loadcnt_dscnt 0x0
+; DAGISEL-NEXT:    s_wait_expcnt 0x0
+; DAGISEL-NEXT:    s_wait_samplecnt 0x0
+; DAGISEL-NEXT:    s_wait_bvhcnt 0x0
+; DAGISEL-NEXT:    s_wait_kmcnt 0x0
+; DAGISEL-NEXT:    s_or_saveexec_b32 s0, -1
+; DAGISEL-NEXT:    scratch_store_b32 off, v40, s32 ; 4-byte Folded Spill
+; DAGISEL-NEXT:    ;;#ASMSTART
+; DAGISEL-NEXT:    ; clobber CSR VGPR
+; DAGISEL-NEXT:    ;;#ASMEND
+; DAGISEL-NEXT:    scratch_load_b32 v40, off, s32 ; 4-byte Folded Reload
+; DAGISEL-NEXT:    s_wait_alu 0xfffe
+; DAGISEL-NEXT:    s_mov_b32 exec_lo, s0
+; DAGISEL-NEXT:    s_wait_loadcnt 0x0
+; DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+  call void asm sideeffect "; clobber CSR VGPR", "~{v40}"()
+  ret void
+}
+
+define void @sgpr_spill_only(i1 %active, i32 %a, i32 %b) {
+; DAGISEL-LABEL: sgpr_spill_only:
+; DAGISEL:       ; %bb.0:
+; DAGISEL-NEXT:    s_wait_loadcnt_dscnt 0x0
+; DAGISEL-NEXT:    s_wait_expcnt 0x0
+; DAGISEL-NEXT:    s_wait_samplecnt 0x0
+; DAGISEL-NEXT:    s_wait_bvhcnt 0x0
+; DAGISEL-NEXT:    s_wait_kmcnt 0x0
+; DAGISEL-NEXT:    s_xor_saveexec_b32 s0, -1
+; DAGISEL-NEXT:    scratch_store_b32 off, v0, s32 ; 4-byte Folded Spill
+; DAGISEL-NEXT:    s_mov_b32 exec_lo, -1
+; DAGISEL-NEXT:    v_writelane_b32 v0, s48, 0
+; DAGISEL-NEXT:    ;;#ASMSTART
+; DAGISEL-NEXT:    ; clobber CSR SGPR
+; DAGISEL-NEXT:    ;;#ASMEND
+; DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; DAGISEL-NEXT:    v_readlane_b32 s48, v0, 0
+; DAGISEL-NEXT:    s_wait_alu 0xfffe
+; DAGISEL-NEXT:    s_xor_b32 exec_lo, s0, -1
+; DAGISEL-NEXT:    scratch_load_b32 v0, off, s32 ; 4-byte Folded Reload
+; DAGISEL-NEXT:    s_mov_b32 exec_lo, s0
+; DAGISEL-NEXT:    s_wait_loadcnt 0x0
+; DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+  call void asm sideeffect "; clobber CSR SGPR", "~{s48}"()
+  ret void
+}
+
+define i32 @multiple_blocks(i1 %active, i32 %a, i32 %b) {
+; DAGISEL-LABEL: multiple_blocks:
+; DAGISEL:       ; %bb.0:
+; DAGISEL-NEXT:    s_wait_loadcnt_dscnt 0x0
+; DAGISEL-NEXT:    s_wait_expcnt 0x0
+; DAGISEL-NEXT:    s_wait_samplecnt 0x0
+; DAGISEL-NEXT:    s_wait_bvhcnt 0x0
+; DAGISEL-NEXT:    s_wait_kmcnt 0x0
+; DAGISEL-NEXT:    s_xor_saveexec_b32 vcc_lo, -1
+; DAGISEL-NEXT:    s_clause 0x1
+; DAGISEL-NEXT:    scratch_store_b32 off, v0, s32
+; DAGISEL-NEXT:    scratch_store_b32 off, v1, s32 offset:4
+; DAGISEL-NEXT:    s_mov_b32 exec_lo, -1
+; DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; DAGISEL-NEXT:    s_mov_b32 s1, exec_lo
+; DAGISEL-NEXT:    v_cmpx_eq_u32_e64 v0, v1
+; DAGISEL-NEXT:  ; %bb.1: ; %if.then
+; DAGISEL-NEXT:    v_add_nc_u32_e32 v1, v0, v1
+; DAGISEL-NEXT:  ; %bb.2: ; %if.end
+; DAGISEL-NEXT:    s_wait_alu 0xfffe
+; DAGISEL-NEXT:    s_or_b32 exec_lo, exec_lo, s1
+; DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; DAGISEL-NEXT:    v_cndmask_b32_e32 v0, v1, v0, vcc_lo
+; DAGISEL-NEXT:    s_xor_b32 exec_lo, vcc_lo, -1
+; DAGISEL-NEXT:    s_clause 0x1
+; DAGISEL-NEXT:    scratch_load_b32 v0, off, s32
+; DAGISEL-NEXT:    scratch_load_b32 v1, off, s32 offset:4
+; DAGISEL-NEXT:    s_mov_b32 exec_lo, vcc_lo
+; DAGISEL-NEXT:    s_wait_loadcnt 0x0
+; DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+  %c = icmp eq i32 %a, %b
+  br i1 %c, label %if.then, label %if.end
+
+if.then:                                          ; preds = %0
+  %d = add i32 %a, %b
+  br label %if.end
+
+if.end:
+  %f = phi i32 [ %d, %if.then ], [ %b, %0 ]
+  %e = select i1 %active, i32 %a, i32 %f
+  ret i32 %e
+}
+
+define i64 @ret_64(i1 %active, i64 %a, i64 %b) {
+; DAGISEL-LABEL: ret_64:
+; DAGISEL:       ; %bb.0:
+; DAGISEL-NEXT:    s_wait_loadcnt_dscnt 0x0
+; DAGISEL-NEXT:    s_wait_expcnt 0x0
+; DAGISEL-NEXT:    s_wait_samplecnt 0x0
+; DAGISEL-NEXT:    s_wait_bvhcnt 0x0
+; DAGISEL-NEXT:    s_wait_kmcnt 0x0
+; DAGISEL-NEXT:    s_xor_saveexec_b32 vcc_lo, -1
+; DAGISEL-NEXT:    s_clause 0x3
+; DAGISEL-NEXT:    scratch_store_b32 off, v0, s32
+; DAGISEL-NEXT:    scratch_store_b32 off, v1, s32 offset:4
+; DAGISEL-NEXT:    scratch_store_b32 off, v2, s32 offset:8
+; DAGISEL-NEXT:    scratch_store_b32 off, v3, s32 offset:12
+; DAGISEL-NEXT:    s_mov_b32 exec_lo, -1
+; DAGISEL-NEXT:    s_wait_alu 0xfffe
+; DAGISEL-NEXT:    v_dual_cndmask_b32 v1, 0, v1 :: v_dual_cndmask_b32 v0, 5, v0
+; DAGISEL-NEXT:    v_dual_cndmask_b32 v2, 3, v2 :: v_dual_cndmask_b32 v3, 0, v3
+; DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_2)
+; DAGISEL-NEXT:    v_mov_b32_dpp v0, v2 quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1
+; DAGISEL-NEXT:    v_mov_b32_dpp v1, v3 quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1
+; DAGISEL-NEXT:    s_xor_b32 exec_lo, vcc_lo, -1
+; DAGISEL-NEXT:    s_clause 0x3
+; DAGISEL-NEXT:    scratch_load_b32 v0, off, s32
+; DAGISEL-NEXT:    scratch_load_b32 v1, off, s32 offset:4
+; DAGISEL-NEXT:    scratch_load_b32 v2, off, s32 offset:8
+; DAGISEL-NEXT:    scratch_load_b32 v3, off, s32 offset:12
+; DAGISEL-NEXT:    s_mov_b32 exec_lo, vcc_lo
+; DAGISEL-NEXT:    s_wait_loadcnt 0x0
+; DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+  %x = select i1 %active, i64 %a, i64 5
+  %y = select i1 %active, i64 %b, i64 3
+  %ret = call i64 @llvm.amdgcn.update.dpp.i64(i64 %x, i64 %y, i32 1, i32 1, i32 1, i1 false) #0
+  ret i64 %ret
+}

>From e1df3a085fb493bbd5f4f73801a94fb16bbe93f3 Mon Sep 17 00:00:00 2001
From: Diana Picus <diana-magda.picus at amd.com>
Date: Mon, 17 Mar 2025 12:47:21 +0100
Subject: [PATCH 03/24] Use MF instead of MBB

---
 llvm/lib/Target/AMDGPU/SIFrameLowering.cpp | 4 ++--
 llvm/lib/Target/AMDGPU/SIISelLowering.cpp  | 3 +--
 llvm/lib/Target/AMDGPU/SIInstrInfo.cpp     | 3 ++-
 llvm/lib/Target/AMDGPU/SIInstrInfo.h       | 2 +-
 4 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp b/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
index fda1a09759259..da56facba03e1 100644
--- a/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
@@ -951,7 +951,7 @@ static Register buildScratchExecCopy(LiveRegUnits &LiveUnits,
     // we can use.
     assert(IsProlog && "Epilog should look at return, not setup");
     ScratchExecCopy =
-        TII->getWholeWaveFunctionSetup(MBB)->getOperand(0).getReg();
+        TII->getWholeWaveFunctionSetup(MF)->getOperand(0).getReg();
     assert(ScratchExecCopy && "Couldn't find copy of EXEC");
   } else {
     ScratchExecCopy = findScratchNonCalleeSaveRegister(
@@ -1030,7 +1030,7 @@ void SIFrameLowering::emitCSRSpillStores(
     // -1 here.
     if (WWMCalleeSavedRegs.empty())
       EnableAllLanes();
-    TII->getWholeWaveFunctionSetup(MBB)->eraseFromParent();
+    TII->getWholeWaveFunctionSetup(MF)->eraseFromParent();
   } else if (ScratchExecCopy) {
     // FIXME: Split block and make terminator.
     unsigned ExecMov = ST.isWave32() ? AMDGPU::S_MOV_B32 : AMDGPU::S_MOV_B64;
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index a6867b690013f..be8cbc092a006 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -5874,8 +5874,7 @@ SITargetLowering::EmitInstrWithCustomInserter(MachineInstr &MI,
 
     // During ISel, it's difficult to propagate the original EXEC mask to use as
     // an input to SI_WHOLE_WAVE_FUNC_RETURN. Set it up here instead.
-    MachineInstr *Setup =
-        TII->getWholeWaveFunctionSetup(*BB->getParent()->begin());
+    MachineInstr *Setup = TII->getWholeWaveFunctionSetup(*BB->getParent());
     assert(Setup && "Couldn't find SI_SETUP_WHOLE_WAVE_FUNC");
     MI.getOperand(0).setReg(Setup->getOperand(0).getReg());
     return BB;
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index 4933001ebc520..18a1bf75328fb 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -5771,8 +5771,9 @@ void SIInstrInfo::restoreExec(MachineFunction &MF, MachineBasicBlock &MBB,
 }
 
 MachineInstr *
-SIInstrInfo::getWholeWaveFunctionSetup(MachineBasicBlock &MBB) const {
+SIInstrInfo::getWholeWaveFunctionSetup(MachineFunction &MF) const {
   assert(ST.isWholeWaveFunction() && "Not a whole wave func");
+  MachineBasicBlock &MBB = *MF.begin();
   for (MachineInstr &MI : MBB)
     if (MI.getOpcode() == AMDGPU::SI_SETUP_WHOLE_WAVE_FUNC)
       return &MI;
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.h b/llvm/lib/Target/AMDGPU/SIInstrInfo.h
index 13ec9791bfa58..f4fa0ddfa04e3 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.h
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.h
@@ -1200,7 +1200,7 @@ class SIInstrInfo final : public AMDGPUGenInstrInfo {
                    MachineBasicBlock::iterator MBBI, const DebugLoc &DL,
                    Register Reg, SlotIndexes *Indexes = nullptr) const;
 
-  MachineInstr *getWholeWaveFunctionSetup(MachineBasicBlock &MBB) const;
+  MachineInstr *getWholeWaveFunctionSetup(MachineFunction &MF) const;
 
   /// Return the correct register class for \p OpNo.  For target-specific
   /// instructions, this will return the register class that has been defined

>From ede4ca82661c7b21015a1ac0f0eb7302d6df4891 Mon Sep 17 00:00:00 2001
From: Diana Picus <diana-magda.picus at amd.com>
Date: Tue, 11 Mar 2025 12:27:47 +0100
Subject: [PATCH 04/24] Revert "Add subtarget feature"

This reverts commit c6e9211d5644061521cbce8edac7c475c83b01d6.
---
 llvm/lib/Target/AMDGPU/AMDGPU.td      | 6 ------
 llvm/lib/Target/AMDGPU/GCNSubtarget.h | 6 ------
 2 files changed, 12 deletions(-)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.td b/llvm/lib/Target/AMDGPU/AMDGPU.td
index e80559da97b1e..2ffb33c58e46b 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.td
@@ -1319,12 +1319,6 @@ def FeatureMemToLDSLoad : SubtargetFeature<"vmem-to-lds-load-insts",
   "The platform has memory to lds instructions (global_load w/lds bit set, buffer_load w/lds bit set or global_load_lds. This does not include scratch_load_lds."
 >;
 
-def FeatureWholeWaveFunction : SubtargetFeature<"whole-wave-function",
-  "IsWholeWaveFunction",
-  "true",
-  "Current function is a whole wave function (runs with all lanes enabled)"
-  >;
-
 // Dummy feature used to disable assembler instructions.
 def FeatureDisable : SubtargetFeature<"",
   "FeatureDisable","true",
diff --git a/llvm/lib/Target/AMDGPU/GCNSubtarget.h b/llvm/lib/Target/AMDGPU/GCNSubtarget.h
index ab1a57485bae3..2f79599091faf 100644
--- a/llvm/lib/Target/AMDGPU/GCNSubtarget.h
+++ b/llvm/lib/Target/AMDGPU/GCNSubtarget.h
@@ -268,8 +268,6 @@ class GCNSubtarget final : public AMDGPUGenSubtargetInfo,
   bool RequiresCOV6 = false;
   bool UseBlockVGPROpsForCSR = false;
 
-  bool IsWholeWaveFunction = false;
-
   // Dummy feature to use for assembler in tablegen.
   bool FeatureDisable = false;
 
@@ -1483,10 +1481,6 @@ class GCNSubtarget final : public AMDGPUGenSubtargetInfo,
   // of sign-extending.
   bool hasGetPCZeroExtension() const { return GFX12Insts; }
 
-  /// \returns true if the current function is a whole wave function (i.e. it
-  /// runs with all the lanes enabled).
-  bool isWholeWaveFunction() const { return IsWholeWaveFunction; }
-
   /// \returns SGPR allocation granularity supported by the subtarget.
   unsigned getSGPRAllocGranule() const {
     return AMDGPU::IsaInfo::getSGPRAllocGranule(this);

>From b56965e1120a8a0f20835d358723fb81fffc3877 Mon Sep 17 00:00:00 2001
From: Diana Picus <diana-magda.picus at amd.com>
Date: Wed, 19 Mar 2025 14:50:47 +0100
Subject: [PATCH 05/24] Add new CC. Do nothing

---
 llvm/include/llvm/IR/CallingConv.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/llvm/include/llvm/IR/CallingConv.h b/llvm/include/llvm/IR/CallingConv.h
index d68491eb5535c..7b0c9054f6aa2 100644
--- a/llvm/include/llvm/IR/CallingConv.h
+++ b/llvm/include/llvm/IR/CallingConv.h
@@ -284,6 +284,9 @@ namespace CallingConv {
     RISCV_VLSCall_32768 = 122,
     RISCV_VLSCall_65536 = 123,
 
+    // Calling convention for AMDGPU whole wave functions.
+    AMDGPU_Whole_Wave = 124,
+
     /// The highest possible ID. Must be some 2^k - 1.
     MaxID = 1023
   };

>From 6baf7da08ddb5d0f7d63f90bcbfb4d5c5a538262 Mon Sep 17 00:00:00 2001
From: Diana Picus <diana-magda.picus at amd.com>
Date: Tue, 11 Mar 2025 12:32:09 +0100
Subject: [PATCH 06/24] Replace SubtargetFeature with CallingConv

---
 llvm/include/llvm/AsmParser/LLToken.h         |  1 +
 llvm/include/llvm/IR/CallingConv.h            |  2 +-
 llvm/lib/AsmParser/LLLexer.cpp                |  1 +
 llvm/lib/AsmParser/LLParser.cpp               |  3 +
 llvm/lib/IR/AsmWriter.cpp                     |  3 +
 llvm/lib/Target/AMDGPU/AMDGPU.td              |  2 -
 llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp |  6 +-
 llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp |  2 +
 llvm/lib/Target/AMDGPU/SIFrameLowering.cpp    |  8 +-
 llvm/lib/Target/AMDGPU/SIISelLowering.cpp     | 13 +--
 llvm/lib/Target/AMDGPU/SIInstrInfo.cpp        |  3 +-
 llvm/lib/Target/AMDGPU/SIInstructions.td      |  3 -
 .../Target/AMDGPU/SIMachineFunctionInfo.cpp   |  7 +-
 .../lib/Target/AMDGPU/SIMachineFunctionInfo.h |  6 ++
 llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp     |  2 +
 .../Target/AMDGPU/Utils/AMDGPUPALMetadata.cpp |  1 +
 .../AMDGPU/isel-whole-wave-functions.ll       | 15 ++--
 .../AMDGPU/whole-wave-functions-pei.mir       | 11 ++-
 .../CodeGen/AMDGPU/whole-wave-functions.ll    | 84 ++++---------------
 19 files changed, 78 insertions(+), 95 deletions(-)

diff --git a/llvm/include/llvm/AsmParser/LLToken.h b/llvm/include/llvm/AsmParser/LLToken.h
index c7e4bdf3ff811..2b23225471944 100644
--- a/llvm/include/llvm/AsmParser/LLToken.h
+++ b/llvm/include/llvm/AsmParser/LLToken.h
@@ -181,6 +181,7 @@ enum Kind {
   kw_amdgpu_cs_chain_preserve,
   kw_amdgpu_kernel,
   kw_amdgpu_gfx,
+  kw_amdgpu_whole_wave,
   kw_tailcc,
   kw_m68k_rtdcc,
   kw_graalcc,
diff --git a/llvm/include/llvm/IR/CallingConv.h b/llvm/include/llvm/IR/CallingConv.h
index 7b0c9054f6aa2..417057fc1112e 100644
--- a/llvm/include/llvm/IR/CallingConv.h
+++ b/llvm/include/llvm/IR/CallingConv.h
@@ -285,7 +285,7 @@ namespace CallingConv {
     RISCV_VLSCall_65536 = 123,
 
     // Calling convention for AMDGPU whole wave functions.
-    AMDGPU_Whole_Wave = 124,
+    AMDGPU_WholeWave = 124,
 
     /// The highest possible ID. Must be some 2^k - 1.
     MaxID = 1023
diff --git a/llvm/lib/AsmParser/LLLexer.cpp b/llvm/lib/AsmParser/LLLexer.cpp
index ce813e1d7b1c4..158aa1d333c15 100644
--- a/llvm/lib/AsmParser/LLLexer.cpp
+++ b/llvm/lib/AsmParser/LLLexer.cpp
@@ -679,6 +679,7 @@ lltok::Kind LLLexer::LexIdentifier() {
   KEYWORD(amdgpu_cs_chain_preserve);
   KEYWORD(amdgpu_kernel);
   KEYWORD(amdgpu_gfx);
+  KEYWORD(amdgpu_whole_wave);
   KEYWORD(tailcc);
   KEYWORD(m68k_rtdcc);
   KEYWORD(graalcc);
diff --git a/llvm/lib/AsmParser/LLParser.cpp b/llvm/lib/AsmParser/LLParser.cpp
index c5e166cef6da6..a2866c551f8fc 100644
--- a/llvm/lib/AsmParser/LLParser.cpp
+++ b/llvm/lib/AsmParser/LLParser.cpp
@@ -2274,6 +2274,9 @@ bool LLParser::parseOptionalCallingConv(unsigned &CC) {
     CC = CallingConv::AMDGPU_CS_ChainPreserve;
     break;
   case lltok::kw_amdgpu_kernel:  CC = CallingConv::AMDGPU_KERNEL; break;
+  case lltok::kw_amdgpu_whole_wave:
+    CC = CallingConv::AMDGPU_WholeWave;
+    break;
   case lltok::kw_tailcc:         CC = CallingConv::Tail; break;
   case lltok::kw_m68k_rtdcc:     CC = CallingConv::M68k_RTD; break;
   case lltok::kw_graalcc:        CC = CallingConv::GRAAL; break;
diff --git a/llvm/lib/IR/AsmWriter.cpp b/llvm/lib/IR/AsmWriter.cpp
index 7828ba45ec27f..5a9083d8bf888 100644
--- a/llvm/lib/IR/AsmWriter.cpp
+++ b/llvm/lib/IR/AsmWriter.cpp
@@ -404,6 +404,9 @@ static void PrintCallingConv(unsigned cc, raw_ostream &Out) {
     break;
   case CallingConv::AMDGPU_KERNEL: Out << "amdgpu_kernel"; break;
   case CallingConv::AMDGPU_Gfx:    Out << "amdgpu_gfx"; break;
+  case CallingConv::AMDGPU_WholeWave:
+    Out << "amdgpu_whole_wave";
+    break;
   case CallingConv::M68k_RTD:      Out << "m68k_rtdcc"; break;
   case CallingConv::RISCV_VectorCall:
     Out << "riscv_vector_cc";
diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.td b/llvm/lib/Target/AMDGPU/AMDGPU.td
index 2ffb33c58e46b..72d6a78539ada 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.td
@@ -2690,8 +2690,6 @@ def HasLshlAddU64Inst : Predicate<"Subtarget->hasLshlAddU64Inst()">,
 def HasSetPrioIncWgInst : Predicate<"Subtarget->hasSetPrioIncWgInst()">,
  AssemblerPredicate<(all_of FeatureSetPrioIncWgInst)>;
 
-def IsWholeWaveFunction : Predicate<"Subtarget->isWholeWaveFunction()">;
-
 // Include AMDGPU TD files
 include "SISchedule.td"
 include "GCNProcessors.td"
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp b/llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp
index 14101e57f5143..f9c12b475e557 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp
@@ -1347,7 +1347,8 @@ bool AMDGPUCallLowering::lowerTailCall(
   SmallVector<std::pair<MCRegister, Register>, 12> ImplicitArgRegs;
 
   if (Info.CallConv != CallingConv::AMDGPU_Gfx &&
-      !AMDGPU::isChainCC(Info.CallConv)) {
+      !AMDGPU::isChainCC(Info.CallConv) &&
+      Info.CallConv != CallingConv::AMDGPU_WholeWave) {
     // With a fixed ABI, allocate fixed registers before user arguments.
     if (!passSpecialInputs(MIRBuilder, CCInfo, ImplicitArgRegs, Info))
       return false;
@@ -1524,7 +1525,8 @@ bool AMDGPUCallLowering::lowerCall(MachineIRBuilder &MIRBuilder,
   // after the ordinary user argument registers.
   SmallVector<std::pair<MCRegister, Register>, 12> ImplicitArgRegs;
 
-  if (Info.CallConv != CallingConv::AMDGPU_Gfx) {
+  if (Info.CallConv != CallingConv::AMDGPU_Gfx &&
+      Info.CallConv != CallingConv::AMDGPU_WholeWave) {
     // With a fixed ABI, allocate fixed registers before user arguments.
     if (!passSpecialInputs(MIRBuilder, CCInfo, ImplicitArgRegs, Info))
       return false;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
index f6e0b93f85643..2eb061e56b45c 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
@@ -1138,6 +1138,7 @@ CCAssignFn *AMDGPUCallLowering::CCAssignFnForCall(CallingConv::ID CC,
   case CallingConv::Cold:
     return CC_AMDGPU_Func;
   case CallingConv::AMDGPU_Gfx:
+  case CallingConv::AMDGPU_WholeWave:
     return CC_SI_Gfx;
   case CallingConv::AMDGPU_KERNEL:
   case CallingConv::SPIR_KERNEL:
@@ -1163,6 +1164,7 @@ CCAssignFn *AMDGPUCallLowering::CCAssignFnForReturn(CallingConv::ID CC,
   case CallingConv::AMDGPU_LS:
     return RetCC_SI_Shader;
   case CallingConv::AMDGPU_Gfx:
+  case CallingConv::AMDGPU_WholeWave:
     return RetCC_SI_Gfx;
   case CallingConv::C:
   case CallingConv::Fast:
diff --git a/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp b/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
index da56facba03e1..44e5d8ef2bca4 100644
--- a/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
@@ -946,7 +946,7 @@ static Register buildScratchExecCopy(LiveRegUnits &LiveUnits,
 
   initLiveUnits(LiveUnits, TRI, FuncInfo, MF, MBB, MBBI, IsProlog);
 
-  if (ST.isWholeWaveFunction()) {
+  if (FuncInfo->isWholeWaveFunction()) {
     // Whole wave functions already have a copy of the original EXEC mask that
     // we can use.
     assert(IsProlog && "Epilog should look at return, not setup");
@@ -1023,7 +1023,7 @@ void SIFrameLowering::emitCSRSpillStores(
   }
 
   StoreWWMRegisters(WWMCalleeSavedRegs);
-  if (ST.isWholeWaveFunction()) {
+  if (FuncInfo->isWholeWaveFunction()) {
     // SI_SETUP_WHOLE_WAVE_FUNCTION has outlived its purpose, so we can remove
     // it now. If we have already saved some WWM CSR registers, then the EXEC is
     // already -1 and we don't need to do anything else. Otherwise, set EXEC to
@@ -1116,7 +1116,7 @@ void SIFrameLowering::emitCSRSpillRestores(
         }
       };
 
-  if (ST.isWholeWaveFunction()) {
+  if (FuncInfo->isWholeWaveFunction()) {
     // For whole wave functions, the EXEC is already -1 at this point.
     // Therefore, we can restore the CSR WWM registers right away.
     RestoreWWMRegisters(WWMCalleeSavedRegs);
@@ -1711,7 +1711,7 @@ void SIFrameLowering::determineCalleeSaves(MachineFunction &MF,
   if (MFI->isEntryFunction())
     return;
 
-  if (ST.isWholeWaveFunction()) {
+  if (MFI->isWholeWaveFunction()) {
     // In practice, all the VGPRs are WWM registers, and we will need to save at
     // least their inactive lanes. Add them to WWMReservedRegs.
     assert(!NeedExecCopyReservedReg && "Whole wave functions can use the reg mapped for their i1 argument");
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index be8cbc092a006..9be50c89ecff2 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -2914,7 +2914,7 @@ SDValue SITargetLowering::LowerFormalArguments(
              !Info->hasWorkGroupIDZ());
   }
 
-  bool IsWholeWaveFunc = getSubtarget()->isWholeWaveFunction();
+  bool IsWholeWaveFunc = Info->isWholeWaveFunction();
 
   if (CallConv == CallingConv::AMDGPU_PS) {
     processPSInputArgs(Splits, CallConv, Ins, Skipped, FType, Info);
@@ -3351,9 +3351,9 @@ SITargetLowering::LowerReturn(SDValue Chain, CallingConv::ID CallConv,
 
   unsigned Opc = AMDGPUISD::ENDPGM;
   if (!IsWaveEnd)
-    Opc = Subtarget->isWholeWaveFunction() ? AMDGPUISD::WHOLE_WAVE_RETURN
-          : IsShader                       ? AMDGPUISD::RETURN_TO_EPILOG
-                                           : AMDGPUISD::RET_GLUE;
+    Opc = Info->isWholeWaveFunction() ? AMDGPUISD::WHOLE_WAVE_RETURN
+          : IsShader                  ? AMDGPUISD::RETURN_TO_EPILOG
+                                      : AMDGPUISD::RET_GLUE;
   return DAG.getNode(Opc, DL, MVT::Other, RetOps);
 }
 
@@ -3855,7 +3855,8 @@ SDValue SITargetLowering::LowerCall(CallLoweringInfo &CLI,
   CCState CCInfo(CallConv, IsVarArg, MF, ArgLocs, *DAG.getContext());
   CCAssignFn *AssignFn = CCAssignFnForCall(CallConv, IsVarArg);
 
-  if (CallConv != CallingConv::AMDGPU_Gfx && !AMDGPU::isChainCC(CallConv)) {
+  if (CallConv != CallingConv::AMDGPU_Gfx && !AMDGPU::isChainCC(CallConv) &&
+      CallConv != CallingConv::AMDGPU_WholeWave) {
     // With a fixed ABI, allocate fixed registers before user arguments.
     passSpecialInputs(CLI, CCInfo, *Info, RegsToPass, MemOpChains, Chain);
   }
@@ -5870,7 +5871,7 @@ SITargetLowering::EmitInstrWithCustomInserter(MachineInstr &MI,
     return SplitBB;
   }
   case AMDGPU::SI_WHOLE_WAVE_FUNC_RETURN: {
-    assert(Subtarget->isWholeWaveFunction());
+    assert(MFI->isWholeWaveFunction());
 
     // During ISel, it's difficult to propagate the original EXEC mask to use as
     // an input to SI_WHOLE_WAVE_FUNC_RETURN. Set it up here instead.
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index 18a1bf75328fb..0a9a57b574805 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -5772,7 +5772,8 @@ void SIInstrInfo::restoreExec(MachineFunction &MF, MachineBasicBlock &MBB,
 
 MachineInstr *
 SIInstrInfo::getWholeWaveFunctionSetup(MachineFunction &MF) const {
-  assert(ST.isWholeWaveFunction() && "Not a whole wave func");
+  assert(MF.getInfo<SIMachineFunctionInfo>()->isWholeWaveFunction() &&
+         "Not a whole wave func");
   MachineBasicBlock &MBB = *MF.begin();
   for (MachineInstr &MI : MBB)
     if (MI.getOpcode() == AMDGPU::SI_SETUP_WHOLE_WAVE_FUNC)
diff --git a/llvm/lib/Target/AMDGPU/SIInstructions.td b/llvm/lib/Target/AMDGPU/SIInstructions.td
index 27584c60c2c2e..4c1b76a4fc163 100644
--- a/llvm/lib/Target/AMDGPU/SIInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SIInstructions.td
@@ -644,7 +644,6 @@ def SI_INIT_WHOLE_WAVE : SPseudoInstSI <
   let isConvergent = 1;
 }
 
-let SubtargetPredicate = IsWholeWaveFunction in {
 // Sets EXEC to all lanes and returns the previous EXEC.
 def SI_SETUP_WHOLE_WAVE_FUNC : SPseudoInstSI <
   (outs SReg_1:$dst), (ins), [(set i1:$dst, (AMDGPUwhole_wave_setup))]> {
@@ -671,8 +670,6 @@ def SI_WHOLE_WAVE_FUNC_RETURN : SPseudoInstSI <
 def : GCNPat<
   (AMDGPUwhole_wave_return), (SI_WHOLE_WAVE_FUNC_RETURN (i1 (IMPLICIT_DEF)))>;
 
-} // SubtargetPredicate = IsWholeWaveFunction
-
 // Return for returning shaders to a shader variant epilog.
 def SI_RETURN_TO_EPILOG : SPseudoInstSI <
   (outs), (ins variable_ops), [(AMDGPUreturn_to_epilog)]> {
diff --git a/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp b/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
index 67ad28661da43..8ffe3a70041eb 100644
--- a/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
@@ -41,7 +41,8 @@ SIMachineFunctionInfo::SIMachineFunctionInfo(const Function &F,
       WorkGroupIDZ(false), WorkGroupInfo(false), LDSKernelId(false),
       PrivateSegmentWaveByteOffset(false), WorkItemIDX(false),
       WorkItemIDY(false), WorkItemIDZ(false), ImplicitArgPtr(false),
-      GITPtrHigh(0xffffffff), HighBitsOf32BitAddress(0) {
+      GITPtrHigh(0xffffffff), HighBitsOf32BitAddress(0),
+      IsWholeWaveFunction(F.getCallingConv() == CallingConv::AMDGPU_WholeWave) {
   const GCNSubtarget &ST = *static_cast<const GCNSubtarget *>(STI);
   FlatWorkGroupSizes = ST.getFlatWorkGroupSizes(F);
   WavesPerEU = ST.getWavesPerEU(F);
@@ -89,7 +90,7 @@ SIMachineFunctionInfo::SIMachineFunctionInfo(const Function &F,
 
     ImplicitArgPtr = false;
   } else if (!isEntryFunction()) {
-    if (CC != CallingConv::AMDGPU_Gfx)
+    if (CC != CallingConv::AMDGPU_Gfx && CC != CallingConv::AMDGPU_WholeWave)
       ArgInfo = AMDGPUArgumentUsageInfo::FixedABIFunctionInfo;
 
     FrameOffsetReg = AMDGPU::SGPR33;
@@ -722,6 +723,7 @@ yaml::SIMachineFunctionInfo::SIMachineFunctionInfo(
       PSInputAddr(MFI.getPSInputAddr()), PSInputEnable(MFI.getPSInputEnable()),
       MaxMemoryClusterDWords(MFI.getMaxMemoryClusterDWords()),
       Mode(MFI.getMode()), HasInitWholeWave(MFI.hasInitWholeWave()),
+      IsWholeWaveFunction(MFI.isWholeWaveFunction()),
       DynamicVGPRBlockSize(MFI.getDynamicVGPRBlockSize()),
       ScratchReservedForDynamicVGPRs(MFI.getScratchReservedForDynamicVGPRs()) {
   for (Register Reg : MFI.getSGPRSpillPhysVGPRs())
@@ -768,6 +770,7 @@ bool SIMachineFunctionInfo::initializeBaseYamlFields(
   HasSpilledVGPRs = YamlMFI.HasSpilledVGPRs;
   BytesInStackArgArea = YamlMFI.BytesInStackArgArea;
   ReturnsVoid = YamlMFI.ReturnsVoid;
+  IsWholeWaveFunction = YamlMFI.IsWholeWaveFunction;
 
   if (YamlMFI.ScavengeFI) {
     auto FIOrErr = YamlMFI.ScavengeFI->getFI(MF.getFrameInfo());
diff --git a/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h b/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
index 274a60adb8d07..08b0206d244fb 100644
--- a/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
+++ b/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
@@ -298,6 +298,7 @@ struct SIMachineFunctionInfo final : public yaml::MachineFunctionInfo {
   StringValue LongBranchReservedReg;
 
   bool HasInitWholeWave = false;
+  bool IsWholeWaveFunction = false;
 
   unsigned DynamicVGPRBlockSize = 0;
   unsigned ScratchReservedForDynamicVGPRs = 0;
@@ -356,6 +357,7 @@ template <> struct MappingTraits<SIMachineFunctionInfo> {
     YamlIO.mapOptional("dynamicVGPRBlockSize", MFI.DynamicVGPRBlockSize, false);
     YamlIO.mapOptional("scratchReservedForDynamicVGPRs",
                        MFI.ScratchReservedForDynamicVGPRs, 0);
+    YamlIO.mapOptional("isWholeWaveFunction", MFI.IsWholeWaveFunction, false);
   }
 };
 
@@ -565,6 +567,8 @@ class SIMachineFunctionInfo final : public AMDGPUMachineFunction,
   // the serialization easier.
   ReservedRegSet WWMReservedRegs;
 
+  bool IsWholeWaveFunction = false;
+
   using PrologEpilogSGPRSpill =
       std::pair<Register, PrologEpilogSGPRSaveRestoreInfo>;
   // To track the SGPR spill method used for a CSR SGPR register during
@@ -670,6 +674,8 @@ class SIMachineFunctionInfo final : public AMDGPUMachineFunction,
     return WWMReservedRegs.contains(Reg);
   }
 
+  bool isWholeWaveFunction() const { return IsWholeWaveFunction; }
+
   ArrayRef<PrologEpilogSGPRSpill> getPrologEpilogSGPRSpills() const {
     assert(is_sorted(PrologEpilogSGPRSpills, llvm::less_first()));
     return PrologEpilogSGPRSpills;
diff --git a/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp b/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
index 8c3873d23419f..02488426df369 100644
--- a/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
@@ -407,6 +407,7 @@ const MCPhysReg *SIRegisterInfo::getCalleeSavedRegs(
     return ST.hasGFX90AInsts() ? CSR_AMDGPU_GFX90AInsts_SaveList
                                : CSR_AMDGPU_SaveList;
   case CallingConv::AMDGPU_Gfx:
+  case CallingConv::AMDGPU_WholeWave:
     return ST.hasGFX90AInsts() ? CSR_AMDGPU_SI_Gfx_GFX90AInsts_SaveList
                                : CSR_AMDGPU_SI_Gfx_SaveList;
   case CallingConv::AMDGPU_CS_ChainPreserve:
@@ -433,6 +434,7 @@ const uint32_t *SIRegisterInfo::getCallPreservedMask(const MachineFunction &MF,
     return ST.hasGFX90AInsts() ? CSR_AMDGPU_GFX90AInsts_RegMask
                                : CSR_AMDGPU_RegMask;
   case CallingConv::AMDGPU_Gfx:
+  case CallingConv::AMDGPU_WholeWave:
     return ST.hasGFX90AInsts() ? CSR_AMDGPU_SI_Gfx_GFX90AInsts_RegMask
                                : CSR_AMDGPU_SI_Gfx_RegMask;
   case CallingConv::AMDGPU_CS_Chain:
diff --git a/llvm/lib/Target/AMDGPU/Utils/AMDGPUPALMetadata.cpp b/llvm/lib/Target/AMDGPU/Utils/AMDGPUPALMetadata.cpp
index 79152e707a25c..f622c367ae204 100644
--- a/llvm/lib/Target/AMDGPU/Utils/AMDGPUPALMetadata.cpp
+++ b/llvm/lib/Target/AMDGPU/Utils/AMDGPUPALMetadata.cpp
@@ -44,6 +44,7 @@ static const char *getStageName(CallingConv::ID CC) {
   case CallingConv::AMDGPU_LS:
     return ".ls";
   case CallingConv::AMDGPU_Gfx:
+  case CallingConv::AMDGPU_WholeWave:
     llvm_unreachable("Callable shader has no hardware stage");
   default:
     return ".cs";
diff --git a/llvm/test/CodeGen/AMDGPU/isel-whole-wave-functions.ll b/llvm/test/CodeGen/AMDGPU/isel-whole-wave-functions.ll
index 9e41b4e4dd614..f3f16f4659a5b 100644
--- a/llvm/test/CodeGen/AMDGPU/isel-whole-wave-functions.ll
+++ b/llvm/test/CodeGen/AMDGPU/isel-whole-wave-functions.ll
@@ -1,8 +1,8 @@
 ; NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5
-; RUN: llc -global-isel=0 -mtriple=amdgcn--amdpal -mcpu=gfx1200 -mattr=+whole-wave-function -stop-after=finalize-isel -verify-machineinstrs < %s | FileCheck --check-prefix=DAGISEL %s
+; RUN: llc -global-isel=0 -mtriple=amdgcn--amdpal -mcpu=gfx1200 -stop-after=finalize-isel -verify-machineinstrs < %s | FileCheck --check-prefix=DAGISEL %s
 ; TODO: llc -global-isel=1 -mtriple=amdgcn--amdpal -mcpu=gfx1200 -mattr=+whole-wave-function < %s | FileCheck --check-prefix=GISEL %s
 
-define amdgpu_gfx i32 @basic_test(i1 %active, i32 %a, i32 %b) {
+define amdgpu_whole_wave i32 @basic_test(i1 %active, i32 %a, i32 %b) {
   ; DAGISEL-LABEL: name: basic_test
   ; DAGISEL: bb.0 (%ir-block.0):
   ; DAGISEL-NEXT:   liveins: $vgpr0, $vgpr1
@@ -20,12 +20,12 @@ define amdgpu_gfx i32 @basic_test(i1 %active, i32 %a, i32 %b) {
   ; DAGISEL-NEXT:   SI_WHOLE_WAVE_FUNC_RETURN killed [[SI_SETUP_WHOLE_WAVE_FUNC]], implicit $vgpr0
   %x = select i1 %active, i32 %a, i32 5
   %y = select i1 %active, i32 %b, i32 3
-  %ret = call i32 @llvm.amdgcn.update.dpp.i32(i32 %x, i32 %y, i32 1, i32 1, i32 1, i1 false) #0
+  %ret = call i32 @llvm.amdgcn.update.dpp.i32(i32 %x, i32 %y, i32 1, i32 1, i32 1, i1 false)
   ret i32 %ret
 }
 
 ; Make sure we don't crash if %active is not used at all.
-define amdgpu_gfx i32 @unused_active(i1 %active, i32 %a, i32 %b) {
+define amdgpu_whole_wave i32 @unused_active(i1 %active, i32 %a, i32 %b) {
   ; DAGISEL-LABEL: name: unused_active
   ; DAGISEL: bb.0 (%ir-block.0):
   ; DAGISEL-NEXT:   [[SI_SETUP_WHOLE_WAVE_FUNC:%[0-9]+]]:sreg_32 = SI_SETUP_WHOLE_WAVE_FUNC implicit-def dead $exec, implicit $exec
@@ -36,7 +36,7 @@ define amdgpu_gfx i32 @unused_active(i1 %active, i32 %a, i32 %b) {
   ret i32 14
 }
 
-define amdgpu_gfx i32 @multiple_blocks(i1 %active, i32 %a, i32 %b) {
+define amdgpu_whole_wave i32 @multiple_blocks(i1 %active, i32 %a, i32 %b) {
   ; DAGISEL-LABEL: name: multiple_blocks
   ; DAGISEL: bb.0 (%ir-block.0):
   ; DAGISEL-NEXT:   successors: %bb.1(0x40000000), %bb.2(0x40000000)
@@ -76,7 +76,7 @@ if.end:
   ret i32 %e
 }
 
-define amdgpu_gfx i64 @ret_64(i1 %active, i64 %a, i64 %b) {
+define amdgpu_whole_wave i64 @ret_64(i1 %active, i64 %a, i64 %b) {
   ; DAGISEL-LABEL: name: ret_64
   ; DAGISEL: bb.0 (%ir-block.0):
   ; DAGISEL-NEXT:   liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3
@@ -111,6 +111,7 @@ define amdgpu_gfx i64 @ret_64(i1 %active, i64 %a, i64 %b) {
   ; DAGISEL-NEXT:   SI_WHOLE_WAVE_FUNC_RETURN killed [[SI_SETUP_WHOLE_WAVE_FUNC]], implicit $vgpr0, implicit $vgpr1
   %x = select i1 %active, i64 %a, i64 5
   %y = select i1 %active, i64 %b, i64 3
-  %ret = call i64 @llvm.amdgcn.update.dpp.i64(i64 %x, i64 %y, i32 1, i32 1, i32 1, i1 false) #0
+  %ret = call i64 @llvm.amdgcn.update.dpp.i64(i64 %x, i64 %y, i32 1, i32 1, i32 1, i1 false)
   ret i64 %ret
 }
+
diff --git a/llvm/test/CodeGen/AMDGPU/whole-wave-functions-pei.mir b/llvm/test/CodeGen/AMDGPU/whole-wave-functions-pei.mir
index d62e90441284c..a5a35c40b719c 100644
--- a/llvm/test/CodeGen/AMDGPU/whole-wave-functions-pei.mir
+++ b/llvm/test/CodeGen/AMDGPU/whole-wave-functions-pei.mir
@@ -1,5 +1,5 @@
 # NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5
-# RUN: llc -mtriple=amdgcn -mcpu=gfx1200 -mattr=+whole-wave-function -run-pass=prologepilog -o - %s | FileCheck %s
+# RUN: llc -mtriple=amdgcn -mcpu=gfx1200 -run-pass=prologepilog -o - %s | FileCheck %s
 
 ---
 name:            save_inactive_lanes_non_csr_vgpr
@@ -20,6 +20,7 @@ machineFunctionInfo:
   returnsVoid:     false
   occupancy:       16
   sgprForEXECCopy: '$sgpr105'
+  isWholeWaveFunction: true
 body:             |
   bb.0:
     ; CHECK-LABEL: name: save_inactive_lanes_non_csr_vgpr
@@ -57,6 +58,7 @@ machineFunctionInfo:
   returnsVoid:     false
   occupancy:       16
   sgprForEXECCopy: '$sgpr105'
+  isWholeWaveFunction: true
 body:             |
   bb.0:
     ; CHECK-LABEL: name: save_all_lanes_csr_vgpr
@@ -92,6 +94,7 @@ machineFunctionInfo:
   returnsVoid:     false
   occupancy:       16
   sgprForEXECCopy: '$sgpr105'
+  isWholeWaveFunction: true
 body:             |
   bb.0:
     liveins: $sgpr20, $vgpr191
@@ -134,6 +137,7 @@ machineFunctionInfo:
   returnsVoid:     false
   occupancy:       16
   sgprForEXECCopy: '$sgpr105'
+  isWholeWaveFunction: true
 body:             |
   bb.0:
     liveins: $sgpr20, $vgpr191
@@ -181,6 +185,7 @@ machineFunctionInfo:
     - '$vgpr191'
   wwmReservedRegs:
     - '$vgpr191'
+  isWholeWaveFunction: true
 body:             |
   bb.0:
     liveins: $sgpr20, $vgpr0, $vgpr1, $vgpr191
@@ -237,6 +242,7 @@ machineFunctionInfo:
     - '$vgpr191'
   wwmReservedRegs:
     - '$vgpr191'
+  isWholeWaveFunction: true
 body:             |
   bb.0:
     liveins: $sgpr20, $vgpr0, $vgpr1, $vgpr191
@@ -288,6 +294,7 @@ machineFunctionInfo:
   returnsVoid:     false
   occupancy:       16
   sgprForEXECCopy: '$sgpr105'
+  isWholeWaveFunction: true
 body:             |
   bb.0:
     ; CHECK-LABEL: name: vgpr_superregs
@@ -345,6 +352,7 @@ machineFunctionInfo:
   returnsVoid:     false
   occupancy:       16
   sgprForEXECCopy: '$sgpr105'
+  isWholeWaveFunction: true
 body:             |
   bb.0:
     liveins: $vgpr0, $vgpr20, $vgpr40
@@ -383,6 +391,7 @@ machineFunctionInfo:
   returnsVoid:     false
   occupancy:       16
   sgprForEXECCopy: '$sgpr105'
+  isWholeWaveFunction: true
 body:             |
   ; CHECK-LABEL: name: multiple_blocks
   ; CHECK: bb.0:
diff --git a/llvm/test/CodeGen/AMDGPU/whole-wave-functions.ll b/llvm/test/CodeGen/AMDGPU/whole-wave-functions.ll
index 9a951e95f3983..c6890414ed5bc 100644
--- a/llvm/test/CodeGen/AMDGPU/whole-wave-functions.ll
+++ b/llvm/test/CodeGen/AMDGPU/whole-wave-functions.ll
@@ -1,12 +1,12 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
-; RUN: llc -global-isel=0 -mtriple=amdgcn--amdpal -mcpu=gfx1200 -mattr=+whole-wave-function -verify-machineinstrs < %s | FileCheck --check-prefix=DAGISEL %s
+; RUN: llc -global-isel=0 -mtriple=amdgcn--amdpal -mcpu=gfx1200 -verify-machineinstrs < %s | FileCheck --check-prefix=DAGISEL %s
 ; TODO: llc -global-isel=1 -mtriple=amdgcn--amdpal -mcpu=gfx1200 -mattr=+whole-wave-function < %s | FileCheck --check-prefix=GISEL %s
 
 ; Make sure the i1 %active is passed through EXEC.
 ; The EXEC mask should be set to -1 for the duration of the function
 ; and restored to its original value in the epilogue.
 ; We will also need to restore the inactive lanes for any allocated VGPRs.
-define i32 @basic_test(i1 %active, i32 %a, i32 %b) {
+define amdgpu_whole_wave i32 @basic_test(i1 %active, i32 %a, i32 %b) {
 ; DAGISEL-LABEL: basic_test:
 ; DAGISEL:       ; %bb.0:
 ; DAGISEL-NEXT:    s_wait_loadcnt_dscnt 0x0
@@ -32,12 +32,12 @@ define i32 @basic_test(i1 %active, i32 %a, i32 %b) {
 ; DAGISEL-NEXT:    s_setpc_b64 s[30:31]
   %x = select i1 %active, i32 %a, i32 5
   %y = select i1 %active, i32 %b, i32 3
-  %ret = call i32 @llvm.amdgcn.update.dpp.i32(i32 %x, i32 %y, i32 1, i32 1, i32 1, i1 false) #0
+  %ret = call i32 @llvm.amdgcn.update.dpp.i32(i32 %x, i32 %y, i32 1, i32 1, i32 1, i1 false)
   ret i32 %ret
 }
 
 ; Make sure we don't crash if %active is not used at all.
-define i32 @unused_active(i1 %active, i32 %a, i32 %b) {
+define amdgpu_whole_wave i32 @unused_active(i1 %active, i32 %a, i32 %b) {
 ; DAGISEL-LABEL: unused_active:
 ; DAGISEL:       ; %bb.0:
 ; DAGISEL-NEXT:    s_wait_loadcnt_dscnt 0x0
@@ -60,8 +60,8 @@ define i32 @unused_active(i1 %active, i32 %a, i32 %b) {
 
 ; For any used VGPRs (including those used for SGPR spills), we need to restore the inactive lanes.
 ; For CSR VGPRs, we need to restore all lanes.
-define i32 @csr_default_cc(i1 %active, i32 %a, i32 %b) {
-; DAGISEL-LABEL: csr_default_cc:
+define amdgpu_whole_wave i32 @csr(i1 %active, i32 %a, i32 %b) {
+; DAGISEL-LABEL: csr:
 ; DAGISEL:       ; %bb.0:
 ; DAGISEL-NEXT:    s_wait_loadcnt_dscnt 0x0
 ; DAGISEL-NEXT:    s_wait_expcnt 0x0
@@ -76,17 +76,17 @@ define i32 @csr_default_cc(i1 %active, i32 %a, i32 %b) {
 ; DAGISEL-NEXT:    scratch_store_b32 off, v49, s32 offset:16
 ; DAGISEL-NEXT:    s_mov_b32 exec_lo, -1
 ; DAGISEL-NEXT:    scratch_store_b32 off, v40, s32 offset:12 ; 4-byte Folded Spill
-; DAGISEL-NEXT:    v_writelane_b32 v2, s48, 0
 ; DAGISEL-NEXT:    ;;#ASMSTART
 ; DAGISEL-NEXT:    ; clobber CSR
 ; DAGISEL-NEXT:    ;;#ASMEND
+; DAGISEL-NEXT:    v_writelane_b32 v2, s20, 0
 ; DAGISEL-NEXT:    ;;#ASMSTART
 ; DAGISEL-NEXT:    ; clobber non-CSR
 ; DAGISEL-NEXT:    ;;#ASMEND
 ; DAGISEL-NEXT:    scratch_load_b32 v40, off, s32 offset:12 ; 4-byte Folded Reload
 ; DAGISEL-NEXT:    s_wait_alu 0xfffe
 ; DAGISEL-NEXT:    v_dual_cndmask_b32 v0, 5, v0 :: v_dual_cndmask_b32 v1, 3, v1
-; DAGISEL-NEXT:    v_readlane_b32 s48, v2, 0
+; DAGISEL-NEXT:    v_readlane_b32 s20, v2, 0
 ; DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_2)
 ; DAGISEL-NEXT:    v_mov_b32_dpp v0, v1 quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1
 ; DAGISEL-NEXT:    s_xor_b32 exec_lo, vcc_lo, -1
@@ -102,61 +102,13 @@ define i32 @csr_default_cc(i1 %active, i32 %a, i32 %b) {
   %x = select i1 %active, i32 %a, i32 5
   %y = select i1 %active, i32 %b, i32 3
   call void asm sideeffect "; clobber CSR", "~{v40},~{s48}"()
-  call void asm sideeffect "; clobber non-CSR", "~{v49},~{s40}"()
-  %ret = call i32 @llvm.amdgcn.update.dpp.i32(i32 %x, i32 %y, i32 1, i32 1, i32 1, i1 false) #0
-  ret i32 %ret
-}
-
-; Same as above, but with the amdgpu_gfx calling convention.
-define amdgpu_gfx i32 @csr_amdgpu_gfx(i1 %active, i32 %a, i32 %b) {
-; DAGISEL-LABEL: csr_amdgpu_gfx:
-; DAGISEL:       ; %bb.0:
-; DAGISEL-NEXT:    s_wait_loadcnt_dscnt 0x0
-; DAGISEL-NEXT:    s_wait_expcnt 0x0
-; DAGISEL-NEXT:    s_wait_samplecnt 0x0
-; DAGISEL-NEXT:    s_wait_bvhcnt 0x0
-; DAGISEL-NEXT:    s_wait_kmcnt 0x0
-; DAGISEL-NEXT:    s_xor_saveexec_b32 vcc_lo, -1
-; DAGISEL-NEXT:    s_clause 0x3
-; DAGISEL-NEXT:    scratch_store_b32 off, v2, s32
-; DAGISEL-NEXT:    scratch_store_b32 off, v0, s32 offset:4
-; DAGISEL-NEXT:    scratch_store_b32 off, v1, s32 offset:8
-; DAGISEL-NEXT:    scratch_store_b32 off, v49, s32 offset:16
-; DAGISEL-NEXT:    s_mov_b32 exec_lo, -1
-; DAGISEL-NEXT:    scratch_store_b32 off, v40, s32 offset:12 ; 4-byte Folded Spill
-; DAGISEL-NEXT:    v_writelane_b32 v2, s28, 0
-; DAGISEL-NEXT:    ;;#ASMSTART
-; DAGISEL-NEXT:    ; clobber CSR
-; DAGISEL-NEXT:    ;;#ASMEND
-; DAGISEL-NEXT:    ;;#ASMSTART
-; DAGISEL-NEXT:    ; clobber non-CSR
-; DAGISEL-NEXT:    ;;#ASMEND
-; DAGISEL-NEXT:    scratch_load_b32 v40, off, s32 offset:12 ; 4-byte Folded Reload
-; DAGISEL-NEXT:    s_wait_alu 0xfffe
-; DAGISEL-NEXT:    v_dual_cndmask_b32 v0, 5, v0 :: v_dual_cndmask_b32 v1, 3, v1
-; DAGISEL-NEXT:    v_readlane_b32 s28, v2, 0
-; DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_2)
-; DAGISEL-NEXT:    v_mov_b32_dpp v0, v1 quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1
-; DAGISEL-NEXT:    s_xor_b32 exec_lo, vcc_lo, -1
-; DAGISEL-NEXT:    s_clause 0x3
-; DAGISEL-NEXT:    scratch_load_b32 v2, off, s32
-; DAGISEL-NEXT:    scratch_load_b32 v0, off, s32 offset:4
-; DAGISEL-NEXT:    scratch_load_b32 v1, off, s32 offset:8
-; DAGISEL-NEXT:    scratch_load_b32 v49, off, s32 offset:16
-; DAGISEL-NEXT:    s_mov_b32 exec_lo, vcc_lo
-; DAGISEL-NEXT:    s_wait_loadcnt 0x0
-; DAGISEL-NEXT:    s_wait_alu 0xf1ff
-; DAGISEL-NEXT:    s_setpc_b64 s[30:31]
-  %x = select i1 %active, i32 %a, i32 5
-  %y = select i1 %active, i32 %b, i32 3
-  call void asm sideeffect "; clobber CSR", "~{v40},~{s28}"()
-  call void asm sideeffect "; clobber non-CSR", "~{v49},~{s40}"()
-  %ret = call i32 @llvm.amdgcn.update.dpp.i32(i32 %x, i32 %y, i32 1, i32 1, i32 1, i1 false) #0
+  call void asm sideeffect "; clobber non-CSR", "~{v49},~{s20}"()
+  %ret = call i32 @llvm.amdgcn.update.dpp.i32(i32 %x, i32 %y, i32 1, i32 1, i32 1, i1 false)
   ret i32 %ret
 }
 
 ; Save and restore all lanes of v40.
-define void @csr_vgpr_only(i1 %active, i32 %a, i32 %b) {
+define amdgpu_whole_wave void @csr_vgpr_only(i1 %active, i32 %a, i32 %b) {
 ; DAGISEL-LABEL: csr_vgpr_only:
 ; DAGISEL:       ; %bb.0:
 ; DAGISEL-NEXT:    s_wait_loadcnt_dscnt 0x0
@@ -178,7 +130,7 @@ define void @csr_vgpr_only(i1 %active, i32 %a, i32 %b) {
   ret void
 }
 
-define void @sgpr_spill_only(i1 %active, i32 %a, i32 %b) {
+define amdgpu_whole_wave void @sgpr_spill_only(i1 %active, i32 %a, i32 %b) {
 ; DAGISEL-LABEL: sgpr_spill_only:
 ; DAGISEL:       ; %bb.0:
 ; DAGISEL-NEXT:    s_wait_loadcnt_dscnt 0x0
@@ -189,23 +141,23 @@ define void @sgpr_spill_only(i1 %active, i32 %a, i32 %b) {
 ; DAGISEL-NEXT:    s_xor_saveexec_b32 s0, -1
 ; DAGISEL-NEXT:    scratch_store_b32 off, v0, s32 ; 4-byte Folded Spill
 ; DAGISEL-NEXT:    s_mov_b32 exec_lo, -1
-; DAGISEL-NEXT:    v_writelane_b32 v0, s48, 0
+; DAGISEL-NEXT:    v_writelane_b32 v0, s68, 0
 ; DAGISEL-NEXT:    ;;#ASMSTART
 ; DAGISEL-NEXT:    ; clobber CSR SGPR
 ; DAGISEL-NEXT:    ;;#ASMEND
 ; DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; DAGISEL-NEXT:    v_readlane_b32 s48, v0, 0
+; DAGISEL-NEXT:    v_readlane_b32 s68, v0, 0
 ; DAGISEL-NEXT:    s_wait_alu 0xfffe
 ; DAGISEL-NEXT:    s_xor_b32 exec_lo, s0, -1
 ; DAGISEL-NEXT:    scratch_load_b32 v0, off, s32 ; 4-byte Folded Reload
 ; DAGISEL-NEXT:    s_mov_b32 exec_lo, s0
 ; DAGISEL-NEXT:    s_wait_loadcnt 0x0
 ; DAGISEL-NEXT:    s_setpc_b64 s[30:31]
-  call void asm sideeffect "; clobber CSR SGPR", "~{s48}"()
+  call void asm sideeffect "; clobber CSR SGPR", "~{s68}"()
   ret void
 }
 
-define i32 @multiple_blocks(i1 %active, i32 %a, i32 %b) {
+define amdgpu_whole_wave i32 @multiple_blocks(i1 %active, i32 %a, i32 %b) {
 ; DAGISEL-LABEL: multiple_blocks:
 ; DAGISEL:       ; %bb.0:
 ; DAGISEL-NEXT:    s_wait_loadcnt_dscnt 0x0
@@ -248,7 +200,7 @@ if.end:
   ret i32 %e
 }
 
-define i64 @ret_64(i1 %active, i64 %a, i64 %b) {
+define amdgpu_whole_wave i64 @ret_64(i1 %active, i64 %a, i64 %b) {
 ; DAGISEL-LABEL: ret_64:
 ; DAGISEL:       ; %bb.0:
 ; DAGISEL-NEXT:    s_wait_loadcnt_dscnt 0x0
@@ -280,6 +232,6 @@ define i64 @ret_64(i1 %active, i64 %a, i64 %b) {
 ; DAGISEL-NEXT:    s_setpc_b64 s[30:31]
   %x = select i1 %active, i64 %a, i64 5
   %y = select i1 %active, i64 %b, i64 3
-  %ret = call i64 @llvm.amdgcn.update.dpp.i64(i64 %x, i64 %y, i32 1, i32 1, i32 1, i1 false) #0
+  %ret = call i64 @llvm.amdgcn.update.dpp.i64(i64 %x, i64 %y, i32 1, i32 1, i32 1, i1 false)
   ret i64 %ret
 }

>From 39a24b36eba80317183a5e1aebe63d0c013c3fa9 Mon Sep 17 00:00:00 2001
From: Diana Picus <diana-magda.picus at amd.com>
Date: Mon, 17 Mar 2025 14:00:49 +0100
Subject: [PATCH 07/24] Enable gisel in tests

---
 llvm/test/CodeGen/AMDGPU/isel-whole-wave-functions.ll | 2 +-
 llvm/test/CodeGen/AMDGPU/whole-wave-functions.ll      | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/llvm/test/CodeGen/AMDGPU/isel-whole-wave-functions.ll b/llvm/test/CodeGen/AMDGPU/isel-whole-wave-functions.ll
index f3f16f4659a5b..23e97dd2e2fdf 100644
--- a/llvm/test/CodeGen/AMDGPU/isel-whole-wave-functions.ll
+++ b/llvm/test/CodeGen/AMDGPU/isel-whole-wave-functions.ll
@@ -1,6 +1,6 @@
 ; NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5
 ; RUN: llc -global-isel=0 -mtriple=amdgcn--amdpal -mcpu=gfx1200 -stop-after=finalize-isel -verify-machineinstrs < %s | FileCheck --check-prefix=DAGISEL %s
-; TODO: llc -global-isel=1 -mtriple=amdgcn--amdpal -mcpu=gfx1200 -mattr=+whole-wave-function < %s | FileCheck --check-prefix=GISEL %s
+; RUN: llc -global-isel=1 -mtriple=amdgcn--amdpal -mcpu=gfx1200 -stop-after=finalize-isel -verify-machineinstrs < %s | FileCheck --check-prefix=GISEL %s
 
 define amdgpu_whole_wave i32 @basic_test(i1 %active, i32 %a, i32 %b) {
   ; DAGISEL-LABEL: name: basic_test
diff --git a/llvm/test/CodeGen/AMDGPU/whole-wave-functions.ll b/llvm/test/CodeGen/AMDGPU/whole-wave-functions.ll
index c6890414ed5bc..6663fe89f0bc7 100644
--- a/llvm/test/CodeGen/AMDGPU/whole-wave-functions.ll
+++ b/llvm/test/CodeGen/AMDGPU/whole-wave-functions.ll
@@ -1,6 +1,6 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
 ; RUN: llc -global-isel=0 -mtriple=amdgcn--amdpal -mcpu=gfx1200 -verify-machineinstrs < %s | FileCheck --check-prefix=DAGISEL %s
-; TODO: llc -global-isel=1 -mtriple=amdgcn--amdpal -mcpu=gfx1200 -mattr=+whole-wave-function < %s | FileCheck --check-prefix=GISEL %s
+; RUN: llc -global-isel=1 -mtriple=amdgcn--amdpal -mcpu=gfx1200 -verify-machineinstrs < %s | FileCheck --check-prefix=GISEL %s
 
 ; Make sure the i1 %active is passed through EXEC.
 ; The EXEC mask should be set to -1 for the duration of the function

>From 06e10ebe469d171c9f2e5c80757343c579eb5f8d Mon Sep 17 00:00:00 2001
From: Diana Picus <diana-magda.picus at amd.com>
Date: Tue, 11 Mar 2025 12:26:55 +0100
Subject: [PATCH 08/24] GISel support

---
 llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp |  24 ++-
 llvm/lib/Target/AMDGPU/AMDGPUGISel.td         |   4 +
 .../AMDGPU/AMDGPUInstructionSelector.cpp      |   4 +
 .../Target/AMDGPU/AMDGPURegisterBankInfo.cpp  |   4 +
 llvm/lib/Target/AMDGPU/SIInstrInfo.cpp        |   5 +-
 llvm/lib/Target/AMDGPU/SIInstructions.td      |  14 ++
 .../regbankselect-whole-wave-functions.mir    |  40 ++++
 .../irtranslator-whole-wave-functions.ll      | 103 ++++++++++
 .../AMDGPU/isel-whole-wave-functions.ll       |  73 +++++++
 .../CodeGen/AMDGPU/whole-wave-functions.ll    | 182 ++++++++++++++++++
 10 files changed, 449 insertions(+), 4 deletions(-)
 create mode 100644 llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-whole-wave-functions.mir
 create mode 100644 llvm/test/CodeGen/AMDGPU/irtranslator-whole-wave-functions.ll

diff --git a/llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp b/llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp
index f9c12b475e557..47d9aab81cb35 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp
@@ -374,8 +374,10 @@ bool AMDGPUCallLowering::lowerReturn(MachineIRBuilder &B, const Value *Val,
     return true;
   }
 
-  unsigned ReturnOpc =
-      IsShader ? AMDGPU::SI_RETURN_TO_EPILOG : AMDGPU::SI_RETURN;
+  const bool IsWholeWave = MFI->isWholeWaveFunction();
+  unsigned ReturnOpc = IsWholeWave ? AMDGPU::G_AMDGPU_WHOLE_WAVE_FUNC_RETURN
+                       : IsShader  ? AMDGPU::SI_RETURN_TO_EPILOG
+                                   : AMDGPU::SI_RETURN;
   auto Ret = B.buildInstrNoInsert(ReturnOpc);
 
   if (!FLI.CanLowerReturn)
@@ -383,6 +385,13 @@ bool AMDGPUCallLowering::lowerReturn(MachineIRBuilder &B, const Value *Val,
   else if (!lowerReturnVal(B, Val, VRegs, Ret))
     return false;
 
+  if (IsWholeWave) {
+    const GCNSubtarget &ST = MF.getSubtarget<GCNSubtarget>();
+    const SIInstrInfo *TII = ST.getInstrInfo();
+    const MachineInstr *Setup = TII->getWholeWaveFunctionSetup(MF);
+    Ret.addReg(Setup->getOperand(0).getReg());
+  }
+
   // TODO: Handle CalleeSavedRegsViaCopy.
 
   B.insertInstr(Ret);
@@ -632,6 +641,17 @@ bool AMDGPUCallLowering::lowerFormalArguments(
     if (DL.getTypeStoreSize(Arg.getType()) == 0)
       continue;
 
+    if (Info->isWholeWaveFunction() && Idx == 0) {
+      assert(VRegs[Idx].size() == 1 && "Expected only one register");
+
+      // The first argument for whole wave functions is the original EXEC value.
+      B.buildInstr(AMDGPU::G_AMDGPU_WHOLE_WAVE_FUNC_SETUP)
+          .addDef(VRegs[Idx][0]);
+
+      ++Idx;
+      continue;
+    }
+
     const bool InReg = Arg.hasAttribute(Attribute::InReg);
 
     if (Arg.hasAttribute(Attribute::SwiftSelf) ||
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUGISel.td b/llvm/lib/Target/AMDGPU/AMDGPUGISel.td
index 1b909568fc555..c5063c4de4ad3 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUGISel.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPUGISel.td
@@ -300,6 +300,10 @@ def : GINodeEquiv<G_AMDGPU_S_BUFFER_LOAD_SSHORT, SIsbuffer_load_short>;
 def : GINodeEquiv<G_AMDGPU_S_BUFFER_LOAD_USHORT, SIsbuffer_load_ushort>;
 def : GINodeEquiv<G_AMDGPU_S_BUFFER_PREFETCH, SIsbuffer_prefetch>;
 
+def : GINodeEquiv<G_AMDGPU_WHOLE_WAVE_FUNC_SETUP, AMDGPUwhole_wave_setup>;
+// G_AMDGPU_WHOLE_WAVE_FUNC_RETURN is simpler than AMDGPUwhole_wave_return,
+// so we don't mark it as equivalent.
+
 class GISelSop2Pat <
   SDPatternOperator node,
   Instruction inst,
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
index b632b16f5c198..d86e7735a07bc 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
@@ -4141,6 +4141,10 @@ bool AMDGPUInstructionSelector::select(MachineInstr &I) {
     return true;
   case AMDGPU::G_AMDGPU_WAVE_ADDRESS:
     return selectWaveAddress(I);
+  case AMDGPU::G_AMDGPU_WHOLE_WAVE_FUNC_RETURN: {
+    I.setDesc(TII.get(AMDGPU::SI_WHOLE_WAVE_FUNC_RETURN));
+    return true;
+  }
   case AMDGPU::G_STACKRESTORE:
     return selectStackRestore(I);
   case AMDGPU::G_PHI:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
index b20760c356263..a07699ae1eb23 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
@@ -5458,6 +5458,10 @@ AMDGPURegisterBankInfo::getInstrMapping(const MachineInstr &MI) const {
   case AMDGPU::G_PREFETCH:
     OpdsMapping[0] = getSGPROpMapping(MI.getOperand(0).getReg(), MRI, *TRI);
     break;
+  case AMDGPU::G_AMDGPU_WHOLE_WAVE_FUNC_SETUP:
+  case AMDGPU::G_AMDGPU_WHOLE_WAVE_FUNC_RETURN:
+    OpdsMapping[0] = AMDGPU::getValueMapping(AMDGPU::VCCRegBankID, 1);
+    break;
   }
 
   return getInstructionMapping(/*ID*/1, /*Cost*/1,
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index 0a9a57b574805..23413edb6e18a 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -5776,10 +5776,11 @@ SIInstrInfo::getWholeWaveFunctionSetup(MachineFunction &MF) const {
          "Not a whole wave func");
   MachineBasicBlock &MBB = *MF.begin();
   for (MachineInstr &MI : MBB)
-    if (MI.getOpcode() == AMDGPU::SI_SETUP_WHOLE_WAVE_FUNC)
+    if (MI.getOpcode() == AMDGPU::SI_SETUP_WHOLE_WAVE_FUNC ||
+        MI.getOpcode() == AMDGPU::G_AMDGPU_WHOLE_WAVE_FUNC_SETUP)
       return &MI;
 
-  llvm_unreachable("Couldn't find instruction. Wrong MBB?");
+  llvm_unreachable("Couldn't find SI_SETUP_WHOLE_WAVE_FUNC instruction");
 }
 
 static const TargetRegisterClass *
diff --git a/llvm/lib/Target/AMDGPU/SIInstructions.td b/llvm/lib/Target/AMDGPU/SIInstructions.td
index 4c1b76a4fc163..575ebbea034c5 100644
--- a/llvm/lib/Target/AMDGPU/SIInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SIInstructions.td
@@ -4316,6 +4316,20 @@ def G_AMDGPU_S_MUL_I64_I32 : AMDGPUGenericInstruction {
   let hasSideEffects = 0;
 }
 
+def G_AMDGPU_WHOLE_WAVE_FUNC_SETUP : AMDGPUGenericInstruction {
+  let OutOperandList = (outs type0:$origExec);
+  let InOperandList = (ins);
+  let isConvergent = 1;
+}
+
+def G_AMDGPU_WHOLE_WAVE_FUNC_RETURN : AMDGPUGenericInstruction {
+  let OutOperandList = (outs);
+  let InOperandList = (ins type0:$origExec);
+  let isTerminator = 1;
+  let isBarrier = 1;
+  let isReturn = 1;
+}
+
 // This is equivalent to the G_INTRINSIC*, but the operands may have
 // been legalized depending on the subtarget requirements.
 def G_AMDGPU_INTRIN_IMAGE_LOAD : AMDGPUGenericInstruction {
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-whole-wave-functions.mir b/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-whole-wave-functions.mir
new file mode 100644
index 0000000000000..beca901945753
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-whole-wave-functions.mir
@@ -0,0 +1,40 @@
+# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5
+# RUN: llc -mtriple=amdgcn--amdpal -mcpu=gfx1200 -run-pass=regbankselect %s -verify-machineinstrs -o - -regbankselect-fast | FileCheck %s
+# RUN: llc -mtriple=amdgcn--amdpal -mcpu=gfx1200 -run-pass=regbankselect %s -verify-machineinstrs -o - -regbankselect-greedy | FileCheck %s
+# RUN: llc -mtriple=amdgcn--amdpal -mcpu=gfx1200 -mattr=+wavefrontsize64 -run-pass=regbankselect %s -verify-machineinstrs -o - -regbankselect-greedy | FileCheck %s
+---
+name:            basic_test
+legalized:       true
+machineFunctionInfo:
+  isWholeWaveFunction: true
+body:             |
+  bb.1:
+    liveins: $vgpr0, $vgpr1
+
+    ; CHECK-LABEL: name: basic_test
+    ; CHECK: liveins: $vgpr0, $vgpr1
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: [[COPY:%[0-9]+]]:vgpr(s32) = COPY $vgpr0
+    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:vgpr(s32) = COPY $vgpr1
+    ; CHECK-NEXT: [[AMDGPU_WHOLE_WAVE_FUNC_SETUP:%[0-9]+]]:vcc(s1) = G_AMDGPU_WHOLE_WAVE_FUNC_SETUP
+    ; CHECK-NEXT: [[C:%[0-9]+]]:sgpr(s32) = G_CONSTANT i32 5
+    ; CHECK-NEXT: [[COPY2:%[0-9]+]]:vgpr(s32) = COPY [[C]](s32)
+    ; CHECK-NEXT: [[SELECT:%[0-9]+]]:vgpr(s32) = G_SELECT [[AMDGPU_WHOLE_WAVE_FUNC_SETUP]](s1), [[COPY]], [[COPY2]]
+    ; CHECK-NEXT: [[C1:%[0-9]+]]:sgpr(s32) = G_CONSTANT i32 3
+    ; CHECK-NEXT: [[COPY3:%[0-9]+]]:vgpr(s32) = COPY [[C1]](s32)
+    ; CHECK-NEXT: [[SELECT1:%[0-9]+]]:vgpr(s32) = G_SELECT [[AMDGPU_WHOLE_WAVE_FUNC_SETUP]](s1), [[COPY1]], [[COPY3]]
+    ; CHECK-NEXT: [[INTRINSIC_CONVERGENT:%[0-9]+]]:vgpr(s32) = G_INTRINSIC_CONVERGENT intrinsic(@llvm.amdgcn.update.dpp), [[SELECT]](s32), [[SELECT1]](s32), 1, 1, 1, 0
+    ; CHECK-NEXT: $vgpr0 = COPY [[INTRINSIC_CONVERGENT]](s32)
+    ; CHECK-NEXT: G_AMDGPU_WHOLE_WAVE_FUNC_RETURN [[AMDGPU_WHOLE_WAVE_FUNC_SETUP]](s1), implicit $vgpr0
+    %1:_(s32) = COPY $vgpr0
+    %2:_(s32) = COPY $vgpr1
+    %0:_(s1) = G_AMDGPU_WHOLE_WAVE_FUNC_SETUP
+    %12:_(s32) = G_CONSTANT i32 5
+    %11:_(s32) = G_SELECT %0(s1), %1, %12
+    %14:_(s32) = G_CONSTANT i32 3
+    %13:_(s32) = G_SELECT %0(s1), %2, %14
+    %15:_(s32) = G_INTRINSIC_CONVERGENT intrinsic(@llvm.amdgcn.update.dpp), %11(s32), %13(s32), 1, 1, 1, 0
+    $vgpr0 = COPY %15(s32)
+    G_AMDGPU_WHOLE_WAVE_FUNC_RETURN %0(s1), implicit $vgpr0
+
+...
diff --git a/llvm/test/CodeGen/AMDGPU/irtranslator-whole-wave-functions.ll b/llvm/test/CodeGen/AMDGPU/irtranslator-whole-wave-functions.ll
new file mode 100644
index 0000000000000..f18d8128a91ff
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/irtranslator-whole-wave-functions.ll
@@ -0,0 +1,103 @@
+; NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -global-isel=1 -mtriple=amdgcn--amdpal -mcpu=gfx1200 -stop-after=irtranslator -verify-machineinstrs < %s | FileCheck %s
+
+define amdgpu_whole_wave i32 @basic_test(i1 %active, i32 %a, i32 %b) {
+  ; CHECK-LABEL: name: basic_test
+  ; CHECK: bb.1 (%ir-block.0):
+  ; CHECK-NEXT:   liveins: $vgpr0, $vgpr1
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   [[COPY:%[0-9]+]]:_(s32) = COPY $vgpr0
+  ; CHECK-NEXT:   [[COPY1:%[0-9]+]]:_(s32) = COPY $vgpr1
+  ; CHECK-NEXT:   [[AMDGPU_WHOLE_WAVE_FUNC_SETUP:%[0-9]+]]:_(s1) = G_AMDGPU_WHOLE_WAVE_FUNC_SETUP
+  ; CHECK-NEXT:   [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 5
+  ; CHECK-NEXT:   [[C1:%[0-9]+]]:_(s32) = G_CONSTANT i32 3
+  ; CHECK-NEXT:   [[SELECT:%[0-9]+]]:_(s32) = G_SELECT [[AMDGPU_WHOLE_WAVE_FUNC_SETUP]](s1), [[COPY]], [[C]]
+  ; CHECK-NEXT:   [[SELECT1:%[0-9]+]]:_(s32) = G_SELECT [[AMDGPU_WHOLE_WAVE_FUNC_SETUP]](s1), [[COPY1]], [[C1]]
+  ; CHECK-NEXT:   [[INTRINSIC_CONVERGENT:%[0-9]+]]:_(s32) = G_INTRINSIC_CONVERGENT intrinsic(@llvm.amdgcn.update.dpp), [[SELECT]](s32), [[SELECT1]](s32), 1, 1, 1, 0
+  ; CHECK-NEXT:   $vgpr0 = COPY [[INTRINSIC_CONVERGENT]](s32)
+  ; CHECK-NEXT:   G_AMDGPU_WHOLE_WAVE_FUNC_RETURN [[AMDGPU_WHOLE_WAVE_FUNC_SETUP]](s1), implicit $vgpr0
+  %x = select i1 %active, i32 %a, i32 5
+  %y = select i1 %active, i32 %b, i32 3
+  %ret = call i32 @llvm.amdgcn.update.dpp.i32(i32 %x, i32 %y, i32 1, i32 1, i32 1, i1 false)
+  ret i32 %ret
+}
+
+; Make sure we don't crash if %active is not used at all.
+define amdgpu_whole_wave i32 @unused_active(i1 %active, i32 %a, i32 %b) {
+  ; CHECK-LABEL: name: unused_active
+  ; CHECK: bb.1 (%ir-block.0):
+  ; CHECK-NEXT:   liveins: $vgpr0, $vgpr1
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   [[COPY:%[0-9]+]]:_(s32) = COPY $vgpr0
+  ; CHECK-NEXT:   [[COPY1:%[0-9]+]]:_(s32) = COPY $vgpr1
+  ; CHECK-NEXT:   [[AMDGPU_WHOLE_WAVE_FUNC_SETUP:%[0-9]+]]:_(s1) = G_AMDGPU_WHOLE_WAVE_FUNC_SETUP
+  ; CHECK-NEXT:   [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 14
+  ; CHECK-NEXT:   $vgpr0 = COPY [[C]](s32)
+  ; CHECK-NEXT:   G_AMDGPU_WHOLE_WAVE_FUNC_RETURN [[AMDGPU_WHOLE_WAVE_FUNC_SETUP]](s1), implicit $vgpr0
+  ret i32 14
+}
+
+define amdgpu_whole_wave i32 @multiple_blocks(i1 %active, i32 %a, i32 %b) {
+  ; CHECK-LABEL: name: multiple_blocks
+  ; CHECK: bb.1 (%ir-block.0):
+  ; CHECK-NEXT:   successors: %bb.2(0x40000000), %bb.3(0x40000000)
+  ; CHECK-NEXT:   liveins: $vgpr0, $vgpr1
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   [[COPY:%[0-9]+]]:_(s32) = COPY $vgpr0
+  ; CHECK-NEXT:   [[COPY1:%[0-9]+]]:_(s32) = COPY $vgpr1
+  ; CHECK-NEXT:   [[AMDGPU_WHOLE_WAVE_FUNC_SETUP:%[0-9]+]]:_(s1) = G_AMDGPU_WHOLE_WAVE_FUNC_SETUP
+  ; CHECK-NEXT:   [[ICMP:%[0-9]+]]:_(s1) = G_ICMP intpred(eq), [[COPY]](s32), [[COPY1]]
+  ; CHECK-NEXT:   [[INT:%[0-9]+]]:_(s1), [[INT1:%[0-9]+]]:_(s32) = G_INTRINSIC_W_SIDE_EFFECTS intrinsic(@llvm.amdgcn.if), [[ICMP]](s1)
+  ; CHECK-NEXT:   G_BRCOND [[INT]](s1), %bb.2
+  ; CHECK-NEXT:   G_BR %bb.3
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.2.if.then:
+  ; CHECK-NEXT:   successors: %bb.3(0x80000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   [[ADD:%[0-9]+]]:_(s32) = G_ADD [[COPY]], [[COPY1]]
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.3.if.end:
+  ; CHECK-NEXT:   [[PHI:%[0-9]+]]:_(s32) = G_PHI [[COPY1]](s32), %bb.1, [[ADD]](s32), %bb.2
+  ; CHECK-NEXT:   G_INTRINSIC_W_SIDE_EFFECTS intrinsic(@llvm.amdgcn.end.cf), [[INT1]](s32)
+  ; CHECK-NEXT:   [[SELECT:%[0-9]+]]:_(s32) = G_SELECT [[AMDGPU_WHOLE_WAVE_FUNC_SETUP]](s1), [[COPY]], [[PHI]]
+  ; CHECK-NEXT:   $vgpr0 = COPY [[SELECT]](s32)
+  ; CHECK-NEXT:   G_AMDGPU_WHOLE_WAVE_FUNC_RETURN [[AMDGPU_WHOLE_WAVE_FUNC_SETUP]](s1), implicit $vgpr0
+  %c = icmp eq i32 %a, %b
+  br i1 %c, label %if.then, label %if.end
+
+if.then:                                          ; preds = %0
+  %d = add i32 %a, %b
+  br label %if.end
+
+if.end:
+  %f = phi i32 [ %d, %if.then ], [ %b, %0 ]
+  %e = select i1 %active, i32 %a, i32 %f
+  ret i32 %e
+}
+
+define amdgpu_whole_wave i64 @ret_64(i1 %active, i64 %a, i64 %b) {
+  ; CHECK-LABEL: name: ret_64
+  ; CHECK: bb.1 (%ir-block.0):
+  ; CHECK-NEXT:   liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   [[COPY:%[0-9]+]]:_(s32) = COPY $vgpr0
+  ; CHECK-NEXT:   [[COPY1:%[0-9]+]]:_(s32) = COPY $vgpr1
+  ; CHECK-NEXT:   [[MV:%[0-9]+]]:_(s64) = G_MERGE_VALUES [[COPY]](s32), [[COPY1]](s32)
+  ; CHECK-NEXT:   [[COPY2:%[0-9]+]]:_(s32) = COPY $vgpr2
+  ; CHECK-NEXT:   [[COPY3:%[0-9]+]]:_(s32) = COPY $vgpr3
+  ; CHECK-NEXT:   [[MV1:%[0-9]+]]:_(s64) = G_MERGE_VALUES [[COPY2]](s32), [[COPY3]](s32)
+  ; CHECK-NEXT:   [[AMDGPU_WHOLE_WAVE_FUNC_SETUP:%[0-9]+]]:_(s1) = G_AMDGPU_WHOLE_WAVE_FUNC_SETUP
+  ; CHECK-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 5
+  ; CHECK-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 3
+  ; CHECK-NEXT:   [[SELECT:%[0-9]+]]:_(s64) = G_SELECT [[AMDGPU_WHOLE_WAVE_FUNC_SETUP]](s1), [[MV]], [[C]]
+  ; CHECK-NEXT:   [[SELECT1:%[0-9]+]]:_(s64) = G_SELECT [[AMDGPU_WHOLE_WAVE_FUNC_SETUP]](s1), [[MV1]], [[C1]]
+  ; CHECK-NEXT:   [[INTRINSIC_CONVERGENT:%[0-9]+]]:_(s64) = G_INTRINSIC_CONVERGENT intrinsic(@llvm.amdgcn.update.dpp), [[SELECT]](s64), [[SELECT1]](s64), 1, 1, 1, 0
+  ; CHECK-NEXT:   [[UV:%[0-9]+]]:_(s32), [[UV1:%[0-9]+]]:_(s32) = G_UNMERGE_VALUES [[INTRINSIC_CONVERGENT]](s64)
+  ; CHECK-NEXT:   $vgpr0 = COPY [[UV]](s32)
+  ; CHECK-NEXT:   $vgpr1 = COPY [[UV1]](s32)
+  ; CHECK-NEXT:   G_AMDGPU_WHOLE_WAVE_FUNC_RETURN [[AMDGPU_WHOLE_WAVE_FUNC_SETUP]](s1), implicit $vgpr0, implicit $vgpr1
+  %x = select i1 %active, i64 %a, i64 5
+  %y = select i1 %active, i64 %b, i64 3
+  %ret = call i64 @llvm.amdgcn.update.dpp.i64(i64 %x, i64 %y, i32 1, i32 1, i32 1, i1 false)
+  ret i64 %ret
+}
diff --git a/llvm/test/CodeGen/AMDGPU/isel-whole-wave-functions.ll b/llvm/test/CodeGen/AMDGPU/isel-whole-wave-functions.ll
index 23e97dd2e2fdf..300c7863b6966 100644
--- a/llvm/test/CodeGen/AMDGPU/isel-whole-wave-functions.ll
+++ b/llvm/test/CodeGen/AMDGPU/isel-whole-wave-functions.ll
@@ -18,6 +18,23 @@ define amdgpu_whole_wave i32 @basic_test(i1 %active, i32 %a, i32 %b) {
   ; DAGISEL-NEXT:   $vgpr0 = COPY [[V_MOV_B32_dpp]]
   ; DAGISEL-NEXT:   [[DEF:%[0-9]+]]:sreg_32 = IMPLICIT_DEF
   ; DAGISEL-NEXT:   SI_WHOLE_WAVE_FUNC_RETURN killed [[SI_SETUP_WHOLE_WAVE_FUNC]], implicit $vgpr0
+  ;
+  ; GISEL-LABEL: name: basic_test
+  ; GISEL: bb.1 (%ir-block.0):
+  ; GISEL-NEXT:   liveins: $vgpr0, $vgpr1
+  ; GISEL-NEXT: {{  $}}
+  ; GISEL-NEXT:   [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr0
+  ; GISEL-NEXT:   [[COPY1:%[0-9]+]]:vgpr_32 = COPY $vgpr1
+  ; GISEL-NEXT:   [[SI_SETUP_WHOLE_WAVE_FUNC:%[0-9]+]]:sreg_32_xm0_xexec = SI_SETUP_WHOLE_WAVE_FUNC implicit-def dead $exec, implicit $exec
+  ; GISEL-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 5
+  ; GISEL-NEXT:   [[COPY2:%[0-9]+]]:vgpr_32 = COPY [[S_MOV_B32_]]
+  ; GISEL-NEXT:   [[V_CNDMASK_B32_e64_:%[0-9]+]]:vgpr_32 = V_CNDMASK_B32_e64 0, [[COPY2]], 0, [[COPY]], [[SI_SETUP_WHOLE_WAVE_FUNC]], implicit $exec
+  ; GISEL-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 3
+  ; GISEL-NEXT:   [[COPY3:%[0-9]+]]:vgpr_32 = COPY [[S_MOV_B32_1]]
+  ; GISEL-NEXT:   [[V_CNDMASK_B32_e64_1:%[0-9]+]]:vgpr_32 = V_CNDMASK_B32_e64 0, [[COPY3]], 0, [[COPY1]], [[SI_SETUP_WHOLE_WAVE_FUNC]], implicit $exec
+  ; GISEL-NEXT:   [[V_MOV_B32_dpp:%[0-9]+]]:vgpr_32 = V_MOV_B32_dpp [[V_CNDMASK_B32_e64_]], [[V_CNDMASK_B32_e64_1]], 1, 1, 1, 0, implicit $exec
+  ; GISEL-NEXT:   $vgpr0 = COPY [[V_MOV_B32_dpp]]
+  ; GISEL-NEXT:   SI_WHOLE_WAVE_FUNC_RETURN [[SI_SETUP_WHOLE_WAVE_FUNC]], implicit $vgpr0
   %x = select i1 %active, i32 %a, i32 5
   %y = select i1 %active, i32 %b, i32 3
   %ret = call i32 @llvm.amdgcn.update.dpp.i32(i32 %x, i32 %y, i32 1, i32 1, i32 1, i1 false)
@@ -33,6 +50,15 @@ define amdgpu_whole_wave i32 @unused_active(i1 %active, i32 %a, i32 %b) {
   ; DAGISEL-NEXT:   $vgpr0 = COPY [[V_MOV_B32_e32_]]
   ; DAGISEL-NEXT:   [[DEF:%[0-9]+]]:sreg_32 = IMPLICIT_DEF
   ; DAGISEL-NEXT:   SI_WHOLE_WAVE_FUNC_RETURN killed [[SI_SETUP_WHOLE_WAVE_FUNC]], implicit $vgpr0
+  ;
+  ; GISEL-LABEL: name: unused_active
+  ; GISEL: bb.1 (%ir-block.0):
+  ; GISEL-NEXT:   liveins: $vgpr0, $vgpr1
+  ; GISEL-NEXT: {{  $}}
+  ; GISEL-NEXT:   [[SI_SETUP_WHOLE_WAVE_FUNC:%[0-9]+]]:sreg_32_xm0_xexec = SI_SETUP_WHOLE_WAVE_FUNC implicit-def dead $exec, implicit $exec
+  ; GISEL-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 14
+  ; GISEL-NEXT:   $vgpr0 = COPY [[S_MOV_B32_]]
+  ; GISEL-NEXT:   SI_WHOLE_WAVE_FUNC_RETURN [[SI_SETUP_WHOLE_WAVE_FUNC]], implicit $vgpr0
   ret i32 14
 }
 
@@ -63,6 +89,30 @@ define amdgpu_whole_wave i32 @multiple_blocks(i1 %active, i32 %a, i32 %b) {
   ; DAGISEL-NEXT:   $vgpr0 = COPY [[V_CNDMASK_B32_e64_]]
   ; DAGISEL-NEXT:   [[DEF:%[0-9]+]]:sreg_32 = IMPLICIT_DEF
   ; DAGISEL-NEXT:   SI_WHOLE_WAVE_FUNC_RETURN killed [[SI_SETUP_WHOLE_WAVE_FUNC]], implicit $vgpr0
+  ;
+  ; GISEL-LABEL: name: multiple_blocks
+  ; GISEL: bb.1 (%ir-block.0):
+  ; GISEL-NEXT:   successors: %bb.2(0x40000000), %bb.3(0x40000000)
+  ; GISEL-NEXT:   liveins: $vgpr0, $vgpr1
+  ; GISEL-NEXT: {{  $}}
+  ; GISEL-NEXT:   [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr0
+  ; GISEL-NEXT:   [[COPY1:%[0-9]+]]:vgpr_32 = COPY $vgpr1
+  ; GISEL-NEXT:   [[SI_SETUP_WHOLE_WAVE_FUNC:%[0-9]+]]:sreg_32_xm0_xexec = SI_SETUP_WHOLE_WAVE_FUNC implicit-def dead $exec, implicit $exec
+  ; GISEL-NEXT:   [[V_CMP_EQ_U32_e64_:%[0-9]+]]:sreg_32_xm0_xexec = V_CMP_EQ_U32_e64 [[COPY]], [[COPY1]], implicit $exec
+  ; GISEL-NEXT:   [[SI_IF:%[0-9]+]]:sreg_32_xm0_xexec = SI_IF [[V_CMP_EQ_U32_e64_]], %bb.3, implicit-def $exec, implicit-def $scc, implicit $exec
+  ; GISEL-NEXT:   S_BRANCH %bb.2
+  ; GISEL-NEXT: {{  $}}
+  ; GISEL-NEXT: bb.2.if.then:
+  ; GISEL-NEXT:   successors: %bb.3(0x80000000)
+  ; GISEL-NEXT: {{  $}}
+  ; GISEL-NEXT:   [[V_ADD_U32_e64_:%[0-9]+]]:vgpr_32 = V_ADD_U32_e64 [[COPY]], [[COPY1]], 0, implicit $exec
+  ; GISEL-NEXT: {{  $}}
+  ; GISEL-NEXT: bb.3.if.end:
+  ; GISEL-NEXT:   [[PHI:%[0-9]+]]:vgpr_32 = PHI [[COPY1]], %bb.1, [[V_ADD_U32_e64_]], %bb.2
+  ; GISEL-NEXT:   SI_END_CF [[SI_IF]], implicit-def $exec, implicit-def $scc, implicit $exec
+  ; GISEL-NEXT:   [[V_CNDMASK_B32_e64_:%[0-9]+]]:vgpr_32 = V_CNDMASK_B32_e64 0, [[PHI]], 0, [[COPY]], [[SI_SETUP_WHOLE_WAVE_FUNC]], implicit $exec
+  ; GISEL-NEXT:   $vgpr0 = COPY [[V_CNDMASK_B32_e64_]]
+  ; GISEL-NEXT:   SI_WHOLE_WAVE_FUNC_RETURN [[SI_SETUP_WHOLE_WAVE_FUNC]], implicit $vgpr0
   %c = icmp eq i32 %a, %b
   br i1 %c, label %if.then, label %if.end
 
@@ -109,6 +159,29 @@ define amdgpu_whole_wave i64 @ret_64(i1 %active, i64 %a, i64 %b) {
   ; DAGISEL-NEXT:   $vgpr1 = COPY [[V_MOV_B32_dpp1]]
   ; DAGISEL-NEXT:   [[DEF4:%[0-9]+]]:sreg_32 = IMPLICIT_DEF
   ; DAGISEL-NEXT:   SI_WHOLE_WAVE_FUNC_RETURN killed [[SI_SETUP_WHOLE_WAVE_FUNC]], implicit $vgpr0, implicit $vgpr1
+  ;
+  ; GISEL-LABEL: name: ret_64
+  ; GISEL: bb.1 (%ir-block.0):
+  ; GISEL-NEXT:   liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3
+  ; GISEL-NEXT: {{  $}}
+  ; GISEL-NEXT:   [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr0
+  ; GISEL-NEXT:   [[COPY1:%[0-9]+]]:vgpr_32 = COPY $vgpr1
+  ; GISEL-NEXT:   [[COPY2:%[0-9]+]]:vgpr_32 = COPY $vgpr2
+  ; GISEL-NEXT:   [[COPY3:%[0-9]+]]:vgpr_32 = COPY $vgpr3
+  ; GISEL-NEXT:   [[SI_SETUP_WHOLE_WAVE_FUNC:%[0-9]+]]:sreg_32_xm0_xexec = SI_SETUP_WHOLE_WAVE_FUNC implicit-def dead $exec, implicit $exec
+  ; GISEL-NEXT:   [[V_MOV_B32_e32_:%[0-9]+]]:vgpr_32 = V_MOV_B32_e32 5, implicit $exec
+  ; GISEL-NEXT:   [[V_MOV_B32_e32_1:%[0-9]+]]:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
+  ; GISEL-NEXT:   [[V_CNDMASK_B32_e64_:%[0-9]+]]:vgpr_32 = V_CNDMASK_B32_e64 0, [[V_MOV_B32_e32_]], 0, [[COPY]], [[SI_SETUP_WHOLE_WAVE_FUNC]], implicit $exec
+  ; GISEL-NEXT:   [[V_CNDMASK_B32_e64_1:%[0-9]+]]:vgpr_32 = V_CNDMASK_B32_e64 0, [[V_MOV_B32_e32_1]], 0, [[COPY1]], [[SI_SETUP_WHOLE_WAVE_FUNC]], implicit $exec
+  ; GISEL-NEXT:   [[V_MOV_B32_e32_2:%[0-9]+]]:vgpr_32 = V_MOV_B32_e32 3, implicit $exec
+  ; GISEL-NEXT:   [[V_MOV_B32_e32_3:%[0-9]+]]:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
+  ; GISEL-NEXT:   [[V_CNDMASK_B32_e64_2:%[0-9]+]]:vgpr_32 = V_CNDMASK_B32_e64 0, [[V_MOV_B32_e32_2]], 0, [[COPY2]], [[SI_SETUP_WHOLE_WAVE_FUNC]], implicit $exec
+  ; GISEL-NEXT:   [[V_CNDMASK_B32_e64_3:%[0-9]+]]:vgpr_32 = V_CNDMASK_B32_e64 0, [[V_MOV_B32_e32_3]], 0, [[COPY3]], [[SI_SETUP_WHOLE_WAVE_FUNC]], implicit $exec
+  ; GISEL-NEXT:   [[V_MOV_B32_dpp:%[0-9]+]]:vgpr_32 = V_MOV_B32_dpp [[V_CNDMASK_B32_e64_]], [[V_CNDMASK_B32_e64_2]], 1, 1, 1, 0, implicit $exec
+  ; GISEL-NEXT:   [[V_MOV_B32_dpp1:%[0-9]+]]:vgpr_32 = V_MOV_B32_dpp [[V_CNDMASK_B32_e64_1]], [[V_CNDMASK_B32_e64_3]], 1, 1, 1, 0, implicit $exec
+  ; GISEL-NEXT:   $vgpr0 = COPY [[V_MOV_B32_dpp]]
+  ; GISEL-NEXT:   $vgpr1 = COPY [[V_MOV_B32_dpp1]]
+  ; GISEL-NEXT:   SI_WHOLE_WAVE_FUNC_RETURN [[SI_SETUP_WHOLE_WAVE_FUNC]], implicit $vgpr0, implicit $vgpr1
   %x = select i1 %active, i64 %a, i64 5
   %y = select i1 %active, i64 %b, i64 3
   %ret = call i64 @llvm.amdgcn.update.dpp.i64(i64 %x, i64 %y, i32 1, i32 1, i32 1, i1 false)
diff --git a/llvm/test/CodeGen/AMDGPU/whole-wave-functions.ll b/llvm/test/CodeGen/AMDGPU/whole-wave-functions.ll
index 6663fe89f0bc7..715244d39765f 100644
--- a/llvm/test/CodeGen/AMDGPU/whole-wave-functions.ll
+++ b/llvm/test/CodeGen/AMDGPU/whole-wave-functions.ll
@@ -30,6 +30,30 @@ define amdgpu_whole_wave i32 @basic_test(i1 %active, i32 %a, i32 %b) {
 ; DAGISEL-NEXT:    s_mov_b32 exec_lo, vcc_lo
 ; DAGISEL-NEXT:    s_wait_loadcnt 0x0
 ; DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GISEL-LABEL: basic_test:
+; GISEL:       ; %bb.0:
+; GISEL-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GISEL-NEXT:    s_wait_expcnt 0x0
+; GISEL-NEXT:    s_wait_samplecnt 0x0
+; GISEL-NEXT:    s_wait_bvhcnt 0x0
+; GISEL-NEXT:    s_wait_kmcnt 0x0
+; GISEL-NEXT:    s_xor_saveexec_b32 vcc_lo, -1
+; GISEL-NEXT:    s_clause 0x1
+; GISEL-NEXT:    scratch_store_b32 off, v0, s32
+; GISEL-NEXT:    scratch_store_b32 off, v1, s32 offset:4
+; GISEL-NEXT:    s_mov_b32 exec_lo, -1
+; GISEL-NEXT:    s_wait_alu 0xfffe
+; GISEL-NEXT:    v_dual_cndmask_b32 v0, 5, v0 :: v_dual_cndmask_b32 v1, 3, v1
+; GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GISEL-NEXT:    v_mov_b32_dpp v0, v1 quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1
+; GISEL-NEXT:    s_xor_b32 exec_lo, vcc_lo, -1
+; GISEL-NEXT:    s_clause 0x1
+; GISEL-NEXT:    scratch_load_b32 v0, off, s32
+; GISEL-NEXT:    scratch_load_b32 v1, off, s32 offset:4
+; GISEL-NEXT:    s_mov_b32 exec_lo, vcc_lo
+; GISEL-NEXT:    s_wait_loadcnt 0x0
+; GISEL-NEXT:    s_setpc_b64 s[30:31]
   %x = select i1 %active, i32 %a, i32 5
   %y = select i1 %active, i32 %b, i32 3
   %ret = call i32 @llvm.amdgcn.update.dpp.i32(i32 %x, i32 %y, i32 1, i32 1, i32 1, i1 false)
@@ -55,6 +79,24 @@ define amdgpu_whole_wave i32 @unused_active(i1 %active, i32 %a, i32 %b) {
 ; DAGISEL-NEXT:    s_mov_b32 exec_lo, s0
 ; DAGISEL-NEXT:    s_wait_loadcnt 0x0
 ; DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GISEL-LABEL: unused_active:
+; GISEL:       ; %bb.0:
+; GISEL-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GISEL-NEXT:    s_wait_expcnt 0x0
+; GISEL-NEXT:    s_wait_samplecnt 0x0
+; GISEL-NEXT:    s_wait_bvhcnt 0x0
+; GISEL-NEXT:    s_wait_kmcnt 0x0
+; GISEL-NEXT:    s_xor_saveexec_b32 s0, -1
+; GISEL-NEXT:    scratch_store_b32 off, v0, s32 ; 4-byte Folded Spill
+; GISEL-NEXT:    s_mov_b32 exec_lo, -1
+; GISEL-NEXT:    v_mov_b32_e32 v0, 14
+; GISEL-NEXT:    s_wait_alu 0xfffe
+; GISEL-NEXT:    s_xor_b32 exec_lo, s0, -1
+; GISEL-NEXT:    scratch_load_b32 v0, off, s32 ; 4-byte Folded Reload
+; GISEL-NEXT:    s_mov_b32 exec_lo, s0
+; GISEL-NEXT:    s_wait_loadcnt 0x0
+; GISEL-NEXT:    s_setpc_b64 s[30:31]
   ret i32 14
 }
 
@@ -99,6 +141,45 @@ define amdgpu_whole_wave i32 @csr(i1 %active, i32 %a, i32 %b) {
 ; DAGISEL-NEXT:    s_wait_loadcnt 0x0
 ; DAGISEL-NEXT:    s_wait_alu 0xf1ff
 ; DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GISEL-LABEL: csr:
+; GISEL:       ; %bb.0:
+; GISEL-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GISEL-NEXT:    s_wait_expcnt 0x0
+; GISEL-NEXT:    s_wait_samplecnt 0x0
+; GISEL-NEXT:    s_wait_bvhcnt 0x0
+; GISEL-NEXT:    s_wait_kmcnt 0x0
+; GISEL-NEXT:    s_xor_saveexec_b32 vcc_lo, -1
+; GISEL-NEXT:    s_clause 0x3
+; GISEL-NEXT:    scratch_store_b32 off, v2, s32
+; GISEL-NEXT:    scratch_store_b32 off, v0, s32 offset:4
+; GISEL-NEXT:    scratch_store_b32 off, v1, s32 offset:8
+; GISEL-NEXT:    scratch_store_b32 off, v49, s32 offset:16
+; GISEL-NEXT:    s_mov_b32 exec_lo, -1
+; GISEL-NEXT:    scratch_store_b32 off, v40, s32 offset:12 ; 4-byte Folded Spill
+; GISEL-NEXT:    ;;#ASMSTART
+; GISEL-NEXT:    ; clobber CSR
+; GISEL-NEXT:    ;;#ASMEND
+; GISEL-NEXT:    v_writelane_b32 v2, s20, 0
+; GISEL-NEXT:    ;;#ASMSTART
+; GISEL-NEXT:    ; clobber non-CSR
+; GISEL-NEXT:    ;;#ASMEND
+; GISEL-NEXT:    scratch_load_b32 v40, off, s32 offset:12 ; 4-byte Folded Reload
+; GISEL-NEXT:    s_wait_alu 0xfffe
+; GISEL-NEXT:    v_dual_cndmask_b32 v0, 5, v0 :: v_dual_cndmask_b32 v1, 3, v1
+; GISEL-NEXT:    v_readlane_b32 s20, v2, 0
+; GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_2)
+; GISEL-NEXT:    v_mov_b32_dpp v0, v1 quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1
+; GISEL-NEXT:    s_xor_b32 exec_lo, vcc_lo, -1
+; GISEL-NEXT:    s_clause 0x3
+; GISEL-NEXT:    scratch_load_b32 v2, off, s32
+; GISEL-NEXT:    scratch_load_b32 v0, off, s32 offset:4
+; GISEL-NEXT:    scratch_load_b32 v1, off, s32 offset:8
+; GISEL-NEXT:    scratch_load_b32 v49, off, s32 offset:16
+; GISEL-NEXT:    s_mov_b32 exec_lo, vcc_lo
+; GISEL-NEXT:    s_wait_loadcnt 0x0
+; GISEL-NEXT:    s_wait_alu 0xf1ff
+; GISEL-NEXT:    s_setpc_b64 s[30:31]
   %x = select i1 %active, i32 %a, i32 5
   %y = select i1 %active, i32 %b, i32 3
   call void asm sideeffect "; clobber CSR", "~{v40},~{s48}"()
@@ -126,6 +207,24 @@ define amdgpu_whole_wave void @csr_vgpr_only(i1 %active, i32 %a, i32 %b) {
 ; DAGISEL-NEXT:    s_mov_b32 exec_lo, s0
 ; DAGISEL-NEXT:    s_wait_loadcnt 0x0
 ; DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GISEL-LABEL: csr_vgpr_only:
+; GISEL:       ; %bb.0:
+; GISEL-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GISEL-NEXT:    s_wait_expcnt 0x0
+; GISEL-NEXT:    s_wait_samplecnt 0x0
+; GISEL-NEXT:    s_wait_bvhcnt 0x0
+; GISEL-NEXT:    s_wait_kmcnt 0x0
+; GISEL-NEXT:    s_or_saveexec_b32 s0, -1
+; GISEL-NEXT:    scratch_store_b32 off, v40, s32 ; 4-byte Folded Spill
+; GISEL-NEXT:    ;;#ASMSTART
+; GISEL-NEXT:    ; clobber CSR VGPR
+; GISEL-NEXT:    ;;#ASMEND
+; GISEL-NEXT:    scratch_load_b32 v40, off, s32 ; 4-byte Folded Reload
+; GISEL-NEXT:    s_wait_alu 0xfffe
+; GISEL-NEXT:    s_mov_b32 exec_lo, s0
+; GISEL-NEXT:    s_wait_loadcnt 0x0
+; GISEL-NEXT:    s_setpc_b64 s[30:31]
   call void asm sideeffect "; clobber CSR VGPR", "~{v40}"()
   ret void
 }
@@ -153,6 +252,29 @@ define amdgpu_whole_wave void @sgpr_spill_only(i1 %active, i32 %a, i32 %b) {
 ; DAGISEL-NEXT:    s_mov_b32 exec_lo, s0
 ; DAGISEL-NEXT:    s_wait_loadcnt 0x0
 ; DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GISEL-LABEL: sgpr_spill_only:
+; GISEL:       ; %bb.0:
+; GISEL-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GISEL-NEXT:    s_wait_expcnt 0x0
+; GISEL-NEXT:    s_wait_samplecnt 0x0
+; GISEL-NEXT:    s_wait_bvhcnt 0x0
+; GISEL-NEXT:    s_wait_kmcnt 0x0
+; GISEL-NEXT:    s_xor_saveexec_b32 s0, -1
+; GISEL-NEXT:    scratch_store_b32 off, v0, s32 ; 4-byte Folded Spill
+; GISEL-NEXT:    s_mov_b32 exec_lo, -1
+; GISEL-NEXT:    v_writelane_b32 v0, s68, 0
+; GISEL-NEXT:    ;;#ASMSTART
+; GISEL-NEXT:    ; clobber CSR SGPR
+; GISEL-NEXT:    ;;#ASMEND
+; GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GISEL-NEXT:    v_readlane_b32 s68, v0, 0
+; GISEL-NEXT:    s_wait_alu 0xfffe
+; GISEL-NEXT:    s_xor_b32 exec_lo, s0, -1
+; GISEL-NEXT:    scratch_load_b32 v0, off, s32 ; 4-byte Folded Reload
+; GISEL-NEXT:    s_mov_b32 exec_lo, s0
+; GISEL-NEXT:    s_wait_loadcnt 0x0
+; GISEL-NEXT:    s_setpc_b64 s[30:31]
   call void asm sideeffect "; clobber CSR SGPR", "~{s68}"()
   ret void
 }
@@ -187,6 +309,36 @@ define amdgpu_whole_wave i32 @multiple_blocks(i1 %active, i32 %a, i32 %b) {
 ; DAGISEL-NEXT:    s_mov_b32 exec_lo, vcc_lo
 ; DAGISEL-NEXT:    s_wait_loadcnt 0x0
 ; DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GISEL-LABEL: multiple_blocks:
+; GISEL:       ; %bb.0:
+; GISEL-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GISEL-NEXT:    s_wait_expcnt 0x0
+; GISEL-NEXT:    s_wait_samplecnt 0x0
+; GISEL-NEXT:    s_wait_bvhcnt 0x0
+; GISEL-NEXT:    s_wait_kmcnt 0x0
+; GISEL-NEXT:    s_xor_saveexec_b32 vcc_lo, -1
+; GISEL-NEXT:    s_clause 0x1
+; GISEL-NEXT:    scratch_store_b32 off, v0, s32
+; GISEL-NEXT:    scratch_store_b32 off, v1, s32 offset:4
+; GISEL-NEXT:    s_mov_b32 exec_lo, -1
+; GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; GISEL-NEXT:    s_mov_b32 s1, exec_lo
+; GISEL-NEXT:    v_cmpx_eq_u32_e64 v0, v1
+; GISEL-NEXT:  ; %bb.1: ; %if.then
+; GISEL-NEXT:    v_add_nc_u32_e32 v1, v0, v1
+; GISEL-NEXT:  ; %bb.2: ; %if.end
+; GISEL-NEXT:    s_wait_alu 0xfffe
+; GISEL-NEXT:    s_or_b32 exec_lo, exec_lo, s1
+; GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GISEL-NEXT:    v_cndmask_b32_e32 v0, v1, v0, vcc_lo
+; GISEL-NEXT:    s_xor_b32 exec_lo, vcc_lo, -1
+; GISEL-NEXT:    s_clause 0x1
+; GISEL-NEXT:    scratch_load_b32 v0, off, s32
+; GISEL-NEXT:    scratch_load_b32 v1, off, s32 offset:4
+; GISEL-NEXT:    s_mov_b32 exec_lo, vcc_lo
+; GISEL-NEXT:    s_wait_loadcnt 0x0
+; GISEL-NEXT:    s_setpc_b64 s[30:31]
   %c = icmp eq i32 %a, %b
   br i1 %c, label %if.then, label %if.end
 
@@ -230,6 +382,36 @@ define amdgpu_whole_wave i64 @ret_64(i1 %active, i64 %a, i64 %b) {
 ; DAGISEL-NEXT:    s_mov_b32 exec_lo, vcc_lo
 ; DAGISEL-NEXT:    s_wait_loadcnt 0x0
 ; DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GISEL-LABEL: ret_64:
+; GISEL:       ; %bb.0:
+; GISEL-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GISEL-NEXT:    s_wait_expcnt 0x0
+; GISEL-NEXT:    s_wait_samplecnt 0x0
+; GISEL-NEXT:    s_wait_bvhcnt 0x0
+; GISEL-NEXT:    s_wait_kmcnt 0x0
+; GISEL-NEXT:    s_xor_saveexec_b32 vcc_lo, -1
+; GISEL-NEXT:    s_clause 0x3
+; GISEL-NEXT:    scratch_store_b32 off, v0, s32
+; GISEL-NEXT:    scratch_store_b32 off, v1, s32 offset:4
+; GISEL-NEXT:    scratch_store_b32 off, v2, s32 offset:8
+; GISEL-NEXT:    scratch_store_b32 off, v3, s32 offset:12
+; GISEL-NEXT:    s_mov_b32 exec_lo, -1
+; GISEL-NEXT:    s_wait_alu 0xfffe
+; GISEL-NEXT:    v_dual_cndmask_b32 v0, 5, v0 :: v_dual_cndmask_b32 v1, 0, v1
+; GISEL-NEXT:    v_dual_cndmask_b32 v2, 3, v2 :: v_dual_cndmask_b32 v3, 0, v3
+; GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GISEL-NEXT:    v_mov_b32_dpp v0, v2 quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1
+; GISEL-NEXT:    v_mov_b32_dpp v1, v3 quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1
+; GISEL-NEXT:    s_xor_b32 exec_lo, vcc_lo, -1
+; GISEL-NEXT:    s_clause 0x3
+; GISEL-NEXT:    scratch_load_b32 v0, off, s32
+; GISEL-NEXT:    scratch_load_b32 v1, off, s32 offset:4
+; GISEL-NEXT:    scratch_load_b32 v2, off, s32 offset:8
+; GISEL-NEXT:    scratch_load_b32 v3, off, s32 offset:12
+; GISEL-NEXT:    s_mov_b32 exec_lo, vcc_lo
+; GISEL-NEXT:    s_wait_loadcnt 0x0
+; GISEL-NEXT:    s_setpc_b64 s[30:31]
   %x = select i1 %active, i64 %a, i64 5
   %y = select i1 %active, i64 %b, i64 3
   %ret = call i64 @llvm.amdgcn.update.dpp.i64(i64 %x, i64 %y, i32 1, i32 1, i32 1, i1 false)

>From 44095f6f490d4b2479415d5bd07bf35a97531c1a Mon Sep 17 00:00:00 2001
From: Diana Picus <diana-magda.picus at amd.com>
Date: Wed, 19 Mar 2025 10:56:02 +0100
Subject: [PATCH 09/24] Rename pseudo to match others

---
 llvm/lib/Target/AMDGPU/SIFrameLowering.cpp    |  2 +-
 llvm/lib/Target/AMDGPU/SIInstrInfo.cpp        |  2 +-
 llvm/lib/Target/AMDGPU/SIInstructions.td      |  2 +-
 .../AMDGPU/isel-whole-wave-functions.ll       | 60 +++++++++----------
 .../AMDGPU/whole-wave-functions-pei.mir       | 18 +++---
 5 files changed, 42 insertions(+), 42 deletions(-)

diff --git a/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp b/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
index 44e5d8ef2bca4..702e090b5c93e 100644
--- a/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
@@ -1024,7 +1024,7 @@ void SIFrameLowering::emitCSRSpillStores(
 
   StoreWWMRegisters(WWMCalleeSavedRegs);
   if (FuncInfo->isWholeWaveFunction()) {
-    // SI_SETUP_WHOLE_WAVE_FUNCTION has outlived its purpose, so we can remove
+    // SI_WHOLE_WAVE_FUNC_SETUP has outlived its purpose, so we can remove
     // it now. If we have already saved some WWM CSR registers, then the EXEC is
     // already -1 and we don't need to do anything else. Otherwise, set EXEC to
     // -1 here.
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index 23413edb6e18a..fc469f19c7808 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -5776,7 +5776,7 @@ SIInstrInfo::getWholeWaveFunctionSetup(MachineFunction &MF) const {
          "Not a whole wave func");
   MachineBasicBlock &MBB = *MF.begin();
   for (MachineInstr &MI : MBB)
-    if (MI.getOpcode() == AMDGPU::SI_SETUP_WHOLE_WAVE_FUNC ||
+    if (MI.getOpcode() == AMDGPU::SI_WHOLE_WAVE_FUNC_SETUP ||
         MI.getOpcode() == AMDGPU::G_AMDGPU_WHOLE_WAVE_FUNC_SETUP)
       return &MI;
 
diff --git a/llvm/lib/Target/AMDGPU/SIInstructions.td b/llvm/lib/Target/AMDGPU/SIInstructions.td
index 575ebbea034c5..225a073db33d1 100644
--- a/llvm/lib/Target/AMDGPU/SIInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SIInstructions.td
@@ -645,7 +645,7 @@ def SI_INIT_WHOLE_WAVE : SPseudoInstSI <
 }
 
 // Sets EXEC to all lanes and returns the previous EXEC.
-def SI_SETUP_WHOLE_WAVE_FUNC : SPseudoInstSI <
+def SI_WHOLE_WAVE_FUNC_SETUP : SPseudoInstSI <
   (outs SReg_1:$dst), (ins), [(set i1:$dst, (AMDGPUwhole_wave_setup))]> {
   let Defs = [EXEC];
   let Uses = [EXEC];
diff --git a/llvm/test/CodeGen/AMDGPU/isel-whole-wave-functions.ll b/llvm/test/CodeGen/AMDGPU/isel-whole-wave-functions.ll
index 300c7863b6966..851dc5107a8a1 100644
--- a/llvm/test/CodeGen/AMDGPU/isel-whole-wave-functions.ll
+++ b/llvm/test/CodeGen/AMDGPU/isel-whole-wave-functions.ll
@@ -9,15 +9,15 @@ define amdgpu_whole_wave i32 @basic_test(i1 %active, i32 %a, i32 %b) {
   ; DAGISEL-NEXT: {{  $}}
   ; DAGISEL-NEXT:   [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr1
   ; DAGISEL-NEXT:   [[COPY1:%[0-9]+]]:vgpr_32 = COPY $vgpr0
-  ; DAGISEL-NEXT:   [[SI_SETUP_WHOLE_WAVE_FUNC:%[0-9]+]]:sreg_32_xm0_xexec = SI_SETUP_WHOLE_WAVE_FUNC implicit-def dead $exec, implicit $exec
+  ; DAGISEL-NEXT:   [[SI_WHOLE_WAVE_FUNC_SETUP:%[0-9]+]]:sreg_32_xm0_xexec = SI_WHOLE_WAVE_FUNC_SETUP implicit-def dead $exec, implicit $exec
   ; DAGISEL-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 5
-  ; DAGISEL-NEXT:   [[V_CNDMASK_B32_e64_:%[0-9]+]]:vgpr_32 = V_CNDMASK_B32_e64 0, killed [[S_MOV_B32_]], 0, [[COPY1]], [[SI_SETUP_WHOLE_WAVE_FUNC]], implicit $exec
+  ; DAGISEL-NEXT:   [[V_CNDMASK_B32_e64_:%[0-9]+]]:vgpr_32 = V_CNDMASK_B32_e64 0, killed [[S_MOV_B32_]], 0, [[COPY1]], [[SI_WHOLE_WAVE_FUNC_SETUP]], implicit $exec
   ; DAGISEL-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 3
-  ; DAGISEL-NEXT:   [[V_CNDMASK_B32_e64_1:%[0-9]+]]:vgpr_32 = V_CNDMASK_B32_e64 0, killed [[S_MOV_B32_1]], 0, [[COPY]], [[SI_SETUP_WHOLE_WAVE_FUNC]], implicit $exec
+  ; DAGISEL-NEXT:   [[V_CNDMASK_B32_e64_1:%[0-9]+]]:vgpr_32 = V_CNDMASK_B32_e64 0, killed [[S_MOV_B32_1]], 0, [[COPY]], [[SI_WHOLE_WAVE_FUNC_SETUP]], implicit $exec
   ; DAGISEL-NEXT:   [[V_MOV_B32_dpp:%[0-9]+]]:vgpr_32 = V_MOV_B32_dpp [[V_CNDMASK_B32_e64_]], killed [[V_CNDMASK_B32_e64_1]], 1, 1, 1, 0, implicit $exec
   ; DAGISEL-NEXT:   $vgpr0 = COPY [[V_MOV_B32_dpp]]
   ; DAGISEL-NEXT:   [[DEF:%[0-9]+]]:sreg_32 = IMPLICIT_DEF
-  ; DAGISEL-NEXT:   SI_WHOLE_WAVE_FUNC_RETURN killed [[SI_SETUP_WHOLE_WAVE_FUNC]], implicit $vgpr0
+  ; DAGISEL-NEXT:   SI_WHOLE_WAVE_FUNC_RETURN killed [[SI_WHOLE_WAVE_FUNC_SETUP]], implicit $vgpr0
   ;
   ; GISEL-LABEL: name: basic_test
   ; GISEL: bb.1 (%ir-block.0):
@@ -25,16 +25,16 @@ define amdgpu_whole_wave i32 @basic_test(i1 %active, i32 %a, i32 %b) {
   ; GISEL-NEXT: {{  $}}
   ; GISEL-NEXT:   [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr0
   ; GISEL-NEXT:   [[COPY1:%[0-9]+]]:vgpr_32 = COPY $vgpr1
-  ; GISEL-NEXT:   [[SI_SETUP_WHOLE_WAVE_FUNC:%[0-9]+]]:sreg_32_xm0_xexec = SI_SETUP_WHOLE_WAVE_FUNC implicit-def dead $exec, implicit $exec
+  ; GISEL-NEXT:   [[SI_WHOLE_WAVE_FUNC_SETUP:%[0-9]+]]:sreg_32_xm0_xexec = SI_WHOLE_WAVE_FUNC_SETUP implicit-def dead $exec, implicit $exec
   ; GISEL-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 5
   ; GISEL-NEXT:   [[COPY2:%[0-9]+]]:vgpr_32 = COPY [[S_MOV_B32_]]
-  ; GISEL-NEXT:   [[V_CNDMASK_B32_e64_:%[0-9]+]]:vgpr_32 = V_CNDMASK_B32_e64 0, [[COPY2]], 0, [[COPY]], [[SI_SETUP_WHOLE_WAVE_FUNC]], implicit $exec
+  ; GISEL-NEXT:   [[V_CNDMASK_B32_e64_:%[0-9]+]]:vgpr_32 = V_CNDMASK_B32_e64 0, [[COPY2]], 0, [[COPY]], [[SI_WHOLE_WAVE_FUNC_SETUP]], implicit $exec
   ; GISEL-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 3
   ; GISEL-NEXT:   [[COPY3:%[0-9]+]]:vgpr_32 = COPY [[S_MOV_B32_1]]
-  ; GISEL-NEXT:   [[V_CNDMASK_B32_e64_1:%[0-9]+]]:vgpr_32 = V_CNDMASK_B32_e64 0, [[COPY3]], 0, [[COPY1]], [[SI_SETUP_WHOLE_WAVE_FUNC]], implicit $exec
+  ; GISEL-NEXT:   [[V_CNDMASK_B32_e64_1:%[0-9]+]]:vgpr_32 = V_CNDMASK_B32_e64 0, [[COPY3]], 0, [[COPY1]], [[SI_WHOLE_WAVE_FUNC_SETUP]], implicit $exec
   ; GISEL-NEXT:   [[V_MOV_B32_dpp:%[0-9]+]]:vgpr_32 = V_MOV_B32_dpp [[V_CNDMASK_B32_e64_]], [[V_CNDMASK_B32_e64_1]], 1, 1, 1, 0, implicit $exec
   ; GISEL-NEXT:   $vgpr0 = COPY [[V_MOV_B32_dpp]]
-  ; GISEL-NEXT:   SI_WHOLE_WAVE_FUNC_RETURN [[SI_SETUP_WHOLE_WAVE_FUNC]], implicit $vgpr0
+  ; GISEL-NEXT:   SI_WHOLE_WAVE_FUNC_RETURN [[SI_WHOLE_WAVE_FUNC_SETUP]], implicit $vgpr0
   %x = select i1 %active, i32 %a, i32 5
   %y = select i1 %active, i32 %b, i32 3
   %ret = call i32 @llvm.amdgcn.update.dpp.i32(i32 %x, i32 %y, i32 1, i32 1, i32 1, i1 false)
@@ -45,20 +45,20 @@ define amdgpu_whole_wave i32 @basic_test(i1 %active, i32 %a, i32 %b) {
 define amdgpu_whole_wave i32 @unused_active(i1 %active, i32 %a, i32 %b) {
   ; DAGISEL-LABEL: name: unused_active
   ; DAGISEL: bb.0 (%ir-block.0):
-  ; DAGISEL-NEXT:   [[SI_SETUP_WHOLE_WAVE_FUNC:%[0-9]+]]:sreg_32 = SI_SETUP_WHOLE_WAVE_FUNC implicit-def dead $exec, implicit $exec
+  ; DAGISEL-NEXT:   [[SI_WHOLE_WAVE_FUNC_SETUP:%[0-9]+]]:sreg_32 = SI_WHOLE_WAVE_FUNC_SETUP implicit-def dead $exec, implicit $exec
   ; DAGISEL-NEXT:   [[V_MOV_B32_e32_:%[0-9]+]]:vgpr_32 = V_MOV_B32_e32 14, implicit $exec
   ; DAGISEL-NEXT:   $vgpr0 = COPY [[V_MOV_B32_e32_]]
   ; DAGISEL-NEXT:   [[DEF:%[0-9]+]]:sreg_32 = IMPLICIT_DEF
-  ; DAGISEL-NEXT:   SI_WHOLE_WAVE_FUNC_RETURN killed [[SI_SETUP_WHOLE_WAVE_FUNC]], implicit $vgpr0
+  ; DAGISEL-NEXT:   SI_WHOLE_WAVE_FUNC_RETURN killed [[SI_WHOLE_WAVE_FUNC_SETUP]], implicit $vgpr0
   ;
   ; GISEL-LABEL: name: unused_active
   ; GISEL: bb.1 (%ir-block.0):
   ; GISEL-NEXT:   liveins: $vgpr0, $vgpr1
   ; GISEL-NEXT: {{  $}}
-  ; GISEL-NEXT:   [[SI_SETUP_WHOLE_WAVE_FUNC:%[0-9]+]]:sreg_32_xm0_xexec = SI_SETUP_WHOLE_WAVE_FUNC implicit-def dead $exec, implicit $exec
+  ; GISEL-NEXT:   [[SI_WHOLE_WAVE_FUNC_SETUP:%[0-9]+]]:sreg_32_xm0_xexec = SI_WHOLE_WAVE_FUNC_SETUP implicit-def dead $exec, implicit $exec
   ; GISEL-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 14
   ; GISEL-NEXT:   $vgpr0 = COPY [[S_MOV_B32_]]
-  ; GISEL-NEXT:   SI_WHOLE_WAVE_FUNC_RETURN [[SI_SETUP_WHOLE_WAVE_FUNC]], implicit $vgpr0
+  ; GISEL-NEXT:   SI_WHOLE_WAVE_FUNC_RETURN [[SI_WHOLE_WAVE_FUNC_SETUP]], implicit $vgpr0
   ret i32 14
 }
 
@@ -70,8 +70,8 @@ define amdgpu_whole_wave i32 @multiple_blocks(i1 %active, i32 %a, i32 %b) {
   ; DAGISEL-NEXT: {{  $}}
   ; DAGISEL-NEXT:   [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr1
   ; DAGISEL-NEXT:   [[COPY1:%[0-9]+]]:vgpr_32 = COPY $vgpr0
-  ; DAGISEL-NEXT:   [[SI_SETUP_WHOLE_WAVE_FUNC:%[0-9]+]]:sreg_32 = SI_SETUP_WHOLE_WAVE_FUNC implicit-def dead $exec, implicit $exec
-  ; DAGISEL-NEXT:   [[COPY2:%[0-9]+]]:sreg_32 = COPY [[SI_SETUP_WHOLE_WAVE_FUNC]]
+  ; DAGISEL-NEXT:   [[SI_WHOLE_WAVE_FUNC_SETUP:%[0-9]+]]:sreg_32 = SI_WHOLE_WAVE_FUNC_SETUP implicit-def dead $exec, implicit $exec
+  ; DAGISEL-NEXT:   [[COPY2:%[0-9]+]]:sreg_32 = COPY [[SI_WHOLE_WAVE_FUNC_SETUP]]
   ; DAGISEL-NEXT:   [[V_CMP_EQ_U32_e64_:%[0-9]+]]:sreg_32 = V_CMP_EQ_U32_e64 [[COPY1]], [[COPY]], implicit $exec
   ; DAGISEL-NEXT:   [[SI_IF:%[0-9]+]]:sreg_32 = SI_IF killed [[V_CMP_EQ_U32_e64_]], %bb.2, implicit-def dead $exec, implicit-def dead $scc, implicit $exec
   ; DAGISEL-NEXT:   S_BRANCH %bb.1
@@ -88,7 +88,7 @@ define amdgpu_whole_wave i32 @multiple_blocks(i1 %active, i32 %a, i32 %b) {
   ; DAGISEL-NEXT:   [[V_CNDMASK_B32_e64_:%[0-9]+]]:vgpr_32 = V_CNDMASK_B32_e64 0, [[PHI]], 0, [[COPY1]], [[COPY3]], implicit $exec
   ; DAGISEL-NEXT:   $vgpr0 = COPY [[V_CNDMASK_B32_e64_]]
   ; DAGISEL-NEXT:   [[DEF:%[0-9]+]]:sreg_32 = IMPLICIT_DEF
-  ; DAGISEL-NEXT:   SI_WHOLE_WAVE_FUNC_RETURN killed [[SI_SETUP_WHOLE_WAVE_FUNC]], implicit $vgpr0
+  ; DAGISEL-NEXT:   SI_WHOLE_WAVE_FUNC_RETURN killed [[SI_WHOLE_WAVE_FUNC_SETUP]], implicit $vgpr0
   ;
   ; GISEL-LABEL: name: multiple_blocks
   ; GISEL: bb.1 (%ir-block.0):
@@ -97,7 +97,7 @@ define amdgpu_whole_wave i32 @multiple_blocks(i1 %active, i32 %a, i32 %b) {
   ; GISEL-NEXT: {{  $}}
   ; GISEL-NEXT:   [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr0
   ; GISEL-NEXT:   [[COPY1:%[0-9]+]]:vgpr_32 = COPY $vgpr1
-  ; GISEL-NEXT:   [[SI_SETUP_WHOLE_WAVE_FUNC:%[0-9]+]]:sreg_32_xm0_xexec = SI_SETUP_WHOLE_WAVE_FUNC implicit-def dead $exec, implicit $exec
+  ; GISEL-NEXT:   [[SI_WHOLE_WAVE_FUNC_SETUP:%[0-9]+]]:sreg_32_xm0_xexec = SI_WHOLE_WAVE_FUNC_SETUP implicit-def dead $exec, implicit $exec
   ; GISEL-NEXT:   [[V_CMP_EQ_U32_e64_:%[0-9]+]]:sreg_32_xm0_xexec = V_CMP_EQ_U32_e64 [[COPY]], [[COPY1]], implicit $exec
   ; GISEL-NEXT:   [[SI_IF:%[0-9]+]]:sreg_32_xm0_xexec = SI_IF [[V_CMP_EQ_U32_e64_]], %bb.3, implicit-def $exec, implicit-def $scc, implicit $exec
   ; GISEL-NEXT:   S_BRANCH %bb.2
@@ -110,9 +110,9 @@ define amdgpu_whole_wave i32 @multiple_blocks(i1 %active, i32 %a, i32 %b) {
   ; GISEL-NEXT: bb.3.if.end:
   ; GISEL-NEXT:   [[PHI:%[0-9]+]]:vgpr_32 = PHI [[COPY1]], %bb.1, [[V_ADD_U32_e64_]], %bb.2
   ; GISEL-NEXT:   SI_END_CF [[SI_IF]], implicit-def $exec, implicit-def $scc, implicit $exec
-  ; GISEL-NEXT:   [[V_CNDMASK_B32_e64_:%[0-9]+]]:vgpr_32 = V_CNDMASK_B32_e64 0, [[PHI]], 0, [[COPY]], [[SI_SETUP_WHOLE_WAVE_FUNC]], implicit $exec
+  ; GISEL-NEXT:   [[V_CNDMASK_B32_e64_:%[0-9]+]]:vgpr_32 = V_CNDMASK_B32_e64 0, [[PHI]], 0, [[COPY]], [[SI_WHOLE_WAVE_FUNC_SETUP]], implicit $exec
   ; GISEL-NEXT:   $vgpr0 = COPY [[V_CNDMASK_B32_e64_]]
-  ; GISEL-NEXT:   SI_WHOLE_WAVE_FUNC_RETURN [[SI_SETUP_WHOLE_WAVE_FUNC]], implicit $vgpr0
+  ; GISEL-NEXT:   SI_WHOLE_WAVE_FUNC_RETURN [[SI_WHOLE_WAVE_FUNC_SETUP]], implicit $vgpr0
   %c = icmp eq i32 %a, %b
   br i1 %c, label %if.then, label %if.end
 
@@ -141,24 +141,24 @@ define amdgpu_whole_wave i64 @ret_64(i1 %active, i64 %a, i64 %b) {
   ; DAGISEL-NEXT:   [[DEF2:%[0-9]+]]:sgpr_32 = IMPLICIT_DEF
   ; DAGISEL-NEXT:   [[DEF3:%[0-9]+]]:sgpr_32 = IMPLICIT_DEF
   ; DAGISEL-NEXT:   [[REG_SEQUENCE1:%[0-9]+]]:vreg_64 = REG_SEQUENCE [[COPY3]], %subreg.sub0, [[COPY2]], %subreg.sub1
-  ; DAGISEL-NEXT:   [[SI_SETUP_WHOLE_WAVE_FUNC:%[0-9]+]]:sreg_32_xm0_xexec = SI_SETUP_WHOLE_WAVE_FUNC implicit-def dead $exec, implicit $exec
+  ; DAGISEL-NEXT:   [[SI_WHOLE_WAVE_FUNC_SETUP:%[0-9]+]]:sreg_32_xm0_xexec = SI_WHOLE_WAVE_FUNC_SETUP implicit-def dead $exec, implicit $exec
   ; DAGISEL-NEXT:   [[COPY4:%[0-9]+]]:vgpr_32 = COPY [[REG_SEQUENCE1]].sub1
   ; DAGISEL-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 0
-  ; DAGISEL-NEXT:   [[V_CNDMASK_B32_e64_:%[0-9]+]]:vgpr_32 = V_CNDMASK_B32_e64 0, [[S_MOV_B32_]], 0, killed [[COPY4]], [[SI_SETUP_WHOLE_WAVE_FUNC]], implicit $exec
+  ; DAGISEL-NEXT:   [[V_CNDMASK_B32_e64_:%[0-9]+]]:vgpr_32 = V_CNDMASK_B32_e64 0, [[S_MOV_B32_]], 0, killed [[COPY4]], [[SI_WHOLE_WAVE_FUNC_SETUP]], implicit $exec
   ; DAGISEL-NEXT:   [[COPY5:%[0-9]+]]:vgpr_32 = COPY [[REG_SEQUENCE1]].sub0
   ; DAGISEL-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 5
-  ; DAGISEL-NEXT:   [[V_CNDMASK_B32_e64_1:%[0-9]+]]:vgpr_32 = V_CNDMASK_B32_e64 0, killed [[S_MOV_B32_1]], 0, killed [[COPY5]], [[SI_SETUP_WHOLE_WAVE_FUNC]], implicit $exec
+  ; DAGISEL-NEXT:   [[V_CNDMASK_B32_e64_1:%[0-9]+]]:vgpr_32 = V_CNDMASK_B32_e64 0, killed [[S_MOV_B32_1]], 0, killed [[COPY5]], [[SI_WHOLE_WAVE_FUNC_SETUP]], implicit $exec
   ; DAGISEL-NEXT:   [[COPY6:%[0-9]+]]:vgpr_32 = COPY [[REG_SEQUENCE]].sub1
-  ; DAGISEL-NEXT:   [[V_CNDMASK_B32_e64_2:%[0-9]+]]:vgpr_32 = V_CNDMASK_B32_e64 0, [[S_MOV_B32_]], 0, killed [[COPY6]], [[SI_SETUP_WHOLE_WAVE_FUNC]], implicit $exec
+  ; DAGISEL-NEXT:   [[V_CNDMASK_B32_e64_2:%[0-9]+]]:vgpr_32 = V_CNDMASK_B32_e64 0, [[S_MOV_B32_]], 0, killed [[COPY6]], [[SI_WHOLE_WAVE_FUNC_SETUP]], implicit $exec
   ; DAGISEL-NEXT:   [[COPY7:%[0-9]+]]:vgpr_32 = COPY [[REG_SEQUENCE]].sub0
   ; DAGISEL-NEXT:   [[S_MOV_B32_2:%[0-9]+]]:sreg_32 = S_MOV_B32 3
-  ; DAGISEL-NEXT:   [[V_CNDMASK_B32_e64_3:%[0-9]+]]:vgpr_32 = V_CNDMASK_B32_e64 0, killed [[S_MOV_B32_2]], 0, killed [[COPY7]], [[SI_SETUP_WHOLE_WAVE_FUNC]], implicit $exec
+  ; DAGISEL-NEXT:   [[V_CNDMASK_B32_e64_3:%[0-9]+]]:vgpr_32 = V_CNDMASK_B32_e64 0, killed [[S_MOV_B32_2]], 0, killed [[COPY7]], [[SI_WHOLE_WAVE_FUNC_SETUP]], implicit $exec
   ; DAGISEL-NEXT:   [[V_MOV_B32_dpp:%[0-9]+]]:vgpr_32 = V_MOV_B32_dpp [[V_CNDMASK_B32_e64_1]], killed [[V_CNDMASK_B32_e64_3]], 1, 1, 1, 0, implicit $exec
   ; DAGISEL-NEXT:   [[V_MOV_B32_dpp1:%[0-9]+]]:vgpr_32 = V_MOV_B32_dpp [[V_CNDMASK_B32_e64_]], killed [[V_CNDMASK_B32_e64_2]], 1, 1, 1, 0, implicit $exec
   ; DAGISEL-NEXT:   $vgpr0 = COPY [[V_MOV_B32_dpp]]
   ; DAGISEL-NEXT:   $vgpr1 = COPY [[V_MOV_B32_dpp1]]
   ; DAGISEL-NEXT:   [[DEF4:%[0-9]+]]:sreg_32 = IMPLICIT_DEF
-  ; DAGISEL-NEXT:   SI_WHOLE_WAVE_FUNC_RETURN killed [[SI_SETUP_WHOLE_WAVE_FUNC]], implicit $vgpr0, implicit $vgpr1
+  ; DAGISEL-NEXT:   SI_WHOLE_WAVE_FUNC_RETURN killed [[SI_WHOLE_WAVE_FUNC_SETUP]], implicit $vgpr0, implicit $vgpr1
   ;
   ; GISEL-LABEL: name: ret_64
   ; GISEL: bb.1 (%ir-block.0):
@@ -168,20 +168,20 @@ define amdgpu_whole_wave i64 @ret_64(i1 %active, i64 %a, i64 %b) {
   ; GISEL-NEXT:   [[COPY1:%[0-9]+]]:vgpr_32 = COPY $vgpr1
   ; GISEL-NEXT:   [[COPY2:%[0-9]+]]:vgpr_32 = COPY $vgpr2
   ; GISEL-NEXT:   [[COPY3:%[0-9]+]]:vgpr_32 = COPY $vgpr3
-  ; GISEL-NEXT:   [[SI_SETUP_WHOLE_WAVE_FUNC:%[0-9]+]]:sreg_32_xm0_xexec = SI_SETUP_WHOLE_WAVE_FUNC implicit-def dead $exec, implicit $exec
+  ; GISEL-NEXT:   [[SI_WHOLE_WAVE_FUNC_SETUP:%[0-9]+]]:sreg_32_xm0_xexec = SI_WHOLE_WAVE_FUNC_SETUP implicit-def dead $exec, implicit $exec
   ; GISEL-NEXT:   [[V_MOV_B32_e32_:%[0-9]+]]:vgpr_32 = V_MOV_B32_e32 5, implicit $exec
   ; GISEL-NEXT:   [[V_MOV_B32_e32_1:%[0-9]+]]:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
-  ; GISEL-NEXT:   [[V_CNDMASK_B32_e64_:%[0-9]+]]:vgpr_32 = V_CNDMASK_B32_e64 0, [[V_MOV_B32_e32_]], 0, [[COPY]], [[SI_SETUP_WHOLE_WAVE_FUNC]], implicit $exec
-  ; GISEL-NEXT:   [[V_CNDMASK_B32_e64_1:%[0-9]+]]:vgpr_32 = V_CNDMASK_B32_e64 0, [[V_MOV_B32_e32_1]], 0, [[COPY1]], [[SI_SETUP_WHOLE_WAVE_FUNC]], implicit $exec
+  ; GISEL-NEXT:   [[V_CNDMASK_B32_e64_:%[0-9]+]]:vgpr_32 = V_CNDMASK_B32_e64 0, [[V_MOV_B32_e32_]], 0, [[COPY]], [[SI_WHOLE_WAVE_FUNC_SETUP]], implicit $exec
+  ; GISEL-NEXT:   [[V_CNDMASK_B32_e64_1:%[0-9]+]]:vgpr_32 = V_CNDMASK_B32_e64 0, [[V_MOV_B32_e32_1]], 0, [[COPY1]], [[SI_WHOLE_WAVE_FUNC_SETUP]], implicit $exec
   ; GISEL-NEXT:   [[V_MOV_B32_e32_2:%[0-9]+]]:vgpr_32 = V_MOV_B32_e32 3, implicit $exec
   ; GISEL-NEXT:   [[V_MOV_B32_e32_3:%[0-9]+]]:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
-  ; GISEL-NEXT:   [[V_CNDMASK_B32_e64_2:%[0-9]+]]:vgpr_32 = V_CNDMASK_B32_e64 0, [[V_MOV_B32_e32_2]], 0, [[COPY2]], [[SI_SETUP_WHOLE_WAVE_FUNC]], implicit $exec
-  ; GISEL-NEXT:   [[V_CNDMASK_B32_e64_3:%[0-9]+]]:vgpr_32 = V_CNDMASK_B32_e64 0, [[V_MOV_B32_e32_3]], 0, [[COPY3]], [[SI_SETUP_WHOLE_WAVE_FUNC]], implicit $exec
+  ; GISEL-NEXT:   [[V_CNDMASK_B32_e64_2:%[0-9]+]]:vgpr_32 = V_CNDMASK_B32_e64 0, [[V_MOV_B32_e32_2]], 0, [[COPY2]], [[SI_WHOLE_WAVE_FUNC_SETUP]], implicit $exec
+  ; GISEL-NEXT:   [[V_CNDMASK_B32_e64_3:%[0-9]+]]:vgpr_32 = V_CNDMASK_B32_e64 0, [[V_MOV_B32_e32_3]], 0, [[COPY3]], [[SI_WHOLE_WAVE_FUNC_SETUP]], implicit $exec
   ; GISEL-NEXT:   [[V_MOV_B32_dpp:%[0-9]+]]:vgpr_32 = V_MOV_B32_dpp [[V_CNDMASK_B32_e64_]], [[V_CNDMASK_B32_e64_2]], 1, 1, 1, 0, implicit $exec
   ; GISEL-NEXT:   [[V_MOV_B32_dpp1:%[0-9]+]]:vgpr_32 = V_MOV_B32_dpp [[V_CNDMASK_B32_e64_1]], [[V_CNDMASK_B32_e64_3]], 1, 1, 1, 0, implicit $exec
   ; GISEL-NEXT:   $vgpr0 = COPY [[V_MOV_B32_dpp]]
   ; GISEL-NEXT:   $vgpr1 = COPY [[V_MOV_B32_dpp1]]
-  ; GISEL-NEXT:   SI_WHOLE_WAVE_FUNC_RETURN [[SI_SETUP_WHOLE_WAVE_FUNC]], implicit $vgpr0, implicit $vgpr1
+  ; GISEL-NEXT:   SI_WHOLE_WAVE_FUNC_RETURN [[SI_WHOLE_WAVE_FUNC_SETUP]], implicit $vgpr0, implicit $vgpr1
   %x = select i1 %active, i64 %a, i64 5
   %y = select i1 %active, i64 %b, i64 3
   %ret = call i64 @llvm.amdgcn.update.dpp.i64(i64 %x, i64 %y, i32 1, i32 1, i32 1, i1 false)
diff --git a/llvm/test/CodeGen/AMDGPU/whole-wave-functions-pei.mir b/llvm/test/CodeGen/AMDGPU/whole-wave-functions-pei.mir
index a5a35c40b719c..5d6906bacf336 100644
--- a/llvm/test/CodeGen/AMDGPU/whole-wave-functions-pei.mir
+++ b/llvm/test/CodeGen/AMDGPU/whole-wave-functions-pei.mir
@@ -34,7 +34,7 @@ body:             |
     ; CHECK-NEXT: $vgpr0 = SCRATCH_LOAD_DWORD_SADDR $sgpr32, 0, 0, implicit $exec, implicit $flat_scr, implicit $vgpr0(tied-def 0) :: (load (s32) from %stack.0, addrspace 5)
     ; CHECK-NEXT: $exec_lo = S_MOV_B32 $sgpr0
     ; CHECK-NEXT: SI_WHOLE_WAVE_FUNC_RETURN killed renamable $sgpr0, implicit killed $vgpr0
-    renamable $sgpr0 = SI_SETUP_WHOLE_WAVE_FUNC implicit-def dead $exec, implicit $exec
+    renamable $sgpr0 = SI_WHOLE_WAVE_FUNC_SETUP implicit-def dead $exec, implicit $exec
     $vgpr0 = V_MOV_B32_e32 14, implicit $exec
     SI_WHOLE_WAVE_FUNC_RETURN killed renamable $sgpr0, implicit killed $vgpr0
 
@@ -70,7 +70,7 @@ body:             |
     ; CHECK-NEXT: $vgpr40 = SCRATCH_LOAD_DWORD_SADDR $sgpr32, 0, 0, implicit $exec, implicit $flat_scr :: (load (s32) from %stack.0, addrspace 5)
     ; CHECK-NEXT: $exec_lo = S_MOV_B32 $sgpr0
     ; CHECK-NEXT: SI_WHOLE_WAVE_FUNC_RETURN killed renamable $sgpr0
-    renamable $sgpr0 = SI_SETUP_WHOLE_WAVE_FUNC implicit-def dead $exec, implicit $exec
+    renamable $sgpr0 = SI_WHOLE_WAVE_FUNC_SETUP implicit-def dead $exec, implicit $exec
     $vgpr40 = V_MOV_B32_e32 14, implicit $exec
     SI_WHOLE_WAVE_FUNC_RETURN killed renamable $sgpr0
 
@@ -112,7 +112,7 @@ body:             |
     ; CHECK-NEXT: $exec_lo = S_MOV_B32 $vcc_lo
     ; CHECK-NEXT: SI_WHOLE_WAVE_FUNC_RETURN killed renamable $vcc_lo
     $vgpr192 = SI_SPILL_S32_TO_VGPR killed $sgpr20, 0, $vgpr192
-    renamable $vcc_lo = SI_SETUP_WHOLE_WAVE_FUNC implicit-def dead $exec, implicit $exec
+    renamable $vcc_lo = SI_WHOLE_WAVE_FUNC_SETUP implicit-def dead $exec, implicit $exec
     $sgpr20 = S_MOV_B32 14, implicit $exec
     $sgpr20 = SI_RESTORE_S32_FROM_VGPR $vgpr192, 0
     SI_WHOLE_WAVE_FUNC_RETURN killed renamable $vcc_lo
@@ -153,7 +153,7 @@ body:             |
     ; CHECK-NEXT: $exec_lo = S_MOV_B32 $vcc_lo
     ; CHECK-NEXT: SI_WHOLE_WAVE_FUNC_RETURN killed renamable $vcc_lo
     $vgpr191 = SI_SPILL_S32_TO_VGPR killed $sgpr20, 0, $vgpr191
-    renamable $vcc_lo = SI_SETUP_WHOLE_WAVE_FUNC implicit-def dead $exec, implicit $exec
+    renamable $vcc_lo = SI_WHOLE_WAVE_FUNC_SETUP implicit-def dead $exec, implicit $exec
     $sgpr20 = S_MOV_B32 14, implicit $exec
     $sgpr20 = SI_RESTORE_S32_FROM_VGPR $vgpr191, 0
     SI_WHOLE_WAVE_FUNC_RETURN killed renamable $vcc_lo
@@ -209,7 +209,7 @@ body:             |
     ; CHECK-NEXT: $exec_lo = S_MOV_B32 $vcc_lo
     ; CHECK-NEXT: SI_WHOLE_WAVE_FUNC_RETURN killed renamable $vcc_lo
     $vgpr191 = SI_SPILL_S32_TO_VGPR killed $sgpr20, 0, $vgpr191
-    renamable $vcc_lo = SI_SETUP_WHOLE_WAVE_FUNC implicit-def dead $exec, implicit $exec
+    renamable $vcc_lo = SI_WHOLE_WAVE_FUNC_SETUP implicit-def dead $exec, implicit $exec
     S_NOP 0, implicit-def $vgpr40, implicit-def $sgpr20
     S_NOP 0, implicit-def $vgpr49, implicit-def $sgpr40
     $sgpr20 = SI_RESTORE_S32_FROM_VGPR $vgpr191, 0
@@ -267,7 +267,7 @@ body:             |
     ; CHECK-NEXT: $exec_lo = S_MOV_B32 $sgpr3
     ; CHECK-NEXT: SI_WHOLE_WAVE_FUNC_RETURN killed renamable $sgpr3
     $vgpr191 = SI_SPILL_S32_TO_VGPR killed $sgpr20, 0, $vgpr191
-    renamable $vcc_lo = SI_SETUP_WHOLE_WAVE_FUNC implicit-def dead $exec, implicit $exec
+    renamable $vcc_lo = SI_WHOLE_WAVE_FUNC_SETUP implicit-def dead $exec, implicit $exec
     S_NOP 0, implicit-def $vgpr40, implicit-def $sgpr20
     $sgpr3 = COPY $vcc_lo
     S_NOP 0, implicit-def $vgpr49, implicit-def $sgpr40
@@ -323,7 +323,7 @@ body:             |
     ; CHECK-NEXT: $vgpr5 = SCRATCH_LOAD_DWORD_SADDR $sgpr32, 16, 0, implicit $exec, implicit $flat_scr :: (load (s32) from %stack.4, addrspace 5)
     ; CHECK-NEXT: $exec_lo = S_MOV_B32 $sgpr0
     ; CHECK-NEXT: SI_WHOLE_WAVE_FUNC_RETURN killed renamable $sgpr0, implicit killed $vgpr0
-    renamable $sgpr0 = SI_SETUP_WHOLE_WAVE_FUNC implicit-def dead $exec, implicit $exec
+    renamable $sgpr0 = SI_WHOLE_WAVE_FUNC_SETUP implicit-def dead $exec, implicit $exec
     $vgpr0 = V_MOV_B32_e32 14, implicit $exec
     S_NOP 0, implicit-def $vgpr2_vgpr3_vgpr4_vgpr5, implicit-def $vgpr40_vgpr41_vgpr42
     SI_WHOLE_WAVE_FUNC_RETURN killed renamable $sgpr0, implicit killed $vgpr0
@@ -364,7 +364,7 @@ body:             |
     ; CHECK-NEXT: S_NOP 0, implicit $vgpr0, implicit $vgpr20, implicit $vgpr40
     ; CHECK-NEXT: $exec_lo = S_MOV_B32 $sgpr0
     ; CHECK-NEXT: SI_WHOLE_WAVE_FUNC_RETURN killed renamable $sgpr0, implicit killed $vgpr0
-    renamable $sgpr0 = SI_SETUP_WHOLE_WAVE_FUNC implicit-def dead $exec, implicit $exec
+    renamable $sgpr0 = SI_WHOLE_WAVE_FUNC_SETUP implicit-def dead $exec, implicit $exec
     S_NOP 0, implicit $vgpr0, implicit $vgpr20, implicit $vgpr40
     SI_WHOLE_WAVE_FUNC_RETURN killed renamable $sgpr0, implicit killed $vgpr0
 
@@ -427,7 +427,7 @@ body:             |
     successors: %bb.1, %bb.2
     liveins: $vgpr0, $vgpr1
 
-    renamable $vcc_lo = SI_SETUP_WHOLE_WAVE_FUNC implicit-def dead $exec, implicit $exec
+    renamable $vcc_lo = SI_WHOLE_WAVE_FUNC_SETUP implicit-def dead $exec, implicit $exec
     $sgpr1 = S_MOV_B32 $exec_lo
     V_CMPX_EQ_U32_nosdst_e64 $vgpr0, $vgpr1, implicit-def $exec, implicit $exec
     S_CBRANCH_EXECZ %bb.2, implicit $exec

>From 04826ebe7475a374db2120a528b450ee558140c7 Mon Sep 17 00:00:00 2001
From: Diana Picus <diana-magda.picus at amd.com>
Date: Tue, 25 Mar 2025 13:59:15 +0100
Subject: [PATCH 10/24] Rename CC

---
 llvm/include/llvm/AsmParser/LLToken.h              |  2 +-
 llvm/include/llvm/IR/CallingConv.h                 |  2 +-
 llvm/lib/AsmParser/LLLexer.cpp                     |  2 +-
 llvm/lib/AsmParser/LLParser.cpp                    |  4 ++--
 llvm/lib/IR/AsmWriter.cpp                          |  4 ++--
 llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp      |  6 +++---
 llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp      |  4 ++--
 llvm/lib/Target/AMDGPU/SIISelLowering.cpp          |  2 +-
 llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp   |  6 ++++--
 llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp          |  4 ++--
 llvm/lib/Target/AMDGPU/Utils/AMDGPUPALMetadata.cpp |  2 +-
 .../AMDGPU/irtranslator-whole-wave-functions.ll    |  8 ++++----
 .../CodeGen/AMDGPU/isel-whole-wave-functions.ll    |  8 ++++----
 llvm/test/CodeGen/AMDGPU/whole-wave-functions.ll   | 14 +++++++-------
 14 files changed, 35 insertions(+), 33 deletions(-)

diff --git a/llvm/include/llvm/AsmParser/LLToken.h b/llvm/include/llvm/AsmParser/LLToken.h
index 2b23225471944..a2311d2ac285d 100644
--- a/llvm/include/llvm/AsmParser/LLToken.h
+++ b/llvm/include/llvm/AsmParser/LLToken.h
@@ -181,7 +181,7 @@ enum Kind {
   kw_amdgpu_cs_chain_preserve,
   kw_amdgpu_kernel,
   kw_amdgpu_gfx,
-  kw_amdgpu_whole_wave,
+  kw_amdgpu_gfx_whole_wave,
   kw_tailcc,
   kw_m68k_rtdcc,
   kw_graalcc,
diff --git a/llvm/include/llvm/IR/CallingConv.h b/llvm/include/llvm/IR/CallingConv.h
index 417057fc1112e..5d2ff86d60497 100644
--- a/llvm/include/llvm/IR/CallingConv.h
+++ b/llvm/include/llvm/IR/CallingConv.h
@@ -285,7 +285,7 @@ namespace CallingConv {
     RISCV_VLSCall_65536 = 123,
 
     // Calling convention for AMDGPU whole wave functions.
-    AMDGPU_WholeWave = 124,
+    AMDGPU_Gfx_WholeWave = 124,
 
     /// The highest possible ID. Must be some 2^k - 1.
     MaxID = 1023
diff --git a/llvm/lib/AsmParser/LLLexer.cpp b/llvm/lib/AsmParser/LLLexer.cpp
index 158aa1d333c15..520c6a00a9c07 100644
--- a/llvm/lib/AsmParser/LLLexer.cpp
+++ b/llvm/lib/AsmParser/LLLexer.cpp
@@ -679,7 +679,7 @@ lltok::Kind LLLexer::LexIdentifier() {
   KEYWORD(amdgpu_cs_chain_preserve);
   KEYWORD(amdgpu_kernel);
   KEYWORD(amdgpu_gfx);
-  KEYWORD(amdgpu_whole_wave);
+  KEYWORD(amdgpu_gfx_whole_wave);
   KEYWORD(tailcc);
   KEYWORD(m68k_rtdcc);
   KEYWORD(graalcc);
diff --git a/llvm/lib/AsmParser/LLParser.cpp b/llvm/lib/AsmParser/LLParser.cpp
index a2866c551f8fc..b09696497cc4e 100644
--- a/llvm/lib/AsmParser/LLParser.cpp
+++ b/llvm/lib/AsmParser/LLParser.cpp
@@ -2274,8 +2274,8 @@ bool LLParser::parseOptionalCallingConv(unsigned &CC) {
     CC = CallingConv::AMDGPU_CS_ChainPreserve;
     break;
   case lltok::kw_amdgpu_kernel:  CC = CallingConv::AMDGPU_KERNEL; break;
-  case lltok::kw_amdgpu_whole_wave:
-    CC = CallingConv::AMDGPU_WholeWave;
+  case lltok::kw_amdgpu_gfx_whole_wave:
+    CC = CallingConv::AMDGPU_Gfx_WholeWave;
     break;
   case lltok::kw_tailcc:         CC = CallingConv::Tail; break;
   case lltok::kw_m68k_rtdcc:     CC = CallingConv::M68k_RTD; break;
diff --git a/llvm/lib/IR/AsmWriter.cpp b/llvm/lib/IR/AsmWriter.cpp
index 5a9083d8bf888..3ce892ecbff19 100644
--- a/llvm/lib/IR/AsmWriter.cpp
+++ b/llvm/lib/IR/AsmWriter.cpp
@@ -404,8 +404,8 @@ static void PrintCallingConv(unsigned cc, raw_ostream &Out) {
     break;
   case CallingConv::AMDGPU_KERNEL: Out << "amdgpu_kernel"; break;
   case CallingConv::AMDGPU_Gfx:    Out << "amdgpu_gfx"; break;
-  case CallingConv::AMDGPU_WholeWave:
-    Out << "amdgpu_whole_wave";
+  case CallingConv::AMDGPU_Gfx_WholeWave:
+    Out << "amdgpu_gfx_whole_wave";
     break;
   case CallingConv::M68k_RTD:      Out << "m68k_rtdcc"; break;
   case CallingConv::RISCV_VectorCall:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp b/llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp
index 47d9aab81cb35..33bb11c8ce015 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp
@@ -1367,8 +1367,8 @@ bool AMDGPUCallLowering::lowerTailCall(
   SmallVector<std::pair<MCRegister, Register>, 12> ImplicitArgRegs;
 
   if (Info.CallConv != CallingConv::AMDGPU_Gfx &&
-      !AMDGPU::isChainCC(Info.CallConv) &&
-      Info.CallConv != CallingConv::AMDGPU_WholeWave) {
+      Info.CallConv != CallingConv::AMDGPU_Gfx_WholeWave &&
+      !AMDGPU::isChainCC(Info.CallConv)) {
     // With a fixed ABI, allocate fixed registers before user arguments.
     if (!passSpecialInputs(MIRBuilder, CCInfo, ImplicitArgRegs, Info))
       return false;
@@ -1546,7 +1546,7 @@ bool AMDGPUCallLowering::lowerCall(MachineIRBuilder &MIRBuilder,
   SmallVector<std::pair<MCRegister, Register>, 12> ImplicitArgRegs;
 
   if (Info.CallConv != CallingConv::AMDGPU_Gfx &&
-      Info.CallConv != CallingConv::AMDGPU_WholeWave) {
+      Info.CallConv != CallingConv::AMDGPU_Gfx_WholeWave) {
     // With a fixed ABI, allocate fixed registers before user arguments.
     if (!passSpecialInputs(MIRBuilder, CCInfo, ImplicitArgRegs, Info))
       return false;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
index 2eb061e56b45c..0421ed87e61f4 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
@@ -1138,7 +1138,7 @@ CCAssignFn *AMDGPUCallLowering::CCAssignFnForCall(CallingConv::ID CC,
   case CallingConv::Cold:
     return CC_AMDGPU_Func;
   case CallingConv::AMDGPU_Gfx:
-  case CallingConv::AMDGPU_WholeWave:
+  case CallingConv::AMDGPU_Gfx_WholeWave:
     return CC_SI_Gfx;
   case CallingConv::AMDGPU_KERNEL:
   case CallingConv::SPIR_KERNEL:
@@ -1164,7 +1164,7 @@ CCAssignFn *AMDGPUCallLowering::CCAssignFnForReturn(CallingConv::ID CC,
   case CallingConv::AMDGPU_LS:
     return RetCC_SI_Shader;
   case CallingConv::AMDGPU_Gfx:
-  case CallingConv::AMDGPU_WholeWave:
+  case CallingConv::AMDGPU_Gfx_WholeWave:
     return RetCC_SI_Gfx;
   case CallingConv::C:
   case CallingConv::Fast:
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 9be50c89ecff2..adf80108689ea 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -3856,7 +3856,7 @@ SDValue SITargetLowering::LowerCall(CallLoweringInfo &CLI,
   CCAssignFn *AssignFn = CCAssignFnForCall(CallConv, IsVarArg);
 
   if (CallConv != CallingConv::AMDGPU_Gfx && !AMDGPU::isChainCC(CallConv) &&
-      CallConv != CallingConv::AMDGPU_WholeWave) {
+      CallConv != CallingConv::AMDGPU_Gfx_WholeWave) {
     // With a fixed ABI, allocate fixed registers before user arguments.
     passSpecialInputs(CLI, CCInfo, *Info, RegsToPass, MemOpChains, Chain);
   }
diff --git a/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp b/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
index 8ffe3a70041eb..603c3e1e30e3a 100644
--- a/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
@@ -42,7 +42,8 @@ SIMachineFunctionInfo::SIMachineFunctionInfo(const Function &F,
       PrivateSegmentWaveByteOffset(false), WorkItemIDX(false),
       WorkItemIDY(false), WorkItemIDZ(false), ImplicitArgPtr(false),
       GITPtrHigh(0xffffffff), HighBitsOf32BitAddress(0),
-      IsWholeWaveFunction(F.getCallingConv() == CallingConv::AMDGPU_WholeWave) {
+      IsWholeWaveFunction(F.getCallingConv() ==
+                          CallingConv::AMDGPU_Gfx_WholeWave) {
   const GCNSubtarget &ST = *static_cast<const GCNSubtarget *>(STI);
   FlatWorkGroupSizes = ST.getFlatWorkGroupSizes(F);
   WavesPerEU = ST.getWavesPerEU(F);
@@ -90,7 +91,8 @@ SIMachineFunctionInfo::SIMachineFunctionInfo(const Function &F,
 
     ImplicitArgPtr = false;
   } else if (!isEntryFunction()) {
-    if (CC != CallingConv::AMDGPU_Gfx && CC != CallingConv::AMDGPU_WholeWave)
+    if (CC != CallingConv::AMDGPU_Gfx &&
+        CC != CallingConv::AMDGPU_Gfx_WholeWave)
       ArgInfo = AMDGPUArgumentUsageInfo::FixedABIFunctionInfo;
 
     FrameOffsetReg = AMDGPU::SGPR33;
diff --git a/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp b/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
index 02488426df369..3800178a79025 100644
--- a/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
@@ -407,7 +407,7 @@ const MCPhysReg *SIRegisterInfo::getCalleeSavedRegs(
     return ST.hasGFX90AInsts() ? CSR_AMDGPU_GFX90AInsts_SaveList
                                : CSR_AMDGPU_SaveList;
   case CallingConv::AMDGPU_Gfx:
-  case CallingConv::AMDGPU_WholeWave:
+  case CallingConv::AMDGPU_Gfx_WholeWave:
     return ST.hasGFX90AInsts() ? CSR_AMDGPU_SI_Gfx_GFX90AInsts_SaveList
                                : CSR_AMDGPU_SI_Gfx_SaveList;
   case CallingConv::AMDGPU_CS_ChainPreserve:
@@ -434,7 +434,7 @@ const uint32_t *SIRegisterInfo::getCallPreservedMask(const MachineFunction &MF,
     return ST.hasGFX90AInsts() ? CSR_AMDGPU_GFX90AInsts_RegMask
                                : CSR_AMDGPU_RegMask;
   case CallingConv::AMDGPU_Gfx:
-  case CallingConv::AMDGPU_WholeWave:
+  case CallingConv::AMDGPU_Gfx_WholeWave:
     return ST.hasGFX90AInsts() ? CSR_AMDGPU_SI_Gfx_GFX90AInsts_RegMask
                                : CSR_AMDGPU_SI_Gfx_RegMask;
   case CallingConv::AMDGPU_CS_Chain:
diff --git a/llvm/lib/Target/AMDGPU/Utils/AMDGPUPALMetadata.cpp b/llvm/lib/Target/AMDGPU/Utils/AMDGPUPALMetadata.cpp
index f622c367ae204..fdc43878543e0 100644
--- a/llvm/lib/Target/AMDGPU/Utils/AMDGPUPALMetadata.cpp
+++ b/llvm/lib/Target/AMDGPU/Utils/AMDGPUPALMetadata.cpp
@@ -44,7 +44,7 @@ static const char *getStageName(CallingConv::ID CC) {
   case CallingConv::AMDGPU_LS:
     return ".ls";
   case CallingConv::AMDGPU_Gfx:
-  case CallingConv::AMDGPU_WholeWave:
+  case CallingConv::AMDGPU_Gfx_WholeWave:
     llvm_unreachable("Callable shader has no hardware stage");
   default:
     return ".cs";
diff --git a/llvm/test/CodeGen/AMDGPU/irtranslator-whole-wave-functions.ll b/llvm/test/CodeGen/AMDGPU/irtranslator-whole-wave-functions.ll
index f18d8128a91ff..b68786b579dd2 100644
--- a/llvm/test/CodeGen/AMDGPU/irtranslator-whole-wave-functions.ll
+++ b/llvm/test/CodeGen/AMDGPU/irtranslator-whole-wave-functions.ll
@@ -1,7 +1,7 @@
 ; NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5
 ; RUN: llc -global-isel=1 -mtriple=amdgcn--amdpal -mcpu=gfx1200 -stop-after=irtranslator -verify-machineinstrs < %s | FileCheck %s
 
-define amdgpu_whole_wave i32 @basic_test(i1 %active, i32 %a, i32 %b) {
+define amdgpu_gfx_whole_wave i32 @basic_test(i1 %active, i32 %a, i32 %b) {
   ; CHECK-LABEL: name: basic_test
   ; CHECK: bb.1 (%ir-block.0):
   ; CHECK-NEXT:   liveins: $vgpr0, $vgpr1
@@ -23,7 +23,7 @@ define amdgpu_whole_wave i32 @basic_test(i1 %active, i32 %a, i32 %b) {
 }
 
 ; Make sure we don't crash if %active is not used at all.
-define amdgpu_whole_wave i32 @unused_active(i1 %active, i32 %a, i32 %b) {
+define amdgpu_gfx_whole_wave i32 @unused_active(i1 %active, i32 %a, i32 %b) {
   ; CHECK-LABEL: name: unused_active
   ; CHECK: bb.1 (%ir-block.0):
   ; CHECK-NEXT:   liveins: $vgpr0, $vgpr1
@@ -37,7 +37,7 @@ define amdgpu_whole_wave i32 @unused_active(i1 %active, i32 %a, i32 %b) {
   ret i32 14
 }
 
-define amdgpu_whole_wave i32 @multiple_blocks(i1 %active, i32 %a, i32 %b) {
+define amdgpu_gfx_whole_wave i32 @multiple_blocks(i1 %active, i32 %a, i32 %b) {
   ; CHECK-LABEL: name: multiple_blocks
   ; CHECK: bb.1 (%ir-block.0):
   ; CHECK-NEXT:   successors: %bb.2(0x40000000), %bb.3(0x40000000)
@@ -75,7 +75,7 @@ if.end:
   ret i32 %e
 }
 
-define amdgpu_whole_wave i64 @ret_64(i1 %active, i64 %a, i64 %b) {
+define amdgpu_gfx_whole_wave i64 @ret_64(i1 %active, i64 %a, i64 %b) {
   ; CHECK-LABEL: name: ret_64
   ; CHECK: bb.1 (%ir-block.0):
   ; CHECK-NEXT:   liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3
diff --git a/llvm/test/CodeGen/AMDGPU/isel-whole-wave-functions.ll b/llvm/test/CodeGen/AMDGPU/isel-whole-wave-functions.ll
index 851dc5107a8a1..0bd87f493f1ac 100644
--- a/llvm/test/CodeGen/AMDGPU/isel-whole-wave-functions.ll
+++ b/llvm/test/CodeGen/AMDGPU/isel-whole-wave-functions.ll
@@ -2,7 +2,7 @@
 ; RUN: llc -global-isel=0 -mtriple=amdgcn--amdpal -mcpu=gfx1200 -stop-after=finalize-isel -verify-machineinstrs < %s | FileCheck --check-prefix=DAGISEL %s
 ; RUN: llc -global-isel=1 -mtriple=amdgcn--amdpal -mcpu=gfx1200 -stop-after=finalize-isel -verify-machineinstrs < %s | FileCheck --check-prefix=GISEL %s
 
-define amdgpu_whole_wave i32 @basic_test(i1 %active, i32 %a, i32 %b) {
+define amdgpu_gfx_whole_wave i32 @basic_test(i1 %active, i32 %a, i32 %b) {
   ; DAGISEL-LABEL: name: basic_test
   ; DAGISEL: bb.0 (%ir-block.0):
   ; DAGISEL-NEXT:   liveins: $vgpr0, $vgpr1
@@ -42,7 +42,7 @@ define amdgpu_whole_wave i32 @basic_test(i1 %active, i32 %a, i32 %b) {
 }
 
 ; Make sure we don't crash if %active is not used at all.
-define amdgpu_whole_wave i32 @unused_active(i1 %active, i32 %a, i32 %b) {
+define amdgpu_gfx_whole_wave i32 @unused_active(i1 %active, i32 %a, i32 %b) {
   ; DAGISEL-LABEL: name: unused_active
   ; DAGISEL: bb.0 (%ir-block.0):
   ; DAGISEL-NEXT:   [[SI_WHOLE_WAVE_FUNC_SETUP:%[0-9]+]]:sreg_32 = SI_WHOLE_WAVE_FUNC_SETUP implicit-def dead $exec, implicit $exec
@@ -62,7 +62,7 @@ define amdgpu_whole_wave i32 @unused_active(i1 %active, i32 %a, i32 %b) {
   ret i32 14
 }
 
-define amdgpu_whole_wave i32 @multiple_blocks(i1 %active, i32 %a, i32 %b) {
+define amdgpu_gfx_whole_wave i32 @multiple_blocks(i1 %active, i32 %a, i32 %b) {
   ; DAGISEL-LABEL: name: multiple_blocks
   ; DAGISEL: bb.0 (%ir-block.0):
   ; DAGISEL-NEXT:   successors: %bb.1(0x40000000), %bb.2(0x40000000)
@@ -126,7 +126,7 @@ if.end:
   ret i32 %e
 }
 
-define amdgpu_whole_wave i64 @ret_64(i1 %active, i64 %a, i64 %b) {
+define amdgpu_gfx_whole_wave i64 @ret_64(i1 %active, i64 %a, i64 %b) {
   ; DAGISEL-LABEL: name: ret_64
   ; DAGISEL: bb.0 (%ir-block.0):
   ; DAGISEL-NEXT:   liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3
diff --git a/llvm/test/CodeGen/AMDGPU/whole-wave-functions.ll b/llvm/test/CodeGen/AMDGPU/whole-wave-functions.ll
index 715244d39765f..039d68befe299 100644
--- a/llvm/test/CodeGen/AMDGPU/whole-wave-functions.ll
+++ b/llvm/test/CodeGen/AMDGPU/whole-wave-functions.ll
@@ -6,7 +6,7 @@
 ; The EXEC mask should be set to -1 for the duration of the function
 ; and restored to its original value in the epilogue.
 ; We will also need to restore the inactive lanes for any allocated VGPRs.
-define amdgpu_whole_wave i32 @basic_test(i1 %active, i32 %a, i32 %b) {
+define amdgpu_gfx_whole_wave i32 @basic_test(i1 %active, i32 %a, i32 %b) {
 ; DAGISEL-LABEL: basic_test:
 ; DAGISEL:       ; %bb.0:
 ; DAGISEL-NEXT:    s_wait_loadcnt_dscnt 0x0
@@ -61,7 +61,7 @@ define amdgpu_whole_wave i32 @basic_test(i1 %active, i32 %a, i32 %b) {
 }
 
 ; Make sure we don't crash if %active is not used at all.
-define amdgpu_whole_wave i32 @unused_active(i1 %active, i32 %a, i32 %b) {
+define amdgpu_gfx_whole_wave i32 @unused_active(i1 %active, i32 %a, i32 %b) {
 ; DAGISEL-LABEL: unused_active:
 ; DAGISEL:       ; %bb.0:
 ; DAGISEL-NEXT:    s_wait_loadcnt_dscnt 0x0
@@ -102,7 +102,7 @@ define amdgpu_whole_wave i32 @unused_active(i1 %active, i32 %a, i32 %b) {
 
 ; For any used VGPRs (including those used for SGPR spills), we need to restore the inactive lanes.
 ; For CSR VGPRs, we need to restore all lanes.
-define amdgpu_whole_wave i32 @csr(i1 %active, i32 %a, i32 %b) {
+define amdgpu_gfx_whole_wave i32 @csr(i1 %active, i32 %a, i32 %b) {
 ; DAGISEL-LABEL: csr:
 ; DAGISEL:       ; %bb.0:
 ; DAGISEL-NEXT:    s_wait_loadcnt_dscnt 0x0
@@ -189,7 +189,7 @@ define amdgpu_whole_wave i32 @csr(i1 %active, i32 %a, i32 %b) {
 }
 
 ; Save and restore all lanes of v40.
-define amdgpu_whole_wave void @csr_vgpr_only(i1 %active, i32 %a, i32 %b) {
+define amdgpu_gfx_whole_wave void @csr_vgpr_only(i1 %active, i32 %a, i32 %b) {
 ; DAGISEL-LABEL: csr_vgpr_only:
 ; DAGISEL:       ; %bb.0:
 ; DAGISEL-NEXT:    s_wait_loadcnt_dscnt 0x0
@@ -229,7 +229,7 @@ define amdgpu_whole_wave void @csr_vgpr_only(i1 %active, i32 %a, i32 %b) {
   ret void
 }
 
-define amdgpu_whole_wave void @sgpr_spill_only(i1 %active, i32 %a, i32 %b) {
+define amdgpu_gfx_whole_wave void @sgpr_spill_only(i1 %active, i32 %a, i32 %b) {
 ; DAGISEL-LABEL: sgpr_spill_only:
 ; DAGISEL:       ; %bb.0:
 ; DAGISEL-NEXT:    s_wait_loadcnt_dscnt 0x0
@@ -279,7 +279,7 @@ define amdgpu_whole_wave void @sgpr_spill_only(i1 %active, i32 %a, i32 %b) {
   ret void
 }
 
-define amdgpu_whole_wave i32 @multiple_blocks(i1 %active, i32 %a, i32 %b) {
+define amdgpu_gfx_whole_wave i32 @multiple_blocks(i1 %active, i32 %a, i32 %b) {
 ; DAGISEL-LABEL: multiple_blocks:
 ; DAGISEL:       ; %bb.0:
 ; DAGISEL-NEXT:    s_wait_loadcnt_dscnt 0x0
@@ -352,7 +352,7 @@ if.end:
   ret i32 %e
 }
 
-define amdgpu_whole_wave i64 @ret_64(i1 %active, i64 %a, i64 %b) {
+define amdgpu_gfx_whole_wave i64 @ret_64(i1 %active, i64 %a, i64 %b) {
 ; DAGISEL-LABEL: ret_64:
 ; DAGISEL:       ; %bb.0:
 ; DAGISEL-NEXT:    s_wait_loadcnt_dscnt 0x0

>From f15dfec725846767e7b51cfcc8d897d3611f20a1 Mon Sep 17 00:00:00 2001
From: Diana Picus <diana-magda.picus at amd.com>
Date: Tue, 25 Mar 2025 15:00:34 +0100
Subject: [PATCH 11/24] Fix formatting

---
 llvm/lib/Target/AMDGPU/SIFrameLowering.cpp | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp b/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
index 702e090b5c93e..a79b1cebc707c 100644
--- a/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
@@ -1714,7 +1714,8 @@ void SIFrameLowering::determineCalleeSaves(MachineFunction &MF,
   if (MFI->isWholeWaveFunction()) {
     // In practice, all the VGPRs are WWM registers, and we will need to save at
     // least their inactive lanes. Add them to WWMReservedRegs.
-    assert(!NeedExecCopyReservedReg && "Whole wave functions can use the reg mapped for their i1 argument");
+    assert(!NeedExecCopyReservedReg &&
+           "Whole wave functions can use the reg mapped for their i1 argument");
     for (MCRegister Reg : AMDGPU::VGPR_32RegClass)
       if (MF.getRegInfo().isPhysRegModified(Reg)) {
         MFI->reserveWWMRegister(Reg);

>From 101696b24e964f94cabdfbd9f1ba6e64a8b00358 Mon Sep 17 00:00:00 2001
From: Diana Picus <diana-magda.picus at amd.com>
Date: Tue, 6 May 2025 16:25:41 +0200
Subject: [PATCH 12/24] Update tests after merge

---
 llvm/test/CodeGen/MIR/AMDGPU/long-branch-reg-all-sgpr-used.ll | 2 ++
 .../CodeGen/MIR/AMDGPU/machine-function-info-after-pei.ll     | 1 +
 .../MIR/AMDGPU/machine-function-info-long-branch-reg-debug.ll | 1 +
 .../MIR/AMDGPU/machine-function-info-long-branch-reg.ll       | 1 +
 llvm/test/CodeGen/MIR/AMDGPU/machine-function-info-no-ir.mir  | 4 ++++
 llvm/test/CodeGen/MIR/AMDGPU/machine-function-info.ll         | 4 ++++
 6 files changed, 13 insertions(+)

diff --git a/llvm/test/CodeGen/MIR/AMDGPU/long-branch-reg-all-sgpr-used.ll b/llvm/test/CodeGen/MIR/AMDGPU/long-branch-reg-all-sgpr-used.ll
index b514c49394d21..278cf0150c2f7 100644
--- a/llvm/test/CodeGen/MIR/AMDGPU/long-branch-reg-all-sgpr-used.ll
+++ b/llvm/test/CodeGen/MIR/AMDGPU/long-branch-reg-all-sgpr-used.ll
@@ -46,6 +46,7 @@
 ; CHECK-NEXT:   hasInitWholeWave: false
 ; CHECK-NEXT:   dynamicVGPRBlockSize: 0
 ; CHECK-NEXT:   scratchReservedForDynamicVGPRs: 0
+; CHECK-NEXT:   isWholeWaveFunction: false
 ; CHECK-NEXT: body:
   define amdgpu_kernel void @long_branch_used_all_sgprs(ptr addrspace(1) %arg, i32 %cnd) #0 {
   entry:
@@ -315,6 +316,7 @@
 ; CHECK-NEXT:   hasInitWholeWave: false
 ; CHECK-NEXT:   dynamicVGPRBlockSize: 0
 ; CHECK-NEXT:   scratchReservedForDynamicVGPRs: 0
+; CHECK-NEXT:   isWholeWaveFunction: false
 ; CHECK-NEXT: body:
   define amdgpu_kernel void @long_branch_high_num_sgprs_used(ptr addrspace(1) %arg, i32 %cnd) #0 {
   entry:
diff --git a/llvm/test/CodeGen/MIR/AMDGPU/machine-function-info-after-pei.ll b/llvm/test/CodeGen/MIR/AMDGPU/machine-function-info-after-pei.ll
index fc730f9e88454..890ea44081ce7 100644
--- a/llvm/test/CodeGen/MIR/AMDGPU/machine-function-info-after-pei.ll
+++ b/llvm/test/CodeGen/MIR/AMDGPU/machine-function-info-after-pei.ll
@@ -46,6 +46,7 @@
 ; AFTER-PEI-NEXT: hasInitWholeWave: false
 ; AFTER-PEI-NEXT: dynamicVGPRBlockSize: 0
 ; AFTER-PEI-NEXT: scratchReservedForDynamicVGPRs: 0
+; AFTER-PEI-NEXT: isWholeWaveFunction: false
 ; AFTER-PEI-NEXT: body:
 define amdgpu_kernel void @scavenge_fi(ptr addrspace(1) %out, i32 %in) #0 {
   %wide.sgpr0 = call <32 x i32>  asm sideeffect "; def $0", "=s" () #0
diff --git a/llvm/test/CodeGen/MIR/AMDGPU/machine-function-info-long-branch-reg-debug.ll b/llvm/test/CodeGen/MIR/AMDGPU/machine-function-info-long-branch-reg-debug.ll
index 5adef1433079d..f84ef8a3844dd 100644
--- a/llvm/test/CodeGen/MIR/AMDGPU/machine-function-info-long-branch-reg-debug.ll
+++ b/llvm/test/CodeGen/MIR/AMDGPU/machine-function-info-long-branch-reg-debug.ll
@@ -46,6 +46,7 @@
 ; CHECK-NEXT: hasInitWholeWave: false
 ; CHECK-NEXT: dynamicVGPRBlockSize: 0
 ; CHECK-NEXT: scratchReservedForDynamicVGPRs: 0
+; CHECK-NEXT: isWholeWaveFunction: false
 ; CHECK-NEXT: body:
   define amdgpu_kernel void @uniform_long_forward_branch_debug(ptr addrspace(1) %arg, i32 %arg1) #0 !dbg !5 {
   bb0:
diff --git a/llvm/test/CodeGen/MIR/AMDGPU/machine-function-info-long-branch-reg.ll b/llvm/test/CodeGen/MIR/AMDGPU/machine-function-info-long-branch-reg.ll
index fa40164aa02f0..cc834d017c149 100644
--- a/llvm/test/CodeGen/MIR/AMDGPU/machine-function-info-long-branch-reg.ll
+++ b/llvm/test/CodeGen/MIR/AMDGPU/machine-function-info-long-branch-reg.ll
@@ -46,6 +46,7 @@
 ; CHECK-NEXT: hasInitWholeWave: false
 ; CHECK-NEXT: dynamicVGPRBlockSize: 0
 ; CHECK-NEXT: scratchReservedForDynamicVGPRs: 0
+; CHECK-NEXT: isWholeWaveFunction: false
 ; CHECK-NEXT: body:
 define amdgpu_kernel void @uniform_long_forward_branch(ptr addrspace(1) %arg, i32 %arg1) #0 {
 bb0:
diff --git a/llvm/test/CodeGen/MIR/AMDGPU/machine-function-info-no-ir.mir b/llvm/test/CodeGen/MIR/AMDGPU/machine-function-info-no-ir.mir
index 24565e4423d04..06c580ec6f6b4 100644
--- a/llvm/test/CodeGen/MIR/AMDGPU/machine-function-info-no-ir.mir
+++ b/llvm/test/CodeGen/MIR/AMDGPU/machine-function-info-no-ir.mir
@@ -55,6 +55,7 @@
 # FULL-NEXT:  hasInitWholeWave: false
 # FULL-NEXT: dynamicVGPRBlockSize: 0
 # FULL-NEXT: scratchReservedForDynamicVGPRs: 0
+# FULL-NEXT: isWholeWaveFunction: false
 # FULL-NEXT: body:
 
 # SIMPLE: machineFunctionInfo:
@@ -162,6 +163,7 @@ body:             |
 # FULL-NEXT: hasInitWholeWave: false
 # FULL-NEXT: dynamicVGPRBlockSize: 0
 # FULL-NEXT: scratchReservedForDynamicVGPRs: 0
+# FULL-NEXT: isWholeWaveFunction: false
 # FULL-NEXT: body:
 
 # SIMPLE: machineFunctionInfo:
@@ -240,6 +242,7 @@ body:             |
 # FULL-NEXT: hasInitWholeWave: false
 # FULL-NEXT: dynamicVGPRBlockSize: 0
 # FULL-NEXT: scratchReservedForDynamicVGPRs: 0
+# FULL-NEXT: isWholeWaveFunction: false
 # FULL-NEXT: body:
 
 # SIMPLE: machineFunctionInfo:
@@ -319,6 +322,7 @@ body:             |
 # FULL-NEXT: hasInitWholeWave: false
 # FULL-NEXT: dynamicVGPRBlockSize: 0
 # FULL-NEXT: scratchReservedForDynamicVGPRs: 0
+# FULL-NEXT: isWholeWaveFunction: false
 # FULL-NEXT: body:
 
 # SIMPLE: machineFunctionInfo:
diff --git a/llvm/test/CodeGen/MIR/AMDGPU/machine-function-info.ll b/llvm/test/CodeGen/MIR/AMDGPU/machine-function-info.ll
index a15271382f37d..427154651a381 100644
--- a/llvm/test/CodeGen/MIR/AMDGPU/machine-function-info.ll
+++ b/llvm/test/CodeGen/MIR/AMDGPU/machine-function-info.ll
@@ -56,6 +56,7 @@
 ; CHECK-NEXT: hasInitWholeWave: false
 ; CHECK-NEXT: dynamicVGPRBlockSize: 0
 ; CHECK-NEXT: scratchReservedForDynamicVGPRs: 0
+; CHECK-NEXT: isWholeWaveFunction: false
 ; CHECK-NEXT: body:
 define amdgpu_kernel void @kernel(i32 %arg0, i64 %arg1, <16 x i32> %arg2) {
   %gep = getelementptr inbounds [512 x float], ptr addrspace(3) @lds, i32 0, i32 %arg0
@@ -105,6 +106,7 @@ define amdgpu_kernel void @kernel(i32 %arg0, i64 %arg1, <16 x i32> %arg2) {
 ; CHECK-NEXT: hasInitWholeWave: false
 ; CHECK-NEXT: dynamicVGPRBlockSize: 0
 ; CHECK-NEXT: scratchReservedForDynamicVGPRs: 0
+; CHECK-NEXT: isWholeWaveFunction: false
 ; CHECK-NEXT: body:
 define amdgpu_ps void @ps_shader(i32 %arg0, i32 inreg %arg1) {
   %gep = getelementptr inbounds [128 x i32], ptr addrspace(2) @gds, i32 0, i32 %arg0
@@ -178,6 +180,7 @@ define amdgpu_ps void @gds_size_shader(i32 %arg0, i32 inreg %arg1) #5 {
 ; CHECK-NEXT: hasInitWholeWave: false
 ; CHECK-NEXT: dynamicVGPRBlockSize: 0
 ; CHECK-NEXT: scratchReservedForDynamicVGPRs: 0
+; CHECK-NEXT: isWholeWaveFunction: false
 ; CHECK-NEXT: body:
 define void @function() {
   ret void
@@ -233,6 +236,7 @@ define void @function() {
 ; CHECK-NEXT: hasInitWholeWave: false
 ; CHECK-NEXT: dynamicVGPRBlockSize: 0
 ; CHECK-NEXT: scratchReservedForDynamicVGPRs: 0
+; CHECK-NEXT: isWholeWaveFunction: false
 ; CHECK-NEXT: body:
 define void @function_nsz() #0 {
   ret void

>From 69326a6599cde13597ea96012bf498e73dcbe1cf Mon Sep 17 00:00:00 2001
From: Diana Picus <diana-magda.picus at amd.com>
Date: Tue, 6 May 2025 16:26:01 +0200
Subject: [PATCH 13/24] Fix bug in testcase

---
 llvm/lib/Target/AMDGPU/SIFrameLowering.cpp            | 5 ++++-
 llvm/test/CodeGen/AMDGPU/whole-wave-functions-pei.mir | 2 +-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp b/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
index a79b1cebc707c..28bb290c103f1 100644
--- a/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
@@ -1028,7 +1028,10 @@ void SIFrameLowering::emitCSRSpillStores(
     // it now. If we have already saved some WWM CSR registers, then the EXEC is
     // already -1 and we don't need to do anything else. Otherwise, set EXEC to
     // -1 here.
-    if (WWMCalleeSavedRegs.empty())
+    if (!ScratchExecCopy)
+      buildScratchExecCopy(LiveUnits, MF, MBB, MBBI, DL, /*IsProlog*/ true,
+                           /*EnableInactiveLanes*/ true);
+    else if (WWMCalleeSavedRegs.empty())
       EnableAllLanes();
     TII->getWholeWaveFunctionSetup(MF)->eraseFromParent();
   } else if (ScratchExecCopy) {
diff --git a/llvm/test/CodeGen/AMDGPU/whole-wave-functions-pei.mir b/llvm/test/CodeGen/AMDGPU/whole-wave-functions-pei.mir
index 5d6906bacf336..93f489170cea0 100644
--- a/llvm/test/CodeGen/AMDGPU/whole-wave-functions-pei.mir
+++ b/llvm/test/CodeGen/AMDGPU/whole-wave-functions-pei.mir
@@ -360,7 +360,7 @@ body:             |
     ; CHECK-LABEL: name: dont_restore_used_vgprs
     ; CHECK: liveins: $vgpr0, $vgpr20, $vgpr40
     ; CHECK-NEXT: {{  $}}
-    ; CHECK-NEXT: $exec_lo = S_MOV_B32 -1
+    ; CHECK-NEXT: $sgpr0 = S_XOR_SAVEEXEC_B32 -1, implicit-def $exec, implicit-def dead $scc, implicit $exec
     ; CHECK-NEXT: S_NOP 0, implicit $vgpr0, implicit $vgpr20, implicit $vgpr40
     ; CHECK-NEXT: $exec_lo = S_MOV_B32 $sgpr0
     ; CHECK-NEXT: SI_WHOLE_WAVE_FUNC_RETURN killed renamable $sgpr0, implicit killed $vgpr0

>From f02121695f6ed10286d55e317a9747eed937a2cc Mon Sep 17 00:00:00 2001
From: Diana Picus <diana-magda.picus at amd.com>
Date: Mon, 19 May 2025 14:04:21 +0200
Subject: [PATCH 14/24] Test inreg args

---
 .../CodeGen/AMDGPU/whole-wave-functions.ll    | 83 +++++++++++++++++++
 1 file changed, 83 insertions(+)

diff --git a/llvm/test/CodeGen/AMDGPU/whole-wave-functions.ll b/llvm/test/CodeGen/AMDGPU/whole-wave-functions.ll
index 039d68befe299..fe4f67b5daa1b 100644
--- a/llvm/test/CodeGen/AMDGPU/whole-wave-functions.ll
+++ b/llvm/test/CodeGen/AMDGPU/whole-wave-functions.ll
@@ -417,3 +417,86 @@ define amdgpu_gfx_whole_wave i64 @ret_64(i1 %active, i64 %a, i64 %b) {
   %ret = call i64 @llvm.amdgcn.update.dpp.i64(i64 %x, i64 %y, i32 1, i32 1, i32 1, i1 false)
   ret i64 %ret
 }
+
+define amdgpu_gfx_whole_wave void @inreg_args(i1 %active, i32 inreg %i32, <4 x i32> inreg %v4i32, float inreg %float, ptr addrspace(5) inreg %ptr, ptr addrspace(5) inreg %ptr2) {
+; DAGISEL-LABEL: inreg_args:
+; DAGISEL:       ; %bb.0:
+; DAGISEL-NEXT:    s_wait_loadcnt_dscnt 0x0
+; DAGISEL-NEXT:    s_wait_expcnt 0x0
+; DAGISEL-NEXT:    s_wait_samplecnt 0x0
+; DAGISEL-NEXT:    s_wait_bvhcnt 0x0
+; DAGISEL-NEXT:    s_wait_kmcnt 0x0
+; DAGISEL-NEXT:    s_xor_saveexec_b32 s0, -1
+; DAGISEL-NEXT:    s_clause 0x5
+; DAGISEL-NEXT:    scratch_store_b32 off, v0, s32
+; DAGISEL-NEXT:    scratch_store_b32 off, v1, s32 offset:4
+; DAGISEL-NEXT:    scratch_store_b32 off, v2, s32 offset:8
+; DAGISEL-NEXT:    scratch_store_b32 off, v3, s32 offset:12
+; DAGISEL-NEXT:    scratch_store_b32 off, v4, s32 offset:16
+; DAGISEL-NEXT:    scratch_store_b32 off, v5, s32 offset:20
+; DAGISEL-NEXT:    s_mov_b32 exec_lo, -1
+; DAGISEL-NEXT:    v_dual_mov_b32 v4, s4 :: v_dual_mov_b32 v5, s9
+; DAGISEL-NEXT:    v_dual_mov_b32 v0, s5 :: v_dual_mov_b32 v1, s6
+; DAGISEL-NEXT:    v_dual_mov_b32 v2, s7 :: v_dual_mov_b32 v3, s8
+; DAGISEL-NEXT:    scratch_store_b32 off, v4, s10
+; DAGISEL-NEXT:    s_clause 0x1
+; DAGISEL-NEXT:    scratch_store_b128 off, v[0:3], s11
+; DAGISEL-NEXT:    scratch_store_b32 off, v5, s11
+; DAGISEL-NEXT:    s_wait_alu 0xfffe
+; DAGISEL-NEXT:    s_xor_b32 exec_lo, s0, -1
+; DAGISEL-NEXT:    s_clause 0x5
+; DAGISEL-NEXT:    scratch_load_b32 v0, off, s32
+; DAGISEL-NEXT:    scratch_load_b32 v1, off, s32 offset:4
+; DAGISEL-NEXT:    scratch_load_b32 v2, off, s32 offset:8
+; DAGISEL-NEXT:    scratch_load_b32 v3, off, s32 offset:12
+; DAGISEL-NEXT:    scratch_load_b32 v4, off, s32 offset:16
+; DAGISEL-NEXT:    scratch_load_b32 v5, off, s32 offset:20
+; DAGISEL-NEXT:    s_mov_b32 exec_lo, s0
+; DAGISEL-NEXT:    s_wait_loadcnt 0x0
+; DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GISEL-LABEL: inreg_args:
+; GISEL:       ; %bb.0:
+; GISEL-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GISEL-NEXT:    s_wait_expcnt 0x0
+; GISEL-NEXT:    s_wait_samplecnt 0x0
+; GISEL-NEXT:    s_wait_bvhcnt 0x0
+; GISEL-NEXT:    s_wait_kmcnt 0x0
+; GISEL-NEXT:    s_xor_saveexec_b32 s34, -1
+; GISEL-NEXT:    s_clause 0x5
+; GISEL-NEXT:    scratch_store_b32 off, v0, s32
+; GISEL-NEXT:    scratch_store_b32 off, v1, s32 offset:4
+; GISEL-NEXT:    scratch_store_b32 off, v2, s32 offset:8
+; GISEL-NEXT:    scratch_store_b32 off, v3, s32 offset:12
+; GISEL-NEXT:    scratch_store_b32 off, v4, s32 offset:16
+; GISEL-NEXT:    scratch_store_b32 off, v5, s32 offset:20
+; GISEL-NEXT:    s_mov_b32 exec_lo, -1
+; GISEL-NEXT:    s_mov_b32 s0, s5
+; GISEL-NEXT:    s_mov_b32 s1, s6
+; GISEL-NEXT:    s_mov_b32 s2, s7
+; GISEL-NEXT:    s_mov_b32 s3, s8
+; GISEL-NEXT:    v_mov_b32_e32 v4, s4
+; GISEL-NEXT:    s_wait_alu 0xfffe
+; GISEL-NEXT:    v_dual_mov_b32 v0, s0 :: v_dual_mov_b32 v3, s3
+; GISEL-NEXT:    v_dual_mov_b32 v1, s1 :: v_dual_mov_b32 v2, s2
+; GISEL-NEXT:    v_mov_b32_e32 v5, s9
+; GISEL-NEXT:    scratch_store_b32 off, v4, s10
+; GISEL-NEXT:    s_clause 0x1
+; GISEL-NEXT:    scratch_store_b128 off, v[0:3], s11
+; GISEL-NEXT:    scratch_store_b32 off, v5, s11
+; GISEL-NEXT:    s_xor_b32 exec_lo, s34, -1
+; GISEL-NEXT:    s_clause 0x5
+; GISEL-NEXT:    scratch_load_b32 v0, off, s32
+; GISEL-NEXT:    scratch_load_b32 v1, off, s32 offset:4
+; GISEL-NEXT:    scratch_load_b32 v2, off, s32 offset:8
+; GISEL-NEXT:    scratch_load_b32 v3, off, s32 offset:12
+; GISEL-NEXT:    scratch_load_b32 v4, off, s32 offset:16
+; GISEL-NEXT:    scratch_load_b32 v5, off, s32 offset:20
+; GISEL-NEXT:    s_mov_b32 exec_lo, s34
+; GISEL-NEXT:    s_wait_loadcnt 0x0
+; GISEL-NEXT:    s_setpc_b64 s[30:31]
+  store i32 %i32, ptr addrspace(5) %ptr
+  store <4 x i32> %v4i32, ptr addrspace(5) %ptr2
+  store float %float, ptr addrspace(5) %ptr2
+  ret void
+}

>From e14d17a00d35c4de9ab2c3266ead1bacec89e3da Mon Sep 17 00:00:00 2001
From: Diana Picus <diana-magda.picus at amd.com>
Date: Tue, 20 May 2025 12:26:12 +0200
Subject: [PATCH 15/24] Add docs and fixme

---
 llvm/docs/AMDGPUUsage.rst                  | 14 ++++++++++++++
 llvm/lib/Target/AMDGPU/SIFrameLowering.cpp |  2 ++
 2 files changed, 16 insertions(+)

diff --git a/llvm/docs/AMDGPUUsage.rst b/llvm/docs/AMDGPUUsage.rst
index c5b9bd9de66e1..19357635ecfc1 100644
--- a/llvm/docs/AMDGPUUsage.rst
+++ b/llvm/docs/AMDGPUUsage.rst
@@ -1844,6 +1844,20 @@ The AMDGPU backend supports the following calling conventions:
                                      ..TODO::
                                      Describe.
 
+     ``amdgpu_gfx_whole_wave``       Used for AMD graphics targets. Functions with this calling convention
+                                     cannot be used as entry points. They must have an i1 as the first argument,
+                                     which will be mapped to the value of EXEC on entry into the function. Other
+                                     arguments will contain poison in their inactive lanes. Similarly, the return
+                                     value for the inactive lanes is poison.
+
+                                     The function will run with all lanes enabled, i.e. EXEC will be set to -1 in the
+                                     prologue and restored to its original value in the epilogue. The inactive lanes
+                                     will be preserved for all the registers used by the function. Active lanes only
+                                     will only be preserved for the callee saved registers.
+
+                                     In all other respects, functions with this calling convention behave like
+                                     ``amdgpu_gfx`` functions.
+
      ``amdgpu_gs``                   Used for Mesa/AMDPAL geometry shaders.
                                      ..TODO::
                                      Describe.
diff --git a/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp b/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
index 28bb290c103f1..b88df50c6c999 100644
--- a/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
@@ -1719,6 +1719,8 @@ void SIFrameLowering::determineCalleeSaves(MachineFunction &MF,
     // least their inactive lanes. Add them to WWMReservedRegs.
     assert(!NeedExecCopyReservedReg &&
            "Whole wave functions can use the reg mapped for their i1 argument");
+
+    // FIXME: Be more efficient!
     for (MCRegister Reg : AMDGPU::VGPR_32RegClass)
       if (MF.getRegInfo().isPhysRegModified(Reg)) {
         MFI->reserveWWMRegister(Reg);

>From cc3539e00a1e617a3333efd3689d5a16d8226730 Mon Sep 17 00:00:00 2001
From: Diana Picus <diana-magda.picus at amd.com>
Date: Tue, 17 Jun 2025 14:30:39 +0200
Subject: [PATCH 16/24] Remove kill flags on orig exec mask

---
 llvm/lib/Target/AMDGPU/SIISelLowering.cpp | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index adf80108689ea..f80352c6ff954 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -5876,8 +5876,10 @@ SITargetLowering::EmitInstrWithCustomInserter(MachineInstr &MI,
     // During ISel, it's difficult to propagate the original EXEC mask to use as
     // an input to SI_WHOLE_WAVE_FUNC_RETURN. Set it up here instead.
     MachineInstr *Setup = TII->getWholeWaveFunctionSetup(*BB->getParent());
+    Register OriginalExec = Setup->getOperand(0).getReg();
     assert(Setup && "Couldn't find SI_SETUP_WHOLE_WAVE_FUNC");
-    MI.getOperand(0).setReg(Setup->getOperand(0).getReg());
+    MF->getRegInfo().clearKillFlags(OriginalExec);
+    MI.getOperand(0).setReg(OriginalExec);
     return BB;
   }
   default:

>From 225be4f8a6f67a0d83b9c4458b511c37c2971b16 Mon Sep 17 00:00:00 2001
From: Diana Picus <diana-magda.picus at amd.com>
Date: Mon, 23 Jun 2025 11:08:36 +0200
Subject: [PATCH 17/24] Add helper to add orig exec to return

---
 llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp | 13 +++++++++----
 llvm/lib/Target/AMDGPU/AMDGPUCallLowering.h   |  3 +++
 2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp b/llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp
index 33bb11c8ce015..474b2675c7074 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp
@@ -386,10 +386,7 @@ bool AMDGPUCallLowering::lowerReturn(MachineIRBuilder &B, const Value *Val,
     return false;
 
   if (IsWholeWave) {
-    const GCNSubtarget &ST = MF.getSubtarget<GCNSubtarget>();
-    const SIInstrInfo *TII = ST.getInstrInfo();
-    const MachineInstr *Setup = TII->getWholeWaveFunctionSetup(MF);
-    Ret.addReg(Setup->getOperand(0).getReg());
+    addOriginalExecToReturn(B.getMF(), Ret);
   }
 
   // TODO: Handle CalleeSavedRegsViaCopy.
@@ -1614,3 +1611,11 @@ bool AMDGPUCallLowering::lowerCall(MachineIRBuilder &MIRBuilder,
 
   return true;
 }
+
+void AMDGPUCallLowering::addOriginalExecToReturn(MachineFunction &MF,
+                                                 MachineInstrBuilder &Ret) const {
+  const GCNSubtarget &ST = MF.getSubtarget<GCNSubtarget>();
+  const SIInstrInfo *TII = ST.getInstrInfo();
+  const MachineInstr *Setup = TII->getWholeWaveFunctionSetup(MF);
+  Ret.addReg(Setup->getOperand(0).getReg());
+}
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUCallLowering.h b/llvm/lib/Target/AMDGPU/AMDGPUCallLowering.h
index a6e801f2a547b..e0033d59d10bb 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUCallLowering.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUCallLowering.h
@@ -37,6 +37,9 @@ class AMDGPUCallLowering final : public CallLowering {
   bool lowerReturnVal(MachineIRBuilder &B, const Value *Val,
                       ArrayRef<Register> VRegs, MachineInstrBuilder &Ret) const;
 
+  void addOriginalExecToReturn(MachineFunction &MF,
+                               MachineInstrBuilder &Ret) const;
+
 public:
   AMDGPUCallLowering(const AMDGPUTargetLowering &TLI);
 

>From cc3a039edcbbd28c54eac79dba3a4d491618a294 Mon Sep 17 00:00:00 2001
From: Diana Picus <diana-magda.picus at amd.com>
Date: Mon, 23 Jun 2025 11:42:58 +0200
Subject: [PATCH 18/24] Test with single use of orig exec

---
 .../CodeGen/AMDGPU/whole-wave-functions.ll    | 54 +++++++++++++++++++
 1 file changed, 54 insertions(+)

diff --git a/llvm/test/CodeGen/AMDGPU/whole-wave-functions.ll b/llvm/test/CodeGen/AMDGPU/whole-wave-functions.ll
index fe4f67b5daa1b..1727823e71728 100644
--- a/llvm/test/CodeGen/AMDGPU/whole-wave-functions.ll
+++ b/llvm/test/CodeGen/AMDGPU/whole-wave-functions.ll
@@ -60,6 +60,60 @@ define amdgpu_gfx_whole_wave i32 @basic_test(i1 %active, i32 %a, i32 %b) {
   ret i32 %ret
 }
 
+; Make sure we don't crash if there's only one use for %active.
+define amdgpu_gfx_whole_wave i32 @single_use_of_active(i1 %active, i32 %a, i32 %b) {
+; DAGISEL-LABEL: single_use_of_active:
+; DAGISEL:       ; %bb.0:
+; DAGISEL-NEXT:    s_wait_loadcnt_dscnt 0x0
+; DAGISEL-NEXT:    s_wait_expcnt 0x0
+; DAGISEL-NEXT:    s_wait_samplecnt 0x0
+; DAGISEL-NEXT:    s_wait_bvhcnt 0x0
+; DAGISEL-NEXT:    s_wait_kmcnt 0x0
+; DAGISEL-NEXT:    s_xor_saveexec_b32 vcc_lo, -1
+; DAGISEL-NEXT:    s_clause 0x1
+; DAGISEL-NEXT:    scratch_store_b32 off, v0, s32
+; DAGISEL-NEXT:    scratch_store_b32 off, v1, s32 offset:4
+; DAGISEL-NEXT:    s_mov_b32 exec_lo, -1
+; DAGISEL-NEXT:    s_wait_alu 0xfffe
+; DAGISEL-NEXT:    v_cndmask_b32_e32 v1, 17, v1, vcc_lo
+; DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; DAGISEL-NEXT:    v_mov_b32_dpp v0, v1 quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1
+; DAGISEL-NEXT:    s_xor_b32 exec_lo, vcc_lo, -1
+; DAGISEL-NEXT:    s_clause 0x1
+; DAGISEL-NEXT:    scratch_load_b32 v0, off, s32
+; DAGISEL-NEXT:    scratch_load_b32 v1, off, s32 offset:4
+; DAGISEL-NEXT:    s_mov_b32 exec_lo, vcc_lo
+; DAGISEL-NEXT:    s_wait_loadcnt 0x0
+; DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GISEL-LABEL: single_use_of_active:
+; GISEL:       ; %bb.0:
+; GISEL-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GISEL-NEXT:    s_wait_expcnt 0x0
+; GISEL-NEXT:    s_wait_samplecnt 0x0
+; GISEL-NEXT:    s_wait_bvhcnt 0x0
+; GISEL-NEXT:    s_wait_kmcnt 0x0
+; GISEL-NEXT:    s_xor_saveexec_b32 vcc_lo, -1
+; GISEL-NEXT:    s_clause 0x1
+; GISEL-NEXT:    scratch_store_b32 off, v0, s32
+; GISEL-NEXT:    scratch_store_b32 off, v1, s32 offset:4
+; GISEL-NEXT:    s_mov_b32 exec_lo, -1
+; GISEL-NEXT:    s_wait_alu 0xfffe
+; GISEL-NEXT:    v_cndmask_b32_e32 v1, 17, v1, vcc_lo
+; GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GISEL-NEXT:    v_mov_b32_dpp v0, v1 quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1
+; GISEL-NEXT:    s_xor_b32 exec_lo, vcc_lo, -1
+; GISEL-NEXT:    s_clause 0x1
+; GISEL-NEXT:    scratch_load_b32 v0, off, s32
+; GISEL-NEXT:    scratch_load_b32 v1, off, s32 offset:4
+; GISEL-NEXT:    s_mov_b32 exec_lo, vcc_lo
+; GISEL-NEXT:    s_wait_loadcnt 0x0
+; GISEL-NEXT:    s_setpc_b64 s[30:31]
+  %y = select i1 %active, i32 %b, i32 17
+  %ret = call i32 @llvm.amdgcn.update.dpp.i32(i32 %a, i32 %y, i32 1, i32 1, i32 1, i1 false)
+  ret i32 %ret
+}
+
 ; Make sure we don't crash if %active is not used at all.
 define amdgpu_gfx_whole_wave i32 @unused_active(i1 %active, i32 %a, i32 %b) {
 ; DAGISEL-LABEL: unused_active:

>From 283531a5919be990940e547f2cf4958866519aba Mon Sep 17 00:00:00 2001
From: Diana Picus <diana-magda.picus at amd.com>
Date: Mon, 23 Jun 2025 11:43:11 +0200
Subject: [PATCH 19/24] Test calling gfx func from wwf

---
 .../CodeGen/AMDGPU/whole-wave-functions.ll    | 844 ++++++++++++++++++
 1 file changed, 844 insertions(+)

diff --git a/llvm/test/CodeGen/AMDGPU/whole-wave-functions.ll b/llvm/test/CodeGen/AMDGPU/whole-wave-functions.ll
index 1727823e71728..5263f5a46b807 100644
--- a/llvm/test/CodeGen/AMDGPU/whole-wave-functions.ll
+++ b/llvm/test/CodeGen/AMDGPU/whole-wave-functions.ll
@@ -554,3 +554,847 @@ define amdgpu_gfx_whole_wave void @inreg_args(i1 %active, i32 inreg %i32, <4 x i
   store float %float, ptr addrspace(5) %ptr2
   ret void
 }
+
+declare amdgpu_gfx <2 x half> @gfx_callee(<2 x half> %x, <2 x half> %y)
+
+define amdgpu_gfx_whole_wave <2 x half> @call_gfx_from_whole_wave(i1 %active, <2 x half> %x, <2 x half> %y) {
+; DAGISEL-LABEL: call_gfx_from_whole_wave:
+; DAGISEL:       ; %bb.0:
+; DAGISEL-NEXT:    s_wait_loadcnt_dscnt 0x0
+; DAGISEL-NEXT:    s_wait_expcnt 0x0
+; DAGISEL-NEXT:    s_wait_samplecnt 0x0
+; DAGISEL-NEXT:    s_wait_bvhcnt 0x0
+; DAGISEL-NEXT:    s_wait_kmcnt 0x0
+; DAGISEL-NEXT:    s_mov_b32 s35, s33
+; DAGISEL-NEXT:    s_mov_b32 s33, s32
+; DAGISEL-NEXT:    s_xor_saveexec_b32 s34, -1
+; DAGISEL-NEXT:    s_clause 0x1f
+; DAGISEL-NEXT:    scratch_store_b32 off, v0, s33 offset:8
+; DAGISEL-NEXT:    scratch_store_b32 off, v1, s33 offset:12
+; DAGISEL-NEXT:    scratch_store_b32 off, v2, s33 offset:16
+; DAGISEL-NEXT:    scratch_store_b32 off, v3, s33 offset:20
+; DAGISEL-NEXT:    scratch_store_b32 off, v4, s33 offset:24
+; DAGISEL-NEXT:    scratch_store_b32 off, v5, s33 offset:28
+; DAGISEL-NEXT:    scratch_store_b32 off, v6, s33 offset:32
+; DAGISEL-NEXT:    scratch_store_b32 off, v7, s33 offset:36
+; DAGISEL-NEXT:    scratch_store_b32 off, v8, s33 offset:40
+; DAGISEL-NEXT:    scratch_store_b32 off, v9, s33 offset:44
+; DAGISEL-NEXT:    scratch_store_b32 off, v10, s33 offset:48
+; DAGISEL-NEXT:    scratch_store_b32 off, v11, s33 offset:52
+; DAGISEL-NEXT:    scratch_store_b32 off, v12, s33 offset:56
+; DAGISEL-NEXT:    scratch_store_b32 off, v13, s33 offset:60
+; DAGISEL-NEXT:    scratch_store_b32 off, v14, s33 offset:64
+; DAGISEL-NEXT:    scratch_store_b32 off, v15, s33 offset:68
+; DAGISEL-NEXT:    scratch_store_b32 off, v16, s33 offset:72
+; DAGISEL-NEXT:    scratch_store_b32 off, v17, s33 offset:76
+; DAGISEL-NEXT:    scratch_store_b32 off, v18, s33 offset:80
+; DAGISEL-NEXT:    scratch_store_b32 off, v19, s33 offset:84
+; DAGISEL-NEXT:    scratch_store_b32 off, v20, s33 offset:88
+; DAGISEL-NEXT:    scratch_store_b32 off, v21, s33 offset:92
+; DAGISEL-NEXT:    scratch_store_b32 off, v22, s33 offset:96
+; DAGISEL-NEXT:    scratch_store_b32 off, v23, s33 offset:100
+; DAGISEL-NEXT:    scratch_store_b32 off, v24, s33 offset:104
+; DAGISEL-NEXT:    scratch_store_b32 off, v25, s33 offset:108
+; DAGISEL-NEXT:    scratch_store_b32 off, v26, s33 offset:112
+; DAGISEL-NEXT:    scratch_store_b32 off, v27, s33 offset:116
+; DAGISEL-NEXT:    scratch_store_b32 off, v28, s33 offset:120
+; DAGISEL-NEXT:    scratch_store_b32 off, v29, s33 offset:124
+; DAGISEL-NEXT:    scratch_store_b32 off, v30, s33 offset:128
+; DAGISEL-NEXT:    scratch_store_b32 off, v31, s33 offset:132
+; DAGISEL-NEXT:    s_clause 0x1f
+; DAGISEL-NEXT:    scratch_store_b32 off, v32, s33 offset:136
+; DAGISEL-NEXT:    scratch_store_b32 off, v33, s33 offset:140
+; DAGISEL-NEXT:    scratch_store_b32 off, v34, s33 offset:144
+; DAGISEL-NEXT:    scratch_store_b32 off, v35, s33 offset:148
+; DAGISEL-NEXT:    scratch_store_b32 off, v36, s33 offset:152
+; DAGISEL-NEXT:    scratch_store_b32 off, v37, s33 offset:156
+; DAGISEL-NEXT:    scratch_store_b32 off, v38, s33 offset:160
+; DAGISEL-NEXT:    scratch_store_b32 off, v39, s33 offset:164
+; DAGISEL-NEXT:    scratch_store_b32 off, v48, s33 offset:168
+; DAGISEL-NEXT:    scratch_store_b32 off, v49, s33 offset:172
+; DAGISEL-NEXT:    scratch_store_b32 off, v50, s33 offset:176
+; DAGISEL-NEXT:    scratch_store_b32 off, v51, s33 offset:180
+; DAGISEL-NEXT:    scratch_store_b32 off, v52, s33 offset:184
+; DAGISEL-NEXT:    scratch_store_b32 off, v53, s33 offset:188
+; DAGISEL-NEXT:    scratch_store_b32 off, v54, s33 offset:192
+; DAGISEL-NEXT:    scratch_store_b32 off, v55, s33 offset:196
+; DAGISEL-NEXT:    scratch_store_b32 off, v64, s33 offset:200
+; DAGISEL-NEXT:    scratch_store_b32 off, v65, s33 offset:204
+; DAGISEL-NEXT:    scratch_store_b32 off, v66, s33 offset:208
+; DAGISEL-NEXT:    scratch_store_b32 off, v67, s33 offset:212
+; DAGISEL-NEXT:    scratch_store_b32 off, v68, s33 offset:216
+; DAGISEL-NEXT:    scratch_store_b32 off, v69, s33 offset:220
+; DAGISEL-NEXT:    scratch_store_b32 off, v70, s33 offset:224
+; DAGISEL-NEXT:    scratch_store_b32 off, v71, s33 offset:228
+; DAGISEL-NEXT:    scratch_store_b32 off, v80, s33 offset:232
+; DAGISEL-NEXT:    scratch_store_b32 off, v81, s33 offset:236
+; DAGISEL-NEXT:    scratch_store_b32 off, v82, s33 offset:240
+; DAGISEL-NEXT:    scratch_store_b32 off, v83, s33 offset:244
+; DAGISEL-NEXT:    scratch_store_b32 off, v84, s33 offset:248
+; DAGISEL-NEXT:    scratch_store_b32 off, v85, s33 offset:252
+; DAGISEL-NEXT:    scratch_store_b32 off, v86, s33 offset:256
+; DAGISEL-NEXT:    scratch_store_b32 off, v87, s33 offset:260
+; DAGISEL-NEXT:    s_clause 0x1f
+; DAGISEL-NEXT:    scratch_store_b32 off, v96, s33 offset:264
+; DAGISEL-NEXT:    scratch_store_b32 off, v97, s33 offset:268
+; DAGISEL-NEXT:    scratch_store_b32 off, v98, s33 offset:272
+; DAGISEL-NEXT:    scratch_store_b32 off, v99, s33 offset:276
+; DAGISEL-NEXT:    scratch_store_b32 off, v100, s33 offset:280
+; DAGISEL-NEXT:    scratch_store_b32 off, v101, s33 offset:284
+; DAGISEL-NEXT:    scratch_store_b32 off, v102, s33 offset:288
+; DAGISEL-NEXT:    scratch_store_b32 off, v103, s33 offset:292
+; DAGISEL-NEXT:    scratch_store_b32 off, v112, s33 offset:296
+; DAGISEL-NEXT:    scratch_store_b32 off, v113, s33 offset:300
+; DAGISEL-NEXT:    scratch_store_b32 off, v114, s33 offset:304
+; DAGISEL-NEXT:    scratch_store_b32 off, v115, s33 offset:308
+; DAGISEL-NEXT:    scratch_store_b32 off, v116, s33 offset:312
+; DAGISEL-NEXT:    scratch_store_b32 off, v117, s33 offset:316
+; DAGISEL-NEXT:    scratch_store_b32 off, v118, s33 offset:320
+; DAGISEL-NEXT:    scratch_store_b32 off, v119, s33 offset:324
+; DAGISEL-NEXT:    scratch_store_b32 off, v128, s33 offset:328
+; DAGISEL-NEXT:    scratch_store_b32 off, v129, s33 offset:332
+; DAGISEL-NEXT:    scratch_store_b32 off, v130, s33 offset:336
+; DAGISEL-NEXT:    scratch_store_b32 off, v131, s33 offset:340
+; DAGISEL-NEXT:    scratch_store_b32 off, v132, s33 offset:344
+; DAGISEL-NEXT:    scratch_store_b32 off, v133, s33 offset:348
+; DAGISEL-NEXT:    scratch_store_b32 off, v134, s33 offset:352
+; DAGISEL-NEXT:    scratch_store_b32 off, v135, s33 offset:356
+; DAGISEL-NEXT:    scratch_store_b32 off, v144, s33 offset:360
+; DAGISEL-NEXT:    scratch_store_b32 off, v145, s33 offset:364
+; DAGISEL-NEXT:    scratch_store_b32 off, v146, s33 offset:368
+; DAGISEL-NEXT:    scratch_store_b32 off, v147, s33 offset:372
+; DAGISEL-NEXT:    scratch_store_b32 off, v148, s33 offset:376
+; DAGISEL-NEXT:    scratch_store_b32 off, v149, s33 offset:380
+; DAGISEL-NEXT:    scratch_store_b32 off, v150, s33 offset:384
+; DAGISEL-NEXT:    scratch_store_b32 off, v151, s33 offset:388
+; DAGISEL-NEXT:    s_clause 0x1f
+; DAGISEL-NEXT:    scratch_store_b32 off, v160, s33 offset:392
+; DAGISEL-NEXT:    scratch_store_b32 off, v161, s33 offset:396
+; DAGISEL-NEXT:    scratch_store_b32 off, v162, s33 offset:400
+; DAGISEL-NEXT:    scratch_store_b32 off, v163, s33 offset:404
+; DAGISEL-NEXT:    scratch_store_b32 off, v164, s33 offset:408
+; DAGISEL-NEXT:    scratch_store_b32 off, v165, s33 offset:412
+; DAGISEL-NEXT:    scratch_store_b32 off, v166, s33 offset:416
+; DAGISEL-NEXT:    scratch_store_b32 off, v167, s33 offset:420
+; DAGISEL-NEXT:    scratch_store_b32 off, v176, s33 offset:424
+; DAGISEL-NEXT:    scratch_store_b32 off, v177, s33 offset:428
+; DAGISEL-NEXT:    scratch_store_b32 off, v178, s33 offset:432
+; DAGISEL-NEXT:    scratch_store_b32 off, v179, s33 offset:436
+; DAGISEL-NEXT:    scratch_store_b32 off, v180, s33 offset:440
+; DAGISEL-NEXT:    scratch_store_b32 off, v181, s33 offset:444
+; DAGISEL-NEXT:    scratch_store_b32 off, v182, s33 offset:448
+; DAGISEL-NEXT:    scratch_store_b32 off, v183, s33 offset:452
+; DAGISEL-NEXT:    scratch_store_b32 off, v192, s33 offset:456
+; DAGISEL-NEXT:    scratch_store_b32 off, v193, s33 offset:460
+; DAGISEL-NEXT:    scratch_store_b32 off, v194, s33 offset:464
+; DAGISEL-NEXT:    scratch_store_b32 off, v195, s33 offset:468
+; DAGISEL-NEXT:    scratch_store_b32 off, v196, s33 offset:472
+; DAGISEL-NEXT:    scratch_store_b32 off, v197, s33 offset:476
+; DAGISEL-NEXT:    scratch_store_b32 off, v198, s33 offset:480
+; DAGISEL-NEXT:    scratch_store_b32 off, v199, s33 offset:484
+; DAGISEL-NEXT:    scratch_store_b32 off, v208, s33 offset:488
+; DAGISEL-NEXT:    scratch_store_b32 off, v209, s33 offset:492
+; DAGISEL-NEXT:    scratch_store_b32 off, v210, s33 offset:496
+; DAGISEL-NEXT:    scratch_store_b32 off, v211, s33 offset:500
+; DAGISEL-NEXT:    scratch_store_b32 off, v212, s33 offset:504
+; DAGISEL-NEXT:    scratch_store_b32 off, v213, s33 offset:508
+; DAGISEL-NEXT:    scratch_store_b32 off, v214, s33 offset:512
+; DAGISEL-NEXT:    scratch_store_b32 off, v215, s33 offset:516
+; DAGISEL-NEXT:    s_clause 0xf
+; DAGISEL-NEXT:    scratch_store_b32 off, v224, s33 offset:520
+; DAGISEL-NEXT:    scratch_store_b32 off, v225, s33 offset:524
+; DAGISEL-NEXT:    scratch_store_b32 off, v226, s33 offset:528
+; DAGISEL-NEXT:    scratch_store_b32 off, v227, s33 offset:532
+; DAGISEL-NEXT:    scratch_store_b32 off, v228, s33 offset:536
+; DAGISEL-NEXT:    scratch_store_b32 off, v229, s33 offset:540
+; DAGISEL-NEXT:    scratch_store_b32 off, v230, s33 offset:544
+; DAGISEL-NEXT:    scratch_store_b32 off, v231, s33 offset:548
+; DAGISEL-NEXT:    scratch_store_b32 off, v240, s33 offset:552
+; DAGISEL-NEXT:    scratch_store_b32 off, v241, s33 offset:556
+; DAGISEL-NEXT:    scratch_store_b32 off, v242, s33 offset:560
+; DAGISEL-NEXT:    scratch_store_b32 off, v243, s33 offset:564
+; DAGISEL-NEXT:    scratch_store_b32 off, v244, s33 offset:568
+; DAGISEL-NEXT:    scratch_store_b32 off, v245, s33 offset:572
+; DAGISEL-NEXT:    scratch_store_b32 off, v246, s33 offset:576
+; DAGISEL-NEXT:    scratch_store_b32 off, v247, s33 offset:580
+; DAGISEL-NEXT:    s_mov_b32 exec_lo, -1
+; DAGISEL-NEXT:    s_clause 0x1
+; DAGISEL-NEXT:    scratch_store_b32 off, v40, s33
+; DAGISEL-NEXT:    scratch_store_b32 off, v41, s33 offset:4
+; DAGISEL-NEXT:    v_writelane_b32 v40, s4, 0
+; DAGISEL-NEXT:    v_writelane_b32 v41, s76, 0
+; DAGISEL-NEXT:    v_mov_b32_e32 v2, v0
+; DAGISEL-NEXT:    v_swap_b32 v0, v1
+; DAGISEL-NEXT:    v_writelane_b32 v40, s5, 1
+; DAGISEL-NEXT:    v_writelane_b32 v41, s77, 1
+; DAGISEL-NEXT:    s_mov_b32 s1, gfx_callee at abs32@hi
+; DAGISEL-NEXT:    s_mov_b32 s0, gfx_callee at abs32@lo
+; DAGISEL-NEXT:    s_addk_co_i32 s32, 0x250
+; DAGISEL-NEXT:    v_writelane_b32 v40, s6, 2
+; DAGISEL-NEXT:    v_writelane_b32 v41, s78, 2
+; DAGISEL-NEXT:    v_writelane_b32 v40, s7, 3
+; DAGISEL-NEXT:    v_writelane_b32 v41, s79, 3
+; DAGISEL-NEXT:    v_writelane_b32 v40, s8, 4
+; DAGISEL-NEXT:    v_writelane_b32 v41, s88, 4
+; DAGISEL-NEXT:    v_writelane_b32 v40, s9, 5
+; DAGISEL-NEXT:    v_writelane_b32 v41, s89, 5
+; DAGISEL-NEXT:    s_mov_b64 s[8:9], 0
+; DAGISEL-NEXT:    v_writelane_b32 v40, s10, 6
+; DAGISEL-NEXT:    v_writelane_b32 v41, s90, 6
+; DAGISEL-NEXT:    v_writelane_b32 v40, s11, 7
+; DAGISEL-NEXT:    v_writelane_b32 v41, s91, 7
+; DAGISEL-NEXT:    v_writelane_b32 v40, s12, 8
+; DAGISEL-NEXT:    v_writelane_b32 v41, s92, 8
+; DAGISEL-NEXT:    v_writelane_b32 v40, s13, 9
+; DAGISEL-NEXT:    v_writelane_b32 v41, s93, 9
+; DAGISEL-NEXT:    v_writelane_b32 v40, s14, 10
+; DAGISEL-NEXT:    v_writelane_b32 v41, s94, 10
+; DAGISEL-NEXT:    v_writelane_b32 v40, s15, 11
+; DAGISEL-NEXT:    v_writelane_b32 v41, s95, 11
+; DAGISEL-NEXT:    v_writelane_b32 v40, s16, 12
+; DAGISEL-NEXT:    v_writelane_b32 v40, s17, 13
+; DAGISEL-NEXT:    v_writelane_b32 v40, s18, 14
+; DAGISEL-NEXT:    v_writelane_b32 v40, s19, 15
+; DAGISEL-NEXT:    v_writelane_b32 v40, s20, 16
+; DAGISEL-NEXT:    v_writelane_b32 v40, s21, 17
+; DAGISEL-NEXT:    v_writelane_b32 v40, s22, 18
+; DAGISEL-NEXT:    v_writelane_b32 v40, s23, 19
+; DAGISEL-NEXT:    v_writelane_b32 v40, s24, 20
+; DAGISEL-NEXT:    v_writelane_b32 v40, s25, 21
+; DAGISEL-NEXT:    v_writelane_b32 v40, s26, 22
+; DAGISEL-NEXT:    v_writelane_b32 v40, s27, 23
+; DAGISEL-NEXT:    v_writelane_b32 v40, s28, 24
+; DAGISEL-NEXT:    v_writelane_b32 v40, s29, 25
+; DAGISEL-NEXT:    v_writelane_b32 v40, s30, 26
+; DAGISEL-NEXT:    v_writelane_b32 v40, s31, 27
+; DAGISEL-NEXT:    v_writelane_b32 v40, s72, 28
+; DAGISEL-NEXT:    v_writelane_b32 v40, s73, 29
+; DAGISEL-NEXT:    v_writelane_b32 v40, s74, 30
+; DAGISEL-NEXT:    v_writelane_b32 v40, s75, 31
+; DAGISEL-NEXT:    s_wait_alu 0xfffe
+; DAGISEL-NEXT:    s_swappc_b64 s[30:31], s[0:1]
+; DAGISEL-NEXT:    v_readlane_b32 s95, v41, 11
+; DAGISEL-NEXT:    v_readlane_b32 s94, v41, 10
+; DAGISEL-NEXT:    v_readlane_b32 s93, v41, 9
+; DAGISEL-NEXT:    v_readlane_b32 s92, v41, 8
+; DAGISEL-NEXT:    v_readlane_b32 s91, v41, 7
+; DAGISEL-NEXT:    v_readlane_b32 s90, v41, 6
+; DAGISEL-NEXT:    v_readlane_b32 s89, v41, 5
+; DAGISEL-NEXT:    v_readlane_b32 s88, v41, 4
+; DAGISEL-NEXT:    v_readlane_b32 s79, v41, 3
+; DAGISEL-NEXT:    v_readlane_b32 s78, v41, 2
+; DAGISEL-NEXT:    v_readlane_b32 s77, v41, 1
+; DAGISEL-NEXT:    v_readlane_b32 s76, v41, 0
+; DAGISEL-NEXT:    v_readlane_b32 s75, v40, 31
+; DAGISEL-NEXT:    v_readlane_b32 s74, v40, 30
+; DAGISEL-NEXT:    v_readlane_b32 s73, v40, 29
+; DAGISEL-NEXT:    v_readlane_b32 s72, v40, 28
+; DAGISEL-NEXT:    v_readlane_b32 s31, v40, 27
+; DAGISEL-NEXT:    v_readlane_b32 s30, v40, 26
+; DAGISEL-NEXT:    v_readlane_b32 s29, v40, 25
+; DAGISEL-NEXT:    v_readlane_b32 s28, v40, 24
+; DAGISEL-NEXT:    v_readlane_b32 s27, v40, 23
+; DAGISEL-NEXT:    v_readlane_b32 s26, v40, 22
+; DAGISEL-NEXT:    v_readlane_b32 s25, v40, 21
+; DAGISEL-NEXT:    v_readlane_b32 s24, v40, 20
+; DAGISEL-NEXT:    v_readlane_b32 s23, v40, 19
+; DAGISEL-NEXT:    v_readlane_b32 s22, v40, 18
+; DAGISEL-NEXT:    v_readlane_b32 s21, v40, 17
+; DAGISEL-NEXT:    v_readlane_b32 s20, v40, 16
+; DAGISEL-NEXT:    v_readlane_b32 s19, v40, 15
+; DAGISEL-NEXT:    v_readlane_b32 s18, v40, 14
+; DAGISEL-NEXT:    v_readlane_b32 s17, v40, 13
+; DAGISEL-NEXT:    v_readlane_b32 s16, v40, 12
+; DAGISEL-NEXT:    v_readlane_b32 s15, v40, 11
+; DAGISEL-NEXT:    v_readlane_b32 s14, v40, 10
+; DAGISEL-NEXT:    v_readlane_b32 s13, v40, 9
+; DAGISEL-NEXT:    v_readlane_b32 s12, v40, 8
+; DAGISEL-NEXT:    v_readlane_b32 s11, v40, 7
+; DAGISEL-NEXT:    v_readlane_b32 s10, v40, 6
+; DAGISEL-NEXT:    v_readlane_b32 s9, v40, 5
+; DAGISEL-NEXT:    v_readlane_b32 s8, v40, 4
+; DAGISEL-NEXT:    v_readlane_b32 s7, v40, 3
+; DAGISEL-NEXT:    v_readlane_b32 s6, v40, 2
+; DAGISEL-NEXT:    v_readlane_b32 s5, v40, 1
+; DAGISEL-NEXT:    v_readlane_b32 s4, v40, 0
+; DAGISEL-NEXT:    s_clause 0x1
+; DAGISEL-NEXT:    scratch_load_b32 v40, off, s33
+; DAGISEL-NEXT:    scratch_load_b32 v41, off, s33 offset:4
+; DAGISEL-NEXT:    s_mov_b32 s32, s33
+; DAGISEL-NEXT:    s_xor_b32 exec_lo, s34, -1
+; DAGISEL-NEXT:    s_clause 0x1f
+; DAGISEL-NEXT:    scratch_load_b32 v0, off, s33 offset:8
+; DAGISEL-NEXT:    scratch_load_b32 v1, off, s33 offset:12
+; DAGISEL-NEXT:    scratch_load_b32 v2, off, s33 offset:16
+; DAGISEL-NEXT:    scratch_load_b32 v3, off, s33 offset:20
+; DAGISEL-NEXT:    scratch_load_b32 v4, off, s33 offset:24
+; DAGISEL-NEXT:    scratch_load_b32 v5, off, s33 offset:28
+; DAGISEL-NEXT:    scratch_load_b32 v6, off, s33 offset:32
+; DAGISEL-NEXT:    scratch_load_b32 v7, off, s33 offset:36
+; DAGISEL-NEXT:    scratch_load_b32 v8, off, s33 offset:40
+; DAGISEL-NEXT:    scratch_load_b32 v9, off, s33 offset:44
+; DAGISEL-NEXT:    scratch_load_b32 v10, off, s33 offset:48
+; DAGISEL-NEXT:    scratch_load_b32 v11, off, s33 offset:52
+; DAGISEL-NEXT:    scratch_load_b32 v12, off, s33 offset:56
+; DAGISEL-NEXT:    scratch_load_b32 v13, off, s33 offset:60
+; DAGISEL-NEXT:    scratch_load_b32 v14, off, s33 offset:64
+; DAGISEL-NEXT:    scratch_load_b32 v15, off, s33 offset:68
+; DAGISEL-NEXT:    scratch_load_b32 v16, off, s33 offset:72
+; DAGISEL-NEXT:    scratch_load_b32 v17, off, s33 offset:76
+; DAGISEL-NEXT:    scratch_load_b32 v18, off, s33 offset:80
+; DAGISEL-NEXT:    scratch_load_b32 v19, off, s33 offset:84
+; DAGISEL-NEXT:    scratch_load_b32 v20, off, s33 offset:88
+; DAGISEL-NEXT:    scratch_load_b32 v21, off, s33 offset:92
+; DAGISEL-NEXT:    scratch_load_b32 v22, off, s33 offset:96
+; DAGISEL-NEXT:    scratch_load_b32 v23, off, s33 offset:100
+; DAGISEL-NEXT:    scratch_load_b32 v24, off, s33 offset:104
+; DAGISEL-NEXT:    scratch_load_b32 v25, off, s33 offset:108
+; DAGISEL-NEXT:    scratch_load_b32 v26, off, s33 offset:112
+; DAGISEL-NEXT:    scratch_load_b32 v27, off, s33 offset:116
+; DAGISEL-NEXT:    scratch_load_b32 v28, off, s33 offset:120
+; DAGISEL-NEXT:    scratch_load_b32 v29, off, s33 offset:124
+; DAGISEL-NEXT:    scratch_load_b32 v30, off, s33 offset:128
+; DAGISEL-NEXT:    scratch_load_b32 v31, off, s33 offset:132
+; DAGISEL-NEXT:    s_clause 0x1f
+; DAGISEL-NEXT:    scratch_load_b32 v32, off, s33 offset:136
+; DAGISEL-NEXT:    scratch_load_b32 v33, off, s33 offset:140
+; DAGISEL-NEXT:    scratch_load_b32 v34, off, s33 offset:144
+; DAGISEL-NEXT:    scratch_load_b32 v35, off, s33 offset:148
+; DAGISEL-NEXT:    scratch_load_b32 v36, off, s33 offset:152
+; DAGISEL-NEXT:    scratch_load_b32 v37, off, s33 offset:156
+; DAGISEL-NEXT:    scratch_load_b32 v38, off, s33 offset:160
+; DAGISEL-NEXT:    scratch_load_b32 v39, off, s33 offset:164
+; DAGISEL-NEXT:    scratch_load_b32 v48, off, s33 offset:168
+; DAGISEL-NEXT:    scratch_load_b32 v49, off, s33 offset:172
+; DAGISEL-NEXT:    scratch_load_b32 v50, off, s33 offset:176
+; DAGISEL-NEXT:    scratch_load_b32 v51, off, s33 offset:180
+; DAGISEL-NEXT:    scratch_load_b32 v52, off, s33 offset:184
+; DAGISEL-NEXT:    scratch_load_b32 v53, off, s33 offset:188
+; DAGISEL-NEXT:    scratch_load_b32 v54, off, s33 offset:192
+; DAGISEL-NEXT:    scratch_load_b32 v55, off, s33 offset:196
+; DAGISEL-NEXT:    scratch_load_b32 v64, off, s33 offset:200
+; DAGISEL-NEXT:    scratch_load_b32 v65, off, s33 offset:204
+; DAGISEL-NEXT:    scratch_load_b32 v66, off, s33 offset:208
+; DAGISEL-NEXT:    scratch_load_b32 v67, off, s33 offset:212
+; DAGISEL-NEXT:    scratch_load_b32 v68, off, s33 offset:216
+; DAGISEL-NEXT:    scratch_load_b32 v69, off, s33 offset:220
+; DAGISEL-NEXT:    scratch_load_b32 v70, off, s33 offset:224
+; DAGISEL-NEXT:    scratch_load_b32 v71, off, s33 offset:228
+; DAGISEL-NEXT:    scratch_load_b32 v80, off, s33 offset:232
+; DAGISEL-NEXT:    scratch_load_b32 v81, off, s33 offset:236
+; DAGISEL-NEXT:    scratch_load_b32 v82, off, s33 offset:240
+; DAGISEL-NEXT:    scratch_load_b32 v83, off, s33 offset:244
+; DAGISEL-NEXT:    scratch_load_b32 v84, off, s33 offset:248
+; DAGISEL-NEXT:    scratch_load_b32 v85, off, s33 offset:252
+; DAGISEL-NEXT:    scratch_load_b32 v86, off, s33 offset:256
+; DAGISEL-NEXT:    scratch_load_b32 v87, off, s33 offset:260
+; DAGISEL-NEXT:    s_clause 0x1f
+; DAGISEL-NEXT:    scratch_load_b32 v96, off, s33 offset:264
+; DAGISEL-NEXT:    scratch_load_b32 v97, off, s33 offset:268
+; DAGISEL-NEXT:    scratch_load_b32 v98, off, s33 offset:272
+; DAGISEL-NEXT:    scratch_load_b32 v99, off, s33 offset:276
+; DAGISEL-NEXT:    scratch_load_b32 v100, off, s33 offset:280
+; DAGISEL-NEXT:    scratch_load_b32 v101, off, s33 offset:284
+; DAGISEL-NEXT:    scratch_load_b32 v102, off, s33 offset:288
+; DAGISEL-NEXT:    scratch_load_b32 v103, off, s33 offset:292
+; DAGISEL-NEXT:    scratch_load_b32 v112, off, s33 offset:296
+; DAGISEL-NEXT:    scratch_load_b32 v113, off, s33 offset:300
+; DAGISEL-NEXT:    scratch_load_b32 v114, off, s33 offset:304
+; DAGISEL-NEXT:    scratch_load_b32 v115, off, s33 offset:308
+; DAGISEL-NEXT:    scratch_load_b32 v116, off, s33 offset:312
+; DAGISEL-NEXT:    scratch_load_b32 v117, off, s33 offset:316
+; DAGISEL-NEXT:    scratch_load_b32 v118, off, s33 offset:320
+; DAGISEL-NEXT:    scratch_load_b32 v119, off, s33 offset:324
+; DAGISEL-NEXT:    scratch_load_b32 v128, off, s33 offset:328
+; DAGISEL-NEXT:    scratch_load_b32 v129, off, s33 offset:332
+; DAGISEL-NEXT:    scratch_load_b32 v130, off, s33 offset:336
+; DAGISEL-NEXT:    scratch_load_b32 v131, off, s33 offset:340
+; DAGISEL-NEXT:    scratch_load_b32 v132, off, s33 offset:344
+; DAGISEL-NEXT:    scratch_load_b32 v133, off, s33 offset:348
+; DAGISEL-NEXT:    scratch_load_b32 v134, off, s33 offset:352
+; DAGISEL-NEXT:    scratch_load_b32 v135, off, s33 offset:356
+; DAGISEL-NEXT:    scratch_load_b32 v144, off, s33 offset:360
+; DAGISEL-NEXT:    scratch_load_b32 v145, off, s33 offset:364
+; DAGISEL-NEXT:    scratch_load_b32 v146, off, s33 offset:368
+; DAGISEL-NEXT:    scratch_load_b32 v147, off, s33 offset:372
+; DAGISEL-NEXT:    scratch_load_b32 v148, off, s33 offset:376
+; DAGISEL-NEXT:    scratch_load_b32 v149, off, s33 offset:380
+; DAGISEL-NEXT:    scratch_load_b32 v150, off, s33 offset:384
+; DAGISEL-NEXT:    scratch_load_b32 v151, off, s33 offset:388
+; DAGISEL-NEXT:    s_clause 0x1f
+; DAGISEL-NEXT:    scratch_load_b32 v160, off, s33 offset:392
+; DAGISEL-NEXT:    scratch_load_b32 v161, off, s33 offset:396
+; DAGISEL-NEXT:    scratch_load_b32 v162, off, s33 offset:400
+; DAGISEL-NEXT:    scratch_load_b32 v163, off, s33 offset:404
+; DAGISEL-NEXT:    scratch_load_b32 v164, off, s33 offset:408
+; DAGISEL-NEXT:    scratch_load_b32 v165, off, s33 offset:412
+; DAGISEL-NEXT:    scratch_load_b32 v166, off, s33 offset:416
+; DAGISEL-NEXT:    scratch_load_b32 v167, off, s33 offset:420
+; DAGISEL-NEXT:    scratch_load_b32 v176, off, s33 offset:424
+; DAGISEL-NEXT:    scratch_load_b32 v177, off, s33 offset:428
+; DAGISEL-NEXT:    scratch_load_b32 v178, off, s33 offset:432
+; DAGISEL-NEXT:    scratch_load_b32 v179, off, s33 offset:436
+; DAGISEL-NEXT:    scratch_load_b32 v180, off, s33 offset:440
+; DAGISEL-NEXT:    scratch_load_b32 v181, off, s33 offset:444
+; DAGISEL-NEXT:    scratch_load_b32 v182, off, s33 offset:448
+; DAGISEL-NEXT:    scratch_load_b32 v183, off, s33 offset:452
+; DAGISEL-NEXT:    scratch_load_b32 v192, off, s33 offset:456
+; DAGISEL-NEXT:    scratch_load_b32 v193, off, s33 offset:460
+; DAGISEL-NEXT:    scratch_load_b32 v194, off, s33 offset:464
+; DAGISEL-NEXT:    scratch_load_b32 v195, off, s33 offset:468
+; DAGISEL-NEXT:    scratch_load_b32 v196, off, s33 offset:472
+; DAGISEL-NEXT:    scratch_load_b32 v197, off, s33 offset:476
+; DAGISEL-NEXT:    scratch_load_b32 v198, off, s33 offset:480
+; DAGISEL-NEXT:    scratch_load_b32 v199, off, s33 offset:484
+; DAGISEL-NEXT:    scratch_load_b32 v208, off, s33 offset:488
+; DAGISEL-NEXT:    scratch_load_b32 v209, off, s33 offset:492
+; DAGISEL-NEXT:    scratch_load_b32 v210, off, s33 offset:496
+; DAGISEL-NEXT:    scratch_load_b32 v211, off, s33 offset:500
+; DAGISEL-NEXT:    scratch_load_b32 v212, off, s33 offset:504
+; DAGISEL-NEXT:    scratch_load_b32 v213, off, s33 offset:508
+; DAGISEL-NEXT:    scratch_load_b32 v214, off, s33 offset:512
+; DAGISEL-NEXT:    scratch_load_b32 v215, off, s33 offset:516
+; DAGISEL-NEXT:    s_clause 0xf
+; DAGISEL-NEXT:    scratch_load_b32 v224, off, s33 offset:520
+; DAGISEL-NEXT:    scratch_load_b32 v225, off, s33 offset:524
+; DAGISEL-NEXT:    scratch_load_b32 v226, off, s33 offset:528
+; DAGISEL-NEXT:    scratch_load_b32 v227, off, s33 offset:532
+; DAGISEL-NEXT:    scratch_load_b32 v228, off, s33 offset:536
+; DAGISEL-NEXT:    scratch_load_b32 v229, off, s33 offset:540
+; DAGISEL-NEXT:    scratch_load_b32 v230, off, s33 offset:544
+; DAGISEL-NEXT:    scratch_load_b32 v231, off, s33 offset:548
+; DAGISEL-NEXT:    scratch_load_b32 v240, off, s33 offset:552
+; DAGISEL-NEXT:    scratch_load_b32 v241, off, s33 offset:556
+; DAGISEL-NEXT:    scratch_load_b32 v242, off, s33 offset:560
+; DAGISEL-NEXT:    scratch_load_b32 v243, off, s33 offset:564
+; DAGISEL-NEXT:    scratch_load_b32 v244, off, s33 offset:568
+; DAGISEL-NEXT:    scratch_load_b32 v245, off, s33 offset:572
+; DAGISEL-NEXT:    scratch_load_b32 v246, off, s33 offset:576
+; DAGISEL-NEXT:    scratch_load_b32 v247, off, s33 offset:580
+; DAGISEL-NEXT:    s_mov_b32 exec_lo, s34
+; DAGISEL-NEXT:    s_mov_b32 s33, s35
+; DAGISEL-NEXT:    s_wait_loadcnt 0x0
+; DAGISEL-NEXT:    s_wait_alu 0xfffe
+; DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GISEL-LABEL: call_gfx_from_whole_wave:
+; GISEL:       ; %bb.0:
+; GISEL-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GISEL-NEXT:    s_wait_expcnt 0x0
+; GISEL-NEXT:    s_wait_samplecnt 0x0
+; GISEL-NEXT:    s_wait_bvhcnt 0x0
+; GISEL-NEXT:    s_wait_kmcnt 0x0
+; GISEL-NEXT:    s_mov_b32 s35, s33
+; GISEL-NEXT:    s_mov_b32 s33, s32
+; GISEL-NEXT:    s_xor_saveexec_b32 s34, -1
+; GISEL-NEXT:    s_clause 0x1f
+; GISEL-NEXT:    scratch_store_b32 off, v0, s33 offset:8
+; GISEL-NEXT:    scratch_store_b32 off, v1, s33 offset:12
+; GISEL-NEXT:    scratch_store_b32 off, v2, s33 offset:16
+; GISEL-NEXT:    scratch_store_b32 off, v3, s33 offset:20
+; GISEL-NEXT:    scratch_store_b32 off, v4, s33 offset:24
+; GISEL-NEXT:    scratch_store_b32 off, v5, s33 offset:28
+; GISEL-NEXT:    scratch_store_b32 off, v6, s33 offset:32
+; GISEL-NEXT:    scratch_store_b32 off, v7, s33 offset:36
+; GISEL-NEXT:    scratch_store_b32 off, v8, s33 offset:40
+; GISEL-NEXT:    scratch_store_b32 off, v9, s33 offset:44
+; GISEL-NEXT:    scratch_store_b32 off, v10, s33 offset:48
+; GISEL-NEXT:    scratch_store_b32 off, v11, s33 offset:52
+; GISEL-NEXT:    scratch_store_b32 off, v12, s33 offset:56
+; GISEL-NEXT:    scratch_store_b32 off, v13, s33 offset:60
+; GISEL-NEXT:    scratch_store_b32 off, v14, s33 offset:64
+; GISEL-NEXT:    scratch_store_b32 off, v15, s33 offset:68
+; GISEL-NEXT:    scratch_store_b32 off, v16, s33 offset:72
+; GISEL-NEXT:    scratch_store_b32 off, v17, s33 offset:76
+; GISEL-NEXT:    scratch_store_b32 off, v18, s33 offset:80
+; GISEL-NEXT:    scratch_store_b32 off, v19, s33 offset:84
+; GISEL-NEXT:    scratch_store_b32 off, v20, s33 offset:88
+; GISEL-NEXT:    scratch_store_b32 off, v21, s33 offset:92
+; GISEL-NEXT:    scratch_store_b32 off, v22, s33 offset:96
+; GISEL-NEXT:    scratch_store_b32 off, v23, s33 offset:100
+; GISEL-NEXT:    scratch_store_b32 off, v24, s33 offset:104
+; GISEL-NEXT:    scratch_store_b32 off, v25, s33 offset:108
+; GISEL-NEXT:    scratch_store_b32 off, v26, s33 offset:112
+; GISEL-NEXT:    scratch_store_b32 off, v27, s33 offset:116
+; GISEL-NEXT:    scratch_store_b32 off, v28, s33 offset:120
+; GISEL-NEXT:    scratch_store_b32 off, v29, s33 offset:124
+; GISEL-NEXT:    scratch_store_b32 off, v30, s33 offset:128
+; GISEL-NEXT:    scratch_store_b32 off, v31, s33 offset:132
+; GISEL-NEXT:    s_clause 0x1f
+; GISEL-NEXT:    scratch_store_b32 off, v32, s33 offset:136
+; GISEL-NEXT:    scratch_store_b32 off, v33, s33 offset:140
+; GISEL-NEXT:    scratch_store_b32 off, v34, s33 offset:144
+; GISEL-NEXT:    scratch_store_b32 off, v35, s33 offset:148
+; GISEL-NEXT:    scratch_store_b32 off, v36, s33 offset:152
+; GISEL-NEXT:    scratch_store_b32 off, v37, s33 offset:156
+; GISEL-NEXT:    scratch_store_b32 off, v38, s33 offset:160
+; GISEL-NEXT:    scratch_store_b32 off, v39, s33 offset:164
+; GISEL-NEXT:    scratch_store_b32 off, v48, s33 offset:168
+; GISEL-NEXT:    scratch_store_b32 off, v49, s33 offset:172
+; GISEL-NEXT:    scratch_store_b32 off, v50, s33 offset:176
+; GISEL-NEXT:    scratch_store_b32 off, v51, s33 offset:180
+; GISEL-NEXT:    scratch_store_b32 off, v52, s33 offset:184
+; GISEL-NEXT:    scratch_store_b32 off, v53, s33 offset:188
+; GISEL-NEXT:    scratch_store_b32 off, v54, s33 offset:192
+; GISEL-NEXT:    scratch_store_b32 off, v55, s33 offset:196
+; GISEL-NEXT:    scratch_store_b32 off, v64, s33 offset:200
+; GISEL-NEXT:    scratch_store_b32 off, v65, s33 offset:204
+; GISEL-NEXT:    scratch_store_b32 off, v66, s33 offset:208
+; GISEL-NEXT:    scratch_store_b32 off, v67, s33 offset:212
+; GISEL-NEXT:    scratch_store_b32 off, v68, s33 offset:216
+; GISEL-NEXT:    scratch_store_b32 off, v69, s33 offset:220
+; GISEL-NEXT:    scratch_store_b32 off, v70, s33 offset:224
+; GISEL-NEXT:    scratch_store_b32 off, v71, s33 offset:228
+; GISEL-NEXT:    scratch_store_b32 off, v80, s33 offset:232
+; GISEL-NEXT:    scratch_store_b32 off, v81, s33 offset:236
+; GISEL-NEXT:    scratch_store_b32 off, v82, s33 offset:240
+; GISEL-NEXT:    scratch_store_b32 off, v83, s33 offset:244
+; GISEL-NEXT:    scratch_store_b32 off, v84, s33 offset:248
+; GISEL-NEXT:    scratch_store_b32 off, v85, s33 offset:252
+; GISEL-NEXT:    scratch_store_b32 off, v86, s33 offset:256
+; GISEL-NEXT:    scratch_store_b32 off, v87, s33 offset:260
+; GISEL-NEXT:    s_clause 0x1f
+; GISEL-NEXT:    scratch_store_b32 off, v96, s33 offset:264
+; GISEL-NEXT:    scratch_store_b32 off, v97, s33 offset:268
+; GISEL-NEXT:    scratch_store_b32 off, v98, s33 offset:272
+; GISEL-NEXT:    scratch_store_b32 off, v99, s33 offset:276
+; GISEL-NEXT:    scratch_store_b32 off, v100, s33 offset:280
+; GISEL-NEXT:    scratch_store_b32 off, v101, s33 offset:284
+; GISEL-NEXT:    scratch_store_b32 off, v102, s33 offset:288
+; GISEL-NEXT:    scratch_store_b32 off, v103, s33 offset:292
+; GISEL-NEXT:    scratch_store_b32 off, v112, s33 offset:296
+; GISEL-NEXT:    scratch_store_b32 off, v113, s33 offset:300
+; GISEL-NEXT:    scratch_store_b32 off, v114, s33 offset:304
+; GISEL-NEXT:    scratch_store_b32 off, v115, s33 offset:308
+; GISEL-NEXT:    scratch_store_b32 off, v116, s33 offset:312
+; GISEL-NEXT:    scratch_store_b32 off, v117, s33 offset:316
+; GISEL-NEXT:    scratch_store_b32 off, v118, s33 offset:320
+; GISEL-NEXT:    scratch_store_b32 off, v119, s33 offset:324
+; GISEL-NEXT:    scratch_store_b32 off, v128, s33 offset:328
+; GISEL-NEXT:    scratch_store_b32 off, v129, s33 offset:332
+; GISEL-NEXT:    scratch_store_b32 off, v130, s33 offset:336
+; GISEL-NEXT:    scratch_store_b32 off, v131, s33 offset:340
+; GISEL-NEXT:    scratch_store_b32 off, v132, s33 offset:344
+; GISEL-NEXT:    scratch_store_b32 off, v133, s33 offset:348
+; GISEL-NEXT:    scratch_store_b32 off, v134, s33 offset:352
+; GISEL-NEXT:    scratch_store_b32 off, v135, s33 offset:356
+; GISEL-NEXT:    scratch_store_b32 off, v144, s33 offset:360
+; GISEL-NEXT:    scratch_store_b32 off, v145, s33 offset:364
+; GISEL-NEXT:    scratch_store_b32 off, v146, s33 offset:368
+; GISEL-NEXT:    scratch_store_b32 off, v147, s33 offset:372
+; GISEL-NEXT:    scratch_store_b32 off, v148, s33 offset:376
+; GISEL-NEXT:    scratch_store_b32 off, v149, s33 offset:380
+; GISEL-NEXT:    scratch_store_b32 off, v150, s33 offset:384
+; GISEL-NEXT:    scratch_store_b32 off, v151, s33 offset:388
+; GISEL-NEXT:    s_clause 0x1f
+; GISEL-NEXT:    scratch_store_b32 off, v160, s33 offset:392
+; GISEL-NEXT:    scratch_store_b32 off, v161, s33 offset:396
+; GISEL-NEXT:    scratch_store_b32 off, v162, s33 offset:400
+; GISEL-NEXT:    scratch_store_b32 off, v163, s33 offset:404
+; GISEL-NEXT:    scratch_store_b32 off, v164, s33 offset:408
+; GISEL-NEXT:    scratch_store_b32 off, v165, s33 offset:412
+; GISEL-NEXT:    scratch_store_b32 off, v166, s33 offset:416
+; GISEL-NEXT:    scratch_store_b32 off, v167, s33 offset:420
+; GISEL-NEXT:    scratch_store_b32 off, v176, s33 offset:424
+; GISEL-NEXT:    scratch_store_b32 off, v177, s33 offset:428
+; GISEL-NEXT:    scratch_store_b32 off, v178, s33 offset:432
+; GISEL-NEXT:    scratch_store_b32 off, v179, s33 offset:436
+; GISEL-NEXT:    scratch_store_b32 off, v180, s33 offset:440
+; GISEL-NEXT:    scratch_store_b32 off, v181, s33 offset:444
+; GISEL-NEXT:    scratch_store_b32 off, v182, s33 offset:448
+; GISEL-NEXT:    scratch_store_b32 off, v183, s33 offset:452
+; GISEL-NEXT:    scratch_store_b32 off, v192, s33 offset:456
+; GISEL-NEXT:    scratch_store_b32 off, v193, s33 offset:460
+; GISEL-NEXT:    scratch_store_b32 off, v194, s33 offset:464
+; GISEL-NEXT:    scratch_store_b32 off, v195, s33 offset:468
+; GISEL-NEXT:    scratch_store_b32 off, v196, s33 offset:472
+; GISEL-NEXT:    scratch_store_b32 off, v197, s33 offset:476
+; GISEL-NEXT:    scratch_store_b32 off, v198, s33 offset:480
+; GISEL-NEXT:    scratch_store_b32 off, v199, s33 offset:484
+; GISEL-NEXT:    scratch_store_b32 off, v208, s33 offset:488
+; GISEL-NEXT:    scratch_store_b32 off, v209, s33 offset:492
+; GISEL-NEXT:    scratch_store_b32 off, v210, s33 offset:496
+; GISEL-NEXT:    scratch_store_b32 off, v211, s33 offset:500
+; GISEL-NEXT:    scratch_store_b32 off, v212, s33 offset:504
+; GISEL-NEXT:    scratch_store_b32 off, v213, s33 offset:508
+; GISEL-NEXT:    scratch_store_b32 off, v214, s33 offset:512
+; GISEL-NEXT:    scratch_store_b32 off, v215, s33 offset:516
+; GISEL-NEXT:    s_clause 0xf
+; GISEL-NEXT:    scratch_store_b32 off, v224, s33 offset:520
+; GISEL-NEXT:    scratch_store_b32 off, v225, s33 offset:524
+; GISEL-NEXT:    scratch_store_b32 off, v226, s33 offset:528
+; GISEL-NEXT:    scratch_store_b32 off, v227, s33 offset:532
+; GISEL-NEXT:    scratch_store_b32 off, v228, s33 offset:536
+; GISEL-NEXT:    scratch_store_b32 off, v229, s33 offset:540
+; GISEL-NEXT:    scratch_store_b32 off, v230, s33 offset:544
+; GISEL-NEXT:    scratch_store_b32 off, v231, s33 offset:548
+; GISEL-NEXT:    scratch_store_b32 off, v240, s33 offset:552
+; GISEL-NEXT:    scratch_store_b32 off, v241, s33 offset:556
+; GISEL-NEXT:    scratch_store_b32 off, v242, s33 offset:560
+; GISEL-NEXT:    scratch_store_b32 off, v243, s33 offset:564
+; GISEL-NEXT:    scratch_store_b32 off, v244, s33 offset:568
+; GISEL-NEXT:    scratch_store_b32 off, v245, s33 offset:572
+; GISEL-NEXT:    scratch_store_b32 off, v246, s33 offset:576
+; GISEL-NEXT:    scratch_store_b32 off, v247, s33 offset:580
+; GISEL-NEXT:    s_mov_b32 exec_lo, -1
+; GISEL-NEXT:    s_clause 0x1
+; GISEL-NEXT:    scratch_store_b32 off, v40, s33
+; GISEL-NEXT:    scratch_store_b32 off, v41, s33 offset:4
+; GISEL-NEXT:    v_writelane_b32 v40, s4, 0
+; GISEL-NEXT:    v_writelane_b32 v41, s76, 0
+; GISEL-NEXT:    v_mov_b32_e32 v2, v0
+; GISEL-NEXT:    v_swap_b32 v0, v1
+; GISEL-NEXT:    v_writelane_b32 v40, s5, 1
+; GISEL-NEXT:    v_writelane_b32 v41, s77, 1
+; GISEL-NEXT:    s_mov_b32 s0, gfx_callee at abs32@lo
+; GISEL-NEXT:    s_mov_b32 s1, gfx_callee at abs32@hi
+; GISEL-NEXT:    s_addk_co_i32 s32, 0x250
+; GISEL-NEXT:    v_writelane_b32 v40, s6, 2
+; GISEL-NEXT:    v_writelane_b32 v41, s78, 2
+; GISEL-NEXT:    v_writelane_b32 v40, s7, 3
+; GISEL-NEXT:    v_writelane_b32 v41, s79, 3
+; GISEL-NEXT:    v_writelane_b32 v40, s8, 4
+; GISEL-NEXT:    v_writelane_b32 v41, s88, 4
+; GISEL-NEXT:    v_writelane_b32 v40, s9, 5
+; GISEL-NEXT:    v_writelane_b32 v41, s89, 5
+; GISEL-NEXT:    s_mov_b64 s[8:9], 0
+; GISEL-NEXT:    v_writelane_b32 v40, s10, 6
+; GISEL-NEXT:    v_writelane_b32 v41, s90, 6
+; GISEL-NEXT:    v_writelane_b32 v40, s11, 7
+; GISEL-NEXT:    v_writelane_b32 v41, s91, 7
+; GISEL-NEXT:    v_writelane_b32 v40, s12, 8
+; GISEL-NEXT:    v_writelane_b32 v41, s92, 8
+; GISEL-NEXT:    v_writelane_b32 v40, s13, 9
+; GISEL-NEXT:    v_writelane_b32 v41, s93, 9
+; GISEL-NEXT:    v_writelane_b32 v40, s14, 10
+; GISEL-NEXT:    v_writelane_b32 v41, s94, 10
+; GISEL-NEXT:    v_writelane_b32 v40, s15, 11
+; GISEL-NEXT:    v_writelane_b32 v41, s95, 11
+; GISEL-NEXT:    v_writelane_b32 v40, s16, 12
+; GISEL-NEXT:    v_writelane_b32 v40, s17, 13
+; GISEL-NEXT:    v_writelane_b32 v40, s18, 14
+; GISEL-NEXT:    v_writelane_b32 v40, s19, 15
+; GISEL-NEXT:    v_writelane_b32 v40, s20, 16
+; GISEL-NEXT:    v_writelane_b32 v40, s21, 17
+; GISEL-NEXT:    v_writelane_b32 v40, s22, 18
+; GISEL-NEXT:    v_writelane_b32 v40, s23, 19
+; GISEL-NEXT:    v_writelane_b32 v40, s24, 20
+; GISEL-NEXT:    v_writelane_b32 v40, s25, 21
+; GISEL-NEXT:    v_writelane_b32 v40, s26, 22
+; GISEL-NEXT:    v_writelane_b32 v40, s27, 23
+; GISEL-NEXT:    v_writelane_b32 v40, s28, 24
+; GISEL-NEXT:    v_writelane_b32 v40, s29, 25
+; GISEL-NEXT:    v_writelane_b32 v40, s30, 26
+; GISEL-NEXT:    v_writelane_b32 v40, s31, 27
+; GISEL-NEXT:    v_writelane_b32 v40, s72, 28
+; GISEL-NEXT:    v_writelane_b32 v40, s73, 29
+; GISEL-NEXT:    v_writelane_b32 v40, s74, 30
+; GISEL-NEXT:    v_writelane_b32 v40, s75, 31
+; GISEL-NEXT:    s_wait_alu 0xfffe
+; GISEL-NEXT:    s_swappc_b64 s[30:31], s[0:1]
+; GISEL-NEXT:    v_readlane_b32 s95, v41, 11
+; GISEL-NEXT:    v_readlane_b32 s94, v41, 10
+; GISEL-NEXT:    v_readlane_b32 s93, v41, 9
+; GISEL-NEXT:    v_readlane_b32 s92, v41, 8
+; GISEL-NEXT:    v_readlane_b32 s91, v41, 7
+; GISEL-NEXT:    v_readlane_b32 s90, v41, 6
+; GISEL-NEXT:    v_readlane_b32 s89, v41, 5
+; GISEL-NEXT:    v_readlane_b32 s88, v41, 4
+; GISEL-NEXT:    v_readlane_b32 s79, v41, 3
+; GISEL-NEXT:    v_readlane_b32 s78, v41, 2
+; GISEL-NEXT:    v_readlane_b32 s77, v41, 1
+; GISEL-NEXT:    v_readlane_b32 s76, v41, 0
+; GISEL-NEXT:    v_readlane_b32 s75, v40, 31
+; GISEL-NEXT:    v_readlane_b32 s74, v40, 30
+; GISEL-NEXT:    v_readlane_b32 s73, v40, 29
+; GISEL-NEXT:    v_readlane_b32 s72, v40, 28
+; GISEL-NEXT:    v_readlane_b32 s31, v40, 27
+; GISEL-NEXT:    v_readlane_b32 s30, v40, 26
+; GISEL-NEXT:    v_readlane_b32 s29, v40, 25
+; GISEL-NEXT:    v_readlane_b32 s28, v40, 24
+; GISEL-NEXT:    v_readlane_b32 s27, v40, 23
+; GISEL-NEXT:    v_readlane_b32 s26, v40, 22
+; GISEL-NEXT:    v_readlane_b32 s25, v40, 21
+; GISEL-NEXT:    v_readlane_b32 s24, v40, 20
+; GISEL-NEXT:    v_readlane_b32 s23, v40, 19
+; GISEL-NEXT:    v_readlane_b32 s22, v40, 18
+; GISEL-NEXT:    v_readlane_b32 s21, v40, 17
+; GISEL-NEXT:    v_readlane_b32 s20, v40, 16
+; GISEL-NEXT:    v_readlane_b32 s19, v40, 15
+; GISEL-NEXT:    v_readlane_b32 s18, v40, 14
+; GISEL-NEXT:    v_readlane_b32 s17, v40, 13
+; GISEL-NEXT:    v_readlane_b32 s16, v40, 12
+; GISEL-NEXT:    v_readlane_b32 s15, v40, 11
+; GISEL-NEXT:    v_readlane_b32 s14, v40, 10
+; GISEL-NEXT:    v_readlane_b32 s13, v40, 9
+; GISEL-NEXT:    v_readlane_b32 s12, v40, 8
+; GISEL-NEXT:    v_readlane_b32 s11, v40, 7
+; GISEL-NEXT:    v_readlane_b32 s10, v40, 6
+; GISEL-NEXT:    v_readlane_b32 s9, v40, 5
+; GISEL-NEXT:    v_readlane_b32 s8, v40, 4
+; GISEL-NEXT:    v_readlane_b32 s7, v40, 3
+; GISEL-NEXT:    v_readlane_b32 s6, v40, 2
+; GISEL-NEXT:    v_readlane_b32 s5, v40, 1
+; GISEL-NEXT:    v_readlane_b32 s4, v40, 0
+; GISEL-NEXT:    s_clause 0x1
+; GISEL-NEXT:    scratch_load_b32 v40, off, s33
+; GISEL-NEXT:    scratch_load_b32 v41, off, s33 offset:4
+; GISEL-NEXT:    s_mov_b32 s32, s33
+; GISEL-NEXT:    s_xor_b32 exec_lo, s34, -1
+; GISEL-NEXT:    s_clause 0x1f
+; GISEL-NEXT:    scratch_load_b32 v0, off, s33 offset:8
+; GISEL-NEXT:    scratch_load_b32 v1, off, s33 offset:12
+; GISEL-NEXT:    scratch_load_b32 v2, off, s33 offset:16
+; GISEL-NEXT:    scratch_load_b32 v3, off, s33 offset:20
+; GISEL-NEXT:    scratch_load_b32 v4, off, s33 offset:24
+; GISEL-NEXT:    scratch_load_b32 v5, off, s33 offset:28
+; GISEL-NEXT:    scratch_load_b32 v6, off, s33 offset:32
+; GISEL-NEXT:    scratch_load_b32 v7, off, s33 offset:36
+; GISEL-NEXT:    scratch_load_b32 v8, off, s33 offset:40
+; GISEL-NEXT:    scratch_load_b32 v9, off, s33 offset:44
+; GISEL-NEXT:    scratch_load_b32 v10, off, s33 offset:48
+; GISEL-NEXT:    scratch_load_b32 v11, off, s33 offset:52
+; GISEL-NEXT:    scratch_load_b32 v12, off, s33 offset:56
+; GISEL-NEXT:    scratch_load_b32 v13, off, s33 offset:60
+; GISEL-NEXT:    scratch_load_b32 v14, off, s33 offset:64
+; GISEL-NEXT:    scratch_load_b32 v15, off, s33 offset:68
+; GISEL-NEXT:    scratch_load_b32 v16, off, s33 offset:72
+; GISEL-NEXT:    scratch_load_b32 v17, off, s33 offset:76
+; GISEL-NEXT:    scratch_load_b32 v18, off, s33 offset:80
+; GISEL-NEXT:    scratch_load_b32 v19, off, s33 offset:84
+; GISEL-NEXT:    scratch_load_b32 v20, off, s33 offset:88
+; GISEL-NEXT:    scratch_load_b32 v21, off, s33 offset:92
+; GISEL-NEXT:    scratch_load_b32 v22, off, s33 offset:96
+; GISEL-NEXT:    scratch_load_b32 v23, off, s33 offset:100
+; GISEL-NEXT:    scratch_load_b32 v24, off, s33 offset:104
+; GISEL-NEXT:    scratch_load_b32 v25, off, s33 offset:108
+; GISEL-NEXT:    scratch_load_b32 v26, off, s33 offset:112
+; GISEL-NEXT:    scratch_load_b32 v27, off, s33 offset:116
+; GISEL-NEXT:    scratch_load_b32 v28, off, s33 offset:120
+; GISEL-NEXT:    scratch_load_b32 v29, off, s33 offset:124
+; GISEL-NEXT:    scratch_load_b32 v30, off, s33 offset:128
+; GISEL-NEXT:    scratch_load_b32 v31, off, s33 offset:132
+; GISEL-NEXT:    s_clause 0x1f
+; GISEL-NEXT:    scratch_load_b32 v32, off, s33 offset:136
+; GISEL-NEXT:    scratch_load_b32 v33, off, s33 offset:140
+; GISEL-NEXT:    scratch_load_b32 v34, off, s33 offset:144
+; GISEL-NEXT:    scratch_load_b32 v35, off, s33 offset:148
+; GISEL-NEXT:    scratch_load_b32 v36, off, s33 offset:152
+; GISEL-NEXT:    scratch_load_b32 v37, off, s33 offset:156
+; GISEL-NEXT:    scratch_load_b32 v38, off, s33 offset:160
+; GISEL-NEXT:    scratch_load_b32 v39, off, s33 offset:164
+; GISEL-NEXT:    scratch_load_b32 v48, off, s33 offset:168
+; GISEL-NEXT:    scratch_load_b32 v49, off, s33 offset:172
+; GISEL-NEXT:    scratch_load_b32 v50, off, s33 offset:176
+; GISEL-NEXT:    scratch_load_b32 v51, off, s33 offset:180
+; GISEL-NEXT:    scratch_load_b32 v52, off, s33 offset:184
+; GISEL-NEXT:    scratch_load_b32 v53, off, s33 offset:188
+; GISEL-NEXT:    scratch_load_b32 v54, off, s33 offset:192
+; GISEL-NEXT:    scratch_load_b32 v55, off, s33 offset:196
+; GISEL-NEXT:    scratch_load_b32 v64, off, s33 offset:200
+; GISEL-NEXT:    scratch_load_b32 v65, off, s33 offset:204
+; GISEL-NEXT:    scratch_load_b32 v66, off, s33 offset:208
+; GISEL-NEXT:    scratch_load_b32 v67, off, s33 offset:212
+; GISEL-NEXT:    scratch_load_b32 v68, off, s33 offset:216
+; GISEL-NEXT:    scratch_load_b32 v69, off, s33 offset:220
+; GISEL-NEXT:    scratch_load_b32 v70, off, s33 offset:224
+; GISEL-NEXT:    scratch_load_b32 v71, off, s33 offset:228
+; GISEL-NEXT:    scratch_load_b32 v80, off, s33 offset:232
+; GISEL-NEXT:    scratch_load_b32 v81, off, s33 offset:236
+; GISEL-NEXT:    scratch_load_b32 v82, off, s33 offset:240
+; GISEL-NEXT:    scratch_load_b32 v83, off, s33 offset:244
+; GISEL-NEXT:    scratch_load_b32 v84, off, s33 offset:248
+; GISEL-NEXT:    scratch_load_b32 v85, off, s33 offset:252
+; GISEL-NEXT:    scratch_load_b32 v86, off, s33 offset:256
+; GISEL-NEXT:    scratch_load_b32 v87, off, s33 offset:260
+; GISEL-NEXT:    s_clause 0x1f
+; GISEL-NEXT:    scratch_load_b32 v96, off, s33 offset:264
+; GISEL-NEXT:    scratch_load_b32 v97, off, s33 offset:268
+; GISEL-NEXT:    scratch_load_b32 v98, off, s33 offset:272
+; GISEL-NEXT:    scratch_load_b32 v99, off, s33 offset:276
+; GISEL-NEXT:    scratch_load_b32 v100, off, s33 offset:280
+; GISEL-NEXT:    scratch_load_b32 v101, off, s33 offset:284
+; GISEL-NEXT:    scratch_load_b32 v102, off, s33 offset:288
+; GISEL-NEXT:    scratch_load_b32 v103, off, s33 offset:292
+; GISEL-NEXT:    scratch_load_b32 v112, off, s33 offset:296
+; GISEL-NEXT:    scratch_load_b32 v113, off, s33 offset:300
+; GISEL-NEXT:    scratch_load_b32 v114, off, s33 offset:304
+; GISEL-NEXT:    scratch_load_b32 v115, off, s33 offset:308
+; GISEL-NEXT:    scratch_load_b32 v116, off, s33 offset:312
+; GISEL-NEXT:    scratch_load_b32 v117, off, s33 offset:316
+; GISEL-NEXT:    scratch_load_b32 v118, off, s33 offset:320
+; GISEL-NEXT:    scratch_load_b32 v119, off, s33 offset:324
+; GISEL-NEXT:    scratch_load_b32 v128, off, s33 offset:328
+; GISEL-NEXT:    scratch_load_b32 v129, off, s33 offset:332
+; GISEL-NEXT:    scratch_load_b32 v130, off, s33 offset:336
+; GISEL-NEXT:    scratch_load_b32 v131, off, s33 offset:340
+; GISEL-NEXT:    scratch_load_b32 v132, off, s33 offset:344
+; GISEL-NEXT:    scratch_load_b32 v133, off, s33 offset:348
+; GISEL-NEXT:    scratch_load_b32 v134, off, s33 offset:352
+; GISEL-NEXT:    scratch_load_b32 v135, off, s33 offset:356
+; GISEL-NEXT:    scratch_load_b32 v144, off, s33 offset:360
+; GISEL-NEXT:    scratch_load_b32 v145, off, s33 offset:364
+; GISEL-NEXT:    scratch_load_b32 v146, off, s33 offset:368
+; GISEL-NEXT:    scratch_load_b32 v147, off, s33 offset:372
+; GISEL-NEXT:    scratch_load_b32 v148, off, s33 offset:376
+; GISEL-NEXT:    scratch_load_b32 v149, off, s33 offset:380
+; GISEL-NEXT:    scratch_load_b32 v150, off, s33 offset:384
+; GISEL-NEXT:    scratch_load_b32 v151, off, s33 offset:388
+; GISEL-NEXT:    s_clause 0x1f
+; GISEL-NEXT:    scratch_load_b32 v160, off, s33 offset:392
+; GISEL-NEXT:    scratch_load_b32 v161, off, s33 offset:396
+; GISEL-NEXT:    scratch_load_b32 v162, off, s33 offset:400
+; GISEL-NEXT:    scratch_load_b32 v163, off, s33 offset:404
+; GISEL-NEXT:    scratch_load_b32 v164, off, s33 offset:408
+; GISEL-NEXT:    scratch_load_b32 v165, off, s33 offset:412
+; GISEL-NEXT:    scratch_load_b32 v166, off, s33 offset:416
+; GISEL-NEXT:    scratch_load_b32 v167, off, s33 offset:420
+; GISEL-NEXT:    scratch_load_b32 v176, off, s33 offset:424
+; GISEL-NEXT:    scratch_load_b32 v177, off, s33 offset:428
+; GISEL-NEXT:    scratch_load_b32 v178, off, s33 offset:432
+; GISEL-NEXT:    scratch_load_b32 v179, off, s33 offset:436
+; GISEL-NEXT:    scratch_load_b32 v180, off, s33 offset:440
+; GISEL-NEXT:    scratch_load_b32 v181, off, s33 offset:444
+; GISEL-NEXT:    scratch_load_b32 v182, off, s33 offset:448
+; GISEL-NEXT:    scratch_load_b32 v183, off, s33 offset:452
+; GISEL-NEXT:    scratch_load_b32 v192, off, s33 offset:456
+; GISEL-NEXT:    scratch_load_b32 v193, off, s33 offset:460
+; GISEL-NEXT:    scratch_load_b32 v194, off, s33 offset:464
+; GISEL-NEXT:    scratch_load_b32 v195, off, s33 offset:468
+; GISEL-NEXT:    scratch_load_b32 v196, off, s33 offset:472
+; GISEL-NEXT:    scratch_load_b32 v197, off, s33 offset:476
+; GISEL-NEXT:    scratch_load_b32 v198, off, s33 offset:480
+; GISEL-NEXT:    scratch_load_b32 v199, off, s33 offset:484
+; GISEL-NEXT:    scratch_load_b32 v208, off, s33 offset:488
+; GISEL-NEXT:    scratch_load_b32 v209, off, s33 offset:492
+; GISEL-NEXT:    scratch_load_b32 v210, off, s33 offset:496
+; GISEL-NEXT:    scratch_load_b32 v211, off, s33 offset:500
+; GISEL-NEXT:    scratch_load_b32 v212, off, s33 offset:504
+; GISEL-NEXT:    scratch_load_b32 v213, off, s33 offset:508
+; GISEL-NEXT:    scratch_load_b32 v214, off, s33 offset:512
+; GISEL-NEXT:    scratch_load_b32 v215, off, s33 offset:516
+; GISEL-NEXT:    s_clause 0xf
+; GISEL-NEXT:    scratch_load_b32 v224, off, s33 offset:520
+; GISEL-NEXT:    scratch_load_b32 v225, off, s33 offset:524
+; GISEL-NEXT:    scratch_load_b32 v226, off, s33 offset:528
+; GISEL-NEXT:    scratch_load_b32 v227, off, s33 offset:532
+; GISEL-NEXT:    scratch_load_b32 v228, off, s33 offset:536
+; GISEL-NEXT:    scratch_load_b32 v229, off, s33 offset:540
+; GISEL-NEXT:    scratch_load_b32 v230, off, s33 offset:544
+; GISEL-NEXT:    scratch_load_b32 v231, off, s33 offset:548
+; GISEL-NEXT:    scratch_load_b32 v240, off, s33 offset:552
+; GISEL-NEXT:    scratch_load_b32 v241, off, s33 offset:556
+; GISEL-NEXT:    scratch_load_b32 v242, off, s33 offset:560
+; GISEL-NEXT:    scratch_load_b32 v243, off, s33 offset:564
+; GISEL-NEXT:    scratch_load_b32 v244, off, s33 offset:568
+; GISEL-NEXT:    scratch_load_b32 v245, off, s33 offset:572
+; GISEL-NEXT:    scratch_load_b32 v246, off, s33 offset:576
+; GISEL-NEXT:    scratch_load_b32 v247, off, s33 offset:580
+; GISEL-NEXT:    s_mov_b32 exec_lo, s34
+; GISEL-NEXT:    s_mov_b32 s33, s35
+; GISEL-NEXT:    s_wait_loadcnt 0x0
+; GISEL-NEXT:    s_wait_alu 0xfffe
+; GISEL-NEXT:    s_setpc_b64 s[30:31]
+  %ret = call <2 x half>(<2 x half>, <2 x half>) @gfx_callee(<2 x half> %y, <2 x half> %x) convergent
+  ret <2 x half> %ret
+}

>From d0501d339bc6493fddb99345c36538720042890a Mon Sep 17 00:00:00 2001
From: Diana Picus <diana-magda.picus at amd.com>
Date: Tue, 24 Jun 2025 10:48:31 +0200
Subject: [PATCH 20/24] Test wave64

---
 .../CodeGen/AMDGPU/whole-wave-functions.ll    | 1336 +++++++++++++++++
 1 file changed, 1336 insertions(+)

diff --git a/llvm/test/CodeGen/AMDGPU/whole-wave-functions.ll b/llvm/test/CodeGen/AMDGPU/whole-wave-functions.ll
index 5263f5a46b807..4c03b4fa09e11 100644
--- a/llvm/test/CodeGen/AMDGPU/whole-wave-functions.ll
+++ b/llvm/test/CodeGen/AMDGPU/whole-wave-functions.ll
@@ -1,6 +1,8 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
 ; RUN: llc -global-isel=0 -mtriple=amdgcn--amdpal -mcpu=gfx1200 -verify-machineinstrs < %s | FileCheck --check-prefix=DAGISEL %s
 ; RUN: llc -global-isel=1 -mtriple=amdgcn--amdpal -mcpu=gfx1200 -verify-machineinstrs < %s | FileCheck --check-prefix=GISEL %s
+; RUN: llc -global-isel=0 -mtriple=amdgcn--amdpal -mcpu=gfx1200 -mattr=+wavefrontsize64 -verify-machineinstrs < %s | FileCheck --check-prefix=DAGISEL64 %s
+; RUN: llc -global-isel=1 -mtriple=amdgcn--amdpal -mcpu=gfx1200 -mattr=+wavefrontsize64 -verify-machineinstrs < %s | FileCheck --check-prefix=GISEL64 %s
 
 ; Make sure the i1 %active is passed through EXEC.
 ; The EXEC mask should be set to -1 for the duration of the function
@@ -54,6 +56,56 @@ define amdgpu_gfx_whole_wave i32 @basic_test(i1 %active, i32 %a, i32 %b) {
 ; GISEL-NEXT:    s_mov_b32 exec_lo, vcc_lo
 ; GISEL-NEXT:    s_wait_loadcnt 0x0
 ; GISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; DAGISEL64-LABEL: basic_test:
+; DAGISEL64:       ; %bb.0:
+; DAGISEL64-NEXT:    s_wait_loadcnt_dscnt 0x0
+; DAGISEL64-NEXT:    s_wait_expcnt 0x0
+; DAGISEL64-NEXT:    s_wait_samplecnt 0x0
+; DAGISEL64-NEXT:    s_wait_bvhcnt 0x0
+; DAGISEL64-NEXT:    s_wait_kmcnt 0x0
+; DAGISEL64-NEXT:    s_xor_saveexec_b64 vcc, -1
+; DAGISEL64-NEXT:    s_clause 0x1
+; DAGISEL64-NEXT:    scratch_store_b32 off, v0, s32
+; DAGISEL64-NEXT:    scratch_store_b32 off, v1, s32 offset:4
+; DAGISEL64-NEXT:    s_mov_b64 exec, -1
+; DAGISEL64-NEXT:    s_wait_alu 0xfffe
+; DAGISEL64-NEXT:    v_cndmask_b32_e32 v0, 5, v0, vcc
+; DAGISEL64-NEXT:    v_cndmask_b32_e32 v1, 3, v1, vcc
+; DAGISEL64-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; DAGISEL64-NEXT:    v_mov_b32_dpp v0, v1 quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1
+; DAGISEL64-NEXT:    s_xor_b64 exec, vcc, -1
+; DAGISEL64-NEXT:    s_clause 0x1
+; DAGISEL64-NEXT:    scratch_load_b32 v0, off, s32
+; DAGISEL64-NEXT:    scratch_load_b32 v1, off, s32 offset:4
+; DAGISEL64-NEXT:    s_mov_b64 exec, vcc
+; DAGISEL64-NEXT:    s_wait_loadcnt 0x0
+; DAGISEL64-NEXT:    s_setpc_b64 s[30:31]
+;
+; GISEL64-LABEL: basic_test:
+; GISEL64:       ; %bb.0:
+; GISEL64-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GISEL64-NEXT:    s_wait_expcnt 0x0
+; GISEL64-NEXT:    s_wait_samplecnt 0x0
+; GISEL64-NEXT:    s_wait_bvhcnt 0x0
+; GISEL64-NEXT:    s_wait_kmcnt 0x0
+; GISEL64-NEXT:    s_xor_saveexec_b64 vcc, -1
+; GISEL64-NEXT:    s_clause 0x1
+; GISEL64-NEXT:    scratch_store_b32 off, v0, s32
+; GISEL64-NEXT:    scratch_store_b32 off, v1, s32 offset:4
+; GISEL64-NEXT:    s_mov_b64 exec, -1
+; GISEL64-NEXT:    s_wait_alu 0xfffe
+; GISEL64-NEXT:    v_cndmask_b32_e32 v0, 5, v0, vcc
+; GISEL64-NEXT:    v_cndmask_b32_e32 v1, 3, v1, vcc
+; GISEL64-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GISEL64-NEXT:    v_mov_b32_dpp v0, v1 quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1
+; GISEL64-NEXT:    s_xor_b64 exec, vcc, -1
+; GISEL64-NEXT:    s_clause 0x1
+; GISEL64-NEXT:    scratch_load_b32 v0, off, s32
+; GISEL64-NEXT:    scratch_load_b32 v1, off, s32 offset:4
+; GISEL64-NEXT:    s_mov_b64 exec, vcc
+; GISEL64-NEXT:    s_wait_loadcnt 0x0
+; GISEL64-NEXT:    s_setpc_b64 s[30:31]
   %x = select i1 %active, i32 %a, i32 5
   %y = select i1 %active, i32 %b, i32 3
   %ret = call i32 @llvm.amdgcn.update.dpp.i32(i32 %x, i32 %y, i32 1, i32 1, i32 1, i1 false)
@@ -109,6 +161,54 @@ define amdgpu_gfx_whole_wave i32 @single_use_of_active(i1 %active, i32 %a, i32 %
 ; GISEL-NEXT:    s_mov_b32 exec_lo, vcc_lo
 ; GISEL-NEXT:    s_wait_loadcnt 0x0
 ; GISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; DAGISEL64-LABEL: single_use_of_active:
+; DAGISEL64:       ; %bb.0:
+; DAGISEL64-NEXT:    s_wait_loadcnt_dscnt 0x0
+; DAGISEL64-NEXT:    s_wait_expcnt 0x0
+; DAGISEL64-NEXT:    s_wait_samplecnt 0x0
+; DAGISEL64-NEXT:    s_wait_bvhcnt 0x0
+; DAGISEL64-NEXT:    s_wait_kmcnt 0x0
+; DAGISEL64-NEXT:    s_xor_saveexec_b64 vcc, -1
+; DAGISEL64-NEXT:    s_clause 0x1
+; DAGISEL64-NEXT:    scratch_store_b32 off, v0, s32
+; DAGISEL64-NEXT:    scratch_store_b32 off, v1, s32 offset:4
+; DAGISEL64-NEXT:    s_mov_b64 exec, -1
+; DAGISEL64-NEXT:    s_wait_alu 0xfffe
+; DAGISEL64-NEXT:    v_cndmask_b32_e32 v1, 17, v1, vcc
+; DAGISEL64-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; DAGISEL64-NEXT:    v_mov_b32_dpp v0, v1 quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1
+; DAGISEL64-NEXT:    s_xor_b64 exec, vcc, -1
+; DAGISEL64-NEXT:    s_clause 0x1
+; DAGISEL64-NEXT:    scratch_load_b32 v0, off, s32
+; DAGISEL64-NEXT:    scratch_load_b32 v1, off, s32 offset:4
+; DAGISEL64-NEXT:    s_mov_b64 exec, vcc
+; DAGISEL64-NEXT:    s_wait_loadcnt 0x0
+; DAGISEL64-NEXT:    s_setpc_b64 s[30:31]
+;
+; GISEL64-LABEL: single_use_of_active:
+; GISEL64:       ; %bb.0:
+; GISEL64-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GISEL64-NEXT:    s_wait_expcnt 0x0
+; GISEL64-NEXT:    s_wait_samplecnt 0x0
+; GISEL64-NEXT:    s_wait_bvhcnt 0x0
+; GISEL64-NEXT:    s_wait_kmcnt 0x0
+; GISEL64-NEXT:    s_xor_saveexec_b64 vcc, -1
+; GISEL64-NEXT:    s_clause 0x1
+; GISEL64-NEXT:    scratch_store_b32 off, v0, s32
+; GISEL64-NEXT:    scratch_store_b32 off, v1, s32 offset:4
+; GISEL64-NEXT:    s_mov_b64 exec, -1
+; GISEL64-NEXT:    s_wait_alu 0xfffe
+; GISEL64-NEXT:    v_cndmask_b32_e32 v1, 17, v1, vcc
+; GISEL64-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GISEL64-NEXT:    v_mov_b32_dpp v0, v1 quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1
+; GISEL64-NEXT:    s_xor_b64 exec, vcc, -1
+; GISEL64-NEXT:    s_clause 0x1
+; GISEL64-NEXT:    scratch_load_b32 v0, off, s32
+; GISEL64-NEXT:    scratch_load_b32 v1, off, s32 offset:4
+; GISEL64-NEXT:    s_mov_b64 exec, vcc
+; GISEL64-NEXT:    s_wait_loadcnt 0x0
+; GISEL64-NEXT:    s_setpc_b64 s[30:31]
   %y = select i1 %active, i32 %b, i32 17
   %ret = call i32 @llvm.amdgcn.update.dpp.i32(i32 %a, i32 %y, i32 1, i32 1, i32 1, i1 false)
   ret i32 %ret
@@ -151,6 +251,42 @@ define amdgpu_gfx_whole_wave i32 @unused_active(i1 %active, i32 %a, i32 %b) {
 ; GISEL-NEXT:    s_mov_b32 exec_lo, s0
 ; GISEL-NEXT:    s_wait_loadcnt 0x0
 ; GISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; DAGISEL64-LABEL: unused_active:
+; DAGISEL64:       ; %bb.0:
+; DAGISEL64-NEXT:    s_wait_loadcnt_dscnt 0x0
+; DAGISEL64-NEXT:    s_wait_expcnt 0x0
+; DAGISEL64-NEXT:    s_wait_samplecnt 0x0
+; DAGISEL64-NEXT:    s_wait_bvhcnt 0x0
+; DAGISEL64-NEXT:    s_wait_kmcnt 0x0
+; DAGISEL64-NEXT:    s_xor_saveexec_b64 s[0:1], -1
+; DAGISEL64-NEXT:    scratch_store_b32 off, v0, s32 ; 4-byte Folded Spill
+; DAGISEL64-NEXT:    s_mov_b64 exec, -1
+; DAGISEL64-NEXT:    v_mov_b32_e32 v0, 14
+; DAGISEL64-NEXT:    s_wait_alu 0xfffe
+; DAGISEL64-NEXT:    s_xor_b64 exec, s[0:1], -1
+; DAGISEL64-NEXT:    scratch_load_b32 v0, off, s32 ; 4-byte Folded Reload
+; DAGISEL64-NEXT:    s_mov_b64 exec, s[0:1]
+; DAGISEL64-NEXT:    s_wait_loadcnt 0x0
+; DAGISEL64-NEXT:    s_setpc_b64 s[30:31]
+;
+; GISEL64-LABEL: unused_active:
+; GISEL64:       ; %bb.0:
+; GISEL64-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GISEL64-NEXT:    s_wait_expcnt 0x0
+; GISEL64-NEXT:    s_wait_samplecnt 0x0
+; GISEL64-NEXT:    s_wait_bvhcnt 0x0
+; GISEL64-NEXT:    s_wait_kmcnt 0x0
+; GISEL64-NEXT:    s_xor_saveexec_b64 s[0:1], -1
+; GISEL64-NEXT:    scratch_store_b32 off, v0, s32 ; 4-byte Folded Spill
+; GISEL64-NEXT:    s_mov_b64 exec, -1
+; GISEL64-NEXT:    v_mov_b32_e32 v0, 14
+; GISEL64-NEXT:    s_wait_alu 0xfffe
+; GISEL64-NEXT:    s_xor_b64 exec, s[0:1], -1
+; GISEL64-NEXT:    scratch_load_b32 v0, off, s32 ; 4-byte Folded Reload
+; GISEL64-NEXT:    s_mov_b64 exec, s[0:1]
+; GISEL64-NEXT:    s_wait_loadcnt 0x0
+; GISEL64-NEXT:    s_setpc_b64 s[30:31]
   ret i32 14
 }
 
@@ -234,6 +370,86 @@ define amdgpu_gfx_whole_wave i32 @csr(i1 %active, i32 %a, i32 %b) {
 ; GISEL-NEXT:    s_wait_loadcnt 0x0
 ; GISEL-NEXT:    s_wait_alu 0xf1ff
 ; GISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; DAGISEL64-LABEL: csr:
+; DAGISEL64:       ; %bb.0:
+; DAGISEL64-NEXT:    s_wait_loadcnt_dscnt 0x0
+; DAGISEL64-NEXT:    s_wait_expcnt 0x0
+; DAGISEL64-NEXT:    s_wait_samplecnt 0x0
+; DAGISEL64-NEXT:    s_wait_bvhcnt 0x0
+; DAGISEL64-NEXT:    s_wait_kmcnt 0x0
+; DAGISEL64-NEXT:    s_xor_saveexec_b64 vcc, -1
+; DAGISEL64-NEXT:    s_clause 0x3
+; DAGISEL64-NEXT:    scratch_store_b32 off, v2, s32
+; DAGISEL64-NEXT:    scratch_store_b32 off, v0, s32 offset:4
+; DAGISEL64-NEXT:    scratch_store_b32 off, v1, s32 offset:8
+; DAGISEL64-NEXT:    scratch_store_b32 off, v49, s32 offset:16
+; DAGISEL64-NEXT:    s_mov_b64 exec, -1
+; DAGISEL64-NEXT:    scratch_store_b32 off, v40, s32 offset:12 ; 4-byte Folded Spill
+; DAGISEL64-NEXT:    ;;#ASMSTART
+; DAGISEL64-NEXT:    ; clobber CSR
+; DAGISEL64-NEXT:    ;;#ASMEND
+; DAGISEL64-NEXT:    v_writelane_b32 v2, s20, 0
+; DAGISEL64-NEXT:    ;;#ASMSTART
+; DAGISEL64-NEXT:    ; clobber non-CSR
+; DAGISEL64-NEXT:    ;;#ASMEND
+; DAGISEL64-NEXT:    scratch_load_b32 v40, off, s32 offset:12 ; 4-byte Folded Reload
+; DAGISEL64-NEXT:    s_wait_alu 0xfffe
+; DAGISEL64-NEXT:    v_cndmask_b32_e32 v0, 5, v0, vcc
+; DAGISEL64-NEXT:    v_cndmask_b32_e32 v1, 3, v1, vcc
+; DAGISEL64-NEXT:    v_readlane_b32 s20, v2, 0
+; DAGISEL64-NEXT:    s_delay_alu instid0(VALU_DEP_2)
+; DAGISEL64-NEXT:    v_mov_b32_dpp v0, v1 quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1
+; DAGISEL64-NEXT:    s_xor_b64 exec, vcc, -1
+; DAGISEL64-NEXT:    s_clause 0x3
+; DAGISEL64-NEXT:    scratch_load_b32 v2, off, s32
+; DAGISEL64-NEXT:    scratch_load_b32 v0, off, s32 offset:4
+; DAGISEL64-NEXT:    scratch_load_b32 v1, off, s32 offset:8
+; DAGISEL64-NEXT:    scratch_load_b32 v49, off, s32 offset:16
+; DAGISEL64-NEXT:    s_mov_b64 exec, vcc
+; DAGISEL64-NEXT:    s_wait_loadcnt 0x0
+; DAGISEL64-NEXT:    s_wait_alu 0xf1ff
+; DAGISEL64-NEXT:    s_setpc_b64 s[30:31]
+;
+; GISEL64-LABEL: csr:
+; GISEL64:       ; %bb.0:
+; GISEL64-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GISEL64-NEXT:    s_wait_expcnt 0x0
+; GISEL64-NEXT:    s_wait_samplecnt 0x0
+; GISEL64-NEXT:    s_wait_bvhcnt 0x0
+; GISEL64-NEXT:    s_wait_kmcnt 0x0
+; GISEL64-NEXT:    s_xor_saveexec_b64 vcc, -1
+; GISEL64-NEXT:    s_clause 0x3
+; GISEL64-NEXT:    scratch_store_b32 off, v2, s32
+; GISEL64-NEXT:    scratch_store_b32 off, v0, s32 offset:4
+; GISEL64-NEXT:    scratch_store_b32 off, v1, s32 offset:8
+; GISEL64-NEXT:    scratch_store_b32 off, v49, s32 offset:16
+; GISEL64-NEXT:    s_mov_b64 exec, -1
+; GISEL64-NEXT:    scratch_store_b32 off, v40, s32 offset:12 ; 4-byte Folded Spill
+; GISEL64-NEXT:    ;;#ASMSTART
+; GISEL64-NEXT:    ; clobber CSR
+; GISEL64-NEXT:    ;;#ASMEND
+; GISEL64-NEXT:    v_writelane_b32 v2, s20, 0
+; GISEL64-NEXT:    ;;#ASMSTART
+; GISEL64-NEXT:    ; clobber non-CSR
+; GISEL64-NEXT:    ;;#ASMEND
+; GISEL64-NEXT:    scratch_load_b32 v40, off, s32 offset:12 ; 4-byte Folded Reload
+; GISEL64-NEXT:    s_wait_alu 0xfffe
+; GISEL64-NEXT:    v_cndmask_b32_e32 v0, 5, v0, vcc
+; GISEL64-NEXT:    v_cndmask_b32_e32 v1, 3, v1, vcc
+; GISEL64-NEXT:    v_readlane_b32 s20, v2, 0
+; GISEL64-NEXT:    s_delay_alu instid0(VALU_DEP_2)
+; GISEL64-NEXT:    v_mov_b32_dpp v0, v1 quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1
+; GISEL64-NEXT:    s_xor_b64 exec, vcc, -1
+; GISEL64-NEXT:    s_clause 0x3
+; GISEL64-NEXT:    scratch_load_b32 v2, off, s32
+; GISEL64-NEXT:    scratch_load_b32 v0, off, s32 offset:4
+; GISEL64-NEXT:    scratch_load_b32 v1, off, s32 offset:8
+; GISEL64-NEXT:    scratch_load_b32 v49, off, s32 offset:16
+; GISEL64-NEXT:    s_mov_b64 exec, vcc
+; GISEL64-NEXT:    s_wait_loadcnt 0x0
+; GISEL64-NEXT:    s_wait_alu 0xf1ff
+; GISEL64-NEXT:    s_setpc_b64 s[30:31]
   %x = select i1 %active, i32 %a, i32 5
   %y = select i1 %active, i32 %b, i32 3
   call void asm sideeffect "; clobber CSR", "~{v40},~{s48}"()
@@ -279,6 +495,42 @@ define amdgpu_gfx_whole_wave void @csr_vgpr_only(i1 %active, i32 %a, i32 %b) {
 ; GISEL-NEXT:    s_mov_b32 exec_lo, s0
 ; GISEL-NEXT:    s_wait_loadcnt 0x0
 ; GISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; DAGISEL64-LABEL: csr_vgpr_only:
+; DAGISEL64:       ; %bb.0:
+; DAGISEL64-NEXT:    s_wait_loadcnt_dscnt 0x0
+; DAGISEL64-NEXT:    s_wait_expcnt 0x0
+; DAGISEL64-NEXT:    s_wait_samplecnt 0x0
+; DAGISEL64-NEXT:    s_wait_bvhcnt 0x0
+; DAGISEL64-NEXT:    s_wait_kmcnt 0x0
+; DAGISEL64-NEXT:    s_or_saveexec_b64 s[0:1], -1
+; DAGISEL64-NEXT:    scratch_store_b32 off, v40, s32 ; 4-byte Folded Spill
+; DAGISEL64-NEXT:    ;;#ASMSTART
+; DAGISEL64-NEXT:    ; clobber CSR VGPR
+; DAGISEL64-NEXT:    ;;#ASMEND
+; DAGISEL64-NEXT:    scratch_load_b32 v40, off, s32 ; 4-byte Folded Reload
+; DAGISEL64-NEXT:    s_wait_alu 0xfffe
+; DAGISEL64-NEXT:    s_mov_b64 exec, s[0:1]
+; DAGISEL64-NEXT:    s_wait_loadcnt 0x0
+; DAGISEL64-NEXT:    s_setpc_b64 s[30:31]
+;
+; GISEL64-LABEL: csr_vgpr_only:
+; GISEL64:       ; %bb.0:
+; GISEL64-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GISEL64-NEXT:    s_wait_expcnt 0x0
+; GISEL64-NEXT:    s_wait_samplecnt 0x0
+; GISEL64-NEXT:    s_wait_bvhcnt 0x0
+; GISEL64-NEXT:    s_wait_kmcnt 0x0
+; GISEL64-NEXT:    s_or_saveexec_b64 s[0:1], -1
+; GISEL64-NEXT:    scratch_store_b32 off, v40, s32 ; 4-byte Folded Spill
+; GISEL64-NEXT:    ;;#ASMSTART
+; GISEL64-NEXT:    ; clobber CSR VGPR
+; GISEL64-NEXT:    ;;#ASMEND
+; GISEL64-NEXT:    scratch_load_b32 v40, off, s32 ; 4-byte Folded Reload
+; GISEL64-NEXT:    s_wait_alu 0xfffe
+; GISEL64-NEXT:    s_mov_b64 exec, s[0:1]
+; GISEL64-NEXT:    s_wait_loadcnt 0x0
+; GISEL64-NEXT:    s_setpc_b64 s[30:31]
   call void asm sideeffect "; clobber CSR VGPR", "~{v40}"()
   ret void
 }
@@ -329,6 +581,52 @@ define amdgpu_gfx_whole_wave void @sgpr_spill_only(i1 %active, i32 %a, i32 %b) {
 ; GISEL-NEXT:    s_mov_b32 exec_lo, s0
 ; GISEL-NEXT:    s_wait_loadcnt 0x0
 ; GISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; DAGISEL64-LABEL: sgpr_spill_only:
+; DAGISEL64:       ; %bb.0:
+; DAGISEL64-NEXT:    s_wait_loadcnt_dscnt 0x0
+; DAGISEL64-NEXT:    s_wait_expcnt 0x0
+; DAGISEL64-NEXT:    s_wait_samplecnt 0x0
+; DAGISEL64-NEXT:    s_wait_bvhcnt 0x0
+; DAGISEL64-NEXT:    s_wait_kmcnt 0x0
+; DAGISEL64-NEXT:    s_xor_saveexec_b64 s[0:1], -1
+; DAGISEL64-NEXT:    scratch_store_b32 off, v0, s32 ; 4-byte Folded Spill
+; DAGISEL64-NEXT:    s_mov_b64 exec, -1
+; DAGISEL64-NEXT:    v_writelane_b32 v0, s68, 0
+; DAGISEL64-NEXT:    ;;#ASMSTART
+; DAGISEL64-NEXT:    ; clobber CSR SGPR
+; DAGISEL64-NEXT:    ;;#ASMEND
+; DAGISEL64-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; DAGISEL64-NEXT:    v_readlane_b32 s68, v0, 0
+; DAGISEL64-NEXT:    s_wait_alu 0xfffe
+; DAGISEL64-NEXT:    s_xor_b64 exec, s[0:1], -1
+; DAGISEL64-NEXT:    scratch_load_b32 v0, off, s32 ; 4-byte Folded Reload
+; DAGISEL64-NEXT:    s_mov_b64 exec, s[0:1]
+; DAGISEL64-NEXT:    s_wait_loadcnt 0x0
+; DAGISEL64-NEXT:    s_setpc_b64 s[30:31]
+;
+; GISEL64-LABEL: sgpr_spill_only:
+; GISEL64:       ; %bb.0:
+; GISEL64-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GISEL64-NEXT:    s_wait_expcnt 0x0
+; GISEL64-NEXT:    s_wait_samplecnt 0x0
+; GISEL64-NEXT:    s_wait_bvhcnt 0x0
+; GISEL64-NEXT:    s_wait_kmcnt 0x0
+; GISEL64-NEXT:    s_xor_saveexec_b64 s[0:1], -1
+; GISEL64-NEXT:    scratch_store_b32 off, v0, s32 ; 4-byte Folded Spill
+; GISEL64-NEXT:    s_mov_b64 exec, -1
+; GISEL64-NEXT:    v_writelane_b32 v0, s68, 0
+; GISEL64-NEXT:    ;;#ASMSTART
+; GISEL64-NEXT:    ; clobber CSR SGPR
+; GISEL64-NEXT:    ;;#ASMEND
+; GISEL64-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GISEL64-NEXT:    v_readlane_b32 s68, v0, 0
+; GISEL64-NEXT:    s_wait_alu 0xfffe
+; GISEL64-NEXT:    s_xor_b64 exec, s[0:1], -1
+; GISEL64-NEXT:    scratch_load_b32 v0, off, s32 ; 4-byte Folded Reload
+; GISEL64-NEXT:    s_mov_b64 exec, s[0:1]
+; GISEL64-NEXT:    s_wait_loadcnt 0x0
+; GISEL64-NEXT:    s_setpc_b64 s[30:31]
   call void asm sideeffect "; clobber CSR SGPR", "~{s68}"()
   ret void
 }
@@ -393,6 +691,66 @@ define amdgpu_gfx_whole_wave i32 @multiple_blocks(i1 %active, i32 %a, i32 %b) {
 ; GISEL-NEXT:    s_mov_b32 exec_lo, vcc_lo
 ; GISEL-NEXT:    s_wait_loadcnt 0x0
 ; GISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; DAGISEL64-LABEL: multiple_blocks:
+; DAGISEL64:       ; %bb.0:
+; DAGISEL64-NEXT:    s_wait_loadcnt_dscnt 0x0
+; DAGISEL64-NEXT:    s_wait_expcnt 0x0
+; DAGISEL64-NEXT:    s_wait_samplecnt 0x0
+; DAGISEL64-NEXT:    s_wait_bvhcnt 0x0
+; DAGISEL64-NEXT:    s_wait_kmcnt 0x0
+; DAGISEL64-NEXT:    s_xor_saveexec_b64 vcc, -1
+; DAGISEL64-NEXT:    s_clause 0x1
+; DAGISEL64-NEXT:    scratch_store_b32 off, v0, s32
+; DAGISEL64-NEXT:    scratch_store_b32 off, v1, s32 offset:4
+; DAGISEL64-NEXT:    s_mov_b64 exec, -1
+; DAGISEL64-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; DAGISEL64-NEXT:    s_mov_b64 s[2:3], exec
+; DAGISEL64-NEXT:    v_cmpx_eq_u32_e64 v0, v1
+; DAGISEL64-NEXT:  ; %bb.1: ; %if.then
+; DAGISEL64-NEXT:    v_add_nc_u32_e32 v1, v0, v1
+; DAGISEL64-NEXT:  ; %bb.2: ; %if.end
+; DAGISEL64-NEXT:    s_wait_alu 0xfffe
+; DAGISEL64-NEXT:    s_or_b64 exec, exec, s[2:3]
+; DAGISEL64-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; DAGISEL64-NEXT:    v_cndmask_b32_e32 v0, v1, v0, vcc
+; DAGISEL64-NEXT:    s_xor_b64 exec, vcc, -1
+; DAGISEL64-NEXT:    s_clause 0x1
+; DAGISEL64-NEXT:    scratch_load_b32 v0, off, s32
+; DAGISEL64-NEXT:    scratch_load_b32 v1, off, s32 offset:4
+; DAGISEL64-NEXT:    s_mov_b64 exec, vcc
+; DAGISEL64-NEXT:    s_wait_loadcnt 0x0
+; DAGISEL64-NEXT:    s_setpc_b64 s[30:31]
+;
+; GISEL64-LABEL: multiple_blocks:
+; GISEL64:       ; %bb.0:
+; GISEL64-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GISEL64-NEXT:    s_wait_expcnt 0x0
+; GISEL64-NEXT:    s_wait_samplecnt 0x0
+; GISEL64-NEXT:    s_wait_bvhcnt 0x0
+; GISEL64-NEXT:    s_wait_kmcnt 0x0
+; GISEL64-NEXT:    s_xor_saveexec_b64 vcc, -1
+; GISEL64-NEXT:    s_clause 0x1
+; GISEL64-NEXT:    scratch_store_b32 off, v0, s32
+; GISEL64-NEXT:    scratch_store_b32 off, v1, s32 offset:4
+; GISEL64-NEXT:    s_mov_b64 exec, -1
+; GISEL64-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; GISEL64-NEXT:    s_mov_b64 s[2:3], exec
+; GISEL64-NEXT:    v_cmpx_eq_u32_e64 v0, v1
+; GISEL64-NEXT:  ; %bb.1: ; %if.then
+; GISEL64-NEXT:    v_add_nc_u32_e32 v1, v0, v1
+; GISEL64-NEXT:  ; %bb.2: ; %if.end
+; GISEL64-NEXT:    s_wait_alu 0xfffe
+; GISEL64-NEXT:    s_or_b64 exec, exec, s[2:3]
+; GISEL64-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GISEL64-NEXT:    v_cndmask_b32_e32 v0, v1, v0, vcc
+; GISEL64-NEXT:    s_xor_b64 exec, vcc, -1
+; GISEL64-NEXT:    s_clause 0x1
+; GISEL64-NEXT:    scratch_load_b32 v0, off, s32
+; GISEL64-NEXT:    scratch_load_b32 v1, off, s32 offset:4
+; GISEL64-NEXT:    s_mov_b64 exec, vcc
+; GISEL64-NEXT:    s_wait_loadcnt 0x0
+; GISEL64-NEXT:    s_setpc_b64 s[30:31]
   %c = icmp eq i32 %a, %b
   br i1 %c, label %if.then, label %if.end
 
@@ -466,6 +824,70 @@ define amdgpu_gfx_whole_wave i64 @ret_64(i1 %active, i64 %a, i64 %b) {
 ; GISEL-NEXT:    s_mov_b32 exec_lo, vcc_lo
 ; GISEL-NEXT:    s_wait_loadcnt 0x0
 ; GISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; DAGISEL64-LABEL: ret_64:
+; DAGISEL64:       ; %bb.0:
+; DAGISEL64-NEXT:    s_wait_loadcnt_dscnt 0x0
+; DAGISEL64-NEXT:    s_wait_expcnt 0x0
+; DAGISEL64-NEXT:    s_wait_samplecnt 0x0
+; DAGISEL64-NEXT:    s_wait_bvhcnt 0x0
+; DAGISEL64-NEXT:    s_wait_kmcnt 0x0
+; DAGISEL64-NEXT:    s_xor_saveexec_b64 vcc, -1
+; DAGISEL64-NEXT:    s_clause 0x3
+; DAGISEL64-NEXT:    scratch_store_b32 off, v0, s32
+; DAGISEL64-NEXT:    scratch_store_b32 off, v1, s32 offset:4
+; DAGISEL64-NEXT:    scratch_store_b32 off, v2, s32 offset:8
+; DAGISEL64-NEXT:    scratch_store_b32 off, v3, s32 offset:12
+; DAGISEL64-NEXT:    s_mov_b64 exec, -1
+; DAGISEL64-NEXT:    s_wait_alu 0xfffe
+; DAGISEL64-NEXT:    v_cndmask_b32_e32 v1, 0, v1, vcc
+; DAGISEL64-NEXT:    v_cndmask_b32_e32 v0, 5, v0, vcc
+; DAGISEL64-NEXT:    v_cndmask_b32_e32 v2, 3, v2, vcc
+; DAGISEL64-NEXT:    v_cndmask_b32_e32 v3, 0, v3, vcc
+; DAGISEL64-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; DAGISEL64-NEXT:    v_mov_b32_dpp v0, v2 quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1
+; DAGISEL64-NEXT:    v_mov_b32_dpp v1, v3 quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1
+; DAGISEL64-NEXT:    s_xor_b64 exec, vcc, -1
+; DAGISEL64-NEXT:    s_clause 0x3
+; DAGISEL64-NEXT:    scratch_load_b32 v0, off, s32
+; DAGISEL64-NEXT:    scratch_load_b32 v1, off, s32 offset:4
+; DAGISEL64-NEXT:    scratch_load_b32 v2, off, s32 offset:8
+; DAGISEL64-NEXT:    scratch_load_b32 v3, off, s32 offset:12
+; DAGISEL64-NEXT:    s_mov_b64 exec, vcc
+; DAGISEL64-NEXT:    s_wait_loadcnt 0x0
+; DAGISEL64-NEXT:    s_setpc_b64 s[30:31]
+;
+; GISEL64-LABEL: ret_64:
+; GISEL64:       ; %bb.0:
+; GISEL64-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GISEL64-NEXT:    s_wait_expcnt 0x0
+; GISEL64-NEXT:    s_wait_samplecnt 0x0
+; GISEL64-NEXT:    s_wait_bvhcnt 0x0
+; GISEL64-NEXT:    s_wait_kmcnt 0x0
+; GISEL64-NEXT:    s_xor_saveexec_b64 vcc, -1
+; GISEL64-NEXT:    s_clause 0x3
+; GISEL64-NEXT:    scratch_store_b32 off, v0, s32
+; GISEL64-NEXT:    scratch_store_b32 off, v1, s32 offset:4
+; GISEL64-NEXT:    scratch_store_b32 off, v2, s32 offset:8
+; GISEL64-NEXT:    scratch_store_b32 off, v3, s32 offset:12
+; GISEL64-NEXT:    s_mov_b64 exec, -1
+; GISEL64-NEXT:    s_wait_alu 0xfffe
+; GISEL64-NEXT:    v_cndmask_b32_e32 v0, 5, v0, vcc
+; GISEL64-NEXT:    v_cndmask_b32_e32 v1, 0, v1, vcc
+; GISEL64-NEXT:    v_cndmask_b32_e32 v2, 3, v2, vcc
+; GISEL64-NEXT:    v_cndmask_b32_e32 v3, 0, v3, vcc
+; GISEL64-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GISEL64-NEXT:    v_mov_b32_dpp v0, v2 quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1
+; GISEL64-NEXT:    v_mov_b32_dpp v1, v3 quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1
+; GISEL64-NEXT:    s_xor_b64 exec, vcc, -1
+; GISEL64-NEXT:    s_clause 0x3
+; GISEL64-NEXT:    scratch_load_b32 v0, off, s32
+; GISEL64-NEXT:    scratch_load_b32 v1, off, s32 offset:4
+; GISEL64-NEXT:    scratch_load_b32 v2, off, s32 offset:8
+; GISEL64-NEXT:    scratch_load_b32 v3, off, s32 offset:12
+; GISEL64-NEXT:    s_mov_b64 exec, vcc
+; GISEL64-NEXT:    s_wait_loadcnt 0x0
+; GISEL64-NEXT:    s_setpc_b64 s[30:31]
   %x = select i1 %active, i64 %a, i64 5
   %y = select i1 %active, i64 %b, i64 3
   %ret = call i64 @llvm.amdgcn.update.dpp.i64(i64 %x, i64 %y, i32 1, i32 1, i32 1, i1 false)
@@ -549,6 +971,88 @@ define amdgpu_gfx_whole_wave void @inreg_args(i1 %active, i32 inreg %i32, <4 x i
 ; GISEL-NEXT:    s_mov_b32 exec_lo, s34
 ; GISEL-NEXT:    s_wait_loadcnt 0x0
 ; GISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; DAGISEL64-LABEL: inreg_args:
+; DAGISEL64:       ; %bb.0:
+; DAGISEL64-NEXT:    s_wait_loadcnt_dscnt 0x0
+; DAGISEL64-NEXT:    s_wait_expcnt 0x0
+; DAGISEL64-NEXT:    s_wait_samplecnt 0x0
+; DAGISEL64-NEXT:    s_wait_bvhcnt 0x0
+; DAGISEL64-NEXT:    s_wait_kmcnt 0x0
+; DAGISEL64-NEXT:    s_xor_saveexec_b64 s[0:1], -1
+; DAGISEL64-NEXT:    s_clause 0x5
+; DAGISEL64-NEXT:    scratch_store_b32 off, v0, s32
+; DAGISEL64-NEXT:    scratch_store_b32 off, v1, s32 offset:4
+; DAGISEL64-NEXT:    scratch_store_b32 off, v2, s32 offset:8
+; DAGISEL64-NEXT:    scratch_store_b32 off, v3, s32 offset:12
+; DAGISEL64-NEXT:    scratch_store_b32 off, v4, s32 offset:16
+; DAGISEL64-NEXT:    scratch_store_b32 off, v5, s32 offset:20
+; DAGISEL64-NEXT:    s_mov_b64 exec, -1
+; DAGISEL64-NEXT:    v_mov_b32_e32 v4, s4
+; DAGISEL64-NEXT:    v_mov_b32_e32 v0, s5
+; DAGISEL64-NEXT:    v_mov_b32_e32 v1, s6
+; DAGISEL64-NEXT:    v_mov_b32_e32 v2, s7
+; DAGISEL64-NEXT:    v_mov_b32_e32 v3, s8
+; DAGISEL64-NEXT:    v_mov_b32_e32 v5, s9
+; DAGISEL64-NEXT:    scratch_store_b32 off, v4, s10
+; DAGISEL64-NEXT:    s_clause 0x1
+; DAGISEL64-NEXT:    scratch_store_b128 off, v[0:3], s11
+; DAGISEL64-NEXT:    scratch_store_b32 off, v5, s11
+; DAGISEL64-NEXT:    s_wait_alu 0xfffe
+; DAGISEL64-NEXT:    s_xor_b64 exec, s[0:1], -1
+; DAGISEL64-NEXT:    s_clause 0x5
+; DAGISEL64-NEXT:    scratch_load_b32 v0, off, s32
+; DAGISEL64-NEXT:    scratch_load_b32 v1, off, s32 offset:4
+; DAGISEL64-NEXT:    scratch_load_b32 v2, off, s32 offset:8
+; DAGISEL64-NEXT:    scratch_load_b32 v3, off, s32 offset:12
+; DAGISEL64-NEXT:    scratch_load_b32 v4, off, s32 offset:16
+; DAGISEL64-NEXT:    scratch_load_b32 v5, off, s32 offset:20
+; DAGISEL64-NEXT:    s_mov_b64 exec, s[0:1]
+; DAGISEL64-NEXT:    s_wait_loadcnt 0x0
+; DAGISEL64-NEXT:    s_setpc_b64 s[30:31]
+;
+; GISEL64-LABEL: inreg_args:
+; GISEL64:       ; %bb.0:
+; GISEL64-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GISEL64-NEXT:    s_wait_expcnt 0x0
+; GISEL64-NEXT:    s_wait_samplecnt 0x0
+; GISEL64-NEXT:    s_wait_bvhcnt 0x0
+; GISEL64-NEXT:    s_wait_kmcnt 0x0
+; GISEL64-NEXT:    s_xor_saveexec_b64 s[34:35], -1
+; GISEL64-NEXT:    s_clause 0x5
+; GISEL64-NEXT:    scratch_store_b32 off, v0, s32
+; GISEL64-NEXT:    scratch_store_b32 off, v1, s32 offset:4
+; GISEL64-NEXT:    scratch_store_b32 off, v2, s32 offset:8
+; GISEL64-NEXT:    scratch_store_b32 off, v3, s32 offset:12
+; GISEL64-NEXT:    scratch_store_b32 off, v4, s32 offset:16
+; GISEL64-NEXT:    scratch_store_b32 off, v5, s32 offset:20
+; GISEL64-NEXT:    s_mov_b64 exec, -1
+; GISEL64-NEXT:    s_mov_b32 s0, s5
+; GISEL64-NEXT:    s_mov_b32 s1, s6
+; GISEL64-NEXT:    s_mov_b32 s2, s7
+; GISEL64-NEXT:    s_mov_b32 s3, s8
+; GISEL64-NEXT:    v_mov_b32_e32 v4, s4
+; GISEL64-NEXT:    s_wait_alu 0xfffe
+; GISEL64-NEXT:    v_mov_b32_e32 v0, s0
+; GISEL64-NEXT:    v_mov_b32_e32 v1, s1
+; GISEL64-NEXT:    v_mov_b32_e32 v2, s2
+; GISEL64-NEXT:    v_mov_b32_e32 v3, s3
+; GISEL64-NEXT:    v_mov_b32_e32 v5, s9
+; GISEL64-NEXT:    scratch_store_b32 off, v4, s10
+; GISEL64-NEXT:    s_clause 0x1
+; GISEL64-NEXT:    scratch_store_b128 off, v[0:3], s11
+; GISEL64-NEXT:    scratch_store_b32 off, v5, s11
+; GISEL64-NEXT:    s_xor_b64 exec, s[34:35], -1
+; GISEL64-NEXT:    s_clause 0x5
+; GISEL64-NEXT:    scratch_load_b32 v0, off, s32
+; GISEL64-NEXT:    scratch_load_b32 v1, off, s32 offset:4
+; GISEL64-NEXT:    scratch_load_b32 v2, off, s32 offset:8
+; GISEL64-NEXT:    scratch_load_b32 v3, off, s32 offset:12
+; GISEL64-NEXT:    scratch_load_b32 v4, off, s32 offset:16
+; GISEL64-NEXT:    scratch_load_b32 v5, off, s32 offset:20
+; GISEL64-NEXT:    s_mov_b64 exec, s[34:35]
+; GISEL64-NEXT:    s_wait_loadcnt 0x0
+; GISEL64-NEXT:    s_setpc_b64 s[30:31]
   store i32 %i32, ptr addrspace(5) %ptr
   store <4 x i32> %v4i32, ptr addrspace(5) %ptr2
   store float %float, ptr addrspace(5) %ptr2
@@ -1395,6 +1899,838 @@ define amdgpu_gfx_whole_wave <2 x half> @call_gfx_from_whole_wave(i1 %active, <2
 ; GISEL-NEXT:    s_wait_loadcnt 0x0
 ; GISEL-NEXT:    s_wait_alu 0xfffe
 ; GISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; DAGISEL64-LABEL: call_gfx_from_whole_wave:
+; DAGISEL64:       ; %bb.0:
+; DAGISEL64-NEXT:    s_wait_loadcnt_dscnt 0x0
+; DAGISEL64-NEXT:    s_wait_expcnt 0x0
+; DAGISEL64-NEXT:    s_wait_samplecnt 0x0
+; DAGISEL64-NEXT:    s_wait_bvhcnt 0x0
+; DAGISEL64-NEXT:    s_wait_kmcnt 0x0
+; DAGISEL64-NEXT:    s_mov_b32 s36, s33
+; DAGISEL64-NEXT:    s_mov_b32 s33, s32
+; DAGISEL64-NEXT:    s_xor_saveexec_b64 s[34:35], -1
+; DAGISEL64-NEXT:    s_clause 0x1f
+; DAGISEL64-NEXT:    scratch_store_b32 off, v0, s33 offset:4
+; DAGISEL64-NEXT:    scratch_store_b32 off, v1, s33 offset:8
+; DAGISEL64-NEXT:    scratch_store_b32 off, v2, s33 offset:12
+; DAGISEL64-NEXT:    scratch_store_b32 off, v3, s33 offset:16
+; DAGISEL64-NEXT:    scratch_store_b32 off, v4, s33 offset:20
+; DAGISEL64-NEXT:    scratch_store_b32 off, v5, s33 offset:24
+; DAGISEL64-NEXT:    scratch_store_b32 off, v6, s33 offset:28
+; DAGISEL64-NEXT:    scratch_store_b32 off, v7, s33 offset:32
+; DAGISEL64-NEXT:    scratch_store_b32 off, v8, s33 offset:36
+; DAGISEL64-NEXT:    scratch_store_b32 off, v9, s33 offset:40
+; DAGISEL64-NEXT:    scratch_store_b32 off, v10, s33 offset:44
+; DAGISEL64-NEXT:    scratch_store_b32 off, v11, s33 offset:48
+; DAGISEL64-NEXT:    scratch_store_b32 off, v12, s33 offset:52
+; DAGISEL64-NEXT:    scratch_store_b32 off, v13, s33 offset:56
+; DAGISEL64-NEXT:    scratch_store_b32 off, v14, s33 offset:60
+; DAGISEL64-NEXT:    scratch_store_b32 off, v15, s33 offset:64
+; DAGISEL64-NEXT:    scratch_store_b32 off, v16, s33 offset:68
+; DAGISEL64-NEXT:    scratch_store_b32 off, v17, s33 offset:72
+; DAGISEL64-NEXT:    scratch_store_b32 off, v18, s33 offset:76
+; DAGISEL64-NEXT:    scratch_store_b32 off, v19, s33 offset:80
+; DAGISEL64-NEXT:    scratch_store_b32 off, v20, s33 offset:84
+; DAGISEL64-NEXT:    scratch_store_b32 off, v21, s33 offset:88
+; DAGISEL64-NEXT:    scratch_store_b32 off, v22, s33 offset:92
+; DAGISEL64-NEXT:    scratch_store_b32 off, v23, s33 offset:96
+; DAGISEL64-NEXT:    scratch_store_b32 off, v24, s33 offset:100
+; DAGISEL64-NEXT:    scratch_store_b32 off, v25, s33 offset:104
+; DAGISEL64-NEXT:    scratch_store_b32 off, v26, s33 offset:108
+; DAGISEL64-NEXT:    scratch_store_b32 off, v27, s33 offset:112
+; DAGISEL64-NEXT:    scratch_store_b32 off, v28, s33 offset:116
+; DAGISEL64-NEXT:    scratch_store_b32 off, v29, s33 offset:120
+; DAGISEL64-NEXT:    scratch_store_b32 off, v30, s33 offset:124
+; DAGISEL64-NEXT:    scratch_store_b32 off, v31, s33 offset:128
+; DAGISEL64-NEXT:    s_clause 0x1f
+; DAGISEL64-NEXT:    scratch_store_b32 off, v32, s33 offset:132
+; DAGISEL64-NEXT:    scratch_store_b32 off, v33, s33 offset:136
+; DAGISEL64-NEXT:    scratch_store_b32 off, v34, s33 offset:140
+; DAGISEL64-NEXT:    scratch_store_b32 off, v35, s33 offset:144
+; DAGISEL64-NEXT:    scratch_store_b32 off, v36, s33 offset:148
+; DAGISEL64-NEXT:    scratch_store_b32 off, v37, s33 offset:152
+; DAGISEL64-NEXT:    scratch_store_b32 off, v38, s33 offset:156
+; DAGISEL64-NEXT:    scratch_store_b32 off, v39, s33 offset:160
+; DAGISEL64-NEXT:    scratch_store_b32 off, v48, s33 offset:164
+; DAGISEL64-NEXT:    scratch_store_b32 off, v49, s33 offset:168
+; DAGISEL64-NEXT:    scratch_store_b32 off, v50, s33 offset:172
+; DAGISEL64-NEXT:    scratch_store_b32 off, v51, s33 offset:176
+; DAGISEL64-NEXT:    scratch_store_b32 off, v52, s33 offset:180
+; DAGISEL64-NEXT:    scratch_store_b32 off, v53, s33 offset:184
+; DAGISEL64-NEXT:    scratch_store_b32 off, v54, s33 offset:188
+; DAGISEL64-NEXT:    scratch_store_b32 off, v55, s33 offset:192
+; DAGISEL64-NEXT:    scratch_store_b32 off, v64, s33 offset:196
+; DAGISEL64-NEXT:    scratch_store_b32 off, v65, s33 offset:200
+; DAGISEL64-NEXT:    scratch_store_b32 off, v66, s33 offset:204
+; DAGISEL64-NEXT:    scratch_store_b32 off, v67, s33 offset:208
+; DAGISEL64-NEXT:    scratch_store_b32 off, v68, s33 offset:212
+; DAGISEL64-NEXT:    scratch_store_b32 off, v69, s33 offset:216
+; DAGISEL64-NEXT:    scratch_store_b32 off, v70, s33 offset:220
+; DAGISEL64-NEXT:    scratch_store_b32 off, v71, s33 offset:224
+; DAGISEL64-NEXT:    scratch_store_b32 off, v80, s33 offset:228
+; DAGISEL64-NEXT:    scratch_store_b32 off, v81, s33 offset:232
+; DAGISEL64-NEXT:    scratch_store_b32 off, v82, s33 offset:236
+; DAGISEL64-NEXT:    scratch_store_b32 off, v83, s33 offset:240
+; DAGISEL64-NEXT:    scratch_store_b32 off, v84, s33 offset:244
+; DAGISEL64-NEXT:    scratch_store_b32 off, v85, s33 offset:248
+; DAGISEL64-NEXT:    scratch_store_b32 off, v86, s33 offset:252
+; DAGISEL64-NEXT:    scratch_store_b32 off, v87, s33 offset:256
+; DAGISEL64-NEXT:    s_clause 0x1f
+; DAGISEL64-NEXT:    scratch_store_b32 off, v96, s33 offset:260
+; DAGISEL64-NEXT:    scratch_store_b32 off, v97, s33 offset:264
+; DAGISEL64-NEXT:    scratch_store_b32 off, v98, s33 offset:268
+; DAGISEL64-NEXT:    scratch_store_b32 off, v99, s33 offset:272
+; DAGISEL64-NEXT:    scratch_store_b32 off, v100, s33 offset:276
+; DAGISEL64-NEXT:    scratch_store_b32 off, v101, s33 offset:280
+; DAGISEL64-NEXT:    scratch_store_b32 off, v102, s33 offset:284
+; DAGISEL64-NEXT:    scratch_store_b32 off, v103, s33 offset:288
+; DAGISEL64-NEXT:    scratch_store_b32 off, v112, s33 offset:292
+; DAGISEL64-NEXT:    scratch_store_b32 off, v113, s33 offset:296
+; DAGISEL64-NEXT:    scratch_store_b32 off, v114, s33 offset:300
+; DAGISEL64-NEXT:    scratch_store_b32 off, v115, s33 offset:304
+; DAGISEL64-NEXT:    scratch_store_b32 off, v116, s33 offset:308
+; DAGISEL64-NEXT:    scratch_store_b32 off, v117, s33 offset:312
+; DAGISEL64-NEXT:    scratch_store_b32 off, v118, s33 offset:316
+; DAGISEL64-NEXT:    scratch_store_b32 off, v119, s33 offset:320
+; DAGISEL64-NEXT:    scratch_store_b32 off, v128, s33 offset:324
+; DAGISEL64-NEXT:    scratch_store_b32 off, v129, s33 offset:328
+; DAGISEL64-NEXT:    scratch_store_b32 off, v130, s33 offset:332
+; DAGISEL64-NEXT:    scratch_store_b32 off, v131, s33 offset:336
+; DAGISEL64-NEXT:    scratch_store_b32 off, v132, s33 offset:340
+; DAGISEL64-NEXT:    scratch_store_b32 off, v133, s33 offset:344
+; DAGISEL64-NEXT:    scratch_store_b32 off, v134, s33 offset:348
+; DAGISEL64-NEXT:    scratch_store_b32 off, v135, s33 offset:352
+; DAGISEL64-NEXT:    scratch_store_b32 off, v144, s33 offset:356
+; DAGISEL64-NEXT:    scratch_store_b32 off, v145, s33 offset:360
+; DAGISEL64-NEXT:    scratch_store_b32 off, v146, s33 offset:364
+; DAGISEL64-NEXT:    scratch_store_b32 off, v147, s33 offset:368
+; DAGISEL64-NEXT:    scratch_store_b32 off, v148, s33 offset:372
+; DAGISEL64-NEXT:    scratch_store_b32 off, v149, s33 offset:376
+; DAGISEL64-NEXT:    scratch_store_b32 off, v150, s33 offset:380
+; DAGISEL64-NEXT:    scratch_store_b32 off, v151, s33 offset:384
+; DAGISEL64-NEXT:    s_clause 0x1f
+; DAGISEL64-NEXT:    scratch_store_b32 off, v160, s33 offset:388
+; DAGISEL64-NEXT:    scratch_store_b32 off, v161, s33 offset:392
+; DAGISEL64-NEXT:    scratch_store_b32 off, v162, s33 offset:396
+; DAGISEL64-NEXT:    scratch_store_b32 off, v163, s33 offset:400
+; DAGISEL64-NEXT:    scratch_store_b32 off, v164, s33 offset:404
+; DAGISEL64-NEXT:    scratch_store_b32 off, v165, s33 offset:408
+; DAGISEL64-NEXT:    scratch_store_b32 off, v166, s33 offset:412
+; DAGISEL64-NEXT:    scratch_store_b32 off, v167, s33 offset:416
+; DAGISEL64-NEXT:    scratch_store_b32 off, v176, s33 offset:420
+; DAGISEL64-NEXT:    scratch_store_b32 off, v177, s33 offset:424
+; DAGISEL64-NEXT:    scratch_store_b32 off, v178, s33 offset:428
+; DAGISEL64-NEXT:    scratch_store_b32 off, v179, s33 offset:432
+; DAGISEL64-NEXT:    scratch_store_b32 off, v180, s33 offset:436
+; DAGISEL64-NEXT:    scratch_store_b32 off, v181, s33 offset:440
+; DAGISEL64-NEXT:    scratch_store_b32 off, v182, s33 offset:444
+; DAGISEL64-NEXT:    scratch_store_b32 off, v183, s33 offset:448
+; DAGISEL64-NEXT:    scratch_store_b32 off, v192, s33 offset:452
+; DAGISEL64-NEXT:    scratch_store_b32 off, v193, s33 offset:456
+; DAGISEL64-NEXT:    scratch_store_b32 off, v194, s33 offset:460
+; DAGISEL64-NEXT:    scratch_store_b32 off, v195, s33 offset:464
+; DAGISEL64-NEXT:    scratch_store_b32 off, v196, s33 offset:468
+; DAGISEL64-NEXT:    scratch_store_b32 off, v197, s33 offset:472
+; DAGISEL64-NEXT:    scratch_store_b32 off, v198, s33 offset:476
+; DAGISEL64-NEXT:    scratch_store_b32 off, v199, s33 offset:480
+; DAGISEL64-NEXT:    scratch_store_b32 off, v208, s33 offset:484
+; DAGISEL64-NEXT:    scratch_store_b32 off, v209, s33 offset:488
+; DAGISEL64-NEXT:    scratch_store_b32 off, v210, s33 offset:492
+; DAGISEL64-NEXT:    scratch_store_b32 off, v211, s33 offset:496
+; DAGISEL64-NEXT:    scratch_store_b32 off, v212, s33 offset:500
+; DAGISEL64-NEXT:    scratch_store_b32 off, v213, s33 offset:504
+; DAGISEL64-NEXT:    scratch_store_b32 off, v214, s33 offset:508
+; DAGISEL64-NEXT:    scratch_store_b32 off, v215, s33 offset:512
+; DAGISEL64-NEXT:    s_clause 0xf
+; DAGISEL64-NEXT:    scratch_store_b32 off, v224, s33 offset:516
+; DAGISEL64-NEXT:    scratch_store_b32 off, v225, s33 offset:520
+; DAGISEL64-NEXT:    scratch_store_b32 off, v226, s33 offset:524
+; DAGISEL64-NEXT:    scratch_store_b32 off, v227, s33 offset:528
+; DAGISEL64-NEXT:    scratch_store_b32 off, v228, s33 offset:532
+; DAGISEL64-NEXT:    scratch_store_b32 off, v229, s33 offset:536
+; DAGISEL64-NEXT:    scratch_store_b32 off, v230, s33 offset:540
+; DAGISEL64-NEXT:    scratch_store_b32 off, v231, s33 offset:544
+; DAGISEL64-NEXT:    scratch_store_b32 off, v240, s33 offset:548
+; DAGISEL64-NEXT:    scratch_store_b32 off, v241, s33 offset:552
+; DAGISEL64-NEXT:    scratch_store_b32 off, v242, s33 offset:556
+; DAGISEL64-NEXT:    scratch_store_b32 off, v243, s33 offset:560
+; DAGISEL64-NEXT:    scratch_store_b32 off, v244, s33 offset:564
+; DAGISEL64-NEXT:    scratch_store_b32 off, v245, s33 offset:568
+; DAGISEL64-NEXT:    scratch_store_b32 off, v246, s33 offset:572
+; DAGISEL64-NEXT:    scratch_store_b32 off, v247, s33 offset:576
+; DAGISEL64-NEXT:    s_mov_b64 exec, -1
+; DAGISEL64-NEXT:    scratch_store_b32 off, v40, s33 ; 4-byte Folded Spill
+; DAGISEL64-NEXT:    v_writelane_b32 v40, s4, 0
+; DAGISEL64-NEXT:    v_mov_b32_e32 v2, v0
+; DAGISEL64-NEXT:    v_swap_b32 v0, v1
+; DAGISEL64-NEXT:    s_mov_b32 s1, gfx_callee at abs32@hi
+; DAGISEL64-NEXT:    v_writelane_b32 v40, s5, 1
+; DAGISEL64-NEXT:    s_mov_b32 s0, gfx_callee at abs32@lo
+; DAGISEL64-NEXT:    s_addk_co_i32 s32, 0x250
+; DAGISEL64-NEXT:    v_writelane_b32 v40, s6, 2
+; DAGISEL64-NEXT:    v_writelane_b32 v40, s7, 3
+; DAGISEL64-NEXT:    v_writelane_b32 v40, s8, 4
+; DAGISEL64-NEXT:    v_writelane_b32 v40, s9, 5
+; DAGISEL64-NEXT:    s_mov_b64 s[8:9], 0
+; DAGISEL64-NEXT:    v_writelane_b32 v40, s10, 6
+; DAGISEL64-NEXT:    v_writelane_b32 v40, s11, 7
+; DAGISEL64-NEXT:    v_writelane_b32 v40, s12, 8
+; DAGISEL64-NEXT:    v_writelane_b32 v40, s13, 9
+; DAGISEL64-NEXT:    v_writelane_b32 v40, s14, 10
+; DAGISEL64-NEXT:    v_writelane_b32 v40, s15, 11
+; DAGISEL64-NEXT:    v_writelane_b32 v40, s16, 12
+; DAGISEL64-NEXT:    v_writelane_b32 v40, s17, 13
+; DAGISEL64-NEXT:    v_writelane_b32 v40, s18, 14
+; DAGISEL64-NEXT:    v_writelane_b32 v40, s19, 15
+; DAGISEL64-NEXT:    v_writelane_b32 v40, s20, 16
+; DAGISEL64-NEXT:    v_writelane_b32 v40, s21, 17
+; DAGISEL64-NEXT:    v_writelane_b32 v40, s22, 18
+; DAGISEL64-NEXT:    v_writelane_b32 v40, s23, 19
+; DAGISEL64-NEXT:    v_writelane_b32 v40, s24, 20
+; DAGISEL64-NEXT:    v_writelane_b32 v40, s25, 21
+; DAGISEL64-NEXT:    v_writelane_b32 v40, s26, 22
+; DAGISEL64-NEXT:    v_writelane_b32 v40, s27, 23
+; DAGISEL64-NEXT:    v_writelane_b32 v40, s28, 24
+; DAGISEL64-NEXT:    v_writelane_b32 v40, s29, 25
+; DAGISEL64-NEXT:    v_writelane_b32 v40, s30, 26
+; DAGISEL64-NEXT:    v_writelane_b32 v40, s31, 27
+; DAGISEL64-NEXT:    v_writelane_b32 v40, s72, 28
+; DAGISEL64-NEXT:    v_writelane_b32 v40, s73, 29
+; DAGISEL64-NEXT:    v_writelane_b32 v40, s74, 30
+; DAGISEL64-NEXT:    v_writelane_b32 v40, s75, 31
+; DAGISEL64-NEXT:    v_writelane_b32 v40, s76, 32
+; DAGISEL64-NEXT:    v_writelane_b32 v40, s77, 33
+; DAGISEL64-NEXT:    v_writelane_b32 v40, s78, 34
+; DAGISEL64-NEXT:    v_writelane_b32 v40, s79, 35
+; DAGISEL64-NEXT:    v_writelane_b32 v40, s88, 36
+; DAGISEL64-NEXT:    v_writelane_b32 v40, s89, 37
+; DAGISEL64-NEXT:    v_writelane_b32 v40, s90, 38
+; DAGISEL64-NEXT:    v_writelane_b32 v40, s91, 39
+; DAGISEL64-NEXT:    v_writelane_b32 v40, s92, 40
+; DAGISEL64-NEXT:    v_writelane_b32 v40, s93, 41
+; DAGISEL64-NEXT:    v_writelane_b32 v40, s94, 42
+; DAGISEL64-NEXT:    v_writelane_b32 v40, s95, 43
+; DAGISEL64-NEXT:    s_wait_alu 0xfffe
+; DAGISEL64-NEXT:    s_swappc_b64 s[30:31], s[0:1]
+; DAGISEL64-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; DAGISEL64-NEXT:    v_readlane_b32 s95, v40, 43
+; DAGISEL64-NEXT:    v_readlane_b32 s94, v40, 42
+; DAGISEL64-NEXT:    v_readlane_b32 s93, v40, 41
+; DAGISEL64-NEXT:    v_readlane_b32 s92, v40, 40
+; DAGISEL64-NEXT:    v_readlane_b32 s91, v40, 39
+; DAGISEL64-NEXT:    v_readlane_b32 s90, v40, 38
+; DAGISEL64-NEXT:    v_readlane_b32 s89, v40, 37
+; DAGISEL64-NEXT:    v_readlane_b32 s88, v40, 36
+; DAGISEL64-NEXT:    v_readlane_b32 s79, v40, 35
+; DAGISEL64-NEXT:    v_readlane_b32 s78, v40, 34
+; DAGISEL64-NEXT:    v_readlane_b32 s77, v40, 33
+; DAGISEL64-NEXT:    v_readlane_b32 s76, v40, 32
+; DAGISEL64-NEXT:    v_readlane_b32 s75, v40, 31
+; DAGISEL64-NEXT:    v_readlane_b32 s74, v40, 30
+; DAGISEL64-NEXT:    v_readlane_b32 s73, v40, 29
+; DAGISEL64-NEXT:    v_readlane_b32 s72, v40, 28
+; DAGISEL64-NEXT:    v_readlane_b32 s31, v40, 27
+; DAGISEL64-NEXT:    v_readlane_b32 s30, v40, 26
+; DAGISEL64-NEXT:    v_readlane_b32 s29, v40, 25
+; DAGISEL64-NEXT:    v_readlane_b32 s28, v40, 24
+; DAGISEL64-NEXT:    v_readlane_b32 s27, v40, 23
+; DAGISEL64-NEXT:    v_readlane_b32 s26, v40, 22
+; DAGISEL64-NEXT:    v_readlane_b32 s25, v40, 21
+; DAGISEL64-NEXT:    v_readlane_b32 s24, v40, 20
+; DAGISEL64-NEXT:    v_readlane_b32 s23, v40, 19
+; DAGISEL64-NEXT:    v_readlane_b32 s22, v40, 18
+; DAGISEL64-NEXT:    v_readlane_b32 s21, v40, 17
+; DAGISEL64-NEXT:    v_readlane_b32 s20, v40, 16
+; DAGISEL64-NEXT:    v_readlane_b32 s19, v40, 15
+; DAGISEL64-NEXT:    v_readlane_b32 s18, v40, 14
+; DAGISEL64-NEXT:    v_readlane_b32 s17, v40, 13
+; DAGISEL64-NEXT:    v_readlane_b32 s16, v40, 12
+; DAGISEL64-NEXT:    v_readlane_b32 s15, v40, 11
+; DAGISEL64-NEXT:    v_readlane_b32 s14, v40, 10
+; DAGISEL64-NEXT:    v_readlane_b32 s13, v40, 9
+; DAGISEL64-NEXT:    v_readlane_b32 s12, v40, 8
+; DAGISEL64-NEXT:    v_readlane_b32 s11, v40, 7
+; DAGISEL64-NEXT:    v_readlane_b32 s10, v40, 6
+; DAGISEL64-NEXT:    v_readlane_b32 s9, v40, 5
+; DAGISEL64-NEXT:    v_readlane_b32 s8, v40, 4
+; DAGISEL64-NEXT:    v_readlane_b32 s7, v40, 3
+; DAGISEL64-NEXT:    v_readlane_b32 s6, v40, 2
+; DAGISEL64-NEXT:    v_readlane_b32 s5, v40, 1
+; DAGISEL64-NEXT:    v_readlane_b32 s4, v40, 0
+; DAGISEL64-NEXT:    scratch_load_b32 v40, off, s33 ; 4-byte Folded Reload
+; DAGISEL64-NEXT:    s_mov_b32 s32, s33
+; DAGISEL64-NEXT:    s_xor_b64 exec, s[34:35], -1
+; DAGISEL64-NEXT:    s_clause 0x1f
+; DAGISEL64-NEXT:    scratch_load_b32 v0, off, s33 offset:4
+; DAGISEL64-NEXT:    scratch_load_b32 v1, off, s33 offset:8
+; DAGISEL64-NEXT:    scratch_load_b32 v2, off, s33 offset:12
+; DAGISEL64-NEXT:    scratch_load_b32 v3, off, s33 offset:16
+; DAGISEL64-NEXT:    scratch_load_b32 v4, off, s33 offset:20
+; DAGISEL64-NEXT:    scratch_load_b32 v5, off, s33 offset:24
+; DAGISEL64-NEXT:    scratch_load_b32 v6, off, s33 offset:28
+; DAGISEL64-NEXT:    scratch_load_b32 v7, off, s33 offset:32
+; DAGISEL64-NEXT:    scratch_load_b32 v8, off, s33 offset:36
+; DAGISEL64-NEXT:    scratch_load_b32 v9, off, s33 offset:40
+; DAGISEL64-NEXT:    scratch_load_b32 v10, off, s33 offset:44
+; DAGISEL64-NEXT:    scratch_load_b32 v11, off, s33 offset:48
+; DAGISEL64-NEXT:    scratch_load_b32 v12, off, s33 offset:52
+; DAGISEL64-NEXT:    scratch_load_b32 v13, off, s33 offset:56
+; DAGISEL64-NEXT:    scratch_load_b32 v14, off, s33 offset:60
+; DAGISEL64-NEXT:    scratch_load_b32 v15, off, s33 offset:64
+; DAGISEL64-NEXT:    scratch_load_b32 v16, off, s33 offset:68
+; DAGISEL64-NEXT:    scratch_load_b32 v17, off, s33 offset:72
+; DAGISEL64-NEXT:    scratch_load_b32 v18, off, s33 offset:76
+; DAGISEL64-NEXT:    scratch_load_b32 v19, off, s33 offset:80
+; DAGISEL64-NEXT:    scratch_load_b32 v20, off, s33 offset:84
+; DAGISEL64-NEXT:    scratch_load_b32 v21, off, s33 offset:88
+; DAGISEL64-NEXT:    scratch_load_b32 v22, off, s33 offset:92
+; DAGISEL64-NEXT:    scratch_load_b32 v23, off, s33 offset:96
+; DAGISEL64-NEXT:    scratch_load_b32 v24, off, s33 offset:100
+; DAGISEL64-NEXT:    scratch_load_b32 v25, off, s33 offset:104
+; DAGISEL64-NEXT:    scratch_load_b32 v26, off, s33 offset:108
+; DAGISEL64-NEXT:    scratch_load_b32 v27, off, s33 offset:112
+; DAGISEL64-NEXT:    scratch_load_b32 v28, off, s33 offset:116
+; DAGISEL64-NEXT:    scratch_load_b32 v29, off, s33 offset:120
+; DAGISEL64-NEXT:    scratch_load_b32 v30, off, s33 offset:124
+; DAGISEL64-NEXT:    scratch_load_b32 v31, off, s33 offset:128
+; DAGISEL64-NEXT:    s_clause 0x1f
+; DAGISEL64-NEXT:    scratch_load_b32 v32, off, s33 offset:132
+; DAGISEL64-NEXT:    scratch_load_b32 v33, off, s33 offset:136
+; DAGISEL64-NEXT:    scratch_load_b32 v34, off, s33 offset:140
+; DAGISEL64-NEXT:    scratch_load_b32 v35, off, s33 offset:144
+; DAGISEL64-NEXT:    scratch_load_b32 v36, off, s33 offset:148
+; DAGISEL64-NEXT:    scratch_load_b32 v37, off, s33 offset:152
+; DAGISEL64-NEXT:    scratch_load_b32 v38, off, s33 offset:156
+; DAGISEL64-NEXT:    scratch_load_b32 v39, off, s33 offset:160
+; DAGISEL64-NEXT:    scratch_load_b32 v48, off, s33 offset:164
+; DAGISEL64-NEXT:    scratch_load_b32 v49, off, s33 offset:168
+; DAGISEL64-NEXT:    scratch_load_b32 v50, off, s33 offset:172
+; DAGISEL64-NEXT:    scratch_load_b32 v51, off, s33 offset:176
+; DAGISEL64-NEXT:    scratch_load_b32 v52, off, s33 offset:180
+; DAGISEL64-NEXT:    scratch_load_b32 v53, off, s33 offset:184
+; DAGISEL64-NEXT:    scratch_load_b32 v54, off, s33 offset:188
+; DAGISEL64-NEXT:    scratch_load_b32 v55, off, s33 offset:192
+; DAGISEL64-NEXT:    scratch_load_b32 v64, off, s33 offset:196
+; DAGISEL64-NEXT:    scratch_load_b32 v65, off, s33 offset:200
+; DAGISEL64-NEXT:    scratch_load_b32 v66, off, s33 offset:204
+; DAGISEL64-NEXT:    scratch_load_b32 v67, off, s33 offset:208
+; DAGISEL64-NEXT:    scratch_load_b32 v68, off, s33 offset:212
+; DAGISEL64-NEXT:    scratch_load_b32 v69, off, s33 offset:216
+; DAGISEL64-NEXT:    scratch_load_b32 v70, off, s33 offset:220
+; DAGISEL64-NEXT:    scratch_load_b32 v71, off, s33 offset:224
+; DAGISEL64-NEXT:    scratch_load_b32 v80, off, s33 offset:228
+; DAGISEL64-NEXT:    scratch_load_b32 v81, off, s33 offset:232
+; DAGISEL64-NEXT:    scratch_load_b32 v82, off, s33 offset:236
+; DAGISEL64-NEXT:    scratch_load_b32 v83, off, s33 offset:240
+; DAGISEL64-NEXT:    scratch_load_b32 v84, off, s33 offset:244
+; DAGISEL64-NEXT:    scratch_load_b32 v85, off, s33 offset:248
+; DAGISEL64-NEXT:    scratch_load_b32 v86, off, s33 offset:252
+; DAGISEL64-NEXT:    scratch_load_b32 v87, off, s33 offset:256
+; DAGISEL64-NEXT:    s_clause 0x1f
+; DAGISEL64-NEXT:    scratch_load_b32 v96, off, s33 offset:260
+; DAGISEL64-NEXT:    scratch_load_b32 v97, off, s33 offset:264
+; DAGISEL64-NEXT:    scratch_load_b32 v98, off, s33 offset:268
+; DAGISEL64-NEXT:    scratch_load_b32 v99, off, s33 offset:272
+; DAGISEL64-NEXT:    scratch_load_b32 v100, off, s33 offset:276
+; DAGISEL64-NEXT:    scratch_load_b32 v101, off, s33 offset:280
+; DAGISEL64-NEXT:    scratch_load_b32 v102, off, s33 offset:284
+; DAGISEL64-NEXT:    scratch_load_b32 v103, off, s33 offset:288
+; DAGISEL64-NEXT:    scratch_load_b32 v112, off, s33 offset:292
+; DAGISEL64-NEXT:    scratch_load_b32 v113, off, s33 offset:296
+; DAGISEL64-NEXT:    scratch_load_b32 v114, off, s33 offset:300
+; DAGISEL64-NEXT:    scratch_load_b32 v115, off, s33 offset:304
+; DAGISEL64-NEXT:    scratch_load_b32 v116, off, s33 offset:308
+; DAGISEL64-NEXT:    scratch_load_b32 v117, off, s33 offset:312
+; DAGISEL64-NEXT:    scratch_load_b32 v118, off, s33 offset:316
+; DAGISEL64-NEXT:    scratch_load_b32 v119, off, s33 offset:320
+; DAGISEL64-NEXT:    scratch_load_b32 v128, off, s33 offset:324
+; DAGISEL64-NEXT:    scratch_load_b32 v129, off, s33 offset:328
+; DAGISEL64-NEXT:    scratch_load_b32 v130, off, s33 offset:332
+; DAGISEL64-NEXT:    scratch_load_b32 v131, off, s33 offset:336
+; DAGISEL64-NEXT:    scratch_load_b32 v132, off, s33 offset:340
+; DAGISEL64-NEXT:    scratch_load_b32 v133, off, s33 offset:344
+; DAGISEL64-NEXT:    scratch_load_b32 v134, off, s33 offset:348
+; DAGISEL64-NEXT:    scratch_load_b32 v135, off, s33 offset:352
+; DAGISEL64-NEXT:    scratch_load_b32 v144, off, s33 offset:356
+; DAGISEL64-NEXT:    scratch_load_b32 v145, off, s33 offset:360
+; DAGISEL64-NEXT:    scratch_load_b32 v146, off, s33 offset:364
+; DAGISEL64-NEXT:    scratch_load_b32 v147, off, s33 offset:368
+; DAGISEL64-NEXT:    scratch_load_b32 v148, off, s33 offset:372
+; DAGISEL64-NEXT:    scratch_load_b32 v149, off, s33 offset:376
+; DAGISEL64-NEXT:    scratch_load_b32 v150, off, s33 offset:380
+; DAGISEL64-NEXT:    scratch_load_b32 v151, off, s33 offset:384
+; DAGISEL64-NEXT:    s_clause 0x1f
+; DAGISEL64-NEXT:    scratch_load_b32 v160, off, s33 offset:388
+; DAGISEL64-NEXT:    scratch_load_b32 v161, off, s33 offset:392
+; DAGISEL64-NEXT:    scratch_load_b32 v162, off, s33 offset:396
+; DAGISEL64-NEXT:    scratch_load_b32 v163, off, s33 offset:400
+; DAGISEL64-NEXT:    scratch_load_b32 v164, off, s33 offset:404
+; DAGISEL64-NEXT:    scratch_load_b32 v165, off, s33 offset:408
+; DAGISEL64-NEXT:    scratch_load_b32 v166, off, s33 offset:412
+; DAGISEL64-NEXT:    scratch_load_b32 v167, off, s33 offset:416
+; DAGISEL64-NEXT:    scratch_load_b32 v176, off, s33 offset:420
+; DAGISEL64-NEXT:    scratch_load_b32 v177, off, s33 offset:424
+; DAGISEL64-NEXT:    scratch_load_b32 v178, off, s33 offset:428
+; DAGISEL64-NEXT:    scratch_load_b32 v179, off, s33 offset:432
+; DAGISEL64-NEXT:    scratch_load_b32 v180, off, s33 offset:436
+; DAGISEL64-NEXT:    scratch_load_b32 v181, off, s33 offset:440
+; DAGISEL64-NEXT:    scratch_load_b32 v182, off, s33 offset:444
+; DAGISEL64-NEXT:    scratch_load_b32 v183, off, s33 offset:448
+; DAGISEL64-NEXT:    scratch_load_b32 v192, off, s33 offset:452
+; DAGISEL64-NEXT:    scratch_load_b32 v193, off, s33 offset:456
+; DAGISEL64-NEXT:    scratch_load_b32 v194, off, s33 offset:460
+; DAGISEL64-NEXT:    scratch_load_b32 v195, off, s33 offset:464
+; DAGISEL64-NEXT:    scratch_load_b32 v196, off, s33 offset:468
+; DAGISEL64-NEXT:    scratch_load_b32 v197, off, s33 offset:472
+; DAGISEL64-NEXT:    scratch_load_b32 v198, off, s33 offset:476
+; DAGISEL64-NEXT:    scratch_load_b32 v199, off, s33 offset:480
+; DAGISEL64-NEXT:    scratch_load_b32 v208, off, s33 offset:484
+; DAGISEL64-NEXT:    scratch_load_b32 v209, off, s33 offset:488
+; DAGISEL64-NEXT:    scratch_load_b32 v210, off, s33 offset:492
+; DAGISEL64-NEXT:    scratch_load_b32 v211, off, s33 offset:496
+; DAGISEL64-NEXT:    scratch_load_b32 v212, off, s33 offset:500
+; DAGISEL64-NEXT:    scratch_load_b32 v213, off, s33 offset:504
+; DAGISEL64-NEXT:    scratch_load_b32 v214, off, s33 offset:508
+; DAGISEL64-NEXT:    scratch_load_b32 v215, off, s33 offset:512
+; DAGISEL64-NEXT:    s_clause 0xf
+; DAGISEL64-NEXT:    scratch_load_b32 v224, off, s33 offset:516
+; DAGISEL64-NEXT:    scratch_load_b32 v225, off, s33 offset:520
+; DAGISEL64-NEXT:    scratch_load_b32 v226, off, s33 offset:524
+; DAGISEL64-NEXT:    scratch_load_b32 v227, off, s33 offset:528
+; DAGISEL64-NEXT:    scratch_load_b32 v228, off, s33 offset:532
+; DAGISEL64-NEXT:    scratch_load_b32 v229, off, s33 offset:536
+; DAGISEL64-NEXT:    scratch_load_b32 v230, off, s33 offset:540
+; DAGISEL64-NEXT:    scratch_load_b32 v231, off, s33 offset:544
+; DAGISEL64-NEXT:    scratch_load_b32 v240, off, s33 offset:548
+; DAGISEL64-NEXT:    scratch_load_b32 v241, off, s33 offset:552
+; DAGISEL64-NEXT:    scratch_load_b32 v242, off, s33 offset:556
+; DAGISEL64-NEXT:    scratch_load_b32 v243, off, s33 offset:560
+; DAGISEL64-NEXT:    scratch_load_b32 v244, off, s33 offset:564
+; DAGISEL64-NEXT:    scratch_load_b32 v245, off, s33 offset:568
+; DAGISEL64-NEXT:    scratch_load_b32 v246, off, s33 offset:572
+; DAGISEL64-NEXT:    scratch_load_b32 v247, off, s33 offset:576
+; DAGISEL64-NEXT:    s_mov_b64 exec, s[34:35]
+; DAGISEL64-NEXT:    s_mov_b32 s33, s36
+; DAGISEL64-NEXT:    s_wait_loadcnt 0x0
+; DAGISEL64-NEXT:    s_wait_alu 0xfffe
+; DAGISEL64-NEXT:    s_setpc_b64 s[30:31]
+;
+; GISEL64-LABEL: call_gfx_from_whole_wave:
+; GISEL64:       ; %bb.0:
+; GISEL64-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GISEL64-NEXT:    s_wait_expcnt 0x0
+; GISEL64-NEXT:    s_wait_samplecnt 0x0
+; GISEL64-NEXT:    s_wait_bvhcnt 0x0
+; GISEL64-NEXT:    s_wait_kmcnt 0x0
+; GISEL64-NEXT:    s_mov_b32 s36, s33
+; GISEL64-NEXT:    s_mov_b32 s33, s32
+; GISEL64-NEXT:    s_xor_saveexec_b64 s[34:35], -1
+; GISEL64-NEXT:    s_clause 0x1f
+; GISEL64-NEXT:    scratch_store_b32 off, v0, s33 offset:4
+; GISEL64-NEXT:    scratch_store_b32 off, v1, s33 offset:8
+; GISEL64-NEXT:    scratch_store_b32 off, v2, s33 offset:12
+; GISEL64-NEXT:    scratch_store_b32 off, v3, s33 offset:16
+; GISEL64-NEXT:    scratch_store_b32 off, v4, s33 offset:20
+; GISEL64-NEXT:    scratch_store_b32 off, v5, s33 offset:24
+; GISEL64-NEXT:    scratch_store_b32 off, v6, s33 offset:28
+; GISEL64-NEXT:    scratch_store_b32 off, v7, s33 offset:32
+; GISEL64-NEXT:    scratch_store_b32 off, v8, s33 offset:36
+; GISEL64-NEXT:    scratch_store_b32 off, v9, s33 offset:40
+; GISEL64-NEXT:    scratch_store_b32 off, v10, s33 offset:44
+; GISEL64-NEXT:    scratch_store_b32 off, v11, s33 offset:48
+; GISEL64-NEXT:    scratch_store_b32 off, v12, s33 offset:52
+; GISEL64-NEXT:    scratch_store_b32 off, v13, s33 offset:56
+; GISEL64-NEXT:    scratch_store_b32 off, v14, s33 offset:60
+; GISEL64-NEXT:    scratch_store_b32 off, v15, s33 offset:64
+; GISEL64-NEXT:    scratch_store_b32 off, v16, s33 offset:68
+; GISEL64-NEXT:    scratch_store_b32 off, v17, s33 offset:72
+; GISEL64-NEXT:    scratch_store_b32 off, v18, s33 offset:76
+; GISEL64-NEXT:    scratch_store_b32 off, v19, s33 offset:80
+; GISEL64-NEXT:    scratch_store_b32 off, v20, s33 offset:84
+; GISEL64-NEXT:    scratch_store_b32 off, v21, s33 offset:88
+; GISEL64-NEXT:    scratch_store_b32 off, v22, s33 offset:92
+; GISEL64-NEXT:    scratch_store_b32 off, v23, s33 offset:96
+; GISEL64-NEXT:    scratch_store_b32 off, v24, s33 offset:100
+; GISEL64-NEXT:    scratch_store_b32 off, v25, s33 offset:104
+; GISEL64-NEXT:    scratch_store_b32 off, v26, s33 offset:108
+; GISEL64-NEXT:    scratch_store_b32 off, v27, s33 offset:112
+; GISEL64-NEXT:    scratch_store_b32 off, v28, s33 offset:116
+; GISEL64-NEXT:    scratch_store_b32 off, v29, s33 offset:120
+; GISEL64-NEXT:    scratch_store_b32 off, v30, s33 offset:124
+; GISEL64-NEXT:    scratch_store_b32 off, v31, s33 offset:128
+; GISEL64-NEXT:    s_clause 0x1f
+; GISEL64-NEXT:    scratch_store_b32 off, v32, s33 offset:132
+; GISEL64-NEXT:    scratch_store_b32 off, v33, s33 offset:136
+; GISEL64-NEXT:    scratch_store_b32 off, v34, s33 offset:140
+; GISEL64-NEXT:    scratch_store_b32 off, v35, s33 offset:144
+; GISEL64-NEXT:    scratch_store_b32 off, v36, s33 offset:148
+; GISEL64-NEXT:    scratch_store_b32 off, v37, s33 offset:152
+; GISEL64-NEXT:    scratch_store_b32 off, v38, s33 offset:156
+; GISEL64-NEXT:    scratch_store_b32 off, v39, s33 offset:160
+; GISEL64-NEXT:    scratch_store_b32 off, v48, s33 offset:164
+; GISEL64-NEXT:    scratch_store_b32 off, v49, s33 offset:168
+; GISEL64-NEXT:    scratch_store_b32 off, v50, s33 offset:172
+; GISEL64-NEXT:    scratch_store_b32 off, v51, s33 offset:176
+; GISEL64-NEXT:    scratch_store_b32 off, v52, s33 offset:180
+; GISEL64-NEXT:    scratch_store_b32 off, v53, s33 offset:184
+; GISEL64-NEXT:    scratch_store_b32 off, v54, s33 offset:188
+; GISEL64-NEXT:    scratch_store_b32 off, v55, s33 offset:192
+; GISEL64-NEXT:    scratch_store_b32 off, v64, s33 offset:196
+; GISEL64-NEXT:    scratch_store_b32 off, v65, s33 offset:200
+; GISEL64-NEXT:    scratch_store_b32 off, v66, s33 offset:204
+; GISEL64-NEXT:    scratch_store_b32 off, v67, s33 offset:208
+; GISEL64-NEXT:    scratch_store_b32 off, v68, s33 offset:212
+; GISEL64-NEXT:    scratch_store_b32 off, v69, s33 offset:216
+; GISEL64-NEXT:    scratch_store_b32 off, v70, s33 offset:220
+; GISEL64-NEXT:    scratch_store_b32 off, v71, s33 offset:224
+; GISEL64-NEXT:    scratch_store_b32 off, v80, s33 offset:228
+; GISEL64-NEXT:    scratch_store_b32 off, v81, s33 offset:232
+; GISEL64-NEXT:    scratch_store_b32 off, v82, s33 offset:236
+; GISEL64-NEXT:    scratch_store_b32 off, v83, s33 offset:240
+; GISEL64-NEXT:    scratch_store_b32 off, v84, s33 offset:244
+; GISEL64-NEXT:    scratch_store_b32 off, v85, s33 offset:248
+; GISEL64-NEXT:    scratch_store_b32 off, v86, s33 offset:252
+; GISEL64-NEXT:    scratch_store_b32 off, v87, s33 offset:256
+; GISEL64-NEXT:    s_clause 0x1f
+; GISEL64-NEXT:    scratch_store_b32 off, v96, s33 offset:260
+; GISEL64-NEXT:    scratch_store_b32 off, v97, s33 offset:264
+; GISEL64-NEXT:    scratch_store_b32 off, v98, s33 offset:268
+; GISEL64-NEXT:    scratch_store_b32 off, v99, s33 offset:272
+; GISEL64-NEXT:    scratch_store_b32 off, v100, s33 offset:276
+; GISEL64-NEXT:    scratch_store_b32 off, v101, s33 offset:280
+; GISEL64-NEXT:    scratch_store_b32 off, v102, s33 offset:284
+; GISEL64-NEXT:    scratch_store_b32 off, v103, s33 offset:288
+; GISEL64-NEXT:    scratch_store_b32 off, v112, s33 offset:292
+; GISEL64-NEXT:    scratch_store_b32 off, v113, s33 offset:296
+; GISEL64-NEXT:    scratch_store_b32 off, v114, s33 offset:300
+; GISEL64-NEXT:    scratch_store_b32 off, v115, s33 offset:304
+; GISEL64-NEXT:    scratch_store_b32 off, v116, s33 offset:308
+; GISEL64-NEXT:    scratch_store_b32 off, v117, s33 offset:312
+; GISEL64-NEXT:    scratch_store_b32 off, v118, s33 offset:316
+; GISEL64-NEXT:    scratch_store_b32 off, v119, s33 offset:320
+; GISEL64-NEXT:    scratch_store_b32 off, v128, s33 offset:324
+; GISEL64-NEXT:    scratch_store_b32 off, v129, s33 offset:328
+; GISEL64-NEXT:    scratch_store_b32 off, v130, s33 offset:332
+; GISEL64-NEXT:    scratch_store_b32 off, v131, s33 offset:336
+; GISEL64-NEXT:    scratch_store_b32 off, v132, s33 offset:340
+; GISEL64-NEXT:    scratch_store_b32 off, v133, s33 offset:344
+; GISEL64-NEXT:    scratch_store_b32 off, v134, s33 offset:348
+; GISEL64-NEXT:    scratch_store_b32 off, v135, s33 offset:352
+; GISEL64-NEXT:    scratch_store_b32 off, v144, s33 offset:356
+; GISEL64-NEXT:    scratch_store_b32 off, v145, s33 offset:360
+; GISEL64-NEXT:    scratch_store_b32 off, v146, s33 offset:364
+; GISEL64-NEXT:    scratch_store_b32 off, v147, s33 offset:368
+; GISEL64-NEXT:    scratch_store_b32 off, v148, s33 offset:372
+; GISEL64-NEXT:    scratch_store_b32 off, v149, s33 offset:376
+; GISEL64-NEXT:    scratch_store_b32 off, v150, s33 offset:380
+; GISEL64-NEXT:    scratch_store_b32 off, v151, s33 offset:384
+; GISEL64-NEXT:    s_clause 0x1f
+; GISEL64-NEXT:    scratch_store_b32 off, v160, s33 offset:388
+; GISEL64-NEXT:    scratch_store_b32 off, v161, s33 offset:392
+; GISEL64-NEXT:    scratch_store_b32 off, v162, s33 offset:396
+; GISEL64-NEXT:    scratch_store_b32 off, v163, s33 offset:400
+; GISEL64-NEXT:    scratch_store_b32 off, v164, s33 offset:404
+; GISEL64-NEXT:    scratch_store_b32 off, v165, s33 offset:408
+; GISEL64-NEXT:    scratch_store_b32 off, v166, s33 offset:412
+; GISEL64-NEXT:    scratch_store_b32 off, v167, s33 offset:416
+; GISEL64-NEXT:    scratch_store_b32 off, v176, s33 offset:420
+; GISEL64-NEXT:    scratch_store_b32 off, v177, s33 offset:424
+; GISEL64-NEXT:    scratch_store_b32 off, v178, s33 offset:428
+; GISEL64-NEXT:    scratch_store_b32 off, v179, s33 offset:432
+; GISEL64-NEXT:    scratch_store_b32 off, v180, s33 offset:436
+; GISEL64-NEXT:    scratch_store_b32 off, v181, s33 offset:440
+; GISEL64-NEXT:    scratch_store_b32 off, v182, s33 offset:444
+; GISEL64-NEXT:    scratch_store_b32 off, v183, s33 offset:448
+; GISEL64-NEXT:    scratch_store_b32 off, v192, s33 offset:452
+; GISEL64-NEXT:    scratch_store_b32 off, v193, s33 offset:456
+; GISEL64-NEXT:    scratch_store_b32 off, v194, s33 offset:460
+; GISEL64-NEXT:    scratch_store_b32 off, v195, s33 offset:464
+; GISEL64-NEXT:    scratch_store_b32 off, v196, s33 offset:468
+; GISEL64-NEXT:    scratch_store_b32 off, v197, s33 offset:472
+; GISEL64-NEXT:    scratch_store_b32 off, v198, s33 offset:476
+; GISEL64-NEXT:    scratch_store_b32 off, v199, s33 offset:480
+; GISEL64-NEXT:    scratch_store_b32 off, v208, s33 offset:484
+; GISEL64-NEXT:    scratch_store_b32 off, v209, s33 offset:488
+; GISEL64-NEXT:    scratch_store_b32 off, v210, s33 offset:492
+; GISEL64-NEXT:    scratch_store_b32 off, v211, s33 offset:496
+; GISEL64-NEXT:    scratch_store_b32 off, v212, s33 offset:500
+; GISEL64-NEXT:    scratch_store_b32 off, v213, s33 offset:504
+; GISEL64-NEXT:    scratch_store_b32 off, v214, s33 offset:508
+; GISEL64-NEXT:    scratch_store_b32 off, v215, s33 offset:512
+; GISEL64-NEXT:    s_clause 0xf
+; GISEL64-NEXT:    scratch_store_b32 off, v224, s33 offset:516
+; GISEL64-NEXT:    scratch_store_b32 off, v225, s33 offset:520
+; GISEL64-NEXT:    scratch_store_b32 off, v226, s33 offset:524
+; GISEL64-NEXT:    scratch_store_b32 off, v227, s33 offset:528
+; GISEL64-NEXT:    scratch_store_b32 off, v228, s33 offset:532
+; GISEL64-NEXT:    scratch_store_b32 off, v229, s33 offset:536
+; GISEL64-NEXT:    scratch_store_b32 off, v230, s33 offset:540
+; GISEL64-NEXT:    scratch_store_b32 off, v231, s33 offset:544
+; GISEL64-NEXT:    scratch_store_b32 off, v240, s33 offset:548
+; GISEL64-NEXT:    scratch_store_b32 off, v241, s33 offset:552
+; GISEL64-NEXT:    scratch_store_b32 off, v242, s33 offset:556
+; GISEL64-NEXT:    scratch_store_b32 off, v243, s33 offset:560
+; GISEL64-NEXT:    scratch_store_b32 off, v244, s33 offset:564
+; GISEL64-NEXT:    scratch_store_b32 off, v245, s33 offset:568
+; GISEL64-NEXT:    scratch_store_b32 off, v246, s33 offset:572
+; GISEL64-NEXT:    scratch_store_b32 off, v247, s33 offset:576
+; GISEL64-NEXT:    s_mov_b64 exec, -1
+; GISEL64-NEXT:    scratch_store_b32 off, v40, s33 ; 4-byte Folded Spill
+; GISEL64-NEXT:    v_writelane_b32 v40, s4, 0
+; GISEL64-NEXT:    v_mov_b32_e32 v2, v0
+; GISEL64-NEXT:    v_swap_b32 v0, v1
+; GISEL64-NEXT:    s_mov_b32 s0, gfx_callee at abs32@lo
+; GISEL64-NEXT:    v_writelane_b32 v40, s5, 1
+; GISEL64-NEXT:    s_mov_b32 s1, gfx_callee at abs32@hi
+; GISEL64-NEXT:    s_addk_co_i32 s32, 0x250
+; GISEL64-NEXT:    v_writelane_b32 v40, s6, 2
+; GISEL64-NEXT:    v_writelane_b32 v40, s7, 3
+; GISEL64-NEXT:    v_writelane_b32 v40, s8, 4
+; GISEL64-NEXT:    v_writelane_b32 v40, s9, 5
+; GISEL64-NEXT:    s_mov_b64 s[8:9], 0
+; GISEL64-NEXT:    v_writelane_b32 v40, s10, 6
+; GISEL64-NEXT:    v_writelane_b32 v40, s11, 7
+; GISEL64-NEXT:    v_writelane_b32 v40, s12, 8
+; GISEL64-NEXT:    v_writelane_b32 v40, s13, 9
+; GISEL64-NEXT:    v_writelane_b32 v40, s14, 10
+; GISEL64-NEXT:    v_writelane_b32 v40, s15, 11
+; GISEL64-NEXT:    v_writelane_b32 v40, s16, 12
+; GISEL64-NEXT:    v_writelane_b32 v40, s17, 13
+; GISEL64-NEXT:    v_writelane_b32 v40, s18, 14
+; GISEL64-NEXT:    v_writelane_b32 v40, s19, 15
+; GISEL64-NEXT:    v_writelane_b32 v40, s20, 16
+; GISEL64-NEXT:    v_writelane_b32 v40, s21, 17
+; GISEL64-NEXT:    v_writelane_b32 v40, s22, 18
+; GISEL64-NEXT:    v_writelane_b32 v40, s23, 19
+; GISEL64-NEXT:    v_writelane_b32 v40, s24, 20
+; GISEL64-NEXT:    v_writelane_b32 v40, s25, 21
+; GISEL64-NEXT:    v_writelane_b32 v40, s26, 22
+; GISEL64-NEXT:    v_writelane_b32 v40, s27, 23
+; GISEL64-NEXT:    v_writelane_b32 v40, s28, 24
+; GISEL64-NEXT:    v_writelane_b32 v40, s29, 25
+; GISEL64-NEXT:    v_writelane_b32 v40, s30, 26
+; GISEL64-NEXT:    v_writelane_b32 v40, s31, 27
+; GISEL64-NEXT:    v_writelane_b32 v40, s72, 28
+; GISEL64-NEXT:    v_writelane_b32 v40, s73, 29
+; GISEL64-NEXT:    v_writelane_b32 v40, s74, 30
+; GISEL64-NEXT:    v_writelane_b32 v40, s75, 31
+; GISEL64-NEXT:    v_writelane_b32 v40, s76, 32
+; GISEL64-NEXT:    v_writelane_b32 v40, s77, 33
+; GISEL64-NEXT:    v_writelane_b32 v40, s78, 34
+; GISEL64-NEXT:    v_writelane_b32 v40, s79, 35
+; GISEL64-NEXT:    v_writelane_b32 v40, s88, 36
+; GISEL64-NEXT:    v_writelane_b32 v40, s89, 37
+; GISEL64-NEXT:    v_writelane_b32 v40, s90, 38
+; GISEL64-NEXT:    v_writelane_b32 v40, s91, 39
+; GISEL64-NEXT:    v_writelane_b32 v40, s92, 40
+; GISEL64-NEXT:    v_writelane_b32 v40, s93, 41
+; GISEL64-NEXT:    v_writelane_b32 v40, s94, 42
+; GISEL64-NEXT:    v_writelane_b32 v40, s95, 43
+; GISEL64-NEXT:    s_wait_alu 0xfffe
+; GISEL64-NEXT:    s_swappc_b64 s[30:31], s[0:1]
+; GISEL64-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GISEL64-NEXT:    v_readlane_b32 s95, v40, 43
+; GISEL64-NEXT:    v_readlane_b32 s94, v40, 42
+; GISEL64-NEXT:    v_readlane_b32 s93, v40, 41
+; GISEL64-NEXT:    v_readlane_b32 s92, v40, 40
+; GISEL64-NEXT:    v_readlane_b32 s91, v40, 39
+; GISEL64-NEXT:    v_readlane_b32 s90, v40, 38
+; GISEL64-NEXT:    v_readlane_b32 s89, v40, 37
+; GISEL64-NEXT:    v_readlane_b32 s88, v40, 36
+; GISEL64-NEXT:    v_readlane_b32 s79, v40, 35
+; GISEL64-NEXT:    v_readlane_b32 s78, v40, 34
+; GISEL64-NEXT:    v_readlane_b32 s77, v40, 33
+; GISEL64-NEXT:    v_readlane_b32 s76, v40, 32
+; GISEL64-NEXT:    v_readlane_b32 s75, v40, 31
+; GISEL64-NEXT:    v_readlane_b32 s74, v40, 30
+; GISEL64-NEXT:    v_readlane_b32 s73, v40, 29
+; GISEL64-NEXT:    v_readlane_b32 s72, v40, 28
+; GISEL64-NEXT:    v_readlane_b32 s31, v40, 27
+; GISEL64-NEXT:    v_readlane_b32 s30, v40, 26
+; GISEL64-NEXT:    v_readlane_b32 s29, v40, 25
+; GISEL64-NEXT:    v_readlane_b32 s28, v40, 24
+; GISEL64-NEXT:    v_readlane_b32 s27, v40, 23
+; GISEL64-NEXT:    v_readlane_b32 s26, v40, 22
+; GISEL64-NEXT:    v_readlane_b32 s25, v40, 21
+; GISEL64-NEXT:    v_readlane_b32 s24, v40, 20
+; GISEL64-NEXT:    v_readlane_b32 s23, v40, 19
+; GISEL64-NEXT:    v_readlane_b32 s22, v40, 18
+; GISEL64-NEXT:    v_readlane_b32 s21, v40, 17
+; GISEL64-NEXT:    v_readlane_b32 s20, v40, 16
+; GISEL64-NEXT:    v_readlane_b32 s19, v40, 15
+; GISEL64-NEXT:    v_readlane_b32 s18, v40, 14
+; GISEL64-NEXT:    v_readlane_b32 s17, v40, 13
+; GISEL64-NEXT:    v_readlane_b32 s16, v40, 12
+; GISEL64-NEXT:    v_readlane_b32 s15, v40, 11
+; GISEL64-NEXT:    v_readlane_b32 s14, v40, 10
+; GISEL64-NEXT:    v_readlane_b32 s13, v40, 9
+; GISEL64-NEXT:    v_readlane_b32 s12, v40, 8
+; GISEL64-NEXT:    v_readlane_b32 s11, v40, 7
+; GISEL64-NEXT:    v_readlane_b32 s10, v40, 6
+; GISEL64-NEXT:    v_readlane_b32 s9, v40, 5
+; GISEL64-NEXT:    v_readlane_b32 s8, v40, 4
+; GISEL64-NEXT:    v_readlane_b32 s7, v40, 3
+; GISEL64-NEXT:    v_readlane_b32 s6, v40, 2
+; GISEL64-NEXT:    v_readlane_b32 s5, v40, 1
+; GISEL64-NEXT:    v_readlane_b32 s4, v40, 0
+; GISEL64-NEXT:    scratch_load_b32 v40, off, s33 ; 4-byte Folded Reload
+; GISEL64-NEXT:    s_mov_b32 s32, s33
+; GISEL64-NEXT:    s_xor_b64 exec, s[34:35], -1
+; GISEL64-NEXT:    s_clause 0x1f
+; GISEL64-NEXT:    scratch_load_b32 v0, off, s33 offset:4
+; GISEL64-NEXT:    scratch_load_b32 v1, off, s33 offset:8
+; GISEL64-NEXT:    scratch_load_b32 v2, off, s33 offset:12
+; GISEL64-NEXT:    scratch_load_b32 v3, off, s33 offset:16
+; GISEL64-NEXT:    scratch_load_b32 v4, off, s33 offset:20
+; GISEL64-NEXT:    scratch_load_b32 v5, off, s33 offset:24
+; GISEL64-NEXT:    scratch_load_b32 v6, off, s33 offset:28
+; GISEL64-NEXT:    scratch_load_b32 v7, off, s33 offset:32
+; GISEL64-NEXT:    scratch_load_b32 v8, off, s33 offset:36
+; GISEL64-NEXT:    scratch_load_b32 v9, off, s33 offset:40
+; GISEL64-NEXT:    scratch_load_b32 v10, off, s33 offset:44
+; GISEL64-NEXT:    scratch_load_b32 v11, off, s33 offset:48
+; GISEL64-NEXT:    scratch_load_b32 v12, off, s33 offset:52
+; GISEL64-NEXT:    scratch_load_b32 v13, off, s33 offset:56
+; GISEL64-NEXT:    scratch_load_b32 v14, off, s33 offset:60
+; GISEL64-NEXT:    scratch_load_b32 v15, off, s33 offset:64
+; GISEL64-NEXT:    scratch_load_b32 v16, off, s33 offset:68
+; GISEL64-NEXT:    scratch_load_b32 v17, off, s33 offset:72
+; GISEL64-NEXT:    scratch_load_b32 v18, off, s33 offset:76
+; GISEL64-NEXT:    scratch_load_b32 v19, off, s33 offset:80
+; GISEL64-NEXT:    scratch_load_b32 v20, off, s33 offset:84
+; GISEL64-NEXT:    scratch_load_b32 v21, off, s33 offset:88
+; GISEL64-NEXT:    scratch_load_b32 v22, off, s33 offset:92
+; GISEL64-NEXT:    scratch_load_b32 v23, off, s33 offset:96
+; GISEL64-NEXT:    scratch_load_b32 v24, off, s33 offset:100
+; GISEL64-NEXT:    scratch_load_b32 v25, off, s33 offset:104
+; GISEL64-NEXT:    scratch_load_b32 v26, off, s33 offset:108
+; GISEL64-NEXT:    scratch_load_b32 v27, off, s33 offset:112
+; GISEL64-NEXT:    scratch_load_b32 v28, off, s33 offset:116
+; GISEL64-NEXT:    scratch_load_b32 v29, off, s33 offset:120
+; GISEL64-NEXT:    scratch_load_b32 v30, off, s33 offset:124
+; GISEL64-NEXT:    scratch_load_b32 v31, off, s33 offset:128
+; GISEL64-NEXT:    s_clause 0x1f
+; GISEL64-NEXT:    scratch_load_b32 v32, off, s33 offset:132
+; GISEL64-NEXT:    scratch_load_b32 v33, off, s33 offset:136
+; GISEL64-NEXT:    scratch_load_b32 v34, off, s33 offset:140
+; GISEL64-NEXT:    scratch_load_b32 v35, off, s33 offset:144
+; GISEL64-NEXT:    scratch_load_b32 v36, off, s33 offset:148
+; GISEL64-NEXT:    scratch_load_b32 v37, off, s33 offset:152
+; GISEL64-NEXT:    scratch_load_b32 v38, off, s33 offset:156
+; GISEL64-NEXT:    scratch_load_b32 v39, off, s33 offset:160
+; GISEL64-NEXT:    scratch_load_b32 v48, off, s33 offset:164
+; GISEL64-NEXT:    scratch_load_b32 v49, off, s33 offset:168
+; GISEL64-NEXT:    scratch_load_b32 v50, off, s33 offset:172
+; GISEL64-NEXT:    scratch_load_b32 v51, off, s33 offset:176
+; GISEL64-NEXT:    scratch_load_b32 v52, off, s33 offset:180
+; GISEL64-NEXT:    scratch_load_b32 v53, off, s33 offset:184
+; GISEL64-NEXT:    scratch_load_b32 v54, off, s33 offset:188
+; GISEL64-NEXT:    scratch_load_b32 v55, off, s33 offset:192
+; GISEL64-NEXT:    scratch_load_b32 v64, off, s33 offset:196
+; GISEL64-NEXT:    scratch_load_b32 v65, off, s33 offset:200
+; GISEL64-NEXT:    scratch_load_b32 v66, off, s33 offset:204
+; GISEL64-NEXT:    scratch_load_b32 v67, off, s33 offset:208
+; GISEL64-NEXT:    scratch_load_b32 v68, off, s33 offset:212
+; GISEL64-NEXT:    scratch_load_b32 v69, off, s33 offset:216
+; GISEL64-NEXT:    scratch_load_b32 v70, off, s33 offset:220
+; GISEL64-NEXT:    scratch_load_b32 v71, off, s33 offset:224
+; GISEL64-NEXT:    scratch_load_b32 v80, off, s33 offset:228
+; GISEL64-NEXT:    scratch_load_b32 v81, off, s33 offset:232
+; GISEL64-NEXT:    scratch_load_b32 v82, off, s33 offset:236
+; GISEL64-NEXT:    scratch_load_b32 v83, off, s33 offset:240
+; GISEL64-NEXT:    scratch_load_b32 v84, off, s33 offset:244
+; GISEL64-NEXT:    scratch_load_b32 v85, off, s33 offset:248
+; GISEL64-NEXT:    scratch_load_b32 v86, off, s33 offset:252
+; GISEL64-NEXT:    scratch_load_b32 v87, off, s33 offset:256
+; GISEL64-NEXT:    s_clause 0x1f
+; GISEL64-NEXT:    scratch_load_b32 v96, off, s33 offset:260
+; GISEL64-NEXT:    scratch_load_b32 v97, off, s33 offset:264
+; GISEL64-NEXT:    scratch_load_b32 v98, off, s33 offset:268
+; GISEL64-NEXT:    scratch_load_b32 v99, off, s33 offset:272
+; GISEL64-NEXT:    scratch_load_b32 v100, off, s33 offset:276
+; GISEL64-NEXT:    scratch_load_b32 v101, off, s33 offset:280
+; GISEL64-NEXT:    scratch_load_b32 v102, off, s33 offset:284
+; GISEL64-NEXT:    scratch_load_b32 v103, off, s33 offset:288
+; GISEL64-NEXT:    scratch_load_b32 v112, off, s33 offset:292
+; GISEL64-NEXT:    scratch_load_b32 v113, off, s33 offset:296
+; GISEL64-NEXT:    scratch_load_b32 v114, off, s33 offset:300
+; GISEL64-NEXT:    scratch_load_b32 v115, off, s33 offset:304
+; GISEL64-NEXT:    scratch_load_b32 v116, off, s33 offset:308
+; GISEL64-NEXT:    scratch_load_b32 v117, off, s33 offset:312
+; GISEL64-NEXT:    scratch_load_b32 v118, off, s33 offset:316
+; GISEL64-NEXT:    scratch_load_b32 v119, off, s33 offset:320
+; GISEL64-NEXT:    scratch_load_b32 v128, off, s33 offset:324
+; GISEL64-NEXT:    scratch_load_b32 v129, off, s33 offset:328
+; GISEL64-NEXT:    scratch_load_b32 v130, off, s33 offset:332
+; GISEL64-NEXT:    scratch_load_b32 v131, off, s33 offset:336
+; GISEL64-NEXT:    scratch_load_b32 v132, off, s33 offset:340
+; GISEL64-NEXT:    scratch_load_b32 v133, off, s33 offset:344
+; GISEL64-NEXT:    scratch_load_b32 v134, off, s33 offset:348
+; GISEL64-NEXT:    scratch_load_b32 v135, off, s33 offset:352
+; GISEL64-NEXT:    scratch_load_b32 v144, off, s33 offset:356
+; GISEL64-NEXT:    scratch_load_b32 v145, off, s33 offset:360
+; GISEL64-NEXT:    scratch_load_b32 v146, off, s33 offset:364
+; GISEL64-NEXT:    scratch_load_b32 v147, off, s33 offset:368
+; GISEL64-NEXT:    scratch_load_b32 v148, off, s33 offset:372
+; GISEL64-NEXT:    scratch_load_b32 v149, off, s33 offset:376
+; GISEL64-NEXT:    scratch_load_b32 v150, off, s33 offset:380
+; GISEL64-NEXT:    scratch_load_b32 v151, off, s33 offset:384
+; GISEL64-NEXT:    s_clause 0x1f
+; GISEL64-NEXT:    scratch_load_b32 v160, off, s33 offset:388
+; GISEL64-NEXT:    scratch_load_b32 v161, off, s33 offset:392
+; GISEL64-NEXT:    scratch_load_b32 v162, off, s33 offset:396
+; GISEL64-NEXT:    scratch_load_b32 v163, off, s33 offset:400
+; GISEL64-NEXT:    scratch_load_b32 v164, off, s33 offset:404
+; GISEL64-NEXT:    scratch_load_b32 v165, off, s33 offset:408
+; GISEL64-NEXT:    scratch_load_b32 v166, off, s33 offset:412
+; GISEL64-NEXT:    scratch_load_b32 v167, off, s33 offset:416
+; GISEL64-NEXT:    scratch_load_b32 v176, off, s33 offset:420
+; GISEL64-NEXT:    scratch_load_b32 v177, off, s33 offset:424
+; GISEL64-NEXT:    scratch_load_b32 v178, off, s33 offset:428
+; GISEL64-NEXT:    scratch_load_b32 v179, off, s33 offset:432
+; GISEL64-NEXT:    scratch_load_b32 v180, off, s33 offset:436
+; GISEL64-NEXT:    scratch_load_b32 v181, off, s33 offset:440
+; GISEL64-NEXT:    scratch_load_b32 v182, off, s33 offset:444
+; GISEL64-NEXT:    scratch_load_b32 v183, off, s33 offset:448
+; GISEL64-NEXT:    scratch_load_b32 v192, off, s33 offset:452
+; GISEL64-NEXT:    scratch_load_b32 v193, off, s33 offset:456
+; GISEL64-NEXT:    scratch_load_b32 v194, off, s33 offset:460
+; GISEL64-NEXT:    scratch_load_b32 v195, off, s33 offset:464
+; GISEL64-NEXT:    scratch_load_b32 v196, off, s33 offset:468
+; GISEL64-NEXT:    scratch_load_b32 v197, off, s33 offset:472
+; GISEL64-NEXT:    scratch_load_b32 v198, off, s33 offset:476
+; GISEL64-NEXT:    scratch_load_b32 v199, off, s33 offset:480
+; GISEL64-NEXT:    scratch_load_b32 v208, off, s33 offset:484
+; GISEL64-NEXT:    scratch_load_b32 v209, off, s33 offset:488
+; GISEL64-NEXT:    scratch_load_b32 v210, off, s33 offset:492
+; GISEL64-NEXT:    scratch_load_b32 v211, off, s33 offset:496
+; GISEL64-NEXT:    scratch_load_b32 v212, off, s33 offset:500
+; GISEL64-NEXT:    scratch_load_b32 v213, off, s33 offset:504
+; GISEL64-NEXT:    scratch_load_b32 v214, off, s33 offset:508
+; GISEL64-NEXT:    scratch_load_b32 v215, off, s33 offset:512
+; GISEL64-NEXT:    s_clause 0xf
+; GISEL64-NEXT:    scratch_load_b32 v224, off, s33 offset:516
+; GISEL64-NEXT:    scratch_load_b32 v225, off, s33 offset:520
+; GISEL64-NEXT:    scratch_load_b32 v226, off, s33 offset:524
+; GISEL64-NEXT:    scratch_load_b32 v227, off, s33 offset:528
+; GISEL64-NEXT:    scratch_load_b32 v228, off, s33 offset:532
+; GISEL64-NEXT:    scratch_load_b32 v229, off, s33 offset:536
+; GISEL64-NEXT:    scratch_load_b32 v230, off, s33 offset:540
+; GISEL64-NEXT:    scratch_load_b32 v231, off, s33 offset:544
+; GISEL64-NEXT:    scratch_load_b32 v240, off, s33 offset:548
+; GISEL64-NEXT:    scratch_load_b32 v241, off, s33 offset:552
+; GISEL64-NEXT:    scratch_load_b32 v242, off, s33 offset:556
+; GISEL64-NEXT:    scratch_load_b32 v243, off, s33 offset:560
+; GISEL64-NEXT:    scratch_load_b32 v244, off, s33 offset:564
+; GISEL64-NEXT:    scratch_load_b32 v245, off, s33 offset:568
+; GISEL64-NEXT:    scratch_load_b32 v246, off, s33 offset:572
+; GISEL64-NEXT:    scratch_load_b32 v247, off, s33 offset:576
+; GISEL64-NEXT:    s_mov_b64 exec, s[34:35]
+; GISEL64-NEXT:    s_mov_b32 s33, s36
+; GISEL64-NEXT:    s_wait_loadcnt 0x0
+; GISEL64-NEXT:    s_wait_alu 0xfffe
+; GISEL64-NEXT:    s_setpc_b64 s[30:31]
   %ret = call <2 x half>(<2 x half>, <2 x half>) @gfx_callee(<2 x half> %y, <2 x half> %x) convergent
   ret <2 x half> %ret
 }

>From c56f918aca17bb9053e8123b9442b7bab5ad7018 Mon Sep 17 00:00:00 2001
From: Diana Picus <diana-magda.picus at amd.com>
Date: Tue, 24 Jun 2025 13:14:19 +0200
Subject: [PATCH 21/24] Fix a few missed spots

---
 llvm/lib/IR/Function.cpp                       | 1 +
 llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp | 2 +-
 llvm/lib/Target/AMDGPU/SIISelLowering.cpp      | 6 ++++--
 llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.h  | 3 ++-
 4 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/llvm/lib/IR/Function.cpp b/llvm/lib/IR/Function.cpp
index 3e7fcbb983738..b1e8dd716063f 100644
--- a/llvm/lib/IR/Function.cpp
+++ b/llvm/lib/IR/Function.cpp
@@ -1226,6 +1226,7 @@ bool llvm::CallingConv::supportsNonVoidReturnType(CallingConv::ID CC) {
   case CallingConv::AArch64_SVE_VectorCall:
   case CallingConv::WASM_EmscriptenInvoke:
   case CallingConv::AMDGPU_Gfx:
+  case CallingConv::AMDGPU_Gfx_WholeWave:
   case CallingConv::M68k_INTR:
   case CallingConv::AArch64_SME_ABI_Support_Routines_PreserveMost_From_X0:
   case CallingConv::AArch64_SME_ABI_Support_Routines_PreserveMost_From_X2:
diff --git a/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp b/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp
index bc95d3f040e1d..098c2dc2405df 100644
--- a/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp
@@ -3155,7 +3155,7 @@ bool GCNHazardRecognizer::fixRequiredExportPriority(MachineInstr *MI) {
   // Check entry priority at each export (as there will only be a few).
   // Note: amdgpu_gfx can only be a callee, so defer to caller setprio.
   bool Changed = false;
-  if (CC != CallingConv::AMDGPU_Gfx)
+  if (CC != CallingConv::AMDGPU_Gfx && CC != CallingConv::AMDGPU_Gfx_WholeWave)
     Changed = ensureEntrySetPrio(MF, NormalPriority, TII);
 
   auto NextMI = std::next(It);
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index f80352c6ff954..eec2dbcd2dd4a 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -2226,7 +2226,8 @@ SDValue SITargetLowering::getPreloadedValue(
   const ArgDescriptor WorkGroupIDZ =
       ArgDescriptor::createRegister(AMDGPU::TTMP7, 0xFFFF0000u);
   if (Subtarget->hasArchitectedSGPRs() &&
-      (AMDGPU::isCompute(CC) || CC == CallingConv::AMDGPU_Gfx)) {
+      (AMDGPU::isCompute(CC) || CC == CallingConv::AMDGPU_Gfx ||
+       CC == CallingConv::AMDGPU_Gfx_WholeWave)) {
     switch (PVID) {
     case AMDGPUFunctionArgInfo::WORKGROUP_ID_X:
       Reg = &WorkGroupIDX;
@@ -2908,7 +2909,8 @@ SDValue SITargetLowering::LowerFormalArguments(
     if (!Subtarget->enableFlatScratch())
       assert(!UserSGPRInfo.hasFlatScratchInit());
     if ((CallConv != CallingConv::AMDGPU_CS &&
-         CallConv != CallingConv::AMDGPU_Gfx) ||
+         CallConv != CallingConv::AMDGPU_Gfx &&
+         CallConv != CallingConv::AMDGPU_Gfx_WholeWave) ||
         !Subtarget->hasArchitectedSGPRs())
       assert(!Info->hasWorkGroupIDX() && !Info->hasWorkGroupIDY() &&
              !Info->hasWorkGroupIDZ());
diff --git a/llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.h b/llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.h
index e6078d6918ac2..e6af1ecc8db77 100644
--- a/llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.h
+++ b/llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.h
@@ -1341,7 +1341,8 @@ constexpr bool isShader(CallingConv::ID CC) {
 
 LLVM_READNONE
 constexpr bool isGraphics(CallingConv::ID CC) {
-  return isShader(CC) || CC == CallingConv::AMDGPU_Gfx;
+  return isShader(CC) || CC == CallingConv::AMDGPU_Gfx ||
+         CC == CallingConv::AMDGPU_Gfx_WholeWave;
 }
 
 LLVM_READNONE

>From 5d51c1275fc462fcec573f6d8bf65426faeab905 Mon Sep 17 00:00:00 2001
From: Diana Picus <diana-magda.picus at amd.com>
Date: Wed, 25 Jun 2025 12:53:56 +0200
Subject: [PATCH 22/24] clang-format

---
 llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp b/llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp
index 474b2675c7074..b4ea3c81b3b6e 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp
@@ -1612,8 +1612,8 @@ bool AMDGPUCallLowering::lowerCall(MachineIRBuilder &MIRBuilder,
   return true;
 }
 
-void AMDGPUCallLowering::addOriginalExecToReturn(MachineFunction &MF,
-                                                 MachineInstrBuilder &Ret) const {
+void AMDGPUCallLowering::addOriginalExecToReturn(
+    MachineFunction &MF, MachineInstrBuilder &Ret) const {
   const GCNSubtarget &ST = MF.getSubtarget<GCNSubtarget>();
   const SIInstrInfo *TII = ST.getInstrInfo();
   const MachineInstr *Setup = TII->getWholeWaveFunctionSetup(MF);

>From 5470f1d46f80697785b533e1456837ea8baf6482 Mon Sep 17 00:00:00 2001
From: Diana Picus <diana-magda.picus at amd.com>
Date: Fri, 27 Jun 2025 10:16:34 +0200
Subject: [PATCH 23/24] Fix CC in test

---
 .../CodeGen/AMDGPU/whole-wave-functions.ll    | 1596 +++++++----------
 1 file changed, 637 insertions(+), 959 deletions(-)

diff --git a/llvm/test/CodeGen/AMDGPU/whole-wave-functions.ll b/llvm/test/CodeGen/AMDGPU/whole-wave-functions.ll
index 4c03b4fa09e11..53d02925fb1c2 100644
--- a/llvm/test/CodeGen/AMDGPU/whole-wave-functions.ll
+++ b/llvm/test/CodeGen/AMDGPU/whole-wave-functions.ll
@@ -1069,414 +1069,331 @@ define amdgpu_gfx_whole_wave <2 x half> @call_gfx_from_whole_wave(i1 %active, <2
 ; DAGISEL-NEXT:    s_wait_samplecnt 0x0
 ; DAGISEL-NEXT:    s_wait_bvhcnt 0x0
 ; DAGISEL-NEXT:    s_wait_kmcnt 0x0
-; DAGISEL-NEXT:    s_mov_b32 s35, s33
+; DAGISEL-NEXT:    s_mov_b32 s0, s33
 ; DAGISEL-NEXT:    s_mov_b32 s33, s32
-; DAGISEL-NEXT:    s_xor_saveexec_b32 s34, -1
+; DAGISEL-NEXT:    s_xor_saveexec_b32 s4, -1
 ; DAGISEL-NEXT:    s_clause 0x1f
-; DAGISEL-NEXT:    scratch_store_b32 off, v0, s33 offset:8
-; DAGISEL-NEXT:    scratch_store_b32 off, v1, s33 offset:12
-; DAGISEL-NEXT:    scratch_store_b32 off, v2, s33 offset:16
-; DAGISEL-NEXT:    scratch_store_b32 off, v3, s33 offset:20
-; DAGISEL-NEXT:    scratch_store_b32 off, v4, s33 offset:24
-; DAGISEL-NEXT:    scratch_store_b32 off, v5, s33 offset:28
-; DAGISEL-NEXT:    scratch_store_b32 off, v6, s33 offset:32
-; DAGISEL-NEXT:    scratch_store_b32 off, v7, s33 offset:36
-; DAGISEL-NEXT:    scratch_store_b32 off, v8, s33 offset:40
-; DAGISEL-NEXT:    scratch_store_b32 off, v9, s33 offset:44
-; DAGISEL-NEXT:    scratch_store_b32 off, v10, s33 offset:48
-; DAGISEL-NEXT:    scratch_store_b32 off, v11, s33 offset:52
-; DAGISEL-NEXT:    scratch_store_b32 off, v12, s33 offset:56
-; DAGISEL-NEXT:    scratch_store_b32 off, v13, s33 offset:60
-; DAGISEL-NEXT:    scratch_store_b32 off, v14, s33 offset:64
-; DAGISEL-NEXT:    scratch_store_b32 off, v15, s33 offset:68
-; DAGISEL-NEXT:    scratch_store_b32 off, v16, s33 offset:72
-; DAGISEL-NEXT:    scratch_store_b32 off, v17, s33 offset:76
-; DAGISEL-NEXT:    scratch_store_b32 off, v18, s33 offset:80
-; DAGISEL-NEXT:    scratch_store_b32 off, v19, s33 offset:84
-; DAGISEL-NEXT:    scratch_store_b32 off, v20, s33 offset:88
-; DAGISEL-NEXT:    scratch_store_b32 off, v21, s33 offset:92
-; DAGISEL-NEXT:    scratch_store_b32 off, v22, s33 offset:96
-; DAGISEL-NEXT:    scratch_store_b32 off, v23, s33 offset:100
-; DAGISEL-NEXT:    scratch_store_b32 off, v24, s33 offset:104
-; DAGISEL-NEXT:    scratch_store_b32 off, v25, s33 offset:108
-; DAGISEL-NEXT:    scratch_store_b32 off, v26, s33 offset:112
-; DAGISEL-NEXT:    scratch_store_b32 off, v27, s33 offset:116
-; DAGISEL-NEXT:    scratch_store_b32 off, v28, s33 offset:120
-; DAGISEL-NEXT:    scratch_store_b32 off, v29, s33 offset:124
-; DAGISEL-NEXT:    scratch_store_b32 off, v30, s33 offset:128
-; DAGISEL-NEXT:    scratch_store_b32 off, v31, s33 offset:132
+; DAGISEL-NEXT:    scratch_store_b32 off, v0, s33 offset:4
+; DAGISEL-NEXT:    scratch_store_b32 off, v1, s33 offset:8
+; DAGISEL-NEXT:    scratch_store_b32 off, v2, s33 offset:12
+; DAGISEL-NEXT:    scratch_store_b32 off, v3, s33 offset:16
+; DAGISEL-NEXT:    scratch_store_b32 off, v4, s33 offset:20
+; DAGISEL-NEXT:    scratch_store_b32 off, v5, s33 offset:24
+; DAGISEL-NEXT:    scratch_store_b32 off, v6, s33 offset:28
+; DAGISEL-NEXT:    scratch_store_b32 off, v7, s33 offset:32
+; DAGISEL-NEXT:    scratch_store_b32 off, v8, s33 offset:36
+; DAGISEL-NEXT:    scratch_store_b32 off, v9, s33 offset:40
+; DAGISEL-NEXT:    scratch_store_b32 off, v10, s33 offset:44
+; DAGISEL-NEXT:    scratch_store_b32 off, v11, s33 offset:48
+; DAGISEL-NEXT:    scratch_store_b32 off, v12, s33 offset:52
+; DAGISEL-NEXT:    scratch_store_b32 off, v13, s33 offset:56
+; DAGISEL-NEXT:    scratch_store_b32 off, v14, s33 offset:60
+; DAGISEL-NEXT:    scratch_store_b32 off, v15, s33 offset:64
+; DAGISEL-NEXT:    scratch_store_b32 off, v16, s33 offset:68
+; DAGISEL-NEXT:    scratch_store_b32 off, v17, s33 offset:72
+; DAGISEL-NEXT:    scratch_store_b32 off, v18, s33 offset:76
+; DAGISEL-NEXT:    scratch_store_b32 off, v19, s33 offset:80
+; DAGISEL-NEXT:    scratch_store_b32 off, v20, s33 offset:84
+; DAGISEL-NEXT:    scratch_store_b32 off, v21, s33 offset:88
+; DAGISEL-NEXT:    scratch_store_b32 off, v22, s33 offset:92
+; DAGISEL-NEXT:    scratch_store_b32 off, v23, s33 offset:96
+; DAGISEL-NEXT:    scratch_store_b32 off, v24, s33 offset:100
+; DAGISEL-NEXT:    scratch_store_b32 off, v25, s33 offset:104
+; DAGISEL-NEXT:    scratch_store_b32 off, v26, s33 offset:108
+; DAGISEL-NEXT:    scratch_store_b32 off, v27, s33 offset:112
+; DAGISEL-NEXT:    scratch_store_b32 off, v28, s33 offset:116
+; DAGISEL-NEXT:    scratch_store_b32 off, v29, s33 offset:120
+; DAGISEL-NEXT:    scratch_store_b32 off, v30, s33 offset:124
+; DAGISEL-NEXT:    scratch_store_b32 off, v31, s33 offset:128
 ; DAGISEL-NEXT:    s_clause 0x1f
-; DAGISEL-NEXT:    scratch_store_b32 off, v32, s33 offset:136
-; DAGISEL-NEXT:    scratch_store_b32 off, v33, s33 offset:140
-; DAGISEL-NEXT:    scratch_store_b32 off, v34, s33 offset:144
-; DAGISEL-NEXT:    scratch_store_b32 off, v35, s33 offset:148
-; DAGISEL-NEXT:    scratch_store_b32 off, v36, s33 offset:152
-; DAGISEL-NEXT:    scratch_store_b32 off, v37, s33 offset:156
-; DAGISEL-NEXT:    scratch_store_b32 off, v38, s33 offset:160
-; DAGISEL-NEXT:    scratch_store_b32 off, v39, s33 offset:164
-; DAGISEL-NEXT:    scratch_store_b32 off, v48, s33 offset:168
-; DAGISEL-NEXT:    scratch_store_b32 off, v49, s33 offset:172
-; DAGISEL-NEXT:    scratch_store_b32 off, v50, s33 offset:176
-; DAGISEL-NEXT:    scratch_store_b32 off, v51, s33 offset:180
-; DAGISEL-NEXT:    scratch_store_b32 off, v52, s33 offset:184
-; DAGISEL-NEXT:    scratch_store_b32 off, v53, s33 offset:188
-; DAGISEL-NEXT:    scratch_store_b32 off, v54, s33 offset:192
-; DAGISEL-NEXT:    scratch_store_b32 off, v55, s33 offset:196
-; DAGISEL-NEXT:    scratch_store_b32 off, v64, s33 offset:200
-; DAGISEL-NEXT:    scratch_store_b32 off, v65, s33 offset:204
-; DAGISEL-NEXT:    scratch_store_b32 off, v66, s33 offset:208
-; DAGISEL-NEXT:    scratch_store_b32 off, v67, s33 offset:212
-; DAGISEL-NEXT:    scratch_store_b32 off, v68, s33 offset:216
-; DAGISEL-NEXT:    scratch_store_b32 off, v69, s33 offset:220
-; DAGISEL-NEXT:    scratch_store_b32 off, v70, s33 offset:224
-; DAGISEL-NEXT:    scratch_store_b32 off, v71, s33 offset:228
-; DAGISEL-NEXT:    scratch_store_b32 off, v80, s33 offset:232
-; DAGISEL-NEXT:    scratch_store_b32 off, v81, s33 offset:236
-; DAGISEL-NEXT:    scratch_store_b32 off, v82, s33 offset:240
-; DAGISEL-NEXT:    scratch_store_b32 off, v83, s33 offset:244
-; DAGISEL-NEXT:    scratch_store_b32 off, v84, s33 offset:248
-; DAGISEL-NEXT:    scratch_store_b32 off, v85, s33 offset:252
-; DAGISEL-NEXT:    scratch_store_b32 off, v86, s33 offset:256
-; DAGISEL-NEXT:    scratch_store_b32 off, v87, s33 offset:260
+; DAGISEL-NEXT:    scratch_store_b32 off, v32, s33 offset:132
+; DAGISEL-NEXT:    scratch_store_b32 off, v33, s33 offset:136
+; DAGISEL-NEXT:    scratch_store_b32 off, v34, s33 offset:140
+; DAGISEL-NEXT:    scratch_store_b32 off, v35, s33 offset:144
+; DAGISEL-NEXT:    scratch_store_b32 off, v36, s33 offset:148
+; DAGISEL-NEXT:    scratch_store_b32 off, v37, s33 offset:152
+; DAGISEL-NEXT:    scratch_store_b32 off, v38, s33 offset:156
+; DAGISEL-NEXT:    scratch_store_b32 off, v39, s33 offset:160
+; DAGISEL-NEXT:    scratch_store_b32 off, v48, s33 offset:164
+; DAGISEL-NEXT:    scratch_store_b32 off, v49, s33 offset:168
+; DAGISEL-NEXT:    scratch_store_b32 off, v50, s33 offset:172
+; DAGISEL-NEXT:    scratch_store_b32 off, v51, s33 offset:176
+; DAGISEL-NEXT:    scratch_store_b32 off, v52, s33 offset:180
+; DAGISEL-NEXT:    scratch_store_b32 off, v53, s33 offset:184
+; DAGISEL-NEXT:    scratch_store_b32 off, v54, s33 offset:188
+; DAGISEL-NEXT:    scratch_store_b32 off, v55, s33 offset:192
+; DAGISEL-NEXT:    scratch_store_b32 off, v64, s33 offset:196
+; DAGISEL-NEXT:    scratch_store_b32 off, v65, s33 offset:200
+; DAGISEL-NEXT:    scratch_store_b32 off, v66, s33 offset:204
+; DAGISEL-NEXT:    scratch_store_b32 off, v67, s33 offset:208
+; DAGISEL-NEXT:    scratch_store_b32 off, v68, s33 offset:212
+; DAGISEL-NEXT:    scratch_store_b32 off, v69, s33 offset:216
+; DAGISEL-NEXT:    scratch_store_b32 off, v70, s33 offset:220
+; DAGISEL-NEXT:    scratch_store_b32 off, v71, s33 offset:224
+; DAGISEL-NEXT:    scratch_store_b32 off, v80, s33 offset:228
+; DAGISEL-NEXT:    scratch_store_b32 off, v81, s33 offset:232
+; DAGISEL-NEXT:    scratch_store_b32 off, v82, s33 offset:236
+; DAGISEL-NEXT:    scratch_store_b32 off, v83, s33 offset:240
+; DAGISEL-NEXT:    scratch_store_b32 off, v84, s33 offset:244
+; DAGISEL-NEXT:    scratch_store_b32 off, v85, s33 offset:248
+; DAGISEL-NEXT:    scratch_store_b32 off, v86, s33 offset:252
+; DAGISEL-NEXT:    scratch_store_b32 off, v87, s33 offset:256
 ; DAGISEL-NEXT:    s_clause 0x1f
-; DAGISEL-NEXT:    scratch_store_b32 off, v96, s33 offset:264
-; DAGISEL-NEXT:    scratch_store_b32 off, v97, s33 offset:268
-; DAGISEL-NEXT:    scratch_store_b32 off, v98, s33 offset:272
-; DAGISEL-NEXT:    scratch_store_b32 off, v99, s33 offset:276
-; DAGISEL-NEXT:    scratch_store_b32 off, v100, s33 offset:280
-; DAGISEL-NEXT:    scratch_store_b32 off, v101, s33 offset:284
-; DAGISEL-NEXT:    scratch_store_b32 off, v102, s33 offset:288
-; DAGISEL-NEXT:    scratch_store_b32 off, v103, s33 offset:292
-; DAGISEL-NEXT:    scratch_store_b32 off, v112, s33 offset:296
-; DAGISEL-NEXT:    scratch_store_b32 off, v113, s33 offset:300
-; DAGISEL-NEXT:    scratch_store_b32 off, v114, s33 offset:304
-; DAGISEL-NEXT:    scratch_store_b32 off, v115, s33 offset:308
-; DAGISEL-NEXT:    scratch_store_b32 off, v116, s33 offset:312
-; DAGISEL-NEXT:    scratch_store_b32 off, v117, s33 offset:316
-; DAGISEL-NEXT:    scratch_store_b32 off, v118, s33 offset:320
-; DAGISEL-NEXT:    scratch_store_b32 off, v119, s33 offset:324
-; DAGISEL-NEXT:    scratch_store_b32 off, v128, s33 offset:328
-; DAGISEL-NEXT:    scratch_store_b32 off, v129, s33 offset:332
-; DAGISEL-NEXT:    scratch_store_b32 off, v130, s33 offset:336
-; DAGISEL-NEXT:    scratch_store_b32 off, v131, s33 offset:340
-; DAGISEL-NEXT:    scratch_store_b32 off, v132, s33 offset:344
-; DAGISEL-NEXT:    scratch_store_b32 off, v133, s33 offset:348
-; DAGISEL-NEXT:    scratch_store_b32 off, v134, s33 offset:352
-; DAGISEL-NEXT:    scratch_store_b32 off, v135, s33 offset:356
-; DAGISEL-NEXT:    scratch_store_b32 off, v144, s33 offset:360
-; DAGISEL-NEXT:    scratch_store_b32 off, v145, s33 offset:364
-; DAGISEL-NEXT:    scratch_store_b32 off, v146, s33 offset:368
-; DAGISEL-NEXT:    scratch_store_b32 off, v147, s33 offset:372
-; DAGISEL-NEXT:    scratch_store_b32 off, v148, s33 offset:376
-; DAGISEL-NEXT:    scratch_store_b32 off, v149, s33 offset:380
-; DAGISEL-NEXT:    scratch_store_b32 off, v150, s33 offset:384
-; DAGISEL-NEXT:    scratch_store_b32 off, v151, s33 offset:388
+; DAGISEL-NEXT:    scratch_store_b32 off, v96, s33 offset:260
+; DAGISEL-NEXT:    scratch_store_b32 off, v97, s33 offset:264
+; DAGISEL-NEXT:    scratch_store_b32 off, v98, s33 offset:268
+; DAGISEL-NEXT:    scratch_store_b32 off, v99, s33 offset:272
+; DAGISEL-NEXT:    scratch_store_b32 off, v100, s33 offset:276
+; DAGISEL-NEXT:    scratch_store_b32 off, v101, s33 offset:280
+; DAGISEL-NEXT:    scratch_store_b32 off, v102, s33 offset:284
+; DAGISEL-NEXT:    scratch_store_b32 off, v103, s33 offset:288
+; DAGISEL-NEXT:    scratch_store_b32 off, v112, s33 offset:292
+; DAGISEL-NEXT:    scratch_store_b32 off, v113, s33 offset:296
+; DAGISEL-NEXT:    scratch_store_b32 off, v114, s33 offset:300
+; DAGISEL-NEXT:    scratch_store_b32 off, v115, s33 offset:304
+; DAGISEL-NEXT:    scratch_store_b32 off, v116, s33 offset:308
+; DAGISEL-NEXT:    scratch_store_b32 off, v117, s33 offset:312
+; DAGISEL-NEXT:    scratch_store_b32 off, v118, s33 offset:316
+; DAGISEL-NEXT:    scratch_store_b32 off, v119, s33 offset:320
+; DAGISEL-NEXT:    scratch_store_b32 off, v128, s33 offset:324
+; DAGISEL-NEXT:    scratch_store_b32 off, v129, s33 offset:328
+; DAGISEL-NEXT:    scratch_store_b32 off, v130, s33 offset:332
+; DAGISEL-NEXT:    scratch_store_b32 off, v131, s33 offset:336
+; DAGISEL-NEXT:    scratch_store_b32 off, v132, s33 offset:340
+; DAGISEL-NEXT:    scratch_store_b32 off, v133, s33 offset:344
+; DAGISEL-NEXT:    scratch_store_b32 off, v134, s33 offset:348
+; DAGISEL-NEXT:    scratch_store_b32 off, v135, s33 offset:352
+; DAGISEL-NEXT:    scratch_store_b32 off, v144, s33 offset:356
+; DAGISEL-NEXT:    scratch_store_b32 off, v145, s33 offset:360
+; DAGISEL-NEXT:    scratch_store_b32 off, v146, s33 offset:364
+; DAGISEL-NEXT:    scratch_store_b32 off, v147, s33 offset:368
+; DAGISEL-NEXT:    scratch_store_b32 off, v148, s33 offset:372
+; DAGISEL-NEXT:    scratch_store_b32 off, v149, s33 offset:376
+; DAGISEL-NEXT:    scratch_store_b32 off, v150, s33 offset:380
+; DAGISEL-NEXT:    scratch_store_b32 off, v151, s33 offset:384
 ; DAGISEL-NEXT:    s_clause 0x1f
-; DAGISEL-NEXT:    scratch_store_b32 off, v160, s33 offset:392
-; DAGISEL-NEXT:    scratch_store_b32 off, v161, s33 offset:396
-; DAGISEL-NEXT:    scratch_store_b32 off, v162, s33 offset:400
-; DAGISEL-NEXT:    scratch_store_b32 off, v163, s33 offset:404
-; DAGISEL-NEXT:    scratch_store_b32 off, v164, s33 offset:408
-; DAGISEL-NEXT:    scratch_store_b32 off, v165, s33 offset:412
-; DAGISEL-NEXT:    scratch_store_b32 off, v166, s33 offset:416
-; DAGISEL-NEXT:    scratch_store_b32 off, v167, s33 offset:420
-; DAGISEL-NEXT:    scratch_store_b32 off, v176, s33 offset:424
-; DAGISEL-NEXT:    scratch_store_b32 off, v177, s33 offset:428
-; DAGISEL-NEXT:    scratch_store_b32 off, v178, s33 offset:432
-; DAGISEL-NEXT:    scratch_store_b32 off, v179, s33 offset:436
-; DAGISEL-NEXT:    scratch_store_b32 off, v180, s33 offset:440
-; DAGISEL-NEXT:    scratch_store_b32 off, v181, s33 offset:444
-; DAGISEL-NEXT:    scratch_store_b32 off, v182, s33 offset:448
-; DAGISEL-NEXT:    scratch_store_b32 off, v183, s33 offset:452
-; DAGISEL-NEXT:    scratch_store_b32 off, v192, s33 offset:456
-; DAGISEL-NEXT:    scratch_store_b32 off, v193, s33 offset:460
-; DAGISEL-NEXT:    scratch_store_b32 off, v194, s33 offset:464
-; DAGISEL-NEXT:    scratch_store_b32 off, v195, s33 offset:468
-; DAGISEL-NEXT:    scratch_store_b32 off, v196, s33 offset:472
-; DAGISEL-NEXT:    scratch_store_b32 off, v197, s33 offset:476
-; DAGISEL-NEXT:    scratch_store_b32 off, v198, s33 offset:480
-; DAGISEL-NEXT:    scratch_store_b32 off, v199, s33 offset:484
-; DAGISEL-NEXT:    scratch_store_b32 off, v208, s33 offset:488
-; DAGISEL-NEXT:    scratch_store_b32 off, v209, s33 offset:492
-; DAGISEL-NEXT:    scratch_store_b32 off, v210, s33 offset:496
-; DAGISEL-NEXT:    scratch_store_b32 off, v211, s33 offset:500
-; DAGISEL-NEXT:    scratch_store_b32 off, v212, s33 offset:504
-; DAGISEL-NEXT:    scratch_store_b32 off, v213, s33 offset:508
-; DAGISEL-NEXT:    scratch_store_b32 off, v214, s33 offset:512
-; DAGISEL-NEXT:    scratch_store_b32 off, v215, s33 offset:516
+; DAGISEL-NEXT:    scratch_store_b32 off, v160, s33 offset:388
+; DAGISEL-NEXT:    scratch_store_b32 off, v161, s33 offset:392
+; DAGISEL-NEXT:    scratch_store_b32 off, v162, s33 offset:396
+; DAGISEL-NEXT:    scratch_store_b32 off, v163, s33 offset:400
+; DAGISEL-NEXT:    scratch_store_b32 off, v164, s33 offset:404
+; DAGISEL-NEXT:    scratch_store_b32 off, v165, s33 offset:408
+; DAGISEL-NEXT:    scratch_store_b32 off, v166, s33 offset:412
+; DAGISEL-NEXT:    scratch_store_b32 off, v167, s33 offset:416
+; DAGISEL-NEXT:    scratch_store_b32 off, v176, s33 offset:420
+; DAGISEL-NEXT:    scratch_store_b32 off, v177, s33 offset:424
+; DAGISEL-NEXT:    scratch_store_b32 off, v178, s33 offset:428
+; DAGISEL-NEXT:    scratch_store_b32 off, v179, s33 offset:432
+; DAGISEL-NEXT:    scratch_store_b32 off, v180, s33 offset:436
+; DAGISEL-NEXT:    scratch_store_b32 off, v181, s33 offset:440
+; DAGISEL-NEXT:    scratch_store_b32 off, v182, s33 offset:444
+; DAGISEL-NEXT:    scratch_store_b32 off, v183, s33 offset:448
+; DAGISEL-NEXT:    scratch_store_b32 off, v192, s33 offset:452
+; DAGISEL-NEXT:    scratch_store_b32 off, v193, s33 offset:456
+; DAGISEL-NEXT:    scratch_store_b32 off, v194, s33 offset:460
+; DAGISEL-NEXT:    scratch_store_b32 off, v195, s33 offset:464
+; DAGISEL-NEXT:    scratch_store_b32 off, v196, s33 offset:468
+; DAGISEL-NEXT:    scratch_store_b32 off, v197, s33 offset:472
+; DAGISEL-NEXT:    scratch_store_b32 off, v198, s33 offset:476
+; DAGISEL-NEXT:    scratch_store_b32 off, v199, s33 offset:480
+; DAGISEL-NEXT:    scratch_store_b32 off, v208, s33 offset:484
+; DAGISEL-NEXT:    scratch_store_b32 off, v209, s33 offset:488
+; DAGISEL-NEXT:    scratch_store_b32 off, v210, s33 offset:492
+; DAGISEL-NEXT:    scratch_store_b32 off, v211, s33 offset:496
+; DAGISEL-NEXT:    scratch_store_b32 off, v212, s33 offset:500
+; DAGISEL-NEXT:    scratch_store_b32 off, v213, s33 offset:504
+; DAGISEL-NEXT:    scratch_store_b32 off, v214, s33 offset:508
+; DAGISEL-NEXT:    scratch_store_b32 off, v215, s33 offset:512
 ; DAGISEL-NEXT:    s_clause 0xf
-; DAGISEL-NEXT:    scratch_store_b32 off, v224, s33 offset:520
-; DAGISEL-NEXT:    scratch_store_b32 off, v225, s33 offset:524
-; DAGISEL-NEXT:    scratch_store_b32 off, v226, s33 offset:528
-; DAGISEL-NEXT:    scratch_store_b32 off, v227, s33 offset:532
-; DAGISEL-NEXT:    scratch_store_b32 off, v228, s33 offset:536
-; DAGISEL-NEXT:    scratch_store_b32 off, v229, s33 offset:540
-; DAGISEL-NEXT:    scratch_store_b32 off, v230, s33 offset:544
-; DAGISEL-NEXT:    scratch_store_b32 off, v231, s33 offset:548
-; DAGISEL-NEXT:    scratch_store_b32 off, v240, s33 offset:552
-; DAGISEL-NEXT:    scratch_store_b32 off, v241, s33 offset:556
-; DAGISEL-NEXT:    scratch_store_b32 off, v242, s33 offset:560
-; DAGISEL-NEXT:    scratch_store_b32 off, v243, s33 offset:564
-; DAGISEL-NEXT:    scratch_store_b32 off, v244, s33 offset:568
-; DAGISEL-NEXT:    scratch_store_b32 off, v245, s33 offset:572
-; DAGISEL-NEXT:    scratch_store_b32 off, v246, s33 offset:576
-; DAGISEL-NEXT:    scratch_store_b32 off, v247, s33 offset:580
+; DAGISEL-NEXT:    scratch_store_b32 off, v224, s33 offset:516
+; DAGISEL-NEXT:    scratch_store_b32 off, v225, s33 offset:520
+; DAGISEL-NEXT:    scratch_store_b32 off, v226, s33 offset:524
+; DAGISEL-NEXT:    scratch_store_b32 off, v227, s33 offset:528
+; DAGISEL-NEXT:    scratch_store_b32 off, v228, s33 offset:532
+; DAGISEL-NEXT:    scratch_store_b32 off, v229, s33 offset:536
+; DAGISEL-NEXT:    scratch_store_b32 off, v230, s33 offset:540
+; DAGISEL-NEXT:    scratch_store_b32 off, v231, s33 offset:544
+; DAGISEL-NEXT:    scratch_store_b32 off, v240, s33 offset:548
+; DAGISEL-NEXT:    scratch_store_b32 off, v241, s33 offset:552
+; DAGISEL-NEXT:    scratch_store_b32 off, v242, s33 offset:556
+; DAGISEL-NEXT:    scratch_store_b32 off, v243, s33 offset:560
+; DAGISEL-NEXT:    scratch_store_b32 off, v244, s33 offset:564
+; DAGISEL-NEXT:    scratch_store_b32 off, v245, s33 offset:568
+; DAGISEL-NEXT:    scratch_store_b32 off, v246, s33 offset:572
+; DAGISEL-NEXT:    scratch_store_b32 off, v247, s33 offset:576
 ; DAGISEL-NEXT:    s_mov_b32 exec_lo, -1
-; DAGISEL-NEXT:    s_clause 0x1
-; DAGISEL-NEXT:    scratch_store_b32 off, v40, s33
-; DAGISEL-NEXT:    scratch_store_b32 off, v41, s33 offset:4
-; DAGISEL-NEXT:    v_writelane_b32 v40, s4, 0
-; DAGISEL-NEXT:    v_writelane_b32 v41, s76, 0
+; DAGISEL-NEXT:    scratch_store_b32 off, v40, s33 ; 4-byte Folded Spill
+; DAGISEL-NEXT:    s_wait_alu 0xfffe
+; DAGISEL-NEXT:    v_writelane_b32 v40, s0, 3
 ; DAGISEL-NEXT:    v_mov_b32_e32 v2, v0
 ; DAGISEL-NEXT:    v_swap_b32 v0, v1
-; DAGISEL-NEXT:    v_writelane_b32 v40, s5, 1
-; DAGISEL-NEXT:    v_writelane_b32 v41, s77, 1
 ; DAGISEL-NEXT:    s_mov_b32 s1, gfx_callee at abs32@hi
+; DAGISEL-NEXT:    v_writelane_b32 v40, s4, 0
 ; DAGISEL-NEXT:    s_mov_b32 s0, gfx_callee at abs32@lo
 ; DAGISEL-NEXT:    s_addk_co_i32 s32, 0x250
-; DAGISEL-NEXT:    v_writelane_b32 v40, s6, 2
-; DAGISEL-NEXT:    v_writelane_b32 v41, s78, 2
-; DAGISEL-NEXT:    v_writelane_b32 v40, s7, 3
-; DAGISEL-NEXT:    v_writelane_b32 v41, s79, 3
-; DAGISEL-NEXT:    v_writelane_b32 v40, s8, 4
-; DAGISEL-NEXT:    v_writelane_b32 v41, s88, 4
-; DAGISEL-NEXT:    v_writelane_b32 v40, s9, 5
-; DAGISEL-NEXT:    v_writelane_b32 v41, s89, 5
-; DAGISEL-NEXT:    s_mov_b64 s[8:9], 0
-; DAGISEL-NEXT:    v_writelane_b32 v40, s10, 6
-; DAGISEL-NEXT:    v_writelane_b32 v41, s90, 6
-; DAGISEL-NEXT:    v_writelane_b32 v40, s11, 7
-; DAGISEL-NEXT:    v_writelane_b32 v41, s91, 7
-; DAGISEL-NEXT:    v_writelane_b32 v40, s12, 8
-; DAGISEL-NEXT:    v_writelane_b32 v41, s92, 8
-; DAGISEL-NEXT:    v_writelane_b32 v40, s13, 9
-; DAGISEL-NEXT:    v_writelane_b32 v41, s93, 9
-; DAGISEL-NEXT:    v_writelane_b32 v40, s14, 10
-; DAGISEL-NEXT:    v_writelane_b32 v41, s94, 10
-; DAGISEL-NEXT:    v_writelane_b32 v40, s15, 11
-; DAGISEL-NEXT:    v_writelane_b32 v41, s95, 11
-; DAGISEL-NEXT:    v_writelane_b32 v40, s16, 12
-; DAGISEL-NEXT:    v_writelane_b32 v40, s17, 13
-; DAGISEL-NEXT:    v_writelane_b32 v40, s18, 14
-; DAGISEL-NEXT:    v_writelane_b32 v40, s19, 15
-; DAGISEL-NEXT:    v_writelane_b32 v40, s20, 16
-; DAGISEL-NEXT:    v_writelane_b32 v40, s21, 17
-; DAGISEL-NEXT:    v_writelane_b32 v40, s22, 18
-; DAGISEL-NEXT:    v_writelane_b32 v40, s23, 19
-; DAGISEL-NEXT:    v_writelane_b32 v40, s24, 20
-; DAGISEL-NEXT:    v_writelane_b32 v40, s25, 21
-; DAGISEL-NEXT:    v_writelane_b32 v40, s26, 22
-; DAGISEL-NEXT:    v_writelane_b32 v40, s27, 23
-; DAGISEL-NEXT:    v_writelane_b32 v40, s28, 24
-; DAGISEL-NEXT:    v_writelane_b32 v40, s29, 25
-; DAGISEL-NEXT:    v_writelane_b32 v40, s30, 26
-; DAGISEL-NEXT:    v_writelane_b32 v40, s31, 27
-; DAGISEL-NEXT:    v_writelane_b32 v40, s72, 28
-; DAGISEL-NEXT:    v_writelane_b32 v40, s73, 29
-; DAGISEL-NEXT:    v_writelane_b32 v40, s74, 30
-; DAGISEL-NEXT:    v_writelane_b32 v40, s75, 31
+; DAGISEL-NEXT:    v_writelane_b32 v40, s30, 1
+; DAGISEL-NEXT:    v_writelane_b32 v40, s31, 2
 ; DAGISEL-NEXT:    s_wait_alu 0xfffe
 ; DAGISEL-NEXT:    s_swappc_b64 s[30:31], s[0:1]
-; DAGISEL-NEXT:    v_readlane_b32 s95, v41, 11
-; DAGISEL-NEXT:    v_readlane_b32 s94, v41, 10
-; DAGISEL-NEXT:    v_readlane_b32 s93, v41, 9
-; DAGISEL-NEXT:    v_readlane_b32 s92, v41, 8
-; DAGISEL-NEXT:    v_readlane_b32 s91, v41, 7
-; DAGISEL-NEXT:    v_readlane_b32 s90, v41, 6
-; DAGISEL-NEXT:    v_readlane_b32 s89, v41, 5
-; DAGISEL-NEXT:    v_readlane_b32 s88, v41, 4
-; DAGISEL-NEXT:    v_readlane_b32 s79, v41, 3
-; DAGISEL-NEXT:    v_readlane_b32 s78, v41, 2
-; DAGISEL-NEXT:    v_readlane_b32 s77, v41, 1
-; DAGISEL-NEXT:    v_readlane_b32 s76, v41, 0
-; DAGISEL-NEXT:    v_readlane_b32 s75, v40, 31
-; DAGISEL-NEXT:    v_readlane_b32 s74, v40, 30
-; DAGISEL-NEXT:    v_readlane_b32 s73, v40, 29
-; DAGISEL-NEXT:    v_readlane_b32 s72, v40, 28
-; DAGISEL-NEXT:    v_readlane_b32 s31, v40, 27
-; DAGISEL-NEXT:    v_readlane_b32 s30, v40, 26
-; DAGISEL-NEXT:    v_readlane_b32 s29, v40, 25
-; DAGISEL-NEXT:    v_readlane_b32 s28, v40, 24
-; DAGISEL-NEXT:    v_readlane_b32 s27, v40, 23
-; DAGISEL-NEXT:    v_readlane_b32 s26, v40, 22
-; DAGISEL-NEXT:    v_readlane_b32 s25, v40, 21
-; DAGISEL-NEXT:    v_readlane_b32 s24, v40, 20
-; DAGISEL-NEXT:    v_readlane_b32 s23, v40, 19
-; DAGISEL-NEXT:    v_readlane_b32 s22, v40, 18
-; DAGISEL-NEXT:    v_readlane_b32 s21, v40, 17
-; DAGISEL-NEXT:    v_readlane_b32 s20, v40, 16
-; DAGISEL-NEXT:    v_readlane_b32 s19, v40, 15
-; DAGISEL-NEXT:    v_readlane_b32 s18, v40, 14
-; DAGISEL-NEXT:    v_readlane_b32 s17, v40, 13
-; DAGISEL-NEXT:    v_readlane_b32 s16, v40, 12
-; DAGISEL-NEXT:    v_readlane_b32 s15, v40, 11
-; DAGISEL-NEXT:    v_readlane_b32 s14, v40, 10
-; DAGISEL-NEXT:    v_readlane_b32 s13, v40, 9
-; DAGISEL-NEXT:    v_readlane_b32 s12, v40, 8
-; DAGISEL-NEXT:    v_readlane_b32 s11, v40, 7
-; DAGISEL-NEXT:    v_readlane_b32 s10, v40, 6
-; DAGISEL-NEXT:    v_readlane_b32 s9, v40, 5
-; DAGISEL-NEXT:    v_readlane_b32 s8, v40, 4
-; DAGISEL-NEXT:    v_readlane_b32 s7, v40, 3
-; DAGISEL-NEXT:    v_readlane_b32 s6, v40, 2
-; DAGISEL-NEXT:    v_readlane_b32 s5, v40, 1
+; DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; DAGISEL-NEXT:    v_readlane_b32 s31, v40, 2
+; DAGISEL-NEXT:    v_readlane_b32 s30, v40, 1
 ; DAGISEL-NEXT:    v_readlane_b32 s4, v40, 0
-; DAGISEL-NEXT:    s_clause 0x1
-; DAGISEL-NEXT:    scratch_load_b32 v40, off, s33
-; DAGISEL-NEXT:    scratch_load_b32 v41, off, s33 offset:4
+; DAGISEL-NEXT:    v_readlane_b32 s0, v40, 3
+; DAGISEL-NEXT:    scratch_load_b32 v40, off, s33 ; 4-byte Folded Reload
 ; DAGISEL-NEXT:    s_mov_b32 s32, s33
-; DAGISEL-NEXT:    s_xor_b32 exec_lo, s34, -1
+; DAGISEL-NEXT:    s_xor_b32 exec_lo, s4, -1
 ; DAGISEL-NEXT:    s_clause 0x1f
-; DAGISEL-NEXT:    scratch_load_b32 v0, off, s33 offset:8
-; DAGISEL-NEXT:    scratch_load_b32 v1, off, s33 offset:12
-; DAGISEL-NEXT:    scratch_load_b32 v2, off, s33 offset:16
-; DAGISEL-NEXT:    scratch_load_b32 v3, off, s33 offset:20
-; DAGISEL-NEXT:    scratch_load_b32 v4, off, s33 offset:24
-; DAGISEL-NEXT:    scratch_load_b32 v5, off, s33 offset:28
-; DAGISEL-NEXT:    scratch_load_b32 v6, off, s33 offset:32
-; DAGISEL-NEXT:    scratch_load_b32 v7, off, s33 offset:36
-; DAGISEL-NEXT:    scratch_load_b32 v8, off, s33 offset:40
-; DAGISEL-NEXT:    scratch_load_b32 v9, off, s33 offset:44
-; DAGISEL-NEXT:    scratch_load_b32 v10, off, s33 offset:48
-; DAGISEL-NEXT:    scratch_load_b32 v11, off, s33 offset:52
-; DAGISEL-NEXT:    scratch_load_b32 v12, off, s33 offset:56
-; DAGISEL-NEXT:    scratch_load_b32 v13, off, s33 offset:60
-; DAGISEL-NEXT:    scratch_load_b32 v14, off, s33 offset:64
-; DAGISEL-NEXT:    scratch_load_b32 v15, off, s33 offset:68
-; DAGISEL-NEXT:    scratch_load_b32 v16, off, s33 offset:72
-; DAGISEL-NEXT:    scratch_load_b32 v17, off, s33 offset:76
-; DAGISEL-NEXT:    scratch_load_b32 v18, off, s33 offset:80
-; DAGISEL-NEXT:    scratch_load_b32 v19, off, s33 offset:84
-; DAGISEL-NEXT:    scratch_load_b32 v20, off, s33 offset:88
-; DAGISEL-NEXT:    scratch_load_b32 v21, off, s33 offset:92
-; DAGISEL-NEXT:    scratch_load_b32 v22, off, s33 offset:96
-; DAGISEL-NEXT:    scratch_load_b32 v23, off, s33 offset:100
-; DAGISEL-NEXT:    scratch_load_b32 v24, off, s33 offset:104
-; DAGISEL-NEXT:    scratch_load_b32 v25, off, s33 offset:108
-; DAGISEL-NEXT:    scratch_load_b32 v26, off, s33 offset:112
-; DAGISEL-NEXT:    scratch_load_b32 v27, off, s33 offset:116
-; DAGISEL-NEXT:    scratch_load_b32 v28, off, s33 offset:120
-; DAGISEL-NEXT:    scratch_load_b32 v29, off, s33 offset:124
-; DAGISEL-NEXT:    scratch_load_b32 v30, off, s33 offset:128
-; DAGISEL-NEXT:    scratch_load_b32 v31, off, s33 offset:132
+; DAGISEL-NEXT:    scratch_load_b32 v0, off, s33 offset:4
+; DAGISEL-NEXT:    scratch_load_b32 v1, off, s33 offset:8
+; DAGISEL-NEXT:    scratch_load_b32 v2, off, s33 offset:12
+; DAGISEL-NEXT:    scratch_load_b32 v3, off, s33 offset:16
+; DAGISEL-NEXT:    scratch_load_b32 v4, off, s33 offset:20
+; DAGISEL-NEXT:    scratch_load_b32 v5, off, s33 offset:24
+; DAGISEL-NEXT:    scratch_load_b32 v6, off, s33 offset:28
+; DAGISEL-NEXT:    scratch_load_b32 v7, off, s33 offset:32
+; DAGISEL-NEXT:    scratch_load_b32 v8, off, s33 offset:36
+; DAGISEL-NEXT:    scratch_load_b32 v9, off, s33 offset:40
+; DAGISEL-NEXT:    scratch_load_b32 v10, off, s33 offset:44
+; DAGISEL-NEXT:    scratch_load_b32 v11, off, s33 offset:48
+; DAGISEL-NEXT:    scratch_load_b32 v12, off, s33 offset:52
+; DAGISEL-NEXT:    scratch_load_b32 v13, off, s33 offset:56
+; DAGISEL-NEXT:    scratch_load_b32 v14, off, s33 offset:60
+; DAGISEL-NEXT:    scratch_load_b32 v15, off, s33 offset:64
+; DAGISEL-NEXT:    scratch_load_b32 v16, off, s33 offset:68
+; DAGISEL-NEXT:    scratch_load_b32 v17, off, s33 offset:72
+; DAGISEL-NEXT:    scratch_load_b32 v18, off, s33 offset:76
+; DAGISEL-NEXT:    scratch_load_b32 v19, off, s33 offset:80
+; DAGISEL-NEXT:    scratch_load_b32 v20, off, s33 offset:84
+; DAGISEL-NEXT:    scratch_load_b32 v21, off, s33 offset:88
+; DAGISEL-NEXT:    scratch_load_b32 v22, off, s33 offset:92
+; DAGISEL-NEXT:    scratch_load_b32 v23, off, s33 offset:96
+; DAGISEL-NEXT:    scratch_load_b32 v24, off, s33 offset:100
+; DAGISEL-NEXT:    scratch_load_b32 v25, off, s33 offset:104
+; DAGISEL-NEXT:    scratch_load_b32 v26, off, s33 offset:108
+; DAGISEL-NEXT:    scratch_load_b32 v27, off, s33 offset:112
+; DAGISEL-NEXT:    scratch_load_b32 v28, off, s33 offset:116
+; DAGISEL-NEXT:    scratch_load_b32 v29, off, s33 offset:120
+; DAGISEL-NEXT:    scratch_load_b32 v30, off, s33 offset:124
+; DAGISEL-NEXT:    scratch_load_b32 v31, off, s33 offset:128
 ; DAGISEL-NEXT:    s_clause 0x1f
-; DAGISEL-NEXT:    scratch_load_b32 v32, off, s33 offset:136
-; DAGISEL-NEXT:    scratch_load_b32 v33, off, s33 offset:140
-; DAGISEL-NEXT:    scratch_load_b32 v34, off, s33 offset:144
-; DAGISEL-NEXT:    scratch_load_b32 v35, off, s33 offset:148
-; DAGISEL-NEXT:    scratch_load_b32 v36, off, s33 offset:152
-; DAGISEL-NEXT:    scratch_load_b32 v37, off, s33 offset:156
-; DAGISEL-NEXT:    scratch_load_b32 v38, off, s33 offset:160
-; DAGISEL-NEXT:    scratch_load_b32 v39, off, s33 offset:164
-; DAGISEL-NEXT:    scratch_load_b32 v48, off, s33 offset:168
-; DAGISEL-NEXT:    scratch_load_b32 v49, off, s33 offset:172
-; DAGISEL-NEXT:    scratch_load_b32 v50, off, s33 offset:176
-; DAGISEL-NEXT:    scratch_load_b32 v51, off, s33 offset:180
-; DAGISEL-NEXT:    scratch_load_b32 v52, off, s33 offset:184
-; DAGISEL-NEXT:    scratch_load_b32 v53, off, s33 offset:188
-; DAGISEL-NEXT:    scratch_load_b32 v54, off, s33 offset:192
-; DAGISEL-NEXT:    scratch_load_b32 v55, off, s33 offset:196
-; DAGISEL-NEXT:    scratch_load_b32 v64, off, s33 offset:200
-; DAGISEL-NEXT:    scratch_load_b32 v65, off, s33 offset:204
-; DAGISEL-NEXT:    scratch_load_b32 v66, off, s33 offset:208
-; DAGISEL-NEXT:    scratch_load_b32 v67, off, s33 offset:212
-; DAGISEL-NEXT:    scratch_load_b32 v68, off, s33 offset:216
-; DAGISEL-NEXT:    scratch_load_b32 v69, off, s33 offset:220
-; DAGISEL-NEXT:    scratch_load_b32 v70, off, s33 offset:224
-; DAGISEL-NEXT:    scratch_load_b32 v71, off, s33 offset:228
-; DAGISEL-NEXT:    scratch_load_b32 v80, off, s33 offset:232
-; DAGISEL-NEXT:    scratch_load_b32 v81, off, s33 offset:236
-; DAGISEL-NEXT:    scratch_load_b32 v82, off, s33 offset:240
-; DAGISEL-NEXT:    scratch_load_b32 v83, off, s33 offset:244
-; DAGISEL-NEXT:    scratch_load_b32 v84, off, s33 offset:248
-; DAGISEL-NEXT:    scratch_load_b32 v85, off, s33 offset:252
-; DAGISEL-NEXT:    scratch_load_b32 v86, off, s33 offset:256
-; DAGISEL-NEXT:    scratch_load_b32 v87, off, s33 offset:260
+; DAGISEL-NEXT:    scratch_load_b32 v32, off, s33 offset:132
+; DAGISEL-NEXT:    scratch_load_b32 v33, off, s33 offset:136
+; DAGISEL-NEXT:    scratch_load_b32 v34, off, s33 offset:140
+; DAGISEL-NEXT:    scratch_load_b32 v35, off, s33 offset:144
+; DAGISEL-NEXT:    scratch_load_b32 v36, off, s33 offset:148
+; DAGISEL-NEXT:    scratch_load_b32 v37, off, s33 offset:152
+; DAGISEL-NEXT:    scratch_load_b32 v38, off, s33 offset:156
+; DAGISEL-NEXT:    scratch_load_b32 v39, off, s33 offset:160
+; DAGISEL-NEXT:    scratch_load_b32 v48, off, s33 offset:164
+; DAGISEL-NEXT:    scratch_load_b32 v49, off, s33 offset:168
+; DAGISEL-NEXT:    scratch_load_b32 v50, off, s33 offset:172
+; DAGISEL-NEXT:    scratch_load_b32 v51, off, s33 offset:176
+; DAGISEL-NEXT:    scratch_load_b32 v52, off, s33 offset:180
+; DAGISEL-NEXT:    scratch_load_b32 v53, off, s33 offset:184
+; DAGISEL-NEXT:    scratch_load_b32 v54, off, s33 offset:188
+; DAGISEL-NEXT:    scratch_load_b32 v55, off, s33 offset:192
+; DAGISEL-NEXT:    scratch_load_b32 v64, off, s33 offset:196
+; DAGISEL-NEXT:    scratch_load_b32 v65, off, s33 offset:200
+; DAGISEL-NEXT:    scratch_load_b32 v66, off, s33 offset:204
+; DAGISEL-NEXT:    scratch_load_b32 v67, off, s33 offset:208
+; DAGISEL-NEXT:    scratch_load_b32 v68, off, s33 offset:212
+; DAGISEL-NEXT:    scratch_load_b32 v69, off, s33 offset:216
+; DAGISEL-NEXT:    scratch_load_b32 v70, off, s33 offset:220
+; DAGISEL-NEXT:    scratch_load_b32 v71, off, s33 offset:224
+; DAGISEL-NEXT:    scratch_load_b32 v80, off, s33 offset:228
+; DAGISEL-NEXT:    scratch_load_b32 v81, off, s33 offset:232
+; DAGISEL-NEXT:    scratch_load_b32 v82, off, s33 offset:236
+; DAGISEL-NEXT:    scratch_load_b32 v83, off, s33 offset:240
+; DAGISEL-NEXT:    scratch_load_b32 v84, off, s33 offset:244
+; DAGISEL-NEXT:    scratch_load_b32 v85, off, s33 offset:248
+; DAGISEL-NEXT:    scratch_load_b32 v86, off, s33 offset:252
+; DAGISEL-NEXT:    scratch_load_b32 v87, off, s33 offset:256
 ; DAGISEL-NEXT:    s_clause 0x1f
-; DAGISEL-NEXT:    scratch_load_b32 v96, off, s33 offset:264
-; DAGISEL-NEXT:    scratch_load_b32 v97, off, s33 offset:268
-; DAGISEL-NEXT:    scratch_load_b32 v98, off, s33 offset:272
-; DAGISEL-NEXT:    scratch_load_b32 v99, off, s33 offset:276
-; DAGISEL-NEXT:    scratch_load_b32 v100, off, s33 offset:280
-; DAGISEL-NEXT:    scratch_load_b32 v101, off, s33 offset:284
-; DAGISEL-NEXT:    scratch_load_b32 v102, off, s33 offset:288
-; DAGISEL-NEXT:    scratch_load_b32 v103, off, s33 offset:292
-; DAGISEL-NEXT:    scratch_load_b32 v112, off, s33 offset:296
-; DAGISEL-NEXT:    scratch_load_b32 v113, off, s33 offset:300
-; DAGISEL-NEXT:    scratch_load_b32 v114, off, s33 offset:304
-; DAGISEL-NEXT:    scratch_load_b32 v115, off, s33 offset:308
-; DAGISEL-NEXT:    scratch_load_b32 v116, off, s33 offset:312
-; DAGISEL-NEXT:    scratch_load_b32 v117, off, s33 offset:316
-; DAGISEL-NEXT:    scratch_load_b32 v118, off, s33 offset:320
-; DAGISEL-NEXT:    scratch_load_b32 v119, off, s33 offset:324
-; DAGISEL-NEXT:    scratch_load_b32 v128, off, s33 offset:328
-; DAGISEL-NEXT:    scratch_load_b32 v129, off, s33 offset:332
-; DAGISEL-NEXT:    scratch_load_b32 v130, off, s33 offset:336
-; DAGISEL-NEXT:    scratch_load_b32 v131, off, s33 offset:340
-; DAGISEL-NEXT:    scratch_load_b32 v132, off, s33 offset:344
-; DAGISEL-NEXT:    scratch_load_b32 v133, off, s33 offset:348
-; DAGISEL-NEXT:    scratch_load_b32 v134, off, s33 offset:352
-; DAGISEL-NEXT:    scratch_load_b32 v135, off, s33 offset:356
-; DAGISEL-NEXT:    scratch_load_b32 v144, off, s33 offset:360
-; DAGISEL-NEXT:    scratch_load_b32 v145, off, s33 offset:364
-; DAGISEL-NEXT:    scratch_load_b32 v146, off, s33 offset:368
-; DAGISEL-NEXT:    scratch_load_b32 v147, off, s33 offset:372
-; DAGISEL-NEXT:    scratch_load_b32 v148, off, s33 offset:376
-; DAGISEL-NEXT:    scratch_load_b32 v149, off, s33 offset:380
-; DAGISEL-NEXT:    scratch_load_b32 v150, off, s33 offset:384
-; DAGISEL-NEXT:    scratch_load_b32 v151, off, s33 offset:388
+; DAGISEL-NEXT:    scratch_load_b32 v96, off, s33 offset:260
+; DAGISEL-NEXT:    scratch_load_b32 v97, off, s33 offset:264
+; DAGISEL-NEXT:    scratch_load_b32 v98, off, s33 offset:268
+; DAGISEL-NEXT:    scratch_load_b32 v99, off, s33 offset:272
+; DAGISEL-NEXT:    scratch_load_b32 v100, off, s33 offset:276
+; DAGISEL-NEXT:    scratch_load_b32 v101, off, s33 offset:280
+; DAGISEL-NEXT:    scratch_load_b32 v102, off, s33 offset:284
+; DAGISEL-NEXT:    scratch_load_b32 v103, off, s33 offset:288
+; DAGISEL-NEXT:    scratch_load_b32 v112, off, s33 offset:292
+; DAGISEL-NEXT:    scratch_load_b32 v113, off, s33 offset:296
+; DAGISEL-NEXT:    scratch_load_b32 v114, off, s33 offset:300
+; DAGISEL-NEXT:    scratch_load_b32 v115, off, s33 offset:304
+; DAGISEL-NEXT:    scratch_load_b32 v116, off, s33 offset:308
+; DAGISEL-NEXT:    scratch_load_b32 v117, off, s33 offset:312
+; DAGISEL-NEXT:    scratch_load_b32 v118, off, s33 offset:316
+; DAGISEL-NEXT:    scratch_load_b32 v119, off, s33 offset:320
+; DAGISEL-NEXT:    scratch_load_b32 v128, off, s33 offset:324
+; DAGISEL-NEXT:    scratch_load_b32 v129, off, s33 offset:328
+; DAGISEL-NEXT:    scratch_load_b32 v130, off, s33 offset:332
+; DAGISEL-NEXT:    scratch_load_b32 v131, off, s33 offset:336
+; DAGISEL-NEXT:    scratch_load_b32 v132, off, s33 offset:340
+; DAGISEL-NEXT:    scratch_load_b32 v133, off, s33 offset:344
+; DAGISEL-NEXT:    scratch_load_b32 v134, off, s33 offset:348
+; DAGISEL-NEXT:    scratch_load_b32 v135, off, s33 offset:352
+; DAGISEL-NEXT:    scratch_load_b32 v144, off, s33 offset:356
+; DAGISEL-NEXT:    scratch_load_b32 v145, off, s33 offset:360
+; DAGISEL-NEXT:    scratch_load_b32 v146, off, s33 offset:364
+; DAGISEL-NEXT:    scratch_load_b32 v147, off, s33 offset:368
+; DAGISEL-NEXT:    scratch_load_b32 v148, off, s33 offset:372
+; DAGISEL-NEXT:    scratch_load_b32 v149, off, s33 offset:376
+; DAGISEL-NEXT:    scratch_load_b32 v150, off, s33 offset:380
+; DAGISEL-NEXT:    scratch_load_b32 v151, off, s33 offset:384
 ; DAGISEL-NEXT:    s_clause 0x1f
-; DAGISEL-NEXT:    scratch_load_b32 v160, off, s33 offset:392
-; DAGISEL-NEXT:    scratch_load_b32 v161, off, s33 offset:396
-; DAGISEL-NEXT:    scratch_load_b32 v162, off, s33 offset:400
-; DAGISEL-NEXT:    scratch_load_b32 v163, off, s33 offset:404
-; DAGISEL-NEXT:    scratch_load_b32 v164, off, s33 offset:408
-; DAGISEL-NEXT:    scratch_load_b32 v165, off, s33 offset:412
-; DAGISEL-NEXT:    scratch_load_b32 v166, off, s33 offset:416
-; DAGISEL-NEXT:    scratch_load_b32 v167, off, s33 offset:420
-; DAGISEL-NEXT:    scratch_load_b32 v176, off, s33 offset:424
-; DAGISEL-NEXT:    scratch_load_b32 v177, off, s33 offset:428
-; DAGISEL-NEXT:    scratch_load_b32 v178, off, s33 offset:432
-; DAGISEL-NEXT:    scratch_load_b32 v179, off, s33 offset:436
-; DAGISEL-NEXT:    scratch_load_b32 v180, off, s33 offset:440
-; DAGISEL-NEXT:    scratch_load_b32 v181, off, s33 offset:444
-; DAGISEL-NEXT:    scratch_load_b32 v182, off, s33 offset:448
-; DAGISEL-NEXT:    scratch_load_b32 v183, off, s33 offset:452
-; DAGISEL-NEXT:    scratch_load_b32 v192, off, s33 offset:456
-; DAGISEL-NEXT:    scratch_load_b32 v193, off, s33 offset:460
-; DAGISEL-NEXT:    scratch_load_b32 v194, off, s33 offset:464
-; DAGISEL-NEXT:    scratch_load_b32 v195, off, s33 offset:468
-; DAGISEL-NEXT:    scratch_load_b32 v196, off, s33 offset:472
-; DAGISEL-NEXT:    scratch_load_b32 v197, off, s33 offset:476
-; DAGISEL-NEXT:    scratch_load_b32 v198, off, s33 offset:480
-; DAGISEL-NEXT:    scratch_load_b32 v199, off, s33 offset:484
-; DAGISEL-NEXT:    scratch_load_b32 v208, off, s33 offset:488
-; DAGISEL-NEXT:    scratch_load_b32 v209, off, s33 offset:492
-; DAGISEL-NEXT:    scratch_load_b32 v210, off, s33 offset:496
-; DAGISEL-NEXT:    scratch_load_b32 v211, off, s33 offset:500
-; DAGISEL-NEXT:    scratch_load_b32 v212, off, s33 offset:504
-; DAGISEL-NEXT:    scratch_load_b32 v213, off, s33 offset:508
-; DAGISEL-NEXT:    scratch_load_b32 v214, off, s33 offset:512
-; DAGISEL-NEXT:    scratch_load_b32 v215, off, s33 offset:516
+; DAGISEL-NEXT:    scratch_load_b32 v160, off, s33 offset:388
+; DAGISEL-NEXT:    scratch_load_b32 v161, off, s33 offset:392
+; DAGISEL-NEXT:    scratch_load_b32 v162, off, s33 offset:396
+; DAGISEL-NEXT:    scratch_load_b32 v163, off, s33 offset:400
+; DAGISEL-NEXT:    scratch_load_b32 v164, off, s33 offset:404
+; DAGISEL-NEXT:    scratch_load_b32 v165, off, s33 offset:408
+; DAGISEL-NEXT:    scratch_load_b32 v166, off, s33 offset:412
+; DAGISEL-NEXT:    scratch_load_b32 v167, off, s33 offset:416
+; DAGISEL-NEXT:    scratch_load_b32 v176, off, s33 offset:420
+; DAGISEL-NEXT:    scratch_load_b32 v177, off, s33 offset:424
+; DAGISEL-NEXT:    scratch_load_b32 v178, off, s33 offset:428
+; DAGISEL-NEXT:    scratch_load_b32 v179, off, s33 offset:432
+; DAGISEL-NEXT:    scratch_load_b32 v180, off, s33 offset:436
+; DAGISEL-NEXT:    scratch_load_b32 v181, off, s33 offset:440
+; DAGISEL-NEXT:    scratch_load_b32 v182, off, s33 offset:444
+; DAGISEL-NEXT:    scratch_load_b32 v183, off, s33 offset:448
+; DAGISEL-NEXT:    scratch_load_b32 v192, off, s33 offset:452
+; DAGISEL-NEXT:    scratch_load_b32 v193, off, s33 offset:456
+; DAGISEL-NEXT:    scratch_load_b32 v194, off, s33 offset:460
+; DAGISEL-NEXT:    scratch_load_b32 v195, off, s33 offset:464
+; DAGISEL-NEXT:    scratch_load_b32 v196, off, s33 offset:468
+; DAGISEL-NEXT:    scratch_load_b32 v197, off, s33 offset:472
+; DAGISEL-NEXT:    scratch_load_b32 v198, off, s33 offset:476
+; DAGISEL-NEXT:    scratch_load_b32 v199, off, s33 offset:480
+; DAGISEL-NEXT:    scratch_load_b32 v208, off, s33 offset:484
+; DAGISEL-NEXT:    scratch_load_b32 v209, off, s33 offset:488
+; DAGISEL-NEXT:    scratch_load_b32 v210, off, s33 offset:492
+; DAGISEL-NEXT:    scratch_load_b32 v211, off, s33 offset:496
+; DAGISEL-NEXT:    scratch_load_b32 v212, off, s33 offset:500
+; DAGISEL-NEXT:    scratch_load_b32 v213, off, s33 offset:504
+; DAGISEL-NEXT:    scratch_load_b32 v214, off, s33 offset:508
+; DAGISEL-NEXT:    scratch_load_b32 v215, off, s33 offset:512
 ; DAGISEL-NEXT:    s_clause 0xf
-; DAGISEL-NEXT:    scratch_load_b32 v224, off, s33 offset:520
-; DAGISEL-NEXT:    scratch_load_b32 v225, off, s33 offset:524
-; DAGISEL-NEXT:    scratch_load_b32 v226, off, s33 offset:528
-; DAGISEL-NEXT:    scratch_load_b32 v227, off, s33 offset:532
-; DAGISEL-NEXT:    scratch_load_b32 v228, off, s33 offset:536
-; DAGISEL-NEXT:    scratch_load_b32 v229, off, s33 offset:540
-; DAGISEL-NEXT:    scratch_load_b32 v230, off, s33 offset:544
-; DAGISEL-NEXT:    scratch_load_b32 v231, off, s33 offset:548
-; DAGISEL-NEXT:    scratch_load_b32 v240, off, s33 offset:552
-; DAGISEL-NEXT:    scratch_load_b32 v241, off, s33 offset:556
-; DAGISEL-NEXT:    scratch_load_b32 v242, off, s33 offset:560
-; DAGISEL-NEXT:    scratch_load_b32 v243, off, s33 offset:564
-; DAGISEL-NEXT:    scratch_load_b32 v244, off, s33 offset:568
-; DAGISEL-NEXT:    scratch_load_b32 v245, off, s33 offset:572
-; DAGISEL-NEXT:    scratch_load_b32 v246, off, s33 offset:576
-; DAGISEL-NEXT:    scratch_load_b32 v247, off, s33 offset:580
-; DAGISEL-NEXT:    s_mov_b32 exec_lo, s34
-; DAGISEL-NEXT:    s_mov_b32 s33, s35
+; DAGISEL-NEXT:    scratch_load_b32 v224, off, s33 offset:516
+; DAGISEL-NEXT:    scratch_load_b32 v225, off, s33 offset:520
+; DAGISEL-NEXT:    scratch_load_b32 v226, off, s33 offset:524
+; DAGISEL-NEXT:    scratch_load_b32 v227, off, s33 offset:528
+; DAGISEL-NEXT:    scratch_load_b32 v228, off, s33 offset:532
+; DAGISEL-NEXT:    scratch_load_b32 v229, off, s33 offset:536
+; DAGISEL-NEXT:    scratch_load_b32 v230, off, s33 offset:540
+; DAGISEL-NEXT:    scratch_load_b32 v231, off, s33 offset:544
+; DAGISEL-NEXT:    scratch_load_b32 v240, off, s33 offset:548
+; DAGISEL-NEXT:    scratch_load_b32 v241, off, s33 offset:552
+; DAGISEL-NEXT:    scratch_load_b32 v242, off, s33 offset:556
+; DAGISEL-NEXT:    scratch_load_b32 v243, off, s33 offset:560
+; DAGISEL-NEXT:    scratch_load_b32 v244, off, s33 offset:564
+; DAGISEL-NEXT:    scratch_load_b32 v245, off, s33 offset:568
+; DAGISEL-NEXT:    scratch_load_b32 v246, off, s33 offset:572
+; DAGISEL-NEXT:    scratch_load_b32 v247, off, s33 offset:576
+; DAGISEL-NEXT:    s_mov_b32 exec_lo, s4
+; DAGISEL-NEXT:    s_mov_b32 s33, s0
 ; DAGISEL-NEXT:    s_wait_loadcnt 0x0
 ; DAGISEL-NEXT:    s_wait_alu 0xfffe
 ; DAGISEL-NEXT:    s_setpc_b64 s[30:31]
@@ -1488,414 +1405,331 @@ define amdgpu_gfx_whole_wave <2 x half> @call_gfx_from_whole_wave(i1 %active, <2
 ; GISEL-NEXT:    s_wait_samplecnt 0x0
 ; GISEL-NEXT:    s_wait_bvhcnt 0x0
 ; GISEL-NEXT:    s_wait_kmcnt 0x0
-; GISEL-NEXT:    s_mov_b32 s35, s33
+; GISEL-NEXT:    s_mov_b32 s0, s33
 ; GISEL-NEXT:    s_mov_b32 s33, s32
-; GISEL-NEXT:    s_xor_saveexec_b32 s34, -1
+; GISEL-NEXT:    s_xor_saveexec_b32 s4, -1
 ; GISEL-NEXT:    s_clause 0x1f
-; GISEL-NEXT:    scratch_store_b32 off, v0, s33 offset:8
-; GISEL-NEXT:    scratch_store_b32 off, v1, s33 offset:12
-; GISEL-NEXT:    scratch_store_b32 off, v2, s33 offset:16
-; GISEL-NEXT:    scratch_store_b32 off, v3, s33 offset:20
-; GISEL-NEXT:    scratch_store_b32 off, v4, s33 offset:24
-; GISEL-NEXT:    scratch_store_b32 off, v5, s33 offset:28
-; GISEL-NEXT:    scratch_store_b32 off, v6, s33 offset:32
-; GISEL-NEXT:    scratch_store_b32 off, v7, s33 offset:36
-; GISEL-NEXT:    scratch_store_b32 off, v8, s33 offset:40
-; GISEL-NEXT:    scratch_store_b32 off, v9, s33 offset:44
-; GISEL-NEXT:    scratch_store_b32 off, v10, s33 offset:48
-; GISEL-NEXT:    scratch_store_b32 off, v11, s33 offset:52
-; GISEL-NEXT:    scratch_store_b32 off, v12, s33 offset:56
-; GISEL-NEXT:    scratch_store_b32 off, v13, s33 offset:60
-; GISEL-NEXT:    scratch_store_b32 off, v14, s33 offset:64
-; GISEL-NEXT:    scratch_store_b32 off, v15, s33 offset:68
-; GISEL-NEXT:    scratch_store_b32 off, v16, s33 offset:72
-; GISEL-NEXT:    scratch_store_b32 off, v17, s33 offset:76
-; GISEL-NEXT:    scratch_store_b32 off, v18, s33 offset:80
-; GISEL-NEXT:    scratch_store_b32 off, v19, s33 offset:84
-; GISEL-NEXT:    scratch_store_b32 off, v20, s33 offset:88
-; GISEL-NEXT:    scratch_store_b32 off, v21, s33 offset:92
-; GISEL-NEXT:    scratch_store_b32 off, v22, s33 offset:96
-; GISEL-NEXT:    scratch_store_b32 off, v23, s33 offset:100
-; GISEL-NEXT:    scratch_store_b32 off, v24, s33 offset:104
-; GISEL-NEXT:    scratch_store_b32 off, v25, s33 offset:108
-; GISEL-NEXT:    scratch_store_b32 off, v26, s33 offset:112
-; GISEL-NEXT:    scratch_store_b32 off, v27, s33 offset:116
-; GISEL-NEXT:    scratch_store_b32 off, v28, s33 offset:120
-; GISEL-NEXT:    scratch_store_b32 off, v29, s33 offset:124
-; GISEL-NEXT:    scratch_store_b32 off, v30, s33 offset:128
-; GISEL-NEXT:    scratch_store_b32 off, v31, s33 offset:132
+; GISEL-NEXT:    scratch_store_b32 off, v0, s33 offset:4
+; GISEL-NEXT:    scratch_store_b32 off, v1, s33 offset:8
+; GISEL-NEXT:    scratch_store_b32 off, v2, s33 offset:12
+; GISEL-NEXT:    scratch_store_b32 off, v3, s33 offset:16
+; GISEL-NEXT:    scratch_store_b32 off, v4, s33 offset:20
+; GISEL-NEXT:    scratch_store_b32 off, v5, s33 offset:24
+; GISEL-NEXT:    scratch_store_b32 off, v6, s33 offset:28
+; GISEL-NEXT:    scratch_store_b32 off, v7, s33 offset:32
+; GISEL-NEXT:    scratch_store_b32 off, v8, s33 offset:36
+; GISEL-NEXT:    scratch_store_b32 off, v9, s33 offset:40
+; GISEL-NEXT:    scratch_store_b32 off, v10, s33 offset:44
+; GISEL-NEXT:    scratch_store_b32 off, v11, s33 offset:48
+; GISEL-NEXT:    scratch_store_b32 off, v12, s33 offset:52
+; GISEL-NEXT:    scratch_store_b32 off, v13, s33 offset:56
+; GISEL-NEXT:    scratch_store_b32 off, v14, s33 offset:60
+; GISEL-NEXT:    scratch_store_b32 off, v15, s33 offset:64
+; GISEL-NEXT:    scratch_store_b32 off, v16, s33 offset:68
+; GISEL-NEXT:    scratch_store_b32 off, v17, s33 offset:72
+; GISEL-NEXT:    scratch_store_b32 off, v18, s33 offset:76
+; GISEL-NEXT:    scratch_store_b32 off, v19, s33 offset:80
+; GISEL-NEXT:    scratch_store_b32 off, v20, s33 offset:84
+; GISEL-NEXT:    scratch_store_b32 off, v21, s33 offset:88
+; GISEL-NEXT:    scratch_store_b32 off, v22, s33 offset:92
+; GISEL-NEXT:    scratch_store_b32 off, v23, s33 offset:96
+; GISEL-NEXT:    scratch_store_b32 off, v24, s33 offset:100
+; GISEL-NEXT:    scratch_store_b32 off, v25, s33 offset:104
+; GISEL-NEXT:    scratch_store_b32 off, v26, s33 offset:108
+; GISEL-NEXT:    scratch_store_b32 off, v27, s33 offset:112
+; GISEL-NEXT:    scratch_store_b32 off, v28, s33 offset:116
+; GISEL-NEXT:    scratch_store_b32 off, v29, s33 offset:120
+; GISEL-NEXT:    scratch_store_b32 off, v30, s33 offset:124
+; GISEL-NEXT:    scratch_store_b32 off, v31, s33 offset:128
 ; GISEL-NEXT:    s_clause 0x1f
-; GISEL-NEXT:    scratch_store_b32 off, v32, s33 offset:136
-; GISEL-NEXT:    scratch_store_b32 off, v33, s33 offset:140
-; GISEL-NEXT:    scratch_store_b32 off, v34, s33 offset:144
-; GISEL-NEXT:    scratch_store_b32 off, v35, s33 offset:148
-; GISEL-NEXT:    scratch_store_b32 off, v36, s33 offset:152
-; GISEL-NEXT:    scratch_store_b32 off, v37, s33 offset:156
-; GISEL-NEXT:    scratch_store_b32 off, v38, s33 offset:160
-; GISEL-NEXT:    scratch_store_b32 off, v39, s33 offset:164
-; GISEL-NEXT:    scratch_store_b32 off, v48, s33 offset:168
-; GISEL-NEXT:    scratch_store_b32 off, v49, s33 offset:172
-; GISEL-NEXT:    scratch_store_b32 off, v50, s33 offset:176
-; GISEL-NEXT:    scratch_store_b32 off, v51, s33 offset:180
-; GISEL-NEXT:    scratch_store_b32 off, v52, s33 offset:184
-; GISEL-NEXT:    scratch_store_b32 off, v53, s33 offset:188
-; GISEL-NEXT:    scratch_store_b32 off, v54, s33 offset:192
-; GISEL-NEXT:    scratch_store_b32 off, v55, s33 offset:196
-; GISEL-NEXT:    scratch_store_b32 off, v64, s33 offset:200
-; GISEL-NEXT:    scratch_store_b32 off, v65, s33 offset:204
-; GISEL-NEXT:    scratch_store_b32 off, v66, s33 offset:208
-; GISEL-NEXT:    scratch_store_b32 off, v67, s33 offset:212
-; GISEL-NEXT:    scratch_store_b32 off, v68, s33 offset:216
-; GISEL-NEXT:    scratch_store_b32 off, v69, s33 offset:220
-; GISEL-NEXT:    scratch_store_b32 off, v70, s33 offset:224
-; GISEL-NEXT:    scratch_store_b32 off, v71, s33 offset:228
-; GISEL-NEXT:    scratch_store_b32 off, v80, s33 offset:232
-; GISEL-NEXT:    scratch_store_b32 off, v81, s33 offset:236
-; GISEL-NEXT:    scratch_store_b32 off, v82, s33 offset:240
-; GISEL-NEXT:    scratch_store_b32 off, v83, s33 offset:244
-; GISEL-NEXT:    scratch_store_b32 off, v84, s33 offset:248
-; GISEL-NEXT:    scratch_store_b32 off, v85, s33 offset:252
-; GISEL-NEXT:    scratch_store_b32 off, v86, s33 offset:256
-; GISEL-NEXT:    scratch_store_b32 off, v87, s33 offset:260
+; GISEL-NEXT:    scratch_store_b32 off, v32, s33 offset:132
+; GISEL-NEXT:    scratch_store_b32 off, v33, s33 offset:136
+; GISEL-NEXT:    scratch_store_b32 off, v34, s33 offset:140
+; GISEL-NEXT:    scratch_store_b32 off, v35, s33 offset:144
+; GISEL-NEXT:    scratch_store_b32 off, v36, s33 offset:148
+; GISEL-NEXT:    scratch_store_b32 off, v37, s33 offset:152
+; GISEL-NEXT:    scratch_store_b32 off, v38, s33 offset:156
+; GISEL-NEXT:    scratch_store_b32 off, v39, s33 offset:160
+; GISEL-NEXT:    scratch_store_b32 off, v48, s33 offset:164
+; GISEL-NEXT:    scratch_store_b32 off, v49, s33 offset:168
+; GISEL-NEXT:    scratch_store_b32 off, v50, s33 offset:172
+; GISEL-NEXT:    scratch_store_b32 off, v51, s33 offset:176
+; GISEL-NEXT:    scratch_store_b32 off, v52, s33 offset:180
+; GISEL-NEXT:    scratch_store_b32 off, v53, s33 offset:184
+; GISEL-NEXT:    scratch_store_b32 off, v54, s33 offset:188
+; GISEL-NEXT:    scratch_store_b32 off, v55, s33 offset:192
+; GISEL-NEXT:    scratch_store_b32 off, v64, s33 offset:196
+; GISEL-NEXT:    scratch_store_b32 off, v65, s33 offset:200
+; GISEL-NEXT:    scratch_store_b32 off, v66, s33 offset:204
+; GISEL-NEXT:    scratch_store_b32 off, v67, s33 offset:208
+; GISEL-NEXT:    scratch_store_b32 off, v68, s33 offset:212
+; GISEL-NEXT:    scratch_store_b32 off, v69, s33 offset:216
+; GISEL-NEXT:    scratch_store_b32 off, v70, s33 offset:220
+; GISEL-NEXT:    scratch_store_b32 off, v71, s33 offset:224
+; GISEL-NEXT:    scratch_store_b32 off, v80, s33 offset:228
+; GISEL-NEXT:    scratch_store_b32 off, v81, s33 offset:232
+; GISEL-NEXT:    scratch_store_b32 off, v82, s33 offset:236
+; GISEL-NEXT:    scratch_store_b32 off, v83, s33 offset:240
+; GISEL-NEXT:    scratch_store_b32 off, v84, s33 offset:244
+; GISEL-NEXT:    scratch_store_b32 off, v85, s33 offset:248
+; GISEL-NEXT:    scratch_store_b32 off, v86, s33 offset:252
+; GISEL-NEXT:    scratch_store_b32 off, v87, s33 offset:256
 ; GISEL-NEXT:    s_clause 0x1f
-; GISEL-NEXT:    scratch_store_b32 off, v96, s33 offset:264
-; GISEL-NEXT:    scratch_store_b32 off, v97, s33 offset:268
-; GISEL-NEXT:    scratch_store_b32 off, v98, s33 offset:272
-; GISEL-NEXT:    scratch_store_b32 off, v99, s33 offset:276
-; GISEL-NEXT:    scratch_store_b32 off, v100, s33 offset:280
-; GISEL-NEXT:    scratch_store_b32 off, v101, s33 offset:284
-; GISEL-NEXT:    scratch_store_b32 off, v102, s33 offset:288
-; GISEL-NEXT:    scratch_store_b32 off, v103, s33 offset:292
-; GISEL-NEXT:    scratch_store_b32 off, v112, s33 offset:296
-; GISEL-NEXT:    scratch_store_b32 off, v113, s33 offset:300
-; GISEL-NEXT:    scratch_store_b32 off, v114, s33 offset:304
-; GISEL-NEXT:    scratch_store_b32 off, v115, s33 offset:308
-; GISEL-NEXT:    scratch_store_b32 off, v116, s33 offset:312
-; GISEL-NEXT:    scratch_store_b32 off, v117, s33 offset:316
-; GISEL-NEXT:    scratch_store_b32 off, v118, s33 offset:320
-; GISEL-NEXT:    scratch_store_b32 off, v119, s33 offset:324
-; GISEL-NEXT:    scratch_store_b32 off, v128, s33 offset:328
-; GISEL-NEXT:    scratch_store_b32 off, v129, s33 offset:332
-; GISEL-NEXT:    scratch_store_b32 off, v130, s33 offset:336
-; GISEL-NEXT:    scratch_store_b32 off, v131, s33 offset:340
-; GISEL-NEXT:    scratch_store_b32 off, v132, s33 offset:344
-; GISEL-NEXT:    scratch_store_b32 off, v133, s33 offset:348
-; GISEL-NEXT:    scratch_store_b32 off, v134, s33 offset:352
-; GISEL-NEXT:    scratch_store_b32 off, v135, s33 offset:356
-; GISEL-NEXT:    scratch_store_b32 off, v144, s33 offset:360
-; GISEL-NEXT:    scratch_store_b32 off, v145, s33 offset:364
-; GISEL-NEXT:    scratch_store_b32 off, v146, s33 offset:368
-; GISEL-NEXT:    scratch_store_b32 off, v147, s33 offset:372
-; GISEL-NEXT:    scratch_store_b32 off, v148, s33 offset:376
-; GISEL-NEXT:    scratch_store_b32 off, v149, s33 offset:380
-; GISEL-NEXT:    scratch_store_b32 off, v150, s33 offset:384
-; GISEL-NEXT:    scratch_store_b32 off, v151, s33 offset:388
+; GISEL-NEXT:    scratch_store_b32 off, v96, s33 offset:260
+; GISEL-NEXT:    scratch_store_b32 off, v97, s33 offset:264
+; GISEL-NEXT:    scratch_store_b32 off, v98, s33 offset:268
+; GISEL-NEXT:    scratch_store_b32 off, v99, s33 offset:272
+; GISEL-NEXT:    scratch_store_b32 off, v100, s33 offset:276
+; GISEL-NEXT:    scratch_store_b32 off, v101, s33 offset:280
+; GISEL-NEXT:    scratch_store_b32 off, v102, s33 offset:284
+; GISEL-NEXT:    scratch_store_b32 off, v103, s33 offset:288
+; GISEL-NEXT:    scratch_store_b32 off, v112, s33 offset:292
+; GISEL-NEXT:    scratch_store_b32 off, v113, s33 offset:296
+; GISEL-NEXT:    scratch_store_b32 off, v114, s33 offset:300
+; GISEL-NEXT:    scratch_store_b32 off, v115, s33 offset:304
+; GISEL-NEXT:    scratch_store_b32 off, v116, s33 offset:308
+; GISEL-NEXT:    scratch_store_b32 off, v117, s33 offset:312
+; GISEL-NEXT:    scratch_store_b32 off, v118, s33 offset:316
+; GISEL-NEXT:    scratch_store_b32 off, v119, s33 offset:320
+; GISEL-NEXT:    scratch_store_b32 off, v128, s33 offset:324
+; GISEL-NEXT:    scratch_store_b32 off, v129, s33 offset:328
+; GISEL-NEXT:    scratch_store_b32 off, v130, s33 offset:332
+; GISEL-NEXT:    scratch_store_b32 off, v131, s33 offset:336
+; GISEL-NEXT:    scratch_store_b32 off, v132, s33 offset:340
+; GISEL-NEXT:    scratch_store_b32 off, v133, s33 offset:344
+; GISEL-NEXT:    scratch_store_b32 off, v134, s33 offset:348
+; GISEL-NEXT:    scratch_store_b32 off, v135, s33 offset:352
+; GISEL-NEXT:    scratch_store_b32 off, v144, s33 offset:356
+; GISEL-NEXT:    scratch_store_b32 off, v145, s33 offset:360
+; GISEL-NEXT:    scratch_store_b32 off, v146, s33 offset:364
+; GISEL-NEXT:    scratch_store_b32 off, v147, s33 offset:368
+; GISEL-NEXT:    scratch_store_b32 off, v148, s33 offset:372
+; GISEL-NEXT:    scratch_store_b32 off, v149, s33 offset:376
+; GISEL-NEXT:    scratch_store_b32 off, v150, s33 offset:380
+; GISEL-NEXT:    scratch_store_b32 off, v151, s33 offset:384
 ; GISEL-NEXT:    s_clause 0x1f
-; GISEL-NEXT:    scratch_store_b32 off, v160, s33 offset:392
-; GISEL-NEXT:    scratch_store_b32 off, v161, s33 offset:396
-; GISEL-NEXT:    scratch_store_b32 off, v162, s33 offset:400
-; GISEL-NEXT:    scratch_store_b32 off, v163, s33 offset:404
-; GISEL-NEXT:    scratch_store_b32 off, v164, s33 offset:408
-; GISEL-NEXT:    scratch_store_b32 off, v165, s33 offset:412
-; GISEL-NEXT:    scratch_store_b32 off, v166, s33 offset:416
-; GISEL-NEXT:    scratch_store_b32 off, v167, s33 offset:420
-; GISEL-NEXT:    scratch_store_b32 off, v176, s33 offset:424
-; GISEL-NEXT:    scratch_store_b32 off, v177, s33 offset:428
-; GISEL-NEXT:    scratch_store_b32 off, v178, s33 offset:432
-; GISEL-NEXT:    scratch_store_b32 off, v179, s33 offset:436
-; GISEL-NEXT:    scratch_store_b32 off, v180, s33 offset:440
-; GISEL-NEXT:    scratch_store_b32 off, v181, s33 offset:444
-; GISEL-NEXT:    scratch_store_b32 off, v182, s33 offset:448
-; GISEL-NEXT:    scratch_store_b32 off, v183, s33 offset:452
-; GISEL-NEXT:    scratch_store_b32 off, v192, s33 offset:456
-; GISEL-NEXT:    scratch_store_b32 off, v193, s33 offset:460
-; GISEL-NEXT:    scratch_store_b32 off, v194, s33 offset:464
-; GISEL-NEXT:    scratch_store_b32 off, v195, s33 offset:468
-; GISEL-NEXT:    scratch_store_b32 off, v196, s33 offset:472
-; GISEL-NEXT:    scratch_store_b32 off, v197, s33 offset:476
-; GISEL-NEXT:    scratch_store_b32 off, v198, s33 offset:480
-; GISEL-NEXT:    scratch_store_b32 off, v199, s33 offset:484
-; GISEL-NEXT:    scratch_store_b32 off, v208, s33 offset:488
-; GISEL-NEXT:    scratch_store_b32 off, v209, s33 offset:492
-; GISEL-NEXT:    scratch_store_b32 off, v210, s33 offset:496
-; GISEL-NEXT:    scratch_store_b32 off, v211, s33 offset:500
-; GISEL-NEXT:    scratch_store_b32 off, v212, s33 offset:504
-; GISEL-NEXT:    scratch_store_b32 off, v213, s33 offset:508
-; GISEL-NEXT:    scratch_store_b32 off, v214, s33 offset:512
-; GISEL-NEXT:    scratch_store_b32 off, v215, s33 offset:516
+; GISEL-NEXT:    scratch_store_b32 off, v160, s33 offset:388
+; GISEL-NEXT:    scratch_store_b32 off, v161, s33 offset:392
+; GISEL-NEXT:    scratch_store_b32 off, v162, s33 offset:396
+; GISEL-NEXT:    scratch_store_b32 off, v163, s33 offset:400
+; GISEL-NEXT:    scratch_store_b32 off, v164, s33 offset:404
+; GISEL-NEXT:    scratch_store_b32 off, v165, s33 offset:408
+; GISEL-NEXT:    scratch_store_b32 off, v166, s33 offset:412
+; GISEL-NEXT:    scratch_store_b32 off, v167, s33 offset:416
+; GISEL-NEXT:    scratch_store_b32 off, v176, s33 offset:420
+; GISEL-NEXT:    scratch_store_b32 off, v177, s33 offset:424
+; GISEL-NEXT:    scratch_store_b32 off, v178, s33 offset:428
+; GISEL-NEXT:    scratch_store_b32 off, v179, s33 offset:432
+; GISEL-NEXT:    scratch_store_b32 off, v180, s33 offset:436
+; GISEL-NEXT:    scratch_store_b32 off, v181, s33 offset:440
+; GISEL-NEXT:    scratch_store_b32 off, v182, s33 offset:444
+; GISEL-NEXT:    scratch_store_b32 off, v183, s33 offset:448
+; GISEL-NEXT:    scratch_store_b32 off, v192, s33 offset:452
+; GISEL-NEXT:    scratch_store_b32 off, v193, s33 offset:456
+; GISEL-NEXT:    scratch_store_b32 off, v194, s33 offset:460
+; GISEL-NEXT:    scratch_store_b32 off, v195, s33 offset:464
+; GISEL-NEXT:    scratch_store_b32 off, v196, s33 offset:468
+; GISEL-NEXT:    scratch_store_b32 off, v197, s33 offset:472
+; GISEL-NEXT:    scratch_store_b32 off, v198, s33 offset:476
+; GISEL-NEXT:    scratch_store_b32 off, v199, s33 offset:480
+; GISEL-NEXT:    scratch_store_b32 off, v208, s33 offset:484
+; GISEL-NEXT:    scratch_store_b32 off, v209, s33 offset:488
+; GISEL-NEXT:    scratch_store_b32 off, v210, s33 offset:492
+; GISEL-NEXT:    scratch_store_b32 off, v211, s33 offset:496
+; GISEL-NEXT:    scratch_store_b32 off, v212, s33 offset:500
+; GISEL-NEXT:    scratch_store_b32 off, v213, s33 offset:504
+; GISEL-NEXT:    scratch_store_b32 off, v214, s33 offset:508
+; GISEL-NEXT:    scratch_store_b32 off, v215, s33 offset:512
 ; GISEL-NEXT:    s_clause 0xf
-; GISEL-NEXT:    scratch_store_b32 off, v224, s33 offset:520
-; GISEL-NEXT:    scratch_store_b32 off, v225, s33 offset:524
-; GISEL-NEXT:    scratch_store_b32 off, v226, s33 offset:528
-; GISEL-NEXT:    scratch_store_b32 off, v227, s33 offset:532
-; GISEL-NEXT:    scratch_store_b32 off, v228, s33 offset:536
-; GISEL-NEXT:    scratch_store_b32 off, v229, s33 offset:540
-; GISEL-NEXT:    scratch_store_b32 off, v230, s33 offset:544
-; GISEL-NEXT:    scratch_store_b32 off, v231, s33 offset:548
-; GISEL-NEXT:    scratch_store_b32 off, v240, s33 offset:552
-; GISEL-NEXT:    scratch_store_b32 off, v241, s33 offset:556
-; GISEL-NEXT:    scratch_store_b32 off, v242, s33 offset:560
-; GISEL-NEXT:    scratch_store_b32 off, v243, s33 offset:564
-; GISEL-NEXT:    scratch_store_b32 off, v244, s33 offset:568
-; GISEL-NEXT:    scratch_store_b32 off, v245, s33 offset:572
-; GISEL-NEXT:    scratch_store_b32 off, v246, s33 offset:576
-; GISEL-NEXT:    scratch_store_b32 off, v247, s33 offset:580
+; GISEL-NEXT:    scratch_store_b32 off, v224, s33 offset:516
+; GISEL-NEXT:    scratch_store_b32 off, v225, s33 offset:520
+; GISEL-NEXT:    scratch_store_b32 off, v226, s33 offset:524
+; GISEL-NEXT:    scratch_store_b32 off, v227, s33 offset:528
+; GISEL-NEXT:    scratch_store_b32 off, v228, s33 offset:532
+; GISEL-NEXT:    scratch_store_b32 off, v229, s33 offset:536
+; GISEL-NEXT:    scratch_store_b32 off, v230, s33 offset:540
+; GISEL-NEXT:    scratch_store_b32 off, v231, s33 offset:544
+; GISEL-NEXT:    scratch_store_b32 off, v240, s33 offset:548
+; GISEL-NEXT:    scratch_store_b32 off, v241, s33 offset:552
+; GISEL-NEXT:    scratch_store_b32 off, v242, s33 offset:556
+; GISEL-NEXT:    scratch_store_b32 off, v243, s33 offset:560
+; GISEL-NEXT:    scratch_store_b32 off, v244, s33 offset:564
+; GISEL-NEXT:    scratch_store_b32 off, v245, s33 offset:568
+; GISEL-NEXT:    scratch_store_b32 off, v246, s33 offset:572
+; GISEL-NEXT:    scratch_store_b32 off, v247, s33 offset:576
 ; GISEL-NEXT:    s_mov_b32 exec_lo, -1
-; GISEL-NEXT:    s_clause 0x1
-; GISEL-NEXT:    scratch_store_b32 off, v40, s33
-; GISEL-NEXT:    scratch_store_b32 off, v41, s33 offset:4
-; GISEL-NEXT:    v_writelane_b32 v40, s4, 0
-; GISEL-NEXT:    v_writelane_b32 v41, s76, 0
+; GISEL-NEXT:    scratch_store_b32 off, v40, s33 ; 4-byte Folded Spill
+; GISEL-NEXT:    s_wait_alu 0xfffe
+; GISEL-NEXT:    v_writelane_b32 v40, s0, 3
 ; GISEL-NEXT:    v_mov_b32_e32 v2, v0
 ; GISEL-NEXT:    v_swap_b32 v0, v1
-; GISEL-NEXT:    v_writelane_b32 v40, s5, 1
-; GISEL-NEXT:    v_writelane_b32 v41, s77, 1
 ; GISEL-NEXT:    s_mov_b32 s0, gfx_callee at abs32@lo
+; GISEL-NEXT:    v_writelane_b32 v40, s4, 0
 ; GISEL-NEXT:    s_mov_b32 s1, gfx_callee at abs32@hi
 ; GISEL-NEXT:    s_addk_co_i32 s32, 0x250
-; GISEL-NEXT:    v_writelane_b32 v40, s6, 2
-; GISEL-NEXT:    v_writelane_b32 v41, s78, 2
-; GISEL-NEXT:    v_writelane_b32 v40, s7, 3
-; GISEL-NEXT:    v_writelane_b32 v41, s79, 3
-; GISEL-NEXT:    v_writelane_b32 v40, s8, 4
-; GISEL-NEXT:    v_writelane_b32 v41, s88, 4
-; GISEL-NEXT:    v_writelane_b32 v40, s9, 5
-; GISEL-NEXT:    v_writelane_b32 v41, s89, 5
-; GISEL-NEXT:    s_mov_b64 s[8:9], 0
-; GISEL-NEXT:    v_writelane_b32 v40, s10, 6
-; GISEL-NEXT:    v_writelane_b32 v41, s90, 6
-; GISEL-NEXT:    v_writelane_b32 v40, s11, 7
-; GISEL-NEXT:    v_writelane_b32 v41, s91, 7
-; GISEL-NEXT:    v_writelane_b32 v40, s12, 8
-; GISEL-NEXT:    v_writelane_b32 v41, s92, 8
-; GISEL-NEXT:    v_writelane_b32 v40, s13, 9
-; GISEL-NEXT:    v_writelane_b32 v41, s93, 9
-; GISEL-NEXT:    v_writelane_b32 v40, s14, 10
-; GISEL-NEXT:    v_writelane_b32 v41, s94, 10
-; GISEL-NEXT:    v_writelane_b32 v40, s15, 11
-; GISEL-NEXT:    v_writelane_b32 v41, s95, 11
-; GISEL-NEXT:    v_writelane_b32 v40, s16, 12
-; GISEL-NEXT:    v_writelane_b32 v40, s17, 13
-; GISEL-NEXT:    v_writelane_b32 v40, s18, 14
-; GISEL-NEXT:    v_writelane_b32 v40, s19, 15
-; GISEL-NEXT:    v_writelane_b32 v40, s20, 16
-; GISEL-NEXT:    v_writelane_b32 v40, s21, 17
-; GISEL-NEXT:    v_writelane_b32 v40, s22, 18
-; GISEL-NEXT:    v_writelane_b32 v40, s23, 19
-; GISEL-NEXT:    v_writelane_b32 v40, s24, 20
-; GISEL-NEXT:    v_writelane_b32 v40, s25, 21
-; GISEL-NEXT:    v_writelane_b32 v40, s26, 22
-; GISEL-NEXT:    v_writelane_b32 v40, s27, 23
-; GISEL-NEXT:    v_writelane_b32 v40, s28, 24
-; GISEL-NEXT:    v_writelane_b32 v40, s29, 25
-; GISEL-NEXT:    v_writelane_b32 v40, s30, 26
-; GISEL-NEXT:    v_writelane_b32 v40, s31, 27
-; GISEL-NEXT:    v_writelane_b32 v40, s72, 28
-; GISEL-NEXT:    v_writelane_b32 v40, s73, 29
-; GISEL-NEXT:    v_writelane_b32 v40, s74, 30
-; GISEL-NEXT:    v_writelane_b32 v40, s75, 31
+; GISEL-NEXT:    v_writelane_b32 v40, s30, 1
+; GISEL-NEXT:    v_writelane_b32 v40, s31, 2
 ; GISEL-NEXT:    s_wait_alu 0xfffe
 ; GISEL-NEXT:    s_swappc_b64 s[30:31], s[0:1]
-; GISEL-NEXT:    v_readlane_b32 s95, v41, 11
-; GISEL-NEXT:    v_readlane_b32 s94, v41, 10
-; GISEL-NEXT:    v_readlane_b32 s93, v41, 9
-; GISEL-NEXT:    v_readlane_b32 s92, v41, 8
-; GISEL-NEXT:    v_readlane_b32 s91, v41, 7
-; GISEL-NEXT:    v_readlane_b32 s90, v41, 6
-; GISEL-NEXT:    v_readlane_b32 s89, v41, 5
-; GISEL-NEXT:    v_readlane_b32 s88, v41, 4
-; GISEL-NEXT:    v_readlane_b32 s79, v41, 3
-; GISEL-NEXT:    v_readlane_b32 s78, v41, 2
-; GISEL-NEXT:    v_readlane_b32 s77, v41, 1
-; GISEL-NEXT:    v_readlane_b32 s76, v41, 0
-; GISEL-NEXT:    v_readlane_b32 s75, v40, 31
-; GISEL-NEXT:    v_readlane_b32 s74, v40, 30
-; GISEL-NEXT:    v_readlane_b32 s73, v40, 29
-; GISEL-NEXT:    v_readlane_b32 s72, v40, 28
-; GISEL-NEXT:    v_readlane_b32 s31, v40, 27
-; GISEL-NEXT:    v_readlane_b32 s30, v40, 26
-; GISEL-NEXT:    v_readlane_b32 s29, v40, 25
-; GISEL-NEXT:    v_readlane_b32 s28, v40, 24
-; GISEL-NEXT:    v_readlane_b32 s27, v40, 23
-; GISEL-NEXT:    v_readlane_b32 s26, v40, 22
-; GISEL-NEXT:    v_readlane_b32 s25, v40, 21
-; GISEL-NEXT:    v_readlane_b32 s24, v40, 20
-; GISEL-NEXT:    v_readlane_b32 s23, v40, 19
-; GISEL-NEXT:    v_readlane_b32 s22, v40, 18
-; GISEL-NEXT:    v_readlane_b32 s21, v40, 17
-; GISEL-NEXT:    v_readlane_b32 s20, v40, 16
-; GISEL-NEXT:    v_readlane_b32 s19, v40, 15
-; GISEL-NEXT:    v_readlane_b32 s18, v40, 14
-; GISEL-NEXT:    v_readlane_b32 s17, v40, 13
-; GISEL-NEXT:    v_readlane_b32 s16, v40, 12
-; GISEL-NEXT:    v_readlane_b32 s15, v40, 11
-; GISEL-NEXT:    v_readlane_b32 s14, v40, 10
-; GISEL-NEXT:    v_readlane_b32 s13, v40, 9
-; GISEL-NEXT:    v_readlane_b32 s12, v40, 8
-; GISEL-NEXT:    v_readlane_b32 s11, v40, 7
-; GISEL-NEXT:    v_readlane_b32 s10, v40, 6
-; GISEL-NEXT:    v_readlane_b32 s9, v40, 5
-; GISEL-NEXT:    v_readlane_b32 s8, v40, 4
-; GISEL-NEXT:    v_readlane_b32 s7, v40, 3
-; GISEL-NEXT:    v_readlane_b32 s6, v40, 2
-; GISEL-NEXT:    v_readlane_b32 s5, v40, 1
+; GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GISEL-NEXT:    v_readlane_b32 s31, v40, 2
+; GISEL-NEXT:    v_readlane_b32 s30, v40, 1
 ; GISEL-NEXT:    v_readlane_b32 s4, v40, 0
-; GISEL-NEXT:    s_clause 0x1
-; GISEL-NEXT:    scratch_load_b32 v40, off, s33
-; GISEL-NEXT:    scratch_load_b32 v41, off, s33 offset:4
+; GISEL-NEXT:    v_readlane_b32 s0, v40, 3
+; GISEL-NEXT:    scratch_load_b32 v40, off, s33 ; 4-byte Folded Reload
 ; GISEL-NEXT:    s_mov_b32 s32, s33
-; GISEL-NEXT:    s_xor_b32 exec_lo, s34, -1
+; GISEL-NEXT:    s_xor_b32 exec_lo, s4, -1
 ; GISEL-NEXT:    s_clause 0x1f
-; GISEL-NEXT:    scratch_load_b32 v0, off, s33 offset:8
-; GISEL-NEXT:    scratch_load_b32 v1, off, s33 offset:12
-; GISEL-NEXT:    scratch_load_b32 v2, off, s33 offset:16
-; GISEL-NEXT:    scratch_load_b32 v3, off, s33 offset:20
-; GISEL-NEXT:    scratch_load_b32 v4, off, s33 offset:24
-; GISEL-NEXT:    scratch_load_b32 v5, off, s33 offset:28
-; GISEL-NEXT:    scratch_load_b32 v6, off, s33 offset:32
-; GISEL-NEXT:    scratch_load_b32 v7, off, s33 offset:36
-; GISEL-NEXT:    scratch_load_b32 v8, off, s33 offset:40
-; GISEL-NEXT:    scratch_load_b32 v9, off, s33 offset:44
-; GISEL-NEXT:    scratch_load_b32 v10, off, s33 offset:48
-; GISEL-NEXT:    scratch_load_b32 v11, off, s33 offset:52
-; GISEL-NEXT:    scratch_load_b32 v12, off, s33 offset:56
-; GISEL-NEXT:    scratch_load_b32 v13, off, s33 offset:60
-; GISEL-NEXT:    scratch_load_b32 v14, off, s33 offset:64
-; GISEL-NEXT:    scratch_load_b32 v15, off, s33 offset:68
-; GISEL-NEXT:    scratch_load_b32 v16, off, s33 offset:72
-; GISEL-NEXT:    scratch_load_b32 v17, off, s33 offset:76
-; GISEL-NEXT:    scratch_load_b32 v18, off, s33 offset:80
-; GISEL-NEXT:    scratch_load_b32 v19, off, s33 offset:84
-; GISEL-NEXT:    scratch_load_b32 v20, off, s33 offset:88
-; GISEL-NEXT:    scratch_load_b32 v21, off, s33 offset:92
-; GISEL-NEXT:    scratch_load_b32 v22, off, s33 offset:96
-; GISEL-NEXT:    scratch_load_b32 v23, off, s33 offset:100
-; GISEL-NEXT:    scratch_load_b32 v24, off, s33 offset:104
-; GISEL-NEXT:    scratch_load_b32 v25, off, s33 offset:108
-; GISEL-NEXT:    scratch_load_b32 v26, off, s33 offset:112
-; GISEL-NEXT:    scratch_load_b32 v27, off, s33 offset:116
-; GISEL-NEXT:    scratch_load_b32 v28, off, s33 offset:120
-; GISEL-NEXT:    scratch_load_b32 v29, off, s33 offset:124
-; GISEL-NEXT:    scratch_load_b32 v30, off, s33 offset:128
-; GISEL-NEXT:    scratch_load_b32 v31, off, s33 offset:132
+; GISEL-NEXT:    scratch_load_b32 v0, off, s33 offset:4
+; GISEL-NEXT:    scratch_load_b32 v1, off, s33 offset:8
+; GISEL-NEXT:    scratch_load_b32 v2, off, s33 offset:12
+; GISEL-NEXT:    scratch_load_b32 v3, off, s33 offset:16
+; GISEL-NEXT:    scratch_load_b32 v4, off, s33 offset:20
+; GISEL-NEXT:    scratch_load_b32 v5, off, s33 offset:24
+; GISEL-NEXT:    scratch_load_b32 v6, off, s33 offset:28
+; GISEL-NEXT:    scratch_load_b32 v7, off, s33 offset:32
+; GISEL-NEXT:    scratch_load_b32 v8, off, s33 offset:36
+; GISEL-NEXT:    scratch_load_b32 v9, off, s33 offset:40
+; GISEL-NEXT:    scratch_load_b32 v10, off, s33 offset:44
+; GISEL-NEXT:    scratch_load_b32 v11, off, s33 offset:48
+; GISEL-NEXT:    scratch_load_b32 v12, off, s33 offset:52
+; GISEL-NEXT:    scratch_load_b32 v13, off, s33 offset:56
+; GISEL-NEXT:    scratch_load_b32 v14, off, s33 offset:60
+; GISEL-NEXT:    scratch_load_b32 v15, off, s33 offset:64
+; GISEL-NEXT:    scratch_load_b32 v16, off, s33 offset:68
+; GISEL-NEXT:    scratch_load_b32 v17, off, s33 offset:72
+; GISEL-NEXT:    scratch_load_b32 v18, off, s33 offset:76
+; GISEL-NEXT:    scratch_load_b32 v19, off, s33 offset:80
+; GISEL-NEXT:    scratch_load_b32 v20, off, s33 offset:84
+; GISEL-NEXT:    scratch_load_b32 v21, off, s33 offset:88
+; GISEL-NEXT:    scratch_load_b32 v22, off, s33 offset:92
+; GISEL-NEXT:    scratch_load_b32 v23, off, s33 offset:96
+; GISEL-NEXT:    scratch_load_b32 v24, off, s33 offset:100
+; GISEL-NEXT:    scratch_load_b32 v25, off, s33 offset:104
+; GISEL-NEXT:    scratch_load_b32 v26, off, s33 offset:108
+; GISEL-NEXT:    scratch_load_b32 v27, off, s33 offset:112
+; GISEL-NEXT:    scratch_load_b32 v28, off, s33 offset:116
+; GISEL-NEXT:    scratch_load_b32 v29, off, s33 offset:120
+; GISEL-NEXT:    scratch_load_b32 v30, off, s33 offset:124
+; GISEL-NEXT:    scratch_load_b32 v31, off, s33 offset:128
 ; GISEL-NEXT:    s_clause 0x1f
-; GISEL-NEXT:    scratch_load_b32 v32, off, s33 offset:136
-; GISEL-NEXT:    scratch_load_b32 v33, off, s33 offset:140
-; GISEL-NEXT:    scratch_load_b32 v34, off, s33 offset:144
-; GISEL-NEXT:    scratch_load_b32 v35, off, s33 offset:148
-; GISEL-NEXT:    scratch_load_b32 v36, off, s33 offset:152
-; GISEL-NEXT:    scratch_load_b32 v37, off, s33 offset:156
-; GISEL-NEXT:    scratch_load_b32 v38, off, s33 offset:160
-; GISEL-NEXT:    scratch_load_b32 v39, off, s33 offset:164
-; GISEL-NEXT:    scratch_load_b32 v48, off, s33 offset:168
-; GISEL-NEXT:    scratch_load_b32 v49, off, s33 offset:172
-; GISEL-NEXT:    scratch_load_b32 v50, off, s33 offset:176
-; GISEL-NEXT:    scratch_load_b32 v51, off, s33 offset:180
-; GISEL-NEXT:    scratch_load_b32 v52, off, s33 offset:184
-; GISEL-NEXT:    scratch_load_b32 v53, off, s33 offset:188
-; GISEL-NEXT:    scratch_load_b32 v54, off, s33 offset:192
-; GISEL-NEXT:    scratch_load_b32 v55, off, s33 offset:196
-; GISEL-NEXT:    scratch_load_b32 v64, off, s33 offset:200
-; GISEL-NEXT:    scratch_load_b32 v65, off, s33 offset:204
-; GISEL-NEXT:    scratch_load_b32 v66, off, s33 offset:208
-; GISEL-NEXT:    scratch_load_b32 v67, off, s33 offset:212
-; GISEL-NEXT:    scratch_load_b32 v68, off, s33 offset:216
-; GISEL-NEXT:    scratch_load_b32 v69, off, s33 offset:220
-; GISEL-NEXT:    scratch_load_b32 v70, off, s33 offset:224
-; GISEL-NEXT:    scratch_load_b32 v71, off, s33 offset:228
-; GISEL-NEXT:    scratch_load_b32 v80, off, s33 offset:232
-; GISEL-NEXT:    scratch_load_b32 v81, off, s33 offset:236
-; GISEL-NEXT:    scratch_load_b32 v82, off, s33 offset:240
-; GISEL-NEXT:    scratch_load_b32 v83, off, s33 offset:244
-; GISEL-NEXT:    scratch_load_b32 v84, off, s33 offset:248
-; GISEL-NEXT:    scratch_load_b32 v85, off, s33 offset:252
-; GISEL-NEXT:    scratch_load_b32 v86, off, s33 offset:256
-; GISEL-NEXT:    scratch_load_b32 v87, off, s33 offset:260
+; GISEL-NEXT:    scratch_load_b32 v32, off, s33 offset:132
+; GISEL-NEXT:    scratch_load_b32 v33, off, s33 offset:136
+; GISEL-NEXT:    scratch_load_b32 v34, off, s33 offset:140
+; GISEL-NEXT:    scratch_load_b32 v35, off, s33 offset:144
+; GISEL-NEXT:    scratch_load_b32 v36, off, s33 offset:148
+; GISEL-NEXT:    scratch_load_b32 v37, off, s33 offset:152
+; GISEL-NEXT:    scratch_load_b32 v38, off, s33 offset:156
+; GISEL-NEXT:    scratch_load_b32 v39, off, s33 offset:160
+; GISEL-NEXT:    scratch_load_b32 v48, off, s33 offset:164
+; GISEL-NEXT:    scratch_load_b32 v49, off, s33 offset:168
+; GISEL-NEXT:    scratch_load_b32 v50, off, s33 offset:172
+; GISEL-NEXT:    scratch_load_b32 v51, off, s33 offset:176
+; GISEL-NEXT:    scratch_load_b32 v52, off, s33 offset:180
+; GISEL-NEXT:    scratch_load_b32 v53, off, s33 offset:184
+; GISEL-NEXT:    scratch_load_b32 v54, off, s33 offset:188
+; GISEL-NEXT:    scratch_load_b32 v55, off, s33 offset:192
+; GISEL-NEXT:    scratch_load_b32 v64, off, s33 offset:196
+; GISEL-NEXT:    scratch_load_b32 v65, off, s33 offset:200
+; GISEL-NEXT:    scratch_load_b32 v66, off, s33 offset:204
+; GISEL-NEXT:    scratch_load_b32 v67, off, s33 offset:208
+; GISEL-NEXT:    scratch_load_b32 v68, off, s33 offset:212
+; GISEL-NEXT:    scratch_load_b32 v69, off, s33 offset:216
+; GISEL-NEXT:    scratch_load_b32 v70, off, s33 offset:220
+; GISEL-NEXT:    scratch_load_b32 v71, off, s33 offset:224
+; GISEL-NEXT:    scratch_load_b32 v80, off, s33 offset:228
+; GISEL-NEXT:    scratch_load_b32 v81, off, s33 offset:232
+; GISEL-NEXT:    scratch_load_b32 v82, off, s33 offset:236
+; GISEL-NEXT:    scratch_load_b32 v83, off, s33 offset:240
+; GISEL-NEXT:    scratch_load_b32 v84, off, s33 offset:244
+; GISEL-NEXT:    scratch_load_b32 v85, off, s33 offset:248
+; GISEL-NEXT:    scratch_load_b32 v86, off, s33 offset:252
+; GISEL-NEXT:    scratch_load_b32 v87, off, s33 offset:256
 ; GISEL-NEXT:    s_clause 0x1f
-; GISEL-NEXT:    scratch_load_b32 v96, off, s33 offset:264
-; GISEL-NEXT:    scratch_load_b32 v97, off, s33 offset:268
-; GISEL-NEXT:    scratch_load_b32 v98, off, s33 offset:272
-; GISEL-NEXT:    scratch_load_b32 v99, off, s33 offset:276
-; GISEL-NEXT:    scratch_load_b32 v100, off, s33 offset:280
-; GISEL-NEXT:    scratch_load_b32 v101, off, s33 offset:284
-; GISEL-NEXT:    scratch_load_b32 v102, off, s33 offset:288
-; GISEL-NEXT:    scratch_load_b32 v103, off, s33 offset:292
-; GISEL-NEXT:    scratch_load_b32 v112, off, s33 offset:296
-; GISEL-NEXT:    scratch_load_b32 v113, off, s33 offset:300
-; GISEL-NEXT:    scratch_load_b32 v114, off, s33 offset:304
-; GISEL-NEXT:    scratch_load_b32 v115, off, s33 offset:308
-; GISEL-NEXT:    scratch_load_b32 v116, off, s33 offset:312
-; GISEL-NEXT:    scratch_load_b32 v117, off, s33 offset:316
-; GISEL-NEXT:    scratch_load_b32 v118, off, s33 offset:320
-; GISEL-NEXT:    scratch_load_b32 v119, off, s33 offset:324
-; GISEL-NEXT:    scratch_load_b32 v128, off, s33 offset:328
-; GISEL-NEXT:    scratch_load_b32 v129, off, s33 offset:332
-; GISEL-NEXT:    scratch_load_b32 v130, off, s33 offset:336
-; GISEL-NEXT:    scratch_load_b32 v131, off, s33 offset:340
-; GISEL-NEXT:    scratch_load_b32 v132, off, s33 offset:344
-; GISEL-NEXT:    scratch_load_b32 v133, off, s33 offset:348
-; GISEL-NEXT:    scratch_load_b32 v134, off, s33 offset:352
-; GISEL-NEXT:    scratch_load_b32 v135, off, s33 offset:356
-; GISEL-NEXT:    scratch_load_b32 v144, off, s33 offset:360
-; GISEL-NEXT:    scratch_load_b32 v145, off, s33 offset:364
-; GISEL-NEXT:    scratch_load_b32 v146, off, s33 offset:368
-; GISEL-NEXT:    scratch_load_b32 v147, off, s33 offset:372
-; GISEL-NEXT:    scratch_load_b32 v148, off, s33 offset:376
-; GISEL-NEXT:    scratch_load_b32 v149, off, s33 offset:380
-; GISEL-NEXT:    scratch_load_b32 v150, off, s33 offset:384
-; GISEL-NEXT:    scratch_load_b32 v151, off, s33 offset:388
+; GISEL-NEXT:    scratch_load_b32 v96, off, s33 offset:260
+; GISEL-NEXT:    scratch_load_b32 v97, off, s33 offset:264
+; GISEL-NEXT:    scratch_load_b32 v98, off, s33 offset:268
+; GISEL-NEXT:    scratch_load_b32 v99, off, s33 offset:272
+; GISEL-NEXT:    scratch_load_b32 v100, off, s33 offset:276
+; GISEL-NEXT:    scratch_load_b32 v101, off, s33 offset:280
+; GISEL-NEXT:    scratch_load_b32 v102, off, s33 offset:284
+; GISEL-NEXT:    scratch_load_b32 v103, off, s33 offset:288
+; GISEL-NEXT:    scratch_load_b32 v112, off, s33 offset:292
+; GISEL-NEXT:    scratch_load_b32 v113, off, s33 offset:296
+; GISEL-NEXT:    scratch_load_b32 v114, off, s33 offset:300
+; GISEL-NEXT:    scratch_load_b32 v115, off, s33 offset:304
+; GISEL-NEXT:    scratch_load_b32 v116, off, s33 offset:308
+; GISEL-NEXT:    scratch_load_b32 v117, off, s33 offset:312
+; GISEL-NEXT:    scratch_load_b32 v118, off, s33 offset:316
+; GISEL-NEXT:    scratch_load_b32 v119, off, s33 offset:320
+; GISEL-NEXT:    scratch_load_b32 v128, off, s33 offset:324
+; GISEL-NEXT:    scratch_load_b32 v129, off, s33 offset:328
+; GISEL-NEXT:    scratch_load_b32 v130, off, s33 offset:332
+; GISEL-NEXT:    scratch_load_b32 v131, off, s33 offset:336
+; GISEL-NEXT:    scratch_load_b32 v132, off, s33 offset:340
+; GISEL-NEXT:    scratch_load_b32 v133, off, s33 offset:344
+; GISEL-NEXT:    scratch_load_b32 v134, off, s33 offset:348
+; GISEL-NEXT:    scratch_load_b32 v135, off, s33 offset:352
+; GISEL-NEXT:    scratch_load_b32 v144, off, s33 offset:356
+; GISEL-NEXT:    scratch_load_b32 v145, off, s33 offset:360
+; GISEL-NEXT:    scratch_load_b32 v146, off, s33 offset:364
+; GISEL-NEXT:    scratch_load_b32 v147, off, s33 offset:368
+; GISEL-NEXT:    scratch_load_b32 v148, off, s33 offset:372
+; GISEL-NEXT:    scratch_load_b32 v149, off, s33 offset:376
+; GISEL-NEXT:    scratch_load_b32 v150, off, s33 offset:380
+; GISEL-NEXT:    scratch_load_b32 v151, off, s33 offset:384
 ; GISEL-NEXT:    s_clause 0x1f
-; GISEL-NEXT:    scratch_load_b32 v160, off, s33 offset:392
-; GISEL-NEXT:    scratch_load_b32 v161, off, s33 offset:396
-; GISEL-NEXT:    scratch_load_b32 v162, off, s33 offset:400
-; GISEL-NEXT:    scratch_load_b32 v163, off, s33 offset:404
-; GISEL-NEXT:    scratch_load_b32 v164, off, s33 offset:408
-; GISEL-NEXT:    scratch_load_b32 v165, off, s33 offset:412
-; GISEL-NEXT:    scratch_load_b32 v166, off, s33 offset:416
-; GISEL-NEXT:    scratch_load_b32 v167, off, s33 offset:420
-; GISEL-NEXT:    scratch_load_b32 v176, off, s33 offset:424
-; GISEL-NEXT:    scratch_load_b32 v177, off, s33 offset:428
-; GISEL-NEXT:    scratch_load_b32 v178, off, s33 offset:432
-; GISEL-NEXT:    scratch_load_b32 v179, off, s33 offset:436
-; GISEL-NEXT:    scratch_load_b32 v180, off, s33 offset:440
-; GISEL-NEXT:    scratch_load_b32 v181, off, s33 offset:444
-; GISEL-NEXT:    scratch_load_b32 v182, off, s33 offset:448
-; GISEL-NEXT:    scratch_load_b32 v183, off, s33 offset:452
-; GISEL-NEXT:    scratch_load_b32 v192, off, s33 offset:456
-; GISEL-NEXT:    scratch_load_b32 v193, off, s33 offset:460
-; GISEL-NEXT:    scratch_load_b32 v194, off, s33 offset:464
-; GISEL-NEXT:    scratch_load_b32 v195, off, s33 offset:468
-; GISEL-NEXT:    scratch_load_b32 v196, off, s33 offset:472
-; GISEL-NEXT:    scratch_load_b32 v197, off, s33 offset:476
-; GISEL-NEXT:    scratch_load_b32 v198, off, s33 offset:480
-; GISEL-NEXT:    scratch_load_b32 v199, off, s33 offset:484
-; GISEL-NEXT:    scratch_load_b32 v208, off, s33 offset:488
-; GISEL-NEXT:    scratch_load_b32 v209, off, s33 offset:492
-; GISEL-NEXT:    scratch_load_b32 v210, off, s33 offset:496
-; GISEL-NEXT:    scratch_load_b32 v211, off, s33 offset:500
-; GISEL-NEXT:    scratch_load_b32 v212, off, s33 offset:504
-; GISEL-NEXT:    scratch_load_b32 v213, off, s33 offset:508
-; GISEL-NEXT:    scratch_load_b32 v214, off, s33 offset:512
-; GISEL-NEXT:    scratch_load_b32 v215, off, s33 offset:516
+; GISEL-NEXT:    scratch_load_b32 v160, off, s33 offset:388
+; GISEL-NEXT:    scratch_load_b32 v161, off, s33 offset:392
+; GISEL-NEXT:    scratch_load_b32 v162, off, s33 offset:396
+; GISEL-NEXT:    scratch_load_b32 v163, off, s33 offset:400
+; GISEL-NEXT:    scratch_load_b32 v164, off, s33 offset:404
+; GISEL-NEXT:    scratch_load_b32 v165, off, s33 offset:408
+; GISEL-NEXT:    scratch_load_b32 v166, off, s33 offset:412
+; GISEL-NEXT:    scratch_load_b32 v167, off, s33 offset:416
+; GISEL-NEXT:    scratch_load_b32 v176, off, s33 offset:420
+; GISEL-NEXT:    scratch_load_b32 v177, off, s33 offset:424
+; GISEL-NEXT:    scratch_load_b32 v178, off, s33 offset:428
+; GISEL-NEXT:    scratch_load_b32 v179, off, s33 offset:432
+; GISEL-NEXT:    scratch_load_b32 v180, off, s33 offset:436
+; GISEL-NEXT:    scratch_load_b32 v181, off, s33 offset:440
+; GISEL-NEXT:    scratch_load_b32 v182, off, s33 offset:444
+; GISEL-NEXT:    scratch_load_b32 v183, off, s33 offset:448
+; GISEL-NEXT:    scratch_load_b32 v192, off, s33 offset:452
+; GISEL-NEXT:    scratch_load_b32 v193, off, s33 offset:456
+; GISEL-NEXT:    scratch_load_b32 v194, off, s33 offset:460
+; GISEL-NEXT:    scratch_load_b32 v195, off, s33 offset:464
+; GISEL-NEXT:    scratch_load_b32 v196, off, s33 offset:468
+; GISEL-NEXT:    scratch_load_b32 v197, off, s33 offset:472
+; GISEL-NEXT:    scratch_load_b32 v198, off, s33 offset:476
+; GISEL-NEXT:    scratch_load_b32 v199, off, s33 offset:480
+; GISEL-NEXT:    scratch_load_b32 v208, off, s33 offset:484
+; GISEL-NEXT:    scratch_load_b32 v209, off, s33 offset:488
+; GISEL-NEXT:    scratch_load_b32 v210, off, s33 offset:492
+; GISEL-NEXT:    scratch_load_b32 v211, off, s33 offset:496
+; GISEL-NEXT:    scratch_load_b32 v212, off, s33 offset:500
+; GISEL-NEXT:    scratch_load_b32 v213, off, s33 offset:504
+; GISEL-NEXT:    scratch_load_b32 v214, off, s33 offset:508
+; GISEL-NEXT:    scratch_load_b32 v215, off, s33 offset:512
 ; GISEL-NEXT:    s_clause 0xf
-; GISEL-NEXT:    scratch_load_b32 v224, off, s33 offset:520
-; GISEL-NEXT:    scratch_load_b32 v225, off, s33 offset:524
-; GISEL-NEXT:    scratch_load_b32 v226, off, s33 offset:528
-; GISEL-NEXT:    scratch_load_b32 v227, off, s33 offset:532
-; GISEL-NEXT:    scratch_load_b32 v228, off, s33 offset:536
-; GISEL-NEXT:    scratch_load_b32 v229, off, s33 offset:540
-; GISEL-NEXT:    scratch_load_b32 v230, off, s33 offset:544
-; GISEL-NEXT:    scratch_load_b32 v231, off, s33 offset:548
-; GISEL-NEXT:    scratch_load_b32 v240, off, s33 offset:552
-; GISEL-NEXT:    scratch_load_b32 v241, off, s33 offset:556
-; GISEL-NEXT:    scratch_load_b32 v242, off, s33 offset:560
-; GISEL-NEXT:    scratch_load_b32 v243, off, s33 offset:564
-; GISEL-NEXT:    scratch_load_b32 v244, off, s33 offset:568
-; GISEL-NEXT:    scratch_load_b32 v245, off, s33 offset:572
-; GISEL-NEXT:    scratch_load_b32 v246, off, s33 offset:576
-; GISEL-NEXT:    scratch_load_b32 v247, off, s33 offset:580
-; GISEL-NEXT:    s_mov_b32 exec_lo, s34
-; GISEL-NEXT:    s_mov_b32 s33, s35
+; GISEL-NEXT:    scratch_load_b32 v224, off, s33 offset:516
+; GISEL-NEXT:    scratch_load_b32 v225, off, s33 offset:520
+; GISEL-NEXT:    scratch_load_b32 v226, off, s33 offset:524
+; GISEL-NEXT:    scratch_load_b32 v227, off, s33 offset:528
+; GISEL-NEXT:    scratch_load_b32 v228, off, s33 offset:532
+; GISEL-NEXT:    scratch_load_b32 v229, off, s33 offset:536
+; GISEL-NEXT:    scratch_load_b32 v230, off, s33 offset:540
+; GISEL-NEXT:    scratch_load_b32 v231, off, s33 offset:544
+; GISEL-NEXT:    scratch_load_b32 v240, off, s33 offset:548
+; GISEL-NEXT:    scratch_load_b32 v241, off, s33 offset:552
+; GISEL-NEXT:    scratch_load_b32 v242, off, s33 offset:556
+; GISEL-NEXT:    scratch_load_b32 v243, off, s33 offset:560
+; GISEL-NEXT:    scratch_load_b32 v244, off, s33 offset:564
+; GISEL-NEXT:    scratch_load_b32 v245, off, s33 offset:568
+; GISEL-NEXT:    scratch_load_b32 v246, off, s33 offset:572
+; GISEL-NEXT:    scratch_load_b32 v247, off, s33 offset:576
+; GISEL-NEXT:    s_mov_b32 exec_lo, s4
+; GISEL-NEXT:    s_mov_b32 s33, s0
 ; GISEL-NEXT:    s_wait_loadcnt 0x0
 ; GISEL-NEXT:    s_wait_alu 0xfffe
 ; GISEL-NEXT:    s_setpc_b64 s[30:31]
@@ -1907,9 +1741,9 @@ define amdgpu_gfx_whole_wave <2 x half> @call_gfx_from_whole_wave(i1 %active, <2
 ; DAGISEL64-NEXT:    s_wait_samplecnt 0x0
 ; DAGISEL64-NEXT:    s_wait_bvhcnt 0x0
 ; DAGISEL64-NEXT:    s_wait_kmcnt 0x0
-; DAGISEL64-NEXT:    s_mov_b32 s36, s33
+; DAGISEL64-NEXT:    s_mov_b32 s0, s33
 ; DAGISEL64-NEXT:    s_mov_b32 s33, s32
-; DAGISEL64-NEXT:    s_xor_saveexec_b64 s[34:35], -1
+; DAGISEL64-NEXT:    s_xor_saveexec_b64 s[4:5], -1
 ; DAGISEL64-NEXT:    s_clause 0x1f
 ; DAGISEL64-NEXT:    scratch_store_b32 off, v0, s33 offset:4
 ; DAGISEL64-NEXT:    scratch_store_b32 off, v1, s33 offset:8
@@ -2061,106 +1895,28 @@ define amdgpu_gfx_whole_wave <2 x half> @call_gfx_from_whole_wave(i1 %active, <2
 ; DAGISEL64-NEXT:    scratch_store_b32 off, v247, s33 offset:576
 ; DAGISEL64-NEXT:    s_mov_b64 exec, -1
 ; DAGISEL64-NEXT:    scratch_store_b32 off, v40, s33 ; 4-byte Folded Spill
-; DAGISEL64-NEXT:    v_writelane_b32 v40, s4, 0
+; DAGISEL64-NEXT:    s_wait_alu 0xfffe
+; DAGISEL64-NEXT:    v_writelane_b32 v40, s0, 4
 ; DAGISEL64-NEXT:    v_mov_b32_e32 v2, v0
 ; DAGISEL64-NEXT:    v_swap_b32 v0, v1
 ; DAGISEL64-NEXT:    s_mov_b32 s1, gfx_callee at abs32@hi
-; DAGISEL64-NEXT:    v_writelane_b32 v40, s5, 1
+; DAGISEL64-NEXT:    v_writelane_b32 v40, s4, 0
 ; DAGISEL64-NEXT:    s_mov_b32 s0, gfx_callee at abs32@lo
 ; DAGISEL64-NEXT:    s_addk_co_i32 s32, 0x250
-; DAGISEL64-NEXT:    v_writelane_b32 v40, s6, 2
-; DAGISEL64-NEXT:    v_writelane_b32 v40, s7, 3
-; DAGISEL64-NEXT:    v_writelane_b32 v40, s8, 4
-; DAGISEL64-NEXT:    v_writelane_b32 v40, s9, 5
-; DAGISEL64-NEXT:    s_mov_b64 s[8:9], 0
-; DAGISEL64-NEXT:    v_writelane_b32 v40, s10, 6
-; DAGISEL64-NEXT:    v_writelane_b32 v40, s11, 7
-; DAGISEL64-NEXT:    v_writelane_b32 v40, s12, 8
-; DAGISEL64-NEXT:    v_writelane_b32 v40, s13, 9
-; DAGISEL64-NEXT:    v_writelane_b32 v40, s14, 10
-; DAGISEL64-NEXT:    v_writelane_b32 v40, s15, 11
-; DAGISEL64-NEXT:    v_writelane_b32 v40, s16, 12
-; DAGISEL64-NEXT:    v_writelane_b32 v40, s17, 13
-; DAGISEL64-NEXT:    v_writelane_b32 v40, s18, 14
-; DAGISEL64-NEXT:    v_writelane_b32 v40, s19, 15
-; DAGISEL64-NEXT:    v_writelane_b32 v40, s20, 16
-; DAGISEL64-NEXT:    v_writelane_b32 v40, s21, 17
-; DAGISEL64-NEXT:    v_writelane_b32 v40, s22, 18
-; DAGISEL64-NEXT:    v_writelane_b32 v40, s23, 19
-; DAGISEL64-NEXT:    v_writelane_b32 v40, s24, 20
-; DAGISEL64-NEXT:    v_writelane_b32 v40, s25, 21
-; DAGISEL64-NEXT:    v_writelane_b32 v40, s26, 22
-; DAGISEL64-NEXT:    v_writelane_b32 v40, s27, 23
-; DAGISEL64-NEXT:    v_writelane_b32 v40, s28, 24
-; DAGISEL64-NEXT:    v_writelane_b32 v40, s29, 25
-; DAGISEL64-NEXT:    v_writelane_b32 v40, s30, 26
-; DAGISEL64-NEXT:    v_writelane_b32 v40, s31, 27
-; DAGISEL64-NEXT:    v_writelane_b32 v40, s72, 28
-; DAGISEL64-NEXT:    v_writelane_b32 v40, s73, 29
-; DAGISEL64-NEXT:    v_writelane_b32 v40, s74, 30
-; DAGISEL64-NEXT:    v_writelane_b32 v40, s75, 31
-; DAGISEL64-NEXT:    v_writelane_b32 v40, s76, 32
-; DAGISEL64-NEXT:    v_writelane_b32 v40, s77, 33
-; DAGISEL64-NEXT:    v_writelane_b32 v40, s78, 34
-; DAGISEL64-NEXT:    v_writelane_b32 v40, s79, 35
-; DAGISEL64-NEXT:    v_writelane_b32 v40, s88, 36
-; DAGISEL64-NEXT:    v_writelane_b32 v40, s89, 37
-; DAGISEL64-NEXT:    v_writelane_b32 v40, s90, 38
-; DAGISEL64-NEXT:    v_writelane_b32 v40, s91, 39
-; DAGISEL64-NEXT:    v_writelane_b32 v40, s92, 40
-; DAGISEL64-NEXT:    v_writelane_b32 v40, s93, 41
-; DAGISEL64-NEXT:    v_writelane_b32 v40, s94, 42
-; DAGISEL64-NEXT:    v_writelane_b32 v40, s95, 43
+; DAGISEL64-NEXT:    v_writelane_b32 v40, s5, 1
+; DAGISEL64-NEXT:    v_writelane_b32 v40, s30, 2
+; DAGISEL64-NEXT:    v_writelane_b32 v40, s31, 3
 ; DAGISEL64-NEXT:    s_wait_alu 0xfffe
 ; DAGISEL64-NEXT:    s_swappc_b64 s[30:31], s[0:1]
 ; DAGISEL64-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; DAGISEL64-NEXT:    v_readlane_b32 s95, v40, 43
-; DAGISEL64-NEXT:    v_readlane_b32 s94, v40, 42
-; DAGISEL64-NEXT:    v_readlane_b32 s93, v40, 41
-; DAGISEL64-NEXT:    v_readlane_b32 s92, v40, 40
-; DAGISEL64-NEXT:    v_readlane_b32 s91, v40, 39
-; DAGISEL64-NEXT:    v_readlane_b32 s90, v40, 38
-; DAGISEL64-NEXT:    v_readlane_b32 s89, v40, 37
-; DAGISEL64-NEXT:    v_readlane_b32 s88, v40, 36
-; DAGISEL64-NEXT:    v_readlane_b32 s79, v40, 35
-; DAGISEL64-NEXT:    v_readlane_b32 s78, v40, 34
-; DAGISEL64-NEXT:    v_readlane_b32 s77, v40, 33
-; DAGISEL64-NEXT:    v_readlane_b32 s76, v40, 32
-; DAGISEL64-NEXT:    v_readlane_b32 s75, v40, 31
-; DAGISEL64-NEXT:    v_readlane_b32 s74, v40, 30
-; DAGISEL64-NEXT:    v_readlane_b32 s73, v40, 29
-; DAGISEL64-NEXT:    v_readlane_b32 s72, v40, 28
-; DAGISEL64-NEXT:    v_readlane_b32 s31, v40, 27
-; DAGISEL64-NEXT:    v_readlane_b32 s30, v40, 26
-; DAGISEL64-NEXT:    v_readlane_b32 s29, v40, 25
-; DAGISEL64-NEXT:    v_readlane_b32 s28, v40, 24
-; DAGISEL64-NEXT:    v_readlane_b32 s27, v40, 23
-; DAGISEL64-NEXT:    v_readlane_b32 s26, v40, 22
-; DAGISEL64-NEXT:    v_readlane_b32 s25, v40, 21
-; DAGISEL64-NEXT:    v_readlane_b32 s24, v40, 20
-; DAGISEL64-NEXT:    v_readlane_b32 s23, v40, 19
-; DAGISEL64-NEXT:    v_readlane_b32 s22, v40, 18
-; DAGISEL64-NEXT:    v_readlane_b32 s21, v40, 17
-; DAGISEL64-NEXT:    v_readlane_b32 s20, v40, 16
-; DAGISEL64-NEXT:    v_readlane_b32 s19, v40, 15
-; DAGISEL64-NEXT:    v_readlane_b32 s18, v40, 14
-; DAGISEL64-NEXT:    v_readlane_b32 s17, v40, 13
-; DAGISEL64-NEXT:    v_readlane_b32 s16, v40, 12
-; DAGISEL64-NEXT:    v_readlane_b32 s15, v40, 11
-; DAGISEL64-NEXT:    v_readlane_b32 s14, v40, 10
-; DAGISEL64-NEXT:    v_readlane_b32 s13, v40, 9
-; DAGISEL64-NEXT:    v_readlane_b32 s12, v40, 8
-; DAGISEL64-NEXT:    v_readlane_b32 s11, v40, 7
-; DAGISEL64-NEXT:    v_readlane_b32 s10, v40, 6
-; DAGISEL64-NEXT:    v_readlane_b32 s9, v40, 5
-; DAGISEL64-NEXT:    v_readlane_b32 s8, v40, 4
-; DAGISEL64-NEXT:    v_readlane_b32 s7, v40, 3
-; DAGISEL64-NEXT:    v_readlane_b32 s6, v40, 2
+; DAGISEL64-NEXT:    v_readlane_b32 s31, v40, 3
+; DAGISEL64-NEXT:    v_readlane_b32 s30, v40, 2
 ; DAGISEL64-NEXT:    v_readlane_b32 s5, v40, 1
 ; DAGISEL64-NEXT:    v_readlane_b32 s4, v40, 0
+; DAGISEL64-NEXT:    v_readlane_b32 s0, v40, 4
 ; DAGISEL64-NEXT:    scratch_load_b32 v40, off, s33 ; 4-byte Folded Reload
 ; DAGISEL64-NEXT:    s_mov_b32 s32, s33
-; DAGISEL64-NEXT:    s_xor_b64 exec, s[34:35], -1
+; DAGISEL64-NEXT:    s_xor_b64 exec, s[4:5], -1
 ; DAGISEL64-NEXT:    s_clause 0x1f
 ; DAGISEL64-NEXT:    scratch_load_b32 v0, off, s33 offset:4
 ; DAGISEL64-NEXT:    scratch_load_b32 v1, off, s33 offset:8
@@ -2310,8 +2066,8 @@ define amdgpu_gfx_whole_wave <2 x half> @call_gfx_from_whole_wave(i1 %active, <2
 ; DAGISEL64-NEXT:    scratch_load_b32 v245, off, s33 offset:568
 ; DAGISEL64-NEXT:    scratch_load_b32 v246, off, s33 offset:572
 ; DAGISEL64-NEXT:    scratch_load_b32 v247, off, s33 offset:576
-; DAGISEL64-NEXT:    s_mov_b64 exec, s[34:35]
-; DAGISEL64-NEXT:    s_mov_b32 s33, s36
+; DAGISEL64-NEXT:    s_mov_b64 exec, s[4:5]
+; DAGISEL64-NEXT:    s_mov_b32 s33, s0
 ; DAGISEL64-NEXT:    s_wait_loadcnt 0x0
 ; DAGISEL64-NEXT:    s_wait_alu 0xfffe
 ; DAGISEL64-NEXT:    s_setpc_b64 s[30:31]
@@ -2323,9 +2079,9 @@ define amdgpu_gfx_whole_wave <2 x half> @call_gfx_from_whole_wave(i1 %active, <2
 ; GISEL64-NEXT:    s_wait_samplecnt 0x0
 ; GISEL64-NEXT:    s_wait_bvhcnt 0x0
 ; GISEL64-NEXT:    s_wait_kmcnt 0x0
-; GISEL64-NEXT:    s_mov_b32 s36, s33
+; GISEL64-NEXT:    s_mov_b32 s0, s33
 ; GISEL64-NEXT:    s_mov_b32 s33, s32
-; GISEL64-NEXT:    s_xor_saveexec_b64 s[34:35], -1
+; GISEL64-NEXT:    s_xor_saveexec_b64 s[4:5], -1
 ; GISEL64-NEXT:    s_clause 0x1f
 ; GISEL64-NEXT:    scratch_store_b32 off, v0, s33 offset:4
 ; GISEL64-NEXT:    scratch_store_b32 off, v1, s33 offset:8
@@ -2477,106 +2233,28 @@ define amdgpu_gfx_whole_wave <2 x half> @call_gfx_from_whole_wave(i1 %active, <2
 ; GISEL64-NEXT:    scratch_store_b32 off, v247, s33 offset:576
 ; GISEL64-NEXT:    s_mov_b64 exec, -1
 ; GISEL64-NEXT:    scratch_store_b32 off, v40, s33 ; 4-byte Folded Spill
-; GISEL64-NEXT:    v_writelane_b32 v40, s4, 0
+; GISEL64-NEXT:    s_wait_alu 0xfffe
+; GISEL64-NEXT:    v_writelane_b32 v40, s0, 4
 ; GISEL64-NEXT:    v_mov_b32_e32 v2, v0
 ; GISEL64-NEXT:    v_swap_b32 v0, v1
 ; GISEL64-NEXT:    s_mov_b32 s0, gfx_callee at abs32@lo
-; GISEL64-NEXT:    v_writelane_b32 v40, s5, 1
+; GISEL64-NEXT:    v_writelane_b32 v40, s4, 0
 ; GISEL64-NEXT:    s_mov_b32 s1, gfx_callee at abs32@hi
 ; GISEL64-NEXT:    s_addk_co_i32 s32, 0x250
-; GISEL64-NEXT:    v_writelane_b32 v40, s6, 2
-; GISEL64-NEXT:    v_writelane_b32 v40, s7, 3
-; GISEL64-NEXT:    v_writelane_b32 v40, s8, 4
-; GISEL64-NEXT:    v_writelane_b32 v40, s9, 5
-; GISEL64-NEXT:    s_mov_b64 s[8:9], 0
-; GISEL64-NEXT:    v_writelane_b32 v40, s10, 6
-; GISEL64-NEXT:    v_writelane_b32 v40, s11, 7
-; GISEL64-NEXT:    v_writelane_b32 v40, s12, 8
-; GISEL64-NEXT:    v_writelane_b32 v40, s13, 9
-; GISEL64-NEXT:    v_writelane_b32 v40, s14, 10
-; GISEL64-NEXT:    v_writelane_b32 v40, s15, 11
-; GISEL64-NEXT:    v_writelane_b32 v40, s16, 12
-; GISEL64-NEXT:    v_writelane_b32 v40, s17, 13
-; GISEL64-NEXT:    v_writelane_b32 v40, s18, 14
-; GISEL64-NEXT:    v_writelane_b32 v40, s19, 15
-; GISEL64-NEXT:    v_writelane_b32 v40, s20, 16
-; GISEL64-NEXT:    v_writelane_b32 v40, s21, 17
-; GISEL64-NEXT:    v_writelane_b32 v40, s22, 18
-; GISEL64-NEXT:    v_writelane_b32 v40, s23, 19
-; GISEL64-NEXT:    v_writelane_b32 v40, s24, 20
-; GISEL64-NEXT:    v_writelane_b32 v40, s25, 21
-; GISEL64-NEXT:    v_writelane_b32 v40, s26, 22
-; GISEL64-NEXT:    v_writelane_b32 v40, s27, 23
-; GISEL64-NEXT:    v_writelane_b32 v40, s28, 24
-; GISEL64-NEXT:    v_writelane_b32 v40, s29, 25
-; GISEL64-NEXT:    v_writelane_b32 v40, s30, 26
-; GISEL64-NEXT:    v_writelane_b32 v40, s31, 27
-; GISEL64-NEXT:    v_writelane_b32 v40, s72, 28
-; GISEL64-NEXT:    v_writelane_b32 v40, s73, 29
-; GISEL64-NEXT:    v_writelane_b32 v40, s74, 30
-; GISEL64-NEXT:    v_writelane_b32 v40, s75, 31
-; GISEL64-NEXT:    v_writelane_b32 v40, s76, 32
-; GISEL64-NEXT:    v_writelane_b32 v40, s77, 33
-; GISEL64-NEXT:    v_writelane_b32 v40, s78, 34
-; GISEL64-NEXT:    v_writelane_b32 v40, s79, 35
-; GISEL64-NEXT:    v_writelane_b32 v40, s88, 36
-; GISEL64-NEXT:    v_writelane_b32 v40, s89, 37
-; GISEL64-NEXT:    v_writelane_b32 v40, s90, 38
-; GISEL64-NEXT:    v_writelane_b32 v40, s91, 39
-; GISEL64-NEXT:    v_writelane_b32 v40, s92, 40
-; GISEL64-NEXT:    v_writelane_b32 v40, s93, 41
-; GISEL64-NEXT:    v_writelane_b32 v40, s94, 42
-; GISEL64-NEXT:    v_writelane_b32 v40, s95, 43
+; GISEL64-NEXT:    v_writelane_b32 v40, s5, 1
+; GISEL64-NEXT:    v_writelane_b32 v40, s30, 2
+; GISEL64-NEXT:    v_writelane_b32 v40, s31, 3
 ; GISEL64-NEXT:    s_wait_alu 0xfffe
 ; GISEL64-NEXT:    s_swappc_b64 s[30:31], s[0:1]
 ; GISEL64-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GISEL64-NEXT:    v_readlane_b32 s95, v40, 43
-; GISEL64-NEXT:    v_readlane_b32 s94, v40, 42
-; GISEL64-NEXT:    v_readlane_b32 s93, v40, 41
-; GISEL64-NEXT:    v_readlane_b32 s92, v40, 40
-; GISEL64-NEXT:    v_readlane_b32 s91, v40, 39
-; GISEL64-NEXT:    v_readlane_b32 s90, v40, 38
-; GISEL64-NEXT:    v_readlane_b32 s89, v40, 37
-; GISEL64-NEXT:    v_readlane_b32 s88, v40, 36
-; GISEL64-NEXT:    v_readlane_b32 s79, v40, 35
-; GISEL64-NEXT:    v_readlane_b32 s78, v40, 34
-; GISEL64-NEXT:    v_readlane_b32 s77, v40, 33
-; GISEL64-NEXT:    v_readlane_b32 s76, v40, 32
-; GISEL64-NEXT:    v_readlane_b32 s75, v40, 31
-; GISEL64-NEXT:    v_readlane_b32 s74, v40, 30
-; GISEL64-NEXT:    v_readlane_b32 s73, v40, 29
-; GISEL64-NEXT:    v_readlane_b32 s72, v40, 28
-; GISEL64-NEXT:    v_readlane_b32 s31, v40, 27
-; GISEL64-NEXT:    v_readlane_b32 s30, v40, 26
-; GISEL64-NEXT:    v_readlane_b32 s29, v40, 25
-; GISEL64-NEXT:    v_readlane_b32 s28, v40, 24
-; GISEL64-NEXT:    v_readlane_b32 s27, v40, 23
-; GISEL64-NEXT:    v_readlane_b32 s26, v40, 22
-; GISEL64-NEXT:    v_readlane_b32 s25, v40, 21
-; GISEL64-NEXT:    v_readlane_b32 s24, v40, 20
-; GISEL64-NEXT:    v_readlane_b32 s23, v40, 19
-; GISEL64-NEXT:    v_readlane_b32 s22, v40, 18
-; GISEL64-NEXT:    v_readlane_b32 s21, v40, 17
-; GISEL64-NEXT:    v_readlane_b32 s20, v40, 16
-; GISEL64-NEXT:    v_readlane_b32 s19, v40, 15
-; GISEL64-NEXT:    v_readlane_b32 s18, v40, 14
-; GISEL64-NEXT:    v_readlane_b32 s17, v40, 13
-; GISEL64-NEXT:    v_readlane_b32 s16, v40, 12
-; GISEL64-NEXT:    v_readlane_b32 s15, v40, 11
-; GISEL64-NEXT:    v_readlane_b32 s14, v40, 10
-; GISEL64-NEXT:    v_readlane_b32 s13, v40, 9
-; GISEL64-NEXT:    v_readlane_b32 s12, v40, 8
-; GISEL64-NEXT:    v_readlane_b32 s11, v40, 7
-; GISEL64-NEXT:    v_readlane_b32 s10, v40, 6
-; GISEL64-NEXT:    v_readlane_b32 s9, v40, 5
-; GISEL64-NEXT:    v_readlane_b32 s8, v40, 4
-; GISEL64-NEXT:    v_readlane_b32 s7, v40, 3
-; GISEL64-NEXT:    v_readlane_b32 s6, v40, 2
+; GISEL64-NEXT:    v_readlane_b32 s31, v40, 3
+; GISEL64-NEXT:    v_readlane_b32 s30, v40, 2
 ; GISEL64-NEXT:    v_readlane_b32 s5, v40, 1
 ; GISEL64-NEXT:    v_readlane_b32 s4, v40, 0
+; GISEL64-NEXT:    v_readlane_b32 s0, v40, 4
 ; GISEL64-NEXT:    scratch_load_b32 v40, off, s33 ; 4-byte Folded Reload
 ; GISEL64-NEXT:    s_mov_b32 s32, s33
-; GISEL64-NEXT:    s_xor_b64 exec, s[34:35], -1
+; GISEL64-NEXT:    s_xor_b64 exec, s[4:5], -1
 ; GISEL64-NEXT:    s_clause 0x1f
 ; GISEL64-NEXT:    scratch_load_b32 v0, off, s33 offset:4
 ; GISEL64-NEXT:    scratch_load_b32 v1, off, s33 offset:8
@@ -2726,11 +2404,11 @@ define amdgpu_gfx_whole_wave <2 x half> @call_gfx_from_whole_wave(i1 %active, <2
 ; GISEL64-NEXT:    scratch_load_b32 v245, off, s33 offset:568
 ; GISEL64-NEXT:    scratch_load_b32 v246, off, s33 offset:572
 ; GISEL64-NEXT:    scratch_load_b32 v247, off, s33 offset:576
-; GISEL64-NEXT:    s_mov_b64 exec, s[34:35]
-; GISEL64-NEXT:    s_mov_b32 s33, s36
+; GISEL64-NEXT:    s_mov_b64 exec, s[4:5]
+; GISEL64-NEXT:    s_mov_b32 s33, s0
 ; GISEL64-NEXT:    s_wait_loadcnt 0x0
 ; GISEL64-NEXT:    s_wait_alu 0xfffe
 ; GISEL64-NEXT:    s_setpc_b64 s[30:31]
-  %ret = call <2 x half>(<2 x half>, <2 x half>) @gfx_callee(<2 x half> %y, <2 x half> %x) convergent
+  %ret = call amdgpu_gfx <2 x half>(<2 x half>, <2 x half>) @gfx_callee(<2 x half> %y, <2 x half> %x) convergent
   ret <2 x half> %ret
 }

>From 8ea4ac92ef7760ca0930c77dd1c3c25116b2ca50 Mon Sep 17 00:00:00 2001
From: Diana Picus <diana-magda.picus at amd.com>
Date: Fri, 27 Jun 2025 12:51:42 +0200
Subject: [PATCH 24/24] Verifier checks for whole wave funcs

---
 llvm/include/llvm/IR/CallingConv.h |  5 +++++
 llvm/lib/IR/Verifier.cpp           | 10 +++++++++
 llvm/test/Bitcode/compatibility.ll |  4 ++++
 llvm/test/Verifier/amdgpu-cc.ll    | 33 ++++++++++++++++++++++++++++++
 4 files changed, 52 insertions(+)

diff --git a/llvm/include/llvm/IR/CallingConv.h b/llvm/include/llvm/IR/CallingConv.h
index 5d2ff86d60497..ef761eb1aed73 100644
--- a/llvm/include/llvm/IR/CallingConv.h
+++ b/llvm/include/llvm/IR/CallingConv.h
@@ -297,8 +297,13 @@ namespace CallingConv {
 /// directly or indirectly via a call-like instruction.
 constexpr bool isCallableCC(CallingConv::ID CC) {
   switch (CC) {
+  // Called with special intrinsics:
+  // llvm.amdgcn.cs.chain
   case CallingConv::AMDGPU_CS_Chain:
   case CallingConv::AMDGPU_CS_ChainPreserve:
+  // llvm.amdgcn.call.whole.wave
+  case CallingConv::AMDGPU_Gfx_WholeWave:
+  // Hardware entry points:
   case CallingConv::AMDGPU_CS:
   case CallingConv::AMDGPU_ES:
   case CallingConv::AMDGPU_GS:
diff --git a/llvm/lib/IR/Verifier.cpp b/llvm/lib/IR/Verifier.cpp
index 9cab88b09779a..32ce1880f2fdd 100644
--- a/llvm/lib/IR/Verifier.cpp
+++ b/llvm/lib/IR/Verifier.cpp
@@ -2975,6 +2975,16 @@ void Verifier::visitFunction(const Function &F) {
           "perfect forwarding!",
           &F);
     break;
+  case CallingConv::AMDGPU_Gfx_WholeWave:
+    Check(F.arg_size() != 0 && F.arg_begin()->getType()->isIntegerTy(1),
+          "Calling convention requires first argument to be i1", &F);
+    Check(!F.arg_begin()->hasInRegAttr(),
+          "Calling convention requires first argument to not be inreg", &F);
+    Check(!F.isVarArg(),
+          "Calling convention does not support varargs or "
+          "perfect forwarding!",
+          &F);
+    break;
   }
 
   // Check that the argument values match the function type for this function...
diff --git a/llvm/test/Bitcode/compatibility.ll b/llvm/test/Bitcode/compatibility.ll
index 9cf3fdbe550b4..0b5ce08c00a23 100644
--- a/llvm/test/Bitcode/compatibility.ll
+++ b/llvm/test/Bitcode/compatibility.ll
@@ -564,6 +564,10 @@ declare riscv_vls_cc(32768) void @riscv_vls_cc_32768()
 ; CHECK: declare riscv_vls_cc(32768) void @riscv_vls_cc_32768()
 declare riscv_vls_cc(65536) void @riscv_vls_cc_65536()
 ; CHECK: declare riscv_vls_cc(65536) void @riscv_vls_cc_65536()
+declare cc124 void @f.cc124(i1)
+; CHECK: declare amdgpu_gfx_whole_wave void @f.cc124(i1)
+declare amdgpu_gfx_whole_wave void @f.amdgpu_gfx_whole_wave(i1)
+; CHECK: declare amdgpu_gfx_whole_wave void @f.amdgpu_gfx_whole_wave(i1)
 declare cc1023 void @f.cc1023()
 ; CHECK: declare cc1023 void @f.cc1023()
 
diff --git a/llvm/test/Verifier/amdgpu-cc.ll b/llvm/test/Verifier/amdgpu-cc.ll
index aec09771d2e4f..e86825e088753 100644
--- a/llvm/test/Verifier/amdgpu-cc.ll
+++ b/llvm/test/Verifier/amdgpu-cc.ll
@@ -217,3 +217,36 @@ define amdgpu_cs_chain_preserve void @preallocated_cc_amdgpu_cs_chain_preserve(p
 define amdgpu_cs_chain_preserve void @inalloca_cc_amdgpu_cs_chain_preserve(ptr inalloca(i32) %ptr) {
   ret void
 }
+
+; CHECK: Calling convention requires first argument to be i1
+; CHECK-NEXT: ptr @whole_wave_no_args
+define amdgpu_gfx_whole_wave void @whole_wave_no_args() {
+  ret void
+}
+
+; CHECK: Calling convention requires first argument to be i1
+; CHECK-NEXT: ptr @whole_wave_must_have_i1_active
+define amdgpu_gfx_whole_wave void @whole_wave_must_have_i1_active(i32 %x) {
+  ret void
+}
+
+; CHECK: Calling convention requires first argument to not be inreg
+; CHECK-NEXT: ptr @whole_wave_i1_active_inreg
+define amdgpu_gfx_whole_wave void @whole_wave_i1_active_inreg(i1 inreg %active) {
+  ret void
+}
+
+; CHECK: Calling convention does not support varargs
+; CHECK-NEXT: ptr @whole_wave_varargs
+define amdgpu_gfx_whole_wave void @whole_wave_varargs(i1 %active, i32 %x, ...) {
+  ret void
+}
+
+declare amdgpu_gfx_whole_wave void @whole_wave_callee(i1 %active)
+
+; CHECK: calling convention does not permit calls
+; CHECK-NEXT: call amdgpu_gfx_whole_wave void @whole_wave_callee(i1 true)
+define amdgpu_cs void @cant_call_whole_wave_func() {
+  call amdgpu_gfx_whole_wave void @whole_wave_callee(i1 true)
+  ret void
+}



More information about the llvm-commits mailing list