[llvm] [amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic (PR #105822)
Diana Picus via llvm-commits
llvm-commits at lists.llvm.org
Fri Aug 23 05:29:28 PDT 2024
https://github.com/rovka created https://github.com/llvm/llvm-project/pull/105822
This intrinsic is meant to be used in functions that have a "tail" that needs to be run with all the lanes enabled. The "tail" may contain complex control flow that makes it unsuitable for the use of the existing WWM intrinsics. Instead, we will pretend that the function starts with all the lanes enabled, then branches into the actual body of the function for the lanes that were meant to run it, and then finally all the lanes will rejoin and run the tail.
As such, the intrinsic will return the EXEC mask for the body of the function, and is meant to be used only as part of a very limited pattern (for now only in amdgpu_cs_chain functions):
```
entry:
%func_exec = call i1 @llvm.amdgcn.init.whole.wave()
br i1 %func_exec, label %func, label %tail
func:
; ... stuff that should run with the actual EXEC mask
br label %tail
tail:
; ... stuff that runs with all the lanes enabled;
; can contain more than one basic block
```
It's an error to use the result of this intrinsic for anything other than a branch (but unfortunately checking that in the verifier is non-trivial because SIAnnotateControlFlow will introduce an amdgcn.if between the intrinsic and the branch).
Since the usage pattern is very strict, the backend can optimize away the intrinsic and the branch following it (in practice EXEC will already contain the correct value when entering the function, because it will be set by the llvm.amdgcn.cs.chain intrinsic before jumping in). The removal is done early on, in `finalizeLowering`, however the information that the function was conceptually started in whole wave mode is stored in the machine function info (hasInitWholeWave). This will be useful in prolog epilog insertion, where we can skip saving the inactive lanes for CSRs (since if the function started with all the lanes active, then there are no inactive lanes to preserve).
Some of the generated code could use some more optimization (#98864 might help with some of that). One important thing for front-ends to note is that for now it's recommended to avoid phi's in `tail` with large structs/vectors where not all elements are modified by `shader` - prefer to have small phi's and build the aggregate in `tail` (see the `basic` vs `phi-whole-struct` test cases).
>From 8fda21992b69c2d87d6e0de89443ee611b5d873a Mon Sep 17 00:00:00 2001
From: Diana Picus <Diana-Magda.Picus at amd.com>
Date: Thu, 15 Aug 2024 14:23:39 +0200
Subject: [PATCH] [amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic
This intrinsic is meant to be used in functions that have a "tail" that
needs to be run with all the lanes enabled. The "tail" may contain
complex control flow that makes it unsuitable for the use of the
existing WWM intrinsics. Instead, we will pretend that the function
starts with all the lanes enabled, then branches into the actual body of
the function for the lanes that were meant to run it, and then finally
all the lanes will rejoin and run the tail.
As such, the intrinsic will return the EXEC mask for the body of the
function, and is meant to be used only as part of a very limited pattern
(for now only in amdgpu_cs_chain functions):
```
entry:
%func_exec = call i1 @llvm.amdgcn.init.whole.wave()
br i1 %func_exec, label %func, label %tail
func:
; ... stuff that should run with the actual EXEC mask
br label %tail
tail:
; ... stuff that runs with all the lanes enabled;
; can contain more than one basic block
```
It's an error to use the result of this intrinsic for anything
other than a branch (but unfortunately checking that in the verifier is
non-trivial because SIAnnotateControlFlow will introduce an amdgcn.if
between the intrinsic and the branch).
Since the usage pattern is very strict, the backend can optimize away
the intrinsic and the branch following it (in practice EXEC will already
contain the correct value when entering the function, because it will be
set by the llvm.amdgcn.cs.chain intrinsic before jumping in). The
removal is done early on, in `finalizeLowering`, however the information
that the function was conceptually started in whole wave mode is stored
in the machine function info (hasInitWholeWave). This will be useful in
prolog epilog insertion, where we can skip saving the inactive lanes for
CSRs (since if the function started with all the lanes active, then
there are no inactive lanes to preserve).
Some of the generated code could use some more optimization. One
important thing for front-ends to note is that for now it's recommended
to avoid phi's in `tail` with large structs/vectors where not all
elements are modified by `shader` - prefer to have small phi's and build
the aggregate in `tail` (see the `basic` vs `phi-whole-struct` test
cases).
---
llvm/include/llvm/IR/IntrinsicsAMDGPU.td | 14 +
llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp | 5 +
.../AMDGPU/AMDGPUInstructionSelector.cpp | 10 +
.../Target/AMDGPU/AMDGPUInstructionSelector.h | 1 +
.../lib/Target/AMDGPU/AMDGPUMachineFunction.h | 5 +
.../Target/AMDGPU/AMDGPURegisterBankInfo.cpp | 1 +
.../Target/AMDGPU/AMDGPUSearchableTables.td | 1 +
.../lib/Target/AMDGPU/AMDGPUTargetMachine.cpp | 3 +
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp | 12 +-
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | 133 ++
llvm/lib/Target/AMDGPU/SIInstructions.td | 8 +
.../lib/Target/AMDGPU/SIMachineFunctionInfo.h | 3 +
.../AMDGPU/llvm.amdgcn.init.whole.wave-w32.ll | 1545 +++++++++++++++++
.../AMDGPU/llvm.amdgcn.init.whole.wave-w64.ll | 254 +++
.../CodeGen/AMDGPU/pei-amdgpu-cs-chain.mir | 29 +
.../AMDGPU/long-branch-reg-all-sgpr-used.ll | 2 +
.../AMDGPU/machine-function-info-after-pei.ll | 1 +
...ine-function-info-long-branch-reg-debug.ll | 1 +
.../machine-function-info-long-branch-reg.ll | 1 +
.../AMDGPU/machine-function-info-no-ir.mir | 4 +
.../MIR/AMDGPU/machine-function-info.ll | 4 +
21 files changed, 2033 insertions(+), 4 deletions(-)
create mode 100644 llvm/test/CodeGen/AMDGPU/llvm.amdgcn.init.whole.wave-w32.ll
create mode 100644 llvm/test/CodeGen/AMDGPU/llvm.amdgcn.init.whole.wave-w64.ll
diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
index dc13a35c66f9ab..a552b758a1ace7 100644
--- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
@@ -208,6 +208,20 @@ def int_amdgcn_init_exec_from_input : Intrinsic<[],
[IntrConvergent, IntrHasSideEffects, IntrNoMem, IntrNoCallback,
IntrNoFree, IntrWillReturn, ImmArg<ArgIndex<1>>]>;
+// Sets the function into whole-wave-mode and returns whether the lane was
+// active when entering the function. A branch depending on this return will
+// revert the EXEC mask to what it was when entering the function, thus
+// resulting in a no-op. This pattern is used to optimize branches when function
+// tails need to be run in whole-wave-mode. It may also have other consequences
+// (mostly related to WWM CSR handling) that differentiate it from using
+// a plain `amdgcn.init.exec -1`.
+//
+// Can only be used in functions with the `amdgpu_cs_chain` calling convention.
+// Using this intrinsic without immediately branching on its return value is an
+// error.
+def int_amdgcn_init_whole_wave : Intrinsic<[llvm_i1_ty], [], [
+ IntrHasSideEffects, IntrNoMem, IntrNoDuplicate, IntrConvergent]>;
+
def int_amdgcn_wavefrontsize :
ClangBuiltin<"__builtin_amdgcn_wavefrontsize">,
DefaultAttrsIntrinsic<[llvm_i32_ty], [], [NoUndef<RetIndex>, IntrNoMem, IntrSpeculatable]>;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
index 0daaf6b6576030..380dc7d3312f32 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
@@ -2738,6 +2738,11 @@ void AMDGPUDAGToDAGISel::SelectINTRINSIC_W_CHAIN(SDNode *N) {
case Intrinsic::amdgcn_ds_bvh_stack_rtn:
SelectDSBvhStackIntrinsic(N);
return;
+ case Intrinsic::amdgcn_init_whole_wave:
+ CurDAG->getMachineFunction()
+ .getInfo<SIMachineFunctionInfo>()
+ ->setInitWholeWave();
+ break;
}
SelectCode(N);
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
index 17071970ca4bfe..06192f74e1e0ed 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
@@ -1772,6 +1772,14 @@ bool AMDGPUInstructionSelector::selectDSAppendConsume(MachineInstr &MI,
return constrainSelectedInstRegOperands(*MIB, TII, TRI, RBI);
}
+bool AMDGPUInstructionSelector::selectInitWholeWave(MachineInstr &MI) const {
+ MachineFunction *MF = MI.getParent()->getParent();
+ SIMachineFunctionInfo *MFInfo = MF->getInfo<SIMachineFunctionInfo>();
+
+ MFInfo->setInitWholeWave();
+ return selectImpl(MI, *CoverageInfo);
+}
+
bool AMDGPUInstructionSelector::selectSBarrier(MachineInstr &MI) const {
if (TM.getOptLevel() > CodeGenOptLevel::None) {
unsigned WGSize = STI.getFlatWorkGroupSizes(MF->getFunction()).second;
@@ -2099,6 +2107,8 @@ bool AMDGPUInstructionSelector::selectG_INTRINSIC_W_SIDE_EFFECTS(
return selectDSAppendConsume(I, true);
case Intrinsic::amdgcn_ds_consume:
return selectDSAppendConsume(I, false);
+ case Intrinsic::amdgcn_init_whole_wave:
+ return selectInitWholeWave(I);
case Intrinsic::amdgcn_s_barrier:
return selectSBarrier(I);
case Intrinsic::amdgcn_raw_buffer_load_lds:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h
index 207cd67f0eda0e..2ddee05d096b23 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h
@@ -120,6 +120,7 @@ class AMDGPUInstructionSelector final : public InstructionSelector {
bool selectDSOrderedIntrinsic(MachineInstr &MI, Intrinsic::ID IID) const;
bool selectDSGWSIntrinsic(MachineInstr &MI, Intrinsic::ID IID) const;
bool selectDSAppendConsume(MachineInstr &MI, bool IsAppend) const;
+ bool selectInitWholeWave(MachineInstr &MI) const;
bool selectSBarrier(MachineInstr &MI) const;
bool selectDSBvhStackIntrinsic(MachineInstr &MI) const;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.h b/llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.h
index 7efb7f825348e3..b1022e48b8d34f 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.h
@@ -67,6 +67,8 @@ class AMDGPUMachineFunction : public MachineFunctionInfo {
// Kernel may need limited waves per EU for better performance.
bool WaveLimiter = false;
+ bool HasInitWholeWave = false;
+
public:
AMDGPUMachineFunction(const Function &F, const AMDGPUSubtarget &ST);
@@ -109,6 +111,9 @@ class AMDGPUMachineFunction : public MachineFunctionInfo {
return WaveLimiter;
}
+ bool hasInitWholeWave() const { return HasInitWholeWave; }
+ void setInitWholeWave() { HasInitWholeWave = true; }
+
unsigned allocateLDSGlobal(const DataLayout &DL, const GlobalVariable &GV) {
return allocateLDSGlobal(DL, GV, DynLDSAlign);
}
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
index 69a1936a11fe05..5e7986734871cf 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
@@ -4979,6 +4979,7 @@ AMDGPURegisterBankInfo::getInstrMapping(const MachineInstr &MI) const {
OpdsMapping[3] = AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, WaveSize);
break;
}
+ case Intrinsic::amdgcn_init_whole_wave:
case Intrinsic::amdgcn_live_mask: {
OpdsMapping[0] = AMDGPU::getValueMapping(AMDGPU::VCCRegBankID, 1);
break;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUSearchableTables.td b/llvm/lib/Target/AMDGPU/AMDGPUSearchableTables.td
index 95c4859674ecc4..2cd5fb2b94285c 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUSearchableTables.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPUSearchableTables.td
@@ -329,6 +329,7 @@ def : SourceOfDivergence<int_amdgcn_mov_dpp>;
def : SourceOfDivergence<int_amdgcn_mov_dpp8>;
def : SourceOfDivergence<int_amdgcn_update_dpp>;
def : SourceOfDivergence<int_amdgcn_writelane>;
+def : SourceOfDivergence<int_amdgcn_init_whole_wave>;
foreach intr = AMDGPUMFMAIntrinsics908 in
def : SourceOfDivergence<intr>;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
index 7a9735790371a1..3fcc750646d6ab 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
@@ -1731,6 +1731,9 @@ bool GCNTargetMachine::parseMachineFunctionInfo(
? DenormalMode::IEEE
: DenormalMode::PreserveSign;
+ if (YamlMFI.HasInitWholeWave)
+ MFI->setInitWholeWave();
+
return false;
}
diff --git a/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp b/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
index 8c951105101d96..dfdc7ad32b00c7 100644
--- a/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
@@ -1343,10 +1343,14 @@ void SIFrameLowering::processFunctionBeforeFrameFinalized(
// Allocate spill slots for WWM reserved VGPRs.
// For chain functions, we only need to do this if we have calls to
- // llvm.amdgcn.cs.chain.
- bool IsChainWithoutCalls =
- FuncInfo->isChainFunction() && !MF.getFrameInfo().hasTailCall();
- if (!FuncInfo->isEntryFunction() && !IsChainWithoutCalls) {
+ // llvm.amdgcn.cs.chain (otherwise there's no one to save them for, since
+ // chain functions do not return) and the function did not contain a call to
+ // llvm.amdgcn.init.whole.wave (since in that case there are no inactive lanes
+ // when entering the function).
+ bool IsChainWithoutRestores =
+ FuncInfo->isChainFunction() &&
+ (!MF.getFrameInfo().hasTailCall() || FuncInfo->hasInitWholeWave());
+ if (!FuncInfo->isEntryFunction() && !IsChainWithoutRestores) {
for (Register Reg : FuncInfo->getWWMReservedRegs()) {
const TargetRegisterClass *RC = TRI->getPhysRegBaseClass(Reg);
FuncInfo->allocateWWMSpill(MF, Reg, TRI->getSpillSize(*RC),
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index ecd4451c504727..b50c92df5479fc 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -15671,6 +15671,133 @@ static int getAlignedAGPRClassID(unsigned UnalignedClassID) {
}
}
+static void removeInitWholeWaveBranch(MachineFunction &MF,
+ MachineRegisterInfo &MRI,
+ const SIInstrInfo *TII) {
+ // Remove SI_INIT_WHOLE_WAVE and the following SI_IF/END_CF and instead set
+ // EXEC to -1 at SI_END_CF.
+ auto IWWIt = find_if(MF.begin()->instrs(), [](const MachineInstr &MI) {
+ return MI.getOpcode() == AMDGPU::SI_INIT_WHOLE_WAVE;
+ });
+ if (IWWIt == MF.begin()->instr_end())
+ return; // We've been here before (GISel runs finalizeLowering twice).
+
+ MachineInstr &If = *MRI.use_begin(IWWIt->getOperand(0).getReg())->getParent();
+ assert(If.getOpcode() == AMDGPU::SI_IF &&
+ "Unexpected user for init.whole.wave result");
+ assert(MRI.hasOneUse(IWWIt->getOperand(0).getReg()) &&
+ "Expected simple control flow");
+
+ MachineInstr &EndCf = *MRI.use_begin(If.getOperand(0).getReg())->getParent();
+ MachineBasicBlock *EndBB = EndCf.getParent();
+
+ // Update all the Phis: since we're removing a predecessor, we need to remove
+ // the corresponding pair of operands. However, we can't just drop the value
+ // coming from the 'if' block - that's going to be the value of the inactive
+ // lanes.
+ // %v = phi (%inactive, %if), (%active1, %shader1), ... (%activeN, %shaderN)
+ // should become
+ // %t = phi (%active1, %shader1), ... (%activeN, %shaderN)
+ // %v = v_set_inactive %t, %inactive
+ // Note that usually EndCf will be the first instruction after the phis and as
+ // such will serve as the end of the range when iterating over phis.
+ // Therefore, we shouldn't introduce any new instructions before it.
+ const SIRegisterInfo &TRI = TII->getRegisterInfo();
+ auto AfterEndCf = std::next(EndCf.getIterator());
+ for (auto &Phi : EndBB->phis()) {
+ Register PhiDest = Phi.getOperand(0).getReg();
+ const TargetRegisterClass *PhiRC = MRI.getRegClass(PhiDest);
+
+ Register NewPhiDest = MRI.createVirtualRegister(PhiRC);
+ Phi.getOperand(0).setReg(NewPhiDest);
+
+ unsigned InactiveOpIdx = 0;
+ for (unsigned I = 1; I < Phi.getNumOperands(); I += 2) {
+ if (Phi.getOperand(I + 1).getMBB() == If.getParent()) {
+ InactiveOpIdx = I;
+ break;
+ }
+ }
+ assert(InactiveOpIdx != 0 && "Broken phi?");
+
+ // At this point, the register class could be larger than 32 or 64, so we
+ // might have to use more than one V_SET_INACTIVE instruction.
+ unsigned Size = TRI.getRegSizeInBits(*PhiRC);
+ switch (Size) {
+ case 32:
+ BuildMI(*EndBB, AfterEndCf, Phi.getDebugLoc(),
+ TII->get(AMDGPU::V_SET_INACTIVE_B32), PhiDest)
+ .addReg(NewPhiDest)
+ .add(Phi.getOperand(InactiveOpIdx));
+ break;
+ case 64:
+ BuildMI(*EndBB, AfterEndCf, Phi.getDebugLoc(),
+ TII->get(AMDGPU::V_SET_INACTIVE_B64), PhiDest)
+ .addReg(NewPhiDest)
+ .add(Phi.getOperand(InactiveOpIdx));
+ break;
+ default: {
+ // For each 32-bit subregister of the register at InactiveOpIdx, insert
+ // a COPY to a new register, and a V_SET_INACTIVE_B32 using the
+ // corresponding subregisters of PhiDest and NewPhiDest.
+ // FIXME: There has to be a better way to iterate over this...
+ llvm::SmallVector<Register, 16> PhiSubRegs;
+ const unsigned SubRegIndices[] = {
+ AMDGPU::sub0, AMDGPU::sub1, AMDGPU::sub2, AMDGPU::sub3,
+ AMDGPU::sub4, AMDGPU::sub5, AMDGPU::sub6, AMDGPU::sub7,
+ AMDGPU::sub8, AMDGPU::sub9, AMDGPU::sub10, AMDGPU::sub11,
+ AMDGPU::sub12, AMDGPU::sub13, AMDGPU::sub14, AMDGPU::sub15,
+ AMDGPU::sub16, AMDGPU::sub17, AMDGPU::sub18, AMDGPU::sub19,
+ AMDGPU::sub20, AMDGPU::sub21, AMDGPU::sub22, AMDGPU::sub23,
+ AMDGPU::sub24, AMDGPU::sub25, AMDGPU::sub26, AMDGPU::sub27,
+ AMDGPU::sub28, AMDGPU::sub29, AMDGPU::sub30, AMDGPU::sub31};
+ const unsigned NumSubRegs = Size / 32;
+ assert(sizeof(SubRegIndices) / sizeof(SubRegIndices[0]) >= NumSubRegs &&
+ "Not enough subregister indices");
+ for (unsigned I = 0; I != NumSubRegs; ++I) {
+ unsigned SubRegIdx = SubRegIndices[I];
+ Register InactiveSubReg =
+ MRI.createVirtualRegister(&AMDGPU::VGPR_32RegClass);
+ BuildMI(*EndBB, AfterEndCf, Phi.getDebugLoc(), TII->get(AMDGPU::COPY),
+ InactiveSubReg)
+ .addReg(Phi.getOperand(InactiveOpIdx).getReg(), 0, SubRegIdx);
+
+ Register AllLanesSubReg =
+ MRI.createVirtualRegister(&AMDGPU::VGPR_32RegClass);
+ BuildMI(*EndBB, AfterEndCf, Phi.getDebugLoc(),
+ TII->get(AMDGPU::V_SET_INACTIVE_B32), AllLanesSubReg)
+ .addReg(NewPhiDest, 0, SubRegIdx)
+ .addReg(InactiveSubReg);
+ PhiSubRegs.push_back(AllLanesSubReg);
+ }
+ // Now we need to combine the subregisters into the original register.
+ auto RegSequence = BuildMI(*EndBB, AfterEndCf, Phi.getDebugLoc(),
+ TII->get(AMDGPU::REG_SEQUENCE), PhiDest);
+ for (unsigned I = 0; I < NumSubRegs; ++I) {
+ RegSequence.addReg(PhiSubRegs[I]);
+ RegSequence.addImm(SubRegIndices[I]);
+ }
+ break;
+ }
+ }
+
+ Phi.removeOperand(InactiveOpIdx + 1);
+ Phi.removeOperand(InactiveOpIdx);
+ }
+ If.getParent()->removeSuccessor(EndBB);
+
+ BuildMI(*EndBB, AfterEndCf, IWWIt->getDebugLoc(),
+ TII->get(MF.getSubtarget<GCNSubtarget>().isWave32()
+ ? AMDGPU::S_MOV_B32
+ : AMDGPU::S_MOV_B64),
+ TII->getRegisterInfo().getExec())
+ .addImm(-1);
+
+ EndCf.eraseFromParent();
+ If.eraseFromParent();
+ IWWIt->eraseFromParent();
+}
+
// Figure out which registers should be reserved for stack access. Only after
// the function is legalized do we know all of the non-spill stack objects or if
// calls are present.
@@ -15681,6 +15808,12 @@ void SITargetLowering::finalizeLowering(MachineFunction &MF) const {
const SIRegisterInfo *TRI = Subtarget->getRegisterInfo();
const SIInstrInfo *TII = ST.getInstrInfo();
+ if (Info->hasInitWholeWave()) {
+ assert(Info->isChainFunction() &&
+ "init.whole.wave may only be used in chain functions");
+ removeInitWholeWaveBranch(MF, MRI, TII);
+ }
+
if (Info->isEntryFunction()) {
// Callable functions have fixed registers used for stack access.
reservePrivateMemoryRegs(getTargetMachine(), MF, *TRI, *Info);
diff --git a/llvm/lib/Target/AMDGPU/SIInstructions.td b/llvm/lib/Target/AMDGPU/SIInstructions.td
index 814d3182fb5df8..ffffa18e13c30d 100644
--- a/llvm/lib/Target/AMDGPU/SIInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SIInstructions.td
@@ -575,6 +575,14 @@ def SI_INIT_EXEC_FROM_INPUT : SPseudoInstSI <
let Defs = [EXEC];
}
+// Sets EXEC to all lanes and returns the previous EXEC.
+def SI_INIT_WHOLE_WAVE : SPseudoInstSI <
+ (outs SReg_1:$dst), (ins),
+ [(set i1:$dst, (int_amdgcn_init_whole_wave))]> {
+ let Defs = [EXEC];
+ let Uses = [EXEC];
+}
+
// Return for returning shaders to a shader variant epilog.
def SI_RETURN_TO_EPILOG : SPseudoInstSI <
(outs), (ins variable_ops), [(AMDGPUreturn_to_epilog)]> {
diff --git a/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h b/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
index 7af5e7388f841e..7cebfa29fe7b8d 100644
--- a/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
+++ b/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
@@ -289,6 +289,8 @@ struct SIMachineFunctionInfo final : public yaml::MachineFunctionInfo {
StringValue SGPRForEXECCopy;
StringValue LongBranchReservedReg;
+ bool HasInitWholeWave = false;
+
SIMachineFunctionInfo() = default;
SIMachineFunctionInfo(const llvm::SIMachineFunctionInfo &,
const TargetRegisterInfo &TRI,
@@ -336,6 +338,7 @@ template <> struct MappingTraits<SIMachineFunctionInfo> {
StringValue()); // Don't print out when it's empty.
YamlIO.mapOptional("longBranchReservedReg", MFI.LongBranchReservedReg,
StringValue());
+ YamlIO.mapOptional("hasInitWholeWave", MFI.HasInitWholeWave, false);
}
};
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.init.whole.wave-w32.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.init.whole.wave-w32.ll
new file mode 100644
index 00000000000000..6ad9e684273b28
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.init.whole.wave-w32.ll
@@ -0,0 +1,1545 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc -global-isel=1 -O2 -mtriple=amdgcn -mcpu=gfx1200 -verify-machineinstrs < %s | FileCheck --check-prefix=GISEL12 %s
+; RUN: llc -global-isel=0 -O2 -mtriple=amdgcn -mcpu=gfx1200 -verify-machineinstrs < %s | FileCheck --check-prefix=DAGISEL12 %s
+; RUN: llc -global-isel=1 -O2 -mtriple=amdgcn -mcpu=gfx1030 -verify-machineinstrs < %s | FileCheck --check-prefix=GISEL10 %s
+; RUN: llc -global-isel=0 -O2 -mtriple=amdgcn -mcpu=gfx1030 -verify-machineinstrs < %s | FileCheck --check-prefix=DAGISEL10 %s
+
+define amdgpu_cs_chain void @basic(<3 x i32> inreg %sgpr, ptr inreg %callee, i32 inreg %exec, { i32, ptr addrspace(5), i32, i32 } %vgpr, i32 %x, i32 %y) {
+; GISEL12-LABEL: basic:
+; GISEL12: ; %bb.0: ; %entry
+; GISEL12-NEXT: s_wait_loadcnt_dscnt 0x0
+; GISEL12-NEXT: s_wait_expcnt 0x0
+; GISEL12-NEXT: s_wait_samplecnt 0x0
+; GISEL12-NEXT: s_wait_bvhcnt 0x0
+; GISEL12-NEXT: s_wait_kmcnt 0x0
+; GISEL12-NEXT: s_mov_b32 s6, s3
+; GISEL12-NEXT: s_or_saveexec_b32 s3, -1
+; GISEL12-NEXT: v_dual_mov_b32 v0, v12 :: v_dual_mov_b32 v1, v13
+; GISEL12-NEXT: s_mov_b32 exec_lo, s3
+; GISEL12-NEXT: s_mov_b32 s7, s4
+; GISEL12-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_4) | instid1(VALU_DEP_1)
+; GISEL12-NEXT: v_mov_b32_e32 v2, v0
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: v_mov_b32_e32 v2, 0x47
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: s_or_saveexec_b32 s3, -1
+; GISEL12-NEXT: v_cmp_ne_u32_e64 s4, 0, v2
+; GISEL12-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GISEL12-NEXT: v_mov_b32_e32 v2, s4
+; GISEL12-NEXT: s_mov_b32 exec_lo, s3
+; GISEL12-NEXT: v_dual_mov_b32 v4, v2 :: v_dual_add_nc_u32 v3, 42, v0
+; GISEL12-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GISEL12-NEXT: v_mov_b32_e32 v0, v3
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: v_mov_b32_e32 v0, v0
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: v_mov_b32_e32 v1, v4
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GISEL12-NEXT: v_mov_b32_e32 v1, v1
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: s_mov_b32 exec_lo, -1
+; GISEL12-NEXT: v_dual_mov_b32 v10, v0 :: v_dual_mov_b32 v11, v1
+; GISEL12-NEXT: s_mov_b32 exec_lo, s5
+; GISEL12-NEXT: s_setpc_b64 s[6:7]
+;
+; DAGISEL12-LABEL: basic:
+; DAGISEL12: ; %bb.0: ; %entry
+; DAGISEL12-NEXT: s_wait_loadcnt_dscnt 0x0
+; DAGISEL12-NEXT: s_wait_expcnt 0x0
+; DAGISEL12-NEXT: s_wait_samplecnt 0x0
+; DAGISEL12-NEXT: s_wait_bvhcnt 0x0
+; DAGISEL12-NEXT: s_wait_kmcnt 0x0
+; DAGISEL12-NEXT: s_or_saveexec_b32 s6, -1
+; DAGISEL12-NEXT: v_dual_mov_b32 v0, v13 :: v_dual_mov_b32 v1, v12
+; DAGISEL12-NEXT: s_mov_b32 exec_lo, s6
+; DAGISEL12-NEXT: s_mov_b32 s7, s4
+; DAGISEL12-NEXT: s_mov_b32 s6, s3
+; DAGISEL12-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_4) | instid1(VALU_DEP_1)
+; DAGISEL12-NEXT: v_mov_b32_e32 v2, v1
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: v_mov_b32_e32 v2, 0x47
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: s_or_saveexec_b32 s3, -1
+; DAGISEL12-NEXT: v_cmp_ne_u32_e64 s4, 0, v2
+; DAGISEL12-NEXT: s_mov_b32 exec_lo, s3
+; DAGISEL12-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; DAGISEL12-NEXT: v_dual_mov_b32 v4, s4 :: v_dual_add_nc_u32 v3, 42, v1
+; DAGISEL12-NEXT: v_mov_b32_e32 v1, v3
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_3) | instid1(VALU_DEP_1)
+; DAGISEL12-NEXT: v_mov_b32_e32 v1, v1
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: v_mov_b32_e32 v0, v4
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: v_mov_b32_e32 v0, v0
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: s_mov_b32 exec_lo, -1
+; DAGISEL12-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; DAGISEL12-NEXT: v_dual_mov_b32 v10, v1 :: v_dual_mov_b32 v11, v0
+; DAGISEL12-NEXT: s_mov_b32 exec_lo, s5
+; DAGISEL12-NEXT: s_setpc_b64 s[6:7]
+;
+; GISEL10-LABEL: basic:
+; GISEL10: ; %bb.0: ; %entry
+; GISEL10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GISEL10-NEXT: s_mov_b32 s6, s3
+; GISEL10-NEXT: s_or_saveexec_b32 s3, -1
+; GISEL10-NEXT: v_mov_b32_e32 v0, v12
+; GISEL10-NEXT: v_mov_b32_e32 v1, v13
+; GISEL10-NEXT: s_mov_b32 exec_lo, s3
+; GISEL10-NEXT: s_mov_b32 s7, s4
+; GISEL10-NEXT: v_mov_b32_e32 v2, v0
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v2, 0x47
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: s_or_saveexec_b32 s3, -1
+; GISEL10-NEXT: v_cmp_ne_u32_e64 s4, 0, v2
+; GISEL10-NEXT: v_mov_b32_e32 v2, s4
+; GISEL10-NEXT: s_mov_b32 exec_lo, s3
+; GISEL10-NEXT: v_add_nc_u32_e32 v3, 42, v0
+; GISEL10-NEXT: v_mov_b32_e32 v4, v2
+; GISEL10-NEXT: v_mov_b32_e32 v0, v3
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v0, v0
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v1, v4
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v1, v1
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: s_mov_b32 exec_lo, -1
+; GISEL10-NEXT: v_mov_b32_e32 v10, v0
+; GISEL10-NEXT: v_mov_b32_e32 v11, v1
+; GISEL10-NEXT: s_mov_b32 exec_lo, s5
+; GISEL10-NEXT: s_setpc_b64 s[6:7]
+;
+; DAGISEL10-LABEL: basic:
+; DAGISEL10: ; %bb.0: ; %entry
+; DAGISEL10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; DAGISEL10-NEXT: s_or_saveexec_b32 s6, -1
+; DAGISEL10-NEXT: v_mov_b32_e32 v0, v13
+; DAGISEL10-NEXT: v_mov_b32_e32 v1, v12
+; DAGISEL10-NEXT: s_mov_b32 exec_lo, s6
+; DAGISEL10-NEXT: s_mov_b32 s7, s4
+; DAGISEL10-NEXT: s_mov_b32 s6, s3
+; DAGISEL10-NEXT: v_mov_b32_e32 v2, v1
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v2, 0x47
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: s_or_saveexec_b32 s3, -1
+; DAGISEL10-NEXT: v_cmp_ne_u32_e64 s4, 0, v2
+; DAGISEL10-NEXT: s_mov_b32 exec_lo, s3
+; DAGISEL10-NEXT: v_add_nc_u32_e32 v3, 42, v1
+; DAGISEL10-NEXT: v_mov_b32_e32 v4, s4
+; DAGISEL10-NEXT: v_mov_b32_e32 v1, v3
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v1, v1
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v0, v4
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v0, v0
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: s_mov_b32 exec_lo, -1
+; DAGISEL10-NEXT: v_mov_b32_e32 v10, v1
+; DAGISEL10-NEXT: v_mov_b32_e32 v11, v0
+; DAGISEL10-NEXT: s_mov_b32 exec_lo, s5
+; DAGISEL10-NEXT: s_setpc_b64 s[6:7]
+entry:
+ %entry_exec = call i1 @llvm.amdgcn.init.whole.wave()
+ br i1 %entry_exec, label %shader, label %tail
+
+shader:
+ %nonwwm = add i32 %x, 42
+
+ %full.vgpr = call i32 @llvm.amdgcn.set.inactive.i32(i32 %x, i32 71)
+ %non.zero = icmp ne i32 %full.vgpr, 0
+ %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %non.zero)
+ %wwm = call i32 @llvm.amdgcn.strict.wwm.i32(i32 %ballot)
+
+ br label %tail
+
+tail:
+ %full.nonwwm = phi i32 [%x, %entry], [%nonwwm, %shader]
+ %full.wwm = phi i32 [%y, %entry], [%wwm, %shader]
+ %vgpr.1 = insertvalue { i32, ptr addrspace(5), i32, i32} %vgpr, i32 %full.nonwwm, 2
+ %vgpr.2 = insertvalue { i32, ptr addrspace(5), i32, i32} %vgpr.1, i32 %full.wwm, 3
+ call void(ptr, i32, <3 x i32>, { i32, ptr addrspace(5), i32, i32 }, i32, ...) @llvm.amdgcn.cs.chain(ptr %callee, i32 %exec, <3 x i32> inreg %sgpr, { i32, ptr addrspace(5), i32, i32 } %vgpr.2, i32 0)
+ unreachable
+}
+
+define amdgpu_cs_chain void @phi_whole_struct(<3 x i32> inreg %sgpr, ptr inreg %callee, i32 inreg %exec, { i32, ptr addrspace(5), i32, i32 } %vgpr, i32 %x, i32 %y) {
+; GISEL12-LABEL: phi_whole_struct:
+; GISEL12: ; %bb.0: ; %entry
+; GISEL12-NEXT: s_wait_loadcnt_dscnt 0x0
+; GISEL12-NEXT: s_wait_expcnt 0x0
+; GISEL12-NEXT: s_wait_samplecnt 0x0
+; GISEL12-NEXT: s_wait_bvhcnt 0x0
+; GISEL12-NEXT: s_wait_kmcnt 0x0
+; GISEL12-NEXT: s_or_saveexec_b32 s8, -1
+; GISEL12-NEXT: v_dual_mov_b32 v0, v8 :: v_dual_mov_b32 v1, v9
+; GISEL12-NEXT: v_dual_mov_b32 v2, v10 :: v_dual_mov_b32 v3, v11
+; GISEL12-NEXT: s_mov_b32 exec_lo, s8
+; GISEL12-NEXT: s_mov_b32 s6, s3
+; GISEL12-NEXT: s_mov_b32 s7, s4
+; GISEL12-NEXT: v_mov_b32_e32 v4, v12
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: v_mov_b32_e32 v4, 0x47
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: s_or_saveexec_b32 s3, -1
+; GISEL12-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GISEL12-NEXT: v_cmp_ne_u32_e64 s4, 0, v4
+; GISEL12-NEXT: v_mov_b32_e32 v4, s4
+; GISEL12-NEXT: s_mov_b32 exec_lo, s3
+; GISEL12-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GISEL12-NEXT: v_dual_mov_b32 v6, v4 :: v_dual_add_nc_u32 v5, 42, v12
+; GISEL12-NEXT: v_mov_b32_e32 v0, v0
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: v_mov_b32_e32 v0, v0
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: v_mov_b32_e32 v1, v1
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_3) | instid1(VALU_DEP_1)
+; GISEL12-NEXT: v_mov_b32_e32 v1, v1
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: v_mov_b32_e32 v2, v5
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: v_mov_b32_e32 v2, v2
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: v_mov_b32_e32 v3, v6
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_3) | instid1(VALU_DEP_2)
+; GISEL12-NEXT: v_mov_b32_e32 v3, v3
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: s_mov_b32 exec_lo, -1
+; GISEL12-NEXT: v_dual_mov_b32 v8, v0 :: v_dual_mov_b32 v9, v1
+; GISEL12-NEXT: v_dual_mov_b32 v10, v2 :: v_dual_mov_b32 v11, v3
+; GISEL12-NEXT: s_mov_b32 exec_lo, s5
+; GISEL12-NEXT: s_setpc_b64 s[6:7]
+;
+; DAGISEL12-LABEL: phi_whole_struct:
+; DAGISEL12: ; %bb.0: ; %entry
+; DAGISEL12-NEXT: s_wait_loadcnt_dscnt 0x0
+; DAGISEL12-NEXT: s_wait_expcnt 0x0
+; DAGISEL12-NEXT: s_wait_samplecnt 0x0
+; DAGISEL12-NEXT: s_wait_bvhcnt 0x0
+; DAGISEL12-NEXT: s_wait_kmcnt 0x0
+; DAGISEL12-NEXT: s_or_saveexec_b32 s6, -1
+; DAGISEL12-NEXT: v_dual_mov_b32 v0, v11 :: v_dual_mov_b32 v1, v10
+; DAGISEL12-NEXT: v_dual_mov_b32 v2, v9 :: v_dual_mov_b32 v3, v8
+; DAGISEL12-NEXT: s_mov_b32 exec_lo, s6
+; DAGISEL12-NEXT: s_mov_b32 s7, s4
+; DAGISEL12-NEXT: s_mov_b32 s6, s3
+; DAGISEL12-NEXT: v_mov_b32_e32 v4, v12
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: v_mov_b32_e32 v4, 0x47
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: s_or_saveexec_b32 s3, -1
+; DAGISEL12-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; DAGISEL12-NEXT: v_cmp_ne_u32_e64 s4, 0, v4
+; DAGISEL12-NEXT: s_mov_b32 exec_lo, s3
+; DAGISEL12-NEXT: v_dual_mov_b32 v6, s4 :: v_dual_add_nc_u32 v5, 42, v12
+; DAGISEL12-NEXT: v_mov_b32_e32 v3, v3
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_3) | instid1(VALU_DEP_1)
+; DAGISEL12-NEXT: v_mov_b32_e32 v3, v3
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: v_mov_b32_e32 v2, v2
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: v_mov_b32_e32 v2, v2
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: v_mov_b32_e32 v1, v5
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_3) | instid1(VALU_DEP_1)
+; DAGISEL12-NEXT: v_mov_b32_e32 v1, v1
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: v_mov_b32_e32 v0, v6
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: v_mov_b32_e32 v0, v0
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: s_mov_b32 exec_lo, -1
+; DAGISEL12-NEXT: v_dual_mov_b32 v8, v3 :: v_dual_mov_b32 v9, v2
+; DAGISEL12-NEXT: s_delay_alu instid0(VALU_DEP_2)
+; DAGISEL12-NEXT: v_dual_mov_b32 v10, v1 :: v_dual_mov_b32 v11, v0
+; DAGISEL12-NEXT: s_mov_b32 exec_lo, s5
+; DAGISEL12-NEXT: s_setpc_b64 s[6:7]
+;
+; GISEL10-LABEL: phi_whole_struct:
+; GISEL10: ; %bb.0: ; %entry
+; GISEL10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GISEL10-NEXT: s_or_saveexec_b32 s8, -1
+; GISEL10-NEXT: v_mov_b32_e32 v0, v8
+; GISEL10-NEXT: v_mov_b32_e32 v1, v9
+; GISEL10-NEXT: v_mov_b32_e32 v2, v10
+; GISEL10-NEXT: v_mov_b32_e32 v3, v11
+; GISEL10-NEXT: s_mov_b32 exec_lo, s8
+; GISEL10-NEXT: s_mov_b32 s6, s3
+; GISEL10-NEXT: s_mov_b32 s7, s4
+; GISEL10-NEXT: v_mov_b32_e32 v4, v12
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v4, 0x47
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: s_or_saveexec_b32 s3, -1
+; GISEL10-NEXT: v_cmp_ne_u32_e64 s4, 0, v4
+; GISEL10-NEXT: v_mov_b32_e32 v4, s4
+; GISEL10-NEXT: s_mov_b32 exec_lo, s3
+; GISEL10-NEXT: v_add_nc_u32_e32 v5, 42, v12
+; GISEL10-NEXT: v_mov_b32_e32 v6, v4
+; GISEL10-NEXT: v_mov_b32_e32 v0, v0
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v0, v0
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v1, v1
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v1, v1
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v2, v5
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v2, v2
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v3, v6
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v3, v3
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: s_mov_b32 exec_lo, -1
+; GISEL10-NEXT: v_mov_b32_e32 v8, v0
+; GISEL10-NEXT: v_mov_b32_e32 v9, v1
+; GISEL10-NEXT: v_mov_b32_e32 v10, v2
+; GISEL10-NEXT: v_mov_b32_e32 v11, v3
+; GISEL10-NEXT: s_mov_b32 exec_lo, s5
+; GISEL10-NEXT: s_setpc_b64 s[6:7]
+;
+; DAGISEL10-LABEL: phi_whole_struct:
+; DAGISEL10: ; %bb.0: ; %entry
+; DAGISEL10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; DAGISEL10-NEXT: s_or_saveexec_b32 s6, -1
+; DAGISEL10-NEXT: v_mov_b32_e32 v0, v11
+; DAGISEL10-NEXT: v_mov_b32_e32 v1, v10
+; DAGISEL10-NEXT: v_mov_b32_e32 v2, v9
+; DAGISEL10-NEXT: v_mov_b32_e32 v3, v8
+; DAGISEL10-NEXT: s_mov_b32 exec_lo, s6
+; DAGISEL10-NEXT: s_mov_b32 s7, s4
+; DAGISEL10-NEXT: s_mov_b32 s6, s3
+; DAGISEL10-NEXT: v_mov_b32_e32 v4, v12
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v4, 0x47
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: s_or_saveexec_b32 s3, -1
+; DAGISEL10-NEXT: v_cmp_ne_u32_e64 s4, 0, v4
+; DAGISEL10-NEXT: s_mov_b32 exec_lo, s3
+; DAGISEL10-NEXT: v_add_nc_u32_e32 v5, 42, v12
+; DAGISEL10-NEXT: v_mov_b32_e32 v6, s4
+; DAGISEL10-NEXT: v_mov_b32_e32 v3, v3
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v3, v3
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v2, v2
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v2, v2
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v1, v5
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v1, v1
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v0, v6
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v0, v0
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: s_mov_b32 exec_lo, -1
+; DAGISEL10-NEXT: v_mov_b32_e32 v8, v3
+; DAGISEL10-NEXT: v_mov_b32_e32 v9, v2
+; DAGISEL10-NEXT: v_mov_b32_e32 v10, v1
+; DAGISEL10-NEXT: v_mov_b32_e32 v11, v0
+; DAGISEL10-NEXT: s_mov_b32 exec_lo, s5
+; DAGISEL10-NEXT: s_setpc_b64 s[6:7]
+entry:
+ %entry_exec = call i1 @llvm.amdgcn.init.whole.wave()
+ br i1 %entry_exec, label %shader, label %tail
+
+shader:
+ %nonwwm = add i32 %x, 42
+ %vgpr.1 = insertvalue { i32, ptr addrspace(5), i32, i32} %vgpr, i32 %nonwwm, 2
+
+ %full.vgpr = call i32 @llvm.amdgcn.set.inactive.i32(i32 %x, i32 71)
+ %non.zero = icmp ne i32 %full.vgpr, 0
+ %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %non.zero)
+ %wwm = call i32 @llvm.amdgcn.strict.wwm.i32(i32 %ballot)
+ %vgpr.2 = insertvalue { i32, ptr addrspace(5), i32, i32} %vgpr.1, i32 %wwm, 3
+
+ br label %tail
+
+tail:
+ %vgpr.args = phi { i32, ptr addrspace(5), i32, i32} [%vgpr, %entry], [%vgpr.2, %shader]
+ call void(ptr, i32, <3 x i32>, { i32, ptr addrspace(5), i32, i32 }, i32, ...) @llvm.amdgcn.cs.chain(ptr %callee, i32 %exec, <3 x i32> inreg %sgpr, { i32, ptr addrspace(5), i32, i32 } %vgpr.args, i32 0)
+ unreachable
+}
+
+; Introduce more complex control flow - %shader contains a simple loop, and %tail contains an if.
+define amdgpu_cs_chain void @control_flow(<3 x i32> inreg %sgpr, ptr inreg %callee, i32 inreg %exec, { i32, ptr addrspace(5), i32, i32 } %vgpr, i32 %x, i32 %y) {
+; GISEL12-LABEL: control_flow:
+; GISEL12: ; %bb.0: ; %entry
+; GISEL12-NEXT: s_wait_loadcnt_dscnt 0x0
+; GISEL12-NEXT: s_wait_expcnt 0x0
+; GISEL12-NEXT: s_wait_samplecnt 0x0
+; GISEL12-NEXT: s_wait_bvhcnt 0x0
+; GISEL12-NEXT: s_wait_kmcnt 0x0
+; GISEL12-NEXT: s_or_saveexec_b32 s8, -1
+; GISEL12-NEXT: v_dual_mov_b32 v0, v9 :: v_dual_mov_b32 v1, v10
+; GISEL12-NEXT: v_mov_b32_e32 v2, v11
+; GISEL12-NEXT: s_mov_b32 exec_lo, s8
+; GISEL12-NEXT: v_add_nc_u32_e32 v4, -1, v12
+; GISEL12-NEXT: s_mov_b32 s6, s3
+; GISEL12-NEXT: s_mov_b32 s7, s4
+; GISEL12-NEXT: s_mov_b32 s3, 0
+; GISEL12-NEXT: .LBB2_1: ; %shader
+; GISEL12-NEXT: ; =>This Inner Loop Header: Depth=1
+; GISEL12-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GISEL12-NEXT: v_add_nc_u32_e32 v4, 1, v4
+; GISEL12-NEXT: v_mov_b32_e32 v3, v4
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: v_mov_b32_e32 v3, 0x47
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: s_or_saveexec_b32 s4, -1
+; GISEL12-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GISEL12-NEXT: v_cmp_ne_u32_e64 s8, 0, v3
+; GISEL12-NEXT: v_mov_b32_e32 v3, s8
+; GISEL12-NEXT: s_mov_b32 exec_lo, s4
+; GISEL12-NEXT: v_cmp_eq_u32_e32 vcc_lo, v13, v4
+; GISEL12-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GISEL12-NEXT: v_mov_b32_e32 v5, v3
+; GISEL12-NEXT: s_or_b32 s3, vcc_lo, s3
+; GISEL12-NEXT: s_and_not1_b32 exec_lo, exec_lo, s3
+; GISEL12-NEXT: s_cbranch_execnz .LBB2_1
+; GISEL12-NEXT: ; %bb.2: ; %tail.loopexit
+; GISEL12-NEXT: s_or_b32 exec_lo, exec_lo, s3
+; GISEL12-NEXT: v_add_nc_u32_e32 v4, 42, v4
+; GISEL12-NEXT: v_mov_b32_e32 v0, v0
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_3) | instid1(VALU_DEP_1)
+; GISEL12-NEXT: v_mov_b32_e32 v0, v0
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: v_mov_b32_e32 v1, v4
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: v_mov_b32_e32 v1, v1
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: v_mov_b32_e32 v2, v5
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(SALU_CYCLE_1)
+; GISEL12-NEXT: v_mov_b32_e32 v2, v2
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: s_mov_b32 exec_lo, -1
+; GISEL12-NEXT: s_mov_b32 s3, exec_lo
+; GISEL12-NEXT: ; implicit-def: $vgpr8
+; GISEL12-NEXT: v_cmpx_lt_i32_e64 v12, v13
+; GISEL12-NEXT: s_xor_b32 s3, exec_lo, s3
+; GISEL12-NEXT: ; %bb.3: ; %tail.else
+; GISEL12-NEXT: s_or_saveexec_b32 s4, -1
+; GISEL12-NEXT: v_mov_b32_e32 v3, 15
+; GISEL12-NEXT: s_mov_b32 exec_lo, s4
+; GISEL12-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GISEL12-NEXT: v_mov_b32_e32 v8, v3
+; GISEL12-NEXT: ; %bb.4: ; %Flow
+; GISEL12-NEXT: s_and_not1_saveexec_b32 s3, s3
+; GISEL12-NEXT: ; %bb.5: ; %tail.then
+; GISEL12-NEXT: s_mov_b32 s4, 44
+; GISEL12-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
+; GISEL12-NEXT: v_mov_b32_e32 v8, s4
+; GISEL12-NEXT: ; %bb.6: ; %tail.end
+; GISEL12-NEXT: s_or_b32 exec_lo, exec_lo, s3
+; GISEL12-NEXT: v_dual_mov_b32 v9, v0 :: v_dual_mov_b32 v10, v1
+; GISEL12-NEXT: v_mov_b32_e32 v11, v2
+; GISEL12-NEXT: s_mov_b32 exec_lo, s5
+; GISEL12-NEXT: s_setpc_b64 s[6:7]
+;
+; DAGISEL12-LABEL: control_flow:
+; DAGISEL12: ; %bb.0: ; %entry
+; DAGISEL12-NEXT: s_wait_loadcnt_dscnt 0x0
+; DAGISEL12-NEXT: s_wait_expcnt 0x0
+; DAGISEL12-NEXT: s_wait_samplecnt 0x0
+; DAGISEL12-NEXT: s_wait_bvhcnt 0x0
+; DAGISEL12-NEXT: s_wait_kmcnt 0x0
+; DAGISEL12-NEXT: s_or_saveexec_b32 s6, -1
+; DAGISEL12-NEXT: v_dual_mov_b32 v0, v11 :: v_dual_mov_b32 v1, v10
+; DAGISEL12-NEXT: v_mov_b32_e32 v2, v9
+; DAGISEL12-NEXT: s_mov_b32 exec_lo, s6
+; DAGISEL12-NEXT: v_add_nc_u32_e32 v4, -1, v12
+; DAGISEL12-NEXT: s_mov_b32 s7, s4
+; DAGISEL12-NEXT: s_mov_b32 s6, s3
+; DAGISEL12-NEXT: s_mov_b32 s3, 0
+; DAGISEL12-NEXT: .LBB2_1: ; %shader
+; DAGISEL12-NEXT: ; =>This Inner Loop Header: Depth=1
+; DAGISEL12-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; DAGISEL12-NEXT: v_add_nc_u32_e32 v4, 1, v4
+; DAGISEL12-NEXT: v_mov_b32_e32 v3, v4
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: v_mov_b32_e32 v3, 0x47
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: s_or_saveexec_b32 s4, -1
+; DAGISEL12-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_2)
+; DAGISEL12-NEXT: v_cmp_ne_u32_e64 s8, 0, v3
+; DAGISEL12-NEXT: s_mov_b32 exec_lo, s4
+; DAGISEL12-NEXT: v_cmp_eq_u32_e32 vcc_lo, v13, v4
+; DAGISEL12-NEXT: v_mov_b32_e32 v5, s8
+; DAGISEL12-NEXT: s_or_b32 s3, vcc_lo, s3
+; DAGISEL12-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
+; DAGISEL12-NEXT: s_and_not1_b32 exec_lo, exec_lo, s3
+; DAGISEL12-NEXT: s_cbranch_execnz .LBB2_1
+; DAGISEL12-NEXT: ; %bb.2: ; %tail.loopexit
+; DAGISEL12-NEXT: s_or_b32 exec_lo, exec_lo, s3
+; DAGISEL12-NEXT: v_add_nc_u32_e32 v4, 42, v4
+; DAGISEL12-NEXT: v_mov_b32_e32 v2, v2
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_3) | instid1(VALU_DEP_1)
+; DAGISEL12-NEXT: v_mov_b32_e32 v2, v2
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: v_mov_b32_e32 v1, v4
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: v_mov_b32_e32 v1, v1
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: v_mov_b32_e32 v0, v5
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(SALU_CYCLE_1)
+; DAGISEL12-NEXT: v_mov_b32_e32 v0, v0
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: s_mov_b32 exec_lo, -1
+; DAGISEL12-NEXT: s_mov_b32 s3, exec_lo
+; DAGISEL12-NEXT: ; implicit-def: $vgpr8
+; DAGISEL12-NEXT: v_cmpx_lt_i32_e64 v12, v13
+; DAGISEL12-NEXT: s_xor_b32 s3, exec_lo, s3
+; DAGISEL12-NEXT: ; %bb.3: ; %tail.else
+; DAGISEL12-NEXT: s_mov_b32 s4, 15
+; DAGISEL12-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
+; DAGISEL12-NEXT: v_mov_b32_e32 v8, s4
+; DAGISEL12-NEXT: ; %bb.4: ; %Flow
+; DAGISEL12-NEXT: s_and_not1_saveexec_b32 s3, s3
+; DAGISEL12-NEXT: ; %bb.5: ; %tail.then
+; DAGISEL12-NEXT: v_mov_b32_e32 v8, 44
+; DAGISEL12-NEXT: ; %bb.6: ; %tail.end
+; DAGISEL12-NEXT: s_or_b32 exec_lo, exec_lo, s3
+; DAGISEL12-NEXT: v_dual_mov_b32 v9, v2 :: v_dual_mov_b32 v10, v1
+; DAGISEL12-NEXT: v_mov_b32_e32 v11, v0
+; DAGISEL12-NEXT: s_mov_b32 exec_lo, s5
+; DAGISEL12-NEXT: s_setpc_b64 s[6:7]
+;
+; GISEL10-LABEL: control_flow:
+; GISEL10: ; %bb.0: ; %entry
+; GISEL10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GISEL10-NEXT: s_or_saveexec_b32 s8, -1
+; GISEL10-NEXT: v_mov_b32_e32 v0, v9
+; GISEL10-NEXT: v_mov_b32_e32 v1, v10
+; GISEL10-NEXT: v_mov_b32_e32 v2, v11
+; GISEL10-NEXT: s_mov_b32 exec_lo, s8
+; GISEL10-NEXT: v_add_nc_u32_e32 v4, -1, v12
+; GISEL10-NEXT: s_mov_b32 s6, s3
+; GISEL10-NEXT: s_mov_b32 s7, s4
+; GISEL10-NEXT: s_mov_b32 s3, 0
+; GISEL10-NEXT: .LBB2_1: ; %shader
+; GISEL10-NEXT: ; =>This Inner Loop Header: Depth=1
+; GISEL10-NEXT: v_add_nc_u32_e32 v4, 1, v4
+; GISEL10-NEXT: v_mov_b32_e32 v3, v4
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v3, 0x47
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: s_or_saveexec_b32 s4, -1
+; GISEL10-NEXT: v_cmp_ne_u32_e64 s8, 0, v3
+; GISEL10-NEXT: v_mov_b32_e32 v3, s8
+; GISEL10-NEXT: s_mov_b32 exec_lo, s4
+; GISEL10-NEXT: v_cmp_eq_u32_e32 vcc_lo, v13, v4
+; GISEL10-NEXT: v_mov_b32_e32 v5, v3
+; GISEL10-NEXT: s_or_b32 s3, vcc_lo, s3
+; GISEL10-NEXT: s_andn2_b32 exec_lo, exec_lo, s3
+; GISEL10-NEXT: s_cbranch_execnz .LBB2_1
+; GISEL10-NEXT: ; %bb.2: ; %tail.loopexit
+; GISEL10-NEXT: s_or_b32 exec_lo, exec_lo, s3
+; GISEL10-NEXT: v_add_nc_u32_e32 v4, 42, v4
+; GISEL10-NEXT: v_mov_b32_e32 v0, v0
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v0, v0
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v1, v4
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v1, v1
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v2, v5
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v2, v2
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: s_mov_b32 exec_lo, -1
+; GISEL10-NEXT: s_mov_b32 s3, exec_lo
+; GISEL10-NEXT: ; implicit-def: $vgpr8
+; GISEL10-NEXT: v_cmpx_lt_i32_e64 v12, v13
+; GISEL10-NEXT: s_xor_b32 s3, exec_lo, s3
+; GISEL10-NEXT: ; %bb.3: ; %tail.else
+; GISEL10-NEXT: s_or_saveexec_b32 s4, -1
+; GISEL10-NEXT: v_mov_b32_e32 v3, 15
+; GISEL10-NEXT: s_mov_b32 exec_lo, s4
+; GISEL10-NEXT: v_mov_b32_e32 v8, v3
+; GISEL10-NEXT: ; %bb.4: ; %Flow
+; GISEL10-NEXT: s_andn2_saveexec_b32 s3, s3
+; GISEL10-NEXT: ; %bb.5: ; %tail.then
+; GISEL10-NEXT: s_mov_b32 s4, 44
+; GISEL10-NEXT: v_mov_b32_e32 v8, s4
+; GISEL10-NEXT: ; %bb.6: ; %tail.end
+; GISEL10-NEXT: s_or_b32 exec_lo, exec_lo, s3
+; GISEL10-NEXT: v_mov_b32_e32 v9, v0
+; GISEL10-NEXT: v_mov_b32_e32 v10, v1
+; GISEL10-NEXT: v_mov_b32_e32 v11, v2
+; GISEL10-NEXT: s_mov_b32 exec_lo, s5
+; GISEL10-NEXT: s_setpc_b64 s[6:7]
+;
+; DAGISEL10-LABEL: control_flow:
+; DAGISEL10: ; %bb.0: ; %entry
+; DAGISEL10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; DAGISEL10-NEXT: s_or_saveexec_b32 s6, -1
+; DAGISEL10-NEXT: v_mov_b32_e32 v0, v11
+; DAGISEL10-NEXT: v_mov_b32_e32 v1, v10
+; DAGISEL10-NEXT: v_mov_b32_e32 v2, v9
+; DAGISEL10-NEXT: s_mov_b32 exec_lo, s6
+; DAGISEL10-NEXT: v_add_nc_u32_e32 v4, -1, v12
+; DAGISEL10-NEXT: s_mov_b32 s7, s4
+; DAGISEL10-NEXT: s_mov_b32 s6, s3
+; DAGISEL10-NEXT: s_mov_b32 s3, 0
+; DAGISEL10-NEXT: .LBB2_1: ; %shader
+; DAGISEL10-NEXT: ; =>This Inner Loop Header: Depth=1
+; DAGISEL10-NEXT: v_add_nc_u32_e32 v4, 1, v4
+; DAGISEL10-NEXT: v_mov_b32_e32 v3, v4
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v3, 0x47
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: s_or_saveexec_b32 s4, -1
+; DAGISEL10-NEXT: v_cmp_ne_u32_e64 s8, 0, v3
+; DAGISEL10-NEXT: s_mov_b32 exec_lo, s4
+; DAGISEL10-NEXT: v_cmp_eq_u32_e32 vcc_lo, v13, v4
+; DAGISEL10-NEXT: v_mov_b32_e32 v5, s8
+; DAGISEL10-NEXT: s_or_b32 s3, vcc_lo, s3
+; DAGISEL10-NEXT: s_andn2_b32 exec_lo, exec_lo, s3
+; DAGISEL10-NEXT: s_cbranch_execnz .LBB2_1
+; DAGISEL10-NEXT: ; %bb.2: ; %tail.loopexit
+; DAGISEL10-NEXT: s_or_b32 exec_lo, exec_lo, s3
+; DAGISEL10-NEXT: v_add_nc_u32_e32 v4, 42, v4
+; DAGISEL10-NEXT: v_mov_b32_e32 v2, v2
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v2, v2
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v1, v4
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v1, v1
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v0, v5
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v0, v0
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: s_mov_b32 exec_lo, -1
+; DAGISEL10-NEXT: s_mov_b32 s3, exec_lo
+; DAGISEL10-NEXT: ; implicit-def: $vgpr8
+; DAGISEL10-NEXT: v_cmpx_lt_i32_e64 v12, v13
+; DAGISEL10-NEXT: s_xor_b32 s3, exec_lo, s3
+; DAGISEL10-NEXT: ; %bb.3: ; %tail.else
+; DAGISEL10-NEXT: s_mov_b32 s4, 15
+; DAGISEL10-NEXT: v_mov_b32_e32 v8, s4
+; DAGISEL10-NEXT: ; %bb.4: ; %Flow
+; DAGISEL10-NEXT: s_andn2_saveexec_b32 s3, s3
+; DAGISEL10-NEXT: ; %bb.5: ; %tail.then
+; DAGISEL10-NEXT: v_mov_b32_e32 v8, 44
+; DAGISEL10-NEXT: ; %bb.6: ; %tail.end
+; DAGISEL10-NEXT: s_or_b32 exec_lo, exec_lo, s3
+; DAGISEL10-NEXT: v_mov_b32_e32 v9, v2
+; DAGISEL10-NEXT: v_mov_b32_e32 v10, v1
+; DAGISEL10-NEXT: v_mov_b32_e32 v11, v0
+; DAGISEL10-NEXT: s_mov_b32 exec_lo, s5
+; DAGISEL10-NEXT: s_setpc_b64 s[6:7]
+entry:
+ %entry_exec = call i1 @llvm.amdgcn.init.whole.wave()
+ br i1 %entry_exec, label %shader, label %tail
+
+shader:
+ %i = phi i32 [%x, %entry], [%i.inc, %shader]
+
+ %nonwwm = add i32 %i, 42
+ %vgpr.1 = insertvalue { i32, ptr addrspace(5), i32, i32} %vgpr, i32 %nonwwm, 2
+
+ %full.vgpr = call i32 @llvm.amdgcn.set.inactive.i32(i32 %i, i32 71)
+ %non.zero = icmp ne i32 %full.vgpr, 0
+ %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %non.zero)
+ %wwm = call i32 @llvm.amdgcn.strict.wwm.i32(i32 %ballot)
+ %vgpr.2 = insertvalue { i32, ptr addrspace(5), i32, i32} %vgpr.1, i32 %wwm, 3
+
+ %i.inc = add i32 %i, 1
+ %loop.cond = icmp ne i32 %i, %y
+ br i1 %loop.cond, label %shader, label %tail
+
+tail:
+ %vgpr.tail = phi { i32, ptr addrspace(5), i32, i32} [%vgpr, %entry], [%vgpr.2, %shader]
+
+ %if.cond = icmp sge i32 %x, %y
+ br i1 %if.cond, label %tail.then, label %tail.else
+
+tail.then:
+ %vgpr.then = insertvalue { i32, ptr addrspace(5), i32, i32} %vgpr.tail, i32 44, 0
+ br label %tail.end
+
+tail.else:
+ %wwm.tail = call i32 @llvm.amdgcn.strict.wwm.i32(i32 15)
+ %vgpr.else = insertvalue { i32, ptr addrspace(5), i32, i32} %vgpr.tail, i32 %wwm.tail, 0
+ br label %tail.end
+
+tail.end:
+ %vgpr.args = phi { i32, ptr addrspace(5), i32, i32 } [%vgpr.then, %tail.then], [%vgpr.else, %tail.else]
+ call void(ptr, i32, <3 x i32>, { i32, ptr addrspace(5), i32, i32 }, i32, ...) @llvm.amdgcn.cs.chain(ptr %callee, i32 %exec, <3 x i32> inreg %sgpr, { i32, ptr addrspace(5), i32, i32 } %vgpr.args, i32 0)
+ unreachable
+}
+
+; Try with v0-v7 occupied - this will force us to use higher registers for temporaries. Make sure we don't preserve them.
+define amdgpu_cs_chain void @control_flow_use_v0_7(<3 x i32> inreg %sgpr, ptr inreg %callee, i32 inreg %exec, { i32, ptr addrspace(5), i32, i32 } %vgpr, i32 %x, i32 %y) {
+; GISEL12-LABEL: control_flow_use_v0_7:
+; GISEL12: ; %bb.0: ; %entry
+; GISEL12-NEXT: s_wait_loadcnt_dscnt 0x0
+; GISEL12-NEXT: s_wait_expcnt 0x0
+; GISEL12-NEXT: s_wait_samplecnt 0x0
+; GISEL12-NEXT: s_wait_bvhcnt 0x0
+; GISEL12-NEXT: s_wait_kmcnt 0x0
+; GISEL12-NEXT: s_or_saveexec_b32 s8, -1
+; GISEL12-NEXT: v_dual_mov_b32 v13, v8 :: v_dual_mov_b32 v14, v9
+; GISEL12-NEXT: v_dual_mov_b32 v15, v10 :: v_dual_mov_b32 v16, v11
+; GISEL12-NEXT: s_mov_b32 exec_lo, s8
+; GISEL12-NEXT: s_mov_b32 s6, s3
+; GISEL12-NEXT: s_mov_b32 s7, s4
+; GISEL12-NEXT: v_mov_b32_e32 v17, v12
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: v_mov_b32_e32 v17, 0x47
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: s_or_saveexec_b32 s3, -1
+; GISEL12-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GISEL12-NEXT: v_cmp_ne_u32_e64 s4, 0, v17
+; GISEL12-NEXT: v_mov_b32_e32 v17, s4
+; GISEL12-NEXT: s_mov_b32 exec_lo, s3
+; GISEL12-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_3) | instid1(VALU_DEP_1)
+; GISEL12-NEXT: v_dual_mov_b32 v9, v17 :: v_dual_add_nc_u32 v8, 42, v12
+; GISEL12-NEXT: ;;#ASMSTART
+; GISEL12-NEXT: ; use v0-7
+; GISEL12-NEXT: ;;#ASMEND
+; GISEL12-NEXT: v_mov_b32_e32 v13, v13
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: v_mov_b32_e32 v13, v13
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: v_mov_b32_e32 v14, v14
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_3) | instid1(VALU_DEP_1)
+; GISEL12-NEXT: v_mov_b32_e32 v14, v14
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: v_mov_b32_e32 v15, v8
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: v_mov_b32_e32 v15, v15
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: v_mov_b32_e32 v16, v9
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_3) | instid1(VALU_DEP_2)
+; GISEL12-NEXT: v_mov_b32_e32 v16, v16
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: s_mov_b32 exec_lo, -1
+; GISEL12-NEXT: v_dual_mov_b32 v8, v13 :: v_dual_mov_b32 v9, v14
+; GISEL12-NEXT: v_dual_mov_b32 v10, v15 :: v_dual_mov_b32 v11, v16
+; GISEL12-NEXT: s_mov_b32 exec_lo, s5
+; GISEL12-NEXT: s_setpc_b64 s[6:7]
+;
+; DAGISEL12-LABEL: control_flow_use_v0_7:
+; DAGISEL12: ; %bb.0: ; %entry
+; DAGISEL12-NEXT: s_wait_loadcnt_dscnt 0x0
+; DAGISEL12-NEXT: s_wait_expcnt 0x0
+; DAGISEL12-NEXT: s_wait_samplecnt 0x0
+; DAGISEL12-NEXT: s_wait_bvhcnt 0x0
+; DAGISEL12-NEXT: s_wait_kmcnt 0x0
+; DAGISEL12-NEXT: s_or_saveexec_b32 s6, -1
+; DAGISEL12-NEXT: v_dual_mov_b32 v13, v11 :: v_dual_mov_b32 v14, v10
+; DAGISEL12-NEXT: v_dual_mov_b32 v15, v9 :: v_dual_mov_b32 v16, v8
+; DAGISEL12-NEXT: s_mov_b32 exec_lo, s6
+; DAGISEL12-NEXT: s_mov_b32 s7, s4
+; DAGISEL12-NEXT: s_mov_b32 s6, s3
+; DAGISEL12-NEXT: v_mov_b32_e32 v17, v12
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: v_mov_b32_e32 v17, 0x47
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: s_or_saveexec_b32 s3, -1
+; DAGISEL12-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; DAGISEL12-NEXT: v_cmp_ne_u32_e64 s4, 0, v17
+; DAGISEL12-NEXT: s_mov_b32 exec_lo, s3
+; DAGISEL12-NEXT: v_dual_mov_b32 v9, s4 :: v_dual_add_nc_u32 v8, 42, v12
+; DAGISEL12-NEXT: ;;#ASMSTART
+; DAGISEL12-NEXT: ; use v0-7
+; DAGISEL12-NEXT: ;;#ASMEND
+; DAGISEL12-NEXT: v_mov_b32_e32 v16, v16
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_3) | instid1(VALU_DEP_1)
+; DAGISEL12-NEXT: v_mov_b32_e32 v16, v16
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: v_mov_b32_e32 v15, v15
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: v_mov_b32_e32 v15, v15
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: v_mov_b32_e32 v14, v8
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_3) | instid1(VALU_DEP_1)
+; DAGISEL12-NEXT: v_mov_b32_e32 v14, v14
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: v_mov_b32_e32 v13, v9
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: v_mov_b32_e32 v13, v13
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: s_mov_b32 exec_lo, -1
+; DAGISEL12-NEXT: v_dual_mov_b32 v8, v16 :: v_dual_mov_b32 v9, v15
+; DAGISEL12-NEXT: s_delay_alu instid0(VALU_DEP_2)
+; DAGISEL12-NEXT: v_dual_mov_b32 v10, v14 :: v_dual_mov_b32 v11, v13
+; DAGISEL12-NEXT: s_mov_b32 exec_lo, s5
+; DAGISEL12-NEXT: s_setpc_b64 s[6:7]
+;
+; GISEL10-LABEL: control_flow_use_v0_7:
+; GISEL10: ; %bb.0: ; %entry
+; GISEL10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GISEL10-NEXT: s_or_saveexec_b32 s8, -1
+; GISEL10-NEXT: v_mov_b32_e32 v13, v8
+; GISEL10-NEXT: v_mov_b32_e32 v14, v9
+; GISEL10-NEXT: v_mov_b32_e32 v15, v10
+; GISEL10-NEXT: v_mov_b32_e32 v16, v11
+; GISEL10-NEXT: s_mov_b32 exec_lo, s8
+; GISEL10-NEXT: s_mov_b32 s6, s3
+; GISEL10-NEXT: s_mov_b32 s7, s4
+; GISEL10-NEXT: v_mov_b32_e32 v17, v12
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v17, 0x47
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: s_or_saveexec_b32 s3, -1
+; GISEL10-NEXT: v_cmp_ne_u32_e64 s4, 0, v17
+; GISEL10-NEXT: v_mov_b32_e32 v17, s4
+; GISEL10-NEXT: s_mov_b32 exec_lo, s3
+; GISEL10-NEXT: v_add_nc_u32_e32 v8, 42, v12
+; GISEL10-NEXT: v_mov_b32_e32 v9, v17
+; GISEL10-NEXT: ;;#ASMSTART
+; GISEL10-NEXT: ; use v0-7
+; GISEL10-NEXT: ;;#ASMEND
+; GISEL10-NEXT: v_mov_b32_e32 v13, v13
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v13, v13
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v14, v14
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v14, v14
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v15, v8
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v15, v15
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v16, v9
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v16, v16
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: s_mov_b32 exec_lo, -1
+; GISEL10-NEXT: v_mov_b32_e32 v8, v13
+; GISEL10-NEXT: v_mov_b32_e32 v9, v14
+; GISEL10-NEXT: v_mov_b32_e32 v10, v15
+; GISEL10-NEXT: v_mov_b32_e32 v11, v16
+; GISEL10-NEXT: s_mov_b32 exec_lo, s5
+; GISEL10-NEXT: s_setpc_b64 s[6:7]
+;
+; DAGISEL10-LABEL: control_flow_use_v0_7:
+; DAGISEL10: ; %bb.0: ; %entry
+; DAGISEL10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; DAGISEL10-NEXT: s_or_saveexec_b32 s6, -1
+; DAGISEL10-NEXT: v_mov_b32_e32 v13, v11
+; DAGISEL10-NEXT: v_mov_b32_e32 v14, v10
+; DAGISEL10-NEXT: v_mov_b32_e32 v15, v9
+; DAGISEL10-NEXT: v_mov_b32_e32 v16, v8
+; DAGISEL10-NEXT: s_mov_b32 exec_lo, s6
+; DAGISEL10-NEXT: s_mov_b32 s7, s4
+; DAGISEL10-NEXT: s_mov_b32 s6, s3
+; DAGISEL10-NEXT: v_mov_b32_e32 v17, v12
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v17, 0x47
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: s_or_saveexec_b32 s3, -1
+; DAGISEL10-NEXT: v_cmp_ne_u32_e64 s4, 0, v17
+; DAGISEL10-NEXT: s_mov_b32 exec_lo, s3
+; DAGISEL10-NEXT: v_add_nc_u32_e32 v8, 42, v12
+; DAGISEL10-NEXT: ;;#ASMSTART
+; DAGISEL10-NEXT: ; use v0-7
+; DAGISEL10-NEXT: ;;#ASMEND
+; DAGISEL10-NEXT: v_mov_b32_e32 v9, s4
+; DAGISEL10-NEXT: v_mov_b32_e32 v16, v16
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v16, v16
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v15, v15
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v15, v15
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v14, v8
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v14, v14
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v13, v9
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v13, v13
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: s_mov_b32 exec_lo, -1
+; DAGISEL10-NEXT: v_mov_b32_e32 v8, v16
+; DAGISEL10-NEXT: v_mov_b32_e32 v9, v15
+; DAGISEL10-NEXT: v_mov_b32_e32 v10, v14
+; DAGISEL10-NEXT: v_mov_b32_e32 v11, v13
+; DAGISEL10-NEXT: s_mov_b32 exec_lo, s5
+; DAGISEL10-NEXT: s_setpc_b64 s[6:7]
+entry:
+ %entry_exec = call i1 @llvm.amdgcn.init.whole.wave()
+ br i1 %entry_exec, label %shader, label %tail
+
+shader:
+ call void asm sideeffect "; use v0-7", "~{v0},~{v1},~{v2},~{v3},~{v4},~{v5},~{v6},~{v7}"()
+
+ %nonwwm = add i32 %x, 42
+ %vgpr.1 = insertvalue { i32, ptr addrspace(5), i32, i32} %vgpr, i32 %nonwwm, 2
+
+ %full.vgpr = call i32 @llvm.amdgcn.set.inactive.i32(i32 %x, i32 71)
+ %non.zero = icmp ne i32 %full.vgpr, 0
+ %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 %non.zero)
+ %wwm = call i32 @llvm.amdgcn.strict.wwm.i32(i32 %ballot)
+ %vgpr.2 = insertvalue { i32, ptr addrspace(5), i32, i32} %vgpr.1, i32 %wwm, 3
+
+ br label %tail
+
+tail:
+ %vgpr.args = phi { i32, ptr addrspace(5), i32, i32} [%vgpr, %entry], [%vgpr.2, %shader]
+ call void(ptr, i32, <3 x i32>, { i32, ptr addrspace(5), i32, i32 }, i32, ...) @llvm.amdgcn.cs.chain(ptr %callee, i32 %exec, <3 x i32> inreg %sgpr, { i32, ptr addrspace(5), i32, i32 } %vgpr.args, i32 0)
+ unreachable
+}
+
+
+; Check that the inactive lanes of v8:15 are correctly preserved even across a
+; WWM call that reads and writes them.
+define amdgpu_cs_chain void @wwm_write_to_arg_reg(<3 x i32> inreg %sgpr, ptr inreg %callee, i32 inreg %exec, <16 x i32> %vgpr, i32 %x, i32 %y) {
+; GISEL12-LABEL: wwm_write_to_arg_reg:
+; GISEL12: ; %bb.0: ; %entry
+; GISEL12-NEXT: s_wait_loadcnt_dscnt 0x0
+; GISEL12-NEXT: s_wait_expcnt 0x0
+; GISEL12-NEXT: s_wait_samplecnt 0x0
+; GISEL12-NEXT: s_wait_bvhcnt 0x0
+; GISEL12-NEXT: s_wait_kmcnt 0x0
+; GISEL12-NEXT: s_mov_b32 s32, 0
+; GISEL12-NEXT: s_or_saveexec_b32 s9, -1
+; GISEL12-NEXT: s_mov_b32 s11, s4
+; GISEL12-NEXT: v_dual_mov_b32 v40, v8 :: v_dual_mov_b32 v41, v9
+; GISEL12-NEXT: v_dual_mov_b32 v42, v10 :: v_dual_mov_b32 v43, v11
+; GISEL12-NEXT: v_dual_mov_b32 v44, v12 :: v_dual_mov_b32 v45, v13
+; GISEL12-NEXT: v_dual_mov_b32 v46, v14 :: v_dual_mov_b32 v47, v15
+; GISEL12-NEXT: v_dual_mov_b32 v56, v16 :: v_dual_mov_b32 v57, v17
+; GISEL12-NEXT: v_dual_mov_b32 v58, v18 :: v_dual_mov_b32 v59, v19
+; GISEL12-NEXT: v_dual_mov_b32 v60, v20 :: v_dual_mov_b32 v61, v21
+; GISEL12-NEXT: v_dual_mov_b32 v62, v22 :: v_dual_mov_b32 v63, v23
+; GISEL12-NEXT: s_mov_b32 exec_lo, s9
+; GISEL12-NEXT: s_or_saveexec_b32 s4, -1
+; GISEL12-NEXT: s_mov_b32 s6, s0
+; GISEL12-NEXT: s_mov_b32 s7, s1
+; GISEL12-NEXT: s_getpc_b64 s[0:1]
+; GISEL12-NEXT: s_sext_i32_i16 s1, s1
+; GISEL12-NEXT: s_add_co_u32 s0, s0, write_v0_v15 at gotpcrel32@lo+8
+; GISEL12-NEXT: s_add_co_ci_u32 s1, s1, write_v0_v15 at gotpcrel32@hi+16
+; GISEL12-NEXT: v_dual_mov_b32 v0, v40 :: v_dual_mov_b32 v1, v41
+; GISEL12-NEXT: s_load_b64 s[0:1], s[0:1], 0x0
+; GISEL12-NEXT: v_dual_mov_b32 v2, v42 :: v_dual_mov_b32 v3, v43
+; GISEL12-NEXT: v_dual_mov_b32 v4, v44 :: v_dual_mov_b32 v5, v45
+; GISEL12-NEXT: v_dual_mov_b32 v6, v46 :: v_dual_mov_b32 v7, v47
+; GISEL12-NEXT: v_dual_mov_b32 v8, v56 :: v_dual_mov_b32 v9, v57
+; GISEL12-NEXT: v_dual_mov_b32 v10, v58 :: v_dual_mov_b32 v11, v59
+; GISEL12-NEXT: v_dual_mov_b32 v12, v60 :: v_dual_mov_b32 v13, v61
+; GISEL12-NEXT: v_dual_mov_b32 v14, v62 :: v_dual_mov_b32 v15, v63
+; GISEL12-NEXT: s_mov_b32 s8, s2
+; GISEL12-NEXT: s_mov_b32 s10, s3
+; GISEL12-NEXT: s_wait_kmcnt 0x0
+; GISEL12-NEXT: s_swappc_b64 s[30:31], s[0:1]
+; GISEL12-NEXT: v_dual_mov_b32 v24, v0 :: v_dual_mov_b32 v25, v1
+; GISEL12-NEXT: v_dual_mov_b32 v26, v2 :: v_dual_mov_b32 v27, v3
+; GISEL12-NEXT: v_dual_mov_b32 v28, v4 :: v_dual_mov_b32 v29, v5
+; GISEL12-NEXT: v_dual_mov_b32 v30, v6 :: v_dual_mov_b32 v31, v7
+; GISEL12-NEXT: v_dual_mov_b32 v32, v8 :: v_dual_mov_b32 v33, v9
+; GISEL12-NEXT: v_dual_mov_b32 v34, v10 :: v_dual_mov_b32 v35, v11
+; GISEL12-NEXT: v_dual_mov_b32 v36, v12 :: v_dual_mov_b32 v37, v13
+; GISEL12-NEXT: v_dual_mov_b32 v38, v14 :: v_dual_mov_b32 v39, v15
+; GISEL12-NEXT: s_mov_b32 exec_lo, s4
+; GISEL12-NEXT: v_dual_mov_b32 v0, v24 :: v_dual_mov_b32 v1, v25
+; GISEL12-NEXT: v_dual_mov_b32 v2, v26 :: v_dual_mov_b32 v3, v27
+; GISEL12-NEXT: v_dual_mov_b32 v4, v28 :: v_dual_mov_b32 v5, v29
+; GISEL12-NEXT: v_dual_mov_b32 v6, v30 :: v_dual_mov_b32 v7, v31
+; GISEL12-NEXT: v_dual_mov_b32 v8, v32 :: v_dual_mov_b32 v9, v33
+; GISEL12-NEXT: v_dual_mov_b32 v10, v34 :: v_dual_mov_b32 v11, v35
+; GISEL12-NEXT: v_dual_mov_b32 v12, v36 :: v_dual_mov_b32 v13, v37
+; GISEL12-NEXT: v_dual_mov_b32 v14, v38 :: v_dual_mov_b32 v15, v39
+; GISEL12-NEXT: v_mov_b32_e32 v24, v0
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: v_mov_b32_e32 v24, v40
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: v_mov_b32_e32 v25, v1
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: v_mov_b32_e32 v25, v41
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: v_mov_b32_e32 v26, v2
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: v_mov_b32_e32 v26, v42
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: v_mov_b32_e32 v27, v3
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: v_mov_b32_e32 v27, v43
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: v_mov_b32_e32 v28, v4
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: v_mov_b32_e32 v28, v44
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: v_mov_b32_e32 v29, v5
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: v_mov_b32_e32 v29, v45
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: v_mov_b32_e32 v30, v6
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: v_mov_b32_e32 v30, v46
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: v_mov_b32_e32 v31, v7
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: v_mov_b32_e32 v31, v47
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: v_mov_b32_e32 v32, v8
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: v_mov_b32_e32 v32, v56
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: v_mov_b32_e32 v33, v9
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: v_mov_b32_e32 v33, v57
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: v_mov_b32_e32 v34, v10
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: v_mov_b32_e32 v34, v58
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: v_mov_b32_e32 v35, v11
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: v_mov_b32_e32 v35, v59
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: v_mov_b32_e32 v36, v12
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: v_mov_b32_e32 v36, v60
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: v_mov_b32_e32 v37, v13
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: v_mov_b32_e32 v37, v61
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: v_mov_b32_e32 v38, v14
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: v_mov_b32_e32 v38, v62
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: v_mov_b32_e32 v39, v15
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: v_mov_b32_e32 v39, v63
+; GISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT: s_mov_b32 exec_lo, -1
+; GISEL12-NEXT: v_dual_mov_b32 v8, v24 :: v_dual_mov_b32 v9, v25
+; GISEL12-NEXT: v_dual_mov_b32 v10, v26 :: v_dual_mov_b32 v11, v27
+; GISEL12-NEXT: v_dual_mov_b32 v12, v28 :: v_dual_mov_b32 v13, v29
+; GISEL12-NEXT: v_dual_mov_b32 v14, v30 :: v_dual_mov_b32 v15, v31
+; GISEL12-NEXT: v_dual_mov_b32 v16, v32 :: v_dual_mov_b32 v17, v33
+; GISEL12-NEXT: v_dual_mov_b32 v18, v34 :: v_dual_mov_b32 v19, v35
+; GISEL12-NEXT: v_dual_mov_b32 v20, v36 :: v_dual_mov_b32 v21, v37
+; GISEL12-NEXT: v_dual_mov_b32 v22, v38 :: v_dual_mov_b32 v23, v39
+; GISEL12-NEXT: s_mov_b32 s0, s6
+; GISEL12-NEXT: s_mov_b32 s1, s7
+; GISEL12-NEXT: s_mov_b32 s2, s8
+; GISEL12-NEXT: s_mov_b32 exec_lo, s5
+; GISEL12-NEXT: s_setpc_b64 s[10:11]
+;
+; DAGISEL12-LABEL: wwm_write_to_arg_reg:
+; DAGISEL12: ; %bb.0: ; %entry
+; DAGISEL12-NEXT: s_wait_loadcnt_dscnt 0x0
+; DAGISEL12-NEXT: s_wait_expcnt 0x0
+; DAGISEL12-NEXT: s_wait_samplecnt 0x0
+; DAGISEL12-NEXT: s_wait_bvhcnt 0x0
+; DAGISEL12-NEXT: s_wait_kmcnt 0x0
+; DAGISEL12-NEXT: s_mov_b32 s32, 0
+; DAGISEL12-NEXT: s_or_saveexec_b32 s6, -1
+; DAGISEL12-NEXT: v_dual_mov_b32 v40, v23 :: v_dual_mov_b32 v41, v22
+; DAGISEL12-NEXT: v_dual_mov_b32 v42, v21 :: v_dual_mov_b32 v43, v20
+; DAGISEL12-NEXT: v_dual_mov_b32 v44, v19 :: v_dual_mov_b32 v45, v18
+; DAGISEL12-NEXT: v_dual_mov_b32 v46, v17 :: v_dual_mov_b32 v47, v16
+; DAGISEL12-NEXT: v_dual_mov_b32 v56, v15 :: v_dual_mov_b32 v57, v14
+; DAGISEL12-NEXT: v_dual_mov_b32 v58, v13 :: v_dual_mov_b32 v59, v12
+; DAGISEL12-NEXT: v_dual_mov_b32 v60, v11 :: v_dual_mov_b32 v61, v10
+; DAGISEL12-NEXT: v_dual_mov_b32 v62, v9 :: v_dual_mov_b32 v63, v8
+; DAGISEL12-NEXT: s_mov_b32 exec_lo, s6
+; DAGISEL12-NEXT: s_or_saveexec_b32 s10, -1
+; DAGISEL12-NEXT: s_mov_b32 s6, s1
+; DAGISEL12-NEXT: s_mov_b32 s7, s0
+; DAGISEL12-NEXT: s_getpc_b64 s[0:1]
+; DAGISEL12-NEXT: s_sext_i32_i16 s1, s1
+; DAGISEL12-NEXT: s_add_co_u32 s0, s0, write_v0_v15 at gotpcrel32@lo+8
+; DAGISEL12-NEXT: s_add_co_ci_u32 s1, s1, write_v0_v15 at gotpcrel32@hi+16
+; DAGISEL12-NEXT: v_dual_mov_b32 v0, v63 :: v_dual_mov_b32 v1, v62
+; DAGISEL12-NEXT: s_load_b64 s[0:1], s[0:1], 0x0
+; DAGISEL12-NEXT: v_dual_mov_b32 v2, v61 :: v_dual_mov_b32 v3, v60
+; DAGISEL12-NEXT: v_dual_mov_b32 v4, v59 :: v_dual_mov_b32 v5, v58
+; DAGISEL12-NEXT: v_dual_mov_b32 v6, v57 :: v_dual_mov_b32 v7, v56
+; DAGISEL12-NEXT: v_dual_mov_b32 v8, v47 :: v_dual_mov_b32 v9, v46
+; DAGISEL12-NEXT: v_dual_mov_b32 v10, v45 :: v_dual_mov_b32 v11, v44
+; DAGISEL12-NEXT: v_dual_mov_b32 v12, v43 :: v_dual_mov_b32 v13, v42
+; DAGISEL12-NEXT: v_dual_mov_b32 v14, v41 :: v_dual_mov_b32 v15, v40
+; DAGISEL12-NEXT: s_mov_b32 s9, s4
+; DAGISEL12-NEXT: s_mov_b32 s8, s3
+; DAGISEL12-NEXT: s_mov_b32 s4, s2
+; DAGISEL12-NEXT: s_wait_kmcnt 0x0
+; DAGISEL12-NEXT: s_swappc_b64 s[30:31], s[0:1]
+; DAGISEL12-NEXT: v_dual_mov_b32 v24, v0 :: v_dual_mov_b32 v25, v1
+; DAGISEL12-NEXT: v_dual_mov_b32 v26, v2 :: v_dual_mov_b32 v27, v3
+; DAGISEL12-NEXT: v_dual_mov_b32 v28, v4 :: v_dual_mov_b32 v29, v5
+; DAGISEL12-NEXT: v_dual_mov_b32 v30, v6 :: v_dual_mov_b32 v31, v7
+; DAGISEL12-NEXT: v_dual_mov_b32 v32, v8 :: v_dual_mov_b32 v33, v9
+; DAGISEL12-NEXT: v_dual_mov_b32 v34, v10 :: v_dual_mov_b32 v35, v11
+; DAGISEL12-NEXT: v_dual_mov_b32 v36, v12 :: v_dual_mov_b32 v37, v13
+; DAGISEL12-NEXT: v_dual_mov_b32 v38, v14 :: v_dual_mov_b32 v39, v15
+; DAGISEL12-NEXT: s_mov_b32 exec_lo, s10
+; DAGISEL12-NEXT: v_dual_mov_b32 v0, v24 :: v_dual_mov_b32 v1, v25
+; DAGISEL12-NEXT: v_dual_mov_b32 v2, v26 :: v_dual_mov_b32 v3, v27
+; DAGISEL12-NEXT: v_dual_mov_b32 v4, v28 :: v_dual_mov_b32 v5, v29
+; DAGISEL12-NEXT: v_dual_mov_b32 v6, v30 :: v_dual_mov_b32 v7, v31
+; DAGISEL12-NEXT: v_dual_mov_b32 v8, v32 :: v_dual_mov_b32 v9, v33
+; DAGISEL12-NEXT: v_dual_mov_b32 v10, v34 :: v_dual_mov_b32 v11, v35
+; DAGISEL12-NEXT: v_dual_mov_b32 v12, v36 :: v_dual_mov_b32 v13, v37
+; DAGISEL12-NEXT: v_dual_mov_b32 v14, v38 :: v_dual_mov_b32 v15, v39
+; DAGISEL12-NEXT: v_mov_b32_e32 v24, v0
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: v_mov_b32_e32 v24, v63
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: v_mov_b32_e32 v25, v1
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: v_mov_b32_e32 v25, v62
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: v_mov_b32_e32 v26, v2
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: v_mov_b32_e32 v26, v61
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: v_mov_b32_e32 v27, v3
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: v_mov_b32_e32 v27, v60
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: v_mov_b32_e32 v28, v4
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: v_mov_b32_e32 v28, v59
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: v_mov_b32_e32 v29, v5
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: v_mov_b32_e32 v29, v58
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: v_mov_b32_e32 v30, v6
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: v_mov_b32_e32 v30, v57
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: v_mov_b32_e32 v31, v7
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: v_mov_b32_e32 v31, v56
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: v_mov_b32_e32 v32, v8
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: v_mov_b32_e32 v32, v47
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: v_mov_b32_e32 v33, v9
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: v_mov_b32_e32 v33, v46
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: v_mov_b32_e32 v34, v10
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: v_mov_b32_e32 v34, v45
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: v_mov_b32_e32 v35, v11
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: v_mov_b32_e32 v35, v44
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: v_mov_b32_e32 v36, v12
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: v_mov_b32_e32 v36, v43
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: v_mov_b32_e32 v37, v13
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: v_mov_b32_e32 v37, v42
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: v_mov_b32_e32 v38, v14
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: v_mov_b32_e32 v38, v41
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: v_mov_b32_e32 v39, v15
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: v_mov_b32_e32 v39, v40
+; DAGISEL12-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT: s_mov_b32 exec_lo, -1
+; DAGISEL12-NEXT: v_dual_mov_b32 v8, v24 :: v_dual_mov_b32 v9, v25
+; DAGISEL12-NEXT: v_dual_mov_b32 v10, v26 :: v_dual_mov_b32 v11, v27
+; DAGISEL12-NEXT: v_dual_mov_b32 v12, v28 :: v_dual_mov_b32 v13, v29
+; DAGISEL12-NEXT: v_dual_mov_b32 v14, v30 :: v_dual_mov_b32 v15, v31
+; DAGISEL12-NEXT: v_dual_mov_b32 v16, v32 :: v_dual_mov_b32 v17, v33
+; DAGISEL12-NEXT: v_dual_mov_b32 v18, v34 :: v_dual_mov_b32 v19, v35
+; DAGISEL12-NEXT: v_dual_mov_b32 v20, v36 :: v_dual_mov_b32 v21, v37
+; DAGISEL12-NEXT: v_dual_mov_b32 v22, v38 :: v_dual_mov_b32 v23, v39
+; DAGISEL12-NEXT: s_mov_b32 s0, s7
+; DAGISEL12-NEXT: s_mov_b32 s1, s6
+; DAGISEL12-NEXT: s_mov_b32 s2, s4
+; DAGISEL12-NEXT: s_mov_b32 exec_lo, s5
+; DAGISEL12-NEXT: s_setpc_b64 s[8:9]
+;
+; GISEL10-LABEL: wwm_write_to_arg_reg:
+; GISEL10: ; %bb.0: ; %entry
+; GISEL10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GISEL10-NEXT: s_mov_b32 s32, 0
+; GISEL10-NEXT: s_or_saveexec_b32 s9, -1
+; GISEL10-NEXT: s_mov_b32 s11, s4
+; GISEL10-NEXT: v_mov_b32_e32 v40, v8
+; GISEL10-NEXT: v_mov_b32_e32 v41, v9
+; GISEL10-NEXT: v_mov_b32_e32 v42, v10
+; GISEL10-NEXT: v_mov_b32_e32 v43, v11
+; GISEL10-NEXT: v_mov_b32_e32 v44, v12
+; GISEL10-NEXT: v_mov_b32_e32 v45, v13
+; GISEL10-NEXT: v_mov_b32_e32 v46, v14
+; GISEL10-NEXT: v_mov_b32_e32 v47, v15
+; GISEL10-NEXT: v_mov_b32_e32 v56, v16
+; GISEL10-NEXT: v_mov_b32_e32 v57, v17
+; GISEL10-NEXT: v_mov_b32_e32 v58, v18
+; GISEL10-NEXT: v_mov_b32_e32 v59, v19
+; GISEL10-NEXT: v_mov_b32_e32 v60, v20
+; GISEL10-NEXT: v_mov_b32_e32 v61, v21
+; GISEL10-NEXT: v_mov_b32_e32 v62, v22
+; GISEL10-NEXT: v_mov_b32_e32 v63, v23
+; GISEL10-NEXT: s_mov_b32 exec_lo, s9
+; GISEL10-NEXT: s_or_saveexec_b32 s4, -1
+; GISEL10-NEXT: s_mov_b32 s6, s0
+; GISEL10-NEXT: s_mov_b32 s7, s1
+; GISEL10-NEXT: s_getpc_b64 s[0:1]
+; GISEL10-NEXT: s_add_u32 s0, s0, write_v0_v15 at gotpcrel32@lo+4
+; GISEL10-NEXT: s_addc_u32 s1, s1, write_v0_v15 at gotpcrel32@hi+12
+; GISEL10-NEXT: v_mov_b32_e32 v0, v40
+; GISEL10-NEXT: s_load_dwordx2 s[12:13], s[0:1], 0x0
+; GISEL10-NEXT: v_mov_b32_e32 v1, v41
+; GISEL10-NEXT: v_mov_b32_e32 v2, v42
+; GISEL10-NEXT: v_mov_b32_e32 v3, v43
+; GISEL10-NEXT: v_mov_b32_e32 v4, v44
+; GISEL10-NEXT: v_mov_b32_e32 v5, v45
+; GISEL10-NEXT: v_mov_b32_e32 v6, v46
+; GISEL10-NEXT: v_mov_b32_e32 v7, v47
+; GISEL10-NEXT: v_mov_b32_e32 v8, v56
+; GISEL10-NEXT: v_mov_b32_e32 v9, v57
+; GISEL10-NEXT: v_mov_b32_e32 v10, v58
+; GISEL10-NEXT: v_mov_b32_e32 v11, v59
+; GISEL10-NEXT: v_mov_b32_e32 v12, v60
+; GISEL10-NEXT: v_mov_b32_e32 v13, v61
+; GISEL10-NEXT: v_mov_b32_e32 v14, v62
+; GISEL10-NEXT: v_mov_b32_e32 v15, v63
+; GISEL10-NEXT: s_mov_b32 s8, s2
+; GISEL10-NEXT: s_mov_b32 s10, s3
+; GISEL10-NEXT: s_mov_b64 s[0:1], s[48:49]
+; GISEL10-NEXT: s_mov_b64 s[2:3], s[50:51]
+; GISEL10-NEXT: s_waitcnt lgkmcnt(0)
+; GISEL10-NEXT: s_swappc_b64 s[30:31], s[12:13]
+; GISEL10-NEXT: v_mov_b32_e32 v24, v0
+; GISEL10-NEXT: v_mov_b32_e32 v25, v1
+; GISEL10-NEXT: v_mov_b32_e32 v26, v2
+; GISEL10-NEXT: v_mov_b32_e32 v27, v3
+; GISEL10-NEXT: v_mov_b32_e32 v28, v4
+; GISEL10-NEXT: v_mov_b32_e32 v29, v5
+; GISEL10-NEXT: v_mov_b32_e32 v30, v6
+; GISEL10-NEXT: v_mov_b32_e32 v31, v7
+; GISEL10-NEXT: v_mov_b32_e32 v32, v8
+; GISEL10-NEXT: v_mov_b32_e32 v33, v9
+; GISEL10-NEXT: v_mov_b32_e32 v34, v10
+; GISEL10-NEXT: v_mov_b32_e32 v35, v11
+; GISEL10-NEXT: v_mov_b32_e32 v36, v12
+; GISEL10-NEXT: v_mov_b32_e32 v37, v13
+; GISEL10-NEXT: v_mov_b32_e32 v38, v14
+; GISEL10-NEXT: v_mov_b32_e32 v39, v15
+; GISEL10-NEXT: s_mov_b32 exec_lo, s4
+; GISEL10-NEXT: v_mov_b32_e32 v0, v24
+; GISEL10-NEXT: v_mov_b32_e32 v1, v25
+; GISEL10-NEXT: v_mov_b32_e32 v2, v26
+; GISEL10-NEXT: v_mov_b32_e32 v3, v27
+; GISEL10-NEXT: v_mov_b32_e32 v4, v28
+; GISEL10-NEXT: v_mov_b32_e32 v5, v29
+; GISEL10-NEXT: v_mov_b32_e32 v6, v30
+; GISEL10-NEXT: v_mov_b32_e32 v7, v31
+; GISEL10-NEXT: v_mov_b32_e32 v8, v32
+; GISEL10-NEXT: v_mov_b32_e32 v9, v33
+; GISEL10-NEXT: v_mov_b32_e32 v10, v34
+; GISEL10-NEXT: v_mov_b32_e32 v11, v35
+; GISEL10-NEXT: v_mov_b32_e32 v12, v36
+; GISEL10-NEXT: v_mov_b32_e32 v13, v37
+; GISEL10-NEXT: v_mov_b32_e32 v14, v38
+; GISEL10-NEXT: v_mov_b32_e32 v15, v39
+; GISEL10-NEXT: v_mov_b32_e32 v24, v0
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v24, v40
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v25, v1
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v25, v41
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v26, v2
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v26, v42
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v27, v3
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v27, v43
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v28, v4
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v28, v44
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v29, v5
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v29, v45
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v30, v6
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v30, v46
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v31, v7
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v31, v47
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v32, v8
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v32, v56
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v33, v9
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v33, v57
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v34, v10
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v34, v58
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v35, v11
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v35, v59
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v36, v12
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v36, v60
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v37, v13
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v37, v61
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v38, v14
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v38, v62
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v39, v15
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: v_mov_b32_e32 v39, v63
+; GISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; GISEL10-NEXT: s_mov_b32 exec_lo, -1
+; GISEL10-NEXT: v_mov_b32_e32 v8, v24
+; GISEL10-NEXT: v_mov_b32_e32 v9, v25
+; GISEL10-NEXT: v_mov_b32_e32 v10, v26
+; GISEL10-NEXT: v_mov_b32_e32 v11, v27
+; GISEL10-NEXT: v_mov_b32_e32 v12, v28
+; GISEL10-NEXT: v_mov_b32_e32 v13, v29
+; GISEL10-NEXT: v_mov_b32_e32 v14, v30
+; GISEL10-NEXT: v_mov_b32_e32 v15, v31
+; GISEL10-NEXT: v_mov_b32_e32 v16, v32
+; GISEL10-NEXT: v_mov_b32_e32 v17, v33
+; GISEL10-NEXT: v_mov_b32_e32 v18, v34
+; GISEL10-NEXT: v_mov_b32_e32 v19, v35
+; GISEL10-NEXT: v_mov_b32_e32 v20, v36
+; GISEL10-NEXT: v_mov_b32_e32 v21, v37
+; GISEL10-NEXT: v_mov_b32_e32 v22, v38
+; GISEL10-NEXT: v_mov_b32_e32 v23, v39
+; GISEL10-NEXT: s_mov_b32 s0, s6
+; GISEL10-NEXT: s_mov_b32 s1, s7
+; GISEL10-NEXT: s_mov_b32 s2, s8
+; GISEL10-NEXT: s_mov_b32 exec_lo, s5
+; GISEL10-NEXT: s_setpc_b64 s[10:11]
+;
+; DAGISEL10-LABEL: wwm_write_to_arg_reg:
+; DAGISEL10: ; %bb.0: ; %entry
+; DAGISEL10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; DAGISEL10-NEXT: s_mov_b32 s32, 0
+; DAGISEL10-NEXT: s_or_saveexec_b32 s6, -1
+; DAGISEL10-NEXT: v_mov_b32_e32 v40, v23
+; DAGISEL10-NEXT: v_mov_b32_e32 v41, v22
+; DAGISEL10-NEXT: v_mov_b32_e32 v42, v21
+; DAGISEL10-NEXT: v_mov_b32_e32 v43, v20
+; DAGISEL10-NEXT: v_mov_b32_e32 v44, v19
+; DAGISEL10-NEXT: v_mov_b32_e32 v45, v18
+; DAGISEL10-NEXT: v_mov_b32_e32 v46, v17
+; DAGISEL10-NEXT: v_mov_b32_e32 v47, v16
+; DAGISEL10-NEXT: v_mov_b32_e32 v56, v15
+; DAGISEL10-NEXT: v_mov_b32_e32 v57, v14
+; DAGISEL10-NEXT: v_mov_b32_e32 v58, v13
+; DAGISEL10-NEXT: v_mov_b32_e32 v59, v12
+; DAGISEL10-NEXT: v_mov_b32_e32 v60, v11
+; DAGISEL10-NEXT: v_mov_b32_e32 v61, v10
+; DAGISEL10-NEXT: v_mov_b32_e32 v62, v9
+; DAGISEL10-NEXT: v_mov_b32_e32 v63, v8
+; DAGISEL10-NEXT: s_mov_b32 exec_lo, s6
+; DAGISEL10-NEXT: s_or_saveexec_b32 s12, -1
+; DAGISEL10-NEXT: s_mov_b32 s6, s1
+; DAGISEL10-NEXT: s_mov_b32 s7, s0
+; DAGISEL10-NEXT: s_getpc_b64 s[0:1]
+; DAGISEL10-NEXT: s_add_u32 s0, s0, write_v0_v15 at gotpcrel32@lo+4
+; DAGISEL10-NEXT: s_addc_u32 s1, s1, write_v0_v15 at gotpcrel32@hi+12
+; DAGISEL10-NEXT: v_mov_b32_e32 v0, v63
+; DAGISEL10-NEXT: s_load_dwordx2 s[10:11], s[0:1], 0x0
+; DAGISEL10-NEXT: v_mov_b32_e32 v1, v62
+; DAGISEL10-NEXT: v_mov_b32_e32 v2, v61
+; DAGISEL10-NEXT: v_mov_b32_e32 v3, v60
+; DAGISEL10-NEXT: v_mov_b32_e32 v4, v59
+; DAGISEL10-NEXT: v_mov_b32_e32 v5, v58
+; DAGISEL10-NEXT: v_mov_b32_e32 v6, v57
+; DAGISEL10-NEXT: v_mov_b32_e32 v7, v56
+; DAGISEL10-NEXT: v_mov_b32_e32 v8, v47
+; DAGISEL10-NEXT: v_mov_b32_e32 v9, v46
+; DAGISEL10-NEXT: v_mov_b32_e32 v10, v45
+; DAGISEL10-NEXT: v_mov_b32_e32 v11, v44
+; DAGISEL10-NEXT: v_mov_b32_e32 v12, v43
+; DAGISEL10-NEXT: v_mov_b32_e32 v13, v42
+; DAGISEL10-NEXT: v_mov_b32_e32 v14, v41
+; DAGISEL10-NEXT: v_mov_b32_e32 v15, v40
+; DAGISEL10-NEXT: s_mov_b32 s9, s4
+; DAGISEL10-NEXT: s_mov_b32 s8, s3
+; DAGISEL10-NEXT: s_mov_b32 s4, s2
+; DAGISEL10-NEXT: s_mov_b64 s[0:1], s[48:49]
+; DAGISEL10-NEXT: s_mov_b64 s[2:3], s[50:51]
+; DAGISEL10-NEXT: s_waitcnt lgkmcnt(0)
+; DAGISEL10-NEXT: s_swappc_b64 s[30:31], s[10:11]
+; DAGISEL10-NEXT: v_mov_b32_e32 v24, v0
+; DAGISEL10-NEXT: v_mov_b32_e32 v25, v1
+; DAGISEL10-NEXT: v_mov_b32_e32 v26, v2
+; DAGISEL10-NEXT: v_mov_b32_e32 v27, v3
+; DAGISEL10-NEXT: v_mov_b32_e32 v28, v4
+; DAGISEL10-NEXT: v_mov_b32_e32 v29, v5
+; DAGISEL10-NEXT: v_mov_b32_e32 v30, v6
+; DAGISEL10-NEXT: v_mov_b32_e32 v31, v7
+; DAGISEL10-NEXT: v_mov_b32_e32 v32, v8
+; DAGISEL10-NEXT: v_mov_b32_e32 v33, v9
+; DAGISEL10-NEXT: v_mov_b32_e32 v34, v10
+; DAGISEL10-NEXT: v_mov_b32_e32 v35, v11
+; DAGISEL10-NEXT: v_mov_b32_e32 v36, v12
+; DAGISEL10-NEXT: v_mov_b32_e32 v37, v13
+; DAGISEL10-NEXT: v_mov_b32_e32 v38, v14
+; DAGISEL10-NEXT: v_mov_b32_e32 v39, v15
+; DAGISEL10-NEXT: s_mov_b32 exec_lo, s12
+; DAGISEL10-NEXT: v_mov_b32_e32 v0, v24
+; DAGISEL10-NEXT: v_mov_b32_e32 v1, v25
+; DAGISEL10-NEXT: v_mov_b32_e32 v2, v26
+; DAGISEL10-NEXT: v_mov_b32_e32 v3, v27
+; DAGISEL10-NEXT: v_mov_b32_e32 v4, v28
+; DAGISEL10-NEXT: v_mov_b32_e32 v5, v29
+; DAGISEL10-NEXT: v_mov_b32_e32 v6, v30
+; DAGISEL10-NEXT: v_mov_b32_e32 v7, v31
+; DAGISEL10-NEXT: v_mov_b32_e32 v8, v32
+; DAGISEL10-NEXT: v_mov_b32_e32 v9, v33
+; DAGISEL10-NEXT: v_mov_b32_e32 v10, v34
+; DAGISEL10-NEXT: v_mov_b32_e32 v11, v35
+; DAGISEL10-NEXT: v_mov_b32_e32 v12, v36
+; DAGISEL10-NEXT: v_mov_b32_e32 v13, v37
+; DAGISEL10-NEXT: v_mov_b32_e32 v14, v38
+; DAGISEL10-NEXT: v_mov_b32_e32 v15, v39
+; DAGISEL10-NEXT: v_mov_b32_e32 v24, v0
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v24, v63
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v25, v1
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v25, v62
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v26, v2
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v26, v61
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v27, v3
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v27, v60
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v28, v4
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v28, v59
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v29, v5
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v29, v58
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v30, v6
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v30, v57
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v31, v7
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v31, v56
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v32, v8
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v32, v47
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v33, v9
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v33, v46
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v34, v10
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v34, v45
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v35, v11
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v35, v44
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v36, v12
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v36, v43
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v37, v13
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v37, v42
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v38, v14
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v38, v41
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v39, v15
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: v_mov_b32_e32 v39, v40
+; DAGISEL10-NEXT: s_not_b32 exec_lo, exec_lo
+; DAGISEL10-NEXT: s_mov_b32 exec_lo, -1
+; DAGISEL10-NEXT: v_mov_b32_e32 v8, v24
+; DAGISEL10-NEXT: v_mov_b32_e32 v9, v25
+; DAGISEL10-NEXT: v_mov_b32_e32 v10, v26
+; DAGISEL10-NEXT: v_mov_b32_e32 v11, v27
+; DAGISEL10-NEXT: v_mov_b32_e32 v12, v28
+; DAGISEL10-NEXT: v_mov_b32_e32 v13, v29
+; DAGISEL10-NEXT: v_mov_b32_e32 v14, v30
+; DAGISEL10-NEXT: v_mov_b32_e32 v15, v31
+; DAGISEL10-NEXT: v_mov_b32_e32 v16, v32
+; DAGISEL10-NEXT: v_mov_b32_e32 v17, v33
+; DAGISEL10-NEXT: v_mov_b32_e32 v18, v34
+; DAGISEL10-NEXT: v_mov_b32_e32 v19, v35
+; DAGISEL10-NEXT: v_mov_b32_e32 v20, v36
+; DAGISEL10-NEXT: v_mov_b32_e32 v21, v37
+; DAGISEL10-NEXT: v_mov_b32_e32 v22, v38
+; DAGISEL10-NEXT: v_mov_b32_e32 v23, v39
+; DAGISEL10-NEXT: s_mov_b32 s0, s7
+; DAGISEL10-NEXT: s_mov_b32 s1, s6
+; DAGISEL10-NEXT: s_mov_b32 s2, s4
+; DAGISEL10-NEXT: s_mov_b32 exec_lo, s5
+; DAGISEL10-NEXT: s_setpc_b64 s[8:9]
+entry:
+ %entry_exec = call i1 @llvm.amdgcn.init.whole.wave()
+ br i1 %entry_exec, label %shader, label %tail
+
+shader:
+ %v0.15 = call amdgpu_gfx <16 x i32> @write_v0_v15(<16 x i32> %vgpr)
+ %vgpr.wwm = call <16 x i32> @llvm.amdgcn.strict.wwm.v16i32(<16 x i32> %v0.15)
+
+ br label %tail
+
+tail:
+ %vgpr.args = phi <16 x i32> [%vgpr, %entry], [%vgpr.wwm, %shader]
+ call void(ptr, i32, <3 x i32>, <16 x i32>, i32, ...) @llvm.amdgcn.cs.chain(ptr %callee, i32 %exec, <3 x i32> inreg %sgpr, <16 x i32> %vgpr.args, i32 0)
+ unreachable
+}
+
+declare amdgpu_gfx <16 x i32> @write_v0_v15(<16 x i32>)
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.init.whole.wave-w64.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.init.whole.wave-w64.ll
new file mode 100644
index 00000000000000..dc693e9adb4822
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.init.whole.wave-w64.ll
@@ -0,0 +1,254 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc -global-isel=1 -O2 -mtriple=amdgcn -mcpu=gfx1200 -mattr=+wavefrontsize64 -verify-machineinstrs < %s | FileCheck --check-prefix=GISEL12 %s
+; RUN: llc -global-isel=0 -O2 -mtriple=amdgcn -mcpu=gfx1200 -mattr=+wavefrontsize64 -verify-machineinstrs < %s | FileCheck --check-prefix=DAGISEL12 %s
+; RUN: llc -global-isel=1 -O2 -mtriple=amdgcn -mcpu=gfx1030 -mattr=+wavefrontsize64 -verify-machineinstrs < %s | FileCheck --check-prefix=GISEL10 %s
+; RUN: llc -global-isel=0 -O2 -mtriple=amdgcn -mcpu=gfx1030 -mattr=+wavefrontsize64 -verify-machineinstrs < %s | FileCheck --check-prefix=DAGISEL10 %s
+
+; This shouldn't be too different from wave32, so we'll only test one case.
+
+define amdgpu_cs_chain void @basic(<3 x i32> inreg %sgpr, ptr inreg %callee, i64 inreg %exec, { i32, ptr addrspace(5), i32, i64 } %vgpr, i32 %x, i32 %y) {
+; GISEL12-LABEL: basic:
+; GISEL12: ; %bb.0: ; %entry
+; GISEL12-NEXT: s_wait_loadcnt_dscnt 0x0
+; GISEL12-NEXT: s_wait_expcnt 0x0
+; GISEL12-NEXT: s_wait_samplecnt 0x0
+; GISEL12-NEXT: s_wait_bvhcnt 0x0
+; GISEL12-NEXT: s_wait_kmcnt 0x0
+; GISEL12-NEXT: s_or_saveexec_b64 s[10:11], -1
+; GISEL12-NEXT: v_mov_b32_e32 v0, v8
+; GISEL12-NEXT: v_mov_b32_e32 v1, v9
+; GISEL12-NEXT: v_mov_b32_e32 v2, v10
+; GISEL12-NEXT: v_mov_b32_e32 v3, v11
+; GISEL12-NEXT: v_mov_b32_e32 v4, v12
+; GISEL12-NEXT: s_mov_b64 exec, s[10:11]
+; GISEL12-NEXT: s_mov_b32 s9, s4
+; GISEL12-NEXT: s_mov_b32 s4, s5
+; GISEL12-NEXT: s_mov_b32 s5, s6
+; GISEL12-NEXT: v_mov_b32_e32 v5, v13
+; GISEL12-NEXT: s_not_b64 exec, exec
+; GISEL12-NEXT: v_mov_b32_e32 v5, 0x47
+; GISEL12-NEXT: s_not_b64 exec, exec
+; GISEL12-NEXT: s_or_saveexec_b64 s[6:7], -1
+; GISEL12-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GISEL12-NEXT: v_cmp_ne_u32_e64 s[10:11], 0, v5
+; GISEL12-NEXT: v_mov_b32_e32 v5, s10
+; GISEL12-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GISEL12-NEXT: v_mov_b32_e32 v6, s11
+; GISEL12-NEXT: s_mov_b64 exec, s[6:7]
+; GISEL12-NEXT: v_mov_b32_e32 v7, v5
+; GISEL12-NEXT: v_add_nc_u32_e32 v9, 42, v13
+; GISEL12-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_3) | instid1(VALU_DEP_1)
+; GISEL12-NEXT: v_mov_b32_e32 v8, v6
+; GISEL12-NEXT: s_mov_b32 s8, s3
+; GISEL12-NEXT: v_mov_b32_e32 v0, v0
+; GISEL12-NEXT: s_not_b64 exec, exec
+; GISEL12-NEXT: v_mov_b32_e32 v0, v0
+; GISEL12-NEXT: s_not_b64 exec, exec
+; GISEL12-NEXT: v_mov_b32_e32 v1, v1
+; GISEL12-NEXT: s_not_b64 exec, exec
+; GISEL12-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_3) | instid1(VALU_DEP_1)
+; GISEL12-NEXT: v_mov_b32_e32 v1, v1
+; GISEL12-NEXT: s_not_b64 exec, exec
+; GISEL12-NEXT: v_mov_b32_e32 v2, v9
+; GISEL12-NEXT: s_not_b64 exec, exec
+; GISEL12-NEXT: v_mov_b32_e32 v2, v2
+; GISEL12-NEXT: s_not_b64 exec, exec
+; GISEL12-NEXT: v_mov_b32_e32 v3, v7
+; GISEL12-NEXT: v_mov_b32_e32 v4, v8
+; GISEL12-NEXT: s_not_b64 exec, exec
+; GISEL12-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GISEL12-NEXT: v_mov_b32_e32 v3, v3
+; GISEL12-NEXT: v_mov_b32_e32 v4, v4
+; GISEL12-NEXT: s_not_b64 exec, exec
+; GISEL12-NEXT: s_mov_b64 exec, -1
+; GISEL12-NEXT: v_mov_b32_e32 v8, v0
+; GISEL12-NEXT: v_mov_b32_e32 v9, v1
+; GISEL12-NEXT: v_mov_b32_e32 v10, v2
+; GISEL12-NEXT: v_mov_b32_e32 v11, v3
+; GISEL12-NEXT: v_mov_b32_e32 v12, v4
+; GISEL12-NEXT: s_mov_b64 exec, s[4:5]
+; GISEL12-NEXT: s_setpc_b64 s[8:9]
+;
+; DAGISEL12-LABEL: basic:
+; DAGISEL12: ; %bb.0: ; %entry
+; DAGISEL12-NEXT: s_wait_loadcnt_dscnt 0x0
+; DAGISEL12-NEXT: s_wait_expcnt 0x0
+; DAGISEL12-NEXT: s_wait_samplecnt 0x0
+; DAGISEL12-NEXT: s_wait_bvhcnt 0x0
+; DAGISEL12-NEXT: s_wait_kmcnt 0x0
+; DAGISEL12-NEXT: s_or_saveexec_b64 s[8:9], -1
+; DAGISEL12-NEXT: v_mov_b32_e32 v1, v12
+; DAGISEL12-NEXT: v_mov_b32_e32 v0, v11
+; DAGISEL12-NEXT: v_mov_b32_e32 v2, v10
+; DAGISEL12-NEXT: v_mov_b32_e32 v3, v9
+; DAGISEL12-NEXT: v_mov_b32_e32 v4, v8
+; DAGISEL12-NEXT: s_mov_b64 exec, s[8:9]
+; DAGISEL12-NEXT: v_mov_b32_e32 v5, v13
+; DAGISEL12-NEXT: s_not_b64 exec, exec
+; DAGISEL12-NEXT: v_mov_b32_e32 v5, 0x47
+; DAGISEL12-NEXT: s_not_b64 exec, exec
+; DAGISEL12-NEXT: s_or_saveexec_b64 s[8:9], -1
+; DAGISEL12-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; DAGISEL12-NEXT: v_cmp_ne_u32_e64 s[10:11], 0, v5
+; DAGISEL12-NEXT: s_mov_b64 exec, s[8:9]
+; DAGISEL12-NEXT: v_mov_b32_e32 v6, s10
+; DAGISEL12-NEXT: v_add_nc_u32_e32 v8, 42, v13
+; DAGISEL12-NEXT: s_mov_b32 s7, s6
+; DAGISEL12-NEXT: s_mov_b32 s6, s5
+; DAGISEL12-NEXT: s_mov_b32 s5, s4
+; DAGISEL12-NEXT: s_mov_b32 s4, s3
+; DAGISEL12-NEXT: v_mov_b32_e32 v7, s11
+; DAGISEL12-NEXT: v_mov_b32_e32 v4, v4
+; DAGISEL12-NEXT: s_not_b64 exec, exec
+; DAGISEL12-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_3) | instid1(VALU_DEP_1)
+; DAGISEL12-NEXT: v_mov_b32_e32 v4, v4
+; DAGISEL12-NEXT: s_not_b64 exec, exec
+; DAGISEL12-NEXT: v_mov_b32_e32 v3, v3
+; DAGISEL12-NEXT: s_not_b64 exec, exec
+; DAGISEL12-NEXT: v_mov_b32_e32 v3, v3
+; DAGISEL12-NEXT: s_not_b64 exec, exec
+; DAGISEL12-NEXT: v_mov_b32_e32 v2, v8
+; DAGISEL12-NEXT: s_not_b64 exec, exec
+; DAGISEL12-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_4) | instid1(VALU_DEP_2)
+; DAGISEL12-NEXT: v_mov_b32_e32 v2, v2
+; DAGISEL12-NEXT: s_not_b64 exec, exec
+; DAGISEL12-NEXT: v_mov_b32_e32 v0, v6
+; DAGISEL12-NEXT: v_mov_b32_e32 v1, v7
+; DAGISEL12-NEXT: s_not_b64 exec, exec
+; DAGISEL12-NEXT: v_mov_b32_e32 v0, v0
+; DAGISEL12-NEXT: s_delay_alu instid0(VALU_DEP_2)
+; DAGISEL12-NEXT: v_mov_b32_e32 v1, v1
+; DAGISEL12-NEXT: s_not_b64 exec, exec
+; DAGISEL12-NEXT: s_mov_b64 exec, -1
+; DAGISEL12-NEXT: v_mov_b32_e32 v8, v4
+; DAGISEL12-NEXT: v_mov_b32_e32 v9, v3
+; DAGISEL12-NEXT: v_mov_b32_e32 v10, v2
+; DAGISEL12-NEXT: v_mov_b32_e32 v11, v0
+; DAGISEL12-NEXT: v_mov_b32_e32 v12, v1
+; DAGISEL12-NEXT: s_mov_b64 exec, s[6:7]
+; DAGISEL12-NEXT: s_setpc_b64 s[4:5]
+;
+; GISEL10-LABEL: basic:
+; GISEL10: ; %bb.0: ; %entry
+; GISEL10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GISEL10-NEXT: s_or_saveexec_b64 s[10:11], -1
+; GISEL10-NEXT: v_mov_b32_e32 v0, v8
+; GISEL10-NEXT: v_mov_b32_e32 v1, v9
+; GISEL10-NEXT: v_mov_b32_e32 v2, v10
+; GISEL10-NEXT: v_mov_b32_e32 v3, v11
+; GISEL10-NEXT: v_mov_b32_e32 v4, v12
+; GISEL10-NEXT: s_mov_b64 exec, s[10:11]
+; GISEL10-NEXT: s_mov_b32 s9, s4
+; GISEL10-NEXT: s_mov_b32 s4, s5
+; GISEL10-NEXT: s_mov_b32 s5, s6
+; GISEL10-NEXT: v_mov_b32_e32 v5, v13
+; GISEL10-NEXT: s_not_b64 exec, exec
+; GISEL10-NEXT: v_mov_b32_e32 v5, 0x47
+; GISEL10-NEXT: s_not_b64 exec, exec
+; GISEL10-NEXT: s_or_saveexec_b64 s[6:7], -1
+; GISEL10-NEXT: v_cmp_ne_u32_e64 s[10:11], 0, v5
+; GISEL10-NEXT: v_mov_b32_e32 v5, s10
+; GISEL10-NEXT: v_mov_b32_e32 v6, s11
+; GISEL10-NEXT: s_mov_b64 exec, s[6:7]
+; GISEL10-NEXT: v_mov_b32_e32 v7, v5
+; GISEL10-NEXT: v_add_nc_u32_e32 v9, 42, v13
+; GISEL10-NEXT: v_mov_b32_e32 v8, v6
+; GISEL10-NEXT: s_mov_b32 s8, s3
+; GISEL10-NEXT: v_mov_b32_e32 v0, v0
+; GISEL10-NEXT: s_not_b64 exec, exec
+; GISEL10-NEXT: v_mov_b32_e32 v0, v0
+; GISEL10-NEXT: s_not_b64 exec, exec
+; GISEL10-NEXT: v_mov_b32_e32 v1, v1
+; GISEL10-NEXT: s_not_b64 exec, exec
+; GISEL10-NEXT: v_mov_b32_e32 v1, v1
+; GISEL10-NEXT: s_not_b64 exec, exec
+; GISEL10-NEXT: v_mov_b32_e32 v2, v9
+; GISEL10-NEXT: s_not_b64 exec, exec
+; GISEL10-NEXT: v_mov_b32_e32 v2, v2
+; GISEL10-NEXT: s_not_b64 exec, exec
+; GISEL10-NEXT: v_mov_b32_e32 v3, v7
+; GISEL10-NEXT: v_mov_b32_e32 v4, v8
+; GISEL10-NEXT: s_not_b64 exec, exec
+; GISEL10-NEXT: v_mov_b32_e32 v3, v3
+; GISEL10-NEXT: v_mov_b32_e32 v4, v4
+; GISEL10-NEXT: s_not_b64 exec, exec
+; GISEL10-NEXT: s_mov_b64 exec, -1
+; GISEL10-NEXT: v_mov_b32_e32 v8, v0
+; GISEL10-NEXT: v_mov_b32_e32 v9, v1
+; GISEL10-NEXT: v_mov_b32_e32 v10, v2
+; GISEL10-NEXT: v_mov_b32_e32 v11, v3
+; GISEL10-NEXT: v_mov_b32_e32 v12, v4
+; GISEL10-NEXT: s_mov_b64 exec, s[4:5]
+; GISEL10-NEXT: s_setpc_b64 s[8:9]
+;
+; DAGISEL10-LABEL: basic:
+; DAGISEL10: ; %bb.0: ; %entry
+; DAGISEL10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; DAGISEL10-NEXT: s_or_saveexec_b64 s[8:9], -1
+; DAGISEL10-NEXT: v_mov_b32_e32 v1, v12
+; DAGISEL10-NEXT: v_mov_b32_e32 v0, v11
+; DAGISEL10-NEXT: v_mov_b32_e32 v2, v10
+; DAGISEL10-NEXT: v_mov_b32_e32 v3, v9
+; DAGISEL10-NEXT: v_mov_b32_e32 v4, v8
+; DAGISEL10-NEXT: s_mov_b64 exec, s[8:9]
+; DAGISEL10-NEXT: v_mov_b32_e32 v5, v13
+; DAGISEL10-NEXT: s_not_b64 exec, exec
+; DAGISEL10-NEXT: v_mov_b32_e32 v5, 0x47
+; DAGISEL10-NEXT: s_not_b64 exec, exec
+; DAGISEL10-NEXT: s_or_saveexec_b64 s[8:9], -1
+; DAGISEL10-NEXT: v_cmp_ne_u32_e64 s[10:11], 0, v5
+; DAGISEL10-NEXT: s_mov_b64 exec, s[8:9]
+; DAGISEL10-NEXT: v_mov_b32_e32 v6, s10
+; DAGISEL10-NEXT: v_add_nc_u32_e32 v8, 42, v13
+; DAGISEL10-NEXT: s_mov_b32 s7, s6
+; DAGISEL10-NEXT: s_mov_b32 s6, s5
+; DAGISEL10-NEXT: s_mov_b32 s5, s4
+; DAGISEL10-NEXT: s_mov_b32 s4, s3
+; DAGISEL10-NEXT: v_mov_b32_e32 v7, s11
+; DAGISEL10-NEXT: v_mov_b32_e32 v4, v4
+; DAGISEL10-NEXT: s_not_b64 exec, exec
+; DAGISEL10-NEXT: v_mov_b32_e32 v4, v4
+; DAGISEL10-NEXT: s_not_b64 exec, exec
+; DAGISEL10-NEXT: v_mov_b32_e32 v3, v3
+; DAGISEL10-NEXT: s_not_b64 exec, exec
+; DAGISEL10-NEXT: v_mov_b32_e32 v3, v3
+; DAGISEL10-NEXT: s_not_b64 exec, exec
+; DAGISEL10-NEXT: v_mov_b32_e32 v2, v8
+; DAGISEL10-NEXT: s_not_b64 exec, exec
+; DAGISEL10-NEXT: v_mov_b32_e32 v2, v2
+; DAGISEL10-NEXT: s_not_b64 exec, exec
+; DAGISEL10-NEXT: v_mov_b32_e32 v0, v6
+; DAGISEL10-NEXT: v_mov_b32_e32 v1, v7
+; DAGISEL10-NEXT: s_not_b64 exec, exec
+; DAGISEL10-NEXT: v_mov_b32_e32 v0, v0
+; DAGISEL10-NEXT: v_mov_b32_e32 v1, v1
+; DAGISEL10-NEXT: s_not_b64 exec, exec
+; DAGISEL10-NEXT: s_mov_b64 exec, -1
+; DAGISEL10-NEXT: v_mov_b32_e32 v8, v4
+; DAGISEL10-NEXT: v_mov_b32_e32 v9, v3
+; DAGISEL10-NEXT: v_mov_b32_e32 v10, v2
+; DAGISEL10-NEXT: v_mov_b32_e32 v11, v0
+; DAGISEL10-NEXT: v_mov_b32_e32 v12, v1
+; DAGISEL10-NEXT: s_mov_b64 exec, s[6:7]
+; DAGISEL10-NEXT: s_setpc_b64 s[4:5]
+entry:
+ %entry_exec = call i1 @llvm.amdgcn.init.whole.wave()
+ br i1 %entry_exec, label %shader, label %tail
+
+shader:
+ %nonwwm = add i32 %x, 42
+ %vgpr.1 = insertvalue { i32, ptr addrspace(5), i32, i64} %vgpr, i32 %nonwwm, 2
+
+ %full.vgpr = call i32 @llvm.amdgcn.set.inactive.i32(i32 %x, i32 71)
+ %non.zero = icmp ne i32 %full.vgpr, 0
+ %ballot = call i64 @llvm.amdgcn.ballot.i64(i1 %non.zero)
+ %wwm = call i64 @llvm.amdgcn.strict.wwm.i64(i64 %ballot)
+ %vgpr.2 = insertvalue { i32, ptr addrspace(5), i32, i64} %vgpr.1, i64 %wwm, 3
+
+ br label %tail
+
+tail:
+ %vgpr.args = phi { i32, ptr addrspace(5), i32, i64} [%vgpr, %entry], [%vgpr.2, %shader]
+ call void(ptr, i64, <3 x i32>, { i32, ptr addrspace(5), i32, i64 }, i32, ...) @llvm.amdgcn.cs.chain(ptr %callee, i64 %exec, <3 x i32> inreg %sgpr, { i32, ptr addrspace(5), i32, i64 } %vgpr.args, i32 0)
+ unreachable
+}
diff --git a/llvm/test/CodeGen/AMDGPU/pei-amdgpu-cs-chain.mir b/llvm/test/CodeGen/AMDGPU/pei-amdgpu-cs-chain.mir
index 765597fecd20e8..4b8b71a7400852 100644
--- a/llvm/test/CodeGen/AMDGPU/pei-amdgpu-cs-chain.mir
+++ b/llvm/test/CodeGen/AMDGPU/pei-amdgpu-cs-chain.mir
@@ -10,6 +10,7 @@
define amdgpu_cs_chain void @preserve_inactive_wwm() {ret void}
define amdgpu_cs_chain void @preserve_inactive_detected_wwm() {ret void}
define amdgpu_cs_chain void @dont_preserve_wwm_if_no_chain_calls() {ret void}
+ define amdgpu_cs_chain void @dont_preserve_wwm_if_init_whole_wave() {ret void}
define amdgpu_cs_chain void @dont_preserve_non_wwm() {ret void}
define amdgpu_cs_chain void @dont_preserve_v0_v7() {ret void}
define amdgpu_cs_chain void @dont_preserve_sgpr() {ret void}
@@ -133,6 +134,34 @@ body: |
S_ENDPGM 0
...
+---
+name: dont_preserve_wwm_if_init_whole_wave
+tracksRegLiveness: true
+frameInfo:
+ hasTailCall: true
+machineFunctionInfo:
+ stackPtrOffsetReg: '$sgpr32'
+ returnsVoid: true
+ wwmReservedRegs:
+ - '$vgpr8'
+ - '$vgpr9'
+ hasInitWholeWave: true
+body: |
+ bb.0:
+ liveins: $sgpr0, $sgpr35, $vgpr8, $vgpr9
+
+ ; GCN-LABEL: name: dont_preserve_wwm_if_init_whole_wave
+ ; GCN: liveins: $sgpr0, $sgpr35, $vgpr8, $vgpr9
+ ; GCN-NEXT: {{ $}}
+ ; GCN-NEXT: renamable $sgpr4_sgpr5 = SI_PC_ADD_REL_OFFSET target-flags(amdgpu-gotprel32-lo) @callee + 4, target-flags(amdgpu-gotprel32-hi) @callee + 12, implicit-def dead $scc
+ ; GCN-NEXT: renamable $sgpr4_sgpr5 = S_LOAD_DWORDX2_IMM killed renamable $sgpr4_sgpr5, 0, 0 :: (dereferenceable invariant load (p0) from got, addrspace 4)
+ ; GCN-NEXT: SI_CS_CHAIN_TC_W32 killed renamable $sgpr4_sgpr5, @callee, 0, -1, amdgpu_allvgprs, implicit $sgpr0, implicit $vgpr8
+ renamable $sgpr4_sgpr5 = SI_PC_ADD_REL_OFFSET target-flags(amdgpu-gotprel32-lo) @callee + 4, target-flags(amdgpu-gotprel32-hi) @callee + 12, implicit-def dead $scc
+ renamable $sgpr4_sgpr5 = S_LOAD_DWORDX2_IMM killed renamable $sgpr4_sgpr5, 0, 0 :: (dereferenceable invariant load (p0) from got, addrspace 4)
+ SI_CS_CHAIN_TC_W32 killed renamable $sgpr4_sgpr5, @callee, 0, -1, amdgpu_allvgprs, implicit $sgpr0, implicit $vgpr8
+
+...
+
---
name: dont_preserve_non_wwm
tracksRegLiveness: true
diff --git a/llvm/test/CodeGen/MIR/AMDGPU/long-branch-reg-all-sgpr-used.ll b/llvm/test/CodeGen/MIR/AMDGPU/long-branch-reg-all-sgpr-used.ll
index 3b4ebef1529676..0f7a5f8e0941ad 100644
--- a/llvm/test/CodeGen/MIR/AMDGPU/long-branch-reg-all-sgpr-used.ll
+++ b/llvm/test/CodeGen/MIR/AMDGPU/long-branch-reg-all-sgpr-used.ll
@@ -42,6 +42,7 @@
; CHECK-NEXT: vgprForAGPRCopy: ''
; CHECK-NEXT: sgprForEXECCopy: '$sgpr100_sgpr101'
; CHECK-NEXT: longBranchReservedReg: ''
+; CHECK-NEXT: hasInitWholeWave: false
; CHECK-NEXT: body:
define amdgpu_kernel void @long_branch_used_all_sgprs(ptr addrspace(1) %arg, i32 %cnd) #0 {
entry:
@@ -307,6 +308,7 @@
; CHECK-NEXT: vgprForAGPRCopy: ''
; CHECK-NEXT: sgprForEXECCopy: '$sgpr100_sgpr101'
; CHECK-NEXT: longBranchReservedReg: ''
+; CHECK-NEXT: hasInitWholeWave: false
; CHECK-NEXT: body:
define amdgpu_kernel void @long_branch_high_num_sgprs_used(ptr addrspace(1) %arg, i32 %cnd) #0 {
entry:
diff --git a/llvm/test/CodeGen/MIR/AMDGPU/machine-function-info-after-pei.ll b/llvm/test/CodeGen/MIR/AMDGPU/machine-function-info-after-pei.ll
index 138106632c1bc8..7759501ea42268 100644
--- a/llvm/test/CodeGen/MIR/AMDGPU/machine-function-info-after-pei.ll
+++ b/llvm/test/CodeGen/MIR/AMDGPU/machine-function-info-after-pei.ll
@@ -42,6 +42,7 @@
; AFTER-PEI-NEXT: vgprForAGPRCopy: ''
; AFTER-PEI-NEXT: sgprForEXECCopy: ''
; AFTER-PEI-NEXT: longBranchReservedReg: ''
+; AFTER-PEI-NEXT: hasInitWholeWave: false
; AFTER-PEI-NEXT: body:
define amdgpu_kernel void @scavenge_fi(ptr addrspace(1) %out, i32 %in) #0 {
%wide.sgpr0 = call <32 x i32> asm sideeffect "; def $0", "=s" () #0
diff --git a/llvm/test/CodeGen/MIR/AMDGPU/machine-function-info-long-branch-reg-debug.ll b/llvm/test/CodeGen/MIR/AMDGPU/machine-function-info-long-branch-reg-debug.ll
index 3046480b3c0d2a..4545c8bbeb3e6c 100644
--- a/llvm/test/CodeGen/MIR/AMDGPU/machine-function-info-long-branch-reg-debug.ll
+++ b/llvm/test/CodeGen/MIR/AMDGPU/machine-function-info-long-branch-reg-debug.ll
@@ -42,6 +42,7 @@
; CHECK-NEXT: vgprForAGPRCopy: ''
; CHECK-NEXT: sgprForEXECCopy: '$sgpr100_sgpr101'
; CHECK-NEXT: longBranchReservedReg: '$sgpr2_sgpr3'
+; CHECK-NEXT: hasInitWholeWave: false
; CHECK-NEXT: body:
define amdgpu_kernel void @uniform_long_forward_branch_debug(ptr addrspace(1) %arg, i32 %arg1) #0 !dbg !5 {
bb0:
diff --git a/llvm/test/CodeGen/MIR/AMDGPU/machine-function-info-long-branch-reg.ll b/llvm/test/CodeGen/MIR/AMDGPU/machine-function-info-long-branch-reg.ll
index 3f6f0c909e8bbf..8215ba834170f2 100644
--- a/llvm/test/CodeGen/MIR/AMDGPU/machine-function-info-long-branch-reg.ll
+++ b/llvm/test/CodeGen/MIR/AMDGPU/machine-function-info-long-branch-reg.ll
@@ -42,6 +42,7 @@
; CHECK-NEXT: vgprForAGPRCopy: ''
; CHECK-NEXT: sgprForEXECCopy: '$sgpr100_sgpr101'
; CHECK-NEXT: longBranchReservedReg: '$sgpr2_sgpr3'
+; CHECK-NEXT: hasInitWholeWave: false
; CHECK-NEXT: body:
define amdgpu_kernel void @uniform_long_forward_branch(ptr addrspace(1) %arg, i32 %arg1) #0 {
bb0:
diff --git a/llvm/test/CodeGen/MIR/AMDGPU/machine-function-info-no-ir.mir b/llvm/test/CodeGen/MIR/AMDGPU/machine-function-info-no-ir.mir
index 4a3319043ede68..ebbb89b7816c58 100644
--- a/llvm/test/CodeGen/MIR/AMDGPU/machine-function-info-no-ir.mir
+++ b/llvm/test/CodeGen/MIR/AMDGPU/machine-function-info-no-ir.mir
@@ -51,6 +51,7 @@
# FULL-NEXT: vgprForAGPRCopy: ''
# FULL-NEXT: sgprForEXECCopy: ''
# FULL-NEXT: longBranchReservedReg: ''
+# FULL-NEXT: hasInitWholeWave: false
# FULL-NEXT: body:
# SIMPLE: machineFunctionInfo:
@@ -154,6 +155,7 @@ body: |
# FULL-NEXT: vgprForAGPRCopy: ''
# FULL-NEXT: sgprForEXECCopy: ''
# FULL-NEXT: longBranchReservedReg: ''
+# FULL-NEXT: hasInitWholeWave: false
# FULL-NEXT: body:
# SIMPLE: machineFunctionInfo:
@@ -228,6 +230,7 @@ body: |
# FULL-NEXT: vgprForAGPRCopy: ''
# FULL-NEXT: sgprForEXECCopy: ''
# FULL-NEXT: longBranchReservedReg: ''
+# FULL-NEXT: hasInitWholeWave: false
# FULL-NEXT: body:
# SIMPLE: machineFunctionInfo:
@@ -303,6 +306,7 @@ body: |
# FULL-NEXT: vgprForAGPRCopy: ''
# FULL-NEXT: sgprForEXECCopy: ''
# FULL-NEXT: longBranchReservedReg: ''
+# FULL-NEXT: hasInitWholeWave: false
# FULL-NEXT: body:
# SIMPLE: machineFunctionInfo:
diff --git a/llvm/test/CodeGen/MIR/AMDGPU/machine-function-info.ll b/llvm/test/CodeGen/MIR/AMDGPU/machine-function-info.ll
index b3ed7376a1ede6..c8cf0391ba4bf6 100644
--- a/llvm/test/CodeGen/MIR/AMDGPU/machine-function-info.ll
+++ b/llvm/test/CodeGen/MIR/AMDGPU/machine-function-info.ll
@@ -51,6 +51,7 @@
; CHECK-NEXT: vgprForAGPRCopy: ''
; CHECK-NEXT: sgprForEXECCopy: '$sgpr100_sgpr101'
; CHECK-NEXT: longBranchReservedReg: ''
+; CHECK-NEXT: hasInitWholeWave: false
; CHECK-NEXT: body:
define amdgpu_kernel void @kernel(i32 %arg0, i64 %arg1, <16 x i32> %arg2) {
%gep = getelementptr inbounds [512 x float], ptr addrspace(3) @lds, i32 0, i32 %arg0
@@ -96,6 +97,7 @@ define amdgpu_kernel void @kernel(i32 %arg0, i64 %arg1, <16 x i32> %arg2) {
; CHECK-NEXT: vgprForAGPRCopy: ''
; CHECK-NEXT: sgprForEXECCopy: '$sgpr100_sgpr101'
; CHECK-NEXT: longBranchReservedReg: ''
+; CHECK-NEXT: hasInitWholeWave: false
; CHECK-NEXT: body:
define amdgpu_ps void @ps_shader(i32 %arg0, i32 inreg %arg1) {
%gep = getelementptr inbounds [128 x i32], ptr addrspace(2) @gds, i32 0, i32 %arg0
@@ -165,6 +167,7 @@ define amdgpu_ps void @gds_size_shader(i32 %arg0, i32 inreg %arg1) #5 {
; CHECK-NEXT: vgprForAGPRCopy: ''
; CHECK-NEXT: sgprForEXECCopy: '$sgpr100_sgpr101'
; CHECK-NEXT: longBranchReservedReg: ''
+; CHECK-NEXT: hasInitWholeWave: false
; CHECK-NEXT: body:
define void @function() {
ret void
@@ -216,6 +219,7 @@ define void @function() {
; CHECK-NEXT: vgprForAGPRCopy: ''
; CHECK-NEXT: sgprForEXECCopy: '$sgpr100_sgpr101'
; CHECK-NEXT: longBranchReservedReg: ''
+; CHECK-NEXT: hasInitWholeWave: false
; CHECK-NEXT: body:
define void @function_nsz() #0 {
ret void
More information about the llvm-commits
mailing list