[llvm] [AMDGPU] Allocate i1 argument to SGPRs (PR #72461)
Jun Wang via llvm-commits
llvm-commits at lists.llvm.org
Wed Feb 7 11:42:02 PST 2024
================
@@ -715,6 +715,14 @@ bool SILowerI1Copies::lowerCopiesToI1() {
assert(!MI.getOperand(1).getSubReg());
if (!SrcReg.isVirtual() || (!isLaneMaskReg(SrcReg) && !isVreg1(SrcReg))) {
+ if (!SrcReg.isVirtual() &&
+ TII->getRegisterInfo().getRegSizeInBits(SrcReg, *MRI) == 64) {
+ // When calling convention allocates SGPR for i1, for GPUs with
+ // wavefront size 64, i1 return value is put in 64b SGPR.
+ assert(ST->isWave64());
+ continue;
+ }
+
----------------
jwanggit86 wrote:
> 1. The code already in SILowerI1Copies is nonsense which should be fixed
Maybe create a separate ticket?
>2. You shouldn't need to touch it
I don't see a better alternative. Do you have any suggestions?
Here is a recap of the full story. I'm listing the .ll file, the init and final DAGs, the instruction selection results and the machine code.
```diff
// .ll file
define i1 @i1_func_void() {
%val = load i1, ptr addrspace(1) undef
ret i1 %val
}
define void @test_call_external_i1_func_void() {
%val = call i1 @i1_func_void()
store volatile i1 %val, ptr addrspace(1) undef
ret void
}
// 1. Initial DAG
Initial selection DAG: %bb.0 'test_call_external_i1_func_void:'
SelectionDAG has 17 nodes:
...
t9: ch,glue = CALL t6, GlobalAddress:i64<ptr @i1_func_void> 0, TargetGlobalAddress:i64<ptr @i1_func_void> 0, Register:v4i32 $sgpr0_sgpr1_sgpr2_sgpr3, RegisterMask:Untyped, t6:1
t10: ch,glue = callseq_end # D:1 t9, TargetConstant:i64<0>, TargetConstant:i64<0>, t9:1
- t12: i1,ch,glue = CopyFromReg # D:1 t10, Register:i1 $sgpr0_sgpr1, t10:1
t14: i64 = Constant<0>
...
+ // Note that CopyFromReg is divergent
// 2. Final DAG
Optimized legalized selection DAG: %bb.0 'test_call_external_i1_func_void:'
SelectionDAG has 21 nodes:
...
t10: ch,glue = callseq_end # D:1 t9, TargetConstant:i64<0>, TargetConstant:i64<0>, t9:1
- t12: i1,ch,glue = CopyFromReg # D:1 t10, Register:i1 $sgpr0_sgpr1, t10:1
t29: i32 = zero_extend # D:1 t12
...
// 3. Instruction selection
===== Instruction selection ends:
Selected selection DAG: %bb.0 'test_call_external_i1_func_void:'
SelectionDAG has 23 nodes:
...
t10: i1,ch,glue = ADJCALLSTACKDOWN TargetConstant:i64<0>, TargetConstant:i64<0>, t9, t9:1
- t12: i1,ch,glue = CopyFromReg # D:1 t10:1, Register:i1 $sgpr0_sgpr1, t10:2
t29: i32 = V_CNDMASK_B32_e64 # D:1 TargetConstant:i32<0>, TargetConstant:i32<0>, TargetConstant:i32<0>, TargetConstant:i32<1>, t12
t27: ch = GLOBAL_STORE_BYTE<Mem:(volatile store (s8) into `ptr addrspace(1) undef`, addrspace 1)> # D:1 IMPLICIT_DEF:i64, t29, TargetConstant:i16<0>, TargetConstant:i32<0>, t12:1
...
// 4. Machine code at end of ISel
*** MachineFunction at end of ISel ***
# Machine code for function test_call_external_i1_func_void: IsSSA, TracksLiveness
...
ADJCALLSTACKDOWN 0, 0, implicit-def dead $scc
- %3:vreg_1 = COPY $sgpr0_sgpr1
- %5:sreg_64_xexec = COPY %3:vreg_1
%4:vgpr_32 = V_CNDMASK_B32_e64 0, 0, 0, 1, %5:sreg_64_xexec, implicit $exec
%6:sreg_64 = IMPLICIT_DEF
%7:vreg_64 = COPY %6:sreg_64
GLOBAL_STORE_BYTE killed %7:vreg_64, killed %4:vgpr_32, 0, 0, implicit $exec :: (volatile store (s8) into `ptr addrspace(1) undef`, addrspace 1)
...
```
In `InstrEmitter::EmitCopyFromReg()` the CopyFromReg insn transformed to the following:
```
%3:vreg_1 = COPY $sgpr0_sgpr1
%5:sreg_64_xexec = COPY %3:vreg_1
```
If I understand you correctly, I think you want the "vreg_1" in %3 to be "sreg_64_xexec". The relevant code of `EmitCopyFromReg()` is as follows:
```
void InstrEmitter::EmitCopyFromReg(...)
...
// Stick to the preferred register classes for legal types.
if (TLI->isTypeLegal(VT))
- UseRC = TLI->getRegClassFor(VT, Node->isDivergent()); // UseRC is set to vreg_1 here
...
const TargetRegisterClass *RC = nullptr;
if (i + II.getNumDefs() < II.getNumOperands()) {
RC = TRI->getAllocatableClass(
TII->getRegClass(II, i + II.getNumDefs(), TRI, *MF));
}
- if (!UseRC) // UseRC is already set, so it's not overwritten here
UseRC = RC;
...
if (VRBase) {
DstRC = MRI->getRegClass(VRBase);
} else if (UseRC) {
assert(TRI->isTypeLegalForClass(*UseRC, VT) &&
"Incompatible phys register def and uses!");
- DstRC = UseRC; // DstRC is set to UseRC, which is vreg_1
} else
DstRC = SrcRC;
...
```
As you can see, because `UseRC` is set to vreg_1 near the beginning of the function, the final `DstRC` is set to vreg_1 as well.
The code for `SITargetLowering::getRegClassFor()` is as follows:
```cpp
const TargetRegisterClass *SITargetLowering::getRegClassFor(MVT VT, bool isDivergent) const {
const TargetRegisterClass *RC = TargetLoweringBase::getRegClassFor(VT, false);
const SIRegisterInfo *TRI = Subtarget->getRegisterInfo();
if (RC == &AMDGPU::VReg_1RegClass && !isDivergent) // Returns SReg_64 only for non-divergent SDNode.
return Subtarget->getWavefrontSize() == 64 ? &AMDGPU::SReg_64RegClass
: &AMDGPU::SReg_32RegClass;
if (!TRI->isSGPRClass(RC) && !isDivergent)
return TRI->getEquivalentSGPRClass(RC);
else if (TRI->isSGPRClass(RC) && isDivergent)
return TRI->getEquivalentVGPRClass(RC);
return RC;
}
```
I don't think changing the behavior of `SITargetLowering::getRegClassFor()` is a good idea. So, I think the current solution, i.e. skipping the instruction, `%3:sreg_64 = COPY $sgpr0_sgpr1` in SILowerI1Copies is preferred.
Pls let me know if you have any suggestions.
https://github.com/llvm/llvm-project/pull/72461
More information about the llvm-commits
mailing list