[llvm] [AMDGPU] Allocate i1 argument to SGPRs (PR #72461)

Wed Feb 7 11:42:02 PST 2024

================
@@ -715,6 +715,14 @@ bool SILowerI1Copies::lowerCopiesToI1() {
       assert(!MI.getOperand(1).getSubReg());
 
       if (!SrcReg.isVirtual() || (!isLaneMaskReg(SrcReg) && !isVreg1(SrcReg))) {
+        if (!SrcReg.isVirtual() &&
+            TII->getRegisterInfo().getRegSizeInBits(SrcReg, *MRI) == 64) {
+          // When calling convention allocates SGPR for i1, for GPUs with
+          // wavefront size 64, i1 return value is put in 64b SGPR.
+          assert(ST->isWave64());
+          continue;
+        }
+
----------------
jwanggit86 wrote:

> 1. The code already in SILowerI1Copies is nonsense which should be fixed
Maybe create a separate ticket?

>2. You shouldn't need to touch it
I don't see a better alternative. Do you have any suggestions?

Here is a recap of the full story. I'm listing the .ll file, the init and final DAGs, the instruction selection results and the machine code.
```diff
// .ll file
define i1 @i1_func_void() {
  %val = load i1, ptr addrspace(1) undef
  ret i1 %val
}

define void @test_call_external_i1_func_void() {
  %val = call i1 @i1_func_void()
  store volatile i1 %val, ptr addrspace(1) undef
  ret void
}

// 1. Initial DAG
Initial selection DAG: %bb.0 'test_call_external_i1_func_void:'
SelectionDAG has 17 nodes:
...
  t9: ch,glue = CALL t6, GlobalAddress:i64<ptr @i1_func_void> 0, TargetGlobalAddress:i64<ptr @i1_func_void> 0, Register:v4i32 $sgpr0_sgpr1_sgpr2_sgpr3, RegisterMask:Untyped, t6:1
  t10: ch,glue = callseq_end # D:1 t9, TargetConstant:i64<0>, TargetConstant:i64<0>, t9:1
  - t12: i1,ch,glue = CopyFromReg # D:1 t10, Register:i1 $sgpr0_sgpr1, t10:1
  t14: i64 = Constant<0>
...

+ // Note that CopyFromReg is divergent

// 2. Final DAG
Optimized legalized selection DAG: %bb.0 'test_call_external_i1_func_void:'
SelectionDAG has 21 nodes:
...
  t10: ch,glue = callseq_end # D:1 t9, TargetConstant:i64<0>, TargetConstant:i64<0>, t9:1
  - t12: i1,ch,glue = CopyFromReg # D:1 t10, Register:i1 $sgpr0_sgpr1, t10:1
      t29: i32 = zero_extend # D:1 t12
...

// 3. Instruction selection
===== Instruction selection ends:

Selected selection DAG: %bb.0 'test_call_external_i1_func_void:'
SelectionDAG has 23 nodes:
...
  t10: i1,ch,glue = ADJCALLSTACKDOWN TargetConstant:i64<0>, TargetConstant:i64<0>, t9, t9:1
  - t12: i1,ch,glue = CopyFromReg # D:1 t10:1, Register:i1 $sgpr0_sgpr1, t10:2
      t29: i32 = V_CNDMASK_B32_e64 # D:1 TargetConstant:i32<0>, TargetConstant:i32<0>, TargetConstant:i32<0>, TargetConstant:i32<1>, t12
    t27: ch = GLOBAL_STORE_BYTE<Mem:(volatile store (s8) into `ptr addrspace(1) undef`, addrspace 1)> # D:1 IMPLICIT_DEF:i64, t29, TargetConstant:i16<0>, TargetConstant:i32<0>, t12:1
...

// 4. Machine code at end of ISel
*** MachineFunction at end of ISel ***
# Machine code for function test_call_external_i1_func_void: IsSSA, TracksLiveness

...
  ADJCALLSTACKDOWN 0, 0, implicit-def dead $scc
  - %3:vreg_1 = COPY $sgpr0_sgpr1
  - %5:sreg_64_xexec = COPY %3:vreg_1
  %4:vgpr_32 = V_CNDMASK_B32_e64 0, 0, 0, 1, %5:sreg_64_xexec, implicit $exec
  %6:sreg_64 = IMPLICIT_DEF
  %7:vreg_64 = COPY %6:sreg_64
  GLOBAL_STORE_BYTE killed %7:vreg_64, killed %4:vgpr_32, 0, 0, implicit $exec :: (volatile store (s8) into `ptr addrspace(1) undef`, addrspace 1)
...
```

In `InstrEmitter::EmitCopyFromReg()` the CopyFromReg insn transformed to the following:
```
  %3:vreg_1 = COPY $sgpr0_sgpr1
  %5:sreg_64_xexec = COPY %3:vreg_1
```
If I understand you correctly, I think you want the "vreg_1" in %3 to be "sreg_64_xexec". The relevant code of `EmitCopyFromReg()` is as follows:
```
void InstrEmitter::EmitCopyFromReg(...)
  ...
  // Stick to the preferred register classes for legal types.
  if (TLI->isTypeLegal(VT))
    - UseRC = TLI->getRegClassFor(VT, Node->isDivergent());  // UseRC is set to vreg_1 here
...
          const TargetRegisterClass *RC = nullptr;
          if (i + II.getNumDefs() < II.getNumOperands()) {
            RC = TRI->getAllocatableClass(
                TII->getRegClass(II, i + II.getNumDefs(), TRI, *MF));
          }
          - if (!UseRC)  // UseRC is already set, so it's not overwritten here
            UseRC = RC;
...
  if (VRBase) {
    DstRC = MRI->getRegClass(VRBase);
  } else if (UseRC) {
    assert(TRI->isTypeLegalForClass(*UseRC, VT) &&
           "Incompatible phys register def and uses!");
    - DstRC = UseRC;   // DstRC is set to UseRC, which is vreg_1
  } else
    DstRC = SrcRC;
...
```
As you can see, because `UseRC` is set to vreg_1 near the beginning of the function, the final `DstRC` is set to vreg_1 as well.

The code for `SITargetLowering::getRegClassFor()` is as follows:
```cpp
const TargetRegisterClass *SITargetLowering::getRegClassFor(MVT VT, bool isDivergent) const {
  const TargetRegisterClass *RC = TargetLoweringBase::getRegClassFor(VT, false);
  const SIRegisterInfo *TRI = Subtarget->getRegisterInfo();
  if (RC == &AMDGPU::VReg_1RegClass && !isDivergent)  // Returns SReg_64 only for non-divergent SDNode.  
    return Subtarget->getWavefrontSize() == 64 ? &AMDGPU::SReg_64RegClass
                                               : &AMDGPU::SReg_32RegClass;
  if (!TRI->isSGPRClass(RC) && !isDivergent)
    return TRI->getEquivalentSGPRClass(RC);
  else if (TRI->isSGPRClass(RC) && isDivergent)
    return TRI->getEquivalentVGPRClass(RC);

  return RC;
}
```

I don't think changing the behavior of `SITargetLowering::getRegClassFor()` is a good idea. So, I think the current solution, i.e. skipping the instruction, `%3:sreg_64 = COPY $sgpr0_sgpr1` in SILowerI1Copies is preferred.

Pls let me know if you have any suggestions. 

https://github.com/llvm/llvm-project/pull/72461