[PATCH] D133723: [AMDGPU][GFX11] Use VGPR_32_F128 for VOP1,2,C

Tue Sep 13 12:27:10 PDT 2022

Joe_Nash marked 6 inline comments as done.
Joe_Nash added a comment.

In D133723#3786055 <https://reviews.llvm.org/D133723#3786055>, @foad wrote:

>> We introduce a new register class VGPR_32_F128 which is used for
>> encodings VOP1, VOP2, and VOPC. This register class only has the first
>> 128 VGPRs, but is otherwise identical to VGPR_32. Therefore, VOP1, VOP2,
>> and VOPC instructions are correctly limited to use the first 128
>> VGPRs, while the other instructions can freely use all 256.
>
> This paragraph needs to explain that the new register class is used for 16-bit operands only. The whole point of this patch is that 32-bit operands will no longer be restricted to using the first 128 VGPRs.

done

================
Comment at: llvm/lib/Target/AMDGPU/GCNDPPCombine.cpp:606-608
+    assert(
+        (TII->get(OrigOp).Size != 4 || !llvm::AMDGPU::isTrue16Inst(OrigOp)) &&
+        "There should not be e32 True16 instructions pre-RA");
----------------
foad wrote:
> This may be true but I don't see why you're asserting it //here// in particular. Couldn't this go into isShrinkable, near the other check that you added?
isShrinkable guards against shrinking in this pass, but this assert was supposed to check if previous passes shrunk anything. In theory, it's not necessary, but I put the assert for future proofing.

================
Comment at: llvm/lib/Target/AMDGPU/SIInstrInfo.cpp:3279-3281
+  assert(Opc != AMDGPU::V_FMAC_F16_T16_e32 &&
+         "V_FMAC_F16_T16_e32 is not supported and not expected to be present "
+         "pre-RA");
----------------
foad wrote:
> My understanding is that having a _T16_e32 instruction here should work correctly, but it is undesirable because it has more restrictive register classes than the _T16_e64 form. So perhaps we want a more general assertion in (or just before) the register allocator that there are no _T16_e32 instructions present? I'm not sure how or where to implement that.
>From the perspective of the RA, allocating a _T16_e32 will work correctly. 

In this particular function, there is a restriction. V_FMAC_F16_T16_e32 -> V_FMA_F16_gfx9_e64 should not be allowed, because the register class in the former is VGPR_32_F128 and in the latter is VGPR_32. So it would require a COPY or something, which we are not doing. 

================
Comment at: llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp:2880
                                                 unsigned Idx) const {
+  if (Idx == AMDGPU::RegisterPressureSets::VGPR_32_F128) {
+    return getRegPressureLimit(&AMDGPU::VGPR_32_F128RegClass,
----------------
rampitec wrote:
> You probably do not need PSet for it, it is not handled anywhere.
I don't fully understand this, but MachineLICMBase calls getRegPressureSetLimit and will hit llvm_unreachable without the PSet for VGPR_32_F128

================
Comment at: llvm/lib/Target/AMDGPU/SIRegisterInfo.td:557
+
+def VGPR_32_F128 : SIRegisterClass<"AMDGPU", !listconcat(Reg32Types.types, Reg16Types.types), 32,
+                            (add (sequence "VGPR%u", 0, 127))> {
----------------
arsenm wrote:
> I don't know what "_F128" is supposed to mean. I read this as a class for long double
It is short for First 128. Is a set with only the first 128 VGPRs. I will add a comment noting this. 

================
Comment at: llvm/lib/Target/AMDGPU/SIShrinkInstructions.cpp:144

+/// Only shrink 16 bit registers after RA.
+/// Do not shrink the instruction if its registers are not expressible in the
----------------
foad wrote:
> I would prefer to change SIShrinkInstructions in a more general way that does not specifically check for True16 instructions: D133769
That seems fine to me. I will rebase on that when its landed.

================
Comment at: llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-sitofp.mir:91-93
+    ; GFX11-NEXT: %1:vgpr_32_f128 = nofpexcept V_CVT_F16_F32_T16_e32 [[V_CVT_F32_I32_e32_]], implicit $mode, implicit $exec
+    ; GFX11-NEXT: [[COPY1:%[0-9]+]]:vgpr_32 = COPY %1
+    ; GFX11-NEXT: $vgpr0 = COPY [[COPY1]]
----------------
foad wrote:
> This is undesirable. We don't want to see _f128 classes here. We should be selecting the _e64 form of the V_CVT instruction instead. If you rebase on 3743f9afeb51e0b7bdf2269583f32b7e35369168 you will see the same problem for fptosi as well as sitofp.
> 
> The fix is to change the patterns in SIInstructions.td to always use the _e64 forms: https://reviews.llvm.org/differential/diff/459699/
> Pre-GFX11 this should be pretty harmless even if selecting the _e64 forms doesn't actually give any benefit.
I picked up those changes, thanks!

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D133723/new/

https://reviews.llvm.org/D133723