[PATCH] D109301: [AMDGPU] Enable copy between VGPR and AGPR classes during regalloc

Mon Sep 20 15:15:49 PDT 2021

rampitec requested changes to this revision.
rampitec added a comment.
This revision now requires changes to proceed.

There is an elephant in the room: it kills partial spill of wide tuples. I do not think we can afford it without addressing this problem first.

Note that to spill an AGPR to memory on gfx908 one needs an intermediate VGPR and extra VALU (32 extra VALU in a worst case + nops). In the updated tests we can clearly see the code bloat and memory traffic bloat because of the missing partial spill.

================
Comment at: llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp:399-400
+      return &AMDGPU::AV_64RegClass;
+    if (RC == &AMDGPU::VReg_64_Align2RegClass)
+      return &AMDGPU::AV_64_Align2RegClass;
+    if (RC == &AMDGPU::VReg_96RegClass)
----------------
cdevadas wrote:
> arsenm wrote:
> > I think with the intent of this function, you don't need to return aligned classes for aligned classes, the unaligned versions are fine. I guess this gets more complicated in the case where we're using scratch instructions that do require alignment for multi-dword spilling
> I will see the impact of scratch instructions separately.
The copy of a 64-bit VGPR will use V_PK_MOV_B32 which uses 64 bit operands, which then shall be aligned.

================
Comment at: llvm/test/CodeGen/AMDGPU/pei-build-spill-partial-agpr.mir:63
     ; MUBUF-V2A: BUFFER_STORE_DWORD_OFFSET killed $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, 0, implicit $exec, implicit-def $vgpr0_vgpr1, implicit $vgpr0_vgpr1 :: (store (s32) into %stack.0, addrspace 5)
-    ; MUBUF-V2A: $agpr0 = V_ACCVGPR_WRITE_B32_e64 killed $vgpr1, implicit $exec, implicit killed $vgpr0_vgpr1
+    ; MUBUF-V2A: BUFFER_STORE_DWORD_OFFSET killed $vgpr1, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 4, 0, 0, 0, implicit $exec, implicit killed $vgpr0_vgpr1 :: (store (s32) into %stack.0 + 4, addrspace 5)
     ; MUBUF-V2A: $vgpr0 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, 0, implicit $exec, implicit-def $vgpr0_vgpr1 :: (load (s32) from %stack.0, addrspace 5)
----------------
So this is a clear and predictable regression. Partial spill is killed by this patch.

================
Comment at: llvm/test/CodeGen/AMDGPU/spill-agpr.ll:10
+; GCN-DAG:     v_accvgpr_read_b32 v[[VSPILL:[0-9]+]], a0
+; GCN-NOT:     buffer_store_dword
+; GCN-NOT:     buffer_load_dword
----------------
Looks like the test does not test AGPR spills anymore, while we need to test it at least for gfx908.

================
Comment at: llvm/test/CodeGen/AMDGPU/spill-agpr.mir:34
   ; GFX908-EXPANDED:   $vgpr0 = V_ACCVGPR_READ_B32_e64 killed $agpr0, implicit $exec
+  ; GFX908-EXPANDED:   BUFFER_STORE_DWORD_OFFSET killed $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 4, 0, 0, 0, implicit $exec :: (store (s32) into %stack.1, addrspace 5)
   ; GFX908-EXPANDED:   S_NOP 0, implicit-def renamable $agpr0
----------------
Another obvious regression.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109301/new/

https://reviews.llvm.org/D109301