[PATCH] D109301: [AMDGPU] Enable copy between VGPR and AGPR classes during regalloc

Thu Nov 4 05:37:15 PDT 2021

cdevadas added inline comments.

================
Comment at: llvm/test/CodeGen/AMDGPU/spill-to-agpr-partial.mir:19
+    ; GCN: liveins: $vgpr0_vgpr1_vgpr2_vgpr3, $agpr0_agpr1_agpr2_agpr3_agpr4_agpr5_agpr6_agpr7_agpr8_agpr9_agpr10_agpr11_agpr12_agpr13_agpr14_agpr15, $agpr16_agpr17_agpr18_agpr19_agpr20_agpr21_agpr22_agpr23, $agpr24_agpr25_agpr26_agpr27, $agpr28_agpr29, $agpr30
+    ; GCN: SCRATCH_STORE_DWORDX4_SADDR killed $vgpr0_vgpr1_vgpr2_vgpr3, $sgpr32, 0, 0, implicit $exec, implicit $flat_scr :: (store (s128) into %stack.0, align 4, addrspace 5)
+    ; GCN: $vgpr0_vgpr1_vgpr2_vgpr3 = SCRATCH_LOAD_DWORDX4_SADDR $sgpr32, 0, 0, implicit $exec, implicit $flat_scr :: (load (s128) from %stack.0, align 4, addrspace 5)
----------------
rampitec wrote:
> cdevadas wrote:
> > rampitec wrote:
> > > arsenm wrote:
> > > > rampitec wrote:
> > > > > arsenm wrote:
> > > > > > rampitec wrote:
> > > > > > > This is greedy, not fastra, the same regression.
> > > > > > This isn't allocated at all, this is running just PEI. For CSRs I think we should either not have CSR AGPRs, or use splitCSR
> > > > > OK, can we see partial cross class copy anywhere?
> > > > Not really, but this is just a general problem with the allocator which I hope to look into soon. It doesn't know how to introduce new subranges to avoid conflicts or to spill the minimum set of required lanes
> > > Look, this is tolerable unless these are AGPRs with their 32 register tuples.
> > It is true, the Spiller during regalloc doesn't support partial/sub-range tuple spills.
> > But it's not the case with copy. Greedy did handle partial tuple copy into a superclass.
> > 
> > `max_256_vgprs_spill_9x32_2bb` in spill-vgpr-to-agpr.ll, for instance, used to get entire 1024 tuple spill stores and restores.
> > Now that has become minimal copies.
> > 
> > I see similar copies inserted during greedy allocator.
> > %1200.sub16_sub17_sub18_sub19:av_1024 = COPY %1201.sub16_sub17_sub18_sub19:vreg_1024  //// copy to the super class, AV.
> > %1202.sub16_sub17_sub18_sub19:vreg_1024 = COPY %1200.sub16_sub17_sub18_sub19:av_1024  //// later moving it back to V.
> > 
> > We currently don't have a test that captures it. I can include one.
> This test would be super helpful. We need to make sure we do not copy a whole tuple. Moreover, we need it all ways: v to a, a to v, and all tuple sizes. 
> This test would be super helpful. We need to make sure we do not copy a whole tuple. Moreover, we need it all ways: v to a, a to v, and all tuple sizes. 

I take my word back when I said regallc inserts sub-range tuple copies during `tryInstructionSplit`. It isn't entirely true.
Register coalescer inserts copies that now become copies to equivalent superclasses as we defined `getLargestLegalSuperClass` for the vector classes. The `trySplit` function doesn't do anything new to introduce subrange copies.
I failed to come up with a test case that introduce tuple subrange copies for all supported AMDGPU tuple sizes.
Like Matt said, regalloc needs a fix to better handle the subranges.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109301/new/

https://reviews.llvm.org/D109301