[PATCH] D105062: [AMDGPU] Reduce AGPR to AGPR copies with same source

Mon Jun 28 18:49:23 PDT 2021

vangthao added a comment.

In D105062#2845881 <https://reviews.llvm.org/D105062#2845881>, @rampitec wrote:

> In D105062#2845855 <https://reviews.llvm.org/D105062#2845855>, @arsenm wrote:
>
>> I think copyPhysReg should not have to do any nontrivial scanning. These situations where this matters should be resolvable before register allocation
>
> I concur. This is untrivial and error prone. There are quite a number of foldings for agpr copy and init in different places, this looks like a candidate for yet another one, starting from a relevant testcase. This place is really a last resort.
>
> Do you have a minimal ir test showing a problem you are solving?

I added a test case broadcast_a32_unable_to_propagate_v0 in accvgpr-copy.mir where we failed to propagate v0 when broadcasting $agpr0 to 31 other lanes. Using the trunk, this forces us to read $agpr0 31 times, write it to a vgpr register, and then copy it to the agpr lane. This issue occurs when an instruction is scheduled in-between the defining accvgpr_write and the COPIES and also writes over the same vgpr register as the defining accvgpr_write. In this case, a V_MOV_B32_e32 uses $vgpr0 and thus we cannot propagate $vgpr0 to the AGPR copies.

I believe this is really only an issue when we have a situation where multiple lanes are using the same source. Following the IR printed by LLVM before register allocation, the code looks somewhat like this

  %80:vgpr_32 = COPY %12:sgpr_32
  %81:agpr_32 = V_ACCVGPR_WRITE_B32_e64 %80:vgpr_32, implicit $exec
  %82:agpr_32 = V_ACCVGPR_WRITE_B32_e64 %80:vgpr_32, implicit $exec
  %83:agpr_32 = V_ACCVGPR_WRITE_B32_e64 %80:vgpr_32, implicit $exec
  %84:agpr_32 = V_ACCVGPR_WRITE_B32_e64 %80:vgpr_32, implicit $exec
  %85:agpr_32 = V_ACCVGPR_WRITE_B32_e64 %80:vgpr_32, implicit $exec
  ...
  %70:areg_1024 = REG_SEQUENCE %81:agpr_32, %subreg.sub0, %82:agpr_32 ...

This seems fine to me. We are using the vgpr register instead of an agpr register before register allocation. But then the machine-cse pass eliminates this chain of instruction and creates a dependency on agpr register %81.

  %80:vgpr_32 = COPY %12:sgpr_32
  %81:agpr_32 = V_ACCVGPR_WRITE_B32_e64 %80:vgpr_32, implicit $exec
  %70:areg_1024 = REG_SEQUENCE %81:agpr_32, %subreg.sub0, %81:agpr_32, %subreg.sub1, %81:agpr_32 ...

This is an issue because the REG_SEQUENCE will be expanded into multiple AGPR copies during twoaddressinstruction pass.

  %80:vgpr_32 = V_MOV_B32_e32 1123418112, implicit $exec
  %81:agpr_32 = V_ACCVGPR_WRITE_B32_e64 killed %80:vgpr_32, implicit $exec
  undef %70.sub0:areg_1024 = COPY %81:agpr_32
  %70.sub1:areg_1024 = COPY %81:agpr_32
  %70.sub2:areg_1024 = COPY %81:agpr_32
  %70.sub3:areg_1024 = COPY %81:agpr_32
  ...

So we just opted to eliminate instructions and replaced them with AGPR copies that we will have to transform back into those same instructions we eliminated. Even worse if we happen to not be able to propagate %80:vgpr_32 as in the test scenario broadcast_a32_unable_to_propagate_v0.

Disabling machine-cse gives us something like this:

  64B       %80:vgpr_32 = V_MOV_B32_e32 1123418112, implicit $exec
  80B       %81:agpr_32 = V_ACCVGPR_WRITE_B32_e64 %80:vgpr_32, implicit $exec
  96B       %82:agpr_32 = V_ACCVGPR_WRITE_B32_e64 %80:vgpr_32, implicit $exec
  112B      %83:agpr_32 = V_ACCVGPR_WRITE_B32_e64 %80:vgpr_32, implicit $exec
  128B      %84:agpr_32 = V_ACCVGPR_WRITE_B32_e64 %80:vgpr_32, implicit $exec
  144B      %85:agpr_32 = V_ACCVGPR_WRITE_B32_e64 %80:vgpr_32, implicit $exec
  ...
  608B      %70.sub1:areg_1024 = COPY %82:agpr_32
  624B      %70.sub2:areg_1024 = COPY %83:agpr_32
  640B      %70.sub3:areg_1024 = COPY %84:agpr_32
  656B      %70.sub4:areg_1024 = COPY %85:agpr_32
  672B      %70.sub5:areg_1024 = COPY %86:agpr_32
  ...

Which then the AGPR copies is folded into the accvgpr_write by the simple register coalescing pass.

  64B       %80:vgpr_32 = V_MOV_B32_e32 1123418112, implicit $exec
  80B       undef %70.sub0:areg_1024 = V_ACCVGPR_WRITE_B32_e64 %80:vgpr_32, implicit $exec
  96B       %70.sub1:areg_1024 = V_ACCVGPR_WRITE_B32_e64 %80:vgpr_32, implicit $exec
  112B      %70.sub2:areg_1024 = V_ACCVGPR_WRITE_B32_e64 %80:vgpr_32, implicit $exec
  128B      %70.sub3:areg_1024 = V_ACCVGPR_WRITE_B32_e64 %80:vgpr_32, implicit $exec
  144B      %70.sub4:areg_1024 = V_ACCVGPR_WRITE_B32_e64 %80:vgpr_32, implicit $exec
  ...

This seems to be perfectly fine and is what we wanted since there are no AGPR copies and the accvgpr_writes are using vgpr sources. So I believe the issue here is that the machine-cse pass is removing something that we wish to keep and replacing it with something possibly worse. Eliminating accvgpr_write and replacing them with AGPR copies does not seem to be equivalent or better on targets that are not able to copy from AGPR to AGPR directly.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D105062/new/

https://reviews.llvm.org/D105062