[PATCH] D123525: [AMDGPU] On gfx908, reserve VGPR for AGPR copy based on register budget.

Tue Apr 12 10:31:25 PDT 2022

hsmhsm added a comment.

In D123525#3446104 <https://reviews.llvm.org/D123525#3446104>, @rampitec wrote:

> Let's begin with the question: what problem are you trying to solve?

The problem, I am basically trying to solve is a case, where we have following situation - there is a vector ALU operation which requires 1024 wide VGPR, and the register budget in this case is 64. Since v32 is reserved, all the 1024 bit vectors starting from v1...v32 until v32...v63 are NOT usable. Only possible to use 1024 bit register is v0...v31. But, unfortunately, there is SGPR spills to v0 happening, and hence, v0...v31 is also NOT usable. So, RA fails.

Hence, we cannot (always) choose v32, probably the better way of handling it is:

(1)    While reserving the registers before RA, reserve highest available VGPR based on register budget, and then later
(2)    Shift it to lowest available VGPR after RA

Now, we have of couple of questions to answer

Q1.  Is it safe to always reserve highest available VGPR irrespective of constrained register budget?
Q2.  What is the best place to handle (2) ?
Q3.  Is SIInstrInfo::indirectCopyToAGPR() always doing right thing as expected?

Answer to Q1 is - we have no other choice at the moment.
Answer to Q2 is - I thought SIInstrInfo::indirectCopyToAGPR() is a right place to handle it. But, later based on offline discussion with Matt, looks like framelowering is probably the right place.
Answer to Q3 is - Based on my own debugging of SIInstrInfo::indirectCopyToAGPR() and discussing my findings with Matt, this function is actually broken. I better not touch it at the moment, at least for this patch.

================
Comment at: llvm/lib/Target/AMDGPU/SIInstrInfo.cpp:628
   // or spill with the slot.
   while (RegNo-- && RS.FindUnusedReg(&AMDGPU::VGPR_32RegClass)) {
     Register Tmp2 = RS.scavengeRegister(&AMDGPU::VGPR_32RegClass, 0);
----------------
rampitec wrote:
> hsmhsm wrote:
> > rampitec wrote:
> > > It will never get replaced if RegNo == 0.
> > Yes, I noticed it. I think, we need to take a relook at it.
> This completely breaks the original logic and idea. This loop exists because there are 2 waitstates needed to reuse a VGPR used in the v_accvgpr_write on gfx908. When you are copying a wide register, say a 16 dword tuple, we would have to insert 30 nops is we are using a single register. Therefore this code attempts to use another register each time until we have copied 3 dwords. Hence the 'DestReg % 3'. This code just does not make sense anymore.
As I mentioned in my other reply (to your comment), this function is actually broken. I better not touch it at the moment, at least for this patch. And, I am going to handle the required change within framelowering.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D123525/new/

https://reviews.llvm.org/D123525