[PATCH] D136169: [AMDGPU] Avoid SCC clobbering before S_CSELECT_B32

Thu Oct 20 06:20:39 PDT 2022

foad added inline comments.

================
Comment at: llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp:2216-2224
+          while (I != MBB->end()) {
+            if (*I != MI && I->readsRegister(AMDGPU::SCC)) {
+              NeedSaveSCC = true;
+              break;
+            }
+            if (*I != MI && I->definesRegister(AMDGPU::SCC))
+              break;
----------------
alex-t wrote:
> foad wrote:
> > alex-t wrote:
> > > alex-t wrote:
> > > > alex-t wrote:
> > > > > alex-t wrote:
> > > > > > foad wrote:
> > > > > > > alex-t wrote:
> > > > > > > > foad wrote:
> > > > > > > > > What is this loop for? Isn't the scavenger already supposed to know whether SCC is live?
> > > > > > > > > What is this loop for? Isn't the scavenger already supposed to know whether SCC is live?
> > > > > > > > 
> > > > > > > > The scavenger's knowledge depends on the accurate dead/kill flags insertion. This loop decreases the number of false-positive "live" SCC and accordingly unnecessary SCC saves/restores.
> > > > > > > > 
> > > > > > > > As Matt rightly pointed out, the dead/kill correct placement is out of this change scope.
> > > > > > > > It also is going to be a complex task.
> > > > > > > > 
> > > > > > > > Should I add a corresponding TODO here?
> > > > > > > Haven't dead/kill flags been automatically computed before this code runs?
> > > > > > > 
> > > > > > > If you do need this loop, shouldn't it conservatively assume that SCC is live at the end of the basic block?
> > > > > > Let me illustrate what I mean. For the code snipped below
> > > > > > ```
> > > > > >   (1) $sgpr3 = S_ADD_I32 $sgpr33, 32, implicit-def $scc
> > > > > >   (2) $vgpr51 = V_ADD_U32_e64 killed $sgpr3, killed $sgpr43, 0, implicit $exec
> > > > > >   (3) $sgpr3 = S_ADD_I32 $sgpr33, 32, implicit-def $scc
> > > > > >   (4) $vgpr52 = V_ADD_U32_e64 killed $sgpr3, killed $sgpr44, 0, implicit $exec
> > > > > > ```
> > > > > > scavenger considers SCC "live" between (1) and (3) because its internal iterator position points to (2). Calling the "advance" method inside the target hook breaks the outer PEI logic. Literally, PEI does not expect the scavenger state to be changed inside the TargetRegisterInfo::LowerFrameIndex method.
> > > > > > 
> > > > > > In this particular case, I can add the dead flag to the particular place:
> > > > > > ```
> > > > > > auto Add = BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_ADD_I32), TmpSReg)
> > > > > >             .addReg(FrameReg)
> > > > > >             .addImm(Offset);
> > > > > > Add->getOperand(3).setIsDead();
> > > > > > ```
> > > > > > Unfortunately, we have plenty of places like that. And for cases that are not as trivial, we are going to have plenty of odd code.
> > > > > > 
> > > > > > If you do need this loop, shouldn't it conservatively assume that SCC is live at the end of the basic block?
> > > > > 
> > > > > Good point, thanks.
> > > > > > If you do need this loop, shouldn't it conservatively assume that SCC is live at the end of the basic block?
> > > > > 
> > > > > Good point, thanks.
> > > > 
> > > > No, this does not work really. Given that we have a BB with just one SCC def, following the idea we would insert 
> > > > ```
> > > >   Some SCC def...
> > > >   $sgpr38 = S_CSELECT_B32 -1, 0, implicit $scc
> > > >   $sgpr8 = S_ADD_I32 $sgpr32, 2960, implicit-def $scc
> > > >   S_CMP_LG_U32 $sgpr38, 0, implicit-def $scc
> > > > ```
> > > > on each frame lowering S_ADD_I32 since each time we have S_CMP_LG_U32 defining SCC and no other defs to the end of the BB.
> > > > So, we will scavenge each time until we are out of registers.
> > > > If you do need this loop, shouldn't it conservatively assume that SCC is live at the end of the basic block?
> > > 
> > > In fact, the loop aims to restrict a big amount of false positive SCC "live" slots by simulating the register scavenger "forward" job. But if SCC live-in in some BB it is still in scavenger "usedregs".
> > > 
> > The loop still sounds like a hack to me that will have O(n^2) cost in the worst case. But I really don't know enough about PEI or the RegisterScavenger to review this properly.
> > The loop still sounds like a hack to me that will have O(n^2) cost in the worst case. But I really don't know enough about PEI or the RegisterScavenger to review this properly.
> 
> I don't like this approach either and would like to look for a neat solution. This is proposed as a workaround to stop Vulkan RT from failing.
> 
> Could you also clarify, why you consider it O(n^2)? The worst case is when we have to walk to the end of the BB. This is just on level loop and each instruction is visited once. 
In a large BB with n instructions that need frame index elimination, you will walk to end of the BB (worst case) for each one of them. So that is O(n^2) overall cost to run this pass.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D136169/new/

https://reviews.llvm.org/D136169