[PATCH] D119475: [AMDGPU] Add scheduler pass to rematerialize trivial defs

Tue Feb 15 15:10:49 PST 2022

rampitec added inline comments.

================
Comment at: llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp:822
+    return false;
+  } else {
+    // Occupancy is improved.
----------------
No else after return.

================
Comment at: llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp:830
+      // not need to check RegionEnd since it can only be a call or terminator
+      // instruction and we do not touch those type of instructions.
+      for (unsigned DefI = 0, DefE = Regions.size(); DefI != DefE; ++DefI) {
----------------
You actually need to update RegionEnd because there can be a fallthrough without a terminator.

================
Comment at: llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp:799
+  // occupancy. Keep the modified live-in set from second check.
+  if (!ImprovedAfterSinkingLiveThrus) {
+    // Keep a list of newly rematerialized instructions so that we can easily
----------------
vangthao wrote:
> rampitec wrote:
> > vangthao wrote:
> > > rampitec wrote:
> > > > vangthao wrote:
> > > > > rampitec wrote:
> > > > > > vangthao wrote:
> > > > > > > rampitec wrote:
> > > > > > > > vangthao wrote:
> > > > > > > > > rampitec wrote:
> > > > > > > > > > I do not think you really need a whole block above dealing with live-through estimations. First it is not guaranteed to help the RP if you sink a live-through. Second it is essentially the same as this block actually performing rematerialization, just less precise. Keep just rematerialization part.
> > > > > > > > > If the live-through is increasing RP in this block then by rematerializing don't we decrease RP by the live-throughs that we rematerialized?
> > > > > > > > I do not think LICM will hoist anything like that. Then even if so the code below will handle it as well.
> > > > > > > The code below only checks for trivially rematerializable defs used in the high RP block. If the def is live-through and used in a low RP block then it will not be checked. This checks for that and sinks those defs to lower RP blocks so we can decrease live-through RP.
> > > > > > I see 2 situations where something sinkable can be live-through:
> > > > > > 
> > > > > > 1)
> > > > > > ```
> > > > > > Def
> > > > > > loop:
> > > > > >   ...
> > > > > > cbr loop
> > > > > > Use
> > > > > > ```
> > > > > > There is absolutely no reason for any pass to hoist that Def high, most likely it will be sunk much earlier.
> > > > > > 
> > > > > > 2)
> > > > > > ```
> > > > > > Def
> > > > > > loop1:
> > > > > >   loop2:
> > > > > >     Use
> > > > > >   cbr loop2
> > > > > > cbr loop1
> > > > > > ```
> > > > > > Assuming you have a high live-through pressure in loop1 sinking Def to Use into loop2 will unlikely help it as the highest pressure is likely in the loop2 anyway.
> > > > > What about this situation? The def is live throughout the whole loop since it is needed in each iteration.
> > > > > 
> > > > > ```
> > > > > Def
> > > > > loop1:
> > > > >   ...
> > > > >   ... (high rp block)
> > > > >   ...
> > > > >   use (lower rp block)
> > > > >   cbr loop1
> > > > > ```
> > > > > 
> > > > This probably can happen, I doubt it is a common situation though. But then I still do not understand why code below will not handle it? You are only collecting instructions where defined register is in the live-in pressure of the high RP block, so it will be collected. Then below you are sinking it to the use. I.e. to me it will do the same thing automatically.
> > > We can handle both situations there but re-calculating RP is expensive so I split up the checks into two parts where we first check live-throughs then check defs that have a use in the high RP block.
> > > 
> > > For the first live-through check, it is simple to re-calculate RP since we can deduct RP from the cached pressure while the second check needs a RPTracker to go over the whole region each time we sink a def. If we can improve occupancy by only sinking live-throughs then that will reduce the amount of time we have to use RPTracker.
> > Besides unnecessary code complication I think you will need to recalculate RP anyway. You are going to reschedule the block to which you are sinking the instruction and since you may have several other blocks on the way this recalculation will be needed.
> > 
> > Speaking of which... It looks like if you have blocks in between this code will leave cached LiveIns, MBBLiveIns and Pressure for these blocks in an incorrect state.
> > 
> > I really like to keep it simple, at least until there is a proof a complication is necessary. Then even if it is needed it is easier to implement, debug and review it in a small steps.
> If this is the last pass, do we still want to fix the cached values for those blocks we are not re-scheduling?
What will happen if we add a new pass after? But updating all blocks in between can be tricky, you will need a dominator tree for that. Add a static_assert(LastStage == PreRARematerialize) and comment the reason.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D119475/new/

https://reviews.llvm.org/D119475