[PATCH] [MachineLICM] Sink instructions only if they are unlikely to be executed

Fri Mar 20 15:25:46 PDT 2015

================
Comment at: lib/CodeGen/MachineLICM.cpp:835
@@ -818,8 +834,3 @@
       }
-      if (!B) {
-        B = MI.getParent();
-        continue;
-      }
-      B = DT->findNearestCommonDominator(B, MI.getParent());
-      if (!B) {
-        CanSink = false;
+      B = B ? DT->findNearestCommonDominator(B, MI.getParent())
+            : MI.getParent();
----------------
qcolombet wrote:
> Aborting on PHI does not make much sense.
> Instead, you should look for the common dominator for the use of the PHI, e.g., by looking at the dominator of the terminator of the related block. In that case, you wouldn’t insert at after the first non-PHI instruction.
I don't think I understand this.

We are talking about an instruction that is currently in the loop preheader and its use is in a PHI. Doesn't that (at least in the vast majority of cases) mean that we are needing the value only when branching from the loop preheader into the loop? Thus, the instruction is currently at the best possible place.

================
Comment at: lib/CodeGen/MachineLICM.cpp:836
@@ +835,3 @@
+    // factor of the a cost model.
+    if (HeaderFreq * ColdProb < MBFI->getBlockFreq(B))
+      continue;
----------------
qcolombet wrote:
> djasper wrote:
> > qcolombet wrote:
> > > djasper wrote:
> > > > qcolombet wrote:
> > > > > Shouldn’t we have a threshold in the other direction as well, i.e., Freq(B) < Freq(Preheader) * <some threshold>?
> > > > Could you elaborate what that might accomplish?
> > > > 
> > > > Also the frequency of the *pre*header doesn't really relate strongly to the frequencies within the loop AFAICT.
> > > Sure, but the instructions we are moving come from the pre header and we do not want them to explode the number of time they are executed.
> > > Let say the pre-header is executed once and the loop 1,000,000 times. Then, even in the cold path of the loop, this is still rather expensive to sink the instructions.
> > I think we will almost never have accurate information about that. And the cost model is also not easy. We are generally weighing up the cost of the additional computation in each loop vs. the cost of the additional live ranges of these registers and the register spills that might be a result. The latter might actually make the computation inside the loop *more* expensive as demonstrated in the changed test code. Here, we are very frequently accessing one of the struct's fields. If we hoist all of the GEPs out of the loop, we end up spilling them all onto the stack and each load becomes a load from memory. With this change, we keep the GEPs of the less frequently accessed fields inside the loops (where they are folded into the LEAs and potentially don't cause significant overhead over loading them from the stack onto which they would be spilled). We still pull out the GEP for the frequently accessed field and can actually keep that in a register using only a cheap MOV inside the loop.
> > 
> > I think the long-term solution might be to order all instructions we could sink by frequency and then sink them into the loop starting from the least frequently executed until there is no more register pressure. Not entirely sure how to implement that correctly yet. My hope is that this patch is an incremental step towards that.
> > We are generally weighing up the cost of the additional computation in each loop vs. the cost of the additional live ranges of these registers and the register spills that might be a result.
> 
> I agree, but checking for frequencies to derive a heuristic for that sounds wrong to me.
> To simplify, let us assume that the spiller would insert the spill instructions at the exact same locations as the sinking algorithm.
> The first question we should ask ourselves is: do the reload instructions are more expensive than the things we sunk?
> If the answer is no, then there is no point in sinking. Frequency does not help for that.
> 
> Second, when do we consider the live-range we shorten/extend?
> E.g., what if you sink a = add b<kill> + c<kill>?
> You end up increasing the register pressure by one in the whole loop body and pushed the instruction in a more expensive location. Frequency does not help for that too.
> 
> The bottom line is, I believe that frequency has nothing to do, with the heuristic we want, to achieve better spill placement.
> Therefore, I do not think this is a step toward fixing the problem.
> 
> I am fine if you want to do experimentation to gather ideas, but seeing gains out of that study seems more luck based than actual improvements.
> The first question we should ask ourselves is: do the reload instructions are more expensive than the things we sunk?
> If the answer is no, then there is no point in sinking. Frequency does not help for that.

I am not sure it is as easy. Register spills have other costs as well. At the very least, they are increasing the stack frame size and incur the cost of writing to the stack. If the instructions are only executed a handful of times, as in the case when they are unlikely to be executed, this additional cost is significant.

The model I have in mind in the long run is the following. Lets call the difference between instruction cost and cost of spill-reload C. It is the additional cost that a sunken instruction will incur each time it is executed in the loop. You are saying that we should sink all the instructions where C is negative (spill reload is cheaper). This is correct to a certain extent although we have to factor in register pressure because we might not need to spill at all. But that a bit of a separate consideration. Lets further assume that the likelihood of execution is L. Now, for positive C, we should sink instructions if C*L is smaller than some threshold that factors in the cost of the spill itself which I mentioned above. It is the latter part that I am getting at with this patch and I do think that looking at probabilities is a step into the right direction (although maybe not the most important first step).

Now, to explain where I am coming from. I have code with a loop, which basically does deserialization and inside the loop there is a large switch statement. In practice, each of the spilled registers is usually accessed either 0 or 1 times. Thus, sinking is very important.

Also, there is already code in MachineLICM, which kind of does the same thing in line 743. It basically checks whether the loop header has more than 25 successors in which case it cops out of hoisting instructions. Now, of course 25 is significantly greater than the 20% probability I am using. But also, just looking at the successor count is not very helpful as I think this condition would not fire if the loop contained a switch statement inside a surrounding branch in the loop body. We should probably also make this decision based on execution probabilities.

http://reviews.llvm.org/D8451

EMAIL PREFERENCES
  http://reviews.llvm.org/settings/panel/emailpreferences/