<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Jan 8, 2015 at 12:33 PM, Philip Reames <span dir="ltr"><<a href="mailto:listmail@philipreames.com" target="_blank">listmail@philipreames.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class=""><br>

On 01/07/2015 05:33 PM, Chandler Carruth wrote:<br>

</span><span class=""><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

How does this compare with classical approaches of loop peeling, partitioning, fission, or whatever you might call it?<br>

</blockquote></span>

I'm still developing a good sense for this, but let me lay out some observations.  Fair warning, this is going to be long and rambling.<br>

<br>

Let's start with a toy example:<br>

while(c) {<br>

  x = this->x;<br>

  y = this->y;<br>

  if (x == y) {<br>

    rare();<br>

  }<br>

}<br>

<br></blockquote><div><br></div><div>With profile info, speculative PRE can be performed for memory access that are not downsafe:</div><div><br></div><div>temp_c = c;</div><div>if (temp_c)</div><div>{</div><div>    x = this->x;</div><div>    y = this->y;</div><div>}</div><div>while (temp_c) {</div><div>   if (x == y)</div><div>     {</div><div>          rare();</div><div>          x = this->x;</div><div>          y = this->y;</div><div>      }</div><div>   temp_c = c;</div><div>}</div><div><br></div><div>If you can prove this->x etc are safe control speculative (e.g, seen a dominating memory access with the same base and offset), it can be </div><div><br></div><div>x = this->x;</div><div>y = this->y;</div><div>while (c) {</div><div>   if (x == y) {</div><div>      rare();</div><div>      x = this->x;</div><div>      y = this->y;</div><div>    }</div><div> }</div><div><br></div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

If we could tell x and y were loop invariant, we could unswitch this loop.  However, the rare call clobbers our view of memory, so LICM fails, and thus unswitch fails.<br>

<br>

We'd like to apply PRE here and push the reload into the loop preheader and the rare case.  This currently fails for two reasons: 1) We don't allow code duplication during PRE, and 2) the canonical loop-simplify form of having a single latch means that PRE doesn't actually see the rare block at all, it sees the preheader and the join point after the if block.<br>

<br>

I think both of the previous are fixable:<br>

1) This calls for profile guided duplication.  If we can trade one hot load for two cold ones, that's great.  Also, there might be static heuristics we could construct.  (e.g. if it's not available in the preheader and one of many latches...)<br>

2) We need PRE to look back through merge blocks.  This should possibly even live in MemoryDependenceAnalysis.  Alternatively, we could create a "deconstruct loop simplify form" pass to run right before GVN to side step this issue.<br>

<br>

Neither is entirely simple to implement, but both are doable. Probably.<br>

<br>

In theory, we've now solved the same problem my loop nest transform does.  There's two snags: one minor(ish), one major.<br>

<br>

The minor one is that many of my rare conditions involve volatile loads.  MemDepAnalysis currently completely gives up when faced with volatile loads.  This is mostly fixable, but given the structure of the code may be fairly invasive.  Oddly, LICM actually performs better in this case.<br>

<br>

For the major one, let's tweak our example slightly:<br>

while(c) {<br>

  x = this->x;<br>

  if (x == null) return;<br>

  y = this->y;<br>

  if (y == null) return;<br>

  if (x == y) {<br>

    rare();<br>

  }<br>

}<br>

<br>

Let's assume that PRE worked as designed for 'x'.  We now need to perform a transform analogous to loop unswitch, but a bit different.  We need to move the condition loop exits into each of the predecessors where the answers not available.  I don't believe we have anything like this today, and it involves a non-trivial bit of code.  Even if we get that done, we're left with a pass ordering problem where we need to run PRE, then branch-PRE, then PRE, then branch-PRE, etc..  (In practice, these chains can be quite long due to null checks and range checks.)<br>

<br>

If we'd split this into a loop nest structure, we'd still have the chain, but a) that's easier to control with a custom pass order since LICM and LoopUnswitch are far cheaper than GVN/PRE and b) having LICM do a bit of trivial loop unswitch for the terminator of the header appears quite tractable.<br></blockquote><div><br></div><div><br></div><div>if (c)</div><div>{</div><div>   x = this->x;</div><div>   if (!x) return;</div><div>   y = this->y;</div><div>   if (!y) return;</div><div>}</div><div>while (c) {</div><div>  if (x == y) {</div><div>      rare();</div><div>      if (!x) return;</div><div>      if (!y) return;</div><div>   }</div><div>}</div><div><br></div><div>The 'branch PRE' (hoisting, sinking, assertion prop and branch elimination) part is a little tricky to do though.</div><div><br></div><div>David</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

One other point worth mentioning is that LLVM does not currently have a loop peeling pass.  LoopRotate seems to get us most of the way there in common cases, but not always.  If we did peel the above loop once, we'd have the same merge block problem I mentioned previously (this time in the preheader *and* the latch), but the profitability becomes slightly more clear without the need for profile information or loop specific heuristics.<br>

<br>

To be clear, I'm not opposed to fixing some of the issues above.  I think that's a great idea, I'm just trying to find my way through to practical performance improvements with minimal invested effort.  I probably will in fact fix some of these since they show up in a number of other cases as well.<br>

<br>

<br>

Ok, so that's that.  Let's move on to why my proposed loop nest separation change is a bit scary.<br>

<br>

While the loop nest does allow things to be lifted from the inner loop to the outer loop - without any of the complex PRE or PGO reasoning required above - it suffers from all of the limitations of the existing loop optimization passes.  In particular, LICM may not actually be able to do much if the inner loop is still complicated enough to defeat either it's fairly simple alias analysis or fault analysis.  GVN/PRE is strictly more powerful - in theory if not always practice - here than what we have in LICM.<br>

<br>

We've also potentially created a loop nest which is harder to analyse than the original loop with two latches.  Consider this example:<br>

for(int i = 0; i < 20000) {<br>

   i++;<br>

   if(i % 20 != 0) continue;<br>

   a[i] = 0;<br>

}<br>

(This is not a real example of anything useful.  The important point is the early return is the common case.)<br>

<br>

This loop clearly runs exactly 20000 times.  ScalarEvolution almost certainly gets this right. (I haven't checked.)<br>

<br>

After separation, we'd have this:<br>

<br>

int i = 0;<br>

while(true) {<br>

  while(i < 20000) {<br>

     i++;<br>

     if(i % 20 == 0) { // now a loop exit<br>

       a[i] = 0;<br>

       goto next:<br>

     }<br>

}<br>

break:<br>

  next:<br>

}<br>

<br>

It's not clear to me that SE will always be able to tell that both the inner and outer loop runs a maximum of 20000 times.  (In this particular case it might actually optimize the inner loop substantially and thus get a more precise answer, but let's ignore that.)  We could potentially loose information about the loop by doing this transformation.  Of course, this concern applies to the *existing* loop nest separation heuristics too.  I don't have any particular reason to believe one heuristic is worse than the other, but I also have no reason not to believe that either.<br>

<br>

As a final point, the inner loop now has two exiting blocks not one.  I know at least some of our loop transforms give up on anything with more than one loop exit.  I'm less concerned by that since a) we should fix them, and b) the cases I've looked at don't look too hard.<br>

<br>

<br>

And that's everything that occurs to me at the moment.  I'll probably have to append in further responses as I remember more details, I know I'm forgetting some pieces.<span class="HOEnZb"><font color="#888888"><br>

<br>

Philip</font></span><div class="HOEnZb"><div class="h5"><br>

<br>

<br>

<br>

______________________________<u></u>_________________<br>

LLVM Developers mailing list<br>

<a href="mailto:LLVMdev@cs.uiuc.edu" target="_blank">LLVMdev@cs.uiuc.edu</a>         <a href="http://llvm.cs.uiuc.edu" target="_blank">http://llvm.cs.uiuc.edu</a><br>

<a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev" target="_blank">http://lists.cs.uiuc.edu/<u></u>mailman/listinfo/llvmdev</a><br>

</div></div></blockquote></div><br></div></div>