[LLVMdev] Separating loop nests based on profile information?

Mon Jan 12 20:28:39 PST 2015

On Thu, Jan 8, 2015 at 12:33 PM, Philip Reames <listmail at philipreames.com>
wrote:

>
> On 01/07/2015 05:33 PM, Chandler Carruth wrote:
>
>> How does this compare with classical approaches of loop peeling,
>> partitioning, fission, or whatever you might call it?
>>
> I'm still developing a good sense for this, but let me lay out some
> observations.  Fair warning, this is going to be long and rambling.
>
> Let's start with a toy example:
> while(c) {
>   x = this->x;
>   y = this->y;
>   if (x == y) {
>     rare();
>   }
> }
>
>
With profile info, speculative PRE can be performed for memory access that
are not downsafe:

temp_c = c;
if (temp_c)
{
    x = this->x;
    y = this->y;
}
while (temp_c) {
   if (x == y)
     {
          rare();
          x = this->x;
          y = this->y;
      }
   temp_c = c;
}

If you can prove this->x etc are safe control speculative (e.g, seen a
dominating memory access with the same base and offset), it can be

x = this->x;
y = this->y;
while (c) {
   if (x == y) {
      rare();
      x = this->x;
      y = this->y;
    }
 }

> If we could tell x and y were loop invariant, we could unswitch this
> loop.  However, the rare call clobbers our view of memory, so LICM fails,
> and thus unswitch fails.
>
> We'd like to apply PRE here and push the reload into the loop preheader
> and the rare case.  This currently fails for two reasons: 1) We don't allow
> code duplication during PRE, and 2) the canonical loop-simplify form of
> having a single latch means that PRE doesn't actually see the rare block at
> all, it sees the preheader and the join point after the if block.
>
> I think both of the previous are fixable:
> 1) This calls for profile guided duplication.  If we can trade one hot
> load for two cold ones, that's great.  Also, there might be static
> heuristics we could construct.  (e.g. if it's not available in the
> preheader and one of many latches...)
> 2) We need PRE to look back through merge blocks.  This should possibly
> even live in MemoryDependenceAnalysis.  Alternatively, we could create a
> "deconstruct loop simplify form" pass to run right before GVN to side step
> this issue.
>
> Neither is entirely simple to implement, but both are doable. Probably.
>
> In theory, we've now solved the same problem my loop nest transform does.
> There's two snags: one minor(ish), one major.
>
> The minor one is that many of my rare conditions involve volatile loads.
> MemDepAnalysis currently completely gives up when faced with volatile
> loads.  This is mostly fixable, but given the structure of the code may be
> fairly invasive.  Oddly, LICM actually performs better in this case.
>
> For the major one, let's tweak our example slightly:
> while(c) {
>   x = this->x;
>   if (x == null) return;
>   y = this->y;
>   if (y == null) return;
>   if (x == y) {
>     rare();
>   }
> }
>
> Let's assume that PRE worked as designed for 'x'.  We now need to perform
> a transform analogous to loop unswitch, but a bit different.  We need to
> move the condition loop exits into each of the predecessors where the
> answers not available.  I don't believe we have anything like this today,
> and it involves a non-trivial bit of code.  Even if we get that done, we're
> left with a pass ordering problem where we need to run PRE, then
> branch-PRE, then PRE, then branch-PRE, etc..  (In practice, these chains
> can be quite long due to null checks and range checks.)
>
> If we'd split this into a loop nest structure, we'd still have the chain,
> but a) that's easier to control with a custom pass order since LICM and
> LoopUnswitch are far cheaper than GVN/PRE and b) having LICM do a bit of
> trivial loop unswitch for the terminator of the header appears quite
> tractable.
>

if (c)
{
   x = this->x;
   if (!x) return;
   y = this->y;
   if (!y) return;
}
while (c) {
  if (x == y) {
      rare();
      if (!x) return;
      if (!y) return;
   }
}

The 'branch PRE' (hoisting, sinking, assertion prop and branch elimination)
part is a little tricky to do though.

David

>
> One other point worth mentioning is that LLVM does not currently have a
> loop peeling pass.  LoopRotate seems to get us most of the way there in
> common cases, but not always.  If we did peel the above loop once, we'd
> have the same merge block problem I mentioned previously (this time in the
> preheader *and* the latch), but the profitability becomes slightly more
> clear without the need for profile information or loop specific heuristics.
>
> To be clear, I'm not opposed to fixing some of the issues above.  I think
> that's a great idea, I'm just trying to find my way through to practical
> performance improvements with minimal invested effort.  I probably will in
> fact fix some of these since they show up in a number of other cases as
> well.
>
>
> Ok, so that's that.  Let's move on to why my proposed loop nest separation
> change is a bit scary.
>
> While the loop nest does allow things to be lifted from the inner loop to
> the outer loop - without any of the complex PRE or PGO reasoning required
> above - it suffers from all of the limitations of the existing loop
> optimization passes.  In particular, LICM may not actually be able to do
> much if the inner loop is still complicated enough to defeat either it's
> fairly simple alias analysis or fault analysis.  GVN/PRE is strictly more
> powerful - in theory if not always practice - here than what we have in
> LICM.
>
> We've also potentially created a loop nest which is harder to analyse than
> the original loop with two latches.  Consider this example:
> for(int i = 0; i < 20000) {
>    i++;
>    if(i % 20 != 0) continue;
>    a[i] = 0;
> }
> (This is not a real example of anything useful.  The important point is
> the early return is the common case.)
>
> This loop clearly runs exactly 20000 times.  ScalarEvolution almost
> certainly gets this right. (I haven't checked.)
>
> After separation, we'd have this:
>
> int i = 0;
> while(true) {
>   while(i < 20000) {
>      i++;
>      if(i % 20 == 0) { // now a loop exit
>        a[i] = 0;
>        goto next:
>      }
> }
> break:
>   next:
> }
>
> It's not clear to me that SE will always be able to tell that both the
> inner and outer loop runs a maximum of 20000 times.  (In this particular
> case it might actually optimize the inner loop substantially and thus get a
> more precise answer, but let's ignore that.)  We could potentially loose
> information about the loop by doing this transformation.  Of course, this
> concern applies to the *existing* loop nest separation heuristics too.  I
> don't have any particular reason to believe one heuristic is worse than the
> other, but I also have no reason not to believe that either.
>
> As a final point, the inner loop now has two exiting blocks not one.  I
> know at least some of our loop transforms give up on anything with more
> than one loop exit.  I'm less concerned by that since a) we should fix
> them, and b) the cases I've looked at don't look too hard.
>
>
> And that's everything that occurs to me at the moment.  I'll probably have
> to append in further responses as I remember more details, I know I'm
> forgetting some pieces.
>
> Philip
>
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150112/227d39a3/attachment.html>