[LLVMdev] Separating loop nests based on profile information?

Fri Jan 9 11:39:28 PST 2015

On 01/08/2015 12:33 PM, Philip Reames wrote:
>
> On 01/07/2015 05:33 PM, Chandler Carruth wrote:
>> How does this compare with classical approaches of loop peeling, 
>> partitioning, fission, or whatever you might call it?
> I'm still developing a good sense for this, but let me lay out some 
> observations.  Fair warning, this is going to be long and rambling.
>
> Let's start with a toy example:
> while(c) {
>   x = this->x;
>   y = this->y;
>   if (x == y) {
>     rare();
>   }
> }
>
> If we could tell x and y were loop invariant, we could unswitch this 
> loop.  However, the rare call clobbers our view of memory, so LICM 
> fails, and thus unswitch fails.
>
> We'd like to apply PRE here and push the reload into the loop 
> preheader and the rare case.  This currently fails for two reasons: 1) 
> We don't allow code duplication during PRE, and 2) the canonical 
> loop-simplify form of having a single latch means that PRE doesn't 
> actually see the rare block at all, it sees the preheader and the join 
> point after the if block.
>
> I think both of the previous are fixable:
> 1) This calls for profile guided duplication.  If we can trade one hot 
> load for two cold ones, that's great.  Also, there might be static 
> heuristics we could construct.  (e.g. if it's not available in the 
> preheader and one of many latches...)
> 2) We need PRE to look back through merge blocks.  This should 
> possibly even live in MemoryDependenceAnalysis.  Alternatively, we 
> could create a "deconstruct loop simplify form" pass to run right 
> before GVN to side step this issue.
Point 2 here may be inaccurate.  I've definitely seen problems here in 
the past, but when I sat down to write some test cases, it looks like 
the simple merge block for latch (with the load being considered in the 
loop header) works just fine.  I suspect my original example is actually 
blocked by something else; I'm just not sure what that is yet.

One case where I can get PRE to fail is if there's a load in one of two 
paths through the loop, in the preheader, but not in the header.  Here's 
the IR:
; Function Attrs: readonly
declare void @hold(i32) #0

declare void @clobber()

define i32 @test1(i1 %cnd, i32* %p) {
entry:
   %v2 = load i32* %p
   call void @hold(i32 %v2)
   br label %header

header:                                           ; preds = %merge, %entry
   br i1 %cnd, label %bb1, label %bb2

bb1:                                              ; preds = %header
   %v1 = load i32* %p
   call void @hold(i32 %v1)
   br label %merge

bb2:                                              ; preds = %header
   call void @clobber()
   br label %merge

merge:                                            ; preds = %bb2, %bb1
   br label %header
}

I haven't looked to see what the issue here is yet.

>
> Neither is entirely simple to implement, but both are doable. Probably.
>
> In theory, we've now solved the same problem my loop nest transform 
> does.  There's two snags: one minor(ish), one major.
>
> The minor one is that many of my rare conditions involve volatile 
> loads.  MemDepAnalysis currently completely gives up when faced with 
> volatile loads.  This is mostly fixable, but given the structure of 
> the code may be fairly invasive.  Oddly, LICM actually performs better 
> in this case.
>
> For the major one, let's tweak our example slightly:
> while(c) {
>   x = this->x;
>   if (x == null) return;
>   y = this->y;
>   if (y == null) return;
>   if (x == y) {
>     rare();
>   }
> }
>
> Let's assume that PRE worked as designed for 'x'.  We now need to 
> perform a transform analogous to loop unswitch, but a bit different.  
> We need to move the condition loop exits into each of the predecessors 
> where the answers not available.  I don't believe we have anything 
> like this today, and it involves a non-trivial bit of code.  Even if 
> we get that done, we're left with a pass ordering problem where we 
> need to run PRE, then branch-PRE, then PRE, then branch-PRE, etc..  
> (In practice, these chains can be quite long due to null checks and 
> range checks.)
>
> If we'd split this into a loop nest structure, we'd still have the 
> chain, but a) that's easier to control with a custom pass order since 
> LICM and LoopUnswitch are far cheaper than GVN/PRE and b) having LICM 
> do a bit of trivial loop unswitch for the terminator of the header 
> appears quite tractable.
>
> One other point worth mentioning is that LLVM does not currently have 
> a loop peeling pass.  LoopRotate seems to get us most of the way there 
> in common cases, but not always.  If we did peel the above loop once, 
> we'd have the same merge block problem I mentioned previously (this 
> time in the preheader *and* the latch), but the profitability becomes 
> slightly more clear without the need for profile information or loop 
> specific heuristics.
>
> To be clear, I'm not opposed to fixing some of the issues above. I 
> think that's a great idea, I'm just trying to find my way through to 
> practical performance improvements with minimal invested effort.  I 
> probably will in fact fix some of these since they show up in a number 
> of other cases as well.
>
>
> Ok, so that's that.  Let's move on to why my proposed loop nest 
> separation change is a bit scary.
>
> While the loop nest does allow things to be lifted from the inner loop 
> to the outer loop - without any of the complex PRE or PGO reasoning 
> required above - it suffers from all of the limitations of the 
> existing loop optimization passes.  In particular, LICM may not 
> actually be able to do much if the inner loop is still complicated 
> enough to defeat either it's fairly simple alias analysis or fault 
> analysis.  GVN/PRE is strictly more powerful - in theory if not always 
> practice - here than what we have in LICM.
>
> We've also potentially created a loop nest which is harder to analyse 
> than the original loop with two latches.  Consider this example:
> for(int i = 0; i < 20000) {
>    i++;
>    if(i % 20 != 0) continue;
>    a[i] = 0;
> }
> (This is not a real example of anything useful.  The important point 
> is the early return is the common case.)
>
> This loop clearly runs exactly 20000 times.  ScalarEvolution almost 
> certainly gets this right. (I haven't checked.)
>
> After separation, we'd have this:
>
> int i = 0;
> while(true) {
>   while(i < 20000) {
>      i++;
>      if(i % 20 == 0) { // now a loop exit
>        a[i] = 0;
>        goto next:
>      }
> }
> break:
>   next:
> }
>
> It's not clear to me that SE will always be able to tell that both the 
> inner and outer loop runs a maximum of 20000 times.  (In this 
> particular case it might actually optimize the inner loop 
> substantially and thus get a more precise answer, but let's ignore 
> that.)  We could potentially loose information about the loop by doing 
> this transformation.  Of course, this concern applies to the 
> *existing* loop nest separation heuristics too.  I don't have any 
> particular reason to believe one heuristic is worse than the other, 
> but I also have no reason not to believe that either.
>
> As a final point, the inner loop now has two exiting blocks not one.  
> I know at least some of our loop transforms give up on anything with 
> more than one loop exit.  I'm less concerned by that since a) we 
> should fix them, and b) the cases I've looked at don't look too hard.
>
>
> And that's everything that occurs to me at the moment.  I'll probably 
> have to append in further responses as I remember more details, I know 
> I'm forgetting some pieces.
>
> Philip
>
>
>