[PATCH] D25963: [LoopUnroll] Implement profile-based loop peeling

Thu Oct 27 13:41:42 PDT 2016

> On Oct 26, 2016, at 12:38 PM, Xinliang David Li <davidxl at google.com> wrote:
> 
> 
> 
> On Wed, Oct 26, 2016 at 11:00 AM, Michael Kuperstein <mkuper at google.com <mailto:mkuper at google.com>> wrote:
> mkuper added a comment.
> 
> Hi Gerolf,
> 
> In https://reviews.llvm.org/D25963#579918 <https://reviews.llvm.org/D25963#579918>, @Gerolf wrote:
> 
> > Hi,
> >
> > Could you provide more background on this idea?
> 
> 
> I can't take credit for the idea - this is something GCC already does.
> 
> > What is your motivational use case? When the trip count is low why optimize?

Ok. I was wondering about specific apps...
> 
> The motivational use case is a loop with a low trip count that is nested inside a loop with a high trip count.
> Peeling the inner loop allows further passes in the optimization pipeline simplify the code for the iterations that actually run, making the outer loop's body better.
> Consider something like this:
> 
>   for (int i = 0; i < bignum; ++i) {
>     int ret = 0;
>     for (int j = 0; j < param; ++j)
>       ret += arr[i][j];
>     out[i] = ret;
>   }
> 
> Imagine param is usually 1.
> We can then peel this into:
> 
>   for (int i = 0; i < bignum; ++i) {
>     int ret = 0;
>     if (param == 0)
>       continue;
>     ret += arr[i][0]
>     for (int j = 1; j < param; ++j)
>       ret += arr[i][j];
>     out[i] = ret;
>   }

That makes a lot of sense. I think of peeling as a client/enabler optimization. So some pass(es) - like unrolller, vectorizer, ... needs to have the cost model proper and invoke it when appropriate. Intuitively turning peeling on only based on PGO is like rolling the dice - some wins, some losses. And I”m worried that my intuition turns out to be right.

-Gerolf

> 
> Which then becomes something morally equivalent to:
> 
>   for (int i = 0; i < bignum; ++i) {
>     if (param == 0)
>        continue;
>     if (param == 1) {
>       out[i] = arr[i][0];
>       continue;
>     }
>     ret = arr[i][0];
>     for (int j = 1; j < param; ++j)
>       ret += arr[i][j];
>     out[i] = ret;
>   }
> 
> So, we've improved the common case (param == 1) - we no longer have to initialize ret, we don't have to deal with the inner loop's IV, there's no add, just a direct assignment.
> 
> Beyond that, loop unswitching should also kick in which further improves.
> 
> 
> Another example of loop peeling helping eliminating initialization code:
> 
> int res[4] = { 0, 0, 0, 0};
> 
>  for (....) {
>      res[0] += ... ;
>      res[1] += ... ;
>      ...
>   }
> 
> Peeling one iteration off allows res[] to be directly initialized with first iteration's value.
> 
> Other benefits come from 
> 
> 1) more cross iteration optimizations cross peeled iterations
> 2) better scheduling freedom 
> 3) possibly improved branch prediction
> 4) other special cases where first couple of iterations of the loop do special things (allowing remaining loop body to be cleaner).
> 
> 
> David
> 
> David
>  
> 
> > If the profile is wrong and it actually is a hot loop for a regular/different input set peeling could hurt.
> 
> Sure, but this is true for any profile-guided optimization. PGO is only good with representative inputs. If an optimization was good regardless of input, we'd be doing it for non-PGO builds.
> 
> > There are also side effects on code size, register pressure etc. that could hurt performance.
> 
> Right. But that's not really different from any other kind of loop unrolling. Hence the thresholds, etc.
> 
> > Thanks
> > Gerolf
> 
> Thanks,
> Michael
> 
> 
> 
> ================
> Comment at: lib/Transforms/Utils/LoopUnrollPeel.cpp:87
> +    VMap[*BB] = NewBB;
> +    if (Header == *BB)
> +      InsertTop->getTerminator()->setSuccessor(0, NewBB);
> ----------------
> davidxl wrote:
> > Can this be moved outside the loop?
> >
> > assert(VMap[Header]);
> > InsertTop->getTerminator()->setSuccessor(0, VMap[Header]);
> Right, both this and the Latch handling should be moved outside the loop, thanks.
> 
> 
> ================
> Comment at: lib/Transforms/Utils/LoopUnrollPeel.cpp:90
> +
> +    if (Latch == *BB) {
> +      VMap.erase((*BB)->getTerminator());
> ----------------
> davidxl wrote:
> > This can be handled outside the loop too.
> Right, thanks.
> 
> 
> ================
> Comment at: lib/Transforms/Utils/LoopUnrollPeel.cpp:91
> +    if (Latch == *BB) {
> +      VMap.erase((*BB)->getTerminator());
> +      BranchInst *LatchBR = cast<BranchInst>(NewBB->getTerminator());
> ----------------
> davidxl wrote:
> > What this this erase do?
> Nothing, nice catch!
> (It's stale - it's needed when you replace LatchBR instead of modifying it in-place.)
> 
> 
> ================
> Comment at: lib/Transforms/Utils/LoopUnrollPeel.cpp:95
> +      // If this is the last iteration, branch past the loop, otherwise
> +      // branch to the next iteration (which may itself either be peeled off,
> +      // or the loop's preheader)
> ----------------
> davidxl wrote:
> > Is this a stale comment?
> No, but I guess it's not clear.
> 
> Let's say we're peeling off K iterations.
> 
> For iteration J in 1..K-1, we want the branch that terminates the copy of the latch to be:
> if (cond) goto header(J+1) else goto exit
> 
> For iteration K, we want to set this branch to be:
> if (cond) goto new-ph else goto exit.
> 
> Here, new-ph is the preheader of the new loop (that is, the loop that handles iterations >= K+1). Technically, this preheader should be empty, and only contains a branch to the loop header - the only reason it exists is to keep the loop in canonical form.
> Does this make sense? If it does, I'll try to improve the comment.
> 
> 
> ================
> Comment at: lib/Transforms/Utils/LoopUnrollPeel.cpp:101
> +      // We no longer know anything about the branch probability.
> +      LatchBR->setMetadata(LLVMContext::MD_prof, nullptr);
> +    }
> ----------------
> davidxl wrote:
> > Why? I think we should update the branch probability here -- it depends on the what iteration of the peeled clone. If peel count < average/estimated trip count, then each peeled iteration should be more biased towards fall through. If peel_count == est trip_count, then the last peel iteration should be biased toward exit.
> You're right, it's not that we don't know anything - but we don't know enough. I'm not sure how to attach a reasonable number to this, without knowing the distribution.
> Do you have any suggestions? The trivial option would be to assume an extremely narrow distribution (the loop always exits after exactly K iterations), but that would mean having an extreme bias for all of the branches, and I'm not sure that's wise.
> 
> 
> ================
> Comment at: lib/Transforms/Utils/LoopUnrollPeel.cpp:220
> +  }
> +
> +  // Now adjust the phi nodes in the loop header to get their initial values
> ----------------
> davidxl wrote:
> > The profile meta data of the original loop's back branch should be adjusted too.
> Right, I missed that, thanks.
> But, as above - I'm not sure by how much to adjust it.
> 
> 
> https://reviews.llvm.org/D25963 <https://reviews.llvm.org/D25963>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20161027/6aed36f3/attachment.html>