[PATCH] D25963: [LoopUnroll] Implement profile-based loop peeling
Gerolf Hoflehner via llvm-commits
llvm-commits at lists.llvm.org
Thu Oct 27 13:41:42 PDT 2016
> On Oct 26, 2016, at 12:38 PM, Xinliang David Li <davidxl at google.com> wrote:
>
>
>
> On Wed, Oct 26, 2016 at 11:00 AM, Michael Kuperstein <mkuper at google.com <mailto:mkuper at google.com>> wrote:
> mkuper added a comment.
>
> Hi Gerolf,
>
> In https://reviews.llvm.org/D25963#579918 <https://reviews.llvm.org/D25963#579918>, @Gerolf wrote:
>
> > Hi,
> >
> > Could you provide more background on this idea?
>
>
> I can't take credit for the idea - this is something GCC already does.
>
> > What is your motivational use case? When the trip count is low why optimize?
Ok. I was wondering about specific apps...
>
> The motivational use case is a loop with a low trip count that is nested inside a loop with a high trip count.
> Peeling the inner loop allows further passes in the optimization pipeline simplify the code for the iterations that actually run, making the outer loop's body better.
> Consider something like this:
>
> for (int i = 0; i < bignum; ++i) {
> int ret = 0;
> for (int j = 0; j < param; ++j)
> ret += arr[i][j];
> out[i] = ret;
> }
>
> Imagine param is usually 1.
> We can then peel this into:
>
> for (int i = 0; i < bignum; ++i) {
> int ret = 0;
> if (param == 0)
> continue;
> ret += arr[i][0]
> for (int j = 1; j < param; ++j)
> ret += arr[i][j];
> out[i] = ret;
> }
That makes a lot of sense. I think of peeling as a client/enabler optimization. So some pass(es) - like unrolller, vectorizer, ... needs to have the cost model proper and invoke it when appropriate. Intuitively turning peeling on only based on PGO is like rolling the dice - some wins, some losses. And I”m worried that my intuition turns out to be right.
-Gerolf
>
> Which then becomes something morally equivalent to:
>
> for (int i = 0; i < bignum; ++i) {
> if (param == 0)
> continue;
> if (param == 1) {
> out[i] = arr[i][0];
> continue;
> }
> ret = arr[i][0];
> for (int j = 1; j < param; ++j)
> ret += arr[i][j];
> out[i] = ret;
> }
>
> So, we've improved the common case (param == 1) - we no longer have to initialize ret, we don't have to deal with the inner loop's IV, there's no add, just a direct assignment.
>
> Beyond that, loop unswitching should also kick in which further improves.
>
>
> Another example of loop peeling helping eliminating initialization code:
>
> int res[4] = { 0, 0, 0, 0};
>
> for (....) {
> res[0] += ... ;
> res[1] += ... ;
> ...
> }
>
> Peeling one iteration off allows res[] to be directly initialized with first iteration's value.
>
> Other benefits come from
>
> 1) more cross iteration optimizations cross peeled iterations
> 2) better scheduling freedom
> 3) possibly improved branch prediction
> 4) other special cases where first couple of iterations of the loop do special things (allowing remaining loop body to be cleaner).
>
>
> David
>
> David
>
>
> > If the profile is wrong and it actually is a hot loop for a regular/different input set peeling could hurt.
>
> Sure, but this is true for any profile-guided optimization. PGO is only good with representative inputs. If an optimization was good regardless of input, we'd be doing it for non-PGO builds.
>
> > There are also side effects on code size, register pressure etc. that could hurt performance.
>
> Right. But that's not really different from any other kind of loop unrolling. Hence the thresholds, etc.
>
> > Thanks
> > Gerolf
>
> Thanks,
> Michael
>
>
>
> ================
> Comment at: lib/Transforms/Utils/LoopUnrollPeel.cpp:87
> + VMap[*BB] = NewBB;
> + if (Header == *BB)
> + InsertTop->getTerminator()->setSuccessor(0, NewBB);
> ----------------
> davidxl wrote:
> > Can this be moved outside the loop?
> >
> > assert(VMap[Header]);
> > InsertTop->getTerminator()->setSuccessor(0, VMap[Header]);
> Right, both this and the Latch handling should be moved outside the loop, thanks.
>
>
> ================
> Comment at: lib/Transforms/Utils/LoopUnrollPeel.cpp:90
> +
> + if (Latch == *BB) {
> + VMap.erase((*BB)->getTerminator());
> ----------------
> davidxl wrote:
> > This can be handled outside the loop too.
> Right, thanks.
>
>
> ================
> Comment at: lib/Transforms/Utils/LoopUnrollPeel.cpp:91
> + if (Latch == *BB) {
> + VMap.erase((*BB)->getTerminator());
> + BranchInst *LatchBR = cast<BranchInst>(NewBB->getTerminator());
> ----------------
> davidxl wrote:
> > What this this erase do?
> Nothing, nice catch!
> (It's stale - it's needed when you replace LatchBR instead of modifying it in-place.)
>
>
> ================
> Comment at: lib/Transforms/Utils/LoopUnrollPeel.cpp:95
> + // If this is the last iteration, branch past the loop, otherwise
> + // branch to the next iteration (which may itself either be peeled off,
> + // or the loop's preheader)
> ----------------
> davidxl wrote:
> > Is this a stale comment?
> No, but I guess it's not clear.
>
> Let's say we're peeling off K iterations.
>
> For iteration J in 1..K-1, we want the branch that terminates the copy of the latch to be:
> if (cond) goto header(J+1) else goto exit
>
> For iteration K, we want to set this branch to be:
> if (cond) goto new-ph else goto exit.
>
> Here, new-ph is the preheader of the new loop (that is, the loop that handles iterations >= K+1). Technically, this preheader should be empty, and only contains a branch to the loop header - the only reason it exists is to keep the loop in canonical form.
> Does this make sense? If it does, I'll try to improve the comment.
>
>
> ================
> Comment at: lib/Transforms/Utils/LoopUnrollPeel.cpp:101
> + // We no longer know anything about the branch probability.
> + LatchBR->setMetadata(LLVMContext::MD_prof, nullptr);
> + }
> ----------------
> davidxl wrote:
> > Why? I think we should update the branch probability here -- it depends on the what iteration of the peeled clone. If peel count < average/estimated trip count, then each peeled iteration should be more biased towards fall through. If peel_count == est trip_count, then the last peel iteration should be biased toward exit.
> You're right, it's not that we don't know anything - but we don't know enough. I'm not sure how to attach a reasonable number to this, without knowing the distribution.
> Do you have any suggestions? The trivial option would be to assume an extremely narrow distribution (the loop always exits after exactly K iterations), but that would mean having an extreme bias for all of the branches, and I'm not sure that's wise.
>
>
> ================
> Comment at: lib/Transforms/Utils/LoopUnrollPeel.cpp:220
> + }
> +
> + // Now adjust the phi nodes in the loop header to get their initial values
> ----------------
> davidxl wrote:
> > The profile meta data of the original loop's back branch should be adjusted too.
> Right, I missed that, thanks.
> But, as above - I'm not sure by how much to adjust it.
>
>
> https://reviews.llvm.org/D25963 <https://reviews.llvm.org/D25963>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20161027/6aed36f3/attachment.html>
More information about the llvm-commits
mailing list