[PATCH] D25963: [LoopUnroll] Implement profile-based loop peeling

Thu Oct 27 14:29:56 PDT 2016

On Thu, Oct 27, 2016 at 1:41 PM, Gerolf Hoflehner via llvm-commits <
llvm-commits at lists.llvm.org> wrote:

>
> On Oct 26, 2016, at 12:38 PM, Xinliang David Li <davidxl at google.com>
> wrote:
>
>
>
> On Wed, Oct 26, 2016 at 11:00 AM, Michael Kuperstein <mkuper at google.com>
> wrote:
>
>> mkuper added a comment.
>>
>> Hi Gerolf,
>>
>> In https://reviews.llvm.org/D25963#579918, @Gerolf wrote:
>>
>> > Hi,
>> >
>> > Could you provide more background on this idea?
>>
>>
>> I can't take credit for the idea - this is something GCC already does.
>>
>> > What is your motivational use case? When the trip count is low why
>> optimize?
>>
>
> Ok. I was wondering about specific apps...
>

This was from a google internal application, which has some initialization
code followed by a short-trip-count-loop to have "+=" operation on the
initialized value. LLVM to vectorize the loop, but as the trip count is
small, vectorization actually hurts performance because of extra runtime
check. But with loop peeling, the initialization code can be eliminated and
all taken back edges are transformed to fall-through branch.

>
>
>> The motivational use case is a loop with a low trip count that is nested
>> inside a loop with a high trip count.
>> Peeling the inner loop allows further passes in the optimization pipeline
>> simplify the code for the iterations that actually run, making the outer
>> loop's body better.
>> Consider something like this:
>>
>>   for (int i = 0; i < bignum; ++i) {
>>     int ret = 0;
>>     for (int j = 0; j < param; ++j)
>>       ret += arr[i][j];
>>     out[i] = ret;
>>   }
>>
>> Imagine param is usually 1.
>> We can then peel this into:
>>
>>   for (int i = 0; i < bignum; ++i) {
>>     int ret = 0;
>>     if (param == 0)
>>       continue;
>>     ret += arr[i][0]
>>     for (int j = 1; j < param; ++j)
>>       ret += arr[i][j];
>>     out[i] = ret;
>>   }
>>
>
> That makes a lot of sense. I think of peeling as a client/enabler
> optimization. So some pass(es) - like unrolller, vectorizer, ... needs to
> have the cost model proper and invoke it when appropriate. Intuitively
> turning peeling on only based on PGO is like rolling the dice - some wins,
> some losses. And I”m worried that my intuition turns out to be right.
>

For unroller/vectorizer, a good performance model is built to set a
unroll/vectorize factor. But these are all based on the assumption that the
tripcount is large. If the assumption does not meet,  as shown in the
example above, the performance could be worse. With PGO,
unroller/vectorizer can be more selective as it knows the tripcount and
hotness of the loop.

For loop peel, it's a different story because the peeling factor is only
decided by the trip count, which is only available from profile. We also
have a performance model here: if the trip count (derived from profile) is
too large, peeling will not be beneficial thus will be turned off. So the
most important factor in deciding for loop peeling is the trip count. That
is why it should only be enabled in PGO.

Dehao

>
> -Gerolf
>
>
>> Which then becomes something morally equivalent to:
>>
>>   for (int i = 0; i < bignum; ++i) {
>>     if (param == 0)
>>        continue;
>>     if (param == 1) {
>>       out[i] = arr[i][0];
>>       continue;
>>     }
>>     ret = arr[i][0];
>>     for (int j = 1; j < param; ++j)
>>       ret += arr[i][j];
>>     out[i] = ret;
>>   }
>>
>> So, we've improved the common case (param == 1) - we no longer have to
>> initialize ret, we don't have to deal with the inner loop's IV, there's no
>> add, just a direct assignment.
>>
>
> Beyond that, loop unswitching should also kick in which further improves.
>
>
> Another example of loop peeling helping eliminating initialization code:
>
> int res[4] = { 0, 0, 0, 0};
>
>  for (....) {
>      res[0] += ... ;
>      res[1] += ... ;
>      ...
>   }
>
> Peeling one iteration off allows res[] to be directly initialized with
> first iteration's value.
>
> Other benefits come from
>
> 1) more cross iteration optimizations cross peeled iterations
> 2) better scheduling freedom
> 3) possibly improved branch prediction
> 4) other special cases where first couple of iterations of the loop do
> special things (allowing remaining loop body to be cleaner).
>
>
> David
>
> David
>
>
>>
>> > If the profile is wrong and it actually is a hot loop for a
>> regular/different input set peeling could hurt.
>>
>> Sure, but this is true for any profile-guided optimization. PGO is only
>> good with representative inputs. If an optimization was good regardless of
>> input, we'd be doing it for non-PGO builds.
>>
>> > There are also side effects on code size, register pressure etc. that
>> could hurt performance.
>>
>> Right. But that's not really different from any other kind of loop
>> unrolling. Hence the thresholds, etc.
>>
>> > Thanks
>> > Gerolf
>>
>> Thanks,
>> Michael
>>
>>
>>
>> ================
>> Comment at: lib/Transforms/Utils/LoopUnrollPeel.cpp:87
>> +    VMap[*BB] = NewBB;
>> +    if (Header == *BB)
>> +      InsertTop->getTerminator()->setSuccessor(0, NewBB);
>> ----------------
>> davidxl wrote:
>> > Can this be moved outside the loop?
>> >
>> > assert(VMap[Header]);
>> > InsertTop->getTerminator()->setSuccessor(0, VMap[Header]);
>> Right, both this and the Latch handling should be moved outside the loop,
>> thanks.
>>
>>
>> ================
>> Comment at: lib/Transforms/Utils/LoopUnrollPeel.cpp:90
>> +
>> +    if (Latch == *BB) {
>> +      VMap.erase((*BB)->getTerminator());
>> ----------------
>> davidxl wrote:
>> > This can be handled outside the loop too.
>> Right, thanks.
>>
>>
>> ================
>> Comment at: lib/Transforms/Utils/LoopUnrollPeel.cpp:91
>> +    if (Latch == *BB) {
>> +      VMap.erase((*BB)->getTerminator());
>> +      BranchInst *LatchBR = cast<BranchInst>(NewBB->getTerminator());
>> ----------------
>> davidxl wrote:
>> > What this this erase do?
>> Nothing, nice catch!
>> (It's stale - it's needed when you replace LatchBR instead of modifying
>> it in-place.)
>>
>>
>> ================
>> Comment at: lib/Transforms/Utils/LoopUnrollPeel.cpp:95
>> +      // If this is the last iteration, branch past the loop, otherwise
>> +      // branch to the next iteration (which may itself either be peeled
>> off,
>> +      // or the loop's preheader)
>> ----------------
>> davidxl wrote:
>> > Is this a stale comment?
>> No, but I guess it's not clear.
>>
>> Let's say we're peeling off K iterations.
>>
>> For iteration J in 1..K-1, we want the branch that terminates the copy of
>> the latch to be:
>> if (cond) goto header(J+1) else goto exit
>>
>> For iteration K, we want to set this branch to be:
>> if (cond) goto new-ph else goto exit.
>>
>> Here, new-ph is the preheader of the new loop (that is, the loop that
>> handles iterations >= K+1). Technically, this preheader should be empty,
>> and only contains a branch to the loop header - the only reason it exists
>> is to keep the loop in canonical form.
>> Does this make sense? If it does, I'll try to improve the comment.
>>
>>
>> ================
>> Comment at: lib/Transforms/Utils/LoopUnrollPeel.cpp:101
>> +      // We no longer know anything about the branch probability.
>> +      LatchBR->setMetadata(LLVMContext::MD_prof, nullptr);
>> +    }
>> ----------------
>> davidxl wrote:
>> > Why? I think we should update the branch probability here -- it depends
>> on the what iteration of the peeled clone. If peel count <
>> average/estimated trip count, then each peeled iteration should be more
>> biased towards fall through. If peel_count == est trip_count, then the last
>> peel iteration should be biased toward exit.
>> You're right, it's not that we don't know anything - but we don't know
>> enough. I'm not sure how to attach a reasonable number to this, without
>> knowing the distribution.
>> Do you have any suggestions? The trivial option would be to assume an
>> extremely narrow distribution (the loop always exits after exactly K
>> iterations), but that would mean having an extreme bias for all of the
>> branches, and I'm not sure that's wise.
>>
>>
>> ================
>> Comment at: lib/Transforms/Utils/LoopUnrollPeel.cpp:220
>> +  }
>> +
>> +  // Now adjust the phi nodes in the loop header to get their initial
>> values
>> ----------------
>> davidxl wrote:
>> > The profile meta data of the original loop's back branch should be
>> adjusted too.
>> Right, I missed that, thanks.
>> But, as above - I'm not sure by how much to adjust it.
>>
>>
>> https://reviews.llvm.org/D25963
>
>
>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20161027/732cc717/attachment.html>