[PATCH] D25963: [LoopUnroll] Implement profile-based loop peeling

Wed Oct 26 14:07:31 PDT 2016

hfinkel added a comment.

In https://reviews.llvm.org/D25963#579367, @mkuper wrote:

> In https://reviews.llvm.org/D25963#579328, @hfinkel wrote:
>
> > As a high-level comment, it would be nice to also have loop metadata to specify a typical trip count (or trip counts).
> >
> > Intel, for example has (https://software.intel.com/en-us/node/524502):
> >
> >   #pragma loop_count(n)
> >   
> >
> > which asks the optimizer to optimize for a trip count of n. Moreover, and perhaps more importantly, is also supports:
> >
> >   #pragma loop_count(n1, n2, ...)
> >   
> >
> > which asks for specializations for trip counts n1, n2, etc.
> >
> > Also supported by Intel's compiler is:
> >
> >   #pragma loop_count min(n),max(n),avg(n)
> >   
>
>
> I agree this would be nice, but I think it's somewhat orthogonal.
>  We can start with an implementation of "estimated trip count" that relies on branch weights, and refine to use more specialized metadata if/when we have it.

Agreed.

> 
> 
>> FWIW, obviously part of the problem with the average is that you might miss the common trip counts. A loop that is generally executed with a trip count of 3 or 5, might end up with a average near 4; I'm not sure what the best thing would be to do in that case.
> 
> Right, but at least for sampling-based PGO, I think average is the best we're going to get. (Instrumentation can probably do better, and user hints certainly can).
>  I'm not entirely sure this is a problem, though. We want to optimize for the common case, and I think the average gives us that - in the "0.5 * 3 + 0.5 * 5" case, if we peel off 4 iterations, then 90% of the dynamically executed iterations will hit the peeled-off section - all iterations of the "3 trips" case, and 4 out of 5 iterations of the "5 trips" cases. Which is hopefully better than leaving the loop as is.

I agree. Thanks for explaining this, because I did not understand what was happening. I thought that you where peeling off a fixed number of iterations as a single block. You're not. This will give a different performance vs. applicability tradeoff.  I think that this probably makes more sense for PGO-driven information.

https://reviews.llvm.org/D25963