[PATCH] D25963: [LoopUnroll] Implement profile-based loop peeling
Hal Finkel via llvm-commits
llvm-commits at lists.llvm.org
Wed Oct 26 14:07:31 PDT 2016
hfinkel added a comment.
In https://reviews.llvm.org/D25963#579367, @mkuper wrote:
> In https://reviews.llvm.org/D25963#579328, @hfinkel wrote:
>
> > As a high-level comment, it would be nice to also have loop metadata to specify a typical trip count (or trip counts).
> >
> > Intel, for example has (https://software.intel.com/en-us/node/524502):
> >
> > #pragma loop_count(n)
> >
> >
> > which asks the optimizer to optimize for a trip count of n. Moreover, and perhaps more importantly, is also supports:
> >
> > #pragma loop_count(n1, n2, ...)
> >
> >
> > which asks for specializations for trip counts n1, n2, etc.
> >
> > Also supported by Intel's compiler is:
> >
> > #pragma loop_count min(n),max(n),avg(n)
> >
>
>
> I agree this would be nice, but I think it's somewhat orthogonal.
> We can start with an implementation of "estimated trip count" that relies on branch weights, and refine to use more specialized metadata if/when we have it.
Agreed.
>
>
>> FWIW, obviously part of the problem with the average is that you might miss the common trip counts. A loop that is generally executed with a trip count of 3 or 5, might end up with a average near 4; I'm not sure what the best thing would be to do in that case.
>
> Right, but at least for sampling-based PGO, I think average is the best we're going to get. (Instrumentation can probably do better, and user hints certainly can).
> I'm not entirely sure this is a problem, though. We want to optimize for the common case, and I think the average gives us that - in the "0.5 * 3 + 0.5 * 5" case, if we peel off 4 iterations, then 90% of the dynamically executed iterations will hit the peeled-off section - all iterations of the "3 trips" case, and 4 out of 5 iterations of the "5 trips" cases. Which is hopefully better than leaving the loop as is.
I agree. Thanks for explaining this, because I did not understand what was happening. I thought that you where peeling off a fixed number of iterations as a single block. You're not. This will give a different performance vs. applicability tradeoff. I think that this probably makes more sense for PGO-driven information.
https://reviews.llvm.org/D25963
More information about the llvm-commits
mailing list