[llvm] r265337 - Enable unroll for constant bound loops when TripCount is not modulo of unroll factor, reducing it to maximum power-of-2 that satisfies threshold limit.

Mon Apr 4 16:17:43 PDT 2016

That sounds reasonable to me.

—Owen

> On Apr 4, 2016, at 2:55 PM, Evgeny Stupachenko <evstupac at gmail.com> wrote:
> 
>> I don’t follow what you’re proposed change is?
> Do not create fixup for "lane-predicated architecture".
> I agree that for this type of architectures we should keep loop
> iterations inside loop.
> 
> It looks like that compromise is to enable unroll for constant bound
> loops with TripCount that is not modulo of unroll factor only when
> "-unroll-runtime" is true.
> If it is ok, I'll prepare corresponding patch today.
> 
> On Mon, Apr 4, 2016 at 2:31 PM, Owen Anderson <resistor at mac.com> wrote:
>> I don’t follow what you’re proposed change is?
>> 
>> —Owen
>> 
>>> On Apr 4, 2016, at 2:28 PM, Evgeny Stupachenko <evstupac at gmail.com> wrote:
>>> 
>>> Sounds reasonable. Why not to include the check? By default unroll do
>>> not generate fixup loop (even in further passes it appeared to be a
>>> number of peeled iterations, not a loop).
>>> 
>>> On Mon, Apr 4, 2016 at 2:19 PM, Owen Anderson <resistor at mac.com> wrote:
>>>> More generally, for any lane-predicated architecture, the introduction of a
>>>> fixup loop is generally a bad idea.
>>>> 
>>>> —Owen
>>>> 
>>>> On Apr 4, 2016, at 1:52 PM, via llvm-commits <llvm-commits at lists.llvm.org>
>>>> wrote:
>>>> 
>>>> Oh, absolutely; it seems reasonable for runtime unrolling (since usually
>>>> with runtime unrolling you can’t avoid a fixup loop at all unless you
>>>> actually know the trip count is divisible by some N, which seems fairly
>>>> unlikely). I can see partial unrolling being useful in this way in some
>>>> cases, but it’s not what we want (and not what it did before); do you need
>>>> partial unrolling to work this way for your target?
>>>> 
>>>> —escha
>>>> 
>>>> On Apr 4, 2016, at 1:45 PM, Evgeny Stupachenko <evstupac at gmail.com> wrote:
>>>> 
>>>> if (Count <= 1 && Unrolling == Runtime) {
>>>> 
>>>> for sure I mean this somewhere else in code.
>>>> Just allow this type of unrolling when unroll runtime is set.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Mon, Apr 4, 2016 at 1:42 PM, Fiona Glaser <fglaser at apple.com> wrote:
>>>> 
>>>> 
>>>> On Apr 4, 2016, at 1:41 PM, via llvm-commits <llvm-commits at lists.llvm.org>
>>>> wrote:
>>>> 
>>>> 
>>>> On Apr 4, 2016, at 1:35 PM, Evgeny Stupachenko <evstupac at gmail.com> wrote:
>>>> 
>>>> Before the patch the loop
>>>> for (i = 0; i < 15; i++)
>>>> loop_body;
>>>> was not unrolled,
>>>> 
>>>> the loop
>>>> for (i = 0; i < 16; i++)
>>>> loop_body;
>>>> was unrolled
>>>> 
>>>> the loop
>>>> for (i = 0; i < n; i++)
>>>> loop_body;
>>>> was unrolled
>>>> 
>>>> Why we should avoid unrolling if threshold let us unroll a loop?
>>>> The sense of unrolling (right now) is to reduce induction variable and
>>>> compare/branch costs.
>>>> 
>>>> One of possible solutions is to add " && Unrolling == Runtime":
>>>> 
>>>>  if (Count <= 1 && Unrolling == Runtime) {
>>>> 
>>>> 
>>>> 
>>>> What do you mean? That code is already under this branch:
>>>> 
>>>> if (Unrolling == Partial) {
>>>> 
>>>> So it would never trigger, if I’m reading this right.
>>>> 
>>>> However I still do not understand why we should avoid unrolling if
>>>> threshold let us unroll a loop?
>>>> For the cases where unroll is unprofitable there should be
>>>> corresponding heuristics.
>>>> What is your case?
>>>> 
>>>> 
>>>> You’ve changed the definition of “partial” unrolling from what it did
>>>> before, which makes me someone nervous in general. Our specific use-case for
>>>> partial unrolling is that GPUs want to reduce latency, so a big loop with
>>>> high-latency memory operations in it (too big to fully unroll) should be
>>>> partially unrolled to trade some number of registers for some amount of
>>>> latency reduction. However, suppose the following case occurs:
>>>> 
>>>> Trip count: 15
>>>> Max unroll count: 8
>>>> 
>>>> This means we unroll 8 times, then create a fixup loop that runs 7 times
>>>> afterwards. Now we have the absolute worst of both worlds: our register
>>>> count has gone up a lot because of the unroll, but we still have a lot of
>>>> latency because of the fixup loop, so we’ll probably end up losing
>>>> performance overall.
>>>> 
>>>> —escha
>>>> 
>>>> 
>>>> Corrected example:
>>>> 
>>>> Trip count: 13
>>>> Max unroll count: 8
>>>> Fixup loop size: 5
>>>> 
>>>> (The 15 case wouldn’t happen because it’d do a modulo-unroll of size 5).
>>>> 
>>>> —escha
>>>> 
>>>> 
>>>> _______________________________________________
>>>> llvm-commits mailing list
>>>> llvm-commits at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>>>> 
>>>> 
>>