[llvm] r265337 - Enable unroll for constant bound loops when TripCount is not modulo of unroll factor, reducing it to maximum power-of-2 that satisfies threshold limit.

Mon Apr 4 13:45:53 PDT 2016

>   if (Count <= 1 && Unrolling == Runtime) {
for sure I mean this somewhere else in code.
Just allow this type of unrolling when unroll runtime is set.


On Mon, Apr 4, 2016 at 1:42 PM, Fiona Glaser <fglaser at apple.com> wrote:
>
> On Apr 4, 2016, at 1:41 PM, via llvm-commits <llvm-commits at lists.llvm.org>
> wrote:
>
>
> On Apr 4, 2016, at 1:35 PM, Evgeny Stupachenko <evstupac at gmail.com> wrote:
>
> Before the patch the loop
> for (i = 0; i < 15; i++)
>  loop_body;
> was not unrolled,
>
> the loop
> for (i = 0; i < 16; i++)
>  loop_body;
> was unrolled
>
> the loop
> for (i = 0; i < n; i++)
>  loop_body;
> was unrolled
>
> Why we should avoid unrolling if threshold let us unroll a loop?
> The sense of unrolling (right now) is to reduce induction variable and
> compare/branch costs.
>
> One of possible solutions is to add " && Unrolling == Runtime":
>
>      if (Count <= 1 && Unrolling == Runtime) {
>
>
>
> What do you mean? That code is already under this branch:
>
>   if (Unrolling == Partial) {
>
> So it would never trigger, if I’m reading this right.
>
> However I still do not understand why we should avoid unrolling if
> threshold let us unroll a loop?
> For the cases where unroll is unprofitable there should be
> corresponding heuristics.
> What is your case?
>
>
> You’ve changed the definition of “partial” unrolling from what it did
> before, which makes me someone nervous in general. Our specific use-case for
> partial unrolling is that GPUs want to reduce latency, so a big loop with
> high-latency memory operations in it (too big to fully unroll) should be
> partially unrolled to trade some number of registers for some amount of
> latency reduction. However, suppose the following case occurs:
>
> Trip count: 15
> Max unroll count: 8
>
> This means we unroll 8 times, then create a fixup loop that runs 7 times
> afterwards. Now we have the absolute worst of both worlds: our register
> count has gone up a lot because of the unroll, but we still have a lot of
> latency because of the fixup loop, so we’ll probably end up losing
> performance overall.
>
> —escha
>
>
> Corrected example:
>
> Trip count: 13
> Max unroll count: 8
> Fixup loop size: 5
>
> (The 15 case wouldn’t happen because it’d do a modulo-unroll of size 5).
>
> —escha
>