[LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops

Sat Aug 17 00:08:13 PDT 2013

At 2013-08-16 22:32:30,"Tobias Grosser" <tobias at grosser.es> wrote:
>>
>> Yes, I have changed the original code to the form you suggested:
>>    for (i
>>        for (j
>>            ...
>>                 x=1
>
>Sorry, I meant
>                  x[0] +=
>

It is interesting that Polly would run much faster if we change the "x=0" to "X[0]=0" or "X[0]+=0 in the nested loops.
First, if the nested loop only contains a statement "X[0]+=0", like this:
// SingleAcc.c
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char *argv[]) {
  int n = ((argc == 2) ? atoi(argv[1]) : 46);
  int a, b, c, d, e, f, x=0;
  int X[10];
  for (a=0; a<n; a++)
    for (b=0; b<n; b++)
      for (c=0; c<n; c++)
        for (d=0; d<n; d++)
          for (e=0; e<n; e++)
            for (f=0; f<n; f++)
              X[0]+=1;  // could be X[0]+=1, X[0]=1 or x=1
  printf("%d\n", X[0]); // could be X[0] or x
  return 0;
}
Then the scops results would be:

$ clang -O0 -emit-llvm -S SingleAcc.c -o SingleAcc.ll
# Run everything up to the offending pass
$ polly-opt SingleAcc.ll -targetlibinfo -no-aa ... -polly-indvars -o SingleAcc.preopt.ll -S
# Print the scop info:
$ polly-opt -basicaa -polly-prepare -polly-scops SingleAcc.preopt.ll -analyze 
(*Note: you can download SingleAcc.preopt.ll on http://llvm.org/bugs/attachment.cgi?id=11059)

Printing analysis 'Polly - Create polyhedral description of Scops' for region: 'for.cond2.preheader => for.end32' in function 'main':
    Context:
    [cond.reload] -> {  : cond.reload >= -2147483648 and cond.reload <= 2147483647 }
    p0: %cond.reload
    Statements {
    Stmt_for_body16
            Domain :=
                [cond.reload] -> { Stmt_for_body16[i0, i1, i2, i3, i4, i5] : i0 >= 0 and i0 <= -1 + cond.reload and i1 >= 0 and i1 <= -1 + cond.reload and i2 >= 0 and i2 <= -1 + cond.reload and i3 >= 0 and i3 <= -1 + cond.reload and i4 >= 0 and i4 <= -1 + cond.reload and i5 >= 0 and i5 <= -1 + cond.reload and cond.reload >= 1 };
            Scattering :=
                [cond.reload] -> { Stmt_for_body16[i0, i1, i2, i3, i4, i5] -> scattering[0, i0, 0, i1, 0, i2, 0, i3, 0, i4, 0, i5, 0] };
            ReadAccess := 
                [cond.reload] -> { Stmt_for_body16[i0, i1, i2, i3, i4, i5] -> MemRef_X[0] };
            MustWriteAccess := 
                [cond.reload] -> { Stmt_for_body16[i0, i1, i2, i3, i4, i5] -> MemRef_X[0] };
    }

Similarly, the scops result for "X[0]=0" would be:

$ polly-opt -basicaa -polly-prepare -polly-scops SingleAssign.preopt.ll -analyze 
(*Note: you can download SingleAssign.preopt.ll on http://llvm.org/bugs/attachment.cgi?id=11060)

Printing analysis 'Polly - Create polyhedral description of Scops' for region: 'for.cond2.preheader => for.end32' in function 'main':
    Context:
    [cond.reload] -> {  : cond.reload >= -2147483648 and cond.reload <= 2147483647 }
    p0: %cond.reload
    Statements {
    Stmt_for_body16
            Domain :=
                [cond.reload] -> { Stmt_for_body16[i0, i1, i2, i3, i4, i5] : i0 >= 0 and i0 <= -1 + cond.reload and i1 >= 0 and i1 <= -1 + cond.reload and i2 >= 0 and i2 <= -1 + cond.reload and i3 >= 0 and i3 <= -1 + cond.reload and i4 >= 0 and i4 <= -1 + cond.reload and i5 >= 0 and i5 <= -1 + cond.reload and cond.reload >= 1 };
            Scattering :=
                [cond.reload] -> { Stmt_for_body16[i0, i1, i2, i3, i4, i5] -> scattering[0, i0, 0, i1, 0, i2, 0, i3, 0, i4, 0, i5, 0] };
            MustWriteAccess := 
                [cond.reload] -> { Stmt_for_body16[i0, i1, i2, i3, i4, i5] -> MemRef_X[0] };
    }

Unfortunately, the scops for "x=1" is much more complicated as follows:

$ polly-opt -basicaa -polly-prepare -polly-scops nestedloop.preopt.ll -analyze 
(*Note: you can download nestedloop.preopt.ll on http://llvm.org/bugs/attachment.cgi?id=11045)

Printing analysis 'Polly - Create polyhedral description of Scops' for region: 'for.cond2.preheader => for.end31' in function 'main':
    Context:
    [cond.reload] -> {  : cond.reload >= -2147483648 and cond.reload <= 2147483647 }
    p0: %cond.reload
    Statements {
    Stmt_for_cond2_preheader
            Domain :=
                [cond.reload] -> { Stmt_for_cond2_preheader[i0] : i0 >= 0 and i0 <= -1 + cond.reload };
            Scattering :=
                [cond.reload] -> { Stmt_for_cond2_preheader[i0] -> scattering[0, i0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] };
            ReadAccess := 
                [cond.reload] -> { Stmt_for_cond2_preheader[i0] -> MemRef_x_018_reg2mem[0] };
            MustWriteAccess := 
                [cond.reload] -> { Stmt_for_cond2_preheader[i0] -> MemRef_x_018_reload_s2a[0] };
            MustWriteAccess := 
                [cond.reload] -> { Stmt_for_cond2_preheader[i0] -> MemRef_x_1_lcssa_reg2mem[0] };
    Stmt_for_cond5_preheader_lr_ph
            Domain :=
                [cond.reload] -> { Stmt_for_cond5_preheader_lr_ph[i0] : i0 >= 0 and i0 <= -1 + cond.reload and cond.reload >= 1 };
            Scattering :=
                [cond.reload] -> { Stmt_for_cond5_preheader_lr_ph[i0] -> scattering[0, i0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] };
            ReadAccess := 
                [cond.reload] -> { Stmt_for_cond5_preheader_lr_ph[i0] -> MemRef_x_018_reload_s2a[0] };
            MustWriteAccess := 
                [cond.reload] -> { Stmt_for_cond5_preheader_lr_ph[i0] -> MemRef_x_114_reg2mem[0] };
    Stmt_for_cond5_preheader
            Domain :=
                [cond.reload] -> { Stmt_for_cond5_preheader[i0, i1] : i0 >= 0 and i0 <= -1 + cond.reload and i1 >= 0 and i1 <= -1 + cond.reload and cond.reload >= 1 };
            Scattering :=
                [cond.reload] -> { Stmt_for_cond5_preheader[i0, i1] -> scattering[0, i0, 2, i1, 0, 0, 0, 0, 0, 0, 0, 0, 0] };
            ReadAccess := 
                [cond.reload] -> { Stmt_for_cond5_preheader[i0, i1] -> MemRef_x_114_reg2mem[0] };
            MustWriteAccess := 
                [cond.reload] -> { Stmt_for_cond5_preheader[i0, i1] -> MemRef_x_114_reload_s2a[0] };
            MustWriteAccess := 
                [cond.reload] -> { Stmt_for_cond5_preheader[i0, i1] -> MemRef_x_2_lcssa_reg2mem[0] };
        [...]
    Stmt_for_inc29
            Domain :=
                [cond.reload] -> { Stmt_for_inc29[i0] : i0 >= 0 and i0 <= -1 + cond.reload };
            Scattering :=
                [cond.reload] -> { Stmt_for_inc29[i0] -> scattering[0, i0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] };
            ReadAccess := 
                [cond.reload] -> { Stmt_for_inc29[i0] -> MemRef_x_1_lcssa_reg2mem[0] };
            MustWriteAccess := 
                [cond.reload] -> { Stmt_for_inc29[i0] -> MemRef_x_1_lcssa_lcssa_reg2mem[0] };
            MustWriteAccess := 
                [cond.reload] -> { Stmt_for_inc29[i0] -> MemRef_x_018_reg2mem[0] };
    Stmt_for_cond_for_end31_crit_edge
            Domain :=
                [cond.reload] -> { Stmt_for_cond_for_end31_crit_edge[] };
            Scattering :=
                [cond.reload] -> { Stmt_for_cond_for_end31_crit_edge[] -> scattering[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] };
            ReadAccess := 
                [cond.reload] -> { Stmt_for_cond_for_end31_crit_edge[] -> MemRef_x_1_lcssa_lcssa_reg2mem[0] };
            MustWriteAccess := 
                [cond.reload] -> { Stmt_for_cond_for_end31_crit_edge[] -> MemRef_x_0_lcssa_reg2mem[0] };
    }

The compile time of polly-dependence analysis for the three cases would be:
loop with "X[0]+=1":0.130s
loop with "X[0] =1": 0.053s
loop with "x=1":1.037s

>> There is no data dependence at all. However,  it seems that -polly-prepare still introduces a lot of allocs and store instructions.
>>
>>> Meaning one alloc instruction and a single load and store in the
>>> innermost loop, without any non-induction variable PHI nodes.
>>>
>>> When implementing this in C, LLVM may introduce again PHI nodes. So it
>>> may possibly be necessary to write this directly in LLVM-IR.
>
>This is the important one. I wonder what the performance of the 
>dependence analysis is for increasing loop depth, if we have only a 
>single statement, that just reads and writes from the very same memory 
>location, but that does not have all the mess from the 
>non-induction-variable PHI nodes.

If we change the "x=1" to "X[0]+=1", then there would be only one "store" instruction and one "load" instructions in the loop no matter how much the depth of the loop nest. In such a case, the compile time would increase exponentially as the depth of the loop nest increases:

nest-2: 0.063s
nest-4: 0.110s
nest-6: 0.221s
nest-8: 0.465s
nest-10: 0.893s
nest-12: 1.710s

Cheers,
Star Tan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130817/6307b94f/attachment.html>