[LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops

Sat Aug 17 08:22:32 PDT 2013

On 08/17/2013 12:08 AM, Star Tan wrote:
> At 2013-08-16 22:32:30,"Tobias Grosser" <tobias at grosser.es> wrote:
>>>
>>> Yes, I have changed the original code to the form you suggested:
>>>     for (i
>>>         for (j
>>>             ...
>>>                  x=1
>>
>> Sorry, I meant
>>                   x[0] +=
>>
>
>
> It is interesting that Polly would run much faster if we change the "x=0" to "X[0]=0" or "X[0]+=0 in the nested loops.

Indeed, this is interesting.

> First, if the nested loop only contains a statement "X[0]+=0", like this:
[...]

> The compile time of polly-dependence analysis for the three cases would be:
> loop with "X[0]+=1":0.130s
> loop with "X[0] =1": 0.053s
> loop with "x=1":1.037s

OK. So it seems in case reductions are converted to PHI nodes, we see a 
lot slower compiles. This is possibly due to the increased number of 
statements as well as the increased number of data-locations modelled.

I see two directions we can approach this problem with:

1) Optimize isl to work faster

We can see if there are some optimization opportunities in isl to make 
the code faster

2) Generate a less complicated SCoP

If possible, detect that the SCoP can be modelled with less statements 
and less memory locations and do so.

I did not look into either of the two, so those are basically open 
questions (that may not be easy to solve). If you are interested, you
may want to take a look.

>>> There is no data dependence at all. However,  it seems that -polly-prepare still introduces a lot of allocs and store instructions.
>>>
>>>> Meaning one alloc instruction and a single load and store in the
>>>> innermost loop, without any non-induction variable PHI nodes.
>>>>
>>>> When implementing this in C, LLVM may introduce again PHI nodes. So it
>>>> may possibly be necessary to write this directly in LLVM-IR.
>>
>> This is the important one. I wonder what the performance of the
>> dependence analysis is for increasing loop depth, if we have only a
>> single statement, that just reads and writes from the very same memory
>> location, but that does not have all the mess from the
>> non-induction-variable PHI nodes.
>
>
> If we change the "x=1" to "X[0]+=1", then there would be only one "store" instruction and one "load" instructions in the loop no matter how much the depth of the loop nest. In such a case, the compile time would increase exponentially as the depth of the loop nest increases:
>
>
> nest-2: 0.063s
> nest-4: 0.110s
> nest-6: 0.221s
> nest-8: 0.465s
> nest-10: 0.893s
> nest-12: 1.710s

OK, so we still have exponentially growth with the loop-depth, but in 
general we seem to be a lot faster. It is not surprising that the 
complexity grows exponentially with the loop depth. However, seeing that 
even for a 12 level depth loop we only spend two seconds, seems to be OK.

Cheers,
Tobias