[LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops

Fri Aug 16 07:32:30 PDT 2013

On 08/16/2013 02:42 AM, Star Tan wrote:
> At 2013-08-16 12:44:02,"Tobias Grosser" <tobias at grosser.es> wrote:
>> Hi,
>>
>> I tried to reproduce your findings, but could not do so.
>
>
> Sorry, I did not put all code in my previous email because the code seems a little too long and complicated.
> You can refer to the detailed C code and LLVM IR code on http://llvm.org/bugs/show_bug.cgi?id=16843
> There are four attachments for our C code and LLVM IR code:
>
>
> nestedloop.c (http://llvm.org/bugs/attachment.cgi?id=11043): the simplified C code.
> nestedloop.ll (http://llvm.org/bugs/attachment.cgi?id=11044): the basic LLVM IR code.
> nestedloop.preopt.ll (http://llvm.org/bugs/attachment.cgi?id=11045): the preprocessed LLVM IR code.
> nestedloop.prepare.ll (http://llvm.org/bugs/attachment.cgi?id=11046): the LLVM IR code after running the polly-prepare pass.

Thanks, this works for me.

>> # Time with and without preoptimization
>> ~$ time polly-opt -basicaa -polly-prepare -polly-dependences
>> out.preopt.ll -disable-output
>> [..]
>>
>> Within -O3 the overhead is still far away from the 60 slowdown we have
>> seen on the other test case. A 2x slowdown for a kernel that we fully
>> analyse and code generate is not optimal, but probably not an issue. The
>> question is if the increasing depth of the loop nest can yield this 60x
>> slowdown.
>
>
> This is because the depth of the loop nest is only two. If we increase the depth of the loop nest, then the compile-time would significantly increase. I have posted results for the original code which contains 6 nested loops on http://llvm.org/bugs/show_bug.cgi?id=16843#c5
>
>
> In fact, the compile time is exponentially growing as the depth of loop nest increases:
> 2-nested: ~2X
> 4-nested: ~10X
> 6-nested: ~50X
> 8-nested: ~200X
> 10-nested: ~700X
> 12-nested: ~2000X

Perfect. This is what I was expecting. Let's see what we can do about it.

>> I think there are two angles we can have a look at:
>>
>> 1) Does -polly-prepare introduce unneeded dependences?
>>
>> This means, could we model the scalar dependences more efficiently than
>> how they are modeled by the array accesses introduced by -polly-prepare.
>
>
> This is really necessary. In fact, there should be no dependence in our code example as shown in our previous example. There is only a single and constant assignment "a=1" in the nested loop. Unfortunately, Polly still reports very complex data dependences.

Sure, LLVM may even be able to replace this entire loop by a closed form 
expression. However, it is also a good test case to challenge Polly. 
Let's see what we can do.

>> 2) What triggers the 60x slowdown in dependence analysis
>>
>> Is this due to the increasing loop nest depth? Or the increasing number
>> of allocations introduced by -polly-prepare?
>>
> I have found that the compile time is exponentially growing as the depth of loop nest increases. However, increasing the loop nest depth would also increases the number of allocations and store instructions.
>
>
>> An interesting experiment would be to see if the dependence analysis
>> runs faster on a loop that uses a single element array to implement the
>> reduction:
>>
>> for (i
>>    for (j
>>       ...
>>         X[0] = ...
>
>
> Yes, I have changed the original code to the form you suggested:
>    for (i
>        for (j
>            ...
>                 x=1

Sorry, I meant
                  x[0] +=

> There is no data dependence at all. However,  it seems that -polly-prepare still introduces a lot of allocs and store instructions.
>
>
>> Meaning one alloc instruction and a single load and store in the
>> innermost loop, without any non-induction variable PHI nodes.
>>
>> When implementing this in C, LLVM may introduce again PHI nodes. So it
>> may possibly be necessary to write this directly in LLVM-IR.

This is the important one. I wonder what the performance of the 
dependence analysis is for increasing loop depth, if we have only a 
single statement, that just reads and writes from the very same memory 
location, but that does not have all the mess from the 
non-induction-variable PHI nodes.

Tobi