[LLVMdev] [Polly] Summary of some expensive compiler passes, especially PollyDependence

Thu Aug 8 18:27:49 PDT 2013

At 2013-08-08 22:28:33,"Tobias Grosser" <tobias at grosser.es> wrote:
>On 08/08/2013 01:29 AM, Star Tan wrote:
>> Hi all,
>>
>>
>> I have summarized the top 10 compiler passes for Polly when compiling LLVM test-ssuite. Results can be viewed on:
>>      https://gist.github.com/tanstar/581bcea1e4e03498f935/raw/f6a4ec4e8565f7a7bbdb924cd59fcf145caac039/Polly-top10
>>
>>
>> Based on the comparison between "clang -O3" and "polly -O3" listed on:
>>      http://188.40.87.11:8000/db_default/v4/nts/18?compare_to=14&baseline=14
>
>Please compare against clang -O3 -load LLVMPolly.so, otherwise 
>especially the compile time of the small binaries includes the overhead 
>of loading the Polly shared object file.

In fact, the compile-time overhead of loading Polly shared object file is very small (usually less than 1%). Their overhead can be viewed on:
    http://188.40.87.11:8000/db_default/v4/nts/15?compare_to=14&baseline=14

>> we can see Polly's compile-time overhead is mainly resulted by some expensive Polly passes such as PollyDependence, PollyOptimization and PollyCodegen. Especially, I notice that the PollyDependence can lead to significant extra compile-time overhead. Its compile-time percentage for some expensive benchmarks can be summarized as:
>>      nestedloop:  41.4% (Polly - Calculate dependence)
>>      salsa20:       98.5% (Polly - Calculate dependence)
>>      seidel-2d:     72.1% (Polly - Calculate dependence)
>>      multiplies:     54.3% (Poly -  Calculate dependence)
>>      Puzzle:         22.8% (Poly -  Calculate dependence)
>>
>>
>> As a result, it is critical to improve the PollyDependence pass.  I have previously committed a patch file to split start value from SCEVAddRecExpr (r187728), which can improve the PollyDependence pass for some benchmarks as follows:
>>       http://188.40.87.11:8000/db_default/v4/nts/25?baseline=18&compare_to=18
>
>Those are very nice results. Removing 80% of the compile time for the lu 
>benchmark is a very nice outcome.
>
>> Obviously, it is not enough.
>
>Sure, there is still plenty left to be optimised. However, when Polly 
>detects a code region that it can optimise it seems fair that we spend 
>some time in analysing and optimising it. Especially if our 
>transformation improves the run-time performance.
>
>> I have investigated  the salsa20 benchmark, in which the PollyDependence pass consumes 98.5% of total compile-time. The key code of salsa20 benchmark consists of a serial of array operations:
>>    #define R(a,b) (((a) << (b)) | ((a) >> (32 - (b))))
>>    for (i = 20;i > 0;i -= 2) {
>>      x[ 4] ^= R(x[ 0]+x[12], 7);  x[ 8] ^= R(x[ 4]+x[ 0], 9);
>>      x[12] ^= R(x[ 8]+x[ 4],13);  x[ 0] ^= R(x[12]+x[ 8],18);
>>      ....
>>      x[14] ^= R(x[13]+x[12],13);  x[15] ^= R(x[14]+x[13],18);
>>    }
>> which would be translated into a serial of affine functions:
>>        ReadAccess :=    { Stmt_for_body5[i0] -> MemRef_x[12] };
>>        ReadAccess :=    { Stmt_for_body5[i0] -> MemRef_x[4] };
>>        MustWriteAccess :=    { Stmt_for_body5[i0] -> MemRef_x[4] };
>>        ReadAccess :=    { Stmt_for_body5[i0] -> MemRef_x[0] };
>>        ...
>> Consequently, the PollyDependence pass would produce very complex Read/Write/MayWrite as follows:
>>
>>
>>        Read: { Stmt_for_body[i0] -> MemRef_in[i0] : i0 >= 0 and i0 <=
>> 15; Stmt_for_body5[i0] -> MemRef_x[15] : (i0 >= 0 and i0 <= 9) or (i0
>>> = 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or
>> (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] ->
>> MemRef_x[9] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >=
>> 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] ->
>> MemRef_x[4] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >=
>> 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] ->
>> MemRef_x[11] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >=
>> 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] ->
>> MemRef_x[2] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >=
>> 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] ->
>> MemRef_x[6] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >=
>> 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] ->
>> MemRef_x[14] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >=
>> 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] ->
>> MemRef_x[8] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >=
>> 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] ->
>> MemRef_x[10] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >=
>> 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0
>>> = 0 and i0 <= 9); Stmt_for_body5[i0] -> MemRef_x[0] : (i0 >= 0 and i0
>> <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and
>> i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9);
>> Stmt_for_body5[i0] -> MemRef_x[13] : (i0 >= 0 and i0 <= 9) or (i0 >= 0
>> and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9);
>> Stmt_for_body5[i0] -> MemRef_x[1] : (i0 >= 0 and i0 <= 9) or (i0 >= 0
>> and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9);
>> Stmt_for_body5[i0] -> MemRef_x[3] : (i0 >= 0 and i0 <= 9) or (i0 >= 0
>> and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9);
>> Stmt_for_body5[i0] -> MemRef_x[7] : (i0 >= 0 and i0 <= 9) or (i0 >= 0
>> and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9);
>> Stmt_for_body5[i0] -> MemRef_x[12] : (i0 >= 0 and i0 <= 9) or (i0 >= 0
>> and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9);
>> Stmt_for_body5[i0] -> MemRef_x[5] : (i0 >= 0 and i0 <= 9) or (i0 >= 0
>> and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0
>>> = 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) }
>>        Write: ...
>>
>>
>> As we can see, the reason why PollyDependence leads to expensive compile-time lies in the very complex Read/Write/MayWrite structure.  In fact, we can run isl_union_map_coalesce() on the Read/MayWrite/MustWrite before feeding them into ISL analysis:
>
>Very nice.
>
>> diff --git a/lib/Analysis/Dependences.cpp b/lib/Analysis/Dependences.cpp index 9f918f3..39c3fb6 100644 --- a/lib/Analysis/Dependences.cpp +++ b/lib/Analysis/Dependences.cpp @@ -95,6 +95,10 @@ void Dependences::calculateDependences(Scop &S) { collectInfo(S, &Read, &Write, &MayWrite, &Schedule); + Read = isl_union_map_coalesce(Read); + Write = isl_union_map_coalesce(Write); + MayWrite = isl_union_map_coalesce(MayWrite); + DEBUG(dbgs() << "Read: " << Read << "\n"
>
>This patch is unreadable in the mail. However, the one you submitted 
>looked good and was committed.
>
>> With this patch file, we can reduce the compile-time percentage of PollyDependence from 98.5% to 15.3%. Unfortunately, the compile-time percentage of PollyDependence for benchmarks, such as "nestedloop", is still very high.
>
>It would be good to get an up-to-date comparison with the latest patch 
>having gone into Polly.

Yes, you can view the comparison on:
    http://188.40.87.11:8000/db_default/v4/nts/26?compare_to=25&baseline=25

Results show that this patch file is very effective for several benchmarks, such as salsa20 (reduced by 97.72%), Obsequi (54.62%), seidel-2d (48.64%), telecomm-gsm (33.71%).

>I did not yet look at the nestedloop benchmark, but it sounds basically 
>like a benchmark only consisting of loop nests that we can optimise. 
>This is definitely interesting to look into. Both in respect of how fast
>we can analyse it faster, but also if we are able to improve the 
>performance of the generated code. Especially if we improve the 
>execution performance some additional compile time is justified.

Yes. nestedloop.c is a very simple benchmark that contains a single nested loop as follows:
    int n = ((argc == 2) ? atoi(argv[1]) : LENGTH);
    int a, b, c, d, e, f, x=0;
    for (a=0; a<n; a++)
        for (b=0; b<n; b++)
            for (c=0; c<n; c++)
                for (d=0; d<n; d++)
                    for (e=0; e<n; e++)
                        for (f=0; f<n; f++)
                            x++;
Polly would significantly increases the compile-time from 0.0320s to 2.3320 (70x), but it also reduces the execution time from 0.048s to 0.004s (12x). Maybe it is worth, but I think that would be eif we can reduce the compile-time without hurting the execution performance.

> My plan is to continue investigating why PollyDependence pass leads to such high compile-time overhead. Could anyone give me some comments or suggestions?
>
>You already made nice progress. I propose to just continue by measuring 
>the remaining overhead, taking one of the top cases and analysing it. We 
>then see on a case by case basis what can be done.
>
>Cheers,
>Tobias
>

Thanks for your suggestion.

Cheers,
Star Tan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130809/1960abe9/attachment.html>