[LLVMdev] [Polly] GSoC Proposal: Reducing LLVM-Polly Compiling overhead

Fri May 3 09:30:17 PDT 2013

Dear Tobias and all LLVM/Polly developers,

Thank you very much for all your help and advice. 

I have submitted my proposal to GSoC 2013 application system:
http://www.google-melange.com/gsoc/proposal/review/google/gsoc2013/star/1. 
Some tables and paragraphs are simplified to make it more readable on GSoC official pages. Any suggestion or comment would be appreciated.

At 2013-05-03 19:05:05,"Tobias Grosser" <tobias at grosser.es> wrote:
>On 05/03/2013 11:39 AM, Star Tan wrote:
>> Dear Tobias,
>>
>>
>> Thank you very much for your very helpful advice.
>>
>>
>> Yes, -debug-pass and -time-passes are two very useful and powerful
>> options when evaluating the compile-time of each compiler pass. They
>> are exactly what I need! With these options, I can step into details
>> of the compile-time overhead of each pass. I have finished some
>> preliminary testing based on two randomly selected files from
>> PolyBench and MediaBench. Results are listed on
>> https://gist.github.com/tanstar/5508153 .
>
>Great. Thanks for the quick update and the nice formatting of your
>proposal.
>
>> In this project, I try to reduce Polly compile-time overhead by
>> revising Polly passes.  For this purpose, I firstly try to find out
>> where the compile-time overhead comes from. When Polly optimizes a
>> program, it takes the following steps:
>>
>> * step1 - Polly canonicalization: prepare basic information and
>> complete some transformations.
>> * step2 - LLVM IR to Polly description: detect valid scops and
>> translates them into polyhedral representation.
>> * step3 - Polly optimization: analyze and optimize polyhedral scops.
>> * step4 - Polly description to LLVM-IR: translates the polyhedral
>> description back into LLVM-IR.
>
>I may make sense to list in the attachments which passes exactly belong
>to these different steps.
OK, I will add further information in Table 5.
>
>> Based on these results, I plan to revise all canonicalization passes,
>> code generation passes and optimization passes in Polly. Then, I will
>> try my best to reduce the compile-time overhead for each part of them.
>>
>> In the following 12 weeks, my schedule for this project includes the following stages:
>> * Stage1 (1 week): set up a Polly performance tester and integrate the automation testing environment.
>> * Stage2 (1 week): collect and analyze the detailed profiling information for Polly compile process;
>> * Stage3 (3 weeks): revise and improve some expensive Polly passes or LLVM passes that Polly depends on;
>> * Stage4 (3 weeks): revise canonicalization passes of Polly and try to remove most of canonicalization passes;
>> * Stage5 (1 week): revise the pass ordering problem and let Polly bail out if it cannot benefit programs;
>> * Stage6 (1 week): revise and improve Polly code generation passes;
>> * Stage7 (1 week): revise and improve Polly optimization passes;
>> * Stage8 (1 week): revise other parts of Polly.
>
>I believe that before we should probably move Polly rather early into
>the default -O3 optimization chain. You may take some time to profile
>Polly and to understand it better, but we should not do extensive
>optimizations before moving it. As long as we can ensure the number of
>scops detected remains the same, this should be good enough. We can then
>spend the time optimizing the Polly integration in the normal -O3 chain.
I agree with you.  I will try to understand Polly better as soon as possible by profiling Polly and stepping into the details of some critical Polly passes. It may take a little time before we can  start optimizing the Polly integration in the normal -O3 chain.

>
>> ### 3. Project Details
>>
>> #### Stage1 -- Set up a Polly performance tester to track the compile time. [Week 1]
>>
>> The first important work is to pick criteria to evaluate my work in
>> this project. This project targets to reduce compile-time overhead,
>> but at the same time it must not degrade the performance. To simplify
>> the testing, I will use the number of scops that optimized by Polly as
>> the performance criteria. As a result, our Polly performance tester
>> should contains two parts:
>>
>> * Code performance: the number of scops optimized by Polly should not
>> be degraded.
>> * Compile-time Overhead: both the total compile-time overhead and the
>> percentage of each Polly pass are both important.
>>
>> Furthermore, I currently use some scripts to collect the compile-time
>> statistics, but it still requires some tricky and manual work. My plan
>> is to set up an automation testing environment and integrate it into
>> Polly.
>
>Yes, this is nice. It would be great if we could get some hardware to
>run such tests regularly. I will check if we can find a sponsor for
>this?
Thank you. That would be great if we can find a sponsor.
Otherwise, I think I have to run all test on my laptop.

>
>> **Week 3**: revise the "Polly - Calculate dependence" pass. Since this
>> pass leads to significant compile-time overhead in both PolyBench and
>> MediaBench, I will investigate the details of this pass and try to
>> reduce the compile-time overhead.
>> **Week 4**: revise the "Polly - Detect static control parts (SCoPs)"
>> and "Polly - Create polyhedral description of Scops" pass. As shown in
>> table 3 and table 4, these two passes are expensive. However, they are
>> also very important to Polly because all Polly optimizations are based
>> on scops detected in these two passes. I will step into the details of
>> the two passes and make sure no performance loss happens no matter how
>> much compile-time overhead is reduced.
>> **Week 5**: revise other top ten expensive Polly passes. For example,
>> "Combine redundant instructions" pass runs several times because of
>> Polly and it can causes a large compile-time overhead. I will try to
>> eliminate or reduce the run of this pass.
>>
>> In this stage, I will also step into some existing bugs related to
>> those expensive passes discussed above. For example, the bug
>> llvm.org/PR15643 is related to the "Polly - Calculate dependence"
>> pass. Although it has brought some discussions in LLVM/Polly mail
>> list, there is still no feasible solution. I will try my best to fix
>> up such bugs when I step into the details of those passes.
>
>I actually pointed out a solution. We need to change Polly or
>recanonicalize the SCEVs we analyse as the current way the SCEV are
>canonicalized makes Polly introduce different parameters for expressions
>{p + 1, +, 1}_<l> and {p + 2, +, 1}_<l> in case the loop 'l' is not part
>of the scop.
>
>> Our evaluation is based on Intel Pentium Dual CPU T2390(1.86GHz) with
>> 2GB DDR2 memory. Each benchmark is run multiple times and data are
>> collected using ministat (https://github.com/codahale/ministat).
>> Polly is built with Cloog in Release+Assert mode. All Polly, LLVM and
>> Cloog source code are checked out from SVN at April 24, 2013.
>
>Can you give the svn revision / git hash for the checkouts of LLVM,
>Polly, isl, cloog and polybench.
I use the following git clones: http://llvm.org/git/llvm.git, http://llvm.org/git/polly.git and http://llvm.org/git/clang.git. 
I usually run "git pull" command for all the three repositories before I start a new evaluation. Of course sometimes the LLVM is not up-to-date because rebuilding LLVM is really time consuming.  Their git hash value are:
Cloog: 77c44782f4cbf6d60a7dbfe996b0649336ec7205
Polly:   5f3abecad642a6646c53cbd39879091da87c3c37
LLVM: d050e96133fac8565e3bb1eabe9a587dd5a6ac4d
>
>> Table 1 and table 2 show the compile-time overhead of Polly. Five cases are tested:
>> (alias pollycc="clang -O3 -load LLVMPolly.so -mllvm -polly)
>> * **clang**:  clang -O3
>> * **pBasic**: clang -O3 -load LLVMPolly.so
>> * **pNoGen**: pollycc -O3 -mllvm -polly-optimizer=none -mllvm -polly-code-generatorr=none
>> * **pNoOpt**: pollycc -O3 -mllvm -polly-optimizer=none
>> * **pOpt**:   pollycc -O3
>
>You probably want to add a couple of additional parameters here to
>ensure we detect all scops in polybench. I would add -mllvm
>-polly-ignore-aliasing and -DPOLYBENCH_USE_SCALAR_LB
>
>You can use '-mllvm -debug-only=polly-cloog' to see if the kernels
>are properly detected. I get the following output:
>
>alias polly-clang
>alias polly-clang='~/Projekte/polly/build/bin/clang -Xclang -load 
>-Xclang ~/Projekte/polly/build/lib/LLVMPolly.so'
>grosser at tobilaptop:~/Projekte/polybench$ polly-clang -O3 -mllvm -polly 
>-mllvm -debug-only=polly-cloog linear-algebra/kernels/gemm/gemm.c -I 
>utilities/ utilities/polybench.c  -mllvm -polly-ignore-aliasing 
>-DPOLYBENCH_USE_SCALAR_LB
>:: init_array : entry.split => for.end56
>if ((nj >= 1) && (nk >= 1) && (p_1 >= 1) && (p_4 >= 1)) {
>   for (c2=0;c2<=p_4-1;c2+=32) {
>     for 
>(c3=max(-32*floord(p_1-12*p_4+10,32)-32*p_4,-32*c2-32*floord(-12*c2+p_1+10,32)-640);c3<=-20*c2;c3+=32) 
>{
>       for 
>(c4=max(ceild(-c3-p_1-30,20),c2);c4<=min(min(floord(-c3,20),c2+31),p_4-1);c4++) 
>{
>         for (c5=max(c3,-20*c4-p_1+1);c5<=min(-20*c4,c3+31);c5++) {
>           Stmt_for_body41(c4,-20*c4-c5);
>         }
>       }
>     }
>   }
>}
>if ((ni >= 1) && (nk >= 1) && (p_0 >= 1) && (p_4 >= 1)) {
>   for (c2=0;c2<=p_0-1;c2+=32) {
>     for 
>(c3=max(-32*floord(-12*p_0+p_4+10,32)-32*p_0,-32*c2-32*floord(-12*c2+p_4+10,32)-640);c3<=-20*c2;c3+=32) 
>{
>       for 
>(c4=max(ceild(-c3-p_4-30,20),c2);c4<=min(min(floord(-c3,20),c2+31),p_0-1);c4++) 
>{
>         for (c5=max(c3,-20*c4-p_4+1);c5<=min(-20*c4,c3+31);c5++) {
>           Stmt_for_body18(c4,-20*c4-c5);
>         }
>       }
>     }
>   }
>}
>if ((ni >= 1) && (nj >= 1) && (p_0 >= 1) && (p_1 >= 1)) {
>   for (c2=0;c2<=p_0-1;c2+=32) {
>     for 
>(c3=max(-32*floord(-12*p_0+p_1+10,32)-32*p_0,-32*c2-32*floord(-12*c2+p_1+10,32)-640);c3<=-20*c2;c3+=32) 
>{
>       for 
>(c4=max(ceild(-c3-p_1-30,20),c2);c4<=min(min(floord(-c3,20),c2+31),p_0-1);c4++) 
>{
>         for (c5=max(c3,-20*c4-p_1+1);c5<=min(-20*c4,c3+31);c5++) {
>           Stmt_for_body3(c4,-20*c4-c5);
>         }
>       }
>     }
>   }
>}
>Stmt_entry_split();
>:: kernel_gemm : for.cond1.preheader => for.end28
>for (c2=0;c2<=1023;c2+=32) {
>   for (c3=0;c3<=1023;c3+=32) {
>     for (c4=c2;c4<=c2+31;c4++) {
>       for (c5=c3;c5<=c3+31;c5++) {
>         Stmt_for_body3(c4,c5);
>       }
>     }
>   }
>}
>for (c2=0;c2<=1023;c2+=32) {
>   for (c3=0;c3<=1023;c3+=32) {
>     for (c4=0;c4<=1023;c4+=32) {
>       for (c5=c2;c5<=c2+31;c5++) {
>         for (c6=c3;c6<=c3+31;c6++) {
>           for (c7=c4;c7<=c4+31;c7++) {
>             Stmt_for_body8(c5,c6,c7);
>           }
>         }
>       }
>     }
>   }
>}
>:: polybench_flush_cache : for.inc => for.end
>for (c1=0;c1<=4194559;c1++) {
>   Stmt_for_inc(c1);
>}
>
>As a first step, we should ensure that Polly is fast for these flags. As
>subsequent steps we should then ensure that the absence of these flags
>does not increase compile-time.
Thanks for your advice. I will investigate more flags and enrich the content of table 1 and table 2.

>
>> Table 3 and table 4 show the compile-time overhead of top 15 passes in
>> polly-opt. we first generate LLVM IR files by the command "clang -O3
>> -emit-llvm XX.c -o XX.ll", then we run opt to collect the compile time
>> of each pass using the command "opt -basicaa  -load LLVMPolly.so -O3
>> -polly -polly-codegen-scev XX.ll -S -o XX.opt.ll -time-passes"
>
>You want to extract the .ll file with 'clang -O0'. Otherwise you run
>polly on code that is already -O3 optimized, making the runs not
>comparable to the clang integrated ones and also unrealistic as they do
>not reflect what we do when running Polly from within clang.
>
>It would be interesting to understand the doitgen results better. The
>time in the optimizer is only 0.408 seconds, whereas the increase from
>pBasic to pOpt is 0.897 - 0.151 = 0.746 seconds. This seems surprising.
>Is this because of running polly on -O3 optimized code, is Polly
>producing bigger .ll files which yield to longer object file emmission
>time or is there a measurement problem?
 I run three groups of commands:
Group 1:  "clang -O0 -emit-llvm" (0.082s) + "polly-opt -O3" (0.589s)
Group 2:  "clang -O3 -emit-llvm" (0.128s) + "polly-opt -O3" (0.402s)
Group 3: "clang -O3 -mllvm -polly -emit-llvm" (0.848s)
*Note: I did not use -emit-llvm in previous evaluation, which means it take a little time to translate LLVM IR to X86 assembler code, so the time in Group 3 (0.848s) is a little less than pOpt  listed in table 1 (0.897)
You are right. I should use "clang -O0 -emit-llvm" to generate LLVM IR for the subsequent Polly optimization.
However, I notice that there is still a gap beween Group 1 (0.671s in total) and Group 3 (0.848s in total). I guess this is because Polly generates bigger and more complicated LLVM IR, which slows down clang passes after Polly in Group 3 commands. 
Is there any option like -time-passes when running clang? Such kind of options can help to find out where the extra compile time spends on.

>
>That's it for now. Thanks for the update!
>
>Cheers,
>Tobi
>

Best wishes,
Star Tan

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130504/f2468e86/attachment.html>