[LLVMdev] [Polly] GSoC Proposal: Reducing LLVM-Polly Compiling overhead

Fri May 3 04:05:05 PDT 2013

On 05/03/2013 11:39 AM, Star Tan wrote:
> Dear Tobias,
>
>
> Thank you very much for your very helpful advice.
>
>
> Yes, -debug-pass and -time-passes are two very useful and powerful
> options when evaluating the compile-time of each compiler pass. They
> are exactly what I need! With these options, I can step into details
> of the compile-time overhead of each pass. I have finished some
> preliminary testing based on two randomly selected files from
> PolyBench and MediaBench. Results are listed on
> https://gist.github.com/tanstar/5508153 .

Great. Thanks for the quick update and the nice formatting of your
proposal.

> In this project, I try to reduce Polly compile-time overhead by
> revising Polly passes.  For this purpose, I firstly try to find out
> where the compile-time overhead comes from. When Polly optimizes a
> program, it takes the following steps:
>
> * step1 - Polly canonicalization: prepare basic information and
> complete some transformations.
> * step2 - LLVM IR to Polly description: detect valid scops and
> translates them into polyhedral representation.
> * step3 - Polly optimization: analyze and optimize polyhedral scops.
> * step4 - Polly description to LLVM-IR: translates the polyhedral
> description back into LLVM-IR.

I may make sense to list in the attachments which passes exactly belong
to these different steps.

> Based on these results, I plan to revise all canonicalization passes,
> code generation passes and optimization passes in Polly. Then, I will
> try my best to reduce the compile-time overhead for each part of them.
>
> In the following 12 weeks, my schedule for this project includes the following stages:
> * Stage1 (1 week): set up a Polly performance tester and integrate the automation testing environment.
> * Stage2 (1 week): collect and analyze the detailed profiling information for Polly compile process;
> * Stage3 (3 weeks): revise and improve some expensive Polly passes or LLVM passes that Polly depends on;
> * Stage4 (3 weeks): revise canonicalization passes of Polly and try to remove most of canonicalization passes;
> * Stage5 (1 week): revise the pass ordering problem and let Polly bail out if it cannot benefit programs;
> * Stage6 (1 week): revise and improve Polly code generation passes;
> * Stage7 (1 week): revise and improve Polly optimization passes;
> * Stage8 (1 week): revise other parts of Polly.

I believe that before we should probably move Polly rather early into
the default -O3 optimization chain. You may take some time to profile
Polly and to understand it better, but we should not do extensive
optimizations before moving it. As long as we can ensure the number of
scops detected remains the same, this should be good enough. We can then
spend the time optimizing the Polly integration in the normal -O3 chain.

> ### 3. Project Details
>
> #### Stage1 -- Set up a Polly performance tester to track the compile time. [Week 1]
>
> The first important work is to pick criteria to evaluate my work in
> this project. This project targets to reduce compile-time overhead,
> but at the same time it must not degrade the performance. To simplify
> the testing, I will use the number of scops that optimized by Polly as
> the performance criteria. As a result, our Polly performance tester
> should contains two parts:
>
> * Code performance: the number of scops optimized by Polly should not
> be degraded.
> * Compile-time Overhead: both the total compile-time overhead and the
> percentage of each Polly pass are both important.
>
> Furthermore, I currently use some scripts to collect the compile-time
> statistics, but it still requires some tricky and manual work. My plan
> is to set up an automation testing environment and integrate it into
> Polly.

Yes, this is nice. It would be great if we could get some hardware to
run such tests regularly. I will check if we can find a sponsor for
this?

> **Week 3**: revise the "Polly - Calculate dependence" pass. Since this
> pass leads to significant compile-time overhead in both PolyBench and
> MediaBench, I will investigate the details of this pass and try to
> reduce the compile-time overhead.
> **Week 4**: revise the "Polly - Detect static control parts (SCoPs)"
> and "Polly - Create polyhedral description of Scops" pass. As shown in
> table 3 and table 4, these two passes are expensive. However, they are
> also very important to Polly because all Polly optimizations are based
> on scops detected in these two passes. I will step into the details of
> the two passes and make sure no performance loss happens no matter how
> much compile-time overhead is reduced.
> **Week 5**: revise other top ten expensive Polly passes. For example,
> "Combine redundant instructions" pass runs several times because of
> Polly and it can causes a large compile-time overhead. I will try to
> eliminate or reduce the run of this pass.
>
> In this stage, I will also step into some existing bugs related to
> those expensive passes discussed above. For example, the bug
> llvm.org/PR15643 is related to the "Polly - Calculate dependence"
> pass. Although it has brought some discussions in LLVM/Polly mail
> list, there is still no feasible solution. I will try my best to fix
> up such bugs when I step into the details of those passes.

I actually pointed out a solution. We need to change Polly or
recanonicalize the SCEVs we analyse as the current way the SCEV are
canonicalized makes Polly introduce different parameters for expressions
{p + 1, +, 1}_<l> and {p + 2, +, 1}_<l> in case the loop 'l' is not part
of the scop.

> Our evaluation is based on Intel Pentium Dual CPU T2390(1.86GHz) with
> 2GB DDR2 memory. Each benchmark is run multiple times and data are
> collected using ministat (https://github.com/codahale/ministat).
> Polly is built with Cloog in Release+Assert mode. All Polly, LLVM and
> Cloog source code are checked out from SVN at April 24, 2013.

Can you give the svn revision / git hash for the checkouts of LLVM,
Polly, isl, cloog and polybench.

> Table 1 and table 2 show the compile-time overhead of Polly. Five cases are tested:
> (alias pollycc="clang -O3 -load LLVMPolly.so -mllvm -polly)
> * **clang**:  clang -O3
> * **pBasic**: clang -O3 -load LLVMPolly.so
> * **pNoGen**: pollycc -O3 -mllvm -polly-optimizer=none -mllvm -polly-code-generatorr=none
> * **pNoOpt**: pollycc -O3 -mllvm -polly-optimizer=none
> * **pOpt**:   pollycc -O3

You probably want to add a couple of additional parameters here to
ensure we detect all scops in polybench. I would add -mllvm
-polly-ignore-aliasing and -DPOLYBENCH_USE_SCALAR_LB

You can use '-mllvm -debug-only=polly-cloog' to see if the kernels
are properly detected. I get the following output:

alias polly-clang
alias polly-clang='~/Projekte/polly/build/bin/clang -Xclang -load 
-Xclang ~/Projekte/polly/build/lib/LLVMPolly.so'
grosser at tobilaptop:~/Projekte/polybench$ polly-clang -O3 -mllvm -polly 
-mllvm -debug-only=polly-cloog linear-algebra/kernels/gemm/gemm.c -I 
utilities/ utilities/polybench.c  -mllvm -polly-ignore-aliasing 
-DPOLYBENCH_USE_SCALAR_LB
:: init_array : entry.split => for.end56
if ((nj >= 1) && (nk >= 1) && (p_1 >= 1) && (p_4 >= 1)) {
   for (c2=0;c2<=p_4-1;c2+=32) {
     for 
(c3=max(-32*floord(p_1-12*p_4+10,32)-32*p_4,-32*c2-32*floord(-12*c2+p_1+10,32)-640);c3<=-20*c2;c3+=32) 
{
       for 
(c4=max(ceild(-c3-p_1-30,20),c2);c4<=min(min(floord(-c3,20),c2+31),p_4-1);c4++) 
{
         for (c5=max(c3,-20*c4-p_1+1);c5<=min(-20*c4,c3+31);c5++) {
           Stmt_for_body41(c4,-20*c4-c5);
         }
       }
     }
   }
}
if ((ni >= 1) && (nk >= 1) && (p_0 >= 1) && (p_4 >= 1)) {
   for (c2=0;c2<=p_0-1;c2+=32) {
     for 
(c3=max(-32*floord(-12*p_0+p_4+10,32)-32*p_0,-32*c2-32*floord(-12*c2+p_4+10,32)-640);c3<=-20*c2;c3+=32) 
{
       for 
(c4=max(ceild(-c3-p_4-30,20),c2);c4<=min(min(floord(-c3,20),c2+31),p_0-1);c4++) 
{
         for (c5=max(c3,-20*c4-p_4+1);c5<=min(-20*c4,c3+31);c5++) {
           Stmt_for_body18(c4,-20*c4-c5);
         }
       }
     }
   }
}
if ((ni >= 1) && (nj >= 1) && (p_0 >= 1) && (p_1 >= 1)) {
   for (c2=0;c2<=p_0-1;c2+=32) {
     for 
(c3=max(-32*floord(-12*p_0+p_1+10,32)-32*p_0,-32*c2-32*floord(-12*c2+p_1+10,32)-640);c3<=-20*c2;c3+=32) 
{
       for 
(c4=max(ceild(-c3-p_1-30,20),c2);c4<=min(min(floord(-c3,20),c2+31),p_0-1);c4++) 
{
         for (c5=max(c3,-20*c4-p_1+1);c5<=min(-20*c4,c3+31);c5++) {
           Stmt_for_body3(c4,-20*c4-c5);
         }
       }
     }
   }
}
Stmt_entry_split();
:: kernel_gemm : for.cond1.preheader => for.end28
for (c2=0;c2<=1023;c2+=32) {
   for (c3=0;c3<=1023;c3+=32) {
     for (c4=c2;c4<=c2+31;c4++) {
       for (c5=c3;c5<=c3+31;c5++) {
         Stmt_for_body3(c4,c5);
       }
     }
   }
}
for (c2=0;c2<=1023;c2+=32) {
   for (c3=0;c3<=1023;c3+=32) {
     for (c4=0;c4<=1023;c4+=32) {
       for (c5=c2;c5<=c2+31;c5++) {
         for (c6=c3;c6<=c3+31;c6++) {
           for (c7=c4;c7<=c4+31;c7++) {
             Stmt_for_body8(c5,c6,c7);
           }
         }
       }
     }
   }
}
:: polybench_flush_cache : for.inc => for.end
for (c1=0;c1<=4194559;c1++) {
   Stmt_for_inc(c1);
}

As a first step, we should ensure that Polly is fast for these flags. As
subsequent steps we should then ensure that the absence of these flags
does not increase compile-time.

> Table 3 and table 4 show the compile-time overhead of top 15 passes in
> polly-opt. we first generate LLVM IR files by the command "clang -O3
> -emit-llvm XX.c -o XX.ll", then we run opt to collect the compile time
> of each pass using the command "opt -basicaa  -load LLVMPolly.so -O3
> -polly -polly-codegen-scev XX.ll -S -o XX.opt.ll -time-passes"

You want to extract the .ll file with 'clang -O0'. Otherwise you run
polly on code that is already -O3 optimized, making the runs not
comparable to the clang integrated ones and also unrealistic as they do
not reflect what we do when running Polly from within clang.

It would be interesting to understand the doitgen results better. The
time in the optimizer is only 0.408 seconds, whereas the increase from
pBasic to pOpt is 0.897 - 0.151 = 0.746 seconds. This seems surprising.
Is this because of running polly on -O3 optimized code, is Polly
producing bigger .ll files which yield to longer object file emmission
time or is there a measurement problem?

That's it for now. Thanks for the update!

Cheers,
Tobi