[LLVMdev] [Polly]GSoC Proposal: Reducing LLVM-Polly Compiling overhead

Mon Mar 18 05:40:50 PDT 2013

On 03/17/2013 11:54 PM, Star Tan wrote:
>   Hello Tobi,
>
> I am interested in Polly project. Polly seems to be a very promising tool to find out program parallelization based on LLVM infrastructure.   However, I find that Polly analysis and optimization can consume significant compiling time, so I propose a GSoC project to reduce Polly compiling time and I hope my work can make the Polly tool more applicable for all LLVM users.

Dear Star Tan,

the project your propose sounds both interesting and important. As you 
correctly pointed out, maintaining fast compile times when Polly is 
enabled is a very important point. Especially, if we want to think of 
enabling Polly by default.

> I have done some preliminary experiments.  My experimental environment is built as follows:
> First, I build the LLVM, Clang and Polly using -O3 option, so all of these tools can be run in best case;
> Second,  I evaluate the compiling time of Polybench using the following options:
>        Clang-O3:  clang -O3  (the basic clang without polly)
>        Polly-basic:  clang -Xclang -load -Xclang LLVMPolly.so -O3  (load polly but no use of polly optimization)
>        Polly-optimize: clang -Xclang -load -Xclang LLVMPolly.so -mllvm -polly -O3 (use polly optimization)

I believe this setup is a very good start. However, to draw useful 
conclusions we probably need more information:

1) Did you build LLVM/Polly in release mode?

The compile times with and without Polly look very high. Compiling 2mm 
with a current version of clang takes 0.076 seconds for me, whereas you
report a compile time of 0.780 seconds. This may caused by performance
differences of our machines or, what I believe is more likely, that you
did not compile LLVM and Polly in Release mode. Profiling of debug 
builds may give a wrong idea on where compile time is spent. Can you
verify you built in Release mode?

2) Compile time impact when Polly does not optimize the program code

The benchmarks you have chosen can be optimized well by Polly. This is
obviously a nice thing, but it also implies that Polly transforms and
optimizes a large percentage of this code. In cases, where Polly can 
create an optimized version of the code, (limited) increases in compile 
time are often acceptable. However, one area where we should be super 
critical with compile time is the case when Polly is enabled, but when 
it can not perform any useful transformation.

For polybench, you can get a first idea of the current overhead, by 
timing only the polly-detect part, but by disabling the polly 
optimizations and the polly code-generation. However, to get a more 
realistic picture, you probably want to run Polly on the LLVM 
test-suite. I know of some in-house nightly testers, that track Polly
performance on the LLVM test suite. Your work would be even another 
reason to get a public version of such a tester. I will see if I can get 
hardware to do so. Meanwhile, you can run it on your own machine.

> The preliminary experimental results are as follows: (benchmarks are collected from Po
>
> |
> | Clang
> (seconds) | Polly-basic
> (seconds) | Polly-optimize
> (seconds) | Polly-load overhead | Polly-optimize
> overhead |
> | 2mm.c | 0.786 | 0.802 | 1.944 | 2.0% | 147.3% |
> | correlation.c | 0.782 | 0.792 | 2.005 | 1.3% | 156.4% |
> | gesummv.c | 0.583 | 0.603 | 1.08 | 3.4% | 85.2% |
> | ludcmp.c | 0.787 | 0.806 | 2.475 | 2.4% | 214.5% |
> | 3mm.c | 0.786 | 0.811 | 2.617 | 3.2% | 233.0% |
> | covariance.c | 0.73 | 0.74 | 2.294 | 1.4% | 214.2% |
> | gramschmidt.c | 0.63 | 0.643 | 1.134 | 2.1% | 80.0% |
> | seidel.c | 0.632 | 0.645 | 2.036 | 2.1% | 222.2% |
> | adi.c | 0.8 | 0.811 | 3.044 | 1.4% | 280.5% |
> | doitgen.c | 0.742 | 0.752 | 2.32 | 1.3% | 212.7% |
> | instrument.c | 0.445 | 0.45 | 0.495 | 1.1% | 11.2% |

It is interesting to see that the only file that does not contain a 
kernel that is optimized by polly, but just some auxiliary functions has 
a very low compile time overhead. This may imply that most of the 
compile time overhead is due to optimizations Polly can perform.
However, we should only draw conclusions from this after we verified 
those numbers are from a Release build.

> Experimental results show that Polly analysis and optimization can leads to 142% extra compiling overhead, which maybe unacceptable in many large software building. As a result, it is an urgent task to reduce the compiling time of Polly analysis and optimization.
>
> My plan for this proposal is to reduce Polly compiling overhead step by step:
> 1) Investigate the details of Polly, find out how much time spent on analysis and how much time spent on optimization. Of course it is very important to distinguish the overhead on codes where  Polly is applicable and codes where Polly is not applicable.
> 2) Profile the Polly to find out the hot code and find out the sources of compiling overhead; Based on the profiling, I will try to rewrite the hot code to improving the compiling process.
> 3) Remove expensive analysis passes. For example, the scop detection currently requires both the post-dominance analysis
> as well as the dominance frontier analysis. Not requiring these up front (at all) would be great.
> 4) Canonicalization passes scheduled before Polly.  Before running Polly, we currently schedule a set of passes to canonicalize the LLVM-IR on which the scop detection is run on. If I can reduce the number of preparing passes, then the compiling overhead can be reduced.
> 5) Find out other ways to reduce compiling overhead.

Very good points.

I believe the general direction is great. As a next step I propose to 
continue your profiling work to get a better idea of the performance 
characteristics of Polly and to understand where compile time is spent.
I have some ideas myself, but I would prefer that you do this analysis 
independent of my suggestions to double check my own findings. I invite 
you to regularly report your findings. I should be able to both help you 
to ensure that the numbers you get allow us to draw correct conclusions 
and also to point you to source code changes that you could perform to 
reduce the compile time.

Thanks again for this interesting proposal.

All the best,
Tobias