[LLVMdev] [Polly]GSoC Proposal: Reducing LLVM-Polly Compiling overhead

Sat Mar 23 09:23:24 PDT 2013

Dear Tobies,

Sorry for the late reply. 

I have checked the experiment and I found some of the data is mismatched because of incorrect manual copy and paste, so I have written a Shell script to automatically collect data. Newest data is listed in the attached file.

Tobies, I have made a simple HTML page (attached polly-compiling-overhead.html) to show the experimental data and my plans for this project. I think a public webpage can be helpful for our further discussion. If possible, could you put it on Polly website (Either a public link or a temporary webpage) ? 
I think I will try to remove unnecessary code transformations for canonicalization in next step.

Thank you very much for your warm help.

Best Regards,
Star Tan

From: Tobias Grosser
Date: 2013-03-20 21:06
To: Star Tan
CC: llvmdev
Subject: Re: [Polly]GSoC Proposal: Reducing LLVM-Polly Compiling overhead
On 03/19/2013 11:02 AM, Star Tan wrote:
>
> Dear Tobias Grosser,
>
> Today I have rebuilt the LLVM-Polly in Release mode. The configuration of my own testing machine is: Intel Pentium Dual CPU T2390(1.86GHz) with 2GB DDR2 memory.
> I evaluated the Polly using PolyBench and Mediabench. It takes too long time to evaluate the whole LLVM-testsuite, so I just choose the Mediabench from LLVM-testsuite.

OK. This is a good baseline.

> The preliminary results of Polly compiling overhead is listed as follows:
>
> Table 1: Compiling time overhead of Polly for PolyBench.
>
> | | Clang
> (econd) | Polly-load
> (econd) | Polly-optimize
> (econd) | Polly-load penalty | Polly-optimize
> Penalty |
> | 2mm.c | 0.155 | 0.158 | 0.75 | 1.9% | 383.9% |
> | correlation.c | 0.132 | 0.133 | 0.319 | 0.8% | 141.7% |
> | geummv.c | 0.152 | 0.157 | 0.794 | 3.3% | 422.4% |
> | ludcmp.c | 0.157 | 0.159 | 0.391 | 1.3% | 149.0% |
> | 3mm.c | 0.103 | 0.109 | 0.122 | 5.8% | 18.4% |
> | covariance.c | 0.16 | 0.163 | 1.346 | 1.9% | 741.3% |

This is a very large slowdown. On my system I get

0.06 sec for Polly-load
0.09 sec for Polly-optimize

What exact version of Polybench did you use? What compiler
flags did you use to compile the benchmark?
Also, did you run the executables several times? How large is the
standard deviation of the results? (You can use a tool like ministat to 
calculate these values [1])

> | gramchmidt.c | 0.159 | 0.167 | 1.023 | 5.0% | 543.4% |
> | eidel.c | 0.125 | 0.13 | 0.285 | 4.0% | 128.0% |
> | adi.c | 0.155 | 0.156 | 0.953 | 0.6% | 514.8% |
> | doitgen.c | 0.124 | 0.128 | 0.298 | 3.2% | 140.3% |
> | intrument.c | 0.149 | 0.151 | 0.837 | 1.3% | 461.7% |

This number is surprising. In your last numbers you reported 
Polly-optimize as taking 0.495 sec in debug mode. The time you now
report for the release mode is almost twice as much. Can you verify
this number please?

> | atax.c | 0.135 | 0.136 | 0.917 | 0.7% | 579.3% |
> | gemm.c | 0.161 | 0.162 | 1.839 | 0.6% | 1042.2% |

This number looks also fishy. In debug mode you reported for 
Polly-optimize 1.327 seconds. This is again faster than in release mode.

> | jacobi-2d-imper.c | 0.16 | 0.161 | 0.649 | 0.6% | 305.6% |
> | bicg.c | 0.149 | 0.152 | 0.444 | 2.0% | 198.0% |
> | gemver.c | 0.135 | 0.136 | 0.416 | 0.7% | 208.1% |
> | lu.c | 0.143 | 0.148 | 0.398 | 3.5% | 178.3% |
> | Average | | | | 2.20% | 362.15% |

Otherwise, those numbers look like a good start. Maybe you can put them
on some website/wiki/document where you can extend them as you proceed 
with benchmarking.

> Table 2: Compiling time overhead of Polly for Mediabench (Selected from LLVM-testsuite).
> | | Clang
> (econd) | Polly-load
> (econd) | Polly-optimize
> (econd) | Polly-load penalty | Polly-optimize
> Penalty |
> | adpcm | 0.18 | 0.187 | 0.218 | 3.9% | 21.1% |
> | g721 | 0.538 | 0.538 | 0.803 | 0.0% | 49.3% |
> | gsm | 2.869 | 2.936 | 4.789 | 2.3% | 66.9% |
> | mpeg2 | 3.026 | 3.072 | 4.662 | 1.5% | 54.1% |
> | jpeg | 13.083 | 13.248 | 22.488 | 1.3% | 71.9% |
> | Average | | | | 1.80% | 52.65% |

I run jpeg myself to verify these numbers on my machine. I got:

A: -O3
B: -O3 -load LLVMPolly.so
C: -O3 -load LLVMPolly.so -mllvm -polly
D: -O3 -load LLVMPolly.so -mllvm -polly -mllvm -polly-optimizer=none
E: -O3 -load LLVMPolly.so -mllvm -polly -mllvm -polly-optimizer=none
    -mllvm -polly-code-generator=none

           A     B     C     D     E
| jpeg | 5.1 | 5.2 | 8.0 | 7.9 | 5.5

The overhead between A and C is similar to the one you report. Hence, 
the numbers seem to be correct.

I also added two more runs D and E to figure out where the slowdown 
comes from. As you can see most of the slow down disappears when we
do not do code generation. This either means that the polly code 
generation itself is slow or that the LLVM passes afterwards need more
time due to the code we generated (it contains many opportunities for 
scalar simplifications). It would be interesting to see if this holds 
for the other benchmarks and to investigate the actual reasons for the 
slowdown. It is also interesting to see that just running Polly, but 
without applying optimizations does not slow down the compilation a lot. 
Does this also hold for other benchmarks?

> As shown in these two tables, Polly can significantly increase the compiling time when it indeed works for the Polybench. On average, Polly will increase the compiling time by 4.5X for Polybench.  Even for the Mediabench, in which Polly does not actually improve the efficiency of generated code, it still increases the compiling time by 1.5X.
> Based on the above observation, I think we should not only reduce the Polly analysis and optimization time, but also make it bail out early when it cannot improve the efficiency of generated code. That is very important when Polly is enabled in default for LLVM users.

Bailing out early is definitely something we can think about.

To get started here, you could e.g. look into the jpeg benchmark and 
investigate on which files Polly is spending a lot of time, where 
exactly the time is spent and what kind of SCoPs Polly is optimizing. In 
case we do not expect any benefit, we may skip code generation entirely.

Thanks again for your interesting analysis.

Cheers,
Tobi

[1] https://github.com/codahale/ministat
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130324/3a85931c/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: polly-compiling-overhead.html
Type: application/octet-stream
Size: 8687 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130324/3a85931c/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: polly_build.sh
Type: application/octet-stream
Size: 1177 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130324/3a85931c/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: polly_compile.sh
Type: application/octet-stream
Size: 1213 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130324/3a85931c/attachment-0002.obj>