[LLVMdev] [Polly] GSoC Proposal: Reducing LLVM-Polly Compiling overhead

Tobias Grosser tobias at grosser.es
Thu May 2 04:12:37 PDT 2013

On 04/26/2013 05:08 AM, tanmx_star wrote:
> Hi all,


thanks for the update and sorry for the delay in reviewing. I just had a 
look at your proposal.

> I have updated my GSoS proposal: "FastPolly: Reducing LLVM-Polly Compiling overhead" (https://gist.github.com/tanstar/5441808).  I think the pass ordering problem you discussed early can be also investigated in this project!

Yes, figuring out the optimal path ordering sequence is very good.

> Is there any comment or advice about my proposal?  I appreciate all your help and advice.

> 1. Summary:
> LLVM-Polly is a promising polyhedral optimizer for data-locality and
> parallelism, which takes advantage of multi-cores, cache hierarchies,
> short vector instructions as well as dedicated accelerators. However,
> Polly analysis and optimization can lead to significant compiling
> overhead, which makes it much less attractive for LLVM users. I argue
> that maintaining fast compiling time when Polly is enabled is very
> important, especially if we want to think of enabling Polly in default.
> Based on this assumption, I try to reduce Polly compiling overhead in
> this project.

Sounds good.

> 2. Motivation:
> LLVM is an incredible open-source project. It has been widely in C/C++

                                          You miss a verb here ^^^

> compilers, high-level synthesis compilers, virtual machines, optimizing
> tools, etc. As a graduate student, I am going to work on compiler
> analysis and optimization, especially on program vectorization and
> parallelization. I find Polly is a very useful and powerful polyhedral
> optimizer. I would like to use this tool and contribute to this project.
> When I was using Polly tool, I found that Polly optimization can lead to
No need for 'tool' here  ^^^

> significant compiling overhead. On average, polly optimization will
> increase the compiling time by 393% for PolyBench benchmarks and by 53%
> for MediaBench benchmarks compared with clang. That means if you want to
> gain from Polly, you have to pay 4 times extra compiling overhead. Even
> if you do not want to gain much from Polly, you still have to pay 53%
> compiling overhead. Such expensive compiling overhead would make the
> Polly much less attractive to LLVM users.

Good point.

> In this project, I try to reduce Polly compiling overhead by removing

I would call it 'compile-time overhead' instead of 'compiling overhead'.

> unnecessary passes and improving critical passes.  For this purpose, I
> firstly try to find out where the compiling overhead comes from. When
> Polly optimizes a program, it takes the following steps: 1) Polly
> canonicalization: prepare some basic information and do some basic
> transformation, such as loop-simplify and region-simplify.  2) LLVM-IR
> to Polly description: detect polly scops and translates the detected
> scops into a polyhedral representation.  3) Polly optimization: analyze
> and optimize polyhedral scops.  4) Polly description to LLVM-IR:
> translates the polyhedral description back into new LLVM-IR.
> In attched table 1 and 2, pBasic shows the overhead of loading the

> LLVMPolly.so; pNoGen shows the overhead of step 1) and 2); pNoOpt shows
> the overhead of step 1), 2) and 4). So the compiling overhead of Polly
> can be divided into three parts:
> PolyBench: canonicalization(13%-1%=12%), code generation(248%-13%=235%)
> and optimization(393%-248%=145%) MediaBench:canonicalization( 9%-1%=
> 8%), code generation( 43%- 9%= 34%) and optimization( 53%- 43%= 10%)

Thanks for adding numbers for pNoGen. Having only 10% runtime increase 
if Polly is not used is a good sign, especially for the amount of 
canonicalization passes we run. This makes me confident we can get it to 
an even smaller number.

The other numbers are large, but there are likely ways to improve on 
this significantly. Also, it would be good to show at least for one 
benchmark which passes the different numbers actually contain. (You can 
use -debug-pass=Structure for this). E.g. the code generation time looks
rather large. I suppose most of the time is not actually spent in code 
generation, but also in the LLVM passes such as common subexpression 
elimination that have more LLVM-IR to work on or clean up after Polly 
was run.

Also, I believe the names of your columns, and the command line options
given above are a little out of sync. I could e.g. not find a 
description for pBasic

> Based on these results, I plan to reduce Polly compiling overhead by the
> following steps: First, I will try to remove unnecessary
> canonicalization passes to reduce canonicalization time; Second, I will
> try to remove or rewrite expensive analysis passes to reduce
> optimization overhead; Third, I will try to improve the code generation
> passes to reduce code generation overhead. Another interesting work is
> to let the polly bail out early, which can be very helpful to save
> compiling overhead if Polly cannot benefit the program.

OK, this sounds like a reasonable approach. Some more points may be 
worth adding:

1) It is important to pick criteria you can evaluate your work on

It is a good start that you identified two benchmarks. Especially 
looking into non-polybench code is very valuable. You should make sure 
that you evaluate your work throughout the project to see the benefit
of your changes. In fact, it may even be worthwhile to set up a Polly 
performance tester to track the compile time with Polly enabled and how
your changes influence it.

2) Add some specific bug reports you are planning to lock into

This bug report shows a large performance problem in Polly that is 
mainly due to creating a very difficult dependency analysis problem:

There was a larger discussion on the Polly mailing list that discusses 
this bug.

> 3. Details about the project:
> StageI -- Remove unnecessary canonicalization transformation. [Week 1-2]
> Polly relies on some canonicalization passes to simplify the following
> analysis and optimization. Canonicalization passes include
> loop-simplify, region-simplify, Induction variable canonicalization and
> block independent. For example, region-simplify pass is run to simplify
> the region to single entry and single exit edge before -polly-detect.
> However, such approach will introduce unnecessary modifications that
> increase compile time even in the cases where Polly cannot optimize the
> code.
> A first step is to remove -region-simplify pass. For this purpose, I
> have modified the scop detection pass and polly code generation pass to
> allow scops with multiple entry edges and multiple exit edges. Details
> can be referred to the following patch files: (Thanks for all the help
> from Polly group)

> r179673: Remove unneeded RegionSimplify pass r179586: Support SCoPs with
> multiple entry edges r179159: Support SCoPs with multiple exit edges
> r179158: Codegen: Replace region exit and entries recursively r179157:
> RegionInfo: Add helpers to replace entry/exit recursively r178530:
> ScopDetection: Use isTopLevelRegion
> In this project, I plan to spend two weeks to reduce canonicalization
> overhead.

It was a good idea to write down what you plan to do each week.

> Week 1:  Profile the compiling overhead of each canonicalization pass,
> including PromoteMemoryToRegisterPass, CFGSimplificationPass,
> ReassociatePass, LoopRotatePass, InstructionCombiningPass,
> IndVarSimplifyPass, CodePreparationPass and LoopSimplifyPass.  Week 2:
> Remove or improve one or two most expensive canonicalization passes. I
> will also try to revise the pass ordering to move some expensive
> canonicalization passes later.

Instead of speeding up the canonicalization passes your focus should 
really be integrating Polly into the -O3 pass chain without the need to 
have any additional canonicalization passes. This part is not so much 
about the patch itself that implements it. It rather requires careful 
analysis how the number of detected scops changes when moving Polly.
At the moment we optimized for optimal scop coverage while neglecting 
compile time. Now we want both, optimal scop coverage and good compile time.

Another point that can be mentioned is removing the need for induction
variable canonicalization. We currently do this using the -polly-indvars 
pass. However, the new option -polly-codegen-scev enables us to remove 
this pass entirely. This could also be an interesting performance
problem as -polly-codegen-scev produces a lot cleaner LLVM-IR at code 
generation time, which may take more time to generate but it may also
require less time to be cleaned up. This could also be interesting to 

> StageII -- Remove or rewrite expensive analysis passes for compiling
> performance. [Week 3-5]
> There are many optimization libraries for Polly, such as ScopLib, Pluto,
> ISL and Jason optimization. To balance the tradeoff between code
> performance and compiling overhead, I will profile each optimization
> library and try to improve some of these libraries to reduce compiling
> overhead.

The only relevant one is currently isl. It may in some cases be useful 
to compare against Pluto so. No need to optimize scoplib or JSON.

> Week 3:  Profile the compiling overhead of each Polly optimization
> library, including ScopLib, Pluto, ISL and Jason.

Instead of profiling per library, I would rather profile per Polly pass
using --time-passes

You could do this later for several programs, but it would be good to 
have this already today for a single program to get an idea where time 
is spent and what needs optimization.

> Week 4:  Profile the
> compiling overhead of each optimization pass for one or two libraries
> (such as ISL and ScopLib). For example, ISL optimization provides many
> optimization passes, such as dependence simplify, schedule optimization,
> and various loop transformation.  Week 5: remove some expensive
> optimization passes and rewrite some critical but expensive optimization
> passes.

> StageIII -- Improve code generation passes for compiling performance.
> [Week 6-9]
> Our evalutions show that polly code generation passes are very
> expensive, especially for some benchmarks like ludcmp.c and adi.c. Polly
> code generation passes can increase the compiling time by 500% or more
> (See table 1). My plan is to improve various code generation passes.

Can you verify your numbers here. You report for ludcmp the following:

	   	clang	pBasic	pNoOpt	pNoGen	pOPt
ludcmp.c	0.157	0.1602	0.2002	1.0761	1.3175

			pBasic%	pNoGen%	pNoOpt%	pOpt%
			2.04%	27.52%	585.41%	739.17%

I have the feeling the headings of the pNoGen% and pNoOpt% columns have 
been switched accidentally. At least from the numbers above, I see an 
increase from 0.16 to 0.20 for code generation, which is far from being 
a 500% increase. On the other side, the optimization itself seems to add 
a larger amount of time as well as the code generation of the optimized 
code. O

> Week 6:  Profile the compiling overhead of each Polly code generation
> pass, especially for ISL code generation.  Week 7:  Remove unnecessary
> analysis for code generation. Currently, Polly code generation pass
> dependents on a lot of  analysis passes such as DominatorTree,
> IslAstInfo, RegionInfo, ScalarEvolution, ScopDetection, ScopInfo. I will
> try to remove some of expensive analysis passes.

Those passes add little overhead to the code generation. In fact the 
analysis is normally already available, such that these analysis 
requirements are for free. They have been added her mainly to allow the 
code generation to update them, such that we do not need to spend time 
rebuilding them later.

 > Week 8-9: Rewrite some
> expensive functions for Polly code generation based on profiling
> information.

This is still very vague. I propose to

> StageIV -- Let Polly bail out early. [Week 10]
> Week 10: Add support in canicalization step or optimization step to
        Typo ----->        canonicalization

> allow Polly boil out early if it cannot benefit programs.

> StageV -- Improve other parts. [Week 11-12]
> Week 11: Improve other parts of Polly. Especially, I will focus on some
> expensive helper functions such as TempScop analysis. This helper
> function is critical and expensive.

How do you know TempScop is expensive?

> Week 12: Integrate all improvements
> and evaluate the whole Polly with multiple benchmarks.

I think the only way to do this project is to continuously evaluate your 
changes on Polybench and mediabench and to directly integrate them
into the svn repository. This should be made clear at the beginning and
I believe it is very fine to spend more time on the individual steps,
such that we can make sure the changes are properly evaluated and 

> 4. Profit for LLVM users and Polly users
> This project can benefit both LLVM users and Polly users. For LLVM
> users, our project will make the Polly more acceptable if it can
> provides extra performance gains within little extra compiling overhead.
> For Polly users, this project will make the Polly more powerful by
> significantly reducing compiling overhead and improving code quality.


You could make your goals more concrete saying that we want to show that
by enabling Polly we can significantly optimizing the polybench 
benchmarks, while at the same time no prohibitively large compile time 
increase can be seen for mediabench. Reaching this goal would be a great
step forward.

> [Attachments]
> Our evaluation is based on Intel Pentium Dual CPU T2390(1.86GHz) with
> 2GB DDR2 memory. Each benchmark is run multiple times and data are
> collected using ministat (https://github.com/codahale/ministat). Results
> are shown in table 1 and table 2.  Five cases are tested: (alias
> pollycc="clang -O3 -load LLVMPolly.so -mllvm -polly) *clang: clang -O3
> *pLoad: clang -O3 -load LLVMPolly.so *pNoGen:pollycc -O3 -mllvm
> -polly-optimizer=none -mllvm -polly-code-generatorr=none *pNoOpt:pollycc
> -O3 -mllvm -polly-optimizer=none *polly: pollycc -O3
> Table 1: Compile time for PolyBench (Seconds, each benchmark is run 10
> times)
> 		clang	pBasic	pNoOpt	pNoGen	pOPt	pBasic%	pNoGen%
> pNoOpt%	pOpt% 2mm.c     	0.1521	0.1593	0.1711	0.3235	0.7247
> 4.73%	12.49%	112.69%	376.46% atax.c  	0.1386	0.1349	0.1449
> 0.2066	0.313	0.00%	0.00%	49.06%	125.83% covariance.c	0.1498
> 0.1517	0.1526	0.3561	0.7706	1.27%	1.87%	137.72%	414.42% gemver.c
> 0.1562	0.1587	0.1724	0.2674	0.3936	1.60%	10.37%	71.19%	151.99%
> instrument.c	0.1062	0.1075	0.1124	0.123	0.1216	0.00%	5.84%
> 15.82%	14.50% ludcmp.c	0.157	0.1602	0.2002	1.0761	1.3175	2.04%
> 27.52%	585.41%	739.17% 3mm.c   	0.1529	0.1559	0.1826	0.4134
> 1.0436	1.96%	19.42%	170.37%	582.54% bicg.c   	0.1244	0.1268
> 0.1353	0.1977	0.2828	1.93%	8.76%	58.92%	127.33% doitgen.c
> 0.1492	0.1505	0.1644	0.3325	0.8971	0.00%	10.19%	122.86%	501.27%
> gesummv.c	0.1224	0.1279	0.134	0.1999	0.2937	4.49%	9.48%
> 63.32%	139.95% jacobi.c	0.1444	0.1506	0.1592	0.3912	0.8494
> 0.00%	10.25%	170.91%	488.23% seidel.c	0.1337	0.1353	0.1462
> 0.6299	0.9155	0.00%	9.35%	371.13%	584.74% adi.c   	0.1593
> 0.1621	0.1835	1.4375	1.849	1.76%	15.19%	802.39%	1060.70%
> correlation.c	0.1579	0.1596	0.1802	0.3393	0.6337	1.08%	14.12%
> 114.88%	301.33% gemm.c   	0.1407	0.1432	0.1576	0.2421	0.4477
> 1.78%	12.01%	72.07%	218.20% gramschmidt.c	0.1331	0.1349	0.1509
> 0.3069	0.4138	0.00%	13.37%	130.58%	210.89% lu.c    	0.1419
> 0.1443	0.1581	0.3156	0.3943	1.69%	11.42%	122.41%	177.87% average
> 1.26%	13.22%	248.47%	393.80%

To improve readability, it may be worth ensuring this fits into 80 
columns. You may be able to reduce the number of digits used here.

You could probably increase the readability of your proposal further if 
you use markdown. See here for an example of how a markdown file looks 
at github: https://gist.github.com/micmcg/976172 and here the raw version

You basically need to use the file ending '.md' and you can then use 
markdown syntax to format your text. The very same syntax will also 
improve the readability of the proposal on the mailing list.

All the best,


More information about the llvm-dev mailing list