[LLVMdev] [Polly] GSoC Proposal: Reducing LLVM-Polly Compiling overhead

Fri May 3 04:40:13 PDT 2013

Dear Tobias,

Thanks for your timely reply. Your advice is really helpful.

I have updated the proposal on https://gist.github.com/tanstar/5508153. Major differences include:
(1)  Add table 3 and table 4 to show the compile-time overhead of top 15 hot passes;
(2)  Describe a new schedule for this project. The new schedule pay more attention on reducing compile-time overhead of hot passes. The new schedule includes eight stages. 
(3)  Enrich the proposal with a lot of concrete work plans in each stage.
(4) Rewrite the proposal using markdown to make it more readable.

At 2013-05-02 19:12:37,"Tobias Grosser" <tobias at grosser.es> wrote:
>On 04/26/2013 05:08 AM, tanmx_star wrote:
>> Hi all,
>
>Hi,
>
>thanks for the update and sorry for the delay in reviewing. I just had a 
>look at your proposal.
>
>
>> I have updated my GSoS proposal: "FastPolly: Reducing LLVM-Polly Compiling overhead" (https://gist.github.com/tanstar/5441808).  I think the pass ordering problem you discussed early can be also investigated in this project!
>
>Yes, figuring out the optimal path ordering sequence is very good.
>
>> Is there any comment or advice about my proposal?  I appreciate all your help and advice.
>
>
>> 1. Summary:
>>
>> LLVM-Polly is a promising polyhedral optimizer for data-locality and
>> parallelism, which takes advantage of multi-cores, cache hierarchies,
>> short vector instructions as well as dedicated accelerators. However,
>> Polly analysis and optimization can lead to significant compiling
>> overhead, which makes it much less attractive for LLVM users. I argue
>> that maintaining fast compiling time when Polly is enabled is very
>> important, especially if we want to think of enabling Polly in default.
>> Based on this assumption, I try to reduce Polly compiling overhead in
>> this project.
>
>Sounds good.
>
>
>> 2. Motivation:
>>
>> LLVM is an incredible open-source project. It has been widely in C/C++
>
>                                          You miss a verb here ^^^
>
>> compilers, high-level synthesis compilers, virtual machines, optimizing
>> tools, etc. As a graduate student, I am going to work on compiler
>> analysis and optimization, especially on program vectorization and
>> parallelization. I find Polly is a very useful and powerful polyhedral
>> optimizer. I would like to use this tool and contribute to this project.
>>
>> When I was using Polly tool, I found that Polly optimization can lead to
>No need for 'tool' here  ^^^
>
>> significant compiling overhead. On average, polly optimization will
>> increase the compiling time by 393% for PolyBench benchmarks and by 53%
>> for MediaBench benchmarks compared with clang. That means if you want to
>> gain from Polly, you have to pay 4 times extra compiling overhead. Even
>> if you do not want to gain much from Polly, you still have to pay 53%
>> compiling overhead. Such expensive compiling overhead would make the
>> Polly much less attractive to LLVM users.
>
>Good point.
>
>> In this project, I try to reduce Polly compiling overhead by removing
>
>I would call it 'compile-time overhead' instead of 'compiling overhead'.
>
>> unnecessary passes and improving critical passes.  For this purpose, I
>> firstly try to find out where the compiling overhead comes from. When
>> Polly optimizes a program, it takes the following steps: 1) Polly
>> canonicalization: prepare some basic information and do some basic
>> transformation, such as loop-simplify and region-simplify.  2) LLVM-IR
>> to Polly description: detect polly scops and translates the detected
>> scops into a polyhedral representation.  3) Polly optimization: analyze
>> and optimize polyhedral scops.  4) Polly description to LLVM-IR:
>> translates the polyhedral description back into new LLVM-IR.
>>
>> In attched table 1 and 2, pBasic shows the overhead of loading the
>      attached
>
>> LLVMPolly.so; pNoGen shows the overhead of step 1) and 2); pNoOpt shows
>> the overhead of step 1), 2) and 4). So the compiling overhead of Polly
>> can be divided into three parts:
>> PolyBench: canonicalization(13%-1%=12%), code generation(248%-13%=235%)
>> and optimization(393%-248%=145%) MediaBench:canonicalization( 9%-1%=
>> 8%), code generation( 43%- 9%= 34%) and optimization( 53%- 43%= 10%)
>
>Thanks for adding numbers for pNoGen. Having only 10% runtime increase 
>if Polly is not used is a good sign, especially for the amount of 
>canonicalization passes we run. This makes me confident we can get it to 
>an even smaller number.
>
>The other numbers are large, but there are likely ways to improve on 
>this significantly. Also, it would be good to show at least for one 
>benchmark which passes the different numbers actually contain. (You can 
>use -debug-pass=Structure for this). E.g. the code generation time looks
>rather large. I suppose most of the time is not actually spent in code 
>generation, but also in the LLVM passes such as common subexpression 
>elimination that have more LLVM-IR to work on or clean up after Polly 
>was run.
>
>Also, I believe the names of your columns, and the command line options
>given above are a little out of sync. I could e.g. not find a 
>description for pBasic
Sorry, pBasic means pLoad. I have added the description for pBasic in the new proposal.
>
>> Based on these results, I plan to reduce Polly compiling overhead by the
>> following steps: First, I will try to remove unnecessary
>> canonicalization passes to reduce canonicalization time; Second, I will
>> try to remove or rewrite expensive analysis passes to reduce
>> optimization overhead; Third, I will try to improve the code generation
>> passes to reduce code generation overhead. Another interesting work is
>> to let the polly bail out early, which can be very helpful to save
>> compiling overhead if Polly cannot benefit the program.
>
>OK, this sounds like a reasonable approach. Some more points may be 
>worth adding:
>
>1) It is important to pick criteria you can evaluate your work on
>
>It is a good start that you identified two benchmarks. Especially 
>looking into non-polybench code is very valuable. You should make sure 
>that you evaluate your work throughout the project to see the benefit
>of your changes. In fact, it may even be worthwhile to set up a Polly 
>performance tester to track the compile time with Polly enabled and how
>your changes influence it.
Yes, your are right. It is very important prerequisite work to pick criteria for the continuous evaluation. I add an extra stage (stage1) for this work. In my option, "number of scops optimized by Polly" can be used as the performance criteria, while "total compile-time overhead" and  "compile-time overhead of each Polly pass" can be used as the compile-time overhead criteria.  I will set up the testing environment and integrate it into Polly SVN repository as soon as possible.
>
>2) Add some specific bug reports you are planning to lock into
>
>This bug report shows a large performance problem in Polly that is 
>mainly due to creating a very difficult dependency analysis problem:
>llvm.org/PR15643
>
>There was a larger discussion on the Polly mailing list that discusses 
>this bug.
I have added such kind of work plans to stage3 in the new proposal.
>
>> 3. Details about the project:
>>
>> StageI -- Remove unnecessary canonicalization transformation. [Week 1-2]
>>
>> Polly relies on some canonicalization passes to simplify the following
>> analysis and optimization. Canonicalization passes include
>> loop-simplify, region-simplify, Induction variable canonicalization and
>> block independent. For example, region-simplify pass is run to simplify
>> the region to single entry and single exit edge before -polly-detect.
>> However, such approach will introduce unnecessary modifications that
>> increase compile time even in the cases where Polly cannot optimize the
>> code.
>>
>> A first step is to remove -region-simplify pass. For this purpose, I
>> have modified the scop detection pass and polly code generation pass to
>> allow scops with multiple entry edges and multiple exit edges. Details
>> can be referred to the following patch files: (Thanks for all the help
>> from Polly group)
>
>> r179673: Remove unneeded RegionSimplify pass r179586: Support SCoPs with
>> multiple entry edges r179159: Support SCoPs with multiple exit edges
>> r179158: Codegen: Replace region exit and entries recursively r179157:
>> RegionInfo: Add helpers to replace entry/exit recursively r178530:
>> ScopDetection: Use isTopLevelRegion
>>
>> In this project, I plan to spend two weeks to reduce canonicalization
>> overhead.
>
>It was a good idea to write down what you plan to do each week.
>
>> Week 1:  Profile the compiling overhead of each canonicalization pass,
>> including PromoteMemoryToRegisterPass, CFGSimplificationPass,
>> ReassociatePass, LoopRotatePass, InstructionCombiningPass,
>> IndVarSimplifyPass, CodePreparationPass and LoopSimplifyPass.  Week 2:
>> Remove or improve one or two most expensive canonicalization passes. I
>> will also try to revise the pass ordering to move some expensive
>> canonicalization passes later.
>
>Instead of speeding up the canonicalization passes your focus should 
>really be integrating Polly into the -O3 pass chain without the need to 
>have any additional canonicalization passes. This part is not so much 
>about the patch itself that implements it. It rather requires careful 
>analysis how the number of detected scops changes when moving Polly.
>At the moment we optimized for optimal scop coverage while neglecting 
>compile time. Now we want both, optimal scop coverage and good compile time.
>
>Another point that can be mentioned is removing the need for induction
>variable canonicalization. We currently do this using the -polly-indvars 
>pass. However, the new option -polly-codegen-scev enables us to remove 
>this pass entirely. This could also be an interesting performance
>problem as -polly-codegen-scev produces a lot cleaner LLVM-IR at code 
>generation time, which may take more time to generate but it may also
>require less time to be cleaned up. This could also be interesting to 
>investigate.
>
Work plans for such work are added to stage4 in the new proposal.
You are right, It would be great if we can completely remove canonicalization passes. I think I will try to remove  -polly-indvars at first, and then I will investigate the other canonicalization passes.
>
>
>> StageII -- Remove or rewrite expensive analysis passes for compiling
>> performance. [Week 3-5]
>>
>> There are many optimization libraries for Polly, such as ScopLib, Pluto,
>> ISL and Jason optimization. To balance the tradeoff between code
>           JSON
>> performance and compiling overhead, I will profile each optimization
>> library and try to improve some of these libraries to reduce compiling
>> overhead.
>
>The only relevant one is currently isl. It may in some cases be useful 
>to compare against Pluto so. No need to optimize scoplib or JSON.
Yes, your comments are right. However,  it seems Polly uses Cloog to generate code in default, which is much slower than ISL. Do you mean we will use ISL as default in the future?
>
>> Week 3:  Profile the compiling overhead of each Polly optimization
>> library, including ScopLib, Pluto, ISL and Jason.
>
>Instead of profiling per library, I would rather profile per Polly pass
>using --time-passes
>
>You could do this later for several programs, but it would be good to 
>have this already today for a single program to get an idea where time 
>is spent and what needs optimization.
Yes, the new proposal pays more attention on profiling and improving each Polly pass. "--time-passes" is really a very useful command!
>
>> Week 4:  Profile the
>> compiling overhead of each optimization pass for one or two libraries
>> (such as ISL and ScopLib). For example, ISL optimization provides many
>> optimization passes, such as dependence simplify, schedule optimization,
>> and various loop transformation.  Week 5: remove some expensive
>> optimization passes and rewrite some critical but expensive optimization
>> passes.
>
>
>
>
>> StageIII -- Improve code generation passes for compiling performance.
>> [Week 6-9]
>>
>> Our evalutions show that polly code generation passes are very
>> expensive, especially for some benchmarks like ludcmp.c and adi.c. Polly
>> code generation passes can increase the compiling time by 500% or more
>> (See table 1). My plan is to improve various code generation passes.
>
>Can you verify your numbers here. You report for ludcmp the following:
>
>	   	clang	pBasic	pNoOpt	pNoGen	pOPt
>ludcmp.c	0.157	0.1602	0.2002	1.0761	1.3175
>
>			pBasic%	pNoGen%	pNoOpt%	pOpt%
>			2.04%	27.52%	585.41%	739.17%
>
>I have the feeling the headings of the pNoGen% and pNoOpt% columns have 
>been switched accidentally. At least from the numbers above, I see an 
>increase from 0.16 to 0.20 for code generation, which is far from being 
>a 500% increase. On the other side, the optimization itself seems to add 
>a larger amount of time as well as the code generation of the optimized 
>code. O
Sorry, the right order should be  "clang pBasic pNoGen pNoOpt pOPt pBasic% pNoGen% pNoOpt% pOpt%". I have fixed this problem in the new proposal.
>> Week 6:  Profile the compiling overhead of each Polly code generation
>> pass, especially for ISL code generation.  Week 7:  Remove unnecessary
>> analysis for code generation. Currently, Polly code generation pass
>> dependents on a lot of  analysis passes such as DominatorTree,
>> IslAstInfo, RegionInfo, ScalarEvolution, ScopDetection, ScopInfo. I will
>> try to remove some of expensive analysis passes.
>
>Those passes add little overhead to the code generation. In fact the 
>analysis is normally already available, such that these analysis 
>requirements are for free. They have been added her mainly to allow the 
>code generation to update them, such that we do not need to spend time 
>rebuilding them later.
>
> > Week 8-9: Rewrite some
>> expensive functions for Polly code generation based on profiling
>> information.
>
>This is still very vague. I propose to
>
>> StageIV -- Let Polly bail out early. [Week 10]
>>
>> Week 10: Add support in canicalization step or optimization step to
>        Typo ----->        canonicalization
>
>> allow Polly boil out early if it cannot benefit programs.
>
>
>> StageV -- Improve other parts. [Week 11-12]
>>
>> Week 11: Improve other parts of Polly. Especially, I will focus on some
>> expensive helper functions such as TempScop analysis. This helper
>> function is critical and expensive.
>
>How do you know TempScop is expensive?
>
>> Week 12: Integrate all improvements
>> and evaluate the whole Polly with multiple benchmarks.
>
>I think the only way to do this project is to continuously evaluate your 
>changes on Polybench and mediabench and to directly integrate them
>into the svn repository. This should be made clear at the beginning and
>I believe it is very fine to spend more time on the individual steps,
>such that we can make sure the changes are properly evaluated and 
>integrated.
Yes, I will set the environment as soon as possible and integrate it into Polly SVN repository. Currently I have finished some scripts and I think this work can be done in the next week.
>
>> 4. Profit for LLVM users and Polly users
>>
>> This project can benefit both LLVM users and Polly users. For LLVM
>> users, our project will make the Polly more acceptable if it can
>> provides extra performance gains within little extra compiling overhead.
>> For Polly users, this project will make the Polly more powerful by
>> significantly reducing compiling overhead and improving code quality.
>
>Nice.
>
>You could make your goals more concrete saying that we want to show that
>by enabling Polly we can significantly optimizing the polybench 
>benchmarks, while at the same time no prohibitively large compile time 
>increase can be seen for mediabench. Reaching this goal would be a great
>step forward.
>
>> [Attachments]
>>
>> Our evaluation is based on Intel Pentium Dual CPU T2390(1.86GHz) with
>> 2GB DDR2 memory. Each benchmark is run multiple times and data are
>> collected using ministat (https://github.com/codahale/ministat). Results
>> are shown in table 1 and table 2.  Five cases are tested: (alias
>> pollycc="clang -O3 -load LLVMPolly.so -mllvm -polly) *clang: clang -O3
>> *pLoad: clang -O3 -load LLVMPolly.so *pNoGen:pollycc -O3 -mllvm
>> -polly-optimizer=none -mllvm -polly-code-generatorr=none *pNoOpt:pollycc
>> -O3 -mllvm -polly-optimizer=none *polly: pollycc -O3
>>
>> Table 1: Compile time for PolyBench (Seconds, each benchmark is run 10
>> times)
>>
>> 		clang	pBasic	pNoOpt	pNoGen	pOPt	pBasic%	pNoGen%
>> pNoOpt%	pOpt% 2mm.c     	0.1521	0.1593	0.1711	0.3235	0.7247
>> 4.73%	12.49%	112.69%	376.46% atax.c  	0.1386	0.1349	0.1449
>> 0.2066	0.313	0.00%	0.00%	49.06%	125.83% covariance.c	0.1498
>> 0.1517	0.1526	0.3561	0.7706	1.27%	1.87%	137.72%	414.42% gemver.c
>> 0.1562	0.1587	0.1724	0.2674	0.3936	1.60%	10.37%	71.19%	151.99%
>> instrument.c	0.1062	0.1075	0.1124	0.123	0.1216	0.00%	5.84%
>> 15.82%	14.50% ludcmp.c	0.157	0.1602	0.2002	1.0761	1.3175	2.04%
>> 27.52%	585.41%	739.17% 3mm.c   	0.1529	0.1559	0.1826	0.4134
>> 1.0436	1.96%	19.42%	170.37%	582.54% bicg.c   	0.1244	0.1268
>> 0.1353	0.1977	0.2828	1.93%	8.76%	58.92%	127.33% doitgen.c
>> 0.1492	0.1505	0.1644	0.3325	0.8971	0.00%	10.19%	122.86%	501.27%
>> gesummv.c	0.1224	0.1279	0.134	0.1999	0.2937	4.49%	9.48%
>> 63.32%	139.95% jacobi.c	0.1444	0.1506	0.1592	0.3912	0.8494
>> 0.00%	10.25%	170.91%	488.23% seidel.c	0.1337	0.1353	0.1462
>> 0.6299	0.9155	0.00%	9.35%	371.13%	584.74% adi.c   	0.1593
>> 0.1621	0.1835	1.4375	1.849	1.76%	15.19%	802.39%	1060.70%
>> correlation.c	0.1579	0.1596	0.1802	0.3393	0.6337	1.08%	14.12%
>> 114.88%	301.33% gemm.c   	0.1407	0.1432	0.1576	0.2421	0.4477
>> 1.78%	12.01%	72.07%	218.20% gramschmidt.c	0.1331	0.1349	0.1509
>> 0.3069	0.4138	0.00%	13.37%	130.58%	210.89% lu.c    	0.1419
>> 0.1443	0.1581	0.3156	0.3943	1.69%	11.42%	122.41%	177.87% average
>> 1.26%	13.22%	248.47%	393.80%
>
>To improve readability, it may be worth ensuring this fits into 80 
>columns. You may be able to reduce the number of digits used here.
>
>You could probably increase the readability of your proposal further if 
>you use markdown. See here for an example of how a markdown file looks 
>at github: https://gist.github.com/micmcg/976172 and here the raw version
>https://gist.github.com/micmcg/976172/raw/70f1e0db278340bd8167c98fb880979b4571e847/gistfile1.md
>
>You basically need to use the file ending '.md' and you can then use 
>markdown syntax to format your text. The very same syntax will also 
>improve the readability of the proposal on the mailing list.
Thank you so much for your very helpful advice. I have rewrite the proposal using markdown. This tool is really interesting and powerful.
>
>All the best,
>
>Tobias
>

Best regards,
Star Tan

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130503/3d04a8b3/attachment.html>