<div style="line-height:1.7;color:#000000;font-size:14px;font-family:arial"><div style="line-height:1.7;color:#000000;font-size:14px;font-family:arial">Dear Tobias,<div><br></div><div>Thanks for your timely reply. Your advice is really helpful.</div><div><br></div><div>I have updated the proposal on <a href="https://gist.github.com/tanstar/5508153" target="_blank" style="line-height: 1.7;">https://gist.github.com/tanstar/5508153</a>. Major differences include:</div><div>(1)  Add table 3 and table 4 to show the compile-time overhead of top 15 hot passes;</div><div>(2)  Describe a new schedule for this project. The new schedule pay more attention on reducing compile-time overhead of hot passes. The new schedule includes eight stages. </div><div>(3)  Enrich the proposal with a lot of concrete work plans in each stage.</div><div>(4) Rewrite the proposal using <span style="white-space: pre-wrap; line-height: 1.7;">markdown to make it more readab!

 le.</span></div><div><br></div><div><span style="white-space: pre-wrap; line-height: 1.7;">At 2013-05-02 19:12:37,"Tobias Grosser" <<a href="mailto:tobias@grosser.es">tobias@grosser.es</a>> wrote:</span></div><div><pre>>On 04/26/2013 05:08 AM, tanmx_star wrote:

>> Hi all,

>

>Hi,

>

>thanks for the update and sorry for the delay in reviewing. I just had a 

>look at your proposal.

>

>

>> I have updated my GSoS proposal: "FastPolly: Reducing LLVM-Polly Compiling overhead" (https://gist.github.com/tanstar/5441808).  I think the pass ordering problem you discussed early can be also investigated in this project!

>

>Yes, figuring out the optimal path ordering sequence is very good.

>

>> Is there any comment or advice about my proposal?  I appreciate all your help and advice.

>

>

>> 1. Summary:

>>

>> LLVM-Polly is a promising polyhedral optimizer for data-locality and

>> parallelism, which takes advantage of multi-cores, cache hierarchies,

>> short vector instructions as well as dedicated accelerators. However,

>> Polly analysis and optimization can lead to significant compiling

>> overhead, which makes it much less attractive for LLVM users. I argue

>> that maintaining fast compiling time when Polly is enabled is very

>> important, especially if we want to think of enabling Polly in default.

>> Based on this assumption, I try to reduce Polly compiling overhead in

>> this project.

>

>Sounds good.

>

>

>> 2. Motivation:

>>

>> LLVM is an incredible open-source project. It has been widely in C/C++

>

>                                          You miss a verb here ^^^

>

>> compilers, high-level synthesis compilers, virtual machines, optimizing

>> tools, etc. As a graduate student, I am going to work on compiler

>> analysis and optimization, especially on program vectorization and

>> parallelization. I find Polly is a very useful and powerful polyhedral

>> optimizer. I would like to use this tool and contribute to this project.

>>

>> When I was using Polly tool, I found that Polly optimization can lead to

>No need for 'tool' here  ^^^

>

>> significant compiling overhead. On average, polly optimization will

>> increase the compiling time by 393% for PolyBench benchmarks and by 53%

>> for MediaBench benchmarks compared with clang. That means if you want to

>> gain from Polly, you have to pay 4 times extra compiling overhead. Even

>> if you do not want to gain much from Polly, you still have to pay 53%

>> compiling overhead. Such expensive compiling overhead would make the

>> Polly much less attractive to LLVM users.

>

>Good point.

>

>> In this project, I try to reduce Polly compiling overhead by removing

>

>I would call it 'compile-time overhead' instead of 'compiling overhead'.

>

>> unnecessary passes and improving critical passes.  For this purpose, I

>> firstly try to find out where the compiling overhead comes from. When

>> Polly optimizes a program, it takes the following steps: 1) Polly

>> canonicalization: prepare some basic information and do some basic

>> transformation, such as loop-simplify and region-simplify.  2) LLVM-IR

>> to Polly description: detect polly scops and translates the detected

>> scops into a polyhedral representation.  3) Polly optimization: analyze

>> and optimize polyhedral scops.  4) Polly description to LLVM-IR:

>> translates the polyhedral description back into new LLVM-IR.

>>

>> In attched table 1 and 2, pBasic shows the overhead of loading the

>      attached

>

>> LLVMPolly.so; pNoGen shows the overhead of step 1) and 2); pNoOpt shows

>> the overhead of step 1), 2) and 4). So the compiling overhead of Polly

>> can be divided into three parts:

>> PolyBench: canonicalization(13%-1%=12%), code generation(248%-13%=235%)

>> and optimization(393%-248%=145%) MediaBench:canonicalization( 9%-1%=

>> 8%), code generation( 43%- 9%= 34%) and optimization( 53%- 43%= 10%)

>

>Thanks for adding numbers for pNoGen. Having only 10% runtime increase 

>if Polly is not used is a good sign, especially for the amount of 

>canonicalization passes we run. This makes me confident we can get it to 

>an even smaller number.

>

>The other numbers are large, but there are likely ways to improve on 

>this significantly. Also, it would be good to show at least for one 

>benchmark which passes the different numbers actually contain. (You can 

>use -debug-pass=Structure for this). E.g. the code generation time looks

>rather large. I suppose most of the time is not actually spent in code 

>generation, but also in the LLVM passes such as common subexpression 

>elimination that have more LLVM-IR to work on or clean up after Polly 

>was run.</pre><pre>>

>Also, I believe the names of your columns, and the command line options

>given above are a little out of sync. I could e.g. not find a 

>description for pBasic</pre><pre><span style="line-height: 1.7;">Sorry, </span>pBasic means pLoad. I have added the description for pBasic in the new proposal.</pre><pre>>

>> Based on these results, I plan to reduce Polly compiling overhead by the

>> following steps: First, I will try to remove unnecessary

>> canonicalization passes to reduce canonicalization time; Second, I will

>> try to remove or rewrite expensive analysis passes to reduce

>> optimization overhead; Third, I will try to improve the code generation

>> passes to reduce code generation overhead. Another interesting work is

>> to let the polly bail out early, which can be very helpful to save

>> compiling overhead if Polly cannot benefit the program.

>

>OK, this sounds like a reasonable approach. Some more points may be 

>worth adding:

>

>1) It is important to pick criteria you can evaluate your work on

>

>It is a good start that you identified two benchmarks. Especially 

>looking into non-polybench code is very valuable. You should make sure 

>that you evaluate your work throughout the project to see the benefit

>of your changes. In fact, it may even be worthwhile to set up a Polly 

>performance tester to track the compile time with Polly enabled and how

>your changes influence it.</pre><pre>Yes, your are right. It is very important prerequisite work to <span style="line-height: 1.7;">pick criteria for the continuous evaluation</span><span style="line-height: 1.7;">. I add an extra stage (stage1) for this work. In my option, "number of scops optimized by Polly" can be used as the performance criteria, while "total compile-time overhead" and </span><span style="line-height: 1.7;"> "compile-time overhead of each Polly pass" </span><span style="line-height: 1.7;">can be used as the compile-time overhead criteria.  I will set up the testing environment and integrate it into Polly SVN repository as soon as possible.</span></pre><pre>>

>2) Add some specific bug reports you are planning to lock into

>

>This bug report shows a large performance problem in Polly that is 

>mainly due to creating a very difficult dependency analysis problem:

>llvm.org/PR15643

>

>There was a larger discussion on the Polly mailing list that discusses 

>this bug.</pre><pre>I have added such kind of work plans to stage3 in the new proposal.</pre><pre><span style="line-height: 1.7;">></span></pre><pre>>> 3. Details about the project:

>>

>> StageI -- Remove unnecessary canonicalization transformation. [Week 1-2]

>>

>> Polly relies on some canonicalization passes to simplify the following

>> analysis and optimization. Canonicalization passes include

>> loop-simplify, region-simplify, Induction variable canonicalization and

>> block independent. For example, region-simplify pass is run to simplify

>> the region to single entry and single exit edge before -polly-detect.

>> However, such approach will introduce unnecessary modifications that

>> increase compile time even in the cases where Polly cannot optimize the

>> code.

>>

>> A first step is to remove -region-simplify pass. For this purpose, I

>> have modified the scop detection pass and polly code generation pass to

>> allow scops with multiple entry edges and multiple exit edges. Details

>> can be referred to the following patch files: (Thanks for all the help

>> from Polly group)

>

>> r179673: Remove unneeded RegionSimplify pass r179586: Support SCoPs with

>> multiple entry edges r179159: Support SCoPs with multiple exit edges

>> r179158: Codegen: Replace region exit and entries recursively r179157:

>> RegionInfo: Add helpers to replace entry/exit recursively r178530:

>> ScopDetection: Use isTopLevelRegion

>>

>> In this project, I plan to spend two weeks to reduce canonicalization

>> overhead.

>

>It was a good idea to write down what you plan to do each week.

>

>> Week 1:  Profile the compiling overhead of each canonicalization pass,

>> including PromoteMemoryToRegisterPass, CFGSimplificationPass,

>> ReassociatePass, LoopRotatePass, InstructionCombiningPass,

>> IndVarSimplifyPass, CodePreparationPass and LoopSimplifyPass.  Week 2:

>> Remove or improve one or two most expensive canonicalization passes. I

>> will also try to revise the pass ordering to move some expensive

>> canonicalization passes later.

>

>Instead of speeding up the canonicalization passes your focus should 

>really be integrating Polly into the -O3 pass chain without the need to 

>have any additional canonicalization passes. This part is not so much 

>about the patch itself that implements it. It rather requires careful 

>analysis how the number of detected scops changes when moving Polly.

>At the moment we optimized for optimal scop coverage while neglecting 

>compile time. Now we want both, optimal scop coverage and good compile time.</pre><pre>>

>Another point that can be mentioned is removing the need for induction

>variable canonicalization. We currently do this using the -polly-indvars 

>pass. However, the new option -polly-codegen-scev enables us to remove 

>this pass entirely. This could also be an interesting performance

>problem as -polly-codegen-scev produces a lot cleaner LLVM-IR at code 

>generation time, which may take more time to generate but it may also

>require less time to be cleaned up. This could also be interesting to 

>investigate.

></pre><pre>Work plans for such work are added to stage4 in the new proposal.</pre><pre>You are right, It would be great if we can completely remove <span style="line-height: 1.7;">canonicalization passes. </span><span style="line-height: 1.7;">I think I will try to remove  -polly-indvars at first, and then I will investigate the other canonicalization passes.</span><span style="line-height: 1.7;"> </span></pre><pre>>

>

>> StageII -- Remove or rewrite expensive analysis passes for compiling

>> performance. [Week 3-5]

>>

>> There are many optimization libraries for Polly, such as ScopLib, Pluto,

>> ISL and Jason optimization. To balance the tradeoff between code

>           JSON

>> performance and compiling overhead, I will profile each optimization

>> library and try to improve some of these libraries to reduce compiling

>> overhead.

>

>The only relevant one is currently isl. It may in some cases be useful 

>to compare against Pluto so. No need to optimize scoplib or JSON.</pre><pre>Yes, your comments are right. However,  <span style="line-height: 1.7;">it seems Polly uses Cloog to generate code in default, which is much slower than ISL. Do you mean we will use ISL as default in the future?</span></pre><pre>>

>> Week 3:  Profile the compiling overhead of each Polly optimization

>> library, including ScopLib, Pluto, ISL and Jason.

>

>Instead of profiling per library, I would rather profile per Polly pass

>using --time-passes

>

>You could do this later for several programs, but it would be good to 

>have this already today for a single program to get an idea where time 

>is spent and what needs optimization.</pre><pre>Yes, the new proposal pays more attention on profiling and improving each Polly pass. "<span style="line-height: 1.7;">--time-passes" is really a very useful command!</span></pre><pre>>

>> Week 4:  Profile the

>> compiling overhead of each optimization pass for one or two libraries

>> (such as ISL and ScopLib). For example, ISL optimization provides many

>> optimization passes, such as dependence simplify, schedule optimization,

>> and various loop transformation.  Week 5: remove some expensive

>> optimization passes and rewrite some critical but expensive optimization

>> passes.

>

>

>

>

>> StageIII -- Improve code generation passes for compiling performance.

>> [Week 6-9]

>>

>> Our evalutions show that polly code generation passes are very

>> expensive, especially for some benchmarks like ludcmp.c and adi.c. Polly

>> code generation passes can increase the compiling time by 500% or more

>> (See table 1). My plan is to improve various code generation passes.

>

>Can you verify your numbers here. You report for ludcmp the following:

>

>             clang   pBasic  pNoOpt  pNoGen  pOPt

>ludcmp.c    0.157   0.1602  0.2002  1.0761  1.3175

>

>                    pBasic% pNoGen% pNoOpt% pOpt%

>                    2.04%   27.52%  585.41% 739.17%

>

>I have the feeling the headings of the pNoGen% and pNoOpt% columns have 

>been switched accidentally. At least from the numbers above, I see an 

>increase from 0.16 to 0.20 for code generation, which is far from being 

>a 500% increase. On the other side, the optimization itself seems to add 

>a larger amount of time as well as the code generation of the optimized 

>code. O</pre><pre>Sorry, the right order should be  "<span style="line-height: 1.7;">clang pBasic pNoGen pNoOpt pOPt </span><span style="line-height: 1.7;">pBasic% pNoGen% pNoOpt% pOpt%". I have fixed this problem in the new proposal.</span></pre><pre><span style="line-height: 1.7;">>> Week 6:  Profile the compiling overhead of each Polly code generation</span></pre><pre>>> pass, especially for ISL code generation.  Week 7:  Remove unnecessary

>> analysis for code generation. Currently, Polly code generation pass

>> dependents on a lot of  analysis passes such as DominatorTree,

>> IslAstInfo, RegionInfo, ScalarEvolution, ScopDetection, ScopInfo. I will

>> try to remove some of expensive analysis passes.

>

>Those passes add little overhead to the code generation. In fact the 

>analysis is normally already available, such that these analysis 

>requirements are for free. They have been added her mainly to allow the 

>code generation to update them, such that we do not need to spend time 

>rebuilding them later.

>

> > Week 8-9: Rewrite some

>> expensive functions for Polly code generation based on profiling

>> information.

>

>This is still very vague. I propose to

>

>> StageIV -- Let Polly bail out early. [Week 10]

>>

>> Week 10: Add support in canicalization step or optimization step to

>        Typo ----->        canonicalization

>

>> allow Polly boil out early if it cannot benefit programs.

>

>

>> StageV -- Improve other parts. [Week 11-12]

>>

>> Week 11: Improve other parts of Polly. Especially, I will focus on some

>> expensive helper functions such as TempScop analysis. This helper

>> function is critical and expensive.

>

>How do you know TempScop is expensive?</pre><pre><span style="line-height: 1.7;">></span></pre><pre>>> Week 12: Integrate all improvements

>> and evaluate the whole Polly with multiple benchmarks.

>

>I think the only way to do this project is to continuously evaluate your 

>changes on Polybench and mediabench and to directly integrate them

>into the svn repository. This should be made clear at the beginning and

>I believe it is very fine to spend more time on the individual steps,

>such that we can make sure the changes are properly evaluated and 

>integrated.</pre><pre>Yes, I will set the environment as soon as possible and integrate it into Polly SVN repository. Currently I have finished some scripts and I think this work can be done in the next week.</pre><pre><span style="line-height: 1.7;">></span></pre><pre>>> 4. Profit for LLVM users and Polly users

>>

>> This project can benefit both LLVM users and Polly users. For LLVM

>> users, our project will make the Polly more acceptable if it can

>> provides extra performance gains within little extra compiling overhead.

>> For Polly users, this project will make the Polly more powerful by

>> significantly reducing compiling overhead and improving code quality.

>

>Nice.

>

>You could make your goals more concrete saying that we want to show that

>by enabling Polly we can significantly optimizing the polybench 

>benchmarks, while at the same time no prohibitively large compile time 

>increase can be seen for mediabench. Reaching this goal would be a great

>step forward.

>

>> [Attachments]

>>

>> Our evaluation is based on Intel Pentium Dual CPU T2390(1.86GHz) with

>> 2GB DDR2 memory. Each benchmark is run multiple times and data are

>> collected using ministat (https://github.com/codahale/ministat). Results

>> are shown in table 1 and table 2.  Five cases are tested: (alias

>> pollycc="clang -O3 -load LLVMPolly.so -mllvm -polly) *clang: clang -O3

>> *pLoad: clang -O3 -load LLVMPolly.so *pNoGen:pollycc -O3 -mllvm

>> -polly-optimizer=none -mllvm -polly-code-generatorr=none *pNoOpt:pollycc

>> -O3 -mllvm -polly-optimizer=none *polly: pollycc -O3

>>

>> Table 1: Compile time for PolyBench (Seconds, each benchmark is run 10

>> times)

>>

>>           clang   pBasic  pNoOpt  pNoGen  pOPt    pBasic% pNoGen%

>> pNoOpt%   pOpt% 2mm.c       0.1521  0.1593  0.1711  0.3235  0.7247

>> 4.73%     12.49%  112.69% 376.46% atax.c   0.1386  0.1349  0.1449

>> 0.2066    0.313   0.00%   0.00%   49.06%  125.83% covariance.c       0.1498

>> 0.1517    0.1526  0.3561  0.7706  1.27%   1.87%   137.72% 414.42% gemver.c

>> 0.1562    0.1587  0.1724  0.2674  0.3936  1.60%   10.37%  71.19%  151.99%

>> instrument.c      0.1062  0.1075  0.1124  0.123   0.1216  0.00%   5.84%

>> 15.82%    14.50% ludcmp.c    0.157   0.1602  0.2002  1.0761  1.3175  2.04%

>> 27.52%    585.41% 739.17% 3mm.c       0.1529  0.1559  0.1826  0.4134

>> 1.0436    1.96%   19.42%  170.37% 582.54% bicg.c      0.1244  0.1268

>> 0.1353    0.1977  0.2828  1.93%   8.76%   58.92%  127.33% doitgen.c

>> 0.1492    0.1505  0.1644  0.3325  0.8971  0.00%   10.19%  122.86% 501.27%

>> gesummv.c 0.1224  0.1279  0.134   0.1999  0.2937  4.49%   9.48%

>> 63.32%    139.95% jacobi.c   0.1444  0.1506  0.1592  0.3912  0.8494

>> 0.00%     10.25%  170.91% 488.23% seidel.c   0.1337  0.1353  0.1462

>> 0.6299    0.9155  0.00%   9.35%   371.13% 584.74% adi.c       0.1593

>> 0.1621    0.1835  1.4375  1.849   1.76%   15.19%  802.39% 1060.70%

>> correlation.c     0.1579  0.1596  0.1802  0.3393  0.6337  1.08%   14.12%

>> 114.88%   301.33% gemm.c      0.1407  0.1432  0.1576  0.2421  0.4477

>> 1.78%     12.01%  72.07%  218.20% gramschmidt.c      0.1331  0.1349  0.1509

>> 0.3069    0.4138  0.00%   13.37%  130.58% 210.89% lu.c           0.1419

>> 0.1443    0.1581  0.3156  0.3943  1.69%   11.42%  122.41% 177.87% average

>> 1.26%     13.22%  248.47% 393.80%

>

>To improve readability, it may be worth ensuring this fits into 80 

>columns. You may be able to reduce the number of digits used here.

>

>You could probably increase the readability of your proposal further if 

>you use markdown. See here for an example of how a markdown file looks 

>at github: https://gist.github.com/micmcg/976172 and here the raw version

>https://gist.github.com/micmcg/976172/raw/70f1e0db278340bd8167c98fb880979b4571e847/gistfile1.md

>

>You basically need to use the file ending '.md' and you can then use 

>markdown syntax to format your text. The very same syntax will also 

>improve the readability of the proposal on the mailing list.</pre><pre>Thank you so much for your very helpful advice. I have rewrite the proposal using markdown. This tool is really interesting and powerful.</pre><pre>>

>All the best,

>

>Tobias

>

</pre><pre>Best regards,</pre><pre>Star Tan</pre><pre><br></pre></div></div></div><br><br><span title="neteasefooter"><span id="netease_mail_footer"><span title="neteasefooter"><span id="netease_mail_footer"><a href="#" target="_blank"></a></span></span>

</span></span>