<div style="line-height:1.7;color:#000000;font-size:14px;font-family:arial"><div style="line-height:1.7;color:#000000;font-size:14px;font-family:arial">Dear Tobias,<div><br></div><div>Thanks for your timely reply. Your advice is really helpful.</div><div><br></div><div>I have updated the proposal on <a href="https://gist.github.com/tanstar/5508153" target="_blank" style="line-height: 1.7;">https://gist.github.com/tanstar/5508153</a>. Major differences include:</div><div>(1) Add table 3 and table 4 to show the compile-time overhead of top 15 hot passes;</div><div>(2) Describe a new schedule for this project. The new schedule pay more attention on reducing compile-time overhead of hot passes. The new schedule includes eight stages. </div><div>(3) Enrich the proposal with a lot of concrete work plans in each stage.</div><div>(4) Rewrite the proposal using <span style="white-space: pre-wrap; line-height: 1.7;">markdown to make it more readab!
le.</span></div><div><br></div><div><span style="white-space: pre-wrap; line-height: 1.7;">At 2013-05-02 19:12:37,"Tobias Grosser" <<a href="mailto:tobias@grosser.es">tobias@grosser.es</a>> wrote:</span></div><div><pre>>On 04/26/2013 05:08 AM, tanmx_star wrote:
>> Hi all,
>
>Hi,
>
>thanks for the update and sorry for the delay in reviewing. I just had a
>look at your proposal.
>
>
>> I have updated my GSoS proposal: "FastPolly: Reducing LLVM-Polly Compiling overhead" (https://gist.github.com/tanstar/5441808). I think the pass ordering problem you discussed early can be also investigated in this project!
>
>Yes, figuring out the optimal path ordering sequence is very good.
>
>> Is there any comment or advice about my proposal? I appreciate all your help and advice.
>
>
>> 1. Summary:
>>
>> LLVM-Polly is a promising polyhedral optimizer for data-locality and
>> parallelism, which takes advantage of multi-cores, cache hierarchies,
>> short vector instructions as well as dedicated accelerators. However,
>> Polly analysis and optimization can lead to significant compiling
>> overhead, which makes it much less attractive for LLVM users. I argue
>> that maintaining fast compiling time when Polly is enabled is very
>> important, especially if we want to think of enabling Polly in default.
>> Based on this assumption, I try to reduce Polly compiling overhead in
>> this project.
>
>Sounds good.
>
>
>> 2. Motivation:
>>
>> LLVM is an incredible open-source project. It has been widely in C/C++
>
> You miss a verb here ^^^
>
>> compilers, high-level synthesis compilers, virtual machines, optimizing
>> tools, etc. As a graduate student, I am going to work on compiler
>> analysis and optimization, especially on program vectorization and
>> parallelization. I find Polly is a very useful and powerful polyhedral
>> optimizer. I would like to use this tool and contribute to this project.
>>
>> When I was using Polly tool, I found that Polly optimization can lead to
>No need for 'tool' here ^^^
>
>> significant compiling overhead. On average, polly optimization will
>> increase the compiling time by 393% for PolyBench benchmarks and by 53%
>> for MediaBench benchmarks compared with clang. That means if you want to
>> gain from Polly, you have to pay 4 times extra compiling overhead. Even
>> if you do not want to gain much from Polly, you still have to pay 53%
>> compiling overhead. Such expensive compiling overhead would make the
>> Polly much less attractive to LLVM users.
>
>Good point.
>
>> In this project, I try to reduce Polly compiling overhead by removing
>
>I would call it 'compile-time overhead' instead of 'compiling overhead'.
>
>> unnecessary passes and improving critical passes. For this purpose, I
>> firstly try to find out where the compiling overhead comes from. When
>> Polly optimizes a program, it takes the following steps: 1) Polly
>> canonicalization: prepare some basic information and do some basic
>> transformation, such as loop-simplify and region-simplify. 2) LLVM-IR
>> to Polly description: detect polly scops and translates the detected
>> scops into a polyhedral representation. 3) Polly optimization: analyze
>> and optimize polyhedral scops. 4) Polly description to LLVM-IR:
>> translates the polyhedral description back into new LLVM-IR.
>>
>> In attched table 1 and 2, pBasic shows the overhead of loading the
> attached
>
>> LLVMPolly.so; pNoGen shows the overhead of step 1) and 2); pNoOpt shows
>> the overhead of step 1), 2) and 4). So the compiling overhead of Polly
>> can be divided into three parts:
>> PolyBench: canonicalization(13%-1%=12%), code generation(248%-13%=235%)
>> and optimization(393%-248%=145%) MediaBench:canonicalization( 9%-1%=
>> 8%), code generation( 43%- 9%= 34%) and optimization( 53%- 43%= 10%)
>
>Thanks for adding numbers for pNoGen. Having only 10% runtime increase
>if Polly is not used is a good sign, especially for the amount of
>canonicalization passes we run. This makes me confident we can get it to
>an even smaller number.
>
>The other numbers are large, but there are likely ways to improve on
>this significantly. Also, it would be good to show at least for one
>benchmark which passes the different numbers actually contain. (You can
>use -debug-pass=Structure for this). E.g. the code generation time looks
>rather large. I suppose most of the time is not actually spent in code
>generation, but also in the LLVM passes such as common subexpression
>elimination that have more LLVM-IR to work on or clean up after Polly
>was run.</pre><pre>>
>Also, I believe the names of your columns, and the command line options
>given above are a little out of sync. I could e.g. not find a
>description for pBasic</pre><pre><span style="line-height: 1.7;">Sorry, </span>pBasic means pLoad. I have added the description for pBasic in the new proposal.</pre><pre>>
>> Based on these results, I plan to reduce Polly compiling overhead by the
>> following steps: First, I will try to remove unnecessary
>> canonicalization passes to reduce canonicalization time; Second, I will
>> try to remove or rewrite expensive analysis passes to reduce
>> optimization overhead; Third, I will try to improve the code generation
>> passes to reduce code generation overhead. Another interesting work is
>> to let the polly bail out early, which can be very helpful to save
>> compiling overhead if Polly cannot benefit the program.
>
>OK, this sounds like a reasonable approach. Some more points may be
>worth adding:
>
>1) It is important to pick criteria you can evaluate your work on
>
>It is a good start that you identified two benchmarks. Especially
>looking into non-polybench code is very valuable. You should make sure
>that you evaluate your work throughout the project to see the benefit
>of your changes. In fact, it may even be worthwhile to set up a Polly
>performance tester to track the compile time with Polly enabled and how
>your changes influence it.</pre><pre>Yes, your are right. It is very important prerequisite work to <span style="line-height: 1.7;">pick criteria for the continuous evaluation</span><span style="line-height: 1.7;">. I add an extra stage (stage1) for this work. In my option, "number of scops optimized by Polly" can be used as the performance criteria, while "total compile-time overhead" and </span><span style="line-height: 1.7;"> "compile-time overhead of each Polly pass" </span><span style="line-height: 1.7;">can be used as the compile-time overhead criteria. I will set up the testing environment and integrate it into Polly SVN repository as soon as possible.</span></pre><pre>>
>2) Add some specific bug reports you are planning to lock into
>
>This bug report shows a large performance problem in Polly that is
>mainly due to creating a very difficult dependency analysis problem:
>llvm.org/PR15643
>
>There was a larger discussion on the Polly mailing list that discusses
>this bug.</pre><pre>I have added such kind of work plans to stage3 in the new proposal.</pre><pre><span style="line-height: 1.7;">></span></pre><pre>>> 3. Details about the project:
>>
>> StageI -- Remove unnecessary canonicalization transformation. [Week 1-2]
>>
>> Polly relies on some canonicalization passes to simplify the following
>> analysis and optimization. Canonicalization passes include
>> loop-simplify, region-simplify, Induction variable canonicalization and
>> block independent. For example, region-simplify pass is run to simplify
>> the region to single entry and single exit edge before -polly-detect.
>> However, such approach will introduce unnecessary modifications that
>> increase compile time even in the cases where Polly cannot optimize the
>> code.
>>
>> A first step is to remove -region-simplify pass. For this purpose, I
>> have modified the scop detection pass and polly code generation pass to
>> allow scops with multiple entry edges and multiple exit edges. Details
>> can be referred to the following patch files: (Thanks for all the help
>> from Polly group)
>
>> r179673: Remove unneeded RegionSimplify pass r179586: Support SCoPs with
>> multiple entry edges r179159: Support SCoPs with multiple exit edges
>> r179158: Codegen: Replace region exit and entries recursively r179157:
>> RegionInfo: Add helpers to replace entry/exit recursively r178530:
>> ScopDetection: Use isTopLevelRegion
>>
>> In this project, I plan to spend two weeks to reduce canonicalization
>> overhead.
>
>It was a good idea to write down what you plan to do each week.
>
>> Week 1: Profile the compiling overhead of each canonicalization pass,
>> including PromoteMemoryToRegisterPass, CFGSimplificationPass,
>> ReassociatePass, LoopRotatePass, InstructionCombiningPass,
>> IndVarSimplifyPass, CodePreparationPass and LoopSimplifyPass. Week 2:
>> Remove or improve one or two most expensive canonicalization passes. I
>> will also try to revise the pass ordering to move some expensive
>> canonicalization passes later.
>
>Instead of speeding up the canonicalization passes your focus should
>really be integrating Polly into the -O3 pass chain without the need to
>have any additional canonicalization passes. This part is not so much
>about the patch itself that implements it. It rather requires careful
>analysis how the number of detected scops changes when moving Polly.
>At the moment we optimized for optimal scop coverage while neglecting
>compile time. Now we want both, optimal scop coverage and good compile time.</pre><pre>>
>Another point that can be mentioned is removing the need for induction
>variable canonicalization. We currently do this using the -polly-indvars
>pass. However, the new option -polly-codegen-scev enables us to remove
>this pass entirely. This could also be an interesting performance
>problem as -polly-codegen-scev produces a lot cleaner LLVM-IR at code
>generation time, which may take more time to generate but it may also
>require less time to be cleaned up. This could also be interesting to
>investigate.
></pre><pre>Work plans for such work are added to stage4 in the new proposal.</pre><pre>You are right, It would be great if we can completely remove <span style="line-height: 1.7;">canonicalization passes. </span><span style="line-height: 1.7;">I think I will try to remove -polly-indvars at first, and then I will investigate the other canonicalization passes.</span><span style="line-height: 1.7;"> </span></pre><pre>>
>
>> StageII -- Remove or rewrite expensive analysis passes for compiling
>> performance. [Week 3-5]
>>
>> There are many optimization libraries for Polly, such as ScopLib, Pluto,
>> ISL and Jason optimization. To balance the tradeoff between code
> JSON
>> performance and compiling overhead, I will profile each optimization
>> library and try to improve some of these libraries to reduce compiling
>> overhead.
>
>The only relevant one is currently isl. It may in some cases be useful
>to compare against Pluto so. No need to optimize scoplib or JSON.</pre><pre>Yes, your comments are right. However, <span style="line-height: 1.7;">it seems Polly uses Cloog to generate code in default, which is much slower than ISL. Do you mean we will use ISL as default in the future?</span></pre><pre>>
>> Week 3: Profile the compiling overhead of each Polly optimization
>> library, including ScopLib, Pluto, ISL and Jason.
>
>Instead of profiling per library, I would rather profile per Polly pass
>using --time-passes
>
>You could do this later for several programs, but it would be good to
>have this already today for a single program to get an idea where time
>is spent and what needs optimization.</pre><pre>Yes, the new proposal pays more attention on profiling and improving each Polly pass. "<span style="line-height: 1.7;">--time-passes" is really a very useful command!</span></pre><pre>>
>> Week 4: Profile the
>> compiling overhead of each optimization pass for one or two libraries
>> (such as ISL and ScopLib). For example, ISL optimization provides many
>> optimization passes, such as dependence simplify, schedule optimization,
>> and various loop transformation. Week 5: remove some expensive
>> optimization passes and rewrite some critical but expensive optimization
>> passes.
>
>
>
>
>> StageIII -- Improve code generation passes for compiling performance.
>> [Week 6-9]
>>
>> Our evalutions show that polly code generation passes are very
>> expensive, especially for some benchmarks like ludcmp.c and adi.c. Polly
>> code generation passes can increase the compiling time by 500% or more
>> (See table 1). My plan is to improve various code generation passes.
>
>Can you verify your numbers here. You report for ludcmp the following:
>
> clang pBasic pNoOpt pNoGen pOPt
>ludcmp.c 0.157 0.1602 0.2002 1.0761 1.3175
>
> pBasic% pNoGen% pNoOpt% pOpt%
> 2.04% 27.52% 585.41% 739.17%
>
>I have the feeling the headings of the pNoGen% and pNoOpt% columns have
>been switched accidentally. At least from the numbers above, I see an
>increase from 0.16 to 0.20 for code generation, which is far from being
>a 500% increase. On the other side, the optimization itself seems to add
>a larger amount of time as well as the code generation of the optimized
>code. O</pre><pre>Sorry, the right order should be "<span style="line-height: 1.7;">clang pBasic pNoGen pNoOpt pOPt </span><span style="line-height: 1.7;">pBasic% pNoGen% pNoOpt% pOpt%". I have fixed this problem in the new proposal.</span></pre><pre><span style="line-height: 1.7;">>> Week 6: Profile the compiling overhead of each Polly code generation</span></pre><pre>>> pass, especially for ISL code generation. Week 7: Remove unnecessary
>> analysis for code generation. Currently, Polly code generation pass
>> dependents on a lot of analysis passes such as DominatorTree,
>> IslAstInfo, RegionInfo, ScalarEvolution, ScopDetection, ScopInfo. I will
>> try to remove some of expensive analysis passes.
>
>Those passes add little overhead to the code generation. In fact the
>analysis is normally already available, such that these analysis
>requirements are for free. They have been added her mainly to allow the
>code generation to update them, such that we do not need to spend time
>rebuilding them later.
>
> > Week 8-9: Rewrite some
>> expensive functions for Polly code generation based on profiling
>> information.
>
>This is still very vague. I propose to
>
>> StageIV -- Let Polly bail out early. [Week 10]
>>
>> Week 10: Add support in canicalization step or optimization step to
> Typo -----> canonicalization
>
>> allow Polly boil out early if it cannot benefit programs.
>
>
>> StageV -- Improve other parts. [Week 11-12]
>>
>> Week 11: Improve other parts of Polly. Especially, I will focus on some
>> expensive helper functions such as TempScop analysis. This helper
>> function is critical and expensive.
>
>How do you know TempScop is expensive?</pre><pre><span style="line-height: 1.7;">></span></pre><pre>>> Week 12: Integrate all improvements
>> and evaluate the whole Polly with multiple benchmarks.
>
>I think the only way to do this project is to continuously evaluate your
>changes on Polybench and mediabench and to directly integrate them
>into the svn repository. This should be made clear at the beginning and
>I believe it is very fine to spend more time on the individual steps,
>such that we can make sure the changes are properly evaluated and
>integrated.</pre><pre>Yes, I will set the environment as soon as possible and integrate it into Polly SVN repository. Currently I have finished some scripts and I think this work can be done in the next week.</pre><pre><span style="line-height: 1.7;">></span></pre><pre>>> 4. Profit for LLVM users and Polly users
>>
>> This project can benefit both LLVM users and Polly users. For LLVM
>> users, our project will make the Polly more acceptable if it can
>> provides extra performance gains within little extra compiling overhead.
>> For Polly users, this project will make the Polly more powerful by
>> significantly reducing compiling overhead and improving code quality.
>
>Nice.
>
>You could make your goals more concrete saying that we want to show that
>by enabling Polly we can significantly optimizing the polybench
>benchmarks, while at the same time no prohibitively large compile time
>increase can be seen for mediabench. Reaching this goal would be a great
>step forward.
>
>> [Attachments]
>>
>> Our evaluation is based on Intel Pentium Dual CPU T2390(1.86GHz) with
>> 2GB DDR2 memory. Each benchmark is run multiple times and data are
>> collected using ministat (https://github.com/codahale/ministat). Results
>> are shown in table 1 and table 2. Five cases are tested: (alias
>> pollycc="clang -O3 -load LLVMPolly.so -mllvm -polly) *clang: clang -O3
>> *pLoad: clang -O3 -load LLVMPolly.so *pNoGen:pollycc -O3 -mllvm
>> -polly-optimizer=none -mllvm -polly-code-generatorr=none *pNoOpt:pollycc
>> -O3 -mllvm -polly-optimizer=none *polly: pollycc -O3
>>
>> Table 1: Compile time for PolyBench (Seconds, each benchmark is run 10
>> times)
>>
>> clang pBasic pNoOpt pNoGen pOPt pBasic% pNoGen%
>> pNoOpt% pOpt% 2mm.c 0.1521 0.1593 0.1711 0.3235 0.7247
>> 4.73% 12.49% 112.69% 376.46% atax.c 0.1386 0.1349 0.1449
>> 0.2066 0.313 0.00% 0.00% 49.06% 125.83% covariance.c 0.1498
>> 0.1517 0.1526 0.3561 0.7706 1.27% 1.87% 137.72% 414.42% gemver.c
>> 0.1562 0.1587 0.1724 0.2674 0.3936 1.60% 10.37% 71.19% 151.99%
>> instrument.c 0.1062 0.1075 0.1124 0.123 0.1216 0.00% 5.84%
>> 15.82% 14.50% ludcmp.c 0.157 0.1602 0.2002 1.0761 1.3175 2.04%
>> 27.52% 585.41% 739.17% 3mm.c 0.1529 0.1559 0.1826 0.4134
>> 1.0436 1.96% 19.42% 170.37% 582.54% bicg.c 0.1244 0.1268
>> 0.1353 0.1977 0.2828 1.93% 8.76% 58.92% 127.33% doitgen.c
>> 0.1492 0.1505 0.1644 0.3325 0.8971 0.00% 10.19% 122.86% 501.27%
>> gesummv.c 0.1224 0.1279 0.134 0.1999 0.2937 4.49% 9.48%
>> 63.32% 139.95% jacobi.c 0.1444 0.1506 0.1592 0.3912 0.8494
>> 0.00% 10.25% 170.91% 488.23% seidel.c 0.1337 0.1353 0.1462
>> 0.6299 0.9155 0.00% 9.35% 371.13% 584.74% adi.c 0.1593
>> 0.1621 0.1835 1.4375 1.849 1.76% 15.19% 802.39% 1060.70%
>> correlation.c 0.1579 0.1596 0.1802 0.3393 0.6337 1.08% 14.12%
>> 114.88% 301.33% gemm.c 0.1407 0.1432 0.1576 0.2421 0.4477
>> 1.78% 12.01% 72.07% 218.20% gramschmidt.c 0.1331 0.1349 0.1509
>> 0.3069 0.4138 0.00% 13.37% 130.58% 210.89% lu.c 0.1419
>> 0.1443 0.1581 0.3156 0.3943 1.69% 11.42% 122.41% 177.87% average
>> 1.26% 13.22% 248.47% 393.80%
>
>To improve readability, it may be worth ensuring this fits into 80
>columns. You may be able to reduce the number of digits used here.
>
>You could probably increase the readability of your proposal further if
>you use markdown. See here for an example of how a markdown file looks
>at github: https://gist.github.com/micmcg/976172 and here the raw version
>https://gist.github.com/micmcg/976172/raw/70f1e0db278340bd8167c98fb880979b4571e847/gistfile1.md
>
>You basically need to use the file ending '.md' and you can then use
>markdown syntax to format your text. The very same syntax will also
>improve the readability of the proposal on the mailing list.</pre><pre>Thank you so much for your very helpful advice. I have rewrite the proposal using markdown. This tool is really interesting and powerful.</pre><pre>>
>All the best,
>
>Tobias
>
</pre><pre>Best regards,</pre><pre>Star Tan</pre><pre><br></pre></div></div></div><br><br><span title="neteasefooter"><span id="netease_mail_footer"><span title="neteasefooter"><span id="netease_mail_footer"><a href="#" target="_blank"></a></span></span>
</span></span>