<div style="line-height:1.7;color:#000000;font-size:14px;font-family:arial"><span style="color: rgb(0, 0, 0); font-family: arial; font-size: 14px; line-height: 1.7; white-space: pre-wrap;">At 2013-07-31 22:50:57,"Tobias Grosser" <</span><a href="mailto:tobias@grosser.es" style="color: rgb(0, 0, 0); font-family: arial; font-size: 14px; line-height: 1.7; white-space: pre-wrap;">tobias@grosser.es</a><span style="color: rgb(0, 0, 0); font-family: arial; font-size: 14px; line-height: 1.7; white-space: pre-wrap;">> wrote:</span><br><pre style="color: rgb(0, 0, 0); font-family: arial; font-size: 14px; line-height: 1.7;">>On 07/30/2013 10:03 AM, Star Tan wrote:

>> Hi Tobias and all Polly developers,

>>

>> I have re-evaluated the Polly compile-time performance using newest

>> LLVM/Polly source code.  You can view the results on

>> http://188.40.87.11:8000

>> <http://188.40.87.11:8000/db_default/v4/nts/16?compare_to=9&baseline=9&aggregation_fn=median>.

>>

>> Especially, I also evaluated ourr187102 patch file that avoids expensive

>> failure string operations in normal execution. Specifically, I evaluated

>> two cases for it:

>>

>> Polly-NoCodeGen: clang -O3 -load LLVMPolly.so -mllvm

>> -polly-optimizer=none -mllvm -polly-code-generator=none

>> http://188.40.87.11:8000/db_default/v4/nts/16?compare_to=9&baseline=9&aggregation_fn=median

>> Polly-Opt: clang -O3 -load LLVMPolly.so -mllvm -polly

>> http://188.40.87.11:8000/db_default/v4/nts/18?compare_to=11&baseline=11&aggregation_fn=median

>>

>> The "Polly-NoCodeGen" case is mainly used to compare the compile-time

>> performance for the polly-detect pass. As shown in the results, our

>> patch file could significantly reduce the compile-time overhead for some

>> benchmarks such as tramp3dv4

>> <http://188.40.87.11:8000/db_default/v4/nts/16/graph?test.355=2> (24.2%), simple_types_constant_folding

>> <http://188.40.87.11:8000/db_default/v4/nts/16/graph?test.366=2>(12.6%),

>> oggenc

>> <http://188.40.87.11:8000/db_default/v4/nts/16/graph?test.331=2>(9.1%),

>> loop_unroll

>> <http://188.40.87.11:8000/db_default/v4/nts/16/graph?test.235=2>(7.8%)

>

>Very nice!

>

>Though I am surprised to also see performance regressions. They are all 

>in very shortly executing kernels, so they may very well be measuring 

>noice. Is this really the case?</pre><pre style="color: rgb(0, 0, 0); font-family: arial; font-size: 14px; line-height: 1.7;">Yes, it seems that shortly executing benchmarks always show huge unexpected noise even we run 10 samples for a test. </pre><pre style="color: rgb(0, 0, 0); font-family: arial; font-size: 14px; line-height: 1.7;">I have changed the ignore_small abs value to 0.05 from the original 0.01, which means benchmarks with the performance delta less then 0.05s would be skipped. In that case, <span style="font-size: 14px; line-height: 1.7;">the results seem to be much more stable. </span></pre><pre><span style="color: rgb(0, 0, 0); font-family: arial; font-size: 14px; line-height: 1.7;">However, I have noticed that there are many other Polly patches between the two version r</span>185399 and r187116. They may also affect the compile-time performance. I would re-evaluate LLVM-testsuite to see the performance improvements caused only by!

  our  </pre><pre style="color: rgb(0, 0, 0); font-family: arial; font-size: 14px; line-height: 1.7;">>

>Also, it may be interesting to compare against the non-polly case to see

>how much overhead there is still due to our scop detetion.

>

>> The "Polly-opt" case is used to compare the whole compile-time

>> performance of Polly. Since our patch file mainly affects the

>> Polly-Detect pass, it shows similar performance to "Polly-NoCodeGen". As

>> shown in results, it reduces the compile-time overhead of some

>> benchmarks such as tramp3dv4

>> <http://188.40.87.11:8000/db_default/v4/nts/16/graph?test.355=2> (23.7%), simple_types_constant_folding

>> <http://188.40.87.11:8000/db_default/v4/nts/16/graph?test.366=2>(12.9%),

>> oggenc

>> <http://188.40.87.11:8000/db_default/v4/nts/16/graph?test.331=2>(8.3%),

>> loop_unroll

>> <http://188.40.87.11:8000/db_default/v4/nts/16/graph?test.235=2>(7.5%)

>>

>> At last, I also evaluated the performance of the ScopBottomUp patch that

>> changes the up-down scop detection into bottom-up scop detection.

>> Results can be viewed by:

>> pNoCodeGen-ScopBottomUp: clang -O3 -load LLVMPolly.so (v.s.

>> LLVMPolly-ScopBottomUp.so)  -mllvm -polly-optimizer=none -mllvm

>> -polly-code-generator=none

>> http://188.40.87.11:8000/db_default/v4/nts/21?compare_to=16&baseline=16&aggregation_fn=median

>> pOpt-ScopBottomUp: clang -O3 -load LLVMPolly.so (v.s.

>> LLVMPolly-ScopBottomUp.so)  -mllvm -polly

>> http://188.40.87.11:8000/db_default/v4/nts/19?compare_to=18&baseline=18&aggregation_fn=median

>> (*Both of these results are based on LLVM r187116, which has included

>> the r187102 patch file that we discussed above)

>>

>> Please notice that this patch file will lead to some errors in

>> Polly-tests, so the data shown here can not be regards as confident

>> results. For example, this patch can significantly reduce the

>> compile-time overhead of SingleSource/Benchmarks/Shootout/nestedloop

>> <http://188.40.87.11:8000/db_default/v4/nts/19/graph?test.17=2> only

>> because it regards the nested loop as an invalid scop and skips all

>> following transformations and optimizations. However, I evaluated it

>> here to see its potential performance impact.  Based on the results

>> shown on

>> http://188.40.87.11:8000/db_default/v4/nts/21?compare_to=16&baseline=16&aggregation_fn=median,

>> we can see detecting scops bottom-up may further reduce Polly

>> compile-time by more than 10%.

>

>Interesting. For some reason it also regresses huffbench quite a bit. </pre><pre style="color: rgb(0, 0, 0); font-family: arial; font-size: 14px; line-height: 1.7;">This is because the ScopBottomUp patch file invalids the scop detection for huffbench. The run-time of huffbench with different options are shown as follows:</pre><pre style="color: rgb(0, 0, 0); font-family: arial; font-size: 14px; line-height: 1.7;">clang: 19.1680s  (see runid=14)</pre><pre style="color: rgb(0, 0, 0); font-family: arial; font-size: 14px; line-height: 1.7;">polly without ScopBottomUp patch file: 14.8340s (see runid=16)</pre><pre style="color: rgb(0, 0, 0); font-family: arial; font-size: 14px; line-height: 1.7;">polly with ScopBottomUp patch file: 19.2920s (see runid=21)</pre><pre style="color: rgb(0, 0, 0); font-family: arial; font-size: 14px; line-height: 1.7;">As you can see, the ScopBottomUp patch file shows almost the same execution !

 performance with clang. That is because no invalid scops is detected with this patch file at all.</pre><pre style="color: rgb(0, 0, 0); font-family: arial; font-size: 14px; line-height: 1.7;"><br></pre><pre style="color: rgb(0, 0, 0); font-family: arial; font-size: 14px; line-height: 1.7;">>:-( I think here an up-to-date non-polly to polly comparision would come 

>handy to see which benchmarks we still see larger performance 

>regressions. And if the bottom-up scop detection actually helps here.

>As this is a larger patch, we should really have a need for it before 

>switching to it.

>

I have evaluated Polly compile-time performance for the following options:</pre><pre>  clang: clang -O3  (runid: 14) </pre><pre>  pBasic: clang -O3 -load LLVMPolly.so (runid:15) </pre><pre>  pNoGen: pollycc -O3 -mllvm -polly-optimizer=none -mllvm -polly-code-generator=none (runid:16) </pre><pre>  pNoOpt: pollycc -O3 -mllvm -polly-optimizer=none (runid:17) </pre><pre>  pOpt: pollycc -O3 (runid:18)</pre><pre>For example, you can view the comparison between "clang" and "pNoGen" with:</pre><pre><pre style="font-size: 14px; line-height: 1.7;">http://188.40.87.11:8000/db_default/v4/nts/16?compare_to=14&baseline=14</pre><pre style="font-size: 14px; line-height: 1.7;">It shows that without optimizer and code generator, Polly would lead to less then 30% extra compile-time overhead. </pre><pre style="font-size: 14px; line-height: 1.7;">For the execution performance, it is interesting that pNoGen not only significantly improves the execution performance for!

  some benchmarks (nestedloop/huffbench) but also significantly reduces the execution performance for another set of benchmarks (gcc-loops/lpbench).</pre><pre style="font-size: 14px; line-height: 1.7;"><br></pre><pre style="font-size: 14px; line-height: 1.7;">Thanks,</pre><pre style="font-size: 14px; line-height: 1.7;">Star Tan</pre></pre></div>