[LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization

Sun Sep 8 20:18:56 PDT 2013

At 2013-09-09 05:52:35,"Tobias Grosser" <tobias at grosser.es> wrote:

>On 09/08/2013 08:03 PM, Star Tan wrote:
>> Hello all,
>>
>>
>> I have done some basic experiments about Polly canonicalization passes and I found the SCEV canonicalization has significant impact on both compile-time and execution-time performance.
>
>Interesting.
>
>> Detailed results for SCEV and default canonicalization can be viewed on: http://188.40.87.11:8000/db_default/v4/nts/32 (or 33, 34)
>>     *pNoGen with SCEV canonicalization (run 32): -O3 -Xclang -load -Xclang LLVMPolly.so -mllvm -polly -mllvm -polly-optimizer=none -mllvm -polly-code-generator=none -mllvm -polly-codegen-scev
>>     *pNoGen with default canonicalization (run 33): -O3 -Xclang -load -Xclang LLVMPolly.so -mllvm -polly -mllvm -polly-optimizer=none -mllvm -polly-code-generator=none
>>     *pBasic without any canonicalization (run 34): -O3 -Xclang -load -Xclang LLVMPolly.so
>>
>>
>> Impact of SCEV canonicalization:
>>      http://188.40.87.11:8000/db_default/v4/nts/32?compare_to=34&baseline=34
>> Impact of default canonicalization:
>>      http://188.40.87.11:8000/db_default/v4/nts/33?compare_to=34&baseline=34
>> Comparison of SCEV canonicalization with default canonicalization:
>>      http://188.40.87.11:8000/db_default/v4/nts/32?compare_to=33&baseline=33
>>
>>
>> As we expected, both SCEV canonicalization and default canonicalization will slightly increase the compile-time overhead (at most 30% extra compile-time). They also lead to some execution-time regressions and improvements.
>>
>>
>> The only difference between SCEV canonicalization and default canonicalization is the "IndVarSimplify" pass as shown in the code RegisterPasses.cpp:212:
>>        if (!SCEVCodegen)
>>          PM.add(polly::createIndVarSimplifyPass());
>
>There are actually more differences (see grep -R SCEVCodegen polly/), 
>but the other differences will mainly be code generation differences.
Thanks for your reminder. Since we are currently focusing on canonicalization passes, the other differences for code generation do not matter.

>> However, I find it is interesting to look into the comparison between SCEV canonicalization and default canonicalization (http://188.40.87.11:8000/db_default/v4/nts/32?compare_to=33&baseline=33):
>
>Yes, this is definitely a good start.
>
>> First of all, we can expect SCEV canonicalization has better compile-time performance since it avoids the "IndVarSimplify" pass. Actually, it can gain more than 5% compile-time performance improvement for 32 benchmarks, especially for the following benchmarks:
>>          MultiSource/Applications/lemon/lemon-11.02%
>>          SingleSource/Benchmarks/Misc/oourafft-10.53%
>>          SingleSource/Benchmarks/Linpack/linpack-pc-10.00%
>>          MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan-8.31%
>>          MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt-8.18%
>>
>>
>> Second, we find that SCEV canonicalization has both regression and improvement of execution performance compared with default canonicalization. Actually, there are many execution-time regressions such as:
>>          SingleSource/Benchmarks/Shootout/nestedloop+16363.64%
>>          SingleSource/Benchmarks/Shootout-C++/nestedloop+16200.00%
>Those two have a huge impact. Understanding what is going on here would 
>be nice.
Yes, I am investigating these cases.
>> I think the execution-time performance regression is mainly because of the unexpected performance improvements from non-SCEV canonicalization as shown int eh following bug: http://llvm.org/bugs/show_bug.cgi?id=17153. I will try to find out why "IndVarSimplify" can produce better code in the next step. If we can eliminate "IndVarSimplify" canonicalization but keep on producing high-quality code, then we can gain better compile-time performance without execution-time performance loss.
>
>Previous experience has shown that the indvars pass as we run it in 
>Polly can unpredictably change performance both negatively and 
>positively. It was disabled as it people did not manage to eliminate all 
>regressions it introduced, such that the positive performance changes 
>could not really be valued.
>
>So regarding performance tuning, I do not think we need to get this 
>optimal. As soon as -polly-codegen-scev reaches similar performance than
>the original approach, we are fine.
I see. I agree with you. I think we care more about compile-time performance for Polly's canonicalization passes since no Polly optimization or Polly code generation happens here.

>Also, I wonder if your runs include the dependence analysis. If this is 
>the case, the numbers are very good. Otherwise, 30% overhead seems still 
>to be a little bit much.
I think no Polly Dependence analysis is involved since our compiling command is:  
clang -O3 -Xclang -load -Xclang LLVMPolly.so -mllvm -polly -mllvm -polly-optimizer=none -mllvm -polly-code-generator=none  -mllvm -polly-codegen-scev
Fortunately, with the option "-polly-codegen-scev", only three benchmark shows >20% extra compile-time overhead:
SingleSource/Benchmarks/Misc/flops	28.57%
MultiSource/Benchmarks/MiBench/security-sha/security-sha	22.22%
MultiSource/Benchmarks/VersaBench/ecbdes/ecbdes	21.05%
When I look into the compile-time for the flop benchmark using "-ftime-report", I find the extra compile-time overhead mainly comes from the "Combine redundant instructions" pass.
the top 5 passes when compiled with Polly canonicalization passes:
   ---User Time---   --User+System--   ---Wall Time---  --- Name ---
   0.0160 ( 20.0%)   0.0160 ( 20.0%)   0.0164 ( 20.8%)  Combine redundant instructions
   0.0120 ( 15.0%)   0.0120 ( 15.0%)   0.0138 ( 17.5%)  X86 DAG->DAG Instruction Selection
   0.0040 (  5.0%)   0.0040 (  5.0%)   0.0045 (  5.7%)  Greedy Register Allocator
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0029 (  3.7%)  Global Value Numbering
   0.0040 (  5.0%)   0.0040 (  5.0%)   0.0028 (  3.6%)  Polly - Create polyhedral description of Scops

But the top 5 passes for clang is:
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   0.0120 ( 25.0%)   0.0000 (  0.0%)   0.0120 ( 21.4%)   0.0141 ( 25.2%)  X86 DAG->DAG Instruction Selection
   0.0040 (  8.3%)   0.0000 (  0.0%)   0.0040 (  7.1%)   0.0047 (  8.4%)  Greedy Register Allocator
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0034 (  6.1%)  Combine redundant instructions
   0.0000 (  0.0%)   0.0040 ( 50.0%)   0.0040 (  7.1%)   0.0029 (  5.2%)  Global Value Numbering
   0.0040 (  8.3%)   0.0000 (  0.0%)   0.0040 (  7.1%)   0.0029 (  5.2%)  Combine redundant instructions
We can see the "Combine redundant instructions" are invoked many times and the extra invoke resulted by Polly's canonicalization is the most significant one. We have found this problem before and I need to look into the details of canonicalization passes related to "Combine redundant instructions".
BTW, I want to point out that although SCEV based Polly canonicalization (with -polly-codegen-scev) runs faster than default canonicalization, it can lead to 5 extra compile errors and 3 extra runtime errors as shown on http://188.40.87.11:8000/db_default/v4/nts/32?compare_to=34&baseline=34.
I have done some basic analysis for one of the compile error (7zip-benchmark). Results can be viewed on http://llvm.org/bugs/show_bug.cgi?Cid=17159
Best,
Star Tan

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130909/0fdd342c/attachment.html>