[LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization

Fri Sep 13 18:51:10 PDT 2013

Hello all,

I have evaluated the compile-time and execution-time performance of Polly canonicalization passes. Details can be referred to http://188.40.87.11:8000/db_default/v4/nts/recent_activity. There are four runs:
pollyBasic (run 45): clang -O3 -Xclang -load -Xclang LLVMPolly.so
pollyNoGenSCEV (run 44): clang -O3 -Xclang -load -Xclang LLVMPolly.so -mllvm -polly -mllvm -polly-codegen-scev
pollyNoGenSCEV_1comb (run 46): same option as pollyNoGenSCEV but remove the first "InstructionCombining" canonicalization pass when generate LLVMPolly.so
pollyNoGenSCEV_nocan (run 47):  same option as pollyNoGenSCEV but remove all canonicalization passes (actually only keep "createCodePreparationPass")  when generate LLVMPolly.so

Fist. let's see the results of removing the first "InstructionCombining" pass like this:
static void registerCanonicalicationPasses(llvm::PassManagerBase &PM) {
  PM.add(llvm::createPromoteMemoryToRegisterPass());
//  PM.add(llvm::createInstructionCombiningPass());  //this is the most expensive canonicalization pass for flop benchmark
  PM.add(llvm::createCFGSimplificationPass());
  PM.add(llvm::createTailCallEliminationPass());
  PM.add(llvm::createCFGSimplificationPass());
  PM.add(llvm::createReassociatePass());
  PM.add(llvm::createLoopRotatePass());
  PM.add(llvm::createInstructionCombiningPass());
  PM.add(polly::createCodePreparationPass());
}
Results are shown on http://188.40.87.11:8000/db_default/v4/nts/46?baseline=44&compare_to=44. As shown in the results, 13 benchmarks have >5% compile-time performance improvements by simply removing the first "createInstructionCombiningPass". The top 5 benchmarks are listed as follows:
SingleSource/Regression/C++/2003-09-29-NonPODsByValue-38.46%
SingleSource/Benchmarks/Misc/flops-19.30%
SingleSource/Benchmarks/Misc/himenobmtxpa-12.94%
MultiSource/Benchmarks/VersaBench/ecbdes/ecbdes-12.68%
MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000-10.68%
Unfortunately, there are also two serious execution-time performance regressions:
SingleSource/Benchmarks/Adobe-C++/simple_types_constant_folding204.19%
SingleSource/Benchmarks/Polybench/linear-algebra/solvers/dynprog/dynprog44.58%
By looking into the simple_types_constant_folding benchmark, I find it is mainly caused by the unexpected impact of the createPromoteMemoryToRegisterPass(). Removing "createPromoteMemoryToRegisterPass" would eliminate the execution-time performance regression for simple_types_constant_folding benchmark. Right now, I have no idea why  createPromoteMemoryToRegisterPass" would lead to such great execution-time performance regression.

http://188.40.87.11:8000/db_default/v4/nts/46?baseline=45&compare_to=45 shows the extra compile-time overhead of Polly canonicalization passes without the first "InstructionCombining" pass. By removing the  first "InstructionCombining" pass, we see the extra compile-time overhead of Polly canonicalization is at most 13.5%, which is much smaller than the original Polly canonicalization overhead (>20%).

Second, let's look into the total impact of those polly canonicalization passes by removing all optional canonicalization passes as follows:
static void registerCanonicalicationPasses(llvm::PassManagerBase &PM) {
//  PM.add(llvm::createPromoteMemoryToRegisterPass());
//  PM.add(llvm::createInstructionCombiningPass());  //this is the most expensive canonicalization pass for flop benchmark
//  PM.add(llvm::createCFGSimplificationPass());
//  PM.add(llvm::createTailCallEliminationPass());
//  PM.add(llvm::createCFGSimplificationPass());
//  PM.add(llvm::createReassociatePass());
//  PM.add(llvm::createLoopRotatePass());
//  PM.add(llvm::createInstructionCombiningPass());
  PM.add(polly::createCodePreparationPass());
}
As shown on http://188.40.87.11:8000/db_default/v4/nts/47?baseline=45&compare_to=45, the extra compile-time overhead is very small (5.04% at most) by removing all optional Polly canonicalization passes. However, I think it is not practical to remove all these canonicalizations for the sake of Polly optimization performance. I would further evaluate Polly's performance (with optimization and code generation)  in the case all optional canonicalization passes are removed.

As a simple informal conclusion, I think we should revise Polly's canonicalization passes. At least we should consider removing the first "InstructionCombining" pass! 

Best,
Star Tan

At 2013-09-13 12:46:33,"Star Tan" <tanmx_star at yeah.net> wrote:

At 2013-09-09 13:07:07,"Tobias Grosser" <tobias at grosser.es> wrote:

>On 09/09/2013 05:18 AM, Star Tan wrote:
>>
>> At 2013-09-09 05:52:35,"Tobias Grosser" <tobias at grosser.es> wrote:
>>
>>> On 09/08/2013 08:03 PM, Star Tan wrote:
>>> Also, I wonder if your runs include the dependence analysis. If this is
>>> the case, the numbers are very good. Otherwise, 30% overhead seems still
>>> to be a little bit much.
>> I think no Polly Dependence analysis is involved since our compiling command is:
>> clang -O3 -Xclang -load -Xclang LLVMPolly.so -mllvm -polly -mllvm -polly-optimizer=none -mllvm -polly-code-generator=none  -mllvm -polly-codegen-scev
>> Fortunately, with the option "-polly-codegen-scev", only three benchmark shows >20% extra compile-time overhead:
>
>I believe so to, but please verify with -debug-pass=Structure
I have verified. It indeed does not involve Polly Dependence analysis. "Polly Dependence Pass" for flop is still high for some benchmarks as we have discussed before. 
>> SingleSource/Benchmarks/Misc/flops	28.57%
>> MultiSource/Benchmarks/MiBench/security-sha/security-sha	22.22%
>> MultiSource/Benchmarks/VersaBench/ecbdes/ecbdes	21.05%
>> When I look into the compile-time for the flop benchmark using "-ftime-report", I find the extra compile-time overhead mainly comes from the "Combine redundant instructions" pass.
>> the top 5 passes when compiled with Polly canonicalization passes:
>>     ---User Time---   --User+System--   ---Wall Time---  --- Name ---
>>     0.0160 ( 20.0%)   0.0160 ( 20.0%)   0.0164 ( 20.8%)  Combine redundant instructions
>>     0.0120 ( 15.0%)   0.0120 ( 15.0%)   0.0138 ( 17.5%)  X86 DAG->DAG Instruction Selection
>>     0.0040 (  5.0%)   0.0040 (  5.0%)   0.0045 (  5.7%)  Greedy Register Allocator
>>     0.0000 (  0.0%)   0.0000 (  0.0%)   0.0029 (  3.7%)  Global Value Numbering
>>     0.0040 (  5.0%)   0.0040 (  5.0%)   0.0028 (  3.6%)  Polly - Create polyhedral description of Scops
>>
>> But the top 5 passes for clang is:
>>     ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
>>     0.0120 ( 25.0%)   0.0000 (  0.0%)   0.0120 ( 21.4%)   0.0141 ( 25.2%)  X86 DAG->DAG Instruction Selection
>>     0.0040 (  8.3%)   0.0000 (  0.0%)   0.0040 (  7.1%)   0.0047 (  8.4%)  Greedy Register Allocator
>>     0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0034 (  6.1%)  Combine redundant instructions
>>     0.0000 (  0.0%)   0.0040 ( 50.0%)   0.0040 (  7.1%)   0.0029 (  5.2%)  Global Value Numbering
>>     0.0040 (  8.3%)   0.0000 (  0.0%)   0.0040 (  7.1%)   0.0029 (  5.2%)  Combine redundant instructions
>> We can see the "Combine redundant instructions" are invoked many times and the extra invoke resulted by Polly's canonicalization is the most significant one. We have found this problem before and I need to look into the details of canonicalization passes related to "Combine redundant instructions".
>
>OK.

By investigating the flop benchmark, I find the key is the first "InstructionCombining" pass in a serial of canonicalization passes listed as follows:
static void registerCanonicalicationPasses(llvm::PassManagerBase &PM) {
  PM.add(llvm::createPromoteMemoryToRegisterPass());
  PM.add(llvm::createInstructionCombiningPass());  //this is the most expensive canonicalization pass for flop benchmark
  PM.add(llvm::createCFGSimplificationPass());
  PM.add(llvm::createTailCallEliminationPass());
  PM.add(llvm::createCFGSimplificationPass());
  PM.add(llvm::createReassociatePass());
  PM.add(llvm::createLoopRotatePass());
  PM.add(llvm::createInstructionCombiningPass());
  if (!SCEVCodegen)
    PM.add(polly::createIndVarSimplifyPass());
  PM.add(polly::createCodePreparationPass());
}
If we remove the first "InstructionCombining" pass, then the compile-time is reduced by more than 10% . The results reported by -ftime-report become very similar to the case without Polly canonicalization:
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   0.0120 ( 23.1%)   0.0000 (  0.0%)   0.0120 ( 21.4%)   0.0138 ( 21.5%)  X86 DAG->DAG Instruction Selection
   0.0040 (  7.7%)   0.0000 (  0.0%)   0.0040 (  7.1%)   0.0045 (  7.1%)  Greedy Register Allocator
   0.0040 (  7.7%)   0.0000 (  0.0%)   0.0040 (  7.1%)   0.0042 (  6.6%)  Polly - Create polyhedral description of Scops
   0.0040 (  7.7%)   0.0000 (  0.0%)   0.0040 (  7.1%)   0.0038 (  5.9%)  Combine redundant instructions
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0029 (  4.5%)  Global Value Numbering
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0027 (  4.2%)  Combine redundant instructions
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0020 (  3.2%)  Combine redundant instructions
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0020 (  3.1%)  Combine redundant instructions
Similar results have been found in the benchmark whetstone.  I will have a full test using LLVM test-suite tonight to see whether it has similar effectiveness for other test-suite benchmarks.
@Tobias, do you have any idea about the performance impact and other consequences that if we remove such a  canonicalization pass. In my option, it should not be important since we still run the "InstructionCombining" pass after "createLoopRotatePass" pass and in fact there are many more runs of "InstructionCombine" pass after this point.
Best,
Star Tan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130914/a2bf7a5f/attachment.html>