[LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization

Thu Sep 12 21:46:33 PDT 2013

At 2013-09-09 13:07:07,"Tobias Grosser" <tobias at grosser.es> wrote:

>On 09/09/2013 05:18 AM, Star Tan wrote:
>>
>> At 2013-09-09 05:52:35,"Tobias Grosser" <tobias at grosser.es> wrote:
>>
>>> On 09/08/2013 08:03 PM, Star Tan wrote:
>>> Also, I wonder if your runs include the dependence analysis. If this is
>>> the case, the numbers are very good. Otherwise, 30% overhead seems still
>>> to be a little bit much.
>> I think no Polly Dependence analysis is involved since our compiling command is:
>> clang -O3 -Xclang -load -Xclang LLVMPolly.so -mllvm -polly -mllvm -polly-optimizer=none -mllvm -polly-code-generator=none  -mllvm -polly-codegen-scev
>> Fortunately, with the option "-polly-codegen-scev", only three benchmark shows >20% extra compile-time overhead:
>
>I believe so to, but please verify with -debug-pass=Structure
I have verified. It indeed does not involve Polly Dependence analysis. "Polly Dependence Pass" for flop is still high for some benchmarks as we have discussed before. 
>> SingleSource/Benchmarks/Misc/flops	28.57%
>> MultiSource/Benchmarks/MiBench/security-sha/security-sha	22.22%
>> MultiSource/Benchmarks/VersaBench/ecbdes/ecbdes	21.05%
>> When I look into the compile-time for the flop benchmark using "-ftime-report", I find the extra compile-time overhead mainly comes from the "Combine redundant instructions" pass.
>> the top 5 passes when compiled with Polly canonicalization passes:
>>     ---User Time---   --User+System--   ---Wall Time---  --- Name ---
>>     0.0160 ( 20.0%)   0.0160 ( 20.0%)   0.0164 ( 20.8%)  Combine redundant instructions
>>     0.0120 ( 15.0%)   0.0120 ( 15.0%)   0.0138 ( 17.5%)  X86 DAG->DAG Instruction Selection
>>     0.0040 (  5.0%)   0.0040 (  5.0%)   0.0045 (  5.7%)  Greedy Register Allocator
>>     0.0000 (  0.0%)   0.0000 (  0.0%)   0.0029 (  3.7%)  Global Value Numbering
>>     0.0040 (  5.0%)   0.0040 (  5.0%)   0.0028 (  3.6%)  Polly - Create polyhedral description of Scops
>>
>> But the top 5 passes for clang is:
>>     ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
>>     0.0120 ( 25.0%)   0.0000 (  0.0%)   0.0120 ( 21.4%)   0.0141 ( 25.2%)  X86 DAG->DAG Instruction Selection
>>     0.0040 (  8.3%)   0.0000 (  0.0%)   0.0040 (  7.1%)   0.0047 (  8.4%)  Greedy Register Allocator
>>     0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0034 (  6.1%)  Combine redundant instructions
>>     0.0000 (  0.0%)   0.0040 ( 50.0%)   0.0040 (  7.1%)   0.0029 (  5.2%)  Global Value Numbering
>>     0.0040 (  8.3%)   0.0000 (  0.0%)   0.0040 (  7.1%)   0.0029 (  5.2%)  Combine redundant instructions
>> We can see the "Combine redundant instructions" are invoked many times and the extra invoke resulted by Polly's canonicalization is the most significant one. We have found this problem before and I need to look into the details of canonicalization passes related to "Combine redundant instructions".
>
>OK.

By investigating the flop benchmark, I find the key is the first "InstructionCombining" pass in a serial of canonicalization passes listed as follows:
static void registerCanonicalicationPasses(llvm::PassManagerBase &PM) {
  PM.add(llvm::createPromoteMemoryToRegisterPass());
  PM.add(llvm::createInstructionCombiningPass());  //this is the most expensive canonicalization pass for flop benchmark
  PM.add(llvm::createCFGSimplificationPass());
  PM.add(llvm::createTailCallEliminationPass());
  PM.add(llvm::createCFGSimplificationPass());
  PM.add(llvm::createReassociatePass());
  PM.add(llvm::createLoopRotatePass());
  PM.add(llvm::createInstructionCombiningPass());
  if (!SCEVCodegen)
    PM.add(polly::createIndVarSimplifyPass());
  PM.add(polly::createCodePreparationPass());
}
If we remove the first "InstructionCombining" pass, then the compile-time is reduced by more than 10% . The results reported by -ftime-report become very similar to the case without Polly canonicalization:
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   0.0120 ( 23.1%)   0.0000 (  0.0%)   0.0120 ( 21.4%)   0.0138 ( 21.5%)  X86 DAG->DAG Instruction Selection
   0.0040 (  7.7%)   0.0000 (  0.0%)   0.0040 (  7.1%)   0.0045 (  7.1%)  Greedy Register Allocator
   0.0040 (  7.7%)   0.0000 (  0.0%)   0.0040 (  7.1%)   0.0042 (  6.6%)  Polly - Create polyhedral description of Scops
   0.0040 (  7.7%)   0.0000 (  0.0%)   0.0040 (  7.1%)   0.0038 (  5.9%)  Combine redundant instructions
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0029 (  4.5%)  Global Value Numbering
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0027 (  4.2%)  Combine redundant instructions
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0020 (  3.2%)  Combine redundant instructions
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0020 (  3.1%)  Combine redundant instructions
Similar results have been found in the benchmark whetstone.  I will have a full test using LLVM test-suite tonight to see whether it has similar effectiveness for other test-suite benchmarks.
@Tobias, do you have any idea about the performance impact and other consequences that if we remove such a  canonicalization pass. In my option, it should not be important since we still run the "InstructionCombining" pass after "createLoopRotatePass" pass and in fact there are many more runs of "InstructionCombine" pass after this point.
Best,
Star Tan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130913/32a89a62/attachment.html>