<div dir="ltr">And finally a reduced test case:<div><font face="monospace">$ cat q.cc<br>struct S {<br>  template<int N><br>  bool f() const;<br>};<br>int f(const S& s) {<br>  int d = 0;<br>  if (s.f<1>()) ++d;<br>  // 4998 lines skipped<br>  if (s.f<5000>()) ++d;<br>  if (d == 0) {<br>    return s.f<-1>();<br>  }<br>  return 0;<br>}<br>$ ./clang-11004 --target=x86_64--linux-gnu -O1 -fslp-vectorize  -c -xc++ q.cc -o q.o -ftime-report<br>===-------------------------------------------------------------------------===<br>                      ... Pass execution timing report ...<br>===-------------------------------------------------------------------------===<br>  Total Execution Time: 8.7398 seconds (8.8367 wall clock)<br><br>   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---<br>   7.9827 ( 91.6%)   0.0081 ( 29.3%)   7.9908 ( 91.4%)   8.0861 ( 91.5%)  SLPVectorizerPass<br>   0.1950 (  2.2%)   0.0002 (  0.6%)   0.1952 (  2.2%)   0.1962 (  2.2%)  InstCombinePass<br></font></div><div><font face="monospace">...</font></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, May 21, 2022 at 2:39 AM Alexey Bataev via Phabricator <<a href="mailto:reviews@reviews.llvm.org">reviews@reviews.llvm.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">ABataev added a comment.<br>
<br>
In D114171#3528932 <<a href="https://reviews.llvm.org/D114171#3528932" rel="noreferrer" target="_blank">https://reviews.llvm.org/D114171#3528932</a>>, @alexfh wrote:<br>
<br>
> In D114171#3528930 <<a href="https://reviews.llvm.org/D114171#3528930" rel="noreferrer" target="_blank">https://reviews.llvm.org/D114171#3528930</a>>, @ABataev wrote:<br>
><br>
>> In D114171#3528903 <<a href="https://reviews.llvm.org/D114171#3528903" rel="noreferrer" target="_blank">https://reviews.llvm.org/D114171#3528903</a>>, @alexfh wrote:<br>
>><br>
>>> In D114171#3528894 <<a href="https://reviews.llvm.org/D114171#3528894" rel="noreferrer" target="_blank">https://reviews.llvm.org/D114171#3528894</a>>, @ABataev wrote:<br>
>>><br>
>>>> In D114171#3528866 <<a href="https://reviews.llvm.org/D114171#3528866" rel="noreferrer" target="_blank">https://reviews.llvm.org/D114171#3528866</a>>, @alexfh wrote:<br>
>>>><br>
>>>>> In D114171#3528093 <<a href="https://reviews.llvm.org/D114171#3528093" rel="noreferrer" target="_blank">https://reviews.llvm.org/D114171#3528093</a>>, @ABataev wrote:<br>
>>>>><br>
>>>>>> Aha, I committed 4e271fc49517362a9333371fb1ab7e865d4c1b0e <<a href="https://reviews.llvm.org/rG4e271fc49517362a9333371fb1ab7e865d4c1b0e" rel="noreferrer" target="_blank">https://reviews.llvm.org/rG4e271fc49517362a9333371fb1ab7e865d4c1b0e</a>> earlier today, which should improve it. Try to update the compiler.<br>
>>>>><br>
>>>>> Thanks! This patch makes SLP vectorizer pass much faster on the problematic input, but it doesn't completely compensate the slowdown introduced here.  This is how -ftime-report looks like after 4e271fc49517362a9333371fb1ab7e865d4c1b0e <<a href="https://reviews.llvm.org/rG4e271fc49517362a9333371fb1ab7e865d4c1b0e" rel="noreferrer" target="_blank">https://reviews.llvm.org/rG4e271fc49517362a9333371fb1ab7e865d4c1b0e</a>>:<br>
>>>>><br>
>>>>>   ===-------------------------------------------------------------------------===<br>
>>>>>                         ... Pass execution timing report ...<br>
>>>>>   ===-------------------------------------------------------------------------===<br>
>>>>>     Total Execution Time: 308.9466 seconds (308.9519 wall clock)<br>
>>>>>   <br>
>>>>>      ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---<br>
>>>>>     132.7263 ( 45.9%)   0.0001 (  0.0%)  132.7264 ( 43.0%)  132.7351 ( 43.0%)  SLPVectorizerPass<br>
>>>>>     48.0866 ( 16.6%)   5.7199 ( 29.3%)  53.8065 ( 17.4%)  53.8123 ( 17.4%)  ModuleInlinerWrapperPass<br>
>>>>>     47.3402 ( 16.4%)   5.4604 ( 27.9%)  52.8007 ( 17.1%)  52.8060 ( 17.1%)  DevirtSCCRepeatedPass<br>
>>>>>     22.2568 (  7.7%)   0.3222 (  1.6%)  22.5790 (  7.3%)  22.5785 (  7.3%)  GVNPass<br>
>>>>>     11.3194 (  3.9%)   1.0507 (  5.4%)  12.3701 (  4.0%)  12.3520 (  4.0%)  InstCombinePass<br>
>>>>>      4.8834 (  1.7%)   1.2548 (  6.4%)   6.1382 (  2.0%)   6.1350 (  2.0%)  InlinerPass<br>
>>>>><br>
>>>>> And this is how it looked at 38d0df557706940af5d7110bdf662590449f8a60 <<a href="https://reviews.llvm.org/rG38d0df557706940af5d7110bdf662590449f8a60" rel="noreferrer" target="_blank">https://reviews.llvm.org/rG38d0df557706940af5d7110bdf662590449f8a60</a>> (the closest commit before 7ea03f0b4e4ec5d91d48ba2976f5adc299089ffd <<a href="https://reviews.llvm.org/rG7ea03f0b4e4ec5d91d48ba2976f5adc299089ffd" rel="noreferrer" target="_blank">https://reviews.llvm.org/rG7ea03f0b4e4ec5d91d48ba2976f5adc299089ffd</a>> where I could compile clang):<br>
>>>>><br>
>>>>>   ===-------------------------------------------------------------------------===<br>
>>>>>                         ... Pass execution timing report ...<br>
>>>>>   ===-------------------------------------------------------------------------===<br>
>>>>>     Total Execution Time: 181.4693 seconds (181.4723 wall clock)<br>
>>>>>   <br>
>>>>>      ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---<br>
>>>>>     48.1675 ( 29.7%)   5.6593 ( 29.3%)  53.8269 ( 29.7%)  53.8332 ( 29.7%)  ModuleInlinerWrapperPass<br>
>>>>>     47.4270 ( 29.2%)   5.3983 ( 28.0%)  52.8253 ( 29.1%)  52.8315 ( 29.1%)  DevirtSCCRepeatedPass<br>
>>>>>     22.1909 ( 13.7%)   0.2916 (  1.5%)  22.4824 ( 12.4%)  22.4825 ( 12.4%)  GVNPass<br>
>>>>>     11.1989 (  6.9%)   1.0292 (  5.3%)  12.2281 (  6.7%)  12.2204 (  6.7%)  InstCombinePass<br>
>>>>>      4.9943 (  3.1%)   1.1928 (  6.2%)   6.1871 (  3.4%)   6.1842 (  3.4%)  InlinerPass<br>
>>>>>      5.2197 (  3.2%)   0.0000 (  0.0%)   5.2198 (  2.9%)   5.2201 (  2.9%)  SLPVectorizerPass<br>
>>>>><br>
>>>>> Note the 5s -> 130s jump in time spent in SLPVectorizerPass. I'll grab the profile for the updated binary, but it will take some time. And yes, still trying to reduce the test case.<br>
>>>><br>
>>>> The perf profile should help, thanks<br>
>>><br>
>>> Looking at this I wonder whether SmallPtrSet is not that small? :)<br>
>>><br>
>>>   -   86.47%     0.00%  clang    clang               [.] cc1_main                                                                                                                                                                                     ▒<br>
>>>      - cc1_main                                                                                                                                                                                                                                       ▒<br>
>>>         - 86.47% clang::ExecuteCompilerInvocation                                                                                                                                                                                                     ▒<br>
>>>            - 86.47% clang::CompilerInstance::ExecuteAction                                                                                                                                                                                            ▒<br>
>>>               - 86.47% clang::FrontendAction::Execute                                                                                                                                                                                                 ▒<br>
>>>                  - 86.47% clang::ParseAST                                                                                                                                                                                                             ▒<br>
>>>                     - 80.62% clang::BackendConsumer::HandleTranslationUnit                                                                                                                                                                            ▒<br>
>>>                        - 80.17% clang::EmitBackendOutput                                                                                                                                                                                              ▒<br>
>>>                           - 63.55% (anonymous namespace)::EmitAssemblyHelper::RunOptimizationPipeline                                                                                                                                                 ▒<br>
>>>                              - 63.55% llvm::PassManager<llvm::Module, llvm::AnalysisManager<llvm::Module>>::run                                                                                                                                       ▒<br>
>>>                                 - 46.20% llvm::detail::PassModel<llvm::Module, llvm::ModuleToFunctionPassAdaptor, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Module>>::run                                                                  ▒<br>
>>>                                    - 46.17% llvm::ModuleToFunctionPassAdaptor::run                                                                                                                                                                    ▒<br>
>>>                                       - 45.90% llvm::detail::PassModel<llvm::Function, llvm::PassManager<llvm::Function, llvm::AnalysisManager<llvm::Function>>, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Function>>::run                 ▒<br>
>>>                                          - 45.89% llvm::PassManager<llvm::Function, llvm::AnalysisManager<llvm::Function>>::run                                                                                                                       ▒<br>
>>>                                             - 43.23% llvm::detail::PassModel<llvm::Function, llvm::SLPVectorizerPass, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Function>>::run                                                            ▒<br>
>>>                                                  llvm::SLPVectorizerPass::run                                                                                                                                                                         ▒<br>
>>>                                                - llvm::SLPVectorizerPass::runImpl                                                                                                                                                                     ▒<br>
>>>                                                   - 43.20% llvm::SLPVectorizerPass::vectorizeChainsInBlock                                                                                                                                            ▒<br>
>>>                                                      - 42.84% llvm::SLPVectorizerPass::vectorizeSimpleInstructions                                                                                                                                    ▒<br>
>>>                                                         - 41.02% llvm::SLPVectorizerPass::vectorizeRootInstruction                                                                                                                                    ▒<br>
>>>                                                            - 40.93% (anonymous namespace)::HorizontalReduction::tryToReduce                                                                                                                           ▒<br>
>>>                                                               - 18.19% llvm::slpvectorizer::BoUpSLP::buildTree                                                                                                                                        ▒<br>
>>>                                                                    9.89% llvm::SmallPtrSetImplBase::insert_imp_big                                                                                                                                    ▒<br>
>>>                                                                  - 6.24% llvm::slpvectorizer::BoUpSLP::buildTree_rec                                                                                                                                  ▒<br>
>>>                                                                     - 3.23% llvm::slpvectorizer::BoUpSLP::buildTree_rec(llvm::ArrayRef<llvm::Value*>, unsigned int, llvm::slpvectorizer::BoUpSLP::EdgeInfo const&)::$_32::operator()                  ▒<br>
>>>                                                                        + 1.61% llvm::DenseMapBase<llvm::DenseMap<llvm::Value*, unsigned int, llvm::DenseMapInfo<llvm::Value*, void>, llvm::detail::DenseMapPair<llvm::Value*, unsigned int> >, llvm::V▒<br>
>>>                                                                       0.85% getSameOpcode                                                                                                                                                             ▒<br>
>>>                                                                     - 0.78% llvm::slpvectorizer::BoUpSLP::newTreeEntry                                                                                                                                ▒<br>
>>>                                                                          0.72% llvm::SmallPtrSetImplBase::insert_imp_big                                                                                                                              ▒<br>
>>>                                                                    0.98% llvm::SmallPtrSetImplBase::FindBucketFor                                                                                                                                     ▒<br>
>>>                                                                 8.57% llvm::SmallPtrSetImplBase::FindBucketFor                                                                                                                                        ▒<br>
>>>                                                               + 4.71% llvm::MapVector<llvm::Value*, llvm::SmallVector<llvm::Instruction*, 2u>, llvm::DenseMap<llvm::Value*, unsigned int, llvm::DenseMapInfo<llvm::Value*, void>, llvm::detail::DenseM▒<br>
>>>                                                                 3.23% (anonymous namespace)::HorizontalReduction::tryToReduce(llvm::slpvectorizer::BoUpSLP&, llvm::TargetTransformInfo*)::{lambda(bool)#1}::operator()                                ▒<br>
>>>                                                                 0.70% memset                                                                                                                                                                          ▒<br>
>>>                                                         + 1.82% tryToVectorizeSequence<llvm::Value>                                                                                                                                                   ▒<br>
>><br>
>> Ok, thanks, will improve it on Monday. We can avoid multiple creation of SmallPtrSet and I'll check for other possible optimizations too.<br>
><br>
> Significant time seems to be spent in this loop as well:<br>
><br>
>   for (unsigned Cnt = 0; Cnt < NumReducedVals; ++Cnt) {<br>
>     if (Cnt >= Pos && Cnt < Pos + ReduxWidth)<br>
>       continue;<br>
>     unsigned NumOps = VectorizedVals.lookup(Candidates[Cnt]) +<br>
>                       std::count(VL.begin(), VL.end(), Candidates[Cnt]);<br>
>     if (NumOps != ReducedValsToOps.find(Candidates[Cnt])->second.size())<br>
>       LocalExternallyUsedValues[Candidates[Cnt]];<br>
>   }<br>
<br>
Will fix it, thanks!<br>
<br>
<br>
Repository:<br>
  rG LLVM Github Monorepo<br>
<br>
CHANGES SINCE LAST ACTION<br>
  <a href="https://reviews.llvm.org/D114171/new/" rel="noreferrer" target="_blank">https://reviews.llvm.org/D114171/new/</a><br>
<br>
<a href="https://reviews.llvm.org/D114171" rel="noreferrer" target="_blank">https://reviews.llvm.org/D114171</a><br>
<br>
</blockquote></div>