<div dir="ltr">And finally a reduced test case:<div><font face="monospace">$ cat q.cc<br>struct S {<br> template<int N><br> bool f() const;<br>};<br>int f(const S& s) {<br> int d = 0;<br> if (s.f<1>()) ++d;<br> // 4998 lines skipped<br> if (s.f<5000>()) ++d;<br> if (d == 0) {<br> return s.f<-1>();<br> }<br> return 0;<br>}<br>$ ./clang-11004 --target=x86_64--linux-gnu -O1 -fslp-vectorize -c -xc++ q.cc -o q.o -ftime-report<br>===-------------------------------------------------------------------------===<br> ... Pass execution timing report ...<br>===-------------------------------------------------------------------------===<br> Total Execution Time: 8.7398 seconds (8.8367 wall clock)<br><br> ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name ---<br> 7.9827 ( 91.6%) 0.0081 ( 29.3%) 7.9908 ( 91.4%) 8.0861 ( 91.5%) SLPVectorizerPass<br> 0.1950 ( 2.2%) 0.0002 ( 0.6%) 0.1952 ( 2.2%) 0.1962 ( 2.2%) InstCombinePass<br></font></div><div><font face="monospace">...</font></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, May 21, 2022 at 2:39 AM Alexey Bataev via Phabricator <<a href="mailto:reviews@reviews.llvm.org">reviews@reviews.llvm.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">ABataev added a comment.<br>
<br>
In D114171#3528932 <<a href="https://reviews.llvm.org/D114171#3528932" rel="noreferrer" target="_blank">https://reviews.llvm.org/D114171#3528932</a>>, @alexfh wrote:<br>
<br>
> In D114171#3528930 <<a href="https://reviews.llvm.org/D114171#3528930" rel="noreferrer" target="_blank">https://reviews.llvm.org/D114171#3528930</a>>, @ABataev wrote:<br>
><br>
>> In D114171#3528903 <<a href="https://reviews.llvm.org/D114171#3528903" rel="noreferrer" target="_blank">https://reviews.llvm.org/D114171#3528903</a>>, @alexfh wrote:<br>
>><br>
>>> In D114171#3528894 <<a href="https://reviews.llvm.org/D114171#3528894" rel="noreferrer" target="_blank">https://reviews.llvm.org/D114171#3528894</a>>, @ABataev wrote:<br>
>>><br>
>>>> In D114171#3528866 <<a href="https://reviews.llvm.org/D114171#3528866" rel="noreferrer" target="_blank">https://reviews.llvm.org/D114171#3528866</a>>, @alexfh wrote:<br>
>>>><br>
>>>>> In D114171#3528093 <<a href="https://reviews.llvm.org/D114171#3528093" rel="noreferrer" target="_blank">https://reviews.llvm.org/D114171#3528093</a>>, @ABataev wrote:<br>
>>>>><br>
>>>>>> Aha, I committed 4e271fc49517362a9333371fb1ab7e865d4c1b0e <<a href="https://reviews.llvm.org/rG4e271fc49517362a9333371fb1ab7e865d4c1b0e" rel="noreferrer" target="_blank">https://reviews.llvm.org/rG4e271fc49517362a9333371fb1ab7e865d4c1b0e</a>> earlier today, which should improve it. Try to update the compiler.<br>
>>>>><br>
>>>>> Thanks! This patch makes SLP vectorizer pass much faster on the problematic input, but it doesn't completely compensate the slowdown introduced here. This is how -ftime-report looks like after 4e271fc49517362a9333371fb1ab7e865d4c1b0e <<a href="https://reviews.llvm.org/rG4e271fc49517362a9333371fb1ab7e865d4c1b0e" rel="noreferrer" target="_blank">https://reviews.llvm.org/rG4e271fc49517362a9333371fb1ab7e865d4c1b0e</a>>:<br>
>>>>><br>
>>>>> ===-------------------------------------------------------------------------===<br>
>>>>> ... Pass execution timing report ...<br>
>>>>> ===-------------------------------------------------------------------------===<br>
>>>>> Total Execution Time: 308.9466 seconds (308.9519 wall clock)<br>
>>>>> <br>
>>>>> ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name ---<br>
>>>>> 132.7263 ( 45.9%) 0.0001 ( 0.0%) 132.7264 ( 43.0%) 132.7351 ( 43.0%) SLPVectorizerPass<br>
>>>>> 48.0866 ( 16.6%) 5.7199 ( 29.3%) 53.8065 ( 17.4%) 53.8123 ( 17.4%) ModuleInlinerWrapperPass<br>
>>>>> 47.3402 ( 16.4%) 5.4604 ( 27.9%) 52.8007 ( 17.1%) 52.8060 ( 17.1%) DevirtSCCRepeatedPass<br>
>>>>> 22.2568 ( 7.7%) 0.3222 ( 1.6%) 22.5790 ( 7.3%) 22.5785 ( 7.3%) GVNPass<br>
>>>>> 11.3194 ( 3.9%) 1.0507 ( 5.4%) 12.3701 ( 4.0%) 12.3520 ( 4.0%) InstCombinePass<br>
>>>>> 4.8834 ( 1.7%) 1.2548 ( 6.4%) 6.1382 ( 2.0%) 6.1350 ( 2.0%) InlinerPass<br>
>>>>><br>
>>>>> And this is how it looked at 38d0df557706940af5d7110bdf662590449f8a60 <<a href="https://reviews.llvm.org/rG38d0df557706940af5d7110bdf662590449f8a60" rel="noreferrer" target="_blank">https://reviews.llvm.org/rG38d0df557706940af5d7110bdf662590449f8a60</a>> (the closest commit before 7ea03f0b4e4ec5d91d48ba2976f5adc299089ffd <<a href="https://reviews.llvm.org/rG7ea03f0b4e4ec5d91d48ba2976f5adc299089ffd" rel="noreferrer" target="_blank">https://reviews.llvm.org/rG7ea03f0b4e4ec5d91d48ba2976f5adc299089ffd</a>> where I could compile clang):<br>
>>>>><br>
>>>>> ===-------------------------------------------------------------------------===<br>
>>>>> ... Pass execution timing report ...<br>
>>>>> ===-------------------------------------------------------------------------===<br>
>>>>> Total Execution Time: 181.4693 seconds (181.4723 wall clock)<br>
>>>>> <br>
>>>>> ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name ---<br>
>>>>> 48.1675 ( 29.7%) 5.6593 ( 29.3%) 53.8269 ( 29.7%) 53.8332 ( 29.7%) ModuleInlinerWrapperPass<br>
>>>>> 47.4270 ( 29.2%) 5.3983 ( 28.0%) 52.8253 ( 29.1%) 52.8315 ( 29.1%) DevirtSCCRepeatedPass<br>
>>>>> 22.1909 ( 13.7%) 0.2916 ( 1.5%) 22.4824 ( 12.4%) 22.4825 ( 12.4%) GVNPass<br>
>>>>> 11.1989 ( 6.9%) 1.0292 ( 5.3%) 12.2281 ( 6.7%) 12.2204 ( 6.7%) InstCombinePass<br>
>>>>> 4.9943 ( 3.1%) 1.1928 ( 6.2%) 6.1871 ( 3.4%) 6.1842 ( 3.4%) InlinerPass<br>
>>>>> 5.2197 ( 3.2%) 0.0000 ( 0.0%) 5.2198 ( 2.9%) 5.2201 ( 2.9%) SLPVectorizerPass<br>
>>>>><br>
>>>>> Note the 5s -> 130s jump in time spent in SLPVectorizerPass. I'll grab the profile for the updated binary, but it will take some time. And yes, still trying to reduce the test case.<br>
>>>><br>
>>>> The perf profile should help, thanks<br>
>>><br>
>>> Looking at this I wonder whether SmallPtrSet is not that small? :)<br>
>>><br>
>>> - 86.47% 0.00% clang clang [.] cc1_main ▒<br>
>>> - cc1_main ▒<br>
>>> - 86.47% clang::ExecuteCompilerInvocation ▒<br>
>>> - 86.47% clang::CompilerInstance::ExecuteAction ▒<br>
>>> - 86.47% clang::FrontendAction::Execute ▒<br>
>>> - 86.47% clang::ParseAST ▒<br>
>>> - 80.62% clang::BackendConsumer::HandleTranslationUnit ▒<br>
>>> - 80.17% clang::EmitBackendOutput ▒<br>
>>> - 63.55% (anonymous namespace)::EmitAssemblyHelper::RunOptimizationPipeline ▒<br>
>>> - 63.55% llvm::PassManager<llvm::Module, llvm::AnalysisManager<llvm::Module>>::run ▒<br>
>>> - 46.20% llvm::detail::PassModel<llvm::Module, llvm::ModuleToFunctionPassAdaptor, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Module>>::run ▒<br>
>>> - 46.17% llvm::ModuleToFunctionPassAdaptor::run ▒<br>
>>> - 45.90% llvm::detail::PassModel<llvm::Function, llvm::PassManager<llvm::Function, llvm::AnalysisManager<llvm::Function>>, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Function>>::run ▒<br>
>>> - 45.89% llvm::PassManager<llvm::Function, llvm::AnalysisManager<llvm::Function>>::run ▒<br>
>>> - 43.23% llvm::detail::PassModel<llvm::Function, llvm::SLPVectorizerPass, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Function>>::run ▒<br>
>>> llvm::SLPVectorizerPass::run ▒<br>
>>> - llvm::SLPVectorizerPass::runImpl ▒<br>
>>> - 43.20% llvm::SLPVectorizerPass::vectorizeChainsInBlock ▒<br>
>>> - 42.84% llvm::SLPVectorizerPass::vectorizeSimpleInstructions ▒<br>
>>> - 41.02% llvm::SLPVectorizerPass::vectorizeRootInstruction ▒<br>
>>> - 40.93% (anonymous namespace)::HorizontalReduction::tryToReduce ▒<br>
>>> - 18.19% llvm::slpvectorizer::BoUpSLP::buildTree ▒<br>
>>> 9.89% llvm::SmallPtrSetImplBase::insert_imp_big ▒<br>
>>> - 6.24% llvm::slpvectorizer::BoUpSLP::buildTree_rec ▒<br>
>>> - 3.23% llvm::slpvectorizer::BoUpSLP::buildTree_rec(llvm::ArrayRef<llvm::Value*>, unsigned int, llvm::slpvectorizer::BoUpSLP::EdgeInfo const&)::$_32::operator() ▒<br>
>>> + 1.61% llvm::DenseMapBase<llvm::DenseMap<llvm::Value*, unsigned int, llvm::DenseMapInfo<llvm::Value*, void>, llvm::detail::DenseMapPair<llvm::Value*, unsigned int> >, llvm::V▒<br>
>>> 0.85% getSameOpcode ▒<br>
>>> - 0.78% llvm::slpvectorizer::BoUpSLP::newTreeEntry ▒<br>
>>> 0.72% llvm::SmallPtrSetImplBase::insert_imp_big ▒<br>
>>> 0.98% llvm::SmallPtrSetImplBase::FindBucketFor ▒<br>
>>> 8.57% llvm::SmallPtrSetImplBase::FindBucketFor ▒<br>
>>> + 4.71% llvm::MapVector<llvm::Value*, llvm::SmallVector<llvm::Instruction*, 2u>, llvm::DenseMap<llvm::Value*, unsigned int, llvm::DenseMapInfo<llvm::Value*, void>, llvm::detail::DenseM▒<br>
>>> 3.23% (anonymous namespace)::HorizontalReduction::tryToReduce(llvm::slpvectorizer::BoUpSLP&, llvm::TargetTransformInfo*)::{lambda(bool)#1}::operator() ▒<br>
>>> 0.70% memset ▒<br>
>>> + 1.82% tryToVectorizeSequence<llvm::Value> ▒<br>
>><br>
>> Ok, thanks, will improve it on Monday. We can avoid multiple creation of SmallPtrSet and I'll check for other possible optimizations too.<br>
><br>
> Significant time seems to be spent in this loop as well:<br>
><br>
> for (unsigned Cnt = 0; Cnt < NumReducedVals; ++Cnt) {<br>
> if (Cnt >= Pos && Cnt < Pos + ReduxWidth)<br>
> continue;<br>
> unsigned NumOps = VectorizedVals.lookup(Candidates[Cnt]) +<br>
> std::count(VL.begin(), VL.end(), Candidates[Cnt]);<br>
> if (NumOps != ReducedValsToOps.find(Candidates[Cnt])->second.size())<br>
> LocalExternallyUsedValues[Candidates[Cnt]];<br>
> }<br>
<br>
Will fix it, thanks!<br>
<br>
<br>
Repository:<br>
rG LLVM Github Monorepo<br>
<br>
CHANGES SINCE LAST ACTION<br>
<a href="https://reviews.llvm.org/D114171/new/" rel="noreferrer" target="_blank">https://reviews.llvm.org/D114171/new/</a><br>
<br>
<a href="https://reviews.llvm.org/D114171" rel="noreferrer" target="_blank">https://reviews.llvm.org/D114171</a><br>
<br>
</blockquote></div>