[PATCH] D114171: [SLP]Improve reductions analysis and emission, part 1.
Alexander Kornienko via llvm-commits
llvm-commits at lists.llvm.org
Sun May 22 18:12:06 PDT 2022
And finally a reduced test case:
$ cat q.cc
struct S {
template<int N>
bool f() const;
};
int f(const S& s) {
int d = 0;
if (s.f<1>()) ++d;
// 4998 lines skipped
if (s.f<5000>()) ++d;
if (d == 0) {
return s.f<-1>();
}
return 0;
}
$ ./clang-11004 --target=x86_64--linux-gnu -O1 -fslp-vectorize -c -xc++
q.cc -o q.o -ftime-report
===-------------------------------------------------------------------------===
... Pass execution timing report ...
===-------------------------------------------------------------------------===
Total Execution Time: 8.7398 seconds (8.8367 wall clock)
---User Time--- --System Time-- --User+System-- ---Wall Time---
--- Name ---
7.9827 ( 91.6%) 0.0081 ( 29.3%) 7.9908 ( 91.4%) 8.0861 ( 91.5%)
SLPVectorizerPass
0.1950 ( 2.2%) 0.0002 ( 0.6%) 0.1952 ( 2.2%) 0.1962 ( 2.2%)
InstCombinePass
...
On Sat, May 21, 2022 at 2:39 AM Alexey Bataev via Phabricator <
reviews at reviews.llvm.org> wrote:
> ABataev added a comment.
>
> In D114171#3528932 <https://reviews.llvm.org/D114171#3528932>, @alexfh
> wrote:
>
> > In D114171#3528930 <https://reviews.llvm.org/D114171#3528930>, @ABataev
> wrote:
> >
> >> In D114171#3528903 <https://reviews.llvm.org/D114171#3528903>, @alexfh
> wrote:
> >>
> >>> In D114171#3528894 <https://reviews.llvm.org/D114171#3528894>,
> @ABataev wrote:
> >>>
> >>>> In D114171#3528866 <https://reviews.llvm.org/D114171#3528866>,
> @alexfh wrote:
> >>>>
> >>>>> In D114171#3528093 <https://reviews.llvm.org/D114171#3528093>,
> @ABataev wrote:
> >>>>>
> >>>>>> Aha, I committed 4e271fc49517362a9333371fb1ab7e865d4c1b0e <
> https://reviews.llvm.org/rG4e271fc49517362a9333371fb1ab7e865d4c1b0e>
> earlier today, which should improve it. Try to update the compiler.
> >>>>>
> >>>>> Thanks! This patch makes SLP vectorizer pass much faster on the
> problematic input, but it doesn't completely compensate the slowdown
> introduced here. This is how -ftime-report looks like after
> 4e271fc49517362a9333371fb1ab7e865d4c1b0e <
> https://reviews.llvm.org/rG4e271fc49517362a9333371fb1ab7e865d4c1b0e>:
> >>>>>
> >>>>>
> ===-------------------------------------------------------------------------===
> >>>>> ... Pass execution timing report ...
> >>>>>
> ===-------------------------------------------------------------------------===
> >>>>> Total Execution Time: 308.9466 seconds (308.9519 wall clock)
> >>>>>
> >>>>> ---User Time--- --System Time-- --User+System-- ---Wall
> Time--- --- Name ---
> >>>>> 132.7263 ( 45.9%) 0.0001 ( 0.0%) 132.7264 ( 43.0%) 132.7351
> ( 43.0%) SLPVectorizerPass
> >>>>> 48.0866 ( 16.6%) 5.7199 ( 29.3%) 53.8065 ( 17.4%) 53.8123 (
> 17.4%) ModuleInlinerWrapperPass
> >>>>> 47.3402 ( 16.4%) 5.4604 ( 27.9%) 52.8007 ( 17.1%) 52.8060 (
> 17.1%) DevirtSCCRepeatedPass
> >>>>> 22.2568 ( 7.7%) 0.3222 ( 1.6%) 22.5790 ( 7.3%) 22.5785 (
> 7.3%) GVNPass
> >>>>> 11.3194 ( 3.9%) 1.0507 ( 5.4%) 12.3701 ( 4.0%) 12.3520 (
> 4.0%) InstCombinePass
> >>>>> 4.8834 ( 1.7%) 1.2548 ( 6.4%) 6.1382 ( 2.0%) 6.1350 (
> 2.0%) InlinerPass
> >>>>>
> >>>>> And this is how it looked at
> 38d0df557706940af5d7110bdf662590449f8a60 <
> https://reviews.llvm.org/rG38d0df557706940af5d7110bdf662590449f8a60> (the
> closest commit before 7ea03f0b4e4ec5d91d48ba2976f5adc299089ffd <
> https://reviews.llvm.org/rG7ea03f0b4e4ec5d91d48ba2976f5adc299089ffd>
> where I could compile clang):
> >>>>>
> >>>>>
> ===-------------------------------------------------------------------------===
> >>>>> ... Pass execution timing report ...
> >>>>>
> ===-------------------------------------------------------------------------===
> >>>>> Total Execution Time: 181.4693 seconds (181.4723 wall clock)
> >>>>>
> >>>>> ---User Time--- --System Time-- --User+System-- ---Wall
> Time--- --- Name ---
> >>>>> 48.1675 ( 29.7%) 5.6593 ( 29.3%) 53.8269 ( 29.7%) 53.8332 (
> 29.7%) ModuleInlinerWrapperPass
> >>>>> 47.4270 ( 29.2%) 5.3983 ( 28.0%) 52.8253 ( 29.1%) 52.8315 (
> 29.1%) DevirtSCCRepeatedPass
> >>>>> 22.1909 ( 13.7%) 0.2916 ( 1.5%) 22.4824 ( 12.4%) 22.4825 (
> 12.4%) GVNPass
> >>>>> 11.1989 ( 6.9%) 1.0292 ( 5.3%) 12.2281 ( 6.7%) 12.2204 (
> 6.7%) InstCombinePass
> >>>>> 4.9943 ( 3.1%) 1.1928 ( 6.2%) 6.1871 ( 3.4%) 6.1842 (
> 3.4%) InlinerPass
> >>>>> 5.2197 ( 3.2%) 0.0000 ( 0.0%) 5.2198 ( 2.9%) 5.2201 (
> 2.9%) SLPVectorizerPass
> >>>>>
> >>>>> Note the 5s -> 130s jump in time spent in SLPVectorizerPass. I'll
> grab the profile for the updated binary, but it will take some time. And
> yes, still trying to reduce the test case.
> >>>>
> >>>> The perf profile should help, thanks
> >>>
> >>> Looking at this I wonder whether SmallPtrSet is not that small? :)
> >>>
> >>> - 86.47% 0.00% clang clang [.] cc1_main
>
>
> ▒
> >>> - cc1_main
>
>
> ▒
> >>> - 86.47% clang::ExecuteCompilerInvocation
>
>
> ▒
> >>> - 86.47% clang::CompilerInstance::ExecuteAction
>
>
> ▒
> >>> - 86.47% clang::FrontendAction::Execute
>
>
> ▒
> >>> - 86.47% clang::ParseAST
>
>
> ▒
> >>> - 80.62%
> clang::BackendConsumer::HandleTranslationUnit
>
> ▒
> >>> - 80.17% clang::EmitBackendOutput
>
>
> ▒
> >>> - 63.55% (anonymous
> namespace)::EmitAssemblyHelper::RunOptimizationPipeline
>
> ▒
> >>> - 63.55% llvm::PassManager<llvm::Module,
> llvm::AnalysisManager<llvm::Module>>::run
>
> ▒
> >>> - 46.20%
> llvm::detail::PassModel<llvm::Module, llvm::ModuleToFunctionPassAdaptor,
> llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Module>>::run
> ▒
> >>> - 46.17%
> llvm::ModuleToFunctionPassAdaptor::run
>
> ▒
> >>> - 45.90%
> llvm::detail::PassModel<llvm::Function, llvm::PassManager<llvm::Function,
> llvm::AnalysisManager<llvm::Function>>, llvm::PreservedAnalyses,
> llvm::AnalysisManager<llvm::Function>>::run ▒
> >>> - 45.89%
> llvm::PassManager<llvm::Function,
> llvm::AnalysisManager<llvm::Function>>::run
>
> ▒
> >>> - 43.23%
> llvm::detail::PassModel<llvm::Function, llvm::SLPVectorizerPass,
> llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Function>>::run
> ▒
> >>>
> llvm::SLPVectorizerPass::run
>
> ▒
> >>> -
> llvm::SLPVectorizerPass::runImpl
>
> ▒
> >>> - 43.20%
> llvm::SLPVectorizerPass::vectorizeChainsInBlock
>
> ▒
> >>> - 42.84%
> llvm::SLPVectorizerPass::vectorizeSimpleInstructions
>
> ▒
> >>> - 41.02%
> llvm::SLPVectorizerPass::vectorizeRootInstruction
>
> ▒
> >>> - 40.93%
> (anonymous namespace)::HorizontalReduction::tryToReduce
>
> ▒
> >>> - 18.19%
> llvm::slpvectorizer::BoUpSLP::buildTree
>
> ▒
> >>>
> 9.89% llvm::SmallPtrSetImplBase::insert_imp_big
>
> ▒
> >>> -
> 6.24% llvm::slpvectorizer::BoUpSLP::buildTree_rec
>
> ▒
> >>> -
> 3.23%
> llvm::slpvectorizer::BoUpSLP::buildTree_rec(llvm::ArrayRef<llvm::Value*>,
> unsigned int, llvm::slpvectorizer::BoUpSLP::EdgeInfo
> const&)::$_32::operator() ▒
> >>>
> + 1.61% llvm::DenseMapBase<llvm::DenseMap<llvm::Value*, unsigned int,
> llvm::DenseMapInfo<llvm::Value*, void>,
> llvm::detail::DenseMapPair<llvm::Value*, unsigned int> >, llvm::V▒
> >>>
> 0.85% getSameOpcode
>
> ▒
> >>> -
> 0.78% llvm::slpvectorizer::BoUpSLP::newTreeEntry
>
> ▒
> >>>
> 0.72% llvm::SmallPtrSetImplBase::insert_imp_big
>
> ▒
> >>>
> 0.98% llvm::SmallPtrSetImplBase::FindBucketFor
>
> ▒
> >>> 8.57%
> llvm::SmallPtrSetImplBase::FindBucketFor
>
> ▒
> >>> + 4.71%
> llvm::MapVector<llvm::Value*, llvm::SmallVector<llvm::Instruction*, 2u>,
> llvm::DenseMap<llvm::Value*, unsigned int, llvm::DenseMapInfo<llvm::Value*,
> void>, llvm::detail::DenseM▒
> >>> 3.23%
> (anonymous
> namespace)::HorizontalReduction::tryToReduce(llvm::slpvectorizer::BoUpSLP&,
> llvm::TargetTransformInfo*)::{lambda(bool)#1}::operator()
> ▒
> >>> 0.70%
> memset
>
> ▒
> >>> + 1.82%
> tryToVectorizeSequence<llvm::Value>
>
> ▒
> >>
> >> Ok, thanks, will improve it on Monday. We can avoid multiple creation
> of SmallPtrSet and I'll check for other possible optimizations too.
> >
> > Significant time seems to be spent in this loop as well:
> >
> > for (unsigned Cnt = 0; Cnt < NumReducedVals; ++Cnt) {
> > if (Cnt >= Pos && Cnt < Pos + ReduxWidth)
> > continue;
> > unsigned NumOps = VectorizedVals.lookup(Candidates[Cnt]) +
> > std::count(VL.begin(), VL.end(), Candidates[Cnt]);
> > if (NumOps != ReducedValsToOps.find(Candidates[Cnt])->second.size())
> > LocalExternallyUsedValues[Candidates[Cnt]];
> > }
>
> Will fix it, thanks!
>
>
> Repository:
> rG LLVM Github Monorepo
>
> CHANGES SINCE LAST ACTION
> https://reviews.llvm.org/D114171/new/
>
> https://reviews.llvm.org/D114171
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20220523/5001bc77/attachment.html>
More information about the llvm-commits
mailing list