[PATCH] D114171: [SLP]Improve reductions analysis and emission, part 1.

Alexander Kornienko via llvm-commits llvm-commits at lists.llvm.org
Sun May 22 18:12:06 PDT 2022


And finally a reduced test case:
$ cat q.cc
struct S {
  template<int N>
  bool f() const;
};
int f(const S& s) {
  int d = 0;
  if (s.f<1>()) ++d;
  // 4998 lines skipped
  if (s.f<5000>()) ++d;
  if (d == 0) {
    return s.f<-1>();
  }
  return 0;
}
$ ./clang-11004 --target=x86_64--linux-gnu -O1 -fslp-vectorize  -c -xc++
q.cc -o q.o -ftime-report
===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 8.7398 seconds (8.8367 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---
 --- Name ---
   7.9827 ( 91.6%)   0.0081 ( 29.3%)   7.9908 ( 91.4%)   8.0861 ( 91.5%)
 SLPVectorizerPass
   0.1950 (  2.2%)   0.0002 (  0.6%)   0.1952 (  2.2%)   0.1962 (  2.2%)
 InstCombinePass
...

On Sat, May 21, 2022 at 2:39 AM Alexey Bataev via Phabricator <
reviews at reviews.llvm.org> wrote:

> ABataev added a comment.
>
> In D114171#3528932 <https://reviews.llvm.org/D114171#3528932>, @alexfh
> wrote:
>
> > In D114171#3528930 <https://reviews.llvm.org/D114171#3528930>, @ABataev
> wrote:
> >
> >> In D114171#3528903 <https://reviews.llvm.org/D114171#3528903>, @alexfh
> wrote:
> >>
> >>> In D114171#3528894 <https://reviews.llvm.org/D114171#3528894>,
> @ABataev wrote:
> >>>
> >>>> In D114171#3528866 <https://reviews.llvm.org/D114171#3528866>,
> @alexfh wrote:
> >>>>
> >>>>> In D114171#3528093 <https://reviews.llvm.org/D114171#3528093>,
> @ABataev wrote:
> >>>>>
> >>>>>> Aha, I committed 4e271fc49517362a9333371fb1ab7e865d4c1b0e <
> https://reviews.llvm.org/rG4e271fc49517362a9333371fb1ab7e865d4c1b0e>
> earlier today, which should improve it. Try to update the compiler.
> >>>>>
> >>>>> Thanks! This patch makes SLP vectorizer pass much faster on the
> problematic input, but it doesn't completely compensate the slowdown
> introduced here.  This is how -ftime-report looks like after
> 4e271fc49517362a9333371fb1ab7e865d4c1b0e <
> https://reviews.llvm.org/rG4e271fc49517362a9333371fb1ab7e865d4c1b0e>:
> >>>>>
> >>>>>
>  ===-------------------------------------------------------------------------===
> >>>>>                         ... Pass execution timing report ...
> >>>>>
>  ===-------------------------------------------------------------------------===
> >>>>>     Total Execution Time: 308.9466 seconds (308.9519 wall clock)
> >>>>>
> >>>>>      ---User Time---   --System Time--   --User+System--   ---Wall
> Time---  --- Name ---
> >>>>>     132.7263 ( 45.9%)   0.0001 (  0.0%)  132.7264 ( 43.0%)  132.7351
> ( 43.0%)  SLPVectorizerPass
> >>>>>     48.0866 ( 16.6%)   5.7199 ( 29.3%)  53.8065 ( 17.4%)  53.8123 (
> 17.4%)  ModuleInlinerWrapperPass
> >>>>>     47.3402 ( 16.4%)   5.4604 ( 27.9%)  52.8007 ( 17.1%)  52.8060 (
> 17.1%)  DevirtSCCRepeatedPass
> >>>>>     22.2568 (  7.7%)   0.3222 (  1.6%)  22.5790 (  7.3%)  22.5785 (
> 7.3%)  GVNPass
> >>>>>     11.3194 (  3.9%)   1.0507 (  5.4%)  12.3701 (  4.0%)  12.3520 (
> 4.0%)  InstCombinePass
> >>>>>      4.8834 (  1.7%)   1.2548 (  6.4%)   6.1382 (  2.0%)   6.1350 (
> 2.0%)  InlinerPass
> >>>>>
> >>>>> And this is how it looked at
> 38d0df557706940af5d7110bdf662590449f8a60 <
> https://reviews.llvm.org/rG38d0df557706940af5d7110bdf662590449f8a60> (the
> closest commit before 7ea03f0b4e4ec5d91d48ba2976f5adc299089ffd <
> https://reviews.llvm.org/rG7ea03f0b4e4ec5d91d48ba2976f5adc299089ffd>
> where I could compile clang):
> >>>>>
> >>>>>
>  ===-------------------------------------------------------------------------===
> >>>>>                         ... Pass execution timing report ...
> >>>>>
>  ===-------------------------------------------------------------------------===
> >>>>>     Total Execution Time: 181.4693 seconds (181.4723 wall clock)
> >>>>>
> >>>>>      ---User Time---   --System Time--   --User+System--   ---Wall
> Time---  --- Name ---
> >>>>>     48.1675 ( 29.7%)   5.6593 ( 29.3%)  53.8269 ( 29.7%)  53.8332 (
> 29.7%)  ModuleInlinerWrapperPass
> >>>>>     47.4270 ( 29.2%)   5.3983 ( 28.0%)  52.8253 ( 29.1%)  52.8315 (
> 29.1%)  DevirtSCCRepeatedPass
> >>>>>     22.1909 ( 13.7%)   0.2916 (  1.5%)  22.4824 ( 12.4%)  22.4825 (
> 12.4%)  GVNPass
> >>>>>     11.1989 (  6.9%)   1.0292 (  5.3%)  12.2281 (  6.7%)  12.2204 (
> 6.7%)  InstCombinePass
> >>>>>      4.9943 (  3.1%)   1.1928 (  6.2%)   6.1871 (  3.4%)   6.1842 (
> 3.4%)  InlinerPass
> >>>>>      5.2197 (  3.2%)   0.0000 (  0.0%)   5.2198 (  2.9%)   5.2201 (
> 2.9%)  SLPVectorizerPass
> >>>>>
> >>>>> Note the 5s -> 130s jump in time spent in SLPVectorizerPass. I'll
> grab the profile for the updated binary, but it will take some time. And
> yes, still trying to reduce the test case.
> >>>>
> >>>> The perf profile should help, thanks
> >>>
> >>> Looking at this I wonder whether SmallPtrSet is not that small? :)
> >>>
> >>>   -   86.47%     0.00%  clang    clang               [.] cc1_main
>
>
>> >>>      - cc1_main
>
>
>> >>>         - 86.47% clang::ExecuteCompilerInvocation
>
>
>> >>>            - 86.47% clang::CompilerInstance::ExecuteAction
>
>
>> >>>               - 86.47% clang::FrontendAction::Execute
>
>
>> >>>                  - 86.47% clang::ParseAST
>
>
>> >>>                     - 80.62%
> clang::BackendConsumer::HandleTranslationUnit
>
>> >>>                        - 80.17% clang::EmitBackendOutput
>
>
>> >>>                           - 63.55% (anonymous
> namespace)::EmitAssemblyHelper::RunOptimizationPipeline
>
>> >>>                              - 63.55% llvm::PassManager<llvm::Module,
> llvm::AnalysisManager<llvm::Module>>::run
>
>> >>>                                 - 46.20%
> llvm::detail::PassModel<llvm::Module, llvm::ModuleToFunctionPassAdaptor,
> llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Module>>::run
>> >>>                                    - 46.17%
> llvm::ModuleToFunctionPassAdaptor::run
>
>> >>>                                       - 45.90%
> llvm::detail::PassModel<llvm::Function, llvm::PassManager<llvm::Function,
> llvm::AnalysisManager<llvm::Function>>, llvm::PreservedAnalyses,
> llvm::AnalysisManager<llvm::Function>>::run                 ▒
> >>>                                          - 45.89%
> llvm::PassManager<llvm::Function,
> llvm::AnalysisManager<llvm::Function>>::run
>
>> >>>                                             - 43.23%
> llvm::detail::PassModel<llvm::Function, llvm::SLPVectorizerPass,
> llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Function>>::run
>> >>>
> llvm::SLPVectorizerPass::run
>
>> >>>                                                -
> llvm::SLPVectorizerPass::runImpl
>
>> >>>                                                   - 43.20%
> llvm::SLPVectorizerPass::vectorizeChainsInBlock
>
>> >>>                                                      - 42.84%
> llvm::SLPVectorizerPass::vectorizeSimpleInstructions
>
>> >>>                                                         - 41.02%
> llvm::SLPVectorizerPass::vectorizeRootInstruction
>
>> >>>                                                            - 40.93%
> (anonymous namespace)::HorizontalReduction::tryToReduce
>
>> >>>                                                               - 18.19%
> llvm::slpvectorizer::BoUpSLP::buildTree
>
>> >>>
> 9.89% llvm::SmallPtrSetImplBase::insert_imp_big
>
>> >>>                                                                  -
> 6.24% llvm::slpvectorizer::BoUpSLP::buildTree_rec
>
>> >>>                                                                     -
> 3.23%
> llvm::slpvectorizer::BoUpSLP::buildTree_rec(llvm::ArrayRef<llvm::Value*>,
> unsigned int, llvm::slpvectorizer::BoUpSLP::EdgeInfo
> const&)::$_32::operator()                  ▒
> >>>
> + 1.61% llvm::DenseMapBase<llvm::DenseMap<llvm::Value*, unsigned int,
> llvm::DenseMapInfo<llvm::Value*, void>,
> llvm::detail::DenseMapPair<llvm::Value*, unsigned int> >, llvm::V▒
> >>>
>  0.85% getSameOpcode
>
>> >>>                                                                     -
> 0.78% llvm::slpvectorizer::BoUpSLP::newTreeEntry
>
>> >>>
>   0.72% llvm::SmallPtrSetImplBase::insert_imp_big
>
>> >>>
> 0.98% llvm::SmallPtrSetImplBase::FindBucketFor
>
>> >>>                                                                 8.57%
> llvm::SmallPtrSetImplBase::FindBucketFor
>
>> >>>                                                               + 4.71%
> llvm::MapVector<llvm::Value*, llvm::SmallVector<llvm::Instruction*, 2u>,
> llvm::DenseMap<llvm::Value*, unsigned int, llvm::DenseMapInfo<llvm::Value*,
> void>, llvm::detail::DenseM▒
> >>>                                                                 3.23%
> (anonymous
> namespace)::HorizontalReduction::tryToReduce(llvm::slpvectorizer::BoUpSLP&,
> llvm::TargetTransformInfo*)::{lambda(bool)#1}::operator()
>> >>>                                                                 0.70%
> memset
>
>> >>>                                                         + 1.82%
> tryToVectorizeSequence<llvm::Value>
>
>> >>
> >> Ok, thanks, will improve it on Monday. We can avoid multiple creation
> of SmallPtrSet and I'll check for other possible optimizations too.
> >
> > Significant time seems to be spent in this loop as well:
> >
> >   for (unsigned Cnt = 0; Cnt < NumReducedVals; ++Cnt) {
> >     if (Cnt >= Pos && Cnt < Pos + ReduxWidth)
> >       continue;
> >     unsigned NumOps = VectorizedVals.lookup(Candidates[Cnt]) +
> >                       std::count(VL.begin(), VL.end(), Candidates[Cnt]);
> >     if (NumOps != ReducedValsToOps.find(Candidates[Cnt])->second.size())
> >       LocalExternallyUsedValues[Candidates[Cnt]];
> >   }
>
> Will fix it, thanks!
>
>
> Repository:
>   rG LLVM Github Monorepo
>
> CHANGES SINCE LAST ACTION
>   https://reviews.llvm.org/D114171/new/
>
> https://reviews.llvm.org/D114171
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20220523/5001bc77/attachment.html>


More information about the llvm-commits mailing list