[PATCH] D84108: [SimplifyCFG][LoopRotate] SimplifyCFG: disable common instruction hoisting by default, enable late in pipeline

Fri Aug 21 08:41:51 PDT 2020

echristo added a comment.

In D84108#2230308 <https://reviews.llvm.org/D84108#2230308>, @lebedev.ri wrote:

> In D84108#2229882 <https://reviews.llvm.org/D84108#2229882>, @dmgreen wrote:
>
>> I would have been using old, the default. I think we still see some performance/codesize problems with switch to new.
>
> That might be the pattern then, just another old-pm vs new-pm difference for the switch to track.
> But i'm not seeing ab umbrella bug for performance regressions on https://bugs.llvm.org/show_bug.cgi?id=46649?
>
> In D84108#2228500 <https://reviews.llvm.org/D84108#2228500>, @echristo wrote:
>
>> In D84108#2227701 <https://reviews.llvm.org/D84108#2227701>, @lebedev.ri wrote:
>>
>>> In D84108#2227371 <https://reviews.llvm.org/D84108#2227371>, @echristo wrote:
>>>
>>>> One more follow up here:
>>>>
>>>> One set of benchmarks we're seeing this in are also in the llvm test harness:
>>>>
>>>> llvm_multisource_tsvc.TSVC-ControlFlow-flt and llvm_multisource_misc.Ptrdist-yacr2
>>>>
>>>> in an FDO mode (so we have some decent branch hints for the code).
>>>>
>>>> You should be able to duplicate at least some of the slowdown there.
>>>
>>> That sadly doesn't tell me much.
>>> Can you please provide the reproduction steps, and ideally the IR with/without this change?
>>
>> It should be as straightforward as -O3 -fexperimental-new-pass-manager, if that isn't then we'll look a bit more. Those tests are in the llvm test-suite so they should be easy to get as part of your checkout.
>
> F12721865: old.tar.xz <https://reviews.llvm.org/F12721865>
>
> F12721867: new.tar.xz <https://reviews.llvm.org/F12721867>
> Uhm, for N=9, i'm afraid i'm not seeing anything horrible like that:
>
>   $ /repositories/llvm-test-suite/utils/compare.py --lhs-name old-newpm --rhs-name new-newpm llvm-test-suite-old-NewPM/res-*.json vs llvm-test-suite-new-NewPM/res-*.json --merge-average --filter-hash
>   Tests: 41
>   Metric: exec_time
>   
>   Program                                        old-newpm new-newpm diff 
>    test-suite...marks/Ptrdist/yacr2/yacr2.test     0.77      0.74    -4.0%
>    test-suite...s/Ptrdist/anagram/anagram.test     0.76      0.78     3.0%
>    test-suite...flt/InductionVariable-flt.test     2.90      2.95     1.9%
>    test-suite...lFlow-flt/ControlFlow-flt.test     3.48      3.44    -1.4%
>    test-suite...ing-dbl/NodeSplitting-dbl.test     3.09      3.13     1.1%
>    test-suite...dbl/InductionVariable-dbl.test     4.02      4.05     1.0%
>    test-suite...pansion-dbl/Expansion-dbl.test     2.49      2.51     0.7%
>    test-suite...pansion-flt/Expansion-flt.test     1.56      1.57     0.5%
>    test-suite...lt/CrossingThresholds-flt.test     2.86      2.87     0.5%
>    test-suite...lFlow-dbl/ControlFlow-dbl.test     4.05      4.03    -0.5%
>    test-suite...ing-flt/Equivalencing-flt.test     1.13      1.14     0.3%
>    test-suite...ow-dbl/GlobalDataFlow-dbl.test     3.20      3.21     0.3%
>    test-suite.../Benchmarks/Ptrdist/bc/bc.test     0.40      0.40     0.3%
>    test-suite...C/Packing-flt/Packing-flt.test     2.90      2.91     0.3%
>    test-suite.../Benchmarks/Ptrdist/ft/ft.test     1.05      1.05     0.3%
>    Geomean difference                                                 0.1%
>          old-newpm  new-newpm       diff
>   count  41.000000  41.000000  41.000000
>   mean   2.761491   2.765264   0.001355 
>   std    1.382334   1.383351   0.009172 
>   min    0.400356   0.401544  -0.040124 
>   25%    2.158678   2.161544  -0.000219 
>   50%    2.836133   2.839078   0.000974 
>   75%    3.204467   3.214600   0.002932 
>   max    6.865056   6.881911   0.029819 
>
> @echristo or does that match what you were seeing for test-suite? is the `Ptrdist/anagram` regression the one?

The two tests in the llvm test-suite we're seeing the largest regressions on are:

llvm_multisource_tsvc.TSVC-ControlFlow-flt -0.34% -> -19.33%
llvm_multisource_misc.Ptrdist-yacr2 (I can't recall the difference here)

Those are running on a similar system to yours (but fully quieted/no other code running) with -O3 -fexperimental-new-pass-manager -march=haswell -fno-strict-aliasing -fno-omit-frame-pointer and -mprefer-vector-width=128 as the option set.

In addition we're seeing 10% regressions in benchmarks with https://github.com/google/snappy code (benchmarks aren't public unfortunately) and other scalar code with loops whether they have some hand vectorization or not.

Looking at some other benchmarks on skylake (so no -march=haswell from above) (https://github.com/google/gipfeli) under perf we're seeing the number of executed instructions is almost same, the number of branches is increased by 0.8%, but the number of branch miss predictions is 3.2 times larger. I don't think that this is related to the pass manager at all, but that the new code is causing significant regressions on the processor due to code layout and the branch predictor.

My proposal for now is that we revert this and continue working on performance analysis and try to figure out what we can do to make this more of a win outside of the vectorizer. :)

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D84108/new/

https://reviews.llvm.org/D84108