[llvm] 9c2469c - [PassManager] `buildModuleOptimizationPipeline()`: schedule `LoopDeletion` pass run before vectorization passes

Roman Lebedev via llvm-commits llvm-commits at lists.llvm.org
Wed Nov 3 09:25:30 PDT 2021


Author: Roman Lebedev
Date: 2021-11-03T19:24:49+03:00
New Revision: 9c2469c1ddb34517de8dafd83d1940deada3fc22

URL: https://github.com/llvm/llvm-project/commit/9c2469c1ddb34517de8dafd83d1940deada3fc22
DIFF: https://github.com/llvm/llvm-project/commit/9c2469c1ddb34517de8dafd83d1940deada3fc22.diff

LOG: [PassManager] `buildModuleOptimizationPipeline()`: schedule `LoopDeletion` pass run before vectorization passes

Test thanks to Michael Kuklinski from `#llvm`: https://godbolt.org/z/bdrah5Goo
originally inspired by Daniel Lemire's https://lemire.me/blog/2021/10/26/in-c-is-empty-faster-than-comparing-the-size-with-zero/

We manage to deduce that the answer does not require looping,
but we do that after the last `LoopDeletion` pass run,
so we end up being stuck with a dead loop.

Now, as with all things SCEV, this has
a very expected ~`+0.12%` compile time performance regression:
https://llvm-compile-time-tracker.com/compare.php?from=0ae7bf124a9bca76dd9a91b2f7379168ff13f562&to=c2ae57c9b961aeb4a28c747266949340613a6d84&stat=instructions
(for comparison, doing that in function simplification pipeline
would have been ~`+0.5` compile time performance regression, D112840)

Looking at the transformation stats over vanilla test-suite, i think it's rather expected:
```
| statistic name                                   |  baseline |  proposed |     Δ |      % |    |%| |
|--------------------------------------------------|----------:|----------:|------:|-------:|-------:|
| scalar-evolution.NumBruteForceTripCountsComputed |       789 |       888 |    99 | 12.55% | 12.55% |
| scalar-evolution.NumTripCountsNotComputed        |    105592 |    117900 | 12308 | 11.66% | 11.66% |
| loop-delete.NumBackedgesBroken                   |       542 |       559 |    17 |  3.14% |  3.14% |
| regalloc.numExtends                              |        81 |        79 |    -2 | -2.47% |  2.47% |
| indvars.NumFoldedUser                            |       408 |       400 |    -8 | -1.96% |  1.96% |
| indvars.NumElimCmp                               |      3831 |      3758 |   -73 | -1.91% |  1.91% |
| scalar-evolution.NumTripCountsComputed           |    299759 |    304278 |  4519 |  1.51% |  1.51% |
| loop-delete.NumDeleted                           |      8055 |      8128 |    73 |  0.91% |  0.91% |
| machine-cse.NumCommutes                          |       111 |       110 |    -1 | -0.90% |  0.90% |
| globaldce.NumFunctions                           |      1187 |      1192 |     5 |  0.42% |  0.42% |
| codegenprepare.NumSelectsExpanded                |       277 |       278 |     1 |  0.36% |  0.36% |
| loop-unroll.NumRuntimeUnrolled                   |     13841 |     13791 |   -50 | -0.36% |  0.36% |
| machinelicm.NumPostRAHoisted                     |      1168 |      1172 |     4 |  0.34% |  0.34% |
| phi-node-elimination.NumCriticalEdgesSplit       |     83054 |     82879 |  -175 | -0.21% |  0.21% |
| machine-cse.NumPREs                              |      3085 |      3079 |    -6 | -0.19% |  0.19% |
| branch-folder.NumBranchOpts                      |    108122 |    107942 |  -180 | -0.17% |  0.17% |
| loop-unroll.NumUnrolled                          |     40136 |     40067 |   -69 | -0.17% |  0.17% |
| branch-folder.NumDeadBlocks                      |    130818 |    130607 |  -211 | -0.16% |  0.16% |
| codegenprepare.NumBlocksElim                     |     92856 |     92714 |  -142 | -0.15% |  0.15% |
| instsimplify.NumSimplified                       |    103263 |    103129 |  -134 | -0.13% |  0.13% |
| instcombine.NumConstProp                         |     26070 |     26102 |    32 |  0.12% |  0.12% |
| instsimplify.NumExpand                           |      1716 |      1718 |     2 |  0.12% |  0.12% |
| loop-unroll.NumCompletelyUnrolled                |      9236 |      9225 |   -11 | -0.12% |  0.12% |
| branch-folder.NumHoist                           |      2773 |      2770 |    -3 | -0.11% |  0.11% |
| regalloc.NumReloadsRemoved                       |     10822 |     10834 |    12 |  0.11% |  0.11% |
| regalloc.NumSnippets                             |     11394 |     11406 |    12 |  0.11% |  0.11% |
| machine-cse.NumCrossBBCSEs                       |      1052 |      1053 |     1 |  0.10% |  0.10% |
| machinelicm.NumCSEed                             |     99887 |     99784 |  -103 | -0.10% |  0.10% |
| branch-folder.NumTailMerge                       |     72501 |     72435 |   -66 | -0.09% |  0.09% |
| codegenprepare.NumExtUses                        |     22007 |     21987 |   -20 | -0.09% |  0.09% |
| local.NumRemoved                                 |     68232 |     68294 |    62 |  0.09% |  0.09% |
| loop-vectorize.LoopsAnalyzed                     |     75483 |     75413 |   -70 | -0.09% |  0.09% |
```

Note that i'm only changing current PM, and not touching obsolete PM.

This is an alternative to the function simplification pipeline variant
of the same change, D112840. It has both less compile time impact
(since the additional number of SCEV trip count calculations
is way lass less than with the D112840), and it is
much more powerful/impactful (almost 2x more loops deleted).

I have checked, and doing this after loop rotation
is favorable (more loops deleted).

Reviewed By: mkazantsev

Differential Revision: https://reviews.llvm.org/D112851

Added: 
    

Modified: 
    llvm/lib/Passes/PassBuilderPipelines.cpp
    llvm/test/Other/new-pm-defaults.ll
    llvm/test/Other/new-pm-thinlto-defaults.ll
    llvm/test/Other/new-pm-thinlto-postlink-pgo-defaults.ll
    llvm/test/Other/new-pm-thinlto-postlink-samplepgo-defaults.ll
    llvm/test/Transforms/PhaseOrdering/deletion-of-loops-that-became-side-effect-free.ll

Removed: 
    


################################################################################
diff  --git a/llvm/lib/Passes/PassBuilderPipelines.cpp b/llvm/lib/Passes/PassBuilderPipelines.cpp
index 2009a687ae7d6..f0f7803ed3ae5 100644
--- a/llvm/lib/Passes/PassBuilderPipelines.cpp
+++ b/llvm/lib/Passes/PassBuilderPipelines.cpp
@@ -1093,11 +1093,16 @@ PassBuilder::buildModuleOptimizationPipeline(OptimizationLevel Level,
   for (auto &C : VectorizerStartEPCallbacks)
     C(OptimizePM, Level);
 
+  LoopPassManager LPM;
   // First rotate loops that may have been un-rotated by prior passes.
   // Disable header duplication at -Oz.
+  LPM.addPass(LoopRotatePass(Level != OptimizationLevel::Oz, LTOPreLink));
+  // Some loops may have become dead by now. Try to delete them.
+  // FIXME: see disscussion in https://reviews.llvm.org/D112851
+  //        this may need to be revisited once GVN is more powerful.
+  LPM.addPass(LoopDeletionPass());
   OptimizePM.addPass(createFunctionToLoopPassAdaptor(
-      LoopRotatePass(Level != OptimizationLevel::Oz, LTOPreLink),
-      /*UseMemorySSA=*/false, /*UseBlockFrequencyInfo=*/false));
+      std::move(LPM), /*UseMemorySSA=*/false, /*UseBlockFrequencyInfo=*/false));
 
   // Distribute loops to allow partial vectorization.  I.e. isolate dependences
   // into separate loop that would otherwise inhibit vectorization.  This is

diff  --git a/llvm/test/Other/new-pm-defaults.ll b/llvm/test/Other/new-pm-defaults.ll
index 5067b6fbdd188..b9f90dad82249 100644
--- a/llvm/test/Other/new-pm-defaults.ll
+++ b/llvm/test/Other/new-pm-defaults.ll
@@ -216,6 +216,7 @@
 ; CHECK-O-NEXT: Running pass: LoopSimplifyPass
 ; CHECK-O-NEXT: Running pass: LCSSAPass
 ; CHECK-O-NEXT: Running pass: LoopRotatePass
+; CHECK-O-NEXT: Running pass: LoopDeletionPass
 ; CHECK-O-NEXT: Running pass: LoopDistributePass
 ; CHECK-O-NEXT: Running pass: InjectTLIMappings
 ; CHECK-O-NEXT: Running pass: LoopVectorizePass

diff  --git a/llvm/test/Other/new-pm-thinlto-defaults.ll b/llvm/test/Other/new-pm-thinlto-defaults.ll
index 1f52fe47ae73c..7836de5c6ccef 100644
--- a/llvm/test/Other/new-pm-thinlto-defaults.ll
+++ b/llvm/test/Other/new-pm-thinlto-defaults.ll
@@ -196,6 +196,7 @@
 ; CHECK-POSTLINK-O-NEXT: Running pass: LoopSimplifyPass
 ; CHECK-POSTLINK-O-NEXT: Running pass: LCSSAPass
 ; CHECK-POSTLINK-O-NEXT: Running pass: LoopRotatePass
+; CHECK-POSTLINK-O-NEXT: Running pass: LoopDeletionPass
 ; CHECK-POSTLINK-O-NEXT: Running pass: LoopDistributePass
 ; CHECK-POSTLINK-O-NEXT: Running pass: InjectTLIMappings
 ; CHECK-POSTLINK-O-NEXT: Running pass: LoopVectorizePass

diff  --git a/llvm/test/Other/new-pm-thinlto-postlink-pgo-defaults.ll b/llvm/test/Other/new-pm-thinlto-postlink-pgo-defaults.ll
index 3a80efba3c565..e66e8672358c1 100644
--- a/llvm/test/Other/new-pm-thinlto-postlink-pgo-defaults.ll
+++ b/llvm/test/Other/new-pm-thinlto-postlink-pgo-defaults.ll
@@ -167,6 +167,7 @@
 ; CHECK-O-NEXT: Running pass: LoopSimplifyPass on foo
 ; CHECK-O-NEXT: Running pass: LCSSAPass on foo
 ; CHECK-O-NEXT: Running pass: LoopRotatePass
+; CHECK-O-NEXT: Running pass: LoopDeletionPass
 ; CHECK-O-NEXT: Running pass: LoopDistributePass
 ; CHECK-O-NEXT: Running pass: InjectTLIMappings
 ; CHECK-O-NEXT: Running pass: LoopVectorizePass

diff  --git a/llvm/test/Other/new-pm-thinlto-postlink-samplepgo-defaults.ll b/llvm/test/Other/new-pm-thinlto-postlink-samplepgo-defaults.ll
index 2e822b21f8a11..410841124c8e7 100644
--- a/llvm/test/Other/new-pm-thinlto-postlink-samplepgo-defaults.ll
+++ b/llvm/test/Other/new-pm-thinlto-postlink-samplepgo-defaults.ll
@@ -179,6 +179,7 @@
 ; CHECK-O-NEXT: Running pass: LoopSimplifyPass
 ; CHECK-O-NEXT: Running pass: LCSSAPass
 ; CHECK-O-NEXT: Running pass: LoopRotatePass
+; CHECK-O-NEXT: Running pass: LoopDeletionPass
 ; CHECK-O-NEXT: Running pass: LoopDistributePass
 ; CHECK-O-NEXT: Running pass: InjectTLIMappings
 ; CHECK-O-NEXT: Running pass: LoopVectorizePass

diff  --git a/llvm/test/Transforms/PhaseOrdering/deletion-of-loops-that-became-side-effect-free.ll b/llvm/test/Transforms/PhaseOrdering/deletion-of-loops-that-became-side-effect-free.ll
index ec8db3cceeb11..99a52acd3b2b1 100644
--- a/llvm/test/Transforms/PhaseOrdering/deletion-of-loops-that-became-side-effect-free.ll
+++ b/llvm/test/Transforms/PhaseOrdering/deletion-of-loops-that-became-side-effect-free.ll
@@ -11,17 +11,8 @@
 define dso_local zeroext i1 @is_not_empty_variant1(%struct.node* %p) {
 ; ALL-LABEL: @is_not_empty_variant1(
 ; ALL-NEXT:  entry:
-; ALL-NEXT:    [[TOBOOL_NOT3_I:%.*]] = icmp eq %struct.node* [[P:%.*]], null
-; ALL-NEXT:    br i1 [[TOBOOL_NOT3_I]], label [[COUNT_NODES_VARIANT1_EXIT:%.*]], label [[WHILE_BODY_I:%.*]]
-; ALL:       while.body.i:
-; ALL-NEXT:    [[P_ADDR_04_I:%.*]] = phi %struct.node* [ [[TMP0:%.*]], [[WHILE_BODY_I]] ], [ [[P]], [[ENTRY:%.*]] ]
-; ALL-NEXT:    [[NEXT_I:%.*]] = getelementptr inbounds [[STRUCT_NODE:%.*]], %struct.node* [[P_ADDR_04_I]], i64 0, i32 0
-; ALL-NEXT:    [[TMP0]] = load %struct.node*, %struct.node** [[NEXT_I]], align 8
-; ALL-NEXT:    [[TOBOOL_NOT_I:%.*]] = icmp eq %struct.node* [[TMP0]], null
-; ALL-NEXT:    br i1 [[TOBOOL_NOT_I]], label [[COUNT_NODES_VARIANT1_EXIT]], label [[WHILE_BODY_I]], !llvm.loop [[LOOP0:![0-9]+]]
-; ALL:       count_nodes_variant1.exit:
-; ALL-NEXT:    [[TMP1:%.*]] = xor i1 [[TOBOOL_NOT3_I]], true
-; ALL-NEXT:    ret i1 [[TMP1]]
+; ALL-NEXT:    [[TOBOOL_NOT3_I:%.*]] = icmp ne %struct.node* [[P:%.*]], null
+; ALL-NEXT:    ret i1 [[TOBOOL_NOT3_I]]
 ;
 entry:
   %p.addr = alloca %struct.node*, align 8
@@ -113,39 +104,13 @@ while.end:
 define dso_local zeroext i1 @is_not_empty_variant3(%struct.node* %p) {
 ; O3-LABEL: @is_not_empty_variant3(
 ; O3-NEXT:  entry:
-; O3-NEXT:    [[TOBOOL_NOT4_I:%.*]] = icmp eq %struct.node* [[P:%.*]], null
-; O3-NEXT:    br i1 [[TOBOOL_NOT4_I]], label [[COUNT_NODES_VARIANT3_EXIT:%.*]], label [[WHILE_BODY_I:%.*]]
-; O3:       while.body.i:
-; O3-NEXT:    [[SIZE_06_I:%.*]] = phi i64 [ [[INC_I:%.*]], [[WHILE_BODY_I]] ], [ 0, [[ENTRY:%.*]] ]
-; O3-NEXT:    [[P_ADDR_05_I:%.*]] = phi %struct.node* [ [[TMP0:%.*]], [[WHILE_BODY_I]] ], [ [[P]], [[ENTRY]] ]
-; O3-NEXT:    [[CMP_I:%.*]] = icmp ne i64 [[SIZE_06_I]], -1
-; O3-NEXT:    tail call void @llvm.assume(i1 [[CMP_I]]) #[[ATTR3:[0-9]+]]
-; O3-NEXT:    [[NEXT_I:%.*]] = getelementptr inbounds [[STRUCT_NODE:%.*]], %struct.node* [[P_ADDR_05_I]], i64 0, i32 0
-; O3-NEXT:    [[TMP0]] = load %struct.node*, %struct.node** [[NEXT_I]], align 8
-; O3-NEXT:    [[INC_I]] = add nuw i64 [[SIZE_06_I]], 1
-; O3-NEXT:    [[TOBOOL_NOT_I:%.*]] = icmp eq %struct.node* [[TMP0]], null
-; O3-NEXT:    br i1 [[TOBOOL_NOT_I]], label [[COUNT_NODES_VARIANT3_EXIT]], label [[WHILE_BODY_I]], !llvm.loop [[LOOP2:![0-9]+]]
-; O3:       count_nodes_variant3.exit:
-; O3-NEXT:    [[TMP1:%.*]] = xor i1 [[TOBOOL_NOT4_I]], true
-; O3-NEXT:    ret i1 [[TMP1]]
+; O3-NEXT:    [[TOBOOL_NOT4_I:%.*]] = icmp ne %struct.node* [[P:%.*]], null
+; O3-NEXT:    ret i1 [[TOBOOL_NOT4_I]]
 ;
 ; O2-LABEL: @is_not_empty_variant3(
 ; O2-NEXT:  entry:
-; O2-NEXT:    [[TOBOOL_NOT4_I:%.*]] = icmp eq %struct.node* [[P:%.*]], null
-; O2-NEXT:    br i1 [[TOBOOL_NOT4_I]], label [[COUNT_NODES_VARIANT3_EXIT:%.*]], label [[WHILE_BODY_I:%.*]]
-; O2:       while.body.i:
-; O2-NEXT:    [[SIZE_06_I:%.*]] = phi i64 [ [[INC_I:%.*]], [[WHILE_BODY_I]] ], [ 0, [[ENTRY:%.*]] ]
-; O2-NEXT:    [[P_ADDR_05_I:%.*]] = phi %struct.node* [ [[TMP0:%.*]], [[WHILE_BODY_I]] ], [ [[P]], [[ENTRY]] ]
-; O2-NEXT:    [[CMP_I:%.*]] = icmp ne i64 [[SIZE_06_I]], -1
-; O2-NEXT:    tail call void @llvm.assume(i1 [[CMP_I]]) #[[ATTR3:[0-9]+]]
-; O2-NEXT:    [[NEXT_I:%.*]] = getelementptr inbounds [[STRUCT_NODE:%.*]], %struct.node* [[P_ADDR_05_I]], i64 0, i32 0
-; O2-NEXT:    [[TMP0]] = load %struct.node*, %struct.node** [[NEXT_I]], align 8
-; O2-NEXT:    [[INC_I]] = add nuw i64 [[SIZE_06_I]], 1
-; O2-NEXT:    [[TOBOOL_NOT_I:%.*]] = icmp eq %struct.node* [[TMP0]], null
-; O2-NEXT:    br i1 [[TOBOOL_NOT_I]], label [[COUNT_NODES_VARIANT3_EXIT]], label [[WHILE_BODY_I]], !llvm.loop [[LOOP2:![0-9]+]]
-; O2:       count_nodes_variant3.exit:
-; O2-NEXT:    [[TMP1:%.*]] = xor i1 [[TOBOOL_NOT4_I]], true
-; O2-NEXT:    ret i1 [[TMP1]]
+; O2-NEXT:    [[TOBOOL_NOT4_I:%.*]] = icmp ne %struct.node* [[P:%.*]], null
+; O2-NEXT:    ret i1 [[TOBOOL_NOT4_I]]
 ;
 ; O1-LABEL: @is_not_empty_variant3(
 ; O1-NEXT:  entry:
@@ -160,7 +125,7 @@ define dso_local zeroext i1 @is_not_empty_variant3(%struct.node* %p) {
 ; O1-NEXT:    [[TMP0]] = load %struct.node*, %struct.node** [[NEXT_I]], align 8
 ; O1-NEXT:    [[INC_I]] = add i64 [[SIZE_06_I]], 1
 ; O1-NEXT:    [[TOBOOL_NOT_I:%.*]] = icmp eq %struct.node* [[TMP0]], null
-; O1-NEXT:    br i1 [[TOBOOL_NOT_I]], label [[COUNT_NODES_VARIANT3_EXIT_LOOPEXIT:%.*]], label [[WHILE_BODY_I]], !llvm.loop [[LOOP2:![0-9]+]]
+; O1-NEXT:    br i1 [[TOBOOL_NOT_I]], label [[COUNT_NODES_VARIANT3_EXIT_LOOPEXIT:%.*]], label [[WHILE_BODY_I]], !llvm.loop [[LOOP0:![0-9]+]]
 ; O1:       count_nodes_variant3.exit.loopexit:
 ; O1-NEXT:    [[PHI_CMP:%.*]] = icmp ne i64 [[INC_I]], 0
 ; O1-NEXT:    br label [[COUNT_NODES_VARIANT3_EXIT]]


        


More information about the llvm-commits mailing list