[llvm] 905083f - [LTO] Ensure LICM hoists expensive fdiv instructions introduced by InstCombine

Fri Jul 7 04:06:33 PDT 2023

Author: David Sherwood
Date: 2023-07-07T11:06:24Z
New Revision: 905083f3c1a6c8dada42ad1c68553fa639f22859

URL: https://github.com/llvm/llvm-project/commit/905083f3c1a6c8dada42ad1c68553fa639f22859
DIFF: https://github.com/llvm/llvm-project/commit/905083f3c1a6c8dada42ad1c68553fa639f22859.diff

LOG: [LTO] Ensure LICM hoists expensive fdiv instructions introduced by InstCombine

In the LTO pipeline we run InstCombine after LICM, which is
different to what we normally do without LTO. This has the
effect of undoing all the great work done by LICM to reduce
the cost of the loop when it hoists the fdiv out and replaces
it with fmul. When InstCombine runs after LICM it puts the
fdiv straight back which, on AArch64 at least, is darn
expensive. You can observe this problem in the SPEC2017
benchmark parest if you build with "-Ofast -flto" and the
loop-vectoriser uses an unroll factor of 1, which is what
often happens when tail-folding is enabled.

This is also a problem for scalar loops, or indeed any loop
where there is only one use of the preheader fdiv result in
the loop.

See InstCombinerImpl::visitFMul for the code that sinks the fdiv.

I've attempted to fix this by adding another LICM pass for Full
LTO after InstCombine. The alternative is to stop InstCombine
from sinking the fdiv into loops. See D87479 for a previous
discussion on this issue.

Differential Revision: https://reviews.llvm.org/D143631

Added: 
    

Modified: 
    llvm/lib/Passes/PassBuilderPipelines.cpp
    llvm/test/Other/new-pm-lto-defaults.ll
    llvm/test/Transforms/PhaseOrdering/lto-licm.ll

Removed: 
    


################################################################################
diff  --git a/llvm/lib/Passes/PassBuilderPipelines.cpp b/llvm/lib/Passes/PassBuilderPipelines.cpp
index d3e625e5576b11..37565408fbede3 100644

--- a/llvm/lib/Passes/PassBuilderPipelines.cpp
+++ b/llvm/lib/Passes/PassBuilderPipelines.cpp
@@ -1245,19 +1245,24 @@ void PassBuilder::addVectorPasses(OptimizationLevel Level,
     // or SimplifyCFG passes scheduled after us, that would cleanup
     // the CFG mess this may created if allowed to modify CFG, so forbid that.
     FPM.addPass(SROAPass(SROAOptions::PreserveCFG));
-    FPM.addPass(InstCombinePass());
-    FPM.addPass(createFunctionToLoopPassAdaptor(
-        LICMPass(PTO.LicmMssaOptCap, PTO.LicmMssaNoAccForPromotionCap,
-                 /*AllowSpeculation=*/true),
-        /*UseMemorySSA=*/true, /*UseBlockFrequencyInfo=*/false));
   }
 
+  FPM.addPass(InstCombinePass());
+
+  // This is needed for two reasons:
+  //   1. It works around problems that instcombine introduces, such as sinking
+  //      expensive FP divides into loops containing multiplications using the
+  //      divide result.
+  //   2. It helps to clean up some loop-invariant code created by the loop
+  //      unroll pass when IsFullLTO=false.
+  FPM.addPass(createFunctionToLoopPassAdaptor(
+      LICMPass(PTO.LicmMssaOptCap, PTO.LicmMssaNoAccForPromotionCap,
+               /*AllowSpeculation=*/true),
+      /*UseMemorySSA=*/true, /*UseBlockFrequencyInfo=*/false));
+
   // Now that we've vectorized and unrolled loops, we may have more refined
   // alignment information, try to re-derive it here.
   FPM.addPass(AlignmentFromAssumptionsPass());
-
-  if (IsFullLTO)
-    FPM.addPass(InstCombinePass());
 }
 
 ModulePassManager

diff  --git a/llvm/test/Other/new-pm-lto-defaults.ll b/llvm/test/Other/new-pm-lto-defaults.ll
index 3c0f2bbdc06bcd..1b64760e42c1ef 100644
--- a/llvm/test/Other/new-pm-lto-defaults.ll
+++ b/llvm/test/Other/new-pm-lto-defaults.ll
@@ -128,8 +128,11 @@
 ; CHECK-O3-NEXT: Running pass: SLPVectorizerPass on foo
 ; CHECK-OS-NEXT: Running pass: SLPVectorizerPass on foo
 ; CHECK-O23SZ-NEXT: Running pass: VectorCombinePass on foo
-; CHECK-O23SZ-NEXT: Running pass: AlignmentFromAssumptionsPass on foo
 ; CHECK-O23SZ-NEXT: Running pass: InstCombinePass on foo
+; CHECK-O23SZ-NEXT: Running pass: LoopSimplifyPass
+; CHECK-O23SZ-NEXT: Running pass: LCSSAPass
+; CHECK-O23SZ-NEXT: Running pass: LICMPass
+; CHECK-O23SZ-NEXT: Running pass: AlignmentFromAssumptionsPass on foo
 ; CHECK-EP-Peephole-NEXT: Running pass: NoOpFunctionPass on foo
 ; CHECK-O23SZ-NEXT: Running pass: JumpThreadingPass on foo
 ; CHECK-O23SZ-NEXT: Running pass: LowerTypeTestsPass

diff  --git a/llvm/test/Transforms/PhaseOrdering/lto-licm.ll b/llvm/test/Transforms/PhaseOrdering/lto-licm.ll
index 8c080acaae61cc..1a5a67d8241a1b 100644
--- a/llvm/test/Transforms/PhaseOrdering/lto-licm.ll
+++ b/llvm/test/Transforms/PhaseOrdering/lto-licm.ll
@@ -4,6 +4,7 @@
 define void @hoist_fdiv(ptr %a, float %b) {
 ; CHECK-LABEL: @hoist_fdiv(
 ; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[TMP0:%.*]] = fdiv fast float 1.000000e+00, [[B:%.*]]
 ; CHECK-NEXT:    br label [[FOR_COND:%.*]]
 ; CHECK:       for.cond:
 ; CHECK-NEXT:    [[I_0:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[INC:%.*]], [[FOR_INC:%.*]] ]
@@ -12,9 +13,9 @@ define void @hoist_fdiv(ptr %a, float %b) {
 ; CHECK:       for.inc:
 ; CHECK-NEXT:    [[IDXPROM:%.*]] = zext i32 [[I_0]] to i64
 ; CHECK-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds float, ptr [[A:%.*]], i64 [[IDXPROM]]
-; CHECK-NEXT:    [[TMP0:%.*]] = load float, ptr [[ARRAYIDX]], align 4
-; CHECK-NEXT:    [[TMP1:%.*]] = fdiv fast float [[TMP0]], [[B:%.*]]
-; CHECK-NEXT:    store float [[TMP1]], ptr [[ARRAYIDX]], align 4
+; CHECK-NEXT:    [[TMP1:%.*]] = load float, ptr [[ARRAYIDX]], align 4
+; CHECK-NEXT:    [[TMP2:%.*]] = fmul fast float [[TMP1]], [[TMP0]]
+; CHECK-NEXT:    store float [[TMP2]], ptr [[ARRAYIDX]], align 4
 ; CHECK-NEXT:    [[INC]] = add nuw nsw i32 [[I_0]], 1
 ; CHECK-NEXT:    br label [[FOR_COND]]
 ; CHECK:       for.end: