[PATCH] D143631: [LTO] Don't let InstCombine re-sink the vastly more expensive fdiv

Thu Feb 9 01:12:30 PST 2023

david-arm created this revision.
david-arm added reviewers: sdesmalen, kmclaughlin, spatel, craig.topper, MattDevereau.
Herald added subscribers: ormris, StephenFan, asbirlea, steven_wu, hiraditya, kristof.beyls, inglorion.
Herald added a project: All.
david-arm requested review of this revision.
Herald added a project: LLVM.
Herald added a subscriber: llvm-commits.

For some reason LTO decides to run InstCombine after LICM, which is
different to what we normally do without LTO. This has the effect of
undoing all the great work done by LICM to reduce the cost of the
loop when it hoists the fdiv out and replaces it with fmul. When
InstCombine runs after LICM it puts the fdiv straight back which,
on AArch64 at least, is darn expensive. You can observe this
problem in the SPEC2017 benchmark parest if you build with
"-Ofast -flto" and enable tail-folding in the vectoriser. This is
also a problem for scalar loops, or indeed any loop where there is
only one use of the preheader fdiv result in the loop.

See InstCombinerImpl::visitFMul for the code that sinks the fdiv.

I've attempted to fix this by adding another LICM pass for Full
LTO after InstCombine. The alternative is to stop InstCombine
from sinking the fdiv into loops. See D87479 <https://reviews.llvm.org/D87479> for a previous
discussion on this issue.


Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D143631

Files:
  llvm/lib/Passes/PassBuilderPipelines.cpp
  llvm/test/Other/new-pm-lto-defaults.ll
  llvm/test/Transforms/PhaseOrdering/lto-licm.ll


Index: llvm/test/Transforms/PhaseOrdering/lto-licm.ll
===================================================================

--- llvm/test/Transforms/PhaseOrdering/lto-licm.ll
+++ llvm/test/Transforms/PhaseOrdering/lto-licm.ll
@@ -4,6 +4,7 @@
 define void @hoist_fdiv(ptr %a, float %b) {
 ; CHECK-LABEL: @hoist_fdiv(
 ; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[TMP0:%.*]] = fdiv fast float 1.000000e+00, [[B:%.*]]
 ; CHECK-NEXT:    br label [[FOR_COND:%.*]]
 ; CHECK:       for.cond:
 ; CHECK-NEXT:    [[I_0:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[INC:%.*]], [[FOR_INC:%.*]] ]
@@ -12,9 +13,9 @@
 ; CHECK:       for.inc:
 ; CHECK-NEXT:    [[IDXPROM:%.*]] = zext i32 [[I_0]] to i64
 ; CHECK-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds float, ptr [[A:%.*]], i64 [[IDXPROM]]
-; CHECK-NEXT:    [[TMP0:%.*]] = load float, ptr [[ARRAYIDX]], align 4
-; CHECK-NEXT:    [[TMP1:%.*]] = fdiv fast float [[TMP0]], [[B:%.*]]
-; CHECK-NEXT:    store float [[TMP1]], ptr [[ARRAYIDX]], align 4
+; CHECK-NEXT:    [[TMP1:%.*]] = load float, ptr [[ARRAYIDX]], align 4
+; CHECK-NEXT:    [[TMP2:%.*]] = fmul fast float [[TMP1]], [[TMP0]]
+; CHECK-NEXT:    store float [[TMP2]], ptr [[ARRAYIDX]], align 4
 ; CHECK-NEXT:    [[INC]] = add nuw nsw i32 [[I_0]], 1
 ; CHECK-NEXT:    br label [[FOR_COND]]
 ; CHECK:       for.end:
Index: llvm/test/Other/new-pm-lto-defaults.ll
===================================================================
--- llvm/test/Other/new-pm-lto-defaults.ll
+++ llvm/test/Other/new-pm-lto-defaults.ll
@@ -134,6 +134,9 @@
 ; CHECK-O23SZ-NEXT: Running pass: VectorCombinePass on foo
 ; CHECK-O23SZ-NEXT: Running pass: AlignmentFromAssumptionsPass on foo
 ; CHECK-O23SZ-NEXT: Running pass: InstCombinePass on foo
+; CHECK-O23SZ-NEXT: Running pass: LoopSimplifyPass
+; CHECK-O23SZ-NEXT: Running pass: LCSSAPass
+; CHECK-O23SZ-NEXT: Running pass: LICMPass
 ; CHECK-EP-Peephole-NEXT: Running pass: NoOpFunctionPass on foo
 ; CHECK-O23SZ-NEXT: Running pass: JumpThreadingPass on foo
 ; CHECK-O23SZ-NEXT: Running pass: LowerTypeTestsPass
Index: llvm/lib/Passes/PassBuilderPipelines.cpp
===================================================================
--- llvm/lib/Passes/PassBuilderPipelines.cpp
+++ llvm/lib/Passes/PassBuilderPipelines.cpp
@@ -1234,8 +1234,19 @@
   // alignment information, try to re-derive it here.
   FPM.addPass(AlignmentFromAssumptionsPass());
 
-  if (IsFullLTO)
+  if (IsFullLTO) {
     FPM.addPass(InstCombinePass());
+
+    // This is needed to work around problems that instcombine introduces, such
+    // as sinking expensive FP divides into loops containing multiplications
+    // using the divide result.
+    // For normal compilation without LTO we never run the InstCombine pass
+    // after LICM so we avoid this problem.
+    FPM.addPass(createFunctionToLoopPassAdaptor(
+        LICMPass(PTO.LicmMssaOptCap, PTO.LicmMssaNoAccForPromotionCap,
+                 /*AllowSpeculation=*/true),
+        /*UseMemorySSA=*/true, /*UseBlockFrequencyInfo=*/true));
+  }
 }
 
 ModulePassManager


-------------- next part --------------
A non-text attachment was scrubbed...
Name: D143631.496042.patch
Type: text/x-patch
Size: 3024 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20230209/1bb20597/attachment.bin>