[llvm-dev] [RFC] SimplifyCFG block speculation is a horribly inconsistent mess

Thu Nov 15 08:23:06 PST 2018

Background
==========

This is part of my long-running quest to better optimise

  float min_element_example(float *data, int size) {
    return *std::min_element(data, data+size);
  }

which requires improvements to GVN, which first requires a side-quest to delay
phi-to-select generation until after it (see https://reviews.llvm.org/D50723 and
my earlier RFC "Delaying phi-to-select transformation until later in the pass
pipeline") which requires side-side-quests to fix everything that gets worse when
we do that.

There's some bad interaction between jump threading and simplifycfg phi-to-select
generation, where jump threading then phi-to-select gives worse results than
phi-to-select then jump threading, specifically in the function:

  define double @perlin_example(i32 %h, double %x, double %y, double %z) {
  entry:
    %cmp = icmp slt i32 %h, 8
    br i1 %cmp, label %cond.end, label %cond.false

  cond.false:
    br label %cond.end

  cond.end:
    %cond = phi double [ %y, %cond.false ], [ %x, %entry ]
    %cmp1 = icmp slt i32 %h, 4
    br i1 %cmp1, label %cond.end9, label %cond.false3

  cond.false3:
    %cmp4 = icmp eq i32 %h, 12
    br i1 %cmp4, label %cond.end9, label %cond.false6

  cond.false6:
    br label %cond.end9

  cond.end9:
    %cond10 = phi double [ %z, %cond.false6 ], [ %y, %cond.end ], [ %x, %cond.false3 ]
    %add = fadd double %cond, %cond10
    ret double %add
  }

(heavily reduced from SingleSource/Benchmarks/Misc/perlin.c) where
 opt -jump-threading -simplifycfg
gives worse results than
 opt -simplifycfg -jump-threading
and we also get these worse results for opt -O3 when D50723 is applied.

I think the correct way to fix this is to improve how SimplifyCFG does block speculation,
which is currently a horrible inconsistent mess and I think I need to fix the mess first
before I make any changes, and it's that side-side-side-quest that I'm talking about here.

Overview
========

Some of what SimplifyCFG does is speculate blocks, i.e. move instructions from
a block into one of its predecessors. There are five functions (that I've found)
that do this, with a lot of overlap in what they handle and also some restricted
in ways that others aren't without much consistency:

FoldBranchToCommonDest:
 * BB can have any number of predecessors, the first valid predecessor is
   chosen.
 * BB can have any number of successors, but it has to share a successor with
   one of its predecessors.
 * The condition must be an instruction in BB.
 * BB can contain at most 1+BonusInstThreshold (non-branch) instructions.
 * Phis are updated rather than turned into selects, so this function cannot
   happen if the phi cannot be updated in this way.
 * Branch condition in predecessor merged with BB's using logical operations.

HoistThenElseCodeToIf:
 * BB1 and BB2 must be the have the same predecessor, and be the only successors
   of that predecessor.
 * The successors of BB1 and BB2 do not matter (except see phi handling below).
 * No restriction on condition.
 * Only instructions that are common to BB1 and BB2 are speculated.
 * If the terminators of BB1 and BB2 is speculated (i.e. BB1 and BB2 have the
   same successors), then phis in BB1 and BB2's successors are turned into
   selects in their predecessor.
 * Branch in predecessor turned into that of BB1/BB2 if it was hoisted.

SpeculativelyExecuteBB:
 * BB must have one predecessor.
 * BB must have one successor, which is the other successor of its predecessor.
 * The condition being speculated cannot come from FPCmpInst.
 * BB can contain at most 1 (non-branch) instruction.
 * Phis in BB's successor are turned into selects in BB's predecessor.
 * BB's predecessor changed to branch unconditionally to its other successor.

SimplifyCondBranchToCondBranch:
 * BB can have any number of predecessors, the first valid predecessor is
   chosen.
 * BB can have any number of successors, but it has to share a successor with
   the predecessor.
 * No restriction on condition.
 * BB must be empty (except for the branch).
 * Phis in the shared successor are turned into selects in the predecessor.
 * Branch condition in predecessor merged with BB's using logical operations.

FoldTwoEntryPHINode:
 * BB1 and/or BB2 must have the same predecessor, and must form a simple if or if/else.
 * BB1 and BB2 must have the same successor, or we only have BB1 which has one successor
 * No restriction on condition.
 * Instructions with total cost PHINodeFoldingThreshold * TCC_Basic can be speculated.
 * Phis in the successor are turned into selects in the common predecessor.
 * Branch condition in predecessor changed to branch unconditionally.

Examples
========

What does all this means? Let's look at some examples:

Inconsistent fcmp / empty block handling
----------------------------------------

  define float @fn1(float %a, float %b, float %c) {
  entry:
    %cmp1 = fcmp une float %a, %b
    br i1 %cmp1, label %bb1, label %end
  bb1:
    %cmp2 = fcmp une float %a, %c
    br i1 %cmp2, label %bb2, label %end
  bb2:
    br label %end
  end:
    %phi = phi float [ 0.0, %entry ], [ 1.0, %bb1 ], [ 2.0, %bb2 ]
    ret float %phi
  }

  define float @fn2(float %a, float %b, float %c) {
  entry:
    %cmp1 = fcmp une float %a, %b
    %cmp2 = fcmp une float %a, %c
    br i1 %cmp1, label %bb1, label %end
  bb1:
    br i1 %cmp2, label %bb2, label %end
  bb2:
    br label %end
  end:
    %phi = phi float [ 0.0, %entry ], [ 1.0, %bb1 ], [ 2.0, %bb2 ]
    ret float %phi
  }

In @fn1 no speculation happens because the two function which could possibly do
it are SpeculativelyExecuteBB, which can't do it because %cmp2 is an fcmp, and
SimplifyCondBranchToCondBranch, which can't do it because bb1 isn't empty

However in @fn2 we can now speculate bb1 because %cmp2 has been moved so bb1 is
now empty and SimplifyCondBranchToCondBranch, which doesn't care about fcmp, can
speculate it.

Large impact due to which function speculating what
---------------------------------------------------

  define i32 @fn3(i1 %c, i32 %arg) {
  entry:
    br i1 %c, label %bb1, label %end
  bb1:
    %cmp = icmp eq i32 %arg, 0
    br i1 %cmp, label %bb2, label %end
  bb2:
    br label %end
  end:
    %phi = phi i32 [ 927, %bb2 ], [ 42, %bb1 ], [ 42, %entry ]
    ret i32 %phi
  }

  define i32 @fn4(i1 %c, i32 %arg) {
  entry:
    %cmp = icmp eq i32 %arg, 0
    br i1 %c, label %bb1, label %end
  bb1:
    br i1 %cmp, label %bb2, label %end
  bb2:
    br label %end
  end:
    %phi = phi i32 [ 927, %bb2 ], [ 42, %bb1 ], [ 42, %entry ]
    ret i32 %phi
  }

  define i32 @fn5(i1 %c, i32 %arg) {
  entry:
    %cmp = icmp eq i32 %arg, 0
    br i1 %c, label %bb1, label %end
  bb1:
    br i1 %cmp, label %bb2, label %end
  end:
    %phi = phi i32 [ 927, %bb2 ], [ 42, %bb1 ], [ 42, %entry ]
    ret i32 %phi
  bb2:
    br label %end
  }

In @fn3:
 * FoldBranchToCommonDest speculates %bb1 in %entry
 * FoldTwoEntryPHINode turn phi into a select in %entry
 * %bb2 now an empty block which serves no purpose, and is removed

In @fn4:
 * SpeculativelyExecuteBB speculates %bb2 in %bb1
 * SimplifyCondBranchToTwoReturns makes %bb1 select then return instead of conditionally
   branching to two returns

In @fn5:
 * SpeculativelyExecuteBB speculates %bb2 in %bb1, inserting a select
 * FoldTwoEntryPHINode turns the phi in %end into a select in %entry

What all this means is what we end up with is:
 * @fn3: cmp; and; select
 * @fn4: cmp; conditional branch to select
 * @fn5: cmp; select; select
of which @fn3 is the best.

Inconsistent thresholds
-----------------------

  define float @fn6(i32 %a, i32 %b, float %c) {
  entry:
    %cmp1 = icmp ne i32 %a, %b
    br i1 %cmp1, label %bb1, label %end
  bb1:
    %mul = fmul float %c, 3.0
    %add = fadd float %mul, 3.0
    br label %end
  end:
    %phi = phi float [ %add, %bb1 ], [ 0.0, %entry ]
    ret float %phi
  }

  define float @fn7(i32 %a, i32 %b, float %c) {
  entry:
    %cmp1 = icmp ne i32 %a, %b
    br i1 %cmp1, label %bb1, label %end
  bb1:
    %div = fdiv float %c, 3.0
    %add = fadd float %div, 3.0
    br label %end
  end:
    %phi = phi float [ %add, %bb1 ], [ 0.0, %entry ]
    ret float %phi
  }

  define void @fn8(i32 %a, i32 %b, float %c, i32* %p1, i32* %p2) {
  entry:
    %cmp1 = icmp ne i32 %a, %b
    br i1 %cmp1, label %bb1, label %bb2
  bb1:
    %div = fdiv float %c, 3.0
    %cmp2 = fcmp une float %div, 3.0
    br i1 %cmp2, label %bb2, label %end
  bb2:
    store i32 0, i32* %p1
    br label %end
  end:
    store i32 0, i32* %p2
    ret void
  }

FoldTwoEntryPHINode will fold the phi (and so speculated bb1) in @fn6 but not @fn7 as
fmul has cost TCC_Basic, so fmul + fadd is below the threshold, but fdiv has cost
TCC_Expensive, so fdiv + fadd is above the threshold.

In @fn8 however FoldBranchToCommonDest just checks the number of instructions so has no
problem at all speculating bb1.

Questions
=========

The questions arising from all of this:
 * Firstly, is there some alternate way of dealing with @perlin_example above (e.g doing
   something different in jump threading) that will magically solve my problems without
   having to deal with any of this?
 * The inconsistent FPCmpInst handling is really odd. Would anyone object if I removed
   that check from SpeculativelyExecuteBB?
 * What exactly is the right threshold for doing all of this stuff? FoldTwoEntryPHINode
   takes into accout the actual instruction cost, so that would seem like the most
   accurate.
 * It'd be really nice to cut down the number of functions doing this. I've got an attempt
   at making FoldBranchToCommonDest do everything that SimplifyCondBranchToCondBranch
   does so we can get rid of SimplifyCondBranchToCondBranch, as those two functions are
   the ones with the most overlap, but there's some major problems:
   * Doing this changes the order that blocks are speculated, giving worse results in
     some cases (but better in others).
   * SimplifyCondBranchToCondBranc must turn phi into select, FoldBranchToCommonDest cannot,
     and FoldBranchToCommonDest followed by the rest of the functions that can turn phi to
     select sometimes can't either.
   So: any ideas on how cutting down the number of functions could be done?
 * Alternatively instead of cutting down on the number of functions we could try to common
   out things into shared functions.
 * Any other ideas for how to make this less of a horrible mess?
 * Would people mind if I just ignore all of this and just make this into even more of a
   mess while trying to fix how we handle @perlin_example? (The correct answer is "No, don't
   do that" but it would certainly make my job easier.) Or alternatively does anyone else
   want to try sorting out this mess?

John