[llvm-dev] [RFC] SimplifyCFG block speculation is a horribly inconsistent mess
John Brawn via llvm-dev
llvm-dev at lists.llvm.org
Thu Nov 15 08:23:06 PST 2018
Background
==========
This is part of my long-running quest to better optimise
float min_element_example(float *data, int size) {
return *std::min_element(data, data+size);
}
which requires improvements to GVN, which first requires a side-quest to delay
phi-to-select generation until after it (see https://reviews.llvm.org/D50723 and
my earlier RFC "Delaying phi-to-select transformation until later in the pass
pipeline") which requires side-side-quests to fix everything that gets worse when
we do that.
There's some bad interaction between jump threading and simplifycfg phi-to-select
generation, where jump threading then phi-to-select gives worse results than
phi-to-select then jump threading, specifically in the function:
define double @perlin_example(i32 %h, double %x, double %y, double %z) {
entry:
%cmp = icmp slt i32 %h, 8
br i1 %cmp, label %cond.end, label %cond.false
cond.false:
br label %cond.end
cond.end:
%cond = phi double [ %y, %cond.false ], [ %x, %entry ]
%cmp1 = icmp slt i32 %h, 4
br i1 %cmp1, label %cond.end9, label %cond.false3
cond.false3:
%cmp4 = icmp eq i32 %h, 12
br i1 %cmp4, label %cond.end9, label %cond.false6
cond.false6:
br label %cond.end9
cond.end9:
%cond10 = phi double [ %z, %cond.false6 ], [ %y, %cond.end ], [ %x, %cond.false3 ]
%add = fadd double %cond, %cond10
ret double %add
}
(heavily reduced from SingleSource/Benchmarks/Misc/perlin.c) where
opt -jump-threading -simplifycfg
gives worse results than
opt -simplifycfg -jump-threading
and we also get these worse results for opt -O3 when D50723 is applied.
I think the correct way to fix this is to improve how SimplifyCFG does block speculation,
which is currently a horrible inconsistent mess and I think I need to fix the mess first
before I make any changes, and it's that side-side-side-quest that I'm talking about here.
Overview
========
Some of what SimplifyCFG does is speculate blocks, i.e. move instructions from
a block into one of its predecessors. There are five functions (that I've found)
that do this, with a lot of overlap in what they handle and also some restricted
in ways that others aren't without much consistency:
FoldBranchToCommonDest:
* BB can have any number of predecessors, the first valid predecessor is
chosen.
* BB can have any number of successors, but it has to share a successor with
one of its predecessors.
* The condition must be an instruction in BB.
* BB can contain at most 1+BonusInstThreshold (non-branch) instructions.
* Phis are updated rather than turned into selects, so this function cannot
happen if the phi cannot be updated in this way.
* Branch condition in predecessor merged with BB's using logical operations.
HoistThenElseCodeToIf:
* BB1 and BB2 must be the have the same predecessor, and be the only successors
of that predecessor.
* The successors of BB1 and BB2 do not matter (except see phi handling below).
* No restriction on condition.
* Only instructions that are common to BB1 and BB2 are speculated.
* If the terminators of BB1 and BB2 is speculated (i.e. BB1 and BB2 have the
same successors), then phis in BB1 and BB2's successors are turned into
selects in their predecessor.
* Branch in predecessor turned into that of BB1/BB2 if it was hoisted.
SpeculativelyExecuteBB:
* BB must have one predecessor.
* BB must have one successor, which is the other successor of its predecessor.
* The condition being speculated cannot come from FPCmpInst.
* BB can contain at most 1 (non-branch) instruction.
* Phis in BB's successor are turned into selects in BB's predecessor.
* BB's predecessor changed to branch unconditionally to its other successor.
SimplifyCondBranchToCondBranch:
* BB can have any number of predecessors, the first valid predecessor is
chosen.
* BB can have any number of successors, but it has to share a successor with
the predecessor.
* No restriction on condition.
* BB must be empty (except for the branch).
* Phis in the shared successor are turned into selects in the predecessor.
* Branch condition in predecessor merged with BB's using logical operations.
FoldTwoEntryPHINode:
* BB1 and/or BB2 must have the same predecessor, and must form a simple if or if/else.
* BB1 and BB2 must have the same successor, or we only have BB1 which has one successor
* No restriction on condition.
* Instructions with total cost PHINodeFoldingThreshold * TCC_Basic can be speculated.
* Phis in the successor are turned into selects in the common predecessor.
* Branch condition in predecessor changed to branch unconditionally.
Examples
========
What does all this means? Let's look at some examples:
Inconsistent fcmp / empty block handling
----------------------------------------
define float @fn1(float %a, float %b, float %c) {
entry:
%cmp1 = fcmp une float %a, %b
br i1 %cmp1, label %bb1, label %end
bb1:
%cmp2 = fcmp une float %a, %c
br i1 %cmp2, label %bb2, label %end
bb2:
br label %end
end:
%phi = phi float [ 0.0, %entry ], [ 1.0, %bb1 ], [ 2.0, %bb2 ]
ret float %phi
}
define float @fn2(float %a, float %b, float %c) {
entry:
%cmp1 = fcmp une float %a, %b
%cmp2 = fcmp une float %a, %c
br i1 %cmp1, label %bb1, label %end
bb1:
br i1 %cmp2, label %bb2, label %end
bb2:
br label %end
end:
%phi = phi float [ 0.0, %entry ], [ 1.0, %bb1 ], [ 2.0, %bb2 ]
ret float %phi
}
In @fn1 no speculation happens because the two function which could possibly do
it are SpeculativelyExecuteBB, which can't do it because %cmp2 is an fcmp, and
SimplifyCondBranchToCondBranch, which can't do it because bb1 isn't empty
However in @fn2 we can now speculate bb1 because %cmp2 has been moved so bb1 is
now empty and SimplifyCondBranchToCondBranch, which doesn't care about fcmp, can
speculate it.
Large impact due to which function speculating what
---------------------------------------------------
define i32 @fn3(i1 %c, i32 %arg) {
entry:
br i1 %c, label %bb1, label %end
bb1:
%cmp = icmp eq i32 %arg, 0
br i1 %cmp, label %bb2, label %end
bb2:
br label %end
end:
%phi = phi i32 [ 927, %bb2 ], [ 42, %bb1 ], [ 42, %entry ]
ret i32 %phi
}
define i32 @fn4(i1 %c, i32 %arg) {
entry:
%cmp = icmp eq i32 %arg, 0
br i1 %c, label %bb1, label %end
bb1:
br i1 %cmp, label %bb2, label %end
bb2:
br label %end
end:
%phi = phi i32 [ 927, %bb2 ], [ 42, %bb1 ], [ 42, %entry ]
ret i32 %phi
}
define i32 @fn5(i1 %c, i32 %arg) {
entry:
%cmp = icmp eq i32 %arg, 0
br i1 %c, label %bb1, label %end
bb1:
br i1 %cmp, label %bb2, label %end
end:
%phi = phi i32 [ 927, %bb2 ], [ 42, %bb1 ], [ 42, %entry ]
ret i32 %phi
bb2:
br label %end
}
In @fn3:
* FoldBranchToCommonDest speculates %bb1 in %entry
* FoldTwoEntryPHINode turn phi into a select in %entry
* %bb2 now an empty block which serves no purpose, and is removed
In @fn4:
* SpeculativelyExecuteBB speculates %bb2 in %bb1
* SimplifyCondBranchToTwoReturns makes %bb1 select then return instead of conditionally
branching to two returns
In @fn5:
* SpeculativelyExecuteBB speculates %bb2 in %bb1, inserting a select
* FoldTwoEntryPHINode turns the phi in %end into a select in %entry
What all this means is what we end up with is:
* @fn3: cmp; and; select
* @fn4: cmp; conditional branch to select
* @fn5: cmp; select; select
of which @fn3 is the best.
Inconsistent thresholds
-----------------------
define float @fn6(i32 %a, i32 %b, float %c) {
entry:
%cmp1 = icmp ne i32 %a, %b
br i1 %cmp1, label %bb1, label %end
bb1:
%mul = fmul float %c, 3.0
%add = fadd float %mul, 3.0
br label %end
end:
%phi = phi float [ %add, %bb1 ], [ 0.0, %entry ]
ret float %phi
}
define float @fn7(i32 %a, i32 %b, float %c) {
entry:
%cmp1 = icmp ne i32 %a, %b
br i1 %cmp1, label %bb1, label %end
bb1:
%div = fdiv float %c, 3.0
%add = fadd float %div, 3.0
br label %end
end:
%phi = phi float [ %add, %bb1 ], [ 0.0, %entry ]
ret float %phi
}
define void @fn8(i32 %a, i32 %b, float %c, i32* %p1, i32* %p2) {
entry:
%cmp1 = icmp ne i32 %a, %b
br i1 %cmp1, label %bb1, label %bb2
bb1:
%div = fdiv float %c, 3.0
%cmp2 = fcmp une float %div, 3.0
br i1 %cmp2, label %bb2, label %end
bb2:
store i32 0, i32* %p1
br label %end
end:
store i32 0, i32* %p2
ret void
}
FoldTwoEntryPHINode will fold the phi (and so speculated bb1) in @fn6 but not @fn7 as
fmul has cost TCC_Basic, so fmul + fadd is below the threshold, but fdiv has cost
TCC_Expensive, so fdiv + fadd is above the threshold.
In @fn8 however FoldBranchToCommonDest just checks the number of instructions so has no
problem at all speculating bb1.
Questions
=========
The questions arising from all of this:
* Firstly, is there some alternate way of dealing with @perlin_example above (e.g doing
something different in jump threading) that will magically solve my problems without
having to deal with any of this?
* The inconsistent FPCmpInst handling is really odd. Would anyone object if I removed
that check from SpeculativelyExecuteBB?
* What exactly is the right threshold for doing all of this stuff? FoldTwoEntryPHINode
takes into accout the actual instruction cost, so that would seem like the most
accurate.
* It'd be really nice to cut down the number of functions doing this. I've got an attempt
at making FoldBranchToCommonDest do everything that SimplifyCondBranchToCondBranch
does so we can get rid of SimplifyCondBranchToCondBranch, as those two functions are
the ones with the most overlap, but there's some major problems:
* Doing this changes the order that blocks are speculated, giving worse results in
some cases (but better in others).
* SimplifyCondBranchToCondBranc must turn phi into select, FoldBranchToCommonDest cannot,
and FoldBranchToCommonDest followed by the rest of the functions that can turn phi to
select sometimes can't either.
So: any ideas on how cutting down the number of functions could be done?
* Alternatively instead of cutting down on the number of functions we could try to common
out things into shared functions.
* Any other ideas for how to make this less of a horrible mess?
* Would people mind if I just ignore all of this and just make this into even more of a
mess while trying to fix how we handle @perlin_example? (The correct answer is "No, don't
do that" but it would certainly make my job easier.) Or alternatively does anyone else
want to try sorting out this mess?
John
More information about the llvm-dev
mailing list