[llvm-dev] Store unswitch
Hongbin Zheng via llvm-dev
llvm-dev at lists.llvm.org
Fri Apr 28 19:48:42 PDT 2017
Hi James,
Thanks a lot for the explanations, they make sense.
On Fri, Apr 28, 2017 at 7:42 AM, James Molloy <james at jamesmolloy.co.uk>
wrote:
> Hi Danny,
>
> Thanks for that :) However I've just updated the prototype patch to NewGVN
> and it didn't need any API changes - all I rely on is GVNExpression.
>
> Hongbin,
>
> I wanted to explain a little about what GVNSink can currently do, what it
> was designed for and hopefully how to make it handle your testcase.
>
> *Background*
>
> Common code sinking is more difficult to efficently do than one might
> expect. Prior to October 2016, LLVM was able to handle simple cases of this
> within SimplifyCFG, for example:
>
> if (a) {
> b += 42;
> } else {
> b += 42;
> }
> // -> b += 42
>
> It could even sink operations where a phi was required:
>
> if (a) {
> b += 42;
> } else {
> b += 64;
> }
> // -> b += phi(42, 64)
>
> However it had a simplistic execution model and so:
> 1) Restricted itself to blocks with only TWO incoming arcs
> 2) Would only ever produce one PHI and then bail out
>
> Generally, the factors that make the search space for a good generic
> sinking optimization large are:
>
> 1) The number of incoming edges to a block.
> 2) Non-monotonicity of heuristic - for example it might not be
> worthwhile sinking 2 instructions but may be worthwhile sinking 3
> instructions.
> 3) Permutations of otherwise similar instructions in incoming blocks.
>
> I'll describe them all in a bit more detail:
>
> *1) The number of incoming edges to a block*
>
> There are many constructs that can create a block with more than two input
> edges. An obvious example is a switch:
>
> switch (a) {
> case 1: b += 32; break;
> case 2: b += 43; break;
> default: break;
> }
>
> Another is a chain of if-elses:
>
> if (a) {
> b += 32;
> } else if (c) {
> b += 43;
> } else {
> b += 65;
> }
>
> Note that those two examples are subtly different. In the latter example,
> every branch performs an addition to 'b'. We can therefore sink all the
> adds to the end and perform only one add:
>
> b += phi(32, 43, 65);
>
> However in the former case things aren't so simple. The default case never
> does any add, so we have to speculate the add instruction in that case. For
> adds, this is simple - adds don't have sideeffects and we can speculate by
> adding with an identity (0):
>
> b += phi(32, 43, 0);
>
> However what if 'b' was a global variable? In this case, each case of the
> switch will have a load/add/store sequence:
>
> switch(a) ---------------
> / \ \
> | | |
> [t1 = load @b] [t2 = load @b] |
> [t1 = add t1, 32] [t2 = add t2, 43] |
> [store @b, t1] [store @b, t2] |
> \ / /
> v v v-------------
> [ ]
>
> In most cases, we cannot be sure that in the default case @b is allowed to
> be written to. Speculating a store in this case could be invalid.
> Therefore, if we want to sink the store (and thus the add and load) we need
> to invent a new block to sink them to.
>
> switch(a) ---------------
> / \ \
> | | |
> [t1 = load @b] [t2 = load @b] |
> [t1 = add t1, 32] [t2 = add t2, 43] |
> \ / |
> [ p = phi(t1, t2) ] /
> [ store @b, p ] /
> | +------------
> v v
> [ ]
>
> Creating this new block allows the switch to be turned into a trivial one
> that can be lowered to a bit test or jump table.
>
> There is an even more complex case that appears quite a bit:
>
> if (a) {
> switch (b) {
> case 1: c += 42; break;
> case 2: c += 2; break;
> default: c += 3; break;
> }
> } else {
> switch (d) {
> case 1: e -= 2; break;
> case 2: e -= 6; break;
> default: e -= 32; break;
> }
> }
>
> We canonicalize this CFG to one cleanup block with six incoming arcs. In
> this case, it takes some nontrivial analysis to determine what to do, as it
> requires partitioning the incoming arcs into sets, performing different
> sinking operations on each (in this case two disjoint sets, but there are
> cases where the sets are not disjoint and some cost heuristic needs to be
> used).
>
> *2) Non-monotonicity of heuristic*
>
> What I'm trying to say here is the greedy algorithms work best when the
> heuristic they are computing is monotonic. Consider this:
>
> if (a) {
> tmp1 = *b;
> tmp2 = *c;
> tmp3 = tmp1 + tmp2;
> *d = tmp3;
> } else {
> tmp4 = *b;
> tmp5 = *c;
> tmp6 = tmp4 + tmp5;
> *e = tmp6;
> }
>
> If we proceed sequentially from the bottom of these blocks, trying to
> determine if sinking them is worthwhile:
>
> Step 1: [*e = tmp6], [*d = tmp3]
> - This is the same operation, a store, but it requires two PHIs - one
> for (e,d) and one for (tmp6, tmp3).
> - This is NOT worthwhile. We are constraining register allocation for
> only one instruction commonned.
>
> Step 2: [tmp6 = tmp4 + tmp5], [tmp3 = tmp1 + tmp2]
> - Again, the operation is the same, but now we need another two PHIs,
> plus one PHI from Step 1 (the phi required for (tmp6, tmp3) is now no
> longer required).
> - Three PHIs required for two instructions sunk. NOT worth it.
>
> Step 3: [tmp5 = *c], [tmp2 = *c]
> - These can be sunk, and have the same arguments so no PHIs required. We
> still need two of the PHIs from step 2 (the (tmp5, tmp2) phi now is no
> longer required)
> - At this point we've sunk three instructions for two PHIs. Maybe it's
> worth it?
>
> Step 4: [tmp4 = *b], [tmp1 = *b]
> - Same as step 3, but now the (tmp4, tmp1) PHI from step 2 goes away too
> - So now we're sinking 4 instructions for one PHI.
> - Now it's definitely worth it!
>
> Whatever heuristic one chooses, in a case like this it will get worse,
> then maybe get better again. So we can't simply bail out when the heuristic
> reaches some threshold, because maybe it will improve as we analyze more
> instructions!
>
> *3) Permutations of otherwise similar instructions*
>
> This is the case you describe. Stores (for example) that look similar but
> have been permuted:
>
> if (a) {
> *b = 52;
> *c = 43;
> } else {
> *c = 87;
> *b = 54;
> }
>
> If we approach this bottom-up, we will compare [*b = 54], [*c = 43], then
> [*c = 87], [*b = 52] and conclude that two instructions can be sunk, but 4
> PHIs are required (because the target of the store and value are different
> in each case). If we looked at [*b = 54], [*b = 52], perhaps we would have
> seen that we can sink this with only *two* PHIs required and that might
> change our decision.
>
> This is generally quite difficult and especially shows up with memory
> instructions like this (if c and b alias, then permuting them when sinking
> would be illegal. So we need alias analysis here).
>
> *What GVNSink attempts to solve*
>
> The current implementation of GVNSink was meant to solve 1) and 2), but
> NOT 3). 3) wasn't within scope when I was writing GVNSink, but I think 3)
> is doable for some cases within the framework of GVNSink.
>
> *How GVNSink works*
>
> GVNSink works by computing a value number for each IR instruction, similar
> to GVN. However, GVNSink's value numbering works slightly from GVN's.
>
> To recap, the value number (VN) of an instruction is a function of the
> instruction. If VN[I] == VN[J], I and J can be considered equivalent, for
> some definition of equivalence.
>
> In GVN, VN[I] is based upon I and all of I's operands, transitively.
> Therefore if VN[I] == VN[J], I and J were computed in the same way and J
> can be used instead of I.
>
> In GVNSink, we flip this on its head. VN[I] is based upon I's *uses*,
> transitively. Therefore if VN[I] == VN[J], all of the uses of I are the
> same as all of the uses of J. Therefore I and J can be merged. So this:
>
> [ %a = add %I, 1 ] [ %c = add %J, 1 ]
> [ %b = sub %a, 5 ] [ %d = sub %c, 5 ]
> [ ret %b ] [ ret %d ]
>
> Can be transformed into this:
>
> [] []
> \ /
> [ %IJ = phi %I, %J ]
> [ %a = add %IJ, 1 ]
> [ %b = sub %a, 5 ]
> [ ret %b ]
>
> In GVNSink, we step backwards across all instructions in all incoming
> basic blocks (see LockstepReverseIterator). This produces, conceptually, a
> sequence of sets of instructions.
>
> Each of these sets is analyzed in turn (see analyzeInstructionForSinking)
> - the value number of each contained instruction is calculated and used to
> determine if the entire set is mergeable or not (all values have the same
> VN).
>
> If not all instructions in the set have the same VN, the set is pruned to
> just those with the same VN (or we stop if all instructions had different
> VNs). If this set passes extra legality checks, a
> SinkingInstructionCandidate is created. This contains pertinent data for
> the cost heuristic:
>
> 1) Number of instructions from each incoming block being sunk
> 2) Number of blocks to pick instructions from (this can be a subset of
> all the incoming blocks!)
> 3) How many PHIs would be generated in total
>
> After all candidates have been examined, they are sorted based on a
> heuristic and the best candidate picked and applied.
>
> The use of value numbering in this algorithm is what allows us to do this
> in somewhat-linear time with respect to the number of instructions
> inspected.
>
> Now, to solve 3), I think the component that needs swapping out is
> LockstepReverseIterator. This currently iterates backwards in lockstep
> across a set of blocks. To handle permutations, I think we could use a
> priority queue that feeds the main algorithm instructions from the input
> blocks in a slightly sorted order.
>
To solve 3, I think what we need are not only the CSE from GVN, but also
some kind of instruction reordering, i.e. "scheduling". I think the idea of
introducing a "priority queue" and processing the instruction in a
different order than they appear in the BB, is also a kind of "implicit
scheduling".
maybe what we can do is introduce a "canonical order" by considering data
dependency (SSA def-use chain), memory dependencies (Memory SSA), and other
dependencies to order the instruction such that:
1. Load instructions are move toward the beginning of a BB, such that
GVNHosit and/or CFGSimplify have better chance to hoist and merge the loads
2. Store instructions are move toward the end of a BB, such that GVNSink
and InstructionCombine (it is merging the stores today) have a better
chance to sink and merge the stores.
3. With the same type of instructions, i.e. load/stores, without
dependencies between them, we can sort/schedule them based on the GVN of
the pointer operand.
In sort, the "canonical ordering" will sort instructions in BBs like:
BB0:
br BB1 or BB2
BB1:
x = ld ptr1
st x ptr2
y = ld ptr3
st y ptr4
br BB3
BB2:
x1 = ld ptr3
st x1 ptr4
y1 = ld ptr1
st y1 ptr2
br BB3
BB3:
to the following:
BB0:
br BB1 or BB2
BB1:
x = ld ptr1
y = ld ptr3
st x ptr2
st y ptr4
br BB3
BB2:
y1 = ld ptr1
x1 = ld ptr3
st y1 ptr2
st x1 ptr4
br BB3
BB3:
Now we can use the lock step iterator to merge these instructions.:
BB0:
x = ld ptr1
y = ld ptr3
br BB1 or BB2
BB1:
br BB3
BB2:
br BB3
BB3:
st y ptr2
st x ptr4
And we can make this "canonical ordering" independent from the other passes.
Such that the users can choose to pay extra compile time run the "canonical
ordering" pass and get more merge from GVN or not.
Of course, this "canonical ordering" may not always work, e.g.:
BB0:
br BB1 or BB2
BB1:
x = ld ptr1
st x ptr2
y = ld ptr3
st y ptr4
br BB3
BB2:
x1 = ld ptr0
st x1 ptr5
y1 = ld ptr1
st y1 ptr2
br BB3
BB3:
May be reordered to the following:
BB0:
br BB1 or BB2
BB1:
x = ld ptr1
y = ld ptr3
st x ptr2
st y ptr4
br BB3
BB2:
x1 = ld ptr0
y1 = ld ptr1
st y1 ptr2
st x1 ptr5
br BB3
BB3:
And now the lock step iterator doesn't work.
So there may be no universal "canonical order", and we need to consider the
context when we reorder/assign priority in the priority queue to solve
problem 3.
Thanks
Hongbin
>
> For example, perhaps it might order sequences of stores by their target
> operand (assuming alias analysis says it can), ensuring sequences of stores
> in different blocks appear in the same order?
>
> Anyway, I hope this has helped understand a bit more about what GVNSink
> does, can do and how to modify it if you want to solve your problem.
>
> Cheers,
>
> James
>
>
>
> On Wed, 26 Apr 2017 at 18:16 Daniel Berlin <dberlin at dberlin.org> wrote:
>
>> I have a patch i'm working on to split newgvn into a proper analysis.
>> It works (and happy to share it), it's just the interface i'm using is a
>> little ugly (haven't decided what the right API is), but let me know what
>> GVNSink needs from life and i'll make sure it's there :)
>>
>>
>> On Wed, Apr 26, 2017 at 10:09 AM, James Molloy via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>> It's basically ready to commit; the reviewers were fairly happy with it.
>>> It needs rebasing on top of NewGVN and any bugs that shakes out fixed, but
>>> that's about it.
>>>
>>> I want to get around to it soon-ish, but I've wanted that for a while!
>>>
>>> On Wed, 26 Apr 2017 at 16:50, Hongbin Zheng <etherzhhb at gmail.com> wrote:
>>>
>>>> Hi James,
>>>>
>>>> I have an ad-hoc solution in mind to solve this problem.
>>>> But if it can be solved under the framework of GVN"Sink", it is even
>>>> better.
>>>>
>>>> any plan on your https://reviews.llvm.org/D24805?
>>>>
>>>> Thanks
>>>> Hongbin
>>>>
>>>>
>>>> On Wed, Apr 26, 2017 at 2:13 AM, James Molloy <james at jamesmolloy.co.uk>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Yes, I can see why that would not work.
>>>>>
>>>>> The sinking algorithm in SimplifyCFG isn't particularly clever. In
>>>>> particular it can't reason about memory ordering and aliasing. In
>>>>> unswitch1(), it can identify that the stores correlate because the
>>>>> correlating stores appear in the same relative source order. In unswitch2()
>>>>> they have been permuted, and the algorithm cannot deal with this. This
>>>>> requires some kind of value numbering to do efficiently.
>>>>>
>>>>> The GVNSink pass that I really should get around to updating should
>>>>> solve this, eventually!
>>>>>
>>>>> James
>>>>>
>>>>>
>>>>> On Wed, 26 Apr 2017 at 08:19 Hongbin Zheng via llvm-dev <
>>>>> llvm-dev at lists.llvm.org> wrote:
>>>>>
>>>>>> Looks like this do not work:
>>>>>>
>>>>>> // Type your code here, or load an example.
>>>>>> int a[10];
>>>>>>
>>>>>> void unswitch2(int i, int x, int y0, int y1) {
>>>>>> if (x) {
>>>>>> a[i] = y0;
>>>>>> a[i + 1] = y1;
>>>>>> } else {
>>>>>> a[i + 1] = y0;
>>>>>> a[i] = y1;
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> https://godbolt.org/g/Ldd5qV
>>>>>>
>>>>>> On Tue, Apr 25, 2017 at 10:22 PM, Hongbin Zheng <etherzhhb at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Looks like inst combine do the job
>>>>>>>
>>>>>>> On Tue, Apr 25, 2017 at 9:36 PM, Davide Italiano <davide at freebsd.org
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> On Tue, Apr 25, 2017 at 9:24 PM, Hongbin Zheng via llvm-dev
>>>>>>>> <llvm-dev at lists.llvm.org> wrote:
>>>>>>>> > Hi,
>>>>>>>> >
>>>>>>>> > Is there a pass in LLVM that can optimize:
>>>>>>>> >
>>>>>>>> > if (x)
>>>>>>>> > a[i] = y0;
>>>>>>>> > else
>>>>>>>> > a[i] = y1;
>>>>>>>> >
>>>>>>>> > to
>>>>>>>> >
>>>>>>>> > a[i] = x ? y0 : y1?
>>>>>>>> >
>>>>>>>> > I tried opt (3.9) with -O3 but looks like such an optimization do
>>>>>>>> not
>>>>>>>> > happened.
>>>>>>>> >
>>>>>>>>
>>>>>>>> The same IR at -O3 for both cases on this example.
>>>>>>>> https://godbolt.org/g/Tk2MM8
>>>>>>>>
>>>>>>>> --
>>>>>>>> Davide
>>>>>>>>
>>>>>>>> "There are no solved problems; there are only problems that are more
>>>>>>>> or less solved" -- Henri Poincare
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> _______________________________________________
>>>>>> LLVM Developers mailing list
>>>>>> llvm-dev at lists.llvm.org
>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>
>>>>>
>>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170428/4e49b51c/attachment-0001.html>
More information about the llvm-dev
mailing list