[llvm-dev] Store unswitch

Daniel Berlin via llvm-dev llvm-dev at lists.llvm.org
Fri Apr 28 19:50:09 PDT 2017


This is normally known as global code motion, and there are numerous
algorithms that do it

On Fri, Apr 28, 2017, 7:48 PM Hongbin Zheng <etherzhhb at gmail.com> wrote:

> Hi James,
>
> Thanks a lot for the explanations, they make sense.
>
>
> On Fri, Apr 28, 2017 at 7:42 AM, James Molloy <james at jamesmolloy.co.uk>
> wrote:
>
>> Hi Danny,
>>
>> Thanks for that :) However I've just updated the prototype patch to
>> NewGVN and it didn't need any API changes - all I rely on is GVNExpression.
>>
>> Hongbin,
>>
>> I wanted to explain a little about what GVNSink can currently do, what it
>> was designed for and hopefully how to make it handle your testcase.
>>
>> *Background*
>>
>> Common code sinking is more difficult to efficently do than one might
>> expect. Prior to October 2016, LLVM was able to handle simple cases of this
>> within SimplifyCFG, for example:
>>
>>     if (a) {
>>       b += 42;
>>     } else {
>>       b += 42;
>>     }
>>     // -> b += 42
>>
>> It could even sink operations where a phi was required:
>>
>>     if (a) {
>>       b += 42;
>>     } else {
>>       b += 64;
>>     }
>>     // -> b += phi(42, 64)
>>
>> However it had a simplistic execution model and so:
>>   1) Restricted itself to blocks with only TWO incoming arcs
>>   2) Would only ever produce one PHI and then bail out
>>
>> Generally, the factors that make the search space for a good generic
>> sinking optimization large are:
>>
>>   1) The number of incoming edges to a block.
>>   2) Non-monotonicity of heuristic - for example it might not be
>> worthwhile sinking 2 instructions but may be worthwhile sinking 3
>> instructions.
>>   3) Permutations of otherwise similar instructions in incoming blocks.
>>
>> I'll describe them all in a bit more detail:
>>
>> *1) The number of incoming edges to a block*
>>
>> There are many constructs that can create a block with more than two
>> input edges. An obvious example is a switch:
>>
>>     switch (a) {
>>       case 1: b += 32; break;
>>       case 2: b += 43; break;
>>       default: break;
>>     }
>>
>> Another is a chain of if-elses:
>>
>>     if (a) {
>>       b += 32;
>>     } else if (c) {
>>       b += 43;
>>     } else {
>>       b += 65;
>>     }
>>
>> Note that those two examples are subtly different. In the latter example,
>> every branch performs an addition to 'b'. We can therefore sink all the
>> adds to the end and perform only one add:
>>
>>     b += phi(32, 43, 65);
>>
>> However in the former case things aren't so simple. The default case
>> never does any add, so we have to speculate the add instruction in that
>> case. For adds, this is simple - adds don't have sideeffects and we can
>> speculate by adding with an identity (0):
>>
>>     b += phi(32, 43, 0);
>>
>> However what if 'b' was a global variable? In this case, each case of the
>> switch will have a load/add/store sequence:
>>
>>                switch(a) ---------------
>>               /        \                \
>>               |         |                |
>>     [t1 = load @b]    [t2 = load @b]     |
>>     [t1 = add t1, 32] [t2 = add t2, 43]  |
>>     [store @b, t1]    [store @b, t2]     |
>>                \        /               /
>>                 v       v v-------------
>>              [             ]
>>
>> In most cases, we cannot be sure that in the default case @b is allowed
>> to be written to. Speculating a store in this case could be invalid.
>> Therefore, if we want to sink the store (and thus the add and load) we need
>> to invent a new block to sink them to.
>>
>>                switch(a) ---------------
>>               /        \                \
>>               |         |                |
>>     [t1 = load @b]    [t2 = load @b]     |
>>     [t1 = add t1, 32] [t2 = add t2, 43]  |
>>                \        /                |
>>              [ p = phi(t1, t2) ]        /
>>              [ store @b, p ]           /
>>                    |      +------------
>>                    v      v
>>              [             ]
>>
>> Creating this new block allows the switch to be turned into a trivial one
>> that can be lowered to a bit test or jump table.
>>
>> There is an even more complex case that appears quite a bit:
>>
>>     if (a) {
>>       switch (b) {
>>       case 1: c += 42; break;
>>       case 2: c += 2; break;
>>       default: c += 3; break;
>>       }
>>     } else {
>>       switch (d) {
>>       case 1: e -= 2; break;
>>       case 2: e -= 6; break;
>>       default: e -= 32; break;
>>       }
>>     }
>>
>> We canonicalize this CFG to one cleanup block with six incoming arcs. In
>> this case, it takes some nontrivial analysis to determine what to do, as it
>> requires partitioning the incoming arcs into sets, performing different
>> sinking operations on each (in this case two disjoint sets, but there are
>> cases where the sets are not disjoint and some cost heuristic needs to be
>> used).
>>
>> *2) Non-monotonicity of heuristic*
>>
>> What I'm trying to say here is the greedy algorithms work best when the
>> heuristic they are computing is monotonic. Consider this:
>>
>>     if (a) {
>>       tmp1 = *b;
>>       tmp2 = *c;
>>       tmp3 = tmp1 + tmp2;
>>       *d = tmp3;
>>     } else {
>>       tmp4 = *b;
>>       tmp5 = *c;
>>       tmp6 = tmp4 + tmp5;
>>       *e = tmp6;
>>     }
>>
>> If we proceed sequentially from the bottom of these blocks, trying to
>> determine if sinking them is worthwhile:
>>
>> Step 1: [*e = tmp6], [*d = tmp3]
>>   - This is the same operation, a store, but it requires two PHIs - one
>> for (e,d) and one for (tmp6, tmp3).
>>   - This is NOT worthwhile. We are constraining register allocation for
>> only one instruction commonned.
>>
>> Step 2: [tmp6 = tmp4 + tmp5], [tmp3 = tmp1 + tmp2]
>>   - Again, the operation is the same, but now we need another two PHIs,
>> plus one PHI from Step 1 (the phi required for (tmp6, tmp3) is now no
>> longer required).
>>   - Three PHIs required for two instructions sunk. NOT worth it.
>>
>> Step 3: [tmp5 = *c], [tmp2 = *c]
>>   - These can be sunk, and have the same arguments so no PHIs required.
>> We still need two of the PHIs from step 2 (the (tmp5, tmp2) phi now is no
>> longer required)
>>   - At this point we've sunk three instructions for two PHIs. Maybe it's
>> worth it?
>>
>> Step 4: [tmp4 = *b], [tmp1 = *b]
>>   - Same as step 3, but now the (tmp4, tmp1) PHI from step 2 goes away too
>>   - So now we're sinking 4 instructions for one PHI.
>>   - Now it's definitely worth it!
>>
>> Whatever heuristic one chooses, in a case like this it will get worse,
>> then maybe get better again. So we can't simply bail out when the heuristic
>> reaches some threshold, because maybe it will improve as we analyze more
>> instructions!
>>
>> *3) Permutations of otherwise similar instructions*
>>
>> This is the case you describe. Stores (for example) that look similar but
>> have been permuted:
>>
>>     if (a) {
>>       *b = 52;
>>       *c = 43;
>>     } else {
>>       *c = 87;
>>       *b = 54;
>>     }
>>
>> If we approach this bottom-up, we will compare [*b = 54], [*c = 43], then
>> [*c = 87], [*b = 52] and conclude that two instructions can be sunk, but 4
>> PHIs are required (because the target of the store and value are different
>> in each case). If we looked at [*b = 54], [*b = 52], perhaps we would have
>> seen that we can sink this with only *two* PHIs required and that might
>> change our decision.
>>
>> This is generally quite difficult and especially shows up with memory
>> instructions like this (if c and b alias, then permuting them when sinking
>> would be illegal. So we need alias analysis here).
>>
>> *What GVNSink attempts to solve*
>>
>> The current implementation of GVNSink was meant to solve 1) and 2), but
>> NOT 3). 3) wasn't within scope when I was writing GVNSink, but I think 3)
>> is doable for some cases within the framework of GVNSink.
>>
>> *How GVNSink works*
>>
>> GVNSink works by computing a value number for each IR instruction,
>> similar to GVN. However, GVNSink's value numbering works slightly from
>> GVN's.
>>
>> To recap, the value number (VN) of an instruction is a function of the
>> instruction. If VN[I] == VN[J], I and J can be considered equivalent, for
>> some definition of equivalence.
>>
>> In GVN, VN[I] is based upon I and all of I's operands, transitively.
>> Therefore if VN[I] == VN[J], I and J were computed in the same way and J
>> can be used instead of I.
>>
>> In GVNSink, we flip this on its head. VN[I] is based upon I's *uses*,
>> transitively. Therefore if VN[I] == VN[J], all of the uses of I are the
>> same as all of the uses of J. Therefore I and J can be merged. So this:
>>
>>    [ %a = add %I, 1 ]   [ %c = add %J, 1 ]
>>    [ %b = sub %a, 5 ]   [ %d = sub %c, 5 ]
>>    [ ret %b ]           [ ret %d ]
>>
>> Can be transformed into this:
>>
>>    []     []
>>       \ /
>>    [ %IJ = phi %I, %J ]
>>    [ %a = add %IJ, 1 ]
>>    [ %b = sub %a, 5 ]
>>    [ ret %b ]
>>
>> In GVNSink, we step backwards across all instructions in all incoming
>> basic blocks (see LockstepReverseIterator). This produces, conceptually, a
>> sequence of sets of instructions.
>>
>> Each of these sets is analyzed in turn (see analyzeInstructionForSinking)
>> - the value number of each contained instruction is calculated and used to
>> determine if the entire set is mergeable or not (all values have the same
>> VN).
>>
>> If not all instructions in the set have the same VN, the set is pruned to
>> just those with the same VN (or we stop if all instructions had different
>> VNs). If this set passes extra legality checks, a
>> SinkingInstructionCandidate is created. This contains pertinent data for
>> the cost heuristic:
>>
>>   1) Number of instructions from each incoming block being sunk
>>   2) Number of blocks to pick instructions from (this can be a subset of
>> all the incoming blocks!)
>>   3) How many PHIs would be generated in total
>>
>> After all candidates have been examined, they are sorted based on a
>> heuristic and the best candidate picked and applied.
>>
>> The use of value numbering in this algorithm is what allows us to do this
>> in somewhat-linear time with respect to the number of instructions
>> inspected.
>>
>> Now, to solve 3), I think the component that needs swapping out is
>> LockstepReverseIterator. This currently iterates backwards in lockstep
>> across a set of blocks. To handle permutations, I think we could use a
>> priority queue that feeds the main algorithm instructions from the input
>> blocks in a slightly sorted order.
>>
>
> To solve 3, I think what we need are not only the CSE from GVN, but also
> some kind of instruction reordering, i.e. "scheduling". I think the idea of
> introducing a "priority queue" and processing the instruction in a
> different order than they appear in the BB, is also a kind of "implicit
> scheduling".
>
> maybe what we can do is introduce a "canonical order" by considering data
> dependency (SSA def-use chain), memory dependencies (Memory SSA), and other
> dependencies to order the instruction such that:
> 1. Load instructions are move toward the beginning of a BB, such that
> GVNHosit and/or CFGSimplify have better chance to hoist and merge the loads
> 2. Store instructions are move toward the end of a BB, such that GVNSink
> and InstructionCombine (it is merging the stores today) have a better
> chance to sink and merge the stores.
> 3. With the same type of instructions, i.e. load/stores, without
> dependencies between them, we can sort/schedule them based on the GVN of
> the pointer operand.
>
> In sort, the "canonical ordering" will sort instructions in BBs like:
>
> BB0:
> br BB1 or BB2
>
> BB1:
> x = ld ptr1
> st x ptr2
> y = ld ptr3
> st y ptr4
> br BB3
>
> BB2:
> x1 = ld ptr3
> st x1 ptr4
> y1 = ld ptr1
> st y1 ptr2
> br BB3
>
> BB3:
>
> to the following:
> BB0:
> br BB1 or BB2
>
> BB1:
> x = ld ptr1
> y = ld ptr3
> st x ptr2
> st y ptr4
> br BB3
>
> BB2:
> y1 = ld ptr1
> x1 = ld ptr3
> st y1 ptr2
> st x1 ptr4
> br BB3
>
> BB3:
>
> Now we can use the lock step iterator to merge these instructions.:
> BB0:
> x = ld ptr1
> y = ld ptr3
> br BB1 or BB2
>
> BB1:
> br BB3
>
> BB2:
> br BB3
>
> BB3:
> st y ptr2
> st x ptr4
>
> And we can make this "canonical ordering" independent from the other
> passes.
>
> Such that the users can choose to pay extra compile time run the
> "canonical ordering" pass and get more merge from GVN or not.
>
> Of course, this "canonical ordering" may not always work, e.g.:
>
> BB0:
> br BB1 or BB2
>
> BB1:
> x = ld ptr1
> st x ptr2
> y = ld ptr3
> st y ptr4
> br BB3
>
> BB2:
> x1 = ld ptr0
> st x1 ptr5
> y1 = ld ptr1
> st y1 ptr2
> br BB3
>
> BB3:
>
> May be reordered to the following:
>
> BB0:
> br BB1 or BB2
>
> BB1:
> x = ld ptr1
> y = ld ptr3
> st x ptr2
> st y ptr4
> br BB3
>
> BB2:
> x1 = ld ptr0
> y1 = ld ptr1
> st y1 ptr2
> st x1 ptr5
> br BB3
>
> BB3:
>
> And now the lock step iterator doesn't work.
>
> So there may be no universal "canonical order", and we need to consider
> the context when we reorder/assign priority in the priority queue to solve
> problem 3.
>
> Thanks
> Hongbin
>
>
>>
>> For example, perhaps it might order sequences of stores by their target
>> operand (assuming alias analysis says it can), ensuring sequences of stores
>> in different blocks appear in the same order?
>>
>> Anyway, I hope this has helped understand a bit more about what GVNSink
>> does, can do and how to modify it if you want to solve your problem.
>>
>> Cheers,
>>
>> James
>>
>>
>>
>> On Wed, 26 Apr 2017 at 18:16 Daniel Berlin <dberlin at dberlin.org> wrote:
>>
>>> I have a patch i'm working on to split newgvn into a proper analysis.
>>> It works (and happy to share it), it's just the interface i'm using is a
>>> little ugly (haven't decided what the right API is), but let me know what
>>> GVNSink needs from life and i'll make sure it's there :)
>>>
>>>
>>> On Wed, Apr 26, 2017 at 10:09 AM, James Molloy via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>>> It's basically ready to commit; the reviewers were fairly happy with
>>>> it. It needs rebasing on top of NewGVN and any bugs that shakes out fixed,
>>>> but that's about it.
>>>>
>>>> I want to get around to it soon-ish, but I've wanted that for a while!
>>>>
>>>> On Wed, 26 Apr 2017 at 16:50, Hongbin Zheng <etherzhhb at gmail.com>
>>>> wrote:
>>>>
>>>>> Hi James,
>>>>>
>>>>> I have an ad-hoc solution in mind to solve this problem.
>>>>> But if it can be solved under the framework of GVN"Sink", it is even
>>>>> better.
>>>>>
>>>>> any plan on your https://reviews.llvm.org/D24805?
>>>>>
>>>>> Thanks
>>>>> Hongbin
>>>>>
>>>>>
>>>>> On Wed, Apr 26, 2017 at 2:13 AM, James Molloy <james at jamesmolloy.co.uk
>>>>> > wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Yes, I can see why that would not work.
>>>>>>
>>>>>> The sinking algorithm in SimplifyCFG isn't particularly clever. In
>>>>>> particular it can't reason about memory ordering and aliasing. In
>>>>>> unswitch1(), it can identify that the stores correlate because the
>>>>>> correlating stores appear in the same relative source order. In unswitch2()
>>>>>> they have been permuted, and the algorithm cannot deal with this. This
>>>>>> requires some kind of value numbering to do efficiently.
>>>>>>
>>>>>> The GVNSink pass that I really should get around to updating should
>>>>>> solve this, eventually!
>>>>>>
>>>>>> James
>>>>>>
>>>>>>
>>>>>> On Wed, 26 Apr 2017 at 08:19 Hongbin Zheng via llvm-dev <
>>>>>> llvm-dev at lists.llvm.org> wrote:
>>>>>>
>>>>>>> Looks like this do not work:
>>>>>>>
>>>>>>> // Type your code here, or load an example.
>>>>>>> int a[10];
>>>>>>>
>>>>>>> void unswitch2(int i, int x, int y0, int y1) {
>>>>>>>   if (x) {
>>>>>>>     a[i] = y0;
>>>>>>>     a[i + 1] = y1;
>>>>>>>   } else {
>>>>>>>     a[i + 1] = y0;
>>>>>>>     a[i] = y1;
>>>>>>>   }
>>>>>>> }
>>>>>>>
>>>>>>> https://godbolt.org/g/Ldd5qV
>>>>>>>
>>>>>>> On Tue, Apr 25, 2017 at 10:22 PM, Hongbin Zheng <etherzhhb at gmail.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Looks like inst combine do the job
>>>>>>>>
>>>>>>>> On Tue, Apr 25, 2017 at 9:36 PM, Davide Italiano <
>>>>>>>> davide at freebsd.org> wrote:
>>>>>>>>
>>>>>>>>> On Tue, Apr 25, 2017 at 9:24 PM, Hongbin Zheng via llvm-dev
>>>>>>>>> <llvm-dev at lists.llvm.org> wrote:
>>>>>>>>> > Hi,
>>>>>>>>> >
>>>>>>>>> > Is there a pass in LLVM that can optimize:
>>>>>>>>> >
>>>>>>>>> > if (x)
>>>>>>>>> >   a[i] = y0;
>>>>>>>>> > else
>>>>>>>>> >   a[i] = y1;
>>>>>>>>> >
>>>>>>>>> > to
>>>>>>>>> >
>>>>>>>>> > a[i] = x ? y0 : y1?
>>>>>>>>> >
>>>>>>>>> > I tried opt (3.9) with -O3 but looks like such an optimization
>>>>>>>>> do not
>>>>>>>>> > happened.
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>> The same IR at -O3 for both cases on this example.
>>>>>>>>> https://godbolt.org/g/Tk2MM8
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Davide
>>>>>>>>>
>>>>>>>>> "There are no solved problems; there are only problems that are
>>>>>>>>> more
>>>>>>>>> or less solved" -- Henri Poincare
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> LLVM Developers mailing list
>>>>>>> llvm-dev at lists.llvm.org
>>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>>
>>>>>>
>>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170429/5a2ebd7e/attachment.html>


More information about the llvm-dev mailing list