[llvm-dev] Store unswitch

Hongbin Zheng via llvm-dev llvm-dev at lists.llvm.org
Fri Apr 28 19:56:11 PDT 2017


On Fri, Apr 28, 2017 at 7:50 PM, Daniel Berlin <dberlin at dberlin.org> wrote:

> This is normally known as global code motion, and there are numerous
> algorithms that do it

Yes.

Thanks
Hongbin


>
>
> On Fri, Apr 28, 2017, 7:48 PM Hongbin Zheng <etherzhhb at gmail.com> wrote:
>
>> Hi James,
>>
>> Thanks a lot for the explanations, they make sense.
>>
>>
>> On Fri, Apr 28, 2017 at 7:42 AM, James Molloy <james at jamesmolloy.co.uk>
>> wrote:
>>
>>> Hi Danny,
>>>
>>> Thanks for that :) However I've just updated the prototype patch to
>>> NewGVN and it didn't need any API changes - all I rely on is GVNExpression.
>>>
>>> Hongbin,
>>>
>>> I wanted to explain a little about what GVNSink can currently do, what
>>> it was designed for and hopefully how to make it handle your testcase.
>>>
>>> *Background*
>>>
>>> Common code sinking is more difficult to efficently do than one might
>>> expect. Prior to October 2016, LLVM was able to handle simple cases of this
>>> within SimplifyCFG, for example:
>>>
>>>     if (a) {
>>>       b += 42;
>>>     } else {
>>>       b += 42;
>>>     }
>>>     // -> b += 42
>>>
>>> It could even sink operations where a phi was required:
>>>
>>>     if (a) {
>>>       b += 42;
>>>     } else {
>>>       b += 64;
>>>     }
>>>     // -> b += phi(42, 64)
>>>
>>> However it had a simplistic execution model and so:
>>>   1) Restricted itself to blocks with only TWO incoming arcs
>>>   2) Would only ever produce one PHI and then bail out
>>>
>>> Generally, the factors that make the search space for a good generic
>>> sinking optimization large are:
>>>
>>>   1) The number of incoming edges to a block.
>>>   2) Non-monotonicity of heuristic - for example it might not be
>>> worthwhile sinking 2 instructions but may be worthwhile sinking 3
>>> instructions.
>>>   3) Permutations of otherwise similar instructions in incoming blocks.
>>>
>>> I'll describe them all in a bit more detail:
>>>
>>> *1) The number of incoming edges to a block*
>>>
>>> There are many constructs that can create a block with more than two
>>> input edges. An obvious example is a switch:
>>>
>>>     switch (a) {
>>>       case 1: b += 32; break;
>>>       case 2: b += 43; break;
>>>       default: break;
>>>     }
>>>
>>> Another is a chain of if-elses:
>>>
>>>     if (a) {
>>>       b += 32;
>>>     } else if (c) {
>>>       b += 43;
>>>     } else {
>>>       b += 65;
>>>     }
>>>
>>> Note that those two examples are subtly different. In the latter
>>> example, every branch performs an addition to 'b'. We can therefore sink
>>> all the adds to the end and perform only one add:
>>>
>>>     b += phi(32, 43, 65);
>>>
>>> However in the former case things aren't so simple. The default case
>>> never does any add, so we have to speculate the add instruction in that
>>> case. For adds, this is simple - adds don't have sideeffects and we can
>>> speculate by adding with an identity (0):
>>>
>>>     b += phi(32, 43, 0);
>>>
>>> However what if 'b' was a global variable? In this case, each case of
>>> the switch will have a load/add/store sequence:
>>>
>>>                switch(a) ---------------
>>>               /        \                \
>>>               |         |                |
>>>     [t1 = load @b]    [t2 = load @b]     |
>>>     [t1 = add t1, 32] [t2 = add t2, 43]  |
>>>     [store @b, t1]    [store @b, t2]     |
>>>                \        /               /
>>>                 v       v v-------------
>>>              [             ]
>>>
>>> In most cases, we cannot be sure that in the default case @b is allowed
>>> to be written to. Speculating a store in this case could be invalid.
>>> Therefore, if we want to sink the store (and thus the add and load) we need
>>> to invent a new block to sink them to.
>>>
>>>                switch(a) ---------------
>>>               /        \                \
>>>               |         |                |
>>>     [t1 = load @b]    [t2 = load @b]     |
>>>     [t1 = add t1, 32] [t2 = add t2, 43]  |
>>>                \        /                |
>>>              [ p = phi(t1, t2) ]        /
>>>              [ store @b, p ]           /
>>>                    |      +------------
>>>                    v      v
>>>              [             ]
>>>
>>> Creating this new block allows the switch to be turned into a trivial
>>> one that can be lowered to a bit test or jump table.
>>>
>>> There is an even more complex case that appears quite a bit:
>>>
>>>     if (a) {
>>>       switch (b) {
>>>       case 1: c += 42; break;
>>>       case 2: c += 2; break;
>>>       default: c += 3; break;
>>>       }
>>>     } else {
>>>       switch (d) {
>>>       case 1: e -= 2; break;
>>>       case 2: e -= 6; break;
>>>       default: e -= 32; break;
>>>       }
>>>     }
>>>
>>> We canonicalize this CFG to one cleanup block with six incoming arcs. In
>>> this case, it takes some nontrivial analysis to determine what to do, as it
>>> requires partitioning the incoming arcs into sets, performing different
>>> sinking operations on each (in this case two disjoint sets, but there are
>>> cases where the sets are not disjoint and some cost heuristic needs to be
>>> used).
>>>
>>> *2) Non-monotonicity of heuristic*
>>>
>>> What I'm trying to say here is the greedy algorithms work best when the
>>> heuristic they are computing is monotonic. Consider this:
>>>
>>>     if (a) {
>>>       tmp1 = *b;
>>>       tmp2 = *c;
>>>       tmp3 = tmp1 + tmp2;
>>>       *d = tmp3;
>>>     } else {
>>>       tmp4 = *b;
>>>       tmp5 = *c;
>>>       tmp6 = tmp4 + tmp5;
>>>       *e = tmp6;
>>>     }
>>>
>>> If we proceed sequentially from the bottom of these blocks, trying to
>>> determine if sinking them is worthwhile:
>>>
>>> Step 1: [*e = tmp6], [*d = tmp3]
>>>   - This is the same operation, a store, but it requires two PHIs - one
>>> for (e,d) and one for (tmp6, tmp3).
>>>   - This is NOT worthwhile. We are constraining register allocation for
>>> only one instruction commonned.
>>>
>>> Step 2: [tmp6 = tmp4 + tmp5], [tmp3 = tmp1 + tmp2]
>>>   - Again, the operation is the same, but now we need another two PHIs,
>>> plus one PHI from Step 1 (the phi required for (tmp6, tmp3) is now no
>>> longer required).
>>>   - Three PHIs required for two instructions sunk. NOT worth it.
>>>
>>> Step 3: [tmp5 = *c], [tmp2 = *c]
>>>   - These can be sunk, and have the same arguments so no PHIs required.
>>> We still need two of the PHIs from step 2 (the (tmp5, tmp2) phi now is no
>>> longer required)
>>>   - At this point we've sunk three instructions for two PHIs. Maybe it's
>>> worth it?
>>>
>>> Step 4: [tmp4 = *b], [tmp1 = *b]
>>>   - Same as step 3, but now the (tmp4, tmp1) PHI from step 2 goes away
>>> too
>>>   - So now we're sinking 4 instructions for one PHI.
>>>   - Now it's definitely worth it!
>>>
>>> Whatever heuristic one chooses, in a case like this it will get worse,
>>> then maybe get better again. So we can't simply bail out when the heuristic
>>> reaches some threshold, because maybe it will improve as we analyze more
>>> instructions!
>>>
>>> *3) Permutations of otherwise similar instructions*
>>>
>>> This is the case you describe. Stores (for example) that look similar
>>> but have been permuted:
>>>
>>>     if (a) {
>>>       *b = 52;
>>>       *c = 43;
>>>     } else {
>>>       *c = 87;
>>>       *b = 54;
>>>     }
>>>
>>> If we approach this bottom-up, we will compare [*b = 54], [*c = 43],
>>> then [*c = 87], [*b = 52] and conclude that two instructions can be sunk,
>>> but 4 PHIs are required (because the target of the store and value are
>>> different in each case). If we looked at [*b = 54], [*b = 52], perhaps we
>>> would have seen that we can sink this with only *two* PHIs required and
>>> that might change our decision.
>>>
>>> This is generally quite difficult and especially shows up with memory
>>> instructions like this (if c and b alias, then permuting them when sinking
>>> would be illegal. So we need alias analysis here).
>>>
>>> *What GVNSink attempts to solve*
>>>
>>> The current implementation of GVNSink was meant to solve 1) and 2), but
>>> NOT 3). 3) wasn't within scope when I was writing GVNSink, but I think 3)
>>> is doable for some cases within the framework of GVNSink.
>>>
>>> *How GVNSink works*
>>>
>>> GVNSink works by computing a value number for each IR instruction,
>>> similar to GVN. However, GVNSink's value numbering works slightly from
>>> GVN's.
>>>
>>> To recap, the value number (VN) of an instruction is a function of the
>>> instruction. If VN[I] == VN[J], I and J can be considered equivalent, for
>>> some definition of equivalence.
>>>
>>> In GVN, VN[I] is based upon I and all of I's operands, transitively.
>>> Therefore if VN[I] == VN[J], I and J were computed in the same way and J
>>> can be used instead of I.
>>>
>>> In GVNSink, we flip this on its head. VN[I] is based upon I's *uses*,
>>> transitively. Therefore if VN[I] == VN[J], all of the uses of I are the
>>> same as all of the uses of J. Therefore I and J can be merged. So this:
>>>
>>>    [ %a = add %I, 1 ]   [ %c = add %J, 1 ]
>>>    [ %b = sub %a, 5 ]   [ %d = sub %c, 5 ]
>>>    [ ret %b ]           [ ret %d ]
>>>
>>> Can be transformed into this:
>>>
>>>    []     []
>>>       \ /
>>>    [ %IJ = phi %I, %J ]
>>>    [ %a = add %IJ, 1 ]
>>>    [ %b = sub %a, 5 ]
>>>    [ ret %b ]
>>>
>>> In GVNSink, we step backwards across all instructions in all incoming
>>> basic blocks (see LockstepReverseIterator). This produces, conceptually, a
>>> sequence of sets of instructions.
>>>
>>> Each of these sets is analyzed in turn (see
>>> analyzeInstructionForSinking) - the value number of each contained
>>> instruction is calculated and used to determine if the entire set is
>>> mergeable or not (all values have the same VN).
>>>
>>> If not all instructions in the set have the same VN, the set is pruned
>>> to just those with the same VN (or we stop if all instructions had
>>> different VNs). If this set passes extra legality checks, a
>>> SinkingInstructionCandidate is created. This contains pertinent data for
>>> the cost heuristic:
>>>
>>>   1) Number of instructions from each incoming block being sunk
>>>   2) Number of blocks to pick instructions from (this can be a subset of
>>> all the incoming blocks!)
>>>   3) How many PHIs would be generated in total
>>>
>>> After all candidates have been examined, they are sorted based on a
>>> heuristic and the best candidate picked and applied.
>>>
>>> The use of value numbering in this algorithm is what allows us to do
>>> this in somewhat-linear time with respect to the number of instructions
>>> inspected.
>>>
>>> Now, to solve 3), I think the component that needs swapping out is
>>> LockstepReverseIterator. This currently iterates backwards in lockstep
>>> across a set of blocks. To handle permutations, I think we could use a
>>> priority queue that feeds the main algorithm instructions from the input
>>> blocks in a slightly sorted order.
>>>
>>
>> To solve 3, I think what we need are not only the CSE from GVN, but also
>> some kind of instruction reordering, i.e. "scheduling". I think the idea of
>> introducing a "priority queue" and processing the instruction in a
>> different order than they appear in the BB, is also a kind of "implicit
>> scheduling".
>>
>> maybe what we can do is introduce a "canonical order" by considering data
>> dependency (SSA def-use chain), memory dependencies (Memory SSA), and other
>> dependencies to order the instruction such that:
>> 1. Load instructions are move toward the beginning of a BB, such that
>> GVNHosit and/or CFGSimplify have better chance to hoist and merge the loads
>> 2. Store instructions are move toward the end of a BB, such that GVNSink
>> and InstructionCombine (it is merging the stores today) have a better
>> chance to sink and merge the stores.
>> 3. With the same type of instructions, i.e. load/stores, without
>> dependencies between them, we can sort/schedule them based on the GVN of
>> the pointer operand.
>>
>> In sort, the "canonical ordering" will sort instructions in BBs like:
>>
>> BB0:
>> br BB1 or BB2
>>
>> BB1:
>> x = ld ptr1
>> st x ptr2
>> y = ld ptr3
>> st y ptr4
>> br BB3
>>
>> BB2:
>> x1 = ld ptr3
>> st x1 ptr4
>> y1 = ld ptr1
>> st y1 ptr2
>> br BB3
>>
>> BB3:
>>
>> to the following:
>> BB0:
>> br BB1 or BB2
>>
>> BB1:
>> x = ld ptr1
>> y = ld ptr3
>> st x ptr2
>> st y ptr4
>> br BB3
>>
>> BB2:
>> y1 = ld ptr1
>> x1 = ld ptr3
>> st y1 ptr2
>> st x1 ptr4
>> br BB3
>>
>> BB3:
>>
>> Now we can use the lock step iterator to merge these instructions.:
>> BB0:
>> x = ld ptr1
>> y = ld ptr3
>> br BB1 or BB2
>>
>> BB1:
>> br BB3
>>
>> BB2:
>> br BB3
>>
>> BB3:
>> st y ptr2
>> st x ptr4
>>
>> And we can make this "canonical ordering" independent from the other
>> passes.
>>
>> Such that the users can choose to pay extra compile time run the
>> "canonical ordering" pass and get more merge from GVN or not.
>>
>> Of course, this "canonical ordering" may not always work, e.g.:
>>
>> BB0:
>> br BB1 or BB2
>>
>> BB1:
>> x = ld ptr1
>> st x ptr2
>> y = ld ptr3
>> st y ptr4
>> br BB3
>>
>> BB2:
>> x1 = ld ptr0
>> st x1 ptr5
>> y1 = ld ptr1
>> st y1 ptr2
>> br BB3
>>
>> BB3:
>>
>> May be reordered to the following:
>>
>> BB0:
>> br BB1 or BB2
>>
>> BB1:
>> x = ld ptr1
>> y = ld ptr3
>> st x ptr2
>> st y ptr4
>> br BB3
>>
>> BB2:
>> x1 = ld ptr0
>> y1 = ld ptr1
>> st y1 ptr2
>> st x1 ptr5
>> br BB3
>>
>> BB3:
>>
>> And now the lock step iterator doesn't work.
>>
>> So there may be no universal "canonical order", and we need to consider
>> the context when we reorder/assign priority in the priority queue to solve
>> problem 3.
>>
>> Thanks
>> Hongbin
>>
>>
>>>
>>> For example, perhaps it might order sequences of stores by their target
>>> operand (assuming alias analysis says it can), ensuring sequences of stores
>>> in different blocks appear in the same order?
>>>
>>> Anyway, I hope this has helped understand a bit more about what GVNSink
>>> does, can do and how to modify it if you want to solve your problem.
>>>
>>> Cheers,
>>>
>>> James
>>>
>>>
>>>
>>> On Wed, 26 Apr 2017 at 18:16 Daniel Berlin <dberlin at dberlin.org> wrote:
>>>
>>>> I have a patch i'm working on to split newgvn into a proper analysis.
>>>> It works (and happy to share it), it's just the interface i'm using is
>>>> a little ugly (haven't decided what the right API is), but let me know what
>>>> GVNSink needs from life and i'll make sure it's there :)
>>>>
>>>>
>>>> On Wed, Apr 26, 2017 at 10:09 AM, James Molloy via llvm-dev <
>>>> llvm-dev at lists.llvm.org> wrote:
>>>>
>>>>> It's basically ready to commit; the reviewers were fairly happy with
>>>>> it. It needs rebasing on top of NewGVN and any bugs that shakes out fixed,
>>>>> but that's about it.
>>>>>
>>>>> I want to get around to it soon-ish, but I've wanted that for a while!
>>>>>
>>>>> On Wed, 26 Apr 2017 at 16:50, Hongbin Zheng <etherzhhb at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi James,
>>>>>>
>>>>>> I have an ad-hoc solution in mind to solve this problem.
>>>>>> But if it can be solved under the framework of GVN"Sink", it is even
>>>>>> better.
>>>>>>
>>>>>> any plan on your https://reviews.llvm.org/D24805?
>>>>>>
>>>>>> Thanks
>>>>>> Hongbin
>>>>>>
>>>>>>
>>>>>> On Wed, Apr 26, 2017 at 2:13 AM, James Molloy <
>>>>>> james at jamesmolloy.co.uk> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Yes, I can see why that would not work.
>>>>>>>
>>>>>>> The sinking algorithm in SimplifyCFG isn't particularly clever. In
>>>>>>> particular it can't reason about memory ordering and aliasing. In
>>>>>>> unswitch1(), it can identify that the stores correlate because the
>>>>>>> correlating stores appear in the same relative source order. In unswitch2()
>>>>>>> they have been permuted, and the algorithm cannot deal with this. This
>>>>>>> requires some kind of value numbering to do efficiently.
>>>>>>>
>>>>>>> The GVNSink pass that I really should get around to updating should
>>>>>>> solve this, eventually!
>>>>>>>
>>>>>>> James
>>>>>>>
>>>>>>>
>>>>>>> On Wed, 26 Apr 2017 at 08:19 Hongbin Zheng via llvm-dev <
>>>>>>> llvm-dev at lists.llvm.org> wrote:
>>>>>>>
>>>>>>>> Looks like this do not work:
>>>>>>>>
>>>>>>>> // Type your code here, or load an example.
>>>>>>>> int a[10];
>>>>>>>>
>>>>>>>> void unswitch2(int i, int x, int y0, int y1) {
>>>>>>>>   if (x) {
>>>>>>>>     a[i] = y0;
>>>>>>>>     a[i + 1] = y1;
>>>>>>>>   } else {
>>>>>>>>     a[i + 1] = y0;
>>>>>>>>     a[i] = y1;
>>>>>>>>   }
>>>>>>>> }
>>>>>>>>
>>>>>>>> https://godbolt.org/g/Ldd5qV
>>>>>>>>
>>>>>>>> On Tue, Apr 25, 2017 at 10:22 PM, Hongbin Zheng <
>>>>>>>> etherzhhb at gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Looks like inst combine do the job
>>>>>>>>>
>>>>>>>>> On Tue, Apr 25, 2017 at 9:36 PM, Davide Italiano <
>>>>>>>>> davide at freebsd.org> wrote:
>>>>>>>>>
>>>>>>>>>> On Tue, Apr 25, 2017 at 9:24 PM, Hongbin Zheng via llvm-dev
>>>>>>>>>> <llvm-dev at lists.llvm.org> wrote:
>>>>>>>>>> > Hi,
>>>>>>>>>> >
>>>>>>>>>> > Is there a pass in LLVM that can optimize:
>>>>>>>>>> >
>>>>>>>>>> > if (x)
>>>>>>>>>> >   a[i] = y0;
>>>>>>>>>> > else
>>>>>>>>>> >   a[i] = y1;
>>>>>>>>>> >
>>>>>>>>>> > to
>>>>>>>>>> >
>>>>>>>>>> > a[i] = x ? y0 : y1?
>>>>>>>>>> >
>>>>>>>>>> > I tried opt (3.9) with -O3 but looks like such an optimization
>>>>>>>>>> do not
>>>>>>>>>> > happened.
>>>>>>>>>> >
>>>>>>>>>>
>>>>>>>>>> The same IR at -O3 for both cases on this example.
>>>>>>>>>> https://godbolt.org/g/Tk2MM8
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Davide
>>>>>>>>>>
>>>>>>>>>> "There are no solved problems; there are only problems that are
>>>>>>>>>> more
>>>>>>>>>> or less solved" -- Henri Poincare
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> LLVM Developers mailing list
>>>>>>>> llvm-dev at lists.llvm.org
>>>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>>>
>>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> llvm-dev at lists.llvm.org
>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>
>>>>>
>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170428/77c27baa/attachment-0001.html>


More information about the llvm-dev mailing list