[llvm-dev] Store unswitch

Fri Apr 28 07:42:23 PDT 2017

Hi Danny,

Thanks for that :) However I've just updated the prototype patch to NewGVN
and it didn't need any API changes - all I rely on is GVNExpression.

Hongbin,

I wanted to explain a little about what GVNSink can currently do, what it
was designed for and hopefully how to make it handle your testcase.

*Background*

Common code sinking is more difficult to efficently do than one might
expect. Prior to October 2016, LLVM was able to handle simple cases of this
within SimplifyCFG, for example:

    if (a) {
      b += 42;
    } else {
      b += 42;
    }
    // -> b += 42

It could even sink operations where a phi was required:

    if (a) {
      b += 42;
    } else {
      b += 64;
    }
    // -> b += phi(42, 64)

However it had a simplistic execution model and so:
  1) Restricted itself to blocks with only TWO incoming arcs
  2) Would only ever produce one PHI and then bail out

Generally, the factors that make the search space for a good generic
sinking optimization large are:

  1) The number of incoming edges to a block.
  2) Non-monotonicity of heuristic - for example it might not be worthwhile
sinking 2 instructions but may be worthwhile sinking 3 instructions.
  3) Permutations of otherwise similar instructions in incoming blocks.

I'll describe them all in a bit more detail:

*1) The number of incoming edges to a block*

There are many constructs that can create a block with more than two input
edges. An obvious example is a switch:

    switch (a) {
      case 1: b += 32; break;
      case 2: b += 43; break;
      default: break;
    }

Another is a chain of if-elses:

    if (a) {
      b += 32;
    } else if (c) {
      b += 43;
    } else {
      b += 65;
    }

Note that those two examples are subtly different. In the latter example,
every branch performs an addition to 'b'. We can therefore sink all the
adds to the end and perform only one add:

    b += phi(32, 43, 65);

However in the former case things aren't so simple. The default case never
does any add, so we have to speculate the add instruction in that case. For
adds, this is simple - adds don't have sideeffects and we can speculate by
adding with an identity (0):

    b += phi(32, 43, 0);

However what if 'b' was a global variable? In this case, each case of the
switch will have a load/add/store sequence:

               switch(a) ---------------
              /        \                \
              |         |                |
    [t1 = load @b]    [t2 = load @b]     |
    [t1 = add t1, 32] [t2 = add t2, 43]  |
    [store @b, t1]    [store @b, t2]     |
               \        /               /
                v       v v-------------
             [             ]

In most cases, we cannot be sure that in the default case @b is allowed to
be written to. Speculating a store in this case could be invalid.
Therefore, if we want to sink the store (and thus the add and load) we need
to invent a new block to sink them to.

               switch(a) ---------------
              /        \                \
              |         |                |
    [t1 = load @b]    [t2 = load @b]     |
    [t1 = add t1, 32] [t2 = add t2, 43]  |
               \        /                |
             [ p = phi(t1, t2) ]        /
             [ store @b, p ]           /
                   |      +------------
                   v      v
             [             ]

Creating this new block allows the switch to be turned into a trivial one
that can be lowered to a bit test or jump table.

There is an even more complex case that appears quite a bit:

    if (a) {
      switch (b) {
      case 1: c += 42; break;
      case 2: c += 2; break;
      default: c += 3; break;
      }
    } else {
      switch (d) {
      case 1: e -= 2; break;
      case 2: e -= 6; break;
      default: e -= 32; break;
      }
    }

We canonicalize this CFG to one cleanup block with six incoming arcs. In
this case, it takes some nontrivial analysis to determine what to do, as it
requires partitioning the incoming arcs into sets, performing different
sinking operations on each (in this case two disjoint sets, but there are
cases where the sets are not disjoint and some cost heuristic needs to be
used).

*2) Non-monotonicity of heuristic*

What I'm trying to say here is the greedy algorithms work best when the
heuristic they are computing is monotonic. Consider this:

    if (a) {
      tmp1 = *b;
      tmp2 = *c;
      tmp3 = tmp1 + tmp2;
      *d = tmp3;
    } else {
      tmp4 = *b;
      tmp5 = *c;
      tmp6 = tmp4 + tmp5;
      *e = tmp6;
    }

If we proceed sequentially from the bottom of these blocks, trying to
determine if sinking them is worthwhile:

Step 1: [*e = tmp6], [*d = tmp3]
  - This is the same operation, a store, but it requires two PHIs - one for
(e,d) and one for (tmp6, tmp3).
  - This is NOT worthwhile. We are constraining register allocation for
only one instruction commonned.

Step 2: [tmp6 = tmp4 + tmp5], [tmp3 = tmp1 + tmp2]
  - Again, the operation is the same, but now we need another two PHIs,
plus one PHI from Step 1 (the phi required for (tmp6, tmp3) is now no
longer required).
  - Three PHIs required for two instructions sunk. NOT worth it.

Step 3: [tmp5 = *c], [tmp2 = *c]
  - These can be sunk, and have the same arguments so no PHIs required. We
still need two of the PHIs from step 2 (the (tmp5, tmp2) phi now is no
longer required)
  - At this point we've sunk three instructions for two PHIs. Maybe it's
worth it?

Step 4: [tmp4 = *b], [tmp1 = *b]
  - Same as step 3, but now the (tmp4, tmp1) PHI from step 2 goes away too
  - So now we're sinking 4 instructions for one PHI.
  - Now it's definitely worth it!

Whatever heuristic one chooses, in a case like this it will get worse, then
maybe get better again. So we can't simply bail out when the heuristic
reaches some threshold, because maybe it will improve as we analyze more
instructions!

*3) Permutations of otherwise similar instructions*

This is the case you describe. Stores (for example) that look similar but
have been permuted:

    if (a) {
      *b = 52;
      *c = 43;
    } else {
      *c = 87;
      *b = 54;
    }

If we approach this bottom-up, we will compare [*b = 54], [*c = 43], then
[*c = 87], [*b = 52] and conclude that two instructions can be sunk, but 4
PHIs are required (because the target of the store and value are different
in each case). If we looked at [*b = 54], [*b = 52], perhaps we would have
seen that we can sink this with only *two* PHIs required and that might
change our decision.

This is generally quite difficult and especially shows up with memory
instructions like this (if c and b alias, then permuting them when sinking
would be illegal. So we need alias analysis here).

*What GVNSink attempts to solve*

The current implementation of GVNSink was meant to solve 1) and 2), but NOT
3). 3) wasn't within scope when I was writing GVNSink, but I think 3) is
doable for some cases within the framework of GVNSink.

*How GVNSink works*

GVNSink works by computing a value number for each IR instruction, similar
to GVN. However, GVNSink's value numbering works slightly from GVN's.

To recap, the value number (VN) of an instruction is a function of the
instruction. If VN[I] == VN[J], I and J can be considered equivalent, for
some definition of equivalence.

In GVN, VN[I] is based upon I and all of I's operands, transitively.
Therefore if VN[I] == VN[J], I and J were computed in the same way and J
can be used instead of I.

In GVNSink, we flip this on its head. VN[I] is based upon I's *uses*,
transitively. Therefore if VN[I] == VN[J], all of the uses of I are the
same as all of the uses of J. Therefore I and J can be merged. So this:

   [ %a = add %I, 1 ]   [ %c = add %J, 1 ]
   [ %b = sub %a, 5 ]   [ %d = sub %c, 5 ]
   [ ret %b ]           [ ret %d ]

Can be transformed into this:

   []     []
      \ /
   [ %IJ = phi %I, %J ]
   [ %a = add %IJ, 1 ]
   [ %b = sub %a, 5 ]
   [ ret %b ]

In GVNSink, we step backwards across all instructions in all incoming basic
blocks (see LockstepReverseIterator). This produces, conceptually, a
sequence of sets of instructions.

Each of these sets is analyzed in turn (see analyzeInstructionForSinking) -
the value number of each contained instruction is calculated and used to
determine if the entire set is mergeable or not (all values have the same
VN).

If not all instructions in the set have the same VN, the set is pruned to
just those with the same VN (or we stop if all instructions had different
VNs). If this set passes extra legality checks, a
SinkingInstructionCandidate is created. This contains pertinent data for
the cost heuristic:

  1) Number of instructions from each incoming block being sunk
  2) Number of blocks to pick instructions from (this can be a subset of
all the incoming blocks!)
  3) How many PHIs would be generated in total

After all candidates have been examined, they are sorted based on a
heuristic and the best candidate picked and applied.

The use of value numbering in this algorithm is what allows us to do this
in somewhat-linear time with respect to the number of instructions
inspected.

Now, to solve 3), I think the component that needs swapping out is
LockstepReverseIterator. This currently iterates backwards in lockstep
across a set of blocks. To handle permutations, I think we could use a
priority queue that feeds the main algorithm instructions from the input
blocks in a slightly sorted order.

For example, perhaps it might order sequences of stores by their target
operand (assuming alias analysis says it can), ensuring sequences of stores
in different blocks appear in the same order?

Anyway, I hope this has helped understand a bit more about what GVNSink
does, can do and how to modify it if you want to solve your problem.

Cheers,

James

On Wed, 26 Apr 2017 at 18:16 Daniel Berlin <dberlin at dberlin.org> wrote:

> I have a patch i'm working on to split newgvn into a proper analysis.
> It works (and happy to share it), it's just the interface i'm using is a
> little ugly (haven't decided what the right API is), but let me know what
> GVNSink needs from life and i'll make sure it's there :)
>
>
> On Wed, Apr 26, 2017 at 10:09 AM, James Molloy via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> It's basically ready to commit; the reviewers were fairly happy with it.
>> It needs rebasing on top of NewGVN and any bugs that shakes out fixed, but
>> that's about it.
>>
>> I want to get around to it soon-ish, but I've wanted that for a while!
>>
>> On Wed, 26 Apr 2017 at 16:50, Hongbin Zheng <etherzhhb at gmail.com> wrote:
>>
>>> Hi James,
>>>
>>> I have an ad-hoc solution in mind to solve this problem.
>>> But if it can be solved under the framework of GVN"Sink", it is even
>>> better.
>>>
>>> any plan on your https://reviews.llvm.org/D24805?
>>>
>>> Thanks
>>> Hongbin
>>>
>>>
>>> On Wed, Apr 26, 2017 at 2:13 AM, James Molloy <james at jamesmolloy.co.uk>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Yes, I can see why that would not work.
>>>>
>>>> The sinking algorithm in SimplifyCFG isn't particularly clever. In
>>>> particular it can't reason about memory ordering and aliasing. In
>>>> unswitch1(), it can identify that the stores correlate because the
>>>> correlating stores appear in the same relative source order. In unswitch2()
>>>> they have been permuted, and the algorithm cannot deal with this. This
>>>> requires some kind of value numbering to do efficiently.
>>>>
>>>> The GVNSink pass that I really should get around to updating should
>>>> solve this, eventually!
>>>>
>>>> James
>>>>
>>>>
>>>> On Wed, 26 Apr 2017 at 08:19 Hongbin Zheng via llvm-dev <
>>>> llvm-dev at lists.llvm.org> wrote:
>>>>
>>>>> Looks like this do not work:
>>>>>
>>>>> // Type your code here, or load an example.
>>>>> int a[10];
>>>>>
>>>>> void unswitch2(int i, int x, int y0, int y1) {
>>>>>   if (x) {
>>>>>     a[i] = y0;
>>>>>     a[i + 1] = y1;
>>>>>   } else {
>>>>>     a[i + 1] = y0;
>>>>>     a[i] = y1;
>>>>>   }
>>>>> }
>>>>>
>>>>> https://godbolt.org/g/Ldd5qV
>>>>>
>>>>> On Tue, Apr 25, 2017 at 10:22 PM, Hongbin Zheng <etherzhhb at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Looks like inst combine do the job
>>>>>>
>>>>>> On Tue, Apr 25, 2017 at 9:36 PM, Davide Italiano <davide at freebsd.org>
>>>>>> wrote:
>>>>>>
>>>>>>> On Tue, Apr 25, 2017 at 9:24 PM, Hongbin Zheng via llvm-dev
>>>>>>> <llvm-dev at lists.llvm.org> wrote:
>>>>>>> > Hi,
>>>>>>> >
>>>>>>> > Is there a pass in LLVM that can optimize:
>>>>>>> >
>>>>>>> > if (x)
>>>>>>> >   a[i] = y0;
>>>>>>> > else
>>>>>>> >   a[i] = y1;
>>>>>>> >
>>>>>>> > to
>>>>>>> >
>>>>>>> > a[i] = x ? y0 : y1?
>>>>>>> >
>>>>>>> > I tried opt (3.9) with -O3 but looks like such an optimization do
>>>>>>> not
>>>>>>> > happened.
>>>>>>> >
>>>>>>>
>>>>>>> The same IR at -O3 for both cases on this example.
>>>>>>> https://godbolt.org/g/Tk2MM8
>>>>>>>
>>>>>>> --
>>>>>>> Davide
>>>>>>>
>>>>>>> "There are no solved problems; there are only problems that are more
>>>>>>> or less solved" -- Henri Poincare
>>>>>>>
>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> llvm-dev at lists.llvm.org
>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>
>>>>
>>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170428/e290cd35/attachment.html>