<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <p>Realized that the detailed nature of my response may have missed

      the high order bit.  I generally think that doing this

      transformation in GVN is entirely reasonable, and would be a

      useful enhancement.  That's why I spent the time to give the

      detailed feedback on implementation.  <br>

    </p>

    <p>Philip<br>

    </p>

    <p>p.s. It's worth noting that what we're implementing here is a

      restricted form of "very busy expressions" which basically the

      dual of PRE.  There's a bunch of existing literature on the

      general problem.  <br>

    </p>

    <p>p.p.s. Once we're through with the hoisting variant, there's a

      restricted sinking variant too.   If we structure the hoisting

      code right, the sinking should be straight forward.  <br>

    </p>

    <div class="moz-cite-prefix">On 9/14/21 12:12 PM, Philip Reames

      wrote:<br>

    </div>

    <blockquote type="cite"

      cite="mid:ad35373c-7529-dfd4-8834-f94938e1790a@philipreames.com">

      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

      <p>Ok, this response is focused on the proposed transform.  See

        inline.  <br>

      </p>

      <div class="moz-cite-prefix">On 9/14/21 7:19 AM, Momchil Velikov

        via llvm-dev wrote:<br>

      </div>

      <blockquote type="cite"

cite="mid:AS8PR08MB594391C09044E3578342FD05E0DA9@AS8PR08MB5943.eurprd08.prod.outlook.com">

        <pre class="moz-quote-pre" wrap="">Looking at a particularly hot function in the SPEC/x264, that LLVM fails to

vectorise.

typedef short int16_t;

typedef unsigned short uint16_t;

int q(int16_t d[], uint16_t m[], uint16_t b[]) {

  int n = 0;

  for (int i = 0; i < 16; i++) {

    if (d[i] > 0)

      d[i] = (b[i] + d[i]) * m[i] >> 16;

    else

      d[i] = -((b[i] - d[i]) * m[i] >> 16);

    n |= d[i];

  }

  return n;

}

As it turns out, LLVM adds runtime alias checks for this function and then it

considers it legal to vectorise with if-conversion.  However, the vectorisation

cost model effectively bans that particular if-conversion by assigning a

ridiculous cost to emulated masked loads and stores.

Originally, each branch of the if statement in the loop body contains three

identical loads.  Manually hoisting these allows the loop to be vectorised at

`-O2` (at `-O3` the loop is fully unrolled and that breaks vectorisation).

There's a subroutine in `SimplifyCFG` that does a rudimentary hoisting of

instructions from the two successors of a block, which subroutine does indeed

hoist two of the three loads, but gives up before hoisting the third one.

We'd need a way to make LLVM hoist all three of the loads by itself. `GVNHoist`

can do that, but that pass is disabled by default and has been disabled for a

long, long time.</pre>

      </blockquote>

      <p>I haven't looked at this pass recently, but I would strongly

        recommend distrusting the current implementation.  The software

        engineering process that went into writing it doesn't give me

        any confidence.  I remember a long string of

        enable-revert-bugfix without obvious justification as to why the

        bugs being found had to be found with such an expensive

        approach.  If you do decide to pursue this, expect the need for

        a) targeted review and b) a bunch of dedicated fuzzing.  <br>

      </p>

      <p>Having said that, *conceptually* this pass takes the proper

        approach by using MemorySSA.  Your current implementation has a

        bunch of complexity to basically do a local renumbering of

        memory instructions.  Some of that can be improved (see below),

        but if you were doing this with MemorySSA, you wouldn't need to

        do anything fancy as GVN would have numbered the loads

        correctly.  NewGVN is a MemorySSA version of GVN which I'd

        nominally say is the right place for this, but it has never been

        productionized.  A huge blocker was MemorySSA itself which was

        recently productionized, but I don't know what's left for NewGVN

        itself.  (Alina might be able to add context here?)  <br>

      </p>

      <blockquote type="cite"

cite="mid:AS8PR08MB594391C09044E3578342FD05E0DA9@AS8PR08MB5943.eurprd08.prod.outlook.com">

        <pre class="moz-quote-pre" wrap="">As an alternative, I was thinking of a simpler hoisting transformation, that

just handles moving instructions from two single-predecessor blocks to their

common predecessor. That could be made reasonably fast, by pairing instructions

by their GVN value number. Limiting hoisting to a predecessor block (for the

most part) would also avoid excessive increase of lifetimes (for the majority of

the case) and would also simplify correctness checks.

I've written such a transformation as a subroutine to `GVN`, it seemed like a

good place for it and is an a similar spirit as various PREs the GVN does. The

Phabricator review is at <a class="moz-txt-link-freetext" href="https://reviews.llvm.org/D109760" moz-do-not-send="true">https://reviews.llvm.org/D109760</a>.</pre>

      </blockquote>

      <p>The current structure of algorithm you proposed can probably be

        simplified.  Here's what I'd suggest:</p>

      <p>Start by handling all of the non-memory instructions.  This

        should be as simple as constructing a hash map from value number

        to instruction for one block, and then scanning the other

        block.  If you find the same value number in both blocks, the

        (by assumption scalar) instruction must exist in both blocks and

        can be hoisted.  This can be reviewed, tested, submitted, etc..<br>

      </p>

      <p>Then add a single pre-pass which "renumbers" all of the memory

        instructions.  You can either use a trivial "no clobbers seen"

        like EarlyCSE does, or only handle loads where

        MD->getDependency() is nonlocal.  (We already do this query

        in processLoad, so it should be cached.)  The same algorithm can

        be used as above, all that changes is where the valuenumber

        comes from.  (The "pre-pass" can probably be implemented as a

        helper function which renumbers memory instructions on demand.) 

        <br>

      </p>

      <p>I would specifically *not* add an extra layer of iteration

        here.  If you find you need iteration to handle the cases you

        care about, I suspect we should actually be adjusting the value

        re-numbering scheme for loads described just above. <br>

      </p>

      <p>A couple of assorted notes:</p>

      <ul>

        <li>MemDep should be responsible for handling all aliasing

          barriers.  If it returns a non-local dependence, reordering

          that instruction to the beginning of the block should be legal

          w.r.t. memory.  Your use of isHoistBarrier is confusing to

          me.  You may be trying to handle speculation safety (i.e.

          avoid introducing faults), but if so, the special casing

          around volatile and atomics seems odd?  (Also, why allow

          hosting of the barrier instruction?)  I suspect there's a

          particular case you're trying to handle.  I would strongly

          advice dropping that from the initial patch, then add back the

          complexity in a separate change which is well motivated with

          tests, etc..  <br>

        </li>

        <li>In terms of the actual merge logic, I'd suggest implementing

          it as hoist-one, then CSE.  That would allow you to reuse the

          existing patching logic and avoid reimplementing a bunch of

          complexity.  <br>

        </li>

      </ul>

      <p><br>

      </p>

      <blockquote type="cite"

cite="mid:AS8PR08MB594391C09044E3578342FD05E0DA9@AS8PR08MB5943.eurprd08.prod.outlook.com">

        <pre class="moz-quote-pre" wrap="">Initial benchmarking on Neoverse N1 looks good (speedup, higher is better):

500.perlbench_r    1.13%

502.gcc_r          0.00%

505.mcf_r         -1.89%

520.omnetpp_r      0.00%

523.xalancbmk_r    0.00%

525.x264_r         7.67%

531.deepsjeng_r    0.60%

541.leela_r        0.24%

548.exchange2_r    0.00%

557.xz_r           0.75%</pre>

      </blockquote>

      This looks generally positive.  Do you know why MCF regresses?  If

      not, taking a glance to make sure you understand the interaction

      is probably warranted.  <br>

      <blockquote type="cite"

cite="mid:AS8PR08MB594391C09044E3578342FD05E0DA9@AS8PR08MB5943.eurprd08.prod.outlook.com">

        <pre class="moz-quote-pre" wrap="">Comments?

~chill

_______________________________________________

LLVM Developers mailing list

<a class="moz-txt-link-abbreviated" href="mailto:llvm-dev@lists.llvm.org" moz-do-not-send="true">llvm-dev@lists.llvm.org</a>

<a class="moz-txt-link-freetext" href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" moz-do-not-send="true">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a>

</pre>

      </blockquote>

    </blockquote>

  </body>

</html>