<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <p>Initial response focused on different potential ways of solving

      this problem.  I'll reply separately on your proposed GVN patch.</p>

    <p>I see a number of ways to tackle this:</p>

    <ul>

      <li>The vectorize already has to do a forward walk through the

        function to compute predicate masks.  As we do that walk, we

        could keep track for each address of the union of predicate

        masks for each access.  In this case, the union would be all

        ones, meaning we could use a unconditional access.  In general,

        we can do a single predicated access using the union of the

        predicate masks.  This is a non-trivial costing question of

        whether explicitly forming the union is worthwhile, but I

        suspect we can handle obvious cases cheaply.</li>

      <li>Still in the vectorizer, we can do a forward walk and track

        accesses encountered along each path.  Any address which is

        accessed along each path to the latch can be done

        unconditionally.  (This is essentially a restricted subset of

        the former without the generalization to predicate masks.)</li>

      <li>We could extend SimplifyCFG.  You mention this, but don't get

        into *why* we don't handle the last load.  We really should in

        this case.  (Though, after CSE, there should only be two

        conditional loads in your inner loop?  Maybe I'm missing

        something?)</li>

      <li>You don't mention your target processor, but one question to

        ask is the cost model for a predicated load reasonable?  If not,

        would reducing it to match the actual target cost fix your

        problem?  In particular, we have *multiple* memory accesses with

        the *same* mask here.  Does accounting for that difference in

        lowering cost side step the issue?  <br>

      </li>

    </ul>

    <p>Your in-passing mention that -O3 and unrolling breaks

      vectorization also concerns me.  It really shouldn't.  That sounds

      like a probably issue in the SLP vectorizer, and maybe a pass

      order issue.  <br>

    </p>

    <p>All of the above is simply me brainstorming.  :)</p>

    <p>Philip<br>

    </p>

    <div class="moz-cite-prefix">On 9/14/21 7:19 AM, Momchil Velikov via

      llvm-dev wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:AS8PR08MB594391C09044E3578342FD05E0DA9@AS8PR08MB5943.eurprd08.prod.outlook.com">

      <pre class="moz-quote-pre" wrap="">Looking at a particularly hot function in the SPEC/x264, that LLVM fails to

vectorise.

typedef short int16_t;

typedef unsigned short uint16_t;

int q(int16_t d[], uint16_t m[], uint16_t b[]) {

  int n = 0;

  for (int i = 0; i < 16; i++) {

    if (d[i] > 0)

      d[i] = (b[i] + d[i]) * m[i] >> 16;

    else

      d[i] = -((b[i] - d[i]) * m[i] >> 16);

    n |= d[i];

  }

  return n;

}

As it turns out, LLVM adds runtime alias checks for this function and then it

considers it legal to vectorise with if-conversion.  However, the vectorisation

cost model effectively bans that particular if-conversion by assigning a

ridiculous cost to emulated masked loads and stores.

Originally, each branch of the if statement in the loop body contains three

identical loads.  Manually hoisting these allows the loop to be vectorised at

`-O2` (at `-O3` the loop is fully unrolled and that breaks vectorisation).

There's a subroutine in `SimplifyCFG` that does a rudimentary hoisting of

instructions from the two successors of a block, which subroutine does indeed

hoist two of the three loads, but gives up before hoisting the third one.

We'd need a way to make LLVM hoist all three of the loads by itself. `GVNHoist`

can do that, but that pass is disabled by default and has been disabled for a

long, long time.

As an alternative, I was thinking of a simpler hoisting transformation, that

just handles moving instructions from two single-predecessor blocks to their

common predecessor. That could be made reasonably fast, by pairing instructions

by their GVN value number. Limiting hoisting to a predecessor block (for the

most part) would also avoid excessive increase of lifetimes (for the majority of

the case) and would also simplify correctness checks.

I've written such a transformation as a subroutine to `GVN`, it seemed like a

good place for it and is an a similar spirit as various PREs the GVN does. The

Phabricator review is at <a class="moz-txt-link-freetext" href="https://reviews.llvm.org/D109760">https://reviews.llvm.org/D109760</a>.

Initial benchmarking on Neoverse N1 looks good (speedup, higher is better):

500.perlbench_r    1.13%

502.gcc_r          0.00%

505.mcf_r         -1.89%

520.omnetpp_r      0.00%

523.xalancbmk_r    0.00%

525.x264_r         7.67%

531.deepsjeng_r    0.60%

541.leela_r        0.24%

548.exchange2_r    0.00%

557.xz_r           0.75%

Comments?

~chill

_______________________________________________

LLVM Developers mailing list

<a class="moz-txt-link-abbreviated" href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>

<a class="moz-txt-link-freetext" href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a>

</pre>

    </blockquote>

  </body>

</html>