[llvm-dev] [RFC] Simple GVN hoist

Tue Sep 14 07:19:41 PDT 2021

Looking at a particularly hot function in the SPEC/x264, that LLVM fails to
vectorise.

typedef short int16_t;
typedef unsigned short uint16_t;

int q(int16_t d[], uint16_t m[], uint16_t b[]) {
  int n = 0;
  for (int i = 0; i < 16; i++) {
    if (d[i] > 0)
      d[i] = (b[i] + d[i]) * m[i] >> 16;
    else
      d[i] = -((b[i] - d[i]) * m[i] >> 16);
    n |= d[i];
  }
  return n;
}

As it turns out, LLVM adds runtime alias checks for this function and then it
considers it legal to vectorise with if-conversion.  However, the vectorisation
cost model effectively bans that particular if-conversion by assigning a
ridiculous cost to emulated masked loads and stores.

Originally, each branch of the if statement in the loop body contains three
identical loads.  Manually hoisting these allows the loop to be vectorised at
`-O2` (at `-O3` the loop is fully unrolled and that breaks vectorisation).
There's a subroutine in `SimplifyCFG` that does a rudimentary hoisting of
instructions from the two successors of a block, which subroutine does indeed
hoist two of the three loads, but gives up before hoisting the third one.

We'd need a way to make LLVM hoist all three of the loads by itself. `GVNHoist`
can do that, but that pass is disabled by default and has been disabled for a
long, long time.

As an alternative, I was thinking of a simpler hoisting transformation, that
just handles moving instructions from two single-predecessor blocks to their
common predecessor. That could be made reasonably fast, by pairing instructions
by their GVN value number. Limiting hoisting to a predecessor block (for the
most part) would also avoid excessive increase of lifetimes (for the majority of
the case) and would also simplify correctness checks.

I've written such a transformation as a subroutine to `GVN`, it seemed like a
good place for it and is an a similar spirit as various PREs the GVN does. The
Phabricator review is at https://reviews.llvm.org/D109760.

Initial benchmarking on Neoverse N1 looks good (speedup, higher is better):

500.perlbench_r    1.13%
502.gcc_r          0.00%
505.mcf_r         -1.89%
520.omnetpp_r      0.00%
523.xalancbmk_r    0.00%
525.x264_r         7.67%
531.deepsjeng_r    0.60%
541.leela_r        0.24%
548.exchange2_r    0.00%
557.xz_r           0.75%

Comments?

~chill