[LLVMdev] RFC: Should we have (something like) -extra-vectorizer-passes in -O2?

Mon Oct 13 17:56:24 PDT 2014

I've added a straw-man of some extra optimization passes that help specific
benchmarks here or there by either preparing code better on the way into
the vectorizer or cleaning up afterward. These are off by default until
there is some consensus on the right path forward, but this way we can all
test out the same set of flags, and collaborate on any tweaks to them.

The primary principle here is that the vectorizer expects the IR input to
be in a certain canonical form, and produces IR output that may not yet be
in that form. The primary alternative to this is to make the vectorizers
both extra powerful (able to recognize many variations on things like loop
structure) and extra cautious about their emitted code (so that it is
always already optimized). I much prefer the solution of using passes
rather than this unless compile time is hurt too drastically. It makes it
much easier to test, validate, and compose all of the various components of
the core optimizer.

Here is the structural diff:

+ loop-rotate
  loop-vectorize
+ early-cse
+ correlated-propagation
+ instcombine
+ licm
+ loop-unswitch
+ simplifycfg
+ instcombine
  slp-vectorize
+ early-cse

The rationale I have for this:

1) Zinovy pointed out that the loop vectorizer really needs the input loops
to still be rotated. One counter point is that perhaps we should prevent
any pass from un-rotating loops?

2) I cherrypicked the core of the scalar optimization pipeline that seems
like it would be relevant to code which looks like runtime checks. Things
like correlated values for overlap predicates, loop invariant code, or
predicates that can be unswitched out of loops. Then I added the
canonicalizing passes that might be relevant given those passes.

3) I pulled the EarlyCSE from the BB vectorize stuff. Maybe it isn't
relevant for SLP vectorize, no idea. I did say this was a straw man. =D

My benchmarking has shown some modest improvements to benchmarks, but
nothing huge. However, it shows only a 2% slowdown for building the 'opt'
binary, which I'm actually happy with so that we can work to improve the
loop vectorizer's overhead *knowing* that these passes will clean up stuff.
Thoughts? I'm currently OK with this, but it's pretty borderline so I just
wanted to start the discussion and see what other folks observe in their
benchmarking.

-Chandler
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20141013/3cb69b3d/attachment.html>