[LLVMdev] RFC: Should we have (something like) -extra-vectorizer-passes in -O2?

Tue Oct 14 09:00:28 PDT 2014

----- Original Message -----
> From: "Arnold Schwaighofer" <aschwaighofer at apple.com>
> To: "Chandler Carruth" <chandlerc at gmail.com>
> Cc: "LLVM Developers Mailing List" <llvmdev at cs.uiuc.edu>, "James Molloy" <james at jamesmolloy.co.uk>, "Zinovy Nis"
> <zinovy.nis at gmail.com>, "Andy Trick" <atrick at apple.com>, "Hal Finkel" <hfinkel at anl.gov>, "Gerolf Hoflehner"
> <ghoflehner at apple.com>
> Sent: Tuesday, October 14, 2014 10:53:49 AM
> Subject: Re: RFC: Should we have (something like) -extra-vectorizer-passes in -O2?
> 
> 
> > On Oct 13, 2014, at 5:56 PM, Chandler Carruth <chandlerc at gmail.com>
> > wrote:
> > 
> > I've added a straw-man of some extra optimization passes that help
> > specific benchmarks here or there by either preparing code better
> > on the way into the vectorizer or cleaning up afterward. These are
> > off by default until there is some consensus on the right path
> > forward, but this way we can all test out the same set of flags,
> > and collaborate on any tweaks to them.
> > 
> > The primary principle here is that the vectorizer expects the IR
> > input to be in a certain canonical form, and produces IR output
> > that may not yet be in that form. The primary alternative to this
> > is to make the vectorizers both extra powerful (able to recognize
> > many variations on things like loop structure) and extra cautious
> > about their emitted code (so that it is always already optimized).
> > I much prefer the solution of using passes rather than this unless
> > compile time is hurt too drastically. It makes it much easier to
> > test, validate, and compose all of the various components of the
> > core optimizer.
> > 
> > Here is the structural diff:
> > 
> > + loop-rotate
> >   loop-vectorize
> > + early-cse
> > + correlated-propagation
> > + instcombine
> > + licm
> > + loop-unswitch
> > + simplifycfg
> > + instcombine
> >   slp-vectorize
> > + early-cse
> > 
> 
> I think a late loop optimization (vectorization) pipeline makes
> sense. I think we just have to carefully evaluate benefit over
> compile time.
> 
> Runing loop rotation makes sense. Critical edge splitting can
> transform loops into a form that prevents loop vectorization.
> 
> Both the loop vectorizer and the SLPVectorizer perform limited
> (restricted in region) forms of CSE to cleanup. EarlyCSE runs across
> the whole function and so might catch more opportunities.

In my experience, running a late EarlyCSE produces generic speedups across the board.

 -Hal

> 
> The downside of always running passes is that we pay the cost
> irrespective of benefit. There might not be much to cleanup if we
> don’t vectorize a loop but we still have to pay for running the
> cleanup passes. This has been the motivator to have “pass local” CSE
> but this also stems from a time where we ran within the inlining
> pass manager which meant running over and over again.
> 
> I think we will just have to look at compile time and decide what
> makes sense.
> 
> 
> > The rationale I have for this:
> > 
> > 1) Zinovy pointed out that the loop vectorizer really needs the
> > input loops to still be rotated. One counter point is that perhaps
> > we should prevent any pass from un-rotating loops?
> > 
> > 2) I cherrypicked the core of the scalar optimization pipeline that
> > seems like it would be relevant to code which looks like runtime
> > checks. Things like correlated values for overlap predicates, loop
> > invariant code, or predicates that can be unswitched out of loops.
> > Then I added the canonicalizing passes that might be relevant
> > given those passes.
> > 
> > 3) I pulled the EarlyCSE from the BB vectorize stuff. Maybe it
> > isn't relevant for SLP vectorize, no idea. I did say this was a
> > straw man. =D
> > 
> > 
> > My benchmarking has shown some modest improvements to benchmarks,
> > but nothing huge. However, it shows only a 2% slowdown for
> > building the 'opt' binary, which I'm actually happy with so that
> > we can work to improve the loop vectorizer's overhead *knowing*
> > that these passes will clean up stuff. Thoughts? I'm currently OK
> > with this, but it's pretty borderline so I just wanted to start
> > the discussion and see what other folks observe in their
> > benchmarking.
> > 
> > -Chandler
> 
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory