[LLVMdev] RFC: Should we have (something like) -extra-vectorizer-passes in -O2?

Tue Oct 14 10:28:40 PDT 2014

----- Original Message -----
> From: "Andrew Trick" <atrick at apple.com>
> To: "Arnold Schwaighofer" <aschwaighofer at apple.com>
> Cc: "Chandler Carruth" <chandlerc at gmail.com>, "LLVM Developers Mailing List" <llvmdev at cs.uiuc.edu>, "James Molloy"
> <james at jamesmolloy.co.uk>, "Zinovy Nis" <zinovy.nis at gmail.com>, "Hal Finkel" <hfinkel at anl.gov>, "Gerolf Hoflehner"
> <ghoflehner at apple.com>
> Sent: Tuesday, October 14, 2014 12:11:43 PM
> Subject: Re: RFC: Should we have (something like) -extra-vectorizer-passes in -O2?
> 
> 
> > On Oct 14, 2014, at 8:53 AM, Arnold Schwaighofer
> > <aschwaighofer at apple.com> wrote:
> > 
> > 
> >> On Oct 13, 2014, at 5:56 PM, Chandler Carruth
> >> <chandlerc at gmail.com> wrote:
> >> 
> >> I've added a straw-man of some extra optimization passes that help
> >> specific benchmarks here or there by either preparing code better
> >> on the way into the vectorizer or cleaning up afterward. These
> >> are off by default until there is some consensus on the right
> >> path forward, but this way we can all test out the same set of
> >> flags, and collaborate on any tweaks to them.
> >> 
> >> The primary principle here is that the vectorizer expects the IR
> >> input to be in a certain canonical form, and produces IR output
> >> that may not yet be in that form. The primary alternative to this
> >> is to make the vectorizers both extra powerful (able to recognize
> >> many variations on things like loop structure) and extra cautious
> >> about their emitted code (so that it is always already
> >> optimized). I much prefer the solution of using passes rather
> >> than this unless compile time is hurt too drastically. It makes
> >> it much easier to test, validate, and compose all of the various
> >> components of the core optimizer.
> >> 
> >> Here is the structural diff:
> >> 
> >> + loop-rotate
> >>  loop-vectorize
> >> + early-cse
> >> + correlated-propagation
> >> + instcombine
> >> + licm
> >> + loop-unswitch
> >> + simplifycfg
> >> + instcombine
> >>  slp-vectorize
> >> + early-cse
> >> 
> > 
> > I think a late loop optimization (vectorization) pipeline makes
> > sense. I think we just have to carefully evaluate benefit over
> > compile time.
> > 
> > Runing loop rotation makes sense. Critical edge splitting can
> > transform loops into a form that prevents loop vectorization.
> > 
> > Both the loop vectorizer and the SLPVectorizer perform limited
> > (restricted in region) forms of CSE to cleanup. EarlyCSE runs
> > across the whole function and so might catch more opportunities.
> > 
> > The downside of always running passes is that we pay the cost
> > irrespective of benefit. There might not be much to cleanup if we
> > don’t vectorize a loop but we still have to pay for running the
> > cleanup passes. This has been the motivator to have “pass local”
> > CSE but this also stems from a time where we ran within the
> > inlining pass manager which meant running over and over again.
> > 
> > I think we will just have to look at compile time and decide what
> > makes sense.
> 
> It’s great that we’re running the vectorizers late, outside CGSCC.
> Regarding the set of passes that we rerun, I completely agree with
> Arnold. Naturally, iterating over the pass pipeline produces
> speedups, and I understand the engineering advantage. But rerunning
> several expensive function passes on the slim chance that a loop was
> transformed is an awful design for compile time.
> 
> >> + loop-rotate
> 
> I have no concern about loop-rotate. It should be very fast.
> 
> >>  loop-vectorize
> >> + early-cse
> 
> Passes like loop-vectorize should be able to do their own CSE without
> much engineering effort.
> 
> >> + correlated-propagation
> 
> A little worried about this.
> 
> >> + instcombine
> 
> I'm *very* concerned about rerunning instcombine,

Why? I understand that it is not cheap (especially because it calls into ValueTracking a lot), but how expensive is it when it has nothing to do?

> but understand it
> may help cleanup the vectorized preheader.
> 
> >> + licm
> >> + loop-unswitch
> 
> These should limited to the relevant loop nest.
> 
> >> + simplifycfg
> 
> OK if the CFG actually changed.
> 
> >> + instcombine
> 
> instcombine again! This can’t be good.
> 
> >>  slp-vectorize
> >> + early-cse
> 
> SLP should do its own CSE.

I'm not sure how much of this is reasonable. Obviously, it can do its own CSE within each vectorization tree. But across trees (where multiple independent parts of the function are vectorized), finding and reusing gather sequences, etc. is a general CSE problem, and I'm not sure how much of that we want to replicate in the SLP vectorizer.

When I switched my internal builds from using the BBVectorizer by default to using the SLP vectorizer by default, I saw a number of performance regressions (mostly not from the vectorization, but from the lack of the 'cleanup' passes, EarlyCSE and InstCombine, that were generally being run afterward). My general impression is that running these passes late in the pipeline brings general benefits.

> 
> —
> 
> I think it’s generally useful to have an “extreme” level of
> optimization without much regard for compile time, and in that
> scenario this pipeline makes sense. But this is hardly something
> that should happen at -O2/-Os, unless real data shows otherwise.

Doing all this only at >= -O3 does not seem unreasonable to me.

> 
> If the pass manager were designed to conditionally invoke late passes
> triggered by certain transformation passes, that would solve my
> immediate concern.
> 
> Long term, I think a much better design is for function
> transformations to be conditionally rerun within a scope/region. For
> example, loop-vectorize should be able to trigger instcombine on the
> loop preheader, which I think is the real problem here.

As Chandler might recall ;) -- I've made several requests that the new pass manager design specifically support this.

 -Hal

> 
> -Andy
> 
> >> The rationale I have for this:
> >> 
> >> 1) Zinovy pointed out that the loop vectorizer really needs the
> >> input loops to still be rotated. One counter point is that
> >> perhaps we should prevent any pass from un-rotating loops?
> >> 
> >> 2) I cherrypicked the core of the scalar optimization pipeline
> >> that seems like it would be relevant to code which looks like
> >> runtime checks. Things like correlated values for overlap
> >> predicates, loop invariant code, or predicates that can be
> >> unswitched out of loops. Then I added the canonicalizing passes
> >> that might be relevant given those passes.
> >> 
> >> 3) I pulled the EarlyCSE from the BB vectorize stuff. Maybe it
> >> isn't relevant for SLP vectorize, no idea. I did say this was a
> >> straw man. =D
> >> 
> >> 
> >> My benchmarking has shown some modest improvements to benchmarks,
> >> but nothing huge. However, it shows only a 2% slowdown for
> >> building the 'opt' binary, which I'm actually happy with so that
> >> we can work to improve the loop vectorizer's overhead *knowing*
> >> that these passes will clean up stuff. Thoughts? I'm currently OK
> >> with this, but it's pretty borderline so I just wanted to start
> >> the discussion and see what other folks observe in their
> >> benchmarking.
> >> 
> >> -Chandler
> > 
> 
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory