[LLVMdev] Vectorization: Next Steps

Mon Feb 13 15:30:10 PST 2012

On Wed, 2012-02-08 at 17:26 -0800, Chris Lattner wrote:
> On Feb 7, 2012, at 12:10 PM, Hal Finkel wrote:
> >>> 1. "Target Data" for vectorization - I think that in order to improve
> >>> the vectorization quality, the vectorizer will need more information
> >>> about the target. This information could be provided in the form of a
> >>> kind of extended target data. This extended target data might contain:
> >>> - What basic types can be vectorized, and how many of them will fit
> >>> into (the largest) vector registers
> >>> - What classes of operations can be vectorized (division, conversions /
> >>> sign extension, etc. are not always supported)
> >>> - What alignment is necessary for loads and stores
> >>> - Is scalar-to-vector free?
> >> 
> >> I think that this will be a really important API, but I strongly advocate that you model this after TargetLoweringInfo instead of TargetData.  First, TargetData isn't actually a target API (it should be fixed, I filed PR11936 to track this).  Second, targets will have to implement imperative code to return precise answers to questions.  For example, you'll want something like "what is the cost of a shuffle with this mask" which will be extremely target specific, will depend on what CPU subfeatures are enabled, etc.
> > 
> > This makes sense. What do you think will be the best way of
> > synchronizing things like CPU subfeatures between this API and the
> > backend target libraries? They could be linked directly, although I
> > don't know if we want to do that. tablegen could extract a bunch of this
> > information into separate objects that get linked into opt.
> 
> The best model we have at the moment is TargetLoweringInfo, as used by LoopStrengthReduction.  The details of this interface aren't a great example to follow for a few reasons (i.e. it has selectiondag specific stuff in it, which is a layering violation) but the idea is sound.  This does mean that running "opt -vectorize foo.bc" would not get the same optimization as running clang with the target you want enabled though.  We already have this problem with -loop-reduce though.
> 

LoopStrengthReduction is currently created in
TargetPassConfig::addIRPasses (CodeGen/Passes.cpp). Currently the
vectorization pass is created in
PassManagerBuilder::populateModulePassManager (which is used by opt).
Are you suggesting that I move the vectorization pass creation into
CodeGen? Or are you saying that TLI will sometimes be available to the
pass, as it is now, when called from a full-compilation driver (like
clang)? Or are you suggesting that I propose some object like TLI that
might be available in 'opt' even though TLI itself is not available
there?

Thanks again,
Hal

> >> I think that a loop vectorizor and a basic block vectorizer both make perfect sense and are important for different classes of code.  However, I don't think that we should go down the path of trying to use a "basic block vectorizor + loop unrolling" serve the purpose of a loop vectorizer.  Trying to make a BBVectorizer and a loop unroller play together will be really fragile, because they'll both have to duplicate the same metrics (otherwise, for example, you'd unroll a loop that isn't vectorizable).  This will also be a huge hit to compile time.
> > 
> > The only problem with this comes from loops for which unrolling is
> > necessary to expose vectorization because the memory access pattern is
> > too complicated to model in more-traditional loop vectorization. This
> > generally is useful only in cases with a large number of flops per
> > memory operation (or maybe integer ops too, but I have less experience
> > with those), so maybe we can design a useful heuristic to handle those
> > cases. That having been said, unroll+(failed vectorize)+rollback is not
> > really any more expensive at compile time than unroll+(failed vectorize)
> > except that the resulting code would run faster (actually it is cheaper
> > to compile because the optimization/compilation of the unvectorized
> > unrolled loop code takes longer than the non-unrolled loop). There might
> > be a clean way of doing this; I'll think about it.
> 
> I don't really understand the issue here, can you elaborate on when this might be a win?  I really don't like "speculatively unroll, try to do something, then reroll".  That is terrible for compile time and just strikes me as poor design :-)
> 
> -Chris
> 

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory