[llvm-commits] [PATCH] BasicBlock Autovectorization Pass

Tue Oct 25 19:51:28 PDT 2011

On Tue, 2011-10-25 at 16:23 -0500, Hal Finkel wrote:
> I've attached an improved version of the autovectorization pass. This
> version will also vectorize loads and stores, casts, and some intrinsics
> (fma and trig. functions).
> 
> There are, correspondingly, a few new options:
> bb-vectorize-no-casts -- Don't try to vectorize casting (conversion)
> operations
> bb-vectorize-no-math -- Don't try to vectorize floating-point math
> intrinsics (this is just the trig. functions right now)
> bb-vectorize-no-fma -- Don't try to vectorize the fused-multiply-add
> intrinsic
> bb-vectorize-no-mem-ops -- Don't try to vectorize loads and stores
> bb-vectorize-aligned-only -- Only generate aligned loads and stores
> 
> To make this really useful, there are some improvements necessary to
> InstCombine (and a few other things).

As it turns out, the situation with instruction combination is not bad;
you just need to make sure that instcombine is run after the
vectorizer. In other words, run "opt -bb-vectorize -std-compile-opts"
instead of "opt -std-compile-opts -bb-vectorize". [Is there currently a
way that a pass can request that another pass be run after it?]

 -Hal

> But the autovectorization process
> itself now seems to work well. Please review this patch; adding the
> vectorization pass itself should not affect any other code (although it
> does touch some common files to add support for the pass into opt). If
> it looks okay, please let me know, and I'll commit it.
> 
> Thanks in advance,
> Hal
> 
> On Fri, 2011-10-21 at 16:04 -0500, Hal Finkel wrote:
> > I've attached an initial version of a basic-block autovectorization
> > pass. It works by searching a basic block for pairable (independent)
> > instructions, and, using a chain-seeking heuristic, selects pairings
> > likely to provide an overall speedup (if such pairings can be found).
> > The selected pairs are then fused and, if necessary, other instructions
> > are moved in order to maintain data-flow consistency. This works only
> > within one basic block, but can do loop vectorization in combination
> > with (partial) unrolling. The basic idea was inspired by the Vienna MAP
> > Vectorizor, which has been used to vectorize FFT kernels, but the
> > algorithm used here is different.
> > 
> > To try it, use -bb-vectorize with opt. There are a few options:
> > -bb-vectorize-req-chain-depth: default: 3 -- The depth of the chain of
> > instruction pairs necessary in order to consider the pairs that compose
> > the chain worthy of vectorization.
> > -bb-vectorize-vector-bits: default: 128 -- The size of the target vector
> > registers
> > -bb-vectorize-no-ints -- Don't consider integer instructions
> > -bb-vectorize-no-floats -- Don't consider floating-point instructions  
> > 
> > The vectorizor generates a lot of insert_element/extract_element pairs;
> > The assumption is that other passes will turn these into shuffles when
> > possible (it looks like some work is necessary here). It will also
> > vectorize vector instructions, and generates shuffles in this case
> > (again, other passes should combine these as appropriate).
> > 
> > Currently, it does not fuse load or store instructions, but that is a
> > feature that I'd like to add. Of course, alignment information is an
> > issue for load/store vectorization (or maybe I should just fuse them
> > anyway and let isel deal with unaligned cases?).
> > 
> > Also, support needs to be added for fusing known intrinsics (fma, etc.),
> > and, as has been discussed on llvmdev, we should add some intrinsics to
> > allow the generation of addsub-type instructions.
> > 
> > I've included a few tests, but it needs more. Please review (I'll commit
> > if and when everyone is happy).
> > 
> > Thanks in advance,
> > Hal
> > 
> > P.S. There is another option (not so useful right now, but could be):
> > -bb-vectorize-fast-dep -- Don't do a full inter-instruction dependency
> > analysis; instead stop looking for instruction pairs after the first use
> > of an instruction's value. [This makes the pass faster, but would
> > require a data-dependence-based reordering pass in order to be
> > effective].
> > 
> > _______________________________________________
> > llvm-commits mailing list
> > llvm-commits at cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> 

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory