[llvm-commits] [PATCH] BasicBlock Autovectorization Pass

Fri Oct 28 05:30:02 PDT 2011

Hi Hal,

those numbers look very promising, great work! :)

Best,
Ralf

----- Original Message -----
> From: "Hal Finkel" <hfinkel at anl.gov>
> To: "Bruno Cardoso Lopes" <bruno.cardoso at gmail.com>
> Cc: llvm-commits at cs.uiuc.edu
> Sent: Freitag, 28. Oktober 2011 13:50:00
> Subject: Re: [llvm-commits] [PATCH] BasicBlock Autovectorization Pass
> 
> Bruno, et al.,
> 
> I've attached a new version of the patch that contains improvements
> (and
> a critical bug fix [the code output is not more right, but the pass
> in
> the older patch would crash in certain cases and now does not])
> compared
> to previous versions that I've posted.
> 
> First, these are preliminary results because I did not do the things
> necessary to make them real (explicitly quiet the machine, bind the
> processes to one cpu, etc.). But they should be good enough for
> discussion.
> 
> I'm using LLVM head r143101, with the attached patch applied, and
> clang
> head r143100 on an x86_64 machine (some kind of Intel Xeon). For the
> gcc
> comparison, I'm using build Ubuntu 4.4.3-4ubuntu5. gcc was run -O3
> without any other optimization flags. opt was run -vectorize
> -unroll-allow-partial -O3 with no other optimization flags (the patch
> adds the -vectorize option). llc was just given -O3.
> 
> It is not difficult to construct an example in which vectorization
> would
> be useful: take a loop that does more computation than load/stores,
> and
> (partially) unroll it. Here is a simple case:
> 
> #define ITER 5000
> #define NUM 200
> double a[NUM][NUM];
> double b[NUM][NUM];
> 
> ...
> 
> int main()
> {
> ...
> 
>   for (int i = 0; i < ITER; ++i) {
>     for (int x = 0; x < NUM; ++x)
>     for (int y = 0; y < NUM; ++y) {
>       double v = a[x][y], w = b[x][y];
>       double z1 = v*w;
>       double z2 = v+w;
>       double z3 = z1*z2;
>       double z4 = z3+v;
>       double z5 = z2+w;
>       double z6 = z4*z5;
>       double z7 = z4+z5;
>       a[x][y] = v*v-z6;
>       b[x][y] = w-z7;
>     }
>   }
> 
>  ...
> 
>   return 0;
> }
> 
> Results:
> gcc -03: 0m1.790s
> llvm -vectorize: 0m2.360s
> llvm: 0m2.780s
> gcc -fno-tree-vectorize: 0m2.810s
> (these are the user times after I've run enough for the times to
> settle
> to three decimal places)
> 
> So the vectorization gives a ~15% improvement in the running time.
> gcc's
> vectorization still does a much better job, however (yielding an ~36%
> improvement). So there is still work to do ;)
> 
> Additionally, I've checked the autovectorization on some classic
> numerical benchmarks from netlib. On these benchmarks, clang/llvm
> already do a good job compared to gcc (gcc is only about 10% better,
> and
> this is true regardless of whether gcc's vectorization is on or off).
> For these cases, autovectorization provides an insignificant speedup
> in
> most cases (but does not tend to make things worse, just not really
> any
> better either). Because gcc's vectorization also did not really help
> gcc
> in these cases, I'm not surprised. A good collection of these is
> available here:
> http://www.roylongbottom.org.uk/classic_benchmarks.tar.gz
> 
> I've yet to run the test suite using the pass to validate it. That is
> something that I plan to do. Actually, the "Livermore Loops" test in
> the
> aforementioned archive contains checksums to validate the results,
> and
> it looks like 1 or 2 of the loop results are wrong with vectorization
> turned on, so I'll have to investigate that.
> 
>  -Hal
> 
> On Wed, 2011-10-26 at 18:49 -0200, Bruno Cardoso Lopes wrote:
> > Hi Hal,
> > 
> > On Fri, Oct 21, 2011 at 7:04 PM, Hal Finkel <hfinkel at anl.gov>
> > wrote:
> > > I've attached an initial version of a basic-block
> > > autovectorization
> > > pass. It works by searching a basic block for pairable
> > > (independent)
> > > instructions, and, using a chain-seeking heuristic, selects
> > > pairings
> > > likely to provide an overall speedup (if such pairings can be
> > > found).
> > > The selected pairs are then fused and, if necessary, other
> > > instructions
> > > are moved in order to maintain data-flow consistency. This works
> > > only
> > > within one basic block, but can do loop vectorization in
> > > combination
> > > with (partial) unrolling. The basic idea was inspired by the
> > > Vienna MAP
> > > Vectorizor, which has been used to vectorize FFT kernels, but the
> > > algorithm used here is different.
> > >
> > > To try it, use -bb-vectorize with opt. There are a few options:
> > > -bb-vectorize-req-chain-depth: default: 3 -- The depth of the
> > > chain of
> > > instruction pairs necessary in order to consider the pairs that
> > > compose
> > > the chain worthy of vectorization.
> > > -bb-vectorize-vector-bits: default: 128 -- The size of the target
> > > vector
> > > registers
> > > -bb-vectorize-no-ints -- Don't consider integer instructions
> > > -bb-vectorize-no-floats -- Don't consider floating-point
> > > instructions
> > >
> > > The vectorizor generates a lot of insert_element/extract_element
> > > pairs;
> > > The assumption is that other passes will turn these into shuffles
> > > when
> > > possible (it looks like some work is necessary here). It will
> > > also
> > > vectorize vector instructions, and generates shuffles in this
> > > case
> > > (again, other passes should combine these as appropriate).
> > >
> > > Currently, it does not fuse load or store instructions, but that
> > > is a
> > > feature that I'd like to add. Of course, alignment information is
> > > an
> > > issue for load/store vectorization (or maybe I should just fuse
> > > them
> > > anyway and let isel deal with unaligned cases?).
> > >
> > > Also, support needs to be added for fusing known intrinsics (fma,
> > > etc.),
> > > and, as has been discussed on llvmdev, we should add some
> > > intrinsics to
> > > allow the generation of addsub-type instructions.
> > >
> > > I've included a few tests, but it needs more. Please review (I'll
> > > commit
> > > if and when everyone is happy).
> > >
> > > Thanks in advance,
> > > Hal
> > >
> > > P.S. There is another option (not so useful right now, but could
> > > be):
> > > -bb-vectorize-fast-dep -- Don't do a full inter-instruction
> > > dependency
> > > analysis; instead stop looking for instruction pairs after the
> > > first use
> > > of an instruction's value. [This makes the pass faster, but would
> > > require a data-dependence-based reordering pass in order to be
> > > effective].
> > 
> > Cool! :)
> > Have you run this pass with any benchmark or the llvm testsuite?
> > Does
> > it presents any regression?
> > Do you have any performance results?
> > Cheers,
> > 
> 
> --
> Hal Finkel
> Postdoctoral Appointee
> Leadership Computing Facility
> Argonne National Laboratory
> 
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>