[llvm-commits] [PATCH] BasicBlock Autovectorization Pass

Fri Oct 28 04:50:00 PDT 2011

Bruno, et al.,

I've attached a new version of the patch that contains improvements (and
a critical bug fix [the code output is not more right, but the pass in
the older patch would crash in certain cases and now does not]) compared
to previous versions that I've posted.

First, these are preliminary results because I did not do the things
necessary to make them real (explicitly quiet the machine, bind the
processes to one cpu, etc.). But they should be good enough for
discussion.

I'm using LLVM head r143101, with the attached patch applied, and clang
head r143100 on an x86_64 machine (some kind of Intel Xeon). For the gcc
comparison, I'm using build Ubuntu 4.4.3-4ubuntu5. gcc was run -O3
without any other optimization flags. opt was run -vectorize
-unroll-allow-partial -O3 with no other optimization flags (the patch
adds the -vectorize option). llc was just given -O3.

It is not difficult to construct an example in which vectorization would
be useful: take a loop that does more computation than load/stores, and
(partially) unroll it. Here is a simple case:

#define ITER 5000
#define NUM 200
double a[NUM][NUM];
double b[NUM][NUM];

...

int main()
{
...

  for (int i = 0; i < ITER; ++i) {
    for (int x = 0; x < NUM; ++x)
    for (int y = 0; y < NUM; ++y) {
      double v = a[x][y], w = b[x][y];
      double z1 = v*w;
      double z2 = v+w;
      double z3 = z1*z2;
      double z4 = z3+v;
      double z5 = z2+w;
      double z6 = z4*z5;
      double z7 = z4+z5;
      a[x][y] = v*v-z6;
      b[x][y] = w-z7;
    }
  }

 ...

  return 0;
}

Results:
gcc -03: 0m1.790s
llvm -vectorize: 0m2.360s
llvm: 0m2.780s
gcc -fno-tree-vectorize: 0m2.810s
(these are the user times after I've run enough for the times to settle
to three decimal places)

So the vectorization gives a ~15% improvement in the running time. gcc's
vectorization still does a much better job, however (yielding an ~36%
improvement). So there is still work to do ;)

Additionally, I've checked the autovectorization on some classic
numerical benchmarks from netlib. On these benchmarks, clang/llvm
already do a good job compared to gcc (gcc is only about 10% better, and
this is true regardless of whether gcc's vectorization is on or off).
For these cases, autovectorization provides an insignificant speedup in
most cases (but does not tend to make things worse, just not really any
better either). Because gcc's vectorization also did not really help gcc
in these cases, I'm not surprised. A good collection of these is
available here:
http://www.roylongbottom.org.uk/classic_benchmarks.tar.gz

I've yet to run the test suite using the pass to validate it. That is
something that I plan to do. Actually, the "Livermore Loops" test in the
aforementioned archive contains checksums to validate the results, and
it looks like 1 or 2 of the loop results are wrong with vectorization
turned on, so I'll have to investigate that.

 -Hal

On Wed, 2011-10-26 at 18:49 -0200, Bruno Cardoso Lopes wrote:
> Hi Hal,
> 
> On Fri, Oct 21, 2011 at 7:04 PM, Hal Finkel <hfinkel at anl.gov> wrote:
> > I've attached an initial version of a basic-block autovectorization
> > pass. It works by searching a basic block for pairable (independent)
> > instructions, and, using a chain-seeking heuristic, selects pairings
> > likely to provide an overall speedup (if such pairings can be found).
> > The selected pairs are then fused and, if necessary, other instructions
> > are moved in order to maintain data-flow consistency. This works only
> > within one basic block, but can do loop vectorization in combination
> > with (partial) unrolling. The basic idea was inspired by the Vienna MAP
> > Vectorizor, which has been used to vectorize FFT kernels, but the
> > algorithm used here is different.
> >
> > To try it, use -bb-vectorize with opt. There are a few options:
> > -bb-vectorize-req-chain-depth: default: 3 -- The depth of the chain of
> > instruction pairs necessary in order to consider the pairs that compose
> > the chain worthy of vectorization.
> > -bb-vectorize-vector-bits: default: 128 -- The size of the target vector
> > registers
> > -bb-vectorize-no-ints -- Don't consider integer instructions
> > -bb-vectorize-no-floats -- Don't consider floating-point instructions
> >
> > The vectorizor generates a lot of insert_element/extract_element pairs;
> > The assumption is that other passes will turn these into shuffles when
> > possible (it looks like some work is necessary here). It will also
> > vectorize vector instructions, and generates shuffles in this case
> > (again, other passes should combine these as appropriate).
> >
> > Currently, it does not fuse load or store instructions, but that is a
> > feature that I'd like to add. Of course, alignment information is an
> > issue for load/store vectorization (or maybe I should just fuse them
> > anyway and let isel deal with unaligned cases?).
> >
> > Also, support needs to be added for fusing known intrinsics (fma, etc.),
> > and, as has been discussed on llvmdev, we should add some intrinsics to
> > allow the generation of addsub-type instructions.
> >
> > I've included a few tests, but it needs more. Please review (I'll commit
> > if and when everyone is happy).
> >
> > Thanks in advance,
> > Hal
> >
> > P.S. There is another option (not so useful right now, but could be):
> > -bb-vectorize-fast-dep -- Don't do a full inter-instruction dependency
> > analysis; instead stop looking for instruction pairs after the first use
> > of an instruction's value. [This makes the pass faster, but would
> > require a data-dependence-based reordering pass in order to be
> > effective].
> 
> Cool! :)
> Have you run this pass with any benchmark or the llvm testsuite? Does
> it presents any regression?
> Do you have any performance results?
> Cheers,
> 

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory
-------------- next part --------------
A non-text attachment was scrubbed...
Name: llvm_bb_vectorize-20111028.diff
Type: text/x-patch
Size: 64748 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20111028/3181c531/attachment.bin>