[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

Hal Finkel hfinkel at anl.gov
Sat Oct 29 12:02:03 PDT 2011


On Sat, 2011-10-29 at 12:30 -0500, Hal Finkel wrote:
> Ralf, et al.,
> 
> Attached is the latest version of my autovectorization patch. llvmdev
> has been CC'd (as had been suggested to me); this e-mail contains
> additional benchmark results.
> 
> First, these are preliminary results because I did not do the things
> necessary to make them real (explicitly quiet the machine, bind the
> processes to one cpu, etc.). But they should be good enough for
> discussion.
> 
> I'm using LLVM head r143101, with the attached patch applied, and clang
> head r143100 on an x86_64 machine (some kind of Intel Xeon). For the gcc
> comparison, I'm using build Ubuntu 4.4.3-4ubuntu5. gcc was run -O3
> without any other optimization flags. opt was run -vectorize
> -unroll-allow-partial -O3 with no other optimization flags (the patch
> adds the -vectorize option).

And opt had also been given the flag: -bb-vectorize-vector-bits=256

 -Hal

> llc was just given -O3.
> 
> Below I've included results using the benchmark program by Maleki, et
> al. See:
> An Evaluation of Vectorizing Compilers - PACT'11
> (http://polaris.cs.uiuc.edu/~garzaran/doc/pact11.pdf). The source of
> their benchmark program was retrieved from:
> http://polaris.cs.uiuc.edu/~maleki1/TSVC.tar.gz
> 
> Also, when using clang, I had to pass -Dinline= on the command line:
> when using -emit-llvm, clang appears not to emit code for functions
> declared inline. This is a bug, but I've not yet tracked it down. There
> are two such small functions in the benchmark program, and the regular
> inliner *should* catch them anyway.
> 
> Results:
> 0. Name of the loop
> 1. Time using LLVM with vectorization
> 2. Time using LLVM without vectorization
> 3. Time using gcc with vectorization
> 4. Time using gcc without vectorization
> 
> Loop       llvm-v   llvm     gcc-v    gcc
> -------------------------------------------
> S000        9.59     9.49     4.55    10.04
> S111        7.67     7.37     7.68     7.83
> S1111      13.98    14.48    16.14    16.30
> S112       17.43    17.41    16.54    17.52
> S1112      13.87    14.21    14.83    14.84
> S113       22.97    22.88    22.05    22.05
> S1113      11.46    11.42    11.03    11.01
> S114       13.47    13.75    13.53    13.48
> S115       33.06    33.24    49.98    49.99
> S1115      13.91    14.18    13.65    13.66
> S116       48.74    49.40    49.54    48.11
> S118       11.04    11.26    10.79    10.50
> S119        8.97     9.07    11.83    11.82
> S1119       9.04     9.14     4.31    11.87
> S121       18.06    18.78    14.84    17.31
> S122        7.58     7.54     6.11     6.11
> S123        7.02     7.38     7.42     7.41
> S124        9.62     9.77     9.42     9.33
> S125        7.14     7.22     4.67     7.81
> S126        2.32     2.53     2.57     2.37
> S127       12.87    12.97     7.06    14.50
> S128       12.58    12.43    12.42    11.52
> S131       29.91    29.91    25.17    28.94
> S132       17.04    17.04    15.53    21.03
> S141       12.59    12.26    12.38    12.05
> S151       28.92    29.43    24.89    28.95
> S152       15.68    16.03    11.19    15.63
> S161        6.06     6.06     5.52     5.46
> S1161      14.46    14.40     8.80     8.79
> S162        8.31     9.05     5.36     8.18
> S171       15.47     7.94     2.81     5.70
> S172        5.92     5.89     2.75     5.70
> S173       31.59    30.92    18.15    30.13
> S174       31.16    31.66    18.51    30.16
> S175        5.80     6.18     4.94     5.77
> S176        5.69     5.83     4.41     7.65
> S211       16.56    17.14    16.82    16.38
> S212       13.46    14.28    13.34    13.18
> S1213      13.12    13.46    12.80    12.43
> S221       10.88    11.09     8.65     8.63
> S1221       5.80     6.04     5.40     6.05
> S222        6.01     6.26     5.70     5.72
> S231       23.78    22.94    22.36    22.11
> S232        6.88     6.88     6.89     6.89
> S1232      16.00    15.34    15.05    15.10
> S233       57.48    58.55    54.21    49.56
> S2233      27.65    29.77    29.68    28.40
> S235       46.40    44.92    46.94    43.93
> S241       31.62    31.35    32.53    31.01
> S242        7.20     7.20     7.20     7.20
> S243       16.78    17.09    17.69    16.84
> S244       14.64    14.83    16.91    16.82
> S1244      14.98    14.83    14.77    14.40
> S2244      10.47    10.62    10.40    10.06
> S251       35.10    35.75    19.70    34.38
> S1251      56.65    57.84    41.77    56.11
> S2251      15.96    15.87    17.02    15.70
> S3251      16.41    16.21    19.60    15.34
> S252        7.24     6.32     7.72     7.26
> S253       12.55    11.38    14.40    14.40
> S254       19.08    18.70    28.23    28.06
> S255        5.94     6.09     9.96     9.95
> S256        3.14     3.42     3.10     3.09
> S257        2.18     2.25     2.21     2.20
> S258        1.80     1.82     1.84     1.84
> S261       12.00    12.08    10.98    10.95
> S271       32.93    33.04    33.25    33.01
> S272       15.48    15.82    15.39    15.26
> S273       13.99    14.04    16.86    16.80
> S274       18.38    18.31    18.15    17.89
> S275        3.02     3.02     3.36     2.98
> S2275      33.71    33.50     8.97    33.60
> S276       39.52    39.44    40.80    40.55
> S277        4.81     4.80     4.81     4.80
> S278       14.43    14.42    14.70    14.66
> S279        8.10     8.29     7.25     7.27
> S1279       9.77    10.06     9.34     9.25
> S2710       7.85     8.04     7.86     7.56
> S2711      35.54    35.55    36.56    36.00
> S2712      33.16    33.17    34.24    33.47
> S281       10.97    11.09    12.46    12.02
> S1281      79.37    77.55    57.78    68.06
> S291       11.94    11.78    14.03    14.03
> S292        7.88     7.78     9.94     9.96
> S293       15.90    15.87    19.32    19.33
> S2101       2.59     2.58     2.59     2.60
> S2102      17.63    17.53    16.68    16.75
> S2111       5.63     5.60     5.85     5.85
> S311       72.07    72.03    72.23    72.03
> S31111      7.49     6.00     6.00     6.00
> S312       96.06    96.04    96.05    96.03
> S313       36.50    36.13    36.03    36.02
> S314       36.10    36.07    74.67    72.42
> S315        9.00     8.99     9.35     9.30
> S316       36.11    36.06    72.08    74.87
> S317      444.92   444.94   451.82   451.78
> S318        9.04     9.07     7.30     7.30
> S319       34.76    36.53    34.42    34.19
> S3110       8.53     8.57     4.11     4.11
> S13110      5.76     5.77    12.12    12.12
> S3111       3.60     3.62     3.60     3.60
> S3112       7.20     7.30     7.21     7.20
> S3113      35.12    35.47    60.21    60.20
> S321       16.81    16.81    16.80    16.80
> S322       12.42    12.60    12.60    12.60
> S323       10.93    11.02     8.48     8.51
> S331        4.23     4.23     7.20     7.20
> S332        7.21     7.21     5.21     5.31
> S341        4.74     4.85     7.23     7.20
> S342        6.02     6.09     7.25     7.20
> S343        2.14     2.06     2.16     2.01
> S351       49.26    47.34    21.82    46.46
> S1351      50.85    50.35    33.68    49.06
> S352       58.14    58.04    57.68    57.64
> S353        8.35     8.38     8.34     8.19
> S421       43.13    43.34    20.62    22.46
> S1421      25.25    25.81    15.85    24.76
> S422       88.36    87.53    79.22    78.99
> S423      155.13   155.29   154.56   154.38
> S424       37.11    37.51    11.42    22.36
> S431       58.22    60.66    27.59    57.16
> S441       14.05    13.29    12.88    12.81
> S442        6.08     6.00     6.96     6.90
> S443       17.60    17.77    17.15    16.95
> S451       48.95    49.08    49.03    49.14
> S452       42.98    39.32    14.64    96.03
> S453       28.06    28.06    14.60    14.40
> S471        8.53     8.65     8.39     8.43
> S481       10.98    11.15    12.04    12.00
> S482        9.31     9.31     9.19     9.17
> S491       11.54    11.38    11.37    11.28
> S4112       8.21     8.36     9.13     8.94
> S4113       8.77     8.81     8.86     8.85
> S4114      12.32    12.15    12.18    11.77
> S4115       8.48     8.46     8.95     8.59
> S4116       3.21     3.23     6.02     5.94
> S4117      14.08     9.61    10.16     9.98
> S4121       8.53     8.26     4.04     8.17
> va         30.09    28.58    23.58    48.46
> vag        12.35    12.36    13.58    13.20
> vas        13.74    13.49    13.03    12.47
> vif         4.49     4.57     5.06     4.92
> vpv        58.59    57.22    28.28    57.24
> vtv        59.15    57.83    28.40    57.63
> vpvtv      33.18    32.84    16.35    32.73
> vpvts       5.99     5.83     2.99     6.38
> vpvpv      33.25    32.89    16.54    32.85
> vtvtv      32.83    32.80    16.84    35.97
> vsumr      72.03    72.03    72.20    72.04
> vdotr      72.05    72.05    72.42    72.04
> vbor      205.22   380.81    99.80   372.05
> 
> I've yet to go through these in detail (they just finished running 5
> minutes ago). But for the curious (and I've had several requests for
> benchmarks), here you go. There is obviously more work to do.
> 
>  -Hal
> 
> On Fri, 2011-10-28 at 14:30 +0200, Ralf Karrenberg wrote:
> > Hi Hal,
> > 
> > those numbers look very promising, great work! :)
> > 
> > Best,
> > Ralf
> > 
> > ----- Original Message -----
> > > From: "Hal Finkel" <hfinkel at anl.gov>
> > > To: "Bruno Cardoso Lopes" <bruno.cardoso at gmail.com>
> > > Cc: llvm-commits at cs.uiuc.edu
> > > Sent: Freitag, 28. Oktober 2011 13:50:00
> > > Subject: Re: [llvm-commits] [PATCH] BasicBlock Autovectorization Pass
> > > 
> > > Bruno, et al.,
> > > 
> > > I've attached a new version of the patch that contains improvements
> > > (and
> > > a critical bug fix [the code output is not more right, but the pass
> > > in
> > > the older patch would crash in certain cases and now does not])
> > > compared
> > > to previous versions that I've posted.
> > > 
> > > First, these are preliminary results because I did not do the things
> > > necessary to make them real (explicitly quiet the machine, bind the
> > > processes to one cpu, etc.). But they should be good enough for
> > > discussion.
> > > 
> > > I'm using LLVM head r143101, with the attached patch applied, and
> > > clang
> > > head r143100 on an x86_64 machine (some kind of Intel Xeon). For the
> > > gcc
> > > comparison, I'm using build Ubuntu 4.4.3-4ubuntu5. gcc was run -O3
> > > without any other optimization flags. opt was run -vectorize
> > > -unroll-allow-partial -O3 with no other optimization flags (the patch
> > > adds the -vectorize option). llc was just given -O3.
> > > 
> > > It is not difficult to construct an example in which vectorization
> > > would
> > > be useful: take a loop that does more computation than load/stores,
> > > and
> > > (partially) unroll it. Here is a simple case:
> > > 
> > > #define ITER 5000
> > > #define NUM 200
> > > double a[NUM][NUM];
> > > double b[NUM][NUM];
> > > 
> > > ...
> > > 
> > > int main()
> > > {
> > > ...
> > > 
> > >   for (int i = 0; i < ITER; ++i) {
> > >     for (int x = 0; x < NUM; ++x)
> > >     for (int y = 0; y < NUM; ++y) {
> > >       double v = a[x][y], w = b[x][y];
> > >       double z1 = v*w;
> > >       double z2 = v+w;
> > >       double z3 = z1*z2;
> > >       double z4 = z3+v;
> > >       double z5 = z2+w;
> > >       double z6 = z4*z5;
> > >       double z7 = z4+z5;
> > >       a[x][y] = v*v-z6;
> > >       b[x][y] = w-z7;
> > >     }
> > >   }
> > > 
> > >  ...
> > > 
> > >   return 0;
> > > }
> > > 
> > > Results:
> > > gcc -03: 0m1.790s
> > > llvm -vectorize: 0m2.360s
> > > llvm: 0m2.780s
> > > gcc -fno-tree-vectorize: 0m2.810s
> > > (these are the user times after I've run enough for the times to
> > > settle
> > > to three decimal places)
> > > 
> > > So the vectorization gives a ~15% improvement in the running time.
> > > gcc's
> > > vectorization still does a much better job, however (yielding an ~36%
> > > improvement). So there is still work to do ;)
> > > 
> > > Additionally, I've checked the autovectorization on some classic
> > > numerical benchmarks from netlib. On these benchmarks, clang/llvm
> > > already do a good job compared to gcc (gcc is only about 10% better,
> > > and
> > > this is true regardless of whether gcc's vectorization is on or off).
> > > For these cases, autovectorization provides an insignificant speedup
> > > in
> > > most cases (but does not tend to make things worse, just not really
> > > any
> > > better either). Because gcc's vectorization also did not really help
> > > gcc
> > > in these cases, I'm not surprised. A good collection of these is
> > > available here:
> > > http://www.roylongbottom.org.uk/classic_benchmarks.tar.gz
> > > 
> > > I've yet to run the test suite using the pass to validate it. That is
> > > something that I plan to do. Actually, the "Livermore Loops" test in
> > > the
> > > aforementioned archive contains checksums to validate the results,
> > > and
> > > it looks like 1 or 2 of the loop results are wrong with vectorization
> > > turned on, so I'll have to investigate that.
> > > 
> > >  -Hal
> > > 
> > > On Wed, 2011-10-26 at 18:49 -0200, Bruno Cardoso Lopes wrote:
> > > > Hi Hal,
> > > > 
> > > > On Fri, Oct 21, 2011 at 7:04 PM, Hal Finkel <hfinkel at anl.gov>
> > > > wrote:
> > > > > I've attached an initial version of a basic-block
> > > > > autovectorization
> > > > > pass. It works by searching a basic block for pairable
> > > > > (independent)
> > > > > instructions, and, using a chain-seeking heuristic, selects
> > > > > pairings
> > > > > likely to provide an overall speedup (if such pairings can be
> > > > > found).
> > > > > The selected pairs are then fused and, if necessary, other
> > > > > instructions
> > > > > are moved in order to maintain data-flow consistency. This works
> > > > > only
> > > > > within one basic block, but can do loop vectorization in
> > > > > combination
> > > > > with (partial) unrolling. The basic idea was inspired by the
> > > > > Vienna MAP
> > > > > Vectorizor, which has been used to vectorize FFT kernels, but the
> > > > > algorithm used here is different.
> > > > >
> > > > > To try it, use -bb-vectorize with opt. There are a few options:
> > > > > -bb-vectorize-req-chain-depth: default: 3 -- The depth of the
> > > > > chain of
> > > > > instruction pairs necessary in order to consider the pairs that
> > > > > compose
> > > > > the chain worthy of vectorization.
> > > > > -bb-vectorize-vector-bits: default: 128 -- The size of the target
> > > > > vector
> > > > > registers
> > > > > -bb-vectorize-no-ints -- Don't consider integer instructions
> > > > > -bb-vectorize-no-floats -- Don't consider floating-point
> > > > > instructions
> > > > >
> > > > > The vectorizor generates a lot of insert_element/extract_element
> > > > > pairs;
> > > > > The assumption is that other passes will turn these into shuffles
> > > > > when
> > > > > possible (it looks like some work is necessary here). It will
> > > > > also
> > > > > vectorize vector instructions, and generates shuffles in this
> > > > > case
> > > > > (again, other passes should combine these as appropriate).
> > > > >
> > > > > Currently, it does not fuse load or store instructions, but that
> > > > > is a
> > > > > feature that I'd like to add. Of course, alignment information is
> > > > > an
> > > > > issue for load/store vectorization (or maybe I should just fuse
> > > > > them
> > > > > anyway and let isel deal with unaligned cases?).
> > > > >
> > > > > Also, support needs to be added for fusing known intrinsics (fma,
> > > > > etc.),
> > > > > and, as has been discussed on llvmdev, we should add some
> > > > > intrinsics to
> > > > > allow the generation of addsub-type instructions.
> > > > >
> > > > > I've included a few tests, but it needs more. Please review (I'll
> > > > > commit
> > > > > if and when everyone is happy).
> > > > >
> > > > > Thanks in advance,
> > > > > Hal
> > > > >
> > > > > P.S. There is another option (not so useful right now, but could
> > > > > be):
> > > > > -bb-vectorize-fast-dep -- Don't do a full inter-instruction
> > > > > dependency
> > > > > analysis; instead stop looking for instruction pairs after the
> > > > > first use
> > > > > of an instruction's value. [This makes the pass faster, but would
> > > > > require a data-dependence-based reordering pass in order to be
> > > > > effective].
> > > > 
> > > > Cool! :)
> > > > Have you run this pass with any benchmark or the llvm testsuite?
> > > > Does
> > > > it presents any regression?
> > > > Do you have any performance results?
> > > > Cheers,
> > > > 
> > > 
> > > --
> > > Hal Finkel
> > > Postdoctoral Appointee
> > > Leadership Computing Facility
> > > Argonne National Laboratory
> > > 
> > > _______________________________________________
> > > llvm-commits mailing list
> > > llvm-commits at cs.uiuc.edu
> > > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> > > 
> 
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory
1-630-252-0023
hfinkel at anl.gov




More information about the llvm-dev mailing list