[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass
Hal Finkel
hfinkel at anl.gov
Sat Oct 29 12:02:03 PDT 2011
On Sat, 2011-10-29 at 12:30 -0500, Hal Finkel wrote:
> Ralf, et al.,
>
> Attached is the latest version of my autovectorization patch. llvmdev
> has been CC'd (as had been suggested to me); this e-mail contains
> additional benchmark results.
>
> First, these are preliminary results because I did not do the things
> necessary to make them real (explicitly quiet the machine, bind the
> processes to one cpu, etc.). But they should be good enough for
> discussion.
>
> I'm using LLVM head r143101, with the attached patch applied, and clang
> head r143100 on an x86_64 machine (some kind of Intel Xeon). For the gcc
> comparison, I'm using build Ubuntu 4.4.3-4ubuntu5. gcc was run -O3
> without any other optimization flags. opt was run -vectorize
> -unroll-allow-partial -O3 with no other optimization flags (the patch
> adds the -vectorize option).
And opt had also been given the flag: -bb-vectorize-vector-bits=256
-Hal
> llc was just given -O3.
>
> Below I've included results using the benchmark program by Maleki, et
> al. See:
> An Evaluation of Vectorizing Compilers - PACT'11
> (http://polaris.cs.uiuc.edu/~garzaran/doc/pact11.pdf). The source of
> their benchmark program was retrieved from:
> http://polaris.cs.uiuc.edu/~maleki1/TSVC.tar.gz
>
> Also, when using clang, I had to pass -Dinline= on the command line:
> when using -emit-llvm, clang appears not to emit code for functions
> declared inline. This is a bug, but I've not yet tracked it down. There
> are two such small functions in the benchmark program, and the regular
> inliner *should* catch them anyway.
>
> Results:
> 0. Name of the loop
> 1. Time using LLVM with vectorization
> 2. Time using LLVM without vectorization
> 3. Time using gcc with vectorization
> 4. Time using gcc without vectorization
>
> Loop llvm-v llvm gcc-v gcc
> -------------------------------------------
> S000 9.59 9.49 4.55 10.04
> S111 7.67 7.37 7.68 7.83
> S1111 13.98 14.48 16.14 16.30
> S112 17.43 17.41 16.54 17.52
> S1112 13.87 14.21 14.83 14.84
> S113 22.97 22.88 22.05 22.05
> S1113 11.46 11.42 11.03 11.01
> S114 13.47 13.75 13.53 13.48
> S115 33.06 33.24 49.98 49.99
> S1115 13.91 14.18 13.65 13.66
> S116 48.74 49.40 49.54 48.11
> S118 11.04 11.26 10.79 10.50
> S119 8.97 9.07 11.83 11.82
> S1119 9.04 9.14 4.31 11.87
> S121 18.06 18.78 14.84 17.31
> S122 7.58 7.54 6.11 6.11
> S123 7.02 7.38 7.42 7.41
> S124 9.62 9.77 9.42 9.33
> S125 7.14 7.22 4.67 7.81
> S126 2.32 2.53 2.57 2.37
> S127 12.87 12.97 7.06 14.50
> S128 12.58 12.43 12.42 11.52
> S131 29.91 29.91 25.17 28.94
> S132 17.04 17.04 15.53 21.03
> S141 12.59 12.26 12.38 12.05
> S151 28.92 29.43 24.89 28.95
> S152 15.68 16.03 11.19 15.63
> S161 6.06 6.06 5.52 5.46
> S1161 14.46 14.40 8.80 8.79
> S162 8.31 9.05 5.36 8.18
> S171 15.47 7.94 2.81 5.70
> S172 5.92 5.89 2.75 5.70
> S173 31.59 30.92 18.15 30.13
> S174 31.16 31.66 18.51 30.16
> S175 5.80 6.18 4.94 5.77
> S176 5.69 5.83 4.41 7.65
> S211 16.56 17.14 16.82 16.38
> S212 13.46 14.28 13.34 13.18
> S1213 13.12 13.46 12.80 12.43
> S221 10.88 11.09 8.65 8.63
> S1221 5.80 6.04 5.40 6.05
> S222 6.01 6.26 5.70 5.72
> S231 23.78 22.94 22.36 22.11
> S232 6.88 6.88 6.89 6.89
> S1232 16.00 15.34 15.05 15.10
> S233 57.48 58.55 54.21 49.56
> S2233 27.65 29.77 29.68 28.40
> S235 46.40 44.92 46.94 43.93
> S241 31.62 31.35 32.53 31.01
> S242 7.20 7.20 7.20 7.20
> S243 16.78 17.09 17.69 16.84
> S244 14.64 14.83 16.91 16.82
> S1244 14.98 14.83 14.77 14.40
> S2244 10.47 10.62 10.40 10.06
> S251 35.10 35.75 19.70 34.38
> S1251 56.65 57.84 41.77 56.11
> S2251 15.96 15.87 17.02 15.70
> S3251 16.41 16.21 19.60 15.34
> S252 7.24 6.32 7.72 7.26
> S253 12.55 11.38 14.40 14.40
> S254 19.08 18.70 28.23 28.06
> S255 5.94 6.09 9.96 9.95
> S256 3.14 3.42 3.10 3.09
> S257 2.18 2.25 2.21 2.20
> S258 1.80 1.82 1.84 1.84
> S261 12.00 12.08 10.98 10.95
> S271 32.93 33.04 33.25 33.01
> S272 15.48 15.82 15.39 15.26
> S273 13.99 14.04 16.86 16.80
> S274 18.38 18.31 18.15 17.89
> S275 3.02 3.02 3.36 2.98
> S2275 33.71 33.50 8.97 33.60
> S276 39.52 39.44 40.80 40.55
> S277 4.81 4.80 4.81 4.80
> S278 14.43 14.42 14.70 14.66
> S279 8.10 8.29 7.25 7.27
> S1279 9.77 10.06 9.34 9.25
> S2710 7.85 8.04 7.86 7.56
> S2711 35.54 35.55 36.56 36.00
> S2712 33.16 33.17 34.24 33.47
> S281 10.97 11.09 12.46 12.02
> S1281 79.37 77.55 57.78 68.06
> S291 11.94 11.78 14.03 14.03
> S292 7.88 7.78 9.94 9.96
> S293 15.90 15.87 19.32 19.33
> S2101 2.59 2.58 2.59 2.60
> S2102 17.63 17.53 16.68 16.75
> S2111 5.63 5.60 5.85 5.85
> S311 72.07 72.03 72.23 72.03
> S31111 7.49 6.00 6.00 6.00
> S312 96.06 96.04 96.05 96.03
> S313 36.50 36.13 36.03 36.02
> S314 36.10 36.07 74.67 72.42
> S315 9.00 8.99 9.35 9.30
> S316 36.11 36.06 72.08 74.87
> S317 444.92 444.94 451.82 451.78
> S318 9.04 9.07 7.30 7.30
> S319 34.76 36.53 34.42 34.19
> S3110 8.53 8.57 4.11 4.11
> S13110 5.76 5.77 12.12 12.12
> S3111 3.60 3.62 3.60 3.60
> S3112 7.20 7.30 7.21 7.20
> S3113 35.12 35.47 60.21 60.20
> S321 16.81 16.81 16.80 16.80
> S322 12.42 12.60 12.60 12.60
> S323 10.93 11.02 8.48 8.51
> S331 4.23 4.23 7.20 7.20
> S332 7.21 7.21 5.21 5.31
> S341 4.74 4.85 7.23 7.20
> S342 6.02 6.09 7.25 7.20
> S343 2.14 2.06 2.16 2.01
> S351 49.26 47.34 21.82 46.46
> S1351 50.85 50.35 33.68 49.06
> S352 58.14 58.04 57.68 57.64
> S353 8.35 8.38 8.34 8.19
> S421 43.13 43.34 20.62 22.46
> S1421 25.25 25.81 15.85 24.76
> S422 88.36 87.53 79.22 78.99
> S423 155.13 155.29 154.56 154.38
> S424 37.11 37.51 11.42 22.36
> S431 58.22 60.66 27.59 57.16
> S441 14.05 13.29 12.88 12.81
> S442 6.08 6.00 6.96 6.90
> S443 17.60 17.77 17.15 16.95
> S451 48.95 49.08 49.03 49.14
> S452 42.98 39.32 14.64 96.03
> S453 28.06 28.06 14.60 14.40
> S471 8.53 8.65 8.39 8.43
> S481 10.98 11.15 12.04 12.00
> S482 9.31 9.31 9.19 9.17
> S491 11.54 11.38 11.37 11.28
> S4112 8.21 8.36 9.13 8.94
> S4113 8.77 8.81 8.86 8.85
> S4114 12.32 12.15 12.18 11.77
> S4115 8.48 8.46 8.95 8.59
> S4116 3.21 3.23 6.02 5.94
> S4117 14.08 9.61 10.16 9.98
> S4121 8.53 8.26 4.04 8.17
> va 30.09 28.58 23.58 48.46
> vag 12.35 12.36 13.58 13.20
> vas 13.74 13.49 13.03 12.47
> vif 4.49 4.57 5.06 4.92
> vpv 58.59 57.22 28.28 57.24
> vtv 59.15 57.83 28.40 57.63
> vpvtv 33.18 32.84 16.35 32.73
> vpvts 5.99 5.83 2.99 6.38
> vpvpv 33.25 32.89 16.54 32.85
> vtvtv 32.83 32.80 16.84 35.97
> vsumr 72.03 72.03 72.20 72.04
> vdotr 72.05 72.05 72.42 72.04
> vbor 205.22 380.81 99.80 372.05
>
> I've yet to go through these in detail (they just finished running 5
> minutes ago). But for the curious (and I've had several requests for
> benchmarks), here you go. There is obviously more work to do.
>
> -Hal
>
> On Fri, 2011-10-28 at 14:30 +0200, Ralf Karrenberg wrote:
> > Hi Hal,
> >
> > those numbers look very promising, great work! :)
> >
> > Best,
> > Ralf
> >
> > ----- Original Message -----
> > > From: "Hal Finkel" <hfinkel at anl.gov>
> > > To: "Bruno Cardoso Lopes" <bruno.cardoso at gmail.com>
> > > Cc: llvm-commits at cs.uiuc.edu
> > > Sent: Freitag, 28. Oktober 2011 13:50:00
> > > Subject: Re: [llvm-commits] [PATCH] BasicBlock Autovectorization Pass
> > >
> > > Bruno, et al.,
> > >
> > > I've attached a new version of the patch that contains improvements
> > > (and
> > > a critical bug fix [the code output is not more right, but the pass
> > > in
> > > the older patch would crash in certain cases and now does not])
> > > compared
> > > to previous versions that I've posted.
> > >
> > > First, these are preliminary results because I did not do the things
> > > necessary to make them real (explicitly quiet the machine, bind the
> > > processes to one cpu, etc.). But they should be good enough for
> > > discussion.
> > >
> > > I'm using LLVM head r143101, with the attached patch applied, and
> > > clang
> > > head r143100 on an x86_64 machine (some kind of Intel Xeon). For the
> > > gcc
> > > comparison, I'm using build Ubuntu 4.4.3-4ubuntu5. gcc was run -O3
> > > without any other optimization flags. opt was run -vectorize
> > > -unroll-allow-partial -O3 with no other optimization flags (the patch
> > > adds the -vectorize option). llc was just given -O3.
> > >
> > > It is not difficult to construct an example in which vectorization
> > > would
> > > be useful: take a loop that does more computation than load/stores,
> > > and
> > > (partially) unroll it. Here is a simple case:
> > >
> > > #define ITER 5000
> > > #define NUM 200
> > > double a[NUM][NUM];
> > > double b[NUM][NUM];
> > >
> > > ...
> > >
> > > int main()
> > > {
> > > ...
> > >
> > > for (int i = 0; i < ITER; ++i) {
> > > for (int x = 0; x < NUM; ++x)
> > > for (int y = 0; y < NUM; ++y) {
> > > double v = a[x][y], w = b[x][y];
> > > double z1 = v*w;
> > > double z2 = v+w;
> > > double z3 = z1*z2;
> > > double z4 = z3+v;
> > > double z5 = z2+w;
> > > double z6 = z4*z5;
> > > double z7 = z4+z5;
> > > a[x][y] = v*v-z6;
> > > b[x][y] = w-z7;
> > > }
> > > }
> > >
> > > ...
> > >
> > > return 0;
> > > }
> > >
> > > Results:
> > > gcc -03: 0m1.790s
> > > llvm -vectorize: 0m2.360s
> > > llvm: 0m2.780s
> > > gcc -fno-tree-vectorize: 0m2.810s
> > > (these are the user times after I've run enough for the times to
> > > settle
> > > to three decimal places)
> > >
> > > So the vectorization gives a ~15% improvement in the running time.
> > > gcc's
> > > vectorization still does a much better job, however (yielding an ~36%
> > > improvement). So there is still work to do ;)
> > >
> > > Additionally, I've checked the autovectorization on some classic
> > > numerical benchmarks from netlib. On these benchmarks, clang/llvm
> > > already do a good job compared to gcc (gcc is only about 10% better,
> > > and
> > > this is true regardless of whether gcc's vectorization is on or off).
> > > For these cases, autovectorization provides an insignificant speedup
> > > in
> > > most cases (but does not tend to make things worse, just not really
> > > any
> > > better either). Because gcc's vectorization also did not really help
> > > gcc
> > > in these cases, I'm not surprised. A good collection of these is
> > > available here:
> > > http://www.roylongbottom.org.uk/classic_benchmarks.tar.gz
> > >
> > > I've yet to run the test suite using the pass to validate it. That is
> > > something that I plan to do. Actually, the "Livermore Loops" test in
> > > the
> > > aforementioned archive contains checksums to validate the results,
> > > and
> > > it looks like 1 or 2 of the loop results are wrong with vectorization
> > > turned on, so I'll have to investigate that.
> > >
> > > -Hal
> > >
> > > On Wed, 2011-10-26 at 18:49 -0200, Bruno Cardoso Lopes wrote:
> > > > Hi Hal,
> > > >
> > > > On Fri, Oct 21, 2011 at 7:04 PM, Hal Finkel <hfinkel at anl.gov>
> > > > wrote:
> > > > > I've attached an initial version of a basic-block
> > > > > autovectorization
> > > > > pass. It works by searching a basic block for pairable
> > > > > (independent)
> > > > > instructions, and, using a chain-seeking heuristic, selects
> > > > > pairings
> > > > > likely to provide an overall speedup (if such pairings can be
> > > > > found).
> > > > > The selected pairs are then fused and, if necessary, other
> > > > > instructions
> > > > > are moved in order to maintain data-flow consistency. This works
> > > > > only
> > > > > within one basic block, but can do loop vectorization in
> > > > > combination
> > > > > with (partial) unrolling. The basic idea was inspired by the
> > > > > Vienna MAP
> > > > > Vectorizor, which has been used to vectorize FFT kernels, but the
> > > > > algorithm used here is different.
> > > > >
> > > > > To try it, use -bb-vectorize with opt. There are a few options:
> > > > > -bb-vectorize-req-chain-depth: default: 3 -- The depth of the
> > > > > chain of
> > > > > instruction pairs necessary in order to consider the pairs that
> > > > > compose
> > > > > the chain worthy of vectorization.
> > > > > -bb-vectorize-vector-bits: default: 128 -- The size of the target
> > > > > vector
> > > > > registers
> > > > > -bb-vectorize-no-ints -- Don't consider integer instructions
> > > > > -bb-vectorize-no-floats -- Don't consider floating-point
> > > > > instructions
> > > > >
> > > > > The vectorizor generates a lot of insert_element/extract_element
> > > > > pairs;
> > > > > The assumption is that other passes will turn these into shuffles
> > > > > when
> > > > > possible (it looks like some work is necessary here). It will
> > > > > also
> > > > > vectorize vector instructions, and generates shuffles in this
> > > > > case
> > > > > (again, other passes should combine these as appropriate).
> > > > >
> > > > > Currently, it does not fuse load or store instructions, but that
> > > > > is a
> > > > > feature that I'd like to add. Of course, alignment information is
> > > > > an
> > > > > issue for load/store vectorization (or maybe I should just fuse
> > > > > them
> > > > > anyway and let isel deal with unaligned cases?).
> > > > >
> > > > > Also, support needs to be added for fusing known intrinsics (fma,
> > > > > etc.),
> > > > > and, as has been discussed on llvmdev, we should add some
> > > > > intrinsics to
> > > > > allow the generation of addsub-type instructions.
> > > > >
> > > > > I've included a few tests, but it needs more. Please review (I'll
> > > > > commit
> > > > > if and when everyone is happy).
> > > > >
> > > > > Thanks in advance,
> > > > > Hal
> > > > >
> > > > > P.S. There is another option (not so useful right now, but could
> > > > > be):
> > > > > -bb-vectorize-fast-dep -- Don't do a full inter-instruction
> > > > > dependency
> > > > > analysis; instead stop looking for instruction pairs after the
> > > > > first use
> > > > > of an instruction's value. [This makes the pass faster, but would
> > > > > require a data-dependence-based reordering pass in order to be
> > > > > effective].
> > > >
> > > > Cool! :)
> > > > Have you run this pass with any benchmark or the llvm testsuite?
> > > > Does
> > > > it presents any regression?
> > > > Do you have any performance results?
> > > > Cheers,
> > > >
> > >
> > > --
> > > Hal Finkel
> > > Postdoctoral Appointee
> > > Leadership Computing Facility
> > > Argonne National Laboratory
> > >
> > > _______________________________________________
> > > llvm-commits mailing list
> > > llvm-commits at cs.uiuc.edu
> > > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> > >
>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
--
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory
1-630-252-0023
hfinkel at anl.gov
More information about the llvm-dev
mailing list