[llvm-commits] [PATCH] BasicBlock Autovectorization Pass
Hal Finkel
hfinkel at anl.gov
Tue Nov 1 11:32:36 PDT 2011
Any objections to me committing this? [And some relevant docs changes] I
think that it is ready at this point.
Thanks in advance,
Hal
On Mon, 2011-10-31 at 19:50 -0500, Hal Finkel wrote:
> I've attached the latest version of my autovectorization patch. This
> version is significantly faster (in compile time) than the version I
> posted a couple of days ago, and generally produces better output.
>
> At this point, next steps in enhancing the vectorization include:
> 1. Add an add/sub and/or alternating-negation vector intrinsic to
> provide for generating add-subtract, and more generally, asymmetric fma
> instructions.
> 2. Make -vectorize imply -unroll-allow-partial [Is there an easy way to
> do this?]
> 3. Add a -fvectorize flag to clang along the same lines.
>
> Updated vectorization benchmark:
> Loop llvm-v llvm gcc-v gcc
> -------------------------------------------
> S000 9.00 9.59 4.55 10.04
> S111 7.25 7.65 7.68 7.83
> S1111 13.63 14.72 16.14 16.30
> S112 16.60 17.45 16.54 17.52
> S1112 12.99 13.87 14.83 14.84
> S113 22.03 22.98 22.05 22.05
> S1113 11.01 11.48 11.03 11.01
> S114 13.14 13.81 13.53 13.48
> S115 32.92 33.36 49.98 49.99
> S1115 13.61 14.23 13.65 13.66
> S116 46.90 49.43 49.54 48.11
> S118 10.76 11.25 10.79 10.50
> S119 8.68 9.09 11.83 11.82
> S1119 8.75 9.15 4.31 11.87
> S121 17.17 18.06 14.84 17.31
> S122 7.53 7.70 6.11 6.11
> S123 6.92 7.10 7.42 7.41
> S124 9.60 9.84 9.42 9.33
> S125 6.89 7.10 4.67 7.81
> S126 2.33 2.55 2.57 2.37
> S127 12.18 12.68 7.06 14.50
> S128 11.66 12.41 12.42 11.52
> S131 28.59 30.11 25.17 28.94
> S132 17.04 17.04 15.53 21.03
> S141 12.18 12.85 12.38 12.05
> S151 28.61 30.11 24.89 28.95
> S152 15.47 16.03 11.19 15.63
> S161 6.00 6.12 5.52 5.46
> S1161 14.40 14.50 8.80 8.79
> S162 8.18 8.41 5.36 8.18
> S171 14.05 7.96 2.81 5.70
> S172 5.67 5.97 2.75 5.70
> S173 30.17 31.69 18.15 30.13
> S174 30.12 31.53 18.51 30.16
> S175 5.75 6.04 4.94 5.77
> S176 5.57 5.83 4.41 7.65
> S211 16.23 16.89 16.82 16.38
> S212 13.19 13.50 13.34 13.18
> S1213 12.83 13.35 12.80 12.43
> S221 10.86 11.09 8.65 8.63
> S1221 5.71 6.03 5.40 6.05
> S222 6.00 6.29 5.70 5.72
> S231 22.23 24.22 22.36 22.11
> S232 6.89 6.94 6.89 6.89
> S1232 15.23 16.43 15.05 15.10
> S233 55.17 59.98 54.21 49.56
> S2233 27.07 29.71 29.68 28.40
> S235 43.79 47.85 46.94 43.93
> S241 31.00 31.72 32.53 31.01
> S242 7.20 7.21 7.20 7.20
> S243 16.48 16.99 17.69 16.84
> S244 14.47 14.93 16.91 16.82
> S1244 14.75 15.02 14.77 14.40
> S2244 9.97 10.60 10.40 10.06
> S251 34.20 35.55 19.70 34.38
> S1251 55.09 57.11 41.77 56.11
> S2251 15.64 16.26 17.02 15.70
> S3251 15.55 16.52 19.60 15.34
> S252 6.14 6.46 7.72 7.26
> S253 11.18 11.52 14.40 14.40
> S254 17.72 18.98 28.23 28.06
> S255 5.93 6.14 9.96 9.95
> S256 3.06 3.39 3.10 3.09
> S257 2.12 2.31 2.21 2.20
> S258 1.79 1.87 1.84 1.84
> S261 12.01 12.22 10.98 10.95
> S271 32.76 33.76 33.25 33.01
> S272 14.93 15.52 15.39 15.26
> S273 13.92 14.10 16.86 16.80
> S274 17.77 18.53 18.15 17.89
> S275 2.90 3.14 3.36 2.98
> S2275 32.65 34.95 8.97 33.60
> S276 41.38 41.97 40.80 40.55
> S277 4.81 4.93 4.81 4.80
> S278 14.41 14.76 14.70 14.66
> S279 8.04 8.24 7.25 7.27
> S1279 9.71 9.92 9.34 9.25
> S2710 7.68 8.07 7.86 7.56
> S2711 35.53 37.10 36.56 36.00
> S2712 32.91 33.96 34.24 33.47
> S281 10.75 11.32 12.46 12.02
> S1281 104.13 78.11 57.78 68.06
> S291 11.75 12.27 14.03 14.03
> S292 6.70 6.91 9.94 9.96
> S293 15.38 16.24 19.32 19.33
> S2101 2.50 2.67 2.59 2.60
> S2102 16.56 18.45 16.68 16.75
> S2111 5.59 5.63 5.85 5.85
> S311 72.04 72.27 72.23 72.03
> S31111 6.37 6.01 6.00 6.00
> S312 96.04 96.17 96.05 96.03
> S313 36.03 36.61 36.03 36.02
> S314 36.02 36.12 74.67 72.42
> S315 9.11 9.21 9.35 9.30
> S316 36.02 36.12 72.08 74.87
> S317 444.96 444.94 451.82 451.78
> S318 9.07 9.12 7.30 7.30
> S319 34.49 36.46 34.42 34.19
> S3110 8.53 8.61 4.11 4.11
> S13110 5.75 5.78 12.12 12.12
> S3111 3.60 3.64 3.60 3.60
> S3112 7.21 7.30 7.21 7.20
> S3113 33.68 34.18 60.21 60.20
> S321 16.80 16.87 16.80 16.80
> S322 12.42 12.64 12.60 12.60
> S323 10.89 11.24 8.48 8.51
> S331 4.23 4.36 7.20 7.20
> S332 7.21 7.28 5.21 5.31
> S341 4.76 5.04 7.23 7.20
> S342 6.02 6.24 7.25 7.20
> S343 2.02 2.16 2.16 2.01
> S351 46.33 48.65 21.82 46.46
> S1351 49.07 51.28 33.68 49.06
> S352 57.65 58.44 57.68 57.64
> S353 8.19 8.44 8.34 8.19
> S421 24.17 25.29 20.62 22.46
> S1421 25.09 26.16 15.85 24.76
> S422 79.95 81.51 79.22 78.99
> S423 154.93 155.21 154.56 154.38
> S424 22.61 23.35 11.42 22.36
> S431 56.88 59.82 27.59 57.16
> S441 14.05 14.23 12.88 12.81
> S442 5.99 6.13 6.96 6.90
> S443 17.33 17.77 17.15 16.95
> S451 48.94 48.99 49.03 49.14
> S452 43.01 39.57 14.64 96.03
> S453 28.07 28.07 14.60 14.40
> S471 8.20 8.56 8.39 8.43
> S481 10.89 11.23 12.04 12.00
> S482 9.20 9.42 9.19 9.17
> S491 11.25 11.60 11.37 11.28
> S4112 8.20 8.45 9.13 8.94
> S4113 8.64 8.95 8.86 8.85
> S4114 11.82 12.35 12.18 11.77
> S4115 8.27 8.51 8.95 8.59
> S4116 3.22 3.22 6.02 5.94
> S4117 13.96 9.69 10.16 9.98
> S4121 8.19 8.44 4.04 8.17
> va 28.39 29.33 23.58 48.46
> vag 12.26 12.93 13.58 13.20
> vas 13.36 14.15 13.03 12.47
> vif 4.50 4.79 5.06 4.92
> vpv 56.84 59.83 28.28 57.24
> vtv 57.58 60.42 28.40 57.63
> vpvtv 32.78 33.77 16.35 32.73
> vpvts 5.78 6.07 2.99 6.38
> vpvpv 32.78 33.84 16.54 32.85
> vtvtv 32.76 33.75 16.84 35.97
> vsumr 72.04 72.28 72.20 72.04
> vdotr 72.05 73.22 72.42 72.04
> vbor 227.55 381.18 99.80 372.05
>
> -Hal
>
> On Sat, 2011-10-29 at 17:56 -0500, Hal Finkel wrote:
> > On Sat, 2011-10-29 at 15:16 -0500, Hal Finkel wrote:
> > > On Sat, 2011-10-29 at 14:02 -0500, Hal Finkel wrote:
> > > > On Sat, 2011-10-29 at 12:30 -0500, Hal Finkel wrote:
> > > > > Ralf, et al.,
> > > > >
> > > > > Attached is the latest version of my autovectorization patch. llvmdev
> > > > > has been CC'd (as had been suggested to me); this e-mail contains
> > > > > additional benchmark results.
> > > > >
> > > > > First, these are preliminary results because I did not do the things
> > > > > necessary to make them real (explicitly quiet the machine, bind the
> > > > > processes to one cpu, etc.). But they should be good enough for
> > > > > discussion.
> > > > >
> > > > > I'm using LLVM head r143101, with the attached patch applied, and clang
> > > > > head r143100 on an x86_64 machine (some kind of Intel Xeon). For the gcc
> > > > > comparison, I'm using build Ubuntu 4.4.3-4ubuntu5. gcc was run -O3
> > > > > without any other optimization flags. opt was run -vectorize
> > > > > -unroll-allow-partial -O3 with no other optimization flags (the patch
> > > > > adds the -vectorize option).
> > > >
> > > > And opt had also been given the flag: -bb-vectorize-vector-bits=256
> > >
> > > And this was a mistake (because the machine on which the benchmarks were
> > > run does not have AVX). I've rerun, see better results below...
> > >
> > > >
> > > > -Hal
> > > >
> > > > > llc was just given -O3.
> > > > >
> > > > > Below I've included results using the benchmark program by Maleki, et
> > > > > al. See:
> > > > > An Evaluation of Vectorizing Compilers - PACT'11
> > > > > (http://polaris.cs.uiuc.edu/~garzaran/doc/pact11.pdf). The source of
> > > > > their benchmark program was retrieved from:
> > > > > http://polaris.cs.uiuc.edu/~maleki1/TSVC.tar.gz
> > > > >
> > > > > Also, when using clang, I had to pass -Dinline= on the command line:
> > > > > when using -emit-llvm, clang appears not to emit code for functions
> > > > > declared inline. This is a bug, but I've not yet tracked it down. There
> > > > > are two such small functions in the benchmark program, and the regular
> > > > > inliner *should* catch them anyway.
> > > > >
> > > > > Results:
> > > > > 0. Name of the loop
> > > > > 1. Time using LLVM with vectorization
> > > > > 2. Time using LLVM without vectorization
> > > > > 3. Time using gcc with vectorization
> > > > > 4. Time using gcc without vectorization
> >
> > As Peter Collingbourne indirectly pointed out to me, clang's
> > optimizations are still important (even if it is emitting only llvm).
> > I've rerun the llvm code generation steps, adding -O3 to clang. Here are
> > the results (they are significantly better):
> >
> > Loop llvm-v llvm gcc-v gcc
> > -------------------------------------------
> > S000 9.10 9.59 4.55 10.04
> > S111 7.29 7.65 7.68 7.83
> > S1111 13.87 14.72 16.14 16.30
> > S112 16.67 17.45 16.54 17.52
> > S1112 13.16 13.87 14.83 14.84
> > S113 22.14 22.98 22.05 22.05
> > S1113 11.06 11.48 11.03 11.01
> > S114 13.21 13.81 13.53 13.48
> > S115 32.82 33.36 49.98 49.99
> > S1115 13.67 14.23 13.65 13.66
> > S116 47.37 49.43 49.54 48.11
> > S118 10.81 11.25 10.79 10.50
> > S119 8.73 9.09 11.83 11.82
> > S1119 8.82 9.15 4.31 11.87
> > S121 17.29 18.06 14.84 17.31
> > S122 7.53 7.70 6.11 6.11
> > S123 6.93 7.10 7.42 7.41
> > S124 9.63 9.84 9.42 9.33
> > S125 6.94 7.10 4.67 7.81
> > S126 2.34 2.55 2.57 2.37
> > S127 12.23 12.68 7.06 14.50
> > S128 11.78 12.41 12.42 11.52
> > S131 28.79 30.11 25.17 28.94
> > S132 17.04 17.04 15.53 21.03
> > S141 12.26 12.85 12.38 12.05
> > S151 28.79 30.11 24.89 28.95
> > S152 15.53 16.03 11.19 15.63
> > S161 6.00 6.12 5.52 5.46
> > S1161 14.40 14.50 8.80 8.79
> > S162 8.19 8.41 5.36 8.18
> > S171 15.41 7.96 2.81 5.70
> > S172 5.70 5.97 2.75 5.70
> > S173 30.32 31.69 18.15 30.13
> > S174 30.20 31.53 18.51 30.16
> > S175 5.79 6.04 4.94 5.77
> > S176 5.59 5.83 4.41 7.65
> > S211 16.31 16.89 16.82 16.38
> > S212 13.23 13.50 13.34 13.18
> > S1213 12.82 13.35 12.80 12.43
> > S221 10.87 11.09 8.65 8.63
> > S1221 5.72 6.03 5.40 6.05
> > S222 6.01 6.29 5.70 5.72
> > S231 22.38 24.22 22.36 22.11
> > S232 6.89 6.94 6.89 6.89
> > S1232 15.31 16.43 15.05 15.10
> > S233 55.47 59.98 54.21 49.56
> > S2233 27.23 29.71 29.68 28.40
> > S235 44.08 47.85 46.94 43.93
> > S241 31.14 31.72 32.53 31.01
> > S242 7.20 7.21 7.20 7.20
> > S243 16.54 16.99 17.69 16.84
> > S244 14.51 14.93 16.91 16.82
> > S1244 14.72 15.02 14.77 14.40
> > S2244 10.09 10.60 10.40 10.06
> > S251 34.42 35.55 19.70 34.38
> > S1251 55.39 57.11 41.77 56.11
> > S2251 15.69 16.26 17.02 15.70
> > S3251 15.69 16.52 19.60 15.34
> > S252 6.18 6.46 7.72 7.26
> > S253 11.19 11.52 14.40 14.40
> > S254 18.00 18.98 28.23 28.06
> > S255 5.94 6.14 9.96 9.95
> > S256 3.09 3.39 3.10 3.09
> > S257 2.13 2.31 2.21 2.20
> > S258 1.80 1.87 1.84 1.84
> > S261 12.00 12.22 10.98 10.95
> > S271 32.81 33.76 33.25 33.01
> > S272 15.04 15.52 15.39 15.26
> > S273 13.93 14.10 16.86 16.80
> > S274 17.83 18.53 18.15 17.89
> > S275 2.92 3.14 3.36 2.98
> > S2275 32.81 34.95 8.97 33.60
> > S276 41.26 41.97 40.80 40.55
> > S277 4.80 4.93 4.81 4.80
> > S278 14.43 14.76 14.70 14.66
> > S279 8.05 8.24 7.25 7.27
> > S1279 9.72 9.92 9.34 9.25
> > S2710 7.73 8.07 7.86 7.56
> > S2711 36.49 37.10 36.56 36.00
> > S2712 32.96 33.96 34.24 33.47
> > S281 10.80 11.32 12.46 12.02
> > S1281 79.10 78.11 57.78 68.06
> > S291 11.79 12.27 14.03 14.03
> > S292 6.70 6.91 9.94 9.96
> > S293 15.50 16.24 19.32 19.33
> > S2101 2.56 2.67 2.59 2.60
> > S2102 16.74 18.45 16.68 16.75
> > S2111 5.59 5.63 5.85 5.85
> > S311 72.04 72.27 72.23 72.03
> > S31111 7.50 6.01 6.00 6.00
> > S312 96.04 96.17 96.05 96.03
> > S313 36.02 36.61 36.03 36.02
> > S314 36.01 36.12 74.67 72.42
> > S315 9.11 9.21 9.35 9.30
> > S316 36.01 36.12 72.08 74.87
> > S317 444.91 444.94 451.82 451.78
> > S318 9.07 9.12 7.30 7.30
> > S319 34.57 36.46 34.42 34.19
> > S3110 8.52 8.61 4.11 4.11
> > S13110 5.75 5.78 12.12 12.12
> > S3111 3.60 3.64 3.60 3.60
> > S3112 7.20 7.30 7.21 7.20
> > S3113 33.68 34.18 60.21 60.20
> > S321 16.80 16.87 16.80 16.80
> > S322 12.42 12.64 12.60 12.60
> > S323 10.88 11.24 8.48 8.51
> > S331 4.23 4.36 7.20 7.20
> > S332 7.20 7.28 5.21 5.31
> > S341 4.80 5.04 7.23 7.20
> > S342 6.01 6.24 7.25 7.20
> > S343 2.04 2.16 2.16 2.01
> > S351 46.63 48.65 21.82 46.46
> > S1351 49.37 51.28 33.68 49.06
> > S352 57.64 58.44 57.68 57.64
> > S353 8.21 8.44 8.34 8.19
> > S421 24.26 25.29 20.62 22.46
> > S1421 25.18 26.16 15.85 24.76
> > S422 80.08 81.51 79.22 78.99
> > S423 155.02 155.21 154.56 154.38
> > S424 22.62 23.35 11.42 22.36
> > S431 57.22 59.82 27.59 57.16
> > S441 13.27 14.23 12.88 12.81
> > S442 5.99 6.13 6.96 6.90
> > S443 17.37 17.77 17.15 16.95
> > S451 48.92 48.99 49.03 49.14
> > S452 42.97 39.57 14.64 96.03
> > S453 28.06 28.07 14.60 14.40
> > S471 8.27 8.56 8.39 8.43
> > S481 10.93 11.23 12.04 12.00
> > S482 9.21 9.42 9.19 9.17
> > S491 11.31 11.60 11.37 11.28
> > S4112 8.21 8.45 9.13 8.94
> > S4113 8.65 8.95 8.86 8.85
> > S4114 11.87 12.35 12.18 11.77
> > S4115 8.28 8.51 8.95 8.59
> > S4116 3.23 3.22 6.02 5.94
> > S4117 13.97 9.69 10.16 9.98
> > S4121 8.20 8.44 4.04 8.17
> > va 28.50 29.33 23.58 48.46
> > vag 12.37 12.93 13.58 13.20
> > vas 13.46 14.15 13.03 12.47
> > vif 4.55 4.79 5.06 4.92
> > vpv 57.21 59.83 28.28 57.24
> > vtv 57.92 60.42 28.40 57.63
> > vpvtv 32.84 33.77 16.35 32.73
> > vpvts 5.82 6.07 2.99 6.38
> > vpvpv 32.87 33.84 16.54 32.85
> > vtvtv 32.82 33.75 16.84 35.97
> > vsumr 72.03 72.28 72.20 72.04
> > vdotr 72.05 73.22 72.42 72.04
> > vbor 205.24 381.18 99.80 372.05
> >
> > I apologize for the multiple e-mails with a long list of numbers, but I
> > think that this was worth it (and I did not want to be unfair to the
> > clang developers).
> >
> > -Hal
> >
> > >
> > > Here are improved results where the correct (and default)
> > > vector-register size was used.
> > >
> > > Loop llvm-v llvm gcc-v gcc
> > > -------------------------------------------
> > > S000 9.09 9.49 4.55 10.04
> > > S111 7.28 7.37 7.68 7.83
> > > S1111 13.78 14.48 16.14 16.30
> > > S112 16.67 17.41 16.54 17.52
> > > S1112 13.12 14.21 14.83 14.84
> > > S113 22.12 22.88 22.05 22.05
> > > S1113 11.06 11.42 11.03 11.01
> > > S114 13.23 13.75 13.53 13.48
> > > S115 32.76 33.24 49.98 49.99
> > > S1115 13.68 14.18 13.65 13.66
> > > S116 47.42 49.40 49.54 48.11
> > > S118 10.84 11.26 10.79 10.50
> > > S119 8.74 9.07 11.83 11.82
> > > S1119 8.81 9.14 4.31 11.87
> > > S121 17.28 18.78 14.84 17.31
> > > S122 7.53 7.54 6.11 6.11
> > > S123 6.90 7.38 7.42 7.41
> > > S124 9.60 9.77 9.42 9.33
> > > S125 6.92 7.22 4.67 7.81
> > > S126 2.34 2.53 2.57 2.37
> > > S127 12.19 12.97 7.06 14.50
> > > S128 11.74 12.43 12.42 11.52
> > > S131 28.75 29.91 25.17 28.94
> > > S132 17.04 17.04 15.53 21.03
> > > S141 12.28 12.26 12.38 12.05
> > > S151 28.80 29.43 24.89 28.95
> > > S152 15.54 16.03 11.19 15.63
> > > S161 6.00 6.06 5.52 5.46
> > > S1161 14.39 14.40 8.80 8.79
> > > S162 8.19 9.05 5.36 8.18
> > > S171 15.41 7.94 2.81 5.70
> > > S172 5.71 5.89 2.75 5.70
> > > S173 30.31 30.92 18.15 30.13
> > > S174 30.18 31.66 18.51 30.16
> > > S175 5.78 6.18 4.94 5.77
> > > S176 5.59 5.83 4.41 7.65
> > > S211 16.27 17.14 16.82 16.38
> > > S212 13.21 14.28 13.34 13.18
> > > S1213 12.81 13.46 12.80 12.43
> > > S221 10.86 11.09 8.65 8.63
> > > S1221 5.72 6.04 5.40 6.05
> > > S222 6.02 6.26 5.70 5.72
> > > S231 22.33 22.94 22.36 22.11
> > > S232 6.88 6.88 6.89 6.89
> > > S1232 15.30 15.34 15.05 15.10
> > > S233 55.38 58.55 54.21 49.56
> > > S2233 27.08 29.77 29.68 28.40
> > > S235 44.00 44.92 46.94 43.93
> > > S241 31.09 31.35 32.53 31.01
> > > S242 7.19 7.20 7.20 7.20
> > > S243 16.52 17.09 17.69 16.84
> > > S244 14.45 14.83 16.91 16.82
> > > S1244 14.71 14.83 14.77 14.40
> > > S2244 10.04 10.62 10.40 10.06
> > > S251 34.15 35.75 19.70 34.38
> > > S1251 55.23 57.84 41.77 56.11
> > > S2251 15.73 15.87 17.02 15.70
> > > S3251 15.66 16.21 19.60 15.34
> > > S252 6.18 6.32 7.72 7.26
> > > S253 11.14 11.38 14.40 14.40
> > > S254 18.41 18.70 28.23 28.06
> > > S255 5.93 6.09 9.96 9.95
> > > S256 3.08 3.42 3.10 3.09
> > > S257 2.13 2.25 2.21 2.20
> > > S258 1.79 1.82 1.84 1.84
> > > S261 12.00 12.08 10.98 10.95
> > > S271 32.82 33.04 33.25 33.01
> > > S272 14.98 15.82 15.39 15.26
> > > S273 13.92 14.04 16.86 16.80
> > > S274 17.83 18.31 18.15 17.89
> > > S275 2.92 3.02 3.36 2.98
> > > S2275 32.80 33.50 8.97 33.60
> > > S276 39.43 39.44 40.80 40.55
> > > S277 4.80 4.80 4.81 4.80
> > > S278 14.41 14.42 14.70 14.66
> > > S279 8.03 8.29 7.25 7.27
> > > S1279 9.71 10.06 9.34 9.25
> > > S2710 7.71 8.04 7.86 7.56
> > > S2711 35.53 35.55 36.56 36.00
> > > S2712 32.94 33.17 34.24 33.47
> > > S281 10.79 11.09 12.46 12.02
> > > S1281 79.13 77.55 57.78 68.06
> > > S291 11.80 11.78 14.03 14.03
> > > S292 7.77 7.78 9.94 9.96
> > > S293 15.50 15.87 19.32 19.33
> > > S2101 2.56 2.58 2.59 2.60
> > > S2102 16.71 17.53 16.68 16.75
> > > S2111 5.60 5.60 5.85 5.85
> > > S311 72.03 72.03 72.23 72.03
> > > S31111 7.49 6.00 6.00 6.00
> > > S312 96.04 96.04 96.05 96.03
> > > S313 36.02 36.13 36.03 36.02
> > > S314 36.01 36.07 74.67 72.42
> > > S315 8.96 8.99 9.35 9.30
> > > S316 36.02 36.06 72.08 74.87
> > > S317 444.93 444.94 451.82 451.78
> > > S318 9.05 9.07 7.30 7.30
> > > S319 34.54 36.53 34.42 34.19
> > > S3110 8.51 8.57 4.11 4.11
> > > S13110 5.75 5.77 12.12 12.12
> > > S3111 3.60 3.62 3.60 3.60
> > > S3112 7.19 7.30 7.21 7.20
> > > S3113 35.13 35.47 60.21 60.20
> > > S321 16.79 16.81 16.80 16.80
> > > S322 12.42 12.60 12.60 12.60
> > > S323 10.86 11.02 8.48 8.51
> > > S331 4.23 4.23 7.20 7.20
> > > S332 7.20 7.21 5.21 5.31
> > > S341 4.79 4.85 7.23 7.20
> > > S342 6.01 6.09 7.25 7.20
> > > S343 2.04 2.06 2.16 2.01
> > > S351 46.61 47.34 21.82 46.46
> > > S1351 49.28 50.35 33.68 49.06
> > > S352 57.65 58.04 57.68 57.64
> > > S353 8.21 8.38 8.34 8.19
> > > S421 42.94 43.34 20.62 22.46
> > > S1421 25.15 25.81 15.85 24.76
> > > S422 87.39 87.53 79.22 78.99
> > > S423 155.01 155.29 154.56 154.38
> > > S424 36.51 37.51 11.42 22.36
> > > S431 57.10 60.66 27.59 57.16
> > > S441 14.04 13.29 12.88 12.81
> > > S442 6.00 6.00 6.96 6.90
> > > S443 17.28 17.77 17.15 16.95
> > > S451 48.92 49.08 49.03 49.14
> > > S452 42.98 39.32 14.64 96.03
> > > S453 28.05 28.06 14.60 14.40
> > > S471 8.24 8.65 8.39 8.43
> > > S481 10.88 11.15 12.04 12.00
> > > S482 9.21 9.31 9.19 9.17
> > > S491 11.26 11.38 11.37 11.28
> > > S4112 8.21 8.36 9.13 8.94
> > > S4113 8.65 8.81 8.86 8.85
> > > S4114 11.82 12.15 12.18 11.77
> > > S4115 8.28 8.46 8.95 8.59
> > > S4116 3.22 3.23 6.02 5.94
> > > S4117 13.95 9.61 10.16 9.98
> > > S4121 8.21 8.26 4.04 8.17
> > > va 28.46 28.58 23.58 48.46
> > > vag 12.35 12.36 13.58 13.20
> > > vas 13.45 13.49 13.03 12.47
> > > vif 4.55 4.57 5.06 4.92
> > > vpv 57.08 57.22 28.28 57.24
> > > vtv 57.81 57.83 28.40 57.63
> > > vpvtv 32.82 32.84 16.35 32.73
> > > vpvts 5.82 5.83 2.99 6.38
> > > vpvpv 32.87 32.89 16.54 32.85
> > > vtvtv 32.82 32.80 16.84 35.97
> > > vsumr 72.04 72.03 72.20 72.04
> > > vdotr 72.06 72.05 72.42 72.04
> > > vbor 205.24 380.81 99.80 372.05
> > >
> > > -Hal
> > >
> > > > >
> > > > > Loop llvm-v llvm gcc-v gcc
> > > > > -------------------------------------------
> > > > > S000 9.59 9.49 4.55 10.04
> > > > > S111 7.67 7.37 7.68 7.83
> > > > > S1111 13.98 14.48 16.14 16.30
> > > > > S112 17.43 17.41 16.54 17.52
> > > > > S1112 13.87 14.21 14.83 14.84
> > > > > S113 22.97 22.88 22.05 22.05
> > > > > S1113 11.46 11.42 11.03 11.01
> > > > > S114 13.47 13.75 13.53 13.48
> > > > > S115 33.06 33.24 49.98 49.99
> > > > > S1115 13.91 14.18 13.65 13.66
> > > > > S116 48.74 49.40 49.54 48.11
> > > > > S118 11.04 11.26 10.79 10.50
> > > > > S119 8.97 9.07 11.83 11.82
> > > > > S1119 9.04 9.14 4.31 11.87
> > > > > S121 18.06 18.78 14.84 17.31
> > > > > S122 7.58 7.54 6.11 6.11
> > > > > S123 7.02 7.38 7.42 7.41
> > > > > S124 9.62 9.77 9.42 9.33
> > > > > S125 7.14 7.22 4.67 7.81
> > > > > S126 2.32 2.53 2.57 2.37
> > > > > S127 12.87 12.97 7.06 14.50
> > > > > S128 12.58 12.43 12.42 11.52
> > > > > S131 29.91 29.91 25.17 28.94
> > > > > S132 17.04 17.04 15.53 21.03
> > > > > S141 12.59 12.26 12.38 12.05
> > > > > S151 28.92 29.43 24.89 28.95
> > > > > S152 15.68 16.03 11.19 15.63
> > > > > S161 6.06 6.06 5.52 5.46
> > > > > S1161 14.46 14.40 8.80 8.79
> > > > > S162 8.31 9.05 5.36 8.18
> > > > > S171 15.47 7.94 2.81 5.70
> > > > > S172 5.92 5.89 2.75 5.70
> > > > > S173 31.59 30.92 18.15 30.13
> > > > > S174 31.16 31.66 18.51 30.16
> > > > > S175 5.80 6.18 4.94 5.77
> > > > > S176 5.69 5.83 4.41 7.65
> > > > > S211 16.56 17.14 16.82 16.38
> > > > > S212 13.46 14.28 13.34 13.18
> > > > > S1213 13.12 13.46 12.80 12.43
> > > > > S221 10.88 11.09 8.65 8.63
> > > > > S1221 5.80 6.04 5.40 6.05
> > > > > S222 6.01 6.26 5.70 5.72
> > > > > S231 23.78 22.94 22.36 22.11
> > > > > S232 6.88 6.88 6.89 6.89
> > > > > S1232 16.00 15.34 15.05 15.10
> > > > > S233 57.48 58.55 54.21 49.56
> > > > > S2233 27.65 29.77 29.68 28.40
> > > > > S235 46.40 44.92 46.94 43.93
> > > > > S241 31.62 31.35 32.53 31.01
> > > > > S242 7.20 7.20 7.20 7.20
> > > > > S243 16.78 17.09 17.69 16.84
> > > > > S244 14.64 14.83 16.91 16.82
> > > > > S1244 14.98 14.83 14.77 14.40
> > > > > S2244 10.47 10.62 10.40 10.06
> > > > > S251 35.10 35.75 19.70 34.38
> > > > > S1251 56.65 57.84 41.77 56.11
> > > > > S2251 15.96 15.87 17.02 15.70
> > > > > S3251 16.41 16.21 19.60 15.34
> > > > > S252 7.24 6.32 7.72 7.26
> > > > > S253 12.55 11.38 14.40 14.40
> > > > > S254 19.08 18.70 28.23 28.06
> > > > > S255 5.94 6.09 9.96 9.95
> > > > > S256 3.14 3.42 3.10 3.09
> > > > > S257 2.18 2.25 2.21 2.20
> > > > > S258 1.80 1.82 1.84 1.84
> > > > > S261 12.00 12.08 10.98 10.95
> > > > > S271 32.93 33.04 33.25 33.01
> > > > > S272 15.48 15.82 15.39 15.26
> > > > > S273 13.99 14.04 16.86 16.80
> > > > > S274 18.38 18.31 18.15 17.89
> > > > > S275 3.02 3.02 3.36 2.98
> > > > > S2275 33.71 33.50 8.97 33.60
> > > > > S276 39.52 39.44 40.80 40.55
> > > > > S277 4.81 4.80 4.81 4.80
> > > > > S278 14.43 14.42 14.70 14.66
> > > > > S279 8.10 8.29 7.25 7.27
> > > > > S1279 9.77 10.06 9.34 9.25
> > > > > S2710 7.85 8.04 7.86 7.56
> > > > > S2711 35.54 35.55 36.56 36.00
> > > > > S2712 33.16 33.17 34.24 33.47
> > > > > S281 10.97 11.09 12.46 12.02
> > > > > S1281 79.37 77.55 57.78 68.06
> > > > > S291 11.94 11.78 14.03 14.03
> > > > > S292 7.88 7.78 9.94 9.96
> > > > > S293 15.90 15.87 19.32 19.33
> > > > > S2101 2.59 2.58 2.59 2.60
> > > > > S2102 17.63 17.53 16.68 16.75
> > > > > S2111 5.63 5.60 5.85 5.85
> > > > > S311 72.07 72.03 72.23 72.03
> > > > > S31111 7.49 6.00 6.00 6.00
> > > > > S312 96.06 96.04 96.05 96.03
> > > > > S313 36.50 36.13 36.03 36.02
> > > > > S314 36.10 36.07 74.67 72.42
> > > > > S315 9.00 8.99 9.35 9.30
> > > > > S316 36.11 36.06 72.08 74.87
> > > > > S317 444.92 444.94 451.82 451.78
> > > > > S318 9.04 9.07 7.30 7.30
> > > > > S319 34.76 36.53 34.42 34.19
> > > > > S3110 8.53 8.57 4.11 4.11
> > > > > S13110 5.76 5.77 12.12 12.12
> > > > > S3111 3.60 3.62 3.60 3.60
> > > > > S3112 7.20 7.30 7.21 7.20
> > > > > S3113 35.12 35.47 60.21 60.20
> > > > > S321 16.81 16.81 16.80 16.80
> > > > > S322 12.42 12.60 12.60 12.60
> > > > > S323 10.93 11.02 8.48 8.51
> > > > > S331 4.23 4.23 7.20 7.20
> > > > > S332 7.21 7.21 5.21 5.31
> > > > > S341 4.74 4.85 7.23 7.20
> > > > > S342 6.02 6.09 7.25 7.20
> > > > > S343 2.14 2.06 2.16 2.01
> > > > > S351 49.26 47.34 21.82 46.46
> > > > > S1351 50.85 50.35 33.68 49.06
> > > > > S352 58.14 58.04 57.68 57.64
> > > > > S353 8.35 8.38 8.34 8.19
> > > > > S421 43.13 43.34 20.62 22.46
> > > > > S1421 25.25 25.81 15.85 24.76
> > > > > S422 88.36 87.53 79.22 78.99
> > > > > S423 155.13 155.29 154.56 154.38
> > > > > S424 37.11 37.51 11.42 22.36
> > > > > S431 58.22 60.66 27.59 57.16
> > > > > S441 14.05 13.29 12.88 12.81
> > > > > S442 6.08 6.00 6.96 6.90
> > > > > S443 17.60 17.77 17.15 16.95
> > > > > S451 48.95 49.08 49.03 49.14
> > > > > S452 42.98 39.32 14.64 96.03
> > > > > S453 28.06 28.06 14.60 14.40
> > > > > S471 8.53 8.65 8.39 8.43
> > > > > S481 10.98 11.15 12.04 12.00
> > > > > S482 9.31 9.31 9.19 9.17
> > > > > S491 11.54 11.38 11.37 11.28
> > > > > S4112 8.21 8.36 9.13 8.94
> > > > > S4113 8.77 8.81 8.86 8.85
> > > > > S4114 12.32 12.15 12.18 11.77
> > > > > S4115 8.48 8.46 8.95 8.59
> > > > > S4116 3.21 3.23 6.02 5.94
> > > > > S4117 14.08 9.61 10.16 9.98
> > > > > S4121 8.53 8.26 4.04 8.17
> > > > > va 30.09 28.58 23.58 48.46
> > > > > vag 12.35 12.36 13.58 13.20
> > > > > vas 13.74 13.49 13.03 12.47
> > > > > vif 4.49 4.57 5.06 4.92
> > > > > vpv 58.59 57.22 28.28 57.24
> > > > > vtv 59.15 57.83 28.40 57.63
> > > > > vpvtv 33.18 32.84 16.35 32.73
> > > > > vpvts 5.99 5.83 2.99 6.38
> > > > > vpvpv 33.25 32.89 16.54 32.85
> > > > > vtvtv 32.83 32.80 16.84 35.97
> > > > > vsumr 72.03 72.03 72.20 72.04
> > > > > vdotr 72.05 72.05 72.42 72.04
> > > > > vbor 205.22 380.81 99.80 372.05
> > > > >
> > > > > I've yet to go through these in detail (they just finished running 5
> > > > > minutes ago). But for the curious (and I've had several requests for
> > > > > benchmarks), here you go. There is obviously more work to do.
> > > > >
> > > > > -Hal
> > > > >
> > > > > On Fri, 2011-10-28 at 14:30 +0200, Ralf Karrenberg wrote:
> > > > > > Hi Hal,
> > > > > >
> > > > > > those numbers look very promising, great work! :)
> > > > > >
> > > > > > Best,
> > > > > > Ralf
> > > > > >
> > > > > > ----- Original Message -----
> > > > > > > From: "Hal Finkel" <hfinkel at anl.gov>
> > > > > > > To: "Bruno Cardoso Lopes" <bruno.cardoso at gmail.com>
> > > > > > > Cc: llvm-commits at cs.uiuc.edu
> > > > > > > Sent: Freitag, 28. Oktober 2011 13:50:00
> > > > > > > Subject: Re: [llvm-commits] [PATCH] BasicBlock Autovectorization Pass
> > > > > > >
> > > > > > > Bruno, et al.,
> > > > > > >
> > > > > > > I've attached a new version of the patch that contains improvements
> > > > > > > (and
> > > > > > > a critical bug fix [the code output is not more right, but the pass
> > > > > > > in
> > > > > > > the older patch would crash in certain cases and now does not])
> > > > > > > compared
> > > > > > > to previous versions that I've posted.
> > > > > > >
> > > > > > > First, these are preliminary results because I did not do the things
> > > > > > > necessary to make them real (explicitly quiet the machine, bind the
> > > > > > > processes to one cpu, etc.). But they should be good enough for
> > > > > > > discussion.
> > > > > > >
> > > > > > > I'm using LLVM head r143101, with the attached patch applied, and
> > > > > > > clang
> > > > > > > head r143100 on an x86_64 machine (some kind of Intel Xeon). For the
> > > > > > > gcc
> > > > > > > comparison, I'm using build Ubuntu 4.4.3-4ubuntu5. gcc was run -O3
> > > > > > > without any other optimization flags. opt was run -vectorize
> > > > > > > -unroll-allow-partial -O3 with no other optimization flags (the patch
> > > > > > > adds the -vectorize option). llc was just given -O3.
> > > > > > >
> > > > > > > It is not difficult to construct an example in which vectorization
> > > > > > > would
> > > > > > > be useful: take a loop that does more computation than load/stores,
> > > > > > > and
> > > > > > > (partially) unroll it. Here is a simple case:
> > > > > > >
> > > > > > > #define ITER 5000
> > > > > > > #define NUM 200
> > > > > > > double a[NUM][NUM];
> > > > > > > double b[NUM][NUM];
> > > > > > >
> > > > > > > ...
> > > > > > >
> > > > > > > int main()
> > > > > > > {
> > > > > > > ...
> > > > > > >
> > > > > > > for (int i = 0; i < ITER; ++i) {
> > > > > > > for (int x = 0; x < NUM; ++x)
> > > > > > > for (int y = 0; y < NUM; ++y) {
> > > > > > > double v = a[x][y], w = b[x][y];
> > > > > > > double z1 = v*w;
> > > > > > > double z2 = v+w;
> > > > > > > double z3 = z1*z2;
> > > > > > > double z4 = z3+v;
> > > > > > > double z5 = z2+w;
> > > > > > > double z6 = z4*z5;
> > > > > > > double z7 = z4+z5;
> > > > > > > a[x][y] = v*v-z6;
> > > > > > > b[x][y] = w-z7;
> > > > > > > }
> > > > > > > }
> > > > > > >
> > > > > > > ...
> > > > > > >
> > > > > > > return 0;
> > > > > > > }
> > > > > > >
> > > > > > > Results:
> > > > > > > gcc -03: 0m1.790s
> > > > > > > llvm -vectorize: 0m2.360s
> > > > > > > llvm: 0m2.780s
> > > > > > > gcc -fno-tree-vectorize: 0m2.810s
> > > > > > > (these are the user times after I've run enough for the times to
> > > > > > > settle
> > > > > > > to three decimal places)
> > > > > > >
> > > > > > > So the vectorization gives a ~15% improvement in the running time.
> > > > > > > gcc's
> > > > > > > vectorization still does a much better job, however (yielding an ~36%
> > > > > > > improvement). So there is still work to do ;)
> > > > > > >
> > > > > > > Additionally, I've checked the autovectorization on some classic
> > > > > > > numerical benchmarks from netlib. On these benchmarks, clang/llvm
> > > > > > > already do a good job compared to gcc (gcc is only about 10% better,
> > > > > > > and
> > > > > > > this is true regardless of whether gcc's vectorization is on or off).
> > > > > > > For these cases, autovectorization provides an insignificant speedup
> > > > > > > in
> > > > > > > most cases (but does not tend to make things worse, just not really
> > > > > > > any
> > > > > > > better either). Because gcc's vectorization also did not really help
> > > > > > > gcc
> > > > > > > in these cases, I'm not surprised. A good collection of these is
> > > > > > > available here:
> > > > > > > http://www.roylongbottom.org.uk/classic_benchmarks.tar.gz
> > > > > > >
> > > > > > > I've yet to run the test suite using the pass to validate it. That is
> > > > > > > something that I plan to do. Actually, the "Livermore Loops" test in
> > > > > > > the
> > > > > > > aforementioned archive contains checksums to validate the results,
> > > > > > > and
> > > > > > > it looks like 1 or 2 of the loop results are wrong with vectorization
> > > > > > > turned on, so I'll have to investigate that.
> > > > > > >
> > > > > > > -Hal
> > > > > > >
> > > > > > > On Wed, 2011-10-26 at 18:49 -0200, Bruno Cardoso Lopes wrote:
> > > > > > > > Hi Hal,
> > > > > > > >
> > > > > > > > On Fri, Oct 21, 2011 at 7:04 PM, Hal Finkel <hfinkel at anl.gov>
> > > > > > > > wrote:
> > > > > > > > > I've attached an initial version of a basic-block
> > > > > > > > > autovectorization
> > > > > > > > > pass. It works by searching a basic block for pairable
> > > > > > > > > (independent)
> > > > > > > > > instructions, and, using a chain-seeking heuristic, selects
> > > > > > > > > pairings
> > > > > > > > > likely to provide an overall speedup (if such pairings can be
> > > > > > > > > found).
> > > > > > > > > The selected pairs are then fused and, if necessary, other
> > > > > > > > > instructions
> > > > > > > > > are moved in order to maintain data-flow consistency. This works
> > > > > > > > > only
> > > > > > > > > within one basic block, but can do loop vectorization in
> > > > > > > > > combination
> > > > > > > > > with (partial) unrolling. The basic idea was inspired by the
> > > > > > > > > Vienna MAP
> > > > > > > > > Vectorizor, which has been used to vectorize FFT kernels, but the
> > > > > > > > > algorithm used here is different.
> > > > > > > > >
> > > > > > > > > To try it, use -bb-vectorize with opt. There are a few options:
> > > > > > > > > -bb-vectorize-req-chain-depth: default: 3 -- The depth of the
> > > > > > > > > chain of
> > > > > > > > > instruction pairs necessary in order to consider the pairs that
> > > > > > > > > compose
> > > > > > > > > the chain worthy of vectorization.
> > > > > > > > > -bb-vectorize-vector-bits: default: 128 -- The size of the target
> > > > > > > > > vector
> > > > > > > > > registers
> > > > > > > > > -bb-vectorize-no-ints -- Don't consider integer instructions
> > > > > > > > > -bb-vectorize-no-floats -- Don't consider floating-point
> > > > > > > > > instructions
> > > > > > > > >
> > > > > > > > > The vectorizor generates a lot of insert_element/extract_element
> > > > > > > > > pairs;
> > > > > > > > > The assumption is that other passes will turn these into shuffles
> > > > > > > > > when
> > > > > > > > > possible (it looks like some work is necessary here). It will
> > > > > > > > > also
> > > > > > > > > vectorize vector instructions, and generates shuffles in this
> > > > > > > > > case
> > > > > > > > > (again, other passes should combine these as appropriate).
> > > > > > > > >
> > > > > > > > > Currently, it does not fuse load or store instructions, but that
> > > > > > > > > is a
> > > > > > > > > feature that I'd like to add. Of course, alignment information is
> > > > > > > > > an
> > > > > > > > > issue for load/store vectorization (or maybe I should just fuse
> > > > > > > > > them
> > > > > > > > > anyway and let isel deal with unaligned cases?).
> > > > > > > > >
> > > > > > > > > Also, support needs to be added for fusing known intrinsics (fma,
> > > > > > > > > etc.),
> > > > > > > > > and, as has been discussed on llvmdev, we should add some
> > > > > > > > > intrinsics to
> > > > > > > > > allow the generation of addsub-type instructions.
> > > > > > > > >
> > > > > > > > > I've included a few tests, but it needs more. Please review (I'll
> > > > > > > > > commit
> > > > > > > > > if and when everyone is happy).
> > > > > > > > >
> > > > > > > > > Thanks in advance,
> > > > > > > > > Hal
> > > > > > > > >
> > > > > > > > > P.S. There is another option (not so useful right now, but could
> > > > > > > > > be):
> > > > > > > > > -bb-vectorize-fast-dep -- Don't do a full inter-instruction
> > > > > > > > > dependency
> > > > > > > > > analysis; instead stop looking for instruction pairs after the
> > > > > > > > > first use
> > > > > > > > > of an instruction's value. [This makes the pass faster, but would
> > > > > > > > > require a data-dependence-based reordering pass in order to be
> > > > > > > > > effective].
> > > > > > > >
> > > > > > > > Cool! :)
> > > > > > > > Have you run this pass with any benchmark or the llvm testsuite?
> > > > > > > > Does
> > > > > > > > it presents any regression?
> > > > > > > > Do you have any performance results?
> > > > > > > > Cheers,
> > > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Hal Finkel
> > > > > > > Postdoctoral Appointee
> > > > > > > Leadership Computing Facility
> > > > > > > Argonne National Laboratory
> > > > > > >
> > > > > > > _______________________________________________
> > > > > > > llvm-commits mailing list
> > > > > > > llvm-commits at cs.uiuc.edu
> > > > > > > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> > > > > > >
> > > > >
> > > > > _______________________________________________
> > > > > llvm-commits mailing list
> > > > > llvm-commits at cs.uiuc.edu
> > > > > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> > > >
> > >
> >
>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
--
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory
More information about the llvm-commits
mailing list