[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

Tue Nov 8 02:45:09 PST 2011

I've attached the latest version of my autovectorization patch.

Working through the test suite has proved to be a productive
experience ;) -- And almost all of the bugs that it revealed have now
been fixed. There are still two programs that don't compile with
vectorization turned on, and I'm working on those now, but in case
anyone feels like playing with vectorization, this patch will probably
work for you.

The largest three performance speedups are:
SingleSource/Benchmarks/BenchmarkGame/puzzle - 59.2% speedup
SingleSource/UnitTests/Vector/multiplies - 57.7% speedup
SingleSource/Benchmarks/Misc/flops-7 - 50.75% speedup

The largest three performance slowdowns are:
MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael -
114% slowdown
MultiSource/Benchmarks/MiBench/network-patricia/network-patricia - 66.6%
slowdown
SingleSource/Benchmarks/Misc/flops-8 - 64.2% slowdown

(from these, I've excluded tests that took less that 0.1 seconds to
run).

Largest three compile-time slowdowns:
MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael -
1276% slowdown
SingleSource/Benchmarks/Misc/salsa20 - 1000% slowdown
MultiSource/Benchmarks/Trimaran/enc-3des/enc-3des - 508% slowdown

Not everything slows down, MultiSource/Benchmarks/Prolangs-C
++/city/city, for example, compiles 10% faster with vectorization
enabled; but, for the most part, things certainly take longer to compile
with vectorization enabled. The average slowdown over all tests was 29%,
the median was 11%. On the other hand, the average speedup over all
tests was 5.2%, the median was 1.3%.

Compared to previous patches, which had a minimum required chain length
of 3 or 4, I've now made the default 6. While using a chain length of 4
worked well for targeted benchmarks, it caused an overall slowdown on
almost all test-suite programs. Using a minimum length of 6 causes, on
average, a speedup; so I think that is a better default choice.

 -Hal

On Tue, 2011-11-01 at 18:54 -0500, Hal Finkel wrote:
> On Tue, 2011-11-01 at 16:59 -0500, Hal Finkel wrote:
> > On Tue, 2011-11-01 at 19:19 +0000, Tobias Grosser wrote:
> > > On 11/01/2011 06:32 PM, Hal Finkel wrote:
> > > > Any objections to me committing this? [And some relevant docs changes] I
> > > > think that it is ready at this point.
> > > 
> > > First of all. I think it is great to see work starting on an 
> > > autovectorizer for LLVM. Unfortunately I did not have time to test your 
> > > vectorizer pass intensively, but here my first comments:
> > > 
> > > 1. This patch breaks the --enable-shared/BUILD_SHARED_LIBS build. The
> > >     following patch fixes this for cmake:
> > >     0001-Add-vectorizer-to-libraries-used-by-Transforms-IPO.patch
> > > 
> > 
> > Thanks!
> > 
> > >     Can you check the autoconf build with --enable-shared?
> > 
> > I will check.
> 
> This appears to work as it should.
> 
> > 
> > > 
> > > 2. Did you run this pass on the llvm test-suite? Does your vectorizer
> > >     introduce any correctness regressions? What are the top 10 compile
> > >     time increases/decreases. How about run time?
> > > 
> > 
> > I'll try to get this setup and post the results.
> > 
> > > 3. I did not really test this intensively, but I had the feeling the
> > >     compile time increase for large basic blocks is quite a lot.
> > >     I still need to extract a test case. Any comments on the complexity
> > >     of your vectorizer?
> > 
> > This may very will be true. As is, I would not recommend activating this
> > pass by default (at -O3) because it is fairly slow and the resulting
> > performance increase, while significant in many cases, is not large
> > enough to, IMHO, justify the extra base compile-time increase. Ideally,
> > this kind of vectorization should be the "vectorizer of last resort" --
> > the pass that tries really hard to squeeze the last little bit of
> > vectorization possible out of the code. At the moment, it is all that we
> > have, but I hope that will change. I've not yet done any real profiling,
> > so I'll hold off on commenting about future performance improvements.
> > 
> > Base complexity is a bit difficult, there are certainly a few stages,
> > including that initial one, that are O(n^2), where n is the number of
> > instructions in the block. The "connection-finding" stage should also be
> > O(n^2) in practice, but is really iterating over instruction-user pairs
> > and so could be worse in pathological cases. Note, however, that in the
> > latter stages, that n^2 is not the number of instructions in the block,
> > but rather the number of (unordered) candidate instruction pairs (which
> > is going to be must less than the n^2 from just the number of
> > instructions in the block). It should be possible to generate a
> > compile-time scaling plot by taking a loop and compiling it with partial
> > unrolling, looking at how the compile time changes with the unrolling
> > limit; I'll try and so that.
> 
> So for this test, I ran:
> time opt -S -O3 -unroll-allow-partial -vectorize -o /dev/null q.ll
> where q.ll contains the output from clang -O3 of the vbor function from
> the benchmarks I've been posting recently. The first column is the value
> of -unroll-threshold, the second column is the time with vectorization,
> and the third column is the time without vectorization (time in seconds
> for a release build).
> 
> 100    0.030  0.000
> 200    0.130  0.030
> 300    0.770  0.030
> 400    1.240  0.040
> 500    1.280  0.050
> 600    9.450  0.060
> 700   29.300  0.060
> 
> I am not sure why the 400 and 500 times are so close. Obviously, it is
> not linear ;) I am not sure that enumerating the possible pairings can
> be done in a sub-quadratic way, but I will do some profiling and see if
> I can make things better. To be fair, this test creates a kind of a
> worse-case scenario: an increasingly large block of instructions, almost
> all of which are potentially fusable.
> 
> It may also be possible to design additional heuristics to help the
> situation. For example, we might introduce a target chain length such
> that if the vectorizer finds a chain of a given length, it selects it,
> foregoing the remainder of the search for the selected starting
> instruction. This kind of thing will require further research and
> testing.
> 
>  -Hal
> 
> > 
> > I'm writing a paper on the vectorizer, so within a few weeks there will
> > be a very good description (complete with diagrams) :)
> > 
> > > 
> > > I plan to look into your vectorizer during the next couple of 
> > > days/weeks, but will most probably not have the time to do this tonight. 
> > > Sorry. :-(
> > 
> > Not a problem; it seems that I have some homework to do first ;)
> > 
> > Thanks,
> > Hal
> > 
> > > 
> > > Cheers
> > > Tobi
> > 
> 

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory
-------------- next part --------------
A non-text attachment was scrubbed...
Name: llvm_bb_vectorize-20111107.diff
Type: text/x-patch
Size: 79455 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20111108/3b39956c/attachment.bin>