[llvm-commits] [PATCH] BasicBlock Autovectorization Pass

Tue Nov 1 14:58:59 PDT 2011

On Tue, 2011-11-01 at 19:19 +0000, Tobias Grosser wrote:
> On 11/01/2011 06:32 PM, Hal Finkel wrote:
> > Any objections to me committing this? [And some relevant docs changes] I
> > think that it is ready at this point.
> 
> First of all. I think it is great to see work starting on an 
> autovectorizer for LLVM. Unfortunately I did not have time to test your 
> vectorizer pass intensively, but here my first comments:
> 
> 1. This patch breaks the --enable-shared/BUILD_SHARED_LIBS build. The
>     following patch fixes this for cmake:
>     0001-Add-vectorizer-to-libraries-used-by-Transforms-IPO.patch
> 

Thanks!

>     Can you check the autoconf build with --enable-shared?

I will check.

> 
> 2. Did you run this pass on the llvm test-suite? Does your vectorizer
>     introduce any correctness regressions? What are the top 10 compile
>     time increases/decreases. How about run time?
> 

I'll try to get this setup and post the results.

> 3. I did not really test this intensively, but I had the feeling the
>     compile time increase for large basic blocks is quite a lot.
>     I still need to extract a test case. Any comments on the complexity
>     of your vectorizer?

This may very will be true. As is, I would not recommend activating this
pass by default (at -O3) because it is fairly slow and the resulting
performance increase, while significant in many cases, is not large
enough to, IMHO, justify the extra base compile-time increase. Ideally,
this kind of vectorization should be the "vectorizer of last resort" --
the pass that tries really hard to squeeze the last little bit of
vectorization possible out of the code. At the moment, it is all that we
have, but I hope that will change. I've not yet done any real profiling,
so I'll hold off on commenting about future performance improvements.

Base complexity is a bit difficult, there are certainly a few stages,
including that initial one, that are O(n^2), where n is the number of
instructions in the block. The "connection-finding" stage should also be
O(n^2) in practice, but is really iterating over instruction-user pairs
and so could be worse in pathological cases. Note, however, that in the
latter stages, that n^2 is not the number of instructions in the block,
but rather the number of (unordered) candidate instruction pairs (which
is going to be must less than the n^2 from just the number of
instructions in the block). It should be possible to generate a
compile-time scaling plot by taking a loop and compiling it with partial
unrolling, looking at how the compile time changes with the unrolling
limit; I'll try and so that.

I'm writing a paper on the vectorizer, so within a few weeks there will
be a very good description (complete with diagrams) :)

> 
> I plan to look into your vectorizer during the next couple of 
> days/weeks, but will most probably not have the time to do this tonight. 
> Sorry. :-(

Not a problem; it seems that I have some homework to do first ;)

Thanks,
Hal

> 
> Cheers
> Tobi

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory