[llvm-commits] [PATCH] BasicBlock Autovectorization Pass

Tue Nov 1 16:54:54 PDT 2011

On Tue, 2011-11-01 at 16:59 -0500, Hal Finkel wrote:
> On Tue, 2011-11-01 at 19:19 +0000, Tobias Grosser wrote:
> > On 11/01/2011 06:32 PM, Hal Finkel wrote:
> > > Any objections to me committing this? [And some relevant docs changes] I
> > > think that it is ready at this point.
> > 
> > First of all. I think it is great to see work starting on an 
> > autovectorizer for LLVM. Unfortunately I did not have time to test your 
> > vectorizer pass intensively, but here my first comments:
> > 
> > 1. This patch breaks the --enable-shared/BUILD_SHARED_LIBS build. The
> >     following patch fixes this for cmake:
> >     0001-Add-vectorizer-to-libraries-used-by-Transforms-IPO.patch
> > 
> 
> Thanks!
> 
> >     Can you check the autoconf build with --enable-shared?
> 
> I will check.

This appears to work as it should.

> 
> > 
> > 2. Did you run this pass on the llvm test-suite? Does your vectorizer
> >     introduce any correctness regressions? What are the top 10 compile
> >     time increases/decreases. How about run time?
> > 
> 
> I'll try to get this setup and post the results.
> 
> > 3. I did not really test this intensively, but I had the feeling the
> >     compile time increase for large basic blocks is quite a lot.
> >     I still need to extract a test case. Any comments on the complexity
> >     of your vectorizer?
> 
> This may very will be true. As is, I would not recommend activating this
> pass by default (at -O3) because it is fairly slow and the resulting
> performance increase, while significant in many cases, is not large
> enough to, IMHO, justify the extra base compile-time increase. Ideally,
> this kind of vectorization should be the "vectorizer of last resort" --
> the pass that tries really hard to squeeze the last little bit of
> vectorization possible out of the code. At the moment, it is all that we
> have, but I hope that will change. I've not yet done any real profiling,
> so I'll hold off on commenting about future performance improvements.
> 
> Base complexity is a bit difficult, there are certainly a few stages,
> including that initial one, that are O(n^2), where n is the number of
> instructions in the block. The "connection-finding" stage should also be
> O(n^2) in practice, but is really iterating over instruction-user pairs
> and so could be worse in pathological cases. Note, however, that in the
> latter stages, that n^2 is not the number of instructions in the block,
> but rather the number of (unordered) candidate instruction pairs (which
> is going to be must less than the n^2 from just the number of
> instructions in the block). It should be possible to generate a
> compile-time scaling plot by taking a loop and compiling it with partial
> unrolling, looking at how the compile time changes with the unrolling
> limit; I'll try and so that.

So for this test, I ran:
time opt -S -O3 -unroll-allow-partial -vectorize -o /dev/null q.ll
where q.ll contains the output from clang -O3 of the vbor function from
the benchmarks I've been posting recently. The first column is the value
of -unroll-threshold, the second column is the time with vectorization,
and the third column is the time without vectorization (time in seconds
for a release build).

100    0.030  0.000
200    0.130  0.030
300    0.770  0.030
400    1.240  0.040
500    1.280  0.050
600    9.450  0.060
700   29.300  0.060

I am not sure why the 400 and 500 times are so close. Obviously, it is
not linear ;) I am not sure that enumerating the possible pairings can
be done in a sub-quadratic way, but I will do some profiling and see if
I can make things better. To be fair, this test creates a kind of a
worse-case scenario: an increasingly large block of instructions, almost
all of which are potentially fusable.

It may also be possible to design additional heuristics to help the
situation. For example, we might introduce a target chain length such
that if the vectorizer finds a chain of a given length, it selects it,
foregoing the remainder of the search for the selected starting
instruction. This kind of thing will require further research and
testing.

 -Hal

> 
> I'm writing a paper on the vectorizer, so within a few weeks there will
> be a very good description (complete with diagrams) :)
> 
> > 
> > I plan to look into your vectorizer during the next couple of 
> > days/weeks, but will most probably not have the time to do this tonight. 
> > Sorry. :-(
> 
> Not a problem; it seems that I have some homework to do first ;)
> 
> Thanks,
> Hal
> 
> > 
> > Cheers
> > Tobi
> 

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory