[LLVMdev] Vectorization: Next Steps

Thu Feb 2 19:56:12 PST 2012

As some of you may know, I committed my basic-block autovectorization
pass a few days ago. I encourage anyone interested to try it out (pass
-vectorize to opt or -mllvm -vectorize to clang) and provide feedback.
Especially in combination with -unroll-allow-partial, I have observed
some significant benchmark speedups, but, I have also observed some
significant slowdowns. I would like to share my thoughts, and hopefully
get feedback, on next steps.

1. "Target Data" for vectorization - I think that in order to improve
the vectorization quality, the vectorizer will need more information
about the target. This information could be provided in the form of a
kind of extended target data. This extended target data might contain:
 - What basic types can be vectorized, and how many of them will fit
into (the largest) vector registers
 - What classes of operations can be vectorized (division, conversions /
sign extension, etc. are not always supported)
 - What alignment is necessary for loads and stores
 - Is scalar-to-vector free?

2. Feedback between passes - We may to implement a closer coupling
between optimization passes than currently exists. Specifically, I have
in mind two things:
 - The vectorizer should communicate more closely with the loop
unroller. First, the loop unroller should try to unroll to preserve
maximal load/store alignments. Second, I think it would make a lot of
sense to be able to unroll and, only if this helps vectorization should
the unrolled version be kept in preference to the original. With basic
block vectorization, it is often necessary to (partially) unroll in
order to vectorize. Even when we also have real loop vectorization,
however, I still think that it will be important for the loop unroller
to communicate with the vectorizer.
 - After vectorization, it would make sense for the vectorization pass
to request further simplification, but only on those parts of the code
that it modified. 

3. Loop vectorization - It would be nice to have, in addition to
basic-block vectorization, a more-traditional loop vectorization pass. I
think that we'll need a better loop analysis pass in order for this to
happen. Some of this was started in LoopDependenceAnalysis, but that
pass is not yet finished. We'll need something like this to recognize
affine memory references, etc.

I look forward to hearing everyone's thoughts.

 -Hal

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory