[LLVMdev] [BBVectorizer] Obvious vectorization benefit, but req-chain is too short

Sat Feb 4 12:00:48 PST 2012

On Sat, 2012-02-04 at 16:32 +0200, Pekka Jääskeläinen wrote:
> Hello,
> 
> Thanks for your work on the bb-vectorizer. It looks like a
> promising pass to be used for multi-work-item-vectorization in
> pocl.
> 
> On 02/04/2012 06:21 AM, Hal Finkel wrote:
> > Try it now (after r149761). If this "solution" causes other problems,
> > then we may need to think of something more sophisticated.
> 
> I wonder if the case where a store is the last user of the value could be
> treated as a special case. The case where the code reads, computes
> and writes values in a fully parallelizable (unrolled) loop is an
> optimal case for vectorizing as it might not lead to any unpack/pack
> overheads at all.

The fix that I implemented was to give loads and stores each half of the
required effective chain depth to trigger vectorization. I believe that
this reflects a useful bias toward vectorizing memory operations over
other things, although we'll need to see how it interacts with other
aspects of the heuristic. Some more context-sensitive rule, such as the
one you proposed, might be better, but we'd need to think about it
carefully. For example, is it better to prioritize chains that end with
a store and not those that start with a load? Maybe we should do this
only for aligned loads and stores?

> 
> In case of the bb-vectorizer (if I understood the parameters correctly),
> if the final store (or actually, any final consumer of a value) is weighed
> more heavily in the "chain length computation" it could allow using a
> large chain length threshold and still pick up these "embarrassingly parallel
> loop cases" where there are potentially many updates to different variables
> in memory, but with short preceding computation lengths.

Yes.

> This type of
> embarrasingly parallel loop cases are the basic case when vectorizing
> multiple instances of OpenCL C kernels which are parallel by definition.
> 
> E.g. a case where the kernel does something like:
> 
> A = read mem
> B = read mem
> C = add A, B
> write C to mem
> 
> D = read mem
> E = read mem
> F = mul D, E
> write F to mem
> 
> When this is parallelized N times in the work group, the vectorizer
> might fail to vectorize multiple "kernel iterations" properly as the
> independent computation chains/live ranges (e.g. from D to F) are quite
> short. Still, vectorization is very beneficial here as, like we know, fully
> parallel loops vectorize perfectly without unpack/pack overheads in case
> all the operations can be vectorized (such is the case here when one can
> scale the work-group size to match the vector width).

Indeed, and I agree that these cases should be vectorized by default
(and it now should do this by default). If it does not, please send a
test case.

The tricky part comes in figuring how exactly how to penalize chains
that appear to require permutations in them. This is tricky because it
is really target dependent, the relative costs are different on
different hardware, and some kinds of permutations can be folded "for
free" into certain other operations on some targets. So currently I
ignore this problem, but it will need to be dealt with soon.

I think that in practice we'll need to experiment with different
approaches in order to see what works best. So try it out and if it is
doing something that is detrimental or not doing something obviously
beneficial, then we'll fix it.

Thanks for look at this,
Hal

> 
> BR,

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory