[LLVMdev] LoopVectorizer in OpenCL C work group autovectorization

Pekka Jääskeläinen pekka.jaaskelainen at tut.fi
Thu Jan 31 09:15:43 PST 2013

Hi Ralf,

On 01/31/2013 05:44 PM, Ralf Karrenberg wrote:
> As for the current status, the loop vectorizer is only able to vectorize
> inner loops and (I think) does not handle function calls and memory
> operations well. This will prevent it from vectorizing a large group of
> OpenCL kernels, and certainly all "interesting", more complex ones.

Agreed -- but not being able to handle function calls/intrinsics is
not an OpenCL-specific limitation. Any vectorizable input suffers from
that. Also, an inner loop vectorizer might be able to handle outer loops
e.g. via loop interchange. I'm planning to look into that if time allows.

> However, in the long run, I think the only difference between WFV-like
> approaches and classic loop vectorization a la LoopVectorizer in an
> OpenCL context is the following:
> WFV assumes that there is at least one outer loop that has increments of
> one, runs a multiple of the SIMD width iterations, and that every
> iteration is independent (barriers can be handled by the OpenCL driver
> *after* WFV).

Yes, this is the case with the "wiloops" work group generation
method of pocl. The parallel outer loops are the max 3 dimensions of the
local space. The actual wg barrier calls are converted to no-ops (compiler
barriers) for the current targets.

> On the other hand, LoopVectorizer may not be aimed at covering all kinds
> of code inside the body and/or instead focus more on things not required
> by WFV, such as handling reductions and other kinds of loop-carried
> dependencies.

It is true that the feature set of the LoopVectorizer goes beyond the
"embarrassingly parallel loops" that the implicit WI loops are. However,
I don't see this as a show-stopper for trying to provide a modularized
approach to work group vectorization.

Moreover, parallelization-helping optimizations such as "loop masking" for
the diverging inner-loops (kernel loops) are more generally useful, and, IMHO
should be added to LLVM upstream (not to an OpenCL implementation only)
eventually as generic loop vectorization routines.

> In any case, since our own OpenCL driver is more of a proof-of-concept
> implementation and not very robust, I'd be willing to give it a try to
> integrate the current libWFV into pocl. This should boost performance
> quite a bit for many kernels without too much effort ;). I just don't
> know (yet) where to start - can you give me a hint, Pekka?

I'm very glad to hear this! Luckily, the pocl code base has been modularized
to allow easily switching the "work group function generation method" which I
think your WFV work actually is.

Perhaps the detailed instructions on how to start are out of topic here and
you might want to join the pocl-devel list (and #pocl) where the pocl
developers can give more hints. See http://pocl.sourceforge.net/discussion.html.


More information about the llvm-dev mailing list