[LLVMdev] LoopVectorizer in OpenCL C work group autovectorization

Thu Jan 31 09:47:57 PST 2013

----- Original Message -----
> From: "Pekka Jääskeläinen" <pekka.jaaskelainen at tut.fi>
> To: "Ralf Karrenberg" <Chareos at gmx.de>
> Cc: "LLVM Developers Mailing List" <llvmdev at cs.uiuc.edu>
> Sent: Thursday, January 31, 2013 11:15:43 AM
> Subject: Re: [LLVMdev] LoopVectorizer in OpenCL C work	group	autovectorization
> 
> Hi Ralf,
> 
> On 01/31/2013 05:44 PM, Ralf Karrenberg wrote:
> > As for the current status, the loop vectorizer is only able to
> > vectorize
> > inner loops and (I think) does not handle function calls and memory
> > operations well. This will prevent it from vectorizing a large
> > group of
> > OpenCL kernels, and certainly all "interesting", more complex ones.
> 
> Agreed -- but not being able to handle function calls/intrinsics is
> not an OpenCL-specific limitation. Any vectorizable input suffers
> from
> that. Also, an inner loop vectorizer might be able to handle outer
> loops
> e.g. via loop interchange. I'm planning to look into that if time
> allows.

This is also on my TODO list. Let's collaborate when you have time.

> 
> > However, in the long run, I think the only difference between
> > WFV-like
> > approaches and classic loop vectorization a la LoopVectorizer in an
> > OpenCL context is the following:
> > WFV assumes that there is at least one outer loop that has
> > increments of
> > one, runs a multiple of the SIMD width iterations, and that every
> > iteration is independent (barriers can be handled by the OpenCL
> > driver
> > *after* WFV).
> 
> Yes, this is the case with the "wiloops" work group generation
> method of pocl. The parallel outer loops are the max 3 dimensions of
> the
> local space. The actual wg barrier calls are converted to no-ops
> (compiler
> barriers) for the current targets.
> 
> > On the other hand, LoopVectorizer may not be aimed at covering all
> > kinds
> > of code inside the body and/or instead focus more on things not
> > required
> > by WFV, such as handling reductions and other kinds of loop-carried
> > dependencies.
> 
> It is true that the feature set of the LoopVectorizer goes beyond the
> "embarrassingly parallel loops" that the implicit WI loops are.
> However,
> I don't see this as a show-stopper for trying to provide a
> modularized
> approach to work group vectorization.
> 
> Moreover, parallelization-helping optimizations such as "loop
> masking" for
> the diverging inner-loops (kernel loops) are more generally useful,
> and, IMHO
> should be added to LLVM upstream (not to an OpenCL implementation
> only)
> eventually as generic loop vectorization routines.

I completely agree.

> 
> > In any case, since our own OpenCL driver is more of a
> > proof-of-concept
> > implementation and not very robust, I'd be willing to give it a try
> > to
> > integrate the current libWFV into pocl. This should boost
> > performance
> > quite a bit for many kernels without too much effort ;). I just

Ralf, Does this mean that you're close to releasing the new version?

Thanks again,
Hal

> > don't
> > know (yet) where to start - can you give me a hint, Pekka?
> 
> I'm very glad to hear this! Luckily, the pocl code base has been
> modularized
> to allow easily switching the "work group function generation method"
> which I
> think your WFV work actually is.
> 
> Perhaps the detailed instructions on how to start are out of topic
> here and
> you might want to join the pocl-devel list (and #pocl) where the pocl
> developers can give more hints. See
> http://pocl.sourceforge.net/discussion.html.
> 
> BR,
> --
> Pekka
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>