[LLVMdev] LoopVectorizer in OpenCL C work group autovectorization
Chareos at gmx.de
Thu Jan 31 07:44:47 PST 2013
Hi Pekka, hi Nadav,
I didn't find the time to read this thread until now, sorry for that.
I actually think you are both right :).
As for the current status, the loop vectorizer is only able to vectorize
inner loops and (I think) does not handle function calls and memory
operations well. This will prevent it from vectorizing a large group of
OpenCL kernels, and certainly all "interesting", more complex ones.
However, in the long run, I think the only difference between WFV-like
approaches and classic loop vectorization a la LoopVectorizer in an
OpenCL context is the following:
WFV assumes that there is at least one outer loop that has increments of
one, runs a multiple of the SIMD width iterations, and that every
iteration is independent (barriers can be handled by the OpenCL driver
On the other hand, LoopVectorizer may not be aimed at covering all kinds
of code inside the body and/or instead focus more on things not required
by WFV, such as handling reductions and other kinds of loop-carried
In any case, since our own OpenCL driver is more of a proof-of-concept
implementation and not very robust, I'd be willing to give it a try to
integrate the current libWFV into pocl. This should boost performance
quite a bit for many kernels without too much effort ;). I just don't
know (yet) where to start - can you give me a hint, Pekka?
On 1/25/13 10:54 PM, Pekka Jääskeläinen wrote:
>> I am in favor of adding metadata to control different aspects of
>> vectorization, mainly for supporting user-level pargmas  but also for
>> DSLs. Before we start adding metadata to the IR we need to define the
>> semantics of the tags. "Parallel_for" is too general. We also want to
>> vectorization factor, unroll factor, cost model, etc.
> These are used to control *how* the loops are parallelized.
> The generic "parallel_for" lets the compiler (to try) to do the actual
> parallelization decisions based on the target (aim for performance
> portability). So, both have their uses.
>> Doug Gregor suggested to add the metadata to the branch instruction of
>> latch block in the loop.
> OK that should work better. I'll look into it next week.
>> My main concern is that your approach for vectorizing OpenCL is wrong.
>> was designed for SPMD/outer-loop vectorization and any good OpenCL
>> should be able to vectorize 100% of the workloads. The Loop Vectorizer
>> vectorizes innermost loops only. It has a completely different cost
>> model and
>> legality checks. You also have no use for reduction variables, reverse
>> iterators, etc. If all you are interested in is the widening of
>> then you can easily implement it.
> Sorry, I still don't see the problem in the "modular" approach vs.
> vector instructions directly in pocl -- but then again, I'm not a
> expert. All I'm really trying to do is to delegate the "widening of
> instructions" and the related tasks to the loop vectorizer. If it doesn't
> need all of the vectorizer's features it should not be a problem AFAIU.
> I think
> it's better for me just play a bit with it, and experience the possible
> in it.
More information about the llvm-dev