[LLVMdev] LoopVectorizer in OpenCL C work group autovectorization

Thu Jan 31 07:44:47 PST 2013

Hi Pekka, hi Nadav,

I didn't find the time to read this thread until now, sorry for that.

I actually think you are both right :).
As for the current status, the loop vectorizer is only able to vectorize 
inner loops and (I think) does not handle function calls and memory 
operations well. This will prevent it from vectorizing a large group of 
OpenCL kernels, and certainly all "interesting", more complex ones.
However, in the long run, I think the only difference between WFV-like 
approaches and classic loop vectorization a la LoopVectorizer in an 
OpenCL context is the following:
WFV assumes that there is at least one outer loop that has increments of 
one, runs a multiple of the SIMD width iterations, and that every 
iteration is independent (barriers can be handled by the OpenCL driver 
*after* WFV).

On the other hand, LoopVectorizer may not be aimed at covering all kinds 
of code inside the body and/or instead focus more on things not required 
by WFV, such as handling reductions and other kinds of loop-carried 
dependencies.

In any case, since our own OpenCL driver is more of a proof-of-concept 
implementation and not very robust, I'd be willing to give it a try to 
integrate the current libWFV into pocl. This should boost performance 
quite a bit for many kernels without too much effort ;). I just don't 
know (yet) where to start - can you give me a hint, Pekka?

Cheers,
Ralf

On 1/25/13 10:54 PM, Pekka Jääskeläinen wrote:
>> I am in favor of adding metadata to control different aspects of
>> vectorization, mainly for supporting user-level pargmas [1] but also for
>> DSLs. Before we start adding metadata to the IR we need to define the
>> semantics of the tags. "Parallel_for" is too general. We also want to
>> control
>> vectorization factor, unroll factor, cost model, etc.
>
> These are used to control *how* the loops are parallelized.
> The generic "parallel_for" lets the compiler (to try) to do the actual
> parallelization decisions based on the target (aim for performance
> portability). So, both have their uses.
>
>> Doug Gregor suggested to add the metadata to the branch instruction of
>> the
>> latch block in the loop.
>
> OK that should work better. I'll look into it next week.
>
>> My main concern is that your approach for vectorizing OpenCL is wrong.
>> OpenCL
>> was designed for SPMD/outer-loop vectorization and any good OpenCL
>> vectorizer
>> should be able to vectorize 100% of the workloads.  The Loop Vectorizer
>> vectorizes innermost loops only. It has a completely different cost
>> model and
>> legality checks. You also have no use for reduction variables, reverse
>> iterators, etc. If all you are interested in is the widening of
>> instructions
>> then you can easily implement it.
>
> Sorry, I still don't see the problem in the "modular" approach vs.
> generating
> vector instructions directly in pocl -- but then again, I'm not a
> vectorization
> expert. All I'm really trying to do is to delegate the "widening of
> instructions" and the related tasks to the loop vectorizer. If it doesn't
> need all of the vectorizer's features it should not be a problem AFAIU.
> I think
> it's better for me just play a bit with it, and experience the possible
> problems
> in it.
>