[LLVMdev] LoopVectorizer in OpenCL C work group autovectorization

Fri Jan 25 12:38:42 PST 2013

Pekka, 

I am in favor of adding metadata to control different aspects of vectorization, mainly for supporting user-level pargmas [1] but also for DSLs. 
Before we start adding metadata to the IR we need to define the semantics of the tags. "Parallel_for" is too general. We also want to control vectorization factor, unroll factor, cost model, etc. 

Doug Gregor suggested to add the metadata to the branch instruction of the latch block in the loop. 

My main concern is that your approach for vectorizing OpenCL is wrong. OpenCL was designed for SPMD/outer-loop vectorization and any good OpenCL vectorizer should be able to vectorize 100% of the workloads.  The Loop Vectorizer vectorizes innermost loops only. It has a completely different cost model and legality checks. You also have no use for reduction variables, reverse iterators, etc. If all you are interested in is the widening of instructions then you can easily implement it.  

- Nadav

[1] http://software.intel.com/en-us/articles/vectorization-with-the-intel-compilers-part-i

On Jan 25, 2013, at 9:16 AM, Pekka Jääskeläinen <pekka.jaaskelainen at tut.fi> wrote:

> On 01/25/2013 04:21 PM, Hal Finkel wrote:
>> My point is that I specifically think that you should try it. I'm curious
>> to see how what you come up with might apply to other use cases as well.
> 
> OK, attached is the first quick attempt towards this. I'm not
> proposing committing this, but would like to get comments
> to possibly move towards something committable.
> 
> It simply looks for a metadata named 'parallel_for' in any of the
> instructions in the loop's header and assumes the loop is a parallel
> one if such is found. This metadata is added by the pocl's wiloops
> generation routine. It passes the pocl test suite when enabled but
> probably cannot vectorize many kernels (at least) due to the missing
> intra-kernel vector scalarizer.
> 
> Some known problems that need addressing:
> 
> - Metadata can only be attached to Instructions (not Loops or even
>  BasicBlocks), therefore the brute force approach of marking all
>  instructions in the header BB in hopes of that optimizers
>  might retain at least one of them. E.g., a special intrinsics call
>  might be a better solution.
> 
> - The loop header can be potentially shared with multilevel loops where the
>  outer or inner levels might not be parallel. Not a problem in the pocl use
>  case as the wiloops are fully parallel at all the three levels, but needs
>  to be sorted out in a general solution.
> 
>  Perhaps it would be better to attach the metadata to the iteration
>  count increment/check instruction(s) or similar to better identify the
>  parallel (for) loop in question.
> 
> - Are there optimizations that might push code *illegally* to the parallel
>  loop from the outside of it? If there's, e.g., a non-parallel loop inside
>  a parallel loop, loop invariant code motion might move code from the
>  inner loop to the parallel loop's body. That should be a safe optimization,
>  to my understanding (it preservers the ordering semantics), but I wonder if
>  there are others that might cause breakage.
> 
> -- 
> Pekka
> <llvm-3.3-loopvectorizer-parallel_for-metadata-detection.patch>