[LLVMdev] LoopVectorizer in OpenCL C work group autovectorization

Fri Jan 25 09:11:35 PST 2013

Hi Pekka, 

> How I see it, the data parallel input simply makes the vectorizer's job
> easier (skip some of the legality checks) while reusing most of the
> implementation (e.g. cost estimation, unrolling decisions, the
> vector instruction formation itself, predication/if-conversion,
> speculative execution+blend, etc.).
> 

What you need is outer loop vectorization while the loop vectorizer is an inner loop vectorizer. 
If you decide to use the Loop Vectorizer then you won't be able to vectorize kernels that have inner loops or kernels that have barriers in them. 
If you look at the AMD OpenCL SDK you will see that most of the workloads have barriers, inner loops. 

Another problem that you may run into is 'early exits'.  In many kernels you will see something like  " if (get_global_id(0) > N) return; "  

Not to mention that it will be very important for you to vectorize function calls. Vectorization needs to happen before inlining because you don't want to vectorize a cloud of instructions when you can convert a single function call.  Think about image samplers or slightly more complex builtins that have control flow in them. 

> Now pocl's kernel compiler detects the "parallel regions" (the
> regions between work group barriers) and generates a new function suitable
> for executing multiple work items (WI) in the work group. One method to
> generate such functions is to generate embarrassingly parallel "for-loops"
> (wiloops) that produce the multi-WI DLP execution. That is, the loop
> executes the code in the parallel regions for each work item in the work
> group.
> 
> This step is needed to make the multi-WI kernel executable on
> non-SIMD/SIMT platforms (read: CPUs). On the "SPMD-tailored" processors
> (many GPUs) this step is not always necessary as they can input the single
> kernel instructions and do the "spreading" on the fly. We have a different
> method to generate the WG functions for such targets.
> 
>> Moreover, OpenCL has lots of language specific APIs such as
>> "get_global_id" and builtin function calls, and without knowledge of these
>> calls it is impossible to vectorize OpenCL.
> 
> In pocl the whole kernel is "flattened", that is, the processed kernel code
> does not usually have function calls. Well, printf() and some intrisics
> calls might be exceptions. In such cases the vectorization could be
> simply not done and the parallelization can be attempted using some other
> method (e.g. pure unrolling), like usual.
> 
> get_local_id is converted to regular iteration variables (local id space x,
> y,z) in the wiloop.
> 
> I played yesterday a bit by kludge-hacking the LoopVectorizer code to
> skip the canVectorizeMemory() check for these wiloop constructs and it
> managed to vectorize a kernel as expected.
> 
>> You need to implement something like Whole Function Vectorization
>> (http://dl.acm.org/citation.cfm?id=2190061). The loop vectorizer can't
>> help you here. Ralf Karrenberg open sourced his implementation on github.
>> You should take a look.
> 
> I think the WFV paper has plenty of good ideas that could be applied to
> *improve* the vectorizability of DLP code/parallel loops (e.g. the mask
> generation for diverging branches where the traditional if-conversion won't
> do, especially intra kernel for-loops), but the actual vectorization
> could be modularized to generic passes to, e.g., allow the choice of target-specific parallelization methods later on.
> 
> -- 
> Pekka