[LLVMdev] LoopVectorizer in OpenCL C work group autovectorization
hfinkel at anl.gov
Fri Jan 25 06:00:54 PST 2013
----- Original Message -----
> From: "Pekka Jääskeläinen" <pekka.jaaskelainen at tut.fi>
> To: "Nadav Rotem" <nrotem at apple.com>
> Cc: "LLVM Developers Mailing List" <llvmdev at cs.uiuc.edu>
> Sent: Friday, January 25, 2013 5:35:16 AM
> Subject: Re: [LLVMdev] LoopVectorizer in OpenCL C work group autovectorization
> On 01/25/2013 09:56 AM, Nadav Rotem wrote:
> > Thanks for checking the Loop Vectorizer, I am interested in hearing
> > your
> > feedback. The Loop Vectorizer does not fit here. OpenCL
> > vectorization is
> > completely different because the language itself is data-parallel.
> > You
> > don't need all of the legality checks that the loop vectorizer has.
> I'm aware of this and it was my point in the original post.
> However, I do not see why the loop vectorizer wouldn't fit
> this use case given how the pocl's "kernel compiler" is structured.
> How I see it, the data parallel input simply makes the vectorizer's
> easier (skip some of the legality checks) while reusing most of the
> implementation (e.g. cost estimation, unrolling decisions, the
> vector instruction formation itself, predication/if-conversion,
> speculative execution+blend, etc.).
> Now pocl's kernel compiler detects the "parallel regions" (the
> regions between work group barriers) and generates a new function
> for executing multiple work items (WI) in the work group. One method
> generate such functions is to generate embarrassingly parallel
> (wiloops) that produce the multi-WI DLP execution. That is, the loop
> executes the code in the parallel regions for each work item in the
> This step is needed to make the multi-WI kernel executable on
> non-SIMD/SIMT platforms (read: CPUs). On the "SPMD-tailored"
> (many GPUs) this step is not always necessary as they can input the
> kernel instructions and do the "spreading" on the fly. We have a
> method to generate the WG functions for such targets.
> > Moreover, OpenCL has lots of language specific APIs such as
> > "get_global_id" and builtin function calls, and without knowledge
> > of these
> > calls it is impossible to vectorize OpenCL.
> In pocl the whole kernel is "flattened", that is, the processed
> kernel code
> does not usually have function calls. Well, printf() and some
> calls might be exceptions. In such cases the vectorization could be
> simply not done and the parallelization can be attempted using some
> method (e.g. pure unrolling), like usual.
> get_local_id is converted to regular iteration variables (local id
> space x,
> y,z) in the wiloop.
> I played yesterday a bit by kludge-hacking the LoopVectorizer code to
> skip the canVectorizeMemory() check for these wiloop constructs and
> managed to vectorize a kernel as expected.
Based on this experience, can you propose some metadata that would allow this to happen (so that the LoopVectorizer would be generally useful for POCL)? I suspect this same metadata might be useful in other contexts (such as implementing iteration-independence pragmas).
> > You need to implement something like Whole Function Vectorization
> > (http://dl.acm.org/citation.cfm?id=2190061). The loop vectorizer
> > can't
> > help you here. Ralf Karrenberg open sourced his implementation on
> > github.
> > You should take a look.
> I think the WFV paper has plenty of good ideas that could be applied
> *improve* the vectorizability of DLP code/parallel loops (e.g. the
> generation for diverging branches where the traditional if-conversion
> do, especially intra kernel for-loops), but the actual vectorization
> could be modularized to generic passes to, e.g., allow the choice of
> target-specific parallelization methods later on.
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
More information about the llvm-dev