[LLVMdev] [cfe-dev] Function-level metadata for OpenCL (was Re: OpenCL support)

Tue Dec 21 01:18:01 PST 2010

David Neto wrote:
> However, a __kernel behaves differently when called from the user
> program vs. another function in the compilation unit.  In OpenCL the
> user program can invoke a kernel as an NDRange, i.e. with an implied
> loop around it to iterate over an index space of 1 to 3 dimensions.

I'd like to emphasize that the work group loop cannot be simply applied
around the whole kernel function due to wg barriers. This is what the passes
I mentioned in my original email to this thread are about. The loops need
to be added to the regions between barriers separately to comply with the
barrier semantics which is not completely trivial with some barrier scenarios
(e.g. barriers inside loops or conditional blocks). These loops can be
vectorized or unrolled in case the wg dimensions are known at kernel
compilation time (so called "work item merging/chaining" optimization) and
if it's beneficial on the target architecture.

Of course, some architectures do not need the loops at all due to the OpenCL
"data parallel/threading semantics" implemented in hardware with some sort
of work item/thread aware SIMD-style hardware (AFAIU this is the case with
e.g. NVIDIA GPUs).

> But that implied loop is only applied when directly called from the
> user program.  When a kernel is called from another kernel, it behaves
> as a regular function call and just adopts the caller's index point.

I think in OpenCL kernel compilation it's common to fully inline everything
to the callable kernel, thus the loops would be applied to the fully inlined
version so you don't need separate versions of the kernel functions with and
without the loops.

-- 
--Pekka