[cfe-dev] OpenCL/CUDA Interop with PTX Back-End

Tue Oct 4 14:42:44 PDT 2011

On Tue, Oct 04, 2011 at 04:23:59PM -0400, Justin Holewinski wrote:
> > > I'm currently investigating the following issues/concerns:
> > >
> > >    1. What is the plan for language-specific functions and other
> > constructs,
> > >    such as __syncthreads/barrier, get_local_id/threadIdx, etc.?  Is it up
> > to
> > >    the back-end to define compatible definitions of these, or is there a
> > plan
> > >    to introduce generic LLVM intrinsics for these?  Since OpenCL has
> > >    pre-defined functions that do not require header files, it may be
> > awkward to
> > >    require OpenCL to include a back-end specific header file when
> > compiling
> > >    with Clang.
> >
> > For OpenCL, the implementation should provide definitions of
> > the built-in functions described in section 6.11 of the OpenCL
> > specification.  For at least some of those functions, the definitions
> > would be the same for any OpenCL implementation.  (FWIW, I have
> > developed a set of generic implementations of section 6.11 built-ins
> > as part of an OpenCL implementation I have been working on, which I
> > will be open sourcing soon.)
> >
> > For the rest (e.g. work-item functions), the implementation would
> > need to be specific to the OpenCL implementation.  For example, on
> > a CPU, the exact implementation details of work-item functions would
> > be highly dependent on how the implementation stores work-item IDs,
> > so it would not be appropriate to use a generic intrinsic.
> >
> 
> Right.  I'm wondering what the implementation plan for this is with Clang.
> Are you going to expose the OpenCL functions as LLVM intrinsics, and let
> back-ends provide appropriate implementations?  Right now, I'm defining
> these functions in terms of PTX builtin functions, but this is obviously not
> optimal because you need to include an additional header in OpenCL code.

This is how I imagine the built-ins should be implemented:

The built-in functions would be declared by a header file that belongs
to an OpenCL C runtime library (not to be confused with the OpenCL
Platform Layer or OpenCL Runtime defined by sections 4 and 5 of the
OpenCL specification).  The runtime library in this case would consist
of a set of header files and (optionally) a static or shared library
file which together implement section 6.11 of the OpenCL specification.
The runtime library as a project would be a separate project from Clang
(but it may be a potential LLVM sub-project).

The driver would be extended to support locating the runtime
library's main header file, which could be installed in a known
location, pre-including it using the -include command line option
to the frontend (so that the functions declared by the header file
are available to every OpenCL program), and setting linker options
so that the runtime library is linked into the final executable.

Since my implementation of OpenCL is slightly unconventional (it
is built into KLEE, a symbolic execution engine) I have not needed
to implement any of the driver functionality (KLEE calls into the
frontend directly, and the paths to the header and library files
are hardcoded paths into the KLEE source and build directories),
so I haven't thought too closely about the details.

> > For CUDA, the NVIDIA header files provide appropriate declarations,
> > but as far as I can tell, variables such as threadIdx are handled
> > specially by nvcc, and functions such as __syncthreads are treated
> > as builtins.  Clang does not currently implement the special handling
> > for these variables or functions.
> >
> 
> Are there any plans to implement any of these?

I doubt that I will have time to implement this myself, and I am
unaware of anyone else who is willing to.

> > On Tue, Oct 04, 2011 at 07:28:26PM +0100, Peter Collingbourne wrote:
> > > > and the OpenCL frontend seems to respect the address
> > > >    mapping but does not emit complete array definitions for
> > locally-defined
> > > >    __local arrays.  Does the front-end currently not support __local
> > arrays
> > > >    embedded in the code?  It seems to work if the __local arrays are
> > passed as
> > > >    pointers to the kernel.
> > >
> > > Clang should support __local arrays, and this looks like a genuine
> > > bug in the IR generator.  I will investigate.
> >
> > This actually seems to be an optimisation.  Since only the first
> > element of the array is accessed, LLVM will only allocate storage for
> > that element.  If you compile your example with -O0 (OpenCL compiles
> > with optimisations turned on by default), you will see that the 64
> > element array is created.
> >
> 
> I'm not really convinced this is a legal optimization.  What if you
> purposely allocate arrays with extra padding to prevent bank conflicts in
> the kernel?

Preventing bank conflicts is a reasonable thing for one to want to do,
but allocating arrays with extra padding is not a standards-compliant
way to do it, given that (as far as I'm aware) the OpenCL specification
says nothing about how storage is allocated.  If you are willing
to go outside the requirements of the specification, Clang supports
the C1X _Alignas keyword as an extension in all languages.  So for
example if you know that the target bank size is 1024, you could write:

_Alignas(1024) __local float buffer[64];

Thanks,
-- 
Peter