[cfe-dev] OpenCL/CUDA Interop with PTX Back-End

Wed Oct 5 08:28:40 PDT 2011

On Tue, Oct 4, 2011 at 5:42 PM, Peter Collingbourne <peter at pcc.me.uk> wrote:

> On Tue, Oct 04, 2011 at 04:23:59PM -0400, Justin Holewinski wrote:
> > > > I'm currently investigating the following issues/concerns:
> > > >
> > > >    1. What is the plan for language-specific functions and other
> > > constructs,
> > > >    such as __syncthreads/barrier, get_local_id/threadIdx, etc.?  Is
> it up
> > > to
> > > >    the back-end to define compatible definitions of these, or is
> there a
> > > plan
> > > >    to introduce generic LLVM intrinsics for these?  Since OpenCL has
> > > >    pre-defined functions that do not require header files, it may be
> > > awkward to
> > > >    require OpenCL to include a back-end specific header file when
> > > compiling
> > > >    with Clang.
> > >
> > > For OpenCL, the implementation should provide definitions of
> > > the built-in functions described in section 6.11 of the OpenCL
> > > specification.  For at least some of those functions, the definitions
> > > would be the same for any OpenCL implementation.  (FWIW, I have
> > > developed a set of generic implementations of section 6.11 built-ins
> > > as part of an OpenCL implementation I have been working on, which I
> > > will be open sourcing soon.)
> > >
> > > For the rest (e.g. work-item functions), the implementation would
> > > need to be specific to the OpenCL implementation.  For example, on
> > > a CPU, the exact implementation details of work-item functions would
> > > be highly dependent on how the implementation stores work-item IDs,
> > > so it would not be appropriate to use a generic intrinsic.
> > >
> >
> > Right.  I'm wondering what the implementation plan for this is with
> Clang.
> > Are you going to expose the OpenCL functions as LLVM intrinsics, and let
> > back-ends provide appropriate implementations?  Right now, I'm defining
> > these functions in terms of PTX builtin functions, but this is obviously
> not
> > optimal because you need to include an additional header in OpenCL code.
>
> This is how I imagine the built-ins should be implemented:
>
> The built-in functions would be declared by a header file that belongs
> to an OpenCL C runtime library (not to be confused with the OpenCL
> Platform Layer or OpenCL Runtime defined by sections 4 and 5 of the
> OpenCL specification).  The runtime library in this case would consist
> of a set of header files and (optionally) a static or shared library
> file which together implement section 6.11 of the OpenCL specification.
> The runtime library as a project would be a separate project from Clang
> (but it may be a potential LLVM sub-project).
>
> The driver would be extended to support locating the runtime
> library's main header file, which could be installed in a known
> location, pre-including it using the -include command line option
> to the frontend (so that the functions declared by the header file
> are available to every OpenCL program), and setting linker options
> so that the runtime library is linked into the final executable.
>

This makes sense to me.  The run-time library for PTX would be fairly easy,
since it would mostly just be stubs that call into PTX builtin functions.

>
> Since my implementation of OpenCL is slightly unconventional (it
> is built into KLEE, a symbolic execution engine) I have not needed
> to implement any of the driver functionality (KLEE calls into the
> frontend directly, and the paths to the header and library files
> are hardcoded paths into the KLEE source and build directories),
> so I haven't thought too closely about the details.
>
> > > For CUDA, the NVIDIA header files provide appropriate declarations,
> > > but as far as I can tell, variables such as threadIdx are handled
> > > specially by nvcc, and functions such as __syncthreads are treated
> > > as builtins.  Clang does not currently implement the special handling
> > > for these variables or functions.
> > >
> >
> > Are there any plans to implement any of these?
>
> I doubt that I will have time to implement this myself, and I am
> unaware of anyone else who is willing to.
>

I may take a look at the code to see what all would be involved.

>
> > > On Tue, Oct 04, 2011 at 07:28:26PM +0100, Peter Collingbourne wrote:
> > > > > and the OpenCL frontend seems to respect the address
> > > > >    mapping but does not emit complete array definitions for
> > > locally-defined
> > > > >    __local arrays.  Does the front-end currently not support
> __local
> > > arrays
> > > > >    embedded in the code?  It seems to work if the __local arrays
> are
> > > passed as
> > > > >    pointers to the kernel.
> > > >
> > > > Clang should support __local arrays, and this looks like a genuine
> > > > bug in the IR generator.  I will investigate.
> > >
> > > This actually seems to be an optimisation.  Since only the first
> > > element of the array is accessed, LLVM will only allocate storage for
> > > that element.  If you compile your example with -O0 (OpenCL compiles
> > > with optimisations turned on by default), you will see that the 64
> > > element array is created.
> > >
> >
> > I'm not really convinced this is a legal optimization.  What if you
> > purposely allocate arrays with extra padding to prevent bank conflicts in
> > the kernel?
>
> Preventing bank conflicts is a reasonable thing for one to want to do,
> but allocating arrays with extra padding is not a standards-compliant
> way to do it, given that (as far as I'm aware) the OpenCL specification
> says nothing about how storage is allocated.  If you are willing
> to go outside the requirements of the specification, Clang supports
> the C1X _Alignas keyword as an extension in all languages.  So for
> example if you know that the target bank size is 1024, you could write:
>
> _Alignas(1024) __local float buffer[64];
>
> Thanks,
> --
> Peter
>

-- 

Thanks,

Justin Holewinski
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20111005/5d73958b/attachment.html>