[cfe-dev] OpenCL/CUDA Interop with PTX Back-End

Wed Oct 5 09:18:39 PDT 2011

On Wed, Oct 5, 2011 at 11:28 AM, Justin Holewinski <
justin.holewinski at gmail.com> wrote:

> On Tue, Oct 4, 2011 at 5:42 PM, Peter Collingbourne <peter at pcc.me.uk>wrote:
>
>> On Tue, Oct 04, 2011 at 04:23:59PM -0400, Justin Holewinski wrote:
>> > > > I'm currently investigating the following issues/concerns:
>> > > >
>> > > >    1. What is the plan for language-specific functions and other
>> > > constructs,
>> > > >    such as __syncthreads/barrier, get_local_id/threadIdx, etc.?  Is
>> it up
>> > > to
>> > > >    the back-end to define compatible definitions of these, or is
>> there a
>> > > plan
>> > > >    to introduce generic LLVM intrinsics for these?  Since OpenCL has
>> > > >    pre-defined functions that do not require header files, it may be
>> > > awkward to
>> > > >    require OpenCL to include a back-end specific header file when
>> > > compiling
>> > > >    with Clang.
>> > >
>> > > For OpenCL, the implementation should provide definitions of
>> > > the built-in functions described in section 6.11 of the OpenCL
>> > > specification.  For at least some of those functions, the definitions
>> > > would be the same for any OpenCL implementation.  (FWIW, I have
>> > > developed a set of generic implementations of section 6.11 built-ins
>> > > as part of an OpenCL implementation I have been working on, which I
>> > > will be open sourcing soon.)
>> > >
>> > > For the rest (e.g. work-item functions), the implementation would
>> > > need to be specific to the OpenCL implementation.  For example, on
>> > > a CPU, the exact implementation details of work-item functions would
>> > > be highly dependent on how the implementation stores work-item IDs,
>> > > so it would not be appropriate to use a generic intrinsic.
>> > >
>> >
>> > Right.  I'm wondering what the implementation plan for this is with
>> Clang.
>> > Are you going to expose the OpenCL functions as LLVM intrinsics, and let
>> > back-ends provide appropriate implementations?  Right now, I'm defining
>> > these functions in terms of PTX builtin functions, but this is obviously
>> not
>> > optimal because you need to include an additional header in OpenCL code.
>>
>> This is how I imagine the built-ins should be implemented:
>>
>> The built-in functions would be declared by a header file that belongs
>> to an OpenCL C runtime library (not to be confused with the OpenCL
>> Platform Layer or OpenCL Runtime defined by sections 4 and 5 of the
>> OpenCL specification).  The runtime library in this case would consist
>> of a set of header files and (optionally) a static or shared library
>> file which together implement section 6.11 of the OpenCL specification.
>> The runtime library as a project would be a separate project from Clang
>> (but it may be a potential LLVM sub-project).
>>
>> The driver would be extended to support locating the runtime
>> library's main header file, which could be installed in a known
>> location, pre-including it using the -include command line option
>> to the frontend (so that the functions declared by the header file
>> are available to every OpenCL program), and setting linker options
>> so that the runtime library is linked into the final executable.
>>
>
> This makes sense to me.  The run-time library for PTX would be fairly easy,
> since it would mostly just be stubs that call into PTX builtin functions.
>
>
>>
>> Since my implementation of OpenCL is slightly unconventional (it
>> is built into KLEE, a symbolic execution engine) I have not needed
>> to implement any of the driver functionality (KLEE calls into the
>> frontend directly, and the paths to the header and library files
>> are hardcoded paths into the KLEE source and build directories),
>> so I haven't thought too closely about the details.
>>
>> > > For CUDA, the NVIDIA header files provide appropriate declarations,
>> > > but as far as I can tell, variables such as threadIdx are handled
>> > > specially by nvcc, and functions such as __syncthreads are treated
>> > > as builtins.  Clang does not currently implement the special handling
>> > > for these variables or functions.
>> > >
>> >
>> > Are there any plans to implement any of these?
>>
>> I doubt that I will have time to implement this myself, and I am
>> unaware of anyone else who is willing to.
>>
>
> I may take a look at the code to see what all would be involved.
>
>
>>
>> > > On Tue, Oct 04, 2011 at 07:28:26PM +0100, Peter Collingbourne wrote:
>> > > > > and the OpenCL frontend seems to respect the address
>> > > > >    mapping but does not emit complete array definitions for
>> > > locally-defined
>> > > > >    __local arrays.  Does the front-end currently not support
>> __local
>> > > arrays
>> > > > >    embedded in the code?  It seems to work if the __local arrays
>> are
>> > > passed as
>> > > > >    pointers to the kernel.
>> > > >
>> > > > Clang should support __local arrays, and this looks like a genuine
>> > > > bug in the IR generator.  I will investigate.
>> > >
>> > > This actually seems to be an optimisation.  Since only the first
>> > > element of the array is accessed, LLVM will only allocate storage for
>> > > that element.  If you compile your example with -O0 (OpenCL compiles
>> > > with optimisations turned on by default), you will see that the 64
>> > > element array is created.
>> > >
>> >
>> > I'm not really convinced this is a legal optimization.  What if you
>> > purposely allocate arrays with extra padding to prevent bank conflicts
>> in
>> > the kernel?
>>
>> Preventing bank conflicts is a reasonable thing for one to want to do,
>> but allocating arrays with extra padding is not a standards-compliant
>> way to do it, given that (as far as I'm aware) the OpenCL specification
>> says nothing about how storage is allocated.  If you are willing
>> to go outside the requirements of the specification, Clang supports
>> the C1X _Alignas keyword as an extension in all languages.  So for
>> example if you know that the target bank size is 1024, you could write:
>>
>> _Alignas(1024) __local float buffer[64];
>>
>

Peter, one more question.  Is the "opencl.kernels" metadata a permanent
thing, or is it a short-term hack?  I ask because I'm working on how to
identify kernel vs. device functions in the PTX back-end.  The way I see it,
I have two options:

   1. Use a pass in the back-end to assign the proper calling convention to
   each function, if the metadata is present.
   2. Modify Clang (maybe through an extension of the CGOpenCLRuntime class)
   to set the proper PTX calling convention when in OpenCL-mode.

>
>> Thanks,
>> --
>> Peter
>>
>
>
>
> --
>
> Thanks,
>
> Justin Holewinski
>
>

-- 

Thanks,

Justin Holewinski
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20111005/a139a4e9/attachment.html>