[cfe-dev] OpenCL/CUDA Interop with PTX Back-End

Tue Oct 4 11:28:26 PDT 2011

On Mon, Oct 03, 2011 at 04:45:28PM +0000, Justin Holewinski wrote:
> I'm looking into extending the PTX Clang target to support code generation
> from OpenCL/CUDA code, so I'm wondering about the current state of these two
> Clang implementations.  As a test, I've implemented the AddrSpaceMap map in
> the PTX target in lib/Basic/Targets.cpp, but I'm not sure what other hooks
> are required.  From what I can tell, basic functionality is working quite
> well!  I hope to commit a small patch soon to support the AddrSpaceMap for
> PTX.

The AddrSpaceMap you added for PTX seems to be correct.  The main
other hook for OpenCL is the CGOpenCLRuntime class, but PTX should
be able to use the default implementation.

> I'm currently investigating the following issues/concerns:
> 
>    1. What is the plan for language-specific functions and other constructs,
>    such as __syncthreads/barrier, get_local_id/threadIdx, etc.?  Is it up to
>    the back-end to define compatible definitions of these, or is there a plan
>    to introduce generic LLVM intrinsics for these?  Since OpenCL has
>    pre-defined functions that do not require header files, it may be awkward to
>    require OpenCL to include a back-end specific header file when compiling
>    with Clang.

For OpenCL, the implementation should provide definitions of
the built-in functions described in section 6.11 of the OpenCL
specification.  For at least some of those functions, the definitions
would be the same for any OpenCL implementation.  (FWIW, I have
developed a set of generic implementations of section 6.11 built-ins
as part of an OpenCL implementation I have been working on, which I
will be open sourcing soon.)

For the rest (e.g. work-item functions), the implementation would
need to be specific to the OpenCL implementation.  For example, on
a CPU, the exact implementation details of work-item functions would
be highly dependent on how the implementation stores work-item IDs,
so it would not be appropriate to use a generic intrinsic.

For CUDA, the NVIDIA header files provide appropriate declarations,
but as far as I can tell, variables such as threadIdx are handled
specially by nvcc, and functions such as __syncthreads are treated
as builtins.  Clang does not currently implement the special handling
for these variables or functions.

>    2. What is the status of the address space mapping?  The CUDA frontend
>    does not seem to respect the mapping (I get address-space-less alloca's for
>    __shared__ arrays)

Clang does not currently implement CUDA address spaces correctly.
The major challenge is that OpenCL (and LLVM) address spaces are, in
C/C++ language terms, type qualifiers, while CUDA address spaces are
declaration specifiers (despite being referred to as "type qualifiers"
in the documentation).  This means that in CUDA, pointers lack correct
type information.  nvcc performs type inference to determine whether
a pointer is to __shared__ or __device__ memory, which Clang does
not currently implement.

As an alternative to implementing type inference, we could require
CUDA users to use a board with a unified address space, but this
is sub-optimal, and still depends on some support from Clang.

> and the OpenCL frontend seems to respect the address
>    mapping but does not emit complete array definitions for locally-defined
>    __local arrays.  Does the front-end currently not support __local arrays
>    embedded in the code?  It seems to work if the __local arrays are passed as
>    pointers to the kernel.

Clang should support __local arrays, and this looks like a genuine
bug in the IR generator.  I will investigate.

Thanks,
-- 
Peter