[cfe-dev] [LLVMdev] OpenCL support

Tue Dec 7 13:02:50 PST 2010

On Mon, Dec 6, 2010 at 6:16 PM, Villmow, Micah <Micah.Villmow at amd.com> wrote:
>> -----Original Message-----
>> From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu]
>> On Behalf Of Peter Collingbourne
>> Sent: Monday, December 06, 2010 2:56 PM
>> To: David Neto
>> Cc: cfe-dev at cs.uiuc.edu; llvmdev at cs.uiuc.edu
>> Subject: Re: [LLVMdev] [cfe-dev] OpenCL support
>>
>> Hi David,
>>
>> On Mon, Dec 06, 2010 at 11:14:42AM -0500, David Neto wrote:
>> > What do I think your patch should look like?  It's true that the
>> > diag::err_as_qualified_auto_decl is inappropriate for OpenCL when
>> it's
>> > the __local addres space.
>> >
>> > But we need to implement the semantics somehow.  Conceptually I think
>> > of it as a CL source-to-source transformation that lowers
>> > function-scope-local-address-space variables into a more primitive
>> > form.
>> >
>> > I think I disagree that the Clang is an inappropriate spot for
>> > implementing this type of transform: Clang "knows" the source
>> language
>> > semantics, and has a lot of machinery required for the transform.
>> > Also, Clang also knows a lot about the target machine (e.g. type
>> > sizes, builtins, more?).
>> >
>> > So I believe the "auto var in different address space" case should be
>> > allowed in the AST in the OpenCL case, and the local-lowering
>> > transform should be applied in CodeGen.  Perhaps the lowering is
>> > target-specific, e.g. GPU-style, or more generic style as I proposed.
>> >
>> > Thoughts?
>>
>> I've been rethinking this and perhaps coming around to this way
>> of thinking.  Allocating variables in the __local address space
>> is really something that can't be represented at the LLVM level,
>> at least in a standard form.
> [Villmow, Micah] We ran across this problem in our OpenCL implementation. However, you can create a global variable with an '__local' address space and it works fine. There is an issue with collision between auto-arrays in different kernels, but that can be solved with a little name mangling. There are other ways to do this, for example, by converting local auto-arrays into kernel local pointer arguments with a known size.

Here's a little example to show the direction I was heading, with an
illustration as a CL-to-C translation.  I believe there are no
namespace issues, but otherwise is essentially the same as the global
variable solution.

The idea is that the func scope local addr variables are like a stack
frame that is shared between the different work items in a group.  So
collect all those variables in an anonymous struct, and then create a
function scope private variable to point to the one copy of that
struct.  The pointer is returned by a system-defined intrinsic
function dependent on the current work item.  (The system knows what
work groups are in flight, which is why you need a system-defined
intrinsic.)

So a kernel function like this:

void foo(__global int*A) {
   __local int vint;
   __local int *vpint;
   __local int const *vcpint;
   __local int volatile vvint;
   int a = A[0];
   vint = a;
   vvint = a;
   int a2 = vint;
   int va2 = vvint;
   barrier(CLK_LOCAL_MEM_FENCE);
   A[0] = a2 + va2;
}

is translated to this, which does pass through Clang, with __local
meaning attrib addrspace(2):

extern __local void * __get_work_group_local_base_addr(void); // intrinsic
void foo(__global int*A) {
   __local struct __local_vars_s {
      int vint;
      int *vpint;
      int const *vcpint;
      int volatile vvint;
   } * const __local_vars
            // this is a *private* variable, pointing to *local* addresses.
            // it's a const pointer because it shouldn't change; and
being const may expose optimizations
       = __get_work_group_local_base_addr();  // the new intrinsic
   int a = A[0];
   __local_vars->vint = a;   // l-values are translated as memory stores.
   __local_vars->vvint = a;
   int a2 = __local_vars->vint;   // r-values are translated as memory loads
   int va2 = __local_vars->vvint;
   barrier(CLK_LOCAL_MEM_FENCE);
   A[0] = a2 + va2;
}

As an extension, the backend ought to be able to use some smarts to
simplify this down in simple cases.  For example if the system only
ever allows one work group at a time, then the intrinsic could boil
down to returning a constant, and then link time optimization can
scrub away unneeded work.  Similarly if you have a GPU style
environment where (as Peter described) the "local" addresses are the
same integer value but in different groups point to different storage,
then again the intrinsic returns a constant and again LTO optimizes
the result.

I haven't thought through the implications of a kernel having such
vars calling another kernel having such variables.  At least the
OpenCL spec says that the behaviour is implementation-defined for such
a case.  It would be nice to be able to represent any of the sane
possibilities.

@Anton:  Regarding ARM's open-sourcing:  I'm glad to see the
reaffirmation, and I look forward to the contribution.  Yes, I
understand the virtues of patience.  :-)
I assume you plan to commit a document describing how OpenCL is
supported.  (e.g. how details like the above are handled.)

thanks,
david