[cfe-dev] [LLVMdev] OpenCL support
David Neto
dneto.llvm at gmail.com
Tue Dec 7 13:02:50 PST 2010
On Mon, Dec 6, 2010 at 6:16 PM, Villmow, Micah <Micah.Villmow at amd.com> wrote:
>> -----Original Message-----
>> From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu]
>> On Behalf Of Peter Collingbourne
>> Sent: Monday, December 06, 2010 2:56 PM
>> To: David Neto
>> Cc: cfe-dev at cs.uiuc.edu; llvmdev at cs.uiuc.edu
>> Subject: Re: [LLVMdev] [cfe-dev] OpenCL support
>>
>> Hi David,
>>
>> On Mon, Dec 06, 2010 at 11:14:42AM -0500, David Neto wrote:
>> > What do I think your patch should look like? It's true that the
>> > diag::err_as_qualified_auto_decl is inappropriate for OpenCL when
>> it's
>> > the __local addres space.
>> >
>> > But we need to implement the semantics somehow. Conceptually I think
>> > of it as a CL source-to-source transformation that lowers
>> > function-scope-local-address-space variables into a more primitive
>> > form.
>> >
>> > I think I disagree that the Clang is an inappropriate spot for
>> > implementing this type of transform: Clang "knows" the source
>> language
>> > semantics, and has a lot of machinery required for the transform.
>> > Also, Clang also knows a lot about the target machine (e.g. type
>> > sizes, builtins, more?).
>> >
>> > So I believe the "auto var in different address space" case should be
>> > allowed in the AST in the OpenCL case, and the local-lowering
>> > transform should be applied in CodeGen. Perhaps the lowering is
>> > target-specific, e.g. GPU-style, or more generic style as I proposed.
>> >
>> > Thoughts?
>>
>> I've been rethinking this and perhaps coming around to this way
>> of thinking. Allocating variables in the __local address space
>> is really something that can't be represented at the LLVM level,
>> at least in a standard form.
> [Villmow, Micah] We ran across this problem in our OpenCL implementation. However, you can create a global variable with an '__local' address space and it works fine. There is an issue with collision between auto-arrays in different kernels, but that can be solved with a little name mangling. There are other ways to do this, for example, by converting local auto-arrays into kernel local pointer arguments with a known size.
Here's a little example to show the direction I was heading, with an
illustration as a CL-to-C translation. I believe there are no
namespace issues, but otherwise is essentially the same as the global
variable solution.
The idea is that the func scope local addr variables are like a stack
frame that is shared between the different work items in a group. So
collect all those variables in an anonymous struct, and then create a
function scope private variable to point to the one copy of that
struct. The pointer is returned by a system-defined intrinsic
function dependent on the current work item. (The system knows what
work groups are in flight, which is why you need a system-defined
intrinsic.)
So a kernel function like this:
void foo(__global int*A) {
__local int vint;
__local int *vpint;
__local int const *vcpint;
__local int volatile vvint;
int a = A[0];
vint = a;
vvint = a;
int a2 = vint;
int va2 = vvint;
barrier(CLK_LOCAL_MEM_FENCE);
A[0] = a2 + va2;
}
is translated to this, which does pass through Clang, with __local
meaning attrib addrspace(2):
extern __local void * __get_work_group_local_base_addr(void); // intrinsic
void foo(__global int*A) {
__local struct __local_vars_s {
int vint;
int *vpint;
int const *vcpint;
int volatile vvint;
} * const __local_vars
// this is a *private* variable, pointing to *local* addresses.
// it's a const pointer because it shouldn't change; and
being const may expose optimizations
= __get_work_group_local_base_addr(); // the new intrinsic
int a = A[0];
__local_vars->vint = a; // l-values are translated as memory stores.
__local_vars->vvint = a;
int a2 = __local_vars->vint; // r-values are translated as memory loads
int va2 = __local_vars->vvint;
barrier(CLK_LOCAL_MEM_FENCE);
A[0] = a2 + va2;
}
As an extension, the backend ought to be able to use some smarts to
simplify this down in simple cases. For example if the system only
ever allows one work group at a time, then the intrinsic could boil
down to returning a constant, and then link time optimization can
scrub away unneeded work. Similarly if you have a GPU style
environment where (as Peter described) the "local" addresses are the
same integer value but in different groups point to different storage,
then again the intrinsic returns a constant and again LTO optimizes
the result.
I haven't thought through the implications of a kernel having such
vars calling another kernel having such variables. At least the
OpenCL spec says that the behaviour is implementation-defined for such
a case. It would be nice to be able to represent any of the sane
possibilities.
@Anton: Regarding ARM's open-sourcing: I'm glad to see the
reaffirmation, and I look forward to the contribution. Yes, I
understand the virtues of patience. :-)
I assume you plan to commit a document describing how OpenCL is
supported. (e.g. how details like the above are handled.)
thanks,
david
More information about the cfe-dev
mailing list