[LLVMdev] [cfe-dev] RFC: Representation of OpenCL Memory Spaces

Fri Oct 14 10:40:08 PDT 2011

On Thu, Oct 13, 2011 at 04:14:09PM -0400, Justin Holewinski wrote:
> On Thu, Oct 13, 2011 at 11:57 AM, Peter Collingbourne <peter at pcc.me.uk>wrote:
> 
> > Hi Justin,
> >
> > Thanks for bringing this up, I think it's important to discuss
> > these issues here.
> >
> > On Thu, Oct 13, 2011 at 09:46:28AM -0400, Justin Holewinski wrote:
> > > It is becoming increasingly clear to me that LLVM address spaces are not
> > the
> > > general solution to OpenCL/CUDA memory spaces. They are a convenient hack
> > to
> > > get things working in the short term, but I think a more long-term
> > approach
> > > should be discussed and decided upon now before the OpenCL and CUDA
> > > implementations in Clang/LLVM get too mature. To be clear, I am not
> > > advocating that *targets* change to a different method for representing
> > > device memory spaces. The current use of address spaces to represent
> > > different types of device memory is perfectly valid, IMHO. However, this
> > > knowledge should not be encoded in front-ends and pre-SelectionDAG
> > > optimization passes.
> >
> > I disagree.  The targets should expose all the address spaces they
> > provide, and the frontend should know about the various address spaces
> > it needs to know about.  It is incumbent on the frontend to deliver
> > a valid IR for a particular language implementation, and part of
> > that involves knowing about the ABI requirements for the language
> > implementation (which may involve using specific address spaces)
> > and the capabilities of each target (including the capabilities of
> > the target's address spaces), together with the language semantics.
> > It is not the job of the optimisers or backend to know the semantics
> > for a specific language, a specific implementation of that language
> > or a specific ABI.
> >
> 
> But this is assuming that a target's address spaces have a valid 1 to 1
> mapping between OpenCL memory spaces and back-end address spaces.  What
> happens for a target such as x86?  Do we introduce pseudo address spaces
> into the back-end just to satisfy the front-end OpenCL requirements?

I don't see how anything I wrote implies that.  For x86, there would
presumably be a many-to-one mapping.

> > This presupposes that we need a way of representing OpenCL address
> > spaces in IR targeting X86 (and targets which lack GPU-like address
> > spaces).  As far as I can tell, the only real representations of
> > OpenCL address spaces on such targets that we need are a way of
> > distinguishing the different address spaces for alias analysis
> > and a representation for __local variables allocated on the stack.
> > TBAA metadata would solve the first problem, and we already have
> > mechanisms in the frontend that could be used to solve the second.
> >
> 
> Which mechanisms could be used to differentiate between thread-private and
> __local data?

In OpenCL C, it is illegal to declare a variable with static storage
duration in the __private address space (section 6.5: "All program
scope variables must be declared in the __constant address space.";
section 6.8g: "The extern, static, auto and register storage-class
specifiers are not supported.").  This implies that there is no way
for pointers to the __private address space to be usefully shared
between work-items without invoking undefined behaviour, so the
question is moot (i.e. __private does not need to be implemented using
thread-local storage).

It is possible to write OpenCL C code which shares pointers to
__private memory using barrier synchronisation, but since there is no
way to queue a memory fence across __private memory (only __local and
__global), any access to that memory would invoke undefined behaviour.
For example, consider the following (2 work-items in a work-group):

__kernel void foo() {
  int x = 0;
  int *__local p;
  if (get_local_id(0) == 0) p = &x;
  barrier(CLK_LOCAL_MEM_FENCE);
  if (get_local_id(0) == 1) *p = 1;
  barrier(CLK_LOCAL_MEM_FENCE);
  // what is the value of x in work-item 0 here?
}

The value of x at the comment is undefined, because no fence across
__private memory was queued.

Perhaps more straightforwardly, referring to the following passage
in section 3.3 ("memory model") of the OpenCL specification:

"Private Memory: A region of memory private to a work-item. Variables
defined in one work-item's private memory are not visible to another
work-item."

We can interpret the term "not visible" here as meaning that accesses
across work-items invoke undefined behaviour, so in the example above,
the write to x via p would itself be undefined.

Thanks,
-- 
Peter