[LLVMdev] [RFC] "noclone" function attribute

James Molloy James.Molloy at arm.com
Sat Dec 1 08:02:30 PST 2012


Hi,

OpenCL has a "barrier" function with very specific semantics, and there is currently no analogue to model this in LLVM.

This has been touched on by the SPIR folks but I don't believe they put forward a proposal.

The barrier function is a special function that ensures that all workitems executing a kernel have executed up to that point before execution on any workitem can continue.

The CL spec is specific about how user kernels can use barriers - the sequence of barriers that are hit by all workitems in a workgroup must be identical. An issue occurs when defining what "the same barrier" actually means, however. GPU Hardware, and CPU implementations such as Ralf Karrenberg's (http://llvm.org/devmtg/2012-04-12/Slides/Ralf_Karrenberg.pdf) key off the PC, so barrier call A and barrier call B are the same if and only if the PC value at A and B is the same, for some definition of PC.

Last time this was mentioned, Eli suggested that keying off the PC was a bit silly - it is my understanding that the next CL spec has "named barriers" proposed, which give the key to the barrier function explicitly as a parameter. However even if this is ratified, we (CL vendors) still need to support the old behaviour of keying off the PC.

This (keying off the PC) has advantages in terms of implementation for the CPU. For an example, and an example of how this can go wrong, see the end of this message.

This can go wrong if a barrier call is cloned. This can happen in loop unrolling, loop unswitching and jump threading, currently. I believe multiple CL vendors have hacked ad-hoc checking in these three areas currently - it'd be nice to standardise this and reduce downstream hacks.

I'm proposing a new function attribute, "noclone", with the semantics that "calls to functions marked "noclone" cannot be cloned or duplicated into the same function.". That is, it is illegal to call J = I->clone() then attach J to the same basic block as I if I is marked "noclone".

This means that cloning whole functions (CloneFunction and CloneFunctionInto) will still work fine, but CloneBasicBlock with a new parent set equal to the old parent (i.e. cloning a block in the same function) will assert.

I have a proof of concept patch for this but it's slightly out of date, so I'll need to update it.

I'm envisaging a large group of people with torches and pitchforks walking menacingly towards me right now, so without further ado I'll hand over to them to tell me where I've gone wrong and why the idea is utterly braindead...

Cheers,

James

EXAMPLE
=======

Ralf Karrenberg proposed an algorithm which for a kernel like this:

kernel void k() {
  if (x())
    y();
  barrier();
  if (x())
    z();
  else
    w();
}

split it up into sub-functions and would produce a state machine and a loop similar to this:

while (1) {
switch (state) {
case STATE_START:
  for (x...) for (y...) for (z...)
    state = kernel_START(x, y, z);
  break;

case STATE_BARRIER1:
  for (x...) for (y...) for (z...)
    state = kernel_BARRIER1(x, y, z);
  break;
  
case STATE_END:
  return;
}
}

where every kernel sub-function (kernel_START and kernel_BARRIER1 in this example) return a new state.

Notice this relies upon all calls to either kernel_START or kernel_BARRIER1 returning the *same* next state. This is guaranteed by the OpenCL spec.

Let's apply jump threading to that kernel:

kernel void k() {
  if (x()) {
    y();
    barrier();
    z();
  } else {
    barrier();
    w();
  }
}

Oh dear. Now, we'd end up creating a state machine with four states - START, BARRIER1, BARRIER2 and END. It is no longer guaranteed that all workitems will hit the same barrier, because we've broken an invariant the user guaranteed. Our optimisation has been broken.





More information about the llvm-dev mailing list