[llvm-commits] [LLVMdev] [RFC] "noclone" function attribute

Sun Dec 16 08:20:39 PST 2012

On Dec 16, 2012, at 8:05 AM, James Molloy <James.Molloy at arm.com>
 wrote:

> Hi Richard,
> 
> Thanks for the in-depth reply. A quick comment though:
> 
>> If foo() depends on values that vary across the threads actually being executed in parallel, duplicating the barrier is fatal on our hardware, and is prohibited by the OpenCL spec.
> 
> If your hardware requires force-inlining, how can you *not* duplicate the barrier call in this instance? What do you actually do in this case, without function calls?

We inline b()'s call to barrier() to the kernel function k()… we always inline everything down to the barrier() call.

As I said in an earlier paragraph, we routinely duplicate barrier calls and it isn't a problem in the constrained situations that the OpenCL spec allows barrier() to be used in. The case where foo() produces a different boolean result across the work items actually executing in parallel is prohibited by OpenCL… or, more precisely, results in "undefined behavior", to which we assign "hang" semantics. ;-)

Richard

> 
> Cheers,
> 
> James
> ________________________________________
> From: Relph, Richard [Richard.Relph at amd.com]
> Sent: 16 December 2012 15:59
> To: Kuperstein, Michael M
> Cc: Chris Lattner; James Molloy; Zaks,  Ayal; llvm-commits; Aboud, Amjad
> Subject: Re: [llvm-commits] [LLVMdev] [RFC] "noclone" function attribute
> 
> Hi,
>    I've reviewed this thread from beginning to end a few times. I'm not sure I understand all the vocabulary being used, but I do understand the problem that LLVM sometimes creates for AMD GPUs that I believe "noduplicate" is intended to address. I'm sorry if this is merely a distraction, but I'm hopeful making this abstract discussion a bit more concrete might help. I think we represent one reason for the "vagueness" observed in the OpenCL spec.
> 
>    We are compelled to inline everything for OpenCL because some GPUs don't have a useful stack. OpenCL permits this inlining (OpenCL doesn't require us to support indirect calls or recursion… and we don't.) As a result of inlining, we deliberately duplicate barriers all the time in the case where a kernel function has multiple calls to a function that has a barrier in it. This is not generally a problem, it is merely how we get to "PC + call stack" semantics. Or, maybe it is why we do NOT need PC + call stack semantics, since we've already flattened the call tree. You decide.
>    In any case "outlining" (if I understand that term correctly) would be a problem with our backend, if done after our forced inlining, regardless of whether "noduplicate" is transitive or not, because our backend doesn't support "call" to user (or non-backend-recognized) functions. If done before, well, we'll merely undo the outlining with inlining.
>    Because barrier() is a backend-recognized function, the calls to barrier() will remain in the IR even after inlining, so at least there still is a call to hang the noduplicate attribute on.
>    All work items (threads, certainly, but maybe not "threading" as that term has been used in this email thread) executing simultaneously in the GPU must eventually reach precisely the same barrier instruction (determined by PC and, due to inlining, "call stack") or the GPU will appear to hang waiting for that to happen… If some work items hit a barrier at PC N while other work items hit a barrier with a different PC, the GPU is well and truly hung. This synchronization aspect of barrier is a part of OpenCL barrier that is outside the scope of the noduplicate proposal, but I think the hang created by duplication is the reason for the noduplicate proposal. Or at least the reason we care about most. ;-)
>    If only one work item is executing on a GPU, duplication is never a problem.
> 
>    Inlining everything can be a problem if some of the paths to barriers are conditional…
> 
> b() {
>    barrier();
> }
> 
> kernel k(int arg)
> {
>    if (foo()) {
>        ...
>        b();
>        ...
>    } else {
>        …
>        b();
>        ...
>    }
> }
> 
>    Because of mandatory inlining, even though there is only one call to barrier(), there will be 2 barrier instructions with different PCs.
>    If foo() produces the same boolean result across all work items being executed in parallel on a SIMD, then we don't care if the barrier is duplicated, as all work items are guaranteed to reach the same barrier instance, as required by the OpenCL spec. If foo() depends on values that vary across the threads actually being executed in parallel, duplicating the barrier is fatal on our hardware, and is prohibited by the OpenCL spec.
> 
>    LLVM optimizations sometimes creates the latter situation where it did not exist in the original user program. This is, I believe, the point of noduplicate.
> 
> Richard
> 
> On Dec 15, 2012, at 3:19 AM, "Kuperstein, Michael M" <michael.m.kuperstein at intel.com> wrote:
> 
>> I agree with Chris that we are not trying to express a "barrier" attribute, but a generic "noduplicate" attribute. Barriers have other properties which are not expressed by "noduplicate", and are completely outside the scope of this proposal.
>> 
>> However, I disagree on whether transitivity should be a part of the "noduplicate" property.
>> What we are trying to do here is to define useful semantics for "noduplicate". The motivation for introducing this attribute is to allow IR producers to signal to standard LLVM passes that some barrier-like function is used, so those passes won't break code that calls this function. This means the definition of "noduplicate" needs to be strictly stronger than what barrier-like constructs require in terms of duplication - for any reading of the (ambiguous) OpenCL spec. Part of this "strictly stronger" requirement is transitivity, since a module that has a non-"noduplicate" caller to a "noduplicate" callee can still be easily broken (w.r.t worst-case assumptions on barrier behavior) by standard passes even if the rest of this patch goes in.
>> 
>> Having the Verifier check transitivity achieves two goals:
>> 1) It makes sure IR producers are not allowed to produce "breakable" code.
>> 2) It makes sure transforming the IR cannot introduce "breakable" code. E.g. consider a pass that performs outlining of a piece of code that contains a noduplicate call, but does not mark the new function noduplicate.
>> 
>> I agree that (2) creates problems for devirtualization, but I would argue that code with an indirect call to a possibly-noduplicate function (where the callsite itself was not marked noduplicate) was broken to begin with, and the fact we catch it after devirtualization is a good thing.
>> 
>> In any case, it's not that this patch is bad without transitivity. I support it in any case, I'm just saying it would be more useful with the transitive semantics.
>> 
>> Michael
>> 
>> -----Original Message-----
>> From: Chris Lattner [mailto:clattner at apple.com]
>> Sent: Saturday, December 15, 2012 01:18
>> To: James Molloy
>> Cc: Dmitri Gribenko; llvm-commits; Kuperstein, Michael M
>> Subject: Re: [llvm-commits] [LLVMdev] [RFC] "noclone" function attribute
>> 
>> On Dec 14, 2012, at 9:40 AM, James Molloy <James.Molloy at arm.com> wrote:
>>> Hi Chris,
>>> 
>>> Thanks for the review. Replying now instead of with an updated patch
>>> as I won't be able to get around to it until Monday.
>> 
>> Ok, no problem.
>> 
>>>> +        Assert1((*I)->cannotDuplicate(), "All functions which may transitively call a "
>>>> +                "noduplicate function must themselves be
>>>> + noduplicate!", &F);
>>>> 
>>>> This doesn't make a lot of sense to me to enforce.  Can you explain the intuition for this limitation?  In practice, this will be difficult to handle, because devirtualization (and other things) can turn an indirect call to a direct call... if that direct call has the wrong noduplicate sense, we will get bad things happening.
>>> 
>>> I can try. It all comes down to the reading of the spec and the spirit
>>> of the spec, and "ambiguous" is a kind way to describe it...
>> 
>> This is what I was afraid of.  We're discussing here the semantics of "noduplicate", not the semantics of barrier.  "noduplicate" semantics are a big part of what it means to be a barrier, but barriers may have additional requirements on top of it.
>> 
>> The LLVM "noduplicate" concept is independent of barrier, and should not have this transitive property.
>> 
>> -Chris
>> 
>> 
>> ---------------------------------------------------------------------
>> Intel Israel (74) Limited
>> 
>> This e-mail and any attachments may contain confidential material for
>> the sole use of the intended recipient(s). Any review or distribution
>> by others is strictly prohibited. If you are not the intended
>> recipient, please contact the sender and delete all copies.
>> 
>> 
>> _______________________________________________
>> llvm-commits mailing list
>> llvm-commits at cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>> 
> 
> 
> 
> -- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium.  Thank you.
> 
>