[Clang] Convergent Attribute

Mon May 9 14:45:04 PDT 2016

On Mon, May 9, 2016 at 2:43 PM, Richard Smith <richard at metafoo.co.uk> wrote:

> On Sun, May 8, 2016 at 12:43 PM, Matt Arsenault via cfe-commits <
> cfe-commits at lists.llvm.org> wrote:
>
>> On May 6, 2016, at 18:12, Richard Smith via cfe-commits <
>> cfe-commits at lists.llvm.org> wrote:
>>
>> On Fri, May 6, 2016 at 4:20 PM, Matt Arsenault via cfe-commits <
>> cfe-commits at lists.llvm.org> wrote:
>>
>>> On 05/06/2016 02:42 PM, David Majnemer via cfe-commits wrote:
>>>
>>>> This example looks wrong to me. It doesn't seem meaningful for a
>>>> function to be both readonly and convergent, because convergent means the
>>>> call has some side-effect visible to other threads and readonly means the
>>>> call has no side-effects visible outside the function.
>>>>
>>> This s not correct. It is valid for convergent operations to be
>>> readonly/readnone. Barriers are a common case which do have side effects,
>>> but there are also classes of GPU instructions which do not access memory
>>> and still need the convergent semantics.
>>>
>>
>> Can you give an example? It's not clear to me how a function could be
>> both convergent and satisfy the readnone requirement that it not
>> "access[...] any mutable state (e.g. memory, control registers, etc)
>> visible to caller functions". Synchronizing with other threads seems like
>> it would cause such a state change in an abstract sense. Is the critical
>> distinction here that the state mutation is visible to the code that
>> spawned the gang of threads, but not to other threads within the gang?
>> (This seems like a bug in the definition of readonly if so, because it
>> means that a readonly call whose result is unused cannot be deleted.)
>>
>> I care about this because Clang maps __attribute__((pure)) to LLVM
>> readonly, and -- irrespective of the LLVM semantics -- a call to a function
>> marked pure is permitted to be deleted if the return value is unused, or to
>> have multiple calls CSE'd. As a result, inside Clang, we use that attribute
>> to determine whether an expression has side effects, and Clang's reasoning
>> about these things may also lead to miscompiles if a call to a function
>> marked __attribute__((pure, convergent)) actually can have a side effect.
>> _______________________________________________
>> cfe-commits mailing list
>> cfe-commits at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
>>
>>
>> These are communication operations between lanes that do not require
>> synchronization within the wavefront. These are mostly cross lane
>> communication instructions. An example would be the amdgcn.mov.dpp
>> instruction, which reads a register from a neighboring lane, or the CUDA
>> warp vote functions.
>>
>
> Those both appear to technically fail to satisfy the requirements of an
> __attribute__((pure)) function. If I understand correctly, the DPP function
> effectively stores a value into some state that is shared with another lane
> (from Clang and LLVM's perspectives, state that is visible to a function
> evaluation outside the current one), and then reads a value from another
> such shared storage location. The CUDA warp vote functions effectively
> store a value into some state that is shared with all other threads in the
> warp and then read some summary information about the values stored by all
> the threads. In both cases, the function mutates state that is visible to
> other functions running on other threads, and so is not
> __attribute__((pure)) / readonly, as far as I can see.
>

(And just to be clear, the fact that no actual storage is used for this is
irrelevant to the notional semantics of the operation. Note that the
definition of the pure attribute also covers "control registers, etc".)

> It seems to me that this change weakens the definition of these attributes
> when combined with the convergent attribute to mean that the function *is*
> still allowed to store to mutable state that's shared with other lanes /
> other threads in the same warp, but only via convergent combined store/load
> primitives. That makes some sense, given that the behavior of the
> *execution* model does not (necessarily) treat each notional lane as a
> separate thread, and from that perspective the instruction can be viewed as
> operating on a vector and communicating only with itself, but it doesn't
> match the current definitions of the semantics of these attributes (which
> are specified in terms of the *source* model, in which each notional lane
> is a separate invocation of the function). So I'd like at least for some
> documentation to be added for our "pure" and "const" attributes, saying
> something like "if this is combined with the "convergent" attribute, the
> function may still communicate with other lanes through convergent
> operations, even though such communication notionally involves modification
> of mutable state visible to the other lanes". I'd suggest a similar change
> also be made to LLVM's LangRef.
>
>
> I've checked through how clang is using the "pure" attribute, and it seems
> like it should mostly do the right thing in this case. There are a few
> places where (using your amdgcn.mov.dpp example) we would cause a dpp
> instruction to be emitted where the source code called the relevant
> operation from within an operand that we do not notionally evaluate (for
> instance, the operand of a __assume or __builtin_object_size). Contrived
> example:
>
>   int arr[N];
>   int dpp(int n) __attribute__((convergent, pure));
>   void f(int id) {
>     int x = 0;
>     if (id % 2) x = __builtin_object_size(&arr[dpp(id)], 0);
>   }
>
> We'll emit code to call the dpp function here, because we believe it has
> no side-effects. However, in both cases where we do this, we require the
> relevant expression to have defined behaviour (even though we say we won't
> perform any side-effects contained within it) so it wouldn't be valid to
> call a convergent function except from a convergent point in the
> surrounding function. So I think the worst effect of this would be that we
> would emit extra convergent operations; the resulting code should still be
> correct.
>
> There is no synchronization required, and there is no other way for the
>> same item to access that information private to the other workitem. There’s
>> no observable global state from the perspective of a single lane. The
>> individual registers changed aren’t visible to the spawning host program
>> (perhaps with the exception of some debug hardware inspecting all of the
>> individual registers). Deleting these would be perfectly acceptable if the
>> result is unused.
>>
>> -Matt
>>
>> _______________________________________________
>> cfe-commits mailing list
>> cfe-commits at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20160509/9137062c/attachment.html>