[llvm-dev] Memory scope proposal

Fri Sep 2 17:52:22 PDT 2016

On 09/01/2016 08:52 AM, Tom Stellard via llvm-dev wrote:
> On Wed, Aug 31, 2016 at 12:23:34PM -0700, Justin Lebar via llvm-dev wrote:
>>> Some optimizations that are related to a single thread could be done without needing to know the actual memory scope.
>> Right, it's clear to me that there exist optimizations that you cannot
>> do if we model these ops as target-specific intrinsics.
>>
>> But what I think Mehdi and I were trying to get at is: How much of a
>> problem is this in practice?  Are there real-world programs that
>> suffer because we miss these optimizations?  If so, how much?
>>
>> The reason I'm asking this question is, there's a real cost to adding
>> complexity in LLVM.  Everyone in the project is going to pay that
>> cost, forever (or at least, until we remove the feature :).  So I want
>> to try to evaluate whether paying that cost is actually worth while,
>> as compared to the simple alternative (i.e., intrinsics).  Given the
>> tepid response to this proposal, I'm sort of thinking that now may not
>> be the time to start paying this cost.  (We can always revisit this in
>> the future.)  But I remain open to being convinced.
>>
> I think the cost of adding this information to the IR is really low.
> There is already a sychronization scope field present for LLVM atomic
> instructions, and it is already being encoded as 32-bits, so it is
> possible to represent the additional scopes using the existing bitcode
> format.  Optimization passes are already aware of this synchronization
> scope field, so they know how to preserve it when transforming the IR.
I disagree with this assessment.  Atomics are an area where additional 
complexity has a *substantial* conceptual cost.  I also question whether 
the single_thread scope is actually respected throughout the optimizer 
in practice.

I view the request of changing the IR as a fairly big ask.  In 
particular, I'm really nervous about what the exact optimization 
semantics of such scopes would be.  Depending on how that was defined, 
this could be anything from fairly straight forward to outright messy.  
In particular, if there are optimizations which are legal for only some 
subset of scopes (or subset of pairs of scopes?), I'd really like to see 
a clear definition given for how those are defined.

(p.s. Is there a current patch with an updated LangRef for the proposal 
being discussed?  I've lost track of it.)

Let me give an example proposal just to illustrate my point.  This isn't 
really a counter proposal per se, just me thinking out loud.

Say we added 32 distinct concurrent domains.  One of them is used for 
"single_thread".  One is used for "everything else".  The remaining 30 
are defined in a target specific manner w/the exception that they can't 
overlap with each other or with the two predefined ones.  The effect of 
a given atomic operation with respect to each concurrency domain could 
be defined in terms of a 32 bit mask.  If a bit was set, the operation 
is ordered (according to the separately stated ordering) with that 
domain.  If not, it is explicitly unordered w.r.t. that domain.  A 
memory operation would be tagged with the memory domains which which it 
might interact.

The key bit here is that I can describe transformations in terms of 
these abstract domains without knowing anything about how the frontend 
might be using such a domain or how the backend might lower it.  In 
particular, if I have the sequence:
%v = load i64, %p atomic scope {domain3 only}
fence seq_cst scope={domain1 only}
%v2 = load i64, %p atomic scope {domain3 only}

I can tell that the two loads aren't order with respect to the fence and 
that I can do load forwarding here.

In general, an IR extension needs to be well defined, general enough to 
be used by multiple distinct users, and fairly battle tested design 
wise.  We're not completely afraid of having to remove bad ideas from 
the IR, but we really try to avoid adding things until they're fairly 
proven.

>
> The primary goal here is to pass synchronization scope information from
> the fronted to the backend.  We already have a mechanism for doing this,
> so why not use it?  That seems like the lowest cost option to me.
>
> -Tom
>
>> As a point of comparison, we have a rule of thumb that we'll add an
>> optimization that increases compilation time by x% if we have a
>> benchmark that is sped up by at least x%.  Similarly here, I'd want to
>> weigh the added complexity against the improvements to user code.
>>
>> -Justin
>>
>> On Tue, Aug 23, 2016 at 2:28 PM, Tye, Tony via llvm-dev
>> <llvm-dev at lists.llvm.org> wrote:
>>>> Since the scope is “opaque” and target specific, can you elaborate what
>>>> kind of generic optimization can be performed?
>>>
>>>
>>> Some optimizations that are related to a single thread could be done without
>>> needing to know the actual memory scope. For example, an atomic acquire can
>>> restrict reordering memory operations after it, but allow reordering of
>>> memory operations (except another atomic acquire) before it, regardless of
>>> the memory scope.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> -Tony
>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev