[llvm-dev] Memory scope proposal

Fri Sep 9 12:34:10 PDT 2016

Currently the Synchronization Scope (aka memory scope) information appears not to be used in any atomic related optimizations. It would seem any such optimizations should consider memory scope and an approach such as suggested by Philip seems reasonable. Could that change be tackled as a separate patch? Initially any atomic optimizations could be restricted to only be allowed when the memory scopes are exactly equal which should be conservatively correct.

This patch would be a first step towards adding support for atomics with memory scopes used by languages such as OpenCL. Doing this simplifies how memory scope information is passed from CLANG to the code generator as mentioned by Tom. There seems to be several companies interested in doing this as it will simplify the code and allow atomics to be handled in a consistent way for all languages, and allow atomic optimizations to benefit these languages.

Thanks,
-Tony

On Sep 2, 2016, at 11:13 PM, Mehdi Amini <mehdi.amini at apple.com> wrote:

On Sep 2, 2016, at 5:52 PM, Philip Reames via llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:

On 09/01/2016 08:52 AM, Tom Stellard via llvm-dev wrote:

On Wed, Aug 31, 2016 at 12:23:34PM -0700, Justin Lebar via llvm-dev wrote:

Some optimizations that are related to a single thread could be done without needing to know the actual memory scope.
Right, it's clear to me that there exist optimizations that you cannot
do if we model these ops as target-specific intrinsics.

But what I think Mehdi and I were trying to get at is: How much of a
problem is this in practice?  Are there real-world programs that
suffer because we miss these optimizations?  If so, how much?

The reason I'm asking this question is, there's a real cost to adding
complexity in LLVM.  Everyone in the project is going to pay that
cost, forever (or at least, until we remove the feature :).  So I want
to try to evaluate whether paying that cost is actually worth while,
as compared to the simple alternative (i.e., intrinsics).  Given the
tepid response to this proposal, I'm sort of thinking that now may not
be the time to start paying this cost.  (We can always revisit this in
the future.)  But I remain open to being convinced.
I think the cost of adding this information to the IR is really low.
There is already a sychronization scope field present for LLVM atomic
instructions, and it is already being encoded as 32-bits, so it is
possible to represent the additional scopes using the existing bitcode
format.  Optimization passes are already aware of this synchronization
scope field, so they know how to preserve it when transforming the IR.
I disagree with this assessment.  Atomics are an area where additional complexity has a *substantial* conceptual cost.  I also question whether the single_thread scope is actually respected throughout the optimizer in practice.

I view the request of changing the IR as a fairly big ask.  In particular, I'm really nervous about what the exact optimization semantics of such scopes would be.  Depending on how that was defined, this could be anything from fairly straight forward to outright messy.  In particular, if there are optimizations which are legal for only some subset of scopes (or subset of pairs of scopes?), I'd really like to see a clear definition given for how those are defined.
(p.s. Is there a current patch with an updated LangRef for the proposal being discussed?  I've lost track of it.)

Here is the patch: https://reviews.llvm.org/D21723

Let me give an example proposal just to illustrate my point.  This isn't really a counter proposal per se, just me thinking out loud.

Say we added 32 distinct concurrent domains.  One of them is used for "single_thread".  One is used for "everything else".  The remaining 30 are defined in a target specific manner w/the exception that they can't overlap with each other or with the two predefined ones.  The effect of a given atomic operation with respect to each concurrency domain could be defined in terms of a 32 bit mask.  If a bit was set, the operation is ordered (according to the separately stated ordering) with that domain.  If not, it is explicitly unordered w.r.t. that domain.  A memory operation would be tagged with the memory domains which which it might interact.

The key bit here is that I can describe transformations in terms of these abstract domains without knowing anything about how the frontend might be using such a domain or how the backend might lower it.  In particular, if I have the sequence:
%v = load i64, %p atomic scope {domain3 only}
fence seq_cst scope={domain1 only}
%v2 = load i64, %p atomic scope {domain3 only}

I can tell that the two loads aren't order with respect to the fence and that I can do load forwarding here.

I see the current proposal as a strip-down version what you describe: the optimizer can reason about operations inside a single scope, but can’t assume anything cross-scope (they may or may not interact with each other).

What you describes seems like having always non-overlapping domains (from the optimizer point of view), and require the frontend to express the overlapping by attaching a “list" of domains that an atomic operation interacts with.

I hope I make sense :)

Best,

—
Mehdi

In general, an IR extension needs to be well defined, general enough to be used by multiple distinct users, and fairly battle tested design wise.  We're not completely afraid of having to remove bad ideas from the IR, but we really try to avoid adding things until they're fairly proven.

The primary goal here is to pass synchronization scope information from
the fronted to the backend.  We already have a mechanism for doing this,
so why not use it?  That seems like the lowest cost option to me.

-Tom

As a point of comparison, we have a rule of thumb that we'll add an
optimization that increases compilation time by x% if we have a
benchmark that is sped up by at least x%.  Similarly here, I'd want to
weigh the added complexity against the improvements to user code.

-Justin

On Tue, Aug 23, 2016 at 2:28 PM, Tye, Tony via llvm-dev
<llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:

Since the scope is “opaque” and target specific, can you elaborate what
kind of generic optimization can be performed?

Some optimizations that are related to a single thread could be done without
needing to know the actual memory scope. For example, an atomic acquire can
restrict reordering memory operations after it, but allow reordering of
memory operations (except another atomic acquire) before it, regardless of
the memory scope.

Thanks,

-Tony

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160909/4cdfd6cf/attachment.html>