[LLVMdev] memory scopes in atomic instructions

Wed Nov 19 09:54:23 PST 2014

On 11/19/2014 4:05 AM, Chandler Carruth wrote:
>
> On Fri, Nov 14, 2014 at 1:09 PM, Sahasrabuddhe, Sameer 
> <sameer.sahasrabuddhe at amd.com <mailto:sameer.sahasrabuddhe at amd.com>> 
> wrote:
>
>     1. Update the synchronization scope field in atomic instructions
>     from a
>        single bit to a wider field, say 32-bit unsigned integer.
>
>
> I think this should be an arbitrary bit width integer. I think baking 
> any size into this is a mistake unless that size is "1".

I noticed that the LRM never specifies a width for address spaces, but 
the implementation uses "unsigned" everywhere, which is clearly not an 
arbitrary width integer. Is this how memory scopes should also be 
implemented?

>     4. The use of "single thread scope" is not clear.
>
>
> Consider trying to read from memory written in a thread from a signal 
> handler delivered to that thread. Essentially, there may be a need to 
> write code which we know will execute in a single hardware thread, but 
> where the compiler optimizations precluded by atomics need to be 
> precluded as the control flow within the hardware thread may 
> arbitrarily move from one sequence of instructions to another.
>
>     If it is required in
>        target-independent transforms,
>
>
> Yes, it is. sig_atomic_t.

Thanks! This also explains why SingleThread is baked into tsan. I 
couldn't find a way to work around __tsan_atomic_signal_fence if I 
removed SingleThread as a well-known memory scope.
>
>     5. Possibly add the following constraint on memory scopes: "The scope
>        represented by a larger value is nested inside (is a proper subset
>        of) the scope represented by a smaller value." This would also
>     imply
>        that the value used for single-thread scope must be the largest
>        value used by the target.
>        This constraint on "nesting" is easily satisfied by HSAIL (and also
>        OpenCL), where synchronization scopes increase from a single
>        work-item to the entire system. But it is conceivable that other
>        targets do not have this constraint. For example, a platform may
>        define synchronization scopes in terms of overlapping sets instead
>        of proper subsets.
>
>
> I think this is the important thing to settle on in the design. I'd 
> really like to hear from a diverse set of vendors and folks operating 
> in the GPU space to understand whether having this constraint is 
> critically important or problematic for any reasons.

I think "heterogenous systems" (in general, and not just HSA) might be a 
better term since it covers more than just GPU devices.

Also, I don't see why this constraint in the general LLVM IR could be 
critically important to some target. But I can see why it could be 
problematic for a target! If I understand this correctly, the main issue 
is that if we do not build nested scopes into the IR, then we can never 
have target-independent optimizations that work with multiple memory 
scopes. Is that correct? Is that really so important? What happens when 
we do have a target that does not have nested memory scopes? Will it not 
be harder to remove this assumption from the target-independent 
optimizations?

> I think (unfortunately) it would be hard to add this later...

I am not sure I understand this part. The only effect I see is that 
targets might use enumerations that do not follow a strict order in 
their list of memory scopes. We can always encourage a future-looking 
convention to list the memory scopes in nesting order. And in the worst 
case, the enumerations can be reordered when the need arises, right?

Sameer.