[llvm-dev] RFC: Element-atomic memory intrinsics

Mon May 8 07:54:29 PDT 2017

Greetings all,
 I am picking up the work that was started in https://reviews.llvm.org/D27133 — adding support for an element-atomic memcpy/memset/memmove to LLVM. I would appreciate suggestions/thoughts/advice/comments on how to best proceed with this work in a way that will be acceptable to the LLVM group.

 I apologize in advance; this is going to be a long one...

**Background**

 Loads/stores of shared data in Java must be done in an atomic-like manner — it must essentially look like the entire variable is loaded/stored in a single operation. So, no doing stuff like splitting the load of a uint32 into four loads of uint8’s, or two loads of uint16’s; such a splitting could result in a data-race where each of the split loads can load a different version of the variable’s value (due to the variable being stored-to in other threads of execution). In LLVM, the atomic-like behaviour is achieved by emitting an atomic load/store with the ‘unordered’ ordering constraint.

 In LLVM there is a loop idiom recognition pass that will convert loops that look like memcpy/memmove/memset into a direct call to the corresponding intrinsic — which allows the rest of the LLVM passes to reason about the operation better, and for codegen to materialize probably-better optimized versions of the loop (ex: one that uses wider loads/stores).  But, as one might expect, this idiom recognition doesn’t touch atomic operations — i.e. if a loop that’s doing a memcpy is doing so with unordered loads & stores then loop idiom should not convert that into a memcpy. Reason being that memcpy/memset/memmove implementations are allowed to operate on the byte-level when doing the op, and that would break our requirement that our data be loaded/stored in an atomic-like manner if we’re, say, copying arrays of uint32’s.

**Goal**

 We would like for loop idiom to be able to convert unordered load/store loops into memory intrinsics, but doing so requires that we have some way of expressing that the resulting memory intrinsic be *unordered* and what the element-size is. We would also like all/most of the existing LLVM passes that understand the semantics of the memory intrinsics to understand the semantics of these ‘unordered’ versions.

**Method**

 Clearly we are going to have to teach LLVM about unordered memory intrinsics. There are, as I can see it, a couple of ways to accomplish this. I’d like your opinions on which are preferable, or if you can think of any other options. In no particular order…

Option 1)
 Introduce a new unordered/element-atomic version of each of the memory intrinsics.
  Ex: @llvm.memcpy_element_atomic — work was already started to introduce this one in D27133, but could be backed out and restarted.
  Intrinsic prototype: @llvm.memcpy_element_atomic.<overload desc>(<ty>* dest, <ty>* src, <ty> len, i32 align, i2 isunordered, i16 element_size)
    Semantics:
       * Will do a memcpy of len bytes from src to dest.
       * len must = k * lcm( #bytes in dest type, #bytes in src type), for some non-negative integer k [note: lcm = least-common multiple]
       * load/store size given by the constant power-of-2 parameter “element_size”; expected to be the lcm(sizeof(dest_ty), sizeof(src_ty))
       * isunordered param: bit 0 == 1 => stores to dest must be marked ‘unordered’; bit 1 == 1 => loads from src must be marked ‘unordered'
       * expanded memcpy code has this sort of form (ignoring alignment for simplicity):
                template<typename len_ty, uint16_t element_size>
                void memcpy_element_atomic(char *dest, char *src, len_ty len) {
                    copy_ty = <int type with sizeof(type) == element_size>;
                    len_ty num_iters = len / sizeof(copy_ty);
                    for (len_ty i=0; i < num_iters; i++)
                      *(copy_ty*)(dest + i*sizeof(copy_ty)) = *(copy_ty*)(src + i*sizeof(copy_ty));
                    // Note: restriction on len means we don’t need a residual loop.
                }

   We would inject the new intrinsic into the MemIntrinsic introspection hierarchy (IR/InstrinsicInst.h).

     Pros:
       * New intrinsic doesn’t interfere with existing memory intrinsics code.
       * Adding new intrinsic to MemIntrinsic hierarchy means that many LLVM passes will handle the new intrinsic “for free” (though, not *really* free; some work would need to be done)

     Cons:
       * Passes that check intrinsic ID against Intrinsics::memcpy, etc would need to be updated to handle the new intrinsic ID.
       * When new code is introduced by others to exploit/handle memcpy, then they may not remember to handle the memcpy_element_atomic version.

Option 2)
  Expand the current/existing memory intrinsics to identify the unordered constraint, if one exists, in much the same way that volatility is expressed — i.e. add an ‘isunordered’ parameter(s) to the intrinsic.
  This option has the same semantics as option 1; the only difference is, literally, that we expand the existing memcpy/memset/memmove intrinsics to have an ‘isunordered’ parameter and an ‘element_size’ parameter, so the prototype becomes something like:
   @llvm.memcpy.<overload desc>(<ty>* dest, <ty>* src, <ty> len, i32 align, i1 isvolatile, i2 isunordered, i16 element_size)

 Pros:
   * Minimal extra work to handle the new version in existing passes — only need to change passes that create calls to memory intrinsics, expand memory intrinsics, or that need to care about unordered (which none should that are reasoning about memory intrinsic semantics).
   * New code that’s introduced by others to exploit/handle memory intrinsics should just handle unordered for free — unordered being a part of the memory intrinsic means it’s more likely that the person will realize that they have to think about it (i.e. it raises the profile of unordered memory intrinsics).

 Cons:
   * Breaks backward compatibility of the IR — is there a mechanism for migrating old IR into a new IR version when loading the IR into LLVM?
   * Memory intrinsic prototypes changing could break some 3rd party consumers of LLVM-IR until they update their parsers.
   * Some risk that existing passes (LLVM or 3rd party) don’t handle the new attribute of memory intrinsic properly — introducing bugs. Though, I suspect that this risk is small/tiny as a functional issue would only materialize if they’re materializing unordered-atomic load/stores (which only Java does, I think).

**Note on generalizing to other orderings**

 We had some internal discussion on generalizing this to the other LLVM atomic orderings: monotonic, acquire, release, acquire_release, and sequentially_consistent.

 We don’t think that it’s worth generalizing the memory intrinsics to general memory atomic-ordering constraints. The only benefit that we can see is that doing so, and teaching loop idiom about them, would allow passes to reason about memcpy/memset/memmove semantics via the intrinsic instead of inferring info from the loop behaviour — arguably, a big benefit. However, there are complications in doing so:

 i) The atomic ordering constraints are memory barriers — this would mean that all passes that have to know about atomic-load/store vs load/store would also have to be taught about atomic memory intrinsics. i.e. The memory intrinsic itself would become a memory barrier, the type of which would be a union of the src & dest atomic-orderings.

 ii) Rematerialzing the loop from the memory intrinsic will be tricky and have a lot of special cases. The rematerialized loop would have to exactly match the loop that the memory intrinsic was created from due to the requirements of the memory ordering barriers. We would probably have to have a separate atomic-ordering for the src & dest parameters of the memory intrinsic to be able to faithfully reproduce the original loop semantics.

 iii) The loop would *always* have to be rematerialized from the intrinsic in codegen — if the intrinsic ever becomes a lib call, then there’s a bug. No different from the unordered case that we want to handle, really, but it’s worth pointing out since this bug would show as a data race and would be a beast to debug.

 iv) I doubt that any real-world code will *ever* exist that would become one of these atomic-ordered memory intrinsics, so it’s probably not worth the trouble.

Thanks for reading! Thoughts, questions, and comments are welcome.

-Daniel

---
Daniel Neilson, Ph.D.
Azul Systems

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170508/4f6708eb/attachment.html>