<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class="">
Greetings all,
<div class=""> I am picking up the work that was started in <a href="https://reviews.llvm.org/D27133" class="">
https://reviews.llvm.org/D27133</a> — adding support for an element-atomic memcpy/memset/memmove to LLVM. I would appreciate suggestions/thoughts/advice/comments on how to best proceed with this work in a way that will be acceptable to the LLVM group.</div>
<div class=""><br class="">
</div>
<div class=""> I apologize in advance; this is going to be a long one...</div>
<div class=""><br class="">
</div>
<div class=""><b class=""><u class="">**Background**</u></b></div>
<div class=""><b class=""><u class=""><br class="">
</u></b></div>
<div class=""> Loads/stores of shared data in Java must be done in an atomic-like manner — it must essentially look like the entire variable is loaded/stored in a single operation. So, no doing stuff like splitting the load of a uint32 into four loads of uint8’s,
or two loads of uint16’s; such a splitting could result in a data-race where each of the split loads can load a different version of the variable’s value (due to the variable being stored-to in other threads of execution). In LLVM, the atomic-like behaviour
is achieved by emitting an atomic load/store with the ‘unordered’ ordering constraint.</div>
<div class=""><br class="">
</div>
<div class=""> In LLVM there is a loop idiom recognition pass that will convert loops that look like memcpy/memmove/memset into a direct call to the corresponding intrinsic — which allows the rest of the LLVM passes to reason about the operation better, and
for codegen to materialize probably-better optimized versions of the loop (ex: one that uses wider loads/stores). But, as one might expect, this idiom recognition doesn’t touch atomic operations — i.e. if a loop that’s doing a memcpy is doing so with unordered
loads & stores then loop idiom should not convert that into a memcpy. Reason being that memcpy/memset/memmove implementations are allowed to operate on the byte-level when doing the op, and that would break our requirement that our data be loaded/stored in
an atomic-like manner if we’re, say, copying arrays of uint32’s.</div>
<div class=""><br class="">
</div>
<div class=""><b class=""><u class="">**Goal**</u></b></div>
<div class=""><b class=""><u class=""><br class="">
</u></b></div>
<div class=""> We would like for loop idiom to be able to convert unordered load/store loops into memory intrinsics, but doing so requires that we have some way of expressing that the resulting memory intrinsic be *unordered* and what the element-size is. We
would also like all/most of the existing LLVM passes that understand the semantics of the memory intrinsics to understand the semantics of these ‘unordered’ versions.</div>
<div class=""><br class="">
</div>
<div class=""><u class=""><b class="">**Method**</b></u></div>
<div class=""><u class=""><b class=""><br class="">
</b></u></div>
<div class=""> Clearly we are going to have to teach LLVM about unordered memory intrinsics. There are, as I can see it, a couple of ways to accomplish this. I’d like your opinions on which are preferable, or if you can think of any other options. In no particular
order…</div>
<div class=""><br class="">
</div>
<div class=""><i class="">Option 1)</i></div>
<div class=""> Introduce a new unordered/element-atomic version of each of the memory intrinsics.</div>
<div class=""> Ex: @llvm.memcpy_element_atomic — work was already started to introduce this one in D27133, but could be backed out and restarted.</div>
<div class=""> Intrinsic prototype: @llvm.memcpy_element_atomic.<overload desc>(<ty>* dest, <ty>* src, <ty> len, i32 align, i2 isunordered, i16 element_size)</div>
<div class=""> Semantics: </div>
<div class=""> * Will do a memcpy of len bytes from src to dest. </div>
<div class=""> * len must = k * lcm( #bytes in dest type, #bytes in src type), for some non-negative integer k [note: lcm = least-common multiple]</div>
<div class=""> * load/store size given by the constant power-of-2 parameter “element_size”; expected to be the lcm(sizeof(dest_ty), sizeof(src_ty))</div>
<div class=""> * isunordered param: bit 0 == 1 => stores to dest must be marked ‘unordered’; bit 1 == 1 => loads from src must be marked ‘unordered'</div>
<div class=""> * expanded memcpy code has this sort of form (ignoring alignment for simplicity):</div>
<div class=""> template<typename len_ty, uint16_t element_size></div>
<div class=""> void memcpy_element_atomic(char *dest, char *src, len_ty len) {</div>
<div class=""> copy_ty = <int type with sizeof(type) == element_size>;</div>
<div class=""> len_ty num_iters = len / sizeof(copy_ty);</div>
<div class=""> for (len_ty i=0; i < num_iters; i++)</div>
<div class=""> *(copy_ty*)(dest + i*sizeof(copy_ty)) = *(copy_ty*)(src + i*sizeof(copy_ty));</div>
<div class=""> // Note: restriction on len means we don’t need a residual loop.</div>
<div class=""> }</div>
<div class=""><br class="">
</div>
<div class=""> We would inject the new intrinsic into the MemIntrinsic introspection hierarchy (IR/InstrinsicInst.h).</div>
<div class=""><br class="">
</div>
<div class=""> Pros:</div>
<div class=""> * New intrinsic doesn’t interfere with existing memory intrinsics code.</div>
<div class=""> * Adding new intrinsic to MemIntrinsic hierarchy means that many LLVM passes will handle the new intrinsic “for free” (though, not *really* free; some work would need to be done)</div>
<div class=""><br class="">
</div>
<div class=""> Cons:</div>
<div class=""> * Passes that check intrinsic ID against Intrinsics::memcpy, etc would need to be updated to handle the new intrinsic ID.</div>
<div class=""> * When new code is introduced by others to exploit/handle memcpy, then they may not remember to handle the memcpy_element_atomic version.</div>
<div class=""><br class="">
</div>
<div class="">Option 2)</div>
<div class=""> Expand the current/existing memory intrinsics to identify the unordered constraint, if one exists, in much the same way that volatility is expressed — i.e. add an ‘isunordered’ parameter(s) to the intrinsic.</div>
<div class=""> This option has the same semantics as option 1; the only difference is, literally, that we expand the existing memcpy/memset/memmove intrinsics to have an ‘isunordered’ parameter and an ‘element_size’ parameter, so the prototype becomes something
like:</div>
<div class=""> @llvm.memcpy.<overload desc>(<ty>* dest, <ty>* src, <ty> len, i32 align, i1 isvolatile, i2 isunordered, i16 element_size)</div>
<div class=""><br class="">
</div>
<div class=""> Pros:</div>
<div class=""> * Minimal extra work to handle the new version in existing passes — only need to change passes that create calls to memory intrinsics, expand memory intrinsics, or that need to care about unordered (which none should that are reasoning about
memory intrinsic semantics).</div>
<div class=""> * New code that’s introduced by others to exploit/handle memory intrinsics should just handle unordered for free — unordered being a part of the memory intrinsic means it’s more likely that the person will realize that they have to think about
it (i.e. it raises the profile of unordered memory intrinsics).</div>
<div class=""><br class="">
</div>
<div class=""> Cons:</div>
<div class=""> * Breaks backward compatibility of the IR — is there a mechanism for migrating old IR into a new IR version when loading the IR into LLVM?</div>
<div class=""> * Memory intrinsic prototypes changing could break some 3rd party consumers of LLVM-IR until they update their parsers.</div>
<div class=""> * Some risk that existing passes (LLVM or 3rd party) don’t handle the new attribute of memory intrinsic properly — introducing bugs. Though, I suspect that this risk is small/tiny as a functional issue would only materialize if they’re materializing
unordered-atomic load/stores (which only Java does, I think).</div>
<div class=""><br class="">
</div>
<div class=""><b class=""><u class="">**Note on generalizing to other orderings**</u></b></div>
<div class=""><b class=""><u class=""><br class="">
</u></b></div>
<div class=""> We had some internal discussion on generalizing this to the other LLVM atomic orderings: monotonic, acquire, release, acquire_release, and sequentially_consistent. </div>
<div class=""><br class="">
</div>
<div class=""> We don’t think that it’s worth generalizing the memory intrinsics to general memory atomic-ordering constraints. The only benefit that we can see is that doing so, and teaching loop idiom about them, would allow passes to reason about memcpy/memset/memmove
semantics via the intrinsic instead of inferring info from the loop behaviour — arguably, a big benefit. However, there are complications in doing so:</div>
<div class=""><br class="">
</div>
<div class=""> i) The atomic ordering constraints are memory barriers — this would mean that all passes that have to know about atomic-load/store vs load/store would also have to be taught about atomic memory intrinsics. i.e. The memory intrinsic itself would
become a memory barrier, the type of which would be a union of the src & dest atomic-orderings.</div>
<div class=""><br class="">
</div>
<div class=""> ii) Rematerialzing the loop from the memory intrinsic will be tricky and have a lot of special cases. The rematerialized loop would have to exactly match the loop that the memory intrinsic was created from due to the requirements of the memory
ordering barriers. We would probably have to have a separate atomic-ordering for the src & dest parameters of the memory intrinsic to be able to faithfully reproduce the original loop semantics.</div>
<div class=""><br class="">
</div>
<div class=""> iii) The loop would *always* have to be rematerialized from the intrinsic in codegen — if the intrinsic ever becomes a lib call, then there’s a bug. No different from the unordered case that we want to handle, really, but it’s worth pointing
out since this bug would show as a data race and would be a beast to debug.</div>
<div class=""><br class="">
</div>
<div class=""> iv) I doubt that any real-world code will *ever* exist that would become one of these atomic-ordered memory intrinsics, so it’s probably not worth the trouble.</div>
<div class=""><br class="">
</div>
<div class=""><br class="">
</div>
<div class="">Thanks for reading! Thoughts, questions, and comments are welcome.</div>
<div class=""><br class="">
</div>
<div class="">-Daniel</div>
<div class=""><br class="">
</div>
<div class="">---</div>
<div class="">
<div style="color: rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px;">
Daniel Neilson, Ph.D.</div>
<div style="color: rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px;">
Azul Systems</div>
<div style="color: rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px;" class="">
<br class="">
</div>
<br class="Apple-interchange-newline">
</div>
<br class="">
</body>
</html>