<html>


<head>


<meta http-equiv="Content-Type" content="text/html; charset=utf-8">


</head>


<body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">


<div class="">TLDR: a proposal to add GC-parseable lowering to element atomic </div>


<div class="">memcpy/memmove instrinsics controlled by a new "requires-statepoint” </div>


<div class="">call attribute. </div>


<div class=""><br class="">


</div>


<div class="">Currently llvm.{memcpy|memmove}.element.unordered.atomic calls are </div>


<div class="">considered as GC leaf functions (like most other intrinsics). As a </div>


<div class="">result GC cannot occur while copy operation is in progress. This might</div>


<div class="">have negative effect on GC latencies when large amounts of data are </div>


<div class="">copied. To avoid this problem copying large amounts of data can be </div>


<div class="">done in chunks with GC safepoints in between. We'd like to be able to </div>


<div class="">represent such copy using existing instrinsics [1].</div>


<div class=""><br class="">


</div>


<div class="">For that I'd like to propose a new attribute for </div>


<div class="">llvm.{memcpy|memmove}.element.unordered.atomic calls </div>


<div class="">"requires-statepoint". This attribute on a call will result in a </div>


<div class="">different lowering, which makes it possible to have a GC safepoint </div>


<div class="">during the copy operation.</div>


<div class=""><br class="">


</div>


<div class="">There are three parts to the new lowering:</div>


<div class=""><br class="">


</div>


<div class="">1) The calls with the new attribute will be wrapped into a statepoint </div>


<div class="">by RewriteStatepointsForGC (RS4GC). This way the stack at the calls </div>


<div class="">will be GC parceable. </div>


<div class=""><br class="">


</div>


<div class="">2) Currently these intrinsics are lowered to GC leaf calls to the symbols</div>


<div class="">__llvm_{memcpy|memmove}_element_unordered_atomic_<element_size>. </div>


<div class="">The calls with the new attribute will be lowered to calls to different </div>


<div class="">symbols, let's say</div>


<div class="">__llvm_{memcpy|memmove}_element_unordered_atomic_safepoint_<element_size>.</div>


<div class="">This way the runtime can provide copy implementations with safepoints.</div>


<div class=""><br class="">


</div>


<div class="">3) Currently memcpy/memmove calls take derived pointers as arguments. </div>


<div class="">If we copy with safepoints we might need to relocate the underlying </div>


<div class="">source/destination objects on a safepoint. In order to do this we need </div>


<div class="">to know the base pointers as well. How do we make the base pointers </div>


<div class="">available in the copy routine? I suggest we add them explicitly as </div>


<div class="">arguments during lowering. </div>


<div class=""><br class="">


</div>


<div class="">For example: </div>


<div class="">__llvm_memcpy_element_unordered_atomic_safepoint_1(</div>


<div class="">  dest_base, dest_derived, src_base, src_derived, length)</div>


<div class=""><br class="">


</div>


<div class="">It will be up to RS4GC to do the new lowering and prepare the arguments.</div>


<div class="">RS4GC knows how to compute base pointers for a given derived pointer.</div>


<div class="">It also already does lowering for deoptimize intrinsics by replacing </div>


<div class="">an intrinsic call with a symbol call. So there is a precedent here.</div>


<div class=""><br class="">


</div>


<div class="">Other alternatives:</div>


<div class="">- Change llvm.{memcpy|memmove}.element.unordered.atomic API to accept </div>


<div class="">  base pointers + offsets instead of derived pointers. This will </div>


<div class="">  require autoupgrade of old representation. Changing API of a generic </div>


<div class="">  intrinsic to facilitate GC-specific lowering doesn't look like the</div>


<div class="">  best idea. This will not work if we want to do the same for non-atomic </div>


<div class="">  intrinsics.</div>


<div class="">- Teach GC infrastructure to record base pointers for all derived </div>


<div class="">  pointer arguments. This looks like an overkill for single use case.</div>


<div class=""><br class="">


</div>


<div class="">Here is the proposed implementation in a single patch:</div>


<div class=""><a href="https://reviews.llvm.org/D87954" class="">https://reviews.llvm.org/D87954</a></div>


<div class="">If there are no objections I will split it into individual reviews and</div>


<div class="">add langref changes. </div>


<div class=""><br class="">


</div>


<div class="">Thoughts?</div>


<div class=""><br class="">


</div>


<div class="">Artur</div>


<div class=""><br class="">


</div>


<div class="">[1] An alternative approach would be to make the frontend generate a </div>


<div class="">chunked copy loop with a safepoint inside. The downsides are: </div>


<div class="">- It's harder for the optimizer to see that this loop is just a copy</div>


<div class="">  of a range of bytes. </div>


<div class="">- It forces one particular lowering with the chunked loop inlined in </div>


<div class="">  compiled code. We can't outline the copy loop into the copy routine. </div>


<div class="">  With the intrinsic representation of a chunked copy we can choose </div>


<div class="">  different lowering strategies if we want. </div>


<div class="">- In our system we have to outline the copy loop into the copy routine</div>


<div class="">  due to interactions with deoptimization.</div>


</body>


</html>