[PATCH] Statepoint infrastructure for garbage collection

Tue Oct 21 14:26:36 PDT 2014

On 10/16/2014 02:51 PM, Philip Reames wrote:
>
> On 10/15/2014 02:52 PM, Philip Reames wrote:
>> Kevin,
>>
>> Let me try to answer the point you're getting at.  In doing so, I 
>> want to explicitly separate the statepoint intrinsics which are 
>> currently up for review, and the future late safepoint placement. The 
>> statepoint intrinsics have value separate from the late safepoint 
>> placement approach, and I want to justify them on their own merits.
>>
>> The basic problem we're trying to solve with these intrinsics is 
>> supporting fully relocating collectors.  By definition, such a 
>> collector needs to be precise w.r.t. root tracking.  Even worse, we 
>> need to ensure that *all copies* of a pointer are updated. It is not 
>> acceptable to make two copies of a pointer, update one of them, then 
>> use the other for a memory access.
>>
>> If the compiler is allowed to introduce derived pointers (i.e. 
>> pointer valued temporaries created by the compiler which point 
>> somewhere within an object, or outside it, but associated with it), 
>> we also need to track which *object* each *pointer* to be updated is 
>> associated with.  This is required to safely update the pointers.
>>
>> For the sake of argument, let's say our frontend does safepoint 
>> insertion.
>>
>> There's a couple of approaches which seem like they might work, let's 
>> explore each in turn:
>> - We could use patchpoints to record all the values needed for the GC 
>> stack map.  This mostly works, but requires that the patchpoint not 
>> be marked readonly or readnone (to prevent illegal reorderings).  
>> That could be a usage convention.  The real problem is that the 
>> compiler is still free to introduce multiple *copies* of an SSA value 
>> over the patchpoint.  (This is completely legal under SSA 
>> semantics.)  When it does so, it creates a situation where the gc 
>> could fail to update a pointer which will then be dereferenced. 
>> That's a bug.  Worth stating explicitly, I believe the patchpoint 
>> scheme would be sufficient *if you do not every relocate a root*.
>> - We could use the gc.root.  gc.root defines the allocs, but does not 
>> define the call format, or any of the mechanisms to ensure proper 
>> relocation.  As such, it *by itself* is not viable.  Also, gc.root 
>> inherently assumes every value will have a stack slot. Without 
>> *heavy* reengineering, there's no way to have a gc pointer in a 
>> callee saved register over a call site. This is an unfortunate 
>> limitation.  Any call representation without explicit relocation 
>> suffers from the same bug as the patchpoint scheme.
>> - We could combine gc.root allocas and patchpoints.  This essentially 
>> combines the flaws (no gc pointers in callee saved registers over 
>> calls, and missed copies), with no benefit.
>>
>> The statepoint intrinsics are basically the patchpoint option above, 
>> but with relocation made explicit in the IR.  While it's still legal 
>> for the optimizer to create a copy of the value feeding a statepoint, 
>> that's now okay.  By construction, there can be no use of the 
>> original SSA value (and thus the copy) after the statepoint. Instead, 
>> the explicitly relocated value is used.
>>
>> To summarize: We need (something like) statepoints for correctness of 
>> fully relocating collectors.
>>
>> (The points I'm making here are somewhat subtle.  If it would help to 
>> have IR examples here, ask.  I'm deferring writing them because it's 
>> time consuming.)
> I need to withdraw this part of my comments.  After further reflection 
> and discussion offline, I was reminded that you can implement full 
> relocation semantics with gcroot.  The parts about patchpoints stands, 
> but the gcroot comments are inaccurate.
>
> I need to leave early today, but I plan to respond tomorrow with a 
> more complete analysis of the tradeoffs between gcroots and 
> statepoints.  Sorry for the confusion.
Ok, let's take a second try at explaining the differences between 
statepoints and gc.roots.  I managed to get myself confused last time 
and made a couple of statements which were inaccurate.  As a reminder, 
this is not talking about late safepoint placement at all.  LSP can work 
with either mechanism.

 From a functional correctness standpoint, gc.root and statepoint are 
equivalent.  They can both support relocating collectors, including 
those which relocate roots.  To prevent future confusion, let me review 
how each works.

gc.root uses explicit spill slots in the IR in the form of allocas. Each 
alloca escapes (through the gcroot call itself); as a result, the 
compiler must assume that any readwrite call can both consume and update 
the values in question.  Additionally, the fact that all calls are 
readwrite prevents reordering of unrelated loads past the call.  gcroot 
relies on the fact that no SSA value relocated at a call site is used at 
a site reachable from the call.  Instead, a new SSA value (whose 
relation to the original is unknown by the compiler) is introduced by 
loading from the (potentially clobbered) alloca.  gcroot creates a 
single stack map table for the entire function.  It is the compiled 
code's responsibility to ensure that all values in the allocas are 
either valid live pointers or null.

Statepoints use most of the same techniques.  We rely on not having an 
SSA value used on both sides of a call, but we manage the relocation via 
explicit IR relocation operations, not loads and stores.  We require the 
call to be read/write to prevent reordering of unrelated loads.  Since 
the spill slots are not visible in the IR, we do not need the reasoning 
about escapes that gc.root does.

To explicitly state this again since I screwed this up once before, both 
statepoints and gc.roots can correctly represent relocation semantics in 
the IR.  In fact, the underlying reasoning about their correctness are 
rather similar.

They do differ fairly substantially in the details though.  Let's 
consider a few examples.

*SSA vs Memory* - gcroot encodes relocations as memory operations 
(stores, clobbering calls, loads) where statepoint uses first class SSA 
values.  We believe this makes optimizations more straightforward.

Consider a simple optimization for null pointer relocation.  If the 
optimizer manages to establish that one of the value being relocated is 
null, propagating this across a statepoint is straightforward. (For each 
gc.relocate, if source is null, replaceAllUsesWith null.) Implementing 
this same optimization for gc.root is harder since the store and load 
may have been reordered from immediately around the call.  This isn't an 
unsolvable problem by any means, but it would be a GVN change, not an 
InstCombine one.  In practice, we believe InstCombine style 
optimizations to be advantageous since they're simpler to write and 
debug.  Arguably, they're also more powerful given the current pipeline 
since they have multiple opportunities to trigger.
*
**Derived Pointers* - gcroot can represent derived pointers, but only 
via convention.  There is no convention specified, so it's up to the 
frontend to create it's own.  Statepoints define a convention 
(explicitly in the relocation operation) which makes describing 
optimizations straight forward.

One thing we plan to do with the statepoint representation is to 
implement an "easily derived pointer" optimization (to run near 
CodeGenPrep).  On X86, it's far cheaper to recreate a GEP base + 5 
derived pointer than relocate it.  Recognizing this case is quite 
straight forward given the statepoint representation.

A frontend could implement a similar optimization for gcroot at IR 
generation time.  You could also implement such an optimization over the 
load/call/store representation, but the implementation would be much 
more complex (analogous to the null optimization above).

To be fair, gc.root may need such an optimization less.  Since 
call-safepoints are inserted early, CSE has not yet run.  As a result, 
there may be fewer "easily derived pointers" live across a call.

*Format* - Statepoints use a standard format.  gc.root supports custom 
formats.  Either could be extended to support the other without much 
difficulty.

The more material difference between the two is that gc.root generates a 
single stack map for the entire function while statepoints generate a 
unique stack map per call site.  Having a single stack map imposes a 
slight penalty on code compiled with gc.root since dead values must 
explicitly be removed from the alloca (by a write of null).  In the 
wrong situation (say a tight loop with two calls), this could be material.

*Lowering *- Currently, both gc.root and statepoint lower to stack 
slots.  gc.root does this at the IR level, statepoints does so in 
SelectionDAG.

The design of statepoints is intended to allow pushing the explicit 
relocations back through the backend.  The reason this is desirable is 
that pointers can be left in callee saved registers over call sites.  
Without substantial re-engineering, such a thing is not possible for 
gc.root.  The importance of this from a performance perspective is 
debatable.  It is my belief that the key benefit would be in a) reducing 
frame sizes (by not requiring spill slots), and b) avoiding spills 
around calls.

An advantage of gc.root is that the backend can remain largely ignorant 
of the gc.root mechanism.  By the point the backend encounters them, a 
gc.root is just another alloca.  One potential problem with the current 
implementation is that the escape is lost when lowering; the gcroot call 
is lowered to an entry into a side table and the alloca no longer 
escapes.  This is a source of possible bugs, but is also a 
straightforward fix.

As to the lowering currently implemented, it's debatable which is 
better.  Statepoints optimize constants, and unifies based on SDValue.  
As a result, two IR level values of different types (with the same bit 
pattern) can end up sharing the same stackslot. However, it suffers when 
trying to assign stack slots.  We currently use heuristics, but you can 
end up with ugly shuffling of values around on the stack across basic 
blocks.  (There's a number of ways to improve that, but it's not yet 
implemented.)  gc.root doesn't suffer from this problem since stack 
slots are assigned by the frontend.

Since the stack spills and reloads are visible at the IR layer, gcroot 
gets the full ability of the optimizer to remove redundant reloads.  
Statepoints only get to leverage the pieces in the backend.  In theory, 
this could result in materially worse spill/reload code for 
statepoints.  In practice, this appears not to matter much provided the 
same value is assigned to the same slot across both calls, but I don't 
actually have much data here to say anything conclusively yet.

I haven't tried to measure frame size for gc.root vs statepoints.  I 
suspect that statepoints may come out slightly ahead, but I doubt this 
is material.  There are also cases (see "easily derived pointers" 
above), where gc.root may come out ahead.

*IR Level Optimization* - Both gc.root and statepoints cripple 
optimization (by design!).  gcroot works better with inlining today, but 
statepoints could be easily enhanced to handle this case.  (The same 
work would benefit symbolic patchpoints.)

It is my belief that statepoints are easier to optimize (i.e. teach to 
LICM), but this is purely my guess with no real evidence.  Both suffer 
from the fact that calls must be marked readwrite.  Not having to reason 
about memory seems easier, but I'm open to other arguments here.

*Community Support**& Compatibility*
 From a practical perspective, statepoints have active users behind 
them.  We are interested in continuing to enhance and optimize them in 
the public tree.  The same support does not seem to exist for gcroot.

The implementation of statepoints is largely aligned with that of 
patchpoints.  The implementation of gcroot is completely separate and 
poorly understood by the majority of the community.

It wouldn't be hard to write a translation pass from gcroot to 
statepoints or from statepoints to gcroot.  If folks are concerned about 
compatibility, this would be a reasonable option.  The largest challenge 
to transparently replacing one with the other is in generating the right 
output format.
*
**Summary*
To summarize, gcroot and statepoints are functionally equivalent (modulo 
possible bugs.)  In their current form, the two are largely comparable 
with each having some benefits.  Long term, we believe a statepoint 
representation will allow better code generation and IR level 
optimization of code with safepoints inserted.  We believe statepoints 
to be easier to optimize both at the IR level and backend.

Again, the late safepoint proposal is independent and could be done with 
either representation.  It's currently implemented on statepoints, but 
it could be extended to gcroot without too much work.
>>
>>
>> Other advantages of the statepoint approach:
>>
>> The gc.relocate intrinsics (part of the statepoint proposal) also 
>> makes it explicit in the IR what the base object of each pointer to 
>> be relocated is.  This isn't *required* (you could encode the same 
>> information in the arguments of the statepoint), but making it 
>> explicit is much cleaner.
>>
>> The explicit relocation notation has the potential to be extended in 
>> to the backend.  With some register allocator changes (not part of 
>> this patch!), we could support gc pointers in callee saved 
>> registers.  This is possible with the (incorrect) patchpoint scheme.  
>> It is possible, but *hard*, with the gc.root scheme.
>>
>> The posted patch includes a couple of small optimizations (i.e. null 
>> forwarding) that help performance, but could (probably) be 
>> implemented on top of another scheme.  We have a number of planned 
>> optimizations on the statepoint mechanism.
>>
>>
>> Now, let me finally bring up late safepoint placement. The only real 
>> impact on this patch is that, to date, we have only focused on the 
>> *correctness* of a statepoint passing through the optimizer.  We have 
>> not attempted to teach the optimizer about how to leverage one or 
>> perform optimizations over one.  There's room for improvement here 
>> (i.e. not completely blocking inlining), but we prefer to approach 
>> this problem by simply inserting them late.   You could instead 
>> choose to insert them at generation time, and teach the optimizer 
>> about their semantics.  That *strategy choice* is independent of the 
>> representation choosen provided that representation is *correct*.
>>
>> Yours,
>> Philip
>>
>> On 10/14/2014 07:01 PM, Kevin Modzelewski wrote:
>>> I think a change like this might be more compelling if you could 
>>> give more detail on how it would actually help (I can't find the 
>>> detail I'm looking for in your blog posts).  It seems like the value 
>>> of this patch is that it will work with late safepoint placement, 
>>> but it'd be nice to see some examples of cases where late safepoint 
>>> placement gives you something that early safepoint placement (ie by 
>>> the frontend) doesn't.  It kind of feels like either approach will 
>>> work well with only non-gc values, and neither approach will be able 
>>> to do much optimization when you do function calls. I'm not trying 
>>> to claim that that's necessarily true, but it'd be easier to 
>>> understand your point if there was some example IR.
>>>
>>> http://reviews.llvm.org/D5683
>>>
>>>
>>
>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20141021/f950abd7/attachment.html>