[PATCH] Statepoint infrastructure for garbage collection
Philip Reames
listmail at philipreames.com
Tue Oct 21 14:26:36 PDT 2014
On 10/16/2014 02:51 PM, Philip Reames wrote:
>
> On 10/15/2014 02:52 PM, Philip Reames wrote:
>> Kevin,
>>
>> Let me try to answer the point you're getting at. In doing so, I
>> want to explicitly separate the statepoint intrinsics which are
>> currently up for review, and the future late safepoint placement. The
>> statepoint intrinsics have value separate from the late safepoint
>> placement approach, and I want to justify them on their own merits.
>>
>> The basic problem we're trying to solve with these intrinsics is
>> supporting fully relocating collectors. By definition, such a
>> collector needs to be precise w.r.t. root tracking. Even worse, we
>> need to ensure that *all copies* of a pointer are updated. It is not
>> acceptable to make two copies of a pointer, update one of them, then
>> use the other for a memory access.
>>
>> If the compiler is allowed to introduce derived pointers (i.e.
>> pointer valued temporaries created by the compiler which point
>> somewhere within an object, or outside it, but associated with it),
>> we also need to track which *object* each *pointer* to be updated is
>> associated with. This is required to safely update the pointers.
>>
>> For the sake of argument, let's say our frontend does safepoint
>> insertion.
>>
>> There's a couple of approaches which seem like they might work, let's
>> explore each in turn:
>> - We could use patchpoints to record all the values needed for the GC
>> stack map. This mostly works, but requires that the patchpoint not
>> be marked readonly or readnone (to prevent illegal reorderings).
>> That could be a usage convention. The real problem is that the
>> compiler is still free to introduce multiple *copies* of an SSA value
>> over the patchpoint. (This is completely legal under SSA
>> semantics.) When it does so, it creates a situation where the gc
>> could fail to update a pointer which will then be dereferenced.
>> That's a bug. Worth stating explicitly, I believe the patchpoint
>> scheme would be sufficient *if you do not every relocate a root*.
>> - We could use the gc.root. gc.root defines the allocs, but does not
>> define the call format, or any of the mechanisms to ensure proper
>> relocation. As such, it *by itself* is not viable. Also, gc.root
>> inherently assumes every value will have a stack slot. Without
>> *heavy* reengineering, there's no way to have a gc pointer in a
>> callee saved register over a call site. This is an unfortunate
>> limitation. Any call representation without explicit relocation
>> suffers from the same bug as the patchpoint scheme.
>> - We could combine gc.root allocas and patchpoints. This essentially
>> combines the flaws (no gc pointers in callee saved registers over
>> calls, and missed copies), with no benefit.
>>
>> The statepoint intrinsics are basically the patchpoint option above,
>> but with relocation made explicit in the IR. While it's still legal
>> for the optimizer to create a copy of the value feeding a statepoint,
>> that's now okay. By construction, there can be no use of the
>> original SSA value (and thus the copy) after the statepoint. Instead,
>> the explicitly relocated value is used.
>>
>> To summarize: We need (something like) statepoints for correctness of
>> fully relocating collectors.
>>
>> (The points I'm making here are somewhat subtle. If it would help to
>> have IR examples here, ask. I'm deferring writing them because it's
>> time consuming.)
> I need to withdraw this part of my comments. After further reflection
> and discussion offline, I was reminded that you can implement full
> relocation semantics with gcroot. The parts about patchpoints stands,
> but the gcroot comments are inaccurate.
>
> I need to leave early today, but I plan to respond tomorrow with a
> more complete analysis of the tradeoffs between gcroots and
> statepoints. Sorry for the confusion.
Ok, let's take a second try at explaining the differences between
statepoints and gc.roots. I managed to get myself confused last time
and made a couple of statements which were inaccurate. As a reminder,
this is not talking about late safepoint placement at all. LSP can work
with either mechanism.
From a functional correctness standpoint, gc.root and statepoint are
equivalent. They can both support relocating collectors, including
those which relocate roots. To prevent future confusion, let me review
how each works.
gc.root uses explicit spill slots in the IR in the form of allocas. Each
alloca escapes (through the gcroot call itself); as a result, the
compiler must assume that any readwrite call can both consume and update
the values in question. Additionally, the fact that all calls are
readwrite prevents reordering of unrelated loads past the call. gcroot
relies on the fact that no SSA value relocated at a call site is used at
a site reachable from the call. Instead, a new SSA value (whose
relation to the original is unknown by the compiler) is introduced by
loading from the (potentially clobbered) alloca. gcroot creates a
single stack map table for the entire function. It is the compiled
code's responsibility to ensure that all values in the allocas are
either valid live pointers or null.
Statepoints use most of the same techniques. We rely on not having an
SSA value used on both sides of a call, but we manage the relocation via
explicit IR relocation operations, not loads and stores. We require the
call to be read/write to prevent reordering of unrelated loads. Since
the spill slots are not visible in the IR, we do not need the reasoning
about escapes that gc.root does.
To explicitly state this again since I screwed this up once before, both
statepoints and gc.roots can correctly represent relocation semantics in
the IR. In fact, the underlying reasoning about their correctness are
rather similar.
They do differ fairly substantially in the details though. Let's
consider a few examples.
*SSA vs Memory* - gcroot encodes relocations as memory operations
(stores, clobbering calls, loads) where statepoint uses first class SSA
values. We believe this makes optimizations more straightforward.
Consider a simple optimization for null pointer relocation. If the
optimizer manages to establish that one of the value being relocated is
null, propagating this across a statepoint is straightforward. (For each
gc.relocate, if source is null, replaceAllUsesWith null.) Implementing
this same optimization for gc.root is harder since the store and load
may have been reordered from immediately around the call. This isn't an
unsolvable problem by any means, but it would be a GVN change, not an
InstCombine one. In practice, we believe InstCombine style
optimizations to be advantageous since they're simpler to write and
debug. Arguably, they're also more powerful given the current pipeline
since they have multiple opportunities to trigger.
*
**Derived Pointers* - gcroot can represent derived pointers, but only
via convention. There is no convention specified, so it's up to the
frontend to create it's own. Statepoints define a convention
(explicitly in the relocation operation) which makes describing
optimizations straight forward.
One thing we plan to do with the statepoint representation is to
implement an "easily derived pointer" optimization (to run near
CodeGenPrep). On X86, it's far cheaper to recreate a GEP base + 5
derived pointer than relocate it. Recognizing this case is quite
straight forward given the statepoint representation.
A frontend could implement a similar optimization for gcroot at IR
generation time. You could also implement such an optimization over the
load/call/store representation, but the implementation would be much
more complex (analogous to the null optimization above).
To be fair, gc.root may need such an optimization less. Since
call-safepoints are inserted early, CSE has not yet run. As a result,
there may be fewer "easily derived pointers" live across a call.
*Format* - Statepoints use a standard format. gc.root supports custom
formats. Either could be extended to support the other without much
difficulty.
The more material difference between the two is that gc.root generates a
single stack map for the entire function while statepoints generate a
unique stack map per call site. Having a single stack map imposes a
slight penalty on code compiled with gc.root since dead values must
explicitly be removed from the alloca (by a write of null). In the
wrong situation (say a tight loop with two calls), this could be material.
*Lowering *- Currently, both gc.root and statepoint lower to stack
slots. gc.root does this at the IR level, statepoints does so in
SelectionDAG.
The design of statepoints is intended to allow pushing the explicit
relocations back through the backend. The reason this is desirable is
that pointers can be left in callee saved registers over call sites.
Without substantial re-engineering, such a thing is not possible for
gc.root. The importance of this from a performance perspective is
debatable. It is my belief that the key benefit would be in a) reducing
frame sizes (by not requiring spill slots), and b) avoiding spills
around calls.
An advantage of gc.root is that the backend can remain largely ignorant
of the gc.root mechanism. By the point the backend encounters them, a
gc.root is just another alloca. One potential problem with the current
implementation is that the escape is lost when lowering; the gcroot call
is lowered to an entry into a side table and the alloca no longer
escapes. This is a source of possible bugs, but is also a
straightforward fix.
As to the lowering currently implemented, it's debatable which is
better. Statepoints optimize constants, and unifies based on SDValue.
As a result, two IR level values of different types (with the same bit
pattern) can end up sharing the same stackslot. However, it suffers when
trying to assign stack slots. We currently use heuristics, but you can
end up with ugly shuffling of values around on the stack across basic
blocks. (There's a number of ways to improve that, but it's not yet
implemented.) gc.root doesn't suffer from this problem since stack
slots are assigned by the frontend.
Since the stack spills and reloads are visible at the IR layer, gcroot
gets the full ability of the optimizer to remove redundant reloads.
Statepoints only get to leverage the pieces in the backend. In theory,
this could result in materially worse spill/reload code for
statepoints. In practice, this appears not to matter much provided the
same value is assigned to the same slot across both calls, but I don't
actually have much data here to say anything conclusively yet.
I haven't tried to measure frame size for gc.root vs statepoints. I
suspect that statepoints may come out slightly ahead, but I doubt this
is material. There are also cases (see "easily derived pointers"
above), where gc.root may come out ahead.
*IR Level Optimization* - Both gc.root and statepoints cripple
optimization (by design!). gcroot works better with inlining today, but
statepoints could be easily enhanced to handle this case. (The same
work would benefit symbolic patchpoints.)
It is my belief that statepoints are easier to optimize (i.e. teach to
LICM), but this is purely my guess with no real evidence. Both suffer
from the fact that calls must be marked readwrite. Not having to reason
about memory seems easier, but I'm open to other arguments here.
*Community Support**& Compatibility*
From a practical perspective, statepoints have active users behind
them. We are interested in continuing to enhance and optimize them in
the public tree. The same support does not seem to exist for gcroot.
The implementation of statepoints is largely aligned with that of
patchpoints. The implementation of gcroot is completely separate and
poorly understood by the majority of the community.
It wouldn't be hard to write a translation pass from gcroot to
statepoints or from statepoints to gcroot. If folks are concerned about
compatibility, this would be a reasonable option. The largest challenge
to transparently replacing one with the other is in generating the right
output format.
*
**Summary*
To summarize, gcroot and statepoints are functionally equivalent (modulo
possible bugs.) In their current form, the two are largely comparable
with each having some benefits. Long term, we believe a statepoint
representation will allow better code generation and IR level
optimization of code with safepoints inserted. We believe statepoints
to be easier to optimize both at the IR level and backend.
Again, the late safepoint proposal is independent and could be done with
either representation. It's currently implemented on statepoints, but
it could be extended to gcroot without too much work.
>>
>>
>> Other advantages of the statepoint approach:
>>
>> The gc.relocate intrinsics (part of the statepoint proposal) also
>> makes it explicit in the IR what the base object of each pointer to
>> be relocated is. This isn't *required* (you could encode the same
>> information in the arguments of the statepoint), but making it
>> explicit is much cleaner.
>>
>> The explicit relocation notation has the potential to be extended in
>> to the backend. With some register allocator changes (not part of
>> this patch!), we could support gc pointers in callee saved
>> registers. This is possible with the (incorrect) patchpoint scheme.
>> It is possible, but *hard*, with the gc.root scheme.
>>
>> The posted patch includes a couple of small optimizations (i.e. null
>> forwarding) that help performance, but could (probably) be
>> implemented on top of another scheme. We have a number of planned
>> optimizations on the statepoint mechanism.
>>
>>
>> Now, let me finally bring up late safepoint placement. The only real
>> impact on this patch is that, to date, we have only focused on the
>> *correctness* of a statepoint passing through the optimizer. We have
>> not attempted to teach the optimizer about how to leverage one or
>> perform optimizations over one. There's room for improvement here
>> (i.e. not completely blocking inlining), but we prefer to approach
>> this problem by simply inserting them late. You could instead
>> choose to insert them at generation time, and teach the optimizer
>> about their semantics. That *strategy choice* is independent of the
>> representation choosen provided that representation is *correct*.
>>
>> Yours,
>> Philip
>>
>> On 10/14/2014 07:01 PM, Kevin Modzelewski wrote:
>>> I think a change like this might be more compelling if you could
>>> give more detail on how it would actually help (I can't find the
>>> detail I'm looking for in your blog posts). It seems like the value
>>> of this patch is that it will work with late safepoint placement,
>>> but it'd be nice to see some examples of cases where late safepoint
>>> placement gives you something that early safepoint placement (ie by
>>> the frontend) doesn't. It kind of feels like either approach will
>>> work well with only non-gc values, and neither approach will be able
>>> to do much optimization when you do function calls. I'm not trying
>>> to claim that that's necessarily true, but it'd be easier to
>>> understand your point if there was some example IR.
>>>
>>> http://reviews.llvm.org/D5683
>>>
>>>
>>
>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20141021/f950abd7/attachment.html>
More information about the llvm-commits
mailing list