[PATCH] Statepoint infrastructure for garbage collection

Fri Oct 24 10:40:55 PDT 2014

On 10/23/2014 11:01 PM, Andrew Trick wrote:
>> On Oct 8, 2014, at 2:24 PM, Philip Reames <listmail at philipreames.com> wrote:
>>
>> Hi hfinkel, chandlerc, nicholas, sanjoy, atrick, ributzka, theraven,
>>
>> The attached patch implements an approach to supporting garbage collection in LLVM that has been mentioned on the mailing list a number of times by now.  There's a couple of issues that need to be addressed before submission, but I wanted to get this up to give maximal time for review.
> This is really awesome work. It will be great to have solid and efficient design for anyone wanting precise GC with LLVM compiled code.
>
>> The statepoint intrinsics are intended to enable precise root tracking through the compiler as to support garbage collectors of all types.  Our testing to date has focused on fully relocating collectors (where pointers can change at any safepoint poll, or call site), but the infrastructure should support collectors of other styles.  The addition of the statepoint intrinsics to LLVM should have no impact on the compilation of any program which does not contain them.  There are no side tables created, no extra metadata, and no inhibited optimizations.
> Soon we should promote llvm.experimental.patchpoint to a first-class intrinsic, so I want to know if there's something better we can do to handle your use case.
>
> There are a variety of use-cases for patchpoint-like intrinsics. They will each use or ignore some subset of the operands. I realize now that the most sane approach is to define one well-supported intrinsic and use call attributes to denote subtle difference in semantics. For example, what if someone wants to implement polymorphic inline caches at call sites with precise GC? They are going to need a single intrinsic that does everything patchpoint and statepoint do. Two separate intrinsics would not give them valid semantics, unless they were somehow tied together and lowered as one instruction.
>
> I understand that you don't want statepoints to carry patching-related cruft. But it's really just one constant operand, numbytes, that you set to zero!
I actually completely utterly agree with you here.  Long term, I would 
like to see statepoints, patchpoints, and stackmaps rolled into a single 
intrinsic.

I would like to avoid having perfection be the enemy of good though.  I 
would strongly prefer we land the changes for statepoint, then 
incrementally revise statepoint and patchpoint into a single intrinsic.  
We can start with the implementation commonalities, then merge the 
source intrinsics.
>
> One problem is how you lower the gc_result. However, it should be possible to detect during lowering that patchpoint is used by gc_result and react accordingly.
The gc_result part also isn't particularly GC specific.  This is simply 
a way to extract a result value out of a patchpoint/statepoint while 
still having a token to tie the gc_relocates to.  I don't see immediate 
use for this in other usages of patchpoint like things; can you think of 
any?

But yes, we could easily special case this off a single intrinsic.
>
> If there are other difficulties in how you custom lower, for example tracking pointer relocations, we could add a call attribute to identify the patchpoint as having statepoint semantics. We might also need that attribute to force frame index operands to be volatile stores.
>
> It's certainly nice to see the distinction between deopt and gc args, but is it essential for any LLVM passes to make that distinction downstream from the safepoint insertion pass?
Actually yes.  Deopt arguments are read only.  They do not need to be 
reloaded or updated after a statepoint.  GC arguments are read/write and 
do need updated.  This causes both differences in lowering and legal 
optimizations.  (I'm speaking here w.r.t. semantics, not neccessarily 
what's actually implemented today...)

I agree that this handling could be merged with patchpoint.  You could 
think of the current deopt argument section as being similar in intent 
to patchpoints arguments.  There's some differences in lowering 
currently, but we can take a close look at why these differences exist.  
They may not actually matter.
>
> The only issue I can think of us generating stackmap records. But this really surprises me:
>
> + The ID field of the 'StkMapRecord' for a statepoint is meaningless and it's value is
> + explicitly unspecified.
>
> I envision stack maps working as follows:
>
> (1) LLVM emits a "raw" stackmap record.
>
> (2) The JIT parses and compresses each record, using the ID to map to its own metadata describing the stackmap location. For example, LLVM can emit the deopt and GC parameters as a single list of location. The JIT knows the number of deopt parameters for this ID, and can use the correct encoding for the remaining GC pairs.
You could do it this way.  However, you loose the ability for LLVM to 
distinguish between the two.  Given the difference in semantics w.r.t. 
lowering and optimization legality, I'd considered that undesirable.

FYI, the ID field bit was mostly to avoid accidentally specifying a 
behaviour that we're not using.  I am more than fine picking a 
reasonable behaviour and specifying it for that field.  I'd prefer to do 
that as an incremental change in tree though.  :)
>
>> A statepoint works by transforming a call site (or safepoint poll site) into an explicit relocation operation.  It is the frontend's responsibility (or eventually the safepoint insertion pass we've developed, but that's not part of this patch) to ensure that any live pointer to a GC object is correctly added to the statepoint and explicitly relocated.  The relocated value is just a normal SSA value (as seen by the optimizer), so merges of relocated and unrelocated values are just normal phis.  The explicit relocation operation, the fact the statepoint is assumed to clobber all memory, and the optimizers standard semantics ensure that the relocations flow through IR optimizations correctly.
>>
>> During the lowering process, we currently spill aggressively to stack.  This is not entirely ideal (and we have plans to do better), but it's functional, relatively straight forward, and matches closely the implementations of the patchpoint intrinsics.  We leverage the existing StackMap section format, which is already used by the patchpoint intrinsics, to report where pointer values live.  Unlike a patchpoint, these locations are known (by the backend) to be writeable during the call.  This enables the garbage collector to transparently read and update pointer values if required.  We do optimize lowering in certain well known cases (constant pointers, a.k.a. null, being the key one.)
> It's very sad that you force spilling. That somewhat defeats the purpose of patch points and stack maps. Is that because you don't have a way to lower the statepoint such that it defines all the virtual registers that hold pointer values? That seems fixable with the right machine operand flags.
This is entirely due to implementation problems.  We plan to support gc 
values in callee saved registers.  We've even had two or three attempts 
at implementing it, but every approach we've tried has been problematic 
for some reason.  One of the reasons I want to get this in tree is so 
that I can get some help on figuring out the "right" way to solve this.

In theory, we could easily support deopt values in registers today. They 
don't need the update semantics.    (I don't really remember why we're 
not right now.  I think there might have been some problem with the 
number of virtual register arguments to a single psuedo op?)  Again, 
this should be an incremental change in tree.

Something to keep in mind w.r.t. how much this matters: we're only 
spilling values around call sites.  Between calls things can be in 
registers.  In practice, many values stay on the stack (without reloads 
or uses) between successive statepoints.  It matters, but I've been 
slightly shocked by how far we could get without this.

>
> Is there also an issue with your runtime restoring callee saves when unwinding? This is unfortunately a platform-specific issue that LLVM JITs currently face. You may have solved it given you're only targetting one platform. Please bring it up in the MCJIT BOF Tuesday if it's an issue for you.
In short, yes.  Our runtime has limitations here.  In practice, we have 
support for tracking deopt values through CSR (and the resulting 
spills), and GC values in some very particular cases, but not in 
general.  What we need from LLVM is a) to only spill in cases we 
support, and b) a description of CSR spill locations (which we can get 
from .eh_frame).

One thing we need to do is figure out how to make (a) above 
parametrizable since I imagine every runtime will have it's own 
constraints.  Again, I'd prefer an incremental change after this lands.  :)
>
>> There are a few areas of this patch which could use improvement:
>> - The patch needs rebased against TOT.  It's currently based against a roughly 3 week old snapshot.
>> - The intrinsics should probably be renamed to include an "experimental" prefix.
>> - The usage of Direct and Indirect location types are currently inverted as compared to the definition used by patchpoint.  This is a simple fix.
>> - The test coverage could be improved.  Most of the tests we've actually been using are built on top of the safepoint insertion mechanism (not included here) and our runtime.  We need to improve the IR level tests for optimizer semantics (i.e. not doing illegal transforms), and lowering.  There are some minimal tests in place for the lowering of simple statepoints.
>> - The documentation is "in progress" (to put it kindly.)
>> - Many functions are missing doxygen comments
>> - There's a hack in to force the use of RSP+Offset addressing vs RBP-Offset addressing for references in the StackMap section.  This works, shouldn't break anyone else, but should definitely be cleaned up.  The choice of addressing preference should be up to the runtime.
>>
>> When reviewing, I would greatly appreciate feedback on which issues need to be fixed before submission and those which can be addressed afterwards.  It is my plan to actively maintain and enhance this infrastructure over next few months (and years).  It's already been developed out of tree entirely too long (our fault!), and I'd like to move to incremental work in tree as quickly as feasible.
> This isn't a proper review, but I do have one comment on the code. There is an awful lot of custom code in SelectionDAGBuilder (600 lines). Very few LLVM devs will want to see this, so I suggest finding a way to move it into another file even if it's not easy to do.
I am completely open to improving the code in this respect, but I'm not 
entirely clear on what you're asking for.  Are you saying that a) this 
code should simply be moved to a helper file?  or b) that this code 
should not be part of SelectionDAGBuilder (the class) at all?  If you're 
going for (b), where else would we put it?

(After some of the gcroot vs statepoint discussions the other day, I 
released we could actually do some preparation work at the IR level.  
Doing so would make sense for the moment, but seems to conflict with the 
long term desire to lower to registers.  I'm hesitant to move away from 
the long term objective.)

Thanks for the thoughtful comments.
>
> -Andy
>
>> Planned enhancements after submission:
>> - The ordering of arguments in statepoints is essentially historical cruft at this point.  I'm open to suggestions on how to make this more approachable.  Reordering arguments would (preferably) be a post commit action.
>> - Support for relocatable pointers in callee saved registers over call sites.  This will require the notation of an explicit relocation psuedo op and support for it throughout the backend (particularly the register allocator.)
>> - Optimizations for non-relocating collectors.  For example, the clobber semantics of the spill slots aren't needed if the collector isn't relocating roots.
>> - Further optimizations to reduce the cost of spilling around each statepoint (when required at all).
>> - Support for invokable statepoints.
>> - Once this has baked in tree for a while, I plan to delete the existing gc_root code.  It is unsound, and essentially unused.
>>
>> In addition to the enhancements to the infrastructure in the currently proposed patch, we're also working on a number of follow up changes:
>> - Verification passes to confirm that safepoints were inserted in a semantically valid way (i.e. no memory access of a value after it has been inserted)
>> - A transformation pass to convert naive IR to include both safepoint polling sites, and statepoints on every non-leaf call.  This transformation pass can be used at initial IR creation time to simplify the frontend authors' work, but is also designed to run on *fully optimized* IR, provided the initial IR meets certain (fairly loose) restrictions.
>> - A transformation pass to convert normal loads and stores into user provided load and store barriers.
>> - Further optimizations to reduce the number of safepoints required, and improve the infrastructure as a whole.
>>
>> We've been working on these topics for a while, but the follow on patches aren't quite as mature as what's being proposed now.  Once these pieces stabilize a bit, we plan to upstream them as well.  For those who are curious, our work on those topics is available here: https://github.com/AzulSystems/llvm-late-safepoint-placement
>>
>> http://reviews.llvm.org/D5683
>>
>> Files:
>>   docs/Statepoints.rst
>>   include/llvm/CodeGen/FunctionLoweringInfo.h
>>   include/llvm/CodeGen/MachineInstr.h
>>   include/llvm/CodeGen/StackMaps.h
>>   include/llvm/IR/Intrinsics.td
>>   include/llvm/IR/Statepoint.h
>>   include/llvm/Target/Target.td
>>   include/llvm/Target/TargetFrameLowering.h
>>   include/llvm/Target/TargetOpcodes.h
>>   lib/Analysis/TargetTransformInfo.cpp
>>   lib/CodeGen/InlineSpiller.cpp
>>   lib/CodeGen/LocalStackSlotAllocation.cpp
>>   lib/CodeGen/PrologEpilogInserter.cpp
>>   lib/CodeGen/SelectionDAG/FunctionLoweringInfo.cpp
>>   lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
>>   lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h
>>   lib/CodeGen/StackMaps.cpp
>>   lib/CodeGen/TargetLoweringBase.cpp
>>   lib/IR/CMakeLists.txt
>>   lib/IR/Function.cpp
>>   lib/IR/LLVMContext.cpp
>>   lib/IR/Statepoint.cpp
>>   lib/IR/Verifier.cpp
>>   lib/Target/X86/X86FrameLowering.cpp
>>   lib/Target/X86/X86FrameLowering.h
>>   lib/Target/X86/X86ISelLowering.cpp
>>   lib/Target/X86/X86MCInstLower.cpp
>>   lib/Transforms/InstCombine/InstCombineCalls.cpp
>>   test/CodeGen/X86/statepoint-call-lowering.ll
>>   test/CodeGen/X86/statepoint-stack-usage.ll
>>   test/CodeGen/X86/statepoint-stackmap-format.ll
>>   test/Verifier/statepoint-non-gc-ptr.ll
>>   test/Verifier/statepoint.ll
>>   utils/TableGen/CodeGenTarget.cpp
>> <D5683.14604.patch>