[llvm-dev] gc relocations on exception path w/RS4GC currently broken

Thu Feb 11 10:56:21 PST 2016

I think we accidentally un'CC'ed everybod

On Sun, Feb 7, 2016 at 4:41 PM, Sanjoy Das
<sanjoy at playingwithpointers.com> wrote:
> Joseph Tremoulet wrote:
>
>>> I think the right place ... is in StatpointLowering... reload from
>>> the Indirect slot and spill to the DoubleIndirect slot unconditionally
>>> on the exception edge
>>
>> Do you mean in SelectionDAGBuilder::visitStatepoint?  I thought that
>> SelectionDAGBuilder worked a block-at-a-time and didn't mutate the IR,
>> so I'm confused as to how it should orchestrate inserting certain code
>> in the head of every block that is an EH pad with a statepoint
>> predecessor.  I haven't worked with the selection dag much, so may
>> well just be overlooking something obvious; what makes sense there?

After discussing this with Philip, doing this orchestration in CGP is
probably okay too -- given that it is logically a part of lowering.
That's simpler than the scheme I described.

> So the short story is that even though the key parts of the
> SelectionDAG algorithm are per-block, you are free to do
> whole-function analysis and cache whole-function information that will
> inform what you do in each specific block later.
>
> More specifically, here is one way that might work:
>
> (re-iterating some bits so that we don't talk past each other):
>
>  - Let's say we have two safepointing invokes, at blocks %X, and %Y,
>    that both unwind to %U.  %U uses a heap ref, "%ref = phi i8
>    aspace(1)* [ %R_X, %X], [ %R_Y, %Y]".
>
>  - In RS4GC, we change block %X store %R_X to %Alloc (an alloca)
>    before the invoke (that will now be wrapped in a gc.statepoint) and
>    change block %Y store %R_Y to %Alloc before the statepoint.  The
>    landingpad replaces %ref with a load from %Alloc.  %Alloc gets
>    reported in the statepoints in blocks %X and %Y.  If the runtime
>    supports DoubleIndirect statepoint slots, this is all that's
>    needed.
>
>  - If the runtime does not support DoubleIndirect slots, we do some
>    adjustments in SelectionDAG for DoubleIndirect allocas where the
>    alloca has been obscured (and thus the DoubleIndirect allocas
>    cannot be "demoted" to an Indirect slot).  In
>    FunctionLoweringInfo::set (or some other helper that runs in the
>    same period), we do a whole function analysis for landingpads whose
>    associated invokes have non-demotable DoubleIndirect slots, and
>    remember this set (of landingpads).  When emitting code for these
>    landingpads, we "pretend" the IR for the landingpad was:
>
>        %stck.addr = phi ...
>        %ref.addr  = phi ...
>        (*%stck.addr) = (*%ref.addr)
>
>      x times the max number of non-demotable DoubleIndirect allocas
>        over any one statepoint (this number can be pre-calculated in
>        FunctionLoweringInfo::set as well)
>
>  - The invoke basic blocks can then "pass in" the load and store
>    addresses to the unwind block through these PHI nodes.  There may
>    be cases like, e.g., %R_X == null above, where there is nothing to
>    spill / fill.  In these cases, we can exploit the fact that the
>    (*%stck.addr) = (*%ref.addr) bit is a no-op if %stck.addr ==
>    %ref.addr, and pass in any valid stack slot in the two PHI nodes.
>
> -- Sanjoy
>
>
>>
>> Thanks
>> -Joseph
>>
>> -----Original Message-----
>> From: Sanjoy Das [mailto:sanjoy at playingwithpointers.com]
>> Sent: Friday, February 5, 2016 10:02 PM
>> To: Joseph Tremoulet<jotrem at microsoft.com>
>> Cc: Philip Reames<listmail at philipreames.com>; Manuel
>> Jacob<me at manueljacob.de>; llvm-dev<llvm-dev at lists.llvm.org>
>> Subject: Re: [llvm-dev] gc relocations on exception path w/RS4GC currently
>> broken
>>
>> Joseph Tremoulet wrote:
>>> Thanks, I think that's a useful way to look at it (though if I wanted
>>> to bikeshed I'd suggest the name "DoubleIndirect" as a bit more
>>> precise than "VeryIndirect").
>>
>> Yup, that sounds better.  :)
>>
>>> An aspect of it that I'm still puzzling over is that my target runtime
>>> (at least in its current form) doesn't have a way to represent/process
>>> a "VeryIndirect" pointer.  So I'd like to be able to guarantee that
>>> only "Direct" and (single)"Indirect" slots get reported.  And then
>>> it's not clear to me what bit of code should be responsible for
>>> ensuring that there are no "VeryIndirect" slots at the end of the day.
>>> Does statepoint lowering on the DAG need to be able to inject
>>> loads/stores to convert a "VeryIndirect" to a (single)"Indirect"?
>>> Should CodeGenPrepare be responsible for doing that rewrite at the IR
>>> level (and is it reasonable to assume that nothing after CGP would do
>>> the inverse)?  Is it "good enough" to just know that RS4GC won't
>>> directly emit the pattern that lowers to "VeryIndirect", and have
>>> clients that care like LLILC run RS4GC "right before" CGP?  Or were
>>> you suggesting something different, like somehow on the machine code
>>> we should insert loads and stores if needed when we see a
>>> "VeryIndirect" in the stack map?
>>
>> I think the right place to make DoubleIndirect "go away" is in
>> StatpointLowering.  That's the point after which we **know** what's a stack
>> slot and what isn't; and as you note below, the mechanism won't be
>> substantially more complex / different than what we have today.
>>
>>> It occurs to me that the expansion needed is very similar to the
>>> expansion that the lowering currently does to spill gc-pointer SSA
>>> values and produce "Indirect" slots, just with a load prepended before
>>> the spill and stores appended after the fills at the ends; but the
>>> current mechanism for that lowering keys off the gc.relocate calls for
>>> generating the fills, and the gc.relocate calls wouldn't be present
>>> for the cases that need to be "VeryIndirect"...
>>
>> I think the simplest solution is to reload from the Indirect slot and
>> spill to the DoubleIndirect slot unconditionally on the exception edge (i.e.
>> without regards to where the uses of the relocated references are).  I don't
>> think we need to do anything more from a correctness standpoint.
>>
>> For DoubleIndirect slots we *know* that the users will automatically be
>> lowered into loads from the correct location without us doing anything
>> special, so in that way handing DoubleIndirect this way is easier than
>> handling gc.relocate.
>>
>> -- Sanjoy
>>
>>> -----Original Message-----
>>> From: Sanjoy Das [mailto:sanjoy at playingwithpointers.com]
>>> Sent: Friday, February 5, 2016 7:05 PM
>>> To: Joseph Tremoulet<jotrem at microsoft.com>
>>> Cc: Philip Reames<listmail at philipreames.com>; Manuel
>>> Jacob<me at manueljacob.de>; llvm-dev<llvm-dev at lists.llvm.org>
>>> Subject: Re: [llvm-dev] gc relocations on exception path w/RS4GC
>>> currently broken
>>>
>>> For #1, perhaps we need a third kind of encoding, which we could call
>>> (for the lack of a better name), "VeryIndirect".  A VeryIndirect location
>>> implies that the heap reference is stored in the location **(Reg + Offset).
>>>
>>> With that in place, we'll have three different forms of locations:
>>>
>>> Direct == the reference *is* Reg+Offset Indirect == the reference is
>>> *(Reg+Offset) VeryIndirect == the reference is **(Reg+Offset)
>>>
>>> (This following bit is re-iterating what Joseph and I talked about on
>>> Skype, so that everyone is up to speed)
>>>
>>> gc.statepoint would then have two different "argument regions" for
>>> reporting heap references, "unspilled" and "spilled".  Lowering for the
>>> "unspilled" region would what we currently have for GC references.  Lowering
>>> for the "spilled" region would be: emit code normally (i.e. what we do
>>> today), but if you were going to report the location as Direct, then report
>>> it as Indirect (since the spill is already present in the IR), and if you
>>> were going to report it as Indirect, then report it as VeryIndirect (since
>>> we'll have two spills now).
>>>
>>> RS4GC would construct statepoint with normal SSA references in the
>>> "unspilled" section, and allocas in the "spilled" section.
>>>
>>>
>>> For #2, I like your idea of teaching RS4GC to not emit "live derived
>>> pointers" at all.  It is conceptually the same transform as our
>>> "rematerialize simple GEPs" optimization, except that we now need to be able
>>> to do this for correctness.
>>>
>>> -- Sanjoy
>>>
>>>
>>> Joseph Tremoulet wrote:
>>>> Sorry to reply to myself here, but I had an idea regarding "issue #2"
>>>> -- possibly what makes the most sense for those clients/targets is to
>>>> pull the pointer difference computation/reapplication into RS4GC
>>>> itself -- it could have a pass just before or after
>>>> rematerialization, which runs based on a configuration flag
>>>> (eventually to be driven by GCStrategy), which performs rewrites like
>>>> below to ensure that only base pointers are live across statepoints
>>>> when it's done (plus a bit of bookkeeping w.r.t. recomputeliveness
>>>> and/or the ssa update at the end to make sure the rewritten pointers
>>>> don't get reported and do get uses of the original value replaced
>>>> with them)
>>>>
>>>> Thanks
>>>> -Joseph
>>>> .
>>>>
>>>> -----Original Message-----
>>>> Subject: Re: [llvm-dev] gc relocations on exception path w/RS4GC
>>>> currently broken
>>>>
>>>> Working on this, I've run into a couple potential issues regarding which
>>>> I'd like to solicit feedback.
>>>>
>>>> To give a concrete example, we're talking about having RS4GC see a
>>>> GC-safepoint call like so:
>>>>
>>>>        %a = _  ; gc pointer
>>>>        %b = _  ; gc pointer
>>>>        ...
>>>>        invoke void @callee()
>>>>          to label %cont unwind label %pad
>>>>      cont:
>>>>        _ = %a
>>>>      ...
>>>>      pad:
>>>>        landingpad _
>>>>        _ = %b
>>>>      ...
>>>>
>>>> and transform it into:
>>>>
>>>>        %b.gc_spill = alloca<ty>
>>>>        ...
>>>>        %a = _
>>>>        %b = _
>>>>        ...
>>>>        store<ty>    %b,<ty>* %b.gc_spill
>>>>        %sp = invoke token @llvm.experimental.gc.statepoint(<arg list
>>>> that indicates %a is a gc pointer and %b.gc_spill holds a gc pointer>)
>>>>          to label %cont unwind label %pad
>>>>      cont:
>>>>        %a.reloc = call<ty>    @llvm.experimental.gc.relocate(token
>>>> %sp,<index of %a>,<index of %a>)
>>>>        _ = %a.reloc
>>>>      ...
>>>>      pad:
>>>>        landingpad _
>>>>        %b.gc_reload = load<ty>,<ty>* %b.gc_spill
>>>>        _ = %b.gc_reload
>>>>      ...
>>>>
>>>> which would then get lowered to a call with a stack map reporting %a (or
>>>> the slot that lowering spills %a to) and %b.gc_spill as holding live gc
>>>> pointers.
>>>>
>>>>
>>>> Issue #1: obscurability of the %b.gc_spill use on the gc.statepoint
>>>> invoke
>>>>
>>>> Some target runtimes/GCs (CoreCLR include) need to have stack slots
>>>> reported directly by offset.  If code runs between RS4GC and lowering that
>>>> somehow rewrites the argument on the statepoint corresponding to b's spill
>>>> to be anything other than a direct use of the static alloca that RS4GC
>>>> allocated to hold the spill, the best we could do is have the lowering
>>>> introduce another layer of indirection.
>>>> E.g., continuing the above example, if something after RS4GC obscures
>>>> %b.gc_spill on the statepoint:
>>>>
>>>>        %b.gc_spill = alloca<ty>
>>>>        ...
>>>>        %a = _
>>>>        %b = _
>>>>        ...
>>>>        %p = _ ; like maybe a PHI that has %b.gc_spill as an incoming
>>>> value
>>>>        ...
>>>>        store<ty>    %b,<ty>* %b.gc_spill ; may or may not have been
>>>> rewritten as store into %p
>>>>        %sp = invoke token @llvm.experimental.gc.statepoint(<arg list
>>>> that indicates %a is a gc pointer and %p holds a gc pointer>)
>>>>          to label %cont unwind label %pad
>>>>      cont:
>>>>        %a.reloc = call<ty>    @llvm.experimental.gc.relocate(token
>>>> %sp,<index of %a>,<index of %a>)
>>>>        _ = %a.reloc
>>>>      ...
>>>>      pad:
>>>>        landingpad _
>>>>        %b.gc_reload = load<ty>,<ty>* %b.gc_spill
>>>>        _ = %b.gc_reload
>>>>      ...
>>>>
>>>> then lowering would effectively have to insert another indirection:
>>>>
>>>>        %b.gc_spill = alloca<ty>
>>>>        ...
>>>>        %a = _
>>>>        %b = _
>>>>        ...
>>>>        %p = _ ; like maybe a PHI that has %b.gc_spill as an incoming
>>>> value
>>>>        ...
>>>>        store<ty>    %b,<ty>* %b.gc_spill ; may or may not have been
>>>> rewritten as store into %p
>>>>        %p.deref = load<ty>,<ty>* %p
>>>>        %sp = invoke token @llvm.experimental.gc.statepoint(<arg list
>>>> that indicates %a and %p.deref are gc pointers>)
>>>>          to label %cont unwind label %pad
>>>>      cont:
>>>>        store<ty>    %p.deref,<ty>* %p
>>>>        %a.reloc = call<ty>    @llvm.experimental.gc.relocate(token
>>>> %sp,<index of %a>,<index of %a>)
>>>>        _ = %a.reloc
>>>>      ...
>>>>      pad:
>>>>        landingpad _
>>>>        store<ty>    %p.deref,<ty>* %p
>>>>        %b.gc_reload = load<ty>,<ty>* %b.gc_spill
>>>>        _ = %b.gc_reload
>>>>      ...
>>>>
>>>> (and the code to insert the %p<->    %p.deref loads/stores would have to
>>>> be something in CodeGenPrep or [ugh] direclty in StatepointLowering).
>>>> I'm curious to know if others think this is problematic or not.  I know
>>>> that for LLILC we intend to run RS4GC late in the pass list and could
>>>> probably just discount the possibility of the spill slot allocas on the
>>>> gc.statepoint invoke getting obscured (or at least could be ok with having
>>>> the bail-out lowering/CGP code that patches things up with extra
>>>> stores/loads, on the assumption that it's rare in practice to hit these
>>>> cases), but I'm not sure how representative LLILC is of the community in
>>>> that regard.  Similarly, I have the impression that we're moving generally
>>>> toward wedding RS4GC more with CGP, but I would be interested to know if I'm
>>>> off the mark there.
>>>>
>>>>
>>>> Issue #2: Relocating derived pointers by IR injection
>>>>
>>>> I know there's been some discussion about runtimes which require the
>>>> pointers reported directly to them to be base object pointers.  So e.g. with
>>>> code like this:
>>>>
>>>>        %p = _ ;<some base object pointer>
>>>>        %q =<getelementptr getting a pointer at some offset from %p>
>>>>        ...
>>>>        call @callee()
>>>>        _ = %q
>>>>
>>>> then RS4GC will generate
>>>>
>>>>        %p = _ ;<some base object pointer>
>>>>        %q =<getelementptr getting a pointer at some offset from %p>
>>>>        ...
>>>>        %sp = call token @llvm.experimental.gc.statepoint(<args
>>>> indicating %p and %q are gc pointers, with %p as %q's base>)
>>>>        %p.reloc = call @llvm.experimental.gc.relocate(token %sp,<index
>>>> of p>,<index of p>)
>>>>        %q.reloc = call @llvm.experimental.gc.relocate(token %sp,<index
>>>> of p>,<index of q>)
>>>>        _ = %q.reloc
>>>>
>>>> The default lowering of that is to spill %p and %q to the stack just
>>>> before the call, and lower the gc.relocate calls as loads from those slots
>>>> after the calls, with the understanding that the stack map will communicate
>>>> to the GC that q's slot is derived from p's slot and the GC will update both
>>>> pointers appropriately.  However, for targets where the interface with the
>>>> runtime only allows reporting base pointers, lowering would have to effect
>>>> something like the following transformation:
>>>>
>>>>        %p = _ ;<some base object pointer>
>>>>        %q =<getelementptr getting a pointer at some offset from %p>
>>>>        ...
>>>>        %d =<compute %q - %p>
>>>>        %sp = call token @llvm.experimental.gc.statepoint(<args
>>>> indicating %p is a gc pointer>)
>>>>        %p.reloc = call @llvm.experimental.gc.relocate(token %sp,<index
>>>> of p>,<index of p>) ; lower to load
>>>>        %q.reloc = call @llvm.experimental.gc.relocate(token %sp,<index
>>>> of p>,<index of q>) ; lower to %p.reloc + %d
>>>>        _ = %q.reloc
>>>>
>>>> Now, if we consider the same situation but on an exception path where
>>>> there are no explicit gc.relocate calls because RS4GC spilled along the
>>>> exception path:
>>>>
>>>>        %p.gc_spill = alloca _
>>>>        %q.gc_spill = alloca _
>>>>        %p = _ ;<some base object pointer>
>>>>        %q =<getelementptr getting a pointer at some offset from %p>
>>>>        ...
>>>>        store<ty>    %p,<ty>* %p.gc_spill
>>>>        store<ty>    %q,<ty>* %q.gc_spill
>>>>        %sp = invoke token @llvm.experimental.gc.statepoint(<args
>>>> indicating %p.gc_spill and %q.gc_spill hold gc pointers, with %p.gc_spill as
>>>> %q.gc_spill's base>)
>>>>          to label _, unwind label %pad
>>>>      pad:
>>>>        landingpad _
>>>>        %p.reload = load<ty>,<ty>* %p.gc_spill
>>>>        %q.reload = load<ty>,<ty>* %q.gc_spill
>>>>        _ = %q.reload
>>>>
>>>> then the best you could do is something like this:
>>>>
>>>>        %p.gc_spill = alloca _
>>>>        %q.gc_spill = alloca _
>>>>        %p = _ ;<some base object pointer>
>>>>        %q =<getelementptr getting a pointer at some offset from %p>
>>>>        ...
>>>>        store<ty>    %p,<ty>* %p.gc_spill
>>>>        store<ty>    %q,<ty>* %q.gc_spill
>>>>        ; compute difference
>>>>        %p.to_compute_d = load<ty>,<ty*>    %p.gc_spill
>>>>        %q.to_compute_d = load<ty>,<ty*>    %q.gc_spill
>>>>        %d =<compute %q - %p>
>>>>        %sp = invoke token @llvm.experimental.gc.statepoint(<args
>>>> indicating %p.gc_spill and %q.gc_spill hold gc pointers, with %p.gc_spill as
>>>> %q.gc_spill's base>)
>>>>          to label _, unwind label %pad
>>>>      pad:
>>>>        landingpad _
>>>>        %p.to_compute_q = load<ty>,<ty>* %p.gc_spill
>>>>        %q.recomputed =<compute %p.to_compute_q + %d>
>>>>        store<ty>    %q.recomputed,<ty>* %q.gc_spill
>>>>        ; continue as normal
>>>>        %p.reload = load<ty>,<ty>* %p.gc_spill
>>>>        %q.reload = load<ty>,<ty>* %q.gc_spill
>>>>        _ = %q.reload
>>>>
>>>> There are a number of redundant loads/stores in that sequence, and I'm
>>>> not sure it's reasonable to generate them and expect them to get cleaned up
>>>> later (especially since "later" is post-optimizer).
>>>>
>>>> So it seems like, for that use case, explicit spilling in RS4GC gets in
>>>> the way more than it helps.  But I'm not sure how important that use case is
>>>> to anybody (Manuel, is this the approach PyPy is taking?), or what would be
>>>> preferable to people for whom it is important: simply to continue using
>>>> gc.relocates on the exceptional path and live with non-token linkage between
>>>> your landingpads and gc.relocates?  A different scheme where RS4GC doesn't
>>>> generate `alloca`s and `store`s and `load`s directly, but rather some family
>>>> of intrinsics like `gc.spill.alloca`/`gc.spill.store`/`gc.spill.load` where
>>>> the conceptual slot's type would be token rather than pointer, so as to be
>>>> onobscurable until lowering?  Something else entirely?
>>>>
>>>>
>>>> I'm interested to hear what others think, about both issues above.
>>>>
>>>> Thanks
>>>> -Joseph
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Philip Reames [mailto:listmail at philipreames.com]
>>>> Sent: Friday, January 22, 2016 3:36 PM
>>>> To: llvm-dev<llvm-dev at lists.llvm.org>; Joseph
>>>> Tremoulet<jotrem at microsoft.com>; Manuel Jacob<me at manueljacob.de>;
>>>> chenli@<"azulsystems
>>>> chenli"@https://na01.safelinks.protection.outlook.com/?url=azulsystem
>>>> s
>>>> .com&data=01%7c01%7cjotrem%40microsoft.com%7c81b669bbdc6a4dec072208d3
>>>> 2
>>>> e607440%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=Ru7P2lfCvpxYDglo
>>>> f poxAOW%2bUVTwwEc7UQLXQ%2bj2pLs%3d>; Sanjoy
>>>> Das<sanjoy at playingwithpointers.com>
>>>> Subject: FYI: gc relocations on exception path w/RS4GC currently
>>>> broken
>>>>
>>>> For anyone following along on ToT using the gc.statepoint mechanism, you
>>>> should know that ToT is currently not able to express arbitrary exceptional
>>>> control flow and relocations along exceptional edges. This is a direct
>>>> result of moving the gc.statepoint representation to using a token type
>>>> landingpad.  Essentially, we have a design inconsistency where we expect to
>>>> be able to "resume" a phi of arbitrary landing pads, but we expect
>>>> relocations to be tied specifically to a particular invoke.
>>>>
>>>> Chen, Joseph, and I have spent some time talking about how to resolve
>>>> this.  All of the schemes we've come up with representing relocations
>>>> using gc.relocates on the exceptional path require either a change to
>>>> how we define an invoke instruction (something we'd really like to
>>>> avoid) or a new intrinsic with special treatment in the optimizer so
>>>> that it basically "becomes part of" the landing pad without actually being
>>>> the landing pad.  None of us were particular thrilled by the changes
>>>> involved.
>>>>
>>>> Given exceptional paths are nearly by definition cold, we're currently
>>>> exploring another option.  We're considering having RS4GC insert explicit
>>>> spill slots at the IR level (via allocas) for values live along exceptional
>>>> paths, and leaving all of the normal path values represented as
>>>> gc.relocates.  This avoids the need for another IR extension, makes it
>>>> slightly easier to meet an ABI requirement Joseph has, and provides a better
>>>> platform for lowering experimentation. Joseph is working on implementing
>>>> this and will probably have something up for review next week or the week
>>>> after. Once that's in, we're going to run some performance experiments to
>>>> see if it's a viable lowering strategy even without Joseph's particular ABI
>>>> requirement, and if so, make that the standard way of representing
>>>> relocations on exceptional edges.
>>>>
>>>> Assuming this approach works, we're going to defer solving the problem
>>>> of how to cleanly represent explicit relocations along the exceptional path
>>>> until a later point in time.  In particular, the value of the explicit
>>>> relocations comes mainly from being able to lower them efficiently to
>>>> register uses.  Since the work to integrate relocations with the register
>>>> allocator hasn't happened and doesn't look like it's going to happen in the
>>>> near term (*), this seems like a reasonable compromise.
>>>>
>>>> Philip
>>>>
>>>> (*) To give some context on this, it turns out one of our initial
>>>> starting assumptions was wrong in practice.  We expected the quality of
>>>> lowering for the gc arguments at statepoint/safepoint to be very important
>>>> for overall code quality.  While this may some day become true, we've found
>>>> that whenever we encounter a hot safepoint, the problem is usually that we
>>>> didn't inline appropriately.  As a result, we've ended up fixing (out of
>>>> tree) inlining or devirtualization bugs rather than working on the lowering
>>>> itself. For us, a truly hot megamorphic call site has turned out to be a
>>>> very rare beast.  Worth noting is that this is only true because we're a
>>>> high tier JIT with good profiling information.  It's likely that other users
>>>> who don't have the same design point may find the lowering far more
>>>> problematic; in fact, we have some evidence this may already be true.
>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing list

-- 
Sanjoy Das
http://playingwithpointers.com