Patch: StackMap: Use IndirectMemRefOp more.

Tue Nov 25 18:21:38 PST 2014

Finally getting back to this thread; sorry for letting it sit for a 
while without response.

On 11/07/2014 03:22 AM, Gustaw Smolarczyk wrote:
> 2014-11-07 4:45 GMT+01:00 Andrew Trick <atrick at apple.com>:
>>> On Nov 4, 2014, at 3:21 PM, Gustaw Smolarczyk <wielkiegie at gmail.com> wrote:
>>>
>>> Hi Philip,
>>>
>>> 2014-10-30 22:33 GMT+01:00 Philip Reames <listmail at philipreames.com>:
>>>> Gustaw,
>>>>
>>>> I looked over part of your first patch and like the general direction you're
>>>> taking.  Before getting in to the details, I want to ask a couple of
>>>> question about your intent.
>>>>
>>>> It seems like your arbitrary base register matching could easily match
>>>> non-stack locations.  I'm assuming this is by design?  What's the original
>>>> source construct you're hoping to represent better?
>>>>
>>>>  From the code, I'm guessing this would match things like:
>>>> struct S { int a; }
>>>> S* s = ...
>>>> patchpoint(..., s->a, ...)
>>>>
>>>> Is this what you're aiming for?
>>> In general, yes. This is the kind of code I was hoping to optimize the
>>> references to. I wanted not to waste registers/stack spill locations
>>> for the values that are already in some structure, especially when
>>> many entries of the same structure need to be referenced.
>> Thanks for sharing your use case and patch. It's really interesting.
>>
>>> My original problem involved a processor emulator and a state
>>> structure with a register file inside. Each LLVM function would
>>> represent a sequence of instructions that would be JIT compiled and
>>> executed more efficiently than going through interpreter. However, I
>>> still wanted to have precise interrupts without any large penalty for
>>> a common case. A memory access instruction could fail and fire a
>>> SIGSEGV signal, and the signal handler would need to reconstruct the
>>> emulated CPU state from host CPU state. I wanted to use the patchpoint
>>> intrinsic in order to do that (I would need to emit the memory access
>>> instruction in assembly by myself in the patchpoint area).
>>> Unfortunately, referencing the large register file using the current
>>> implementation would be prohibitively expensive - I don't want to
>>> touch any cold registers from the register file at all.
>> I'm not sure what "cold" registers are. Maybe they're registers that you know by convention are not live into the region and have never been written to.
> They are the registers in the file that had not been written to. Other
> registers are "live" in LLVM values and would be written back to the
> register file at the end of the LLVM function representing a sequence.
> This way I don't have to write them to the register file before any
> faulting instruction.
>
>>> Later on (after I submitted this patch) I came up with a very simple
>>> solution. I don't really have to reference the cold registers. I know
>>> their location and know which ones are cold so I only need to
>>> reference already used registers (that would currently be "in flight"
>>> in LLVM values). So for my particular use case, there really doesn't
>>> seem to be a need for this patch anymore. For others, just referencing
>>> the pointer to the structure could be a simple workaround.
>> Yes. It is the runtime's job to just store a reference to S. That's opaque to LLVM.
>>
>>> However, it
>>> won't always be possible or desirable for the runtime to track which
>>> values would benefit from such treatment and more dynamic approach
>>> like the one this patch introduces might be beneficial.
>> I'm glad you have a workaround. I'm not clear on the solution. Are you still using a stack map to find the values that need to be in each register?
> Only live registers are now referenced by the patchpoint intrinsic. I
> couldn't use stackmaps, since I need to know their location during the
> whole memory (or other faulting) instruction.
> An emulated processor memory read instruction (at address pointer by
> register r10) could be implemented as an LLVM read:
>
>      %ptr.8 = getelementptr i8* %memory_base, i32 %r10
>      %ptr = bitcast i8* %ptr.8 to i32*
>      %val = load i32* %ptr
>
> If I put the stackmap intrinsic just before or just after the load
> instruction, it would still be possible for the stackmap references to
> be wrong during the load instruction (when the page fault happens).
> The (not so clean) solution I came up with is the following:
>
>      %val.64 = call anyregcc @llvm.experimental.patchpoint.i64(i64 123,
> i32 8, i8* null, i32 2, i8* %memory_base, i32 %r10, <other live
> regs>...)
>      %val = trunc i64 %val.64 to i32
>
> And immediately after compilation I would put an x86 instruction into
> the patched area that would load i32 from %memory_base+%r10 into
> %val.64 (they would all be in registers). This way, I am 100% sure
> where the live registers are if this load instruction faulted.
This sounds like a very workable approach.  An alternate idea would be 
to introduce a "faulting load" intrinsic.  We've talked a bit about that 
before, but it's never gotten too much traction.  I'm not suggesting you 
actually switch to such an approach.  :)
>
>>>> More generally, any information about motivation you can provide would be
>>>> appreciated.  Once we have that, we can think about the best approach for
>>>> the problem.
>>>>
>>>> If I'm understanding this properly, I think we'd need to make a few
>>>> adjustments to make this viable.  First, we probably need a new operand
>>>> type.  Indirect is documented as being a stack spill, which this is not.
>>>> Second, we need some type of flag to control the introduction of such
>>>> operands.  I know that my runtime (for one) would need to be extended to
>>>> handle such cases.  Having a way to disable this is a must.
>>> Of course, I understand the need for the new kind of reference. The
>>> name "IndirectMemRefOp" seems to be a bit misleading (I originally
>>> assumed, the LLVM could do what this patch implements), it would be
>>> better to call it "SpilledMemRefOp", or something like that. If
>>> renaming the original type is not possible, then we would need to
>>> invent a new name for this reference kind, like "FreeIndirectMemRefOp"
>>> or simply "MemRefOp”.
>>  From the point of view of stack maps, they are not "MemRefs" at all. They are just stack slots. But they are naturally represented as MemRefs in the machine instruction.
>>
>> The LLVM language reference should make it clear what a Direct vs. Indirect location is. They are both stack slots, but are interpreted differently by the runtime.
>> http://llvm.org/docs/StackMaps.html
> The naming confused my a little, but in the end they all reference
> "stack locations" (like automatic C/C++ variables, which could live on
> stack or in registers at any given time). My proposal would extend
> their scope beyond just these stack locations and make them reference
> any memory. But maybe complicating an already complicated interface
> would not be beneficial enough for that to make sense, as I see it
> now.
>
>>>> Once we've gotten those issues settled, splitting the SelectionDAGBuilder
>>>> and FastIsel changes for review would probably be a good idea.  That's
>>>> really up to the preferences of Juergen since he's most familiar with these
>>>> pieces.
>>>>
>>>> Philip
>>> I understand that this idea is not that simple to implement properly.
>>> The biggest problem is the register allocator. There needs to be a way
>>> to spill the IndirectMemRefOps when it runs out of registers for the
>>> base register reference (it would need to change to a stack spill).
>>> The current implementation that counts unique values for
>>> IndirectMemRefOp references is not particularly clean. Also, the
>>> FrameIndex nodes need to be handled properly, since they are simple
>>> offsets in the resulting code. If I understand correctly, they need an
>>> already allocated stack in order to resolve them to offsets, after
>>> which point any spill is impossible... The logic would be similar to
>>> the DirectMemRefOps handling.
>>>
>>> I am not familiar enough with the LLVM code in order to solve these
>>> problems. If you are still interested in merging this code, I would
>>> appreciate if you provided me with some pointers.
>> Direct stackmap entries are currently used for a special case where both an alloca and stackmap are positioned in the entry block and we don't allow spilling. The semantics are different than normal stackmap values and I want to add a new intrinsic just for this special case to avoid confusion.
>>
>> More generally, direct stackmap entries could be used for encoding the address of a stack slot without requiring the address to be stored in a stack slot or register. This is a minor feature that provides a little flexibility in encoding stack map locations. There's no need to keep a base register alive, as we always assume a frame pointer is available.
>>
>> I don't think it's LLVM's job to do what you were trying to do. You did the right thing by working around it in the runtime.
> You might be right, since I cannot come up with a good enough
> justification for complicating the interface now.
> Thank you for the explanation.
>
> Regards,
> Gustaw
>
>> -Andy
>>
>>> Regards,
>>> Gustaw
>>>
>>>>
>>>> On 09/21/2014 12:33 PM, Gustaw Smolarczyk wrote:
>>>>
>>>> Hello once again,
>>>>
>>>> I managed to create a second version of my patch. This time, I don't
>>>> disallow FrameIndex from the IndirectMemRefOp optimization. Instead, I track
>>>> the number of unique SDValues with the reasoning that each of them will
>>>> require a register. In case the number of them grows too much, I fall back
>>>> to the old logic that handles registers spills gracefully (instead of the
>>>> register allocator error).
>>>>
>>>> Regards,
>>>> Gustaw
>>>>
>>>> 2014-09-21 19:11 GMT+02:00 Gustaw Smolarczyk <wielkiegie at gmail.com>:
>>>>> Hello all.
>>>>>
>>>>> This is my first patch to the LLVM project. I hope you will find it
>>>>> useful.
>>>>>
>>>>> This patch improves the experimental stackmap and patchpoint intrinsics.
>>>>> It tries to use the IndirectMemRefOp in more cases. Currently, they are
>>>>> only emitted for values spilled to a local stack. However, they could be
>>>>> applied for any [Register + Offset] memory load.
>>>>>
>>>>> I had problems with indirect FrameIndex references. Because my code can't
>>>>> retrieve the offset of such a memory reference, it allocates a register for
>>>>> every such access, which caused one of the tests to fail. This is why I
>>>>> disabled it for this case.
>>>>>
>>>>> I am not very familiar with the LLVM code base. I might have done
>>>>> something in a roundabout or simply incorrect way. I would be glad if I had
>>>>> been given some pointers.
>>>>>
>>>>> Regards,
>>>>> Gustaw
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> llvm-commits mailing list
>>>> llvm-commits at cs.uiuc.edu
>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>>
>>>>