[llvm-dev] Creating a virtual machine: stack, regs alloc & other problems

Fri Aug 7 15:27:34 PDT 2015


On 08/07/2015 02:35 PM, Alex Nordwood wrote:
> Hello.
>
>>> It is stack VM, and one designed to utilize all the advantages of the assembly language implementation.
>> This sounds very, very familiar.  Are you willing to share which
>> VM/language you're working on?
> I would like to because I think the additional context would be helpful...let me ask for permission first.
>
>>> doing this using using C (by C function calls, CPS) led to significant performance loss.
>> I assume the continuation is a tail call?  If so, have you examined
>> where the performance loss originated?  An obvious tail call in C code
>> being compiled by modern Clang should be code generated as a tail call.
>> You might be stumbling across an implementation limit and adjusting your
>> input slightly might bypass that.
> Yes, both executor and it's continuation are tail call. And we tried gcc 4.8, 4.9, 5.1
> and clang 3.6 C compilers. All are good to infer a tail call if the function call is explicit. But if
> a continuation function pointer is popped up from VM stack, neither of these were able to produce a jump,
> leading to machine stack overflow. Not sure why clang wasn't able to do this, because while using
> LLVM IR it works (using a test code, of course).
I would suggest looking into this.  The smallest C reproducer which 
doesn't get a tail call would be interesting to see and might get fixed.  :)

One thing you might be running into if you're VM is in C vs C++ is that 
C++ pointers-to-member functions aren't just function pointers.
>
>>> We are considering extending LLVM by creating a special calling convention which
>>> forces a function (using this convention) to pass args in registers and to
>>> be force tail-call optimized.
>> You absolutely will need a custom calling convention for the register
>> assignments and such.  If your source IR uses musttail, in principal you
>> shouldn't need to do anything special for the tail calls provided you're
>> running on one of the architectures where musttail has been implemented.
> musttail has some limitations (ex., the caller and callee prototypes must match), but tail
> with -tailcallopt work just fine.
> Looks like we were on the right track with that question. Thanks!
>
>> Creating a new calling convention is easy.  It will require a custom
>> LLVM build to get started since you have to change td and cpp files in
>> the target.  For examples, see the existing ones in
>> Target/X86/X86CallingConv.td
> Yes, we already have a custom build. Thanks!
>
>>> 2. Because the existing VM runtime is written in x86 assembly, and doesn't do function calls, it uses ESP register for VM stack purposes
>>> (again, it is not in use for low-level calls). We want to do the same.
>> This will be tricky... Do you absolutely absolutely need this?
> We still have to support x86 32-bit and this arch has a lack of GP registers, and
> a) esp register becomes almost unused,
> b) we will have to do stack operations using other regs, which may lead to more spilling.
> So, it's good to have the esp reg doing what is has to...but for VM stack.
> We don't want to keep original x86 assembly version along with our new one llvm-based (hopefully we will make it).
How does your current runtime track spills inserted by the compiler?  Is 
that integrated with the VM stack?  Or is that a distinct stack?  If you 
didn't mind spills being interwoven with vm frames, you could model the 
vm stack operations as dynamic allocas potentially.


>
>>> We think that it could be implemented as intrinsics as well? Or perhaps we should create intrinsics for arbitrary machine stack access?
>>> We tried, for example, stacksave-sub-store-stackrestore sequence, but it never folds into a single push operation.
>> I would suggest just implementing your virtual machine stack as a normal
>> bit of memory.  There's no reason that the compiler needs to know that
>> this is the VM stack versus some other buffer.  You will need to provide
>> aliasing facts, but that's a much smaller extension*.  Trying to much
>> with the frames at runtime using the intrinsics is going to end very badly.
>> * Don't under estimate this point.  Providing aliasing metadata will be
>> *really* important for this scheme to work reasonably well.  You may
>> need to add custom extensions locally or propose extensions upstream to
>> encode the information you need.
> Could you please be more specific? Pointing to some docs or examples will work just fine. :)
See LangRef.  Search for noalias, tbaa, invariant.load, inbounds, 
readonly, readnone, argmemonly, nonnull...

See the alias analysis docs and the TBAA pass as an example of how to 
write and integrate a custom AA pass.
>
>> Another option: If you know the size of your vm frame statically, you
>> can emit loads from the vm stack locations into SSA values (or allocas,
>> which will become SSA values) and spill as needed to ensure the VM stack
>> is up to date as required by your language requirements.  This will
>> likely a) decrease your dependence of the pass ordering above, and b)
>> give slightly better results since LLVM is going to have to be
>> conservative about calls into your runtime and your custom lowering gets
>> to use language specific knowledge.
> Hmm.. this sounds interesting. thanks.
> So, if I understand correctly, if we need to allocate a VM stack frame, the idea is to create
> enough allocas, then store there values which need to be in the VM frame?
> But how can it survive optimizing passes?
The allocas specially shouldn't survive optimization.  That's the 
point.  :)  If the vm stack has been escaped at all the relevant points, 
the spills to the vm stack memory can't be eliminated.  As a result, 
you'd get the effect of having a materialized vm stack when you need it, 
and everything in SSA/execution stack the rest of the time.
> (I assume that we did stackload before and esp points to VM stack)
> Could you please explain more?
I was assuming you still had a separate execution stack and vm stack.  
Mixing the two without letting LLVM spill things between VM stack 
sections would be "interesting".
>>> 3. Since the machine stack is a VM stack, we are not allowed to use alloca. It's not a problem, but the machine register allocator/spiller
>>> can still use the machine stack for register spilling purposes.
>>> How could this be solved? Should we provide our own register allocator? Or could it be solved by providing a machine function pass,
>>> which will run on a function marked with our calling conv and substitute machine instructions responsible for spilling to stack
>>> with load/store instructions to our heap memory?
>> I don't understand what you're trying to ask here.  If you can spill to
>> the machine frame (instead of the VM stack frame), what's the problem?
> I mean that if we do stackload and the machine stack points to the VM stack (and we
> somehow solved the problem above), LLVM still wants to spill regs to stack.
> It would be good to have the spill slots in the VM context, but not in the stack
> (neither machine nor VM).
This is going to be a really problematic design point.  The entire LLVM 
backend assumes it owns the execution stack.  That's a really really 
built in assumption.  Trying to change that would be extremely 
challenging.  (See comment below)
>
>>> Thank you for your time.
>> At a meta level, let me give my standard warning: implementing a
>> functional compiler for a language of your choice on LLVM is relatively
>> easy; your looking at around 1-3 man years of effort depending on the
>> language.  implementing a *performant* compiler is far, far harder.
>> Unless you're willing to budget for upwards of 10 man years of skilled
>> compiler engineering time, you may need to adjust your expectations.
>> How hard the problem will be also depends on how good your current
>> implementation is of course.  :)
> I appreciate the warning.
> Strictly speaking, we don't implement the compiler itself. It's only a runtime
> for interpreting bytecodes compiled earlier. JIT is upcoming, but not for now.
> It's also not including a memory manager - it works good enough written in C.
> Our task is really not easy, but not so hard as you think. :) We think. :)
Wait, what?  I think I got something confused at some point.  All of my 
answers above were with a JIT in mind.  :)  Doing an interpreter is a 
slightly different beast.

For the record, trying to not have an execution stack for spilling just 
started seeming a lot more sane.  :)  I would still start with a design 
that uses an extra (ESP) register for the execution stack, but tuning 
the IR for an interpreter to not spill or changing the base pointer to 
be something special with a fixed scratch pad seems approachable.  
Dealing with a restricted bit of code is much more approachable than 
whatever a JIT might emit.  :)  Still challenging though.
>
>> To give a flavour for the tuning involved, you might find this document
>> helpful:
>> http://llvm.org/docs/Frontend/PerformanceTips.html
>> If you're serious about the project, I highly recommend that you make an
>> effort to attend the developers conference in Oct.  You'll want to have
>> a bunch of high bandwidth conversations with people who've been down
>> this road before and email just doesn't quite work for that.
> Thanks! We will keep this in mind.
>
>
> _____________________________________________________________
> Are you a Techie? Get Your Free Tech Email Address Now! Visit http://www.TechEmail.com