[LLVMdev] Proposal: Loads/stores with deterministic trap/unwind behavior

Sun Apr 6 21:53:29 PDT 2014

Hi Filip,

Thanks for sharing these insights.

Since I've already started on an implementation, what I think I'll do is
experiment locally and see if these new instructions give a performance
improvement in the context of the compiler I'm working on -- I'll probably
need to implement explicit null checks anyway for the sake of supporting
targets that don't support unwinding through signal handlers. Once I have
some numbers I'll re-evaluate.

Thanks,
Peter

On Sun, Apr 06, 2014 at 09:06:49PM -0700, Filip Pizlo wrote:
> Hey Peter,
> 
> I've now had multiple opportunities to evaluate the benefit of virtual-memory-based null checks versus explicit branches.  Here are some of the papers where we did this.  In all cases, we found that explicit null check branches are just as fast as using a virtual memory trap.
> 
> http://www.filpizlo.com/papers/baker-ccpe09-accurate.pdf    (Tomas stumbled on this observation while comparing OpenVM's C backend to its C++ backend.)
> http://www.filpizlo.com/papers/pizlo-eurosys2010-fijivm.pdf   (I designed Fiji VM around never using virtual-memory-based optimizations and then I evaluated the cost of null check branches.  They were free.)
> 
> I've also done this experiment in WebKit.  In JavaScript the equivalent thing is a cell check, and we turn them into explicit branches.  We of course emit a minimal set of such branches through GVN, LICM, etc.  After that, these branches are essentially free.
> 
> In all cases, I've found that on modern hardware, the explicit null check branches have essentially zero cost.  Redundant null checks are trivial to eliminate using any serious compiler infrastructure, LLVM included.  The remaining null checks are trivially predicted by the hardware.
> 
> To me it feels like you're proposing to add new instructions to the IR to support something that isn't a worthwhile optimization.
> 
> (Also, this is specific to languages where the only non-pointer value that we might attempt to load from is null.  It wouldn't work for languages like JavaScript/Ruby/Python/etc where we may attempt to load from an integer or double that uses a high-bit type tag for instance.  So this is an optimization that really only applies to Java and Go, and in Java it's not clear that your IR gives you what you want - Java is basically always JITed and you'd want to deoptimize on null.)
> 
> -Phil
> 
> 
> On April 6, 2014 at 8:54:39 PM, Peter Collingbourne (peter at pcc.me.uk) wrote:
> 
> On Sat, Apr 05, 2014 at 12:21:17AM -0700, Andrew Trick wrote:
> >  
> > On Mar 31, 2014, at 6:58 PM, Peter Collingbourne <peter at pcc.me.uk> wrote:
> >  
> > > Hi,
> > >  
> > > I wanted to propose an IR extension that would allow us to support zero-cost
> > > exception handling for non-call operations that may trap. I wanted to start
> > > with loads and stores through a null pointer, and later we might extend this to
> > > div/rem/mod zero. This feature is obviously useful for implementing languages
> > > such as Java and Go which deterministically translate such operations into
> > > exceptions which may be caught by the user.
> > >  
> > > There are a couple of somewhat orthogonal features that this would entail:
> > >  
> > > 1) Deterministic handling for loads and stores through a null pointer.
> > > 2) Ability to unwind a load/store to a specific basic block, like invoke.
> > >  
> > > At the moment, we do not exactly have 1), as the optimizer considers
> > > non-volatile loads/stores through a null pointer to have undefined
> > > behavior. Volatile loads/stores are closer, but they come with their own
> > > set of baggage that can inhibit optimization. (For example, if we can prove
> > > that a load would always succeed, 'volatile' prevents us from reordering
> > > the load or deleting it if it is dead.) So I propose to add an attribute to
> > > 'load' and 'store', which we can call, say, 'nullcheck', with the following
> > > additional semantics:
> > >  
> > > - If the memory address is between zero and a target-defined value (i.e. the
> > > size of the zero page) the instruction is guaranteed to trap in a
> > > target-defined manner.
> > >  
> > > - The optimizer may only delete or reorder nullcheck instructions if the
> > > program cannot observe such a transformation by installing a signal handler
> > > for the trap. Therefore, the optimizer would be able to drop the attribute
> > > if it can prove that the address will always be non-null.
> > >  
> > > To support 2), I propose a couple of new instructions. I haven't come up with
> > > great names for these instructions, but:
> > >  
> > > - 'iload' is to 'load' as 'invoke' is to 'call'. That is, the instruction is
> > > a terminator and has normal and unwind destinations. e.g.
> > >  
> > > %v = iload i8* %ptr to label %try.cont unwind label %lpad
> > >  
> > > - Similarly, 'istore' is to 'store' as 'invoke' is to 'call'.
> > >  
> > > istore i8 %v, i8* %ptr to label %try.cont unwind label %lpad
> > >  
> > > These instructions always have 'nullcheck' semantics, plus:
> > >  
> > > - If the instruction traps and the program has installed a signal handler
> > > for the trap which unwinds, the unwind is guaranteed to land at the
> > > landing pad.
> > >  
> > > I've been working on an implementation of 'iload' and 'istore' which are
> > > in the attached patches, if you are interested. (They aren't ready to go
> > > in yet.) I have asm parsing/printing for both, and code generation for
> > > 'iload'. Would be interested in getting feedback on code generation as this
> > > is my first serious foray into the backend -- I haven't tried running the
> > > generated code yet and the DAG builder is a mashup of the DAG builders for
> > > 'invoke' and 'load', but I've eyeballed the asm it generates (e.g. llc produces
> > > iload-exception.s for the attached iload-exception.ll) and it looks reasonable.
> >  
> > Hi Peter. All due respect, I don’t think it’s right to introduce new load/store instructions with equivalent semantics to and+icmp+br+load/store.
> 
> I don't think the new instructions have equivalent semantics. If the null check
> fails with the iload/istore instructions, we need to throw the appropriate
> language-specific exception and evaluate it against the landing pad. There
> may also need to be an active exception that can be resumed.
> 
> As far as I can tell, the frontend would need to emit IR that calls the
> language runtime to manually throw the exception. This IR would need to be
> recognized by the IR optimization that converts the icmp+br+load/store to
> a checked load/store. It seems to me that it would be simpler to just start
> with the checked load/store.
> 
> > ‘invoke’ is different. It is needed because there is no way for the caller to explicitly check for an exception.
> >  
> > We do introduce intrinsics that encapsulate things like overflow checks. This is done to eliminate control flow edges that tend to inhibit LLVM’s optimizer and instruction selection. But you’re not removing the control flow, so this technique does not apply. Null checks should actually be exposed in IR so general optimizations can remove redundant checks.
> 
> My idea for removing redundant checks is to teach the IR optimizer to
> treat iloads/istores as if they were null checks. Is there any reason
> why this wouldn't work?
> 
> > Ideally this would just be a machine code pass that can hoist a load/store above a branch and nuke the compare. However, I can see how it’s easier to relate the compare operand to the address arithmetic at IR level.
> >  
> > To do this at IR level, you could introduce a pre-CodeGen pass that converts cmp+br+load/store into a checked load intrinsic. Since this intrinsic only exists for instruction selection, the optimizer doesn’t need to know about it.
> 
> I did initially consider implementing the checked load/store as an
> intrinsic. But there are relatively boring reasons why this won't work at
> present. For example, there is no current way to represent a struct load
> using an intrinsic, as there is no mangling for struct types. Also, named
> structs would need a mangling that is resistant to renames. Rather than
> solve these problems, I decided to avoid intrinsics entirely.
> 
> > The intrinsic would need to be lowered to an MI pseudo-instruction that feels like a load/store to the backend, but is a terminator. During code emission you could grab the address of that pseudo load/store and its resulting branch target to inform the runtime.
> 
> As far as I know, a load can lower to multiple machine instructions. This
> will definitely be the case for the Go frontend that I'm working with, as
> its IR tends to use struct loads/stores quite frequently. So I'm not sure
> if this will work. I think it needs to look a lot like how the lowering for
> invokes currently looks, with a pair of EH_LABELs around a set of ordinary
> load/store MIs -- which is how I implemented it.
> 
> Thanks,
> --  
> Peter
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

-- 
Peter