[LLVMdev] RFC: GEP as canonical form for pointer addressing

Mon Feb 17 02:31:58 PST 2014

On 15 Feb 2014, at 23:55, Andrew Trick <atrick at apple.com> wrote:

> On Feb 14, 2014, at 5:18 PM, Philip Reames <listmail at philipreames.com> wrote:
> 
>> RFC: GEP as canonical form for pointer addressing
>> 
>> I would like to propose that we designate GEPs as the canonical form for pointer addressing in LLVM IR before CodeGenPrepare.
>> 
>> Corollaries
>> 1) It is legal for an optimizer to convert inttoptr+arithmetic+inttoptr sequences to GEPs, but not vice versa.
>> 2) Input IR which does not contain inttoptr instructions will never contain inttoptr instructions (before CodeGenPrepare.)
>> 
>> I've spoken with Nick Lewycky & Owen Anderson offline at the last social.  On first reflection, both were okay with the proposal, but I'd like broader buy-in and discussion.  Nick & Owen, if I've accidentally misrepresented our discussion or you've had second thoughts since, please speak up.
> 
> FWIW, I think it would be nice if standard optimization passes have this property of being well behaved with respect to pointer types, and I don’t see a good reason for canonical IR passes to lose pointer types. I also think it’s the only way to mix the optimization of pointer values with precise GC. It seems that you just want LLVM developers to generally agree that certain passes will be well behaved (you can disable any others). It may just be a matter of documenting those passes. Ideally we could formalize this by declaring a pass as pointer-safe and verifying. Can we easily verify that no memory access is based on inttoptr?

Not directly related, but our canonical form for loops involving pointers[1] turns a loop that contains a GEP with the loop induction variable into a GEP with the increment inside the loop.  This has two annoying properties for code generation:

- The GEP with the induction variable as the offset maps cleanly to CPU addressing modes and so we generate better code if we don't do this canonicalisation, and therefore end up trying to undo it in the back end (yuck).

- If the source is the start of an object, then this behaviour is GC-hostile because it means that IR that contains a pointer to an object start now only contains a pointer to the middle, requiring the GC to deal with inner pointers.  

It would be nice if we could have canonical forms such that if the front end ensures that there are no inner pointers without pointers to the object's start in the IR, the optimisers don't break this.

David

[1] Are canonical forms actually documented anywhere, or are they simply undocumented implicit contracts?