[LLVMdev] RFC: GEP as canonical form for pointer addressing

Andrew Trick atrick at apple.com
Tue Feb 18 13:13:57 PST 2014


On Feb 18, 2014, at 11:29 AM, Philip Reames <listmail at philipreames.com> wrote:

> 
> On 02/17/2014 02:31 AM, David Chisnall wrote:
>> On 15 Feb 2014, at 23:55, Andrew Trick <atrick at apple.com> wrote:
>> 
>>> On Feb 14, 2014, at 5:18 PM, Philip Reames <listmail at philipreames.com> wrote:
>>> 
>>>> RFC: GEP as canonical form for pointer addressing
>>>> 
>>>> I would like to propose that we designate GEPs as the canonical form for pointer addressing in LLVM IR before CodeGenPrepare.
>>>> 
>>>> Corollaries
>>>> 1) It is legal for an optimizer to convert inttoptr+arithmetic+inttoptr sequences to GEPs, but not vice versa.
>>>> 2) Input IR which does not contain inttoptr instructions will never contain inttoptr instructions (before CodeGenPrepare.)
>>>> 
>>>> I've spoken with Nick Lewycky & Owen Anderson offline at the last social.  On first reflection, both were okay with the proposal, but I'd like broader buy-in and discussion.  Nick & Owen, if I've accidentally misrepresented our discussion or you've had second thoughts since, please speak up.
>>> FWIW, I think it would be nice if standard optimization passes have this property of being well behaved with respect to pointer types, and I don’t see a good reason for canonical IR passes to lose pointer types. I also think it’s the only way to mix the optimization of pointer values with precise GC. It seems that you just want LLVM developers to generally agree that certain passes will be well behaved (you can disable any others). It may just be a matter of documenting those passes. Ideally we could formalize this by declaring a pass as pointer-safe and verifying. Can we easily verify that no memory access is based on inttoptr?
>> Not directly related, but our canonical form for loops involving pointers[1] turns a loop that contains a GEP with the loop induction variable into a GEP with the increment inside the loop.  This has two annoying properties for code generation:
>> 
>> - The GEP with the induction variable as the offset maps cleanly to CPU addressing modes and so we generate better code if we don't do this canonicalisation, and therefore end up trying to undo it in the back end (yuck).
>> 
>> - If the source is the start of an object, then this behaviour is GC-hostile because it means that IR that contains a pointer to an object start now only contains a pointer to the middle, requiring the GC to deal with inner pointers.
>> 
>> It would be nice if we could have canonical forms such that if the front end ensures that there are no inner pointers without pointers to the object's start in the IR, the optimisers don't break this.
> While I agree that from a stylistic point of view this would be an improvement, we don't actual *need* this to support precise GC.  It would definitely result in cleaner code generation than our current scheme though.

I’m not opposed to preserving the GEP’s original base as a matter of convention when there’s no good reason not to. But, in general passes expect to be able to break-up GEPs into smaller steps. We can’t guarantee that the original base will be directly referenced at every point of use.

We should certainly avoid generating out-of-bounds GEPs without retaining some in-bounds pointer, because that would break everyone’s conservative GC as well.

-Andy



More information about the llvm-dev mailing list