[LLVMdev] RFC: GEP as canonical form for pointer addressing

Philip Reames listmail at philipreames.com
Fri Feb 14 17:18:21 PST 2014


RFC: GEP as canonical form for pointer addressing

I would like to propose that we designate GEPs as the canonical form for 
pointer addressing in LLVM IR before CodeGenPrepare.

Corollaries
1) It is legal for an optimizer to convert inttoptr+arithmetic+inttoptr 
sequences to GEPs, but not vice versa.
2) Input IR which does not contain inttoptr instructions will never 
contain inttoptr instructions (before CodeGenPrepare.)

I've spoken with Nick Lewycky & Owen Anderson offline at the last 
social.  On first reflection, both were okay with the proposal, but I'd 
like broader buy-in and discussion.  Nick & Owen, if I've accidentally 
misrepresented our discussion or you've had second thoughts since, 
please speak up.


Background & Motivation

We want to support precise garbage collection(1) in LLVM.  To do so, we 
have written a pass which inserts safepoints, read, and write barriers 
as appropriate.  This pass needs to be able to reliably(2) identify 
pointer vs non-pointer values.  Its advantageous to run this pass as 
late as practical in the optimization pipeline, but we can schedule it 
before lowering begins (i.e. before CodeGenPrepare).

We control the initial IR which is generated and can ensure that it does 
not contain any inttoptr instructions.  We're looking to have a 
guarantee(*) that a random LLVM optimization pass will not decide to 
replace GEPs with a sequence of ptrtoint, int arithmetic, and inttoptr 
which are hard for us to reason about.

* "guarantee" isn't really the right word here.  I'm really just looking 
to make sure that the community is comfortable with GEPs as canonical 
form.  If some pass decides to insert inttoptr instructions into 
otherwise clean IR, I want some assurance a patch fixing that would 
stand a good chance of being accepted.  I'm happy to do any cleanup 
required.

In addition to my own use case, here's a few others which might come up:
- Backends for targets which support different operations on pointers vs 
integers.  Examples would be some of the older mainframe architectures.  
(There'd be a lot more work needed to support this.)
- Various security related applications (e.g. CFI w.r.t. function pointers)

I don't really want to get into these applications in detail, mostly 
because I'm not particularly knowledgeable on those topics.  I'd 
appreciate any other applications anyone wants to throw out, but lets 
try to keep from derailing the discussion.  (As I did to Nick's original 
thread on DataLayout. :))

Notes:
1) We're not using the existing gc.root implementation strategy.  I plan 
on explaining why in a lot more detail once we're closer to having a 
complete implementation that we can upstream.  That should be coming 
relatively shortly.  (i.e. months, not weeks, not years)

2) As Nick pointed out in a separate thread, other types of typecasts 
can obscure pointer vs integer classifications.  (i.e. casting the base 
type of a pointer we then load through could load a field of the "wrong" 
type")  I plan on responding to his point separately, but let's leave 
that out of this discussion for the moment.  Having GEPs as canonical 
form is a step forward by itself, even if I decide to propose something 
further down the road.

Philip




More information about the llvm-dev mailing list