[llvm-dev] getelementptr inbounds with offset 0

Doerfert, Johannes via llvm-dev llvm-dev at lists.llvm.org
Tue Mar 26 14:19:46 PDT 2019


Hi Ralf,

On 03/26, Ralf Jung wrote:
> >> So, the thinking here is: LLVM cannot exclude the possibility of an
> >> object of size 0 existing at any given address.  The pointer returned
> >> by "GEPi p 0" then would be one-past-the-end of such a 0-sized object.
> >> Thus, "GEPi p 0" is the identitiy function for any p, it will not
> >> return poison.
> > 
> > I don't see the problem. The behavior I hope we want and implement is:
> > 
> > Either LLVM knows that %p points to an invalid address (=non-object) or
> > it doesn't. If it does, %p and all GEPs on it yield poison. If it
> > doesn't, it has to assume %p points to a valid address and offset 0, 1,
> > 2, ... might all yield valid pointers. The special case is when we know
> > %p is valid and has extend of (at most) S, then all offsets <= S,
> > including 0, are potentially valid (negative extends are similar).
> 
> So you are basically saying whether the offset is 0 or not does not matter, but
> whether the base is an object LLVM can now about or not does?  I see.  That
> makes sense.

Yes, if we are not in the special case (object valid and extend is known).

> The reason I restricted myself to offset 0 is that we'd like to do this without
> actually having any accessible objects anywhere, which works out if the objects
> have size 0.

Now that reasoning works from a conceptual standpoint only for
non-inbounds GEPs, I think. From a practical standpoint my above
description will probably make sure everything works out just fine (see
also my rephrased answer down below!). I say this because I think the
following lang-ref passage makes sure everything, not only memory
accesses, involving a non-pointer-to-object* GEP is poison:
  "If the inbounds keyword is present, the result value of the
   getelementptr is a poison value if the base pointer is not an in
   bounds address of an allocated object"

* I would argue every object needs to have an extend, hence cannot be
  zero-sized.


> FWIW, in <https://people.mpi-sws.org/~jung/twinsem/twinsem.pdf> we anyway had to
> make "getelementptr inbounds" on integer pointers (pointers obtained by casting
> an integer to a pointer) never yield poison directly and instead defer the
> in-bound check to the time when the actual access happens.  That nicely
> accommodates all uses of getelementptr that just compute addresses without ever
> using them for a memory access (using them only, e.g. to compute offsets or
> compare pointers).  But this is not how the LLVM LangRef is written, unfortunately.

I see. Is there a quick answer to the questions why you need inbounds
GEPs in that case? Can't you just use non-inbounds GEPs if you know you
might not have a valid base ptr and "optimize" it to inbounds once that
is proven?

> >> # example1
> >>
> >> %P1 = int2ptr 4
> >> %G1 = gep inbounds %P1 0
> >>
> >> # example2
> >>
> >> %P2 = call noalias i8* @malloc(i64 12)
> >> call void @free(i8* %P2)
> >> %G2 = gep inbounds %P2 0
> >>
> >> The first happens in Rust all the time, and we rely on not getting
> >> poison.  The second doesn't occur in Rust (to my knowledge), but it
> >> seems somewhat inconsistent to return poison in one case and not the
> >> other.
> > 
> > Let's start with example2, note that I renamed the values above.
> > 
> > %P2 is dangling (and we know it) after the free. %P2 is therefore
> > poison* and so is %G2.
> > 
> > * or undef I'm always confused which might be bad in this conversation.
> 
> Wait, I know that C has a rule that dangling pointers are "indeterminate" but
> this is the first time I hear that LLVM has it as well.  Is that written down
> anywhere?  Rust relies heavily in dangling pointers being well-behaved when used
> only on comparisons and casts (no accesses), so this would be a big deal.
> (Also, this rule in C is pretty much impossible to formalize and serves no
> purpose that I know of, but that is a separate discussion.)

I am not very formal in this thread and I realize that this might be a
problem, sorry. The above quote from the lang-ref [0] is why I think
"dangling" inbounds GEPs are poison, do you concur?

[0] https://llvm.org/docs/LangRef.html#getelementptr-instruction


> > In example1, without further information, I'd say that there is no
> > poison (statically). Address 4 could be an allocated object until proven
> > otherwise.
> > 
> > 
> > I am still a little confused about the problem you see. If what I wrote
> > about the implemented behavior holds true (which I am not totally sure
> > of), you should not have a problem with poison even if you would
> > sprinkle GEP (inbounds) %p 0 all over the place. Either %p was known to
> > be invalid and so is the GEP, or %p was not known to be invalid and
> > neither is the GEP. Am I missing something here?
> 
> The thing is, I am not asking about the behavior implemented today but about the
> behavior of the "abstract LLVM machine" that is described by the LangRef and
> that the optimizer has to justify its transformations against.  Analyses become
> smarter every day, so looking at what LLVM deduces from certain instructions is
> but a snapshot.

I agree with your intent, but: My argument here was not to say we cannot
figure X out today so all is good. What I wanted to say/should have said
is something more along the line of:
  Undefined behavior in C/LLVM-IR is often (runtime) value dependent and
  therefore statically not decidable. If it is not, the code must be
  assumed to have defined (="the normal") behavior statically. This
  should be preserved by current and future LLVM passes. Your particular
  example (example1) seems to me like such a case in which the semantics
  is statically not decidable and therefore I do not see any problem.

Again, I might just be wrong about. Please don't pin it on me at the end
of the day.

> But also, your response assumes "dangling pointers are undef/posion", which is
> new to me.  I'd be rather shocked if this is something LLVM actually relies on
> anywhere.

Again, that is how I read the quoted lang-ref wording above for
inbounds GEPs. I agree with you that non-inbounds GEPs have a "normal"
value that can be used for all non-access instructions in the usual way
without producing undef/poison.

Cheers,
  Johannes


> >>> A side-effect based on the GEP will however __locally__ introduce an
> >>> dereferencability assumption (in my opinion at least). Let's say the
> >>> code looks like this:
> >>>
> >>>
> >>>   %G = gep inbounds (int2ptr 4) 0 ; We don't know anything about the
> >>>   dereferencability of ; the memory at address 4 here.  br %cnd,
> >>>   %BB0, %BB1
> >>>
> >>> BB0: ; We don't know anything about the dereferencability of ; the
> >>> memory at address 4 here.  load %G ; We know the memory at address 4
> >>> is dereferenceable here.  ; Though, that is due to the load and not
> >>> the inbounds.  ...  br %BB1
> >>>
> >>> BB1: ; We don't know anything about the dereferencability of ; the
> >>> memory at address 4 here.
> >>>
> >>>
> >>> It is a different story if you start to use the GEP in other
> >>> operations, e.g., to alter control flow. Then the (potential)
> >>> undefined value can propagate.
> >>>
> >>>
> >>> Any thought on this? Did I at least get your problem description
> >>> right?
> >>>
> >>> Cheers, Johannes
> >>>
> >>>
> >>>
> >>> P.S. Sorry if this breaks the thread and apologies that I had to
> >>> remove Bruce from the CC. It turns out replying to an email you did
> >>> not receive is complicated and getting on the LLVM-Dev list is
> >>> nowadays as well...
> >>>
> >>>
> >>> On 02/25, Ralf Jung via llvm-dev wrote:
> >>>> Hi Bruce,
> >>>>
> >>>> On 25.02.19 13:10, Bruce Hoult wrote:
> >>>>> LLVM has no idea whether the address computed by GEP is actually
> >>>>> within a legal object. The "inbounds" keyword is just you, the
> >>>>> programmer, promising LLVM that you know it's ok and that you
> >>>>> don't care what happens if it is actually out of bounds.
> >>>>>
> >>>>> https://llvm.org/docs/GetElementPtr.html#what-happens-if-an-array-index-is-out-of-bounds
> >>>>
> >>>> The LangRef says I get a poison value when I am violating the
> >>>> bounds. What I am asking is what exactly this means when the offset
> >>>> is 0 -- what *are* the conditions under which an offset-by-0 is
> >>>> "out of bounds" and hence yields poison?  Of course LLVM cannot
> >>>> always statically determine this, but it relies on (dynamically, on
> >>>> the "LLVM abstract machine") such things not happening, and I am
> >>>> asking what exactly these dynamic conditions are.
> >>>>
> >>>> Kind regards, Ralf
> >>>>
> >>>>>
> >>>>> On Sun, Feb 24, 2019 at 9:05 AM Ralf Jung via llvm-dev
> >>>>> <llvm... at lists.llvm.org> wrote:
> >>>>>>
> >>>>>> Hi all,
> >>>>>>
> >>>>>> What exactly are the rules for `getelementptr inbounds` with
> >>>>>> offset 0?
> >>>>>>
> >>>>>> In Rust, we are relying on the fact that if we use, for example,
> >>>>>> `inttoptr` to turn `4` into a pointer, we can then do
> >>>>>> `getelementptr inbounds` with offset 0 on that without LLVM
> >>>>>> deducing that there actually is any dereferencable memory at
> >>>>>> location 4.  The argument is that we can think of there being a
> >>>>>> zero-sized allocation. Is that a reasonable assumption?  Can
> >>>>>> something like this be documented in the LangRef?
> >>>>>>
> >>>>>> Relatedly, how does the situation change if the pointer is not
> >>>>>> created "out of thin air" from a fixed integer, but is actually a
> >>>>>> dangling pointer obtained previously from `malloc` (or `alloca`
> >>>>>> or whatever)?  Is getelementptr inbounds` with offset 0 on such a
> >>>>>> pointer a NOP, or does it result in `poison`?  And if that makes
> >>>>>> a difference, how does that square with the fact that, e.g., the
> >>>>>> integer `0x4000` could well be inside such an allocation, but
> >>>>>> doing `getelementptr inbounds` with offset 0 on that would fall
> >>>>>> under the first question above?
> >>>>>>
> >>>>>> Kind regards, Ralf
> >>>>>> _______________________________________________ LLVM Developers
> >>>>>> mailing list llvm... at lists.llvm.org
> >>>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >>>> _______________________________________________ LLVM Developers
> >>>> mailing list llvm... at lists.llvm.org
> >>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >>>
> > 

-- 

Johannes Doerfert
Researcher

Argonne National Laboratory
Lemont, IL 60439, USA

jdoerfert at anl.gov
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 228 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190326/89d73cd8/attachment.sig>


More information about the llvm-dev mailing list