[llvm-dev] getelementptr inbounds with offset 0

Wed Mar 27 01:09:24 PDT 2019

Hi Johannes,

> Now that reasoning works from a conceptual standpoint only for
> non-inbounds GEPs, I think. From a practical standpoint my above
> description will probably make sure everything works out just fine (see
> also my rephrased answer down below!). I say this because I think the
> following lang-ref passage makes sure everything, not only memory
> accesses, involving a non-pointer-to-object* GEP is poison:
>   "If the inbounds keyword is present, the result value of the
>    getelementptr is a poison value if the base pointer is not an in
>    bounds address of an allocated object"
> 
> * I would argue every object needs to have an extend, hence cannot be
>   zero-sized.

I would find that a rather surprising exception / special case.  There's nothing
wrong with objects of size 0.

>> FWIW, in <https://people.mpi-sws.org/~jung/twinsem/twinsem.pdf> we anyway had to
>> make "getelementptr inbounds" on integer pointers (pointers obtained by casting
>> an integer to a pointer) never yield poison directly and instead defer the
>> in-bound check to the time when the actual access happens.  That nicely
>> accommodates all uses of getelementptr that just compute addresses without ever
>> using them for a memory access (using them only, e.g. to compute offsets or
>> compare pointers).  But this is not how the LLVM LangRef is written, unfortunately.
> 
> I see. Is there a quick answer to the questions why you need inbounds
> GEPs in that case? Can't you just use non-inbounds GEPs if you know you
> might not have a valid base ptr and "optimize" it to inbounds once that
> is proven?

You mean on the Rust side?  We emit GEPi for field accesses and array indexing.
 We cannot always statically determine if this is happening for a ZST or not.
At the same time, given that no memory access ever happens for a ZST, allocating
a ZST (Box::new in Rust, think of it like new in C++) does not actually allocate
any memory, it just returns an integer (sufficiently aligned) cast to a pointer.

>>> Let's start with example2, note that I renamed the values above.
>>>
>>> %P2 is dangling (and we know it) after the free. %P2 is therefore
>>> poison* and so is %G2.
>>>
>>> * or undef I'm always confused which might be bad in this conversation.
>>
>> Wait, I know that C has a rule that dangling pointers are "indeterminate" but
>> this is the first time I hear that LLVM has it as well.  Is that written down
>> anywhere?  Rust relies heavily in dangling pointers being well-behaved when used
>> only on comparisons and casts (no accesses), so this would be a big deal.
>> (Also, this rule in C is pretty much impossible to formalize and serves no
>> purpose that I know of, but that is a separate discussion.)
> 
> I am not very formal in this thread and I realize that this might be a
> problem, sorry. The above quote from the lang-ref [0] is why I think
> "dangling" inbounds GEPs are poison, do you concur?
> 
> [0] https://llvm.org/docs/LangRef.html#getelementptr-instruction

You said above that even the *input* (%P2) would be poison.  That is the part I
am doubting.
If the input is not poison (just dangling), then we come back to the original
question behind example2 -- and yes, I can see a reading of the GEPi spec that
makes this poison.  On the other hand, this would make a dangling pointer
(formerly pointing to an object) behave different than an integer pointer that
never pointed to any object, which seems odd.

> I agree with your intent, but: My argument here was not to say we cannot
> figure X out today so all is good. What I wanted to say/should have said
> is something more along the line of:
>   Undefined behavior in C/LLVM-IR is often (runtime) value dependent and
>   therefore statically not decidable. If it is not, the code must be
>   assumed to have defined (="the normal") behavior statically. This
>   should be preserved by current and future LLVM passes. Your particular
>   example (example1) seems to me like such a case in which the semantics
>   is statically not decidable and therefore I do not see any problem.
> 
> Again, I might just be wrong about. Please don't pin it on me at the end
> of the day.

Sure, UB is definitely *defined* in a runtime-value dependent way.  The problem
here is that it is not defined in a precise way -- something where one could
write an interpreter that tracks all the extra state that is needed (like
poison/undef and where allocations lie) and then says precisely under which
conditions we have UB and under which we do not.
What I am asking here for is the exact definition of GEPi if, *at run-time*, the
offset is 0, and the base pointer is (a) an integer, or (b) dangling.

>> But also, your response assumes "dangling pointers are undef/posion", which is
>> new to me.  I'd be rather shocked if this is something LLVM actually relies on
>> anywhere.
> 
> Again, that is how I read the quoted lang-ref wording above for
> inbounds GEPs. I agree with you that non-inbounds GEPs have a "normal"
> value that can be used for all non-access instructions in the usual way
> without producing undef/poison.

I must be missing something here.  You said above "%P2 is dangling (and we know
it) after the free. %P2 is therefore poison" -- at this point, GEPi has not even
happened yet!  If GEPi does something, it will make the *output* poison (%G2),
but you are saying the *input* becomes poison (%P1), and that cannot be a
consequence of GEPi at all.

Kind regards,
Ralf

> 
> Cheers,
>   Johannes
> 
> 
>>>>> A side-effect based on the GEP will however __locally__ introduce an
>>>>> dereferencability assumption (in my opinion at least). Let's say the
>>>>> code looks like this:
>>>>>
>>>>>
>>>>>   %G = gep inbounds (int2ptr 4) 0 ; We don't know anything about the
>>>>>   dereferencability of ; the memory at address 4 here.  br %cnd,
>>>>>   %BB0, %BB1
>>>>>
>>>>> BB0: ; We don't know anything about the dereferencability of ; the
>>>>> memory at address 4 here.  load %G ; We know the memory at address 4
>>>>> is dereferenceable here.  ; Though, that is due to the load and not
>>>>> the inbounds.  ...  br %BB1
>>>>>
>>>>> BB1: ; We don't know anything about the dereferencability of ; the
>>>>> memory at address 4 here.
>>>>>
>>>>>
>>>>> It is a different story if you start to use the GEP in other
>>>>> operations, e.g., to alter control flow. Then the (potential)
>>>>> undefined value can propagate.
>>>>>
>>>>>
>>>>> Any thought on this? Did I at least get your problem description
>>>>> right?
>>>>>
>>>>> Cheers, Johannes
>>>>>
>>>>>
>>>>>
>>>>> P.S. Sorry if this breaks the thread and apologies that I had to
>>>>> remove Bruce from the CC. It turns out replying to an email you did
>>>>> not receive is complicated and getting on the LLVM-Dev list is
>>>>> nowadays as well...
>>>>>
>>>>>
>>>>> On 02/25, Ralf Jung via llvm-dev wrote:
>>>>>> Hi Bruce,
>>>>>>
>>>>>> On 25.02.19 13:10, Bruce Hoult wrote:
>>>>>>> LLVM has no idea whether the address computed by GEP is actually
>>>>>>> within a legal object. The "inbounds" keyword is just you, the
>>>>>>> programmer, promising LLVM that you know it's ok and that you
>>>>>>> don't care what happens if it is actually out of bounds.
>>>>>>>
>>>>>>> https://llvm.org/docs/GetElementPtr.html#what-happens-if-an-array-index-is-out-of-bounds
>>>>>>
>>>>>> The LangRef says I get a poison value when I am violating the
>>>>>> bounds. What I am asking is what exactly this means when the offset
>>>>>> is 0 -- what *are* the conditions under which an offset-by-0 is
>>>>>> "out of bounds" and hence yields poison?  Of course LLVM cannot
>>>>>> always statically determine this, but it relies on (dynamically, on
>>>>>> the "LLVM abstract machine") such things not happening, and I am
>>>>>> asking what exactly these dynamic conditions are.
>>>>>>
>>>>>> Kind regards, Ralf
>>>>>>
>>>>>>>
>>>>>>> On Sun, Feb 24, 2019 at 9:05 AM Ralf Jung via llvm-dev
>>>>>>> <llvm... at lists.llvm.org> wrote:
>>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> What exactly are the rules for `getelementptr inbounds` with
>>>>>>>> offset 0?
>>>>>>>>
>>>>>>>> In Rust, we are relying on the fact that if we use, for example,
>>>>>>>> `inttoptr` to turn `4` into a pointer, we can then do
>>>>>>>> `getelementptr inbounds` with offset 0 on that without LLVM
>>>>>>>> deducing that there actually is any dereferencable memory at
>>>>>>>> location 4.  The argument is that we can think of there being a
>>>>>>>> zero-sized allocation. Is that a reasonable assumption?  Can
>>>>>>>> something like this be documented in the LangRef?
>>>>>>>>
>>>>>>>> Relatedly, how does the situation change if the pointer is not
>>>>>>>> created "out of thin air" from a fixed integer, but is actually a
>>>>>>>> dangling pointer obtained previously from `malloc` (or `alloca`
>>>>>>>> or whatever)?  Is getelementptr inbounds` with offset 0 on such a
>>>>>>>> pointer a NOP, or does it result in `poison`?  And if that makes
>>>>>>>> a difference, how does that square with the fact that, e.g., the
>>>>>>>> integer `0x4000` could well be inside such an allocation, but
>>>>>>>> doing `getelementptr inbounds` with offset 0 on that would fall
>>>>>>>> under the first question above?
>>>>>>>>
>>>>>>>> Kind regards, Ralf
>>>>>>>> _______________________________________________ LLVM Developers
>>>>>>>> mailing list llvm... at lists.llvm.org
>>>>>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>> _______________________________________________ LLVM Developers
>>>>>> mailing list llvm... at lists.llvm.org
>>>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>
>>>
>