[llvm-dev] [RFC] Introducing a byte type to LLVM

Tue Jun 8 20:28:10 PDT 2021

Hi Nicolai,

> I feel like the entire discussion is extremely confused and confusing
> because there is lack of clarity about what values are in fact carried by
> the types i<N>, ptr (assuming untyped pointers already), and the proposed
> b<N>. There are statements about "whether a b<N> contains a pointer or an
> integer" without a clear definition of what any of that really means.

> By the way: the discussion of GVN states that i<N> do not carry provenance
> information, but LangRef appears to contradict this, because it states (in
> the Pointer Aliasing section) that "A pointer value formed by an inttoptr
> is *based* on all pointer values that contribute (directly or indirectly)
> to the computation of the pointer’s value." The only way to make this
> statement operational is for the operand of the inttoptr to carry
> provenance information; this definitely needs to be addressed by this
> proposal.

> My educated guess is that what the proposal really wants to say is that an
> i<N> value is poison or an integer without pointer provenance (i.e., an
> i<N> value is never "based on" a pointer) while a b<N> value is poison or
> an integer that may have provenance (i.e., it may be "based on" one or more
> pointers). Everything else should follow from this core definition. With
> this phrasing, I can see a coherent picture emerging that can be discussed
> without a lot of the confusion in this thread (though still _some_
> confusion, as the topic is inherently subtle).

> @George, Nuno, and Juneyoung: is that a good lens through which to see the
> proposal?

Yes, our understanding is that only values of pointer types carry
provenance whereas non-pointer types (i<N>, float, ...) don't carry
provenance.
We believe that when people were writing optimizations on non-pointer types
(instcombine, reassociate, gvn, etc) they did not assume the existence of
provenance in such types.
As mentioned, it contradicts with LangRef's wording about inttoptr though.

I agree that clarifying these is a good first step to avoid confusion.
Specifically, via pinning down what people agree on in LangRef.
I think it is good to start with removing the inttoptr case from LangRef's
based-on part and adding relevant texts.
IMO, existing alias analysis and optimizations were already avoiding uses
of inttoptr's based-on information (experts' comments needed to double
check).
Also, to provide alternatives for ptrtoint/inttoptr round trip with better
analysis, llvm.ptrmask and clang-tidy's performance-no-int-to-ptr were
added as well.

If people agree, I'll write a draft for this and share it via phabricator.

Thanks,
Juneyoung

On Wed, Jun 9, 2021 at 8:02 AM Nicolai Hähnle via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

> Hi,
>
> On Mon, Jun 7, 2021 at 7:40 PM Harald van Dijk via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> On 04/06/2021 16:24, George Mitenkov via llvm-dev wrote:
>>
>>
>>    1.
>>
>>    When copying pointers as bytes implicit pointer-to-integer casts are
>>    avoided.The current setting of performing a memcpy using i8s leads to
>>    miscompilations (example: bug report 37469
>>    <https://bugs.llvm.org/show_bug.cgi?id=37469>) and is bad for alias
>>    analysis.
>>
>> While trying to understand what exactly is going wrong in this bug I
>> think I stumbled upon incomplete documentation. At
>> https://llvm.org/docs/LangRef.html#pointeraliasing it is documented that
>>
>>    - A memory access must be done through a pointer associated with the
>>    address range.
>>    - A pointer value is associated with the addresses associated with
>>    any value it is based on.
>>    - A pointer value formed by an instruction is based on another
>>    pointer value the instruction is getelementptr, bitcast, or inttoptr.
>>
>> What is not mentioned is what a pointer value formed by a load
>> instruction is based on. If a pointer value formed by a load instruction is
>> not associated with any address range, then it cannot be used to load or
>> store *any* value. Clearly that is not wanted.
>>
>> If memory is completely untyped, then a load of a pointer value needs to
>> be equivalent to a load of an integer value followed by inttoptr. In that
>> model, it is not possible to replace a load of a pointer value by a
>> previously stored value, it is only possible to replace it by
>> inttoptr(ptrtoint(previously stored value)). In that model, the problem is
>> that LLVM does replace a load of a pointer value by a previously stored
>> value, and the bug can be fixed (at a cost of reduced optimisations of
>> other code) by making sure to insert ptrtoint and inttoptr as needed,
>> including for any pointer members of structs etc.
>>
>> Your proposal is based on an interpretation of the problem using a memory
>> model where memory is not completely untyped. In your memory model, what
>> address range is a pointer value that came from a load instruction
>> associated with?
>>
>
> I second this.
>
> I feel like the entire discussion is extremely confused and confusing
> because there is lack of clarity about what values are in fact carried by
> the types i<N>, ptr (assuming untyped pointers already), and the proposed
> b<N>. There are statements about "whether a b<N> contains a pointer or an
> integer" without a clear definition of what any of that really means.
>
> By the way: the discussion of GVN states that i<N> do not carry provenance
> information, but LangRef appears to contradict this, because it states (in
> the Pointer Aliasing section) that "A pointer value formed by an inttoptr
> is *based* on all pointer values that contribute (directly or indirectly)
> to the computation of the pointer’s value." The only way to make this
> statement operational is for the operand of the inttoptr to carry
> provenance information; this definitely needs to be addressed by this
> proposal.
>
> My educated guess is that what the proposal really wants to say is that an
> i<N> value is poison or an integer without pointer provenance (i.e., an
> i<N> value is never "based on" a pointer) while a b<N> value is poison or
> an integer that may have provenance (i.e., it may be "based on" one or more
> pointers). Everything else should follow from this core definition. With
> this phrasing, I can see a coherent picture emerging that can be discussed
> without a lot of the confusion in this thread (though still _some_
> confusion, as the topic is inherently subtle).
>
> @George, Nuno, and Juneyoung: is that a good lens through which to see the
> proposal? If no, why not? If yes, I believe there are some immediate
> implications and questions, such as:
>
> 1. Forbidding arithmetic and bitwise operations in b<N> seems pointless.
> Just define them as the corresponding i<N> op plus the union of provenance
> of the operands. This allows consistent implementation of char/unsigned
> char as b8, without having to jump back and forth between b8 and i8 all the
> time.
>
> 2. What's the provenance of a b<N> `select`?
>
> 3. A b<N>-to-i<N> conversion necessarily loses all provenance information.
>
> 4. What's the provenance of the result of an i<N>-to-b<N> conversion? The
> only possible choices are "empty" or "full" provenance, where "full"
> provenance means that the value is "based on" _all_ pointers. Depending on
> the details of allowed instruction signatures, we may want to allow both
> choices. "Full" provenance gives us something that's useful as a step
> towards inttoptr. "Empty" provenance allows us to lift i<N> to b<N> for
> arithmetic operations without being overly conservative (e.g., consider the
> somewhat hypothetical problem of lowering a GEP to a sequence of arithmetic
> on b<N> types in IR -- we want the result to have the provenance of the
> base pointer, which enters the arithmetic as a b<N> with provenance; adding
> the various i<N> offset operands should have no effect on provenance).
>
> 5. Memory is untyped but can carry provenance; i.e., it consists of b8.
> Load/store of i<N> performs implicit i-to-b/b-to-i conversions,
> respectively. Which kind of i-to-b conversion is used during a store? The
> "empty" one or the "full" one? There are several reasonable options for how
> to address this, and a choice must be made.
>
> 6. (How) are pointer types fundamentally different from b<N> types of the
> correct size? (By this I mean: is there any interesting difference in the
> values that these types can carry? Ignore surface differences like the fact
> that GEP traditionally goes with pointers while `add` goes with integer
> types -- we could have a GEP instruction on a correctly sized b<N>)
>
> It seems pretty clear that there is still a lot of work to be done on this
> proposal before we can be sufficiently confident that it won't just add new
> correctness problems elsewhere. Besides, the cost of adding a new family of
> types is very high, and I am not (yet?) convinced that it pays for itself.
> But with the above I can at least see better where you're coming from.
>
> Cheers,
> Nicolai
>
>
> Cheers,
>> Harald van Dijk
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>
>
> --
> Lerne, wie die Welt wirklich ist,
> aber vergiss niemals, wie sie sein sollte.
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>

-- 

Juneyoung Lee
Software Foundation Lab, Seoul National University
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210609/c5382219/attachment.html>