[llvm-dev] [RFC] Introducing a byte type to LLVM

Tue Jun 15 10:29:47 PDT 2021

On Tue, Jun 15, 2021 at 4:07 PM John McCall <rjmccall at apple.com> wrote:

> On 15 Jun 2021, at 1:49, Juneyoung Lee wrote:
>
> On Tue, Jun 15, 2021 at 1:08 AM John McCall via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> The semantics you seem to want are that LLVM’s integer types cannot carry
> information from pointers. But I can cast a pointer to an integer in C and
> vice-versa, and compilers have de facto defined the behavior of subsequent
> operations like breaking the integer up (and then putting it back
> together), adding numbers to it, and so on. So no, as a C compiler writer,
> I do not have a choice; I will have to use a type that can validly carry
> pointer information for integers in C.
>
> int->ptr cast can reconstruct the pointer information, so making integer
> types not carry pointer information does not necessarily mean that
> dereferencing a pointer casted from integer is UB.
>
> What exactly is the claimed formal property of byte types, then,
> that integer types will lack? Because it seems to me that converting
> from an integer gives us valid provenance in strictly more situations
> than converting from bytes, since it reconstructs provenance if there’s
> any object at that address (under still-debated restrictions),
> while converting from bytes always preserves the original provenance
> (if any). I don’t understand how that can possibly give us *more*
> flexibility to optimize integers.
>
When two objects are adjacent, and an integer is exactly pointing to the
location between them, its provenance cannot be properly recovered.

int x[1], y[1];
llvm.assume((intptr_t)&x[0] == 0x100 && (intptr_t)&y[0] == 0x104);
int *p = (int*)(intptr_t)&x[1];
// Q: Is p's provenance x or y?

If it is expected that '*(p-1)' is equivalent to *x, p's provenance should
be x.
However, based on llvm.assume, optimizations on integers can
replace (intptr_t)&x[1] with (intptr_t)&y[0] (which is what happened in the
bug report).
Then, '*(p-1)' suddenly becomes out-of-bounds access, which is UB.
So, p's provenance isn't simply x or y; it should be something that can
access both x and y.

This implies that, unless there is a guarantee that all allocated objects
are one or more bytes apart, there is no type that can perfectly store a
pointer byte.
memcpy(x, y, 8) isn't equivalent to 'v=load i64 y;store i64 v, x' because v
already lost the pointer information.

The pointer information is perfectly stored in a byte type. But, arithmetic
property-based optimizations such as the above one are not correct anymore.
Here is an example with a byte-type version:

int x[1], y[1];
// byte_8 is a 64-bits byte type
llvm.assume((byte_8)&x[0] == 0x100 && (byte_8)&y[0] == 0x104);
int *p = (int*)(byte_8)&x[1];
// p's provenance is alway x.

For a byte type, equality comparison is true does not mean that the two
values are precisely equal.
Since (byte_8)&x[1] and (byte_8)&y[0] have different provenances, replacing
one with another must be avoided.
Instead, we can guarantee that p is precisely equivalent to &x[1].
Another benefit is that optimizations on integers do not need to suffer
from these pointer thingy anymore;
e.g., the optimization on llvm.assume above can survive and it does not
need to check whether an integer variable is derived from a pointer value.

> Since you seem to find this sort of thing compelling, please note that
> even a simple assignment like char c2 = c1 technically promotes through
> int in C, and so int must be able to carry pointer information if char
> can.
>
> IIUC integer promotion is done when it is used as an operand of arithmetic
> ops or switch's condition, so I think assignment operation is okay.
>
> Hmm, I was misremembering the rule, you’re right.
>
> John.
>

-- 

Juneyoung Lee
Software Foundation Lab, Seoul National University
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210616/ef0a0249/attachment.html>