[llvm-dev] [RFC] Introducing a byte type to LLVM
Juneyoung Lee via llvm-dev
llvm-dev at lists.llvm.org
Tue Jun 15 10:29:47 PDT 2021
On Tue, Jun 15, 2021 at 4:07 PM John McCall <rjmccall at apple.com> wrote:
> On 15 Jun 2021, at 1:49, Juneyoung Lee wrote:
> On Tue, Jun 15, 2021 at 1:08 AM John McCall via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
> The semantics you seem to want are that LLVM’s integer types cannot carry
> information from pointers. But I can cast a pointer to an integer in C and
> vice-versa, and compilers have de facto defined the behavior of subsequent
> operations like breaking the integer up (and then putting it back
> together), adding numbers to it, and so on. So no, as a C compiler writer,
> I do not have a choice; I will have to use a type that can validly carry
> pointer information for integers in C.
> int->ptr cast can reconstruct the pointer information, so making integer
> types not carry pointer information does not necessarily mean that
> dereferencing a pointer casted from integer is UB.
> What exactly is the claimed formal property of byte types, then,
> that integer types will lack? Because it seems to me that converting
> from an integer gives us valid provenance in strictly more situations
> than converting from bytes, since it reconstructs provenance if there’s
> any object at that address (under still-debated restrictions),
> while converting from bytes always preserves the original provenance
> (if any). I don’t understand how that can possibly give us *more*
> flexibility to optimize integers.
When two objects are adjacent, and an integer is exactly pointing to the
location between them, its provenance cannot be properly recovered.
int x, y;
llvm.assume((intptr_t)&x == 0x100 && (intptr_t)&y == 0x104);
int *p = (int*)(intptr_t)&x;
// Q: Is p's provenance x or y?
If it is expected that '*(p-1)' is equivalent to *x, p's provenance should
However, based on llvm.assume, optimizations on integers can
replace (intptr_t)&x with (intptr_t)&y (which is what happened in the
Then, '*(p-1)' suddenly becomes out-of-bounds access, which is UB.
So, p's provenance isn't simply x or y; it should be something that can
access both x and y.
This implies that, unless there is a guarantee that all allocated objects
are one or more bytes apart, there is no type that can perfectly store a
memcpy(x, y, 8) isn't equivalent to 'v=load i64 y;store i64 v, x' because v
already lost the pointer information.
The pointer information is perfectly stored in a byte type. But, arithmetic
property-based optimizations such as the above one are not correct anymore.
Here is an example with a byte-type version:
int x, y;
// byte_8 is a 64-bits byte type
llvm.assume((byte_8)&x == 0x100 && (byte_8)&y == 0x104);
int *p = (int*)(byte_8)&x;
// p's provenance is alway x.
For a byte type, equality comparison is true does not mean that the two
values are precisely equal.
Since (byte_8)&x and (byte_8)&y have different provenances, replacing
one with another must be avoided.
Instead, we can guarantee that p is precisely equivalent to &x.
Another benefit is that optimizations on integers do not need to suffer
from these pointer thingy anymore;
e.g., the optimization on llvm.assume above can survive and it does not
need to check whether an integer variable is derived from a pointer value.
> Since you seem to find this sort of thing compelling, please note that
> even a simple assignment like char c2 = c1 technically promotes through
> int in C, and so int must be able to carry pointer information if char
> IIUC integer promotion is done when it is used as an operand of arithmetic
> ops or switch's condition, so I think assignment operation is okay.
> Hmm, I was misremembering the rule, you’re right.
Software Foundation Lab, Seoul National University
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the llvm-dev