[cfe-dev] [RFC] Introducing a byte type to LLVM
Joshua Cranmer via cfe-dev
cfe-dev at lists.llvm.org
Fri Jun 4 15:27:23 PDT 2021
On 6/4/2021 2:25 PM, John McCall via cfe-dev wrote:
> I don’t believe this is correct. LLVM does not have an innate
>
> concept of typed memory. The type of a global or local allocation
> is just a roundabout way of giving it a size and default alignment,
> and similarly the type of a load or store just determines the width
> and default alignment of the access. There are no restrictions on
> what types can be used to load or store from certain objects.
>
> C-style type aliasing restrictions are imposed using |tbaa|
> metadata, which are unrelated to the IR type of the access.
>
> John.
>
I've never been thoroughly involved in any of the actual optimizations
here, but it seems to me that there is a soundness hole in the LLVM
semantics that we gloss over when we say that LLVM doesn't have typed
memory.
Working backwards from what a putative operational semantics of LLVM
might look like (and I'm going to ignore poison/undef because it's not
relevant), I think there is agreement that integer types in LLVM are
purely bitvectors. Any value of i64 5 can be replaced with any other
value of i64 5 no matter where it came from. At the same time, when we
have pointers involved, this is not true. Two pointers may have the same
numerical value (e.g., when cast to integers), but one might not be
replaceable with the other because there's other data that might not be
the same. So in operational terms, pointers have both a numerical value
and a bag of provenance data (probably other stuff, but let's be simple
and call it provenance).
Now we have to ask what the semantics of converting between integers and
pointers are. Integers, as we've defined, don't have provenance data. So
an inttoptr instruction has to synthesize that provenance somehow.
Ideally, we'd want to grab that data from the ptrtoint instruction that
generated the integer, but the semantics of integers means we can only
launder that data globally, so that an inttoptr has the union of all of
the provenance data that was ever fed into an inttoptr (I suspect the
actual semantics we use is somewhat more precise than this in that it
only considers those pointers that point to still-live data, which
doesn't invalidate anything I'm about to talk about).
Okay, what about memory? I believe what most people intend to mean when
they say that LLVM's memory is untyped is that a load or store of any
type is equivalent to first converting it to an integer and then storing
the integer into memory. E.g. these two functions are semantically
equivalent:
define void @foo(ptr %mem, i8* %foo) {
store i8* %foo, ptr %mem
}
define void @bar(ptr %mem, i8* %foo) {
%asint = ptrtoint i8* %foo to i64 ; Or whatever pointer size you have
store i64 %asint, ptr %mem
}
In other words, we are to accept that every load and store instruction
of a pointer has an implicit inttoptr or ptrtoint attached to it. But as
I mentioned earlier, pointers have this extra metadata attached to it
that is lost when converting to an integer. Under this strict
interpretation of memory, we *lose* that metadata every time a pointer
is stored in memory, as if we did an inttoptr(ptrtoint x). Thus, the
following two functions are *not* semantically equivalent in that model:
define i8* @basic(i8* %in) {
ret i8* %in
}
define i8* @via_alloc(i8* %in) {
%mem = alloca i8*
store i8* %in, i8** %mem
%out = load i8*, i8** %mem
ret i8* %out
}
In order to allow these two functions to be equivalent, we have to let
the load of a pointer recover the provenance data stored by the store of
the pointer, and nothing more general. If either one of those were
instead an integer load or store, then no provenance data can be
communicated, so the integer and the pointer loads *must* be
nonequivalent (although loading an integer instead of a pointer would
presumably be a pessimistic transformation).
In short, pointers have pointery bits that aren't reflected in a
bitvector representation an integer has. LLVM has some optimizations
that assume that loads and stores only have bitvector manipulation
semantics, while other optimizations (and most of the frontends) expect
that loads and stores will preserve the pointery bits. And when these
interact with each other, it's undoubtedly possible that the pointery
bits get lost along the way.
--
Joshua Cranmer
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20210604/828f62d1/attachment-0001.html>
More information about the cfe-dev
mailing list