[cfe-dev] [RFC] Introducing a byte type to LLVM

Fri Jun 4 15:27:23 PDT 2021

On 6/4/2021 2:25 PM, John McCall via cfe-dev wrote:
> I don’t believe this is correct. LLVM does not have an innate
>
> concept of typed memory. The type of a global or local allocation
> is just a roundabout way of giving it a size and default alignment,
> and similarly the type of a load or store just determines the width
> and default alignment of the access. There are no restrictions on
> what types can be used to load or store from certain objects.
>
> C-style type aliasing restrictions are imposed using |tbaa|
> metadata, which are unrelated to the IR type of the access.
>
> John.
>
I've never been thoroughly involved in any of the actual optimizations 
here, but it seems to me that there is a soundness hole in the LLVM 
semantics that we gloss over when we say that LLVM doesn't have typed 
memory.

Working backwards from what a putative operational semantics of LLVM 
might look like (and I'm going to ignore poison/undef because it's not 
relevant), I think there is agreement that integer types in LLVM are 
purely bitvectors. Any value of i64 5 can be replaced with any other 
value of i64 5 no matter where it came from. At the same time, when we 
have pointers involved, this is not true. Two pointers may have the same 
numerical value (e.g., when cast to integers), but one might not be 
replaceable with the other because there's other data that might not be 
the same. So in operational terms, pointers have both a numerical value 
and a bag of provenance data (probably other stuff, but let's be simple 
and call it provenance).

Now we have to ask what the semantics of converting between integers and 
pointers are. Integers, as we've defined, don't have provenance data. So 
an inttoptr instruction has to synthesize that provenance somehow. 
Ideally, we'd want to grab that data from the ptrtoint instruction that 
generated the integer, but the semantics of integers means we can only 
launder that data globally, so that an inttoptr has the union of all of 
the provenance data that was ever fed into an inttoptr (I suspect the 
actual semantics we use is somewhat more precise than this in that it 
only considers those pointers that point to still-live data, which 
doesn't invalidate anything I'm about to talk about).

Okay, what about memory? I believe what most people intend to mean when 
they say that LLVM's memory is untyped is that a load or store of any 
type is equivalent to first converting it to an integer and then storing 
the integer into memory. E.g. these two functions are semantically 
equivalent:

define void @foo(ptr %mem, i8* %foo) {
   store i8* %foo, ptr %mem
}
define void @bar(ptr %mem, i8* %foo) {
   %asint = ptrtoint i8* %foo to i64 ; Or whatever pointer size you have
   store i64 %asint, ptr %mem
}

In other words, we are to accept that every load and store instruction 
of a pointer has an implicit inttoptr or ptrtoint attached to it. But as 
I mentioned earlier, pointers have this extra metadata attached to it 
that is lost when converting to an integer. Under this strict 
interpretation of memory, we *lose* that metadata every time a pointer 
is stored in memory, as if we did an inttoptr(ptrtoint x). Thus, the 
following two functions are *not* semantically equivalent in that model:

define i8* @basic(i8* %in) {
   ret i8* %in
}
define i8* @via_alloc(i8* %in) {
   %mem = alloca i8*
   store i8* %in, i8** %mem
   %out = load i8*, i8** %mem
   ret i8* %out
}

In order to allow these two functions to be equivalent, we have to let 
the load of a pointer recover the provenance data stored by the store of 
the pointer, and nothing more general. If either one of those were 
instead an integer load or store, then no provenance data can be 
communicated, so the integer and the pointer loads *must* be 
nonequivalent (although loading an integer instead of a pointer would 
presumably be a pessimistic transformation).

In short, pointers have pointery bits that aren't reflected in a 
bitvector representation an integer has. LLVM has some optimizations 
that assume that loads and stores only have bitvector manipulation 
semantics, while other optimizations (and most of the frontends) expect 
that loads and stores will preserve the pointery bits. And when these 
interact with each other, it's undoubtedly possible that the pointery 
bits get lost along the way.

-- 
Joshua Cranmer
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20210604/828f62d1/attachment-0001.html>