<!DOCTYPE html>

<html>

<head>

<meta http-equiv="Content-Type" content="text/xhtml; charset=utf-8">

</head>

<body>

<div style="font-family:sans-serif"><div style="white-space:normal">

<p dir="auto">On 15 Jun 2021, at 13:29, Juneyoung Lee wrote:</p>


</div>

<div style="white-space:normal"><blockquote style="border-left:2px solid #3983C4; color:#3983C4; margin:0 0 5px; padding-left:5px"><p dir="auto">On Tue, Jun 15, 2021 at 4:07 PM John McCall <rjmccall@apple.com> wrote:</p>

<blockquote style="border-left:2px solid #3983C4; color:#7CBF0C; margin:0 0 5px; padding-left:5px; border-left-color:#7CBF0C"><p dir="auto">On 15 Jun 2021, at 1:49, Juneyoung Lee wrote:<br>

<br>

On Tue, Jun 15, 2021 at 1:08 AM John McCall via llvm-dev <<br>

llvm-dev@lists.llvm.org> wrote:<br>

<br>

The semantics you seem to want are that LLVM’s integer types cannot carry<br>

information from pointers. But I can cast a pointer to an integer in C and<br>

vice-versa, and compilers have de facto defined the behavior of subsequent<br>

operations like breaking the integer up (and then putting it back<br>

together), adding numbers to it, and so on. So no, as a C compiler writer,<br>

I do not have a choice; I will have to use a type that can validly carry<br>

pointer information for integers in C.<br>

<br>

int->ptr cast can reconstruct the pointer information, so making integer<br>

types not carry pointer information does not necessarily mean that<br>

dereferencing a pointer casted from integer is UB.<br>

<br>

What exactly is the claimed formal property of byte types, then,<br>

that integer types will lack? Because it seems to me that converting<br>

from an integer gives us valid provenance in strictly more situations<br>

than converting from bytes, since it reconstructs provenance if there’s<br>

any object at that address (under still-debated restrictions),<br>

while converting from bytes always preserves the original provenance<br>

(if any). I don’t understand how that can possibly give us *more*<br>

flexibility to optimize integers.<br>

</p>

</blockquote><p dir="auto">When two objects are adjacent, and an integer is exactly pointing to the<br>

location between them, its provenance cannot be properly recovered.<br>

<br>

int x[1], y[1];<br>

llvm.assume((intptr_t)&x[0] == 0x100 && (intptr_t)&y[0] == 0x104);<br>

int *p = (int*)(intptr_t)&x[1];<br>

// Q: Is p's provenance x or y?<br>

<br>

If it is expected that '*(p-1)' is equivalent to *x, p's provenance should<br>

be x.<br>

However, based on llvm.assume, optimizations on integers can<br>

replace (intptr_t)&x[1] with (intptr_t)&y[0] (which is what happened in the<br>

bug report).<br>

Then, '*(p-1)' suddenly becomes out-of-bounds access, which is UB.<br>

So, p's provenance isn't simply x or y; it should be something that can<br>

access both x and y.</p>

</blockquote></div>

<div style="white-space:normal">


<p dir="auto">This is something that the TR does not really have a firm grasp on yet.</p>


</div>

<div style="white-space:normal"><blockquote style="border-left:2px solid #3983C4; color:#3983C4; margin:0 0 5px; padding-left:5px"><p dir="auto">This implies that, unless there is a guarantee that all allocated objects<br>

are one or more bytes apart, there is no type that can perfectly store a<br>

pointer byte.<br>

memcpy(x, y, 8) isn't equivalent to 'v=load i64 y;store i64 v, x' because v<br>

already lost the pointer information .</p>

</blockquote></div>

<div style="white-space:normal">


<p dir="auto">Okay, so let me try to restate and summarize all this.  I’ve CC’ed<br>

a bunch of people back into this part of the thread.</p>


<p dir="auto">C is moving towards a provenance model; you can find the details in<br>

this committee TR that Joshua Cranmer linked:<br>

  <a href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2676.pdf">http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2676.pdf</a></p>


<p dir="auto">This TR is very clearly a work in progress and contains many<br>

digressions and several possible sets of rules with different<br>

implications.  I will try to briefly summarize.</p>


<p dir="auto">Formally, all storage has a unique provenance: a notional unique<br>

identifier for the specific allocation event of the storage, like a<br>

local variable coming into scope, or a specific call to <code>malloc</code>.<br>

Pointers formed to the storage formally carry that provenance (or<br>

invalid provenance, or in some corner cases ambiguous provenance).<br>

It is undefined behavior to do certain things with mismatched<br>

provenances.  For example, it is undefined behavior to access<br>

an object through a pointer whose provenance doesn’t match the<br>

current provenance of the storage containing that object.  This is<br>

a new way of formalizing the well-known rule that you can’t access<br>

a local variable after it’s gone out of scope.</p>


<p dir="auto">In the provenance model that looks most likely to be accepted, an<br>

int-to-pointer cast blesses the resulting pointer with the current<br>

provenance of the storage containing that address.  There does <em>not</em><br>

have to be any sort of data dependence between taking the address<br>

of the storage and the integer that’s used in the cast; it can<br>

simply be a lucky guess.  An integer could reasonably hold the<br>

representation of any address in the program, and so an<br>

int-to-pointer cast could bless its result with valid provenance<br>

for any storage in memory.  But if the pointed-to memory is<br>

subsequently deallocated, the provenance of the pointer will<br>

become out of date.</p>


<p dir="auto">Now, that rule as I’ve stated it would be really bad.  Allowing a<br>

lucky guess to resolve to absolutely anything would almost<br>

completely block the optimizer from optimizing memory.  For example,<br>

if a local variable came into scope, and then we called a function<br>

that returned a different pointer, we’d have to conservatively<br>

assume that that pointer might alias the local, even if the address<br>

of the local was never even taken, much less escaped:</p>


<pre style="border:thin solid gray; margin-left:15px; margin-right:15px; max-width:90vw; overflow-x:auto; padding:5px"><code>  int x = 0;

  int *p = guess_address_of_x();

  *p = 15;

  printf(“%d\n”, x); // provably 0?

</code></pre>


<p dir="auto">So the currently favored proposal adds a really important caveat:<br>

this blessing of provenance only works when a pointer with the<br>

correct provenance has been “exposed”.  There are several ways to<br>

expose a pointer, including I/O, but the most important is casting<br>

it to an integer.</p>


<p dir="auto">Again, there’s no requirement of a data dependence between the<br>

exposure and the int-to-pointer cast.  If the program casts a<br>

pointer to an integer, and an independently-produced integer<br>

that happens to be the same value is later cast to a pointer,<br>

and the storage hasn’t been reallocated in the meantime, the<br>

resulting pointer will have the right provenance for the memory<br>

and will be valid to use.  This implies that pointer-to-int casts<br>

(and other exposures) are semantically significant events in the<br>

program.  They don’t have side effects in the normal sense, but<br>

they must be treated by the compiler just like things that do have<br>

side effects: e.g. unless I’m missing something in the TR,<br>

eliminating a completely unused pointer-to-int cast may make<br>

later code UB.</p>


<p dir="auto">And in fact, it turns out that this is crucially important for<br>

optimization.  If the optimizer wants to allow arbitrary<br>

replacement of integers based on conditional equality, like<br>

in GVN, then replacement totally breaks direct data dependence,<br>

and you can easily be left with no remaining uses of a pointer-to-int<br>

cast when the original code would have had a data dependence.  So<br>

you cannot reason about provenance through int-to-pointer casts:<br>

the pointer can alias any storage whose provenance has potentially<br>

been exposed, and the optimizer must be conservative about optimizing<br>

memory that has potentially been exposed.</p>


<p dir="auto">Of course, a lot of this isn’t new.  If code passes a pointer to an<br>

opaque function, then yeah, the optimizer has to assume that the<br>

function might expose it by performing a pointer-to-int cast on it.<br>

But that’s okay, because the optimizer already has to assume that<br>

the function can escape the pointer in a more standard way.  Exposure<br>

is just another form of escape.</p>


<p dir="auto">Everything I’ve been talking about so far is a C-level concept:<br>

an int-to-pointer cast is e.g. <code>(float*) myInt</code>, not <code>inttoptr</code><br>

in LLVM IR.  But I think people have an expectation of what these<br>

things mean in LLVM IR, and I haven’t seen it written out explicitly,<br>

so let’s do that now.</p>


<p dir="auto">The first assumption here is that int-to-pointer and pointer-to-int<br>

casts in C will translate to <code>inttoptr</code> and <code>ptrtoint</code> operations<br>

in IR.  Now, this is already problematic, because those operations<br>

do not currently have the semantics they need to have to make the<br>

proposed optimization model sound.  In particular:</p>


<ul>

<li><p dir="auto"><code>ptrtoint</code> does not have side-effects and can be dead-stripped<br>

when unused, which as discussed above is not okay.</p></li>

<li><p dir="auto"><code>ptrtoint</code> on a constant is folded to a constant expression,<br>

not an instruction, which is therefore no longer anchored in the<br>

code and does not reliably record that the global may have escaped.<br>

(Unused constant expressions do not really exist, and we absolutely<br>

cannot allow them to affect the semantics of the IR.)</p>


<p dir="auto">Of course, this is only significant for globals that don’t already<br>

have to be treated conservatively because they might have other<br>

uses.  That is, it only really matters for globals with, say,<br>

internal or private linkage.</p></li>

<li><p dir="auto"><code>inttoptr</code> can be reordered with other instructions, which is<br>

not allowed because different points in a function may have<br>

different sets of storage with escaped provenance.</p></li>

<li><p dir="auto"><code>inttoptr(ptrtoint)</code> can be peepholed; ignoring the dead-stripping<br>

aspects of removing the <code>inttoptr</code>, this also potentially<br>

introduces UB because the original <code>inttoptr</code> “launders” the<br>

provenance of the pointer to the current provenance of the<br>

storage, whereas the original pointer may have stale provenance.</p></li>

</ul>


<p dir="auto">There are probably other problems.  There are two ways that I see<br>

of addressing all of this.</p>


<p dir="auto">The first way is to take them on directly and change the core<br>

semantic rules of LLVM IR.  I think this is a huge change, but<br>

maybe it’s necessary in order to get the semantics we want.  We<br>

will need to introduce some new IR features to make this possible,<br>

both for performance and for expressivity.  For example, the way<br>

we currently express relative pointers in constant initializers<br>

involves <code>ptrtoint</code> and <code>sub</code> constant expressions, so if we<br>

remove <code>ptrtoint</code> constants, we’ll need something else.  And<br>

frontends will need to be much more cautious about not introducing<br>

unnecessary int<->pointer casts because they’ll be heavily<br>

pessimizing.</p>


<p dir="auto">The second way is to say that <code>inttoptr</code> and <code>ptrtoint</code> are just<br>

superficial type conversions and don’t affect provenance, then<br>

add builtins that do affect provenance.  But this leaves us still<br>

tracking provenance through integers, so any optimization that’s<br>

not valid on pointers will also not be valid on integers.</p>


<p dir="auto">All of this is not even considering the need for byte types yet.<br>

Here’s the thinking behind byte types as I understand it.</p>


<p dir="auto">The claim is that, to make all of the above work, we need<br>

int<->pointer conversions to be explicit in IR.  If there’s some<br>

other way of doing an int<->pointer conversion, we’re in trouble,<br>

because maybe we won’t realize that the conversion has happened,<br>

and so we won’t be appropriately conservative.  And unfortunately<br>

C provides some other ways to do these conversions, which happen<br>

to all go through memory: copying representations around, doing<br>

aliased loads and stores, whatever.  So we need to have some way<br>

to represent conversions that happen through these mechanisms.<br>

And C says that these mechanisms don’t generally affect provenance,<br>

so inserting <code>inttoptr</code>/<code>ptrtoint</code> casts is at the very least<br>

imprecise because it launders provenance.  So what we want is a<br>

type that maintains the original provenance, and therefore is<br>

subject to the same optimization restrictions as pointers.</p>


<p dir="auto">I don’t find either side of this argument very convincing.</p>


<p dir="auto">First, the compiler already has to be very conservative about<br>

memory.  If a pointer is stored into memory, we generally have<br>

to treat it as having fully escaped: unless we can prove that the<br>

memory isn’t loaded by other code, we have to assume that it is,<br>

and that the loading code will do arbitrary other stuff.  That<br>

could include doing a pointer-to-int cast and sharing the pointer<br>

back to us as an integer.  Therefore, in the general case, our<br>

ability to optimize when a pointer has escaped into memory is at<br>

least as bad as if it had escaped via an int-to-pointer cast.  If<br>

we <em>can</em> nonetheless optimize, because we can reason about some of<br>

the uses together or prove that there aren’t any other uses,<br>

then okay, maybe we see that there’s an int<->pointer conversion.<br>

But translating this to <code>ptrtoint</code>/<code>inttoptr</code> should be, at<br>

worst, overly conservative; it’s not unsound, for reasons I’m<br>

about to get into.</p>


<p dir="auto">Second, adding casts through an integer type is always valid.<br>

Doing so might block the optimizer, but it doesn’t break semantics.<br>

If I have a program that does e.g <code>*px = 15</code>, and I change it to<br>

do <code>*(int*)(intptr_t)px = 15</code>, my program has become well-defined<br>

in strictly more situations: in any case, there must be valid<br>

storage at <code>px</code> for this not to be UB, but previously <code>px</code> might<br>

have had the wrong provenance, and now it never does as long as<br>

the provenance for that storage has previously escaped.</p>


<p dir="auto">If we find that we’re generating too many unnecessary casts<br>

through integer types and it’s really blocking the optimizer too<br>

much, then we should consider the best solutions to those problems<br>

as they arise.  It may be the case that we need a better solution<br>

for type conversions introduced through manual memcpy-like code<br>

so that we maintain the original provenance instead of adding<br>

explicit int<->pointer conversions that launder provenance.<br>

I don’t know that byte types are the best solution to that, but<br>

we can consider them.</p>


<p dir="auto">But this whole conversation about byte types seems to be coming<br>

at it the wrong way around.  We need to be centering the first<br>

set of problems around int<->pointer casts.</p>


<p dir="auto">John.</p>

</div>

</div>

</body>

</html>