[llvm-dev] [RFC] Introducing a byte type to LLVM

Tue Jun 22 02:58:58 PDT 2021

Hi John,

> Unfortunately, though, I this non-determinism still doesn’t allow LLVM
> to be anywhere near as naive about pointer-to-int casts as it is today.

Definitely. There are limits to how naive one can be; beyond those limits, 
miscompilations lurk. <https://www.ralfj.de/blog/2020/12/14/provenance.html> 
explains this by showing such a miscompilation arising from three naive 
optimizations being chained together.

> The rule is intended to allow the compiler to start doing use-analysis
> of exposures; let’s assume that this analysis doesn’t see any
> un-analyzable uses, since of course it would need to conservatively
> treat them as escapes. But if we can optimize uses of integers as if
> they didn’t carry pointer data — say, in a function that takes integer
> parameters — and then we can apply those optimized uses to integers
> that concretely result from pointer-to-int casts — say, by inlining
> that function into one of its callers — can’t we end up with a use
> pattern for one or more of those pointer-to-int casts that no longer
> reflects the fact that it’s been exposed? It seems to me that either
> (1) we cannot do those optimizations on opaque integers or (2) we
> need to record that we did them in a way that, if it turns out that
> they were created by a pointer-to-int casts, forces other code to
> treat that pointer as opaquely exposed.

There is a third option: don't optimize away ptr-int-ptr roundtrips. Then you 
can still do all the same optimizations on integers that LLVM does today, 
completely naively -- the integer world remains "sane". Only the pointer world 
has to be "strange".
(You can also not do things like GVN replacement of *pointer-typed* values, but 
for values of integer types this remains unproblematic.)

I don't think it makes sense for LLVM to adopt an explicit "exposed" flag in its 
semantics. Reasoning based on non-determinism works fine, and has the advantage 
of keeping ptr-to-int casts a pure, side-effect-free operation. This is the 
model we explored in <https://people.mpi-sws.org/~jung/twinsem/twinsem.pdf>, and 
we were able to show quite a few of LLVM's standard optimizations correct 
formally. Some changes are still needed as you noted, but those changes will be 
required anyway even if LLVM were to adopt PNVI-ae:
- No removal of ptr-int-ptr roundtrips. 
(https://bugs.llvm.org/show_bug.cgi?id=34548)
- No GVN replacement of pointer-typed values. 
(https://bugs.llvm.org/show_bug.cgi?id=35229)

>     (I'm not sure whether this is a good place to introduce this, but) we
>     actually have semantics for pointer castings tailored to LLVM (link
>     <https://sf.snu.ac.kr/publications/llvmtwin.pdf
>     <https://sf.snu.ac.kr/publications/llvmtwin.pdf>>).
>     In this proposal, ptrtoint does not have an escaping side effect; ptrtoint
>     and inttoptr are scalar operations.
>     inttoptr simply returns a pointer which can access any object.
> 
> Skimming your paper, I can see how this works /except/ that I don’t
> see any way not to treat |ptrtoint| as an escape. And really I think
> you’re already partially acknowledging that, because that’s the only
> real sense of saying that |inttoptr(ptrtoint p)| can’t be reduced to
> |p|. If those are really just scalar operations that don’t expose
> |p| in ways that might be disconnected from the uses of the |inttoptr|
> then that reduction ought to be safe.

They are indeed just scalar operations, but the reduction is not safe.
The reason is that pointer-typed variables have values of the form "(addr, 
provenance)". There is essentially an 'invisible' component in each pointer 
value that tracks some additional information -- the "provenance" of the 
pointer. Casting a ptr to an int removes that provenance. Casting an int to a 
ptr picks a "default" provenance. So the overall effect of inttoptr(ptrtoint p) 
is to turn "(addr, provenance)" into "(addr, DEFAULT_PROVENANCE)".
Clearly that is *not* a NOP, and hence performing the reduction actually changes 
the result of this operation. Before the reduction, the resulting pointer had 
DEFAULT_PROVENANCE; after the reduction, it maintains the original provenance of 
"p". This can introduce UB into previously UB-free programs.

Kind regards,
Ralf