[llvm-dev] [RFC] Introducing a byte type to LLVM
Ralf Jung via llvm-dev
llvm-dev at lists.llvm.org
Tue Jun 22 02:58:58 PDT 2021
Hi John,
> Unfortunately, though, I this non-determinism still doesn’t allow LLVM
> to be anywhere near as naive about pointer-to-int casts as it is today.
Definitely. There are limits to how naive one can be; beyond those limits,
miscompilations lurk. <https://www.ralfj.de/blog/2020/12/14/provenance.html>
explains this by showing such a miscompilation arising from three naive
optimizations being chained together.
> The rule is intended to allow the compiler to start doing use-analysis
> of exposures; let’s assume that this analysis doesn’t see any
> un-analyzable uses, since of course it would need to conservatively
> treat them as escapes. But if we can optimize uses of integers as if
> they didn’t carry pointer data — say, in a function that takes integer
> parameters — and then we can apply those optimized uses to integers
> that concretely result from pointer-to-int casts — say, by inlining
> that function into one of its callers — can’t we end up with a use
> pattern for one or more of those pointer-to-int casts that no longer
> reflects the fact that it’s been exposed? It seems to me that either
> (1) we cannot do those optimizations on opaque integers or (2) we
> need to record that we did them in a way that, if it turns out that
> they were created by a pointer-to-int casts, forces other code to
> treat that pointer as opaquely exposed.
There is a third option: don't optimize away ptr-int-ptr roundtrips. Then you
can still do all the same optimizations on integers that LLVM does today,
completely naively -- the integer world remains "sane". Only the pointer world
has to be "strange".
(You can also not do things like GVN replacement of *pointer-typed* values, but
for values of integer types this remains unproblematic.)
I don't think it makes sense for LLVM to adopt an explicit "exposed" flag in its
semantics. Reasoning based on non-determinism works fine, and has the advantage
of keeping ptr-to-int casts a pure, side-effect-free operation. This is the
model we explored in <https://people.mpi-sws.org/~jung/twinsem/twinsem.pdf>, and
we were able to show quite a few of LLVM's standard optimizations correct
formally. Some changes are still needed as you noted, but those changes will be
required anyway even if LLVM were to adopt PNVI-ae:
- No removal of ptr-int-ptr roundtrips.
(https://bugs.llvm.org/show_bug.cgi?id=34548)
- No GVN replacement of pointer-typed values.
(https://bugs.llvm.org/show_bug.cgi?id=35229)
> (I'm not sure whether this is a good place to introduce this, but) we
> actually have semantics for pointer castings tailored to LLVM (link
> <https://sf.snu.ac.kr/publications/llvmtwin.pdf
> <https://sf.snu.ac.kr/publications/llvmtwin.pdf>>).
> In this proposal, ptrtoint does not have an escaping side effect; ptrtoint
> and inttoptr are scalar operations.
> inttoptr simply returns a pointer which can access any object.
>
> Skimming your paper, I can see how this works /except/ that I don’t
> see any way not to treat |ptrtoint| as an escape. And really I think
> you’re already partially acknowledging that, because that’s the only
> real sense of saying that |inttoptr(ptrtoint p)| can’t be reduced to
> |p|. If those are really just scalar operations that don’t expose
> |p| in ways that might be disconnected from the uses of the |inttoptr|
> then that reduction ought to be safe.
They are indeed just scalar operations, but the reduction is not safe.
The reason is that pointer-typed variables have values of the form "(addr,
provenance)". There is essentially an 'invisible' component in each pointer
value that tracks some additional information -- the "provenance" of the
pointer. Casting a ptr to an int removes that provenance. Casting an int to a
ptr picks a "default" provenance. So the overall effect of inttoptr(ptrtoint p)
is to turn "(addr, provenance)" into "(addr, DEFAULT_PROVENANCE)".
Clearly that is *not* a NOP, and hence performing the reduction actually changes
the result of this operation. Before the reduction, the resulting pointer had
DEFAULT_PROVENANCE; after the reduction, it maintains the original provenance of
"p". This can introduce UB into previously UB-free programs.
Kind regards,
Ralf
More information about the llvm-dev
mailing list