[llvm-dev] Demystifying the byte type

Tue Oct 19 13:43:40 PDT 2021

The way I understand it, the problem that the byte type is meant to solve is part of a broader-scoped problem, which is the inconsistency of pointer semantics in LLVM (and other compilers, for that matter). Subtle misunderstandings in how pointer semantics works between different optimization passes causes misoptimizations to happen, and identifying which pass is the culprit is challenging. This is not helped by the LLVM language reference being outright incorrect here: it describes provenance in terms of data dependence, even through integers, which is not how any of our analyses actually work, generally preferring to reason on a more escape-based analysis approach.

However, the byte type proposal feels to me like it is motivated on a minor portion of the problem, so narrow that it feels like it only really solves “how to write memcpy in standard C” aspect of this problem. It doesn’t really address how the addition of byte types would fix miscompilations, especially anything beyond memcpy (for example, C code compiled with -fno-strict-aliasing). It doesn’t suggest any fixes to the current known inconsistencies in the language specification. And as a result, it’s kind of dismissive as to why isolated fixes to various optimization passes are insufficient to achieve coherent semantics.

Stepping back a bit, it’s helpful to understand that, for the purposes of building an operational semantics, a pointer is not an i64 but a { i64, BOOM (Bag Of Other Metadata) }, where the BOOM contains sufficient information to explain when a load or store of a pointer is undefined behavior—including liveness information, provenance, and noalias rules [1]. Described like this, three things should be clear. First, the inttoptr instruction has to recreate the BOOM given no information, which is necessarily a pessimistic assumption (it may be useful to have intrinsics that provide less pessimistic recreation of the BOOM). Second, loads and stores of pointers in memory needs to preserve the BOOM, presumably through a generally inaccessible shadow memory feature. Finally, the interaction of non-pointer types with the representation of the BOOM in memory needs to be given a definition.

Fundamentally, then, the problem is inttoptr (and to a lesser degree, ptrtoint, as it constitutes a vehicle for escaping pointers), and memory is involved only insofar as it constitutes a ‘hidden’ inttoptr (and ptrtoint). But byte doesn’t really expose the ‘hidden’ inttoptr, it just hides it in a different place. Indeed, it still retains the existing ones if you should load a pointer with an i64. To me, it appears only to be useful in giving a way to canonicalize @llvm.memcpy into a regular load type, but an entirely new type doesn’t seem necessary for that—intrinsics that give access to reading and writing shadow BOOM seem like they would be sufficient. You might argue that such intrinsics would eliminate the ability of users to write their own copies of memcpy, but even here, byte is an insufficient proposal—there’s no way to write a word-based memcpy in C with this proposal (assuming -fno-strict-aliasing, of course).

With that in mind, I’d like to ask a few questions:

Have you been tracking the WG14 study group on provenance?

Have you attempted to put together some form of provenance semantics in a tool like Alive2 to more comprehensively catalogue miscompilations in existing optimizations?

[1] My first instinct is to say that the BOOM is the set of allocations the pointer may point to, but there may be edge cases that I’m not immediately thinking of. Formal semantics is not my forte, after all.

From: llvm-dev <llvm-dev-bounces at lists.llvm.org> On Behalf Of George Mitenkov via llvm-dev
Sent: Friday, October 15, 2021 14:41
To: llvm-dev <llvm-dev at lists.llvm.org>; cfe-dev at lists.llvm.org Developers <cfe-dev at lists.llvm.org>
Subject: [llvm-dev] Demystifying the byte type

Hi all,

In May 2021, together with Nuno Lopes and Juneyoung Lee, we proposed to add a byte type in LLVM to fix load type punning issues. Initial RFC touched some subtle aspects of LLVM IR and its semantics, and sparked a lot of questions, concerns, and discussions.

We decided to write a post that would summarise the thread and the complicated topic:

https://gist.github.com/georgemitenkov/3def898b8845c2cc161bd216cbbdb81f

We hope that our post clarifies initial concerns raised on the mailing list. As always, any questions, suggestions and advice are welcome!

Thanks,
George
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20211019/ff739ed7/attachment.html>