[llvm-dev] [RFC] Introducing a byte type to LLVM
Ralf Jung via llvm-dev
llvm-dev at lists.llvm.org
Sun Jun 13 08:09:33 PDT 2021
to add to what Nuno said, I came up with a series of optimizations which
demonstrate that dead store elimination, integer "substitution" (GVN replacing
'a' by 'b' after an 'a == b'), and provenance-based alias analysis, are in
conflict with each other: the current LLVM semantics are inconsistent and can
lead to miscompilations.
I posted them here as Rust code:
A direct translation to C wouldn't be correct due to strict aliasing, but if you
imagine this being LLVM IR (or C code compiled with -fno-strict-aliasing), then
I hope this example demonstrates that *something* needs to give. This is not
just a small bug somewhere, it is a case of the implicit assumptions made by
different optimization passes / instructions being mutually incompatible.
On 06.06.21 20:26, Nuno Lopes wrote:
> Let me try to explain the issue around typed memory. It’s orthogonal to TBAA.
> Plus I agree the allocation type is irrelevant; it’s the type of the load/store
> operations that matters.
> Fact 1: LLVM needs an “universal holder” type in its IR.
> * Frontends load data from memory and don’t necessarily know the type they are
> loading. When you load a char in C, it can be part of a pointer or an
> integer, for example. Clang has no way to know.
> * That’s necessary to express implementations of memcpy in IR, as it copies
> raw data without knowing what the type is.
> * LLVM lowers small memcpys into load/store.
> I hope you agree LLVM needs a type that can contain any other type.
> Fact 2: optimizations for this “universal holder” type need to be correct for
> any LLVM IR type.
> Since this type can hold anything, optimizations that apply to this type must be
> correct for integers, pointers, etc. Otherwise, the results would be incorrect
> if the data was cast to the “wrong” type.
> The corollary of this fact is that GVN is not correct for the “universal holder”
> type in general. That’s because GVN is not correct for pointers in general
> (though correct in some cases). GVN doesn’t work for pointers since just because
> 2 pointers compare equal, it doesn’t mean they have the same provenance. As
> LLVM’s AA uses provenance information, this information must be preserved.
> Hopefully you agree with everything I wrote so far. If not please stop me and ask.
> Ok, so what’s the solution?
> We have two candidates for this “universal holder” type:
> * integer (i/<x>/)
> * byte (b/<x>/), a new dedicated type
> The advantage of going with integers is that they already exist. The
> disadvantage? To make LLVM correct, we would need to disable GVN in most cases
> or stop using provenance information in AA (make pointers == integers).
> This sounds pretty bad to me due to the expect performance hit.
> Introducing a new type requires work, yes. But it allows us to keep all integer
> optimizations as aggressive as we want, and only throttle down when encountering
> the byte type. Another benefit is that loading a byte from memory doesn’t escape
> any pointer, while with the first solution you need to consider all stored
> pointers as escaped when loading an integer.
> Introducing the byte type allow us to fix all the integer/pointer casts bugs in
> LLVM IR. And it enables more optimizations as we won’t need to be careful
> because of the LLVM IR bug.
> A 3^rd option is to keep everything as-is. That is, keep wrong optimizations in
> LLVM and keep adding hacks to limit the miscompilations in real-world programs.
> That has bitten us multiple times. Last time someone told me my miscompilation
> example was academic, LLVM ended up miscompiling itself, and then miscompiled
> Google’s code. Only this year we are planning to fully fix that bug (we have a
> hack right now).
> Last week Alive2 caught a miscompilation in the Linux kernel, in the network
> stack. The optimization got the pointer arithmetic wrong. Pretty scary, and the
> bug may have security implications.
> Anyway, just to argue that leaving things as-is is not an option. We have
> momentum and 3 GSoC students ready to help to fix the last few design bugs in
> LLVM IR.
> I know this stuff is complex; please ask questions if something is not clear or
> if you disagree with anything I wrote above.
> *From:* Chris Lattner via llvm-dev
> *Sent:* 06 June 2021 05:27
> *To:* John McCall <rjmccall at apple.com>
> *Cc:* llvm-dev at lists.llvm.org; Ralf Jung <jung at mpi-sws.org>; cfe-dev at lists.llvm.org
> *Subject:* Re: [llvm-dev] [cfe-dev] [RFC] Introducing a byte type to LLVM
> On Jun 4, 2021, at 11:25 AM, John McCall via cfe-dev <cfe-dev at lists.llvm.org
> <mailto:cfe-dev at lists.llvm.org>> wrote:On 4 Jun 2021, at 11:24, George Mitenkov
> Hi all,
> Together with Nuno Lopes and Juneyoung Lee we propose to add a new byte
> type to LLVM to fix miscompilations due to load type punning. Please see
> the proposal below. It would be great to hear the
> char and unsigned char are considered to be universal holders in C. They
> can access raw memory and are used to implement memcpy. i8 is the LLVM’s
> counterpart but it does not have such semantics, which is also not
> desirable as it would disable many optimizations.
> I don’t believe this is correct. LLVM does not have an innate
> concept of typed memory. The type of a global or local allocation
> is just a roundabout way of giving it a size and default alignment,
> and similarly the type of a load or store just determines the width
> and default alignment of the access. There are no restrictions on
> what types can be used to load or store from certain objects.
> C-style type aliasing restrictions are imposed using |tbaa|
> metadata, which are unrelated to the IR type of the access.
> I completely agree with John. “i8” in LLVM doesn’t carry any implications about
> aliasing (in fact, LLVM pointers are going towards being typeless). Any such
> thing occurs at the accesses, and are part of TBAA.
> I’m opposed to adding a byte type to LLVM, as such semantic carrying types are
> entirely unprecedented, and would add tremendous complexity to the entire system.
More information about the llvm-dev