[llvm-dev] [RFC] Introducing a byte type to LLVM

Sun Jun 13 08:09:33 PDT 2021

Hi all,

to add to what Nuno said, I came up with a series of optimizations which 
demonstrate that dead store elimination, integer "substitution" (GVN replacing 
'a' by 'b' after an 'a == b'), and provenance-based alias analysis, are in 
conflict with each other: the current LLVM semantics are inconsistent and can 
lead to miscompilations.

I posted them here as Rust code: 
<https://github.com/rust-lang/unsafe-code-guidelines/issues/286#issuecomment-860189806>. 
A direct translation to C wouldn't be correct due to strict aliasing, but if you 
imagine this being LLVM IR (or C code compiled with -fno-strict-aliasing), then 
I hope this example demonstrates that *something* needs to give. This is not 
just a small bug somewhere, it is a case of the implicit assumptions made by 
different optimization passes / instructions being mutually incompatible.

Kind regards,
Ralf

On 06.06.21 20:26, Nuno Lopes wrote:
> Let me try to explain the issue around typed memory. It’s orthogonal to TBAA. 
> Plus I agree the allocation type is irrelevant; it’s the type of the load/store 
> operations that matters.
> 
> Fact 1: LLVM needs an “universal holder” type in its IR.
> 
> Why?
> 
>   * Frontends load data from memory and don’t necessarily know the type they are
>     loading. When you load a char in C, it can be part of a pointer or an
>     integer, for example. Clang has no way to know.
>   * That’s necessary to express implementations of memcpy in IR, as it copies
>     raw data without knowing what the type is.
>   * LLVM lowers small memcpys into load/store.
> 
> I hope you agree LLVM needs a type that can contain any other type.
> 
> Fact 2: optimizations for this “universal holder” type need to be correct for 
> any LLVM IR type.
> 
> Since this type can hold anything, optimizations that apply to this type must be 
> correct for integers, pointers, etc. Otherwise, the results would be incorrect 
> if the data was cast to the “wrong” type.
> 
> The corollary of this fact is that GVN is not correct for the “universal holder” 
> type in general. That’s because GVN is not correct for pointers in general 
> (though correct in some cases). GVN doesn’t work for pointers since just because 
> 2 pointers compare equal, it doesn’t mean they have the same provenance. As 
> LLVM’s AA uses provenance information, this information must be preserved.
> 
> Hopefully you agree with everything I wrote so far. If not please stop me and ask.
> 
> Ok, so what’s the solution?
> 
> We have two candidates for this “universal holder” type:
> 
>   * integer (i/<x>/)
>   * byte (b/<x>/), a new dedicated type
> 
> The advantage of going with integers is that they already exist. The 
> disadvantage? To make LLVM correct, we would need to disable GVN in most cases 
> or stop using provenance information in AA (make pointers == integers).
> 
> This sounds pretty bad to me due to the expect performance hit.
> 
> Introducing a new type requires work, yes. But it allows us to keep all integer 
> optimizations as aggressive as we want, and only throttle down when encountering 
> the byte type. Another benefit is that loading a byte from memory doesn’t escape 
> any pointer, while with the first solution you need to consider all stored 
> pointers as escaped when loading an integer.
> 
> Introducing the byte type allow us to fix all the integer/pointer casts bugs in 
> LLVM IR. And it enables more optimizations as we won’t need to be careful 
> because of the LLVM IR bug.
> 
> A 3^rd option is to keep everything as-is. That is, keep wrong optimizations in 
> LLVM and keep adding hacks to limit the miscompilations in real-world programs.
> 
> That has bitten us multiple times. Last time someone told me my miscompilation 
> example was academic, LLVM ended up miscompiling itself, and then miscompiled 
> Google’s code. Only this year we are planning to fully fix that bug (we have a 
> hack right now).
> 
> Last week Alive2 caught a miscompilation in the Linux kernel, in the network 
> stack. The optimization got the pointer arithmetic wrong. Pretty scary, and the 
> bug may have security implications.
> 
> Anyway, just to argue that leaving things as-is is not an option. We have 
> momentum and 3 GSoC students ready to help to fix the last few design bugs in 
> LLVM IR.
> 
> I know this stuff is complex; please ask questions if something is not clear or 
> if you disagree with anything I wrote above.
> 
> Thanks,
> 
> Nuno
> 
> *From:* Chris Lattner via llvm-dev
> *Sent:* 06 June 2021 05:27
> *To:* John McCall <rjmccall at apple.com>
> *Cc:* llvm-dev at lists.llvm.org; Ralf Jung <jung at mpi-sws.org>; cfe-dev at lists.llvm.org
> *Subject:* Re: [llvm-dev] [cfe-dev] [RFC] Introducing a byte type to LLVM
> 
> On Jun 4, 2021, at 11:25 AM, John McCall via cfe-dev <cfe-dev at lists.llvm.org 
> <mailto:cfe-dev at lists.llvm.org>> wrote:On 4 Jun 2021, at 11:24, George Mitenkov 
> wrote:
> 
>         Hi all,
> 
>         Together with Nuno Lopes and Juneyoung Lee we propose to add a new byte
>         type to LLVM to fix miscompilations due to load type punning. Please see
>         the proposal below. It would be great to hear the
>         feedback/comments/suggestions!
> 
> 
>         Motivation
>         ==========
> 
>         char and unsigned char are considered to be universal holders in C. They
>         can access raw memory and are used to implement memcpy. i8 is the LLVM’s
>         counterpart but it does not have such semantics, which is also not
>         desirable as it would disable many optimizations.
> 
>     I don’t believe this is correct. LLVM does not have an innate
>     concept of typed memory. The type of a global or local allocation
>     is just a roundabout way of giving it a size and default alignment,
>     and similarly the type of a load or store just determines the width
>     and default alignment of the access. There are no restrictions on
>     what types can be used to load or store from certain objects.
> 
>     C-style type aliasing restrictions are imposed using |tbaa|
>     metadata, which are unrelated to the IR type of the access.
> 
> I completely agree with John.  “i8” in LLVM doesn’t carry any implications about 
> aliasing (in fact, LLVM pointers are going towards being typeless).  Any such 
> thing occurs at the accesses, and are part of TBAA.
> 
> I’m opposed to adding a byte type to LLVM, as such semantic carrying types are 
> entirely unprecedented, and would add tremendous complexity to the entire system.
> 
> -Chris
> 

-- 
Website: https://people.mpi-sws.org/~jung/