[llvm-dev] [cfe-dev] [RFC] Introducing a byte type to LLVM

Mon Jun 7 05:15:51 PDT 2021

IMHO, the ptr_provenance support that is part of the full restrict patches [0]
can likely help for these cases ? The allow to explicitly associate the ptr_provenance
of a load/store instruction.

In the past LLVM AA Tech Calls, we decided to pull those out of the full restrict
patches and prepare them for inclusion. (Which I still have on my list of todo's)

Greetings,

Jeroen Dobbelaere
[0] https://reviews.llvm.org/D68488 [PATCH 03/27] [noalias] [IR] Introduce ptr_provenance for LoadInst/StoreInst

> -----Original Message-----
> From: cfe-dev <cfe-dev-bounces at lists.llvm.org> On Behalf Of Nuno Lopes via
> cfe-dev
> Sent: Sunday, June 6, 2021 21:16
> To: 'David Blaikie' <dblaikie at gmail.com>
> Cc: 'llvm-dev' <llvm-dev at lists.llvm.org>; 'Ralf Jung' <jung at mpi-sws.org>;
> 'Clang Dev' <cfe-dev at lists.llvm.org>; 'Chris Lattner' <clattner at nondot.org>
> Subject: Re: [cfe-dev] [RFC] Introducing a byte type to LLVM
> 
> Of course, let's look at a simple example with GVN:
> if (p == q) {
>   *(p-1) = 0;
> }
> 
> If p/q are integers, then it's perfectly fine to replace p with q or vice-
> versa inside the 'if'. GVN may decide to produce this:
> if (p == q) {
>   *(q-1) = 0;
> }
> 
> Now consider that p/q are pointers. In particular they are the following:
> char a[n];
> char q[n];
> char *p = a + n;
> 
> inlining the definitions, originally we had:
> if (a+n == q) {
>   *(a+n-1) = 0;
> }
> 
> The program is well-defined. However, GVN produces a ill-defined program:
> if (a+n == q) {
>   *(q-1) = 0;
> }
> 
> While a+n may have the same address as q, their provenance is different. You
> can dereference a[n-1], but you can't dereference q[-1], even if they have
> exactly the same address.
> 
> Pointer equalities can only be propagated if both pointers have the same
> provenance information. If two pointers are inbounds and compare equal, then
> they have the same provenance, for example. That's a case when GVN for
> pointers is correct.
> 
> Another class of issues is related with alias analysis and implicit casts done
> by integer loads. AA doesn't consider these implicit casts, which is
> incorrect. When we load an integer, we don't know if we are actually loading
> an integer or a "raw byte", as there's no way to distinguish. So, to be
> correct, we need to consider all integer loads as escaping pointers (unless we
> introduce a way to distinguish "raw byte" loads from integer loads). I gave
> one such example in my last reply to Johannes.
> 
> Please let me know if the example above isn't clear. Happy to elaborate
> further :)
> 
> Thanks,
> Nuno
> 
> 
> -----Original Message-----
> From: David Blaikie <dblaikie at gmail.com>
> Sent: 06 June 2021 19:58
> Subject: Re: [llvm-dev] [RFC] Introducing a byte type to LLVM
> 
> Perhaps a small concrete example of the GVN (or otherwise, if there are
> smaller examples in other optimizations) transformation that you find to be
> invalid - showing how it's important for integers, and invalid for pointers
> (smallest example that manifests that invalidity in an observable way) might
> help ground the conversation a bit.
> 
> On Sun, Jun 6, 2021 at 11:26 AM Nuno Lopes via llvm-dev <llvm-
> dev at lists.llvm.org> wrote:
> >
> > Let me try to explain the issue around typed memory. It’s orthogonal to
> TBAA. Plus I agree the allocation type is irrelevant; it’s the type of the
> load/store operations that matters.
> >
> >
> >
> > Fact 1: LLVM needs an “universal holder” type in its IR.
> >
> > Why?
> >
> > Frontends load data from memory and don’t necessarily know the type they are
> loading. When you load a char in C, it can be part of a pointer or an integer,
> for example. Clang has no way to know.
> > That’s necessary to express implementations of memcpy in IR, as it copies
> raw data without knowing what the type is.
> > LLVM lowers small memcpys into load/store.
> >
> >
> >
> > I hope you agree LLVM needs a type that can contain any other type.
> >
> >
> >
> >
> >
> > Fact 2: optimizations for this “universal holder” type need to be correct
> for any LLVM IR type.
> >
> > Since this type can hold anything, optimizations that apply to this type
> must be correct for integers, pointers, etc. Otherwise, the results would be
> incorrect if the data was cast to the “wrong” type.
> >
> >
> >
> > The corollary of this fact is that GVN is not correct for the “universal
> holder” type in general. That’s because GVN is not correct for pointers in
> general (though correct in some cases). GVN doesn’t work for pointers since
> just because 2 pointers compare equal, it doesn’t mean they have the same
> provenance. As LLVM’s AA uses provenance information, this information must be
> preserved.
> >
> >
> >
> > Hopefully you agree with everything I wrote so far. If not please stop me
> and ask.
> >
> >
> >
> >
> >
> > Ok, so what’s the solution?
> >
> > We have two candidates for this “universal holder” type:
> >
> > integer (i<x>)
> > byte (b<x>), a new dedicated type
> >
> >
> >
> > The advantage of going with integers is that they already exist. The
> disadvantage? To make LLVM correct, we would need to disable GVN in most cases
> or stop using provenance information in AA (make pointers == integers).
> >
> > This sounds pretty bad to me due to the expect performance hit.
> >
> >
> >
> > Introducing a new type requires work, yes. But it allows us to keep all
> integer optimizations as aggressive as we want, and only throttle down when
> encountering the byte type. Another benefit is that loading a byte from memory
> doesn’t escape any pointer, while with the first solution you need to consider
> all stored pointers as escaped when loading an integer.
> >
> > Introducing the byte type allow us to fix all the integer/pointer casts bugs
> in LLVM IR. And it enables more optimizations as we won’t need to be careful
> because of the LLVM IR bug.
> >
> >
> >
> >
> >
> > A 3rd option is to keep everything as-is. That is, keep wrong optimizations
> in LLVM and keep adding hacks to limit the miscompilations in real-world
> programs.
> >
> > That has bitten us multiple times. Last time someone told me my
> miscompilation example was academic, LLVM ended up miscompiling itself, and
> then miscompiled Google’s code. Only this year we are planning to fully fix
> that bug (we have a hack right now).
> >
> > Last week Alive2 caught a miscompilation in the Linux kernel, in the network
> stack. The optimization got the pointer arithmetic wrong. Pretty scary, and
> the bug may have security implications.
> >
> >
> >
> > Anyway, just to argue that leaving things as-is is not an option. We have
> momentum and 3 GSoC students ready to help to fix the last few design bugs in
> LLVM IR.
> >
> >
> >
> > I know this stuff is complex; please ask questions if something is not clear
> or if you disagree with anything I wrote above.
> >
> >
> >
> > Thanks,
> >
> > Nuno
> >
> >
> >
> >
> >
> >
> >
> > From: Chris Lattner via llvm-dev
> > Sent: 06 June 2021 05:27
> > To: John McCall <rjmccall at apple.com>
> > Cc: llvm-dev at lists.llvm.org; Ralf Jung <jung at mpi-sws.org>;
> > cfe-dev at lists.llvm.org
> > Subject: Re: [llvm-dev] [cfe-dev] [RFC] Introducing a byte type to
> > LLVM
> >
> >
> >
> > On Jun 4, 2021, at 11:25 AM, John McCall via cfe-dev <cfe-
> dev at lists.llvm.org> wrote:On 4 Jun 2021, at 11:24, George Mitenkov wrote:
> >
> > Hi all,
> >
> > Together with Nuno Lopes and Juneyoung Lee we propose to add a new
> > byte type to LLVM to fix miscompilations due to load type punning.
> > Please see the proposal below. It would be great to hear the
> > feedback/comments/suggestions!
> >
> >
> > Motivation
> > ==========
> >
> > char and unsigned char are considered to be universal holders in C.
> > They can access raw memory and are used to implement memcpy. i8 is the
> > LLVM’s counterpart but it does not have such semantics, which is also
> > not desirable as it would disable many optimizations.
> >
> > I don’t believe this is correct. LLVM does not have an innate concept
> > of typed memory. The type of a global or local allocation is just a
> > roundabout way of giving it a size and default alignment, and
> > similarly the type of a load or store just determines the width and
> > default alignment of the access. There are no restrictions on what
> > types can be used to load or store from certain objects.
> >
> > C-style type aliasing restrictions are imposed using tbaa metadata,
> > which are unrelated to the IR type of the access.
> >
> > I completely agree with John.  “i8” in LLVM doesn’t carry any implications
> about aliasing (in fact, LLVM pointers are going towards being typeless).  Any
> such thing occurs at the accesses, and are part of TBAA.
> >
> >
> >
> > I’m opposed to adding a byte type to LLVM, as such semantic carrying types
> are entirely unprecedented, and would add tremendous complexity to the entire
> system.
> >
> >
> >
> > -Chris
> >
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > https://urldefense.com/v3/__https://lists.llvm.org/cgi-
> bin/mailman/listinfo/llvm-
> dev__;!!A4F2R9G_pg!OaorXrTDpbE6DTSL_1JEUI55JMoefP2WL3XL_8iPWUad0B4K4HEnbYNE32W
> YUj8oOMYIgVLz$
> 
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> https://urldefense.com/v3/__https://lists.llvm.org/cgi-
> bin/mailman/listinfo/cfe-
> dev__;!!A4F2R9G_pg!OaorXrTDpbE6DTSL_1JEUI55JMoefP2WL3XL_8iPWUad0B4K4HEnbYNE32W
> YUj8oOCB56bnt$