[cfe-dev] [RFC] Introducing a byte type to LLVM

Richard Smith via cfe-dev cfe-dev at lists.llvm.org
Thu Jun 10 17:25:39 PDT 2021


On Sun, 6 Jun 2021 at 12:15, Nuno Lopes via cfe-dev <cfe-dev at lists.llvm.org>
wrote:

> Of course, let's look at a simple example with GVN:
> if (p == q) {
>   *(p-1) = 0;
> }
>
> If p/q are integers, then it's perfectly fine to replace p with q or
> vice-versa inside the 'if'.


If we start with:

char a[n];
char b[n];
long p = (long)(a+n);
long q = (long)b;
if (p == q) {
   *((char*)p-1) = 0;
}

... what prevents us from observing the same issue?

GVN may decide to produce this:
> if (p == q) {
>   *(q-1) = 0;
> }
>
> Now consider that p/q are pointers. In particular they are the following:
> char a[n];
> char q[n];
> char *p = a + n;
>
> inlining the definitions, originally we had:
> if (a+n == q) {
>   *(a+n-1) = 0;
> }
>
> The program is well-defined. However, GVN produces a ill-defined program:
> if (a+n == q) {
>   *(q-1) = 0;
> }
>
> While a+n may have the same address as q, their provenance is different.
> You can dereference a[n-1], but you can't dereference q[-1], even if they
> have exactly the same address.
>
> Pointer equalities can only be propagated if both pointers have the same
> provenance information. If two pointers are inbounds and compare equal,
> then they have the same provenance, for example. That's a case when GVN for
> pointers is correct.
>
> Another class of issues is related with alias analysis and implicit casts
> done by integer loads. AA doesn't consider these implicit casts, which is
> incorrect. When we load an integer, we don't know if we are actually
> loading an integer or a "raw byte", as there's no way to distinguish. So,
> to be correct, we need to consider all integer loads as escaping pointers
> (unless we introduce a way to distinguish "raw byte" loads from integer
> loads). I gave one such example in my last reply to Johannes.
>
> Please let me know if the example above isn't clear. Happy to elaborate
> further :)
>
> Thanks,
> Nuno
>
>
> -----Original Message-----
> From: David Blaikie <dblaikie at gmail.com>
> Sent: 06 June 2021 19:58
> Subject: Re: [llvm-dev] [RFC] Introducing a byte type to LLVM
>
> Perhaps a small concrete example of the GVN (or otherwise, if there are
> smaller examples in other optimizations) transformation that you find to be
> invalid - showing how it's important for integers, and invalid for pointers
> (smallest example that manifests that invalidity in an observable way)
> might help ground the conversation a bit.
>
> On Sun, Jun 6, 2021 at 11:26 AM Nuno Lopes via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
> >
> > Let me try to explain the issue around typed memory. It’s orthogonal to
> TBAA. Plus I agree the allocation type is irrelevant; it’s the type of the
> load/store operations that matters.
> >
> >
> >
> > Fact 1: LLVM needs an “universal holder” type in its IR.
> >
> > Why?
> >
> > Frontends load data from memory and don’t necessarily know the type they
> are loading. When you load a char in C, it can be part of a pointer or an
> integer, for example. Clang has no way to know.
> > That’s necessary to express implementations of memcpy in IR, as it
> copies raw data without knowing what the type is.
> > LLVM lowers small memcpys into load/store.
> >
> >
> >
> > I hope you agree LLVM needs a type that can contain any other type.
> >
> >
> >
> >
> >
> > Fact 2: optimizations for this “universal holder” type need to be
> correct for any LLVM IR type.
> >
> > Since this type can hold anything, optimizations that apply to this type
> must be correct for integers, pointers, etc. Otherwise, the results would
> be incorrect if the data was cast to the “wrong” type.
> >
> >
> >
> > The corollary of this fact is that GVN is not correct for the “universal
> holder” type in general. That’s because GVN is not correct for pointers in
> general (though correct in some cases). GVN doesn’t work for pointers since
> just because 2 pointers compare equal, it doesn’t mean they have the same
> provenance. As LLVM’s AA uses provenance information, this information must
> be preserved.
> >
> >
> >
> > Hopefully you agree with everything I wrote so far. If not please stop
> me and ask.
> >
> >
> >
> >
> >
> > Ok, so what’s the solution?
> >
> > We have two candidates for this “universal holder” type:
> >
> > integer (i<x>)
> > byte (b<x>), a new dedicated type
> >
> >
> >
> > The advantage of going with integers is that they already exist. The
> disadvantage? To make LLVM correct, we would need to disable GVN in most
> cases or stop using provenance information in AA (make pointers ==
> integers).
> >
> > This sounds pretty bad to me due to the expect performance hit.
> >
> >
> >
> > Introducing a new type requires work, yes. But it allows us to keep all
> integer optimizations as aggressive as we want, and only throttle down when
> encountering the byte type. Another benefit is that loading a byte from
> memory doesn’t escape any pointer, while with the first solution you need
> to consider all stored pointers as escaped when loading an integer.
> >
> > Introducing the byte type allow us to fix all the integer/pointer casts
> bugs in LLVM IR. And it enables more optimizations as we won’t need to be
> careful because of the LLVM IR bug.
> >
> >
> >
> >
> >
> > A 3rd option is to keep everything as-is. That is, keep wrong
> optimizations in LLVM and keep adding hacks to limit the miscompilations in
> real-world programs.
> >
> > That has bitten us multiple times. Last time someone told me my
> miscompilation example was academic, LLVM ended up miscompiling itself, and
> then miscompiled Google’s code. Only this year we are planning to fully fix
> that bug (we have a hack right now).
> >
> > Last week Alive2 caught a miscompilation in the Linux kernel, in the
> network stack. The optimization got the pointer arithmetic wrong. Pretty
> scary, and the bug may have security implications.
> >
> >
> >
> > Anyway, just to argue that leaving things as-is is not an option. We
> have momentum and 3 GSoC students ready to help to fix the last few design
> bugs in LLVM IR.
> >
> >
> >
> > I know this stuff is complex; please ask questions if something is not
> clear or if you disagree with anything I wrote above.
> >
> >
> >
> > Thanks,
> >
> > Nuno
> >
> >
> >
> >
> >
> >
> >
> > From: Chris Lattner via llvm-dev
> > Sent: 06 June 2021 05:27
> > To: John McCall <rjmccall at apple.com>
> > Cc: llvm-dev at lists.llvm.org; Ralf Jung <jung at mpi-sws.org>;
> > cfe-dev at lists.llvm.org
> > Subject: Re: [llvm-dev] [cfe-dev] [RFC] Introducing a byte type to
> > LLVM
> >
> >
> >
> > On Jun 4, 2021, at 11:25 AM, John McCall via cfe-dev <
> cfe-dev at lists.llvm.org> wrote:On 4 Jun 2021, at 11:24, George Mitenkov
> wrote:
> >
> > Hi all,
> >
> > Together with Nuno Lopes and Juneyoung Lee we propose to add a new
> > byte type to LLVM to fix miscompilations due to load type punning.
> > Please see the proposal below. It would be great to hear the
> > feedback/comments/suggestions!
> >
> >
> > Motivation
> > ==========
> >
> > char and unsigned char are considered to be universal holders in C.
> > They can access raw memory and are used to implement memcpy. i8 is the
> > LLVM’s counterpart but it does not have such semantics, which is also
> > not desirable as it would disable many optimizations.
> >
> > I don’t believe this is correct. LLVM does not have an innate concept
> > of typed memory. The type of a global or local allocation is just a
> > roundabout way of giving it a size and default alignment, and
> > similarly the type of a load or store just determines the width and
> > default alignment of the access. There are no restrictions on what
> > types can be used to load or store from certain objects.
> >
> > C-style type aliasing restrictions are imposed using tbaa metadata,
> > which are unrelated to the IR type of the access.
> >
> > I completely agree with John.  “i8” in LLVM doesn’t carry any
> implications about aliasing (in fact, LLVM pointers are going towards being
> typeless).  Any such thing occurs at the accesses, and are part of TBAA.
> >
> >
> >
> > I’m opposed to adding a byte type to LLVM, as such semantic carrying
> types are entirely unprecedented, and would add tremendous complexity to
> the entire system.
> >
> >
> >
> > -Chris
> >
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20210610/5077bba2/attachment.html>


More information about the cfe-dev mailing list