[llvm-dev] [RFC] Introducing a byte type to LLVM

Fri Jun 25 00:41:03 PDT 2021

Hi Jeroen,

>>> My interpretation (well not just mine, we did have discussions about this in
>> our group)
>>> wrt to restrict handling, is that the use of decrypt/encrypt
>>> triggers undefined behavior.
>>
>> Yes, that is exactly what I am pushing back against. :)  I cannot see a
>> reading
>> of the standard where this is UB.  I also don't think it is the intention of
>> the
>> standard to make this UB.  Note that the line I showed could be very far away
>> from the 'restrict' annotation. Basically if this is UB then a 'restrict'
>> pointer cannot be passed to other functions unless we know exactly that they
>> do
>> not do ptr-to-int casts.
> 
> Sure, this might be a liberal reading of that sentence wrt to restrict.
> And that is how it is done today in the full restrict patches, but of course,
> that does not mean that this is where we need to settle on when including the
> functionality. It is good to have the reviews that steer us to a solution that
> is more broadly applicable.

Fair enough. The standard is certainly not as unambiguous as one would hope.

Having suffered from an endless stream of 'noalias' bugs on the Rust side, I am 
very excited that this part of LLVM is being overhauled. :)
I was hoping at some point to delve into those restrict patches and try to 
understand them from a PL/semantics perspective, but so far I haven't had the 
time -- and it's also a large patchset, much of which naturally is about the 
implementation (which I can't really follow) and not about the high-level 
description of the LLVM IR spec that makes the new analyses correct.
When/if I find some time -- what would be a good starting point to try to 
understand the concepts of those patches without having to understand the C++ 
details?

>>> Now that we are going over the different pieces of the implementation and
>> see how we can use
>>> them in a broader context, the situation is different: instead of just
>> tracking
>>> the 'restrict/noalias' provenance, we now want to use that part of the
>> infrastructure to
>>> track provenance in general. Because of that, it also makes sense to
>> reconsider what 'policy'
>>> we want to use. In that context, mapping a 'int2ptr' to a
>> 'add_provenance(int2ptr(%Decrypt), null)'
>>> indicating that it can point to anything makes sense, but is still
>> orthogonal to the infrastructure.
>>
>> That is not sufficient though. You also need to know that the provenance of
>> the
>> 'restrict'ed pointer can now be acquired by other pointers created literally
>> anywhere via int2ptr. *That* is what makes this so tricky, I think.
>>
>> int foo(int *restrict x) {
>>     *x = 0;
>>     unk1();
>>     assert(*x == 0); // can be optimized to 'true'
>>     unk2((uintptr_t)x);
>>     assert(*x == 0); // can *not* be optimized to 'true'
>> }
> 
> Also for restrict, escape analysis must be done. So also this case can be handled.

Sure, smarter analyses can handle the easy cases, but I was asking about what 
part of the spec of these operations forces the analysis to work like that. 
Defeating the analysis is not that hard, so here's another example:

static int foo(int *restrict x, uintptr_t y) {
    *x = 0;
    unk1();
    assert(*x == 0); // can be optimized to 'true'
    uintptr_t addr = (uintptr_t)x;
    if (addr == y)
      unk2(addr);
    assert(*x == 0); // can *not* be optimized to 'true'
}

Now we do GVN integer replacement:

static int foo(int *restrict x, uintptr_t y) {
    *x = 0;
    unk1();
    assert(*x == 0); // can be optimized to 'true'
    uintptr_t addr = (uintptr_t)x;
    if (addr == y)
      unk2(y);
    assert(*x == 0); // can *not* be optimized to 'true'
}

Now let us assume there is exactly one call site of this function (and foo is 
static so we know it can't be called from elsewhere, or maybe we are doing LTO), 
which looks like

foo(ptr, (uintptr_t)ptr);

This means we know that the "if" in "foo" will always evaluate to true, so we have

static int foo(int *restrict x, uintptr_t y) {
    *x = 0;
    unk1();
    assert(*x == 0); // can be optimized to 'true'
    uintptr_t addr = (uintptr_t)x;
    unk2(y);
    assert(*x == 0); // can *not* be optimized to 'true'
}

Now we can (seemingly) optimize away the "addr" variable entirely -- but at that 
point, there is no clue left for escape analysis to know that "unk2" might 
legally mutate "x".

That's why I am saying that with 'restrict', we have to treat ptr-to-int casts 
as side-effecting, and cannot optimize them away even if their result is unused.
They *always* have an "escape" effect, no matter what happens with their result.

Kind regards,
Ralf

> 
> Greetings,
> 
> Jeroen Dobbelaere
> 
> 
>>
>>> For this particular example, it would also be nice if we could somehow
>> indicate that the
>>> 'decrypt(encrypt(%P))' can only depend on %P. But that is another
>> discussion.
>>
>> It would be nice if one could express this in the surface language (C/Rust),
>> but
>> I don't think we should allow LLVM to infer this -- that would basically
>> require
>> tracking provenance through integers, which is not a good idea.
>> Put differently: as the various examples in this thread show, integers can
>> easily acquire "provenance" of other values simply by comparing them for
>> equality -- so in a sense, after "x == y" evaluates to true, now 'x' also has
>> the "provenance" of 'y'. I don't think we want obscure effects like this in
>> the
>> semantics of the Abstract Machine. (I am not even convinced this can be done
>> consistently.)
>> So then what we are left with are those transformations that are correct
>> without
>> extra support from the abstract machine. And since these dependencies can
>> entirely disappear from the source code through optimizations like GVN
>> replacing
>> 'x' by 'y', there are strong limits to what can be done here.
>>
>> Kind regards,
>> Ralf
>>
>>>
>>> Greetings,
>>>
>>> Jeroen
>>>
>>>>>>
>>>> Hi again Jeroen,
>>>>
>>>>>> However, I am a bit worried about what happens when we eventually add
>>>> proper
>>>>>> support for 'restrict'/'noalias': the only models I know for that one
>>>> actually
>>>>>> make 'ptrtoint' have side-effects on the memory state (similar to setting
>>>> the
>>>>>> 'exposed' flag in the C provenance TS). I can't (currently) demonstrate
>>>> that
>>>>>
>>>>> For the 'c standard', it is undefined behavior to convert a restrict
>> pointer
>>>> to
>>>>> an integer and back to a pointer type.
>>>>>
>>>>> (At least, that is my interpretation of n2573 6.7.3.1 para 3:
>>>>>       Note that "based" is defined only for expressions with pointer types.
>>>>> )
>>>>
>>>> After sleeping over it, I think I want to push back against this
>>>> interpretation
>>>> a bit more strongly. Consider a program snippet like
>>>>
>>>> int *out = (int*) decrypt(encrypt( (uintptr_t)in  ));
>>>>
>>>> It doesn't matter what "encrypt" and "decrypt" do, as long as they are
>>>> inverses
>>>> of each other.
>>>> "out" is definitely of pointer type. And by the dependency-based definition
>> of
>>>> the standard, it is the case that modifying "in" to point elsewhere would
>> also
>>>> make "out" point elsewhere. Thus "out" is 'based on' "in". And hence it is
>>>> okay
>>>> to use "out" to access the object "in" points to, even in the presence of
>>>> 'restrict'.
>>>>
>>>> Kind regards,
>>>> Ralf
>>>>
>>>>>
>>>>> For the full restrict patches, we do not track restrict provenance across
>> a
>>>>> ptr2int, except for the 'int2ptr(ptr2int %P)' (which we do, as llvm
>>>> sometimes
>>>>> introduced these pairs; not sure if this is still valid).
>>>>>
>>>>> Greetings,
>>>>>
>>>>> Jeroen Dobbelaere
>>>>>
>>>>>> this is *required*, but I also don't know an alternative. So if this
>>>> remains
>>>>>> the
>>>>>> case, and if we say "load i64" performs a ptrtoint when needed, then that
>>>>>> would
>>>>>> mean we could not do dead load elimination any more as that would remove
>>>> the
>>>>>> ptrtoint side-effect.
>>>>>>
>>>>>> There also is the somewhat conceptual concern that LLVM ought to have a
>>>> type
>>>>>> that can loslessly hold all kinds of data that exist in LLVM. Currently,
>>>> that
>>>>>> is
>>>>>> not the case -- 'iN' cannot hold data with provenance.
>>>>>>
>>>>>> Kind regards,
>>>>>> Ralf
>>>>>
>>>>
> 

-- 
Website: https://people.mpi-sws.org/~jung/