[LLVMdev] PROPOSAL: struct-access-path aware TBAA

Mon Mar 11 16:23:54 PDT 2013

On Mon, Mar 11, 2013 at 3:45 PM, Manman Ren <mren at apple.com> wrote:
>
> On Mar 11, 2013, at 2:37 PM, Daniel Berlin <dberlin at dberlin.org> wrote:
>
>> On Mon, Mar 11, 2013 at 2:06 PM, Manman Ren <mren at apple.com> wrote:
>>>
>>> On Mar 11, 2013, at 1:17 PM, Daniel Berlin <dberlin at dberlin.org> wrote:
>>>
>>>> On Mon, Mar 11, 2013 at 11:41 AM, Manman Ren <mren at apple.com> wrote:
>>>>> Based on discussions with John McCall
>>>>>
>>>>> We currently focus on field accesses of structs, more specifically, on fields that are scalars or structs.
>>>>>
>>>>> Fundamental rules from C11
>>>>> --------------------------
>>>>> An object shall have its stored value accessed only by an lvalue expression that has one of the following types: [footnote: The intent of this list is to specify those circumstances in which an object may or may not be aliased.]
>>>>> 1. a type compatible with the effective type of the object,
>>>>> 2. a qualified version of a type compatible with the effective type of the object,
>>>>> 3. a type that is the signed or unsigned type corresponding to the effective type of the object,
>>>>> 4. a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object,
>>>>> 5. an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or
>>>>> 6. a character type.
>>>>>
>>>>
>>>>> Example
>>>>> -------
>>>>> struct A {
>>>>>   int x;
>>>>>   int y;
>>>>> };
>>>>> struct B {
>>>>>   A a;
>>>>>   int z;
>>>>> };
>>>>> struct C {
>>>>>   B b1;
>>>>>   B b2;
>>>>>   int *p;
>>>>> };
>>>>>
>>>>> Type DAG:
>>>>> int <- A::x <- A
>>>>> int <- A::y <- A <- B::a <- B <- C::b1 <- C
>>>>> int <----------------- B::z <- B <- C::b2 <- C
>>>>> any pointer <--------------------- C::p  <- C
>>>>>
>>>>> The type DAG has two types of TBAA nodes:
>>>>> 1> the existing scalar nodes
>>>>> 2> the struct nodes (this is different from the current tbaa.struct)
>>>>>  A struct node has a unique name plus a list of pairs (field name, field type).
>>>>>  For example, struct node for "C" should look like
>>>>>  !4 = metadata !{"C", "C::b1", metadata !3, "C::b2", metadata !3, "C::p", metadata !2}
>>>>>  where !3 is the struct node for "B", !2 is pointer type.
>>>>>
>>>>> Given a field access
>>>>> struct B *bp = ...;
>>>>> bp->a.x = 5;
>>>>> we annotate it as B::a.x.
>>>>
>>>> In the case of multiple structures containing substructures, how are
>>>> you differentiating?
>>>>
>>>> IE given
>>>>
>>>> struct A {
>>>> struct B b;
>>>> }
>>>> struct C {
>>>> struct B b;
>>>> }
>>>>
>>>> How do you know the above struct B *bp =...; is B::b from C and not B::b from A?
>>>>
>>>> (I agree you can know in the case of direct aggregates, but I argue
>>>> you have no way to know in the case of pointer arguments without
>>>> interprocedural analysis)
>>>> It gets worse the more levels you have.
>>>>
>>>> ie if you add
>>>> struct B {
>>>> struct E e;
>>>> }
>>>>
>>>> and have struct E *e = ...
>>>> how do you know it's the B::e contained in struct C, or the B::e
>>>> contained in struct A?
>>>>
>>>>
>>>> Again, i agree you can do both scalar and direct aggregates, but not
>>>> aggregates and scalars through pointers.
>>>>
>>>
>>> I don't have immediate plan of pointer analysis. For the given example, we will treat field accesses from bp (pointer to struct B) as B::x.x, bp can be from either C or A.
>>>
>>>>
>>>>> Implementing the Hierarchy
>>>>> --------------------------
>>>>> We can attach metadata to both scalar accesses and aggregate accesses. Let's call scalar tags and aggregate tags.
>>>>> Each tag can be a sequence of nodes in the type DAG.
>>>>> !C::b2.a.x := [ "tbaa.path", !C, "C::b2", !B, "B::a", !A, "A::x", !int ]
>>>>>
>>>>
>>>> This can get quite deep quite quickly.
>>>> Are there actually cases where knowing the outermost aggregate type +
>>>> {byte offset, size} of field access does not give you the exact same
>>>> disambiguation capability?
>>> The answer is no. We should be able to deduce the access path from {byte offset, size}.
>>> However, I don't know an easy way to check alias(x,y) given {byte offset, size} for x and y.
>>
>> Well, that part is easy, it's just overlap, once you know they are
>> both contained in the same outermost type.
>
> How do you figure out the outermost common type of two access 'x' and 'y'?

How are you annotating each access with a path as you go right now?
It should be the same thing to just annotate it then.
Without thinking too hard:
If it's all direct aggregate access in the original source, whatever
you see as the leftmost part should work.

I haven't thought hard about it, but I believe it should wrok.

Note that again, neither works for pointers unless you can guarantee
you have the entire access path, which is hard:

union u {
struct A a;
struct B b;
};

int foo(struct A *a, struct B *b) {
....

}

void bar() {
union u whoops;
foo(&whoops.a, &whoops.b);
}

If you annotated foo with a partial access path, you may or may not
get the right answer.
You would think your type dag solves this, because you'll see they
share a supertype.

But it's actually worse than that:

file: bar.h
 union u {
struct A a;
struct B b;
};

int foo(struct A *, struct B*);

file: foo.c

int foo(struct A *a, struct B *b) {
....

}

file: bar.c

#include "bar.h"

void bar() {
union u whoops;
foo(&whoops.a, &whoops.b);
}

You will never see compiling  foo.c that tells you both types are in a
union together, or how to annotate these access paths.

Hence the "you need to punt on pointers without more analysis" claim I made :)

> Once that is figured out, it should be easy to check alias('x', 'y').
>>
>> You have to track what the types are in the frontend to generate the
>> access path anyway, so you can compute the offsets + sizes into those
>> types at the same time.
> Generating the offsets+sizes are not hard.
>>
>> Essentially, it it translating your representation into something that
>> can be intersected in constant time, unless i'm missing something
>> about how you are annotating the access paths.
> If we can query alias('x','y') in constant time, irrelevant to the depth of each access, that will be great.
>
> -Manman
>