[LLVMdev] PROPOSAL: struct-access-path aware TBAA

Mon Mar 11 19:52:25 PDT 2013

On Mon, Mar 11, 2013 at 4:56 PM, Manman Ren <mren at apple.com> wrote:
>
> On Mar 11, 2013, at 4:23 PM, Daniel Berlin <dberlin at dberlin.org> wrote:
>
>> On Mon, Mar 11, 2013 at 3:45 PM, Manman Ren <mren at apple.com> wrote:
>>>
>>> On Mar 11, 2013, at 2:37 PM, Daniel Berlin <dberlin at dberlin.org> wrote:
>>>
>>>> On Mon, Mar 11, 2013 at 2:06 PM, Manman Ren <mren at apple.com> wrote:
>>>>>
>>>>> On Mar 11, 2013, at 1:17 PM, Daniel Berlin <dberlin at dberlin.org> wrote:
>>>>>
>>>>>> On Mon, Mar 11, 2013 at 11:41 AM, Manman Ren <mren at apple.com> wrote:
>>>>>>> Based on discussions with John McCall
>>>>>>>
>>>>>>> We currently focus on field accesses of structs, more specifically, on fields that are scalars or structs.
>>>>>>>
>>>>>>> Fundamental rules from C11
>>>>>>> --------------------------
>>>>>>> An object shall have its stored value accessed only by an lvalue expression that has one of the following types: [footnote: The intent of this list is to specify those circumstances in which an object may or may not be aliased.]
>>>>>>> 1. a type compatible with the effective type of the object,
>>>>>>> 2. a qualified version of a type compatible with the effective type of the object,
>>>>>>> 3. a type that is the signed or unsigned type corresponding to the effective type of the object,
>>>>>>> 4. a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object,
>>>>>>> 5. an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or
>>>>>>> 6. a character type.
>>>>>>>
>>>>>>
>>>>>>> Example
>>>>>>> -------
>>>>>>> struct A {
>>>>>>>  int x;
>>>>>>>  int y;
>>>>>>> };
>>>>>>> struct B {
>>>>>>>  A a;
>>>>>>>  int z;
>>>>>>> };
>>>>>>> struct C {
>>>>>>>  B b1;
>>>>>>>  B b2;
>>>>>>>  int *p;
>>>>>>> };
>>>>>>>
>>>>>>> Type DAG:
>>>>>>> int <- A::x <- A
>>>>>>> int <- A::y <- A <- B::a <- B <- C::b1 <- C
>>>>>>> int <----------------- B::z <- B <- C::b2 <- C
>>>>>>> any pointer <--------------------- C::p  <- C
>>>>>>>
>>>>>>> The type DAG has two types of TBAA nodes:
>>>>>>> 1> the existing scalar nodes
>>>>>>> 2> the struct nodes (this is different from the current tbaa.struct)
>>>>>>> A struct node has a unique name plus a list of pairs (field name, field type).
>>>>>>> For example, struct node for "C" should look like
>>>>>>> !4 = metadata !{"C", "C::b1", metadata !3, "C::b2", metadata !3, "C::p", metadata !2}
>>>>>>> where !3 is the struct node for "B", !2 is pointer type.
>>>>>>>
>>>>>>> Given a field access
>>>>>>> struct B *bp = ...;
>>>>>>> bp->a.x = 5;
>>>>>>> we annotate it as B::a.x.
>>>>>>
>>>>>> In the case of multiple structures containing substructures, how are
>>>>>> you differentiating?
>>>>>>
>>>>>> IE given
>>>>>>
>>>>>> struct A {
>>>>>> struct B b;
>>>>>> }
>>>>>> struct C {
>>>>>> struct B b;
>>>>>> }
>>>>>>
>>>>>> How do you know the above struct B *bp =...; is B::b from C and not B::b from A?
>>>>>>
>>>>>> (I agree you can know in the case of direct aggregates, but I argue
>>>>>> you have no way to know in the case of pointer arguments without
>>>>>> interprocedural analysis)
>>>>>> It gets worse the more levels you have.
>>>>>>
>>>>>> ie if you add
>>>>>> struct B {
>>>>>> struct E e;
>>>>>> }
>>>>>>
>>>>>> and have struct E *e = ...
>>>>>> how do you know it's the B::e contained in struct C, or the B::e
>>>>>> contained in struct A?
>>>>>>
>>>>>>
>>>>>> Again, i agree you can do both scalar and direct aggregates, but not
>>>>>> aggregates and scalars through pointers.
>>>>>>
>>>>>
>>>>> I don't have immediate plan of pointer analysis. For the given example, we will treat field accesses from bp (pointer to struct B) as B::x.x, bp can be from either C or A.
>>>>>
>>>>>>
>>>>>>> Implementing the Hierarchy
>>>>>>> --------------------------
>>>>>>> We can attach metadata to both scalar accesses and aggregate accesses. Let's call scalar tags and aggregate tags.
>>>>>>> Each tag can be a sequence of nodes in the type DAG.
>>>>>>> !C::b2.a.x := [ "tbaa.path", !C, "C::b2", !B, "B::a", !A, "A::x", !int ]
>>>>>>>
>>>>>>
>>>>>> This can get quite deep quite quickly.
>>>>>> Are there actually cases where knowing the outermost aggregate type +
>>>>>> {byte offset, size} of field access does not give you the exact same
>>>>>> disambiguation capability?
>>>>> The answer is no. We should be able to deduce the access path from {byte offset, size}.
>>>>> However, I don't know an easy way to check alias(x,y) given {byte offset, size} for x and y.
>>>>
>>>> Well, that part is easy, it's just overlap, once you know they are
>>>> both contained in the same outermost type.
>>>
>>> How do you figure out the outermost common type of two access 'x' and 'y'?
>>
>> How are you annotating each access with a path as you go right now?
>> It should be the same thing to just annotate it then.
>
> I thought your suggestion was to replace
> !C::b2.a.x := [ "tbaa.path", !C, "C::b2", !B, "B::a", !A, "A::x", !int ]
> with
> !C::b2.a.x := [ "tbaa.offset", !C, offset within C, size ]
>
> !B::a := [ "tbaa.path", !B, "B:a", !A ]
> with
> !B::a := [ "tbaa.offset", !B, offset within B, size]
>

Yes, it is.

> Then we already lost the access path at IR level.

But you are *generating* the above metadata at the clang level, no?
That is were I meant you should annotate them as you go.

>
>> Without thinking too hard:
>> If it's all direct aggregate access in the original source, whatever
>> you see as the leftmost part should work.
>>
>> I haven't thought hard about it, but I believe it should work.
>
>
> At llvm IR level, it is not that straight-forward to figure out the outermost common type of C::b2.a.x and B::a with tbaa.offset.

The metadata should already have the outermost common type when it's
output, so it should just be there at the LLVM level.

> To me, the outermost common type should be "B", then we need to check the relative offset from B for both accesses
Agreed.

>
> -Manman
>
>>
>>
>> Note that again, neither works for pointers unless you can guarantee
>> you have the entire access path, which is hard:
>>
>> union u {
>> struct A a;
>> struct B b;
>> };
>>
>> int foo(struct A *a, struct B *b) {
>> ....
>>
>> }
>>
>> void bar() {
>> union u whoops;
>> foo(&whoops.a, &whoops.b);
>> }
>>
>> If you annotated foo with a partial access path, you may or may not
>> get the right answer.
>> You would think your type dag solves this, because you'll see they
>> share a supertype.
>>
>> But it's actually worse than that:
>>
>> file: bar.h
>> union u {
>> struct A a;
>> struct B b;
>> };
>>
>> int foo(struct A *, struct B*);
>>
>> file: foo.c
>>
>> int foo(struct A *a, struct B *b) {
>> ....
>>
>> }
>>
>> file: bar.c
>>
>> #include "bar.h"
>>
>> void bar() {
>> union u whoops;
>> foo(&whoops.a, &whoops.b);
>> }
>>
>>
>> You will never see compiling  foo.c that tells you both types are in a
>> union together, or how to annotate these access paths.
>>
>> Hence the "you need to punt on pointers without more analysis" claim I made :)
>>
>>
>>> Once that is figured out, it should be easy to check alias('x', 'y').
>>>>
>>>> You have to track what the types are in the frontend to generate the
>>>> access path anyway, so you can compute the offsets + sizes into those
>>>> types at the same time.
>>> Generating the offsets+sizes are not hard.
>>>>
>>>> Essentially, it it translating your representation into something that
>>>> can be intersected in constant time, unless i'm missing something
>>>> about how you are annotating the access paths.
>>> If we can query alias('x','y') in constant time, irrelevant to the depth of each access, that will be great.
>>>
>>> -Manman
>>>
>