[LLVMdev] PROPOSAL: struct-access-path aware TBAA

Mon Mar 11 16:56:36 PDT 2013

On Mar 11, 2013, at 4:23 PM, Daniel Berlin <dberlin at dberlin.org> wrote:

> On Mon, Mar 11, 2013 at 3:45 PM, Manman Ren <mren at apple.com> wrote:
>> 
>> On Mar 11, 2013, at 2:37 PM, Daniel Berlin <dberlin at dberlin.org> wrote:
>> 
>>> On Mon, Mar 11, 2013 at 2:06 PM, Manman Ren <mren at apple.com> wrote:
>>>> 
>>>> On Mar 11, 2013, at 1:17 PM, Daniel Berlin <dberlin at dberlin.org> wrote:
>>>> 
>>>>> On Mon, Mar 11, 2013 at 11:41 AM, Manman Ren <mren at apple.com> wrote:
>>>>>> Based on discussions with John McCall
>>>>>> 
>>>>>> We currently focus on field accesses of structs, more specifically, on fields that are scalars or structs.
>>>>>> 
>>>>>> Fundamental rules from C11
>>>>>> --------------------------
>>>>>> An object shall have its stored value accessed only by an lvalue expression that has one of the following types: [footnote: The intent of this list is to specify those circumstances in which an object may or may not be aliased.]
>>>>>> 1. a type compatible with the effective type of the object,
>>>>>> 2. a qualified version of a type compatible with the effective type of the object,
>>>>>> 3. a type that is the signed or unsigned type corresponding to the effective type of the object,
>>>>>> 4. a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object,
>>>>>> 5. an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or
>>>>>> 6. a character type.
>>>>>> 
>>>>> 
>>>>>> Example
>>>>>> -------
>>>>>> struct A {
>>>>>>  int x;
>>>>>>  int y;
>>>>>> };
>>>>>> struct B {
>>>>>>  A a;
>>>>>>  int z;
>>>>>> };
>>>>>> struct C {
>>>>>>  B b1;
>>>>>>  B b2;
>>>>>>  int *p;
>>>>>> };
>>>>>> 
>>>>>> Type DAG:
>>>>>> int <- A::x <- A
>>>>>> int <- A::y <- A <- B::a <- B <- C::b1 <- C
>>>>>> int <----------------- B::z <- B <- C::b2 <- C
>>>>>> any pointer <--------------------- C::p  <- C
>>>>>> 
>>>>>> The type DAG has two types of TBAA nodes:
>>>>>> 1> the existing scalar nodes
>>>>>> 2> the struct nodes (this is different from the current tbaa.struct)
>>>>>> A struct node has a unique name plus a list of pairs (field name, field type).
>>>>>> For example, struct node for "C" should look like
>>>>>> !4 = metadata !{"C", "C::b1", metadata !3, "C::b2", metadata !3, "C::p", metadata !2}
>>>>>> where !3 is the struct node for "B", !2 is pointer type.
>>>>>> 
>>>>>> Given a field access
>>>>>> struct B *bp = ...;
>>>>>> bp->a.x = 5;
>>>>>> we annotate it as B::a.x.
>>>>> 
>>>>> In the case of multiple structures containing substructures, how are
>>>>> you differentiating?
>>>>> 
>>>>> IE given
>>>>> 
>>>>> struct A {
>>>>> struct B b;
>>>>> }
>>>>> struct C {
>>>>> struct B b;
>>>>> }
>>>>> 
>>>>> How do you know the above struct B *bp =...; is B::b from C and not B::b from A?
>>>>> 
>>>>> (I agree you can know in the case of direct aggregates, but I argue
>>>>> you have no way to know in the case of pointer arguments without
>>>>> interprocedural analysis)
>>>>> It gets worse the more levels you have.
>>>>> 
>>>>> ie if you add
>>>>> struct B {
>>>>> struct E e;
>>>>> }
>>>>> 
>>>>> and have struct E *e = ...
>>>>> how do you know it's the B::e contained in struct C, or the B::e
>>>>> contained in struct A?
>>>>> 
>>>>> 
>>>>> Again, i agree you can do both scalar and direct aggregates, but not
>>>>> aggregates and scalars through pointers.
>>>>> 
>>>> 
>>>> I don't have immediate plan of pointer analysis. For the given example, we will treat field accesses from bp (pointer to struct B) as B::x.x, bp can be from either C or A.
>>>> 
>>>>> 
>>>>>> Implementing the Hierarchy
>>>>>> --------------------------
>>>>>> We can attach metadata to both scalar accesses and aggregate accesses. Let's call scalar tags and aggregate tags.
>>>>>> Each tag can be a sequence of nodes in the type DAG.
>>>>>> !C::b2.a.x := [ "tbaa.path", !C, "C::b2", !B, "B::a", !A, "A::x", !int ]
>>>>>> 
>>>>> 
>>>>> This can get quite deep quite quickly.
>>>>> Are there actually cases where knowing the outermost aggregate type +
>>>>> {byte offset, size} of field access does not give you the exact same
>>>>> disambiguation capability?
>>>> The answer is no. We should be able to deduce the access path from {byte offset, size}.
>>>> However, I don't know an easy way to check alias(x,y) given {byte offset, size} for x and y.
>>> 
>>> Well, that part is easy, it's just overlap, once you know they are
>>> both contained in the same outermost type.
>> 
>> How do you figure out the outermost common type of two access 'x' and 'y'?
> 
> How are you annotating each access with a path as you go right now?
> It should be the same thing to just annotate it then.

I thought your suggestion was to replace
!C::b2.a.x := [ "tbaa.path", !C, "C::b2", !B, "B::a", !A, "A::x", !int ]
with
!C::b2.a.x := [ "tbaa.offset", !C, offset within C, size ]

!B::a := [ "tbaa.path", !B, "B:a", !A ]
with
!B::a := [ "tbaa.offset", !B, offset within B, size]

Then we already lost the access path at IR level.

> Without thinking too hard:
> If it's all direct aggregate access in the original source, whatever
> you see as the leftmost part should work.
> 
> I haven't thought hard about it, but I believe it should work.

At llvm IR level, it is not that straight-forward to figure out the outermost common type of C::b2.a.x and B::a with tbaa.offset.
To me, the outermost common type should be "B", then we need to check the relative offset from B for both accesses.

-Manman

> 
> 
> Note that again, neither works for pointers unless you can guarantee
> you have the entire access path, which is hard:
> 
> union u {
> struct A a;
> struct B b;
> };
> 
> int foo(struct A *a, struct B *b) {
> ....
> 
> }
> 
> void bar() {
> union u whoops;
> foo(&whoops.a, &whoops.b);
> }
> 
> If you annotated foo with a partial access path, you may or may not
> get the right answer.
> You would think your type dag solves this, because you'll see they
> share a supertype.
> 
> But it's actually worse than that:
> 
> file: bar.h
> union u {
> struct A a;
> struct B b;
> };
> 
> int foo(struct A *, struct B*);
> 
> file: foo.c
> 
> int foo(struct A *a, struct B *b) {
> ....
> 
> }
> 
> file: bar.c
> 
> #include "bar.h"
> 
> void bar() {
> union u whoops;
> foo(&whoops.a, &whoops.b);
> }
> 
> 
> You will never see compiling  foo.c that tells you both types are in a
> union together, or how to annotate these access paths.
> 
> Hence the "you need to punt on pointers without more analysis" claim I made :)
> 
> 
>> Once that is figured out, it should be easy to check alias('x', 'y').
>>> 
>>> You have to track what the types are in the frontend to generate the
>>> access path anyway, so you can compute the offsets + sizes into those
>>> types at the same time.
>> Generating the offsets+sizes are not hard.
>>> 
>>> Essentially, it it translating your representation into something that
>>> can be intersected in constant time, unless i'm missing something
>>> about how you are annotating the access paths.
>> If we can query alias('x','y') in constant time, irrelevant to the depth of each access, that will be great.
>> 
>> -Manman
>>