[LLVMdev] PROPOSAL: struct-access-path aware TBAA

Tue Mar 12 15:25:49 PDT 2013

Type-based alias analysis is intrinsically unsafe. I don't think it make 
big sense to see the whole program
being compiled and stitch together all type-include-graph together just 
for this optimization. We would otherwise
have to implement this optimization in interprocedural analysis phase 
(with whole program mode).

That said, if would be nice if front-end could figure out if a type is 
only accessiable inside a translation
unit; TBAA can take advantage this info to make it is unsafer.

On 3/11/13, 4:23 PM, Daniel Berlin wrote:
> On Mon, Mar 11, 2013 at 3:45 PM, Manman Ren <mren at apple.com> wrote:
>> On Mar 11, 2013, at 2:37 PM, Daniel Berlin <dberlin at dberlin.org> wrote:
>>
>>> On Mon, Mar 11, 2013 at 2:06 PM, Manman Ren <mren at apple.com> wrote:
>>>> On Mar 11, 2013, at 1:17 PM, Daniel Berlin <dberlin at dberlin.org> wrote:
>>>>
>>>>> On Mon, Mar 11, 2013 at 11:41 AM, Manman Ren <mren at apple.com> wrote:
>>>>>> Based on discussions with John McCall
>>>>>>
>>>>>> We currently focus on field accesses of structs, more specifically, on fields that are scalars or structs.
>>>>>>
>>>>>> Fundamental rules from C11
>>>>>> --------------------------
>>>>>> An object shall have its stored value accessed only by an lvalue expression that has one of the following types: [footnote: The intent of this list is to specify those circumstances in which an object may or may not be aliased.]
>>>>>> 1. a type compatible with the effective type of the object,
>>>>>> 2. a qualified version of a type compatible with the effective type of the object,
>>>>>> 3. a type that is the signed or unsigned type corresponding to the effective type of the object,
>>>>>> 4. a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object,
>>>>>> 5. an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or
>>>>>> 6. a character type.
>>>>>>
>>>>>> Example
>>>>>> -------
>>>>>> struct A {
>>>>>>    int x;
>>>>>>    int y;
>>>>>> };
>>>>>> struct B {
>>>>>>    A a;
>>>>>>    int z;
>>>>>> };
>>>>>> struct C {
>>>>>>    B b1;
>>>>>>    B b2;
>>>>>>    int *p;
>>>>>> };
>>>>>>
>>>>>> Type DAG:
>>>>>> int <- A::x <- A
>>>>>> int <- A::y <- A <- B::a <- B <- C::b1 <- C
>>>>>> int <----------------- B::z <- B <- C::b2 <- C
>>>>>> any pointer <--------------------- C::p  <- C
>>>>>>
>>>>>> The type DAG has two types of TBAA nodes:
>>>>>> 1> the existing scalar nodes
>>>>>> 2> the struct nodes (this is different from the current tbaa.struct)
>>>>>>   A struct node has a unique name plus a list of pairs (field name, field type).
>>>>>>   For example, struct node for "C" should look like
>>>>>>   !4 = metadata !{"C", "C::b1", metadata !3, "C::b2", metadata !3, "C::p", metadata !2}
>>>>>>   where !3 is the struct node for "B", !2 is pointer type.
>>>>>>
>>>>>> Given a field access
>>>>>> struct B *bp = ...;
>>>>>> bp->a.x = 5;
>>>>>> we annotate it as B::a.x.
>>>>> In the case of multiple structures containing substructures, how are
>>>>> you differentiating?
>>>>>
>>>>> IE given
>>>>>
>>>>> struct A {
>>>>> struct B b;
>>>>> }
>>>>> struct C {
>>>>> struct B b;
>>>>> }
>>>>>
>>>>> How do you know the above struct B *bp =...; is B::b from C and not B::b from A?
>>>>>
>>>>> (I agree you can know in the case of direct aggregates, but I argue
>>>>> you have no way to know in the case of pointer arguments without
>>>>> interprocedural analysis)
>>>>> It gets worse the more levels you have.
>>>>>
>>>>> ie if you add
>>>>> struct B {
>>>>> struct E e;
>>>>> }
>>>>>
>>>>> and have struct E *e = ...
>>>>> how do you know it's the B::e contained in struct C, or the B::e
>>>>> contained in struct A?
>>>>>
>>>>>
>>>>> Again, i agree you can do both scalar and direct aggregates, but not
>>>>> aggregates and scalars through pointers.
>>>>>
>>>> I don't have immediate plan of pointer analysis. For the given example, we will treat field accesses from bp (pointer to struct B) as B::x.x, bp can be from either C or A.
>>>>
>>>>>> Implementing the Hierarchy
>>>>>> --------------------------
>>>>>> We can attach metadata to both scalar accesses and aggregate accesses. Let's call scalar tags and aggregate tags.
>>>>>> Each tag can be a sequence of nodes in the type DAG.
>>>>>> !C::b2.a.x := [ "tbaa.path", !C, "C::b2", !B, "B::a", !A, "A::x", !int ]
>>>>>>
>>>>> This can get quite deep quite quickly.
>>>>> Are there actually cases where knowing the outermost aggregate type +
>>>>> {byte offset, size} of field access does not give you the exact same
>>>>> disambiguation capability?
>>>> The answer is no. We should be able to deduce the access path from {byte offset, size}.
>>>> However, I don't know an easy way to check alias(x,y) given {byte offset, size} for x and y.
>>> Well, that part is easy, it's just overlap, once you know they are
>>> both contained in the same outermost type.
>> How do you figure out the outermost common type of two access 'x' and 'y'?
> How are you annotating each access with a path as you go right now?
> It should be the same thing to just annotate it then.
> Without thinking too hard:
> If it's all direct aggregate access in the original source, whatever
> you see as the leftmost part should work.
>
> I haven't thought hard about it, but I believe it should wrok.
>
>
> Note that again, neither works for pointers unless you can guarantee
> you have the entire access path, which is hard:
>
> union u {
> struct A a;
> struct B b;
> };
>
> int foo(struct A *a, struct B *b) {
> ....
>
> }
>
> void bar() {
> union u whoops;
> foo(&whoops.a, &whoops.b);
> }
>
> If you annotated foo with a partial access path, you may or may not
> get the right answer.
> You would think your type dag solves this, because you'll see they
> share a supertype.
>
> But it's actually worse than that:
>
> file: bar.h
>   union u {
> struct A a;
> struct B b;
> };
>
> int foo(struct A *, struct B*);
>
> file: foo.c
>
> int foo(struct A *a, struct B *b) {
> ....
>
> }
>
> file: bar.c
>
> #include "bar.h"
>
> void bar() {
> union u whoops;
> foo(&whoops.a, &whoops.b);
> }
>
>
> You will never see compiling  foo.c that tells you both types are in a
> union together, or how to annotate these access paths.
>
> Hence the "you need to punt on pointers without more analysis" claim I made :)
>
>
>> Once that is figured out, it should be easy to check alias('x', 'y').
>>> You have to track what the types are in the frontend to generate the
>>> access path anyway, so you can compute the offsets + sizes into those
>>> types at the same time.
>> Generating the offsets+sizes are not hard.
>>> Essentially, it it translating your representation into something that
>>> can be intersected in constant time, unless i'm missing something
>>> about how you are annotating the access paths.
>> If we can query alias('x','y') in constant time, irrelevant to the depth of each access, that will be great.
>>
>> -Manman
>>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev