[LLVMdev] PROPOSAL: struct-access-path aware TBAA

Mon Mar 11 15:45:09 PDT 2013

On Mar 11, 2013, at 2:37 PM, Daniel Berlin <dberlin at dberlin.org> wrote:

> On Mon, Mar 11, 2013 at 2:06 PM, Manman Ren <mren at apple.com> wrote:
>> 
>> On Mar 11, 2013, at 1:17 PM, Daniel Berlin <dberlin at dberlin.org> wrote:
>> 
>>> On Mon, Mar 11, 2013 at 11:41 AM, Manman Ren <mren at apple.com> wrote:
>>>> Based on discussions with John McCall
>>>> 
>>>> We currently focus on field accesses of structs, more specifically, on fields that are scalars or structs.
>>>> 
>>>> Fundamental rules from C11
>>>> --------------------------
>>>> An object shall have its stored value accessed only by an lvalue expression that has one of the following types: [footnote: The intent of this list is to specify those circumstances in which an object may or may not be aliased.]
>>>> 1. a type compatible with the effective type of the object,
>>>> 2. a qualified version of a type compatible with the effective type of the object,
>>>> 3. a type that is the signed or unsigned type corresponding to the effective type of the object,
>>>> 4. a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object,
>>>> 5. an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or
>>>> 6. a character type.
>>>> 
>>> 
>>>> Example
>>>> -------
>>>> struct A {
>>>>   int x;
>>>>   int y;
>>>> };
>>>> struct B {
>>>>   A a;
>>>>   int z;
>>>> };
>>>> struct C {
>>>>   B b1;
>>>>   B b2;
>>>>   int *p;
>>>> };
>>>> 
>>>> Type DAG:
>>>> int <- A::x <- A
>>>> int <- A::y <- A <- B::a <- B <- C::b1 <- C
>>>> int <----------------- B::z <- B <- C::b2 <- C
>>>> any pointer <--------------------- C::p  <- C
>>>> 
>>>> The type DAG has two types of TBAA nodes:
>>>> 1> the existing scalar nodes
>>>> 2> the struct nodes (this is different from the current tbaa.struct)
>>>>  A struct node has a unique name plus a list of pairs (field name, field type).
>>>>  For example, struct node for "C" should look like
>>>>  !4 = metadata !{"C", "C::b1", metadata !3, "C::b2", metadata !3, "C::p", metadata !2}
>>>>  where !3 is the struct node for "B", !2 is pointer type.
>>>> 
>>>> Given a field access
>>>> struct B *bp = ...;
>>>> bp->a.x = 5;
>>>> we annotate it as B::a.x.
>>> 
>>> In the case of multiple structures containing substructures, how are
>>> you differentiating?
>>> 
>>> IE given
>>> 
>>> struct A {
>>> struct B b;
>>> }
>>> struct C {
>>> struct B b;
>>> }
>>> 
>>> How do you know the above struct B *bp =...; is B::b from C and not B::b from A?
>>> 
>>> (I agree you can know in the case of direct aggregates, but I argue
>>> you have no way to know in the case of pointer arguments without
>>> interprocedural analysis)
>>> It gets worse the more levels you have.
>>> 
>>> ie if you add
>>> struct B {
>>> struct E e;
>>> }
>>> 
>>> and have struct E *e = ...
>>> how do you know it's the B::e contained in struct C, or the B::e
>>> contained in struct A?
>>> 
>>> 
>>> Again, i agree you can do both scalar and direct aggregates, but not
>>> aggregates and scalars through pointers.
>>> 
>> 
>> I don't have immediate plan of pointer analysis. For the given example, we will treat field accesses from bp (pointer to struct B) as B::x.x, bp can be from either C or A.
>> 
>>> 
>>>> Implementing the Hierarchy
>>>> --------------------------
>>>> We can attach metadata to both scalar accesses and aggregate accesses. Let's call scalar tags and aggregate tags.
>>>> Each tag can be a sequence of nodes in the type DAG.
>>>> !C::b2.a.x := [ "tbaa.path", !C, "C::b2", !B, "B::a", !A, "A::x", !int ]
>>>> 
>>> 
>>> This can get quite deep quite quickly.
>>> Are there actually cases where knowing the outermost aggregate type +
>>> {byte offset, size} of field access does not give you the exact same
>>> disambiguation capability?
>> The answer is no. We should be able to deduce the access path from {byte offset, size}.
>> However, I don't know an easy way to check alias(x,y) given {byte offset, size} for x and y.
> 
> Well, that part is easy, it's just overlap, once you know they are
> both contained in the same outermost type.

How do you figure out the outermost common type of two access 'x' and 'y'?
Once that is figured out, it should be easy to check alias('x', 'y').
> 
> You have to track what the types are in the frontend to generate the
> access path anyway, so you can compute the offsets + sizes into those
> types at the same time.
Generating the offsets+sizes are not hard.
> 
> Essentially, it it translating your representation into something that
> can be intersected in constant time, unless i'm missing something
> about how you are annotating the access paths.
If we can query alias('x','y') in constant time, irrelevant to the depth of each access, that will be great.

-Manman