[LLVMdev] PROPOSAL: struct-access-path aware TBAA (should we use offset+size instead of path?)

Wed Mar 13 10:56:26 PDT 2013

On Mar 12, 2013, at 6:07 PM, Daniel Berlin wrote:

> On Tue, Mar 12, 2013 at 5:59 PM, Manman Ren <mren at apple.com> wrote:
>> 
>> On Mar 12, 2013, at 12:20 PM, Daniel Berlin <dberlin at dberlin.org> wrote:
>> 
>>> On Tue, Mar 12, 2013 at 10:10 AM, Manman Ren <mren at apple.com> wrote:
>>>> 
>>>> On Mar 11, 2013, at 7:52 PM, Daniel Berlin wrote:
>>>> 
>>>>> On Mon, Mar 11, 2013 at 4:56 PM, Manman Ren <mren at apple.com> wrote:
>>>>>> 
>>>>>> On Mar 11, 2013, at 4:23 PM, Daniel Berlin <dberlin at dberlin.org> wrote:
>>>>>> 
>>>>>>> On Mon, Mar 11, 2013 at 3:45 PM, Manman Ren <mren at apple.com> wrote:
>>>>>>>> 
>>>>>>>> On Mar 11, 2013, at 2:37 PM, Daniel Berlin <dberlin at dberlin.org> wrote:
>>>>>>>> 
>>>>>>>>> On Mon, Mar 11, 2013 at 2:06 PM, Manman Ren <mren at apple.com> wrote:
>>>>>>>>>> 
>>>>>>>>>> On Mar 11, 2013, at 1:17 PM, Daniel Berlin <dberlin at dberlin.org> wrote:
>>>>>>>>>> 
>>>>>>>>>>> On Mon, Mar 11, 2013 at 11:41 AM, Manman Ren <mren at apple.com> wrote:
>>>>>>>>>>>> Based on discussions with John McCall
>>>>>>>>>>>> 
>>>>>>>>>>>> We currently focus on field accesses of structs, more specifically, on fields that are scalars or structs.
>>>>>>>>>>>> 
>>>>>>>>>>>> Fundamental rules from C11
>>>>>>>>>>>> --------------------------
>>>>>>>>>>>> An object shall have its stored value accessed only by an lvalue expression that has one of the following types: [footnote: The intent of this list is to specify those circumstances in which an object may or may not be aliased.]
>>>>>>>>>>>> 1. a type compatible with the effective type of the object,
>>>>>>>>>>>> 2. a qualified version of a type compatible with the effective type of the object,
>>>>>>>>>>>> 3. a type that is the signed or unsigned type corresponding to the effective type of the object,
>>>>>>>>>>>> 4. a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object,
>>>>>>>>>>>> 5. an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or
>>>>>>>>>>>> 6. a character type.
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> Example
>>>>>>>>>>>> -------
>>>>>>>>>>>> struct A {
>>>>>>>>>>>> int x;
>>>>>>>>>>>> int y;
>>>>>>>>>>>> };
>>>>>>>>>>>> struct B {
>>>>>>>>>>>> A a;
>>>>>>>>>>>> int z;
>>>>>>>>>>>> };
>>>>>>>>>>>> struct C {
>>>>>>>>>>>> B b1;
>>>>>>>>>>>> B b2;
>>>>>>>>>>>> int *p;
>>>>>>>>>>>> };
>>>>>>>>>>>> 
>>>>>>>>>>>> Type DAG:
>>>>>>>>>>>> int <- A::x <- A
>>>>>>>>>>>> int <- A::y <- A <- B::a <- B <- C::b1 <- C
>>>>>>>>>>>> int <----------------- B::z <- B <- C::b2 <- C
>>>>>>>>>>>> any pointer <--------------------- C::p  <- C
>>>>>>>>>>>> 
>>>>>>>>>>>> The type DAG has two types of TBAA nodes:
>>>>>>>>>>>> 1> the existing scalar nodes
>>>>>>>>>>>> 2> the struct nodes (this is different from the current tbaa.struct)
>>>>>>>>>>>> A struct node has a unique name plus a list of pairs (field name, field type).
>>>>>>>>>>>> For example, struct node for "C" should look like
>>>>>>>>>>>> !4 = metadata !{"C", "C::b1", metadata !3, "C::b2", metadata !3, "C::p", metadata !2}
>>>>>>>>>>>> where !3 is the struct node for "B", !2 is pointer type.
>>>>>>>>>>>> 
>>>>>>>>>>>> Given a field access
>>>>>>>>>>>> struct B *bp = ...;
>>>>>>>>>>>> bp->a.x = 5;
>>>>>>>>>>>> we annotate it as B::a.x.
>>>>>>>>>>> 
>>>>>>>>>>> In the case of multiple structures containing substructures, how are
>>>>>>>>>>> you differentiating?
>>>>>>>>>>> 
>>>>>>>>>>> IE given
>>>>>>>>>>> 
>>>>>>>>>>> struct A {
>>>>>>>>>>> struct B b;
>>>>>>>>>>> }
>>>>>>>>>>> struct C {
>>>>>>>>>>> struct B b;
>>>>>>>>>>> }
>>>>>>>>>>> 
>>>>>>>>>>> How do you know the above struct B *bp =...; is B::b from C and not B::b from A?
>>>>>>>>>>> 
>>>>>>>>>>> (I agree you can know in the case of direct aggregates, but I argue
>>>>>>>>>>> you have no way to know in the case of pointer arguments without
>>>>>>>>>>> interprocedural analysis)
>>>>>>>>>>> It gets worse the more levels you have.
>>>>>>>>>>> 
>>>>>>>>>>> ie if you add
>>>>>>>>>>> struct B {
>>>>>>>>>>> struct E e;
>>>>>>>>>>> }
>>>>>>>>>>> 
>>>>>>>>>>> and have struct E *e = ...
>>>>>>>>>>> how do you know it's the B::e contained in struct C, or the B::e
>>>>>>>>>>> contained in struct A?
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Again, i agree you can do both scalar and direct aggregates, but not
>>>>>>>>>>> aggregates and scalars through pointers.
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> I don't have immediate plan of pointer analysis. For the given example, we will treat field accesses from bp (pointer to struct B) as B::x.x, bp can be from either C or A.
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> Implementing the Hierarchy
>>>>>>>>>>>> --------------------------
>>>>>>>>>>>> We can attach metadata to both scalar accesses and aggregate accesses. Let's call scalar tags and aggregate tags.
>>>>>>>>>>>> Each tag can be a sequence of nodes in the type DAG.
>>>>>>>>>>>> !C::b2.a.x := [ "tbaa.path", !C, "C::b2", !B, "B::a", !A, "A::x", !int ]
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> This can get quite deep quite quickly.
>>>>>>>>>>> Are there actually cases where knowing the outermost aggregate type +
>>>>>>>>>>> {byte offset, size} of field access does not give you the exact same
>>>>>>>>>>> disambiguation capability?
>>>>>>>>>> The answer is no. We should be able to deduce the access path from {byte offset, size}.
>>>>>>>>>> However, I don't know an easy way to check alias(x,y) given {byte offset, size} for x and y.
>>>>>>>>> 
>>>>>>>>> Well, that part is easy, it's just overlap, once you know they are
>>>>>>>>> both contained in the same outermost type.
>>>>>>>> 
>>>>>>>> How do you figure out the outermost common type of two access 'x' and 'y'?
>>>>>>> 
>>>>>>> How are you annotating each access with a path as you go right now?
>>>>>>> It should be the same thing to just annotate it then.
>>>>>> 
>>>>>> I thought your suggestion was to replace
>>>>>> !C::b2.a.x := [ "tbaa.path", !C, "C::b2", !B, "B::a", !A, "A::x", !int ]
>>>>>> with
>>>>>> !C::b2.a.x := [ "tbaa.offset", !C, offset within C, size ]
>>>>>> 
>>>>>> !B::a := [ "tbaa.path", !B, "B:a", !A ]
>>>>>> with
>>>>>> !B::a := [ "tbaa.offset", !B, offset within B, size]
>>>>>> 
>>>>> 
>>>>> Yes, it is.
>>>>> 
>>>>>> Then we already lost the access path at IR level.
>>>>> 
>>>>> But you are *generating* the above metadata at the clang level, no?
>>>>> That is were I meant you should annotate them as you go.
>>>> I may not get what you are saying.
>>>> But if we don't keep it in the LLVM IR, then the information is not available at the IR level.
>>> 
>>> I agree, but it's not being regenerated from the LLVM IR, it's being
>>> generated *into* the LLVM IR by clang or something.
>>> I'm saying you can generate the offset, size info there as well.
>> 
>> 
>> I thought your suggestion was to replace tbaa.path because it "can get quite deep quite quickly",
> 
> Yes.  Your access path based mechanism is going to basically try to
> find common prefixes or suffixes, which is expensive.
Yes, handling alias queries based on access path is similar to finding common prefixes/suffixes.
As given in the proposal, rule(x,y) where 'x' and 'y' are access paths:
Check the first element of 'y', if it is not in 'x', return rule_case1(x,y)
Check the next element of 'y' with the next element of 'x', if not the same, return false.
When we reach the end of either 'x' or 'y', return true.

> 
> 
> 
>> but here you are saying as well.
>> So clarification: you are suggesting to add {offset, size} into metadata tbaa.path, right :)
> No, i'm suggesting you replace it.
> 
>> 
>> I am going to add a few examples here:
>> Given
>> struct A {
>>   int x;
>>   int y;
>> };
>> struct B {
>>   A a;
>>   int z;
>> };
>> struct C {
>>   B b1;
>>   B b2;
>>   int *p;
>> };
>> struct D {
>>   C c;
>> };
>> 
>> with the proposed struct-access-path aware TBAA, we will say "C::b1.a" will alias with "B::a.x", "C::b1.a" will alias with "B",
>> "C::b1.a" does not alias with "D::c.b2.a.x".
>> 
>> The proposal is about the format of metadata in IR and how to implement alias queries in IR:
> 
> Yes, and when I suggested replacing it, you said my replacement would
> be difficult to generate.
That is the point of misunderstanding :]
Generating offset+size is not hard.
I was asking about how to answer alias(x,y) if we only have offset+size+outmost type.

Let's call the alias rule with access path: alias_path(x,y)
and call the alias rule with offset+size: alias_offset(x,y)
In order to have same answers from alias_path and alias_offset, I thought we have to regenerate the access path at IR level from the offset+size+outmost type information.
The regeneration is not easy.

> So what i started to ask is:
> What is generating this metadata?
> Clang?
> Something else?
> What is the actual algorithm you plan on using to generate it?
> 
> I'm trying to understand how you plan on actually implementing the
> metadata generation in order to give you a suggestion of how you would
> generate differently structured metadata that, while conveying the
> same information, would be able to be queried faster.

"C::b1.a" will be annotated as {!C, offset to C, size of A}
"B::a.x" will be annotated as {!B, offset to B, size of int}
"B" will be annotated as {!B, 0, size of B}
"D::c.b2.a.x" will be annotated as {!D, offset to D, size of int}

alias_path("C::b1.a", "B::a.x") = true
alias_path("C::b1.a", "B") = true
alias_path("C::b1.a" "D::c.b2.a.x") = false
What about
alias_offset("C::b1.a", "B::a.x") 
alias_offset("C::b1.a", "B")
alias_offset("C::b1.a", "D::c.b2.a.x")

Thanks,
Manman
> 
> 
>> A> metadata for the type DAG
>>  1> the existing scalar nodes
>>  2> the struct nodes (this is different from the current tbaa.struct)
>>  A struct node has a unique name plus a list of pairs (field name, field type).
>>  For example, struct node for "C" should look like
>>  !4 = metadata !{"C", "C::b1", metadata !3, "C::b2", metadata !3, "C::p", metadata !2}
>>  where !3 is the struct node for "B", !2 is pointer type.
>> 
>>  An example type DAG:
>>  int <- "A::x" <- A
>>  int <- "A::y" <- A <- "B::a" <- B <- "C::b1" <- C
>>  int <------------------- "B::z" <- B <- "C::b2" <- C
>>  any pointer <-------------------------- "C::p"  <- C
>> B> We can attach metadata to both scalar accesses and aggregate accesses. Let's call scalar tags and aggregate tags.
>>  Each tag can be a sequence of nodes in the type DAG.
>>  !C::b2.a.x := [ "tbaa.path", !C, "C::b2", !B, "B::a", !A, "A::x", !int ]
>>  where !C, !B, !A are metadata nodes in the type DAG, "C::b2" "B::a" "A::x" are strings.
>> 
>> I also had some discussion with Shuxin, we can start by a conservative and simpler implementation:
>> A> metadata for the type DAG
>>  1> the existing scalar nodes
>>  2> the struct nodes
>>  A struct node has a unique name plus a list of enclosed types
>>  For example, struct node for "C" should look like
>>  !4 = metadata !{"C", metadata !3, metadata !2}
>> 
>>  An example type DAG:
>>  int <------------- A <- B <-- C
>>  int <-------------------- B
>>  any pointer <----------- <- C
>> B> Each tag will have the last field access.
>>  !A.x := [ "tbaa.path", !A, "A::x", !int ] will be attached to field access C::b2.a.x
>>  The actual access has extra contextual information than the tag and the tag is more conservative than the actual access.
>>  We can increase number of accesses in the tag to get more accurate alias result.
>> 
>> -Manman
>> 
>>> 
>>> In the interest of trying to not talk past each other, let's start at
>>> the beginning a bit
>>> 
>>> 1. When in the compilation process were you planning on generating the
>>> tbaa.struct metadata
>>> 2. How were you planning on generating it (IE given a bunch of source
>>> code or clang AST's, what was the algorithm being used to generate
>>> it)?
>>> 
>>>> For alias query at IR level, we have to somehow regenerate the information if possible.
>>>> 
>>>>> 
>>>>>> 
>>>>>>> Without thinking too hard:
>>>>>>> If it's all direct aggregate access in the original source, whatever
>>>>>>> you see as the leftmost part should work.
>>>>>>> 
>>>>>>> I haven't thought hard about it, but I believe it should work.
>>>>>> 
>>>>>> 
>>>>>> At llvm IR level, it is not that straight-forward to figure out the outermost common type of C::b2.a.x and B::a with tbaa.offset.
>>>>> 
>>>>> The metadata should already have the outermost common type when it's
>>>>> output, so it should just be there at the LLVM level.
>>>> The outermost common type is between two accesses, and the metadata is attached to each access.
>>>> I don't quite get why the metadata should have the outermost common type.
>>> 
>>> You need some identifier to differentiate what it is an offset + size
>>> into.  You don't actually need the type, just a name.
>>> The same way you have identifiers named "C" so that you can tell that
>>> "C", "B", "B!a.x" is not the same path as "D", "B", "B!a.x".
>>> 
>>> You need to do the same thing here, so that if you have "C", "{0, 4}"
>>> and "D", "{0, 4"}, you know that they don't alias because C != D.
>> 
>>> 
>>> 
>>>> 
>>>> Thanks,
>>>> Manman
>>>>> 
>>>>>> To me, the outermost common type should be "B", then we need to check the relative offset from B for both accesses
>>>>> Agreed.
>>>>> 
>>>>>> 
>>>>>> -Manman
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Note that again, neither works for pointers unless you can guarantee
>>>>>>> you have the entire access path, which is hard:
>>>>>>> 
>>>>>>> union u {
>>>>>>> struct A a;
>>>>>>> struct B b;
>>>>>>> };
>>>>>>> 
>>>>>>> int foo(struct A *a, struct B *b) {
>>>>>>> ....
>>>>>>> 
>>>>>>> }
>>>>>>> 
>>>>>>> void bar() {
>>>>>>> union u whoops;
>>>>>>> foo(&whoops.a, &whoops.b);
>>>>>>> }
>>>>>>> 
>>>>>>> If you annotated foo with a partial access path, you may or may not
>>>>>>> get the right answer.
>>>>>>> You would think your type dag solves this, because you'll see they
>>>>>>> share a supertype.
>>>>>>> 
>>>>>>> But it's actually worse than that:
>>>>>>> 
>>>>>>> file: bar.h
>>>>>>> union u {
>>>>>>> struct A a;
>>>>>>> struct B b;
>>>>>>> };
>>>>>>> 
>>>>>>> int foo(struct A *, struct B*);
>>>>>>> 
>>>>>>> file: foo.c
>>>>>>> 
>>>>>>> int foo(struct A *a, struct B *b) {
>>>>>>> ....
>>>>>>> 
>>>>>>> }
>>>>>>> 
>>>>>>> file: bar.c
>>>>>>> 
>>>>>>> #include "bar.h"
>>>>>>> 
>>>>>>> void bar() {
>>>>>>> union u whoops;
>>>>>>> foo(&whoops.a, &whoops.b);
>>>>>>> }
>>>>>>> 
>>>>>>> 
>>>>>>> You will never see compiling  foo.c that tells you both types are in a
>>>>>>> union together, or how to annotate these access paths.
>>>>>>> 
>>>>>>> Hence the "you need to punt on pointers without more analysis" claim I made :)
>>>>>>> 
>>>>>>> 
>>>>>>>> Once that is figured out, it should be easy to check alias('x', 'y').
>>>>>>>>> 
>>>>>>>>> You have to track what the types are in the frontend to generate the
>>>>>>>>> access path anyway, so you can compute the offsets + sizes into those
>>>>>>>>> types at the same time.
>>>>>>>> Generating the offsets+sizes are not hard.
>>>>>>>>> 
>>>>>>>>> Essentially, it it translating your representation into something that
>>>>>>>>> can be intersected in constant time, unless i'm missing something
>>>>>>>>> about how you are annotating the access paths.
>>>>>>>> If we can query alias('x','y') in constant time, irrelevant to the depth of each access, that will be great.
>>>>>>>> 
>>>>>>>> -Manman
>>>>>>>> 
>>>>>> 
>>>> 
>>