[LLVMdev] PROPOSAL: struct-access-path aware TBAA (should we use offset+size instead of path?)
Manman Ren
mren at apple.com
Wed Mar 13 10:56:26 PDT 2013
On Mar 12, 2013, at 6:07 PM, Daniel Berlin wrote:
> On Tue, Mar 12, 2013 at 5:59 PM, Manman Ren <mren at apple.com> wrote:
>>
>> On Mar 12, 2013, at 12:20 PM, Daniel Berlin <dberlin at dberlin.org> wrote:
>>
>>> On Tue, Mar 12, 2013 at 10:10 AM, Manman Ren <mren at apple.com> wrote:
>>>>
>>>> On Mar 11, 2013, at 7:52 PM, Daniel Berlin wrote:
>>>>
>>>>> On Mon, Mar 11, 2013 at 4:56 PM, Manman Ren <mren at apple.com> wrote:
>>>>>>
>>>>>> On Mar 11, 2013, at 4:23 PM, Daniel Berlin <dberlin at dberlin.org> wrote:
>>>>>>
>>>>>>> On Mon, Mar 11, 2013 at 3:45 PM, Manman Ren <mren at apple.com> wrote:
>>>>>>>>
>>>>>>>> On Mar 11, 2013, at 2:37 PM, Daniel Berlin <dberlin at dberlin.org> wrote:
>>>>>>>>
>>>>>>>>> On Mon, Mar 11, 2013 at 2:06 PM, Manman Ren <mren at apple.com> wrote:
>>>>>>>>>>
>>>>>>>>>> On Mar 11, 2013, at 1:17 PM, Daniel Berlin <dberlin at dberlin.org> wrote:
>>>>>>>>>>
>>>>>>>>>>> On Mon, Mar 11, 2013 at 11:41 AM, Manman Ren <mren at apple.com> wrote:
>>>>>>>>>>>> Based on discussions with John McCall
>>>>>>>>>>>>
>>>>>>>>>>>> We currently focus on field accesses of structs, more specifically, on fields that are scalars or structs.
>>>>>>>>>>>>
>>>>>>>>>>>> Fundamental rules from C11
>>>>>>>>>>>> --------------------------
>>>>>>>>>>>> An object shall have its stored value accessed only by an lvalue expression that has one of the following types: [footnote: The intent of this list is to specify those circumstances in which an object may or may not be aliased.]
>>>>>>>>>>>> 1. a type compatible with the effective type of the object,
>>>>>>>>>>>> 2. a qualified version of a type compatible with the effective type of the object,
>>>>>>>>>>>> 3. a type that is the signed or unsigned type corresponding to the effective type of the object,
>>>>>>>>>>>> 4. a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object,
>>>>>>>>>>>> 5. an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or
>>>>>>>>>>>> 6. a character type.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Example
>>>>>>>>>>>> -------
>>>>>>>>>>>> struct A {
>>>>>>>>>>>> int x;
>>>>>>>>>>>> int y;
>>>>>>>>>>>> };
>>>>>>>>>>>> struct B {
>>>>>>>>>>>> A a;
>>>>>>>>>>>> int z;
>>>>>>>>>>>> };
>>>>>>>>>>>> struct C {
>>>>>>>>>>>> B b1;
>>>>>>>>>>>> B b2;
>>>>>>>>>>>> int *p;
>>>>>>>>>>>> };
>>>>>>>>>>>>
>>>>>>>>>>>> Type DAG:
>>>>>>>>>>>> int <- A::x <- A
>>>>>>>>>>>> int <- A::y <- A <- B::a <- B <- C::b1 <- C
>>>>>>>>>>>> int <----------------- B::z <- B <- C::b2 <- C
>>>>>>>>>>>> any pointer <--------------------- C::p <- C
>>>>>>>>>>>>
>>>>>>>>>>>> The type DAG has two types of TBAA nodes:
>>>>>>>>>>>> 1> the existing scalar nodes
>>>>>>>>>>>> 2> the struct nodes (this is different from the current tbaa.struct)
>>>>>>>>>>>> A struct node has a unique name plus a list of pairs (field name, field type).
>>>>>>>>>>>> For example, struct node for "C" should look like
>>>>>>>>>>>> !4 = metadata !{"C", "C::b1", metadata !3, "C::b2", metadata !3, "C::p", metadata !2}
>>>>>>>>>>>> where !3 is the struct node for "B", !2 is pointer type.
>>>>>>>>>>>>
>>>>>>>>>>>> Given a field access
>>>>>>>>>>>> struct B *bp = ...;
>>>>>>>>>>>> bp->a.x = 5;
>>>>>>>>>>>> we annotate it as B::a.x.
>>>>>>>>>>>
>>>>>>>>>>> In the case of multiple structures containing substructures, how are
>>>>>>>>>>> you differentiating?
>>>>>>>>>>>
>>>>>>>>>>> IE given
>>>>>>>>>>>
>>>>>>>>>>> struct A {
>>>>>>>>>>> struct B b;
>>>>>>>>>>> }
>>>>>>>>>>> struct C {
>>>>>>>>>>> struct B b;
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> How do you know the above struct B *bp =...; is B::b from C and not B::b from A?
>>>>>>>>>>>
>>>>>>>>>>> (I agree you can know in the case of direct aggregates, but I argue
>>>>>>>>>>> you have no way to know in the case of pointer arguments without
>>>>>>>>>>> interprocedural analysis)
>>>>>>>>>>> It gets worse the more levels you have.
>>>>>>>>>>>
>>>>>>>>>>> ie if you add
>>>>>>>>>>> struct B {
>>>>>>>>>>> struct E e;
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> and have struct E *e = ...
>>>>>>>>>>> how do you know it's the B::e contained in struct C, or the B::e
>>>>>>>>>>> contained in struct A?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Again, i agree you can do both scalar and direct aggregates, but not
>>>>>>>>>>> aggregates and scalars through pointers.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I don't have immediate plan of pointer analysis. For the given example, we will treat field accesses from bp (pointer to struct B) as B::x.x, bp can be from either C or A.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Implementing the Hierarchy
>>>>>>>>>>>> --------------------------
>>>>>>>>>>>> We can attach metadata to both scalar accesses and aggregate accesses. Let's call scalar tags and aggregate tags.
>>>>>>>>>>>> Each tag can be a sequence of nodes in the type DAG.
>>>>>>>>>>>> !C::b2.a.x := [ "tbaa.path", !C, "C::b2", !B, "B::a", !A, "A::x", !int ]
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> This can get quite deep quite quickly.
>>>>>>>>>>> Are there actually cases where knowing the outermost aggregate type +
>>>>>>>>>>> {byte offset, size} of field access does not give you the exact same
>>>>>>>>>>> disambiguation capability?
>>>>>>>>>> The answer is no. We should be able to deduce the access path from {byte offset, size}.
>>>>>>>>>> However, I don't know an easy way to check alias(x,y) given {byte offset, size} for x and y.
>>>>>>>>>
>>>>>>>>> Well, that part is easy, it's just overlap, once you know they are
>>>>>>>>> both contained in the same outermost type.
>>>>>>>>
>>>>>>>> How do you figure out the outermost common type of two access 'x' and 'y'?
>>>>>>>
>>>>>>> How are you annotating each access with a path as you go right now?
>>>>>>> It should be the same thing to just annotate it then.
>>>>>>
>>>>>> I thought your suggestion was to replace
>>>>>> !C::b2.a.x := [ "tbaa.path", !C, "C::b2", !B, "B::a", !A, "A::x", !int ]
>>>>>> with
>>>>>> !C::b2.a.x := [ "tbaa.offset", !C, offset within C, size ]
>>>>>>
>>>>>> !B::a := [ "tbaa.path", !B, "B:a", !A ]
>>>>>> with
>>>>>> !B::a := [ "tbaa.offset", !B, offset within B, size]
>>>>>>
>>>>>
>>>>> Yes, it is.
>>>>>
>>>>>> Then we already lost the access path at IR level.
>>>>>
>>>>> But you are *generating* the above metadata at the clang level, no?
>>>>> That is were I meant you should annotate them as you go.
>>>> I may not get what you are saying.
>>>> But if we don't keep it in the LLVM IR, then the information is not available at the IR level.
>>>
>>> I agree, but it's not being regenerated from the LLVM IR, it's being
>>> generated *into* the LLVM IR by clang or something.
>>> I'm saying you can generate the offset, size info there as well.
>>
>>
>> I thought your suggestion was to replace tbaa.path because it "can get quite deep quite quickly",
>
> Yes. Your access path based mechanism is going to basically try to
> find common prefixes or suffixes, which is expensive.
Yes, handling alias queries based on access path is similar to finding common prefixes/suffixes.
As given in the proposal, rule(x,y) where 'x' and 'y' are access paths:
Check the first element of 'y', if it is not in 'x', return rule_case1(x,y)
Check the next element of 'y' with the next element of 'x', if not the same, return false.
When we reach the end of either 'x' or 'y', return true.
>
>
>
>> but here you are saying as well.
>> So clarification: you are suggesting to add {offset, size} into metadata tbaa.path, right :)
> No, i'm suggesting you replace it.
>
>>
>> I am going to add a few examples here:
>> Given
>> struct A {
>> int x;
>> int y;
>> };
>> struct B {
>> A a;
>> int z;
>> };
>> struct C {
>> B b1;
>> B b2;
>> int *p;
>> };
>> struct D {
>> C c;
>> };
>>
>> with the proposed struct-access-path aware TBAA, we will say "C::b1.a" will alias with "B::a.x", "C::b1.a" will alias with "B",
>> "C::b1.a" does not alias with "D::c.b2.a.x".
>>
>> The proposal is about the format of metadata in IR and how to implement alias queries in IR:
>
> Yes, and when I suggested replacing it, you said my replacement would
> be difficult to generate.
That is the point of misunderstanding :]
Generating offset+size is not hard.
I was asking about how to answer alias(x,y) if we only have offset+size+outmost type.
Let's call the alias rule with access path: alias_path(x,y)
and call the alias rule with offset+size: alias_offset(x,y)
In order to have same answers from alias_path and alias_offset, I thought we have to regenerate the access path at IR level from the offset+size+outmost type information.
The regeneration is not easy.
> So what i started to ask is:
> What is generating this metadata?
> Clang?
> Something else?
> What is the actual algorithm you plan on using to generate it?
>
> I'm trying to understand how you plan on actually implementing the
> metadata generation in order to give you a suggestion of how you would
> generate differently structured metadata that, while conveying the
> same information, would be able to be queried faster.
"C::b1.a" will be annotated as {!C, offset to C, size of A}
"B::a.x" will be annotated as {!B, offset to B, size of int}
"B" will be annotated as {!B, 0, size of B}
"D::c.b2.a.x" will be annotated as {!D, offset to D, size of int}
alias_path("C::b1.a", "B::a.x") = true
alias_path("C::b1.a", "B") = true
alias_path("C::b1.a" "D::c.b2.a.x") = false
What about
alias_offset("C::b1.a", "B::a.x")
alias_offset("C::b1.a", "B")
alias_offset("C::b1.a", "D::c.b2.a.x")
Thanks,
Manman
>
>
>> A> metadata for the type DAG
>> 1> the existing scalar nodes
>> 2> the struct nodes (this is different from the current tbaa.struct)
>> A struct node has a unique name plus a list of pairs (field name, field type).
>> For example, struct node for "C" should look like
>> !4 = metadata !{"C", "C::b1", metadata !3, "C::b2", metadata !3, "C::p", metadata !2}
>> where !3 is the struct node for "B", !2 is pointer type.
>>
>> An example type DAG:
>> int <- "A::x" <- A
>> int <- "A::y" <- A <- "B::a" <- B <- "C::b1" <- C
>> int <------------------- "B::z" <- B <- "C::b2" <- C
>> any pointer <-------------------------- "C::p" <- C
>> B> We can attach metadata to both scalar accesses and aggregate accesses. Let's call scalar tags and aggregate tags.
>> Each tag can be a sequence of nodes in the type DAG.
>> !C::b2.a.x := [ "tbaa.path", !C, "C::b2", !B, "B::a", !A, "A::x", !int ]
>> where !C, !B, !A are metadata nodes in the type DAG, "C::b2" "B::a" "A::x" are strings.
>>
>> I also had some discussion with Shuxin, we can start by a conservative and simpler implementation:
>> A> metadata for the type DAG
>> 1> the existing scalar nodes
>> 2> the struct nodes
>> A struct node has a unique name plus a list of enclosed types
>> For example, struct node for "C" should look like
>> !4 = metadata !{"C", metadata !3, metadata !2}
>>
>> An example type DAG:
>> int <------------- A <- B <-- C
>> int <-------------------- B
>> any pointer <----------- <- C
>> B> Each tag will have the last field access.
>> !A.x := [ "tbaa.path", !A, "A::x", !int ] will be attached to field access C::b2.a.x
>> The actual access has extra contextual information than the tag and the tag is more conservative than the actual access.
>> We can increase number of accesses in the tag to get more accurate alias result.
>>
>> -Manman
>>
>>>
>>> In the interest of trying to not talk past each other, let's start at
>>> the beginning a bit
>>>
>>> 1. When in the compilation process were you planning on generating the
>>> tbaa.struct metadata
>>> 2. How were you planning on generating it (IE given a bunch of source
>>> code or clang AST's, what was the algorithm being used to generate
>>> it)?
>>>
>>>> For alias query at IR level, we have to somehow regenerate the information if possible.
>>>>
>>>>>
>>>>>>
>>>>>>> Without thinking too hard:
>>>>>>> If it's all direct aggregate access in the original source, whatever
>>>>>>> you see as the leftmost part should work.
>>>>>>>
>>>>>>> I haven't thought hard about it, but I believe it should work.
>>>>>>
>>>>>>
>>>>>> At llvm IR level, it is not that straight-forward to figure out the outermost common type of C::b2.a.x and B::a with tbaa.offset.
>>>>>
>>>>> The metadata should already have the outermost common type when it's
>>>>> output, so it should just be there at the LLVM level.
>>>> The outermost common type is between two accesses, and the metadata is attached to each access.
>>>> I don't quite get why the metadata should have the outermost common type.
>>>
>>> You need some identifier to differentiate what it is an offset + size
>>> into. You don't actually need the type, just a name.
>>> The same way you have identifiers named "C" so that you can tell that
>>> "C", "B", "B!a.x" is not the same path as "D", "B", "B!a.x".
>>>
>>> You need to do the same thing here, so that if you have "C", "{0, 4}"
>>> and "D", "{0, 4"}, you know that they don't alias because C != D.
>>
>>>
>>>
>>>>
>>>> Thanks,
>>>> Manman
>>>>>
>>>>>> To me, the outermost common type should be "B", then we need to check the relative offset from B for both accesses
>>>>> Agreed.
>>>>>
>>>>>>
>>>>>> -Manman
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Note that again, neither works for pointers unless you can guarantee
>>>>>>> you have the entire access path, which is hard:
>>>>>>>
>>>>>>> union u {
>>>>>>> struct A a;
>>>>>>> struct B b;
>>>>>>> };
>>>>>>>
>>>>>>> int foo(struct A *a, struct B *b) {
>>>>>>> ....
>>>>>>>
>>>>>>> }
>>>>>>>
>>>>>>> void bar() {
>>>>>>> union u whoops;
>>>>>>> foo(&whoops.a, &whoops.b);
>>>>>>> }
>>>>>>>
>>>>>>> If you annotated foo with a partial access path, you may or may not
>>>>>>> get the right answer.
>>>>>>> You would think your type dag solves this, because you'll see they
>>>>>>> share a supertype.
>>>>>>>
>>>>>>> But it's actually worse than that:
>>>>>>>
>>>>>>> file: bar.h
>>>>>>> union u {
>>>>>>> struct A a;
>>>>>>> struct B b;
>>>>>>> };
>>>>>>>
>>>>>>> int foo(struct A *, struct B*);
>>>>>>>
>>>>>>> file: foo.c
>>>>>>>
>>>>>>> int foo(struct A *a, struct B *b) {
>>>>>>> ....
>>>>>>>
>>>>>>> }
>>>>>>>
>>>>>>> file: bar.c
>>>>>>>
>>>>>>> #include "bar.h"
>>>>>>>
>>>>>>> void bar() {
>>>>>>> union u whoops;
>>>>>>> foo(&whoops.a, &whoops.b);
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> You will never see compiling foo.c that tells you both types are in a
>>>>>>> union together, or how to annotate these access paths.
>>>>>>>
>>>>>>> Hence the "you need to punt on pointers without more analysis" claim I made :)
>>>>>>>
>>>>>>>
>>>>>>>> Once that is figured out, it should be easy to check alias('x', 'y').
>>>>>>>>>
>>>>>>>>> You have to track what the types are in the frontend to generate the
>>>>>>>>> access path anyway, so you can compute the offsets + sizes into those
>>>>>>>>> types at the same time.
>>>>>>>> Generating the offsets+sizes are not hard.
>>>>>>>>>
>>>>>>>>> Essentially, it it translating your representation into something that
>>>>>>>>> can be intersected in constant time, unless i'm missing something
>>>>>>>>> about how you are annotating the access paths.
>>>>>>>> If we can query alias('x','y') in constant time, irrelevant to the depth of each access, that will be great.
>>>>>>>>
>>>>>>>> -Manman
>>>>>>>>
>>>>>>
>>>>
>>
More information about the llvm-dev
mailing list