[llvm-dev] RFC: Resolving TBAA issues

Mon Aug 21 14:43:32 PDT 2017

On Mon, Aug 21, 2017 at 2:19 PM, Hal Finkel <hfinkel at anl.gov> wrote:

> On 08/20/2017 08:47 PM, Xinliang David Li via llvm-dev wrote:
>
> Hi Ivan, thanks for writing it up. This is a pretty long thread, and there
> are many good points brought up in the heated discussions.  Here is my take
> on many of the points that have been mentioned:
>
> 1) The type based aliasing IR annotation should be generic enough to
> represent aliasing constraints for any frontend languages;
> 2) The offset based aliasing rules and type based aliasing rule have a lot
> in common. The difference is that the offset based aliasing rule require
> that the two accesses to have same base pointer value, while type based
> aliasing rules specifies whether two base pointers can possibly point to
> the same underlying object, and if yes, what are the legally allowed
> offsets at which the two base pointers can be aligned;
> 3) Type specific information is usually not interesting to offset based
> aliasing rules if offsets can be folded into constant. It is when there is
> a variable involved in the indexing that makes language guaranteed bounds
> information also useful.
>
> Given the above, the type based aliasing annotation can be represented as
> list of <base_type, offset> pairs, as well as the original base pointer
> type in the source.  Each pair represent a way to access the memory legally
> allowed by the language -- this field can be accessed with a base pointer
> of type 'base_type' at 'offset'.
>
> For instance:
>
> struct A { struct B { struct C {int m, n; } c1; int i, j; } b1, b2; int k;
> };
> struct A*ap;
> struct B *bp;
> struct C *cp;
> int *ip;
>
> (1) access ap->b2.c1.n has following annotation:
>
>   {A*, [ <A, 12>, <B, 4>, <C,4>, <int, 0>] }
>
> What it means
>
>   Access (1) may only be aliased with another access if that access's base
> pointer is type A and the offset is 12, or type B at offset 4, or type C at
> offset 4, or type int at offset 0.  Also access (1) is originally accessed
> via path <A, 12>.
>
> (2) access ap->k has only
>
>   {A*, [<A, 16>, <int, 0>}
>
> (3) access bp->c1.n has
>
>   {B*, [<B, 4>, <int, 0>]}
>
> (4) Access bp->j has
>
>   {B*, [<B, 12>, <int, 0>]}
>
> (5) access *ip (implicitly) has
>
>   {int *, <int, 0>}
>
> From the above, we can see (5) is aliased with all the above, and (3) is
> also aliased with (1).
>
>
> This sounds reasonable. It is almost exactly an encoding of our current
> TBAA scheme where, instead of causing the client to walk up the tree, we
> explicitly provide the path up the tree at each access. It should also
> handle unions naturally because, regardless of the "currently live" union
> member, all potentially-aliasing accesses will have an entry in their list
> like <U, offset>.
>

>
>
>
> With this representation, the type based alias rules implementation
> becomes: given two memory accesses, align the pointer of one access to the
> other. If it can be done, they are not aliased. If yes, the use the same
> base+offset+access_size aliasing rule to check overlapping.
>
>
> I'm not entirely sure what you mean by aligning pointers (and why aligning
> them would mean they couldn't alias). Can you please explain?
>

I had a typo there. I meant 'if it can not be done, they are not aliased
...".

By aligning pointers, I meant align the base pointer of another access with
one of the allowed types in the access list.

Given:

Access (1)   {A*, [ <A, 12>, <B, 4>, <C,4>, <int, 0>] } and access (2)
 {B*, [<B, 4>, <int, 0>]},   we know that access (2)'s base pointer type is
'B*'.  Its base pointer typed then needs to be matched up against the types
in Access (1)'s list. In this case,  it finds a match, so (2)'s base
pointer will be aligned to point to the type B subobject of the access
(1).  After alignment, we then compare offset and access size to determine
overlapping.

David

>
>
>
> For languages that have rules about access bounds of member arrays, the
> offset information can be replaced with offset + range. By default, the
> range is the size of the array type.
>
>
> Yes, and I'd certainly like to include this information going forward
> (along with some flag saying whether it's legal to over/under index the
> field).
>
> Thanks again,
> Hal
>
>
>
> David
>
>
> On Wed, Aug 16, 2017 at 11:59 AM, Ivan A. Kosarev via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> Hal, Daniel,
>>
>> Thanks for your responses. Here's a quick formal introduction to the
>> proposed approach follows. I also attached a couple files to put some
>> visibility on implementation details. We'd love to hear from you gentlemen
>> and LLVM community in general on how this fits what you know about TBAA.
>>
>>
>> Overview
>> ========
>>
>> With this writing we propose a new approach to the TBAA mechanism
>> designed to overcome current issues such as:
>>  - inability to represent accesses to aggregates, unions, union
>>    members, bit fields, fields of union and aggregate types,
>>    including array members and
>>  - lack of a mean to identify user types defined in different
>>    translation units.
>>
>> As of today, we have a local patch that implements this approach.
>> This new implementation is known to be at least as functionally
>> complete as the current one. Additionally, with this patch on top
>> we improved SROA to propagate TBAA information, thus making sure
>> the new features work as expected.
>>
>> We should be able to upload that patch for review in a few days.
>>
>>
>> The Approach
>> ============
>>
>> With the proposed approach we represent accesses as sequences
>> that contain all accessed types and fields in order. For example
>> for this access:
>>
>>   struct T { union U { struct S { int i1, i2; } s1, s2; } u1, u2; } t;
>>   t.u1.s1.i1
>>
>> we generate an access sequence of the form:
>>
>>   [T, T::u1, U, U::s1, S, S::i1, int]
>>
>> An array is allowed to overlap with any object of its element
>> type, including other arrays of the same element type, so an
>> access to an array element is represented as an access to an
>> object of the element type:
>>
>>   int a[7];
>>   a[5]
>>
>>   [int]
>>
>> In case of a multi-dimensional array this rule applies
>> recursively:
>>
>>   int a[7][9];
>>   a[3][5]
>>
>>   [int]
>>
>> For member arrays we specify the member field as usual so we can
>> distinct it from other array members:
>>
>>   struct S { int a[7], b[9]; } s;
>>   s.a[5]
>>
>>   [S, S::a, int]
>>
>> Similarly to the scalar and struct-path approaches, we consider
>> every type to be a member of a type group it explicitly refers
>> to. Here's how the tree that describes relations between type
>> groups would look like for the example above:
>>
>>   <tbaa_root>
>>   |- <may_alias>
>>     |- <representation_byte>
>>       |-<structure>
>>       | |- S
>>       |- int
>>
>> The <vtable_pointer> group has a special meaning and is used to
>> describe accesses to virtual table pointers. Similarly, the
>> <union> type group includes all union types and used by the TBAA
>> implementation to distinct union types from other types. The
>> <may_alias> group is technically equivalent to
>> <representation_byte> and supposed to be a group for
>> may_alias-marked types.
>>
>> For two given access sequences we can determine if the accessed
>> objects are allowed to overlap by the rules of the input
>> language. Here's the list of rules complete enough to support C
>> and C++. Ellipsis elements denote sequences of zero or more
>> elements. For other input languages more rules can be supported,
>> if necessary.
>>
>>   [X...] is allowed to overlap with [S1..., X..., S2...]
>>   and the most generic access sequence is [X...].
>>
>>   [X1..., X2...] is allowed to overlap with [S1..., X1...]
>>   with the most generic access sequence to be [X1...].
>>
>>   [X1..., U, U::m1, X2...] is allowed to overlap with
>>   [S1..., X1..., U, U::m2, S2...]
>>   for a union U of an unknown effective type, provided m1 != m2
>>   and the most generic access sequence is [X1..., U].
>>
>> If neither of the given sequences contains the leading access
>> type of the other, then they are allowed to overlap if the
>> leading access type of one sequence is a direct or indirect field
>> of the final access type of the other sequence and then the most
>> generic access sequence consists of a single element, which is
>> that final access type.
>>
>> For the purpose of determining whether one type is a direct or
>> indirect member of another type unions are considered to have no
>> members as accesses to members of unions are only allowed to
>> overlap if they have the base union object explicitly specified.
>>
>> Otherwise, given sequences overlap if there is a type group that
>> includes both the leading access types and the most generic
>> access sequence consists of the smallest common type group as its
>> only element.
>>
>> See the attached TypeBasedAliasAnalysis.cpp file and specifically
>> the MatchAccessSequences() function for how these rules can be
>> implemented.
>>
>> TBAA information is encoded as metadata nodes, as usual. Load and
>> store instructions refer to access sequences:
>>   store %struct.T* %p, %struct.T** %p.addr, align 8, !tbaa !11
>>
>> A type node is either a terminal type node that names a root type
>> group:
>>   !0 = !{ !"<tbaa_root>" }
>>
>> or a non-terminal type node that names a type and refers to a
>> type group it belongs to:
>>   !1 = !{ !0, !"int" }
>>
>> Record types also refer to their field descriptors:
>>   !3 = !{ !0, !"S", !9 }
>>
>> An access node is either a terminal access node that refers to
>> the corresponding access type:
>>   !5 = !{ !1 }
>>   !9 = !{ !3 }
>>
>> or a member node that refers to a structure/class or union field
>> descriptor and a subsequent access path node:
>>   !7 = !{ !type_group, !field_id, !field_offset, !field_size }
>>   !11 = !{ !5, !9, !7 }
>>
>> For a field node the first element refers to its type. The
>> purpose of other elements is to make the field node unique. Their
>> meaning is unspecified. Currently the other members for C and C++
>> are the field name, bit offset and bit size of the member, but
>> this may change in future and front ends for other input
>> languages may act differently, so TBAA implementation in the
>> codegen shall not rely on specific shape or meaning of these
>> elements.
>>
>> For types that are interchangeable for purposes of TBAA it is
>> important to encode them identically so that descriptors of
>> interchangeable types defined in different modules merge into
>> same metadata nodes.
>>
>> Structure/class fields are specified in the order of declaration.
>> For union fields there is a canonical order that guarantee that
>> definitions of the same union type will result in identical
>> descriptors regardless of the order of member declarations.
>> Currently we sort union fields with key (field_id, field_offset,
>> field_size).
>>
>> C++ tagged types with no linkage are to be encoded as "distinct"
>> nodes to guarantee their uniqueness.
>>
>> The support for !tbaa.struct information is to be replaced with
>> plain !tbaa tags representing accesses to the corresponding
>> record types.
>>
>> Another attached file, the tbaa.cpp one, is a test case that can
>> give an idea what encoded TBAA metadata may look like.
>>
>>
>> Space and Performance Analysis
>> ==============================
>>
>> In terms of metadata size, with the new approach we generate
>> about 15% more of metadata nodes. The ratio of the total number
>> of TBAA nodes to the amount of code remains very low, meaning the
>> size of TBAA metadata is unlikely to be a problem any time soon.
>>
>> From the performance perspective, the proposed approach differs
>> from the current one in that:
>>  - it does not traverse through record types if one of the access
>>    sequences to match contains the leading access type of the
>>    other and
>>  - it never traverses union types.
>>
>> We thus expect that the new approach is at least as efficient as
>> the current one. Our experiments do not indicate any sensible
>> difference in performance between the implementations.
>>
>> --
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>
>
> _______________________________________________
> LLVM Developers mailing listllvm-dev at lists.llvm.orghttp://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170821/1bc30b80/attachment.html>