[llvm-dev] RFC: Resolving TBAA issues

Wed Aug 16 11:59:23 PDT 2017

Hal, Daniel,

Thanks for your responses. Here's a quick formal introduction to the 
proposed approach follows. I also attached a couple files to put some 
visibility on implementation details. We'd love to hear from you 
gentlemen and LLVM community in general on how this fits what you know 
about TBAA.

Overview
========

With this writing we propose a new approach to the TBAA mechanism
designed to overcome current issues such as:
  - inability to represent accesses to aggregates, unions, union
    members, bit fields, fields of union and aggregate types,
    including array members and
  - lack of a mean to identify user types defined in different
    translation units.

As of today, we have a local patch that implements this approach.
This new implementation is known to be at least as functionally
complete as the current one. Additionally, with this patch on top
we improved SROA to propagate TBAA information, thus making sure
the new features work as expected.

We should be able to upload that patch for review in a few days.

The Approach
============

With the proposed approach we represent accesses as sequences
that contain all accessed types and fields in order. For example
for this access:

   struct T { union U { struct S { int i1, i2; } s1, s2; } u1, u2; } t;
   t.u1.s1.i1

we generate an access sequence of the form:

   [T, T::u1, U, U::s1, S, S::i1, int]

An array is allowed to overlap with any object of its element
type, including other arrays of the same element type, so an
access to an array element is represented as an access to an
object of the element type:

   int a[7];
   a[5]

   [int]

In case of a multi-dimensional array this rule applies
recursively:

   int a[7][9];
   a[3][5]

   [int]

For member arrays we specify the member field as usual so we can
distinct it from other array members:

   struct S { int a[7], b[9]; } s;
   s.a[5]

   [S, S::a, int]

Similarly to the scalar and struct-path approaches, we consider
every type to be a member of a type group it explicitly refers
to. Here's how the tree that describes relations between type
groups would look like for the example above:

   <tbaa_root>
   |- <may_alias>
     |- <representation_byte>
       |-<structure>
       | |- S
       |- int

The <vtable_pointer> group has a special meaning and is used to
describe accesses to virtual table pointers. Similarly, the
<union> type group includes all union types and used by the TBAA
implementation to distinct union types from other types. The
<may_alias> group is technically equivalent to
<representation_byte> and supposed to be a group for
may_alias-marked types.

For two given access sequences we can determine if the accessed
objects are allowed to overlap by the rules of the input
language. Here's the list of rules complete enough to support C
and C++. Ellipsis elements denote sequences of zero or more
elements. For other input languages more rules can be supported,
if necessary.

   [X...] is allowed to overlap with [S1..., X..., S2...]
   and the most generic access sequence is [X...].

   [X1..., X2...] is allowed to overlap with [S1..., X1...]
   with the most generic access sequence to be [X1...].

   [X1..., U, U::m1, X2...] is allowed to overlap with
   [S1..., X1..., U, U::m2, S2...]
   for a union U of an unknown effective type, provided m1 != m2
   and the most generic access sequence is [X1..., U].

If neither of the given sequences contains the leading access
type of the other, then they are allowed to overlap if the
leading access type of one sequence is a direct or indirect field
of the final access type of the other sequence and then the most
generic access sequence consists of a single element, which is
that final access type.

For the purpose of determining whether one type is a direct or
indirect member of another type unions are considered to have no
members as accesses to members of unions are only allowed to
overlap if they have the base union object explicitly specified.

Otherwise, given sequences overlap if there is a type group that
includes both the leading access types and the most generic
access sequence consists of the smallest common type group as its
only element.

See the attached TypeBasedAliasAnalysis.cpp file and specifically
the MatchAccessSequences() function for how these rules can be
implemented.

TBAA information is encoded as metadata nodes, as usual. Load and
store instructions refer to access sequences:
   store %struct.T* %p, %struct.T** %p.addr, align 8, !tbaa !11

A type node is either a terminal type node that names a root type
group:
   !0 = !{ !"<tbaa_root>" }

or a non-terminal type node that names a type and refers to a
type group it belongs to:
   !1 = !{ !0, !"int" }

Record types also refer to their field descriptors:
   !3 = !{ !0, !"S", !9 }

An access node is either a terminal access node that refers to
the corresponding access type:
   !5 = !{ !1 }
   !9 = !{ !3 }

or a member node that refers to a structure/class or union field
descriptor and a subsequent access path node:
   !7 = !{ !type_group, !field_id, !field_offset, !field_size }
   !11 = !{ !5, !9, !7 }

For a field node the first element refers to its type. The
purpose of other elements is to make the field node unique. Their
meaning is unspecified. Currently the other members for C and C++
are the field name, bit offset and bit size of the member, but
this may change in future and front ends for other input
languages may act differently, so TBAA implementation in the
codegen shall not rely on specific shape or meaning of these
elements.

For types that are interchangeable for purposes of TBAA it is
important to encode them identically so that descriptors of
interchangeable types defined in different modules merge into
same metadata nodes.

Structure/class fields are specified in the order of declaration.
For union fields there is a canonical order that guarantee that
definitions of the same union type will result in identical
descriptors regardless of the order of member declarations.
Currently we sort union fields with key (field_id, field_offset,
field_size).

C++ tagged types with no linkage are to be encoded as "distinct"
nodes to guarantee their uniqueness.

The support for !tbaa.struct information is to be replaced with
plain !tbaa tags representing accesses to the corresponding
record types.

Another attached file, the tbaa.cpp one, is a test case that can
give an idea what encoded TBAA metadata may look like.

Space and Performance Analysis
==============================

In terms of metadata size, with the new approach we generate
about 15% more of metadata nodes. The ratio of the total number
of TBAA nodes to the amount of code remains very low, meaning the
size of TBAA metadata is unlikely to be a problem any time soon.

 From the performance perspective, the proposed approach differs
from the current one in that:
  - it does not traverse through record types if one of the access
    sequences to match contains the leading access type of the
    other and
  - it never traverses union types.

We thus expect that the new approach is at least as efficient as
the current one. Our experiments do not indicate any sensible
difference in performance between the implementations.

-- 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: tbaa.cpp
Type: text/x-c++src
Size: 11239 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170816/09ae1d7e/attachment.cpp>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: TypeBasedAliasAnalysis.cpp
Type: text/x-c++src
Size: 24293 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170816/09ae1d7e/attachment-0001.cpp>