[llvm-dev] RFC: Representing unions in TBAA

Thu Mar 9 11:41:02 PST 2017

On 03/09/2017 01:15 PM, Daniel Berlin wrote:
>
>
> On Thu, Mar 9, 2017 at 9:57 AM, Hal Finkel <hfinkel at anl.gov 
> <mailto:hfinkel at anl.gov>> wrote:
>
>
>     On 03/01/2017 05:30 PM, Daniel Berlin via llvm-dev wrote:
>>     So, https://bugs.llvm.org/show_bug.cgi?id=32056
>>     <https://bugs.llvm.org/show_bug.cgi?id=32056> is an example
>>     showing our current TBAA tree for union generation is definitely
>>     irretrievably broken.
>>     I'll be honest here. I'm pretty sure your proposal doesn't go far
>>     enough.
>>     But truthfully,  I would rather see us come closer to a
>>     representation we know works, which is GCC's.
>>     Let me try to simplify what you are suggesting, and what we have.
>>     Our current representation is basically inverted from GCC, but we
>>     don't allow things that would enable it to work.
>>
>>     Given
>>     union {int a, short b};
>>
>>     GCC's will be:
>>
>>      union
>>       /   \
>>     short int
>>
>>
>>     Everything is implicitly a subset of alias set 0 (which for C/C++
>>     is char) just to avoid representing it.
>>
>>     Our metadata has no child links, and only a single parent link.
>>
>>     You can't represent the above form because you can't make a
>>     single short node a child  of every union/struct it needs to be
>>     (lack of multiple parents means you can't just frob them all to
>>     offset zero).
>>
>>     Your suggestion is to invert this as a struct
>>
>>     short  int
>>        |    /
>>     union
>>
>>     We don't allow multiple parents, however.
>>     Because of that, you suggest we follow all nodes for unions,
>>     special casing union-type nodes somehow
>>
>>     Let me suggest something different:
>>
>>     We know the current structure fails us in a number of ways.
>>     Instead of trying to shoehorn this into our current structure, I
>>     suggest: we stop messing around and just have a GCC style tree,
>>     and represent the children instead of the parents.
>>     We make contained types descendants instead of ancestors.
>>     We allow multiple children at each offset and for scalar type
>>     nodes.x`
>>
>>     We could do that without changing to children, but honestly,  i
>>     strongly believe nobody who ever looks at this really understands
>>     it right now.
>>     (If someone really does like the ancestor form, i'd love to
>>     understand why :P)
>>
>>     So if we are going to change it, we should change it to something
>>     that is easier to understand.
>>
>>     something like:
>>
>>     scalar type node -> {name, children nodes}
>>     struct node -> {name, array of {offset, child node} }
>>
>>     Paths are separate from the tbaa tree itself, and are just:
>>     path node -> {struct node, offset} or whatever.
>>
>>     unions are just scalar type nodes with multiple children, not
>>     struct nodes with special-cased offset zero.
>>
>>     This also has a well-defined upgrade path.
>>
>>     For "old-style" DAGs, like exist now, we know we have to regen
>>     the metadata, and that it is wrong (So we could, for example,
>>     simply disable it for correctness reasons)
>>     .
>>     Anything with a "new-style" DAG, we know is usable.
>>
>>     In this representation, the most generic tbaa node is just the
>>     one that contains the other.
>>     If neither contains the other, you can just make one that has
>>     both as children and use that.
>>     (instead of now, where you'd have to have multiple parents again).
>
>     You mean making a 'scalar type node', because those represent 'or'?
>
>
> Yes.
> I would probably stop calling them scalar type nodes and struct type 
> nodes too, while we are bike-shedding a bit :)
>
> The terminology seems to be based on what we think a type system that 
> generates those kinds of nodes look like.
> But uh
> 1. That's wrong a lot of the time, because each frontend may want to 
> use different types of nodes to be conservatively correct, and because 
> different things can be represented, correctly, in multiple ways (IE, 
> as a stupid example, you could represent any type as a struct node 
> where all pieces go to the same place).
>
> In fact, it's not even true for the few things that do generate TBAA 
> trees right now.
>
> 2. It gives you no idea what they do in LLVM, which is, IMHO, what 
> anyone reading those docs cares about.
>
> I'd probably call them based on something related to their actual 
> semantic in llvm's tbaa tree is.
>
> IE i'd call scalar type nodes "or-type nodes" or "any of" nodes or 
> something that gives people an idea that it means the aliasing result 
> is may-alias if any path from that node is may-alias.
> I'd call struct type nodes "offset-based nodes", becuase that's really 
> what they are.
>
> (access-path nodes are actually access paths, so that seems find)
>

I certainly agree. We should pick terminology that makes sense 
independent of C/C++.

  -Hal

>
>
>
>      -Hal
>
>>
>>     (The above also may be better for other languages)
>>
>>     --Dan
>>
>>
>>
>>
>>     On Tue, Feb 28, 2017 at 4:44 PM, Steven Perron
>>     <perrons at ca.ibm.com <mailto:perrons at ca.ibm.com>> wrote:
>>
>>         Seems like the comments have stopped.  I'll try to get a
>>         patch together.  Then we can continue the discussion from there.
>>
>>         Hubert, as for the issue with the llvm optimizations losing
>>         the TBAA information, it should be the responsibility to make
>>         sure the aliasing is changed in the correct way.  One
>>         function related to this has already been mentioned:
>>          getMostGenericTBAA.
>>
>>         Later,
>>         Steven Perron
>>
>>
>>
>>         From: Hubert Tong <hubert.reinterpretcast at gmail.com
>>         <mailto:hubert.reinterpretcast at gmail.com>>
>>         To: Steven Perron/Toronto/IBM at IBMCA
>>         Cc: Daniel Berlin <dberlin at dberlin.org
>>         <mailto:dberlin at dberlin.org>>, llvm-dev
>>         <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>,
>>         Sanjoy Das <sanjoy at playingwithpointers.com
>>         <mailto:sanjoy at playingwithpointers.com>>
>>         Date: 2017/02/15 07:44 AM
>>         Subject: Re: [llvm-dev] RFC: Representing unions in TBAA
>>         ------------------------------------------------------------------------
>>
>>
>>
>>         On Tue, Feb 14, 2017 at 11:22 PM, Steven Perron
>>         <_perrons at ca.ibm.com_ <mailto:perrons at ca.ibm.com>> wrote:
>>         3) How should we handle a reference directly through a union,
>>         and a reference that is not through the union?
>>
>>         My solution was to look for each member of the union overlaps
>>         the given offset, and see if any of those members aliased the
>>         other reference.  If no member aliases the other reference,
>>         then the answer is no alias.  Otherwise the answer is may
>>         alias.  The run time for this would be proportional to
>>          "distance to the root" * "number of overlapping members". 
>>         This could be slow if there are unions with many members or
>>         many unions of unions.
>>
>>         Another option is to say that they do not alias.  This would
>>         mean that all references to unions must be explicitly through
>>         the union.
>>         From what I gather from the thread so far, the access through
>>         the union may be lost because of LLVM transformations. I am
>>         not sure how, in the face of that, TBAA could indicate
>>         NoAlias safely (without the risk of functional-correctness
>>         issues in correct programs) between types which overlap
>>         within any union (within some portion of the program).
>>
>>         As for the standards, it is definitely not true that all
>>         references to unions must be explicitly through the union.
>>         However, if you are trying to perform union-based type
>>         punning (under C11), then it appears that it is intended that
>>         the read must be through the union.
>>
>>         This would be the least restrictive aliasing allowing the
>>         most optimization.  The implementation would be simple.  I
>>         believe we make the parent of the TBAA node for the union to
>>         be "omnipotent char".  This might be similar to treating the
>>         union TBAA node more like a scalar node instead of a
>>         struct-path.  Then the traversal of the TBAA nodes will be
>>         quick.  I'll have to work this out a bit more, but, if this
>>         is good enough to meet the requirements of the standard, I
>>         can try to think this through a little more.  I'll need
>>         Hubert and Daniel to comment on that since I am no expert on
>>         the C and C++ standards.
>>
>>         The third option is to be pessimistic and say "may alias" all
>>         of the time (conservatively correct), and rely on other alias
>>         analysis to improve it. This will have good compile time, but
>>         could hinder optimization.  Personally I do not like this
>>         option.  Most of the time it will not have a negative effect,
>>         but there will be a reasonable number of programs where this
>>         will hurt optimization more that it needs to.
>>
>>
>>
>>
>>
>>
>>     _______________________________________________
>>     LLVM Developers mailing list
>>     llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>>     http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>     <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>
>     -- 
>     Hal Finkel
>     Lead, Compiler Technology and Programming Languages
>     Leadership Computing Facility
>     Argonne National Laboratory
>
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170309/73547950/attachment.html>