[llvm-dev] RFC: Resolving TBAA issues

Sat Aug 19 12:00:45 PDT 2017

On 08/17/2017 11:25 PM, Daniel Berlin wrote:
>
> <just want to focus on these parts for a second. *All* of these 
> representations are really access path representations, just encoded 
> slightly different ways, and, as a result, with various parts of the 
> rules in slightly different places>
>
>     Imagine that we took the enhancement we previously discussed, but
>     instead of implementing it directly, we just directly encoded for
>     every access the path from the access type to the root. I think it
>     would look very much like this proposal.
>
>
> Something like it, yes.
> Note that this representation also has special vtable and union 
> groups, for example, and assumes very c-like rules in several places 
> around the sequencing of access types.
> Also note that  in the language of access paths, C allows overlapping 
> that does not exist in, for example, Java

Is this a statement generally favoring or disfavoring having the 
frontend encode the path explicitly?

> (This is true in both the points-to and type-based domains).
> Ada has discriminated unions (and you could not use the "union" in 
> this proposal to represent them)

Yes, but discriminated unions also have a dynamic element to them. As a 
result, I suspect we'd need a scheme that captures that (or else we'd 
need to model them like a struct with a field and something like a C union).

> etc.
>
>
>>
>>
>>         We generally explain our current TBAA rules by saying that
>>         they're generic but motivated by C/C++ rules.
>>
>>
>>     We do say that but that's not really what our implementation does
>>     in any way.
>
>     Really? I thought it was motivated by C/C++ rules. When you say
>     that it's "not really what our implementation does", is this
>     because it drops a lot of potential information in the translation
>     to our generic form?
>
>
> What we do is completely and totally unrelated to types as they exist 
> in the original language.  It is a completely and totally generic 
> implementation with no rules related to the original language.  The 
> original language, for example, has dynamic typing rules, object 
> lifetime rules, etc. We have none of these.
>
> The types do not, in fact,even always have the same relationship we 
> give them.
>
> We have a tree and some nodes, and we happen to name them after 
> types.  Like this proposal, they also represent computed access paths, 
> in a much simpler language.
>
> The nodes represent alias set members (either one or many)
> The edges represent either "access at offset" (where the default is 0 
> if unweighted).
> We have a simple rule that that processes the access paths related to 
> ancestry.

I certainly understand all of this ;)

>
> You can view it as either a grammar parsing or a graph reachability 
> problem depending on what works for you (in fact, it's really a 
> Dyck-CFL reachability problem on bidirected trees)
>
> The reachability rule is  usually given as: If node Target is 
> reachable from node Source, offset another through either it's 
> children or the upwards edges (or was it downwards, i always screw up 
> which direction we go in), they may-alias
>
> You could also implement it as a real grammar parsing problem (IE you 
> quite literally could generate strings from tag + offset, and parse 
> them against the grammar, and see if the terminal nodes contain your 
> target).  They are equivalent.
>
> This representation would be the same in that regard.
>
> Our current lowering for clang happens to kind of look like c/c++ 
> structures converted to a tree.
>
> However, this is actually just inefficient, space wise, and done 
> because it's simple to lower it this way.

I agree (although this wasn't happenstance; the representation was 
designed with an expected lowering scheme in mind, at least for C/C++).

> Because the accesses are completely unrelated to the original types, 
> and require *no* language rules to interpret, you could also just 
> partition the things that alias by the language rules in the frontend, 
> and then output a tree that represents the possible unique paths.
>
> IE figure out all the answers, then re-encode it as precisely as possible.
>
> This trades time for space.
>
> This representation is *super* generic.  That is, the language being 
> used here is super simple, and the reachability rule is super simple.
>
> I could have the frontend generate these trees based on anything i 
> like.  I could, in fact, encode steensgaard points-to results as this 
> tree without any trouble.
>
> The access path language described in this proposal is more complex 
> and complete, and directly closer to access paths you find in C/C++.  
> It has bitfields, vtables, and unions.

It is more complex, but I think this is partly in the description. 
AFAIKT, only "union" has special properties under the scheme. Bit 
fields, vtables, etc. are all just particular types with no particular 
special rules. I don't like special rules at all for any named entities, 
but as there's only one such entity right now ("union"), this is an 
encoding choice we could bikeshed.

> The reachability rules are more complex.
> Is it possible to express other languages in that set of access path 
> rules?
> Sure.  For example, you can, as above, generate the set of answers, 
> and then re-encode it into access paths.
>
> Right now, the work it takes *in the front end* is minimal, and has a 
> fairly efficient space encoding.
> if i want to say two things alias, i just gotta be able to reach one 
> from the other.
> If i want to say two things do not alias, i just gotta be able to not 
> reach one from the other only using certain types of edges
> In our current language, all that takes in a frontend is "find longest 
> no-aliasing part of tree. Go to parent, add new child".
>
> In the proposed language, the lowering is more complex.  Is it doable?
> Sure, of course, not gonna claim otherwise.  But the more "features" 
> of the access path you add and expect the middle end to handle, 
> instead of the front-end expanding them, and the more those feature's 
> reachability rules are related to a specific language the more 
> language-specific it gets.
>
> That's the *only* tradeoff we are making here, representationally.   
> How much does the frontend have to understand about how the middle end 
> walks these things, in order to generate whatever it wants as 
> precisely as possible
>
> We can make whichever we want express everything we want in N^3 time :)
> The real question is "do we try to add on to what we have in ways that 
> work for multiple languages, and are expressed neutrally in a simple 
> reachability language"
> or "do we add language-specific kinds of nodes to the grammar, and 
> have reachability rules that are fairly language specific".

I don't want language-specific kinds of nodes in the grammar, and I 
believe that it's not necessary under some reasonable variant of this 
scheme. Maybe I'm wrong.

>
> IE do you add, say, discriminated_union nodes to our current 
> representation for ada, or "vtable_access" nodes to our current 
> representation for C++ vtable accesses
> Or do you instead generate a metadata that has a unidirectional edge 
> reachability (IE up only), or whatever it takes to do vtables generically.
>
> Both are completely and totally viable paths, and it's all about which 
> way you want to go.
> But they *definitely* have a difference in terms of language-specificness.
>

To be clear, if the system is not generic, I'm far less interested.

Thanks again,
Hal

>
>
>
>

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170819/88b91f82/attachment-0001.html>