[PATCH] D76287: [analysis][analyzer] Introduce the skeleton of a reaching definitions calculator

Wed Mar 18 10:52:41 PDT 2020

xazax.hun added a comment.

In D76287#1929221 <https://reviews.llvm.org/D76287#1929221>, @Szelethus wrote:d:

> //Variable// could be practically anything that can be written, but due to the nature of what we can work this, we have to make severe restrictions.
>
> - We don't really have a good points-to-analysis. This discards pointees of any kind pretty much immediately. I don't have long term plans to add support for them.

I am a bit concerned about this. So if we have something like `*x = 2`, this definition will not appear in any of the reaching definition sets? An alternative would be to include it in all the sets that are enabled by the strict-aliasing rules. The reason why I am a bit concerned about this because I think we should be clear what the user can expect from this algorithm? Will we calculate a superset of reaching definitions (over-approximating)? Will we calculate a subset (under-approximating)? Different users might have different requirements. Usually, dataflow analyses tend to over-approximate, but if we omit some elements from the reaching definitions, we will both over- and under-approximate at the same time. While this might make sense in certain use-cases, this might be quite surprising for potential users of the algorithm. This is why I think it is important to set a clear high-level goal. One such goal could be that we always want to over-approximate modulo bugs.

> - Arrays are strongly related, and while there are studies on indexed variables (the last paper in my summary talks about such an approach), I think the way C++ complicates things on an uncomfortable level. I do have confidence however adding support for it in the future for the cases where the index is known compile time (possibly with the help of the RD algorithm itself) should be doable by modifying the `Variable` class.

I think initially handling arrays as if it contained only one element should be fine.

> So this leaves the most obvious: `VarDecl`s, and for record objects a `(VarDecl, FieldChain)` pair (for `s.x.a`, this would be `(s, [x,a])`). Setting the `VarDecl` part of the pair to `nullptr` could refer to the implicit this during the analysis of a C++ methods.

Note that, this is very similar to what the lifetime analysis is using. See the `LifetimeContractVariable` in https://reviews.llvm.org/D72810. But we have a more refined memory location that can also include temporaries that we use during the analysis.

> The plan was to represent each field with a separate `Variable` object, so for this code
> 
>   struct A {
>     struct B {
>       int x, y;
>     };
>     B b;
>   };
>   
>   void foo() { A a };
> 
> 
> The following list of `Variable` objects would be created:
> 
>   a
>   a.b
>   a.b.x
>   a.b.y
> 
> 
> The reason behind this seemingly wasteful storage is that the eventual set operations would be difficult if not impossible to implement, if a single `Variable` object were to hold the information about its fields. Mind that each of them could have a totally different definition associated with it. I hope that by emloying immutable data structures these wouldn't be terribly expensive memory-wise.

Not sure what do you mean here. Basically, I think the question is, how hierarchical do you want your representation to be. I.e.: do you want to have pointers between sub/super objects or do you prefer doing lookups when you want to invalidate subobjects?

>> Also, it would be great to have some proposal upfront how indirect writes should behave.
> 
> I wrote a mail about this during my GSoC project: http://lists.llvm.org/pipermail/cfe-dev/2019-July/062975.html.

I think saying that `*x = 2` will just be ignored is not sufficient. See my argument above about over- and under-approximation. I think it would be great to know what are the consequences of this decision. E.g. you could check how the existing algorithms like uninitialized variables behave. For example, one possible mitigation of the problem with pointers would be, to consider `&var` a definition.
My point is, we have several ways to deal with pointers without precise points-to analysis and ignoring them is just one of the options. Having more options considered with pros and cons enumerated would greatly increase our confidence in your chosen solution.

> The Static Analyzer would be the number one user of the RD algorithm, as described in the summary, so I guess that you referred to the users of GEN/KILL sets? What do you mean under configurations?

I mean both. As you mentioned GEN/KILL could be reused for other analyses. But reaching definitions can be a building block for other high-level checks like finding raw pointers that are owners. Consider the following example:

  void foo(int *p) {
    ...
    delete p;
  }

How do you know that the pointer `p` is an owner of the pointee? You know that because it is deleted. So if you want to classify some raw pointers as owners one possible way to do it to check the reaching definitions for delete statements. Clang-tidy could try to convert some of those pointers to `unique_ptr`s. This is how this algorithm can be more generally useful as a building block for other problems unrelated to the static analyzer.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D76287/new/

https://reviews.llvm.org/D76287