[PATCH] D76287: [analysis][analyzer] Introduce the skeleton of a reaching definitions calculator

Kristóf Umann via Phabricator via cfe-commits cfe-commits at lists.llvm.org
Wed Mar 18 09:14:34 PDT 2020


Szelethus marked 9 inline comments as done.
Szelethus added a comment.

In D76287#1927340 <https://reviews.llvm.org/D76287#1927340>, @xazax.hun wrote:

> I think it is very crucial to make the intentions clear, how do you define `definition` and `variable`?
>  Other than assignments we might include pre/postfix increment/decrement, non-const by-ref function arguments and so on as `definitions`.
>  Also, it would be great to have some proposal upfront how indirect writes should behave.


You're totally right. The thing that I wanted to achieve, partly, is to make the implementation flexible for things I may have forgotten, or will need to add for future standards/languages. As you can see from D64991 <https://reviews.llvm.org/D64991>, I did make a lot more progress than what is shown here, so I'll explain some of my long-term plans based on the experience I gathered:

> I think it is very crucial to make the intentions clear, how do you define [...] `variable`?

//Variable// could be practically anything that can be written, but due to the nature of what we can work this, we have to make severe restrictions.

- We don't really have a good points-to-analysis. This discards pointees of any kind pretty much immediately. I don't have long term plans to add support for them.
- Arrays are strongly related, and while there are studies on indexed variables (the last paper in my summary talks about such an approach), I think the way C++ complicates things on an uncomfortable level. I do have confidence however adding support for it in the future for the cases where the index is known compile time (possibly with the help of the RD algorithm itself) should be doable by modifying the `Variable` class.

So this leaves the most obvious: `VarDecl`s, and for record objects a `(VarDecl, FieldChain)` pair (for `s.x.a`, this would be `(s, [x,a])`). Setting the `VarDecl` part of the pair to `nullptr` could refer to the implicit this during the analysis of a C++ methods.

The plan was to represent each field with a separate `Variable` object, so for this code

  struct A {
    struct B {
      int x, y;
    };
    B b;
  };
  
  void foo() { A a };

The following list of `Variable` objects would be created:

  a
  a.b
  a.b.x
  a.b.y

The reason behind this seemingly wasteful storage is that the eventual set operations would be difficult if not impossible to implement, if a single `Variable` object were to hold the information about its fields. Mind that each of them could have a totally different definition associated with it. I hope that by emloying immutable data structures these wouldn't be terribly expensive memory-wise.

> I think it is very crucial to make the intentions clear, how do you define `definition`[...]?
>  Other than assignments we might include pre/postfix increment/decrement, non-const by-ref function arguments and so on as `definitions`.

//Definition// is a statement, more specifically a `CFGStmt` (an element of a `CFGBlock`), that either writes, or could potentially write a variable. The proposed definition finding stage should be very flexible for future additions.

> Also, it would be great to have some proposal upfront how indirect writes should behave.

I wrote a mail about this during my GSoC project: http://lists.llvm.org/pipermail/cfe-dev/2019-July/062975.html.

> Basically, this algorithm is not very useful on its own, but it can be used as a building block for other kind of analyses. So it would make sense to try to look for potential users and use cases and see if they have the same requirements or do you need to introduce configurations?

The Static Analyzer would be the number one user of the RD algorithm, as described in the summary, so I guess that you referred to the users of GEN/KILL sets? What do you mean under configurations?

> Trying to rewrite some existing algorithms in terms of these sets and see if the tests pass might be a good experiment if it is not too much work. (In case we can validate the performance we could even replace the original implementations if this makes things easier.)

I'm not too sure how much work it would take (I suspect an unreasonable quantity, but I might be wrong), so I'll take a look. This is a great idea to gain a lot more knowledge about this topic, even if it eventually fails.

> I think having a reaching definition set per basic block makes perfect sense, as it should be relatively easy to calculate the reaching definition of an instruction given the set and the basic block. So maybe it is more efficient to calculate the reaching definition sets on demand for instruction rather than computing it for every single instruction.
> 
> Regarding ASTMatchers, I think they might be great for now for prototyping but I think once you want to make this more robust covering every corner case in the matcher expression will be at least as hard as writing a visitor if not harder. But we will see :)

Thank you for the detailed response! I won't update this patch for a while to leave time for others to respond, but I will try to work on the other algorithms a bit.



================
Comment at: clang/include/clang/Analysis/Analyses/ReachingDefinitions.h:40
+
+struct Variable {
+  const VarDecl *Var;
----------------
xazax.hun wrote:
> I wonder if `Variable` will be the right notion long term. Do we want to be able to represent heap locations or do we exclude them on purpose? Reasoning about the heap is quite challenging so both answers might be reasonable. But in case we try to tackle the more general problem, we basically need a more general term like `MemoryLocation`.
I don't have long term plans to reason about pointees in general. Heap in particular is probably off limits for the foreseeable future.


================
Comment at: clang/include/clang/Analysis/Analyses/ReachingDefinitions.h:60
+
+class Definition : public Variable {
+public:
----------------
xazax.hun wrote:
> Is the inheritance justified here? Is a definition a variable? Maybe having a variable member better express the relationship.
Yup, you're right. This was a quick hack while developing.


================
Comment at: clang/include/clang/Analysis/Analyses/ReachingDefinitions.h:176-179
+  //===--------------------------------------------------------------------===//
+  // Utility methods for building a GEN set. These are public because the
+  // callback objects will need to call them.
+  //===--------------------------------------------------------------------===//
----------------
whisperity wrote:
> Is this the right approach? A public method makes the user itch to call it. If `GenSetMatcherCallback` is a superclass of every possible implementation, I think adding that class as a friend here works, and you could make the methods private, or protected.
Moving this to `GenSetMatcherCallback` would indeed be a great idea :)


================
Comment at: clang/include/clang/Analysis/Analyses/ReachingDefinitions.h:236-242
+  // TODO: Should we change to ImmutableMap? Does this matter that much?
+  std::map<const CFGBlock *, GenSet> Gen;
+  std::map<const CFGBlock *, DefinitionSet> Kill;
+
+public:
+  std::map<const CFGBlock *, DefinitionSet> In;
+  std::map<const CFGBlock *, DefinitionSet> Out;
----------------
whisperity wrote:
> The immutability might not(?), but the performance aspects of the RBT behind STL `map` could? `K` is a pointer, why not use `DenseMap`? How large do you expect these containers to be when they peak out?
Measurements on real-life codebases can never come soon enough, but I fear it'll be a while before I get them.


================
Comment at: clang/lib/Analysis/ReachingDefinitions.cpp:41
+  for (const Definition &Perpetrator : Set)
+    if (Victim.Var == Perpetrator.Var)
+      return true;
----------------
xazax.hun wrote:
> In the future this will be more complicated.
> 
> For example if I assign to a struct all of its members needs to be killed. As a result, you will not only need to represent memory regions but also the relationships between them. I wonder if the analyzer's memregion classes are any good or are they too specific to the analyzer.
As described in my comment, record variables would have numerous `Variable` classes, so this function wouldn't get much more complicated in the future (as seen in D64991).


================
Comment at: clang/lib/Analysis/ReachingDefinitions.cpp:137
+  static internal::Matcher<Stmt> getMatcher() {
+    return declRefExpr(to(varDecl().bind("var")));
+  }
----------------
xazax.hun wrote:
> Note that, `memberExpr` is not supported at this point. It is not derived from `declRefExpr`. 
Indeed, that is one beast of a matcher :) You can take a sneak peak in D64991.


================
Comment at: clang/lib/Analysis/ReachingDefinitions.cpp:162
+
+  // TODO: Destructor calls? Should we be THAT conservative?
+  // TODO: Regular function calls?
----------------
whisperity wrote:
> You mean //explicit// destructor calls here, right?
Nope, implicit. I wouldn't worry much about explicit calls, I would handle them the same as I would any `CallExpr`. Implicit destructor calls could modify the global state, and are not visibly present in the code.


================
Comment at: clang/lib/Analysis/ReachingDefinitions.cpp:275
+void ReachingDefinitionsCalculator::dumpGenSet() const {
+  llvm::errs() << "GEN sets: blockid (varname [blockid, elementid])\n";
+  for (const std::pair<const CFGBlock *, GenSet> D : Gen) {
----------------
whisperity wrote:
> Instead of simply blockID, could you harmonise this output with the CFG dump and say `Bx` instead of just `X`?
What do you mean?


================
Comment at: clang/test/Analysis/dump-definitions.cpp:17-18
+
+// CHECK:      GEN sets: blockid (varname [blockid, elementid])
+// CHECK-NEXT: 1 (ptr, [1, 3]) <write>
+// CHECK-NEXT: KILL sets: blockid (varname [blockid, elementid])
----------------
whisperity wrote:
> What is an //element//? How do they get their numbers? What does the `3` mean here? I get that basic block 1 (the body of the function) writes `ptr`... but I don't understand this further from looking at the expected output.
A `CFGBlock` is a block that contains all statements that are always executed sequentially. The statements are enumerated according to the execution order. 3 here means that the definition is the 3rd element in the 1st `CFGBlock`.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D76287/new/

https://reviews.llvm.org/D76287





More information about the cfe-commits mailing list