[PATCH] D78933: [analyzer] RangeConstraintManager optimizations in comparison expressions

Wed May 6 08:02:58 PDT 2020

xazax.hun added a comment.

In D78933#2022684 <https://reviews.llvm.org/D78933#2022684>, @ASDenysPetrov wrote:

> I have never run them before. How I can do this? What project to use as a sample?

Unfortunately, we do not really have a well-defined set of benchmarks. But as a first step, you could run the analyzer on LLVM before and after applying your patch and measure the run-time. Or you could start with a smaller project (you can pick any well-known project from GitHub or other similar sites) as checking LLVM can be really time-consuming (especially in debug mode). To run the analyzer on a project I recommend using scan-build or CodeChecker. If you feel like, you can also experiment with the CSA-testbench (https://github.com/Xazax-hun/csa-testbench), but it requires quite a lot of effort to set up upfront.

>> Also, it would be useful to see how statistics like coverage patterns, number of exploded nodes etc changes.
> 
> You can check F11839257: before.html <https://reviews.llvm.org/F11839257> and F11839255: after.html <https://reviews.llvm.org/F11839255> exploded graphs, to see the number of nodes.

This is only for a small artificial example. We would like to learn more about how the node count changes for a large real-world project like LLVM. We do not need to see the actual exploded graph for those big projects as they are really hard to interpret, we just want to know the numbers.

>> See the output of `-analyzer-stats` option for details.
> 
> I didn't find this option in `clang --help`. I've never use it. Please, explain how to use it.

I forgot to mention, not all of the analyzer flags have a corresponding flag in the driver. You can check all the analyzer related flags in the frontend (as opposed to the driver) using the following command: `clang -cc1 -help | grep "analyzer"`
You can either add `-analyzer-stats` to your cc1 invocation or you can use the `-Xclang` to pass the flag through the driver this way: `-Xclang -analyzer-stats`. This will modify the output of the analyzer to include additional statistics about the analysis.

================
Comment at: clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp:588
+  // AnyX2 means that two expressions marked as `Any` are met in code,
+  // and there is a special column for that, for example:
+  // if (x >= y)
----------------
ASDenysPetrov wrote:
> xazax.hun wrote:
> > I have really hard time processing how to interpret `AnyX2`.
> > 
> > For example in the code below:
> > ```
> >    if (x >= y)
> >      if (x != y)
> >        if (x <= y)
> >          return false
> > ```
> > 
> > ```
> >    if (x >= y)
> >      if (x == y)
> >        if (x <= y)
> >          return true
> > ```
> > 
> > We would get different results for `<=`. So I do not really get how I should read the `AnyX2` column.
> I'm sorry for the obscure explanation. I was really thinking about how to explain it more clear.
> Let's consider this code:
> ```
> if (x >= y)
>   if (x != y)
>     if (x <= y)
> ```
> If 'x >= y' is //true//, then following `x <= y` is  `Any` ( can be //true// or //false// ). //See table in the summary.//
> If 'x != y' is //true//, then following `x <= y` is  `Any` ( can be //true// or //false// ).
> If 'x >= y && x != y' is //true//, then following `x <= y` is `False` only (it means that we previously met two Any states, what I've called as **AnyX2**).
> 
> 
> 
> 
Could you tell me how the reasoning would be different for 
```
if (x >= y)
  if (x == y)
    if (x <= y)
```
?

================
Comment at: clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp:622
+  auto IndexOP = IndexFromOp(OP);
+  auto LHS = SSE->getLHS();
+  auto RHS = SSE->getRHS();
----------------
ASDenysPetrov wrote:
> xazax.hun wrote:
> > Could `LHS` and `RHS` be other expressions? Does it make sense to continue executing this function if one of them is not a simple symbol?
> I think it makes sense, because `(x+3*y) > z` and `z < (x+3*y)` perfectly fits in the logic.
Makes sense,  thanks. Maybe it would be worth to rephrase the comment to note that `x` and `y` can also stand for subexpressions, not only for actual symbols.

================
Comment at: clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp:629-630
+  auto &One = BV.getValue(1, T);
+  const RangeSet TrueRangeSet(F, One, One);
+  const RangeSet FalseRangeSet(F, Zero, Zero);
+  int AnyStates = 0;
----------------
ASDenysPetrov wrote:
> xazax.hun wrote:
> > ASDenysPetrov wrote:
> > > Folk, is this a good idea to explicitly create bool ranges as a return value?
> > > As for me, comparisons like `>`, `<`, etc. can only produce bool-based ranges, otherwise it would be weird.
> > I think modeling booleans with ranges is OK. This is what the analyzer is doing at the moment. But I guess the question is about whether boolean results of comparisons is a good way to store the relationships between the symbols. I am not convinced about that see my inline comment below.
> Once I met a case when logic operator `<` retruns me int-based ranges {0,0} on the //false// branch and {1, 429496725} (instead of bool-based {1,1}) on the //true// branch. It confused me, so the question arised.
I see. I do not see any problem with that immediately, but @NoQ is much more well-versed to answer this question.

================
Comment at: clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp:635
+
+    // Let's find an expression e.g. (x < y).
+    OP = OpFromIndex(i);
----------------
ASDenysPetrov wrote:
> xazax.hun wrote:
> > I am not really familiar with how constraints are represented but I vaguely recall the analyzer sometimes normalizing some expressions like conversion `A == B` to `A - B == 0`. I am just wondering if this API to look this expression up is not the right abstraction as it might be better to handle such normalizations in a unified, central way.
> > 
> > Also, note that this method does not handle transitivity. I wonder if maintaining set of ranges is the right representation for this information at all. The ordering between the symbols will define a lattice. Representing that lattice directly instead of using ranges might be more efficient.
> >conversion A == B to A - B == 0. 
> All my samples `x == y` worked as `x == y`. I havn't seen such transformations.
> >Representing that lattice directly instead of using ranges might be more efficient.
> How do you imagine that lattice? Could you detail it?
A lattice is basically a DAG, where the nodes are the values (symbols, or subexpressions), and the edges are ordering. So it defines an ordering. E.g., A is smaller than B, if and only if there is a path from A to B. I am not sure whether this is the right representation, but it would take care of transitivity. 

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D78933/new/

https://reviews.llvm.org/D78933