[PATCH] D78933: [analyzer] RangeConstraintManager optimizations in comparison expressions

Wed May 6 12:26:05 PDT 2020

xazax.hun added inline comments.

================
Comment at: clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp:588
+  // AnyX2 means that two expressions marked as `Any` are met in code,
+  // and there is a special column for that, for example:
+  // if (x >= y)
----------------
ASDenysPetrov wrote:
> xazax.hun wrote:
> > xazax.hun wrote:
> > > ASDenysPetrov wrote:
> > > > xazax.hun wrote:
> > > > > I have really hard time processing how to interpret `AnyX2`.
> > > > > 
> > > > > For example in the code below:
> > > > > ```
> > > > >    if (x >= y)
> > > > >      if (x != y)
> > > > >        if (x <= y)
> > > > >          return false
> > > > > ```
> > > > > 
> > > > > ```
> > > > >    if (x >= y)
> > > > >      if (x == y)
> > > > >        if (x <= y)
> > > > >          return true
> > > > > ```
> > > > > 
> > > > > We would get different results for `<=`. So I do not really get how I should read the `AnyX2` column.
> > > > I'm sorry for the obscure explanation. I was really thinking about how to explain it more clear.
> > > > Let's consider this code:
> > > > ```
> > > > if (x >= y)
> > > >   if (x != y)
> > > >     if (x <= y)
> > > > ```
> > > > If 'x >= y' is //true//, then following `x <= y` is  `Any` ( can be //true// or //false// ). //See table in the summary.//
> > > > If 'x != y' is //true//, then following `x <= y` is  `Any` ( can be //true// or //false// ).
> > > > If 'x >= y && x != y' is //true//, then following `x <= y` is `False` only (it means that we previously met two Any states, what I've called as **AnyX2**).
> > > > 
> > > > 
> > > > 
> > > > 
> > > Could you tell me how the reasoning would be different for 
> > > ```
> > > if (x >= y)
> > >   if (x == y)
> > >     if (x <= y)
> > > ```
> > > ?
> > Oh never mind, after `x == y` the `x <= y` would be `true`. But still, I find this logic hard to explain, understand, and prove correct.
> > I'd like to see at least some informal reasoning why will it always work. Also, if someone would like to extend this table in the future (e.g. with operator <=>), how to make it in such a way that does not break this logic.
> When I was creating a table it was just 6x6 without AnyX2 column. It should tell of what RangeSet we shall return (true, false or empty). It covers cases with only **one** preceding condition.
> Then I found that **two** certain sequential conditions can also tell us what happen to the next condition. And those two conditions happened to be described as `Any` in the table. Thus I deduced one more column for this purpose.
> In other words:
> ```
> if (x >= y)     // 3. Aha, we found `x >= y`, which is also `Any` for `x <= y` in the table
>   if (x != y)   // 2. Aha, we found `x == y`, which is `Any` for `x <= y` in the table
>     if (x <= y) // 1. Reasoning a RangeSet about this `x <= y`
>     // 4. OK, we found two `Any` states. Let's look at the value in `AnyX2` column for `x <= y`.
>     // 5. The value is `False`, so return `FalseRangeSet`, assuming CSA builds false branch only.
> ```
> You can substitute any other row from the table and corresponding Any-comparisons.
> Honestly, I don’t know how else to explain this. :)
> >I find this logic hard to explain, understand, and prove correct.
> If you come up with it, It becomes really easy and clear. You'll see. But I agree that AnyX2 can be changed to something more obvious.
Ok, in the meantime I revisited the algorithm and I think now I understand why does it work.
I think the main observation I missed was the following:  **Every row has exactly two `Any` entries, so we only need to think about one combination per row.** I think originally I interpreted your matrix wrong (the transposed one) and it did not really add up.
So I think you should document the fact that each row has exactly two unknown cases. But for each of the rows, if we have both facts, we can actually evaluate the current operator.

The goal is that if someone wants to extend this code later on (e.g. if the standard committee introduces a new operator that we handle), the next person should be able to modify this algorithm without breaking the existing code. So as you can see, adding a new column/row could break this invariant of having exactly 2 unknowns in each row rendering this code incorrect. This is why it is important, to explain what makes the algorithm work.

I might miss something, but I do not really see test cases that are covering these.

I think some renaming might also help others understanding what is going on:
1) Capitalize variable names, for example, `tristate`.
2) Chose a bit more descriptive names. `tristate` from the previous example describes the type of the variable. But variable names should describe their contents rather than their type. It can be `Result` in case it describes the result of the operation. I think in 3-valued logic we tend to use `Unknown` rather than `Any`.  You have code like `auto tristate = CmpOpTable[IndexOP][Index];`. This is not really readable as it is really hard to tell what index corresponds to what. One of the indexes correspond to the evaluated operator while the other corresponds to the query. The variable names should reflect that. 

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D78933/new/

https://reviews.llvm.org/D78933