[clang] [analyzer][solver] On SymSym RelOps, check EQClass members for contradictions (PR #71284)

Mon Nov 6 02:56:20 PST 2023

steakhal wrote:

> I think every time we need to iterate over all member of an equivalence class, we might do something wrong. The point of the equivalence class would be to make sure those elements are equivalent. One way to avoid iteration would be to always use the representative of the equivalence class. E.g., each time we record a new constraint to a member of the class, we add this information to the representative element. Every time we do a query, we first look up the representative element which already should have all the info from the class and use it instead of the original symbol.
> 
> Would something like this work or am I missing something?

I had to think about it for a little while, and here are my thoughts:
In an example like here:
```c++
void gh_62215(int x, int y, int z) {
  if (x != y) return; // x == y
  if (z <= x) return; // z > x
  if (z >= y) return; // z < y
  clang_analyzer_warnIfReached(); // no-warning: This should be dead code.
  (void)(x + y + z); // keep the constraints alive.
}
```
Here, after the first if, we would have an eqclass of `{x,y}`; with a single constraint `x != y: [0,0]`.
After the second if, we would have notionally 3 eqclasses: `{x,y}`, and two trivial ones `x` and `y`.
We would also have 3 constraints: `x != y: [0,0]`, `z <= x: [0,0]`, `z <= y: [0,0]`.

Note that the keys in the constraints map (symbols) can be 'arbitrarily' complex and refer to already dead symbols.
In this example, I keep these symbols artificially alive to make the case slightly simpler to discuss.

So, if I understand you correctly, at the 3rd if statement, we should canonicalize the symbol we are constraining by walking every sub-symbol and substituting it with its equivalent counterpart (if any), by basically with its eqclass' representative symbol.
In this example, it would mean that instead of adding a constraint `z >= y: [0,0]`, we would instead `z >= x: [0,0]` (assuming that `x` is the representative symbol of eqclass `{x,y}`.
I believe this canonicalization could work (and would solve the other demonstrative limitation I added in this PR), but would also incur overhead.
And from a design point of view, an eqclass would need to keep the representative symbol alive, because otherwise, we can't substitute a symbol with the representative symbol.
(Note that for this reason, we don't actually have a representative symbol, but rather a unique integral value as an ID for the eqclass - which happens to be equal to the `SymbolRef` pointer value of the representative symbol if that's still alive.

Here is an example where the representative symbol `x` of the eqclass `{x,y}` would die, but remain the ID of the eqclass:
```c++
  // same example, but with:
  (void)(y + z); // X is not mentioned!, thus reclaimed after the 2nd if statement.
```

https://github.com/llvm/llvm-project/pull/71284