[PATCH] D132236: [analyzer] Fix liveness of LazyCompoundVals

Fri Aug 26 08:29:21 PDT 2022

steakhal added a comment.

In D132236#3747652 <https://reviews.llvm.org/D132236#3747652>, @NoQ wrote:

> Nice catch but I don't think this is the right solution.
>
> Symbol liveness corresponds to a concrete runtime property of the program: can the program obtain this value by executing further runtime instructions? The program cannot obtain a pointer to variable `n`, or even a pointer to a region `*n`, just by looking at the data in a struct that was copied from region `*n` previously. Therefore neither `n` nor `*n` should not be marked live just because they are mentioned in a base region of a lazy compound value. In particular, memory leaks should be reported against `n` if the last place it's mentioned is a base region of a lazy compound value, as the program cannot possibly free that memory. You can most likely write a MallocChecker test for that, which will become a false negative after your patch.
>
> For the same reason, for any region `R`, if `reg<R>` or `derived<R, $x>` is live, does not mean `R` is live. //The program cannot guess a memory address by only looking at a value loaded from that address.//
>
> If you look at how a simpler example works:
>
>   void clang_analyzer_warnIfReached();
>   
>   void foo(int **n) {
>     int *n2 = *n;
>     if (!**n) {
>       if (*n2) {
>         clang_analyzer_warnIfReached();
>       }
>     }
>   }
>
> You will notice that it does not try to keep VarRegion `n` or even the symbol `reg_$0<int ** n>` alive. It is *sufficient* to keep `reg_$1<int * Element{SymRegion{reg_$0<int ** n>},0 S64b,int *}>` alive in order to keep the test passing, and it does match the actual definition of live symbol (`reg_$0` can no longer be retrieved by the program but `reg_$1` can).

Nice explanation, thank you very much! I'll need some time to digest it to the full extent.
It makes sense so far, but I imagine it will become clear when I continue the investigation.

> The original code tries to do the same for lazy compound values. I suspect that you'll find the actual bug when you descend into `getInterestingValues()`.
>
> What looks fishy about `getInterestingValues()` is that it assumes that the amount of interesting values is finite. This sounds incredibly wrong to me. If a lazy compound value contains any pointer symbol `$p`, then all values in the following infinite series are interesting:
>
>   $p,  *$p,  **$p,  ***$p,  ...

Well, there two places where I suspected this bug. The place I changed and the other was `getInterestingValues()`. The first looked more probable, hence we sticked to that.
We'll continue the investigation with the `getInterestingValues()` next time.

> So I think the correct solution would be to maintain a list of explicitly-live compound values. Just like we already maintain a list of explicitly live symbols - `TheLiving`, and explicitly live regions - `RegionRoots`; neither of these lists enumerates all live symbols or regions as there are infinitely many of them, but each of these lists enumerates the ones that we've learned to be definitely live, and they're sufficient to determine liveness of any other symbol through recursive descent into the identity of that symbol. So with a list of live compound value regions, we can ask a question like "is this symbol stored in any of the known-live compound values?" to determine that symbol's liveness. This will keep the stored symbol alive without making the lazy compound value's base region alive.

Okay, I'll continue from here. Thanks for the

In D132236#3748017 <https://reviews.llvm.org/D132236#3748017>, @NoQ wrote:

> So to be clear, I think your solution is probably an ok stop-gap solution. False negatives aren't that bad, I'm more worried that existing leak reports may become less understandable because they'll be reported later than necessary, which may obscure the reason why we think they're finally leaking. But I really want us to agree on how this facility is supposed to work, and explore the perfect solution, before implementing an imperfect solution.

Unfortunately, we found many FPs introduced by this change; so this should be definitely blocked.

> Maybe we should introduce "weak region roots" that aren't necessarily live themselves but anything derived from them (any `SymbolRegionValue<R'>` or `SymbolDerived<R', $x>` for arbitrary sub-region `R'` of a weak-region-root `R`) is live. This could be pretty straightforward.

So, what you are suggesting here is an alternative solution to the one you already proposed in your last comment?

For transparency, we, unfortunately, reached the end of our timebox, and I'll work on something else for a couple of weeks.
I plan to come back to this in ~3 weeks if everything works well.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D132236/new/

https://reviews.llvm.org/D132236