[PATCH] D103096: [analyzer] Implement cast for ranges of symbolic integers

Mon Apr 25 03:07:58 PDT 2022

martong added inline comments.

================
Comment at: clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp:2280-2290
+/// Return a symbol which is the best canidate to save it in the constraint
+/// map. We should correct symbol because in case of truncation cast we can
+/// only reason about truncated bytes but not the whole value. E.g. (char)(int
+/// x), we can store constraints for the first lower byte but we still don't
+/// know the root value. Also in case of promotion or converion we should
+/// store the root value instead of cast symbol, because we can always get
+/// a correct range using `castTo` metho. And we are not intrested in any
----------------
ASDenysPetrov wrote:
> martong wrote:
> > I think, we should definitely store the constraints as they appear in the analyzed code. This way we mix the infer logic into the constraint setting, which is bad.
> > I mean, we should simply store the constraint directly for the symbol as it is. And then only in `VisitSymbolCast` should we infer the proper value from the stored constraint (if we can).
> > 
> > (Of course, if we have related symbols (casts of the original symbol) then their constraints must be updated.)
> I see what you mean. I thought about this. Here what I've faced with.
> # Let's say you meet `(wchar_t)x > 0`, which you store like a pair {(wchar_t)x, [1,32767]}.
> # Then you meet `(short)x < 0`, where you have to answer whether it's `true` or `false`.
> # So would be your next step? Brute forcing all the constraint pairs to find any `x`-related symbol? Obviously, O(n) complexity for every single integral symbol is inefficient.
> What I propose is to "canonize" arbitrary types to a general form where this form could be a part of key along with `x` and we could get a constraint with a classic map complexity. So that:
> # You meet `(wchar_t)x > 0`, which you convert `wchar_t` to `int16` and store like a pair {(int16)x, [1,32767]}.
> # Then you meet `(short)x < 0`, where you convert `short` to `int16` and get a constraint.
> That's why I've introduced `NominalTypeList`.
> But now I admited your concern about arbitrary size of integers and will redesign my solution.
> So would be your next step? Brute forcing all the constraint pairs to find any x-related symbol? Obviously, O(n) complexity for every single integral symbol is inefficient.

I don't think we need a brute force search among the constraints if we have the additional `CastMap` ([[ https://reviews.llvm.org/D103096#3459049 | that I have previously proposed ]]).
So, the next step would be: Lookup the root symbol of `(short)x` that is `(wchar_t)x` (supposing we have seen a declaration `wchar_t x;` first during the symbolic execution, but having a different root might work out as well).
Then from the `CastMap` we can query in O(1) the set of the related cast symbols of the root symbol. From this set it takes to query the constraints for each of the members from the existing equivalneClass->constraint mapping. That's O(1) times the number of the members of the cast set (which is assumed to be very few in the usual case).

================
Comment at: clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp:2346-2350
+/// FIXME: Update bigger casts. We only can reason about ranges of smaller
+/// types, because it would be too complicated to update, say, the entire
+/// `int` range if you only have knowledge that its lowest byte has been
+/// changed. So we don't touch bigger casts and they may be potentially
+/// invalid. For future, for:
----------------
ASDenysPetrov wrote:
> martong wrote:
> > Instead of a noop we should be more conservative in this case. We should invalidate (remove) the constraints of all the symbols that have more bits than the currently set symbol. However, we might be clever in cases of special values (e.g `0` or in case of the `true` rangeSet {[MIN, -1], [1, MAX]}).
> No, it's incorrect. Consider next:
> ```
> int x;
> if(x > 1000000 || x < 100000) 
>   return;
> // x (100'000, 1000'000) 
> if((int8)x != 42) 
>   return;
> // x (100'000, 1000'000) && (int8)x (42, 42) 
> ```
> We can't just remove or invalidate `x (100'000, 1000'000)` because this range will still stay //true//.
> Strictly speaking `x` range should be updated with values 100394, 102442, 108586, ...,, 960554 and any other value within the range which has its lowest byte equals to 42.
> We can't just update the `RangeSet` with such a big amount of values due to performance issues. So we just assume it as less accurate.
Okay, this makes perfect sense, thanks for the example!

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D103096/new/

https://reviews.llvm.org/D103096