[PATCH] D25660: [Analyzer] Checker for iterators dereferenced beyond their range.

Wed Nov 9 10:46:42 PST 2016

NoQ added a comment.

Sorry for inactivity, been thinking quite a bit about this checker. The checker is very cool because it is an excellent showcase of our API problems in the realm of C++ checkers. Once the checker is committed, we could try various things to make it easier to develop other checkers like this in the future. Also the check is very useful, and improving C++ support in the analyzer is very desired, so again thank you for your work.

Right now the course of action, i think, is to

- Agree on the `evalAssume()` implementation (i'm still not quite understanding what the problem is here, see the new inline comments);
- Add some more comments into the code (especially comment up all the object-copy handling, when iterator state moves from one symbol/region to another symbol/region upon various events).

Then, i think, we should land the commit, assuming that you have a desire to address more issues in subsequent commits to eventually enable it by default.

For enabling by default, the following should most likely be addressed:

- We should probably not warn by default on unchecked std::find (see comments following the `push_back(2016)` example), unless some strong arguments against such code patterns are provided;
- A BugReporterVisitor should be added to report iterator state changes to the user across the diagnostic path;
- Our code owners often have strong opinions regarding warning message wording.

Then there are a few ideas on finding more bugs, which you shouldn't necessarily implement, that came up during the review, eg.:

- Alexey suspects that iterators implemented as plain pointers (commonly used in LLVM itself) might be unsupported;
- Alexey points out that ++/-- may be handled in a less conservative manner;
- More checks could be implemented in this checker, eg. passing `end()` as first argument of `std::find()` is an instant bug (somebody accidentally swapped begin and end?).

A list of ideas on improving core experience, mostly for myself because i seem to be the only one with strong opinions on this:

- Provide a checker callback for structure copies, which would unify the multitude of similar callbacks in this checker;
- Consider removing the conjured structure symbol hack.

Did i forget anything?

================
Comment at: lib/StaticAnalyzer/Checkers/IteratorPastEndChecker.cpp:423
+
+void IteratorPastEndChecker::handleComparison(CheckerContext &C,
+                                              const SVal &LVal,
----------------
baloghadamsoftware wrote:
> NoQ wrote:
> > a.sidorin wrote:
> > > What will happen if we write something like this:
> > > ```
> > > bool Eq1 = it1 == it2;
> > > bool Eq2 = it3 == it4;
> > > if (Eq1) {...}?
> > > ```
> > > 
> > > As I understand, we'll assume the second condition instead of first.
> > Had a look. So the problem is, we obtain the result of the comparison as a symbol, from which it is too hard to recover the operands in order to move iterator position data from one value to another.
> > 
> > Normally we obtain a simple SymbolConjured for the return value of the `operator==()` call (or, similarly, `operator!=()`). For plain-value iterators (eg. `typedef T *iterator`) we might be obtaining an actual binary symbolic expression, but even then it's not quite clear how to obtain operands (the structure of the expression might have changed due to algebraic simplifications). Additionally, LHS and RHS aren't necessarily symbols (they might be semi-concrete), so composing symbolic expressions from them in general is problematic with our symbol hierarchy, which is rarely a problem for numbers but for structural symbols it'd be a mess.
> > 
> > For now i suggest, instead of storing only the last LHS and RHS, to save a map from symbols (which are results of comparisons) to (LHS value, RHS value) pairs. This map should, apart from the obvious, be cleaned up whenever one of the iterators in the pair gets mutated (incremented or decremented). This should take care of the problem Alexey points out, and should work with semi-concrete stuff.
> > 
> > For the future i suggest to let users construct their own symbols and symbolic expressions more easily. In fact, if only we had all iterators as regions, we should have probably used SymbolMetadata for this purpose: it's easy to both recover the parent region from it and use it in symbolic expressions. We could also deprecate the confusing structural symbols (provide default-bound lazy compound values for conjured structures instead), and then it'd be possible to transition to SymbolMetadata entirely.
> Thank you for the suggestion. I am not sure if I fully understand you. If I create a map where the key is the resulting symbol of the comparison, it will not work because evalAssume is called for the innermost comparison. So if the body of operator== (or operator!=) is inlined, then I get a binary symbolic expression in evalAssume, not the SymbolConjured. This binary Symbolic expression is a comparison of the internals of the iterators, e.g. the internal pointer. So the key will not match any LHS and RHS value pair in the map. I also thought on such solution earlier but I dismissed it because of this.
Well, even if the body of the comparison operator is inlined, PreStmt/PostStmt callbacks should still work, and it doesn't really matter if there's a `SymbolConjured` or not, we can still add the symbolic expression to our map as a key.

Essentially, you ignore whatever happens in the iterator's operator==() when it's inlined (including any evalAssume events), then in PostStmt of operator==() you map the return-value symbol of the operator to the operator's arguments (operands), then whenever an assumption is being made against the return-value symbol, you carry over this assumption to the operands. I think it shouldn't really matter if the operator call was inlined.

The only unexpected thing that may happen due to inlining is if the inlined operator returns a concrete value (plain true or plain false) instead of the symbol, but in this case what we need to do is to just carry over the assumption to the operands instantly.

================
Comment at: lib/StaticAnalyzer/Checkers/IteratorPastEndChecker.cpp:580
+  C.addTransition(stateFound);
+  C.addTransition(stateNotFound);
+}
----------------
baloghadamsoftware wrote:
> NoQ wrote:
> > NoQ wrote:
> > > Ouch, i have one more concern, which can be expressed with the following false-positive test which currently fails:
> > > 
> > > ```
> > > void foo() {
> > >   std::vector<int> vec;
> > >   vec.push_back(2016);
> > >   auto i = vec.find(vec.begin(), vec.end(), 2016);
> > >   *i; // no-warning
> > > }
> > > ```
> > > 
> > > Not instantly sure what to do with this. You can avoid state splits until you are actually sure if both branches are possible, but that'd suppress a lot of useful positives. Such positives could be suppressed with assertions, of course, but i'd still hope there aren't too many of those.
> > I mean, `std::find(...` ><
> False positives can occur whenever we are sure that we will find the element so we do not check for the result to be equal with end().
Yep, so there's a bit of grey area here. The test case i wrote is very artificial, i.e. it is not idiomatic, in fact there aren't many cases when doing find() is actually useful when we're sure the element is there.

However, if we eventually enable this checker by default (move out of the alpha.* package), then i think we need to come up with a better behavior for this case:
- maybe it depends on container type (eg. for map-like containers we may know that the key is there but we don't know the value?);
- maybe it's a good idea to add a checker option to enable or disable the warning upon using unchecked find results;
- maybe we'd learn to reason about containers a bit better, even though it'd be hard.

So i've a feeling this can be moved to a FIXME/later, but it's definitely something to think about.

================
Comment at: lib/StaticAnalyzer/Checkers/IteratorPastEndChecker.cpp:204
+                                          CheckerContext &C) const {
+  const auto *ThisExpr = COCE->getArg(0);
+
----------------
This code definitely deserves comments. I managed to understand that this is a workaround for completely replacing the conjured symbol with a lazy value upon calling a method over temporary, which the core does from time to time, and i suspect that this code may break whenever more than one checker starts doing this (i.e. you'd have to skip more than one predecessor node in this case).

I still think that the root cause here is conjured structural symbols which i'd probably prefer to get rid of completely, and then this hack wouldn't be necessary.

https://reviews.llvm.org/D25660