[cfe-commits] Patch to do static analysis with range sets

Sun Feb 8 06:57:13 PST 2009

On Thu, Feb 5, 2009 at 6:55 PM, Ted Kremenek <kremenek at apple.com> wrote:
>
> On Feb 1, 2009, at 12:06 AM, Ben Laurie wrote:
>
>> Not complete - no self-tests and full of diagnostics, but I'd love to
>> get comments about the approach. Also hints on implementing tests (Ted
>> told me in IM, but I forgot), and how to preserve/add more debugging
>> stuff in an approved way, since I find that this stuff hurts my head
>> without it...
>>
>> Also, I have cases which fail because higher-level stuff seems to be
>> doing the wrong thing, but not sure how to go about debugging the rest
>> of static analysis...
>> <clang.patch>_______________________________________________
>> cfe-commits mailing list
>> cfe-commits at cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits
>
> Hi Ben,
>
> Sorry for the delay.  Comments inline:
>
> Index: lib/Analysis/BasicConstraintManager.cpp
> ===================================================================
> --- lib/Analysis/BasicConstraintManager.cpp     (revision 63488)
> +++ lib/Analysis/BasicConstraintManager.cpp     (working copy)
> @@ -17,29 +17,364 @@
>  #include "clang/Analysis/PathSensitive/GRStateTrait.h"
>  #include "clang/Analysis/PathSensitive/GRTransferFuncs.h"
>  #include "llvm/Support/Compiler.h"
> +#include "llvm/ADT/FoldingSet.h"
>  #include "llvm/Support/raw_ostream.h"
>
> +#include <iostream>
>
> We try and avoid <iostream>.  For my reasons why, see the following in the
> LLVM coding standards:
>
> http://llvm.org/docs/CodingStandards.html#ll_iostream
>
> +#include <set>
>
> You're probably going to want to use ImmutableSet, since its a functional
> data structure (which works nicely when you embed sets in state objects).
>  It also doesn't have the malloc traffic that std::set<> does.
>
> +class Range : public std::pair<llvm::APSInt, llvm::APSInt> {
> +public:
> +  Range(const llvm::APSInt &from, const llvm::APSInt &to)
> +    : std::pair<llvm::APSInt, llvm::APSInt>(from, to) {
> +    std::cerr << "Range(" << from.toString(10) << ", " << to.toString(10)
> +              << ')' << std::endl;
> +    assert(from <= to);
> +  }
> +  bool Includes(const llvm::APSInt &v) const {
> +    return first <= v && v <= second;
> +  }
> +  const llvm::APSInt &From() const {
> +    return first;
> +  }
> +  const llvm::APSInt &To() const {
> +    return second;
> +  }
>
> Cute.  Does a Range represent "[a, b]" or "(a, b)".  Please document this
> with comments.
>
> I also take it that ranges are only over concrete values?  e.g., [0, 2].
>  What about ranges where:
>
> (1) one of the ends is "symbolic", e.g. [a, 2]
> (2) one of the ends is Unknown, e.g. [4, unknown].
>
> You don't have to model (1) now, but I think you probably will need to model
> (2).  Frequently we have a lower bound but no upper bound (although I
> imagine you could use maxint, minint, etc.).

I already do - for example, != k is [minint, k-1], [k+1, maxint].

> Note that this approach will inherently not cover relationships between
> symbolic values, e.g., tracking if 'a > b' when the values of 'a' and 'b'
> are symbolic.

Right, I have not attempted symbolic values - not least because none
of the rest of the code does (yet) :-)

>
> +struct RangeCmp {
> +  bool operator()(const Range &r1, const Range &r2) {
> +    if(r1.From() < r2.From()) {
> +      assert(!r1.Includes(r2.From()));
> +      assert(!r2.Includes(r1.To()));
> +      return true;
> +    } else if(r1.From() > r2.From()) {
> +      assert(!r1.Includes(r2.To()));
> +      assert(!r2.Includes(r1.From()));
> +      return false;
> +    } else
> +      assert(!"Ranges should never be equal in the same set");
> +  }
> +};
>
> I don't this will work.  The 'else' branch can be easily triggered if ranges
> overlap.

Ranges cannot overlap.

>  Also, is there a reason you are using operator() instead of
> operator>() and operator<()?

Because that's what you have to do?

> +
> +typedef std::set<Range, RangeCmp> PrimRangeSet;
>
> I think you will want to use ImmutableSet.
>
> +
> +class RangeSet;
> +std::ostream &operator<<(std::ostream &os, const RangeSet &r);
>
> Don't use std::ostream.  Use llvm::raw_ostream&.
>
> +
> +class RangeSet {
> +  PrimRangeSet ranges;
> +  bool noValues;  // if true, no value is possible
>
> My meta-level comment about this class is that isn't clear what it
> represents.  The code afterwards also seems not sure of what assumptions to
> make about RangeSet.  It seems that RangeSet represents a set of
> (non-overlapping) ranges, e.g. which represent the union of the possible
> values for some symbolic integer.  In other words, it represents a
> disjunction:
>
> (x >= set1_lb && x <= set1_ub) || (x >= set2_lb && x <= set2_ub) || ...
>
> With this formalism it should be much more straightforward to answer the
> implementation questions for the methods below.  It will also tell you what
> is the representational power of this constraint model.

Yeah. I do actually know what it means, my comments were because I was
less sure about what the calling code does...

>
> Here "lb" and "ub" mean "lower bound" and "upper bound" respectively.
>
> +
> +  static const llvm::APSInt Max(const llvm::APSInt &v) {
> +    return llvm::APSInt::getMaxValue(v.getBitWidth(), v.isUnsigned());
> +  }
> +  static const llvm::APSInt Min(const llvm::APSInt &v) {
> +     return llvm::APSInt::getMinValue(v.getBitWidth(), v.isUnsigned());
> +  }
> +  static const llvm::APSInt One(const llvm::APSInt &v) {
> +    return llvm::APSInt(llvm::APInt(v.getBitWidth(), 1), v.isUnsigned());
> +  }
> +
> +public:
> +  RangeSet() : noValues(false) {
> +  }
> +  void Profile(llvm::FoldingSetNodeID &ID) const {
> +    for(PrimRangeSet::const_iterator i = ranges.begin() ; i != ranges.end()
> ; ++i) {
> +      i->Profile(ID);
> +    }
> +  }
>
> This kind of Profile implementation won't be necessary if you use
> ImmutableSet.  It will do it for you. Moreover, with ImmutableSet, two sets
> (created using the same ImmutableSet::Factory) will always be referentially
> equivalent, meaning they are a quick pointer comparison.  ImmutableSets also
> use techniques similar to hash-consing to share data between sets.

Cool.

>
> +
> +  bool CouldBeNE(const llvm::APSInt &ne) const {
> +    std::cerr << "CouldBeNE(" << ne.toString(10) << ") " << *this <<
> std::endl;
>
> Please don't have debugging messages like this enabled by default.  Either
> comment it out or #ifdef it.
>
> +    // FIXME: do I mean this? empty means all values are possible,
> therefore != possible, right?
> +    if(ranges.empty())
> +      return false;
>
> I think to answer the questions of these FIXMEs you just need to define what
> the invariants are for PrimRangeSet and what it represents.
>
> +    // FIXME: if there are no possible values, then can't be !=, surely?
> +    if(noValues)
> +      return true;
> +    for(PrimRangeSet::const_iterator i = ranges.begin() ; i != ranges.end()
> ; ++i)
> +      if(i->Includes(ne))
> +        return false;
> +    return true;
> +  }
>
> +
> +  bool CouldBeEQ(const llvm::APSInt &eq) const {
> +    std::cerr << "CouldBeEQ(" << eq.toString(10) << ") " << *this <<
> std::endl;
> +    if(ranges.empty())
> +      return true;
> +    // FIXME: if there are no possible values, then can't be ==, surely?
> +    if(noValues)
> +      return true;
> +    for(PrimRangeSet::const_iterator i = ranges.begin() ; i != ranges.end()
> ; ++i)
> +      if(i->Includes(eq))
> +        return true;
> +    return false;
> +  }
>
> Same comments with CouldBeNE, CouldBeLT, etc..
>
> ....
>
> +  // Make all existing ranges fall within this new range
> +  void Restrict(const llvm::APSInt &from, const llvm::APSInt &to) {
>
> The comment for this method is a little too terse for me to understand your
> algorithm.  What is it trying to do?

This one is actually unused - it seemed like a good idea at the time -
I guess a cast could invoke it.

>  It also looks like you are modifying
> the RangeSet object in place. In the static analyzer, we use functional
> data structures that preserve old data so that each GRState object (which
> represents the analysis state at a particular point within a path) never
> have their values mutated after the GRState object is created.  This allows
> us to faithfully keep full path information.  This allows us to report full
> path diagnostics when reporting bugs but also (from a correctness
> standpoint) allows path caching/merging to work at all.

Wow. Doesn't it turn out to be horribly expensive?

>
> Without making comments on the rest of the patch (which looks similar to the
> other parts), I think there just needs to be a high-level comment of what is
> the algorithmic model for the constraints, how they are resolved, what is
> represented, etc.  Right now it is difficult for me to evaluate the approach
> itself since I don't really understand the algorithm.  Also, I think that
> the implementation itself should come in a straightforward manner once a few
> key methods (e.g., "CouldBeEQ") get fleshed out.  I'm more than happy to
> help with this process.

How would that work in practice? Without switching to mercurial, that is?

Anyway, will work on a new version of the patch.