[cfe-dev] [analyzer] Random SVal hierarchy questions.

Mon Nov 9 17:46:13 PST 2015

Hi Artem,

Sorry it took me so long to get back to you.

> On Nov 2, 2015, at 7:06 AM, Artem Dergachev via cfe-dev <cfe-dev at lists.llvm.org> wrote:
> 
> I've got a few quick random questions about SVal/MemRegion/SymExpr heirarchy in the Static Analyzer. In most cases, i'm looking for design ideas behind particular values, so it's sort of theoretical.
> 
> 
> == SymbolConjured of a record type ==
> 
> Usually symbols have integral types. However, if a function returns, say, "struct S" by value, and its definition is not available, then its return value is described as conj_$N{S}. I'm surprised to see a symbol of such type, because it doesn't behave like other symbols - eg. you cannot compose a SymIntExpr from it, or assume range constraints on it. Is this behavior intentional, or it's rather worth it try replacing it with a more structure-like value? For example, a LazyCompoundValue with a NULL Store could describe an unknown structure pretty well. Is it worth it to *assert* that all symbols have numeric or pointer types only?

One thing conjured symbols of record type let you do is reason about equality of MemRegions and of values in the fields of those regions independently of range constraints. So, for example, conjuring a symbol for the result of getS() below let’s the analyzer establish that x1 == x2:

struct S {
  int f;
};

struct S getS();

void foo() {
  struct S s = getS();

  int x1 = s.f;
  int x2 = s.f;

  // Store Bindings at this program point:
  //   (s,0,default) : conj_$2{struct S}
  //   (x1,0,direct) : derived_$3{conj_$2{struct S},s->f}
  //   (x2,0,direct) : derived_$3{conj_$2{struct S},s->f}

  clang_analyzer_eval(x1 == x2); // Yields TRUE
}

> == SymbolicRegion in different memory spaces ==
> 
> If two SubRegions have different super regions, in particular if they have different memory spaces, they are considered to be different (i.e., have different hash). This makes sense for, say, discriminating between VarRegions of ParmVarDecls in different stack frame contexts (different calls to the same function). However, for SymbolicRegion based on a pointer symbol, it does not make sense to discriminate between the symbolic region in Unknown memory space and the symbolic region in Heap memory space (eg. constructed by MallocChecker). The analyzer currently considers these regions to be two different regions, even if they are based on the same pointer. Is this problem worth fixing by removing the memory space from the hash function of SymbolicRegion?

Regions in the unknown space can alias with any space, not just the heap space. In general, the analyzer makes fairly optimistic assumptions about dis-aliasing. Consider the following code:

int *bar(int *p);

void foo(int param) {
  int *ptr = bar(&param);
  *ptr = 8;
  param = 9;

  // Store Bindings at this program point:
  //   (param,0,direct) : 9 S32b
  //  (ptr,0,direct) : &SymRegion{conj_$3{int *}}
  //  (SymRegion{conj_$3{int *}},0,direct) : 8 S32b

  clang_analyzer_eval(*ptr == 9); // Yields FALSE
}

Here, the analyzer has SymRegion{conj_$3{int *}} is in the unknown memory space but it is entirely possible that bar() returns its parameter. In this case, updating *ptr would change the value stored for param. The analyzer (unsoundly) assumes this can’t happen. It could be more conservative and case split on all possible aliasing relationships, but this would be very expensive! Alternatively, it could invalidate all other storage on writes to the unknown space — but this would be quite imprecise, invalidating storage for locations that probably aren’t aliased. The assumption that symbolic regions in the unknown space don’t alias parameters isn’t sound, but it does cover the common case. 

Similarly, the analyzer makes optimistic dis-aliasing assumptions about regions with two different conjured symbols:

int *bar();

void foo() {
  int *ptrUnknown1 = bar();
  int *ptrUnknown2 = bar();
  *ptrUnknown1 = 8;
  *ptrUnknown2 = 9;

// Store Bindings at this program point:
//   (ptrUnknown1,0,direct) : &SymRegion{conj_$2{int *}}
//   (SymRegion{conj_$2{int *}},0,direct) : 8 S32b
//   (ptrUnknown2,0,direct) : &SymRegion{conj_$5{int *}}
//   (SymRegion{conj_$5{int *}},0,direct) : 9 S32b
}

(This optimistic assumption of one of many things that make the “general” (precondition of true) approach to summaries quite challenging. The analyzer could generate a summary making the optimistic assumption that two parameter pointers are dis-aliased only to discover when trying to apply the summary that the two pointers point to the same memory!)

One thing to note is that the analyzer’s assumption of dis-aliasing is sound for regions within the heap space. This is because it only produces symbolic regions in that space when it actually observes the allocation site (e.g., malloc) of the region. For example, if the two calls to bar() above were malloc() instead, then the analyzer would produce regions in the heap space and could safely assume that the two pointers point to distinct memory.

In contrast with the optimistic approach the analyzer takes for aliasing between symbolic regions, the analyzer is more pessimistic about symbolic *offsets* into regions. As Jordan describes in clang/docs/analyzer/RegionStore.txt (Binding Invalidation), binding to a region with a symbolic offset invalidates the entire region. For example:

void foo(int i, int j) {
  int *ptr = (int *)malloc(2*sizeof(int));

  ptr[i] = 8;
  ptr[j] = 9;

// Store Bindings at this program point:
//   (ptr,0,direct) : &element{SymRegion{conj_$2{void *}},0 S32b,int}
//   (SymRegion{conj_$2{void *}},0,default) : Unknown
//   (element{SymRegion{conj_$2{void *}},reg_$4<j>,int},direct) : 9 S32b

For symbolic offsets, the analyzer assumes that p[i] and p[j] may alias and so invalidates p[i] when p[j] is bound. This is potentially imprecise (it may be the case that i != j), but as a practical heuristic it is much more likely that two symbolic offsets within the same region will alias than that two symbolic regions themselves will alias.

> == CXXTempObjectRegion and block counts ==
> 
> CXXTempObjectRegion is constructed with expression that caused it to appear, a memory space, and that's it. I wonder if, during lifetime of a CXXTempObjectRegion we can accidentally reach the same expression again and produce the same region twice and, say, overwrite its bindings. In other words, does CXXTempObjectRegion require some sort of block-count, like SymbolConjured? This time, i'd rather believe that it doesn't, and that the current behavior is correct, so i wonder if there's an easy way to explain why is it correct.

I discussed this with Jordan and we’re pretty sure this isn’t an issue (although it is a good catch!). Our reasoning is that the only legal way to execute the same expression twice while the result of the previous one is still around is recursion — and you get different CXXTempObjectRegions for different stack frames (see MemRegionManager::getCXXTempObjectRegion()). In all other cases, any const reference that extends the lifetime of the temporary would have to have ended its lifetime as well before the expression could be executed again. That said, there are also sorts of gross things the programmer could do with undefined behavior to get at the same TempObjectRegion (e.g., life-time extending a temporary in a loop and storing the the address of the const reference to it in a pointer). We could potentially add a block count here to provide less unexpected analyzer behavior in these cases.

> == SymbolRegionValue after a deleted binding ==
> 
> As far as i understand, SymbolRegionValue can be described as a symbol that denotes the unknown value of a region *in the beginning* of the analysis. It is produced by the RegionStore whenever there is no other binding that has overwritten this value *during* the analysis. However, when we do invalidation, we can sometimes remove bindings from regions. Is it intended that the same SymbolRegionValue may possibly re-appear after invalidation, simply because the binding was removed, even if the original SymbolRegionValue is considered dead, and thus all constraints and checker data was erased, and hence there's no harm in reusing the symbol? Or we'd better try our best to ensure that some sort of SymbolDerived would always be created in this scenario, and the original SymbolRegionValue never gets resurrected?

When invalidating regions with symbolic base regions, the analyzer tries to adds a default binding to a fresh conjured symbols (see void invalidateRegionsWorker::VisitCluster(const MemRegion *, const ClusterBindings *)).
For example:

struct S {
  int f;
};

void invalidate(struct S *s);

void foo(struct S *s) {
  s->f = 10;
// Store Bindings at this program point:
//   (SymRegion{reg_$0<s>},0,direct) : 10 S32b

  invalidate(s);

// Store Bindings at this program point:
//   (SymRegion{reg_$0<s>},0,default) : conj_$1{int}
}

When invalidating ’s', the analyzer removes the direct binding storing the concrete value ’10’ at off set 0 and adds a default binding storing the fresh conjured value conj_$1{int}.

Are you seeing cases where the analyzer is removing bindings but not adding a default binding to a conjured symbol?

Devin