[PATCH] D35216: [analyzer] Escape symbols when creating std::initializer_list.

Wed Jul 19 05:57:14 PDT 2017

NoQ added a comment.

In https://reviews.llvm.org/D35216#814124, @dcoughlin wrote:

> In this case, I would be fine with some sort of AbstractStorageMemoryRegion that meant "here is a memory region and somewhere reachable from here exists another region of type T". Or even multiple regions with different identifiers. This wouldn't specify how the memory is reachable, but it would allow for transfer functions to get at those regions and it would allow for invalidation.

Yeah, this is what we can easily implement now as a symbolic-region-based-on-a-metadata-symbol (though we can make a new region class for that if we eg. want it typed). The problem is that the relation between such storage region and its parent object region is essentially immaterial, similarly to the relation between SymbolRegionValue and its parent region. Region contents are mutable: today the abstract storage is reachable from its parent object, tomorrow it's not, and maybe something else becomes reachable, something that isn't even abstract. So the parent region for the abstract storage is most of the time at best a "nice to know" thing - we cannot rely on it to do any actual work. We'd anyway need to rely on the checker to do the job.

> For std::initializer_list this reachable region would the region for the backing array and the transfer functions for begin() and end() yield the beginning and end element regions for it.

So maybe in fact for std::initializer_list it may work fine because you cannot change the data after the object is constructed - so this region's contents are essentially immutable. For the future, i feel as if it is a dead end.

I'd like to consider another funny example. Suppose we're trying to model std::unique_ptr. Consider:

  void bar(const std::unique_ptr<int> &x);

  void foo(std::unique_ptr<int> &x) {
    int *a = x.get();   // (a, 0, direct): &AbstractStorageRegion
    *a = 1;             // (AbstractStorageRegion, 0, direct): 1 S32b
    int *b = new int;
    *b = 2;             // (SymRegion{conj_$0<int *>}, 0 ,direct): 2 S32b
    x.reset(b);         // Checker map: x -> SymRegion{conj_$0<int *>}
    bar(x);             // 'a' doesn't escape (the pointer was unique), 'b' does.
    clang_analyzer_eval(*a == 1); // Making this true is up to the checker.
    clang_analyzer_eval(*b == 2); // Making this unknown is up to the checker.
  }

The checker doesn't totally need to ensure that `*a == 1` passes - even though the pointer was unique, it could theoretically have `.get()`-ed above and the code could of course break the uniqueness invariant (though we'd probably want it). The checker can say that "even if `*a` did escape, it was not because it was stuffed directly into `bar()`".

The checker's direct responsibility, however, is to solve the `*b == 2` thing (which is in fact the problem we're dealing with in this patch - escaping the storage region of the object).

So we're talking about one more operation over the program state (scanning reachable symbols and regions) that cannot work without checker support.

We can probably add a new callback "checkReachableSymbols" to solve this. This is in fact also related to the dead symbols problem (we're scanning for live symbols in the store and in the checkers separately, but we need to do so simultaneously with a single worklist). Hmm, in fact this sounds like a good idea; we can replace checkLiveSymbols with checkReachableSymbols.

Or we could just have ghost member variables, and no checker support required at all. For ghost member variables, the relation with their parent region (which would be their superregion) is actually useful, the mutability of their contents is expressed naturally, and the store automagically sees reachable symbols, live symbols, escapes, invalidations, whatever.

> In my view this differs from ghost variables in that (1) this storage does actually exist (it is just a library implementation detail where that storage lives) and (2) it is perfectly valid for a pointer into that storage to be returned and for another part of the program to read or write from that storage. (Well, in this case just read since it is allowed to be read-only memory).
> 
> What I'm not OK with is modeling abstract analysis state (for example, the count of a NSMutableArray or the typestate of a file handle) as a value stored in some ginned up region in the store.This takes an easy problem that the analyzer does well at (modeling typestate) and turns it into a hard one that the analyzer is bad at (reasoning about the contents of the heap).

Yeah, i tend to agree on that. For simple typestates, this is probably an overkill, so let's definitely put aside the idea of "ghost symbolic regions" that i had earlier.

But, to summarize a bit, in our current case, however, the typestate we're looking for //**is**// the contents of the heap. And when we try to model such typestates (complex in this specific manner, i.e. heap-like) in any checker, we have a choice between re-doing this modeling in every such checker (which is something analyzer is indeed good at, but at a price of making checkers heavy) or instead relying on the Store to do exactly what it's designed to do.

> I think the key criterion here is: "is the region accessible from outside the library". That is, does the library expose the region as a pointer that can be read to or written from in the client program? If so, then it makes sense for this to be in the store: we are modeling reachable storage as storage. But if we're just modeling arbitrary analysis facts that need to be invalidated when a pointer escapes then we shouldn't try to gin up storage for them just to get invalidation for free.

As a metaphor, i'd probably compare it to body farms - the difference between ghost member variables and metadata symbols seems to me like the difference between body farms and evalCall. Both are nice to have, and body farms are very pleasant to work with, even if not omnipotent. I think it's fine for a `FunctionDecl`'s body in a body farm to have a local variable, even if such variable doesn't actually exist, even if it cannot be seen from outside the function call. I'm not seeing immediate practical difference between "it does actually exist" and "it doesn't actually exist, just a handy abstraction". Similarly, i think it's fine if we have a `CXXRecordDecl` with implementation-defined contents, and try to farm up a member variable as a handy abstraction (we don't even need to know its name or offset, only that it's there somewhere).

https://reviews.llvm.org/D35216