[PATCH] D103605: [analyzer] Introduce a new interface for tracking

Mon Jun 7 03:50:08 PDT 2021

vsavchenko added a comment.

In D103605#2801681 <https://reviews.llvm.org/D103605#2801681>, @NoQ wrote:

> Inlined defensive check suppressions are indeed pretty ugly; this problem should ideally be solved in a completely different manner.
>
> That said, I strongly disagree that "when the tracker stops" is a hard-to-understand concept. It's an extremely natural and intuitive question to ask:
>
> //"Where does the value come from?"//
>
> It's fairly reasonable to wonder what happened during analysis that caused us to produce this value for the first time, which then traveled down to our bug point, and under what circumstances did it happen. Not much precise reasoning can be based on this information but it's a very good piece of information to drive imprecise reasoning and heuristics, such as (but not limited to) inlined defensive check suppressions (where it bears a relatively precise meaning that corresponds exactly to "state splits in nested stack frames are intrinsically less reliable as they can't ever be derived from presumption of no dead code").
>
> Another example: a value of a top-level parameter variable is much more likely to be arbitrary than a return value of an unknown function call (if the user wanted pre-conditions they could have added an assert), which in turn is much more likely to be arbitrary than a value loaded from a heap region after store invalidation from unknown function call (we're 99% sure that the function doesn't touch this particular heap region so it's probably just the old value). In this sense, I'm pretty tempted to suppress bug paths where two branch conditions (ultimately going into opposite directions) are based on different values of the same heap variable after different number of invalidations. But I'm certainly against doing so for two different top-level parameter values. You may say that in these cases the information is most likely available from the structure of the symbol (`SymbolRegionValue` vs. `SymbolConjured` vs. `SymbolDerived`) but it only confirms my point: origins of values are so important that we even carry them as part of the value's identity! Direct access to the structure of the symbol for determining origins is also used a few times (eg., see call sites of `getOriginRegion()`).

OK, let's make a pause right here.  We again start to go into other topics.  You come from a pure perspective that `track` explains how the value got here and does only that.  In this picture of the world, `track` has one starting point and one ending point.  If it ends to early (due to the lack of domain knowledge), the checker can pick up and restart the tracking if needed.  If it ends for real, this is the earliest where we got this value from and it can tell us about the warning and situation we are in.  The problem is that tracking is fundamentally **dirty** the way it is.  It is not linear, and when we track one simple value, it actually spawns a gigantic tree of traces.  Thus when we get to the point where it all ended, it actually ended 100 times.  It ended in multiple different visitors that were tracking all kinds of things not limited to the value itself.  `trackExpressionValue` is hard also because of this reason.  Inline defensive checks don't belong there and clutter it.

I DON'T KNOW how to simplify this ATM.  Probably we can decouple ONE VALUE TRACKING from the rest of it, but it's not what I'm trying to do.  I want to refactor the existing solution (with all of its inherent design flaws) into something more extensible.  And I think that again ATM, with the current state of things, the concept of "tracking stopped here" is doomed.

> What I'm trying to do here is to see a general direction in which we're moving. We have the benefit of being able to predict things. We don't have to be agile; our problem is well-defined from the start and will never really change. There's a bit of flexibility required to support a variety of different future checkers, yes, but that's about it. And then, again, it's also not entirely arbitrary: we have a pretty good amount of data on what our checkers typically do and it's unlikely to expand drastically in the near future. So we don't always need to go for the most flexible solution, but there's often benefit in going for a simpler solution that simply covers our predictable needs and nothing else.
>
> As a historical example of such process, currently bug visitors perform a single pass over the bug report, from bottom to top. This wasn't always this way: originally visitors traversed the bug report multiple times (occasionally 30+ times), adding new visitors along the way, until the set of visitors attached to the bug report reached a fixed point. The original solution was very flexible; it allowed us to backreference values at the bottom in a flash of hindsight and have all visitors restarted gracefully. But in the entirety of our foreseeable circumstances this wasn't necessary because the ultimate purpose of having visitors was to answer the question "why do we think this situation has indeed occured in the program?" by adding notes to the path, and the answer to that question is //always above// the current point in the path. Such simplification really helped.
>
> This is why i'm trying to understand the entire thing before opting into a system. Especially when it comes to checker APIs for which overengineering is particularly dangerous - and I'm still not sure how exactly the new checker APIs are intended to work after a few pages of discussion. I don't care how they work as much as I care about knowing how they are ultimately intended to work and whether it's simple enough to understand and explain.

It's not that it came from out of nowhere.  That I just decided that `trackExpressionValue` is not extensible enough and put this all together.  We saw with our own eyes that it didn't work for `RetainCountChecker` or `SmartPtrChecker` and started to come up with ridiculous workarounds to address it.  I also see all the different pieces of other checkers that live inside of `BugReportersVisitors.cpp`.  And they are there not because of checker developers failing to understand checker APIs, but rather tracking APIs failing them.

> May i ask you to sketch the intended solution for @RedDocMD's problem and/or for your RetainCountChecker problem, i.e. how it'll look in the checker code? Or in any imaginary checker of your choice? Like, what's the motivating example? That could help me a lot with understanding your approach.

It is a bit weird to me after you correctly named all the properties of the proposed solution and described how it will look like in the new architecture, but OK.

Right now, if we go up the chain of tracking calls, we get to the point where tracker gets stuck or stops.
Instead of understanding that it stops, and modifying a hell-lot of code, and passing callbacks literally everywhere, I suggest we don't let it stop in the first place.
Let's say we have this example:

  ...
  d = foo(e);
  c = d;
  Type b = c;
  a = b;
  reportPoint(a);

Here `foo` is either `dyn_cast` for `RetainCountChecker` or `e.get()` in `SmartPtrChecker`.
`FindLastBRVisitor` gets to `d = foo(e)` understands that this is assignment (maybe even can say that it gets the value stored into `e`), calls `trackExpressionValue` for `foo(e)` and it does nothing.

It is pretty high up the chain, so we probably don't want to repeat the logic of getting up there (@RedDocMD attempted that), but we couldn't insert our logic into this intimate dance of `trackExpressionValue` and `FindLastBRVisitor`.

With the new solution, if you want the most simple ways of addressing it, you create a handler:

  class FooHandler : public ExpressionHandler {
  public:
    using ExpressionHandler::ExpressionHandler;

    Tracker::Result handle(const Expr *E, const ExplodedNode *InputNode,
                           const ExplodedNode *ExprNode,
                           TrackingOptions Opts) {
      if (auto *Arg = getFooArgument(E)) {
        return getParentTracker().track(E, ExprNode, Opts);
      }
      return {};
  };

Essentially these handlers is my way of first introducing new logic for these two checkers, and second to extract similar logic that exists there from all other checkers.

Then, let's say, I decided that I don't like the note "d is assigned here".  It doesn't seem very specific and adding more notes on top will overwhelm our poor user even more.  So, I decided to make notes: "d is foo'ed from e", "g is initialized to foo from e", and "p's foo is used as the 1st argument".
Right now, there is literally no way to do this except for adding this directly to already bloated `FindLastStoreBRVisitor`.  Why do you think it already produces notes about "garbage values" and "null/nil pointers"?

So, the checker implements a `StoreHandler` where they handle all the cases.  It is important to have it exactly like this because `FindLastStoreBRVisitor` has non-trivial logic of catching "stores" and we don't want this logic to get copied into different checkers.

And if we are talking where it can go further, this particular approach, is defining more "events" in the tracking process and decoupling "event identification" from the process of generating notes for such events.

> Interesting, can you give an example (say, in case of @RedDocMD's problem)  where `isTracked()` would give an incorrect result but "I am tracking this value" would give a correct result? I guess such situations will arise more often as we add more tracking, but still.

I'm not super sure, but I think smth like this will suffice:

  // a and b are completely unrelated smart pointers
  raw = a.get();
  if (!b.get() && !a.get()) {
    use(*raw);
  }

We can attempt to track `b.get()` because of the conditions.  Checker wouldn't get that we are actually interested in conditions and will initiate a full-blown back-trace for `b` even though we don't need it.

> I think it's a good time to step back and talk about //statefulness// of this whole thing.
>
> Path sensitive checkers are stateless; their state is instead encoded in the `ProgramState` structure. This is essential because execution paths are branchy and nodes are modeled in a fairly arbitrary order.
>
> Bug visitors are stateful. This is fine because every bug path is linear and traversed in order. Value tracks are linear too - whelp, most of the time, I suppose, but we don't mind basically forking our entire stateful machine whenever this is violated.
>
> Note tags, on the other hand, are stateless again by design. This is why they are designed to rely on information in the bug report, such as interestingness, for communication across tags. Interestingness is an older concept that is simply re-used in note tags but from the perspective of note tags interestingness is a very simple thing: it's their //state//.
>
> Information attached to the bug report is arbitrary. We can extend it as much as we want, make a data structure equivalent of the GenericDataMap that would track checker-specific variations over the concept of interestingness. It could decorate the specific role of every value as much as necessary. If we need to specify that this value is the same value that's seen by that other note tag (or participated in the emission of the warning itself in a certain manner), we can totally do that.
>
> Hence the question, do we really need to preserve this whole imperative concept of "I" in "I am tracking this value", or is the functional concept of "This value is playing a certain role" entirely sufficient for our purposes?

I'm glad that you brought this up and don't understand why you don't see it, really.  Visitors are stateful.  Hmm, I don't like it because it can be confusing because of program state and so on.  So, let's say "Visitors are mutable".  They can actually change, carry one piece of data.  Tracker/handler mechanism is the same, you are allowed to carry some mutable data through the process.  Maybe it is bad, but I don't see the way to get `isTracked` to solve all the problems that I want to solve (it is probably trying to solve bigger problems).  Can we move inline defensive checks out of the `BugReportVisitors.cpp` with `isTracked` only?

And yes, checkers should be able to say "I track this value".  It is the operation that makes sense only if it is helpful for the checker, and the checker should be able to customize it however the checker pleases.

================
Comment at: clang/include/clang/StaticAnalyzer/Core/BugReporter/BugReporterVisitors.h:147
+class ExpressionHandler;
+class StoreHandler;
+
----------------
xazax.hun wrote:
> vsavchenko wrote:
> > xazax.hun wrote:
> > > During path sensitive analysis we already have a callback for stores. We kind of replicating this logic for bug paths. 
> > > So my questions are:
> > > * Do we expect to have additional information here that were not available during the analysis earlier?
> > > * Do we want to make this as similar to the forward analysis part as possible for developer familiarity?
> > Oof, I actually think that this one is trickier.  If the checker modeled some operation (most likely `evalCall`), it can bind a value to the region by itself.  And this handler triggers for such events as well because it simply finds the place where the switch was flipped and the region changed its binding.
> > If the checker modeled some operation (most likely evalCall), it can bind a value to the region by itself. And this handler triggers for such events as well because it simply finds the place where the switch was flipped and the region changed its binding.
> 
> This is such a great point, I'd like that to be captured as a comment somewhere. 
Gotcha!  I'll add it!

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D103605/new/

https://reviews.llvm.org/D103605