[PATCH] D103605: [analyzer] Introduce a new interface for tracking

Mon Jun 7 20:01:30 PDT 2021

NoQ added a comment.

In D103605#2802240 <https://reviews.llvm.org/D103605#2802240>, @vsavchenko wrote:

> In D103605#2801681 <https://reviews.llvm.org/D103605#2801681>, @NoQ wrote:
>
>> 
>
> OK, let's make a pause right here.  We again start to go into other topics.

These are not other topics. We're discussing the overall direction into which this patchset is a large step. I absolutely welcome this step and am fascinated to see how it turns out but I want us to agree on the overall direction first.

> You come from a pure perspective that `track` explains how the value got here and does only that.  In this picture of the world, `track` has one starting point and one ending point.  If it ends to early (due to the lack of domain knowledge), the checker can pick up and restart the tracking if needed.  If it ends for real, this is the earliest where we got this value from and it can tell us about the warning and situation we are in.  The problem is that tracking is fundamentally **dirty** the way it is.  It is not linear, and when we track one simple value, it actually spawns a gigantic tree of traces.  Thus when we get to the point where it all ended, it actually ended 100 times.  It ended in multiple different visitors that were tracking all kinds of things not limited to the value itself.  `trackExpressionValue` is hard also because of this reason.  Inline defensive checks don't belong there and clutter it.
>
> I DON'T KNOW how to simplify this ATM.  Probably we can decouple ONE VALUE TRACKING from the rest of it, but it's not what I'm trying to do.  I want to refactor the existing solution (with all of its inherent design flaws) into something more extensible.  And I think that again ATM, with the current state of things, the concept of "tracking stopped here" is doomed.

The reason why I come from this perspective is because that's how I think it is //supposed// to work and I'm not aware of a single good reason why it can't. I think the reason why it's this way is that the original authors seem to have bombarded the code with extra clauses until it starts demonstrating the more or less expected behavior, without trying to solve the problem on paper first. Like with visitor fixpoint iteration, it sounds like an unnecessary piece of complexity that we will be ultimately able to eliminate.

Except, well, the distant constraints example a-la @RedDocMD's problem but for raw pointers. This is indeed an exceptional case which may cause two tracking efforts to run in parallel. This can be probably generalized to any constraint-like traits: if we want to explain a constraint, we'll have to be prepared that the constraint isn't part of the value track that we already have. This is the first time the above assumption was challenged and I definitely need time to process this but I still think there shouldn't be any other situations like this and there should be something special about this situation that will allow us to incorporate it into the linear-track solution.

"At the moment such refactoring is difficult to achieve" is a perfectly satisfying answer to me with respect to the current state of things. But I want us to agree as early as possible about whether we want to achieve this. Simply so that as we make every step, we keep this in mind and evaluate every step as taking us either closer to this goal or farther from this goal and try to avoid the latter. And I don't think that anything prevents us from discussing this goal right now; the problem we're trying to solve isn't really ever going to change and we already have a fairly large amount of data.

> With the new solution, if you want the most simple ways of addressing it, you create a handler:
>
>   class FooHandler : public ExpressionHandler {
>   public:
>     using ExpressionHandler::ExpressionHandler;
>   
>     Tracker::Result handle(const Expr *E, const ExplodedNode *InputNode,
>                            const ExplodedNode *ExprNode,
>                            TrackingOptions Opts) {
>       if (auto *Arg = getFooArgument(E)) {
>         return getParentTracker().track(E, ExprNode, Opts);
>       }
>       return {};
>   };
>
> Essentially these handlers is my way of first introducing new logic for these two checkers, and second to extract similar logic that exists there from all other checkers.

Ok, do I understand correctly that:

- Functionality-wise, this is entirely equivalent to my NoteTag solution with "`isTracked()`";
- Boilerplate-wise it makes the `isTracked()` check implicit ("we were invoked at all therefore the value of the expression is tracked by... somebody") at the cost of asking the user to define a class;
- Convenience-wise, it potentially requires the developer to reverse-engineer the symbolic execution step when the logic gets more complicated;
- Flexibility-wise... I guess you can make custom trackers now, which will track the value in a different manner, and communicate with the tracker in order to produce different notes?

> Then, let's say, I decided that I don't like the note "d is assigned here".  It doesn't seem very specific and adding more notes on top will overwhelm our poor user even more.  So, I decided to make notes: "d is foo'ed from e", "g is initialized to foo from e", and "p's foo is used as the 1st argument".
> Right now, there is literally no way to do this except for adding this directly to already bloated `FindLastStoreBRVisitor`.  Why do you think it already produces notes about "garbage values" and "null/nil pointers"?
>
> So, the checker implements a `StoreHandler` where they handle all the cases.  It is important to have it exactly like this because `FindLastStoreBRVisitor` has non-trivial logic of catching "stores" and we don't want this logic to get copied into different checkers.
>
> And if we are talking where it can go further, this particular approach, is defining more "events" in the tracking process and decoupling "event identification" from the process of generating notes for such events.

Yeah, the store handler part seems perfectly legit because *I* don't see any better approaches here :) It's indeed a big problem that we can't customize existing notes and this sounds like the right data structure to solve the problem.

>> Interesting, can you give an example (say, in case of @RedDocMD's problem)  where `isTracked()` would give an incorrect result but "I am tracking this value" would give a correct result? I guess such situations will arise more often as we add more tracking, but still.
>
> I'm not super sure, but I think smth like this will suffice:
>
>   // a and b are completely unrelated smart pointers
>   raw = a.get();
>   if (!b.get() && !a.get()) {
>     use(*raw);
>   }
>
> We can attempt to track `b.get()` because of the conditions.  Checker wouldn't get that we are actually interested in conditions and will initiate a full-blown back-trace for `b` even though we don't need it.
> (...)
> And yes, checkers should be able to say "I track this value". It is the operation that makes sense only if it is helpful for the checker, and the checker should be able to customize it however the checker pleases.

Ok, I don't understand! If we have anyway tracked the value back to `b.get()`, why not track further through `b`? What makes `b.get()` the exact place at which we want to stop tracking the value? If we wanted to know why we choose the true path at the branch, shouldn't we track further through `b` as well in order to answer that question? If we didn't want to know why we choose the true path, why did we track the value back to `b.get()` in the first place?

I guess one possible difference between `a` and `b` here would be in messaging, i.e. how we refer to that value. For example, we could refer to `a` as "the pointer value" (implying that this is the null-dereferenced pointer value) and refer to `b` as "a pointer value" (implying that it's some other value, maybe the one that participates in condition later). But even then, this distinction is much more coarse-grained than "I track the value" vs. "Someone else tracks the value". It's more like "Buggy value" vs. "Condition value" which is a distinction we already "almost" make with `TrackingKind`. Even if we wanted to identify our values as "pointer #1", "pointer #2", etc., we could totally do with a sufficiently advanced definition of `isTracked()`.

> Can we move inline defensive checks out of the BugReportVisitors.cpp with `isTracked()` only?

My original thinking was along the lines of, eg.,:

  void ExprEngine::processBranch(...) {
    Bldr.addTransition(..., getNoteTag([] (BugReport &BR) {
      if (BR.isTracked(Condition) && !N->getLocationContext().inTopFrame())
        BR.markInvalid();
    }));
  }

And this is how, basically, the entirety of `trackExpressionValue()` could be decomposed into individual stateless note tags, with all the necessary state stuffed into the bug report. Including the tracking process itself: if note tags are sufficient to help with tracking, they're sufficient to do the tracking from the start.

This code could also be moved into a checker's `checkBranchCondition()` so that to make it truly checker-specific. That would come at the cost of creating a lot of redundant exploded nodes just for the sake of attaching the tag that will probably never be used so this is rather bad, wouldn't recommend.

I still think the solution that relies on well-defined end of tracking is significantly more elegant. Which, again, could be implemented with note tags: we could have a note tag that'd call a callback stuffed directly into the bug report.

So I think it can work both ways given that the tags solution would rely on having a sufficiently flexible mutable state as part of the `BugReport` object.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D103605/new/

https://reviews.llvm.org/D103605