[PATCH] D103605: [analyzer] Introduce a new interface for tracking

Tue Jun 8 06:58:15 PDT 2021

vsavchenko added a comment.

In D103605#2804421 <https://reviews.llvm.org/D103605#2804421>, @NoQ wrote:

> These are not other topics. We're discussing the overall direction into which this patchset is a large step. I absolutely welcome this step and am fascinated to see how it turns out but I want us to agree on the overall direction first.

OK, fair point, but for further plan, we need to have an incremental model of how we want to achieve it.

> The reason why I come from this perspective is because that's how I think it is //supposed// to work and I'm not aware of a single good reason why it can't. I think the reason why it's this way is that the original authors seem to have bombarded the code with extra clauses until it starts demonstrating the more or less expected behavior, without trying to solve the problem on paper first. Like with visitor fixpoint iteration, it sounds like an unnecessary piece of complexity that we will be ultimately able to eliminate.

Right now it is all very tightly coupled.  It is a lot of moving pieces that I don't think anyone fully understands.  If we take time and decompose existing tracking into handlers, it can be a step in getting to know some implicit connections between different tracking activities we have right now.

> Except, well, the distant constraints example a-la @RedDocMD's problem but for raw pointers. This is indeed an exceptional case which may cause two tracking efforts to run in parallel. This can be probably generalized to any constraint-like traits: if we want to explain a constraint, we'll have to be prepared that the constraint isn't part of the value track that we already have. This is the first time the above assumption was challenged and I definitely need time to process this but I still think there shouldn't be any other situations like this and there should be something special about this situation that will allow us to incorporate it into the linear-track solution.
>
> "At the moment such refactoring is difficult to achieve" is a perfectly satisfying answer to me with respect to the current state of things. But I want us to agree as early as possible about whether we want to achieve this. Simply so that as we make every step, we keep this in mind and evaluate every step as taking us either closer to this goal or farther from this goal and try to avoid the latter. And I don't think that anything prevents us from discussing this goal right now; the problem we're trying to solve isn't really ever going to change and we already have a fairly large amount of data.

I get it, but I think we got stuck and should put goal first and think about what can be done to achieve this.  Is it bad that tracking exists and the analyzer reverse engineers what it did itself? - No doubt!  Is note mechanism a good alternative? - Sure!  Is it enough right now? - No.  Let's think how we can get there with incremental changes.

> Ok, do I understand correctly that:
>
> - Functionality-wise, this is entirely equivalent to my NoteTag solution with "`isTracked()`";

Not really, you also get the context of tracking: two nodes and options (so you can figure out if it's being tracked thoroughly or as a condition).

> - Boilerplate-wise it makes the `isTracked()` check implicit ("we were invoked at all therefore the value of the expression is tracked by... somebody") at the cost of asking the user to define a class;

I suppose we can put it this way.

> - Convenience-wise, it potentially requires the developer to reverse-engineer the symbolic execution step when the logic gets more complicated;

I don't see ANY difference with the situation where we use `isTracked`.  If you talk about notes, you can access them here as well.  If you don't, then both ways require the same.

> - Flexibility-wise... I guess you can make custom trackers now, which will track the value in a different manner, and communicate with the tracker in order to produce different notes?

Exactly, tracker object is the thing that is guaranteed to stay whenever you go with tracking, and since you can customize it, you can affect every part of the tracking process.

> Yeah, the store handler part seems perfectly legit because *I* don't see any better approaches here :) It's indeed a big problem that we can't customize existing notes and this sounds like the right data structure to solve the problem.

And we do need to have a central customizable object (aka Tracker) that we pass around the tracking process to make it happen, right?

> Ok, I don't understand! If we have anyway tracked the value back to `b.get()`, why not track further through `b`? What makes `b.get()` the exact place at which we want to stop tracking the value? If we wanted to know why we choose the true path at the branch, shouldn't we track further through `b` as well in order to answer that question? If we didn't want to know why we choose the true path, why did we track the value back to `b.get()` in the first place?
>
> I guess one possible difference between `a` and `b` here would be in messaging, i.e. how we refer to that value.

Bingo!

> For example, we could refer to `a` as "the pointer value" (implying that this is the null-dereferenced pointer value) and refer to `b` as "a pointer value" (implying that it's some other value, maybe the one that participates in condition later). But even then, this distinction is much more coarse-grained than "I track the value" vs. "Someone else tracks the value". It's more like "Buggy value" vs. "Condition value" which is a distinction we already "almost" make with `TrackingKind`. Even if we wanted to identify our values as "pointer #1", "pointer #2", etc., we could totally do with a sufficiently advanced definition of `isTracked()`.

Correct!

>> Can we move inline defensive checks out of the BugReportVisitors.cpp with `isTracked()` only?
>
> My original thinking was along the lines of, eg.,:
>
>   void ExprEngine::processBranch(...) {
>     Bldr.addTransition(..., getNoteTag([] (BugReport &BR) {
>       if (BR.isTracked(Condition) && !N->getLocationContext().inTopFrame())
>         BR.markInvalid();
>     }));
>   }
>
> And this is how, basically, the entirety of `trackExpressionValue()` could be decomposed into individual stateless note tags, with all the necessary state stuffed into the bug report. Including the tracking process itself: if note tags are sufficient to help with tracking, they're sufficient to do the tracking from the start.

Maybe, or maybe not.  Maybe it takes more context than we think.  You essentially assume that the only thing that we need to reimplement is one boolean variable `isTracked` and the expression that is being tracked.  I doubt it.
And it is harder to debug polling approach as opposed to the callback approach.

> This code could also be moved into a checker's `checkBranchCondition()` so that to make it truly checker-specific. That would come at the cost of creating a lot of redundant exploded nodes just for the sake of attaching the tag that will probably never be used so this is rather bad, wouldn't recommend.

Why not model defensive checks in the checker, so we don't make undesired assumptions in the first place?

> I still think the solution that relies on well-defined end of tracking is significantly more elegant. Which, again, could be implemented with note tags: we could have a note tag that'd call a callback stuffed directly into the bug report.

I'm not sure I understood the last thing, can you please elaborate?

> So I think it can work both ways given that the tags solution would rely on having a sufficiently flexible mutable state as part of the `BugReport` object.

OK, let me re-iterate to get some things straight.
Refactoring of `trackExpressionValue` is good, but we need to think further (and even further).  You have a dream of getting rid of visitors and tracking altogether so that users don't need to reverse engineer their own checkers (preferably using tags).  As a matter of transformation, you see that tracking should provide two new concepts: `isTracking` predicate (similar to the existing interesting-ness mechanism) and the way to detect where the tracking stopped.  You also don't want tracking to be a gigantic tree and have everything, but to have one source of value without anything else.  All other parts should be decoupled and maybe even given a different name.  We still want to remove all different parts of other checkers from `BugReporterVisitors.cpp` and probably can do it using just one abstraction of `isTracking` to do that.

And here is my take on things (it is a bit chaotic, sorry about that):

- If we want to move to `isTracking`, we still need a central component that participates in the tracking or something that is shared through all tracking logic.  In my patches it is a `Tracker` object.  It can be done differently by passing such object directly into every `trackExpressionValue` and its sub-visitors, but probably something more extensible would be a preference here.
- If we want to split `trackExpressionValue` and extract checker-specific parts, we need a good and reliable way of splitting it.  I used a time-tested approach of function extraction: if we have `foo(A a, B b, C c) { ... }` with N independent parts, we can refactor them into `foo(A a, B b, C c) { foo1(a, b, c); foo2(a, b, c); ... fooN(a, b, c); }`. If one of the `fooK` is big, we can repeat the same procedure. Instead of doing it plainly, I decided that the user might be interested in putting their own `fooK` into the mix.
- Can we implement `isTracked` first and then split `trackExpressionValue`?  My answer is no.  It implies bigger refactoring because `isTracking` doesn't provide all the bits of `A a, B b, C c` that the logic used to receive before.  One example here is the `SuppressInlineDefensiveChecksVisitor`.  When we add it, we give it a node where the tracking of the expression was initiated, not the node with expression itself.  We also have a very specific comment there that this is a requirement and expression own node might be "too late".  If you use `isTracked` in the Visitor you will be automatically in the "too late" situation.  And it is only one example.  So, I insist that `isTracked` couldn't be the initial step for refactoring.
- Can we implement "tracking has stopped mechanism" first and then split `trackExpressionValue`?  My answer is no.  Tracking right now is a tree and it is very convoluted.  There is no one single point where it stops, it stops 100 of times.  The concept of "stopped" in this situation is not useful (I'm still not convinced that it's useful at all, but this is a part of a different conversation).  In order to fix that we need to split `trackExpressionValue`.  Contradiction.
- Can we implement `isTracked` afterwards?  Yes, we can.  Implementation of `isTracked` on it's own is fairly simple.  And we can have users of `isTracked` and `ExpressionHandler` at the same time.  Maybe, once we migrate all uses, `ExpressionHandler` can be safely retired.
- Can we implement "tracking has stopped mechanism".  Probably.  This is something that I doubt in usefulness, and don't really want to plan.
- Is there a way to get rid of tracking?  Honestly, I don't think so.  I do believe, though, that it is possible to stop reverse engineering ourselves.  While the analyzer explores different paths, both the core and checkers somehow affect the state.  We can cal such a change of the state as "event".  Events can be both relevant and irrelevant (e.g. happened to a different smart pointer) to the issue that is found.  Many of these events are pivotal to the user's understanding of the issue and we should explain them with notes.  The way to understand that it is relevant is through tracking.  We might want to keep these traces as we go, but it might be some extra price that we don't want to pay.
- Let's get back to the events!  Can we build our entire solution around them?  I do think so, actually.  I think that the analyzer can produce a bunch of events (maybe implemented as note tags).  Then each bug report can provide a visitor, not exploded graph visitor as it is now, but event visitor.  It will be automatically called for relevant events only (with tracking hidden under the hood).  Many events will have reasonable defaults so you don't need to write your own implementations every time (e.g. "a is assigned with b here"), but you can always override them.  This is something that I started with `StoreHandler` because "store" is one of those events.  It is different though from the perfect vision because we don't want to reverse engineer what the engine/the checker did before, perfectly we would have this information in the node (or on the edge) that "store event happened".   Tracker will be very trivial in this plan.
- Is it possible to be done incrementally?  I do think so.  We can extract different events from `trackExpressionValue` and change them into various `EventHandler`s (like `StoreHandler`).  Probably, in the process, we will understand what kind of data describes each type of events.  Another step is pretty hard, we need to design how we are going to generate (easier) and store (harder) events during analysis.  It should be pretty implicit, so it has a lot of sensible defaults produced by the core.  And checker developers should be able to generate their own events (maybe even for other checkers, if they model something that everyone can benefit from).  I think it will be possible to generate one event type at a time, and change how we handle it in the tracker, i.e. move from reverse engineering to detecting relevant events in the graph.  Once we are finished with this, we can extract a separate thing called `EventVisitor` and migrate a big chunk of `Tracker` and `EventHandler` interfaces there.
- Why do we even need to introduce tracker and event handlers if we are going to hide/delete them in this plan?  The trick is about making this whole plan manageable.  One enormous refactoring is hard to plan and test.  Big work can demotivate and drain if you don't see the end of it.  I believe that we can make great things if we take a lot of baby steps.

Phew, I hope this answers it.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D103605/new/

https://reviews.llvm.org/D103605