[PATCH] D103605: [analyzer] Introduce a new interface for tracking

Fri Jun 4 15:10:28 PDT 2021

vsavchenko added a comment.

In D103605#2799853 <https://reviews.llvm.org/D103605#2799853>, @NoQ wrote:

> In D103605#2798581 <https://reviews.llvm.org/D103605#2798581>, @vsavchenko wrote:
>
>> In D103605#2798171 <https://reviews.llvm.org/D103605#2798171>, @NoQ wrote:
>>
>>> - Trackers are attached to bug reports. A single bug report may have multiple trackers.
>>>   - Why not have only one tracker per report?
>>
>> That's a good question, we can actually even make it a part of `BugReport` for that matter.  I can though think of one exotic case, where you wait till you encounter certain weird condition, where you need to spawn a very specific customized tracker only starting from that point.
>
> Ok, let's take a simpler example:
>
>   int foo(int D) {
>     int A = 1;
>     int B = 1;
>     int C = B - A;
>     return D / C; // division by zero
>   }
>
> And suppose we want to track `A` and `B` so that to explain why we think that `C` is zero. What are pros and cons of:
> (1) tracking `A` and `B` within the same tracker that already helped us reach `C`?
> (2) terminating the current tracker as it reaches `C` and spawning two separate trackers for `A` and `B`?
>
>>> - The termination condition is as usual: no further visitors attached.
>>>   - It sounds like this patchset is orthogonal to the problem of detecting the tracking termination point(?) It does make it easier to implement customized reactions to such termination though.
>>
>> Actually, tracker dies when this happens.  But hey, can you provide a use case when the checker really needs to know when this happens?
>
> Inlined defensive check suppressions are a major example. Basically null dereference warnings are suppressed based on properties of the program point at which the tracker terminates. If we know when this happens, we can refactor inlined defensive check suppressions into a checker specific handler.

It seems to me that it is a hard concept to wrap your mind around.  And even if I would tell someone how we suppress null pointer inline checks, I wouldn't use "when the tracker stops" as a description of the event.  It is very vague and tightly coupled with the implementation of tracker stopping, and it is not connected semantically to the problem we are trying to solve.  Correct me if I'm wrong, but better description would be track when we constrained the pointer value to null, and if it in the child stack frame, drop it.
So, maybe instead of very hard to understand "when the tracker stops", we can use "when the constraint changes" or "when the constraint became this".  This is more similar to store handlers in their nature, we find a pair of states where some condition stops holding (man, that can have a generic implementation!), describe it, and call the handler.

>>> - However, instead of adding notes directly, they simply invoke `track()`. All note-attaching work is moved into additional default handlers.
>>
>> No, they call `track` where they used to call `trackExpressionValue` or add `FindLastStoreBRVisitor`.
>> Right now, only stores received their special customization and can produce notes based on tracker customization.
>
> What's the ultimate plan though, do we want to fully move all notes into handlers?

There was no such plan, honestly.  I had one problem in mind: it is hard to detect stores yourself, `FindLastStoreBRVisitor` is very tightly coupled with `trackExpressionValue`, and you either put your code in it, or... forget about. Yep, there is no other option!
And it's not unheard of that checkers want to know how the value got into this very region, and it is not unheard of wanting to change messages on those.  This component seems to be generic enough in identifying stores and classifying them, but horrible in the way that it gives you a fixed set of notes: leave it or take it.
I don't instantly see how other visitors from that set have necessities to be customized, but we already found one about changing constraints to be useful.

>>> - Let's consider an example. Suppose you're implementing `evalCall` for `llvm::dyn_cast`. You could have your expression handler detect a modeled `dyn_cast` call and implement its `handle()` method to emit a checker-specific note and start tracking the argument.
>>
>> So, in here it would look like this: If I want to support `dyn_cast` for tracking, I will add extra `ExpressionHandler` that calls `track` on the expression inside of `llvm::dyn_cast`.  Additionally, if you want to change the note's message for situations like `a = dyn_cast<Ty>(b)`, you can add a `StoreHandler` that will produce a note like "a is casted from b".
>>
>>> - With this implementation such note will only be emitted if the value is tracked back to `dyn_cast`. Not every modeled `dyn_cast` will receive a note but only the one that, say, caused a null dereference on its return value when it failed.
>>
>> Correct.
>>
>>> - Now, how does this interact with note tags? The idea behind note tags is so that visitors didn't need to reverse-engineer the analysis.
>>
>> I mean, it's just an orthogonal mechanism.
>>
>>> - It sounds like the handler is a step back to the visitor world: it doesn't know how exactly was `dyn_cast` modeled during analysis, so it has to reverse-engineer what happened.
>>
>> Correct, but during the analysis, checker could've modeled a zillion of different `dyn_casts`, and we don't want to have notes on all of them.
>>
>>> - For example, in order to find out whether `dyn_cast` succeeded or failed, it has to explore the program state to find out which values are passed into the cast and returned from the cast.
>>> - Whereas a note tag would have this information accessible directly.
>>
>> Maybe we can access the note tag from the handler and kill two birds with one stone?
>
> If we can, one way or another, I believe we absolutely should kill two birds with one stone. We shouldn't have two mutually exclusive solutions for two aspects of the same problem. In order to produce good notes, it's necessary to *both* know what happened during analysis *and* know whether the value is tracked.

I don't think that `isTracked` predicate is better than `isInteresting`.  It is yet another way to mark things, and if we go this way, we should fix interesting-ness instead.
I still think that there is a use-case for expression handlers: custom handler -> custom visitor -> track -> custom handler -> custom visitor -> ...
This back-and-forth recursive addition of visitors is impossible right now.  If the checker wants to implement their own `FindLastStoreBRVisitor`, they need to put it into `trackExpressionValue`.

>>> - Hence a suggestion. Maybe instead of having a system of handlers, allow visitors to //ask// the tracker whether `track()` was invoked at this node on a certain expression or region?
>>
>> There are still situations when you don't have tags, but want to hook into the tracking process, so you can add your custom feature.  Let's say you modeled a system call `foo(a, b)` and you want to track `a` and `b`.  Or more generally if you want to help tracking when it's stuck.
>
> Yes, so like in my `a = dyn_cast<...>(b)` example the proposed solution would be like:
>
>   if (isTracked(a)) {
>     emitNote();
>     track(b);
>   }
>
> Such code can be put into either a visitor or a note tag depending on what you have. It is entirely equivalent to the current prevalent idiom
>
>   if (isInteresting(a)) {
>     emitNote();
>     track(b);
>   }
>
> except it's asking the right question, with your facility allowing to answer it. This approach is very easy to understand for checker developers, basically adds no extra code (doesn't require people to inherit classes).

The existing pattern doesn't work, though.  I honestly don't like how this tracking (as a fundamental process) is organized at all.  It doesn't really matter what predicate or callback we introduce, it is still vague.  It should be way more specific, like "I am tracking this value".  Tracking involves a hell-lot of other things that we get "for free", but don't really understand them.
For this reason, "interesting-ness" and "being tracked" have (at least for now) exactly the same property, we get them "for free", but don't really understand.  Not that callbacks don't have this problem, but they seem to be more flexible at least.

>> Additionally, handlers help to tear existing `trackExpressionValue` implementation apart.
>>
>>> - I.e., polling instead of callbacks.
>>> - This would naturally integrate with note tags.
>>
>> My counter-suggestion is to create a flag on `NoteTag` (maybe `Prunable` already can fit into that, I'm not very familiar with this interface) that means "this tag should appear only when part of the tracking trace", and make a handler-visitor pair that will produce notes with messages in those tags.
>
> I think this works too. I like the idea that like note tags are for //execution paths//, we could make "track tags" for //value paths// (or, well, "value tracks"? - we should really use this concept more often, it's very effective; this patchset separates these two notions of a path that we were previously confusing).

We can sketch a plan for that.

> It sounds a bit more restrictive on what kinds of conditions are allowed. For example, if we have `auto [a, b] = foo(c)`, we may decide to track `c` either when both `a` and `b` are tracked, or when only one of them is tracked; I think both approaches could be reasonable.

We need to get these straight before we start changing things.

> In particular, a simple flag (a-la `Prunable`) is not enough. There needs to be a way to communicate *which* value in the state is of interest. The data structure describing such conditions functionally may potentially grow pretty large and this may be a reason to go imperative but I don't think we'll actually run into this problem soon.
>
>> IMO, it will prevent a ton of checkers repeating very similar code: make visitor that will check if some expression is tracked and generate a note out of tag
>
> Yes, absolutely, any solution must eliminate the need for such visitors.

I want to add something (because I might seem defensive).  I do want to make the world peace, but this series of patchiest is addressing a lot of headache me (and Deep) had because of the way `trackExpressionValue` is organized.
If you think about it, all these patches just do one thing - extract piece of code into a separate function, class, or abstraction.
Fundamentally it is all the same.

I think we can discuss future plans, but also try to be realistic about the scope of this refactoring/redesign, since I'm trying to morph existing code into something a little better.

================
Comment at: clang/include/clang/StaticAnalyzer/Core/BugReporter/BugReporterVisitors.h:127
+    /// into a special block region.
+    BlockCapture
+  };
----------------
NoQ wrote:
> xazax.hun wrote:
> > There are some other places were we might have effects that are very similar to stores, like invalidations during conservative function evaluation. 
> Yes, this would be really interesting.  Like, we probably don't want to add notes saying "we invalidated the Store so here's why we're now assuming stuff that seemingly contradicts our previous assumptions". But we may want to add //suppressions// based on this info, eg. "our control flow dependencies are based on invalidation, maybe let's try to find a different path to the same bug?"
Sure, nothing actually prevents us from extending this enum (and I actually do this, since I forgot call arguments here).
However, at this point this models and tears apart `FindLastStoreBRVisitor`.

The main idea here was that many checkers are benefiting from understanding how certain value got bound with a certain region.  But it is so intertwined with `trackExpressionValue` that you couldn't do anything like this in your own code.  And even simply the code that understands the node where the store happens is hard enough to make it abstract it away.

================
Comment at: clang/include/clang/StaticAnalyzer/Core/BugReporter/BugReporterVisitors.h:147
+class ExpressionHandler;
+class StoreHandler;
+
----------------
xazax.hun wrote:
> During path sensitive analysis we already have a callback for stores. We kind of replicating this logic for bug paths. 
> So my questions are:
> * Do we expect to have additional information here that were not available during the analysis earlier?
> * Do we want to make this as similar to the forward analysis part as possible for developer familiarity?
Oof, I actually think that this one is trickier.  If the checker modeled some operation (most likely `evalCall`), it can bind a value to the region by itself.  And this handler triggers for such events as well because it simply finds the place where the switch was flipped and the region changed its binding.

================
Comment at: clang/include/clang/StaticAnalyzer/Core/BugReporter/BugReporterVisitors.h:210
+  ///        much.
+  virtual Result track(KnownSVal V, const MemRegion *R,
+                       TrackingOptions Opts = {},
----------------
NoQ wrote:
> xazax.hun wrote:
> > Not directly related to this patch, but I wonder if we want to have unknown values with identities at some point, so we can track them.
> `UnknownVal` is a stop-gap for situations when we've screwed up so badly we don't even have types anymore. The only thing I'd ever want from them is to disappear :)
> 
> I guess this could be useful for a debug checker that could explain where does an `UnknownVal` come from. In this case unknown values don't need identities; we can track other values without identities, such as null pointers, just fine.
+1 for not caring about `UnknownVal`.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D103605/new/

https://reviews.llvm.org/D103605