[PATCH] D103605: [analyzer] Introduce a new interface for tracking

Artem Dergachev via Phabricator via cfe-commits cfe-commits at lists.llvm.org
Sun Jun 6 18:58:40 PDT 2021


NoQ added a comment.

In D103605#2800101 <https://reviews.llvm.org/D103605#2800101>, @vsavchenko wrote:

>>> Actually, tracker dies when this happens.  But hey, can you provide a use case when the checker really needs to know when this happens?
>>
>> Inlined defensive check suppressions are a major example. Basically null dereference warnings are suppressed based on properties of the program point at which the tracker terminates. If we know when this happens, we can refactor inlined defensive check suppressions into a checker specific handler.
>
> It seems to me that it is a hard concept to wrap your mind around.  And even if I would tell someone how we suppress null pointer inline checks, I wouldn't use "when the tracker stops" as a description of the event.  It is very vague and tightly coupled with the implementation of tracker stopping, and it is not connected semantically to the problem we are trying to solve.  Correct me if I'm wrong, but better description would be track when we constrained the pointer value to null, and if it in the child stack frame, drop it.
> So, maybe instead of very hard to understand "when the tracker stops", we can use "when the constraint changes" or "when the constraint became this".  This is more similar to store handlers in their nature, we find a pair of states where some condition stops holding (man, that can have a generic implementation!), describe it, and call the handler.

Inlined defensive check suppressions are indeed pretty ugly; this problem should ideally be solved in a completely different manner.

That said, I strongly disagree that "when the tracker stops" is a hard-to-understand concept. It's an extremely natural and intuitive question to ask:

//"Where does the value come from?"//

It's fairly reasonable to wonder what happened during analysis that caused us to produce this value for the first time, which then traveled down to our bug point, and under what circumstances did it happen. Not much precise reasoning can be based on this information but it's a very good piece of information to drive imprecise reasoning and heuristics, such as (but not limited to) inlined defensive check suppressions (where it bears a relatively precise meaning that corresponds exactly to "state splits in nested stack frames are intrinsically less reliable as they can't ever be derived from presumption of no dead code").

Another example: a value of a top-level parameter variable is much more likely to be arbitrary than a return value of an unknown function call (if the user wanted pre-conditions they could have added an assert), which in turn is much more likely to be arbitrary than a value loaded from a heap region after store invalidation from unknown function call (we're 99% sure that the function doesn't touch this particular heap region so it's probably just the old value). In this sense, I'm pretty tempted to suppress bug paths where two branch conditions (ultimately going into opposite directions) are based on different values of the same heap variable after different number of invalidations. But I'm certainly against doing so for two different top-level parameter values. You may say that in these cases the information is most likely available from the structure of the symbol (`SymbolRegionValue` vs. `SymbolConjured` vs. `SymbolDerived`) but it only confirms my point: origins of values are so important that we even carry them as part of the value's identity! Direct access to the structure of the symbol for determining origins is also used a few times (eg., see call sites of `getOriginRegion()`).

> I want to add something (because I might seem defensive).  I do want to make the world peace, but this series of patchiest is addressing a lot of headache me (and Deep) had because of the way `trackExpressionValue` is organized.
> If you think about it, all these patches just do one thing - extract piece of code into a separate function, class, or abstraction.
> Fundamentally it is all the same.
>
> I think we can discuss future plans, but also try to be realistic about the scope of this refactoring/redesign, since I'm trying to morph existing code into something a little better.

What I'm trying to do here is to see a general direction in which we're moving. We have the benefit of being able to predict things. We don't have to be agile; our problem is well-defined from the start and will never really change. There's a bit of flexibility required to support a variety of different future checkers, yes, but that's about it. And then, again, it's also not entirely arbitrary: we have a pretty good amount of data on what our checkers typically do and it's unlikely to expand drastically in the near future. So we don't always need to go for the most flexible solution, but there's often benefit in going for a simpler solution that simply covers our predictable needs and nothing else.

As a historical example of such process, currently bug visitors perform a single pass over the bug report, from bottom to top. This wasn't always this way: originally visitors traversed the bug report multiple times (occasionally 30+ times), adding new visitors along the way, until the set of visitors attached to the bug report reached a fixed point. The original solution was very flexible; it allowed us to backreference values at the bottom in a flash of hindsight and have all visitors restarted gracefully. But in the entirety of our foreseeable circumstances this wasn't necessary because the ultimate purpose of having visitors was to answer the question "why do we think this situation has indeed occured in the program?" by adding notes to the path, and the answer to that question is //always above// the current point in the path. Such simplification really helped.

This is why i'm trying to understand the entire thing before opting into a system. Especially when it comes to checker APIs for which overengineering is particularly dangerous - and I'm still not sure how exactly the new checker APIs are intended to work after a few pages of discussion. I don't care how they work as much as I care about knowing how they are ultimately intended to work and whether it's simple enough to understand and explain.

May i ask you to sketch the intended solution for @RedDocMD's problem and/or for your RetainCountChecker problem, i.e. how it'll look in the checker code? Or in any imaginary checker of your choice? Like, what's the motivating example? That could help me a lot with understanding your approach.

> The existing pattern doesn't work, though. I honestly don't like how this tracking (as a fundamental process) is organized at all. It doesn't really matter what predicate or callback we introduce, it is still vague. It should be way more specific, like "I am tracking this value". Tracking involves a hell-lot of other things that we get "for free", but don't really understand them.

Interesting, can you give an example (say, in case of @RedDocMD's problem)  where `isTracked()` would give an incorrect result but "I am tracking this value" would give a correct result? I guess such situations will arise more often as we add more tracking, but still.

I think it's a good time to step back and talk about //statefulness// of this whole thing.

Path sensitive checkers are stateless; their state is instead encoded in the `ProgramState` structure. This is essential because execution paths are branchy and nodes are modeled in a fairly arbitrary order.

Bug visitors are stateful. This is fine because every bug path is linear and traversed in order. Value tracks are linear too - whelp, most of the time, I suppose, but we don't mind basically forking our entire stateful machine whenever this is violated.

Note tags, on the other hand, are stateless again by design. This is why they are designed to rely on information in the bug report, such as interestingness, for communication across tags. Interestingness is an older concept that is simply re-used in note tags but from the perspective of note tags interestingness is a very simple thing: it's their //state//.

Information attached to the bug report is arbitrary. We can extend it as much as we want, make a data structure equivalent of the GenericDataMap that would track checker-specific variations over the concept of interestingness. It could decorate the specific role of every value as much as necessary. If we need to specify that this value is the same value that's seen by that other note tag (or participated in the emission of the warning itself in a certain manner), we can totally do that.

Hence the question, do we really need to preserve this whole imperative concept of "I" in "I am tracking this value", or is the functional concept of "This value is playing a certain role" entirely sufficient for our purposes?


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D103605/new/

https://reviews.llvm.org/D103605



More information about the cfe-commits mailing list