[cfe-dev] [analyzer][RFC] Design idea: separate modelling from checking

Gábor Horváth via cfe-dev cfe-dev at lists.llvm.org
Sat Dec 16 09:40:53 PST 2017


In general, I like the idea of separating modeling and diagnostics.

On 16 December 2017 at 00:40, Artem Dergachev <noqnoqneo at gmail.com> wrote:

> First things first: right now literally nothing prevents us from re-using
> state traits from multiple checkers or translation units. As in:
>   // Model.cpp
>   REGISTER_MAP_WITH_PROGRAMSTATE(StreamState, SymbolRef, int)
>   namespace stream_state {
>     int get(ProgramStateRef State, SymbolRef Sym) {
>       return State->get<StreamState>(Sym);
>     }
>     ProgramStateRef set(ProgramStateRef State, SymbolRef Sym, int Val) {
>       return State->set<StreamState>(Sym, Val);
>     }
>   }
>   // Model.h
>   namespace stream_state {
>     int get(ProgramStateRef State, SymbolRef Sym);
>     ProgramStateRef set(ProgramStateRef State, SymbolRef Sym, int Val);
>   }
>   // Checker.cpp
>   #include "Model.h"
>   void Checker::checkPreCall(const CallEvent &Call, CheckerContext &C) {
>     SymbolRef Sym = Call.getArgSVal(0).getAsSymbol();
>     int Val = stream_state::get(C.getState(), Sym);
>   }
> This approach is, i guess, even already used, for dynamic type info. I
> guess nobody paid attention to that, so i was constantly running around and
> whining how much we need it, and also taint checker wasn't made this way
> for the same reason, i guess. One more item to refactor into a proper trait
> that way would be region extent, which is essentially a trait over the
> region - and it is probably even a mutable trait (see jump-over-VLA bug).
> The only downside i see with the implementation above is that we no longer
> enjoy our nice object oriented syntax, i.e. get(State, Key) instead of
> State->get(Key). We could probably improve upon that. And, yeah, there's
> one more function call that wouldn't be inlined, to it may be a tiny bit
> slower.
> So if we are mostly concerned about huge overgrown and
> poorly-interoperating checkers (MallocChecker, CStringChecker,
> RetainCountChecker, GenericTaintChecker), we already have everything we
> need, and i'm all for splitting them up.
> ~~~
> Now, for when we want our "models" (checkers that only do modeling) to be
> turned off. I think we never want them to be turned off, unless our driver
> says they aren't applicable (i.e. C++-specific model in plain C, or
> Windows-specific checks under Unix-like OSes), so we can delegate this to
> the Driver(?). The reason why i think they should be never turned off is
> because they, well, *improve modeling*. They are useful for cutting off
> infeasible paths, or doing state splits properly when those are indeed
> required. Even the taint checker - taint is essentially a huge state split
> into all possible values. For instance, we should probably disable inlined
> defensive check suppressions over tainted values.
> Though leaving an emergency interface around for turning off broken models
> would be nice.

I think it is important to have such interface. One might have modeling
that is applicable only to some of the projects and not the others (e.g.:
specific to a library or an STL implementation).

> ~~~
> Now, yeah, we can already do manual checker dependencies, as in
>   registerMyChecker(CheckerManager &Mgr) {
>     Mgr.registerChecker<MyModel>();
>     Mgr.registerChecker<MyChecker>();
>   }
> This also guarantees that all models are loaded before the checkers that
> need them, but not necessarily before other checkers. Not sure if we care.

Does the order of loading the checks determine the order of callbacks? So
does this solve the racing between checks?

> Also in this case disabling the model is hard - all checkers that rely on
> this model would need to be disabled as well.
> As an alternative, we could make checker manager refuse to register stuff
> that's explicitly disabled:
>   registerMyChecker(CheckerManager &Mgr) {
>     if (Mgr.registerChecker<MyModel>())
>       Mgr.registerChecker<MyChecker>();
>   }
> This is unlikely to cause confusion when people try to enable checker but
> it doesn't get enabled because model is off, because
>   1. models are always on, and
>   2. when they're off it is an emergency that the user is already aware of.
> So i guess it'd be better this way.

I think it would be great to have good warning messages for the user. I.e.:
warning: explicitly enabled checker X cannot be turned on due to its
dependency Y is explicitly disabled.

Otherwise it might be surprising that a check does not give any warning
even is it is turned on.

> ~~~
> Regarding races between checkers who model the same stuff - i agree that
> this is a problem, regardless of how we separate models and checkers, no
> idea what to do.
> I think that at least in the case of the Environment we can protect
> ourselves from races by just making its entries unmodifiable. When the
> expression is computed, it gets placed into the Environment. When it leaves
> the scope, it is automatically garbage-collected from the environment. And
> i think we should completely disallow replacing existing values in the
> environment - no matter who tries to do that, be it core or checkers -
> unless we're in for a full-scale renaming of one symbolic value into
> another symbolic value (we aren't). The problem is that we do a lot of that
> even in the core, so i'd really like to have this fixed.
> For the Store this seems harder, because Store is definitely designed to
> be mutable, and modeling writes is quite a normal course of behavior for
> the checkers. So i've no idea how to protect ourselves from races here.
> Because traits are often Store-like, we'd have these problems with them as
> well.
> ~~~
> Now, the really big reason why we wanted to have shared checker states is
> that we wanted to let the core inspect them and perhaps automate some
> routine work in the checkers. For that purpose, each checker giving access
> to their state with the help of custom getters and setters is definitely
> not sufficient. Even if getter/setter interface is standardized, the core
> would never guess how to actually operate on it.
> On that topic i don't have much updates yet.
> It is obvious that we need to restrict checkers in what they can do, so
> that the core could comprehend it. We have this idea around to make traits
> part of the Store, but the Store is too smart and unreliable, and in many
> cases it would be an overkill. The alternative is to make a list of simple
> kinds-of-traits that we support ("abstract models" that know how to model
> but don't know what to model, eg. an abstract model of "some trait over a
> symbol that represents the status of an object identified by that symbol"
> that would be used by MallocChecker and StreamChecker), and provide
> standardized interfaces for the model developers to make use of them for
> their specific models, while teaching the core how to do maintenance on
> them; this would increase the number of levels of abstractions to 4: core
> <-> abstract models <- specific models <- checkers. But the fatter our
> abstract models become, the easier would it be to develop specific checkers.
> Which is kinda nice.
> On 15/12/2017 6:23 AM, Aleksei Sidorin wrote:
>> It seems like it is the analyzer RFC week so I'll contribute a bit. :)
>> TL;DR: I think we may need to split up checkers from execution modeling
>> in order to resolve unclear dependencies between checkers, improve checker
>> development process and resolve some non-trivial issues. There is no any
>> proof-of-concept implementation, just some thoughts I (and our team) wish
>> to summarize and to describe the issue. Feel free to share your opinion or
>> correct me if I made wrong assumptions or wrong conclusions.
>> 0. Pre-history.
>> There were some thoughts on this topic shared personally during meetings.
>> I and Artem Dergachev agreed on the fact that checker capabilities to do
>> everything that make implementation of some common approaches to simplify
>> the checker API much harder; unfortunately, I don't know if any decision
>> about this was made on latest LLVM devmeeting because its results were not
>> shared.
>> =============================================================
>>                       1. Problem
>> =============================================================
>> Now, many checkers perform modelling of events interesting for them (most
>> of these events are function calls). They may bind a return value to the
>> expression, perform Store binding, and split states with generating
>> multiple nodes. This approach has some pros:
>> 1. If a checker is disabled, modelling is not performed. So, we save some
>> computation time.
>> 2. It may look like kind of encapsulation for modelling operations.
>> However, I think this approach has several disadvantages.
>> 1. It introduces dependencies between checkers. If we need to ensure that
>> return value of malloc() is null or not in our checker, we need to turn on
>> MallocChecker and:
>>   1) keep its traits in ProgramState (memory consumption);
>>   2) suffer from its checking operations that consume time without any
>> reason for us;
>>   3) watch its warnings that are not useful for our checker.
>> The story with GenericTaint checker (which is strongly needed for many
>> security checks) is even more painful: in addition to this, we also need to
>> keep separate functions directly in ProgramState to access taint
>> information.
>> The requirement to always enable 'core' checkers was a some kind of
>> warning bell because it exposes the problem directly to users. This
>> requirement may be non-evident for analyzer users that may need even to
>> silence 'core' warnings. There may also be modelers appearing later that
>> are not in 'core' so this mandatory list will only grow. And, finally, the
>> main problem is that switching off 'core' can lead to assertion failures
>> which are definitely not the best way to inform user about need to enable
>> these checkers.
>> Another result of this is having checkers that don't throw warnings at
>> all. Such checkers as core.NoReturn and core.LibraryFunctions are strongly
>> required to use because they greatly improve analysis quality but actually,
>> they don't check anything.
>> 2. Result ExplodedGraph depends on checker execution order. For example,
>> if a checker splits state into two with adding transitions, another checker
>> will be launched twice if it is executed after the splitter but only once
>> if it is executed before. The state splitting itself is a very sharp tool
>> that we may need to hide from novice users as it doubles branch analysis
>> time, but that's another story. The problem is that checker can meet
>> different constraints on the same code depending not only on checkers
>> enabled or disabled but even on their order so even testing of such
>> dependencies becomes much harder. Such interaction may potentially result
>> in lost warnings or false positives.
>> Checkers can also write to same Environment or Store entries erasing each
>> other's data. That kind of interaction is usually easy to debug but is
>> still an issue we may wish to avoid: testing checker sets is much harder
>> than testing a single checker due to lots of combinations.
>> 3. What is more (maybe, even the most) important is that as the result of
>> (2) we may have non-consistent checker warnings if some checkers are
>> enabled or disabled. If analysis of some large function is terminated by
>> max-nodes limit, the reachability of different nodes depends on state
>> splitting. So, a warning may appear if a checker is the only enabled, but
>> it may disappear if there are more checkers that split state --- we may
>> just don't get to the point where a warning occurs. Such inconsistency may
>> become a problem both for developers and for warning tracking systems.
>> 4. This approach forces developers to do a work that, ideally, should be
>> done by analyzer engine, in the checker. So, the results are not shared
>> with other checkers or are shared in some non-trivial way (see GenericTaint
>> example).
I like the idea of not putting everything in the analyzer engine, but we
definitely need a nicer infrastructure for that.

>> =============================================================
>>                       2. Possible approaches
>> =============================================================
>> While dependencies and broken encapsulation are something we can
>> potentially deal with, non-consistency needs some more than just testing
>> and verification. And currently I don't see the approach other that
>> forbidding checkers to perform modeling operations.
>> As I understand, to achieve consistency, checkers should always get same
>> ExplodedGraph independently on checkers enabled or disabled. The only way
>> to do it is to build ExplodedGraph without interactions with checkers. So,
>> we should run modelers independently and then use their results in
>> checkers. Checkers are only permitted to read from Environment/Store and to
>> operate with GDM.
>> There are several options.
>> 1. Extract all modellers and launch them before checker run.
This does not solve the problem of possible interdependent modellers.

>>    The problem here is how to select the modellers we need to run.
>>    1) We can launch them all. So, the graph will be bigger. But it will
>> always be the same, independently on how many checkers are running. So,
>> there is no any inconsistency. Another advantage is that all statements
>> that depend on checker modelling are modelled correctly: for example, the
>> return result of 'malloc()' will be in HeapSpace independently on memory
>> checks enabled/disabled.
>>    2) Launch only modellers that are required for checkers mentioned in
>> command line. This will keep resources but inconsistency will still be a
>> problem here.
>> The problem here is when should we launch modellers. Usually, we have
>> three evaluation phases: PreStmt (checkers), modelling by engine and
>> PostStmt. And it is not evident for me which phase is suitable for this.
>> For example, core.CallAndMessage checker can generate sink nodes on PreStmt
>> and if it is disabled, analyzer engine can crash. Some modellers need to
>> constraint the return value and need to be launched after engine finished
>> the modelling.
>> 2. Silence the checkers that are used as dependencies of other checkers
>> and not mentioned in the run line. So, 'core' is always launched mandatory
>> but is just silenced if the user doesn't want to see warnings. This seems
>> to be the easiest way; however, it doesn't solve the problem with
>> inconsistent warnings. The only way to get consistent ExplodedGraph here is
>> to launch all checkers and then silence disabled ones.
>> 3. Move the modelling into the ExprEngine. So, ExprEngine will generate
>> events that other checkers can subscribe on. This is mostly OK but some
>> problems are also present: they are lack of incapsulation and ExprEngine
>> growth. There is still a problem with plugins: to modify modelling,
>> developer has to re-compile the analyzer or... write a checker.
>> Personally, I prefer option 1.1. Initially, it can even be implemented
>> with existing checker interface. What we should do here is to strictly
>> specify the order in which modellers run in order to avoid problems with
>> unexpected interactions.
>> However, there can be options and problems I did not consider. Does
>> somebody have any opinion on this? It will also be cool if somebody will
>> share the result of latest LLVM devmeeting: as I remember, there should
>> have been a BoF about checkers API.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20171216/4c6dc5df/attachment-0001.html>

More information about the cfe-dev mailing list