[cfe-dev] [analyzer] An idea regarding our data map.

Tue May 31 06:53:38 PDT 2016

Hi!

On 26 May 2016 at 14:56, Artem Dergachev via cfe-dev <cfe-dev at lists.llvm.org
> wrote:

> I've got an idea. The idea comes as a replacement for fixing current
> issues with the GDM (sharing traits across checkers, automating the routine
> work on checker side). The idea is: GDM is not the right place to model
> most of the checker-controlled program state traits, instead we should
> encourage checkers to use the Store, which already has most of the
> necessary features.
>

Sharing traits and automating clean-up and other tedious tasks would be
awesome. However isn't a dependency graph of checkers (to restrict the
order of execution) is a prerequirement for sharing traits?

>
> Specifically, whenever the checker tries to model a certain data
> structure, which exists (or could have existed) somewhere in the program
> (be it user or kernel memory, or CPU registers, or shadow memory of some
> sanitizer-like tool which could have been run alongside the program), then
> the checker should model this data structure directly in the Store(?)
>

How would you implement this? Wouldn't this kind of approach make the store
heavier? I think the common case should be that, no extra data is
associated with the live "objects". Making the common case more heavy
weight might imply a performance problem.

Also note that, checkers are not always modelling a data structure. It
might model a property of a path e.g.: was something locked on this path?
Of course, you could still attach this kind of information to a mutex or
something like that, but if you are only curious whether a certain method
was invoked in a critical session, it is easier to only track the lock
deepness for each path, this way you only have one integer to deal with in
the state instead of several on/or switches for each mutexes along the
paths.

>
> This approach may be thought of as bringing the idea of "ghost code" from
> model checking into our analyzer. We're supposed construct ghost structures
> during analysis, which mimic real or imaginary things that happen during
> program execution.
>
> With the current region hierarchy, it's a bit hard to follow this
> approach, because memory regions are hard to construct. So my proposal is
> to add a few more region classes, which would allow checkers to cheaply,
> probably even transparently, construct the necessary structures in the
> Store.
>
> A few reasons why i ended up liking this idea:
>
> 1. It's natural. After all, the program doesn't have anything like GDM
> when it executes dynamically - it only has memory, which means that the
> Store, being our memory model, should easily take care of modeling anything
> that is going on in the program, and useful features of the Store are still
> useful for such modeling, so why don't we use them. In other words, the
> most natural way to explain how exactly should the smart data map automate
> routine tasks would be to say that it should behave exactly like the Store.
> So why don't we just use the Store.
>

They have a lot in common indeed, it is a good observation! However I think
there are some subtle details. The lifetime of the information that is
attached to a data structure might not be the same as the lifetime of the
data structure itself. E.g.: one might use the return value of strlen after
the argument of strlen is already garbage collected. With some checker
participation though this might be worked around (in fact there are
workarounds for this same problem with GDM as fas as I remember).

>
> 2. There's no problem with sharing the trait. Any checker can see what's
> going on with a plain getSVal(). Of course, we can hide the key for
> getSVal() in anonymous namespace and probably provide getters/setters
> instead, which wouldn't stop the Store from iterating over bindings.
>

In order to truly be able to participate, guarantees for execution order
should be introduced, se one of my previous comments.

>
> 3. There's no problem with routine work and code duplication in checkers,
> which makes writing checkers a lot easier:
>   a. Store transparently takes care of cleaning up live/dead symbols,
> there's no more problems with different worklists for store and checkers.
>

This would be awesome!

>   b. The store takes care of invalidation, eg. checkRegionChanges could
> probably be automated with this approach, by simply choosing a good parent
> region for our checker data structures.
>

The invalidation rules might be different for the associated information
than for the object. E.g.: one might assume that a private property is only
modified by friends (even tough it is possible in C++ to modify private
members, it is very rare). Is it possible to also have custom invalidation
schemes for these traits? Or the checkers would fall back to GDM in those
cases? (You think of this as a GDM alternative or replacement?)

>   c. Also, Store provides us with SymbolRegionValue-class and
> SymbolDerived-class symbols for our structures when values in them are
> unknown; this way we can reason if it's the same unknown or a
> probably-different unknown.
>

This is a very interesting property! Tracking this would be automatic? Do
you know how much extra overhead would that be?

>   d. The store can be easily dump()'ed, we no longer need hooks for
> printing the program state on checker side.
>

I think still need a way to stringify the metadata we attached to the
"objects".

>   e. In the future, we can probably add alpha-renaming, and then problems
> and solutions would most likely be exactly the same for the Store itself
> and for checker data structures.
>

Good point!

>
> 4. This approach is very easy to implement and to explain. Most of the
> time, we just use the standard getSVal()/bindLoc() API for everything,
> probably some overrides.
>

If you think of this as an alternative to GDM, a guide explaining when to
choose the GDM would be awesome.

>
> A few arguments against this approach:
>
> 1. Some program state traits do not try to model data structures that
> already exist in the program. The best example is probably range
> constraints. Constraints are immaterial for dynamic execution - they're a
> meta-information about chosen execution path. For such traits, i do not yet
> present a solution, though i have a couple of thoughts - if anybody's
> curious, i can dump. In particular, i feel as if constraint-like traits can
> be easily supported with a separate generic approach, and it'd cover all of
> our problems.
>

What kind of ranges are we talking about? Aren't we only store ranges for
symbols using the constraint manager?

>
> 2. GDM is done in terms of sets and maps, however Store doesn't have sets
> or maps, it only works with C-like memory with at best 1-dimensional
> arrays. Of course we can model maps of ints to ints as ghost arrays with
> symbolic bindings to elements, and it's a safe and sane implementation. For
> a map from int to set of ints, it gets harder. But i think it's a
> non-issue: it's still more important to figure out how do we model more
> complicated containers (i still dream of container symbols and container
> constraints, which would reduce this issue to issue '1.'), and then we
> could re-visit this issue.
>

We should solve this problem without too much execution overhead on the
common case (no information stored by a checker).

>
> I've made proof-of-concept patchs to SimpleStreamChecker (two different
> approaches) and PthreadLockChecker. The patch introduces four kinds of
> ghost regions - each "smart trait" owns its own memory space which doesn't
> get invalidated (for things that don't need to be invalidated, or for
> manual control), and allows creating one variable inside that (or any
> other) memory space, one field inside any region, or one symbol-based
> region for any symbol (unlike SymbolicRegion, the ghost symbol-based region
> doesn't need to be based on a pointer-type symbol, any symbol will do fine).
>
> The patches are on my github because i'd re-arrange them if i was to put
> them on phabricator, and not sure what to do with multiple checker
> variants. So i'm asking to have a look at the checkers and see if you want
> to write checkers *that* way, instead of the traditional GDM, then i'd
> focus on fixing smaller issues, or i'd have to come up with something
> different.
>
> SimpleStreamCheckerV2 defines symbol-based ghost regions for file
> descriptor symbols and stores constant values (0 or 1) inside them,
> indicating opened and closed streams. It doesn't need to store the
> "unknown" state - it is represented by non-constant values, eg.
> SymbolRegionValue of the ghost region. Link: https://githubnValue of the
> ghost r.com/haoNoQ/clang/commit/578e79009bbac81c129daff19e14dbad16e2e43d
> <https://github.com/haoNoQ/clang/commit/578e79009bbac81c129daff19e14dbad16e2e43d>

As a patch against an existing checker this would be easier to evaluate
what the differences are. But at first glance I did like the UX of checker
development this provides except for registering the trait :)

>
>
> SimpleStreamCheckerV3 defines a ghost field inside each symbolic region
> corresponding to the each stream's pointer symbol, and stores the same
> constants inside these fields. It has troubles suppressing invalidation of
> these fields, which, in my opinion, should ideally be solved through
> evalcall/bodyfarm of library functions. Both versions do not suffer from
> the "zombie symbols" issue. Link:
> https://github.com/haoNoQ/clang/commit/bb63471fa6d70502e61728052b6b7a649c4d80d1
>
> PthreadLockCheckerV2 defines a ghost field inside each mutex, indicating
> one of the four states (Initialized, Locked, Unlocked, Destroyed), which
> allows it to find double locks/unlocks etc. Additionally, it models the
> stack of locks in order to find lock order reversals. This is modeled
> through a single ghost variable of type `void**', which represents the
> stack pointer - the checker dereferences this pointer to access or modify
> the stack and increments/decrements it. Link:
> https://github.com/haoNoQ/clang/commit/ab581e4da9c3ef2ddfc0fe75a20ffa9ce789fe8d
>
> None of these checker's code actually deals with ghost regions through
> MemRegionManager - they always use wrappers in the ProgramState class,
> which in my opinion are easy and intuitive, though probably more such
> wrappers need to be introduced.
>
> All the newly introduced terminology is worth discussing, i don't really
> like it.
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20160531/8a945d29/attachment.html>