[cfe-dev] [RFC] A dataflow analysis framework for Clang AST

Gábor Horváth via cfe-dev cfe-dev at lists.llvm.org
Mon Oct 18 12:03:53 PDT 2021


Hi Dmitri,

Thanks for the answers, I have some further questions inline.

On Mon, 18 Oct 2021 at 10:35, Dmitri Gribenko <gribozavr at gmail.com> wrote:

> Hi Gábor,
>
> On Tue, Oct 12, 2021 at 9:01 PM Gábor Horváth <xazax.hun at gmail.com> wrote:
>
>> Hi Yitzhak,
>>
>> Yay, finally happening! I 100% support these efforts and feel free to add
>> me as a reviewer to the patches, design documents etc.
>>
>
> Thank you so much! We will!
>
>
>> Another example where the forward symbolic execution of CSA falls short
>> is backwards reasoning, e.g.:
>>
>> void f(int *p)
>> {
>>   *p = 5;
>>   if (p) // Should warn: redundant null check or null dereference above.
>>   {
>>     ...
>>   }
>> }
>>
>> CSA will not warn on the code snippet above because it is assumed p is
>> never null at the first dereference (to avoid noise) and it is really hard
>> to revise old assumptions given new evidence. Such problems can be found
>> using dataflow analysis.
>>
>
> Agreed, that is a good example!
>
>
>> What kind of control will the user have over the environment? E.g. could
>> we add checks/extensions that model the effects of certain functions (e.g.
>> std::swap) in the environment?
>>
>
> This is an open question. Already, users can model the effects of
> particular function calls. They would add a case for such calls in their
> transfer function, where they would update the environment accordingly. The
> full API for manipulating the environment is available to the user. But, we
> expect there's more room for supporting this kind of domain-specific
> semantics, either a simpler API or, likely, a richer one.
>

One of the key learnings for the static analyzer was that we want a strict
separation between checks that issue warnings and "checks" that model
behavior. E.g., when a user sees false positives from a null dereference
checker, it is very likely they still want to have the modeling from the
same check that can exclude many infeasible paths. So for the best possible
user experience it should be possible for the users to only disable the
relevant warnings while still getting the benefits of modeling.Also, it can
be useful to disable some modelling without affecting the emitted warnings.
On the flip side, the users can shoot themselves in the foot. E.g., even
though running the Static Analyzer with the core modelling checks off is
not supported, we still get crash reports from such scenarios (that we will
not fix).



>
>
>> Do you plan to provide building blocks to help users build lattices?
>> E.g.,  I really like SPARTA <https://github.com/facebook/SPARTA>. They
>> added some helpers to encode finite abstract domains, see SPARTA/FiniteAbstractDomain.h
>> at main · facebook/SPARTA (github.com)
>> <https://github.com/facebook/SPARTA/blob/main/include/FiniteAbstractDomain.h> for
>> details.
>>
>
> We have a goal of making dataflow analysis to become as accessible as AST
> matchers are, so we do support adding whatever building blocks necessary to
> free the checks of boilerplate and make them easy to write. Whether it
> would be similar to SPARTA or not, I think is yet to be seen.
>
> I am a bit opposed to using the term path-condition, as users might
>> confuse this with similar terms used in path-sensitive analyses. What about
>> flow-condition?
>>
>
> Renaming sounds good, let's paint this bikeshed in code review.
>
> Could you elaborate on the storage model? How do you plan to deal with
>> aliasing? How do you represent arrays (with potentially unknown lengths)?
>>
>
> A limiting factor for the storage model in particular is that we try to
> keep the framework accessible to non-experts. It is a limiting factor for
> all design choices, but it hurts the storage model more than anything else.
>
> At the moment we represent the stack and heap memory at each program point
> as a couple of maps:
>
>   llvm::DenseMap<const ValueDecl*, StorageLocation*> decl_to_loc_;
>   llvm::DenseMap<const Expr*, StorageLocation*> expr_to_loc_;
>   llvm::DenseMap<const StorageLocation*, Value*> loc_to_value_;
>
> ValueDecl pointers are the Clang AST nodes for global variables, function
> parameter variables, and local variables.
>

The store is likely to change very little in each of the transitions. The
static analyzer tries to leverage this fact by using immutable data
structures, so we can share the common parts between states. This can help
reduce memory consumption. And the analyzer favors significantly smaller
memory use to slightly better performance because this lets users
leverage parallelism across TUs better with the same amount of memory. I
was wondering whether you considered using immutable data structures
(like llvm/ADT/ImmutableMap.h) or is this just a first version?

Also, how will this model handle expressions that evaluate to rvalues? Do
you have locations for rvalues?


>
> Expr pointers are the Clang AST nodes for the expressions whose values
> have been already computed and are accessible at this program point.
>
> StorageLocation represents a memory location. A memory location stores a
> Value. A Value knows its type, and one can retrieve its sub-objects as
> Values. Nevertheless, the Value is merely a name of a symbolic variable and
> does not store the approximation of the runtime value computed by the
> analysis so far. Value objects participate in the path condition, and
> that's where they get constrained.
>
> We create the first map when we initialize the analysis (before running
> the dataflow fixpoint algorithm). **We are making an unsound assumption
> that each Value that has a pointer or reference type points to a distinct
> object.**
>

The analyzer is making a very similar assumption and it seems to work
reasonably well. There are certain exceptions. E.g., for overloaded
operator= implementations, it will simulate the method both with this !=
&other and this == &other assumptions. This will double the analysis time
for these methods, but it can help finding errors where users forgot to
handle the latter case. The APIs of the analyzer is flexible enough that
this split can be done in a checker.


> We find this assumption to be useful in practice, however it is not
> essential for the framework. We would like all unsound behavior, no matter
> how useful, to be eventually guarded by flags that each checker can set in
> a way that works best for it.
>

I am perfectly comfortable with unsound behavior by default, there is a
whole movement behind this approach: Soundiness Home Page
<http://soundiness.org/> :)


> Furthermore, we think that the framework could support different levels of
> detail in representing storage, and a checker could similarly request the
> necessary abstraction level.
>
> A limiting factor for representing aliasing is that we would probably have
> to rely on an SMT (not just a SAT) solver to make queries to the path
> condition if we track that two pointers may (but not necessarily must)
> alias.
>
> We compute the other two maps for every program point. Assignments, no-op
> casts, temporary materialization etc. can forward the input Value objects
> to their destination StorageLocations, allowing us to symbolically track
> the data flow.
>
> The environment is of course subject to the join operation. To join
> distinct Value objects, we have a built-in rule for booleans (we build a
> new Value that is constrained to be a disjunction of the values being
> joined), and we allow the user-defined checker to handle joining values of
> domain-specific types. Expanding the built-in rules to cover more types is
> desirable, but would likely create a hard dependency on an SMT (not just a
> SAT) solver.
>

When a pointer point to two different locations in two different branches
do you join them to a new location and add constraints just like you do for
booleans?


>
> As for arrays, we don't model them right now and we don't immediately have
> a good idea of how we could do it in a simple way.
>
>
>> What representation do you use for the path condition? Using the
>> syntactic forms, like `x > 5` might not be sufficient as `x` might have
>> different values at different program points.  E.g.,
>>
>> void f(int x)
>> {
>>   assert(x > 0);
>>   ...
>>   x = -1;
>>   funcWithPreconditionXIsPositive(x); // Wants to execute the SAT query
>> here.
>> }
>>
>
> We incorporate the symbolic value of the expression into the path
> condition. Since we don't support integer comparisons, let's use a boolean:
>

> void f(std::optional<int> opt) {
>   bool is_valid = opt.has_value();
>   if (is_valid) {
>     is_valid = false;
>     use(opt.value());
>   }
> }
>
> When we evaluate the conditional branch, we add to the path condition the
> symbolic value stored in the storage location of is_valid at the program
> point of the branch. Therefore the subsequent overwrite of is_valid does
> not interfere with our ability to check the path condition when we evaluate
> opt.value().
>

This is awesome, the static analyzer works exactly the same way! :)


>
> For those boolean conditions that we don't understand, the expression is
> associated with a fresh boolean value, and that value is incorporated into
> the path condition.  Practically, this means a very conservative
> interpretation of the boolean "x > 5": every appearance in the code is
> handled uniquely, even if x has not changed. So, in the following code:
>
> { ...
>   if (x > 5) p = new ...
>   if (x > 5) f(*p);
> }
>
> we can't at the moment deduce that in `*p` the pointer is never null. But
> we can deduce that here:
>
> { ...
>   if (x > 5) {
>     b = true;
>     p = new ...;
>   }
>   if (b) f(*p);
> }
>
> It is certainly desirable to improve this.
>

Agreed. I suspect the proper solution would involve many non-trivial steps
including escape analysis.


>
>
>> Is inter-procedural analysis out of scope (e.g., using summaries)?
>>
>
> Not at all! It's just well ahead of where we are now. That said, we think
> it's important to consider sooner rather than later, since design decisions
> now could block good interprocedural design later. We are happy to discuss
> proposals, we're just prioritizing getting the initial framework upstreamed.
>
> Dmitri
>
> --
> main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
> (j){printf("%d\n",i);}}} /*Dmitri Gribenko <gribozavr at gmail.com>*/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20211018/f5300a23/attachment-0001.html>


More information about the cfe-dev mailing list