[cfe-dev] [RFC] A dataflow analysis framework for Clang AST

Mon Oct 18 10:35:02 PDT 2021

Hi Gábor,

On Tue, Oct 12, 2021 at 9:01 PM Gábor Horváth <xazax.hun at gmail.com> wrote:

> Hi Yitzhak,
>
> Yay, finally happening! I 100% support these efforts and feel free to add
> me as a reviewer to the patches, design documents etc.
>

Thank you so much! We will!

> Another example where the forward symbolic execution of CSA falls short is
> backwards reasoning, e.g.:
>
> void f(int *p)
> {
>   *p = 5;
>   if (p) // Should warn: redundant null check or null dereference above.
>   {
>     ...
>   }
> }
>
> CSA will not warn on the code snippet above because it is assumed p is
> never null at the first dereference (to avoid noise) and it is really hard
> to revise old assumptions given new evidence. Such problems can be found
> using dataflow analysis.
>

Agreed, that is a good example!

> What kind of control will the user have over the environment? E.g. could
> we add checks/extensions that model the effects of certain functions (e.g.
> std::swap) in the environment?
>

This is an open question. Already, users can model the effects of
particular function calls. They would add a case for such calls in their
transfer function, where they would update the environment accordingly. The
full API for manipulating the environment is available to the user. But, we
expect there's more room for supporting this kind of domain-specific
semantics, either a simpler API or, likely, a richer one.

> Do you plan to provide building blocks to help users build lattices?
> E.g.,  I really like SPARTA <https://github.com/facebook/SPARTA>. They
> added some helpers to encode finite abstract domains, see SPARTA/FiniteAbstractDomain.h
> at main · facebook/SPARTA (github.com)
> <https://github.com/facebook/SPARTA/blob/main/include/FiniteAbstractDomain.h> for
> details.
>

We have a goal of making dataflow analysis to become as accessible as AST
matchers are, so we do support adding whatever building blocks necessary to
free the checks of boilerplate and make them easy to write. Whether it
would be similar to SPARTA or not, I think is yet to be seen.

I am a bit opposed to using the term path-condition, as users might confuse
> this with similar terms used in path-sensitive analyses. What about
> flow-condition?
>

Renaming sounds good, let's paint this bikeshed in code review.

Could you elaborate on the storage model? How do you plan to deal with
> aliasing? How do you represent arrays (with potentially unknown lengths)?
>

A limiting factor for the storage model in particular is that we try to
keep the framework accessible to non-experts. It is a limiting factor for
all design choices, but it hurts the storage model more than anything else.

At the moment we represent the stack and heap memory at each program point
as a couple of maps:

  llvm::DenseMap<const ValueDecl*, StorageLocation*> decl_to_loc_;
  llvm::DenseMap<const Expr*, StorageLocation*> expr_to_loc_;
  llvm::DenseMap<const StorageLocation*, Value*> loc_to_value_;

ValueDecl pointers are the Clang AST nodes for global variables, function
parameter variables, and local variables.

Expr pointers are the Clang AST nodes for the expressions whose values have
been already computed and are accessible at this program point.

StorageLocation represents a memory location. A memory location stores a
Value. A Value knows its type, and one can retrieve its sub-objects as
Values. Nevertheless, the Value is merely a name of a symbolic variable and
does not store the approximation of the runtime value computed by the
analysis so far. Value objects participate in the path condition, and
that's where they get constrained.

We create the first map when we initialize the analysis (before running the
dataflow fixpoint algorithm). **We are making an unsound assumption that
each Value that has a pointer or reference type points to a distinct
object.** We find this assumption to be useful in practice, however it is
not essential for the framework. We would like all unsound behavior, no
matter how useful, to be eventually guarded by flags that each checker can
set in a way that works best for it. Furthermore, we think that the
framework could support different levels of detail in representing storage,
and a checker could similarly request the necessary abstraction level.

A limiting factor for representing aliasing is that we would probably have
to rely on an SMT (not just a SAT) solver to make queries to the path
condition if we track that two pointers may (but not necessarily must)
alias.

We compute the other two maps for every program point. Assignments, no-op
casts, temporary materialization etc. can forward the input Value objects
to their destination StorageLocations, allowing us to symbolically track
the data flow.

The environment is of course subject to the join operation. To join
distinct Value objects, we have a built-in rule for booleans (we build a
new Value that is constrained to be a disjunction of the values being
joined), and we allow the user-defined checker to handle joining values of
domain-specific types. Expanding the built-in rules to cover more types is
desirable, but would likely create a hard dependency on an SMT (not just a
SAT) solver.

As for arrays, we don't model them right now and we don't immediately have
a good idea of how we could do it in a simple way.

> What representation do you use for the path condition? Using the syntactic
> forms, like `x > 5` might not be sufficient as `x` might have different
> values at different program points.  E.g.,
>
> void f(int x)
> {
>   assert(x > 0);
>   ...
>   x = -1;
>   funcWithPreconditionXIsPositive(x); // Wants to execute the SAT query
> here.
> }
>

We incorporate the symbolic value of the expression into the path
condition. Since we don't support integer comparisons, let's use a boolean:

void f(std::optional<int> opt) {
  bool is_valid = opt.has_value();
  if (is_valid) {
    is_valid = false;
    use(opt.value());
  }
}

When we evaluate the conditional branch, we add to the path condition the
symbolic value stored in the storage location of is_valid at the program
point of the branch. Therefore the subsequent overwrite of is_valid does
not interfere with our ability to check the path condition when we evaluate
opt.value().

For those boolean conditions that we don't understand, the expression is
associated with a fresh boolean value, and that value is incorporated into
the path condition.  Practically, this means a very conservative
interpretation of the boolean "x > 5": every appearance in the code is
handled uniquely, even if x has not changed. So, in the following code:

{ ...
  if (x > 5) p = new ...
  if (x > 5) f(*p);
}

we can't at the moment deduce that in `*p` the pointer is never null. But
we can deduce that here:

{ ...
  if (x > 5) {
    b = true;
    p = new ...;
  }
  if (b) f(*p);
}

It is certainly desirable to improve this.

> Is inter-procedural analysis out of scope (e.g., using summaries)?
>

Not at all! It's just well ahead of where we are now. That said, we think
it's important to consider sooner rather than later, since design decisions
now could block good interprocedural design later. We are happy to discuss
proposals, we're just prioritizing getting the initial framework upstreamed.

Dmitri

-- 
main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
(j){printf("%d\n",i);}}} /*Dmitri Gribenko <gribozavr at gmail.com>*/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20211018/0a13788e/attachment-0001.html>