[cfe-dev] A survey of dataflow analyses in Clang

Tue Oct 6 11:40:04 PDT 2020

Hi!

Thanks for this survey! I am just adding some random notes inline that
might or might not be useful. Some of them include questions and musings.

On Sun, 4 Oct 2020 at 21:43, Kristóf Umann <dkszelethus at gmail.com> wrote:

>
>
>
> ---  Reads/writes ---
>
> A common theme among data flow algorithms is looking for gens
> (generations, or in other words "reads" or "loads") or kills (in other
> words "writes" or "stores") of either variables or values.
>

One of the advantages of GEN/KILL based analyses is that they are
distributive. This is a class of analyses that are well studied and there
is a rich literature on how to efficiently evaluate them (like IFDS).
Also, the sets are often represented using bit-vectors where each position
corresponds to a variable/register. Rust also uses this approach for
liveness and some other analyses.

>
> In literature, most data flow algorithms are defined and showcased on
> either some simple pseudo code, or an instruction set in a three address
> form:
>
> x <- y + z
>

Clang's AST and CFG is a very heavy weight for most dataflow analyses that
deal with low-level concepts like liveness, or definition/use. I'd love to
see a lightweight overlay on top of the Clang AST that can be used to
categorize all nodes into some basic operations. For example, there is a
large class of dataflow analyses that only care about function calls,
reads, and writes. For those analyses, it is just noise to distinguish an
overloaded operator call from a constructor call and so on. Currently,
every dataflow analysis is doing an ad-hoc abstraction over the AST/CFG.
The question is, is it viable to come up with something that fits most
analyses, so people don't have to reinvent those abstractions?

Would MLIR make it easier to create such an overlay?

>
> In clang however, where transfer functions are implemented with statement
> visitors, the fact that we need substantial contextual information for any
> given statement is sometimes a problem.
>

Is this need for context coming from the use of linearized CFGs? (I.e.
every element of a basic block is a subexpression, not a whole one).

>
>
> --- Classic data flow issues: Aliasing, function calls, unfeasible
> execution paths ---
>
> Aliasing is an often discussed, unavoidable obstacle for C/C++, and
> intraprocedural of most dataflow algorithms is also problematic.
>
>
LLVM IR passes can query alias analyses. As dataflow algorithms are ad-hoc
in Clang, we do not have an infrastructure for these analyses to depend on
each other. In an ideal world, we could create an alias analysis that
always returns MayAlias and write other dataflow analyses against its
interface and once that analysis becomes smarter, all the other dataflow
analyses start to improve in precision. While I do agree, we cannot map
back analysis results from LLVM
IR to the source code in general (especially, when we want to generate
meaningful diagnostics), but I wonder whether it is actually possible for
some restricted cases like alias analysis.

>
>
> --- Records, fields, methods ---
>
>
There are some well-known abstractions for field sensitive analyses. Those
include access paths and store-based models. Due to the heap and loops,
these models are potentially unbounded and there are several methods to
summarize those cases including k-limiting. Here is a survey of some
methods: https://arxiv.org/abs/1403.4910

>
> ---=== 1. LivenessAnalysis ===---
>
>
C++ has many peculiarities with liveness analysis. For example, if we have
a local variable with a dtor, it is live for its whole lexical scope as the
dtor might read its fields. So it looks like we have both lexical and
non-lexical liveness, and this can be very confusing.

>
> If a DeclRefExpr pops up in any other context, it is seen as a read, even
> if its passed variables to function call as a non-const reference, etc.
>

This is a serious problem in the C++ language. Parameter passing describes
the HOW (e.g. by reference) instead of the WHAT (e.g. to read).  So seeing
a non-const reference we can never be sure what the intention of the
developer was. Is this a read and the user forgot the const? Is it a read
followed by a write? Is this a write-only?

> Also, the algorithm doesn't respect aliasing, and doesn't regard the
> fields of a record as a distinct, writable memory regions.
>

Depending on the analysis client's need, this might be OK. E.g. do we want
under- or overapproximation? What is the consequence of imprecision at the
client? Making an analysis more precise only makes sense if the client can
benefit more from the additional precision compared to what we lose to
achieve this precision in the first place.

>
> ---=== 2. UninitializedVariables ===---
>
> --- Third pass ---
>
> Though technically this is an implementation detail, a third pass is used
> to emit diagnostics.
>

I believe that for every monotone analysis we can emit the warnings during
the analysis, so we can spare the last pass (at the cost of making sure to
not emit duplicated warnings). I am not sure, however, which method would
be more performant or cleaner in this particular case.

>
> ---=== 3. Thread Safety analysis ===---
>
> --- Thread Safety's own intermediate language ---
>
> Thread Safety Analysis lowers Clang expressions, implementing its very own
> IR, a Typed Intermediate Language, or TIL.
>

The existence of TIL just amplifies what I wrote earlier, the AST/CFG is
too verbose for dataflow analysis. A lot of the information available there
is just noise and having some abstractions over them could make things
significantly easier. Having a more principled way to organize those
abstractions would be great. Currently every analysis has its own ad-hoc
methods.

>
>
> ---=== 5. Lifetime analysis ===---
>
> Asserts pose a serious problem for lifetime analysis in particular.
> Suppose you have the following assert:
>
>
I imagine that this assert problem could be applicable for many other
analyses. I'd love to see a CFG option to specifically build a CFG that
does not expose this problem to the clients. Unfortunatly, CFG already has
a really large number of options, to the point that I have no confidence
that every reasonable combinations of the configurations are tested.
Getting rid of the unused config options (probably there are quite a few)
would be a huge contribution. But unfortunately, we could never know about
potential downstream users.

>
> Lifetime analysis was conceptualized with strong C++ support from the
> get-go.
>

Lifetime analysis also suffers from the design problems of C++ language I
mentioned earlier. One of the bigger ones, parameter passing describes the
HOW rather thant he intention.

>
> The domain of this analysis is entirely different than liveness or
> uninitialized variables, so what was previously mentioned as GEN/KILL sets
> don't make sense here.
>

Right, bit-vector analyses are distributive. Lifetime analysis is not.

Cheers,
Gabor
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20201006/c92b069f/attachment-0001.html>