[cfe-dev] [analyzer] Using reaching definitions analysis for bug report construction

Tue Jul 16 15:49:50 PDT 2019

Hi!

In this mail, I'll briefly summarize what I already achieved during my GSoC
and discuss what I plan to do moving forward. Feel free to skip ahead if
you feel up to date!

*What happened so far?*

I proposed to enhance bug reports in the Static Analyzer [1]. I
specifically targeted reports that were incomplete in a sense that the
report didn't contain information to explain why we concluded that the bug
report is a true positive. I've shared the following code snippet numerous
times, as it captures several different problems the analyzer can't deal
with yet:

01 int flag;

02 bool coin();

03

04 void foo() {

05   flag = coin(); // no note, but there should be one

06 }

07

08 int main() {

09   int *x = 0; // x initialized to 0

10   flag = 1;

11   foo(); // no note, but there should be one

12   if (flag) // assumed false

13     x = new int;

14   foo(); // no note, but there should be one

15

16   if (flag) // assumed true

17     *x = 5; // warn

18 }

In my first update mail [2] I explained that the assumption on line 12 is
very important, because we wouldn't find a nullpointer dereference had flag
been true. Similarly, the assumption of line 16 is important. You can see
that following the notes in the bug path, despite setting flag to 1 on line
10, we assume it to to be zero, and to be non-zero again.

The goal is to teach the analyzer that the flag's value on both line 16 and
12 is important, and not to prune the calls to foo, where we should note
that flag's value has been reset. The standard way to achieve this is to
*track* flag in these location (similarly to how x is already tracked).
When a variable is tracked, the analyzer constructs notes that explain why
believe that it's value is what believe it to be at the tracking location.

This implies that analyzer can argue about data dependencies quite well (it
can use the information it gathered during symbolic analysis by inspecting
the *exploded graph *(all the information gathered during symbolic
analysis), specifically the *bugpath* (the linear path from the *error node*
[where the error was emitted] to the *root* [where analysis begun]).

The technique I came up with to help on this was inspired by *static
backward program slicing *[3]. Program slicing is essentially creating a
subset of the program (*program slice*) relevant to a (statement, {set of
variables}) pair (*slicing criterion*).

void f() {

int n; read(n);

int sum = 0; int product = 1; for (int i = 0; i < n; ++i) { sum += i;

product *= i; } write(sum);

write(product);

}

For the slicing criterion (write(product), {product}) the following program
slice would be created:

void f() {

int n; read(n);

int product = 1; for (int i = 0; i < n; ++i) {

product *= i; }

write(product);

}

Program slicing, roughly speaking, works by collecting a set of relevant
variables and a set of relevant statements by tracing data and control
dependencies [3].

Since the analyzer is also in need of this functionality (realizing which
parts of the program is relevant to the bug's occurrence), and since data
dependencies are already understood, I modified LLVM's *Iterated Dominance
Frontier* calculator to be used with clang's CFG, which can be used to
calculate control dependencies, assisting bug report generation.

So far, I used this to realize that the condition on line 16 is important,
and start tracking it. The patch that implemented it works by tracking the
conditions of control dependency blocks of already tracked variables: since
x is tracked on line 17, and the block on line 16 is a control dependency
of it, we track the condition expression of that block (flag).

It was quickly discovered that it's a bad approach to use the same tracking
for both error-causing variables and conditions, as it lead to a bunch of
uninteresting notes being inserted to bug reports -- we mainly want the bug
reports to be untouched, and only expand it with information previously not
present that made it potentially impossible to understand how the bug
report came about (like flag's value changing). Ideas for this shared in a
mail [4], where the consensus was to basically prune every note that didn't
happen in a nested stackframe (the stackframe where flag's value changed
(foo()) is nested in the frame where we started tracking it (main())).

I gathered data on several large, open source C++ projects, and I feel
confident that there are only so many loose ends left to be tied: don't
track asserts conditions, calls to operator bool, and make the extra notes
state clearly that we're talking about "just a condition", rather then a
bug causing variable.

In my first mail [2] I also detailed that realizing that line 16 is
important is a vastly different case then the one on line 12: the bugpath
doesn't contain line 13 (since we assumed flag to be false on line 12), so
using BugReporterVisitors wouldn't be sufficient: we called there cases
*must-have-happened*, where the information we need to realize that a
control dependency is important is contained in the bugpath, and
*should-not-have-happened* when it isn't.

In followup mail to that thread [5] explained that the
should-not-have-happened case can be further divided into *inlined category*
(initially referred to as category 1), where the information we need isn't
in the bugpath, but is in function that the analyzer visited, allowing us
to resolve parameters and have access to a variety of tools, and the *not
inlined category* (initially referred to as category 2) where the
infromation lays in a non-inlined function.

Inlined category (bar is inlined, but we don't explore line 10):

01 int flag;

02 bool coin();

03

04 void foo() {

05   flag = coin(); // no note, but there should be one

06 }

07

08 void bar(int &i) {

09  if (flag) // assumed false

10    i = new int;

11 }

12

13 int main() {

14   int *x = 0; // x initialized to 0

15   flag = 1;

16   foo(); // no note, but there should be one

17   bar(x);

18   foo(); // no note, but there should be one

19

20   if (flag) // assumed true

21     *x = 5; // warn

22 }

Not inlined category (bar isn't inlined at all):

01 int flag;

02 bool coin();

03

04 void foo() {

05   flag = coin(); // no note, but there should be one

06 }

07

08 void bar(int &i) {

09  i = new int;

10 }

11

12 int main() {

13   int *x = 0; // x initialized to 0

14   flag = 1;

15   foo(); // no note, but there should be one

16 if (flag) // assumed false

17    bar(x);

18   foo(); // no note, but there should be one

19

20   if (flag) // assumed true

21     *x = 5; // warn

22 }

The not inlined category seems to be far more difficult, like, not even
solvable probably with the amount of time I have left, so I only plan to
aim for the inlined category cases within the should-not-have-happened
problem. I shared a sketch algorithm in a followup mail [6], but I it was
brought to my attention that using a reaching definition algorithm might be
a superior approach.

There were a variety of other things I did under the name of the project,
but they aren't as exciting to be put in a summary ;)

*Using reaching definitions analysis to solve the should-not-have-happened
inline category cases*

I am no expert on this topic, please feel free to correct me if I write
something dumb.

The reaching definitions analysis was originally made for instruction, not
C++ code: "a reaching definition for a given instruction is an earlier
instruction whose target variable can reach (be assigned to) the given one
without an intervening assignment." [7].

When we say *reaching definition**s *we refer to the analysis: the
REACHin[S] set for basic block S is the set of definitions reaching S, and
REACHout[S] are all reaching definitions of its predecessors minus those
reaching definitions whose variable is killed by S (contains an intervening
assignment) plus any new definitions generated within S.

Now, when we're talking about *variables,* the shortcomings of this
algorithm when it comes to C++ are obvious: aliasing. I guess we can always
just try to be conservative in one way or another: no longer look for
reaching definitions after a call to a function where the variable was
passed as a non-const reference, or a write to reference of the same type.

There are plenty of big unknown here -- I'm inexperienced with implementing
such algorithms, and I fear some esoteric AST elements in Clang may need to
be a source of a lot of headache.

Here's an idea to how we could use reaching definitions analysis to
discover important statements (like the one on line 13 in the first example
of this mail), and track it's control dependencies:

// Traverses the bugpath to discover all reaching definitions, even

// in nested stackframes. Returns true if any reaching definition is

// found.

function conditionTracker(N : ExplodedNode, X : variable,

                          Report : BugReport, OriginB : OriginB)

  SetOfReachingDefinitions = reaching definitions to X at N

  while N is not null do

    // Look for the function call to F

    N := N->getFirstPred()

    if not N->getLocationAs<CallExit>() then

      continue

    fi

    B := CFGBlock associated with N

    if there is no path from B to OriginB then

      continue

    fi

    CallEnterN := N’s matching CallEnter

    if CallEnterN is null then

      continue // and assert that we’re in a top frame

    fi

    // CallEnterN’s pred should be in the same CFG as OriginB.

    CalleeE := CFGElement associated with CallEnterN->getFirstPred()

    ParamSet := set of F’s parameters at CallEnterN to which X

is passed to as a non-const reference

    for all NewX variables in ParamSet do

      if conditionTracker(N, NewX, Report, F’s exit block) then

        SetOfReachingDefinitions += CalleeE

      fi

    od

    // X is global/static, or a field and F is a method, etc...

    if F would have access to X regardless then

      if conditionTracker(N, X, Report, F’s exit block) then

        SetOfReachingDefinitions += CalleeE

      fi

    fi

  od

  for all RD CFGElementRefs in SetOfReachingDefinitions do

    track control dependencies of RD

  od

  return not whether SetOfReachingDefinitions is empty

*Should we just traverse the ExplodedGraph?*

In a mail sent by Artem [8], another possible application of the reaching
definition algorithm was discussed. However, during our meeting, it was
mentioned that maybe for that specific use case, traversing the
ExplodedGraph might be a better approach. Should I approach the
should-not-have-happened case with this technique as well, maybe it could
be reused.

I don't fancy this. First off, generalizing in the future to also cover the
not inlined category is I think more difficult with this approach. Second,
algorithms like reaching definitions are set in stone, and coming up with a
reliable, correct way to handle the ExplodedGraph may be too difficult and
time consuming for me at this point.

Any feedback is appreciated!

Cheers,

Kristóf

[1]
https://docs.google.com/document/d/1ci1BCAKojPlqIxIw1J_K2dnATA3z01PuwR_vHJS55TI/edit

[2] http://lists.llvm.org/pipermail/cfe-dev/2019-June/062535.html

[3] http://lists.llvm.org/pipermail/cfe-dev/2019-June/062613.html

[4] Tip, Frank. A survey of program slicing techniques. Centrum voor
Wiskunde en Informatica, 1994.

[5] http://lists.llvm.org/pipermail/cfe-dev/2019-June/062540.html

[6] http://lists.llvm.org/pipermail/cfe-dev/2019-June/062566.html

[7] https://en.wikipedia.org/wiki/Reaching_definition

[8] http://lists.llvm.org/pipermail/cfe-dev/2019-June/062727.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20190717/04bf10ed/attachment.html>