[cfe-dev] [analyzer] Using reaching definitions analysis for bug report construction

Wed Jul 24 06:27:52 PDT 2019

After talking to Gábor and some other folks I've come to the conclusion
that indeed, a reaching definitions algorithm is both more reasonable
finish in the timeframe I have left, and more reliable.

I chose the implement the following algorithm I found on wikipedia [1]. I
also read the related part of "Compilers: Principles, techniques, and tools
second edition" [2], which gave me great insights on special cases related
to C/C++. Here are a couple things that I tweaked on, and things I have
left to do:

   - Clang's CFG doesn't contain the parameter declarations, so I manually
   inspected the FunctionDecl, and added each parameter to the entry blocks
   GEN set. Now, traditionally, the entry block is skipped altogether, but I
   doubt this matters much.
   - In general, I wouldn't like to reason about reaching definitions to
   pointees.
   - Function calls are hard to reason about other than inspecting the
   CallExpr's arguments, so let's use invalidation (or, in other words,
   regarded them as definitions):
      - Global and static variables are invalidated on each function call.
      (I've manage to implement this one)
      - If a variable is passed as a non-const reference/pointer to a
      function, invalidate it. If that variable is of a record type, invalidate
      it if it's passed as value, and the copy constructor is user provided.
      - Presume the function doesn't use offsetof to modify other parts of
      the a record object, when only a field is passed to it as a parameter.
      - If a record object is invalidated, invalidate all of its fields.
      - Invalidate the the record object on a non-static, non-const method
      call.
   - I haven't touched the aliasing problem yet either (in all cases, I
   mean that a T * typed object isn't CXXThis):
      - Presume that the code follows strict aliasing rules.
      - If the pointee of a pointer/reference of type T * is written or
      invalidated, invalidate all T typed (regardless of signedness) variables.
      - If a record object that contains a T * or void * field is
      invalidated, all T typed variables should be invalidated.
      - When it comes to polymorphism, a write to a base pointer typed
      variable should invalidate all derived typed variables.
      - Something I'm quite unsure about: char * can alias everything,
      should it invalidate everything? Is this too conservative? Is it possible
      that we don't even write char * that often so such an
invalidation wouldn't
      be a big deal anyways?

I'm not yet sure how I'll end up handling record objects in general, but
I'm reasonably comfortable saying that I'll be able to finish at least a
good subset of this before wrapping the project up. I got a WIP patch going
on phabricator [3]. After that, I don't see big unknowns using the gathered
data in a visitor.

[1] https://en.wikipedia.org/wiki/Reaching_definition
[2] Aho, A. V., Lam, M. S., Sethi, R., & Ullman, J. D. (2007). Compilers:
Principles, techniques, and tools second edition.
[3] https://reviews.llvm.org/D64991

On Wed, 17 Jul 2019 at 00:49, Kristóf Umann <dkszelethus at gmail.com> wrote:

> Hi!
>
> In this mail, I'll briefly summarize what I already achieved during my
> GSoC and discuss what I plan to do moving forward. Feel free to skip ahead
> if you feel up to date!
>
> *What happened so far?*
>
> I proposed to enhance bug reports in the Static Analyzer [1]. I
> specifically targeted reports that were incomplete in a sense that the
> report didn't contain information to explain why we concluded that the bug
> report is a true positive. I've shared the following code snippet numerous
> times, as it captures several different problems the analyzer can't deal
> with yet:
>
> 01 int flag;
>
> 02 bool coin();
>
> 03
>
> 04 void foo() {
>
> 05   flag = coin(); // no note, but there should be one
>
> 06 }
>
> 07
>
> 08 int main() {
>
> 09   int *x = 0; // x initialized to 0
>
> 10   flag = 1;
>
> 11   foo(); // no note, but there should be one
>
> 12   if (flag) // assumed false
>
> 13     x = new int;
>
> 14   foo(); // no note, but there should be one
>
> 15
>
> 16   if (flag) // assumed true
>
> 17     *x = 5; // warn
>
> 18 }
>
>
> In my first update mail [2] I explained that the assumption on line 12 is
> very important, because we wouldn't find a nullpointer dereference had flag
> been true. Similarly, the assumption of line 16 is important. You can see
> that following the notes in the bug path, despite setting flag to 1 on line
> 10, we assume it to to be zero, and to be non-zero again.
>
>
> The goal is to teach the analyzer that the flag's value on both line 16
> and 12 is important, and not to prune the calls to foo, where we should
> note that flag's value has been reset. The standard way to achieve this is
> to *track* flag in these location (similarly to how x is already
> tracked). When a variable is tracked, the analyzer constructs notes that
> explain why believe that it's value is what believe it to be at the
> tracking location.
>
>
> This implies that analyzer can argue about data dependencies quite well
> (it can use the information it gathered during symbolic analysis by
> inspecting the *exploded graph *(all the information gathered during
> symbolic analysis), specifically the *bugpath* (the linear path from the *error
> node* [where the error was emitted] to the *root* [where analysis begun]).
>
>
> The technique I came up with to help on this was inspired by *static
> backward program slicing *[3]. Program slicing is essentially creating a
> subset of the program (*program slice*) relevant to a (statement, {set of
> variables}) pair (*slicing criterion*).
>
>
> void f() {
>
> int n; read(n);
>
> int sum = 0; int product = 1; for (int i = 0; i < n; ++i) { sum += i;
>
> product *= i; } write(sum);
>
> write(product);
>
> }
>
>
> For the slicing criterion (write(product), {product}) the following
> program slice would be created:
>
>
> void f() {
>
> int n; read(n);
>
> int product = 1; for (int i = 0; i < n; ++i) {
>
>
> product *= i; }
>
>
> write(product);
>
> }
>
>
> Program slicing, roughly speaking, works by collecting a set of relevant
> variables and a set of relevant statements by tracing data and control
> dependencies [3].
>
>
> Since the analyzer is also in need of this functionality (realizing which
> parts of the program is relevant to the bug's occurrence), and since data
> dependencies are already understood, I modified LLVM's *Iterated
> Dominance Frontier* calculator to be used with clang's CFG, which can be
> used to calculate control dependencies, assisting bug report generation.
>
>
> So far, I used this to realize that the condition on line 16 is important,
> and start tracking it. The patch that implemented it works by tracking the
> conditions of control dependency blocks of already tracked variables: since
> x is tracked on line 17, and the block on line 16 is a control dependency
> of it, we track the condition expression of that block (flag).
>
>
> It was quickly discovered that it's a bad approach to use the same
> tracking for both error-causing variables and conditions, as it lead to a
> bunch of uninteresting notes being inserted to bug reports -- we mainly
> want the bug reports to be untouched, and only expand it with information
> previously not present that made it potentially impossible to understand
> how the bug report came about (like flag's value changing). Ideas for this
> shared in a mail [4], where the consensus was to basically prune every note
> that didn't happen in a nested stackframe (the stackframe where flag's
> value changed (foo()) is nested in the frame where we started tracking it
> (main())).
>
>
> I gathered data on several large, open source C++ projects, and I feel
> confident that there are only so many loose ends left to be tied: don't
> track asserts conditions, calls to operator bool, and make the extra notes
> state clearly that we're talking about "just a condition", rather then a
> bug causing variable.
>
>
> In my first mail [2] I also detailed that realizing that line 16 is
> important is a vastly different case then the one on line 12: the bugpath
> doesn't contain line 13 (since we assumed flag to be false on line 12), so
> using BugReporterVisitors wouldn't be sufficient: we called there cases
> *must-have-happened*, where the information we need to realize that a
> control dependency is important is contained in the bugpath, and
> *should-not-have-happened* when it isn't.
>
>
> In followup mail to that thread [5] explained that the
> should-not-have-happened case can be further divided into *inlined
> category* (initially referred to as category 1), where the information we
> need isn't in the bugpath, but is in function that the analyzer visited,
> allowing us to resolve parameters and have access to a variety of tools,
> and the *not inlined category* (initially referred to as category 2)
> where the infromation lays in a non-inlined function.
>
>
> Inlined category (bar is inlined, but we don't explore line 10):
>
>
> 01 int flag;
>
> 02 bool coin();
>
> 03
>
> 04 void foo() {
>
> 05   flag = coin(); // no note, but there should be one
>
> 06 }
>
> 07
>
> 08 void bar(int &i) {
>
> 09  if (flag) // assumed false
>
> 10    i = new int;
>
> 11 }
>
> 12
>
> 13 int main() {
>
> 14   int *x = 0; // x initialized to 0
>
> 15   flag = 1;
>
> 16   foo(); // no note, but there should be one
>
> 17   bar(x);
>
> 18   foo(); // no note, but there should be one
>
> 19
>
> 20   if (flag) // assumed true
>
> 21     *x = 5; // warn
>
> 22 }
>
>
> Not inlined category (bar isn't inlined at all):
>
>
> 01 int flag;
>
> 02 bool coin();
>
> 03
>
> 04 void foo() {
>
> 05   flag = coin(); // no note, but there should be one
>
> 06 }
>
> 07
>
> 08 void bar(int &i) {
>
> 09  i = new int;
>
> 10 }
>
> 11
>
> 12 int main() {
>
> 13   int *x = 0; // x initialized to 0
>
> 14   flag = 1;
>
> 15   foo(); // no note, but there should be one
>
> 16 if (flag) // assumed false
>
> 17    bar(x);
>
> 18   foo(); // no note, but there should be one
>
> 19
>
> 20   if (flag) // assumed true
>
> 21     *x = 5; // warn
>
> 22 }
>
>
> The not inlined category seems to be far more difficult, like, not even
> solvable probably with the amount of time I have left, so I only plan to
> aim for the inlined category cases within the should-not-have-happened
> problem. I shared a sketch algorithm in a followup mail [6], but I it was
> brought to my attention that using a reaching definition algorithm might be
> a superior approach.
>
>
> There were a variety of other things I did under the name of the project,
> but they aren't as exciting to be put in a summary ;)
>
>
> *Using reaching definitions analysis to solve the should-not-have-happened
> inline category cases*
>
>
> I am no expert on this topic, please feel free to correct me if I write
> something dumb.
>
>
> The reaching definitions analysis was originally made for instruction, not
> C++ code: "a reaching definition for a given instruction is an earlier
> instruction whose target variable can reach (be assigned to) the given one
> without an intervening assignment." [7].
>
>
> When we say *reaching definition**s *we refer to the analysis: the
> REACHin[S] set for basic block S is the set of definitions reaching S,
> and REACHout[S] are all reaching definitions of its predecessors minus
> those reaching definitions whose variable is killed by S (contains an
> intervening assignment) plus any new definitions generated within S.
>
>
> Now, when we're talking about *variables,* the shortcomings of this
> algorithm when it comes to C++ are obvious: aliasing. I guess we can always
> just try to be conservative in one way or another: no longer look for
> reaching definitions after a call to a function where the variable was
> passed as a non-const reference, or a write to reference of the same type.
>
>
> There are plenty of big unknown here -- I'm inexperienced with
> implementing such algorithms, and I fear some esoteric AST elements in
> Clang may need to be a source of a lot of headache.
>
>
> Here's an idea to how we could use reaching definitions analysis to
> discover important statements (like the one on line 13 in the first example
> of this mail), and track it's control dependencies:
>
>
> // Traverses the bugpath to discover all reaching definitions, even
>
> // in nested stackframes. Returns true if any reaching definition is
>
> // found.
>
> function conditionTracker(N : ExplodedNode, X : variable,
>
>                           Report : BugReport, OriginB : OriginB)
>
>
>
>   SetOfReachingDefinitions = reaching definitions to X at N
>
>
>   while N is not null do
>
>
>     // Look for the function call to F
>
>     N := N->getFirstPred()
>
>     if not N->getLocationAs<CallExit>() then
>
>       continue
>
>     fi
>
>
>     B := CFGBlock associated with N
>
>     if there is no path from B to OriginB then
>
>       continue
>
>     fi
>
>
>     CallEnterN := N’s matching CallEnter
>
>     if CallEnterN is null then
>
>       continue // and assert that we’re in a top frame
>
>     fi
>
>
>
>     // CallEnterN’s pred should be in the same CFG as OriginB.
>
>     CalleeE := CFGElement associated with CallEnterN->getFirstPred()
>
>
>     ParamSet := set of F’s parameters at CallEnterN to which X
>
> is passed to as a non-const reference
>
>
>
>     for all NewX variables in ParamSet do
>
>       if conditionTracker(N, NewX, Report, F’s exit block) then
>
>         SetOfReachingDefinitions += CalleeE
>
>       fi
>
>     od
>
>
>     // X is global/static, or a field and F is a method, etc...
>
>     if F would have access to X regardless then
>
>       if conditionTracker(N, X, Report, F’s exit block) then
>
>         SetOfReachingDefinitions += CalleeE
>
>       fi
>
>     fi
>
>   od
>
>
>   for all RD CFGElementRefs in SetOfReachingDefinitions do
>
>     track control dependencies of RD
>
>   od
>
>
>   return not whether SetOfReachingDefinitions is empty
>
>
> *Should we just traverse the ExplodedGraph?*
>
>
> In a mail sent by Artem [8], another possible application of the reaching
> definition algorithm was discussed. However, during our meeting, it was
> mentioned that maybe for that specific use case, traversing the
> ExplodedGraph might be a better approach. Should I approach the
> should-not-have-happened case with this technique as well, maybe it could
> be reused.
>
>
> I don't fancy this. First off, generalizing in the future to also cover
> the not inlined category is I think more difficult with this approach.
> Second, algorithms like reaching definitions are set in stone, and coming
> up with a reliable, correct way to handle the ExplodedGraph may be too
> difficult and time consuming for me at this point.
>
>
> Any feedback is appreciated!
>
> Cheers,
>
> Kristóf
>
>
> [1]
> https://docs.google.com/document/d/1ci1BCAKojPlqIxIw1J_K2dnATA3z01PuwR_vHJS55TI/edit
>
> [2] http://lists.llvm.org/pipermail/cfe-dev/2019-June/062535.html
>
> [3] http://lists.llvm.org/pipermail/cfe-dev/2019-June/062613.html
>
> [4] Tip, Frank. A survey of program slicing techniques. Centrum voor
> Wiskunde en Informatica, 1994.
>
> [5] http://lists.llvm.org/pipermail/cfe-dev/2019-June/062540.html
>
> [6] http://lists.llvm.org/pipermail/cfe-dev/2019-June/062566.html
>
> [7] https://en.wikipedia.org/wiki/Reaching_definition
>
> [8] http://lists.llvm.org/pipermail/cfe-dev/2019-June/062727.html
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20190724/07b34036/attachment.html>