[cfe-dev] [analyzer] Using reaching definitions analysis for bug report construction

Thu Jul 25 01:11:05 PDT 2019

We've had a lovely meeting about this, during which I've been made aware of
type based alias analysis, something I'll need to research a little more.
We also talked about implementing the rules with an invalidation
propagating algorithm.

I'll lay this part of the project aside until we can turn the condition
tracking feature of the analyzer on by default, just wanted to have this
out in the open :).

On Wed, 24 Jul 2019 at 15:27, Kristóf Umann <dkszelethus at gmail.com> wrote:

> After talking to Gábor and some other folks I've come to the conclusion
> that indeed, a reaching definitions algorithm is both more reasonable
> finish in the timeframe I have left, and more reliable.
>
> I chose the implement the following algorithm I found on wikipedia [1]. I
> also read the related part of "Compilers: Principles, techniques, and
> tools second edition" [2], which gave me great insights on special cases
> related to C/C++. Here are a couple things that I tweaked on, and things I
> have left to do:
>
>    - Clang's CFG doesn't contain the parameter declarations, so I
>    manually inspected the FunctionDecl, and added each parameter to the entry
>    blocks GEN set. Now, traditionally, the entry block is skipped altogether,
>    but I doubt this matters much.
>    - In general, I wouldn't like to reason about reaching definitions to
>    pointees.
>    - Function calls are hard to reason about other than inspecting the
>    CallExpr's arguments, so let's use invalidation (or, in other words,
>    regarded them as definitions):
>       - Global and static variables are invalidated on each function
>       call. (I've manage to implement this one)
>       - If a variable is passed as a non-const reference/pointer to a
>       function, invalidate it. If that variable is of a record type, invalidate
>       it if it's passed as value, and the copy constructor is user provided.
>       - Presume the function doesn't use offsetof to modify other parts
>       of the a record object, when only a field is passed to it as a parameter.
>       - If a record object is invalidated, invalidate all of its fields.
>       - Invalidate the the record object on a non-static, non-const
>       method call.
>    - I haven't touched the aliasing problem yet either (in all cases, I
>    mean that a T * typed object isn't CXXThis):
>       - Presume that the code follows strict aliasing rules.
>       - If the pointee of a pointer/reference of type T * is written or
>       invalidated, invalidate all T typed (regardless of signedness) variables.
>       - If a record object that contains a T * or void * field is
>       invalidated, all T typed variables should be invalidated.
>       - When it comes to polymorphism, a write to a base pointer typed
>       variable should invalidate all derived typed variables.
>       - Something I'm quite unsure about: char * can alias everything,
>       should it invalidate everything? Is this too conservative? Is it possible
>       that we don't even write char * that often so such an invalidation wouldn't
>       be a big deal anyways?
>
> I'm not yet sure how I'll end up handling record objects in general, but
> I'm reasonably comfortable saying that I'll be able to finish at least a
> good subset of this before wrapping the project up. I got a WIP patch going
> on phabricator [3]. After that, I don't see big unknowns using the gathered
> data in a visitor.
>
> [1] https://en.wikipedia.org/wiki/Reaching_definition
> [2] Aho, A. V., Lam, M. S., Sethi, R., & Ullman, J. D. (2007). Compilers:
> Principles, techniques, and tools second edition.
> [3] https://reviews.llvm.org/D64991
>
> On Wed, 17 Jul 2019 at 00:49, Kristóf Umann <dkszelethus at gmail.com> wrote:
>
>> Hi!
>>
>> In this mail, I'll briefly summarize what I already achieved during my
>> GSoC and discuss what I plan to do moving forward. Feel free to skip ahead
>> if you feel up to date!
>>
>> *What happened so far?*
>>
>> I proposed to enhance bug reports in the Static Analyzer [1]. I
>> specifically targeted reports that were incomplete in a sense that the
>> report didn't contain information to explain why we concluded that the bug
>> report is a true positive. I've shared the following code snippet numerous
>> times, as it captures several different problems the analyzer can't deal
>> with yet:
>>
>> 01 int flag;
>>
>> 02 bool coin();
>>
>> 03
>>
>> 04 void foo() {
>>
>> 05   flag = coin(); // no note, but there should be one
>>
>> 06 }
>>
>> 07
>>
>> 08 int main() {
>>
>> 09   int *x = 0; // x initialized to 0
>>
>> 10   flag = 1;
>>
>> 11   foo(); // no note, but there should be one
>>
>> 12   if (flag) // assumed false
>>
>> 13     x = new int;
>>
>> 14   foo(); // no note, but there should be one
>>
>> 15
>>
>> 16   if (flag) // assumed true
>>
>> 17     *x = 5; // warn
>>
>> 18 }
>>
>>
>> In my first update mail [2] I explained that the assumption on line 12 is
>> very important, because we wouldn't find a nullpointer dereference had flag
>> been true. Similarly, the assumption of line 16 is important. You can see
>> that following the notes in the bug path, despite setting flag to 1 on line
>> 10, we assume it to to be zero, and to be non-zero again.
>>
>>
>> The goal is to teach the analyzer that the flag's value on both line 16
>> and 12 is important, and not to prune the calls to foo, where we should
>> note that flag's value has been reset. The standard way to achieve this is
>> to *track* flag in these location (similarly to how x is already
>> tracked). When a variable is tracked, the analyzer constructs notes that
>> explain why believe that it's value is what believe it to be at the
>> tracking location.
>>
>>
>> This implies that analyzer can argue about data dependencies quite well
>> (it can use the information it gathered during symbolic analysis by
>> inspecting the *exploded graph *(all the information gathered during
>> symbolic analysis), specifically the *bugpath* (the linear path from the *error
>> node* [where the error was emitted] to the *root* [where analysis
>> begun]).
>>
>>
>> The technique I came up with to help on this was inspired by *static
>> backward program slicing *[3]. Program slicing is essentially creating a
>> subset of the program (*program slice*) relevant to a (statement, {set
>> of variables}) pair (*slicing criterion*).
>>
>>
>> void f() {
>>
>> int n; read(n);
>>
>> int sum = 0; int product = 1; for (int i = 0; i < n; ++i) { sum += i;
>>
>> product *= i; } write(sum);
>>
>> write(product);
>>
>> }
>>
>>
>> For the slicing criterion (write(product), {product}) the following
>> program slice would be created:
>>
>>
>> void f() {
>>
>> int n; read(n);
>>
>> int product = 1; for (int i = 0; i < n; ++i) {
>>
>>
>> product *= i; }
>>
>>
>> write(product);
>>
>> }
>>
>>
>> Program slicing, roughly speaking, works by collecting a set of relevant
>> variables and a set of relevant statements by tracing data and control
>> dependencies [3].
>>
>>
>> Since the analyzer is also in need of this functionality (realizing which
>> parts of the program is relevant to the bug's occurrence), and since data
>> dependencies are already understood, I modified LLVM's *Iterated
>> Dominance Frontier* calculator to be used with clang's CFG, which can be
>> used to calculate control dependencies, assisting bug report generation.
>>
>>
>> So far, I used this to realize that the condition on line 16 is
>> important, and start tracking it. The patch that implemented it works by
>> tracking the conditions of control dependency blocks of already tracked
>> variables: since x is tracked on line 17, and the block on line 16 is a
>> control dependency of it, we track the condition expression of that block
>> (flag).
>>
>>
>> It was quickly discovered that it's a bad approach to use the same
>> tracking for both error-causing variables and conditions, as it lead to a
>> bunch of uninteresting notes being inserted to bug reports -- we mainly
>> want the bug reports to be untouched, and only expand it with information
>> previously not present that made it potentially impossible to understand
>> how the bug report came about (like flag's value changing). Ideas for this
>> shared in a mail [4], where the consensus was to basically prune every note
>> that didn't happen in a nested stackframe (the stackframe where flag's
>> value changed (foo()) is nested in the frame where we started tracking it
>> (main())).
>>
>>
>> I gathered data on several large, open source C++ projects, and I feel
>> confident that there are only so many loose ends left to be tied: don't
>> track asserts conditions, calls to operator bool, and make the extra notes
>> state clearly that we're talking about "just a condition", rather then a
>> bug causing variable.
>>
>>
>> In my first mail [2] I also detailed that realizing that line 16 is
>> important is a vastly different case then the one on line 12: the bugpath
>> doesn't contain line 13 (since we assumed flag to be false on line 12), so
>> using BugReporterVisitors wouldn't be sufficient: we called there cases
>> *must-have-happened*, where the information we need to realize that a
>> control dependency is important is contained in the bugpath, and
>> *should-not-have-happened* when it isn't.
>>
>>
>> In followup mail to that thread [5] explained that the
>> should-not-have-happened case can be further divided into *inlined
>> category* (initially referred to as category 1), where the information
>> we need isn't in the bugpath, but is in function that the analyzer visited,
>> allowing us to resolve parameters and have access to a variety of tools,
>> and the *not inlined category* (initially referred to as category 2)
>> where the infromation lays in a non-inlined function.
>>
>>
>> Inlined category (bar is inlined, but we don't explore line 10):
>>
>>
>> 01 int flag;
>>
>> 02 bool coin();
>>
>> 03
>>
>> 04 void foo() {
>>
>> 05   flag = coin(); // no note, but there should be one
>>
>> 06 }
>>
>> 07
>>
>> 08 void bar(int &i) {
>>
>> 09  if (flag) // assumed false
>>
>> 10    i = new int;
>>
>> 11 }
>>
>> 12
>>
>> 13 int main() {
>>
>> 14   int *x = 0; // x initialized to 0
>>
>> 15   flag = 1;
>>
>> 16   foo(); // no note, but there should be one
>>
>> 17   bar(x);
>>
>> 18   foo(); // no note, but there should be one
>>
>> 19
>>
>> 20   if (flag) // assumed true
>>
>> 21     *x = 5; // warn
>>
>> 22 }
>>
>>
>> Not inlined category (bar isn't inlined at all):
>>
>>
>> 01 int flag;
>>
>> 02 bool coin();
>>
>> 03
>>
>> 04 void foo() {
>>
>> 05   flag = coin(); // no note, but there should be one
>>
>> 06 }
>>
>> 07
>>
>> 08 void bar(int &i) {
>>
>> 09  i = new int;
>>
>> 10 }
>>
>> 11
>>
>> 12 int main() {
>>
>> 13   int *x = 0; // x initialized to 0
>>
>> 14   flag = 1;
>>
>> 15   foo(); // no note, but there should be one
>>
>> 16 if (flag) // assumed false
>>
>> 17    bar(x);
>>
>> 18   foo(); // no note, but there should be one
>>
>> 19
>>
>> 20   if (flag) // assumed true
>>
>> 21     *x = 5; // warn
>>
>> 22 }
>>
>>
>> The not inlined category seems to be far more difficult, like, not even
>> solvable probably with the amount of time I have left, so I only plan to
>> aim for the inlined category cases within the should-not-have-happened
>> problem. I shared a sketch algorithm in a followup mail [6], but I it was
>> brought to my attention that using a reaching definition algorithm might be
>> a superior approach.
>>
>>
>> There were a variety of other things I did under the name of the project,
>> but they aren't as exciting to be put in a summary ;)
>>
>>
>> *Using reaching definitions analysis to solve the
>> should-not-have-happened inline category cases*
>>
>>
>> I am no expert on this topic, please feel free to correct me if I write
>> something dumb.
>>
>>
>> The reaching definitions analysis was originally made for instruction,
>> not C++ code: "a reaching definition for a given instruction is an earlier
>> instruction whose target variable can reach (be assigned to) the given one
>> without an intervening assignment." [7].
>>
>>
>> When we say *reaching definition**s *we refer to the analysis: the
>> REACHin[S] set for basic block S is the set of definitions reaching S,
>> and REACHout[S] are all reaching definitions of its predecessors minus
>> those reaching definitions whose variable is killed by S (contains an
>> intervening assignment) plus any new definitions generated within S.
>>
>>
>> Now, when we're talking about *variables,* the shortcomings of this
>> algorithm when it comes to C++ are obvious: aliasing. I guess we can always
>> just try to be conservative in one way or another: no longer look for
>> reaching definitions after a call to a function where the variable was
>> passed as a non-const reference, or a write to reference of the same type.
>>
>>
>> There are plenty of big unknown here -- I'm inexperienced with
>> implementing such algorithms, and I fear some esoteric AST elements in
>> Clang may need to be a source of a lot of headache.
>>
>>
>> Here's an idea to how we could use reaching definitions analysis to
>> discover important statements (like the one on line 13 in the first example
>> of this mail), and track it's control dependencies:
>>
>>
>> // Traverses the bugpath to discover all reaching definitions, even
>>
>> // in nested stackframes. Returns true if any reaching definition is
>>
>> // found.
>>
>> function conditionTracker(N : ExplodedNode, X : variable,
>>
>>                           Report : BugReport, OriginB : OriginB)
>>
>>
>>
>>   SetOfReachingDefinitions = reaching definitions to X at N
>>
>>
>>   while N is not null do
>>
>>
>>     // Look for the function call to F
>>
>>     N := N->getFirstPred()
>>
>>     if not N->getLocationAs<CallExit>() then
>>
>>       continue
>>
>>     fi
>>
>>
>>     B := CFGBlock associated with N
>>
>>     if there is no path from B to OriginB then
>>
>>       continue
>>
>>     fi
>>
>>
>>     CallEnterN := N’s matching CallEnter
>>
>>     if CallEnterN is null then
>>
>>       continue // and assert that we’re in a top frame
>>
>>     fi
>>
>>
>>
>>     // CallEnterN’s pred should be in the same CFG as OriginB.
>>
>>     CalleeE := CFGElement associated with CallEnterN->getFirstPred()
>>
>>
>>     ParamSet := set of F’s parameters at CallEnterN to which X
>>
>> is passed to as a non-const reference
>>
>>
>>
>>     for all NewX variables in ParamSet do
>>
>>       if conditionTracker(N, NewX, Report, F’s exit block) then
>>
>>         SetOfReachingDefinitions += CalleeE
>>
>>       fi
>>
>>     od
>>
>>
>>     // X is global/static, or a field and F is a method, etc...
>>
>>     if F would have access to X regardless then
>>
>>       if conditionTracker(N, X, Report, F’s exit block) then
>>
>>         SetOfReachingDefinitions += CalleeE
>>
>>       fi
>>
>>     fi
>>
>>   od
>>
>>
>>   for all RD CFGElementRefs in SetOfReachingDefinitions do
>>
>>     track control dependencies of RD
>>
>>   od
>>
>>
>>   return not whether SetOfReachingDefinitions is empty
>>
>>
>> *Should we just traverse the ExplodedGraph?*
>>
>>
>> In a mail sent by Artem [8], another possible application of the reaching
>> definition algorithm was discussed. However, during our meeting, it was
>> mentioned that maybe for that specific use case, traversing the
>> ExplodedGraph might be a better approach. Should I approach the
>> should-not-have-happened case with this technique as well, maybe it could
>> be reused.
>>
>>
>> I don't fancy this. First off, generalizing in the future to also cover
>> the not inlined category is I think more difficult with this approach.
>> Second, algorithms like reaching definitions are set in stone, and coming
>> up with a reliable, correct way to handle the ExplodedGraph may be too
>> difficult and time consuming for me at this point.
>>
>>
>> Any feedback is appreciated!
>>
>> Cheers,
>>
>> Kristóf
>>
>>
>> [1]
>> https://docs.google.com/document/d/1ci1BCAKojPlqIxIw1J_K2dnATA3z01PuwR_vHJS55TI/edit
>>
>> [2] http://lists.llvm.org/pipermail/cfe-dev/2019-June/062535.html
>>
>> [3] http://lists.llvm.org/pipermail/cfe-dev/2019-June/062613.html
>>
>> [4] Tip, Frank. A survey of program slicing techniques. Centrum voor
>> Wiskunde en Informatica, 1994.
>>
>> [5] http://lists.llvm.org/pipermail/cfe-dev/2019-June/062540.html
>>
>> [6] http://lists.llvm.org/pipermail/cfe-dev/2019-June/062566.html
>>
>> [7] https://en.wikipedia.org/wiki/Reaching_definition
>>
>> [8] http://lists.llvm.org/pipermail/cfe-dev/2019-June/062727.html
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20190725/9421ca74/attachment.html>