[cfe-dev] [analyzer] Rough sketch of the algorithm for "Enhancing bug reporting with static backward program slicing"

Fri May 24 14:31:55 PDT 2019

Ok, i think i have a little bit of clarity.

Let's first talk about must-have-happened problems. For instance,

   int *foo() {
       return coin() ? new int : nullptr; // needs note
   }
   void bar() {
     int *x = foo();
     *x = 1; // warning: null dereference
   }

In this case trackExpressionValue solves the "must-have-happened" 
problem of "x must-have been assigned a null pointer value" by 
displaying a note when foo() returns nullptr. This currently more or 
less works correctly - there are a lot of bugs but the overall idea 
behind trackExpressionValue is correct.

Example 1 in your document is another example of a must-have-happened 
problem: in order for the report to be valid, we need to show that 
'flag' must-have-changed between line 17 and line 13. That's something 
that the Static Analyzer currently cannot handle and you plan to improve 
upon it.

Hʏᴘᴏᴛʜᴇsɪs1. All "must-have-happened" problems should be solved with a 
bug visitor. We don't need any new analysis algorithm for that.

In particular, i believe that Example 1 can be solved by extending 
trackExpressionValue() with a bug visitor that detects 
control-flow-dependencies via this approach of yours:

 > "Check whether the statement is a control dependency of Statement 18 
with dominator sets: 18 doesn't post dominate 17, but every statement in 
between them does, meaning that 17 is a control dependency of 18."

I.e., while ascending through ExplodedGraph, we pick interesting 
statements and perform this domination check on their predecessor nodes 
until we reach the control dependency, then we track the control 
dependency recursively by adding more visitors.

I think this is the first thing we should try on our GSoC.

The reason why i believe that Hypothesis 1 is true is that all the 
information that we need is fully contained in the bug path. If 
something must have happened on the path, then it's probably in the path 
and we can figure it out by looking at the path. If something must have 
happened on the path and it's not in the path, then why did we emit a 
warning in the first place?

==============

Now, to should-not-have-happened problems. The main motivating example is:

   void foo(int *y) {
       if (coin()) // needs note
         *y = 1;
   }
   void bar() {
     int x;
     foo(&x);
     use(x); // warning: use of uninitialized value
   }

In order to make the warning comprehensible, we have to explain to the 
user why do we think that 'x' is uninitialized, as it clearly 
"should-not-have-been" initialized for the warning to be correct. For 
that purpose it is absolutely essential to display the execution path 
within the inlined call to foo().

One way to achieve that would be to display *all* execution path and 
leave it up to the user to see that the event that 
should-not-have-happened has indeed never happened.

That, however, would be overwhelming, so we use "path pruning" that cuts 
away execution paths in inlined calls that are known to have no 
interesting events happening. Which, in turn, makes us incapable of 
solving should-not-have-happened problems, as it's the whole point of 
the problem to display execution path in inlined functions in which 
nothing has happened.

In order to figure this out, we need to learn how to discriminate 
between inlined calls in which nothing interesting *could* have happened 
in the first place and inlined calls in which something interesting 
could have happened but *didn't*.

NoStoreFuncVisitor solves this problem for the example above. The 
visitor notices that the local variable is passed by non-const pointer 
into the inlined call and the call syntactically contains assignments 
through this pointer, therefore this call *could* have initialized the 
value. The visitor then suppresses pruning of this call by emitting an 
interesting note within the call ("returning without initializing '*y'"...).

However, AST-based analysis in NoStoreFuncVisitor is very primitive and 
easy to trick. A more sophisticated analysis is clearly necessary.

The second thing we can try in our GSoC is to come up with a better 
analysis specifically for the uninitialized value checker, because we 
already have a place to stick it into and we know how exactly do we want 
to consume the result of such analysis and evaluate it.

The third thing we can try in our GSoC is to come up with more such 
analysis for other checkers and other should-have-happened problems and 
see if a reusable framework for creating such analyses can be implemented.

On 5/23/19 1:25 PM, Kristóf Umann wrote:
>
>
> On Thu, 23 May 2019 at 21:58, Kristóf Umann <dkszelethus at gmail.com 
> <mailto:dkszelethus at gmail.com>> wrote:
>
>     Please let me know if I didn't address something you mentioned!
>
>     On Thu, 23 May 2019 at 02:29, Artem Dergachev <noqnoqneo at gmail.com
>     <mailto:noqnoqneo at gmail.com>> wrote:
>
>         My primary question is, how do you plan to use the data that
>         you obtain via your analysis?
>
>
>      Gather relevant statements. Keep ExplodedNodes in the bugpath for
>     which PathDiagnosticLocation::getStmt() is in the set (or,
>     something like that at least). But honestly, I didn't think too
>     much about that yet :^) I guess we could just gather ExplodedNodes
>     rather than statements, we'll see.
>
>         Like, do you plan to backtrack the value of all relevant
>         variables and/or expressions in the final bug report that were
>         also encountered along the bug path? If yes, then is it
>         possible that the slicing criterion gets updated in the
>         process and you'll have to re-run the CFG-based analysis to
>         take it into account?
>
>
>     No, the slicing criterion is what it is, and will not change. The
>     set of relevant statements to the slicing criterion define the
>     actual program slice, which is basically the thing we're going for
>     in this project. When it turns out that a value of a variable
>     directly affects a variable in the slicing criterion (either
>     through data or control dependency), we just add it to the set of
>     relevant variables, and then if something in the set of relevant
>     variables is affected (some statements or variables turn out to be
>     a control/data dependencies to them), then it's added to the
>     respective relevancy sets. Should we only traverse the basic
>     blocks the analyzer visited, I think we can pull off the slicing
>     with a single pass.
>
>
> We'll see about that single pass thing though. I'm gathering some 
> tricky examples in the document I shared and working on a neat algorithm.
>
>         > What would also be really great is to assist this traversal
>         with the information the analyzer already has, essentially
>         only inspecting basic blocks the analyzer actually visits.
>
>         You mean visits on the current bug path or visits on the
>         equivalence class of bugs or visits in general during analysis?
>
>         Regardless, for many problems ideally we should traverse basic
>         blocks that the analyzer *didn't* visit (eg., if the function
>         didn't initialize the variable on the current path, we're
>         interested in parts of the code in which it *did* initialize
>         the variable, even though we didn't necessarily have time to
>         visit them).
>
>
>     Well I mean slicing by definition isn't intended to do that. The
>     entire point of it is to gather the smallest slice related to the
>     slicing criterion, and my project aims to fill in the gaps where
>     we don't provide enough information.
>
>         It actually sounds as if all problems that we're trying to
>         solve here can be classified into "must-have-happened"
>         problems (eg., "variable must-have been declared without
>         initialization", "variable must-have been set to nullptr",
>         "memory must-have been allocated") and
>         "should-not-have-happened" problems (eg., "variable
>         should-not-have been initialized", "null value should-not-have
>         been overwritten", "pointer should-not-have escaped").
>
>         For must-have-happened problems, i'm recently under an
>         impression that we should suppress the bug report entirely
>         when we fail to solve them (i.e., if fail to explain to the
>         user where exactly does this happen, then the report is
>         impossible to understand anyway). This is currently a very big
>         problem for our null dereference checker on some codebases,
>         especially because it uses this tracking for suppressions as
>         well (aka inlined defensive checks), so when it fails to track
>         the variable it's likely a legit false positive as well, not
>         simply a hard-to-understand report.
>
>
>     I think this set calculating approach inherently gathers far more
>     information, allowing us to make better judgement on whether we
>     should suppress the report.
>
>         For should-not-have-happened problems i'm much more confused.
>         We're talking about looking for places where it *could* have
>         happened and then trying to explain to the user why none of
>         them have actually happened. I'm not sure what are the
>         consequences of failing to explain to the user why didn't a
>         particular piece of code do something, because i've no idea
>         how do users intuitively figure out which code *could* have
>         done these things and which clearly couldn't.
>
>
>     Im really at a loss here :) Can you provide some example as to
>     what a problematic "must-have-happened" bug report would look
>     like, and how would you like to see it improved? Same for
>     "should-not-have-happened". Because as I understand it, and I
>     might be wrong, you have problems with this:
>
>     int a; // declaring a without initializing
>
>     if (rand()) // assuming that the condition is false
>       a = 5;
>
>     print(a); // a is undef
>
>     And prefer to see this:
>
>     int a; // declaring a without initializing
>
>     if (rand()) // should this condition be false, a's value will be
>     indeterministic
>       a = 5; // writing to a skipped
>
>     print(a); // a is undef
>
>     and I just don't see how this would be possible with slicing at
>     all. Also, I can't see how this would scale to real-world
>     production code.
>
>
>         On 5/22/19 4:11 PM, Kristóf Umann wrote:
>>
>>         Hi!
>>
>>
>>         I'd like to share some thoughts about my GSoC project,
>>         "Enhancing bug reporting with static backward program
>>         slicing"[1].
>>
>>
>>         My proposal is very specific about whatI'm aiming to improve
>>         on, but vague on the howpart of it. This mail aims to add
>>         clarify some of this.
>>
>>
>>         Program slicing is essentially a technique of discovering
>>         data and control dependencies to the slicing criterion, which
>>         is a (statement, {set of variables}) pair. Fortunately, tools
>>         for control dependencies, namely, dominator set calculations
>>         are already implemented, but seems to be unstable with
>>         clang's CFG. It would be a great tool if I were able to fix it.
>>
>>
>>         While my proposal states that I plan to implement an
>>         AST-based solution, I'm actually not sure that this is the
>>         optimal approach. We could instead inspect CFG blocks in a
>>         backwards manner (coupled with the analyzer's call stack),
>>         and gather relevant variable in the meanwhile.
>>
>>
>>         What would also be really great is to assist this traversal
>>         with the information the analyzer already has, essentially
>>         only inspecting basic blocks the analyzer actually visits.
>>
>>
>>         Here’s my idea for an algorithm (from the point of the
>>         slicing criterion already being constructed):
>>
>>         https://docs.google.com/document/d/1Lx867o3meyQsj0WKOSWMdosSBdw2MUq1aIxyPJM6iLU/edit?usp=sharing
>>
>>
>>
>>         Please note that this is a veeery rough sketch, I didn't
>>         think about every edge case that exists, but I think it's an
>>         okay start.
>>
>>
>>         [1]
>>         https://docs.google.com/document/d/1ci1BCAKojPlqIxIw1J_K2dnATA3z01PuwR_vHJS55TI/edit
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20190524/bee1c277/attachment.html>