[cfe-dev] [analyzer] Rough sketch of the algorithm for "Enhancing bug reporting with static backward program slicing"

Mon May 27 10:47:57 PDT 2019

On 5/27/19 10:12 AM, Gábor Horváth wrote:
> Hi!
>
> I wanted to share some of my points as well but had little time to do so.
>
> Artem identified two kinds of issues. One of which has all the 
> information available in the bug path (in his terminology: 
> must-have-happened) and the other which has not ( 
> should-not-have-happened). The goal is the same in both of the cases, 
> identify all dependencies including control and data dependencies. The 
> main difference is that in one case we could use all the information 
> available in the bug path while in the other we cannot.
>
> 1. must-have-happened:
> Since the bug-path contains all the information we need, I do agree 
> with Artem using bug path visitors might be sufficient. I do not know 
> how reliable `trackExpressionValue` is in identifying data 
> dependencies, but revising it could be one step in GSoC. For example, 
> after quickly skimming through the implementation I am not sure if it 
> is able to track the source of a constant value that was constant 
> folded during the symbolic execution from different values.

It should be able to, because that's how inlined defensive check 
suppressions work.

> Only if we had a way to test this functionality in isolation quickly 
> :) I think making that possible would also be a valuable addition.
>
> For control dependencies, it is great that LLVM already has some 
> algorithms to calculate dominator sets and it might be possible to 
> make that work for the Clang CFG.
>
> In my opinion, if you only tackle these problems by the end of the 
> summer your project already brought great value to the community.
>
> 2. should-not-have-happened
> This category cannot be solved only using visitors for sure. So we 
> might end up with an implementation for this problem that is 
> completely separate from the previous one (maybe except for getting 
> control dependencies?). As Artem pointed out, we might query some 
> information from the analyzer e.g. which functions were inlined. One 
> option is to use a slicing algorithm here. However, we should make 
> sure that the way a function is conservatively evaluated by the 
> slicing algorithm and the analyzer is compatible. E.g.: if the 
> analyzer does not invalidate a memory region during evaluating the 
> call, the statement should not be considered as relevant for the 
> variable in question. We have several challenges here, the analyzer 
> reasons about memory regions while the slicing algorithm will probably 
> not. Also, we need to do something with pointers, since we do not want 
> to implement a full-fledged pointer analysis. We also need to decide 
> how to represent arrays, structures etc.
>
> I skimmed through the algorithm in the google docs, and I think there 
> are some other details we need to work out. For example, you never 
> remove anything from the set of relevant variables. Consider the 
> following example:
> x = 5;
> y = x + 2;
> y = 2;
> z = y + 3;
>
> Whit the slicing criterion (z = y + 3). Here, once y is overwritten 
> with 2, it is no longer relevant, sox should not end up in the slice. 
> The algorithm you proposed will not handle this correctly (yet). Also 
> I am not sure if we actually need therelevant_variable_map or having a 
> set will be sufficient.
>
> Regards,
> Gabor
>
> On Mon, 27 May 2019 at 12:52, Kristóf Umann <dkszelethus at gmail.com 
> <mailto:dkszelethus at gmail.com>> wrote:
>
>     + Ádám Balogh
>
>     On Fri, 24 May 2019 at 23:31, Artem Dergachev <noqnoqneo at gmail.com
>     <mailto:noqnoqneo at gmail.com>> wrote:
>
>         Ok, i think i have a little bit of clarity.
>
>         Let's first talk about must-have-happened problems. For instance,
>
>           int *foo() {
>               return coin() ? new int : nullptr; // needs note
>           }
>           void bar() {
>             int *x = foo();
>             *x = 1; // warning: null dereference
>           }
>
>         In this case trackExpressionValue solves the
>         "must-have-happened" problem of "x must-have been assigned a
>         null pointer value" by displaying a note when foo() returns
>         nullptr. This currently more or less works correctly - there
>         are a lot of bugs but the overall idea behind
>         trackExpressionValue is correct.
>
>         Example 1 in your document is another example of a
>         must-have-happened problem: in order for the report to be
>         valid, we need to show that 'flag' must-have-changed between
>         line 17 and line 13. That's something that the Static Analyzer
>         currently cannot handle and you plan to improve upon it.
>
>         Hʏᴘᴏᴛʜᴇsɪs1. All "must-have-happened" problems should be
>         solved with a bug visitor. We don't need any new analysis
>         algorithm for that.
>
>         In particular, i believe that Example 1 can be solved by
>         extending trackExpressionValue() with a bug visitor that
>         detects control-flow-dependencies via this approach of yours:
>
>         > "Check whether the statement is a control dependency of
>         Statement 18 with dominator sets: 18 doesn't post dominate 17,
>         but every statement in between them does, meaning that 17 is a
>         control dependency of 18."
>
>         I.e., while ascending through ExplodedGraph, we pick
>         interesting statements and perform this domination check on
>         their predecessor nodes until we reach the control dependency,
>         then we track the control dependency recursively by adding
>         more visitors.
>
>         I think this is the first thing we should try on our GSoC.
>
>         The reason why i believe that Hypothesis 1 is true is that all
>         the information that we need is fully contained in the bug
>         path. If something must have happened on the path, then it's
>         probably in the path and we can figure it out by looking at
>         the path. If something must have happened on the path and it's
>         not in the path, then why did we emit a warning in the first
>         place?
>
>         ==============
>
>         Now, to should-not-have-happened problems. The main motivating
>         example is:
>
>           void foo(int *y) {
>               if (coin()) // needs note
>                 *y = 1;
>           }
>           void bar() {
>             int x;
>             foo(&x);
>             use(x); // warning: use of uninitialized value
>           }
>
>         In order to make the warning comprehensible, we have to
>         explain to the user why do we think that 'x' is uninitialized,
>         as it clearly "should-not-have-been" initialized for the
>         warning to be correct. For that purpose it is absolutely
>         essential to display the execution path within the inlined
>         call to foo().
>
>         One way to achieve that would be to display *all* execution
>         path and leave it up to the user to see that the event that
>         should-not-have-happened has indeed never happened.
>
>         That, however, would be overwhelming, so we use "path pruning"
>         that cuts away execution paths in inlined calls that are known
>         to have no interesting events happening. Which, in turn, makes
>         us incapable of solving should-not-have-happened problems, as
>         it's the whole point of the problem to display execution path
>         in inlined functions in which nothing has happened.
>
>         In order to figure this out, we need to learn how to
>         discriminate between inlined calls in which nothing
>         interesting *could* have happened in the first place and
>         inlined calls in which something interesting could have
>         happened but *didn't*.
>
>         NoStoreFuncVisitor solves this problem for the example above.
>         The visitor notices that the local variable is passed by
>         non-const pointer into the inlined call and the call
>         syntactically contains assignments through this pointer,
>         therefore this call *could* have initialized the value. The
>         visitor then suppresses pruning of this call by emitting an
>         interesting note within the call ("returning without
>         initializing '*y'"...).
>
>         However, AST-based analysis in NoStoreFuncVisitor is very
>         primitive and easy to trick. A more sophisticated analysis is
>         clearly necessary.
>
>         The second thing we can try in our GSoC is to come up with a
>         better analysis specifically for the uninitialized value
>         checker, because we already have a place to stick it into and
>         we know how exactly do we want to consume the result of such
>         analysis and evaluate it.
>
>         The third thing we can try in our GSoC is to come up with more
>         such analysis for other checkers and other
>         should-have-happened problems and see if a reusable framework
>         for creating such analyses can be implemented.
>
>
>         On 5/23/19 1:25 PM, Kristóf Umann wrote:
>>
>>
>>         On Thu, 23 May 2019 at 21:58, Kristóf Umann
>>         <dkszelethus at gmail.com <mailto:dkszelethus at gmail.com>> wrote:
>>
>>             Please let me know if I didn't address something you
>>             mentioned!
>>
>>             On Thu, 23 May 2019 at 02:29, Artem Dergachev
>>             <noqnoqneo at gmail.com <mailto:noqnoqneo at gmail.com>> wrote:
>>
>>                 My primary question is, how do you plan to use the
>>                 data that you obtain via your analysis?
>>
>>
>>              Gather relevant statements. Keep ExplodedNodes in the
>>             bugpath for which PathDiagnosticLocation::getStmt() is in
>>             the set (or, something like that at least). But honestly,
>>             I didn't think too much about that yet :^) I guess we
>>             could just gather ExplodedNodes rather than statements,
>>             we'll see.
>>
>>                 Like, do you plan to backtrack the value of all
>>                 relevant variables and/or expressions in the final
>>                 bug report that were also encountered along the bug
>>                 path? If yes, then is it possible that the slicing
>>                 criterion gets updated in the process and you'll have
>>                 to re-run the CFG-based analysis to take it into account?
>>
>>
>>             No, the slicing criterion is what it is, and will not
>>             change. The set of relevant statements to the slicing
>>             criterion define the actual program slice, which is
>>             basically the thing we're going for in this project. When
>>             it turns out that a value of a variable directly affects
>>             a variable in the slicing criterion (either through data
>>             or control dependency), we just add it to the set of
>>             relevant variables, and then if something in the set of
>>             relevant variables is affected (some statements or
>>             variables turn out to be a control/data dependencies to
>>             them), then it's added to the respective relevancy sets.
>>             Should we only traverse the basic blocks the analyzer
>>             visited, I think we can pull off the slicing with a
>>             single pass.
>>
>>
>>         We'll see about that single pass thing though. I'm gathering
>>         some tricky examples in the document I shared and working on
>>         a neat algorithm.
>>
>>                 > What would also be really great is to assist this
>>                 traversal with the information the analyzer already
>>                 has, essentially only inspecting basic blocks the
>>                 analyzer actually visits.
>>
>>                 You mean visits on the current bug path or visits on
>>                 the equivalence class of bugs or visits in general
>>                 during analysis?
>>
>>                 Regardless, for many problems ideally we should
>>                 traverse basic blocks that the analyzer *didn't*
>>                 visit (eg., if the function didn't initialize the
>>                 variable on the current path, we're interested in
>>                 parts of the code in which it *did* initialize the
>>                 variable, even though we didn't necessarily have time
>>                 to visit them).
>>
>>
>>             Well I mean slicing by definition isn't intended to do
>>             that. The entire point of it is to gather the smallest
>>             slice related to the slicing criterion, and my project
>>             aims to fill in the gaps where we don't provide enough
>>             information.
>>
>>                 It actually sounds as if all problems that we're
>>                 trying to solve here can be classified into
>>                 "must-have-happened" problems (eg., "variable
>>                 must-have been declared without initialization",
>>                 "variable must-have been set to nullptr", "memory
>>                 must-have been allocated") and
>>                 "should-not-have-happened" problems (eg., "variable
>>                 should-not-have been initialized", "null value
>>                 should-not-have been overwritten", "pointer
>>                 should-not-have escaped").
>>
>>                 For must-have-happened problems, i'm recently under
>>                 an impression that we should suppress the bug report
>>                 entirely when we fail to solve them (i.e., if fail to
>>                 explain to the user where exactly does this happen,
>>                 then the report is impossible to understand anyway).
>>                 This is currently a very big problem for our null
>>                 dereference checker on some codebases, especially
>>                 because it uses this tracking for suppressions as
>>                 well (aka inlined defensive checks), so when it fails
>>                 to track the variable it's likely a legit false
>>                 positive as well, not simply a hard-to-understand report.
>>
>>
>>             I think this set calculating approach inherently gathers
>>             far more information, allowing us to make better
>>             judgement on whether we should suppress the report.
>>
>>                 For should-not-have-happened problems i'm much more
>>                 confused. We're talking about looking for places
>>                 where it *could* have happened and then trying to
>>                 explain to the user why none of them have actually
>>                 happened. I'm not sure what are the consequences of
>>                 failing to explain to the user why didn't a
>>                 particular piece of code do something, because i've
>>                 no idea how do users intuitively figure out which
>>                 code *could* have done these things and which clearly
>>                 couldn't.
>>
>>
>>             Im really at a loss here :) Can you provide some example
>>             as to what a problematic "must-have-happened" bug report
>>             would look like, and how would you like to see it
>>             improved? Same for "should-not-have-happened". Because as
>>             I understand it, and I might be wrong, you have problems
>>             with this:
>>
>>             int a; // declaring a without initializing
>>
>>             if (rand()) // assuming that the condition is false
>>             a = 5;
>>
>>             print(a); // a is undef
>>
>>             And prefer to see this:
>>
>>             int a; // declaring a without initializing
>>
>>             if (rand()) // should this condition be false, a's value
>>             will be indeterministic
>>             a = 5; // writing to a skipped
>>
>>             print(a); // a is undef
>>
>>             and I just don't see how this would be possible with
>>             slicing at all. Also, I can't see how this would scale to
>>             real-world production code.
>>
>>
>>                 On 5/22/19 4:11 PM, Kristóf Umann wrote:
>>>
>>>                 Hi!
>>>
>>>
>>>                 I'd like to share some thoughts about my GSoC
>>>                 project, "Enhancing bug reporting with static
>>>                 backward program slicing"[1].
>>>
>>>
>>>                 My proposal is very specific about whatI'm aiming to
>>>                 improve on, but vague on the howpart of it. This
>>>                 mail aims to add clarify some of this.
>>>
>>>
>>>                 Program slicing is essentially a technique of
>>>                 discovering data and control dependencies to the
>>>                 slicing criterion, which is a (statement, {set of
>>>                 variables}) pair. Fortunately, tools for control
>>>                 dependencies, namely, dominator set calculations are
>>>                 already implemented, but seems to be unstable with
>>>                 clang's CFG. It would be a great tool if I were able
>>>                 to fix it.
>>>
>>>
>>>                 While my proposal states that I plan to implement an
>>>                 AST-based solution, I'm actually not sure that this
>>>                 is the optimal approach. We could instead inspect
>>>                 CFG blocks in a backwards manner (coupled with the
>>>                 analyzer's call stack), and gather relevant variable
>>>                 in the meanwhile.
>>>
>>>
>>>                 What would also be really great is to assist this
>>>                 traversal with the information the analyzer already
>>>                 has, essentially only inspecting basic blocks the
>>>                 analyzer actually visits.
>>>
>>>
>>>                 Here’s my idea for an algorithm (from the point of
>>>                 the slicing criterion already being constructed):
>>>
>>>                 https://docs.google.com/document/d/1Lx867o3meyQsj0WKOSWMdosSBdw2MUq1aIxyPJM6iLU/edit?usp=sharing
>>>
>>>
>>>
>>>                 Please note that this is a veeery rough sketch, I
>>>                 didn't think about every edge case that exists, but
>>>                 I think it's an okay start.
>>>
>>>
>>>                 [1]
>>>                 https://docs.google.com/document/d/1ci1BCAKojPlqIxIw1J_K2dnATA3z01PuwR_vHJS55TI/edit
>>>
>>
>