[cfe-dev] [analyzer] Rough sketch of the algorithm for "Enhancing bug reporting with static backward program slicing"
Artem Dergachev via cfe-dev
cfe-dev at lists.llvm.org
Mon May 27 10:47:57 PDT 2019
On 5/27/19 10:12 AM, Gábor Horváth wrote:
> Hi!
>
> I wanted to share some of my points as well but had little time to do so.
>
> Artem identified two kinds of issues. One of which has all the
> information available in the bug path (in his terminology:
> must-have-happened) and the other which has not (
> should-not-have-happened). The goal is the same in both of the cases,
> identify all dependencies including control and data dependencies. The
> main difference is that in one case we could use all the information
> available in the bug path while in the other we cannot.
>
> 1. must-have-happened:
> Since the bug-path contains all the information we need, I do agree
> with Artem using bug path visitors might be sufficient. I do not know
> how reliable `trackExpressionValue` is in identifying data
> dependencies, but revising it could be one step in GSoC. For example,
> after quickly skimming through the implementation I am not sure if it
> is able to track the source of a constant value that was constant
> folded during the symbolic execution from different values.
It should be able to, because that's how inlined defensive check
suppressions work.
> Only if we had a way to test this functionality in isolation quickly
> :) I think making that possible would also be a valuable addition.
>
> For control dependencies, it is great that LLVM already has some
> algorithms to calculate dominator sets and it might be possible to
> make that work for the Clang CFG.
>
> In my opinion, if you only tackle these problems by the end of the
> summer your project already brought great value to the community.
>
> 2. should-not-have-happened
> This category cannot be solved only using visitors for sure. So we
> might end up with an implementation for this problem that is
> completely separate from the previous one (maybe except for getting
> control dependencies?). As Artem pointed out, we might query some
> information from the analyzer e.g. which functions were inlined. One
> option is to use a slicing algorithm here. However, we should make
> sure that the way a function is conservatively evaluated by the
> slicing algorithm and the analyzer is compatible. E.g.: if the
> analyzer does not invalidate a memory region during evaluating the
> call, the statement should not be considered as relevant for the
> variable in question. We have several challenges here, the analyzer
> reasons about memory regions while the slicing algorithm will probably
> not. Also, we need to do something with pointers, since we do not want
> to implement a full-fledged pointer analysis. We also need to decide
> how to represent arrays, structures etc.
>
> I skimmed through the algorithm in the google docs, and I think there
> are some other details we need to work out. For example, you never
> remove anything from the set of relevant variables. Consider the
> following example:
> x = 5;
> y = x + 2;
> y = 2;
> z = y + 3;
>
> Whit the slicing criterion (z = y + 3). Here, once y is overwritten
> with 2, it is no longer relevant, sox should not end up in the slice.
> The algorithm you proposed will not handle this correctly (yet). Also
> I am not sure if we actually need therelevant_variable_map or having a
> set will be sufficient.
>
> Regards,
> Gabor
>
> On Mon, 27 May 2019 at 12:52, Kristóf Umann <dkszelethus at gmail.com
> <mailto:dkszelethus at gmail.com>> wrote:
>
> + Ádám Balogh
>
> On Fri, 24 May 2019 at 23:31, Artem Dergachev <noqnoqneo at gmail.com
> <mailto:noqnoqneo at gmail.com>> wrote:
>
> Ok, i think i have a little bit of clarity.
>
> Let's first talk about must-have-happened problems. For instance,
>
> int *foo() {
> return coin() ? new int : nullptr; // needs note
> }
> void bar() {
> int *x = foo();
> *x = 1; // warning: null dereference
> }
>
> In this case trackExpressionValue solves the
> "must-have-happened" problem of "x must-have been assigned a
> null pointer value" by displaying a note when foo() returns
> nullptr. This currently more or less works correctly - there
> are a lot of bugs but the overall idea behind
> trackExpressionValue is correct.
>
> Example 1 in your document is another example of a
> must-have-happened problem: in order for the report to be
> valid, we need to show that 'flag' must-have-changed between
> line 17 and line 13. That's something that the Static Analyzer
> currently cannot handle and you plan to improve upon it.
>
> Hʏᴘᴏᴛʜᴇsɪs1. All "must-have-happened" problems should be
> solved with a bug visitor. We don't need any new analysis
> algorithm for that.
>
> In particular, i believe that Example 1 can be solved by
> extending trackExpressionValue() with a bug visitor that
> detects control-flow-dependencies via this approach of yours:
>
> > "Check whether the statement is a control dependency of
> Statement 18 with dominator sets: 18 doesn't post dominate 17,
> but every statement in between them does, meaning that 17 is a
> control dependency of 18."
>
> I.e., while ascending through ExplodedGraph, we pick
> interesting statements and perform this domination check on
> their predecessor nodes until we reach the control dependency,
> then we track the control dependency recursively by adding
> more visitors.
>
> I think this is the first thing we should try on our GSoC.
>
> The reason why i believe that Hypothesis 1 is true is that all
> the information that we need is fully contained in the bug
> path. If something must have happened on the path, then it's
> probably in the path and we can figure it out by looking at
> the path. If something must have happened on the path and it's
> not in the path, then why did we emit a warning in the first
> place?
>
> ==============
>
> Now, to should-not-have-happened problems. The main motivating
> example is:
>
> void foo(int *y) {
> if (coin()) // needs note
> *y = 1;
> }
> void bar() {
> int x;
> foo(&x);
> use(x); // warning: use of uninitialized value
> }
>
> In order to make the warning comprehensible, we have to
> explain to the user why do we think that 'x' is uninitialized,
> as it clearly "should-not-have-been" initialized for the
> warning to be correct. For that purpose it is absolutely
> essential to display the execution path within the inlined
> call to foo().
>
> One way to achieve that would be to display *all* execution
> path and leave it up to the user to see that the event that
> should-not-have-happened has indeed never happened.
>
> That, however, would be overwhelming, so we use "path pruning"
> that cuts away execution paths in inlined calls that are known
> to have no interesting events happening. Which, in turn, makes
> us incapable of solving should-not-have-happened problems, as
> it's the whole point of the problem to display execution path
> in inlined functions in which nothing has happened.
>
> In order to figure this out, we need to learn how to
> discriminate between inlined calls in which nothing
> interesting *could* have happened in the first place and
> inlined calls in which something interesting could have
> happened but *didn't*.
>
> NoStoreFuncVisitor solves this problem for the example above.
> The visitor notices that the local variable is passed by
> non-const pointer into the inlined call and the call
> syntactically contains assignments through this pointer,
> therefore this call *could* have initialized the value. The
> visitor then suppresses pruning of this call by emitting an
> interesting note within the call ("returning without
> initializing '*y'"...).
>
> However, AST-based analysis in NoStoreFuncVisitor is very
> primitive and easy to trick. A more sophisticated analysis is
> clearly necessary.
>
> The second thing we can try in our GSoC is to come up with a
> better analysis specifically for the uninitialized value
> checker, because we already have a place to stick it into and
> we know how exactly do we want to consume the result of such
> analysis and evaluate it.
>
> The third thing we can try in our GSoC is to come up with more
> such analysis for other checkers and other
> should-have-happened problems and see if a reusable framework
> for creating such analyses can be implemented.
>
>
> On 5/23/19 1:25 PM, Kristóf Umann wrote:
>>
>>
>> On Thu, 23 May 2019 at 21:58, Kristóf Umann
>> <dkszelethus at gmail.com <mailto:dkszelethus at gmail.com>> wrote:
>>
>> Please let me know if I didn't address something you
>> mentioned!
>>
>> On Thu, 23 May 2019 at 02:29, Artem Dergachev
>> <noqnoqneo at gmail.com <mailto:noqnoqneo at gmail.com>> wrote:
>>
>> My primary question is, how do you plan to use the
>> data that you obtain via your analysis?
>>
>>
>> Gather relevant statements. Keep ExplodedNodes in the
>> bugpath for which PathDiagnosticLocation::getStmt() is in
>> the set (or, something like that at least). But honestly,
>> I didn't think too much about that yet :^) I guess we
>> could just gather ExplodedNodes rather than statements,
>> we'll see.
>>
>> Like, do you plan to backtrack the value of all
>> relevant variables and/or expressions in the final
>> bug report that were also encountered along the bug
>> path? If yes, then is it possible that the slicing
>> criterion gets updated in the process and you'll have
>> to re-run the CFG-based analysis to take it into account?
>>
>>
>> No, the slicing criterion is what it is, and will not
>> change. The set of relevant statements to the slicing
>> criterion define the actual program slice, which is
>> basically the thing we're going for in this project. When
>> it turns out that a value of a variable directly affects
>> a variable in the slicing criterion (either through data
>> or control dependency), we just add it to the set of
>> relevant variables, and then if something in the set of
>> relevant variables is affected (some statements or
>> variables turn out to be a control/data dependencies to
>> them), then it's added to the respective relevancy sets.
>> Should we only traverse the basic blocks the analyzer
>> visited, I think we can pull off the slicing with a
>> single pass.
>>
>>
>> We'll see about that single pass thing though. I'm gathering
>> some tricky examples in the document I shared and working on
>> a neat algorithm.
>>
>> > What would also be really great is to assist this
>> traversal with the information the analyzer already
>> has, essentially only inspecting basic blocks the
>> analyzer actually visits.
>>
>> You mean visits on the current bug path or visits on
>> the equivalence class of bugs or visits in general
>> during analysis?
>>
>> Regardless, for many problems ideally we should
>> traverse basic blocks that the analyzer *didn't*
>> visit (eg., if the function didn't initialize the
>> variable on the current path, we're interested in
>> parts of the code in which it *did* initialize the
>> variable, even though we didn't necessarily have time
>> to visit them).
>>
>>
>> Well I mean slicing by definition isn't intended to do
>> that. The entire point of it is to gather the smallest
>> slice related to the slicing criterion, and my project
>> aims to fill in the gaps where we don't provide enough
>> information.
>>
>> It actually sounds as if all problems that we're
>> trying to solve here can be classified into
>> "must-have-happened" problems (eg., "variable
>> must-have been declared without initialization",
>> "variable must-have been set to nullptr", "memory
>> must-have been allocated") and
>> "should-not-have-happened" problems (eg., "variable
>> should-not-have been initialized", "null value
>> should-not-have been overwritten", "pointer
>> should-not-have escaped").
>>
>> For must-have-happened problems, i'm recently under
>> an impression that we should suppress the bug report
>> entirely when we fail to solve them (i.e., if fail to
>> explain to the user where exactly does this happen,
>> then the report is impossible to understand anyway).
>> This is currently a very big problem for our null
>> dereference checker on some codebases, especially
>> because it uses this tracking for suppressions as
>> well (aka inlined defensive checks), so when it fails
>> to track the variable it's likely a legit false
>> positive as well, not simply a hard-to-understand report.
>>
>>
>> I think this set calculating approach inherently gathers
>> far more information, allowing us to make better
>> judgement on whether we should suppress the report.
>>
>> For should-not-have-happened problems i'm much more
>> confused. We're talking about looking for places
>> where it *could* have happened and then trying to
>> explain to the user why none of them have actually
>> happened. I'm not sure what are the consequences of
>> failing to explain to the user why didn't a
>> particular piece of code do something, because i've
>> no idea how do users intuitively figure out which
>> code *could* have done these things and which clearly
>> couldn't.
>>
>>
>> Im really at a loss here :) Can you provide some example
>> as to what a problematic "must-have-happened" bug report
>> would look like, and how would you like to see it
>> improved? Same for "should-not-have-happened". Because as
>> I understand it, and I might be wrong, you have problems
>> with this:
>>
>> int a; // declaring a without initializing
>>
>> if (rand()) // assuming that the condition is false
>> a = 5;
>>
>> print(a); // a is undef
>>
>> And prefer to see this:
>>
>> int a; // declaring a without initializing
>>
>> if (rand()) // should this condition be false, a's value
>> will be indeterministic
>> a = 5; // writing to a skipped
>>
>> print(a); // a is undef
>>
>> and I just don't see how this would be possible with
>> slicing at all. Also, I can't see how this would scale to
>> real-world production code.
>>
>>
>> On 5/22/19 4:11 PM, Kristóf Umann wrote:
>>>
>>> Hi!
>>>
>>>
>>> I'd like to share some thoughts about my GSoC
>>> project, "Enhancing bug reporting with static
>>> backward program slicing"[1].
>>>
>>>
>>> My proposal is very specific about whatI'm aiming to
>>> improve on, but vague on the howpart of it. This
>>> mail aims to add clarify some of this.
>>>
>>>
>>> Program slicing is essentially a technique of
>>> discovering data and control dependencies to the
>>> slicing criterion, which is a (statement, {set of
>>> variables}) pair. Fortunately, tools for control
>>> dependencies, namely, dominator set calculations are
>>> already implemented, but seems to be unstable with
>>> clang's CFG. It would be a great tool if I were able
>>> to fix it.
>>>
>>>
>>> While my proposal states that I plan to implement an
>>> AST-based solution, I'm actually not sure that this
>>> is the optimal approach. We could instead inspect
>>> CFG blocks in a backwards manner (coupled with the
>>> analyzer's call stack), and gather relevant variable
>>> in the meanwhile.
>>>
>>>
>>> What would also be really great is to assist this
>>> traversal with the information the analyzer already
>>> has, essentially only inspecting basic blocks the
>>> analyzer actually visits.
>>>
>>>
>>> Here’s my idea for an algorithm (from the point of
>>> the slicing criterion already being constructed):
>>>
>>> https://docs.google.com/document/d/1Lx867o3meyQsj0WKOSWMdosSBdw2MUq1aIxyPJM6iLU/edit?usp=sharing
>>>
>>>
>>>
>>> Please note that this is a veeery rough sketch, I
>>> didn't think about every edge case that exists, but
>>> I think it's an okay start.
>>>
>>>
>>> [1]
>>> https://docs.google.com/document/d/1ci1BCAKojPlqIxIw1J_K2dnATA3z01PuwR_vHJS55TI/edit
>>>
>>
>
More information about the cfe-dev
mailing list