[cfe-dev] [analyzer] Rough sketch of the algorithm for "Enhancing bug reporting with static backward program slicing"
Kristóf Umann via cfe-dev
cfe-dev at lists.llvm.org
Thu May 30 16:48:55 PDT 2019
Allow me to elaborate.
*Example 1.:*
01 int flag;
02 bool coin();
03
04 void foo() {
05 flag = coin(); // no note
06 }
07
08 int main() {
09 int *x = 0; // x initialized to 0
10 flag = 1;
11 foo();
12 if (flag) // assumed false
13 x = new int;
14 foo();
15
16 if (flag) // assumed true
17 *x = 5; // warn
18 }
*Example 2.: (note the change on line 15)*
01 int flag;
02 bool coin();
03
04 void foo() {
05 flag = coin(); // no note
06 }
07
08 int main() {
09 int *x = 0;
10 flag = 1;
11 foo();
12 if (flag) // assumed false
13 x = new int;
14 foo();
15 x = 0; // x set to 0
16 if (flag) // assumed true
17 *x = 5; // warn
18 }
By tweaking trackExpressionValue a bit:
bool trackExpressionValue(const ExplodedNode *N,
const Stmt *S, BugReport &R,
bool IsArg=false,
bool EnableNullFPSuppression=true);
Should a block B in the bug path be a control dependency of N, track the
expression value in the condition of B. With that, I believe we should get
a note for statement 14. For statement 10, we still wouldn’t get a note
saying that flag’s value was invalidated (I fear), but it’s more of a
should-not-have-happened case. For Example 2 however, we’d get a pretty
much perfect bug report.
On Fri, 31 May 2019 at 00:16, Kristóf Umann <dkszelethus at gmail.com> wrote:
>
>
> On Mon, 27 May 2019 at 19:48, Artem Dergachev <noqnoqneo at gmail.com> wrote:
>
>>
>>
>> On 5/27/19 10:12 AM, Gábor Horváth wrote:
>> > Hi!
>> >
>> > I wanted to share some of my points as well but had little time to do
>> so.
>> >
>> > Artem identified two kinds of issues. One of which has all the
>> > information available in the bug path (in his terminology:
>> > must-have-happened) and the other which has not (
>> > should-not-have-happened). The goal is the same in both of the cases,
>> > identify all dependencies including control and data dependencies. The
>> > main difference is that in one case we could use all the information
>> > available in the bug path while in the other we cannot.
>> >
>> > 1. must-have-happened:
>> > Since the bug-path contains all the information we need, I do agree
>> > with Artem using bug path visitors might be sufficient. I do not know
>> > how reliable `trackExpressionValue` is in identifying data
>> > dependencies, but revising it could be one step in GSoC. For example,
>> > after quickly skimming through the implementation I am not sure if it
>> > is able to track the source of a constant value that was constant
>> > folded during the symbolic execution from different values.
>>
>> It should be able to, because that's how inlined defensive check
>> suppressions work.
>>
>
> I recently finished (not yet published) my work on implementing control
> dependencies for clang's CFG via LLVM's IDFCalculator. Theoretically
> speaking, by simply tracking the variables in the block that is a control
> dependency of the block that is being tracked, we should be able to get a
> proper bug report for Example 1? From afar, it seems reasonable.
>
>
>> > Only if we had a way to test this functionality in isolation quickly
>> > :) I think making that possible would also be a valuable addition.
>> >
>> > For control dependencies, it is great that LLVM already has some
>> > algorithms to calculate dominator sets and it might be possible to
>> > make that work for the Clang CFG.
>> >
>> > In my opinion, if you only tackle these problems by the end of the
>> > summer your project already brought great value to the community.
>> >
>> > 2. should-not-have-happened
>> > This category cannot be solved only using visitors for sure. So we
>> > might end up with an implementation for this problem that is
>> > completely separate from the previous one (maybe except for getting
>> > control dependencies?). As Artem pointed out, we might query some
>> > information from the analyzer e.g. which functions were inlined. One
>> > option is to use a slicing algorithm here. However, we should make
>> > sure that the way a function is conservatively evaluated by the
>> > slicing algorithm and the analyzer is compatible. E.g.: if the
>> > analyzer does not invalidate a memory region during evaluating the
>> > call, the statement should not be considered as relevant for the
>> > variable in question. We have several challenges here, the analyzer
>> > reasons about memory regions while the slicing algorithm will probably
>> > not. Also, we need to do something with pointers, since we do not want
>> > to implement a full-fledged pointer analysis. We also need to decide
>> > how to represent arrays, structures etc.
>> >
>> > I skimmed through the algorithm in the google docs, and I think there
>> > are some other details we need to work out. For example, you never
>> > remove anything from the set of relevant variables. Consider the
>> > following example:
>> > x = 5;
>> > y = x + 2;
>> > y = 2;
>> > z = y + 3;
>> >
>> > Whit the slicing criterion (z = y + 3). Here, once y is overwritten
>> > with 2, it is no longer relevant, sox should not end up in the slice.
>> > The algorithm you proposed will not handle this correctly (yet). Also
>> > I am not sure if we actually need therelevant_variable_map or having a
>> > set will be sufficient.
>> >
>> > Regards,
>> > Gabor
>> >
>> > On Mon, 27 May 2019 at 12:52, Kristóf Umann <dkszelethus at gmail.com
>> > <mailto:dkszelethus at gmail.com>> wrote:
>> >
>> > + Ádám Balogh
>> >
>> > On Fri, 24 May 2019 at 23:31, Artem Dergachev <noqnoqneo at gmail.com
>> > <mailto:noqnoqneo at gmail.com>> wrote:
>> >
>> > Ok, i think i have a little bit of clarity.
>> >
>> > Let's first talk about must-have-happened problems. For
>> instance,
>> >
>> > int *foo() {
>> > return coin() ? new int : nullptr; // needs note
>> > }
>> > void bar() {
>> > int *x = foo();
>> > *x = 1; // warning: null dereference
>> > }
>> >
>> > In this case trackExpressionValue solves the
>> > "must-have-happened" problem of "x must-have been assigned a
>> > null pointer value" by displaying a note when foo() returns
>> > nullptr. This currently more or less works correctly - there
>> > are a lot of bugs but the overall idea behind
>> > trackExpressionValue is correct.
>> >
>> > Example 1 in your document is another example of a
>> > must-have-happened problem: in order for the report to be
>> > valid, we need to show that 'flag' must-have-changed between
>> > line 17 and line 13. That's something that the Static Analyzer
>> > currently cannot handle and you plan to improve upon it.
>> >
>> > Hʏᴘᴏᴛʜᴇsɪs1. All "must-have-happened" problems should be
>> > solved with a bug visitor. We don't need any new analysis
>> > algorithm for that.
>> >
>> > In particular, i believe that Example 1 can be solved by
>> > extending trackExpressionValue() with a bug visitor that
>> > detects control-flow-dependencies via this approach of yours:
>> >
>> > > "Check whether the statement is a control dependency of
>> > Statement 18 with dominator sets: 18 doesn't post dominate 17,
>> > but every statement in between them does, meaning that 17 is a
>> > control dependency of 18."
>> >
>> > I.e., while ascending through ExplodedGraph, we pick
>> > interesting statements and perform this domination check on
>> > their predecessor nodes until we reach the control dependency,
>> > then we track the control dependency recursively by adding
>> > more visitors.
>> >
>> > I think this is the first thing we should try on our GSoC.
>> >
>> > The reason why i believe that Hypothesis 1 is true is that all
>> > the information that we need is fully contained in the bug
>> > path. If something must have happened on the path, then it's
>> > probably in the path and we can figure it out by looking at
>> > the path. If something must have happened on the path and it's
>> > not in the path, then why did we emit a warning in the first
>> > place?
>> >
>> > ==============
>> >
>> > Now, to should-not-have-happened problems. The main motivating
>> > example is:
>> >
>> > void foo(int *y) {
>> > if (coin()) // needs note
>> > *y = 1;
>> > }
>> > void bar() {
>> > int x;
>> > foo(&x);
>> > use(x); // warning: use of uninitialized value
>> > }
>> >
>> > In order to make the warning comprehensible, we have to
>> > explain to the user why do we think that 'x' is uninitialized,
>> > as it clearly "should-not-have-been" initialized for the
>> > warning to be correct. For that purpose it is absolutely
>> > essential to display the execution path within the inlined
>> > call to foo().
>> >
>> > One way to achieve that would be to display *all* execution
>> > path and leave it up to the user to see that the event that
>> > should-not-have-happened has indeed never happened.
>> >
>> > That, however, would be overwhelming, so we use "path pruning"
>> > that cuts away execution paths in inlined calls that are known
>> > to have no interesting events happening. Which, in turn, makes
>> > us incapable of solving should-not-have-happened problems, as
>> > it's the whole point of the problem to display execution path
>> > in inlined functions in which nothing has happened.
>> >
>> > In order to figure this out, we need to learn how to
>> > discriminate between inlined calls in which nothing
>> > interesting *could* have happened in the first place and
>> > inlined calls in which something interesting could have
>> > happened but *didn't*.
>> >
>> > NoStoreFuncVisitor solves this problem for the example above.
>> > The visitor notices that the local variable is passed by
>> > non-const pointer into the inlined call and the call
>> > syntactically contains assignments through this pointer,
>> > therefore this call *could* have initialized the value. The
>> > visitor then suppresses pruning of this call by emitting an
>> > interesting note within the call ("returning without
>> > initializing '*y'"...).
>> >
>> > However, AST-based analysis in NoStoreFuncVisitor is very
>> > primitive and easy to trick. A more sophisticated analysis is
>> > clearly necessary.
>> >
>> > The second thing we can try in our GSoC is to come up with a
>> > better analysis specifically for the uninitialized value
>> > checker, because we already have a place to stick it into and
>> > we know how exactly do we want to consume the result of such
>> > analysis and evaluate it.
>> >
>> > The third thing we can try in our GSoC is to come up with more
>> > such analysis for other checkers and other
>> > should-have-happened problems and see if a reusable framework
>> > for creating such analyses can be implemented.
>> >
>> >
>> > On 5/23/19 1:25 PM, Kristóf Umann wrote:
>> >>
>> >>
>> >> On Thu, 23 May 2019 at 21:58, Kristóf Umann
>> >> <dkszelethus at gmail.com <mailto:dkszelethus at gmail.com>> wrote:
>> >>
>> >> Please let me know if I didn't address something you
>> >> mentioned!
>> >>
>> >> On Thu, 23 May 2019 at 02:29, Artem Dergachev
>> >> <noqnoqneo at gmail.com <mailto:noqnoqneo at gmail.com>> wrote:
>> >>
>> >> My primary question is, how do you plan to use the
>> >> data that you obtain via your analysis?
>> >>
>> >>
>> >> Gather relevant statements. Keep ExplodedNodes in the
>> >> bugpath for which PathDiagnosticLocation::getStmt() is in
>> >> the set (or, something like that at least). But honestly,
>> >> I didn't think too much about that yet :^) I guess we
>> >> could just gather ExplodedNodes rather than statements,
>> >> we'll see.
>> >>
>> >> Like, do you plan to backtrack the value of all
>> >> relevant variables and/or expressions in the final
>> >> bug report that were also encountered along the bug
>> >> path? If yes, then is it possible that the slicing
>> >> criterion gets updated in the process and you'll have
>> >> to re-run the CFG-based analysis to take it into
>> account?
>> >>
>> >>
>> >> No, the slicing criterion is what it is, and will not
>> >> change. The set of relevant statements to the slicing
>> >> criterion define the actual program slice, which is
>> >> basically the thing we're going for in this project. When
>> >> it turns out that a value of a variable directly affects
>> >> a variable in the slicing criterion (either through data
>> >> or control dependency), we just add it to the set of
>> >> relevant variables, and then if something in the set of
>> >> relevant variables is affected (some statements or
>> >> variables turn out to be a control/data dependencies to
>> >> them), then it's added to the respective relevancy sets.
>> >> Should we only traverse the basic blocks the analyzer
>> >> visited, I think we can pull off the slicing with a
>> >> single pass.
>> >>
>> >>
>> >> We'll see about that single pass thing though. I'm gathering
>> >> some tricky examples in the document I shared and working on
>> >> a neat algorithm.
>> >>
>> >> > What would also be really great is to assist this
>> >> traversal with the information the analyzer already
>> >> has, essentially only inspecting basic blocks the
>> >> analyzer actually visits.
>> >>
>> >> You mean visits on the current bug path or visits on
>> >> the equivalence class of bugs or visits in general
>> >> during analysis?
>> >>
>> >> Regardless, for many problems ideally we should
>> >> traverse basic blocks that the analyzer *didn't*
>> >> visit (eg., if the function didn't initialize the
>> >> variable on the current path, we're interested in
>> >> parts of the code in which it *did* initialize the
>> >> variable, even though we didn't necessarily have time
>> >> to visit them).
>> >>
>> >>
>> >> Well I mean slicing by definition isn't intended to do
>> >> that. The entire point of it is to gather the smallest
>> >> slice related to the slicing criterion, and my project
>> >> aims to fill in the gaps where we don't provide enough
>> >> information.
>> >>
>> >> It actually sounds as if all problems that we're
>> >> trying to solve here can be classified into
>> >> "must-have-happened" problems (eg., "variable
>> >> must-have been declared without initialization",
>> >> "variable must-have been set to nullptr", "memory
>> >> must-have been allocated") and
>> >> "should-not-have-happened" problems (eg., "variable
>> >> should-not-have been initialized", "null value
>> >> should-not-have been overwritten", "pointer
>> >> should-not-have escaped").
>> >>
>> >> For must-have-happened problems, i'm recently under
>> >> an impression that we should suppress the bug report
>> >> entirely when we fail to solve them (i.e., if fail to
>> >> explain to the user where exactly does this happen,
>> >> then the report is impossible to understand anyway).
>> >> This is currently a very big problem for our null
>> >> dereference checker on some codebases, especially
>> >> because it uses this tracking for suppressions as
>> >> well (aka inlined defensive checks), so when it fails
>> >> to track the variable it's likely a legit false
>> >> positive as well, not simply a hard-to-understand
>> report.
>> >>
>> >>
>> >> I think this set calculating approach inherently gathers
>> >> far more information, allowing us to make better
>> >> judgement on whether we should suppress the report.
>> >>
>> >> For should-not-have-happened problems i'm much more
>> >> confused. We're talking about looking for places
>> >> where it *could* have happened and then trying to
>> >> explain to the user why none of them have actually
>> >> happened. I'm not sure what are the consequences of
>> >> failing to explain to the user why didn't a
>> >> particular piece of code do something, because i've
>> >> no idea how do users intuitively figure out which
>> >> code *could* have done these things and which clearly
>> >> couldn't.
>> >>
>> >>
>> >> Im really at a loss here :) Can you provide some example
>> >> as to what a problematic "must-have-happened" bug report
>> >> would look like, and how would you like to see it
>> >> improved? Same for "should-not-have-happened". Because as
>> >> I understand it, and I might be wrong, you have problems
>> >> with this:
>> >>
>> >> int a; // declaring a without initializing
>> >>
>> >> if (rand()) // assuming that the condition is false
>> >> a = 5;
>> >>
>> >> print(a); // a is undef
>> >>
>> >> And prefer to see this:
>> >>
>> >> int a; // declaring a without initializing
>> >>
>> >> if (rand()) // should this condition be false, a's value
>> >> will be indeterministic
>> >> a = 5; // writing to a skipped
>> >>
>> >> print(a); // a is undef
>> >>
>> >> and I just don't see how this would be possible with
>> >> slicing at all. Also, I can't see how this would scale to
>> >> real-world production code.
>> >>
>> >>
>> >> On 5/22/19 4:11 PM, Kristóf Umann wrote:
>> >>>
>> >>> Hi!
>> >>>
>> >>>
>> >>> I'd like to share some thoughts about my GSoC
>> >>> project, "Enhancing bug reporting with static
>> >>> backward program slicing"[1].
>> >>>
>> >>>
>> >>> My proposal is very specific about whatI'm aiming to
>> >>> improve on, but vague on the howpart of it. This
>> >>> mail aims to add clarify some of this.
>> >>>
>> >>>
>> >>> Program slicing is essentially a technique of
>> >>> discovering data and control dependencies to the
>> >>> slicing criterion, which is a (statement, {set of
>> >>> variables}) pair. Fortunately, tools for control
>> >>> dependencies, namely, dominator set calculations are
>> >>> already implemented, but seems to be unstable with
>> >>> clang's CFG. It would be a great tool if I were able
>> >>> to fix it.
>> >>>
>> >>>
>> >>> While my proposal states that I plan to implement an
>> >>> AST-based solution, I'm actually not sure that this
>> >>> is the optimal approach. We could instead inspect
>> >>> CFG blocks in a backwards manner (coupled with the
>> >>> analyzer's call stack), and gather relevant variable
>> >>> in the meanwhile.
>> >>>
>> >>>
>> >>> What would also be really great is to assist this
>> >>> traversal with the information the analyzer already
>> >>> has, essentially only inspecting basic blocks the
>> >>> analyzer actually visits.
>> >>>
>> >>>
>> >>> Here’s my idea for an algorithm (from the point of
>> >>> the slicing criterion already being constructed):
>> >>>
>> >>>
>> https://docs.google.com/document/d/1Lx867o3meyQsj0WKOSWMdosSBdw2MUq1aIxyPJM6iLU/edit?usp=sharing
>> >>>
>> >>>
>> >>>
>> >>> Please note that this is a veeery rough sketch, I
>> >>> didn't think about every edge case that exists, but
>> >>> I think it's an okay start.
>> >>>
>> >>>
>> >>> [1]
>> >>>
>> https://docs.google.com/document/d/1ci1BCAKojPlqIxIw1J_K2dnATA3z01PuwR_vHJS55TI/edit
>> >>>
>> >>
>> >
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20190531/07796e61/attachment.html>
More information about the cfe-dev
mailing list