[cfe-dev] [analyzer] Rough sketch of the algorithm for "Enhancing bug reporting with static backward program slicing"
Artem Dergachev via cfe-dev
cfe-dev at lists.llvm.org
Thu May 30 17:55:38 PDT 2019
Interesting.
Condition on line 12 would not be our control dependency - neither in
example 1 nor in example 2, simply because there's nothing interesting
within either branches of that if-statement. The 'new int' assignment
would have been interesting if we solved the should-not-have-happened
problem of "x should not have been overwritten", but for now we don't do
that.
At the same time, condition on line 16 would still be a control
dependency of our report in both examples, because this is where the
actual crash happens. So we will have a note on line 05 in both cases,
even if it's completely unnecessary in the second example.
We would indeed never have a note on line 10. This is, in fact, a dead
store, this value is never used, so we wouldn't be able to track
anything back to it.
On 5/30/19 4:48 PM, Kristóf Umann wrote:
> Allow me to elaborate.
>
> *Example 1.:*
>
> 01 int flag;
>
> 02 bool coin();
>
> 03
>
> 04 void foo() {
>
> 05 flag = coin(); // no note
>
> 06 }
>
> 07
>
> 08 int main() {
>
> 09 int *x = 0; // x initialized to 0
>
> 10 flag = 1;
>
> 11 foo();
>
> 12 if (flag) // assumed false
>
> 13 x = new int;
>
> 14 foo();
>
> 15
>
> 16 if (flag) // assumed true
>
> 17 *x = 5; // warn
>
> 18 }
>
>
> *Example 2.: (note the change on line 15)*
>
> 01 int flag;
>
> 02 bool coin();
>
> 03
>
> 04 void foo() {
>
> 05 flag = coin(); // no note
>
> 06 }
>
> 07
>
> 08 int main() {
>
> 09 int *x = 0;
>
> 10 flag = 1;
>
> 11 foo();
>
> 12 if (flag) // assumed false
>
> 13 x = new int;
>
> 14 foo();
>
> 15 x = 0; // x set to 0
>
> 16 if (flag) // assumed true
>
> 17 *x = 5; // warn
>
> 18 }
>
>
> By tweaking trackExpressionValue a bit:
>
> bool trackExpressionValue(const ExplodedNode *N,
>
> const Stmt *S, BugReport &R,
>
> bool IsArg=false,
>
> bool EnableNullFPSuppression=true);
>
>
> Should a block Bin the bug path be a control dependency of N, track
> the expression value in the condition of B. With that, I believe we
> should get a note for statement 14. For statement 10, we still
> wouldn’t get a note saying that flag’s value was invalidated (I fear),
> but it’s more of a should-not-have-happened case. For Example 2
> however, we’d get a pretty much perfect bug report.
>
> On Fri, 31 May 2019 at 00:16, Kristóf Umann <dkszelethus at gmail.com
> <mailto:dkszelethus at gmail.com>> wrote:
>
>
>
> On Mon, 27 May 2019 at 19:48, Artem Dergachev <noqnoqneo at gmail.com
> <mailto:noqnoqneo at gmail.com>> wrote:
>
>
>
> On 5/27/19 10:12 AM, Gábor Horváth wrote:
> > Hi!
> >
> > I wanted to share some of my points as well but had little
> time to do so.
> >
> > Artem identified two kinds of issues. One of which has all the
> > information available in the bug path (in his terminology:
> > must-have-happened) and the other which has not (
> > should-not-have-happened). The goal is the same in both of
> the cases,
> > identify all dependencies including control and data
> dependencies. The
> > main difference is that in one case we could use all the
> information
> > available in the bug path while in the other we cannot.
> >
> > 1. must-have-happened:
> > Since the bug-path contains all the information we need, I
> do agree
> > with Artem using bug path visitors might be sufficient. I do
> not know
> > how reliable `trackExpressionValue` is in identifying data
> > dependencies, but revising it could be one step in GSoC. For
> example,
> > after quickly skimming through the implementation I am not
> sure if it
> > is able to track the source of a constant value that was
> constant
> > folded during the symbolic execution from different values.
>
> It should be able to, because that's how inlined defensive check
> suppressions work.
>
>
> I recently finished (not yet published) my work on implementing
> control dependencies for clang's CFG via LLVM's IDFCalculator.
> Theoretically speaking, by simply tracking the variables in the
> block that is a control dependency of the block that is being
> tracked, we should be able to get a proper bug report for Example
> 1? From afar, it seems reasonable.
>
> > Only if we had a way to test this functionality in isolation
> quickly
> > :) I think making that possible would also be a valuable
> addition.
> >
> > For control dependencies, it is great that LLVM already has
> some
> > algorithms to calculate dominator sets and it might be
> possible to
> > make that work for the Clang CFG.
> >
> > In my opinion, if you only tackle these problems by the end
> of the
> > summer your project already brought great value to the
> community.
> >
> > 2. should-not-have-happened
> > This category cannot be solved only using visitors for sure.
> So we
> > might end up with an implementation for this problem that is
> > completely separate from the previous one (maybe except for
> getting
> > control dependencies?). As Artem pointed out, we might query
> some
> > information from the analyzer e.g. which functions were
> inlined. One
> > option is to use a slicing algorithm here. However, we
> should make
> > sure that the way a function is conservatively evaluated by the
> > slicing algorithm and the analyzer is compatible. E.g.: if the
> > analyzer does not invalidate a memory region during
> evaluating the
> > call, the statement should not be considered as relevant for
> the
> > variable in question. We have several challenges here, the
> analyzer
> > reasons about memory regions while the slicing algorithm
> will probably
> > not. Also, we need to do something with pointers, since we
> do not want
> > to implement a full-fledged pointer analysis. We also need
> to decide
> > how to represent arrays, structures etc.
> >
> > I skimmed through the algorithm in the google docs, and I
> think there
> > are some other details we need to work out. For example, you
> never
> > remove anything from the set of relevant variables. Consider
> the
> > following example:
> > x = 5;
> > y = x + 2;
> > y = 2;
> > z = y + 3;
> >
> > Whit the slicing criterion (z = y + 3). Here, once y is
> overwritten
> > with 2, it is no longer relevant, sox should not end up in
> the slice.
> > The algorithm you proposed will not handle this correctly
> (yet). Also
> > I am not sure if we actually need therelevant_variable_map
> or having a
> > set will be sufficient.
> >
> > Regards,
> > Gabor
> >
> > On Mon, 27 May 2019 at 12:52, Kristóf Umann
> <dkszelethus at gmail.com <mailto:dkszelethus at gmail.com>
> > <mailto:dkszelethus at gmail.com
> <mailto:dkszelethus at gmail.com>>> wrote:
> >
> > + Ádám Balogh
> >
> > On Fri, 24 May 2019 at 23:31, Artem Dergachev
> <noqnoqneo at gmail.com <mailto:noqnoqneo at gmail.com>
> > <mailto:noqnoqneo at gmail.com
> <mailto:noqnoqneo at gmail.com>>> wrote:
> >
> > Ok, i think i have a little bit of clarity.
> >
> > Let's first talk about must-have-happened problems.
> For instance,
> >
> > int *foo() {
> > return coin() ? new int : nullptr; // needs note
> > }
> > void bar() {
> > int *x = foo();
> > *x = 1; // warning: null dereference
> > }
> >
> > In this case trackExpressionValue solves the
> > "must-have-happened" problem of "x must-have been
> assigned a
> > null pointer value" by displaying a note when foo()
> returns
> > nullptr. This currently more or less works correctly
> - there
> > are a lot of bugs but the overall idea behind
> > trackExpressionValue is correct.
> >
> > Example 1 in your document is another example of a
> > must-have-happened problem: in order for the report
> to be
> > valid, we need to show that 'flag' must-have-changed
> between
> > line 17 and line 13. That's something that the
> Static Analyzer
> > currently cannot handle and you plan to improve upon it.
> >
> > Hʏᴘᴏᴛʜᴇsɪs1. All "must-have-happened" problems should be
> > solved with a bug visitor. We don't need any new
> analysis
> > algorithm for that.
> >
> > In particular, i believe that Example 1 can be solved by
> > extending trackExpressionValue() with a bug visitor that
> > detects control-flow-dependencies via this approach
> of yours:
> >
> > > "Check whether the statement is a control
> dependency of
> > Statement 18 with dominator sets: 18 doesn't post
> dominate 17,
> > but every statement in between them does, meaning
> that 17 is a
> > control dependency of 18."
> >
> > I.e., while ascending through ExplodedGraph, we pick
> > interesting statements and perform this domination
> check on
> > their predecessor nodes until we reach the control
> dependency,
> > then we track the control dependency recursively by
> adding
> > more visitors.
> >
> > I think this is the first thing we should try on our
> GSoC.
> >
> > The reason why i believe that Hypothesis 1 is true
> is that all
> > the information that we need is fully contained in
> the bug
> > path. If something must have happened on the path,
> then it's
> > probably in the path and we can figure it out by
> looking at
> > the path. If something must have happened on the
> path and it's
> > not in the path, then why did we emit a warning in
> the first
> > place?
> >
> > ==============
> >
> > Now, to should-not-have-happened problems. The main
> motivating
> > example is:
> >
> > void foo(int *y) {
> > if (coin()) // needs note
> > *y = 1;
> > }
> > void bar() {
> > int x;
> > foo(&x);
> > use(x); // warning: use of uninitialized value
> > }
> >
> > In order to make the warning comprehensible, we have to
> > explain to the user why do we think that 'x' is
> uninitialized,
> > as it clearly "should-not-have-been" initialized for the
> > warning to be correct. For that purpose it is absolutely
> > essential to display the execution path within the
> inlined
> > call to foo().
> >
> > One way to achieve that would be to display *all*
> execution
> > path and leave it up to the user to see that the
> event that
> > should-not-have-happened has indeed never happened.
> >
> > That, however, would be overwhelming, so we use
> "path pruning"
> > that cuts away execution paths in inlined calls that
> are known
> > to have no interesting events happening. Which, in
> turn, makes
> > us incapable of solving should-not-have-happened
> problems, as
> > it's the whole point of the problem to display
> execution path
> > in inlined functions in which nothing has happened.
> >
> > In order to figure this out, we need to learn how to
> > discriminate between inlined calls in which nothing
> > interesting *could* have happened in the first place and
> > inlined calls in which something interesting could have
> > happened but *didn't*.
> >
> > NoStoreFuncVisitor solves this problem for the
> example above.
> > The visitor notices that the local variable is passed by
> > non-const pointer into the inlined call and the call
> > syntactically contains assignments through this pointer,
> > therefore this call *could* have initialized the
> value. The
> > visitor then suppresses pruning of this call by
> emitting an
> > interesting note within the call ("returning without
> > initializing '*y'"...).
> >
> > However, AST-based analysis in NoStoreFuncVisitor is
> very
> > primitive and easy to trick. A more sophisticated
> analysis is
> > clearly necessary.
> >
> > The second thing we can try in our GSoC is to come
> up with a
> > better analysis specifically for the uninitialized value
> > checker, because we already have a place to stick it
> into and
> > we know how exactly do we want to consume the result
> of such
> > analysis and evaluate it.
> >
> > The third thing we can try in our GSoC is to come up
> with more
> > such analysis for other checkers and other
> > should-have-happened problems and see if a reusable
> framework
> > for creating such analyses can be implemented.
> >
> >
> > On 5/23/19 1:25 PM, Kristóf Umann wrote:
> >>
> >>
> >> On Thu, 23 May 2019 at 21:58, Kristóf Umann
> >> <dkszelethus at gmail.com
> <mailto:dkszelethus at gmail.com> <mailto:dkszelethus at gmail.com
> <mailto:dkszelethus at gmail.com>>> wrote:
> >>
> >> Please let me know if I didn't address
> something you
> >> mentioned!
> >>
> >> On Thu, 23 May 2019 at 02:29, Artem Dergachev
> >> <noqnoqneo at gmail.com
> <mailto:noqnoqneo at gmail.com> <mailto:noqnoqneo at gmail.com
> <mailto:noqnoqneo at gmail.com>>> wrote:
> >>
> >> My primary question is, how do you plan to
> use the
> >> data that you obtain via your analysis?
> >>
> >>
> >> Gather relevant statements. Keep ExplodedNodes
> in the
> >> bugpath for which
> PathDiagnosticLocation::getStmt() is in
> >> the set (or, something like that at least). But
> honestly,
> >> I didn't think too much about that yet :^) I
> guess we
> >> could just gather ExplodedNodes rather than
> statements,
> >> we'll see.
> >>
> >> Like, do you plan to backtrack the value of all
> >> relevant variables and/or expressions in
> the final
> >> bug report that were also encountered along
> the bug
> >> path? If yes, then is it possible that the
> slicing
> >> criterion gets updated in the process and
> you'll have
> >> to re-run the CFG-based analysis to take it
> into account?
> >>
> >>
> >> No, the slicing criterion is what it is, and
> will not
> >> change. The set of relevant statements to the
> slicing
> >> criterion define the actual program slice, which is
> >> basically the thing we're going for in this
> project. When
> >> it turns out that a value of a variable
> directly affects
> >> a variable in the slicing criterion (either
> through data
> >> or control dependency), we just add it to the
> set of
> >> relevant variables, and then if something in
> the set of
> >> relevant variables is affected (some statements or
> >> variables turn out to be a control/data
> dependencies to
> >> them), then it's added to the respective
> relevancy sets.
> >> Should we only traverse the basic blocks the
> analyzer
> >> visited, I think we can pull off the slicing with a
> >> single pass.
> >>
> >>
> >> We'll see about that single pass thing though. I'm
> gathering
> >> some tricky examples in the document I shared and
> working on
> >> a neat algorithm.
> >>
> >> > What would also be really great is to
> assist this
> >> traversal with the information the analyzer
> already
> >> has, essentially only inspecting basic
> blocks the
> >> analyzer actually visits.
> >>
> >> You mean visits on the current bug path or
> visits on
> >> the equivalence class of bugs or visits in
> general
> >> during analysis?
> >>
> >> Regardless, for many problems ideally we should
> >> traverse basic blocks that the analyzer
> *didn't*
> >> visit (eg., if the function didn't
> initialize the
> >> variable on the current path, we're
> interested in
> >> parts of the code in which it *did*
> initialize the
> >> variable, even though we didn't necessarily
> have time
> >> to visit them).
> >>
> >>
> >> Well I mean slicing by definition isn't
> intended to do
> >> that. The entire point of it is to gather the
> smallest
> >> slice related to the slicing criterion, and my
> project
> >> aims to fill in the gaps where we don't provide
> enough
> >> information.
> >>
> >> It actually sounds as if all problems that
> we're
> >> trying to solve here can be classified into
> >> "must-have-happened" problems (eg., "variable
> >> must-have been declared without
> initialization",
> >> "variable must-have been set to nullptr",
> "memory
> >> must-have been allocated") and
> >> "should-not-have-happened" problems (eg.,
> "variable
> >> should-not-have been initialized", "null value
> >> should-not-have been overwritten", "pointer
> >> should-not-have escaped").
> >>
> >> For must-have-happened problems, i'm
> recently under
> >> an impression that we should suppress the
> bug report
> >> entirely when we fail to solve them (i.e.,
> if fail to
> >> explain to the user where exactly does this
> happen,
> >> then the report is impossible to understand
> anyway).
> >> This is currently a very big problem for
> our null
> >> dereference checker on some codebases,
> especially
> >> because it uses this tracking for
> suppressions as
> >> well (aka inlined defensive checks), so
> when it fails
> >> to track the variable it's likely a legit false
> >> positive as well, not simply a
> hard-to-understand report.
> >>
> >>
> >> I think this set calculating approach
> inherently gathers
> >> far more information, allowing us to make better
> >> judgement on whether we should suppress the report.
> >>
> >> For should-not-have-happened problems i'm
> much more
> >> confused. We're talking about looking for
> places
> >> where it *could* have happened and then
> trying to
> >> explain to the user why none of them have
> actually
> >> happened. I'm not sure what are the
> consequences of
> >> failing to explain to the user why didn't a
> >> particular piece of code do something,
> because i've
> >> no idea how do users intuitively figure out
> which
> >> code *could* have done these things and
> which clearly
> >> couldn't.
> >>
> >>
> >> Im really at a loss here :) Can you provide
> some example
> >> as to what a problematic "must-have-happened"
> bug report
> >> would look like, and how would you like to see it
> >> improved? Same for "should-not-have-happened".
> Because as
> >> I understand it, and I might be wrong, you have
> problems
> >> with this:
> >>
> >> int a; // declaring a without initializing
> >>
> >> if (rand()) // assuming that the condition is false
> >> a = 5;
> >>
> >> print(a); // a is undef
> >>
> >> And prefer to see this:
> >>
> >> int a; // declaring a without initializing
> >>
> >> if (rand()) // should this condition be false,
> a's value
> >> will be indeterministic
> >> a = 5; // writing to a skipped
> >>
> >> print(a); // a is undef
> >>
> >> and I just don't see how this would be possible
> with
> >> slicing at all. Also, I can't see how this
> would scale to
> >> real-world production code.
> >>
> >>
> >> On 5/22/19 4:11 PM, Kristóf Umann wrote:
> >>>
> >>> Hi!
> >>>
> >>>
> >>> I'd like to share some thoughts about my GSoC
> >>> project, "Enhancing bug reporting with static
> >>> backward program slicing"[1].
> >>>
> >>>
> >>> My proposal is very specific about whatI'm
> aiming to
> >>> improve on, but vague on the howpart of
> it. This
> >>> mail aims to add clarify some of this.
> >>>
> >>>
> >>> Program slicing is essentially a technique of
> >>> discovering data and control dependencies
> to the
> >>> slicing criterion, which is a (statement,
> {set of
> >>> variables}) pair. Fortunately, tools for
> control
> >>> dependencies, namely, dominator set
> calculations are
> >>> already implemented, but seems to be
> unstable with
> >>> clang's CFG. It would be a great tool if I
> were able
> >>> to fix it.
> >>>
> >>>
> >>> While my proposal states that I plan to
> implement an
> >>> AST-based solution, I'm actually not sure
> that this
> >>> is the optimal approach. We could instead
> inspect
> >>> CFG blocks in a backwards manner (coupled
> with the
> >>> analyzer's call stack), and gather
> relevant variable
> >>> in the meanwhile.
> >>>
> >>>
> >>> What would also be really great is to
> assist this
> >>> traversal with the information the
> analyzer already
> >>> has, essentially only inspecting basic
> blocks the
> >>> analyzer actually visits.
> >>>
> >>>
> >>> Here’s my idea for an algorithm (from the
> point of
> >>> the slicing criterion already being
> constructed):
> >>>
> >>>
> https://docs.google.com/document/d/1Lx867o3meyQsj0WKOSWMdosSBdw2MUq1aIxyPJM6iLU/edit?usp=sharing
> >>>
> >>>
> >>>
> >>> Please note that this is a veeery rough
> sketch, I
> >>> didn't think about every edge case that
> exists, but
> >>> I think it's an okay start.
> >>>
> >>>
> >>> [1]
> >>>
> https://docs.google.com/document/d/1ci1BCAKojPlqIxIw1J_K2dnATA3z01PuwR_vHJS55TI/edit
> >>>
> >>
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20190530/3fa39394/attachment.html>
More information about the cfe-dev
mailing list