[cfe-dev] [analyzer] Rough sketch of the algorithm for "Enhancing bug reporting with static backward program slicing"

Thu May 30 17:55:38 PDT 2019

Interesting.

Condition on line 12 would not be our control dependency - neither in 
example 1 nor in example 2, simply because there's nothing interesting 
within either branches of that if-statement. The 'new int' assignment 
would have been interesting if we solved the should-not-have-happened 
problem of "x should not have been overwritten", but for now we don't do 
that.

At the same time, condition on line 16 would still be a control 
dependency of our report in both examples, because this is where the 
actual crash happens. So we will have a note on line 05 in both cases, 
even if it's completely unnecessary in the second example.

We would indeed never have a note on line 10. This is, in fact, a dead 
store, this value is never used, so we wouldn't be able to track 
anything back to it.

On 5/30/19 4:48 PM, Kristóf Umann wrote:
> Allow me to elaborate.
>
> *Example 1.:*
>
> 01 int flag;
>
> 02 bool coin();
>
> 03
>
> 04 void foo() {
>
> 05   flag = coin(); // no note
>
> 06 }
>
> 07
>
> 08 int main() {
>
> 09   int *x = 0; // x initialized to 0
>
> 10   flag = 1;
>
> 11   foo();
>
> 12   if (flag) // assumed false
>
> 13     x = new int;
>
> 14   foo();
>
> 15
>
> 16   if (flag) // assumed true
>
> 17     *x = 5; // warn
>
> 18 }
>
>
> *Example 2.: (note the change on line 15)*
>
> 01 int flag;
>
> 02 bool coin();
>
> 03
>
> 04 void foo() {
>
> 05   flag = coin(); // no note
>
> 06 }
>
> 07
>
> 08 int main() {
>
> 09   int *x = 0;
>
> 10   flag = 1;
>
> 11   foo();
>
> 12   if (flag) // assumed false
>
> 13     x = new int;
>
> 14   foo();
>
> 15   x = 0; // x set to 0
>
> 16   if (flag) // assumed true
>
> 17     *x = 5; // warn
>
> 18 }
>
>
> By tweaking trackExpressionValue a bit:
>
> bool trackExpressionValue(const ExplodedNode *N,
>
>                          const Stmt *S, BugReport &R,
>
>                          bool IsArg=false,
>
>                          bool EnableNullFPSuppression=true);
>
>
> Should a block Bin the bug path be a control dependency of N, track 
> the expression value in the condition of B. With that, I believe we 
> should get a note for statement 14. For statement 10, we still 
> wouldn’t get a note saying that flag’s value was invalidated (I fear), 
> but it’s more of a should-not-have-happened case. For Example 2 
> however, we’d get a pretty much perfect bug report.
>
> On Fri, 31 May 2019 at 00:16, Kristóf Umann <dkszelethus at gmail.com 
> <mailto:dkszelethus at gmail.com>> wrote:
>
>
>
>     On Mon, 27 May 2019 at 19:48, Artem Dergachev <noqnoqneo at gmail.com
>     <mailto:noqnoqneo at gmail.com>> wrote:
>
>
>
>         On 5/27/19 10:12 AM, Gábor Horváth wrote:
>         > Hi!
>         >
>         > I wanted to share some of my points as well but had little
>         time to do so.
>         >
>         > Artem identified two kinds of issues. One of which has all the
>         > information available in the bug path (in his terminology:
>         > must-have-happened) and the other which has not (
>         > should-not-have-happened). The goal is the same in both of
>         the cases,
>         > identify all dependencies including control and data
>         dependencies. The
>         > main difference is that in one case we could use all the
>         information
>         > available in the bug path while in the other we cannot.
>         >
>         > 1. must-have-happened:
>         > Since the bug-path contains all the information we need, I
>         do agree
>         > with Artem using bug path visitors might be sufficient. I do
>         not know
>         > how reliable `trackExpressionValue` is in identifying data
>         > dependencies, but revising it could be one step in GSoC. For
>         example,
>         > after quickly skimming through the implementation I am not
>         sure if it
>         > is able to track the source of a constant value that was
>         constant
>         > folded during the symbolic execution from different values.
>
>         It should be able to, because that's how inlined defensive check
>         suppressions work.
>
>
>     I recently finished (not yet published) my work on implementing
>     control dependencies for clang's CFG via LLVM's IDFCalculator.
>     Theoretically speaking, by simply tracking the variables in the
>     block that is a control dependency of the block that is being
>     tracked, we should be able to get a proper bug report for Example
>     1? From afar, it seems reasonable.
>
>         > Only if we had a way to test this functionality in isolation
>         quickly
>         > :) I think making that possible would also be a valuable
>         addition.
>         >
>         > For control dependencies, it is great that LLVM already has
>         some
>         > algorithms to calculate dominator sets and it might be
>         possible to
>         > make that work for the Clang CFG.
>         >
>         > In my opinion, if you only tackle these problems by the end
>         of the
>         > summer your project already brought great value to the
>         community.
>         >
>         > 2. should-not-have-happened
>         > This category cannot be solved only using visitors for sure.
>         So we
>         > might end up with an implementation for this problem that is
>         > completely separate from the previous one (maybe except for
>         getting
>         > control dependencies?). As Artem pointed out, we might query
>         some
>         > information from the analyzer e.g. which functions were
>         inlined. One
>         > option is to use a slicing algorithm here. However, we
>         should make
>         > sure that the way a function is conservatively evaluated by the
>         > slicing algorithm and the analyzer is compatible. E.g.: if the
>         > analyzer does not invalidate a memory region during
>         evaluating the
>         > call, the statement should not be considered as relevant for
>         the
>         > variable in question. We have several challenges here, the
>         analyzer
>         > reasons about memory regions while the slicing algorithm
>         will probably
>         > not. Also, we need to do something with pointers, since we
>         do not want
>         > to implement a full-fledged pointer analysis. We also need
>         to decide
>         > how to represent arrays, structures etc.
>         >
>         > I skimmed through the algorithm in the google docs, and I
>         think there
>         > are some other details we need to work out. For example, you
>         never
>         > remove anything from the set of relevant variables. Consider
>         the
>         > following example:
>         > x = 5;
>         > y = x + 2;
>         > y = 2;
>         > z = y + 3;
>         >
>         > Whit the slicing criterion (z = y + 3). Here, once y is
>         overwritten
>         > with 2, it is no longer relevant, sox should not end up in
>         the slice.
>         > The algorithm you proposed will not handle this correctly
>         (yet). Also
>         > I am not sure if we actually need therelevant_variable_map
>         or having a
>         > set will be sufficient.
>         >
>         > Regards,
>         > Gabor
>         >
>         > On Mon, 27 May 2019 at 12:52, Kristóf Umann
>         <dkszelethus at gmail.com <mailto:dkszelethus at gmail.com>
>         > <mailto:dkszelethus at gmail.com
>         <mailto:dkszelethus at gmail.com>>> wrote:
>         >
>         >     + Ádám Balogh
>         >
>         >     On Fri, 24 May 2019 at 23:31, Artem Dergachev
>         <noqnoqneo at gmail.com <mailto:noqnoqneo at gmail.com>
>         >     <mailto:noqnoqneo at gmail.com
>         <mailto:noqnoqneo at gmail.com>>> wrote:
>         >
>         >         Ok, i think i have a little bit of clarity.
>         >
>         >         Let's first talk about must-have-happened problems.
>         For instance,
>         >
>         >           int *foo() {
>         >               return coin() ? new int : nullptr; // needs note
>         >           }
>         >           void bar() {
>         >             int *x = foo();
>         >             *x = 1; // warning: null dereference
>         >           }
>         >
>         >         In this case trackExpressionValue solves the
>         >         "must-have-happened" problem of "x must-have been
>         assigned a
>         >         null pointer value" by displaying a note when foo()
>         returns
>         >         nullptr. This currently more or less works correctly
>         - there
>         >         are a lot of bugs but the overall idea behind
>         >         trackExpressionValue is correct.
>         >
>         >         Example 1 in your document is another example of a
>         >         must-have-happened problem: in order for the report
>         to be
>         >         valid, we need to show that 'flag' must-have-changed
>         between
>         >         line 17 and line 13. That's something that the
>         Static Analyzer
>         >         currently cannot handle and you plan to improve upon it.
>         >
>         >         Hʏᴘᴏᴛʜᴇsɪs1. All "must-have-happened" problems should be
>         >         solved with a bug visitor. We don't need any new
>         analysis
>         >         algorithm for that.
>         >
>         >         In particular, i believe that Example 1 can be solved by
>         >         extending trackExpressionValue() with a bug visitor that
>         >         detects control-flow-dependencies via this approach
>         of yours:
>         >
>         >         > "Check whether the statement is a control
>         dependency of
>         >         Statement 18 with dominator sets: 18 doesn't post
>         dominate 17,
>         >         but every statement in between them does, meaning
>         that 17 is a
>         >         control dependency of 18."
>         >
>         >         I.e., while ascending through ExplodedGraph, we pick
>         >         interesting statements and perform this domination
>         check on
>         >         their predecessor nodes until we reach the control
>         dependency,
>         >         then we track the control dependency recursively by
>         adding
>         >         more visitors.
>         >
>         >         I think this is the first thing we should try on our
>         GSoC.
>         >
>         >         The reason why i believe that Hypothesis 1 is true
>         is that all
>         >         the information that we need is fully contained in
>         the bug
>         >         path. If something must have happened on the path,
>         then it's
>         >         probably in the path and we can figure it out by
>         looking at
>         >         the path. If something must have happened on the
>         path and it's
>         >         not in the path, then why did we emit a warning in
>         the first
>         >         place?
>         >
>         >         ==============
>         >
>         >         Now, to should-not-have-happened problems. The main
>         motivating
>         >         example is:
>         >
>         >           void foo(int *y) {
>         >               if (coin()) // needs note
>         >                 *y = 1;
>         >           }
>         >           void bar() {
>         >             int x;
>         >             foo(&x);
>         >             use(x); // warning: use of uninitialized value
>         >           }
>         >
>         >         In order to make the warning comprehensible, we have to
>         >         explain to the user why do we think that 'x' is
>         uninitialized,
>         >         as it clearly "should-not-have-been" initialized for the
>         >         warning to be correct. For that purpose it is absolutely
>         >         essential to display the execution path within the
>         inlined
>         >         call to foo().
>         >
>         >         One way to achieve that would be to display *all*
>         execution
>         >         path and leave it up to the user to see that the
>         event that
>         >         should-not-have-happened has indeed never happened.
>         >
>         >         That, however, would be overwhelming, so we use
>         "path pruning"
>         >         that cuts away execution paths in inlined calls that
>         are known
>         >         to have no interesting events happening. Which, in
>         turn, makes
>         >         us incapable of solving should-not-have-happened
>         problems, as
>         >         it's the whole point of the problem to display
>         execution path
>         >         in inlined functions in which nothing has happened.
>         >
>         >         In order to figure this out, we need to learn how to
>         >         discriminate between inlined calls in which nothing
>         >         interesting *could* have happened in the first place and
>         >         inlined calls in which something interesting could have
>         >         happened but *didn't*.
>         >
>         >         NoStoreFuncVisitor solves this problem for the
>         example above.
>         >         The visitor notices that the local variable is passed by
>         >         non-const pointer into the inlined call and the call
>         >         syntactically contains assignments through this pointer,
>         >         therefore this call *could* have initialized the
>         value. The
>         >         visitor then suppresses pruning of this call by
>         emitting an
>         >         interesting note within the call ("returning without
>         >         initializing '*y'"...).
>         >
>         >         However, AST-based analysis in NoStoreFuncVisitor is
>         very
>         >         primitive and easy to trick. A more sophisticated
>         analysis is
>         >         clearly necessary.
>         >
>         >         The second thing we can try in our GSoC is to come
>         up with a
>         >         better analysis specifically for the uninitialized value
>         >         checker, because we already have a place to stick it
>         into and
>         >         we know how exactly do we want to consume the result
>         of such
>         >         analysis and evaluate it.
>         >
>         >         The third thing we can try in our GSoC is to come up
>         with more
>         >         such analysis for other checkers and other
>         >         should-have-happened problems and see if a reusable
>         framework
>         >         for creating such analyses can be implemented.
>         >
>         >
>         >         On 5/23/19 1:25 PM, Kristóf Umann wrote:
>         >>
>         >>
>         >>         On Thu, 23 May 2019 at 21:58, Kristóf Umann
>         >>         <dkszelethus at gmail.com
>         <mailto:dkszelethus at gmail.com> <mailto:dkszelethus at gmail.com
>         <mailto:dkszelethus at gmail.com>>> wrote:
>         >>
>         >>             Please let me know if I didn't address
>         something you
>         >>             mentioned!
>         >>
>         >>             On Thu, 23 May 2019 at 02:29, Artem Dergachev
>         >>             <noqnoqneo at gmail.com
>         <mailto:noqnoqneo at gmail.com> <mailto:noqnoqneo at gmail.com
>         <mailto:noqnoqneo at gmail.com>>> wrote:
>         >>
>         >>                 My primary question is, how do you plan to
>         use the
>         >>                 data that you obtain via your analysis?
>         >>
>         >>
>         >>              Gather relevant statements. Keep ExplodedNodes
>         in the
>         >>             bugpath for which
>         PathDiagnosticLocation::getStmt() is in
>         >>             the set (or, something like that at least). But
>         honestly,
>         >>             I didn't think too much about that yet :^) I
>         guess we
>         >>             could just gather ExplodedNodes rather than
>         statements,
>         >>             we'll see.
>         >>
>         >>                 Like, do you plan to backtrack the value of all
>         >>                 relevant variables and/or expressions in
>         the final
>         >>                 bug report that were also encountered along
>         the bug
>         >>                 path? If yes, then is it possible that the
>         slicing
>         >>                 criterion gets updated in the process and
>         you'll have
>         >>                 to re-run the CFG-based analysis to take it
>         into account?
>         >>
>         >>
>         >>             No, the slicing criterion is what it is, and
>         will not
>         >>             change. The set of relevant statements to the
>         slicing
>         >>             criterion define the actual program slice, which is
>         >>             basically the thing we're going for in this
>         project. When
>         >>             it turns out that a value of a variable
>         directly affects
>         >>             a variable in the slicing criterion (either
>         through data
>         >>             or control dependency), we just add it to the
>         set of
>         >>             relevant variables, and then if something in
>         the set of
>         >>             relevant variables is affected (some statements or
>         >>             variables turn out to be a control/data
>         dependencies to
>         >>             them), then it's added to the respective
>         relevancy sets.
>         >>             Should we only traverse the basic blocks the
>         analyzer
>         >>             visited, I think we can pull off the slicing with a
>         >>             single pass.
>         >>
>         >>
>         >>         We'll see about that single pass thing though. I'm
>         gathering
>         >>         some tricky examples in the document I shared and
>         working on
>         >>         a neat algorithm.
>         >>
>         >>                 > What would also be really great is to
>         assist this
>         >>                 traversal with the information the analyzer
>         already
>         >>                 has, essentially only inspecting basic
>         blocks the
>         >>                 analyzer actually visits.
>         >>
>         >>                 You mean visits on the current bug path or
>         visits on
>         >>                 the equivalence class of bugs or visits in
>         general
>         >>                 during analysis?
>         >>
>         >>                 Regardless, for many problems ideally we should
>         >>                 traverse basic blocks that the analyzer
>         *didn't*
>         >>                 visit (eg., if the function didn't
>         initialize the
>         >>                 variable on the current path, we're
>         interested in
>         >>                 parts of the code in which it *did*
>         initialize the
>         >>                 variable, even though we didn't necessarily
>         have time
>         >>                 to visit them).
>         >>
>         >>
>         >>             Well I mean slicing by definition isn't
>         intended to do
>         >>             that. The entire point of it is to gather the
>         smallest
>         >>             slice related to the slicing criterion, and my
>         project
>         >>             aims to fill in the gaps where we don't provide
>         enough
>         >>             information.
>         >>
>         >>                 It actually sounds as if all problems that
>         we're
>         >>                 trying to solve here can be classified into
>         >>                 "must-have-happened" problems (eg., "variable
>         >>                 must-have been declared without
>         initialization",
>         >>                 "variable must-have been set to nullptr",
>         "memory
>         >>                 must-have been allocated") and
>         >>                 "should-not-have-happened" problems (eg.,
>         "variable
>         >>                 should-not-have been initialized", "null value
>         >>                 should-not-have been overwritten", "pointer
>         >>                 should-not-have escaped").
>         >>
>         >>                 For must-have-happened problems, i'm
>         recently under
>         >>                 an impression that we should suppress the
>         bug report
>         >>                 entirely when we fail to solve them (i.e.,
>         if fail to
>         >>                 explain to the user where exactly does this
>         happen,
>         >>                 then the report is impossible to understand
>         anyway).
>         >>                 This is currently a very big problem for
>         our null
>         >>                 dereference checker on some codebases,
>         especially
>         >>                 because it uses this tracking for
>         suppressions as
>         >>                 well (aka inlined defensive checks), so
>         when it fails
>         >>                 to track the variable it's likely a legit false
>         >>                 positive as well, not simply a
>         hard-to-understand report.
>         >>
>         >>
>         >>             I think this set calculating approach
>         inherently gathers
>         >>             far more information, allowing us to make better
>         >>             judgement on whether we should suppress the report.
>         >>
>         >>                 For should-not-have-happened problems i'm
>         much more
>         >>                 confused. We're talking about looking for
>         places
>         >>                 where it *could* have happened and then
>         trying to
>         >>                 explain to the user why none of them have
>         actually
>         >>                 happened. I'm not sure what are the
>         consequences of
>         >>                 failing to explain to the user why didn't a
>         >>                 particular piece of code do something,
>         because i've
>         >>                 no idea how do users intuitively figure out
>         which
>         >>                 code *could* have done these things and
>         which clearly
>         >>                 couldn't.
>         >>
>         >>
>         >>             Im really at a loss here :) Can you provide
>         some example
>         >>             as to what a problematic "must-have-happened"
>         bug report
>         >>             would look like, and how would you like to see it
>         >>             improved? Same for "should-not-have-happened".
>         Because as
>         >>             I understand it, and I might be wrong, you have
>         problems
>         >>             with this:
>         >>
>         >>             int a; // declaring a without initializing
>         >>
>         >>             if (rand()) // assuming that the condition is false
>         >>             a = 5;
>         >>
>         >>             print(a); // a is undef
>         >>
>         >>             And prefer to see this:
>         >>
>         >>             int a; // declaring a without initializing
>         >>
>         >>             if (rand()) // should this condition be false,
>         a's value
>         >>             will be indeterministic
>         >>             a = 5; // writing to a skipped
>         >>
>         >>             print(a); // a is undef
>         >>
>         >>             and I just don't see how this would be possible
>         with
>         >>             slicing at all. Also, I can't see how this
>         would scale to
>         >>             real-world production code.
>         >>
>         >>
>         >>                 On 5/22/19 4:11 PM, Kristóf Umann wrote:
>         >>>
>         >>>                 Hi!
>         >>>
>         >>>
>         >>>                 I'd like to share some thoughts about my GSoC
>         >>>                 project, "Enhancing bug reporting with static
>         >>>                 backward program slicing"[1].
>         >>>
>         >>>
>         >>>                 My proposal is very specific about whatI'm
>         aiming to
>         >>>                 improve on, but vague on the howpart of
>         it. This
>         >>>                 mail aims to add clarify some of this.
>         >>>
>         >>>
>         >>>                 Program slicing is essentially a technique of
>         >>>                 discovering data and control dependencies
>         to the
>         >>>                 slicing criterion, which is a (statement,
>         {set of
>         >>>                 variables}) pair. Fortunately, tools for
>         control
>         >>>                 dependencies, namely, dominator set
>         calculations are
>         >>>                 already implemented, but seems to be
>         unstable with
>         >>>                 clang's CFG. It would be a great tool if I
>         were able
>         >>>                 to fix it.
>         >>>
>         >>>
>         >>>                 While my proposal states that I plan to
>         implement an
>         >>>                 AST-based solution, I'm actually not sure
>         that this
>         >>>                 is the optimal approach. We could instead
>         inspect
>         >>>                 CFG blocks in a backwards manner (coupled
>         with the
>         >>>                 analyzer's call stack), and gather
>         relevant variable
>         >>>                 in the meanwhile.
>         >>>
>         >>>
>         >>>                 What would also be really great is to
>         assist this
>         >>>                 traversal with the information the
>         analyzer already
>         >>>                 has, essentially only inspecting basic
>         blocks the
>         >>>                 analyzer actually visits.
>         >>>
>         >>>
>         >>>                 Here’s my idea for an algorithm (from the
>         point of
>         >>>                 the slicing criterion already being
>         constructed):
>         >>>
>         >>>
>         https://docs.google.com/document/d/1Lx867o3meyQsj0WKOSWMdosSBdw2MUq1aIxyPJM6iLU/edit?usp=sharing
>         >>>
>         >>>
>         >>>
>         >>>                 Please note that this is a veeery rough
>         sketch, I
>         >>>                 didn't think about every edge case that
>         exists, but
>         >>>                 I think it's an okay start.
>         >>>
>         >>>
>         >>>                 [1]
>         >>>
>         https://docs.google.com/document/d/1ci1BCAKojPlqIxIw1J_K2dnATA3z01PuwR_vHJS55TI/edit
>         >>>
>         >>
>         >
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20190530/3fa39394/attachment.html>