<div dir="ltr"><div>Hi!</div><div><br></div><div>I wanted to share some of my points as well but had little time to do so.</div><div><br></div><div>Artem identified two kinds of issues. One of which has all the information available in the bug path (in his terminology: must-have-happened) and the other which has not ( should-not-have-happened). The goal is the same in both of the cases, identify all dependencies including control and data dependencies. The main difference is that in one case we could use all the information available in the bug path while in the other we cannot. <br></div><div><br></div><div>1. must-have-happened:</div><div>Since the bug-path contains all the information we need, I do agree with Artem using bug path visitors might be sufficient. I do not know how reliable `trackExpressionValue` is in identifying data dependencies, but revising it could be one step in GSoC. For example, after quickly skimming through the implementation I am not sure if it is able to track the source of a constant value that was constant folded during the symbolic execution from different values. Only if we had a way to test this functionality in isolation quickly :) I think making that possible would also be a valuable addition. <br></div><div><br></div><div>For control dependencies, it is great that LLVM already has some algorithms to calculate dominator sets and it might be possible to make that work for the Clang CFG. <br></div><div><br></div><div>In my opinion, if you only tackle these problems by the end of the summer your project already brought great value to the community.</div><div><br></div><div>2. should-not-have-happened <br></div><div>This category cannot be solved only using visitors for sure. So we might end up with an implementation for this problem that is completely separate from the previous one (maybe except for getting control dependencies?). As Artem pointed out, we might query some information from the analyzer e.g. which functions were inlined. One option is to use a slicing algorithm here. However, we should make sure that the way a function is conservatively evaluated by the slicing algorithm and the analyzer is compatible. E.g.: if the analyzer does not invalidate a memory region during evaluating the call, the statement should not be considered as relevant for the variable in question. We have several challenges here, the analyzer reasons about memory regions while the slicing algorithm will probably not. Also, we need to do something with pointers, since we do not want to implement a full-fledged pointer analysis. We also need to decide how to represent arrays, structures etc.</div><div><br></div><div>I skimmed through the algorithm in the google docs, and I think there are some other details we need to work out. For example, you never remove anything from the set of relevant variables. Consider the following example:</div><div><span style="font-family:courier new,monospace">x = 5;</span></div><div><span style="font-family:courier new,monospace">y = x + 2;</span></div><div><span style="font-family:courier new,monospace">y = 2;</span></div><div><span style="font-family:courier new,monospace">z = y + 3;</span><br></div><div><br></div><div>Whit the slicing criterion (<span style="font-family:courier new,monospace">z = y + 3</span>). Here, once <span style="font-family:courier new,monospace">y</span> is overwritten with <span style="font-family:courier new,monospace">2</span>, it is no longer relevant, so<span style="font-family:courier new,monospace"> x</span> should not end up in the slice. The algorithm you proposed will not handle this correctly (yet). Also I am not sure if we actually need the<span style="font-family:courier new,monospace"> relevant_variable_map</span> or having a set will be sufficient.<br></div><div><br></div><div>Regards,<br></div><div>Gabor</div><div><br></div><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, 27 May 2019 at 12:52, Kristóf Umann <<a href="mailto:dkszelethus@gmail.com">dkszelethus@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">+ Ádám Balogh</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, 24 May 2019 at 23:31, Artem Dergachev <<a href="mailto:noqnoqneo@gmail.com" target="_blank">noqnoqneo@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF">
Ok, i think i have a little bit of clarity.<br>
<br>
Let's first talk about must-have-happened problems. For instance,<br>
<br>
int *foo() {<br>
return coin() ? new int : nullptr; // needs note<br>
}<br>
void bar() {<br>
int *x = foo();<br>
*x = 1; // warning: null dereference<br>
}<br>
<br>
In this case trackExpressionValue solves the "must-have-happened"
problem of "x must-have been assigned a null pointer value" by
displaying a note when foo() returns nullptr. This currently more or
less works correctly - there are a lot of bugs but the overall idea
behind trackExpressionValue is correct.<br>
<br>
Example 1 in your document is another example of a
must-have-happened problem: in order for the report to be valid, we
need to show that 'flag' must-have-changed between line 17 and line
13. That's something that the Static Analyzer currently cannot
handle and you plan to improve upon it.<br>
<br>
Hʏᴘᴏᴛʜᴇsɪs<span class="gmail-m_-3323113253945746305gmail-m_-2632453692712164457value_inner gmail-m_-3323113253945746305gmail-m_-2632453692712164457times"></span>
1. All "must-have-happened" problems should be solved with a bug
visitor. We don't need any new analysis algorithm for that.<br>
<br>
In particular, i believe that Example 1 can be solved by extending
trackExpressionValue() with a bug visitor that detects
control-flow-dependencies via this approach of yours:<br>
<br>
> "Check whether the statement is a control dependency of
Statement 18 with dominator sets: 18 doesn't post dominate 17, but
every statement in between them does, meaning that 17 is a control
dependency of 18."<br>
<br>
I.e., while ascending through ExplodedGraph, we pick interesting
statements and perform this domination check on their predecessor
nodes until we reach the control dependency, then we track the
control dependency recursively by adding more visitors.<br>
<br>
I think this is the first thing we should try on our GSoC.<br>
<br>
The reason why i believe that Hypothesis 1 is true is that all the
information that we need is fully contained in the bug path. If
something must have happened on the path, then it's probably in the
path and we can figure it out by looking at the path. If something
must have happened on the path and it's not in the path, then why
did we emit a warning in the first place?<br>
<br>
==============<br>
<br>
Now, to should-not-have-happened problems. The main motivating
example is:<br>
<br>
void foo(int *y) {<br>
if (coin()) // needs note<br>
*y = 1;<br>
}<br>
void bar() {<br>
int x;<br>
foo(&x);<br>
use(x); // warning: use of uninitialized value<br>
}<br>
<br>
In order to make the warning comprehensible, we have to explain to
the user why do we think that 'x' is uninitialized, as it clearly
"should-not-have-been" initialized for the warning to be correct.
For that purpose it is absolutely essential to display the execution
path within the inlined call to foo().<br>
<br>
One way to achieve that would be to display *all* execution path and
leave it up to the user to see that the event that
should-not-have-happened has indeed never happened. <br>
<br>
That, however, would be overwhelming, so we use "path pruning" that
cuts away execution paths in inlined calls that are known to have no
interesting events happening. Which, in turn, makes us incapable of
solving should-not-have-happened problems, as it's the whole point
of the problem to display execution path in inlined functions in
which nothing has happened.<br>
<br>
In order to figure this out, we need to learn how to discriminate
between inlined calls in which nothing interesting *could* have
happened in the first place and inlined calls in which something
interesting could have happened but *didn't*.<br>
<br>
NoStoreFuncVisitor solves this problem for the example above. The
visitor notices that the local variable is passed by non-const
pointer into the inlined call and the call syntactically contains
assignments through this pointer, therefore this call *could* have
initialized the value. The visitor then suppresses pruning of this
call by emitting an interesting note within the call ("returning
without initializing '*y'"...).<br>
<br>
However, AST-based analysis in NoStoreFuncVisitor is very primitive
and easy to trick. A more sophisticated analysis is clearly
necessary.<br>
<br>
The second thing we can try in our GSoC is to come up with a better
analysis specifically for the uninitialized value checker, because
we already have a place to stick it into and we know how exactly do
we want to consume the result of such analysis and evaluate it.<br>
<br>
The third thing we can try in our GSoC is to come up with more such
analysis for other checkers and other should-have-happened problems
and see if a reusable framework for creating such analyses can be
implemented.<br>
<br>
<br>
<div class="gmail-m_-3323113253945746305gmail-m_-2632453692712164457moz-cite-prefix">On 5/23/19 1:25 PM, Kristóf Umann
wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div dir="ltr"><br>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Thu, 23 May 2019 at
21:58, Kristóf Umann <<a href="mailto:dkszelethus@gmail.com" target="_blank">dkszelethus@gmail.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div>Please let me know if I didn't address something you
mentioned!</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Thu, 23 May 2019 at
02:29, Artem Dergachev <<a href="mailto:noqnoqneo@gmail.com" target="_blank">noqnoqneo@gmail.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF"> My primary question is, how do
you plan to use the data that you obtain via your
analysis?</div>
</blockquote>
<div><br>
</div>
<div> Gather relevant statements. Keep ExplodedNodes in
the bugpath for which
PathDiagnosticLocation::getStmt() is in the set (or,
something like that at least). But honestly, I didn't
think too much about that yet :^) I guess we could
just gather ExplodedNodes rather than statements,
we'll see.</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF"> Like, do you plan to backtrack
the value of all relevant variables and/or
expressions in the final bug report that were also
encountered along the bug path? If yes, then is it
possible that the slicing criterion gets updated in
the process and you'll have to re-run the CFG-based
analysis to take it into account?<br>
</div>
</blockquote>
<div><br>
</div>
<div>No, the slicing criterion is what it is, and will
not change. The set of relevant statements to the
slicing criterion define the actual program slice,
which is basically the thing we're going for in this
project. When it turns out that a value of a variable
directly affects a variable in the slicing criterion
(either through data or control dependency), we just
add it to the set of relevant variables, and then if
something in the set of relevant variables is affected
(some statements or variables turn out to be a
control/data dependencies to them), then it's added to
the respective relevancy sets. Should we only traverse
the basic blocks the analyzer visited, I think we can
pull off the slicing with a single pass.<br>
</div>
</div>
</div>
</blockquote>
<div><br>
</div>
<div>We'll see about that single pass thing though. I'm
gathering some tricky examples in the document I shared and
working on a neat algorithm.</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div class="gmail_quote">
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF"> > What would also be really
great is to assist this traversal with the
information the analyzer already has, essentially
only inspecting basic blocks the analyzer actually
visits.<br>
<br>
You mean visits on the current bug path or visits on
the equivalence class of bugs or visits in general
during analysis?<br>
<br>
Regardless, for many problems ideally we should
traverse basic blocks that the analyzer *didn't*
visit (eg., if the function didn't initialize the
variable on the current path, we're interested in
parts of the code in which it *did* initialize the
variable, even though we didn't necessarily have
time to visit them).
</div>
</blockquote>
<div><br>
</div>
<div>Well I mean slicing by definition isn't intended to
do that. The entire point of it is to gather the
smallest slice related to the slicing criterion, and
my project aims to fill in the gaps where we don't
provide enough information.</div>
<div><br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF"> It actually sounds as if all
problems that we're trying to solve here can be
classified into "must-have-happened" problems (eg.,
"variable must-have been declared without
initialization", "variable must-have been set to
nullptr", "memory must-have been allocated") and
"should-not-have-happened" problems (eg., "variable
should-not-have been initialized", "null value
should-not-have been overwritten", "pointer
should-not-have escaped").<br>
<br>
For must-have-happened problems, i'm recently under
an impression that we should suppress the bug report
entirely when we fail to solve them (i.e., if fail
to explain to the user where exactly does this
happen, then the report is impossible to understand
anyway). This is currently a very big problem for
our null dereference checker on some codebases,
especially because it uses this tracking for
suppressions as well (aka inlined defensive checks),
so when it fails to track the variable it's likely a
legit false positive as well, not simply a
hard-to-understand report.<br>
</div>
</blockquote>
<div><br>
</div>
<div>I think this set calculating approach inherently
gathers far more information, allowing us to make
better judgement on whether we should suppress the
report.</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF"> For should-not-have-happened
problems i'm much more confused. We're talking about
looking for places where it *could* have happened
and then trying to explain to the user why none of
them have actually happened. I'm not sure what are
the consequences of failing to explain to the user
why didn't a particular piece of code do something,
because i've no idea how do users intuitively figure
out which code *could* have done these things and
which clearly couldn't.<br>
</div>
</blockquote>
<div><br>
</div>
<div>Im really at a loss here :) Can you provide some
example as to what a problematic "must-have-happened"
bug report would look like, and how would you like to
see it improved? Same for "should-not-have-happened".
Because as I understand it, and I might be wrong, you
have problems with this:</div>
<div><br>
<font face="courier new, monospace">int a; //
declaring a without initializing<br>
<br>
</font></div>
<div><font face="courier new, monospace">if (rand()) //
assuming that the condition is false</font></div>
<div><font face="courier new, monospace"> a = 5;</font></div>
<div><font face="courier new, monospace"><br>
</font></div>
<div><font face="courier new, monospace">print(a); // a
is undef</font></div>
<div><br>
And prefer to see this:<br>
<br>
<font face="courier new, monospace">int a; //
declaring a without initializing<br>
<br>
</font></div>
<div><font face="courier new, monospace">if (rand()) //
should this condition be false, a's value will be
indeterministic</font></div>
<div><font face="courier new, monospace"> a = 5; //
writing to a skipped</font></div>
<div><font face="courier new, monospace"><br>
</font></div>
<div><font face="courier new, monospace">print(a); // a
is undef</font></div>
<div><br>
</div>
<div>and I just don't see how this would be possible
with slicing at all. Also, I can't see how this would
scale to real-world production code.</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF"> <br>
<div class="gmail-m_-3323113253945746305gmail-m_-2632453692712164457gmail-m_-2008566417967606901m_-7889642444734106153gmail-m_-8947792374282452487moz-cite-prefix">On
5/22/19 4:11 PM, Kristóf Umann wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr"><span id="gmail-m_-3323113253945746305gmail-m_-2632453692712164457gmail-m_-2008566417967606901m_-7889642444734106153gmail-m_-8947792374282452487gmail-docs-internal-guid-3c74b999-7fff-e99a-77f6-4bd5385fb78f">
<p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">Hi!</span></p>
<br>
<p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">I'd like to share some thoughts about my GSoC project, "Enhancing bug reporting with static backward program slicing"[1].</span></p>
<br>
<p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">My proposal is very specific about </span><span style="font-family:Arial;background-color:transparent;font-style:italic;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">what</span><span style="font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"> I'm aiming to improve on, but vague on the </span><span style="font-family:Arial;background-color:transparent;font-style:italic;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">how</span><span style="font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"> part of it. This mail aims to add clarify some of this.</span></p>
<br>
<p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">Program slicing is essentially a technique of discovering data and control dependencies to the slicing criterion, which is a (statement, {set of variables}) pair. Fortunately, tools for control dependencies, namely, dominator set calculations are already implemented, but seems to be unstable with clang's CFG. It would be a great tool if I were able to fix it.</span></p>
<br>
<p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">While my proposal states that I plan to implement an AST-based solution, I'm actually not sure that this is the optimal approach. We could instead inspect CFG blocks in a backwards manner (coupled with the analyzer's call stack), and gather relevant variable in the meanwhile.</span></p>
<br>
<p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">What would also be really great is to assist this traversal with the information the analyzer already has, essentially only inspecting basic blocks the analyzer actually visits.</span></p>
<br>
<p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">Here’s my idea for an algorithm (from the point of the slicing criterion already being constructed):</span></p>
<p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">
</span></p>
<p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"><a href="https://docs.google.com/document/d/1Lx867o3meyQsj0WKOSWMdosSBdw2MUq1aIxyPJM6iLU/edit?usp=sharing" target="_blank">https://docs.google.com/document/d/1Lx867o3meyQsj0WKOSWMdosSBdw2MUq1aIxyPJM6iLU/edit?usp=sharing</a>
</span></p>
<p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><br>
</p>
<p style="line-height:1.38;margin-top:0pt;margin-bottom:0pt">Please
note that this is a veeery rough sketch,
I didn't think about every edge case
that exists, but I think it's an okay
start.</p>
<p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><br>
</p>
</span><span id="gmail-m_-3323113253945746305gmail-m_-2632453692712164457gmail-m_-2008566417967606901m_-7889642444734106153gmail-m_-8947792374282452487gmail-docs-internal-guid-5954bf5e-7fff-362e-5b73-2faf8cd77d3f"><span style="font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">[1] <a href="https://docs.google.com/document/d/1ci1BCAKojPlqIxIw1J_K2dnATA3z01PuwR_vHJS55TI/edit" target="_blank">https://docs.google.com/document/d/1ci1BCAKojPlqIxIw1J_K2dnATA3z01PuwR_vHJS55TI/edit</a></span></span> <br class="gmail-m_-3323113253945746305gmail-m_-2632453692712164457gmail-m_-2008566417967606901m_-7889642444734106153gmail-m_-8947792374282452487gmail-Apple-interchange-newline">
</div>
</div>
</div>
</blockquote>
<br>
</div>
</blockquote>
</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
<br>
</div>
</blockquote></div>
</blockquote></div></div>