[cfe-dev] [analyzer] Rough sketch of the algorithm for "Enhancing bug reporting with static backward program slicing"

Thu May 23 12:58:18 PDT 2019

Please let me know if I didn't address something you mentioned!

On Thu, 23 May 2019 at 02:29, Artem Dergachev <noqnoqneo at gmail.com> wrote:

> My primary question is, how do you plan to use the data that you obtain
> via your analysis?
>

 Gather relevant statements. Keep ExplodedNodes in the bugpath for which
PathDiagnosticLocation::getStmt() is in the set (or, something like that at
least). But honestly, I didn't think too much about that yet :^) I guess we
could just gather ExplodedNodes rather than statements, we'll see.

> Like, do you plan to backtrack the value of all relevant variables and/or
> expressions in the final bug report that were also encountered along the
> bug path? If yes, then is it possible that the slicing criterion gets
> updated in the process and you'll have to re-run the CFG-based analysis to
> take it into account?
>

No, the slicing criterion is what it is, and will not change. The set of
relevant statements to the slicing criterion define the actual program
slice, which is basically the thing we're going for in this project. When
it turns out that a value of a variable directly affects a variable in the
slicing criterion (either through data or control dependency), we just add
it to the set of relevant variables, and then if something in the set of
relevant variables is affected (some statements or variables turn out to be
a control/data dependencies to them), then it's added to the respective
relevancy sets. Should we only traverse the basic blocks the analyzer
visited, I think we can pull off the slicing with a single pass.

> > What would also be really great is to assist this traversal with the
> information the analyzer already has, essentially only inspecting basic
> blocks the analyzer actually visits.
>
> You mean visits on the current bug path or visits on the equivalence class
> of bugs or visits in general during analysis?
>
> Regardless, for many problems ideally we should traverse basic blocks that
> the analyzer *didn't* visit (eg., if the function didn't initialize the
> variable on the current path, we're interested in parts of the code in
> which it *did* initialize the variable, even though we didn't necessarily
> have time to visit them).
>

Well I mean slicing by definition isn't intended to do that. The entire
point of it is to gather the smallest slice related to the slicing
criterion, and my project aims to fill in the gaps where we don't provide
enough information.

It actually sounds as if all problems that we're trying to solve here can
> be classified into "must-have-happened" problems (eg., "variable must-have
> been declared without initialization", "variable must-have been set to
> nullptr", "memory must-have been allocated") and "should-not-have-happened"
> problems (eg., "variable should-not-have been initialized", "null value
> should-not-have been overwritten", "pointer should-not-have escaped").
>
> For must-have-happened problems, i'm recently under an impression that we
> should suppress the bug report entirely when we fail to solve them (i.e.,
> if fail to explain to the user where exactly does this happen, then the
> report is impossible to understand anyway). This is currently a very big
> problem for our null dereference checker on some codebases, especially
> because it uses this tracking for suppressions as well (aka inlined
> defensive checks), so when it fails to track the variable it's likely a
> legit false positive as well, not simply a hard-to-understand report.
>

I think this set calculating approach inherently gathers far more
information, allowing us to make better judgement on whether we should
suppress the report.

> For should-not-have-happened problems i'm much more confused. We're
> talking about looking for places where it *could* have happened and then
> trying to explain to the user why none of them have actually happened. I'm
> not sure what are the consequences of failing to explain to the user why
> didn't a particular piece of code do something, because i've no idea how do
> users intuitively figure out which code *could* have done these things and
> which clearly couldn't.
>

Im really at a loss here :) Can you provide some example as to what a
problematic "must-have-happened" bug report would look like, and how would
you like to see it improved? Same for "should-not-have-happened". Because
as I understand it, and I might be wrong, you have problems with this:

int a; // declaring a without initializing

if (rand()) // assuming that the condition is false
  a = 5;

print(a); // a is undef

And prefer to see this:

int a; // declaring a without initializing

if (rand()) // should this condition be false, a's value will be
indeterministic
  a = 5; // writing to a skipped

print(a); // a is undef

and I just don't see how this would be possible with slicing at all. Also,
I can't see how this would scale to real-world production code.

>
> On 5/22/19 4:11 PM, Kristóf Umann wrote:
>
> Hi!
>
> I'd like to share some thoughts about my GSoC project, "Enhancing bug
> reporting with static backward program slicing"[1].
>
> My proposal is very specific about what I'm aiming to improve on, but
> vague on the how part of it. This mail aims to add clarify some of this.
>
> Program slicing is essentially a technique of discovering data and control
> dependencies to the slicing criterion, which is a (statement, {set of
> variables}) pair. Fortunately, tools for control dependencies, namely,
> dominator set calculations are already implemented, but seems to be
> unstable with clang's CFG. It would be a great tool if I were able to fix
> it.
>
> While my proposal states that I plan to implement an AST-based solution,
> I'm actually not sure that this is the optimal approach. We could instead
> inspect CFG blocks in a backwards manner (coupled with the analyzer's call
> stack), and gather relevant variable in the meanwhile.
>
> What would also be really great is to assist this traversal with the
> information the analyzer already has, essentially only inspecting basic
> blocks the analyzer actually visits.
>
> Here’s my idea for an algorithm (from the point of the slicing criterion
> already being constructed):
>
>
> https://docs.google.com/document/d/1Lx867o3meyQsj0WKOSWMdosSBdw2MUq1aIxyPJM6iLU/edit?usp=sharing
>
>
> Please note that this is a veeery rough sketch, I didn't think about every
> edge case that exists, but I think it's an okay start.
>
>
> [1]
> https://docs.google.com/document/d/1ci1BCAKojPlqIxIw1J_K2dnATA3z01PuwR_vHJS55TI/edit
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20190523/8a552489/attachment.html>