[cfe-dev] [analyzer] Tracking values for generating bug reports

Tue Jun 18 11:47:18 PDT 2019

Hi!

This is an update on my GSoC project, "Enhancing bug reports in the Clang
Static Analyzer". We had some discussions on phabricator, and some in
private meetings, and I think it's due to share how things are looking
right now.

In my previous mail [1], I detailed the two distinct categories we defined,
the "must-have-happened" case where every information is already in the bug
path, and the "should-not-have-happened" case when the information isn't
found in the bugpath. I divided the latter into two subcategories, the
"Inlined category" (initially referred to as "category 1"), and the "Not
inlined category" (initally referred to as "category 2").

These categorizations are used to teach the analyzer incrementally about
which nodes in the bugpath deserve special attention. For now, I plan to
use this information exclusively for finding control dependencies to these
nodes, and tracking their condition. Now, what we mean under "tracking an
expression value", is that to add notes to the bug report relevant to that
expression value.

Ultimately, this means that my project depends greatly on condition
tracking yielding meaningful addition of notes to the bug report, without
adding unimportant ones. Since I more-or-less finished my work on the
must-have-happened case (meaning that the analyzer can now figure out
control dependencies to nodes contained in the bugpath), I'd like to detail
how I plan to work on this.

While evaluating an early prototype solution to the "must-have-happened"
case where the same expression value tracking was used for both the
bug-causing variable and for the conditions, I found that in many cases,
the growth of bug length was intolerable. This is, in part, caused by
conditions being tracked to a condition recursively, the conditions of
asserts being tracked, and that notes about a condition are not as
interesting as notes about the bug causing variable (calls to operator bool
for instance).

Fixing any of these requires me to teach the analyzer the difference in
between "THE value" and "just a condition". The details are a little more
complicated, so I'll show some examples that point out certain cases:

*Example 1.:*

01 int flag;

02 bool coin();

03

04 void foo() {

05   flag = coin();

06 }

07

08 int main() {

09   int *x = 0;

10   foo();

11   if (flag) // assumed true

12     *x = 5; // warn

13 }

In this example, it'd be great to see notes placed on line 10 and 5,
because if flag wasn't invalidated, the bug would not have occurred (since
flag is global, and is initialized to 0). The analyzer normally wouldn't
place notes there, so we definitely should track flag up to line 5.

*Example 2.:*

01 int flag;

02 bool coin();

03

04 void foo() {

05   coin();

06 }

07

08 int main() {

09   int *x = 0;

10   foo();

11   if (flag) // assumed true

12     *x = 5; // warn

13 }

This case is very similar, with the only difference being that the analyzer
conservatively assumed that coin may have written flag (as it's a global
variable). We should track until line 5.

*Example 3.:*

01 void f(int flag) {

02   int *x = 0;

03   if (flag) // assumed true

04     *x = 5; // warn

05 }

Here, the user could simply follow the arrows that shows the path of
execution: it isn't really important what flag was initialized on the first
line.

*Example 4.:*

01 int flag;

02

03 int getInt();

04

05 int main() {

06   int *x = 0;

07   int y = getInt();

08   flag = y;

09   if (flag) // assumed true

10     *x = 5; // warn

11 }

Again, the user could see there was a write made to flag on line 8 -- the
question is, whether we should explain y better. Right now, we're thinking
that we shouldn't, leading to the conclusion that flag here shouldn't be
tracked at all.

*Example 5.:*

01 int flag;

02

03 int getInt();

04

05 void foo() {

06   int y = getInt();

07   flag = y;

08 }

09

10 int main() {

11   int *x = 0;

12 foo();

13   if (flag) // assumed true

14     *x = 5; // warn

15 }

Like Example 1-2, we should explain that flag was written on line 7, but
like in Example 4., we shouldn't track y.

*Example 6.:*

01 void f(int flag) {

02   int *x = 0;

03 assert(flag);

04   if (flag) // assumed true

05     *x = 5; // warn

06 }

Currently, we mention nothing about line 3, yet we say "Taking the true
branch" on line 4, rather then "Assuming the condition is true". This is
because the collapse point (point at which we constrain flag's value to be
true or false) isn't on line 4: the analysis only continued past the assert
since the analyzer assumed flag to be non-zero. In this case, we would like
to track flag to the assert to explain why we are so confident on line 5
that flag is non-zero.

*Example 7.:*

01 int getInt();

02
03 void f(int flag) {

04   int *x = 0;

05 flag = getInt();

06 assert(flag);

07   if (flag) // assumed true

08     *x = 5; // warn

09 }

Like Example 6, we should explain why know that flag is non-zero on line 7
by tracking it back to line 6. Like in the case of Example 4., the user
could see where flag was written, so we wouldn't like to see a note on line
5.

So what's the takeaway?

After teaching the analyzer the difference in between a condition and a
"regularly tracked expression", I plan to implement the following two rules:

Track a condition only if
a.) The collapse point doesn't coincide with the condition point
b.) It was written in a nested stack frame.

We hope that by implementing this, tracking conditions to conditions would
be kept at bay without a counter to limit the depth of recursion, and the
intolerable growth of bug length with drastically shorten. I do expect
skeletons to fall out of the closet, but I am confident that this is a good
initial approach.

As explained earlier, this is mission number one, so I'll prioritize
getting it right before pursuing the "should-not-have-happened" case.

One thing I did not touch on just yet, is the case where an assert was
(correctly, by the way) regarded as a control dependency, and it's
condition was tracked. This is undoubtedly undesirable, but figuring out
whether the condition is found in an assert is rather difficult. Asserts
are often implemented as a macro, and could have a very, for a lack of a
better word, esoteric implementations on certain platforms. We discussed
trying to tinker with the control dependency calculator, namely skipping
over nodes that have two successors and one of them leads to noreturn node,
but I still need time to figure out something reliable.

Thanks to everyone who helped me along: Artem Dergachev, Gábor Horváth,
Ádám Balogh, Jakub Kuderski, and Zoltán Porkoláb!

Cheers,
Kristóf

[1]  http://lists.llvm.org/pipermail/cfe-dev/2019-June/062535.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20190618/a741def4/attachment.html>