[cfe-dev] [analyzer] Tracking values for generating bug reports

Kristóf Umann via cfe-dev cfe-dev at lists.llvm.org
Tue Jun 18 17:02:36 PDT 2019


On Wed, 19 Jun 2019 at 01:14, Artem Dergachev <noqnoqneo at gmail.com> wrote:

>
>
> On 6/18/19 3:44 PM, Gábor Horváth wrote:
>
> Hi Kristóf,
>
> Thanks for the report. I have been thinking about what kind of user
> experience should we pursue. While my ideas might be a bit opinionated, see
> my comments inline.
>
> On Tue, 18 Jun 2019 at 20:47, Kristóf Umann <dkszelethus at gmail.com> wrote:
>
>> Hi!
>>
>> This is an update on my GSoC project, "Enhancing bug reports in the Clang
>> Static Analyzer". We had some discussions on phabricator, and some in
>> private meetings, and I think it's due to share how things are looking
>> right now.
>>
>> In my previous mail [1], I detailed the two distinct categories we
>> defined, the "must-have-happened" case where every information is already
>> in the bug path, and the "should-not-have-happened" case when the
>> information isn't found in the bugpath. I divided the latter into two
>> subcategories, the "Inlined category" (initially referred to as "category
>> 1"), and the "Not inlined category" (initally referred to as "category 2").
>>
>> These categorizations are used to teach the analyzer incrementally about
>> which nodes in the bugpath deserve special attention. For now, I plan to
>> use this information exclusively for finding control dependencies to these
>> nodes, and tracking their condition. Now, what we mean under "tracking an
>> expression value", is that to add notes to the bug report relevant to that
>> expression value.
>>
>> Ultimately, this means that my project depends greatly on condition
>> tracking yielding meaningful addition of notes to the bug report, without
>> adding unimportant ones. Since I more-or-less finished my work on the
>> must-have-happened case (meaning that the analyzer can now figure out
>> control dependencies to nodes contained in the bugpath), I'd like to detail
>> how I plan to work on this.
>>
>> While evaluating an early prototype solution to the "must-have-happened"
>> case where the same expression value tracking was used for both the
>> bug-causing variable and for the conditions, I found that in many cases,
>> the growth of bug length was intolerable. This is, in part, caused by
>> conditions being tracked to a condition recursively, the conditions of
>> asserts being tracked, and that notes about a condition are not as
>> interesting as notes about the bug causing variable (calls to operator bool
>> for instance).
>>
>> Fixing any of these requires me to teach the analyzer the difference in
>> between "THE value" and "just a condition". The details are a little more
>> complicated, so I'll show some examples that point out certain cases:
>>
>> *Example 1.:*
>>
>> 01 int flag;
>>
>> 02 bool coin();
>>
>> 03
>>
>> 04 void foo() {
>>
>> 05   flag = coin();
>>
>> 06 }
>>
>> 07
>>
>> 08 int main() {
>>
>> 09   int *x = 0;
>>
>> 10   foo();
>>
>> 11   if (flag) // assumed true
>>
>> 12     *x = 5; // warn
>>
>> 13 }
>>
>> In this example, it'd be great to see notes placed on line 10 and 5,
>> because if flag wasn't invalidated, the bug would not have occurred (since
>> flag is global, and is initialized to 0). The analyzer normally wouldn't
>> place notes there, so we definitely should track flag up to line 5.
>>
>> *Example 2.:*
>>
>> 01 int flag;
>>
>> 02 bool coin();
>>
>> 03
>>
>> 04 void foo() {
>>
>> 05   coin();
>>
>> 06 }
>>
>> 07
>>
>> 08 int main() {
>>
>> 09   int *x = 0;
>>
>> 10   foo();
>>
>> 11   if (flag) // assumed true
>>
>> 12     *x = 5; // warn
>>
>> 13 }
>>
>> This case is very similar, with the only difference being that the
>> analyzer conservatively assumed that coin may have written flag (as it's a
>> global variable). We should track until line 5.
>>
>
> I am not sure about this example. While the invalidation of the global
> flag is definitely playing a role in this bug report the point where the
> flag is actually invalidated could seem quite arbitrary for the user. We
> might get the invalidation when we first see a function that is not defined
> in this TU, when we reach the maximum call stack depth and so on. My point
> is, I am not sure if we actually help the user by pinpointing the first
> function call where the invalidation happens. A more user friendly way to
> think about this problem is how could the user solve this false positive?
> As far as I understand the cannonical solution would be to add an
> assertion, so it would be better to put a node close to the place where the
> user would change the code, so I would put it in the stack frame of the
> bug. For example we could generate a note for the call foo(), that
> somewhere in that call stack the analyzer could no longer track the value
> of flag, thus invalidated its contents. @Artem Dergachev
> <adergachev at apple.com> what do you think?
>
>
> This is definitely one of the most annoying kinds of false positives we
> have due to infeasible paths and over-eager assumption chains. Like, you
> can add an assertion here, but it's going to look super ugly:
>
>   bool tmp_flag = flag;
>   coin(); // why would anybody believe it touches flag in the first
> place???
>   assert(flag == tmp_flag && "sanity check");
>
> And even if the user would actually like to add something like that to his
> code, it would require a powerful constraint solver to even handle this
> assertion correctly.
>
> These are not very common, but they're so annoying that i probably
> wouldn't be against marking reports as invalid immediately when we see that
> an important control dependency relies on such invalidation (we'll have to
> gather experimental data on that, of course).
>
>
Hmm, yea, good point. But if I coin were to take flag as a non-const
reference (other than the fact that flag is global) I guess we would like
to keep the report and place notes all the way there? I agree with Artem,
we really need to see the real world impact of this.

>
>
>>
>> *Example 3.:*
>>
>> 01 void f(int flag) {
>>
>> 02   int *x = 0;
>>
>> 03   if (flag) // assumed true
>>
>> 04     *x = 5; // warn
>>
>> 05 }
>>
>> Here, the user could simply follow the arrows that shows the path of
>> execution: it isn't really important what flag was initialized on the first
>> line.
>>
>> *Example 4.:*
>>
>> 01 int flag;
>>
>> 02
>>
>> 03 int getInt();
>>
>> 04
>>
>> 05 int main() {
>>
>> 06   int *x = 0;
>>
>> 07   int y = getInt();
>>
>> 08   flag = y;
>>
>> 09   if (flag) // assumed true
>>
>> 10     *x = 5; // warn
>>
>> 11 }
>>
>> Again, the user could see there was a write made to flag on line 8 -- the
>> question is, whether we should explain y better. Right now, we're thinking
>> that we shouldn't, leading to the conclusion that flag here shouldn't be
>> tracked at all.
>>
>
> Finding the place where flag was modified in a really big function can
> still be challenging. While generating a note would be an overkill I wonder
> if some middle ground like highlighting the statement without additional
> text would actually make sense for this case.
>
> I think its a matter of a simple comparison of the stack frames that would
decide whether we only want to track the condition if it was in between the
collapse point and the initialization in a nested stack frame, or
regardless of the stack frame. Let's see just see which yields the better
result!

>
> When it's in the same function, you can simply search for the variable
> name to see all writes into it. It's not that hard. Even if the variable is
> written into by pointer, you can see that its address is taken and search
> for the pointer variable as well (though annoying if there are too many
> levels of indirection).
>
> But, hmm, if the address is taken in a different stack frame, this becomes
> much harder. I guess, writes that are made through pointers should be
> highlighted, and we should also do our tracking to explain why do we think
> that this pointer points to that variable.
>
>
Ah yes, our arch enemy, pointer analysis showed up again. I guess I agree
with you, but I plan on approaching aliasing related issues when I got most
of the other stuff working.

>
>
>
>>
>> *Example 5.:*
>>
>> 01 int flag;
>>
>> 02
>>
>> 03 int getInt();
>>
>> 04
>>
>> 05 void foo() {
>>
>> 06   int y = getInt();
>>
>> 07   flag = y;
>>
>> 08 }
>>
>> 09
>>
>> 10 int main() {
>>
>> 11   int *x = 0;
>>
>> 12 foo();
>>
>> 13   if (flag) // assumed true
>>
>> 14     *x = 5; // warn
>>
>> 15 }
>>
>> Like Example 1-2, we should explain that flag was written on line 7, but
>> like in Example 4., we shouldn't track y.
>>
>
> Again, from the user's point of view, I think the ultimate solution would
> be to have an interface, where we does not present the part where y was
> tracked by default, but the user could click on a button to actually expand
> that part of the path in case she consider it interesting. This, of course,
> is not part of you GSoC, I am just wondering what is the ideal that we
> would like to pursue in the long run.
>
>
> A more powerful UI could have indeed solved a lot of these problems by
> telling users to do our work. But given that it comes with a fairly large
> cost of developing such UIs for every tool that integrates the Static
> Analyzer, i'd definitely try to push our own effort of stuffing exactly as
> much information as necessary into the report as far as possible.
>
>
I actually had a number of discussions with Dániel Krupp about this, and
folks in Ericsson seem to be very interested in this direction. This could
be a great compliment to my GSoC as a followup project.

>
>
>>
>> * Example 6.:*
>>
>> 01 void f(int flag) {
>>
>> 02   int *x = 0;
>>
>> 03 assert(flag);
>>
>> 04   if (flag) // assumed true
>>
>> 05     *x = 5; // warn
>>
>> 06 }
>>
>> Currently, we mention nothing about line 3, yet we say "Taking the true
>> branch" on line 4, rather then "Assuming the condition is true". This is
>> because the collapse point (point at which we constrain flag's value to be
>> true or false) isn't on line 4: the analysis only continued past the assert
>> since the analyzer assumed flag to be non-zero. In this case, we would like
>> to track flag to the assert to explain why we are so confident on line 5
>> that flag is non-zero.
>>
>
> What do you mean by tracking tha flag back to the assert? The reason why a
> note is useful because in this case either the assert is wrong or the bug
> is a true positive. But again, when the assert is in the same function we
> might not want to generate additional note for this as this info might
> easily be inferred by following the arrows in the report (but we might
> highlight the assertion line, see my previous comments). In case the assert
> is in a separate (inlined) function it would make much more sense to
> generate a note.
>
>
> Wouldn't the assert already contain a (prunable) note saying "Assuming
> 'flag' is true"? I guess the only problem here is that the note is
> prunable, so it won't be shown if the assert is wrapped into a nested
> function call.
>
> Yup, with -analyzer-config prune-paths=false, you get the following
report, but it's still not explained well why flag is non-zero (the thing
that I'm missing is the connection in between b and flag):

$ cat example_6.cpp

[[noreturn]] void halt();

void assert(int b) {
  if (!b)
    halt();
}

void f(int flag) {
  int *x = 0;
  assert(flag);
  if (flag) // assumed true
    *x = 5; // warn
}

$ clang -cc1 -analyze -analyzer-output=text -analyzer-checker=core
example_6.cpp -analyzer-config prune-paths=false

example_6.cpp:10:3: note: 'x' initialized to a null pointer value
  int *x = 0;
  ^~~~~~
example_6.cpp:11:3: note: Calling 'assert'
  assert(flag);
  ^~~~~~~~~~~~
example_6.cpp:5:7: note: Assuming 'b' is not equal to 0
  if (!b)
      ^~
example_6.cpp:5:3: note: Taking false branch
  if (!b)
  ^
example_6.cpp:11:3: note: Returning from 'assert'
  assert(flag);
  ^~~~~~~~~~~~
example_6.cpp:12:7: note: 'flag' is not equal to 0
  if (flag) // assumed true
      ^~~~
example_6.cpp:12:3: note: Taking true branch
  if (flag) // assumed true
  ^
example_6.cpp:13:8: note: Dereference of null pointer (loaded from variable
'x')
    *x = 5; // warn
     ~ ^

As you said, this example is interesting because assert isn't a macro here,
and flag's collapse point really happens in a nested stack frame. Am I
understanding correctly that maybe we shouldn't pursue tracking a variable
if the collapse point happened in the same stack frame?

>
>
>>
>> * Example 7.:*
>>
>> 01 int getInt();
>>
>> 02 03 void f(int flag) {
>>
>> 04   int *x = 0;
>>
>> 05 flag = getInt();
>>
>> 06 assert(flag);
>>
>> 07   if (flag) // assumed true
>>
>> 08     *x = 5; // warn
>>
>> 09 }
>>
>> Like Example 6, we should explain why know that flag is non-zero on line
>> 7 by tracking it back to line 6. Like in the case of Example 4., the user
>> could see where flag was written, so we wouldn't like to see a note on line
>> 5.
>>
>> So what's the takeaway?
>>
>> After teaching the analyzer the difference in between a condition and a
>> "regularly tracked expression", I plan to implement the following two rules:
>>
>> Track a condition only if
>> a.) The collapse point doesn't coincide with the condition point
>> b.) It was written in a nested stack frame.
>>
>
> Do you mane a) or b), or a) and b)? Also, what to do with "multiple
> collapse point" events like:
>
> If either happens.

>
> if ( a < 10 ) ;
> if ( a < 5) // Here we do add new constraints to a, but we also had other
> constraints before. Do you consider this to coincide or not?
>  ...
>
>
> I'd rather track the condition *until* we cover those points (which
> implies not tracking it at all when there are no b.)-points and the
> a.)-point coincide with the current terminator), rather than track it
> forever if we have those points.
>
> By collapse point i meant the point where the (bool)condition collapses to
> a constant-false or a constant-true (in the true case the condition itself
> doesn't necessarily collapse to a constant 1). It doesn't happen every time
> we introduce a constraint.
>
> Yup, that's what I was planning to go for.

>
>
>
>>
>> We hope that by implementing this, tracking conditions to conditions
>> would be kept at bay without a counter to limit the depth of recursion, and
>> the intolerable growth of bug length with drastically shorten. I do expect
>> skeletons to fall out of the closet, but I am confident that this is a good
>> initial approach.
>>
>> As explained earlier, this is mission number one, so I'll prioritize
>> getting it right before pursuing the "should-not-have-happened" case.
>>
>> One thing I did not touch on just yet, is the case where an assert was
>> (correctly, by the way) regarded as a control dependency, and it's
>> condition was tracked. This is undoubtedly undesirable, but figuring out
>> whether the condition is found in an assert is rather difficult. Asserts
>> are often implemented as a macro, and could have a very, for a lack of a
>> better word, esoteric implementations on certain platforms. We discussed
>> trying to tinker with the control dependency calculator, namely skipping
>> over nodes that have two successors and one of them leads to noreturn node,
>> but I still need time to figure out something reliable.
>>
>
> Some asserts consists of more basic blocks. Having two successors where
> one of them is a noreturn block is not a sufficient condition. Consider
> for example assert(a && b); Here the basic block for evaluating a will
> either go to the next line after assertion or to the evaluation of b
> which is not a noreturn block. While it is true that from node a we will
> either go to the next non-assert block or end up in a noreturn block, the
> noreturn block might not be an immediate successor of node a.
>
>
> Forgot to mention, i have a history with the CFG for asserts that's mostly
> contained in https://reviews.llvm.org/D28023 and
> https://reviews.llvm.org/D35673. This is the reason why i believe that
> some implementation of assert() have a lot of CFG blocks on their own,
> regardless of how many CFG blocks does it take to evaluate the assert
> condition.
>
> Thanks! :D Right now, incorrect tracking of assert conditions isn't on the
top of my priority list, but I'll think about this when I'm sitting on the
bus or watering the garden.

>
> Regards,
> Gabor
>
>
>>
>> Thanks to everyone who helped me along: Artem Dergachev, Gábor Horváth,
>> Ádám Balogh, Jakub Kuderski, and Zoltán Porkoláb!
>>
>> Cheers,
>> Kristóf
>>
>> [1]  http://lists.llvm.org/pipermail/cfe-dev/2019-June/062535.html
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20190619/6d8bf68f/attachment.html>


More information about the cfe-dev mailing list