[cfe-dev] [analyzer] Tracking values for generating bug reports

Tue Jun 18 16:04:27 PDT 2019

Sorry for the amount of mails. I just wanted to share a random idea. What
if you cut every (inbound) edge of all noreturn basic blocks temporarily
before calculating the control dependencies and restore the edges
afterwards. I did not work through any examples as it is 1 am here,
nevertheless, this might worth exploring.

Gábor Horváth <xazax.hun at gmail.com> ezt írta (időpont: 2019. jún. 19., Sze
0:44):

> Hi Kristóf,
>
> Thanks for the report. I have been thinking about what kind of user
> experience should we pursue. While my ideas might be a bit opinionated, see
> my comments inline.
>
> On Tue, 18 Jun 2019 at 20:47, Kristóf Umann <dkszelethus at gmail.com> wrote:
>
>> Hi!
>>
>> This is an update on my GSoC project, "Enhancing bug reports in the Clang
>> Static Analyzer". We had some discussions on phabricator, and some in
>> private meetings, and I think it's due to share how things are looking
>> right now.
>>
>> In my previous mail [1], I detailed the two distinct categories we
>> defined, the "must-have-happened" case where every information is already
>> in the bug path, and the "should-not-have-happened" case when the
>> information isn't found in the bugpath. I divided the latter into two
>> subcategories, the "Inlined category" (initially referred to as "category
>> 1"), and the "Not inlined category" (initally referred to as "category 2").
>>
>> These categorizations are used to teach the analyzer incrementally about
>> which nodes in the bugpath deserve special attention. For now, I plan to
>> use this information exclusively for finding control dependencies to these
>> nodes, and tracking their condition. Now, what we mean under "tracking an
>> expression value", is that to add notes to the bug report relevant to that
>> expression value.
>>
>> Ultimately, this means that my project depends greatly on condition
>> tracking yielding meaningful addition of notes to the bug report, without
>> adding unimportant ones. Since I more-or-less finished my work on the
>> must-have-happened case (meaning that the analyzer can now figure out
>> control dependencies to nodes contained in the bugpath), I'd like to detail
>> how I plan to work on this.
>>
>> While evaluating an early prototype solution to the "must-have-happened"
>> case where the same expression value tracking was used for both the
>> bug-causing variable and for the conditions, I found that in many cases,
>> the growth of bug length was intolerable. This is, in part, caused by
>> conditions being tracked to a condition recursively, the conditions of
>> asserts being tracked, and that notes about a condition are not as
>> interesting as notes about the bug causing variable (calls to operator bool
>> for instance).
>>
>> Fixing any of these requires me to teach the analyzer the difference in
>> between "THE value" and "just a condition". The details are a little more
>> complicated, so I'll show some examples that point out certain cases:
>>
>> *Example 1.:*
>>
>> 01 int flag;
>>
>> 02 bool coin();
>>
>> 03
>>
>> 04 void foo() {
>>
>> 05   flag = coin();
>>
>> 06 }
>>
>> 07
>>
>> 08 int main() {
>>
>> 09   int *x = 0;
>>
>> 10   foo();
>>
>> 11   if (flag) // assumed true
>>
>> 12     *x = 5; // warn
>>
>> 13 }
>>
>> In this example, it'd be great to see notes placed on line 10 and 5,
>> because if flag wasn't invalidated, the bug would not have occurred (since
>> flag is global, and is initialized to 0). The analyzer normally wouldn't
>> place notes there, so we definitely should track flag up to line 5.
>>
>> *Example 2.:*
>>
>> 01 int flag;
>>
>> 02 bool coin();
>>
>> 03
>>
>> 04 void foo() {
>>
>> 05   coin();
>>
>> 06 }
>>
>> 07
>>
>> 08 int main() {
>>
>> 09   int *x = 0;
>>
>> 10   foo();
>>
>> 11   if (flag) // assumed true
>>
>> 12     *x = 5; // warn
>>
>> 13 }
>>
>> This case is very similar, with the only difference being that the
>> analyzer conservatively assumed that coin may have written flag (as it's a
>> global variable). We should track until line 5.
>>
>
> I am not sure about this example. While the invalidation of the global
> flag is definitely playing a role in this bug report the point where the
> flag is actually invalidated could seem quite arbitrary for the user. We
> might get the invalidation when we first see a function that is not defined
> in this TU, when we reach the maximum call stack depth and so on. My point
> is, I am not sure if we actually help the user by pinpointing the first
> function call where the invalidation happens. A more user friendly way to
> think about this problem is how could the user solve this false positive?
> As far as I understand the cannonical solution would be to add an
> assertion, so it would be better to put a node close to the place where the
> user would change the code, so I would put it in the stack frame of the
> bug. For example we could generate a note for the call foo(), that
> somewhere in that call stack the analyzer could no longer track the value
> of flag, thus invalidated its contents. @Artem Dergachev
> <adergachev at apple.com> what do you think?
>
>
>>
>> *Example 3.:*
>>
>> 01 void f(int flag) {
>>
>> 02   int *x = 0;
>>
>> 03   if (flag) // assumed true
>>
>> 04     *x = 5; // warn
>>
>> 05 }
>>
>> Here, the user could simply follow the arrows that shows the path of
>> execution: it isn't really important what flag was initialized on the first
>> line.
>>
>> *Example 4.:*
>>
>> 01 int flag;
>>
>> 02
>>
>> 03 int getInt();
>>
>> 04
>>
>> 05 int main() {
>>
>> 06   int *x = 0;
>>
>> 07   int y = getInt();
>>
>> 08   flag = y;
>>
>> 09   if (flag) // assumed true
>>
>> 10     *x = 5; // warn
>>
>> 11 }
>>
>> Again, the user could see there was a write made to flag on line 8 -- the
>> question is, whether we should explain y better. Right now, we're thinking
>> that we shouldn't, leading to the conclusion that flag here shouldn't be
>> tracked at all.
>>
>
> Finding the place where flag was modified in a really big function can
> still be challenging. While generating a note would be an overkill I wonder
> if some middle ground like highlighting the statement without additional
> text would actually make sense for this case.
>
>
>>
>> *Example 5.:*
>>
>> 01 int flag;
>>
>> 02
>>
>> 03 int getInt();
>>
>> 04
>>
>> 05 void foo() {
>>
>> 06   int y = getInt();
>>
>> 07   flag = y;
>>
>> 08 }
>>
>> 09
>>
>> 10 int main() {
>>
>> 11   int *x = 0;
>>
>> 12 foo();
>>
>> 13   if (flag) // assumed true
>>
>> 14     *x = 5; // warn
>>
>> 15 }
>>
>>
>> Like Example 1-2, we should explain that flag was written on line 7, but
>> like in Example 4., we shouldn't track y.
>>
>
> Again, from the user's point of view, I think the ultimate solution would
> be to have an interface, where we does not present the part where y was
> tracked by default, but the user could click on a button to actually expand
> that part of the path in case she consider it interesting. This, of course,
> is not part of you GSoC, I am just wondering what is the ideal that we
> would like to pursue in the long run.
>
>
>>
>> *Example 6.:*
>>
>>
>> 01 void f(int flag) {
>>
>> 02   int *x = 0;
>>
>> 03 assert(flag);
>>
>> 04   if (flag) // assumed true
>>
>> 05     *x = 5; // warn
>>
>> 06 }
>>
>> Currently, we mention nothing about line 3, yet we say "Taking the true
>> branch" on line 4, rather then "Assuming the condition is true". This is
>> because the collapse point (point at which we constrain flag's value to be
>> true or false) isn't on line 4: the analysis only continued past the assert
>> since the analyzer assumed flag to be non-zero. In this case, we would like
>> to track flag to the assert to explain why we are so confident on line 5
>> that flag is non-zero.
>>
>
> What do you mean by tracking tha flag back to the assert? The reason why a
> note is useful because in this case either the assert is wrong or the bug
> is a true positive. But again, when the assert is in the same function we
> might not want to generate additional note for this as this info might
> easily be inferred by following the arrows in the report (but we might
> highlight the assertion line, see my previous comments). In case the assert
> is in a separate (inlined) function it would make much more sense to
> generate a note.
>
>
>>
>> *Example 7.:*
>>
>>
>> 01 int getInt();
>>
>> 02
>> 03 void f(int flag) {
>>
>> 04   int *x = 0;
>>
>> 05 flag = getInt();
>>
>> 06 assert(flag);
>>
>> 07   if (flag) // assumed true
>>
>> 08     *x = 5; // warn
>>
>> 09 }
>>
>> Like Example 6, we should explain why know that flag is non-zero on line
>> 7 by tracking it back to line 6. Like in the case of Example 4., the user
>> could see where flag was written, so we wouldn't like to see a note on line
>> 5.
>>
>> So what's the takeaway?
>>
>> After teaching the analyzer the difference in between a condition and a
>> "regularly tracked expression", I plan to implement the following two rules:
>>
>> Track a condition only if
>> a.) The collapse point doesn't coincide with the condition point
>> b.) It was written in a nested stack frame.
>>
>
> Do you mane a) or b), or a) and b)? Also, what to do with "multiple
> collapse point" events like:
>
> if ( a < 10 ) ;
> if ( a < 5) // Here we do add new constraints to a, but we also had other
> constraints before. Do you consider this to coincide or not?
>  ...
>
>
>>
>> We hope that by implementing this, tracking conditions to conditions
>> would be kept at bay without a counter to limit the depth of recursion, and
>> the intolerable growth of bug length with drastically shorten. I do expect
>> skeletons to fall out of the closet, but I am confident that this is a good
>> initial approach.
>>
>> As explained earlier, this is mission number one, so I'll prioritize
>> getting it right before pursuing the "should-not-have-happened" case.
>>
>> One thing I did not touch on just yet, is the case where an assert was
>> (correctly, by the way) regarded as a control dependency, and it's
>> condition was tracked. This is undoubtedly undesirable, but figuring out
>> whether the condition is found in an assert is rather difficult. Asserts
>> are often implemented as a macro, and could have a very, for a lack of a
>> better word, esoteric implementations on certain platforms. We discussed
>> trying to tinker with the control dependency calculator, namely skipping
>> over nodes that have two successors and one of them leads to noreturn node,
>> but I still need time to figure out something reliable.
>>
>
> Some asserts consists of more basic blocks. Having two successors where
> one of them is a noreturn block is not a sufficient condition. Consider
> for example assert(a && b); Here the basic block for evaluating a will
> either go to the next line after assertion or to the evaluation of b
> which is not a noreturn block. While it is true that from node a we will
> either go to the next non-assert block or end up in a noreturn block, the
> noreturn block might not be an immediate successor of node a.
>
> Regards,
> Gabor
>
>
>>
>> Thanks to everyone who helped me along: Artem Dergachev, Gábor Horváth,
>> Ádám Balogh, Jakub Kuderski, and Zoltán Porkoláb!
>>
>> Cheers,
>> Kristóf
>>
>> [1]  http://lists.llvm.org/pipermail/cfe-dev/2019-June/062535.html
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20190619/62ce65dc/attachment.html>