[cfe-dev] [analyzer] exploration strategies and paths
George Karpenkov via cfe-dev
cfe-dev at lists.llvm.org
Thu Feb 1 11:41:37 PST 2018
> On Jan 31, 2018, at 10:15 PM, Anna Zaks <ganna at apple.com> wrote:
>
>
>
>> On Jan 31, 2018, at 9:59 PM, George Karpenkov <ekarpenkov at apple.com <mailto:ekarpenkov at apple.com>> wrote:
>>
>> On sqlite there are 8 new reports and one removed reports.
>>
>> I think it’s natural that the number is high. In many cases we are saving 3x of the analyzer budget,
>> which is then spent finding new bugs.
>
> A high jump in basic blocks coverage could be explained by your argument. (By the way, have you measured that?)
>
> However, whenever I see that we start reporting hundreds of new bugs on a single codebase it looks suspicious. Will the developers really want and need to fix more than 3 hundred bugs on postgresql?
Right, I see your point.
I think that the right metric here is the proportion of false positives (after all, it’s easy to minimize the absolute number of false positives by not returning any results at all).
Coming back to postgres, out of 377 new results, 289 are on the auto-generated yacc/bison files.
We have seen those reports before, albeit in a lesser quantity.
I have previously investigated them, and basically the root cause is that yacc/bison maintains some complex invariants
(e.g. branch A could only be taken if branch B was taken before due to the global state being set up in a certain way)
the analyzer can not pick up.
This problem is not specific to any exploration strategy, and the mitigation is just disabling analysis on those files (after all, it would not make sense to fix the auto-generated code).
> Often when we dig deeper we find out that either we became too aggressive in reporting what people will view as false positives or we are reporting duplicate reports that have the same underlying cause. It’s possible postgresql is really that buggy, but we really need to be careful here.
Of course, that’s why I’m looking into those files. So far, no problems have been noticed.
I have also evaluated the results on XNU, and uploaded the results to https://reviews.llvm.org/M1
In # of reports I get:
133 reports added
57 reports removed
More interesting is the change in report size, shown at https://reviews.llvm.org/M1/18/
We can see there are about ~8 reports (for the same bug) getting shorter by a factor of ~3, and ~3 reports getting shorter by a factor of ~12.
>
>>
>>> On Jan 31, 2018, at 9:52 PM, Anna Zaks <ganna at apple.com <mailto:ganna at apple.com>> wrote:
>>>
>>>
>>>
>>>> On Jan 31, 2018, at 9:41 PM, George Karpenkov <ekarpenkov at apple.com <mailto:ekarpenkov at apple.com>> wrote:
>>>>
>>>>
>>>>
>>>>> On Jan 31, 2018, at 8:46 PM, Anna Zaks <ganna at apple.com <mailto:ganna at apple.com>> wrote:
>>>>>
>>>>>
>>>>>> On Jan 31, 2018, at 5:00 PM, George Karpenkov <ekarpenkov at apple.com <mailto:ekarpenkov at apple.com>> wrote:
>>>>>>
>>>>>> The list did not like a posting with many images,
>>>>>> so I have posted an evaluation to phabricator: https://reviews.llvm.org/M1 <https://reviews.llvm.org/M1>
>>>>>>
>>>>>> The text part was:
>>>>>>
>>>>>> After fixing a few bugs, another evaluation of the approach shows considerably better results.
>>>>>>
>>>>>> On openssl:
>>>>>>
>>>>>> 9 reports added
>>>>>> 1 report removed
>>>>>> On postgresql:
>>>>>>
>>>>>> 377 reports added
>>>>>> 43 reports removed
>>>>>> On sqlite3 + a few other misc files:
>>>>>>
>>>>>> 239 reports added
>>>>> This is a lot of additional reports! Are there actually that many bugs in that codebase that need to be fixed?
>>>>
>>>> Out of 239 or in general? The 239 reports mostly come from non-sqlite files.
>>>
>>> Can you provide data just for sqlite?
>>>
>>> For postgresql the number of additional reports also seems very high.
>>>
>>>> In general, for most practical purposes C codebases provide an infinite supply of bugs :P
>>>>
>>>> On a more serious note, I don’t think this is very surprising.
>>>> The previous emails in this chain have shown that for loops, the analyzer has a very high chance of entering
>>>> a degenerate behavior where the longest path through the loop will be evaluated first,
>>>> and a large chunk of the analyzer budget is then spent on going around the loop in circles.
>>>>
>>>> Under new exploration strategy, paths which increase coverage are explored first, and then we can actually find
>>>> bugs with the budget which is no longer spent going in circles.
>>>>
>>>>> I think we need to manually evaluate these reports.
>>>>
>>>> Yes, I’ve looked through quite a few of them, the false positive ratio seems to be actually getting lower,
>>>> as the probability of the report being a false positive grows with the path length,
>>>> and path lengths are getting much shorter.
>>>>
>>>>> Also, we have to make sure uniquing works. Do all regression tests pass?
>>>>
>>>> Yes.
>>>>>> 1 report removed
>>>>>> Note on histograms (here and below)
>>>>>>
>>>>>> -> Histograms only show the ratio for same bugs (compared using issue hash),
>>>>>> that is, if the histogram says “decrease by a factor of three”, it means the new approach finds the *same* bug
>>>>>> with a path size 1/3d of the original
>>>>>> -> Histograms omit data points where the path length has remained the same
>>>>>> (as otherwise they completely dominate the histogram)
>>>>>> -> Relative histograms are provided as both ratio and logarithm of the ratio.
>>>>>> Logarithms of the ratio are convenient as they are symmetric in case changes balance out
>>>>>> (e.g. log(1/2) = -log(2/1))
>>>>>>
>>>>>>> On Jan 30, 2018, at 4:23 PM, George Karpenkov <ekarpenkov at apple.com <mailto:ekarpenkov at apple.com>> wrote:
>>>>>>>
>>>>>>> Preliminary evaluation of a patch which prefers exploring nodes associated with statements which weren’t seen before first:
>>>>>>>
>>>>>>> On openssl:
>>>>>>>
>>>>>>> - Adds four reports
>>>>>>> - Removes four reports
>>>>>>> - Path lengths before: 317, 75, 75, 72, 70, 58, 50, 50, 44, 36, 23, 23, 21, 21, 20, 20, 19, 19, 19, 19, 18, 18, 18, 16, 15, 15, 15, 14, 13, 13, 12, 11, 11, 9, 7, 7, 6, 4
>>>>>>> - Path lengths after: 72, 60, 59, 53, 53, 52, 46, 38, 37, 30, 29, 28, 23, 21, 20, 19, 19, 19, 19, 19, 18, 16, 15, 15, 15, 15, 13, 13, 12, 12, 11, 9, 8, 7, 7, 7, 6, 4
>>>>>>>
>>>>>>> The quality of the added reports seems higher, mainly due to the fact that report length is shorter.
>>>>>>>
>>>>>>> On postgresql:
>>>>>>>
>>>>>>> - Added 80 reports
>>>>>>> - Removed 154 reports
>>>>>>> -> Out of those, 72 are reports on the yacc/bison autogenerated files, so whatever the cause is, good thing they are gone
>>>>>>> - The overall number of reports is 1188
>>>>>>> - Path lengths are lower on overall, but not in such a dramatic way
>>>>>>> - For many reports, I am quite confused as to why they got removed
>>>>>>>
>>>>>>> On sqlite:
>>>>>>>
>>>>>>> - 7 inserted, 7 removed
>>>>>>>
>>>>>>>> On Jan 30, 2018, at 1:10 PM, Artem Dergachev <noqnoqneo at gmail.com <mailto:noqnoqneo at gmail.com>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 30/01/2018 12:40 PM, Gábor Horváth via cfe-dev wrote:
>>>>>>>>> Hi George, Artem,
>>>>>>>>>
>>>>>>>>> I am glad that you are looking into this problem!
>>>>>>>>>
>>>>>>>>> On 30 January 2018 at 01:12, George Karpenkov via cfe-dev <cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org> <mailto:cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi All,
>>>>>>>>>
>>>>>>>>> I was investigating recently bug reports with very long analyzer
>>>>>>>>> paths (more than a few hundred nodes).
>>>>>>>>> In many of such cases the path is long for no good reason: namely,
>>>>>>>>> the analyzer would go 3 times around the loop before
>>>>>>>>> going further.
>>>>>>>>> The issue is surprisingly common, and it was exacerbated with a
>>>>>>>>> recent bump of analyzer thresholds.
>>>>>>>>>
>>>>>>>>> The problem is reproduced on the following file:
>>>>>>>>>
>>>>>>>>> ```
>>>>>>>>> extern int coin();
>>>>>>>>>
>>>>>>>>> int foo() {
>>>>>>>>> int *x = 0;
>>>>>>>>> while (coin()) {
>>>>>>>>> if (coin())
>>>>>>>>> return *x;
>>>>>>>>> }
>>>>>>>>> return 0;
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> void bar() {
>>>>>>>>> while(coin())
>>>>>>>>> if (coin())
>>>>>>>>> foo();
>>>>>>>>> }
>>>>>>>>> ```
>>>>>>>>>
>>>>>>>>> While a shortest path to the error does not loop around, the
>>>>>>>>> current version of the analyzer
>>>>>>>>> will go around the loop three times before going further.
>>>>>>>>> (and we are quite fortunate that the unrolling limit for loops is
>>>>>>>>> three, otherwise it would keep going
>>>>>>>>> until the unrolling limit is reached).
>>>>>>>>>
>>>>>>>>> Multiple issues were discovered during the investigation.
>>>>>>>>>
>>>>>>>>> 1. Analyzer queue does not have a concept of priority, and
>>>>>>>>> performs a simple DFS by default.
>>>>>>>>> Thus if the successor of the if-branch under the loop in “bar"
>>>>>>>>> containing the desired destination is generated second,
>>>>>>>>> it will never be evaluated until the loop exploration limit is
>>>>>>>>> exhausted.
>>>>>>>>>
>>>>>>>>> 2. The previous issue slows down the exploration, but is not
>>>>>>>>> enough to get a pathological behavior of ultra-long paths.
>>>>>>>>> The second problem is a combination of:
>>>>>>>>> a) Block counter is not a part of a node's identity, and node A
>>>>>>>>> with a small block counter can be merged into a node B with a
>>>>>>>>> large block counter,
>>>>>>>>> and the resulting node will have a block counter associated with B.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Sorry for the questions, just wanted to clarify some things. You mean ExplodedNodes? By merge, you mean the same thing as "caching-out"?
>>>>>>>>
>>>>>>>> Yeah, George notices that if we construct the same ExplodedNode on two different paths that have different block counts, we'd cache-out on the latter path, while the worklist element of the first path would still possess the original block count.
>>>>>>>>
>>>>>>>> Which happens a lot when we're evaluating foo() conservatively in this example.
>>>>>>>>
>>>>>>>> This isn't directly related to our problem though, as i noticed in http://lists.llvm.org/pipermail/cfe-dev/2018-January/056719.html <http://lists.llvm.org/pipermail/cfe-dev/2018-January/056719.html> .
>>>>>>>>
>>>>>>>>
>>>>>>>>> b) The issue in (a) is triggered due to our heuristic to abandon
>>>>>>>>> the function’s exploration and switch to conservative evaluation
>>>>>>>>> if we are already *inside* the function and the block limit has
>>>>>>>>> been reached.
>>>>>>>>>
>>>>>>>>> Issue (1) combined with (2-b) causes the problematic behavior: the
>>>>>>>>> issue is discovered on the longest path first,
>>>>>>>>> and by the time the shortest path gets to “bar”, the block limit
>>>>>>>>> is already reached, and the switch to conservative evaluation is
>>>>>>>>> performed.
>>>>>>>>>
>>>>>>>>> Thus there are two mitigation strategies currently being evaluated:
>>>>>>>>>
>>>>>>>>> i) Remove the heuristic in (2-b)
>>>>>>>>> ii) Use a priority queue to hold nodes which should be explored;
>>>>>>>>> prefer nodes which give new source code coverage over others
>>>>>>>>> (or alternatively prefer nodes with least depth of loop stack)
>>>>>>>>>
>>>>>>>>> Me and Artem have evaluated the option (i) and the results were
>>>>>>>>> surprisingly good: some reports disappear, and slightly more
>>>>>>>>> reports reappear.
>>>>>>>>> The quality of the new reports seems to be slightly better, and I
>>>>>>>>> am still trying to figure out exact reasons.
>>>>>>>>> I suspect merges resulting from heuristic (2-b) cause us to lose
>>>>>>>>> some actually valid reports.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I also find the results surprising. If you have more information about the reasons please do not forget to follow up this thread. We are curious :)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Option (ii) has not been evaluated fully yet, but current
>>>>>>>>> experiments show slightly more reports (5-10%), and a radical
>>>>>>>>> decline in report lengths
>>>>>>>>> (e.g. from 400+ to <100 for largest reports)
>>>>>>>>>
>>>>>>>>> Are there any thoughts on the matter?
>>>>>>>>>
>>>>>>>>> Personally I think we should do both (i) and (ii), even if they
>>>>>>>>> would shake up the results.
>>>>>>>>> - The original idea for heuristics (2-b) was to be able to produce
>>>>>>>>> a report even if we are out of budget, but since it actually
>>>>>>>>> results in less reports,
>>>>>>>>> I think the data does not validate the approach.
>>>>>>>>>
>>>>>>>>> - Option (ii) is AFAIK how most similar engines work, and should
>>>>>>>>> get us much larger coverage (and shorter paths) for the same node
>>>>>>>>> budget,
>>>>>>>>> even at the cost of log(N) overhead of the priority queue.
>>>>>>>>> Moreover, not having the priority queue will bite us later if we
>>>>>>>>> ever decide to further
>>>>>>>>> increase the analyzer budget or to increase the unroll limit.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I wonder what will the performance implication be. But I also like the idea of having a priority queue. If we find that we get more and better report
>>>>>>>>> but also have worse performance, we can also consider reducing the analysis budget slightly.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Gábor
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> George
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> cfe-dev mailing list
>>>>>>>>> cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org> <mailto:cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>>
>>>>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev <http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev>
>>>>>>>>> <http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev <http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> cfe-dev mailing list
>>>>>>>>> cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>
>>>>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev <http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20180201/10ad5ec4/attachment.html>
More information about the cfe-dev
mailing list