[cfe-commits] [Patch]LocationContext Support for CheckedReturn

Fri Apr 22 17:26:45 PDT 2011

On Apr 18, 2011, at 2:01 AM, 章磊 wrote:

> Hi Ted,
> 
> Sorry for this late reply, i focused on something else these days.
> 
> I think it's great to keep a statistical checker working fine in IPA mode, so if the statistics for a checker can store in the checker instance, i'm glad to make it happen.
> 
> IMO, statistical checker should use (1) "global" data. When i started this job, as you say the lifetime of a Checker object was bound to the lifetime of an ExprEngine. But my checker want to gather "global" data, so i creat a global DenseMap to store them.
> 
> Later, after your reply to my patch, i thought maybe it should be done in two steps, first step in the LocationContexts, second step in some structure has the lifetime of whole translation unit analysis.
> 
> Now, since the Checkers managed by CheckerManager, cam we store the "global" data rather than the "local" data in the Checker instance?

Yes, precisely.  It's not clear what we will do when we start doing cross-translation unit analysis, but we can cross that bridge when we come to it.

> 
> Then we can focus on how to correctly gather the statistics in specific checker.
> 
> What i am afraid most is to miscount the statistics, this will make our statistical checker meaningless.
> 
> here are two examples, the first one:
> 
> int f1();
> 
> void f2() {
>   f1();
> }
> 
> int main() {
>   f2();
>   f2();
> }
> 
> So there are two f2() calls in main. When we inlining the first f2(), f1()'s unchecked return count +1; but when we inlining the second f2(), f1()'s unchecked return count should not change.
> The unchecked return count should depends on how many times the programmer miswrite the code, not how many times a function was called. So it likes a "textual" problem.

I think this is easily handled by recording in the Checker whether or not the body of f2() was already analyzed (and thus don't record any more statistics for calls within that function body) OR, possibly more precisely, whether or not a callsite was previously counted when f2() was inlined the first time (or any other previous time).  Strategies like these allow the Checker to be the decider on what statistics to collect and when.  The bookkeeping for this could be kept as global data within the Checker object itself (just like the statistics).

> 
> Then comes the second example:
> 
> int f1();
> int f2(int arg);
> int f3(int arg);
> 
> int f4(int arg) {
>   int i = f1();
>   if (arg) {
>     f2(i);
>     return 0;
>   }
> 
>   if (i) {
>     f3(i);
>     return 1;
>   }
> 
>   return 2;
> }
> 
> There are several paths in the Example, in some paths i is checked, while not in others. How do we count the statistics?

That's ultimately a policy decision on how you want to collect statistics.  Here are a couple possibilities:

(1) Take the average.  Count the number of *uses* (not paths) a value was checked/unchecked within a function body, and then average.  That could represent a single "count" for the call site of the function that returned the value.  For example, there are 2 uses of 'i', one where it is never checked ("f2(i)"), and one where it is always checked ("f3(i)").  The average count could be 0.5.

(2) Don't take the average, and count by uses instead.  That skews to large functions with many uses.

There are many options.  Simple statistical counting will probably not taking into account correlated counts; e.g. within a given function body certain functions will consistently be used correctly/incorrectly, and that should be weighed when doing the global tabulation of the trend.

> IMO, unless in all the paths i was checked, we do the checked count +1; otherwise the unchecked count +1;

That's definitely conservative, but it may penalize the amount of information you extract from a given call site.

> 
> After some more thought, i think statistical checkers(unchecked return) need not to deactivate IPA. Instead, we can do something like:
> 
> When enter a function body in the first time, we do the unchecked return analysis and update the statistic.
> When enter a function body what we have analysised before, we deactivate the unchecked return checker.

> 
That's a reasonable first cut, but even that may be overly pessimistic.  Consider the case where we look at the counts for *f3*:

void f1(int x);
void f2() {
  f1(0);
  f1(1);
}

int f3();
void f4(int z);

void f1(int x) {
  int y = f3();
  if (x == 1) {
   f4(y);
   return;
 }
 if (z) {
   f4(z);
 }
}

When we analyze 'f1', if we do it via inlining we may analyze it first with the parameter being known to be '0'.  This means that the return value from f3() is considered checked.  The second time we analyze f1() with the parameter known to be '1', we return value is not checked.  Thus, depending on what assumptions we make on the input values (and the possible paths they imply), we get different counts.

In the end, it really depends on the distribution of the data you want to analyze.  Inlining gives you (possibly) more precise information about feasible paths, which means it could possibly make your statistics more accurate.  Alternatively, it can make your counts skewed.  I think if you employ some kind of averaging, you should be able to reconcile some of these conflicts, merging counts and in the process making individual counts (potentially) more accurate.

At any rate, once you get a clear counting scheme that takes into account potentially multiple contexts that might cause the same thing to be counted multiple times, I think you will have the flexibility of looking at the statistical results when both inlining is enabled/disabled and be able to directly compare the results.  This will allow you to decide whether inlining provides any benefits for the statistical checker, or if it just adds noise or skew to your inference.

Cheers,
Ted