[cfe-dev] Process in UncheckedReturn

Mon Jun 6 20:14:01 PDT 2011

在 2011年6月2日 下午3:36，Ted Kremenek <kremenek at apple.com>写道：

> On Jun 1, 2011, at 7:04 PM, 章磊 wrote:
>
>
> 2011/6/2 Ted Kremenek <kremenek at apple.com>
>
>>   On May 30, 2011, at 10:48 PM, 章磊 wrote:
>>
>> Soon i found i was wrong in step 1, i should not use any structure in
>> checker to record path-sensitive states. So i need other ways to handle
>> this:
>>
>>    1. It seems ok that only to update LocalPathMap when end path(so we
>>    can record in GRState until the path analysis finished), but how should we
>>    take care of the dead symbols? Maybe i could keep that dead symbols instead
>>    of removing them?
>>    2. Keep the <CE, CheckedReturnState> in GRState, so we get it over.
>>    But how to effectively keep these information.
>>
>>
>>
>>
>> Hi Lei,
>>
>> Besides aggregating statistics in checkEndPath, another option is to also
>> doing this in checkDeadSymbols.  You then don't need to worry about symbols
>> that have been removed from GRState by the time you reach checkEndPath,
>> since you've already done the aggregation.  Of course this influences your
>> counting, but in all fairness we aren't enumerating all paths explicitly
>> anyway, so your counts are directly influenced by how we explore paths,
>> which paths are explored, and how paths are coalesced.
>>
>> Ted
>>
>
>  Hi Ted,
>
> It's ok to do aggregating in checkDeadSymbols, and that's how i do in my
> patch.
>
>
> Hi Lei,
>
> I don't see that.  I see a call to updateCRState, and that sets an entry in
> LocalPathMap, but I don't see that as counting the number of times a given
> CallExpr's return value is checked.  Is it meant just to record that the
> value was checked on a given path?
>
> I'm concerned that LocalPathMap isn't the right model of what you want.
>  The analyzer doesn't trace out all paths individually; it will coalesce
> paths together as it sees them having the same state.  It is possible by the
> time you reach a call to 'checkEndPath' that an infinite number of paths
> were coalesced together into one by the time you hit the end of a function.
>
> My problem is if we do so, the deadsymbols are path-related, so we must
> keep the statistics in GRState instead of checker itself.
>
>
> Why?  I'm really confused here.
>
> I think trying to collect statistics in GRState itself is inherently the
> wrong approach.  Statistics should be about aggregating what you have
> observed; the checker itself should only focus on the semantics, which in
> this case is whether or not a return value was checked.
>
> Is it ok to add some GRState like llvm::immutablemap<CE,
> CheckedReturnState>, without related to some symbols?
>
>
> Yes, but what would it record?  What would it represent?
>
> Or we must keep the GRState like llvm::immutablemap<sym, llvm::DenseMap<CE,
> CheckedReturnState>>?
>
>
> No, that would not be a good approach.  DenseMap is not an immutable data
> structure, and should not be a value in ImmutableMap.
>
> Stepping back…  how are you trying to count things?  A few concerns come to
> mind…
>
> 1) If you are trying to count statistics along individual paths, your
> current approach isn't really working as you intend.  You are essentially
> relying not he analyzer exploring the state space in a particular way, and
> (as I've said before) you aren't actually enumerating individual paths.
>  This means the attempt to use LocalPathMap to count such statistics is
> inherently not measuring per-path information.
>
> 2) Even if we could enumerate all paths, is counting them individually
> really the right answer?  There could be an infinite number of paths.
>
> 3) Paths don't measure everything, and they can really over count the same
> cases.
>
> I think you can make counting far more simple, and simplify your checker.
>
> Here's my suggestion:
>
> 1) Have the checker track the CheckedReturnState for each symbol, as it
> does now.
>
> 2) When a symbol is dead, or we reach the end of a path, count it in the
> following way…
>
>  void processSymbol(const GRState *state, SymbolRef Sym) {
>    const CheckedReturnState *CS = state->get<CheckedReturnState>(Sym))
>    if (!CS)
>      return;
>    std::pair<unsigned,unsigned> &stats = CheckedSymStatistics[Sym];
>    if (CS->isChecked())
>      stats.first++;
>    else
>      stats.second++;
>  }
>
> This will cause you to get statistics for each symbol.  The next stage is
> then to aggregate them per call site.
>
>   void aggregateByCallSite() {
>     for (llvm::DenseMap<SymbolRef, std::pair<unsigned, unsigned>
> >::iterator i = CheckedStatistics.begin(); … ) {
>        SymbolRef sym = i->first;
>        std::pair<unsigned, unsigned> stats = i->second;
>
>        // Compute normalized statistics for that symbol.
>        double sum = stats.first + stats.second;
>        std::pair<double, double> normalizedStats(stats.first / sum,
> stats.second / sum);
>
>        // Aggregate that count into the call site count.
>        const CallExpr *CE = getCallSite(sym);
>        std::pair<double, double> &callsiteStat =
> CheckedCallsiteStatistics[CE];
>        callsiteStat.first += normalizedStats.first;
>        callsiteStat.second += normalizedStats.second;
>    }
>    // then loop over CheckedCallSiteStatistics and normalize the statistics
> for each entry
>  }
>
> I've left some details out, but basically there are a variety of ways to go
> from a symbol to the original call site.  Either we can record it in a
> DenseMap (on the side), or if it is a conjured symbol, we can look at the
> symbol itself to figure out where it came from.
>
> How this counting works:
>
> (a) Paths help us figure out the behavior of a given symbol.  Since
> counting paths is problematic, we just normalize the counts for a given
> symbol.  This essentially gives us a probability distribution.
>
> (b) Aggregating normalized counts for symbols into their CallExprs gives us
> "counts" for a specific CallExpr.  This is an arbitrary choice to weigh
> symbols in this way, but it causes spurious paths to not be over counted.
>
> (c) Normalizing the aggregated counts for CallExprs gives us another
> probability distribution.  It is basically the probability, for a given
> CallExpr, whether its value was checked, factoring out some redundancy from
> multiple paths.
>
> (d) Using the normalized counts (probability distribution) for CallExprs,
> each CallExpr then has a single count, which you can then use to count
> whether or not a given function has it's return value checked or not.
>
> This counting strategy obviously biases towards inferring properties of
> functions that are *called in more places* versus *appear in more paths*,
> which I think is a better balance than trying to make sense of the different
> paths explored by the analyzer.  It also allows you to skirt the path
> explosion problem entirely.  As long as you think you have traced a
> "representative set of paths", the statistics will yield a confidence
> measure in how likely a given function's return value should be checked,
> regardless of whether or not you were able to evaluate all paths (which IMO
> isn't terribly relevant, as long as you are able to collect representative
> evidence of behavioral trends in the code).
>
> What do you think?  Sorry for sounding so aggressive; it's late here and
> I'm tired (so my wording isn't as finessed as I would like), but I wanted to
> strongly get across that I think counting needs to not be so biased by how
> we count paths.  Of course you are free to disagree.
>
> Cheers,
> Ted
>

Hi Ted,

Thank you very much for your reply and sorry for misunderstanding the word "
aggregate".

I thought that we can do aggregation only when a path is fully analyzed, so
when you mention the aggregation in checkDeadSymbols, i didn't get what you
mean to do.

I think this counting strategy is great, but i have a little problem:

IMO, it's a trade-off between counting statistics along individual paths and
counting statistics in your way. When there is no path explosion, they give
the same counts. But when the path explodes, they works different.

In the first way, we will fully analyze part of all the paths and we can
give the accurate results of those paths. But we will leave out some paths
because of the step limit and the analyzer should work in a particular way.
In the second way, we do not fully analyze a path, we will call checkEndPath
while we are not meet the actual path end, so we can not guarantee the
states we are counting are totally right(the aggregation in checkDeadSymbols
is ok, since all the dead symbols will not be checked; but in checkEndPath
some symbols may be checked after this "step limited end path"). And could
this results be the "representative evidence of behavioral trends" ?

If the second way is ok, and since we should not force the analyzer to
exploring
the state space in a particular way, i will do it in the second way.

-- 
Best regards!

Lei Zhang
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20110607/93ff7fd8/attachment.html>