[cfe-dev] [analyzer][RFC] The future of StdLibraryFunctionsChecker and other checkers responsibility

Thu Jun 25 07:19:40 PDT 2020

+Balázs Kéri

On Thu, 25 Jun 2020 at 10:39, Gábor Márton <martongabesz at gmail.com> wrote:

> One of my goals with StdLibraryFunctionsChecker is to add argument
> constraints to functions in the C, C++ and POSIX standard. It is very easy
> to gather these argument constraints and apply them massively. By doing
> this, the analysis can be more precise, so, I see a lot of gain here.
> However, some of these argument constraints could be handled in more
> specific checkers too, e.g. in CStringModeling, MallocChecker or in
> StreamChecker.
>

For those that didn't follow development closely, argument constraints in
this context mean the following: For select functions, we define a
signature we can use to match a call to this function (for C functions,
that is return value, function name and argument types), and a so called
"summary", which is a list of constraints on the arguments or the return
value. For a function like fread(), the signature is size_t fread(void *,
size_t, size_t, FILE *), and the summary is that the stream argument and
the buffer must be non-null, and that the buffer must be at least the size
of (2nd arg * 3rd arg). We can also constrain the return value, that is is
at least (2nd arg * 3rd arg).

If on the function call any of the preconditions isn't met, we can emit a
warning, otherwise we can increase the precision of the analysis by
constraining the arguments.

This is a form of summary based analysis, and its pretty easy to pull off
on standard functions because they have, well, standardized signatures and
constraints.

(this isn't pulled from the checker, just a theoretical example)

> Let's take for instance `popen`. In the mentioned patch I am adding a
> simple summary with argument constraints. However, sooner or later popen
> could be modeled in StreamChecker once that is mature enough. (1) Should we
> remove the summary from StdLibraryFunctionsChecker when that happens? (2)
> Or should we have a declaration that ranges and nullability are checked
> with summaries while more specific things are checked in their respective
> checkers? (3) Or the specialized checkers could be dependencies to the
> generic StdLibraryFunctionsChecker, so any bug related to e.g. popen is
> prioritized to StreamChecker?
>

Yes, this this the big one. What role should batch modeling checkers play
in the analyzer alongside specific modeling checkers? Lets take an example
that has already presents a problem, and the most glaring of them, to me at
least, is stream modeling functions.

Take for instance the above mentioned fread. The *numerical* pre- and
postconditions can be enforced just as I detailed. However, the most
important effect of this function is of course the stream operation, what
happens depending on the validity of the stream object (which is not a
numerical property), and what happens to it after the call has finished,
successful or not. If the stream hits EOF during the read, the return value
is less then (2nd arg * 3rd arg), and is equal to it if the read was
successful. What this implies to me is that StdLibraryFunctionChecker
cannot really take on the responsibility of establishing a post condition,
**unless** the function is question can be described by numerical
properties only (like isalpha()).

This is something that this checker already does well I think. Its clear
that we don't want to expand it to cover non-numeric cases. So, onto the
topic that brought us here.

(2.) This sounds great. I think precondition checking is something that
introduces a lot of code duplication in checkers, and I don't think that
fread()ing a nullpointer is a stream mismanagement issue necessarily. A
notable challenge to overcome, however, is how can we ask StreamChecker to
add NoteTags from other checkers' reports (see:
https://reviews.llvm.org/D81407?id=270161#inline-750832).

(1.) I think the only thing we could do with this idea is to implement an
incomplete post condition modeling until a checker comes along and does it
properly. However, even this would only make sense if we have a single
post-condition (https://reviews.llvm.org/D79432#inline-757474).

(3.) If we go with option (2), we must introduce strong dependencies so
that numerical preconditions are checked beforehand. However, if we
*duplicate* the same checking in multiple checkers and want to use
dependencies to make the correct code runs first, we would introduce
deadcode, or possibly running the same check twice, and the problem of
inconsistently distributing the responsibility of analysis gets even worse.

> My other goal would be to make it possible to add argument constraints (or
> branches/cases) to any library functions. These libraries may never be
> modeled in the analyzer. In this case the above problem is non-existent.
> Perhaps, we should divide the checker to two different checkers once we
> reach that point, shouldn't we? We are very close to
> rename StdLibraryFunctionsChecker to LibraryFunctionsChecker anyway by now.
>

If we want to implement summary based analysis, it would only be fitting to
bear the name SpeculativeSummaryBasedFunctionModeling and
StdSummaryBasedFunctionModeling or something, right? ;) Anyway, this is the
least of the worries.

I agree, pure functions (like the above mentioned isalpha()) is the real
learning opportunities for generalized summary based analysis for the time
being.

> There is more. In `evalCall` if a checker returns true then that function
> is no longer processed by any subsequent checkers. That is the case with
> pure functions in StdLibraryFunctionsChecker. In this case, however, it
> would be anarchy if another checker also returns true from its own evalCall
> callback. Only one can rule. It would be great if we could enforce that no
> two checkers evaluate the same function this way. But I don't see any clear
> solution to achieve that right now, this is really challenging.
>

Yup, the problem is that checker callbacks from CheckerManager's
perspective (I might be wrong here!) takes a set of ingoing nodes and
returns a set of outgoing ones. So the solution might not be terribly
complicated, just ugly (large #ifs).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20200625/fd324565/attachment.html>