[cfe-dev] [Analyzer][RFC] Function Summaries

Mon Feb 3 09:35:52 PST 2020

Ahaa, great, i'd love to see where it goes!!

When doing a file format for summaries, please try to somehow coordinate 
with the current work on taint checker yaml files. I don't want multiple 
incompatible formats to be circulating around :)

One of the fresher thoughts that i've had recently on automatically 
generated summaries was re-using The Attributor 
(https://llvm.org/devmtg/2019-10/talk-abstracts.html#tech24) to produce 
function summaries that the static analyzer can import and rely upon.

If you go for automatically generated summaries, please keep in mind the 
following potential problems:

(a) We do not want checker API to become significantly more complicated 
than it currently is. It is very likely that generating checker-specific 
summaries of a function would require very non-trivial checker-side 
support that'll be too hard for a typical checker developer to handle. 
Please talk to me once you get there; i'll try to throw in a few 
examples to see if your approach is lightweight and general enough.

(b) Please understand the problem of "inlined defensive checks" that we 
have with inlining-based IPA. Basically, when modeling a function call, 
you need to have a good intuition about whether multiple branches within 
the function imply multiple branches in the caller context. Like, it's 
fine to warn that passing a null pointer into foo() causes a null 
dereference, but it's not fine to emit null dereference warnings upon 
dereferencing a pointer simply because that pointer was previously 
passed into function bar() that checked it for null but has long exited 
since then. Most of the time the ideal solution is to have exactly one 
branch in the summary. Functions like `isalpha` are a rare exception. 
Most of the time the state split is not justified because there's no 
indication on the caller side that a certain callee branch is 
necessarily taken, so we should not keep these assumptions in the state 
after the function has exited, so they shouldn't be part of the summary.

In other news - did you ever consider using ASTImporter for generating 
function body ASTs from hand-written code? Like BodyFarms but without 
constructing the AST by hand. I imagine that the advantage here is that 
the solution is more reliable than ASTImporter-based CTU because you can 
easily make sure that a certain hand-written code under your control is 
always imported correctly (as opposed to arbitrary code written by the 
users). This is probably not that hard to implement and that'd speed up 
our progress on supporting functions that generally benefit from body 
farms, such as atomic intrinsics. I.e.:

   const Stmt *CompareAndSwapBody = ASTImporter.import("""
     bool compare_and_swap(int *p, int old, int new) {
     if (*p != old)
         return false;
     *p = new;
     return true;
   }
   """);

?

On 2/3/20 7:22 PM, Gábor Márton via cfe-dev wrote:
> Hi,
>
> This year we plan to experiment with function summaries in CSA. We 
> have a budge roughly for 2 people to work on this.
>
> Short-term plans:
> ^^^^^^^^^^^^^^^^^
> We'd like to be able to read and use hand-written function summaries. 
> A good example of such hand-written summaries are in Cppcheck 
> <https://github.com/danmar/cppcheck/blob/master/cfg/std.cfg>. The 
> existing StdLibraryFunctionsChecker is the entry point for our 
> upcoming changes. I have created a patch which eliminates the macros 
> in that checker, this way the used classes like `ValueRange` become 
> extensible [https://reviews.llvm.org/D73897]. E.g. we can add an 
> argument constraint that must be true on all branches, if the 
> constraint fails then we warn [https://reviews.llvm.org/D73898]. (Note 
> that these patches are still WIP.)
>
> What could be in a summary?
> - Branches
>   * E.g. for f(x), if x is in ['0', '9'] then ret is not 0.
>   * This is already handled in StdLibraryFunctionsChecker.
> - Argument constraints (a contsraint that must be true on all branches)
>   * Range constraint: E.g. x must be in [0, 255] on all branches. This 
> is the
>     case for instance with isalpha(x) where if x is out of the range 
> then the
>     program is ill-formed (undefined behaviour) according to the C 
> standard.
>   * Size requirements
>     E.g.: asctime_s(char *buf, rsize_t bufsz, const struct tm *time_ptr);
>     `buf` size must be at least `bufsz`.
>   * Not-null
>   * Not-uninitalized
>   * Not-tainted (?)
> - Return value properties (?)
>   * E.g.: the return value must be checked always.
>     void foo() {
>       void *P = aligned_alloc(4, 8);
>       if (P == NULL) { // OK
>       }
>     }
>     void bar() {
>       void *P = aligned_alloc(4, 8);
>       void *X = P + 1; // expected-warning{{Use of return value that 
> was not checked}}
>     }
>     Storing this information in summaries could provide configuration 
> for checkers like this https://reviews.llvm.org/D72705.
>     Also statistics based checkers 
> <https://llvm.org/devmtg/2019-04/slides/Lightning-Balogh-Statistics_based_checkers_in_the_Clang_Static_Analyzer.pdf> 
> could also benefit from this information.
>
> Perhaps we should come up with a summary description file format (e.g. 
> json or
> whatever) that could be parsed by the future variant of the existing
> StdLibraryFunctionsChecker.
>
> After we have a strong foundation of handling existing summaries then 
> we can
> move on to generate summaries automatically. We do believe that 
> summary-based
> analysis could have a strong added value only if we are capable of
> automatically synthesizing the summaries.
>
> Mid-term plans:
> ^^^^^^^^^^^^^^^
> Automatically create summaries based on the methods described in SMART 
> and Microsoft PEX [Godefroid2007, Godefroid2008].
> The basic idea is that we create summaries on-demand during the 
> analysis and then later we try to reuse them if the calling-context 
> allows that.
> This is something like an advanced caching of the effects of function 
> calls.
> A summary is in the form [pre]f(x)[post], where `pre` is the 
> preconditions over function inputs and `post` is postconditions over 
> function outputs.
> There are many open questions and concerns here. E.g. about 
> performance: probably we should use an SMT solver to decide whether we 
> can apply a summary.
> Also, the mentioned publications do runtime execution to drive the 
> symbolic execution towards certain paths.
> Naturally we cannot do that, but we hope that the worklist based 
> driving could have a similar role.
>
>
> Long-term plans:
> ^^^^^^^^^^^^^^^^
> Create summaries based on the ideas in Facebook Infer's publications 
> [FBInfer2005, FBInfer2020].
> The basic idea here is to create summaries the are free from the 
> calling-context.
> This can be done only with a special data-flow computation and with a 
> proofing system.
> This solution would render the analysis to a two-phase analysis.
>
>
> References:
> ^^^^^^^^^^^
> [Godefroid2007] Godefroid, Patrice. "Compositional dynamic test 
> generation." Proceedings of the 34th annual ACM SIGPLAN-SIGACT 
> symposium on Principles of programming languages. 2007.
> [Godefroid2008] Anand, Saswat, Patrice Godefroid, and Nikolai 
> Tillmann. "Demand-driven compositional symbolic execution." 
> International Conference on Tools and Algorithms for the Construction 
> and Analysis of Systems. Springer, Berlin, Heidelberg, 2008.
> [FBInfer2005] Berdine, Josh, Cristiano Calcagno, and Peter W. O’hearn. 
> "Symbolic execution with separation logic." Asian Symposium on 
> Programming Languages and Systems. Springer, Berlin, Heidelberg, 2005.
> [FBInfer2020] O'Hearn, Peter W. "Incorrectness logic." Proceedings of 
> the ACM on Programming Languages 4.POPL (2019): 1-32.
>
> Thanks,
> Gabor
>
>
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev