[cfe-dev] [analyzer][GSoC] Problem Statement: Improving BugReporter with static backward program slicing

Tue Apr 2 17:14:08 PDT 2019

Ahaaaaa, ooooookkkkk, iiiiiiii seeeeeeeeee!

Sounds fascinating, must-have and hard.

My primary question is how much do you think this will put stress on the 
checkers. Like, how much more difficult would it be to write checkers 
when we require them to include the slicing criterion with their bug 
reports? How much of it would we (i.e., BugReporter) be able to infer 
ourselves?

------------------------

Say, let's take Example 1. You describe the slicing criterion as:

     (13, <dynamically allocated int object>)

The value of the dynamically allocated into object remains the same 
regardless of whether the object is stored in global_ptr or not, so the 
slice doesn't need to include line 5 or 6. Therefore i think that the 
slicing criterion you proposed is not what we're looking for, and the 
real slicing criterion we're looking for is:

     (13, <liveness of <dynamically allocated int object>>)

Like, we're mapping every dynamically allocated object to an imaginary 
heap location that represents its current liveness, and include that 
imaginary variable, rather than the object itself, in the slicing 
criterion. Line 6 would affect liveness of the object on line 13 (if 
it's stored into a global, it's alive; otherwise it's not), therefore 
it'll be part of the slice. So, am i understanding correctly that you're 
proposing a checker API that'll look like this:

   BugReport *R = new BugReport(Node, ...);
   R->addLivenessBasedSlicingCriterion(HeapSym);

?

I guess we can infer the statement of the slicing criterion from the 
Node, but i'm not entirely sure; see also below.

I'd actually love it if you elaborate this example further because it's 
fairly interesting. Like, we know that the assignment affects the 
liveness information, but how would the slicing algorithm figure this 
out? Do you have a step-by-step description of how the algorithm behaves 
in this case?

------------------------

Let's take another example i just came up with. Consider 
alpha.unix.PthreadLock - a checker that finds various bugs with mutexes, 
such as double locks or double unlocks. Consider code:

1  pthread_mutex_t mtx1, mtx2;
2
3  void foo(bool x1, bool x2) {
4    if (x1)
5      pthread_mutex_lock(&mtx1);
6    if (x2)
7      pthread_mutex_lock(&mtx1);
8    // ...
9  }

In this example we'll report a double lock bug due to a copy-paste 
error: line 7 should probably lock &mtx2 rather than &mtx1.

The whole program is relevant to the report. Without line 5, there would 
only be a single lock. It transitions the mutex object from state 
"unknown" to state "locked".

I don't think the Analyzer would currently do a bad job at explaining 
the bug; it's pretty straightforward.

What would be the slicing criterion in this case? As far as i 
understand, it'll be

     (7, <locked-ness of mtx1>)

In this case "lockedness" is, again, an "imaginary heap location" that 
contains metadata for the mutex. More specifically, it is a GDM map 
value. How would a checker API look in this case? How would it even 
describe to the BugReporter what to look at, given that currently only 
the checker understands how to read its GDM values?

My random guess is:

   BugReport *R = new BugReport(Node, ...);
   R->addGDMBasedSlicingCriterion<MutexState>(MutexMR);

------------------------

So it sounds to me that your project is, like many other awesome 
projects, brings in, as a dependency, replacing GDM with a better, 
smarter data map that the analyzer core *can* introspect and manipulate 
without the direct help from the checkers. It might be that your project 
actually requires relatively little amounts of such introspection - 
i.e., you might be able to get away with "it's some GDM map and the 
value for this key has changed, therefore this statement is a relevant 
data dependency" for quite a lot of checkers.

That's the subject i was also bringing up as part of 
https://reviews.llvm.org/D59861 - "this was the plan that is part of the 
even-more-long-term plan of completely deprecating the void*-based 
Generic Data Map...". This was also something that became a blocker on 
my old attempt to introduce summary-based analysis. It required stuffing 
fairly large chunks of code into every checker that would teach the 
checker how to collect summary information and apply it when the summary 
is applied, and this new boilerplate was fairly brain-damaging to 
implement and hard to get right or even to explain. This problem also 
blocks other projects, such as state merging / loop widening (on one 
branch the mutex is locked, on the other branch it is unlocked, hey 
checker please teach me how to merge this).

The variety of slicing criteria (that is a consequence of the variety of 
checkers that we have) sounds pretty scary to me. Some of them are 
really complicated, eg. the liveness criterion. Making sure that the 
generic algorithm works correctly with at least a significant chunk of 
them is going to be fun. Making sure it can actually handle arbitrary 
criterions required by the checkers without having every checker come 
with its own boilerplate for slicing also sounds hard. And it'll be very 
bad if we have to tell our checker developers "hey, by the way, you also 
need to know what backward slicing is and write down part of the 
algorithm in order to ever get your checker enabled by default" - i'm 
already feeling pretty bad explaining dead symbols and pointer escapes 
(which are pretty much must-have in most checkers), these are definitely 
examples of a boilerplate that a smarter version of GDM could handle 
automatically. In order to conquer the world, i think we should stick to 
our "writing a checker in 24 hours" utopia: writing a checker should be 
as easy as writing down the transfer functions for relevant statements. 
In my opinion, we should take it as an important constraint.

On 4/2/19 12:21 PM, Kristóf Umann wrote:
> Somehow the images and the attached files were left out, please find 
> them here:
>
> On Tue, 2 Apr 2019 at 21:16, Kristóf Umann <dkszelethus at gmail.com 
> <mailto:dkszelethus at gmail.com>> wrote:
>
>     Hi!
>
>     In this letter, I'd like to describe a particular problem in the
>     Clang StaticAnalyzer's BugReporter class that I'd like to tackle
>     in a Google Summer of Code project this summer. I'll show
>     real-world examples on some of its shortcomings, and propose a
>     potential solution using static backward program slicing. At last,
>     I'll ask some specific questions. I'd love to hear any and all
>     feedback on this!
>
>     This is a /problem statement, /not a /proposal/. I plan to send
>     out a formal proposal by Friday (but not later then Saturday),
>     that will contain more details on both the problem and the
>     solution. I don't introduce myself or my previous works within the
>     project, that will also be detailed in my upcoming letter. I also
>     plan to incorporate the feedback I'll receive to this letter.
>
>     *---=== BugReporter constructs bad reports ===---*
>     *
>     *
>     *What does the BugReporter do?*
>
>     After the Static Analyzer found an error, the BugReporter receives
>     an ExplodedNode, which, accompanied by its predecessors, contains
>     all the information needed to reproduce that error. This /bugpath
>     /is then shortened with a variety of heuristics, such as removing
>     unrelated function calls, unrelated branches and so on.
>     BugReporter by the end of this process will construct a
>     PathDiagnostic object for each report, that is, ideally, minimal.
>     *
>     *
>     *Example 1.*
>
>     Consider the following code example:
>
>     1 // leak.cpp
>     2 int *global_ptr;
>     3
>     4 void save_ext(int storeIt, int *ptr) {
>     5   if (storeIt)
>     6     global_ptr = ptr;
>     7 }
>     8
>     9 void test(int b) {
>     10  int *myptr = new int;
>     11  save_ext(b, myptr);
>     12  delete global_ptr;
>     13 }
>
>     It's clear that if test is invoked with b's value set to true,
>     there is no error in the program. However, should b be false,
>     we'll leak memory.
>
>     $ clang -cc1 -analyze -analyzer-checker=core,cplusplus leak.cpp
>     image.png
>
>     The Static Analyzer is able to catch this error, but fails to
>     mention the call to save_ext entirely, despite the error only
>     occurring because the analyzer assumed that storeIt is false. I've
>     also attached the exploded graph leak.svg that demonstrates this.
>
>     *Example 2.*
>
>     Consider the following code example:
>
>     1  // divbyzero.cpp
>     2  void f() {
>     3    int i = 0;
>     4    (void) (10 / i);
>     5  }
>     6
>     7  void g() { f(); }
>     8  void h() { g(); }
>     9  void j() { h(); }
>     10 void k() { j(); }
>
>     Its clear that a call to f will result in a division by zero error.
>
>     $ clang -cc1 -analyze -analyzer-checker=core divbyzero.cpp
>     image.png
>     Again, the Static Analyzer is plenty smart enough to catch this,
>     but the constructed bug report is littered with a lot of useless
>     information -- it would be enough to only show the body of f, and,
>     optionally where f itself was called. For the sake of
>     completeness, I've attached divbyzero.svg that contains the
>     exploded graph for the above code snippet.
>
>     The above examples demonstrate that BugReporter sometimes reduces
>     the bugpath too much or too little.
>
>     *---=== Solving the above problem with static backward program
>     slicing ===---*
>
>     *What is static backward program slicing?*
>
>     A /program slice/ consists of the parts of a program that
>     (potentially) affect the values computed at some point of
>     interest, called the slicing criterion. /*Program slicing*/ is a
>     decomposition technique that elides program components not
>     relevant to the slicing criterion (which is a pair of (statement,
>     set of variables)), creating a program slice[1][2]. *Static
>     *slicing preserves the meaning of the variable(s) in the slicing
>     criterion for all possible inputs[1]. *Backward *slices answer the
>     question “what program components might affect a selected
>     computation?”[1]
>
>     While statement-minimal slices are not necessarily unique[3],
>     Weisel developed a popular algorithm that constructs one. In
>     essence, his fix-point algorithm constructs sets of /relevant
>     variables///for each edge in between node /i/ and node /j/ in a
>     CFG graph, from which he constructs /relevant statements/. The
>     fix-point of the relevant statements set is the slice itself.
>
>     One of the characteristic of his algorithm is that the resulting
>     program slice will be executable. However, our problem doesn't
>     require the code to be executable, so we could use a more
>     "aggressive" approach that creates a smaller slice. An improvement
>     to his algorithm is presented in [4].
>     *
>     *
>     *How does this relate to BugReporter?*
>
>     We can show that using Weiser's algorithm, issues raised in
>     Example 1. and Example 2. can be improved upon.
>
>     For example 1., the statement-minimal program slice with the
>     criterion (13, <dynamically allocated int object>) will contain
>     the statements 5 and 6, and for example 2., the statement-minimal
>     program slice with the criterion (4,i) won't contain anything but
>     statements 3 and 4. For the latter, we can even improve the
>     algorithm to also contain statement 7, where a call to f is made.
>
>     The analyzer, as stated earlier, gives BugReporter an
>     ExplodedNode, from which the slicing criterion must be
>     constructed. The statement corresponding to this node, coupled
>     with the interesting regions the checker itself marked could be
>     used to construct this slicing criterion.
>
>     *Challenges*
>
>     While the algorithm Weiser developed along with the improvements
>     made by others are interprocedural, I would imagine that in
>     implementation, it would be a challenging step from an
>     intraprocedural prototype.
>
>     Several articles also describe pointers, references, and
>     dynamically allocated regions, as well as gotos and other tricky
>     parts of the language, but I still expect to see some skeletons
>     falling out of the closet when implementing this for C++, not only C.
>
>     *Drawbacks*
>
>     Static slicing, as an algorithm that doesn't know anything about
>     input values, suffers from the same issues that all static
>     techniques do, meaning that without heuristics, it'll have to do
>     very rough guesses, possibly leaving a far too big program slice.
>     However, with the symbolic values the analyzer holds, this could
>     be improved, turning this into /conditioned slicing, /as described
>     in [1]. This however is only planned as a followup work after GSoC.
>
>     For this reason, this project would be developed as an alternative
>     approach to some of the techniques used in BugReporter, as an
>     optional off-by-default analyzer feature.
>
>     *---=== Questions ===---*
>
>     What do you think of this approach?
>     Do you think that implementing this algorithm is achievable, but
>     tough enough task for GSoC?
>     Would you prefer to see a general program slicing library, or an
>     analyzer-specific implementation? Traversing the ExplodedGraph
>     would be far easier in terms of what I want to achieve, but a more
>     general approach that traverses the CFG (like
>     llvm::DominatorTree[5]) could be beneficial to more developers,
>     but possibly at the cost of not being able to improve the
>     prototype with the symbolic value information the analyzer holds.
>
>     */References, links/*
>
>     [1] Gallagher, Keith, and David Binkley. "Program slicing." /2008
>     Frontiers of Software Maintenance/.
>     [2] Tip, Frank. /A survey of program slicing techniques/. Centrum
>     voor Wiskunde en Informatica, 1994.
>     [3] Weiser, Mark. "Program slicing." /Proceedings of the 5th
>     international conference on Software engineering/. IEEE Press, 1981.
>     [4] Binkley, David. "Precise executable interprocedural slices."
>     /ACM Letters on Programming Languages and Systems (LOPLAS)/ 2.1-4
>     (1993): 31-45.
>     [5] http://llvm.org/doxygen/classllvm_1_1DominatorTree.html
>
>     Link to previous GSoC related letter I sent:
>     http://lists.llvm.org/pipermail/cfe-dev/2019-February/061464.html
>
>     Cheers,
>     Kristóf Umann
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20190402/c64723e2/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 126147 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20190402/c64723e2/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 171712 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20190402/c64723e2/attachment-0001.png>