[cfe-dev] [analyzer][GSoC] Problem Statement: Improving BugReporter with static backward program slicing

Tue Apr 2 23:08:00 PDT 2019

Hi,

On Wed, 3 Apr 2019 at 02:14, Artem Dergachev <noqnoqneo at gmail.com> wrote:

> Ahaaaaa, ooooookkkkk, iiiiiiii seeeeeeeeee!
>
> Sounds fascinating, must-have and hard.
>
> My primary question is how much do you think this will put stress on the
> checkers. Like, how much more difficult would it be to write checkers when
> we require them to include the slicing criterion with their bug reports?
> How much of it would we (i.e., BugReporter) be able to infer ourselves?
>
> ------------------------
>
> Say, let's take Example 1. You describe the slicing criterion as:
>
>     (13, <dynamically allocated int object>)
>
> The value of the dynamically allocated into object remains the same
> regardless of whether the object is stored in global_ptr or not, so the
> slice doesn't need to include line 5 or 6. Therefore i think that the
> slicing criterion you proposed is not what we're looking for, and the real
> slicing criterion we're looking for is:
>
>     (13, <liveness of <dynamically allocated int object>>)
>
> Like, we're mapping every dynamically allocated object to an imaginary
> heap location that represents its current liveness, and include that
> imaginary variable, rather than the object itself, in the slicing
> criterion. Line 6 would affect liveness of the object on line 13 (if it's
> stored into a global, it's alive; otherwise it's not), therefore it'll be
> part of the slice. So, am i understanding correctly that you're proposing a
> checker API that'll look like this:
>
>   BugReport *R = new BugReport(Node, ...);
>   R->addLivenessBasedSlicingCriterion(HeapSym);
>
> ?
>
> I guess we can infer the statement of the slicing criterion from the Node,
> but i'm not entirely sure; see also below.
>
> I'd actually love it if you elaborate this example further because it's
> fairly interesting. Like, we know that the assignment affects the liveness
> information, but how would the slicing algorithm figure this out? Do you
> have a step-by-step description of how the algorithm behaves in this case?
>

I think leak like slicing criterion will be on of the hardest part of this
work. But leak related checks are likely to have isSuppressOnSink set to
true. I would suggest to not to use slicing for such checks until we figure
out the proper way. It would be really great if we found a way to manage
this without relying on liveness analysis, e.g. we could have a special
mode that preserves all the escapes of the value and the control
dependencies of those escapes (since the value of the allocated int could
be changed later on if its address was stored to the global ptr). I did not
study the subject in depth yet so let me know if you disagree.

>
> ------------------------
>
> Let's take another example i just came up with. Consider
> alpha.unix.PthreadLock - a checker that finds various bugs with mutexes,
> such as double locks or double unlocks. Consider code:
>
> 1  pthread_mutex_t mtx1, mtx2;
> 2
> 3  void foo(bool x1, bool x2) {
> 4    if (x1)
> 5      pthread_mutex_lock(&mtx1);
> 6    if (x2)
> 7      pthread_mutex_lock(&mtx1);
> 8    // ...
> 9  }
>
> In this example we'll report a double lock bug due to a copy-paste error:
> line 7 should probably lock &mtx2 rather than &mtx1.
>
> The whole program is relevant to the report. Without line 5, there would
> only be a single lock. It transitions the mutex object from state "unknown"
> to state "locked".
>
> I don't think the Analyzer would currently do a bad job at explaining the
> bug; it's pretty straightforward.
>
> What would be the slicing criterion in this case? As far as i understand,
> it'll be
>
>     (7, <locked-ness of mtx1>)
>
> In this case "lockedness" is, again, an "imaginary heap location" that
> contains metadata for the mutex. More specifically, it is a GDM map value.
> How would a checker API look in this case? How would it even describe to
> the BugReporter what to look at, given that currently only the checker
> understands how to read its GDM values?
>
> My random guess is:
>
>   BugReport *R = new BugReport(Node, ...);
>   R->addGDMBasedSlicingCriterion<MutexState>(MutexMR);
>

I am hoping for a very easy to use interface based on interesting
symbols/regions (i.e. some checks with visitors might work out of the box,
checks that did not mark anything interesting will not trigger the
slicing). I might be missing something, but if the user marked the memory
region for the mutex interesting thus we used "value of" the mutex as
slicing criterion, we should include both line 5 as the function might
mutate the value of the mutex.

Of course, it would still be possible for a checker author to shoot herself
in the foot by misusing interestingness. Maybe we could add some sanity
checks based on the statements at the bug report and uniqueing location.
But I think it is not possible to make such a feature bullet proof.

>
> ------------------------
>
> So it sounds to me that your project is, like many other awesome projects,
> brings in, as a dependency, replacing GDM with a better, smarter data map
> that the analyzer core *can* introspect and manipulate without the direct
> help from the checkers. It might be that your project actually requires
> relatively little amounts of such introspection - i.e., you might be able
> to get away with "it's some GDM map and the value for this key has changed,
> therefore this statement is a relevant data dependency" for quite a lot of
> checkers.
>
> That's the subject i was also bringing up as part of
> https://reviews.llvm.org/D59861 - "this was the plan that is part of the
> even-more-long-term plan of completely deprecating the void*-based Generic
> Data Map...". This was also something that became a blocker on my old
> attempt to introduce summary-based analysis. It required stuffing fairly
> large chunks of code into every checker that would teach the checker how to
> collect summary information and apply it when the summary is applied, and
> this new boilerplate was fairly brain-damaging to implement and hard to get
> right or even to explain. This problem also blocks other projects, such as
> state merging / loop widening (on one branch the mutex is locked, on the
> other branch it is unlocked, hey checker please teach me how to merge this).
>
> The variety of slicing criteria (that is a consequence of the variety of
> checkers that we have) sounds pretty scary to me. Some of them are really
> complicated, eg. the liveness criterion. Making sure that the generic
> algorithm works correctly with at least a significant chunk of them is
> going to be fun. Making sure it can actually handle arbitrary criterions
> required by the checkers without having every checker come with its own
> boilerplate for slicing also sounds hard. And it'll be very bad if we have
> to tell our checker developers "hey, by the way, you also need to know what
> backward slicing is and write down part of the algorithm in order to ever
> get your checker enabled by default" - i'm already feeling pretty bad
> explaining dead symbols and pointer escapes (which are pretty much
> must-have in most checkers), these are definitely examples of a boilerplate
> that a smarter version of GDM could handle automatically. In order to
> conquer the world, i think we should stick to our "writing a checker in 24
> hours" utopia: writing a checker should be as easy as writing down the
> transfer functions for relevant statements. In my opinion, we should take
> it as an important constraint.
>
>
>
>
> On 4/2/19 12:21 PM, Kristóf Umann wrote:
>
> Somehow the images and the attached files were left out, please find them
> here:
>
> On Tue, 2 Apr 2019 at 21:16, Kristóf Umann <dkszelethus at gmail.com> wrote:
>
>> Hi!
>>
>> In this letter, I'd like to describe a particular problem in the Clang
>> StaticAnalyzer's BugReporter class that I'd like to tackle in a Google
>> Summer of Code project this summer. I'll show real-world examples on some
>> of its shortcomings, and propose a potential solution using static backward
>> program slicing. At last, I'll ask some specific questions. I'd love to
>> hear any and all feedback on this!
>>
>> This is a *problem statement, *not a *proposal*. I plan to send out a
>> formal proposal by Friday (but not later then Saturday), that will contain
>> more details on both the problem and the solution. I don't introduce myself
>> or my previous works within the project, that will also be detailed in my
>> upcoming letter. I also plan to incorporate the feedback I'll receive to
>> this letter.
>>
>> *---=== BugReporter constructs bad reports ===---*
>>
>> *What does the BugReporter do?*
>>
>> After the Static Analyzer found an error, the BugReporter receives an
>> ExplodedNode, which, accompanied by its predecessors, contains all the
>> information needed to reproduce that error. This *bugpath *is then
>> shortened with a variety of heuristics, such as removing unrelated function
>> calls, unrelated branches and so on. BugReporter by the end of this
>> process will construct a PathDiagnostic object for each report, that is,
>> ideally, minimal.
>>
>> *Example 1.*
>>
>> Consider the following code example:
>>
>> 1  // leak.cpp
>> 2  int *global_ptr;
>> 3
>> 4  void save_ext(int storeIt, int *ptr) {
>> 5    if (storeIt)
>> 6      global_ptr = ptr;
>> 7  }
>> 8
>> 9  void test(int b) {
>> 10   int *myptr = new int;
>> 11   save_ext(b, myptr);
>> 12   delete global_ptr;
>> 13 }
>>
>> It's clear that if test is invoked with b's value set to true, there is
>> no error in the program. However, should b be false, we'll leak memory.
>>
>> $ clang -cc1 -analyze -analyzer-checker=core,cplusplus leak.cpp
>> [image: image.png]
>>
>> The Static Analyzer is able to catch this error, but fails to mention the
>> call to save_ext entirely, despite the error only occurring because the
>> analyzer assumed that storeIt is false. I've also attached the exploded
>> graph leak.svg that demonstrates this.
>>
>> *Example 2.*
>>
>> Consider the following code example:
>>
>> 1  // divbyzero.cpp
>> 2  void f() {
>> 3    int i = 0;
>> 4    (void) (10 / i);
>> 5  }
>> 6
>> 7  void g() { f(); }
>> 8  void h() { g(); }
>> 9  void j() { h(); }
>> 10 void k() { j(); }
>>
>> Its clear that a call to f will result in a division by zero error.
>>
>> $ clang -cc1 -analyze -analyzer-checker=core divbyzero.cpp
>> [image: image.png]
>> Again, the Static Analyzer is plenty smart enough to catch this, but the
>> constructed bug report is littered with a lot of useless information -- it
>> would be enough to only show the body of f, and, optionally where f itself
>> was called. For the sake of completeness, I've attached divbyzero.svg that
>> contains the exploded graph for the above code snippet.
>>
>> The above examples demonstrate that BugReporter sometimes reduces the
>> bugpath too much or too little.
>>
>> *---=== Solving the above problem with static backward program slicing
>> ===---*
>>
>> *What is static backward program slicing?*
>>
>> A *program slice* consists of the parts of a program that (potentially)
>> affect the values computed at some point of interest, called the slicing
>> criterion. *Program slicing* is a decomposition technique that elides
>> program components not relevant to the slicing criterion (which is a pair
>> of (statement, set of variables)), creating a program slice[1][2].
>> *Static *slicing preserves the meaning of the variable(s) in the slicing
>> criterion for all possible inputs[1]. *Backward *slices answer the
>> question “what program components might affect a selected computation?”[1]
>>
>> While statement-minimal slices are not necessarily unique[3], Weisel
>> developed a popular algorithm that constructs one. In essence, his
>> fix-point algorithm constructs sets of *relevant variables* for each
>> edge in between node *i* and node *j* in a CFG graph, from which he
>> constructs *relevant statements*. The fix-point of the relevant
>> statements set is the slice itself.
>>
>> One of the characteristic of his algorithm is that the resulting program
>> slice will be executable. However, our problem doesn't require the code to
>> be executable, so we could use a more "aggressive" approach that creates a
>> smaller slice. An improvement to his algorithm is presented in [4].
>>
>> *How does this relate to BugReporter?*
>>
>> We can show that using Weiser's algorithm, issues raised in Example 1.
>> and Example 2. can be improved upon.
>>
>> For example 1., the statement-minimal program slice with the criterion
>> (13, <dynamically allocated int object>) will contain the statements 5
>> and 6, and for example 2., the statement-minimal program slice with the
>> criterion (4, i) won't contain anything but statements 3 and 4. For the
>> latter, we can even improve the algorithm to also contain statement 7,
>> where a call to f is made.
>>
>> The analyzer, as stated earlier, gives BugReporter an ExplodedNode, from
>> which the slicing criterion must be constructed. The statement
>> corresponding to this node, coupled with the interesting regions the
>> checker itself marked could be used to construct this slicing criterion.
>>
>> *Challenges*
>>
>> While the algorithm Weiser developed along with the improvements made by
>> others are interprocedural, I would imagine that in implementation, it
>> would be a challenging step from an intraprocedural prototype.
>>
>> Several articles also describe pointers, references, and dynamically
>> allocated regions, as well as gotos and other tricky parts of the language,
>> but I still expect to see some skeletons falling out of the closet when
>> implementing this for C++, not only C.
>>
>> *Drawbacks*
>>
>> Static slicing, as an algorithm that doesn't know anything about input
>> values, suffers from the same issues that all static techniques do, meaning
>> that without heuristics, it'll have to do very rough guesses, possibly
>> leaving a far too big program slice. However, with the symbolic values the
>> analyzer holds, this could be improved, turning this into *conditioned
>> slicing, *as described in [1]. This however is only planned as a
>> followup work after GSoC.
>>
>> For this reason, this project would be developed as an alternative
>> approach to some of the techniques used in BugReporter, as an optional
>> off-by-default analyzer feature.
>>
>> *---=== Questions ===---*
>>
>> What do you think of this approach?
>> Do you think that implementing this algorithm is achievable, but tough
>> enough task for GSoC?
>> Would you prefer to see a general program slicing library, or an
>> analyzer-specific implementation? Traversing the ExplodedGraph would be
>> far easier in terms of what I want to achieve, but a more general approach
>> that traverses the CFG (like llvm::DominatorTree[5]) could be beneficial
>> to more developers, but possibly at the cost of not being able to improve
>> the prototype with the symbolic value information the analyzer holds.
>>
>> *References, links*
>>
>> [1] Gallagher, Keith, and David Binkley. "Program slicing." *2008
>> Frontiers of Software Maintenance*.
>> [2] Tip, Frank. *A survey of program slicing techniques*. Centrum voor
>> Wiskunde en Informatica, 1994.
>> [3] Weiser, Mark. "Program slicing." *Proceedings of the 5th
>> international conference on Software engineering*. IEEE Press, 1981.
>> [4] Binkley, David. "Precise executable interprocedural slices." *ACM
>> Letters on Programming Languages and Systems (LOPLAS)* 2.1-4 (1993):
>> 31-45.
>> [5] http://llvm.org/doxygen/classllvm_1_1DominatorTree.html
>>
>> Link to previous GSoC related letter I sent:
>> http://lists.llvm.org/pipermail/cfe-dev/2019-February/061464.html
>>
>> Cheers,
>> Kristóf Umann
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20190403/800e8ce9/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 126147 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20190403/800e8ce9/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 171712 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20190403/800e8ce9/attachment-0001.png>