[cfe-dev] [analyzer][GSoC] Problem Statement: Improving BugReporter with static backward program slicing

Tue Apr 2 23:23:00 PDT 2019

Yeah, this particular example would be fixed, but we do not know how noisy
it would be in general. One way to make it a bit less noisy could be
something like what the NoStoreFuncVisitor already does for IVars. We could
also add some syntactic checks to see if the method/function actually has
the possibility to escape the region.

I am not familiar with Obj-C, but I suspect it would be possible to trick
NoStoreFuncVisitor, or at least the syntactic checks in `
potentiallyWritesIntoIvar `.
I think the main value of slicing here could be to get rid of those fully
syntactic heuristics and replace them by something smarter. Does this make
sense?

On Wed, 3 Apr 2019 at 02:25, Artem Dergachev <noqnoqneo at gmail.com> wrote:

> One more question about Example 1. Do you think this particular example
> could be solved by developing some sort of NoStoreFuncVisitor but for
> MallocChecker? I.e., in addition to emitting notes like "returning without
> initializing variable `x`", it could emit notes like "returning without
> escaping pointer `p`". If such visitor is developed, what would be the
> differences between your proposal and whatever behavior we'll get with such
> visitor? In other words, is it possible to construct an example with the
> "use of uninitialized variable" checker that, like Example 1, has vital
> pieces of information missing, given that this checker already uses
> NoStoreFuncVisitor?
>
> I mean, it's most likely a terrible idea to develop such visitor for
> MallocChecker, because the only reason this visitor works more or less
> reliably is the existence of const-qualifiers. For MallocChecker they don't
> exist, so it's going to be hard to figure out if a function was *supposed*
> to escape the pointer.
>
> On 4/2/19 5:14 PM, Artem Dergachev wrote:
>
> Ahaaaaa, ooooookkkkk, iiiiiiii seeeeeeeeee!
>
> Sounds fascinating, must-have and hard.
>
> My primary question is how much do you think this will put stress on the
> checkers. Like, how much more difficult would it be to write checkers when
> we require them to include the slicing criterion with their bug reports?
> How much of it would we (i.e., BugReporter) be able to infer ourselves?
>
> ------------------------
>
> Say, let's take Example 1. You describe the slicing criterion as:
>
>     (13, <dynamically allocated int object>)
>
> The value of the dynamically allocated into object remains the same
> regardless of whether the object is stored in global_ptr or not, so the
> slice doesn't need to include line 5 or 6. Therefore i think that the
> slicing criterion you proposed is not what we're looking for, and the real
> slicing criterion we're looking for is:
>
>     (13, <liveness of <dynamically allocated int object>>)
>
> Like, we're mapping every dynamically allocated object to an imaginary
> heap location that represents its current liveness, and include that
> imaginary variable, rather than the object itself, in the slicing
> criterion. Line 6 would affect liveness of the object on line 13 (if it's
> stored into a global, it's alive; otherwise it's not), therefore it'll be
> part of the slice. So, am i understanding correctly that you're proposing a
> checker API that'll look like this:
>
>   BugReport *R = new BugReport(Node, ...);
>   R->addLivenessBasedSlicingCriterion(HeapSym);
>
> ?
>
> I guess we can infer the statement of the slicing criterion from the Node,
> but i'm not entirely sure; see also below.
>
> I'd actually love it if you elaborate this example further because it's
> fairly interesting. Like, we know that the assignment affects the liveness
> information, but how would the slicing algorithm figure this out? Do you
> have a step-by-step description of how the algorithm behaves in this case?
>
> ------------------------
>
> Let's take another example i just came up with. Consider
> alpha.unix.PthreadLock - a checker that finds various bugs with mutexes,
> such as double locks or double unlocks. Consider code:
>
> 1  pthread_mutex_t mtx1, mtx2;
> 2
> 3  void foo(bool x1, bool x2) {
> 4    if (x1)
> 5      pthread_mutex_lock(&mtx1);
> 6    if (x2)
> 7      pthread_mutex_lock(&mtx1);
> 8    // ...
> 9  }
>
> In this example we'll report a double lock bug due to a copy-paste error:
> line 7 should probably lock &mtx2 rather than &mtx1.
>
> The whole program is relevant to the report. Without line 5, there would
> only be a single lock. It transitions the mutex object from state "unknown"
> to state "locked".
>
> I don't think the Analyzer would currently do a bad job at explaining the
> bug; it's pretty straightforward.
>
> What would be the slicing criterion in this case? As far as i understand,
> it'll be
>
>     (7, <locked-ness of mtx1>)
>
> In this case "lockedness" is, again, an "imaginary heap location" that
> contains metadata for the mutex. More specifically, it is a GDM map value.
> How would a checker API look in this case? How would it even describe to
> the BugReporter what to look at, given that currently only the checker
> understands how to read its GDM values?
>
> My random guess is:
>
>   BugReport *R = new BugReport(Node, ...);
>   R->addGDMBasedSlicingCriterion<MutexState>(MutexMR);
>
> ------------------------
>
> So it sounds to me that your project is, like many other awesome projects,
> brings in, as a dependency, replacing GDM with a better, smarter data map
> that the analyzer core *can* introspect and manipulate without the direct
> help from the checkers. It might be that your project actually requires
> relatively little amounts of such introspection - i.e., you might be able
> to get away with "it's some GDM map and the value for this key has changed,
> therefore this statement is a relevant data dependency" for quite a lot of
> checkers.
>
> That's the subject i was also bringing up as part of
> https://reviews.llvm.org/D59861 - "this was the plan that is part of the
> even-more-long-term plan of completely deprecating the void*-based Generic
> Data Map...". This was also something that became a blocker on my old
> attempt to introduce summary-based analysis. It required stuffing fairly
> large chunks of code into every checker that would teach the checker how to
> collect summary information and apply it when the summary is applied, and
> this new boilerplate was fairly brain-damaging to implement and hard to get
> right or even to explain. This problem also blocks other projects, such as
> state merging / loop widening (on one branch the mutex is locked, on the
> other branch it is unlocked, hey checker please teach me how to merge this).
>
> The variety of slicing criteria (that is a consequence of the variety of
> checkers that we have) sounds pretty scary to me. Some of them are really
> complicated, eg. the liveness criterion. Making sure that the generic
> algorithm works correctly with at least a significant chunk of them is
> going to be fun. Making sure it can actually handle arbitrary criterions
> required by the checkers without having every checker come with its own
> boilerplate for slicing also sounds hard. And it'll be very bad if we have
> to tell our checker developers "hey, by the way, you also need to know what
> backward slicing is and write down part of the algorithm in order to ever
> get your checker enabled by default" - i'm already feeling pretty bad
> explaining dead symbols and pointer escapes (which are pretty much
> must-have in most checkers), these are definitely examples of a boilerplate
> that a smarter version of GDM could handle automatically. In order to
> conquer the world, i think we should stick to our "writing a checker in 24
> hours" utopia: writing a checker should be as easy as writing down the
> transfer functions for relevant statements. In my opinion, we should take
> it as an important constraint.
>
>
>
>
> On 4/2/19 12:21 PM, Kristóf Umann wrote:
>
> Somehow the images and the attached files were left out, please find them
> here:
>
> On Tue, 2 Apr 2019 at 21:16, Kristóf Umann <dkszelethus at gmail.com> wrote:
>
>> Hi!
>>
>> In this letter, I'd like to describe a particular problem in the Clang
>> StaticAnalyzer's BugReporter class that I'd like to tackle in a Google
>> Summer of Code project this summer. I'll show real-world examples on some
>> of its shortcomings, and propose a potential solution using static backward
>> program slicing. At last, I'll ask some specific questions. I'd love to
>> hear any and all feedback on this!
>>
>> This is a *problem statement, *not a *proposal*. I plan to send out a
>> formal proposal by Friday (but not later then Saturday), that will contain
>> more details on both the problem and the solution. I don't introduce myself
>> or my previous works within the project, that will also be detailed in my
>> upcoming letter. I also plan to incorporate the feedback I'll receive to
>> this letter.
>>
>> *---=== BugReporter constructs bad reports ===---*
>>
>> *What does the BugReporter do?*
>>
>> After the Static Analyzer found an error, the BugReporter receives an
>> ExplodedNode, which, accompanied by its predecessors, contains all the
>> information needed to reproduce that error. This *bugpath *is then
>> shortened with a variety of heuristics, such as removing unrelated function
>> calls, unrelated branches and so on. BugReporter by the end of this
>> process will construct a PathDiagnostic object for each report, that is,
>> ideally, minimal.
>>
>> *Example 1.*
>>
>> Consider the following code example:
>>
>> 1  // leak.cpp
>> 2  int *global_ptr;
>> 3
>> 4  void save_ext(int storeIt, int *ptr) {
>> 5    if (storeIt)
>> 6      global_ptr = ptr;
>> 7  }
>> 8
>> 9  void test(int b) {
>> 10   int *myptr = new int;
>> 11   save_ext(b, myptr);
>> 12   delete global_ptr;
>> 13 }
>>
>> It's clear that if test is invoked with b's value set to true, there is
>> no error in the program. However, should b be false, we'll leak memory.
>>
>> $ clang -cc1 -analyze -analyzer-checker=core,cplusplus leak.cpp
>> [image: image.png]
>>
>> The Static Analyzer is able to catch this error, but fails to mention the
>> call to save_ext entirely, despite the error only occurring because the
>> analyzer assumed that storeIt is false. I've also attached the exploded
>> graph leak.svg that demonstrates this.
>>
>> *Example 2.*
>>
>> Consider the following code example:
>>
>> 1  // divbyzero.cpp
>> 2  void f() {
>> 3    int i = 0;
>> 4    (void) (10 / i);
>> 5  }
>> 6
>> 7  void g() { f(); }
>> 8  void h() { g(); }
>> 9  void j() { h(); }
>> 10 void k() { j(); }
>>
>> Its clear that a call to f will result in a division by zero error.
>>
>> $ clang -cc1 -analyze -analyzer-checker=core divbyzero.cpp
>> [image: image.png]
>> Again, the Static Analyzer is plenty smart enough to catch this, but the
>> constructed bug report is littered with a lot of useless information -- it
>> would be enough to only show the body of f, and, optionally where f itself
>> was called. For the sake of completeness, I've attached divbyzero.svg that
>> contains the exploded graph for the above code snippet.
>>
>> The above examples demonstrate that BugReporter sometimes reduces the
>> bugpath too much or too little.
>>
>> *---=== Solving the above problem with static backward program slicing
>> ===---*
>>
>> *What is static backward program slicing?*
>>
>> A *program slice* consists of the parts of a program that (potentially)
>> affect the values computed at some point of interest, called the slicing
>> criterion. *Program slicing* is a decomposition technique that elides
>> program components not relevant to the slicing criterion (which is a pair
>> of (statement, set of variables)), creating a program slice[1][2].
>> *Static *slicing preserves the meaning of the variable(s) in the slicing
>> criterion for all possible inputs[1]. *Backward *slices answer the
>> question “what program components might affect a selected computation?”[1]
>>
>> While statement-minimal slices are not necessarily unique[3], Weisel
>> developed a popular algorithm that constructs one. In essence, his
>> fix-point algorithm constructs sets of *relevant variables* for each
>> edge in between node *i* and node *j* in a CFG graph, from which he
>> constructs *relevant statements*. The fix-point of the relevant
>> statements set is the slice itself.
>>
>> One of the characteristic of his algorithm is that the resulting program
>> slice will be executable. However, our problem doesn't require the code to
>> be executable, so we could use a more "aggressive" approach that creates a
>> smaller slice. An improvement to his algorithm is presented in [4].
>>
>> *How does this relate to BugReporter?*
>>
>> We can show that using Weiser's algorithm, issues raised in Example 1.
>> and Example 2. can be improved upon.
>>
>> For example 1., the statement-minimal program slice with the criterion
>> (13, <dynamically allocated int object>) will contain the statements 5
>> and 6, and for example 2., the statement-minimal program slice with the
>> criterion (4, i) won't contain anything but statements 3 and 4. For the
>> latter, we can even improve the algorithm to also contain statement 7,
>> where a call to f is made.
>>
>> The analyzer, as stated earlier, gives BugReporter an ExplodedNode, from
>> which the slicing criterion must be constructed. The statement
>> corresponding to this node, coupled with the interesting regions the
>> checker itself marked could be used to construct this slicing criterion.
>>
>> *Challenges*
>>
>> While the algorithm Weiser developed along with the improvements made by
>> others are interprocedural, I would imagine that in implementation, it
>> would be a challenging step from an intraprocedural prototype.
>>
>> Several articles also describe pointers, references, and dynamically
>> allocated regions, as well as gotos and other tricky parts of the language,
>> but I still expect to see some skeletons falling out of the closet when
>> implementing this for C++, not only C.
>>
>> *Drawbacks*
>>
>> Static slicing, as an algorithm that doesn't know anything about input
>> values, suffers from the same issues that all static techniques do, meaning
>> that without heuristics, it'll have to do very rough guesses, possibly
>> leaving a far too big program slice. However, with the symbolic values the
>> analyzer holds, this could be improved, turning this into *conditioned
>> slicing, *as described in [1]. This however is only planned as a
>> followup work after GSoC.
>>
>> For this reason, this project would be developed as an alternative
>> approach to some of the techniques used in BugReporter, as an optional
>> off-by-default analyzer feature.
>>
>> *---=== Questions ===---*
>>
>> What do you think of this approach?
>> Do you think that implementing this algorithm is achievable, but tough
>> enough task for GSoC?
>> Would you prefer to see a general program slicing library, or an
>> analyzer-specific implementation? Traversing the ExplodedGraph would be
>> far easier in terms of what I want to achieve, but a more general approach
>> that traverses the CFG (like llvm::DominatorTree[5]) could be beneficial
>> to more developers, but possibly at the cost of not being able to improve
>> the prototype with the symbolic value information the analyzer holds.
>>
>> *References, links*
>>
>> [1] Gallagher, Keith, and David Binkley. "Program slicing." *2008
>> Frontiers of Software Maintenance*.
>> [2] Tip, Frank. *A survey of program slicing techniques*. Centrum voor
>> Wiskunde en Informatica, 1994.
>> [3] Weiser, Mark. "Program slicing." *Proceedings of the 5th
>> international conference on Software engineering*. IEEE Press, 1981.
>> [4] Binkley, David. "Precise executable interprocedural slices." *ACM
>> Letters on Programming Languages and Systems (LOPLAS)* 2.1-4 (1993):
>> 31-45.
>> [5] http://llvm.org/doxygen/classllvm_1_1DominatorTree.html
>>
>> Link to previous GSoC related letter I sent:
>> http://lists.llvm.org/pipermail/cfe-dev/2019-February/061464.html
>>
>> Cheers,
>> Kristóf Umann
>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20190403/cec62812/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 126147 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20190403/cec62812/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 171712 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20190403/cec62812/attachment-0001.png>