[cfe-dev] [analyzer][GSoC] Problem Statement: Improving BugReporter with static backward program slicing
Artem Dergachev via cfe-dev
cfe-dev at lists.llvm.org
Tue Apr 2 17:24:59 PDT 2019
One more question about Example 1. Do you think this particular example
could be solved by developing some sort of NoStoreFuncVisitor but for
MallocChecker? I.e., in addition to emitting notes like "returning
without initializing variable `x`", it could emit notes like "returning
without escaping pointer `p`". If such visitor is developed, what would
be the differences between your proposal and whatever behavior we'll get
with such visitor? In other words, is it possible to construct an
example with the "use of uninitialized variable" checker that, like
Example 1, has vital pieces of information missing, given that this
checker already uses NoStoreFuncVisitor?
I mean, it's most likely a terrible idea to develop such visitor for
MallocChecker, because the only reason this visitor works more or less
reliably is the existence of const-qualifiers. For MallocChecker they
don't exist, so it's going to be hard to figure out if a function was
*supposed* to escape the pointer.
On 4/2/19 5:14 PM, Artem Dergachev wrote:
> Ahaaaaa, ooooookkkkk, iiiiiiii seeeeeeeeee!
>
> Sounds fascinating, must-have and hard.
>
> My primary question is how much do you think this will put stress on
> the checkers. Like, how much more difficult would it be to write
> checkers when we require them to include the slicing criterion with
> their bug reports? How much of it would we (i.e., BugReporter) be able
> to infer ourselves?
>
> ------------------------
>
> Say, let's take Example 1. You describe the slicing criterion as:
>
> (13, <dynamically allocated int object>)
>
> The value of the dynamically allocated into object remains the same
> regardless of whether the object is stored in global_ptr or not, so
> the slice doesn't need to include line 5 or 6. Therefore i think that
> the slicing criterion you proposed is not what we're looking for, and
> the real slicing criterion we're looking for is:
>
> (13, <liveness of <dynamically allocated int object>>)
>
> Like, we're mapping every dynamically allocated object to an imaginary
> heap location that represents its current liveness, and include that
> imaginary variable, rather than the object itself, in the slicing
> criterion. Line 6 would affect liveness of the object on line 13 (if
> it's stored into a global, it's alive; otherwise it's not), therefore
> it'll be part of the slice. So, am i understanding correctly that
> you're proposing a checker API that'll look like this:
>
> BugReport *R = new BugReport(Node, ...);
> R->addLivenessBasedSlicingCriterion(HeapSym);
>
> ?
>
> I guess we can infer the statement of the slicing criterion from the
> Node, but i'm not entirely sure; see also below.
>
> I'd actually love it if you elaborate this example further because
> it's fairly interesting. Like, we know that the assignment affects the
> liveness information, but how would the slicing algorithm figure this
> out? Do you have a step-by-step description of how the algorithm
> behaves in this case?
>
> ------------------------
>
> Let's take another example i just came up with. Consider
> alpha.unix.PthreadLock - a checker that finds various bugs with
> mutexes, such as double locks or double unlocks. Consider code:
>
> 1 pthread_mutex_t mtx1, mtx2;
> 2
> 3 void foo(bool x1, bool x2) {
> 4 if (x1)
> 5 pthread_mutex_lock(&mtx1);
> 6 if (x2)
> 7 pthread_mutex_lock(&mtx1);
> 8 // ...
> 9 }
>
> In this example we'll report a double lock bug due to a copy-paste
> error: line 7 should probably lock &mtx2 rather than &mtx1.
>
> The whole program is relevant to the report. Without line 5, there
> would only be a single lock. It transitions the mutex object from
> state "unknown" to state "locked".
>
> I don't think the Analyzer would currently do a bad job at explaining
> the bug; it's pretty straightforward.
>
> What would be the slicing criterion in this case? As far as i
> understand, it'll be
>
> (7, <locked-ness of mtx1>)
>
> In this case "lockedness" is, again, an "imaginary heap location" that
> contains metadata for the mutex. More specifically, it is a GDM map
> value. How would a checker API look in this case? How would it even
> describe to the BugReporter what to look at, given that currently only
> the checker understands how to read its GDM values?
>
> My random guess is:
>
> BugReport *R = new BugReport(Node, ...);
> R->addGDMBasedSlicingCriterion<MutexState>(MutexMR);
>
> ------------------------
>
> So it sounds to me that your project is, like many other awesome
> projects, brings in, as a dependency, replacing GDM with a better,
> smarter data map that the analyzer core *can* introspect and
> manipulate without the direct help from the checkers. It might be that
> your project actually requires relatively little amounts of such
> introspection - i.e., you might be able to get away with "it's some
> GDM map and the value for this key has changed, therefore this
> statement is a relevant data dependency" for quite a lot of checkers.
>
> That's the subject i was also bringing up as part of
> https://reviews.llvm.org/D59861 - "this was the plan that is part of
> the even-more-long-term plan of completely deprecating the void*-based
> Generic Data Map...". This was also something that became a blocker on
> my old attempt to introduce summary-based analysis. It required
> stuffing fairly large chunks of code into every checker that would
> teach the checker how to collect summary information and apply it when
> the summary is applied, and this new boilerplate was fairly
> brain-damaging to implement and hard to get right or even to explain.
> This problem also blocks other projects, such as state merging / loop
> widening (on one branch the mutex is locked, on the other branch it is
> unlocked, hey checker please teach me how to merge this).
>
> The variety of slicing criteria (that is a consequence of the variety
> of checkers that we have) sounds pretty scary to me. Some of them are
> really complicated, eg. the liveness criterion. Making sure that the
> generic algorithm works correctly with at least a significant chunk of
> them is going to be fun. Making sure it can actually handle arbitrary
> criterions required by the checkers without having every checker come
> with its own boilerplate for slicing also sounds hard. And it'll be
> very bad if we have to tell our checker developers "hey, by the way,
> you also need to know what backward slicing is and write down part of
> the algorithm in order to ever get your checker enabled by default" -
> i'm already feeling pretty bad explaining dead symbols and pointer
> escapes (which are pretty much must-have in most checkers), these are
> definitely examples of a boilerplate that a smarter version of GDM
> could handle automatically. In order to conquer the world, i think we
> should stick to our "writing a checker in 24 hours" utopia: writing a
> checker should be as easy as writing down the transfer functions for
> relevant statements. In my opinion, we should take it as an important
> constraint.
>
>
>
>
> On 4/2/19 12:21 PM, Kristóf Umann wrote:
>> Somehow the images and the attached files were left out, please find
>> them here:
>>
>> On Tue, 2 Apr 2019 at 21:16, Kristóf Umann <dkszelethus at gmail.com
>> <mailto:dkszelethus at gmail.com>> wrote:
>>
>> Hi!
>>
>> In this letter, I'd like to describe a particular problem in the
>> Clang StaticAnalyzer's BugReporter class that I'd like to tackle
>> in a Google Summer of Code project this summer. I'll show
>> real-world examples on some of its shortcomings, and propose a
>> potential solution using static backward program slicing. At
>> last, I'll ask some specific questions. I'd love to hear any and
>> all feedback on this!
>>
>> This is a /problem statement, /not a /proposal/. I plan to send
>> out a formal proposal by Friday (but not later then Saturday),
>> that will contain more details on both the problem and the
>> solution. I don't introduce myself or my previous works within
>> the project, that will also be detailed in my upcoming letter. I
>> also plan to incorporate the feedback I'll receive to this letter.
>>
>> *---=== BugReporter constructs bad reports ===---*
>> *
>> *
>> *What does the BugReporter do?*
>>
>> After the Static Analyzer found an error, the BugReporter
>> receives an ExplodedNode, which, accompanied by its predecessors,
>> contains all the information needed to reproduce that error. This
>> /bugpath /is then shortened with a variety of heuristics, such as
>> removing unrelated function calls, unrelated branches and so on.
>> BugReporter by the end of this process will construct a
>> PathDiagnostic object for each report, that is, ideally, minimal.
>> *
>> *
>> *Example 1.*
>>
>> Consider the following code example:
>>
>> 1 // leak.cpp
>> 2 int *global_ptr;
>> 3
>> 4 void save_ext(int storeIt, int *ptr) {
>> 5 if (storeIt)
>> 6 global_ptr = ptr;
>> 7 }
>> 8
>> 9 void test(int b) {
>> 10 int *myptr = new int;
>> 11 save_ext(b, myptr);
>> 12 delete global_ptr;
>> 13 }
>>
>> It's clear that if test is invoked with b's value set to true,
>> there is no error in the program. However, should b be false,
>> we'll leak memory.
>>
>> $ clang -cc1 -analyze -analyzer-checker=core,cplusplus leak.cpp
>> image.png
>>
>> The Static Analyzer is able to catch this error, but fails to
>> mention the call to save_ext entirely, despite the error only
>> occurring because the analyzer assumed that storeIt is false.
>> I've also attached the exploded graph leak.svg that demonstrates
>> this.
>>
>> *Example 2.*
>>
>> Consider the following code example:
>>
>> 1 // divbyzero.cpp
>> 2 void f() {
>> 3 int i = 0;
>> 4 (void) (10 / i);
>> 5 }
>> 6
>> 7 void g() { f(); }
>> 8 void h() { g(); }
>> 9 void j() { h(); }
>> 10 void k() { j(); }
>>
>> Its clear that a call to f will result in a division by zero error.
>>
>> $ clang -cc1 -analyze -analyzer-checker=core divbyzero.cpp
>> image.png
>> Again, the Static Analyzer is plenty smart enough to catch this,
>> but the constructed bug report is littered with a lot of useless
>> information -- it would be enough to only show the body of f,
>> and, optionally where f itself was called. For the sake of
>> completeness, I've attached divbyzero.svg that contains the
>> exploded graph for the above code snippet.
>>
>> The above examples demonstrate that BugReporter sometimes reduces
>> the bugpath too much or too little.
>>
>> *---=== Solving the above problem with static backward program
>> slicing ===---*
>>
>> *What is static backward program slicing?*
>>
>> A /program slice/ consists of the parts of a program that
>> (potentially) affect the values computed at some point of
>> interest, called the slicing criterion. /*Program slicing*/ is a
>> decomposition technique that elides program components not
>> relevant to the slicing criterion (which is a pair of (statement,
>> set of variables)), creating a program slice[1][2]. *Static
>> *slicing preserves the meaning of the variable(s) in the slicing
>> criterion for all possible inputs[1]. *Backward *slices answer
>> the question “what program components might affect a selected
>> computation?”[1]
>>
>> While statement-minimal slices are not necessarily unique[3],
>> Weisel developed a popular algorithm that constructs one. In
>> essence, his fix-point algorithm constructs sets of /relevant
>> variables///for each edge in between node /i/ and node /j/ in a
>> CFG graph, from which he constructs /relevant statements/. The
>> fix-point of the relevant statements set is the slice itself.
>>
>> One of the characteristic of his algorithm is that the resulting
>> program slice will be executable. However, our problem doesn't
>> require the code to be executable, so we could use a more
>> "aggressive" approach that creates a smaller slice. An
>> improvement to his algorithm is presented in [4].
>> *
>> *
>> *How does this relate to BugReporter?*
>>
>> We can show that using Weiser's algorithm, issues raised in
>> Example 1. and Example 2. can be improved upon.
>>
>> For example 1., the statement-minimal program slice with the
>> criterion (13, <dynamically allocated int object>) will contain
>> the statements 5 and 6, and for example 2., the statement-minimal
>> program slice with the criterion (4,i) won't contain anything but
>> statements 3 and 4. For the latter, we can even improve the
>> algorithm to also contain statement 7, where a call to f is made.
>>
>> The analyzer, as stated earlier, gives BugReporter an
>> ExplodedNode, from which the slicing criterion must be
>> constructed. The statement corresponding to this node, coupled
>> with the interesting regions the checker itself marked could be
>> used to construct this slicing criterion.
>>
>> *Challenges*
>>
>> While the algorithm Weiser developed along with the improvements
>> made by others are interprocedural, I would imagine that in
>> implementation, it would be a challenging step from an
>> intraprocedural prototype.
>>
>> Several articles also describe pointers, references, and
>> dynamically allocated regions, as well as gotos and other tricky
>> parts of the language, but I still expect to see some skeletons
>> falling out of the closet when implementing this for C++, not only C.
>>
>> *Drawbacks*
>>
>> Static slicing, as an algorithm that doesn't know anything about
>> input values, suffers from the same issues that all static
>> techniques do, meaning that without heuristics, it'll have to do
>> very rough guesses, possibly leaving a far too big program slice.
>> However, with the symbolic values the analyzer holds, this could
>> be improved, turning this into /conditioned slicing, /as
>> described in [1]. This however is only planned as a followup work
>> after GSoC.
>>
>> For this reason, this project would be developed as an
>> alternative approach to some of the techniques used in
>> BugReporter, as an optional off-by-default analyzer feature.
>>
>> *---=== Questions ===---*
>>
>> What do you think of this approach?
>> Do you think that implementing this algorithm is achievable, but
>> tough enough task for GSoC?
>> Would you prefer to see a general program slicing library, or an
>> analyzer-specific implementation? Traversing the ExplodedGraph
>> would be far easier in terms of what I want to achieve, but a
>> more general approach that traverses the CFG (like
>> llvm::DominatorTree[5]) could be beneficial to more developers,
>> but possibly at the cost of not being able to improve the
>> prototype with the symbolic value information the analyzer holds.
>>
>> */References, links/*
>>
>> [1] Gallagher, Keith, and David Binkley. "Program slicing." /2008
>> Frontiers of Software Maintenance/.
>> [2] Tip, Frank. /A survey of program slicing techniques/. Centrum
>> voor Wiskunde en Informatica, 1994.
>> [3] Weiser, Mark. "Program slicing." /Proceedings of the 5th
>> international conference on Software engineering/. IEEE Press, 1981.
>> [4] Binkley, David. "Precise executable interprocedural slices."
>> /ACM Letters on Programming Languages and Systems (LOPLAS)/ 2.1-4
>> (1993): 31-45.
>> [5] http://llvm.org/doxygen/classllvm_1_1DominatorTree.html
>>
>> Link to previous GSoC related letter I sent:
>> http://lists.llvm.org/pipermail/cfe-dev/2019-February/061464.html
>>
>> Cheers,
>> Kristóf Umann
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20190402/9c7bd838/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 126147 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20190402/9c7bd838/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 171712 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20190402/9c7bd838/attachment-0001.png>
More information about the cfe-dev
mailing list