[cfe-dev] [analyzer] Discovering information not contained in the bugpath

Tue Jun 11 09:38:26 PDT 2019

I enhanced the sketch greatly.

On Tue, 11 Jun 2019 at 04:19, Artem Dergachev <noqnoqneo at gmail.com> wrote:

> Hmm, what do you mean by this line?:
>
>         trackExpressionValue(N, RHS of assignment, Report)
>
> Like, how does the existing visitor infrastructure behave when the
> statement is, by construction, never appears in the bug path? It doesn't
> even have a value to track, as it never appeared in the Environment.
>

Well the thinking was that there's no need for that as
TrackControlDependecy only needs a CFG to track conditions. Now, obviously,
the actual tracking of the right hand side of the assignment wouldn't be
possible, but that could be a plus: I think tracking too much hypothetical
"this could have prevented the bug, but ultimately did not" events could
litter the bug report.

I changed this in the sketch.

> P.S. I just realized that this discussion would probably be messy to read
> later through archives, as old revisions of the google doc are no longer
> available.
>

Good point. Here's a copy-paste from the document:

The following slicing algorithm, which is heavily tailored for the Clang
Static Analyzer, somewhat redefines the goal of slicing: traditionally,
we’d gather a set of relevant variables, which would initially be the
variables in the slicing criterion, and expand it if we find a variable
relevant to any variable within the set. Along this, we’d also calculate a
set of relevant statements, which is a subset of the program, either
executable on it’s own or not.

My approach is vastly different: the reason behind this is the function
called trackExpressionValue. In a nutshell, trackExpressionValue tracks a
variable, and adds notes at where the tracked variable was written,
alongside the path that leads there, essentially gathering the set of
relevant statements. Teaching it to discover variables immediate relevant
to the tracked variable, we could (in theory) calculate the set of relevant
variables transitively: if X is tracked, and A would be added to the set
because of X, and B would be added because of A, tracking X would trigger
tracking A, and that would trigger tracking B.

Our starting point (slicing criterion) is the ExplodedNode and CFGBlock
where report originates from, and the variable that was initially tracked.

function insertIfLastWrite(CFG, SetOfLastWrites : set of CFGBlocks,

                          NewBlock : CFGBlock)

 for each W block in SetOfLastWrites do

   if NewBlock post dominates W then

     SetOfLastWrites -= W

     SetOfLastWrites += NewBlock

   else if W post dominates B then

     break

   fi

 od

 if no elements post dominate B or vise versa then

   SetOfLastWrites += NewBlock

 fi

// Traverses N’s CFG to discover branches that are not a part of the

// bug path. Returns true if it discovers a write to X that is

// inevitable.

function conditionTracker(N : ExplodedNode, X : variable,

                         OriginB : CFGBlock, Report : BugReport)

 SetOfLastWrites = {} : set of CFGBlocks

 for all B blocks in N->getCFG() in order do

   if there is no path from B to OriginB then

     continue

   fi

   for all E elements in B in order do

     if E writes X then

       insertIfLastWrite(N->getCFG(), SetOfLastWrites, B)

       continue

     fi

     if E is a function call to F and the analyzer inlined F then

       ParamSet := set of F's parameters to which

                   X is passed to as a non-const reference

       NC := N's predecessor that is F's CallEnter

       for all NewX variables in ParamSet do

         if conditionTracker(NC, NewX, F’s exit block, Report) then

           insertIfLastWrite(N->getCFG(), SetOfLastWrites, B)

         fi

       od

       // X is global/static, or a field and F is a method, etc...

       if F would have access to X regardless then

         if conditionTracker(NC, NewX, F’s exit block, Report) then

           insertIfLastWrite(N->getCFG(), SetOfLastWrites, B)

         fi

       fi

       continue

     fi

   od

 od

 for all W blocks in SetOfLastWrites do

   if W is not a part of the bug path then

     track control dependencies of W

   fi

   // Otherwise, let regular visitors handle it.

 od

 if any of the blocks in SetOfLastWrites is in the bug path

   return true

 fi

 return false

> On 6/10/19 6:46 PM, Kristóf Umann wrote:
>
> I rewrote my sketch algorithm that showcases how I imagine this to work.
>
>
> https://docs.google.com/document/d/1Lx867o3meyQsj0WKOSWMdosSBdw2MUq1aIxyPJM6iLU/edit
>
> On Mon, 10 Jun 2019 at 20:03, Kristóf Umann <dkszelethus at gmail.com> wrote:
>
>>
>>
>> On Mon, 10 Jun 2019 at 05:03, Artem Dergachev <noqnoqneo at gmail.com>
>> wrote:
>>
>>> The problem we're solving is *entirely* about inter-procedural analysis.
>>> Like, as long as there are no inlined calls on the path (and therefore no
>>> pruning is being done), then there's no problem at all: the user can easily
>>> see that the value is untouched by direct observation.
>>>
>>
>> I have explained myself poorly. I intend to implement an
>> *intraprocedural* slicing, but that would work *interprocedurally* by
>> resolving function calls with the help of the analyzer. This is important,
>> because for the by-the-book interprocedural slicing you'd need a system
>> dependence graph. Now, if a function isn't inlined, we need to do this on
>> our own. I see a couple unknowns here (how deep do we explore such
>> functions? how to we resolve non-trivial parameter passing? are we sure
>> that tracking expressions here is actually meaningful?) but I think
>> tackling simple cases is achievable.
>>
>>
>>> In this sense i find the lack of note in example 2 completely
>>> non-problematic (as long as control dependency tracking lands). It's not a
>>> problem that the user isn't told that bar() could have initialized 'x',
>>> because the user already knows that bar() is not even called in the first
>>> place.
>>>
>>
>> Would you agree that tracking flag however would be good?
>>
>>
>>> However, there may be a combination of category 1 and category 2 that is
>>> much more interesting: what if main() unconditionally calls foo() that
>>> conditionally calls bar() that unconditionally initializes the value? Eg.:
>>>
>>> void bar(int *x) {
>>>   *x = 1;  // the interesting event that doesn't happen
>>> }
>>>
>>> void foo(int *x) {
>>>   if (rand())  // control dependency of the interesting event
>>>     bar(x);
>>> }
>>>
>>> int main() {
>>>   int x;
>>>   foo(&x);
>>>   use(x); // use of uninitialized value
>>> }
>>>
>>> In this case it is essential to highlight the path within foo() so that
>>> to explain how it fails to call bar().
>>>
>> On 09.06.2019 17:50, Kristóf Umann wrote:
>>>
>>> I gave this some thought, and realized that we could divide the
>>> "should-not-have-happened" cases into 2 categories:
>>>
>>> 1. The unexplored block is within a function that the analyzer visited.
>>> 2. The unexplored block lies within a function that the analyzer didn't
>>> even visit.
>>>
>>> For example:
>>>
>>> 01 int flag;
>>>
>>> 02 bool coin();
>>>
>>> 03
>>>
>>> 04 void foo() {
>>>
>>> 05   flag = coin(); // no note, but there should be one
>>>
>>> 06 }
>>>
>>> 07
>>>
>>> 08 void bar(int &i) {
>>>
>>> 09  if (flag) // assumed false
>>>
>>> 10    i = new int;
>>>
>>> 11 }
>>>
>>> 12
>>>
>>> 13 int main() {
>>>
>>> 14   int *x = 0; // x initialized to 0
>>>
>>> 15   flag = 1;
>>>
>>> 16   foo(); // no note, but there should be one
>>>
>>> 17   bar(x);
>>>
>>> 18   foo(); // no note, but there should be one
>>>
>>> 19
>>>
>>> 20   if (flag) // assumed true
>>>
>>> 21     *x = 5; // warn
>>>
>>> 22 }
>>>
>>>
>>> Observe the subtle difference that on line 17, function bar is called,
>>> and the first branch lies in that function. This is definitely a "category
>>> 1" case, since the analyzer inlined and visited bar(), but again,
>>> failed the realize the importance of flag due to not visiting line 10, and
>>> makes no mention of line 16 in the bugreport.
>>>
>>> 01 int flag;
>>>
>>> 02 bool coin();
>>>
>>> 03
>>>
>>> 04 void foo() {
>>>
>>> 05   flag = coin(); // no note, but there should be one
>>>
>>> 06 }
>>>
>>> 07
>>>
>>> 08 void bar(int &i) {
>>>
>>> 09  i = new int;
>>>
>>> 10 }
>>>
>>> 11
>>>
>>> 12 int main() {
>>>
>>> 13   int *x = 0; // x initialized to 0
>>>
>>> 14   flag = 1;
>>>
>>> 15   foo(); // no note, but there should be one
>>>
>>> 16 if (flag) // assumed false
>>>
>>> 17    bar(x);
>>>
>>> 18   foo(); // no note, but there should be one
>>>
>>> 19
>>>
>>> 20   if (flag) // assumed true
>>>
>>> 21     *x = 5; // warn
>>>
>>> 22 }
>>>
>>>
>>> In this example, the if statement stays, but the assignment to x is now
>>> found in bar(). In this case, the analyzer didn't explore bar(), so
>>> it's a "category 2" case.
>>>
>>>
>>> Now, the reason why these categories are important is that if the
>>> analyzer actually explored a function, it solves the problem of parameters:
>>> for the example shown for "category 1", we can simply ask the analyzer
>>> whether x in function main() is the same as i in function bar(). This
>>> is, as stated in my previous letter, far more difficult for "category 2"
>>> cases.
>>>
>>>
>>> I think when I'm finished with the "must-have-happened" cases, I could
>>> start enhancing NoStoreFuncVisitor to gather conditions possibly worth
>>> tracking for "category 1" cases, see whether an intraprocedural slicing
>>> algorithm makes sense, could worry about resolving parameters for the other
>>> case one when I get there.
>>>
>>>
>>> Any thoughts on this approach? :)
>>>
>>> On Sat, 8 Jun 2019 at 01:59, Kristóf Umann <dkszelethus at gmail.com>
>>> wrote:
>>>
>>>> Hi!
>>>>
>>>> I'm currently working on a GSoC project that aims to improve the
>>>> description we provide for bug reports the Static Analyzer emits.
>>>>
>>>> We divided the problem (that the bugreports didn't explain how the bug
>>>> came about very well) into two parts:
>>>>
>>>> 1. The information is available in the bug path (basically parts of the
>>>> code the analyzer actually explored on that particular path of execution).
>>>> We refer to these as the "must-have-happened" case.
>>>> 2. The information isn't available, possibly not even in the
>>>> ExplodedGraph (parts of the program the analyzer explored on all paths of
>>>> execution). We call this the "should-not-have-happened" case.
>>>>
>>>> 01 int flag;
>>>>
>>>> 02 bool coin();
>>>>
>>>> 03
>>>>
>>>> 04 void foo() {
>>>>
>>>> 05   flag = coin(); // no note, but there should be one
>>>>
>>>> 06 }
>>>>
>>>> 07
>>>>
>>>> 08 int main() {
>>>>
>>>> 09   int *x = 0; // x initialized to 0
>>>>
>>>> 10   flag = 1;
>>>>
>>>> 11   foo(); // no note, but there should be one
>>>>
>>>> 12   if (flag) // assumed false
>>>>
>>>> 13     x = new int;
>>>>
>>>> 14   foo(); // no note, but there should be one
>>>>
>>>> 15
>>>>
>>>> 16   if (flag) // assumed true
>>>>
>>>> 17     *x = 5; // warn
>>>>
>>>> 18 }
>>>>
>>>> The first branch is a good example for the "should-not-have-happened"
>>>> case: the bug path doesn't contain line 13, so we don't really know whether
>>>> the condition on line 12 is a big deal, which in this case it is (unlike if
>>>> it only guarded a print of some sort)
>>>>
>>>> The second branch is a good example for the "must-have-happened" case:
>>>> we know that the condition on line 16 is very important.
>>>>
>>>> I like to think that I made a very decent progress on the first part,
>>>> as it's objectively a lot simpler. A great addition to the analyzer is that
>>>> it now understands (at least, when my patches land) control dependencies in
>>>> the CFG. With that, we can figure out that flag's value on line 16 is
>>>> crucial, so we track it, resulting in notes being added on line 14, and on
>>>> line 5, explaining that flag's value changed.
>>>>
>>>> However, we are very unsure about how to tackle the second part on the
>>>> project. I gave this a lot of thought, we had a couple meetings in private,
>>>> and I'd like to share where I am with this.
>>>>
>>>> My project proposed to implement a static backward program slicing
>>>> algorithm which would definitely be amazing, but seems to be an unrealistic
>>>> approach. Basically, any slicing algorithm that is interprocedural would
>>>> require a system dependence graph, something we just simply don't have, not
>>>> even for LLVM.
>>>>
>>>> Whenever pointer semantics are in the picture, we have a really hard
>>>> time: it's something that is super difficult to reason about without
>>>> symbolic execution, and it would probably be a good idea not to tackle such
>>>> cases until we come up with something clever (maybe the pset calculation
>>>> Gábor Horváth is working on?). We should, however, being able to handle
>>>> reference parameters.
>>>>
>>>> So then, what's the goal exactly?
>>>>
>>>> We'd like to teach the analyzer that some code blocks (even in other
>>>> functions) could have written the variable that is responsible for the bug
>>>> (x, in the included example), and is very important to the bug report. We
>>>> could use this information to track variables similarly to the
>>>> "must-have-happened" case: track conditions that prevented the write from
>>>> happening.
>>>>
>>>> To the specifics. I think it's reasonable to implement an
>>>> *intraprocedural* slicing algorithm. For the included example, I think
>>>> we should be able to discover the potential write to x on line 9. We don't
>>>> need anything we already don't have to achieve this.
>>>>
>>>> Reasoning across function boundaries could be achieved by inlining
>>>> functions (as I understand it, simply insert the function's CFGBlocks), but
>>>> there are big unknowns in this.
>>>>
>>>> So my question is, do you think such inlining is possible? If not,
>>>> should I just create a primitive variable-to-parameter mapping? Is there
>>>> something else on the table?
>>>>
>>>> Cheers!
>>>> Kristóf Umann
>>>>
>>>
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20190611/d1354bd1/attachment.html>