[cfe-dev] [GSoC 2017] Improve loop execution model of the Clang Static Analyzer

Mon Mar 27 04:47:40 PDT 2017

Hello Péter!

Thanks for working on the stuff. It already aligns very well with what we would have done.

Let me throw in a few random thoughts that may be helpful.

______

As you might have noticed, the difference between

  // no warning
  int foo(int x, int y) {
    return x / y;
  }

and 

  // division by zero warning
  int bar(int x, int y) {
    if (y == 0) {}
    return x / y;
  }

is that the analyzer enjoys having one extra presumption to base his logic upon: namely, it presumes that both branches of the if-statement are viable. In other words, it presumes that there is no dead code in the program, from which it concludes that bar() may actually be called with y equal to 0, without looking at the rest of the program. He cannot say the same for foo() though, so while he concludes that calling foo() with y equal to 0 is unsafe, it is not worth mentioning in the final report.

Essentially, when in our program state we have “$a: [1, 2], $b: [3, 4]”, it means “we have proven that there exists at least one context in which the function under analysis is executed in actual runtime, in which value denoted by symbol $a belongs to [1, 2] and value denoted by symbol $b belongs to [3, 4]”. There are also “tainted” symbols that are known to possibly take all values in their respective ranges (eg. user input - a user may input any value; useful for security checks), but that’s another story.

The idea of presuming that no code is dead is one of the basic ideas behind symbolic execution. It is essentially the only reason why we should be splitting our analysis into different execution paths. This idea, if implemented carefully, would also make us resistant to finding bugs with pre-conditions that never happen in the program, so we wouldn’t have to tell our users to insert asserts all over the place to assist us.

Things get trickier with more complicated control flow though. Even with:

  if (x == 0) {}
  if (y == 0) {}

we cannot presume that all four paths - (x == 0, y == 0), (x == 0, y != 0), (x != 0, y == 0), (x != 0, y != 0) - are actually possible. I think it’s safe to assume that at least two of these paths are possible, without assuming which ones exactly; if x and y are boolean, i’d say at least three paths are possible, because there would be no reason to make two boolean variables that mean the same thing. However, representing such assumptions is already very hard, and the complexity increases rapidly with more and more if's. So we simply split the state into all four paths and rely on asserts to do the rest of the job.

______

Now, to loops. Loops are an even more complicated kind of control flow. If we analyze a function:

  void loop_foo(int N) {
    for (int i = 0; i < N; ++i) {}
  }

then it would be safe to assume that the loop executes at least once, that is, that the state in which N is greater than 0 is worth considering. Otherwise, the loop would be dead code. At the same time, unlike an if() statement, it wouldn’t be safe to assume that N == 0 is possible. There is no “else branch” in the loop, and it is not common to use a for-loop instead of an if-statement, so there’s no reason for us to believe that N == 0 requires additional analysis, similarly to how we didn’t consider the ‘y == 0’ case specially in the original foo(). So for this case we should probably drop the “N == 0” path entirely once we step into the loop. But not here:

  void loop_bar(int N, int t) {
    if (t == 0) {}
    for (int i = 0; i < N; ++i) {}
  }

In this case we may assume that N and t are connected, and even if the loop body is dead for t == 0, it may be live for t != 0, so we cannot technically throw the "N == 0" case away in either of the branches. Moreover, same as in bar(), in loop_bar() we cannot be sure how exactly N and t are connected, so ideally we shouldn’t produce 4 branches. But we probably will.

______

Another tricky thing about the loops is how exactly are we doing things inside them. For example, in loop_foo(), we already believe that N is greater than 0, which means it is safe to enter the loop once and try to execute it. However, because in this example we cannot assume that N is greater to 1, neither that N is equal to 1, it is neither safe to execute the loop twice, nor to drop out of the loop after the first iteration. We allow ourselves to execute the loop 4 times by default, which is unsafe. This is not *very* unsafe though, because normally most loops are executed many times.

Yet another tricky thing about the loops is how do we get out of them. I believe that after iterating inside the loop even once, it is unsafe to continue the analysis after the loop on the same path. Because even though the path inside the loop was definitely possible, by leaving the loop after the first iteration we assume that N is equal to 1 for no particular reason. So any concrete state obtained inside the loop should ideally be discarded once we leave the loop. 

What it takes to discard a state? You may pick one or more of the following:

1. Scrapping the whole path on which we entered the loop and starting a separate path of the analysis that skips the loop entirely, turning it into one big invalidation, similar to calling an external function.
2. Invalidating program state areas that were possibly touched by the loop, by carefully exploring loop body, seeing which variables were touched, which pointers may potentially point to which variables, which checker state traits could have been modified (not everything in C is variables, unfortunately).
3. Try to find common patterns within program states produced on paths on which the loop was iterated through multiple number of times, and try to generalize them into a single program state (for instance, understand that ‘i’ is equal to ’N' at the end of the loop; a lot more could be done though).

______

When i’m saying that something is “unsafe” or that something “should ideally not be done", i mean that it may cause false positives. Because an ideal implementation is definitely too complex to pick up now, tradeoffs should be considered between squashing false positives and failing to deliver useful positives in the process. In most cases, i know not how much true positives would be lost and how much false positives would be mitigated.

One particular problem that has bothered us since quite a while is the problem with “correlated loops”, which includes subsequent loops that iterate through the same array repeatedly. One of the examples would be:

  int main(int argc, char **argv) {
    int *a = malloc(atoi(argv[1]));
    for (int i = 0; i < atoi(argv[1])); ++i) {
      a[i] = i;
    }
    int s = 0;
    for (int i = 0; i < atoi(argv[1])); ++i) {
      s += a[i]; // <== RHS undefined?!
    }
    return s;
  }

In this example, the analyzer doesn’t realize that atoi(argv[1])) is always the same number (for whatever reason - we could fix this example by modeling the atoi() function, but we can’t model them all); different return values of atoi() are denoted with different symbols, which allows paths on which the two loops take different number of iterations. If the first loop takes less iterations than the second loop, then we receive a false alarm that right-hand side of operator+= is an uninitialized value.

To mitigate this problem, we’ve been thinking of something like “detect which loops are correlated, and invalidate arrays that were touched in the first loop once they are read from the second loop, but not earlier; however, invalidate only default binding, leaving direct bindings intact, unless we’re having third, fourth, etc., correlated loop”. This is an example of an approach that tries to be a non-intrusive hack, but still solve a practical issue.

______

So, again, thanks for diving into this! I’d be glad to consider, evaluate, and accept any improvements you’d be able to come up with. The random thoughts above were to make sure you have a certain basis for thinking on smaller details, and they were not intended for motivating you to take entirely different approaches; the way you approach the problem looks really good to me :)

Best regards,
Artem.

> On Mar 26, 2017, at 4:55 PM, Anna Zaks <ganna at apple.com> wrote:
> 
> 
>> On Mar 25, 2017, at 7:33 AM, Péter Szécsi <peterszecsi95 at gmail.com <mailto:peterszecsi95 at gmail.com>> wrote:
>> 
>> Hello,
>> 
>> 
>> thank you for your replies!
>> 
>> I collected my ideas about it but at the end of the email I put my concerns in perspective which dependency would be important and maybe it should be done first. I would gladly take it for the summer since it was mentioned in the GSoC project page earlier.
>> 
>> 
>> 
>> My initial thoughts in details about handling loops in a better way:
>> 
> Thanks for the writeup!
> It would be very useful if you could prepend each idea with motivation explaining which problem you are trying to address.
> 
> One of the issues we’d like to solve with generalized modeling of loops is that the analyzer looses coverage since it only simulates the first N iterations of the loop. This means that it does not cover paths that represent all other iterations such as the last iteration of the loop. One problem is that the loops that are known by the analyzer to have a constant bound will end exploration of a path. Note that the main reason to have a bound in the first place is that the analyzer usually does not know how many iterations of the loop to “execute” and will keep going forever. It will create all these branches:
>  - simulate that exactly one iteration is executed
>  - simulate that exactly two iterations are executed
>  - simulate that exactly three iterations are executed
>  -….
> 
> 
>>  First, we have a control number N which determines how many steps to unroll during the analysis of a loop. Instead of this, in my opinion it would be better to have a number M which determines the maximum times a loop can branch the ExplodedGraph.
>> 
> Could you explain how this number will be computed and give examples on what it will be at least for the cases below? And more importantly, how will this number be used? (I don’t think the intent here is that M will replace N in all cases, correct?)
>> In this case simulating a simple (non-branching) loop to the end would not be too complex. This means we could analyze the following example without a problem:
>> 
>> 
>> 
> Are you saying M=0 for the case below? Is SIZE assumed to be a constant?
>> vector<int> v;
>> 
>> for(int i = 1; i<SIZE;++i)
>> 
>> {
>> 
>>                 v[i] = f(i);
>> 
>> } 
>> 
>> But we would not waste much time to analyze a complex loop for „nothing”:
>> 
>> 
>> 
>> for(auto it = begin();it!=end();it++ ) 
>> {
>>                 if(it->complexfun1()) …
>>                 if(it->complexfun2()) …
>>                 else …
>> 
>> }
>> 
>>  
>> And this could be Milestone1. (That is quite optional but I think this would be a great improvement.)
>> 
>> 
>> 
>> The next step is the LoopWidening so whenever we stop the simulation - since the loop would result in a too complex/huge ExplodedGraph - we would continue the analysis and invalidate only the MemRegions of which binded values can be changed in the loop. In order to avoid false positives, however, an important restriction would be that whenever we would invalidate too much MemRegions then it is just better to stop the analyzing process. It can be tricky to say when exactly it is too much but determining it would be part of the project.
>> 
>> 
>> 
>> So how the invalidation should work? I believe a great approch would be the following: we would iterate over the statements of the loop's body and they would be handled one by one. Implementing an ASTVisitor could solve this and the statements which are not handled at the moment would result in the „conservative analysis” so it would cut the processing of the given path.
>> 
>> This way it can be an extensible and incremental solution since new callbacks can be implemented in an easy way to cover new statements. (It could work like in the ASTImporter.)
>> 
>> Milestone2 could be implementing the ASTVisitor described above.
>> 
>> The next step could be implementing the following nodes: function calls (CallExpr) of which parameters are const or passed by value, and the assignement operator in some trivial cases (no pointer/reference/array on the left side).
>> So whenever we stop the analysis of a loop which contains only those nodes, we have the information which regions to invalidate. This way the analysis can be continued safely.
>> 
>> For example:
>> 
>> 
>> 
>> int f(int);
>> int h(int,int);
>> int a,b,c = 0;
>> for(int i =0;i<100;i++)
>> 
>> {
>> 
>>                 a = f(i);
>>                 b = h(a,i);
>> 
>> }
>> 
>> 
>> 
>> int d = 12/c; // The analyzer should be able to find this and report the divByZero bug since the foor loop contains only those statements which are implemented.
>> 
>> 
>> 
>> However, in the next example:
>> 
>> 
>> 
>> int f(int);
>> int h(int*,int);
>> int a,b,c = 0;
>> for(int i =0;i<100;i++)
>> 
>> {
>> 
>>                 a = f(i);
>>                 b = h(&a,i);
>> 
>> }
>> 
>> int d = 12/c;
>> 
>>  
>> Here we have a function call which is not implemented (pointer parameter) so we just end this path’s analysis and not spot the bug after the loop.
>> This would be Milestone3.
>> 
>> After these fundamentals I would investigate which statements are commonly used in a loop, testing this feature on open source projects and I would aim to implement these visitor callbacks which are important according to these tests.
>> 
>> 
>> 
>> AAAAnd that is the point where I think we have a problem because usage of pointers and function calls by reference is quite common in a loop. In the naive and conservative way all of the typeT variable’s regions should be invalidated whenever we visit an assignment of which left hand side variable is typeT*. Moreover it could be casted to another type so maybe it should result invalidating all MemRegions. That is why I think that a points-to analysis would be critical in order to make this whole approach work on analysis of real projects. Because of that I am considering to apply for the project “Implement points-to analysis” which I have seen it on http://llvm.org/OpenProjects.html <http://llvm.org/OpenProjects.html> 2 weeks ago but it mysteriously disappeared.
>> 
>> 
>> 
>> My questions:
>> 1. What do you think of this whole approach? (Please let me know if the whole idea is broken because I missed something.)
>> 2. Do you agree that arguing about pointer's possible values is important for achieving this?
>> 3. If that is the case could I still apply for that project or is it removed because of good reasons?
>> 
>> 
>> 
>> Cheers,
>> 
>> Peter
>> 
>> 2017-03-16 13:25 GMT+01:00 Sean Eveson <eveson.sean at gmail.com <mailto:eveson.sean at gmail.com>>:
>> Hello Péter,
>> 
>> Sounds good to me. I'd also be interested in knowing the specifics of what you are planning to do.
>> 
>> Very happy to help at any point.
>> 
>> Thanks,
>> 
>> Sean Eveson
>> SN Systems - Sony Interactive Entertainment Group
>> 
>> On Thu, Mar 16, 2017 at 12:49 AM, Anna Zaks <ganna at apple.com <mailto:ganna at apple.com>> wrote:
>> Hi Péter,
>> 
>> Better modeling of loops is definitely an interesting topic. My main concern would be having a specific proposal on what improvements you could complete in the GSoC timeframe. The next step here is to write down more specifics about what you intend to work on. Please, let me, Devin, and Artem know if you’d like to talk to us while figuring this out.
>> 
>> Cheers!
>> Anna.
>> 
>> 
>>> On Mar 13, 2017, at 9:43 AM, Péter Szécsi <peterszecsi95 at gmail.com <mailto:peterszecsi95 at gmail.com>> wrote:
>>> 
>>> Hello, 
>>> 
>>> I would love to work on a Clang Static Analyzer project during the summer by joining GSoC.
>>> Let me introduce myself:
>>> I'm Peter Szecsi, a third-year BSc student at Eötvös Loránd University, Budapest. It would be my first Summer of Code project but I have already contributed to clang:
>>> During the last summer I have uploaded 2 patches:
>>> - An accepted and merged clang-tidy checker [1] 
>>> - An accepted Clang SA checker [2]
>>> Since then I am working on cross translational unit analysis as a university project (and because of that I`ve send some patches about the ASTImporter [3]). Furthermore I participated in the preparations of a talk that was accepted at the EuroLLVM conference. [4]
>>> I found SA interesting since I like to work on algorithmic challenges and enjoy participating in programming competitions and there is a lot of algorithms in the Static Analyzer which could be optimized/made more precise by heuristics.
>>> 
>>> That is the reason why I would be most interested in project "Implement generalized loop execution modeling" from the Clang SA Open Projects page [5]. Hopefully it is a suitable project for GSoC and it is possible to achieve useful improvements during this time frame. 
>>> I have read the discussion at the "loop widening" patch [6] and the most important enhancements (and milestones in the project) would be invalidating only the possible modified values. In addition to that I was thinking to have some heuristics which would approximate if a loop is too complex to even use this kind of strategy because it would still lead to invalidate most of the values and result in a lot of false positives. Another small heuristic could be: when we already reached the maximum loop unroll number then in case we are sure that there is only a few loopsteps ahead we could unroll it to the end since probably this algorithm will not consume too much time/memory and the result state will not be much more complex. 
>>> 
>>> What do you think about these ideas?
>>> 
>>> (I have CC'd everyone from the "loop widening" patch discussion.)
>>> 
>>> Best regards,
>>> Peter
>>> 
>>> [1] https://reviews.llvm.org/D22507 <https://reviews.llvm.org/D22507>
>>> [2] https://reviews.llvm.org/D24246 <https://reviews.llvm.org/D24246>
>>> [3] https://reviews.llvm.org/D29612 <https://reviews.llvm.org/D29612>, https://reviews.llvm.org/D30876 <https://reviews.llvm.org/D30876>, https://reviews.llvm.org/D30877 <https://reviews.llvm.org/D30877>
>>> [4] http://llvm.org/devmtg/2017-03//2017/02/20/accepted-sessions.html#7 <http://llvm.org/devmtg/2017-03//2017/02/20/accepted-sessions.html#7>
>>> [5] http://clang-analyzer.llvm.org/open_projects.html <http://clang-analyzer.llvm.org/open_projects.html>
>>> [6] https://reviews.llvm.org/D12358 <https://reviews.llvm.org/D12358>
>> 
>> 
>> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20170327/c2b7d251/attachment.html>