[cfe-dev] Newby SA question

Artem Dergachev via cfe-dev cfe-dev at lists.llvm.org
Wed Oct 10 20:53:35 PDT 2018


Well, that's how it has always been and is expected to work, though it 
might be not ideal, so it's kinda a little bit of both. I'll just go on 
explaining how exactly Static Analyzer understands these code snippets, 
and it should become clear.

In foo(), Static Analyzer says: "If the loop is executed 0 times, then 
cc is inevitably equal to 0, from which it follows that in test() the 
third argument is 0". This reasoning is, well, correct. The only problem 
here is in the first "If": the developer might be sure that the loop 
always executes at least once, even though there's no indication of it 
whatsoever in the code under analysis.

It might, of course, turn out that the developer is wrong. But if he's 
right, he doesn't mind writing "assert(cc);" at the beginning of the 
function, which would not only suppress the false positive, but also 
reduce the undefined consequences of calling test() with 0 to a simple 
assertion failure, and, as a bonus, document the contract of the 
function for future generations of developers. If such assertion ever 
fails, Static Analyzer was right. If it never fails, the developer was 
right.

Another way to describe it is that Static Analyzer proposes a unit test 
for foo() that consists in calling it with cc equal to 0, because it 
triggers an interesting corner-case: this is the only case in which the 
loop is executed 0 times.

In bar(), on the other hand, there's no indication that the branch on 
which "xx == 0" is of any interest. We would not explore it separately, 
pretty much like we won't explore "xx == 2018" separately. We'll just 
think of all these possibilities as a single path on which xx (or, more 
specifically, the algebraic symbol "reg_$2<unsigned int xx>" that 
denotes the unknown value of variable xx at the beginning of the 
analysis, i.e. the argument with which the function was called) "can be 
anything". That'd be the case until we encounter a line of code like "if 
(xx == 2018) { ... }". This line of code indicates that the developer 
expects that both branches of if() can potentially occur because 
otherwise it would have been dead code that doesn't make any sense and 
telling the developer to remove it would be considered a good thing.

Such presumption of no dead code is pretty much one of the central ideas 
behind symbolic execution, the static analysis technique used in Static 
Analyzer. The Analyzer interprets the program and "forks" the 
interpreter every time it encounters an "if" in the program, remembering 
that symbol "reg_$2<unsigned int xx>" satisfies certain equations 
("constraints") on each branch (eg., it is equal to 2018 on one branch 
and not-equal to 2018 on another branch). If the system of all equations 
collected on the current execution path so far has at least one 
solution, the path is explored further, otherwise it is discarded as 
impossible.

Now let's go back to foo(). Is the presence of a for() loop an 
indication of the possibility that the loop is executed exactly once? 
Not necessarily. If the loop is known to always be executed at least 
once, code still makes perfect sense, and every statement in the code is 
still exercised. So the assumption that the loop is sometimes executed 
exactly 0 times is definitely too eager, there's little difference 
between that and the assumption that the loop is sometimes executed 
exactly 2018 times.

In fact, there's a certain amount of discussion about this problem (eg., 
https://bugs.llvm.org/show_bug.cgi?id=32911) in which some advocate for 
completely skipping paths on which loops are executed 0 times because 
bugs found on them are too often not interesting. I personally mildly 
tend to believe that this corner-case is often missed by programmers, so 
it's actually pretty interesting, but this opinion is not really that 
strong. I guess nobody went far enough to gather enough data to support 
one of these opinions - and it's veeery hard to gather enough data in 
static analysis because a representative selection of codebases is 
almost impossible to obtain, they're just too different.

Is this a big problem? I think it's quite clear that it isn't, really. 
Because, well, you can always put that assert() and it'd be good in so 
many ways.

  Is this an easy problem to fix? Probably not. Apart from the easy hack 
that destroys coverage for 0-iteration paths, the actual proper solution 
is "loop widening", which is a form of what you described as "flushing" 
state after the analysis. There's an experimental flag "-analyzer-config 
widen-loops=true" which enables such behavior, and some users are using 
it, but i won't recommend using it yet because there are known crashes 
due to how it sometimes flushes information that is impossible to flush 
(eg., C++ references cannot be reassigned). It's probably not that hard 
to fix all such problems and have a working implementation of 
widen-loops=true. But i'm also not quite convinced that such primitive 
behavior is actually good. If you destroy all information gathered 
before entering the loop and within the loop, you may lose a lot of 
important bits of information that are completely obvious by looking at 
the code. Eg., imagine code in which a variable is always set to 0 
before the loop, and the loop definitely doesn't touch it, and then 
after the loop we assume that it is equal to 2018. That's a false 
positive that would be much more annoying to the user because the 
assert() that says "this variable is not touched within the loop" is not 
only hard to write in the general case, but also looks ridiculous in the 
code. So when doing loop widening, it is important to preserve as much 
information as possible, and that is a much more complicated problem 
than simply discarding information.

What's specifically hard about loop widening is how would checkers 
interact with it. If the internal state trait of the program tracked by 
the checker changes within the loop, the core of the Analyzer won't be 
able to figure out what should be the final state of the checker's 
internal trait. In fact, it won't even be able to access the checker's 
private data or make sense out of it. This means that in order to 
implement loop widening, we need to force every checker developer to 
implement additional checker callback, i.e. "checkLoopWidening()", to 
update its private state traits. And one of the biggest strengths of the 
Analyzer, with all its naive understanding of control flow, is a very 
friendly checker API. Well, we could do better, but even in its current 
shape a lot of people find it very pleasant to work with.

One of the areas where our path split behavior shines is "taint 
analysis". Which is still experimental, but very promising. For 
instance, if you knew that cc in foo() or xx in bar() was in fact read 
from the user, the positive would immediately become definitely-true, 
even in bar(), and we'll be able to report it. This happens because 
"tainted" symbols are guaranteed to potentially take any value within 
the current constraints, i.e. symbolic values obtained from the user can 
be forged by a malicious hacker to correspond to any concrete value. 
Which is why, say, division by zero checker drops the check for 
stateNonZero being impossible when the symbol is tainted.

But unless you are dealing with tainted symbols or being really paranoid 
about verifying that your program is correct, you shouldn't report the 
error unless you are also sure that the non-buggy state is impossible. 
That'd be super loud.

So that's where we are. As usual: trade-offs, heuristics, budgets, 
suppressions... Can't really get away from it, and in this case it seems 
that the current behavior is both very easy to understand and not at all 
that problematic.

For more info on our attitude towards suppressions please see 
https://clang-analyzer.llvm.org/faq.html#use_assert and the next few 
questions of the FAQ and the links in it.

On 10/10/18 12:37 PM, Sergei Larin via cfe-dev wrote:
> Hello,
>
>    I have (unexpectedly) found myself developing a specific SA checker, and during the process I run across an interesting observation.
>
> 1) The checker traces zero value passed to a specific function. That is not that important, but I based my implementation on zero arguments passed to memcpy/memset checker. Simple enough.
> 2) I realized that the path from function arguments to the checked function matters. Here is an example:
>
> void test (unsigned short *, unsigned, unsigned);
>
> void foo(const float aa[], unsigned short bb[], unsigned int cc)
> {
>    unsigned char dd;
>    float ee = 0.f;
>    for (dd = 0; dd < cc; dd++)
>      ee += aa[dd];
>
>    if (ee == 0.f)
>      test(bb, 99, cc);
> }
>
>
> void bar(const float aa[], unsigned short bb[], unsigned int cc, unsigned int xx)
> {
>    unsigned char dd;
>    float ee = 0.f;
>    for (dd = 0; dd < cc; dd++)
>      ee += aa[dd];
>
>    if (ee == 0.f)
>      test(bb, 99, xx);
> }
>
> In function foo arg cc passed to test() is SVal (0 U32b) (among other values) so my checker is triggered.
> In function bar arg xx is SVal (reg_$2<unsigned int xx>) which does _not_ match zero check.
>
> The reason for that, as far as I can read from the code, is the for() loop. In foo cc is used in (dd<cc) inequality, and value for it is estimated with a range(reg_$0<unsigned int cc> : { [0, 0] }) which at the point my checker run for test() is re-interpreted as known value of zero.
>
> 3) From #2 I can make a conclusion that presence of comparison (dd<cc above) alters analysis in a substantial way.
>
> Now the question - is this expected behavior or a bug? If this is expected behavior is there a way to affect it with any options already available? If this is a bug, is there a mechanism to "flush" state after analysis of the for loop is over, and I want to run value estimator at the point of test() evaluation from scratch? Is there a better way to achieve determinism?
>
>
> ...or I simply do not know what I am talking about 😊
>
> Any feedback is highly appreciated. Thank you.
>
> Sergei
>
> --
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
>
>
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev




More information about the cfe-dev mailing list