[cfe-dev] Questions about symbolic execution in the static analyzer.

Mon Jun 7 02:06:33 PDT 2021

I've got one more thing to add to the first question. One way to combat 
path explosion would be to aggressively merge execution paths. For 
example, we could be widening loops instead of unrolling them. Or 
generally when we have a lot of sequential if-statements we could merge 
some paths to avoid exponential explosion.

One of the main reasons we aren't doing this would be because it 
directly impacts the complexity of developing checkers. Low entry bar 
for developing new checks was something we were always proud of - the 
original video about "developing your checker in 24 hours" was basically 
about this, and these days we have over 100 checkers implemented. One 
way we achieved this was by limiting the number of different operations 
over the program state, as each such operation potentially requires 
checker-side support. This is why we opted into a model of 
interprocedural analysis that is almost entirely transparent to the 
checkers, and for the same reason we prefer loop unrolling to loop 
widening. The merge operation over our program state structures is 
simply not implemented and was never intended to be.

On 6/6/21 7:28 PM, Artem Dergachev wrote:
>
>
> On 6/4/21 12:09 PM, Kriso, Achim via cfe-dev wrote:
>> Hello,
>>
>> I am researching various static analyzers and want to get a better 
>> understanding of how the symbolic execution in the clang static 
>> analyzer works.
>>
>> Every implementation of symbolic execution has to make certain 
>> tradeoffs to be feasible.
>>
>> I'm interested in the following issues:
>>
>> How does the static analyzer approach the path explosion problem?
>> In this [1] presentation it is mentioned that there are budgets which 
>> limit the exploration after visiting the same node multiple times.
>> Are there any additional techniques to alleviate path explosion?
>
> Yes, basically a number of different deterministic budgets and limits. 
> A budget for the total number of nodes (the closest thing we have to a 
> hard time limit), a budget for visiting a CFG block too many times 
> (which sounds like what you're describing), a limit for callee 
> function size and a recursion budget in interprocedural analysis, etc. 
> One of my "favorite" limits is the rule to not consider a function for 
> interprocedural analysis if it hit a budget last time we tried it; 
> this one even works across otherwise unrelated symbolic execution 
> entry points (within the same Clang instance, of course) (in 
> hindsight, this was probably a bad idea; we're considering to replace 
> it with something more predictable).
>
>> Secondly, I'm am wondering how it deals with symbolic values as 
>> indices for arrays (or symbolic pointer arithmetic).
>> I have read the paper about its memory model [2], but it doesn't seem 
>> to address this question.
>> All the approaches to the problem I know rely either on address 
>> concretization, creating a new path for every possible index or using 
>> a more powerful constraint solver (for example using theory of arrays).
>
> An access to array through symbolic index produces a symbol that 
> remembers, as part of its identity, that it was obtained by loading 
> from *that* array by *that* symbolic index.
>
> This is where support basically ends :) This guarantees that loading 
> twice with the same symbolic index will produce the same value but we 
> don't yet properly write down constraints over the new symbol in terms 
> of constraints over array elements and constraints over the index value.
>
> This probably leads to bugs / false positives but I haven't seen much 
> of those in practice (even though I can easily craft one). If we were 
> to improve upon this, we'd probably go ahead with writing down a 
> constraint like
>
>   a[i] == a[1] || a[i] == a[2] || a[i] == a[3]
>
> without splitting the execution path. And then hope that the solver 
> will handle that.
>
> Generally, our experience with powerful constraint solvers is 
> relatively interesting. We have our custom constraint solver that's 
> extremely primitive and barely correct but extremely fast. Replacing 
> it with a full-featured SMT solver causes an unacceptable slowdown 
> (like 20x or so). Our overall future direction is to go for 
> cross-checking instead: use our fast solver for discovering bugs, and 
> then invoke a stronger solver only when a potential bug is discovered. 
> We have this implemented (with Z3 as a motivating example) and it was 
> shown to noticeably improve precision without causing any noticeable 
> slowdowns. One reason this isn't enabled for all users is because 
> nobody is ready to contribute enough manpower to take responsibility 
> of keeping Z3 running in practice - say, if there's a bug in Z3 that 
> causes the static analyzer to crash, it may harm our users way more 
> than the false positives it eliminates, and we'll most likely be 
> unable to address it quickly enough. Also licensing concerns and usual 
> LLVM external dependencies concerns.
>
> If we actually had a chance to ship a strong SMT solver, we could 
> totally take advantage of its array theory support as part of the 
> cross-check pass.
>
>
>> I hope you can answer my questions.
>>
>> Sincerely,
>> Achim Kriso
>>
>> [1] https://www.youtube.com/watch?v=4n3l-ZcDJNY
>> [2] http://lcs.ios.ac.cn/~xzx/memmodel.pdf
>> _______________________________________________
>> cfe-dev mailing list
>> cfe-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>