<div dir="ltr"><div dir="ltr"><div dir="ltr"><div>Hi Dmitri,</div><div><br></div><div>Thanks for the answers, I have some further questions inline.</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, 18 Oct 2021 at 10:35, Dmitri Gribenko <<a href="mailto:gribozavr@gmail.com">gribozavr@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>Hi Gábor,</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Oct 12, 2021 at 9:01 PM Gábor Horváth <<a href="mailto:xazax.hun@gmail.com" target="_blank">xazax.hun@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div>Hi Yitzhak,</div><div><br></div><div>Yay, finally happening! I 100% support these efforts and feel free to add me as a reviewer to the patches, design documents etc.</div></div></div></div></div></div></div></div></blockquote><div><br></div><div>Thank you so much! We will!</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div class="gmail_quote"><div>Another example where the forward symbolic execution of CSA falls short is backwards reasoning, e.g.:<br><br><font face="monospace">void f(int *p)</font></div><div><font face="monospace">{<br> *p = 5;<br> if (p) // Should warn: redundant null check or null dereference above.<br> {<br> ...</font></div><div><font face="monospace"> }<br>} </font></div><div><br></div><div>CSA will not warn on the code snippet above because it is assumed p is never null at the first dereference (to avoid noise) and it is really hard to revise old assumptions given new evidence. Such problems can be found using dataflow analysis.</div></div></div></div></div></div></div></blockquote><div><br></div><div>Agreed, that is a good example!</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div class="gmail_quote"><div></div><div>What kind of control will the user have over the environment? E.g. could we add checks/extensions that model the effects of certain functions (e.g. std::swap) in the environment?</div></div></div></div></div></div></div></blockquote><div><br></div><div>This is an open question. Already, users can model the effects of particular function calls. They would add a case for such calls in their transfer function, where they would update the environment accordingly. The full API for manipulating the environment is available to the user. But, we expect there's more room for supporting this kind of domain-specific semantics, either a simpler API or, likely, a richer one.<br></div></div></div></blockquote><div><br></div><div>One of the key learnings for the static analyzer was that we want a strict separation between checks that issue warnings and "checks" that model behavior. E.g., when a user sees false positives from a null dereference checker, it is very likely they still want to have the modeling from the same check that can exclude many infeasible paths. So for the best possible user experience it should be possible for the users to only disable the relevant warnings while still getting the benefits of modeling.Also, it can be useful to disable some modelling without affecting the emitted warnings. On the flip side, the users can shoot themselves in the foot. E.g., even though running the Static Analyzer with the core modelling checks off is not supported, we still get crash reports from such scenarios (that we will not fix). </div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><div></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div class="gmail_quote"><div>Do you plan to provide building blocks to help users build lattices? E.g., I really like <a href="https://github.com/facebook/SPARTA" target="_blank">SPARTA</a>. They added some helpers to encode finite abstract domains, see <a href="https://github.com/facebook/SPARTA/blob/main/include/FiniteAbstractDomain.h" target="_blank">SPARTA/FiniteAbstractDomain.h at main · facebook/SPARTA (github.com)</a> for details.</div></div></div></div></div></div></div></blockquote><div><br></div><div>We have a goal of making dataflow analysis to become as accessible as AST matchers are, so we do support adding whatever building blocks necessary to free the checks of boilerplate and make them easy to write. Whether it would be similar to SPARTA or not, I think is yet to be seen.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div class="gmail_quote"><div>I am a bit opposed to using the term path-condition, as users might confuse this with similar terms used in path-sensitive analyses. What about flow-condition?<br></div></div></div></div></div></div></div></blockquote><div><br></div><div>Renaming sounds good, let's paint this bikeshed in code review.<br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div class="gmail_quote"><div>Could you elaborate on the storage model? How do you plan to deal with aliasing? How do you represent arrays (with potentially unknown lengths)? </div></div></div></div></div></div></div></blockquote><div><br></div><div>A limiting factor for the storage model in particular is that we try to keep the framework accessible to non-experts. It is a limiting factor for all design choices, but it hurts the storage model more than anything else.<br><br>At the moment we represent the stack and heap memory at each program point as a couple of maps:<br><br> llvm::DenseMap<const ValueDecl*, StorageLocation*> decl_to_loc_;<br> llvm::DenseMap<const Expr*, StorageLocation*> expr_to_loc_;<br> llvm::DenseMap<const StorageLocation*, Value*> loc_to_value_;<br><br>ValueDecl pointers are the Clang AST nodes for global variables, function parameter variables, and local variables.<br></div></div></div></blockquote><div><br></div><div>The store is likely to change very little in each of the transitions. The static analyzer tries to leverage this fact by using immutable data structures, so we can share the common parts between states. This can help reduce memory consumption. And the analyzer favors significantly smaller memory use to slightly better performance because this lets users leverage parallelism across TUs better with the same amount of memory. I was wondering whether you considered using immutable data structures (like llvm/ADT/ImmutableMap.h) or is this just a first version?</div><div><br></div><div>Also, how will this model handle expressions that evaluate to rvalues? Do you have locations for rvalues?</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><div><br>Expr pointers are the Clang AST nodes for the expressions whose values have been already computed and are accessible at this program point.<br><br>StorageLocation represents a memory location. A memory location stores a Value. A Value knows its type, and one can retrieve its sub-objects as Values. Nevertheless, the Value is merely a name of a symbolic variable and does not store the approximation of the runtime value computed by the analysis so far. Value objects participate in the path condition, and that's where they get constrained.<br><br>We create the first map when we initialize the analysis (before running the dataflow fixpoint algorithm). **We are making an unsound assumption that each Value that has a pointer or reference type points to a distinct object.** </div></div></div></blockquote><div><br></div><div>The analyzer is making a very similar assumption and it seems to work reasonably well. There are certain exceptions. E.g., for overloaded operator= implementations, it will simulate the method both with this != &other and this == &other assumptions. This will double the analysis time for these methods, but it can help finding errors where users forgot to handle the latter case. The APIs of the analyzer is flexible enough that this split can be done in a checker.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><div>We find this assumption to be useful in practice, however it is not essential for the framework. We would like all unsound behavior, no matter how useful, to be eventually guarded by flags that each checker can set in a way that works best for it. </div></div></div></blockquote><div><br></div><div>I am perfectly comfortable with unsound behavior by default, there is a whole movement behind this approach: <a href="http://soundiness.org/">Soundiness Home Page</a> :)</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><div>Furthermore, we think that the framework could support different levels of detail in representing storage, and a checker could similarly request the necessary abstraction level.<br><br>A limiting factor for representing aliasing is that we would probably have to rely on an SMT (not just a SAT) solver to make queries to the path condition if we track that two pointers may (but not necessarily must) alias.<br><br>We compute the other two maps for every program point. Assignments, no-op casts, temporary materialization etc. can forward the input Value objects to their destination StorageLocations, allowing us to symbolically track the data flow.<br><br>The environment is of course subject to the join operation. To join distinct Value objects, we have a built-in rule for booleans (we build a new Value that is constrained to be a disjunction of the values being joined), and we allow the user-defined checker to handle joining values of domain-specific types. Expanding the built-in rules to cover more types is desirable, but would likely create a hard dependency on an SMT (not just a SAT) solver.<br></div></div></div></blockquote><div><br></div><div>When a pointer point to two different locations in two different branches do you join them to a new location and add constraints just like you do for booleans?</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><div><br>As for arrays, we don't model them right now and we don't immediately have a good idea of how we could do it in a simple way.<br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div class="gmail_quote"><div><br></div><div>What representation do you use for the path condition? Using the syntactic forms, like `x > 5` might not be sufficient as `x` might have different values at different program points. E.g.,<br><br><font face="monospace">void f(int x)<br>{<br> assert(x > 0);<br> ...<br> x = -1;<br> funcWithPreconditionXIsPositive(x); // Wants to execute the SAT query here.<br>}</font></div></div></div></div></div></div></div></blockquote><div><br></div><div>We incorporate the symbolic value of the expression into the path condition. Since we don't support integer comparisons, let's use a boolean:</div></div></div></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><div><br>void f(std::optional<int> opt) {<br> bool is_valid = opt.has_value();<br> if (is_valid) {<br> is_valid = false;<br> use(opt.value());<br> }<br>}<br><br>When we evaluate the conditional branch, we add to the path condition the symbolic value stored in the storage location of is_valid at the program point of the branch. Therefore the subsequent overwrite of is_valid does not interfere with our ability to check the path condition when we evaluate opt.value().<br></div></div></div></blockquote><div><br></div><div>This is awesome, the static analyzer works exactly the same way! :)</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><div><br>For those boolean conditions that we don't understand, the expression is associated with a fresh boolean value, and that value is incorporated into the path condition. Practically, this means a very conservative interpretation of the boolean "x > 5": every appearance in the code is handled uniquely, even if x has not changed. So, in the following code:</div><div><br>{ ...<br> if (x > 5) p = new ...<br> if (x > 5) f(*p);<br>}<br><br></div><div>we can't at the moment deduce that in `*p` the pointer is never null. But we can deduce that here:<br><br></div><div>{ ...<br> if (x > 5) {<br> b = true;<br> p = new ...;<br> }<br> if (b) f(*p);<br>}<br><br>It is certainly desirable to improve this.<br></div></div></div></blockquote><div><br></div><div>Agreed. I suspect the proper solution would involve many non-trivial steps including escape analysis.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><div></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div class="gmail_quote"><div>Is inter-procedural analysis out of scope (e.g., using summaries)?<br></div></div></div></div></div></div></div></blockquote><div> </div></div>Not at all! It's just well ahead of where we are now. That said, we think it's important to consider sooner rather than later, since design decisions now could block good interprocedural design later. We are happy to discuss proposals, we're just prioritizing getting the initial framework upstreamed.<br><div><br></div><div>Dmitri</div><div><br></div>-- <br><div dir="ltr">main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if<br>(j){printf("%d\n",i);}}} /*Dmitri Gribenko <<a href="mailto:gribozavr@gmail.com" target="_blank">gribozavr@gmail.com</a>>*/</div></div>
</blockquote></div></div></div></div>