<div dir="ltr"><div dir="ltr"><div>Hi!<br><br></div>Thanks for this survey! I am just adding some random notes inline that might or might not be useful. Some of them include questions and musings.<br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, 4 Oct 2020 at 21:43, Kristóf Umann <<a href="mailto:dkszelethus@gmail.com">dkszelethus@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><br><br><br>---  Reads/writes ---<br><br>A common theme among data flow algorithms is looking for gens (generations, or in other words "reads" or "loads") or kills (in other words "writes" or "stores") of either variables or values.<br></div></blockquote><div><br></div><div>One of the advantages of GEN/KILL based analyses is that they are distributive. This is a class of analyses that are well studied and there is a rich literature on how to efficiently evaluate them (like IFDS).  Also, the sets are often represented using bit-vectors where each position corresponds to a variable/register. Rust also uses this approach for liveness and some other analyses.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><br>In literature, most data flow algorithms are defined and showcased on either some simple pseudo code, or an instruction set in a three address form:<br><br>x <- y + z<br></div></blockquote><div><br></div><div>Clang's AST and CFG is a very heavy weight for most dataflow analyses that deal with low-level concepts like liveness, or definition/use. I'd love to see a lightweight overlay on top of the Clang AST that can be used to categorize all nodes into some basic operations. For example, there is a large class of dataflow analyses that only care about function calls, reads, and writes. For those analyses, it is just noise to distinguish an overloaded operator call from a constructor call and so on. Currently, every dataflow analysis is doing an ad-hoc abstraction over the AST/CFG. The question is, is it viable to come up with something that fits most analyses, so people don't have to reinvent those abstractions?</div><div><br></div><div>Would MLIR make it easier to create such an overlay?<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><br></div><div>In clang however, where transfer functions are implemented with statement visitors, the fact that we need substantial contextual information for any given statement is sometimes a problem.</div></div></blockquote><div><br></div><div>Is this need for context coming from the use of linearized CFGs? (I.e. every element of a basic block is a subexpression, not a whole one).<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><br><br>--- Classic data flow issues: Aliasing, function calls, unfeasible execution paths ---<br><br>Aliasing is an often discussed, unavoidable obstacle for C/C++, and intraprocedural of most dataflow algorithms is also problematic.<div><br></div></div></div></blockquote><div><br></div><div>LLVM IR passes can query alias analyses. As dataflow algorithms are ad-hoc in Clang, we do not have an infrastructure for these analyses to depend on each other. In an ideal world, we could create an alias analysis that always returns MayAlias and write other dataflow analyses against its interface and once that analysis becomes smarter, all the other dataflow analyses start to improve in precision. While I do agree, we cannot map back analysis results from LLVM</div><div>IR to the source code in general (especially, when we want to generate meaningful diagnostics), but I wonder whether it is actually possible for some restricted cases like alias analysis. <br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><div><br><div><br>--- Records, fields, methods ---<br><br></div></div></div></div></blockquote><div><br></div><div>There are some well-known abstractions for field sensitive analyses. Those include access paths and store-based models. Due to the heap and loops, these models are potentially unbounded and there are several methods to summarize those cases including k-limiting. Here is a survey of some methods: <a href="https://arxiv.org/abs/1403.4910">https://arxiv.org/abs/1403.4910</a></div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><div><div><br>---=== 1. LivenessAnalysis ===---<br></div><div><div><br></div></div></div></div></div></blockquote><div><br></div><div>C++ has many peculiarities with liveness analysis. For example, if we have a local variable with a dtor, it is live for its whole lexical scope as the dtor might read its fields. So it looks like we have both lexical and non-lexical liveness, and this can be very confusing.<br></div><br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><div><div><div><br></div></div><div><br></div><div>If a DeclRefExpr pops up in any other context, it is seen as a read, even if its passed variables to function call as a non-const reference, etc.</div></div></div></div></blockquote><div><br></div><div>This is a serious problem in the C++ language. Parameter passing describes the HOW (e.g. by reference) instead of the WHAT (e.g. to read).  So seeing a non-const reference we can never be sure what the intention of the developer was. Is this a read and the user forgot the const? Is it a read followed by a write? Is this a write-only?<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><div><div>Also, the algorithm doesn't respect aliasing, and doesn't regard the fields of a record as a distinct, writable memory regions.</div></div></div></div></blockquote><div><br></div><div>Depending on the analysis client's need, this might be OK. E.g. do we want under- or overapproximation? What is the consequence of imprecision at the client? Making an analysis more precise only makes sense if the client can benefit more from the additional precision compared to what we lose to achieve this precision in the first place.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><div><div><br>---=== 2. UninitializedVariables ===---<br><br>--- Third pass ---<br><br>Though technically this is an implementation detail, a third pass is used to emit diagnostics. </div></div></div></div></blockquote><div><br></div><div>I believe that for every monotone analysis we can emit the warnings during the analysis, so we can spare the last pass (at the cost of making sure to not emit duplicated warnings). I am not sure, however, which method would be more performant or cleaner in this particular case.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><div><div><br>---=== 3. Thread Safety analysis ===---<br></div><div><br>--- Thread Safety's own intermediate language ---<br><br>Thread Safety Analysis lowers Clang expressions, implementing its very own IR, a Typed Intermediate Language, or TIL.<br></div></div></div></div></blockquote><div><br></div><div>The existence of TIL just amplifies what I wrote earlier, the AST/CFG is too verbose for dataflow analysis. A lot of the information available there is just noise and having some abstractions over them could make things significantly easier. Having a more principled way to organize those abstractions would be great. Currently every analysis has its own ad-hoc methods.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><div><div></div><div><br><br>---=== 5. Lifetime analysis ===---<br></div><div><br></div><div>Asserts pose a serious problem for lifetime analysis in particular. Suppose you have the following assert:</div><div><br></div></div></div></div></blockquote><div><br></div><div>I imagine that this assert problem could be applicable for many other analyses. I'd love to see a CFG option to specifically build a CFG that does not expose this problem to the clients. Unfortunatly, CFG already has a really large number of options, to the point that I have no confidence that every reasonable combinations of the configurations are tested. Getting rid of the unused config options (probably there are quite a few) would be a huge contribution. But unfortunately, we could never know about potential downstream users.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><div><div></div><div><br></div><div>Lifetime analysis was conceptualized with strong C++ support from the get-go.</div></div></div></div></blockquote><div><br></div><div>Lifetime analysis also suffers from the design problems of C++ language I mentioned earlier. One of the bigger ones, parameter passing describes the HOW rather thant he intention.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><div><br><div>The domain of this analysis is entirely different than liveness or uninitialized variables, so what was previously mentioned as GEN/KILL sets don't make sense here. <br></div></div></div></div></blockquote><div><br></div><div>Right, bit-vector analyses are distributive. Lifetime analysis is not. <br></div><div> </div><div><br></div><div>Cheers,</div><div>Gabor<br></div></div></div>