[PATCH] D114231: [clang][docs][dataflow] Added an introduction to dataflow analysis

Mon Nov 29 05:51:09 PST 2021

gribozavr2 added inline comments.

================
Comment at: clang/docs/DataFlowAnalysisIntro.md:24
+*   [Introduction to Dataflow Analysis](https://www.youtube.com/watch?v=OROXJ9-wUQE)
+*   [Introduction to abstract interpretation](http://www.cs.tau.ac.il/~msagiv/courses/asv/absint-1.pdf).
+*   [Introduction to symbolic execution](https://www.cs.umd.edu/~mwh/se-tutorial/symbolic-exec.pdf).
----------------
xazax.hun wrote:
> I know, everyone has their favorite set of resources. Let me share mine: `https://cs.au.dk/~amoeller/spa/`
> I think this one might have a bit more content with more examples. Feel free to leave it as is. 
Forgot about this book -- I also like it a lot, added!

================
Comment at: clang/docs/DataFlowAnalysisIntro.md:49
+  int x;
+  // x is {}
+  if (n > 0) {
----------------
xazax.hun wrote:
> Nit: I wonder if representing the value of `x` with an empty set is a good choice. One could argue that `x` has an indeterminate value that could be represented with the set of all integers using top. Others could argue for a bottom value. Those concepts are not yet introduced. Maybe initializing `x` in the source code would side step this problem.
Since there is a discussion of these topics below, I agree that avoiding the question at this point is better. Added initialization.

================
Comment at: clang/docs/DataFlowAnalysisIntro.md:71-72
+Abstract algebra provides a nice formalism that models this type of structure,
+namely, a lattice. A lattice is a partially ordered set, in which every two
+elements have a least upper bound (called a *join*).
+
----------------
xazax.hun wrote:
> Technically, I think this would be a `join-semilattice`. To have a `lattice`, we would need both `join` and `meet`.
Corrected.

================
Comment at: clang/docs/DataFlowAnalysisIntro.md:86-87
+
+Computing the join in the lattice corresponds to finding the lowest common
+ancestor between two nodes in its Hasse diagram.
+
----------------
xazax.hun wrote:
> If we want to be practical, I wonder if we want to give some guidance how to implement that efficiently. E.g. we could cite a paper like this: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.106.4911&rep=rep1&type=pdf
Thanks, added.

================
Comment at: clang/docs/DataFlowAnalysisIntro.md:111-112
+information that we track. In this case, we can, for example, arbitrarily limit
+the size of sets to 3 elements. If at a certain program point `x` has more than
+3 possible values, we stop tracking specific values at that program point.
+Instead, we denote possible values of `x` with the symbol `⊤` (pronounced "top"
----------------
xazax.hun wrote:
> I wonder if we want to cite widening and narrowing in this context to give people some pointers where to look for more info.
I think it should be a separate section, with a slightly different example. To show the value of widening we need a lattice that can abstract the infinite set `{0; 1; 2; ...}` more precisely than "⊤" (e.g., "non-negative"). There would be also the usual handwaving about figuring out how to make those types of conclusions in the general case. I think adding all that here would break the flow and make the explanation less crisp. Adding a separate section on widening would be good one day, when the dataflow framework gets the feature. I don't think it is necessary right now.

================
Comment at: clang/docs/DataFlowAnalysisIntro.md:182
+void Example(int n) {
+  int x;          // x is ⊥
+  if (n > 0) {
----------------
xazax.hun wrote:
> I think it would make sense to include the whole program state in the comments. E.g. `// x is ⊥, n is ⊤`.
> This could showcase how we constrain the value of `n` to `42` in one of the branches.
So far we have been focusing on figuring out the values of `x` the local variable, not `n` the parameter, so I'd rather not deviate. Also, comparing `n` to 42 wouldn't necessarily lead to constraining the set of values here (for example, in the framework we do it in the flow condition).

================
Comment at: clang/docs/DataFlowAnalysisIntro.md:279
+```
+out = transfer(basic_block, join(in_1, in_2, ..., in_n))
+```
----------------
xazax.hun wrote:
> While I agree that this is the general formula, people sometimes do small variations, e.g.:
> ```
> out =  join(transfer(basic_block,in_1), transfer(basic_block,in_2), ..., transfer(basic_block,in_n))
> ```
> 
> This is less efficient as we end up invoking the transfer function more often, but it can be more precise. E.g. with some ad-hoc notation:
> 
> ```
> Given the branches: x: {1}, x: {2}, x: {3}, x: {4}
> join(in...) : x : {⊤}
> transfer("x/2", ...) == x : {⊤}
> 
> vs.
> Given the branches: x: {1}, x: {2}, x: {3}, x: {4}
> transfer("x/2", ...) ==x : {0}, x : {1}, x : {1}, x: {2} == outs
> join(outs) == x: {0, 1, 2}
> ```
This section is only attempting to give the readers an intuition of why dataflow works, hence it must be straightforward, not necessarily exhaustive or rigorous. I tried to add your suggestion here, but it breaks the flow. I tried to add it at the end of the section, but it looks out of place. So, sorry, I couldn't fit it here. If you have a specific wording suggestion that works here, I would be happy to apply it.

I think one would need to add a section like "variations on dataflow equations" to properly introduce the idea. It also seems to me that this specific variation is also a special case of a more general idea of deferring join operations to improve precision; that is, instead of doing a join everywhere where classic dataflow prescribes it, we can instead keep exploring separate paths, and only do a join at some later point in the CFG. Similarly we can also unroll loops in a precise way up to a certain number of iterations, and attempt to join/widen only at after that. These are of course important ideas, but they don't help with what this document is trying to achieve.

================
Comment at: clang/docs/DataFlowAnalysisIntro.md:288
+Since the lattice has a finite height, the algorithm is guaranteed to terminate.
+Each iteration of the algorithm can change computed values only to larger values
+from the lattice. In the worst case, all computed values become `⊤`, which is
----------------
xazax.hun wrote:
> Only if our transfer functions are monotonic :)
That's an important condition, added.

================
Comment at: clang/docs/DataFlowAnalysisIntro.md:299
+
+## Symbolic execution: a very short informal introduction
+
----------------
xazax.hun wrote:
> I think many people find it confusing, what is the relationship between symbolic execution and abstract interpretation? I found this answer helpful: https://cstheory.stackexchange.com/questions/19708/symbolic-execution-is-a-case-of-abstract-interpretation
> I wonder if it might be worth citing.
So far I have avoided using the term "abstract interpretation" in this document, or going into more deep topics. However I read this answer and I like it too, so I added a link to it here.

================
Comment at: clang/docs/DataFlowAnalysisIntro.md:365
+void ExampleOfSymbolicPointers(bool b) {
+  int x = 0;     // x is {0}
+  int* ptr = &x; // x is {0}
----------------
xazax.hun wrote:
> Here, I think we want to include `ptr` in the comments describing the program state. 
Added.

================
Comment at: clang/docs/DataFlowAnalysisIntro.md:381-382
+
+Output parameters are function parameters of pointer or reference type whose
+pointee is completely overwritten by the function. They are common in pre-C++11
+code due to the absence of move semantics. In modern C++ output parameters are
----------------
xazax.hun wrote:
> One could argue that you also cannot read from the pointee, otherwise you have an input-output parameter. There is an example clarifying this below, but I wonder if we want to get the definition right from the beginning.
Corrected.

================
Comment at: clang/docs/DataFlowAnalysisIntro.md:469
+
+*   Prove that the function only borrows the pointer and does not persist it.
+
----------------
xazax.hun wrote:
> While I do agree we cannot explain everything, like what borrowing or escaping means in this document, I wonder if we want to add some citations.
I reworded to avoid using the term "borrow".

================
Comment at: clang/docs/DataFlowAnalysisIntro.md:478-479
+
+To analyze the function body we can use a lattice which consists of normal
+states and failure states. A normal state describes program points where we are
+sure that no behaviors that block the refactoring have occurred. Normal states
----------------
xazax.hun wrote:
> I wonder if the distinction between normal states and failure states is useful. In general, we can combine arbitrary number of lattices and propagate all the information in a single pass. I.e., we could have multiple "normal" or "failure" states.
> 
> There are multiple ways to combine lattices, we can put them on top of each other, or next to each other introducing new top/bottom values, or we can take their products.
> I wonder if the distinction between normal states and failure states is useful.

I'm not sure I understand -- this distinction is useful in this particular approach to solve this problem, since it helps solve the problem? Or are you objecting to the term "failure"?

Of course, in general, an analysis does not need to have failure/normal states, and like you said, if we track information about multiple candidate output parameters at the same time, each can be in either a normal or failure state at every program point independently of other parameters. However, this document provides an example of a solution for this particular problem; the goal is not to solve the problem, but to give the reader an intuition of how dataflow ideas can be applied to solve real problems.

================
Comment at: clang/docs/DataFlowAnalysisIntro.md:570
+
+The implementation of dead store analysis is very similar to output parameter
+analysis: we need to track stores and loads, and find stores that were never
----------------
xazax.hun wrote:
> It might be a good idea to reference liveness analysis which is often used to answer many related questions:
> * Finding dead stores
> * Finding uninitialized variable
> * Finding a good point to deallocate memory
> * Finding out if it would be safe to move an object
> * ... 
Good point, added a reference.

================
Comment at: clang/docs/DataFlowAnalysisIntro.md:808
+To implement it using the data flow analysis framework, we can produce a warning
+if any part of the branch condition is implied by the path condition.
+
----------------
xazax.hun wrote:
> Here, I think the term `path condition` becomes ambiguous. Traditionally, path condition is a path sensitive property, in the code snippet below:
> ```
> if (b1)
> { ... }
> 
> if (b1 && b2)
> { ... }
> ```
> 
> We would reach the second branch on two paths (depending on whether we took the first one). On one of the paths, the path condition will have `b1`, on the other, it will have `!b1`.
> 
> I think here you mean the non-path-sensitive property. I'd suggest doing a strict distinction and use path condition for path sensitive properties and flow condition for flow sensitive properties.
Replaced with "flow condition".

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D114231/new/

https://reviews.llvm.org/D114231