[PATCH] D114231: [clang][docs][dataflow] Added an introduction to dataflow analysis

Fri Nov 19 11:27:46 PST 2021

xazax.hun added a comment.

Thanks for working on this! Publishing documentation alongside the feature is a great practice. I liked this documentation a lot. I have strong opinion about the ambiguous use of `path condition` in this document, otherwise I was only nitpicking.

================
Comment at: clang/docs/DataFlowAnalysisIntro.md:24
+*   [Introduction to Dataflow Analysis](https://www.youtube.com/watch?v=OROXJ9-wUQE)
+*   [Introduction to abstract interpretation](http://www.cs.tau.ac.il/~msagiv/courses/asv/absint-1.pdf).
+*   [Introduction to symbolic execution](https://www.cs.umd.edu/~mwh/se-tutorial/symbolic-exec.pdf).
----------------
I know, everyone has their favorite set of resources. Let me share mine: `https://cs.au.dk/~amoeller/spa/`
I think this one might have a bit more content with more examples. Feel free to leave it as is. 

================
Comment at: clang/docs/DataFlowAnalysisIntro.md:49
+  int x;
+  // x is {}
+  if (n > 0) {
----------------
Nit: I wonder if representing the value of `x` with an empty set is a good choice. One could argue that `x` has an indeterminate value that could be represented with the set of all integers using top. Others could argue for a bottom value. Those concepts are not yet introduced. Maybe initializing `x` in the source code would side step this problem.

================
Comment at: clang/docs/DataFlowAnalysisIntro.md:71-72
+Abstract algebra provides a nice formalism that models this type of structure,
+namely, a lattice. A lattice is a partially ordered set, in which every two
+elements have a least upper bound (called a *join*).
+
----------------
Technically, I think this would be a `join-semilattice`. To have a `lattice`, we would need both `join` and `meet`.

================
Comment at: clang/docs/DataFlowAnalysisIntro.md:86-87
+
+Computing the join in the lattice corresponds to finding the lowest common
+ancestor between two nodes in its Hasse diagram.
+
----------------
If we want to be practical, I wonder if we want to give some guidance how to implement that efficiently. E.g. we could cite a paper like this: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.106.4911&rep=rep1&type=pdf

================
Comment at: clang/docs/DataFlowAnalysisIntro.md:111-112
+information that we track. In this case, we can, for example, arbitrarily limit
+the size of sets to 3 elements. If at a certain program point `x` has more than
+3 possible values, we stop tracking specific values at that program point.
+Instead, we denote possible values of `x` with the symbol `⊤` (pronounced "top"
----------------
I wonder if we want to cite widening and narrowing in this context to give people some pointers where to look for more info.

================
Comment at: clang/docs/DataFlowAnalysisIntro.md:182
+void Example(int n) {
+  int x;          // x is ⊥
+  if (n > 0) {
----------------
I think it would make sense to include the whole program state in the comments. E.g. `// x is ⊥, n is ⊤`.
This could showcase how we constrain the value of `n` to `42` in one of the branches.

================
Comment at: clang/docs/DataFlowAnalysisIntro.md:279
+```
+out = transfer(basic_block, join(in_1, in_2, ..., in_n))
+```
----------------
While I agree that this is the general formula, people sometimes do small variations, e.g.:
```
out =  join(transfer(basic_block,in_1), transfer(basic_block,in_2), ..., transfer(basic_block,in_n))
```

This is less efficient as we end up invoking the transfer function more often, but it can be more precise. E.g. with some ad-hoc notation:

```
Given the branches: x: {1}, x: {2}, x: {3}, x: {4}
join(in...) : x : {⊤}
transfer("x/2", ...) == x : {⊤}

vs.
Given the branches: x: {1}, x: {2}, x: {3}, x: {4}
transfer("x/2", ...) ==x : {0}, x : {1}, x : {1}, x: {2} == outs
join(outs) == x: {0, 1, 2}
```

================
Comment at: clang/docs/DataFlowAnalysisIntro.md:288
+Since the lattice has a finite height, the algorithm is guaranteed to terminate.
+Each iteration of the algorithm can change computed values only to larger values
+from the lattice. In the worst case, all computed values become `⊤`, which is
----------------
Only if our transfer functions are monotonic :)

================
Comment at: clang/docs/DataFlowAnalysisIntro.md:299
+
+## Symbolic execution: a very short informal introduction
+
----------------
I think many people find it confusing, what is the relationship between symbolic execution and abstract interpretation? I found this answer helpful: https://cstheory.stackexchange.com/questions/19708/symbolic-execution-is-a-case-of-abstract-interpretation
I wonder if it might be worth citing.

================
Comment at: clang/docs/DataFlowAnalysisIntro.md:365
+void ExampleOfSymbolicPointers(bool b) {
+  int x = 0;     // x is {0}
+  int* ptr = &x; // x is {0}
----------------
Here, I think we want to include `ptr` in the comments describing the program state. 

================
Comment at: clang/docs/DataFlowAnalysisIntro.md:381-382
+
+Output parameters are function parameters of pointer or reference type whose
+pointee is completely overwritten by the function. They are common in pre-C++11
+code due to the absence of move semantics. In modern C++ output parameters are
----------------
One could argue that you also cannot read from the pointee, otherwise you have an input-output parameter. There is an example clarifying this below, but I wonder if we want to get the definition right from the beginning.

================
Comment at: clang/docs/DataFlowAnalysisIntro.md:469
+
+*   Prove that the function only borrows the pointer and does not persist it.
+
----------------
While I do agree we cannot explain everything, like what borrowing or escaping means in this document, I wonder if we want to add some citations.

================
Comment at: clang/docs/DataFlowAnalysisIntro.md:478-479
+
+To analyze the function body we can use a lattice which consists of normal
+states and failure states. A normal state describes program points where we are
+sure that no behaviors that block the refactoring have occurred. Normal states
----------------
I wonder if the distinction between normal states and failure states is useful. In general, we can combine arbitrary number of lattices and propagate all the information in a single pass. I.e., we could have multiple "normal" or "failure" states.

There are multiple ways to combine lattices, we can put them on top of each other, or next to each other introducing new top/bottom values, or we can take their products.

================
Comment at: clang/docs/DataFlowAnalysisIntro.md:570
+
+The implementation of dead store analysis is very similar to output parameter
+analysis: we need to track stores and loads, and find stores that were never
----------------
It might be a good idea to reference liveness analysis which is often used to answer many related questions:
* Finding dead stores
* Finding uninitialized variable
* Finding a good point to deallocate memory
* Finding out if it would be safe to move an object
* ... 

================
Comment at: clang/docs/DataFlowAnalysisIntro.md:808
+To implement it using the data flow analysis framework, we can produce a warning
+if any part of the branch condition is implied by the path condition.
+
----------------
Here, I think the term `path condition` becomes ambiguous. Traditionally, path condition is a path sensitive property, in the code snippet below:
```
if (b1)
{ ... }

if (b1 && b2)
{ ... }
```

We would reach the second branch on two paths (depending on whether we took the first one). On one of the paths, the path condition will have `b1`, on the other, it will have `!b1`.

I think here you mean the non-path-sensitive property. I'd suggest doing a strict distinction and use path condition for path sensitive properties and flow condition for flow sensitive properties.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D114231/new/

https://reviews.llvm.org/D114231