[PATCH] D114231: [clang][docs][dataflow] Added an introduction to dataflow analysis

Mon Dec 6 02:59:40 PST 2021

gribozavr2 added inline comments.

================
Comment at: clang/docs/DataFlowAnalysisIntro.md:279
+```
+out = transfer(basic_block, join(in_1, in_2, ..., in_n))
+```
----------------
xazax.hun wrote:
> gribozavr2 wrote:
> > xazax.hun wrote:
> > > While I agree that this is the general formula, people sometimes do small variations, e.g.:
> > > ```
> > > out =  join(transfer(basic_block,in_1), transfer(basic_block,in_2), ..., transfer(basic_block,in_n))
> > > ```
> > > 
> > > This is less efficient as we end up invoking the transfer function more often, but it can be more precise. E.g. with some ad-hoc notation:
> > > 
> > > ```
> > > Given the branches: x: {1}, x: {2}, x: {3}, x: {4}
> > > join(in...) : x : {⊤}
> > > transfer("x/2", ...) == x : {⊤}
> > > 
> > > vs.
> > > Given the branches: x: {1}, x: {2}, x: {3}, x: {4}
> > > transfer("x/2", ...) ==x : {0}, x : {1}, x : {1}, x: {2} == outs
> > > join(outs) == x: {0, 1, 2}
> > > ```
> > This section is only attempting to give the readers an intuition of why dataflow works, hence it must be straightforward, not necessarily exhaustive or rigorous. I tried to add your suggestion here, but it breaks the flow. I tried to add it at the end of the section, but it looks out of place. So, sorry, I couldn't fit it here. If you have a specific wording suggestion that works here, I would be happy to apply it.
> > 
> > I think one would need to add a section like "variations on dataflow equations" to properly introduce the idea. It also seems to me that this specific variation is also a special case of a more general idea of deferring join operations to improve precision; that is, instead of doing a join everywhere where classic dataflow prescribes it, we can instead keep exploring separate paths, and only do a join at some later point in the CFG. Similarly we can also unroll loops in a precise way up to a certain number of iterations, and attempt to join/widen only at after that. These are of course important ideas, but they don't help with what this document is trying to achieve.
> Agreed, we cannot introduce everything in an introductory document. I mainly wanted to avoid creating the impression that there is one definitive way to data flow analysis and this is it. Maybe one sentence at the end mentioning there are several variations that we do not discuss here could be sufficient. 
Added a long parenthetical:

(Note that there are other ways to write this equation that produce higher
precision analysis results. The trick is to keep exploring the execution paths
separately and delay joining until later. Hoowever, we won't discuss those
variations here.)

================
Comment at: clang/docs/DataFlowAnalysisIntro.md:478-479
+
+To analyze the function body we can use a lattice which consists of normal
+states and failure states. A normal state describes program points where we are
+sure that no behaviors that block the refactoring have occurred. Normal states
----------------
xazax.hun wrote:
> gribozavr2 wrote:
> > xazax.hun wrote:
> > > I wonder if the distinction between normal states and failure states is useful. In general, we can combine arbitrary number of lattices and propagate all the information in a single pass. I.e., we could have multiple "normal" or "failure" states.
> > > 
> > > There are multiple ways to combine lattices, we can put them on top of each other, or next to each other introducing new top/bottom values, or we can take their products.
> > > I wonder if the distinction between normal states and failure states is useful.
> > 
> > I'm not sure I understand -- this distinction is useful in this particular approach to solve this problem, since it helps solve the problem? Or are you objecting to the term "failure"?
> > 
> > Of course, in general, an analysis does not need to have failure/normal states, and like you said, if we track information about multiple candidate output parameters at the same time, each can be in either a normal or failure state at every program point independently of other parameters. However, this document provides an example of a solution for this particular problem; the goal is not to solve the problem, but to give the reader an intuition of how dataflow ideas can be applied to solve real problems.
> Yeah, I found the term `failure` a bit confusing. It is a regular analysis state, the analysis itself did not fail. I think I understand that this is named failure after the fact that it is blocking the transformation. 
> 
> Probably my main problem was that it was not clear whether the document suggested having normal and failure states is a general design principle to all analysis or specific to the problem we want to solve. I think in this particular example the idea of combining lattices in a certain way to build larger lattices is the general design principle (combining something that one could think of 3 separate analysis: escape, unsafe reads, and field sensitive initialized variable into one) and the classification of states into normal and failure is specific to the problem. 
> 
> I think the placement of this section makes it pretty clear that this is just an example and not describing general methods. Probably I took a break before reading this section and read it with the wrong context in mind :)
> 
> Feel free to leave this as it is. 
Sounds good :)

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D114231/new/

https://reviews.llvm.org/D114231