[PATCH] D115235: [clang][dataflow] Implement a basic algorithm for dataflow analysis

Wed Dec 8 07:37:18 PST 2021

sgatev added a comment.

Thanks everyone!

================
Comment at: clang/lib/Analysis/FlowSensitive/TypeErasedDataflowAnalysis.cpp:32
+TypeErasedDataflowAnalysisState computeBlockInputState(
+    std::vector<llvm::Optional<TypeErasedDataflowAnalysisState>> &BlockStates,
+    const CFGBlock &Block, const Environment &InitEnv,
----------------
xazax.hun wrote:
> Alternatively, if we have a bottom, `forall e in lattice join(e, bottom) == e`, so we do not really need optionals and can express nodes that do not contribute to the state using the bottom value. (Or using top, depending on the orientation for our lattices, unfortunately the literature is using both orientations which is really confusing.)
> 
> I'm fine with the current approach, but I'd love to see a function level comment about `nullopt`s representing nodes that were not yet evaluated.
Right. Perhaps we can use `Analysis.typeErasedInitialElement()` and `InitEnv` to express states of blocks that were not evaluated? In practice, these often coincide with bottom. In that case we're loosing the ability to understand which blocks have been evaluated. An alternative, as you mentioned, would be to require an additional bottom element to be provided by the analysis. I added a function level comment for now and will consider this a bit more.

================
Comment at: clang/lib/Analysis/FlowSensitive/TypeErasedDataflowAnalysis.cpp:35
+    TypeErasedDataflowAnalysis &Analysis) {
+  TypeErasedDataflowAnalysisState State = {Analysis.typeErasedInitialElement(),
+                                           InitEnv};
----------------
xazax.hun wrote:
> I did not catch this earlier, but I wonder if we should pass the block in to `typeErasedInitialElement`. There are some analysis where the initial element might be different for certain nodes. E.g. here is the algorithms for computing dominators from wikipedia:
> ```
>  // dominator of the start node is the start itself
>  Dom(n0) = {n0}
>  // for all other nodes, set all nodes as the dominators
>  for each n in N - {n0}
>      Dom(n) = N;
>  // iteratively eliminate nodes that are not dominators
>  while changes in any Dom(n)
>      for each n in N - {n0}:
>          Dom(n) = {n} union with intersection over Dom(p) for all p in pred(n)
> ```
> 
> Here the initial analysis state for the entry node differs from the initial state for the rest of the nodes. 
Good point, though, I'm not sure if this algorithm is the best choice for such an analysis. It seems that in that case we don't need to evaluate the statements in the basic blocks, right? Perhaps we can generalize the algorithm or offer a separate one for more efficient implementation of such analyses. I added a FIXME and will consider this a bit more.

================
Comment at: clang/lib/Analysis/FlowSensitive/TypeErasedDataflowAnalysis.cpp:134
+    Worklist.enqueueSuccessors(Block);
+  }
+
----------------
xazax.hun wrote:
> If a block still has `None` as its state at this point, it is unreachable in the CFG. One option is to ignore them, as we do at the moment and it is completely fine and should not change this patch. But for some analysis in the future, we might want to check dead code as well (it might be dead for the current platform due to some preprocessor defines but alive for some other). E.g., it would also be nice to diagnose a null dereference in a block that was not reached. 
I added a FIXME to consider evaluating unreachable blocks. However, I think it's preferable to run the analysis with the same configuration that will be used to compile the code. Do you think this isn't always possible/practical?

================
Comment at: clang/unittests/Analysis/FlowSensitive/CMakeLists.txt:1
+set(LLVM_LINK_COMPONENTS
+  Support
----------------
ymandel wrote:
> Why create a new sub directory? From what I've seen elsewhere, it seems uncommon. I'm fine either way, but want to be sure we have a good reason for differing.
I created the directory to match the structure in llvm-project/clang/include/clang/Analysis/FlowSensitive and llvm-project/clang/lib/Analysis/FlowSensitive. In general, I think it'd be easier to find related code if it's structured similarly. If this doesn't follow the established practices for the project I'd be happy to change it.

================
Comment at: clang/unittests/Analysis/FlowSensitive/TypeErasedDataflowAnalysisTest.cpp:30
+public:
+  void run(const ast_matchers::MatchFinder::MatchResult &Result) override {
+    const auto *Func = Result.Nodes.getNodeAs<FunctionDecl>("func");
----------------
gribozavr2 wrote:
> Add `assert(BlockStates.empty());` ?
Done, and also changed the other `ASSERT_THAT` to `assert`.

================
Comment at: clang/unittests/Analysis/FlowSensitive/TypeErasedDataflowAnalysisTest.cpp:37-40
+    CFG::BuildOptions Options;
+    Options.AddImplicitDtors = true;
+    Options.AddTemporaryDtors = true;
+    Options.setAllAlwaysAdd();
----------------
xazax.hun wrote:
> I wonder if the dataflow analysis framework should provide a factory that that returns a `CFG::BuildOptions` which is a good default for most clients. Can be in a follow-up patch, or completely ignored. Or maybe even a convenience function running the analysis on a function definition without the user having to know about the CFG.
I added a FIXME. A utility that runs an analysis on a function definition would mean that the CFG will need to be built each time we want to run an analysis which could be wasteful if we want to run several analyses on the same CFG. Perhaps we can consider providing a function that builds the CFG using default options and returns it, possibly wrapped in another type.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D115235/new/

https://reviews.llvm.org/D115235