[PATCH] D86963: [DAGCombine] Skip re-visiting EntryToken to avoid compile time explosion

Tue Sep 1 10:36:47 PDT 2020

uweigand created this revision.
uweigand added reviewers: spatel, RKSimon, niravd.
Herald added subscribers: llvm-commits, ecnelises, steven.zhang, hiraditya.
Herald added a project: LLVM.
uweigand requested review of this revision.

Compiling the SingleSource/UnitTests/matrix-types-spec.cpp test case on SystemZ takes a very long time (many hours, possibly more than a day), which is causing the test to time out and the build bot to be red since that test was added.

Interestingly, on Intel compilation completes after a minute or so.  However, it turns out this is simply because on Intel, LLVM considers SSE vector support to be available by default, while the default SystemZ target has no vector support.  Changing those defaults causes the behavior to match again: compiling on Intel with -mno-sse also takes hours, compiling on SystemZ with -march=z15 also takes just a minute.

The following is a reduced test case (basically the transposeSpec) function of matrix-types-spec.cpp:

  void test(float *X, float *ResSpec) {
    constexpr int R0 = 31;
    constexpr int C0 = 17;
  
    using matrix_trans_t = float __attribute__((matrix_type(C0, R0)));
    auto M = __builtin_matrix_column_major_load(X, R0, C0, R0);
    matrix_trans_t Res;
    for (int c = 0; c < C0; ++c)
      for (int r = 0; r < R0; ++r)
        Res[c][r] = M[r][c];
    __builtin_matrix_column_major_store(Res, ResSpec, C0);
  }

Building this with -O3 -fenable-matrix (and -mno-sse on Intel) demonstrates the problem.

What seems to happen is that DAGCombine optimization exposes many dead stores to the temporary variable which are optimized away.   This results in the store node being replaced by its input chain.  Given that many of the stores have no dependency on each other, that input chain will often be the EntryToken.  Now DAGCombiner::Run will add the result node and all its users back to its worklist.  In case that result node is the EntryToken, all users of the EntryToken are added back -- and those will be a large number of stores ...

So over all, this seems to scale as something like a quadratic or cubic function of the number of stores.  (This also seems to explain the difference when vector support is present, because the number of stores is reduced by a factor of 4 then.)   The details may be more complicated, but it's a bit hard to analyze all the details given the large number of nodes involved.

Now, I'm wondering what the proper fix for this type of problem is.  However, I notice that the fact that an input chain to a node is now the EntryToken basically never allows any significant follow-on optimization to occur.   So I tried simply skipping adding the result node and its users to the worklist in the case where that result node is the EntryToken.  And in fact, I was unable to find **any** difference in code generation due to that change.  However, the above test case now completes compilation in about half a minute ...


Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D86963

Files:
  llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp


Index: llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
===================================================================

--- llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
+++ llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
@@ -1558,9 +1558,15 @@
       DAG.ReplaceAllUsesWith(N, &RV);
     }
 
-    // Push the new node and any users onto the worklist
-    AddToWorklist(RV.getNode());
-    AddUsersToWorklist(RV.getNode());
+    // Push the new node and any users onto the worklist.  Omit this if the
+    // new node is the EntryToken (e.g. if a store managed to get optimized
+    // out), because re-visiting the EntryToken and its users will not uncover
+    // any additional opportunities, but there may be a large number of such
+    // users, potentially causing compile time explosion.
+    if (RV.getOpcode() != ISD::EntryToken) {
+      AddToWorklist(RV.getNode());
+      AddUsersToWorklist(RV.getNode());
+    }
 
     // Finally, if the node is now dead, remove it from the graph.  The node
     // may not be dead if the replacement process recursively simplified to


-------------- next part --------------
A non-text attachment was scrubbed...
Name: D86963.289211.patch
Type: text/x-patch
Size: 1081 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20200901/aeeed984/attachment.bin>