[llvm] [InstCombine] Optimistically allow multiple shufflevector uses in foldOpPhi (PR #114278)

Fri Nov 8 16:52:52 PST 2024

================
@@ -1773,17 +1773,31 @@ Instruction *InstCombinerImpl::foldOpIntoPhi(Instruction &I, PHINode *PN) {
   if (NumPHIValues == 0)
     return nullptr;
 
-  // We normally only transform phis with a single use.  However, if a PHI has
-  // multiple uses and they are all the same operation, we can fold *all* of the
-  // uses into the PHI.
+  // We normally only transform phis with a single use.
+  bool AllUsesIdentical = false;
+  bool MultipleShuffleVectorUses = false;
   if (!PN->hasOneUse()) {
-    // Walk the use list for the instruction, comparing them to I.
+    // Exceptions:
+    //   - All uses are identical.
+    //   - All uses are shufflevector instructions that fully simplify; this
+    //     helps interleave -> phi -> 2x de-interleave+de patterns.
+    MultipleShuffleVectorUses = isa<ShuffleVectorInst>(I);
+    AllUsesIdentical = true;
+    unsigned NumUses = 0;
     for (User *U : PN->users()) {
+      ++NumUses;
       Instruction *UI = cast<Instruction>(U);
-      if (UI != &I && !I.isIdenticalTo(UI))
+      if (UI == &I)
+        continue;
+
+      if (!I.isIdenticalTo(UI))
+        AllUsesIdentical = false;
+      // Only inspect first 4 uses to avoid quadratic complexity.
+      if (!isa<ShuffleVectorInst>(UI) || NumUses > 4)
----------------
MatzeB wrote:

Generally speaking you have to be careful when optimizing instcombine patterns when any node (but the root node) in the pattern has multiple uses. It's generally not easy to confirm that everything does indeed simplify and you can easily turn one value into multiple values being alive increase register pressure and ending up with worse performance. Unfortunately I don't think we have good ways to predict/model this in LLVm so you will notice many instcombine being conservative and bailing out on multiple uses.

My reasoning to make an exception for this particular case is that this happens to hit a very important kernel in our workload; and "strided vectorization" being an explicit mode in the loop vectorizer that can produce patterns like my example "by design" (after SROA).

https://github.com/llvm/llvm-project/pull/114278