[llvm] [LoopInterchange] Consider forward/backward dependency in vectorize heuristic (PR #133672)

Tue Jul 29 09:21:51 PDT 2025

================
@@ -1334,21 +1405,34 @@ LoopInterchangeProfitability::isProfitablePerInstrOrderCost() {
 static bool canVectorize(const CharMatrix &DepMatrix, unsigned LoopId) {
   for (const auto &Dep : DepMatrix) {
     char Dir = Dep[LoopId];
-    if (Dir != 'I' && Dir != '=')
-      return false;
+    char DepType = Dep.back();
+    assert((DepType == '<' || DepType == '*') &&
+           "Unexpected element in dependency vector");
+
+    // There are no loop-carried dependencies.
+    if (Dir == '=' || Dir == 'I')
+      continue;
+
+    // DepType being '<' means that this direction vector represents a forward
+    // dependency. In principle, a loop with '<' direction can be vectorized in
+    // this case.
+    if (Dir == '<' && DepType == '<')
+      continue;
+
+    // We cannot prove that the loop is vectorizable.
+    return false;
   }
   return true;
 }
 
 std::optional<bool> LoopInterchangeProfitability::isProfitableForVectorization(
     unsigned InnerLoopId, unsigned OuterLoopId, CharMatrix &DepMatrix) {
-  // If the outer loop is not loop independent it is not profitable to move
-  // this to inner position, since doing so would not enable inner loop
-  // parallelism.
+  // If the outer loop cannot be vectorized, it is not profitable to move this
+  // to inner position.
   if (!canVectorize(DepMatrix, OuterLoopId))
     return false;
 
-  // If inner loop has dependence and outer loop is loop independent then it is
+  // If inner loop cannot be vectorized and outer loop can be then it is
----------------
kasuga-fj wrote:

Ultimately, I think the root problem is that each pass has its own cost model, and employing some kind of general representation seems one way to address it. I was hoping that VPlan might offer a solution, but it seems that’s not the case... Thanks for the explanation!

> Because of this I would not bother trying apply another pass' analysis on speculative changes. #146383 does not, but it gets complex quickly.

(I'm not sure why you bring up #146383, just a typo?) Anyway, I'm not planning to tweak this vectorization heuristic any further for now. I don't have any good ideas, and from what I've tried, prioritizing the cache profitability over the vectorization one tended to produce better results. Rather than moving the outer loop to the innermost position at the cost of memory locality, it seemed more reasonable to leave things as they are and rely on UnrollAndJam and SLPVectorizer to handle outer-loop vectorization.

https://github.com/llvm/llvm-project/pull/133672