[llvm] [LoopInterchange] Consider forward/backward dependency in vectorize heuristic (PR #133672)

Wed Jul 30 01:28:10 PDT 2025

================
@@ -1334,21 +1405,34 @@ LoopInterchangeProfitability::isProfitablePerInstrOrderCost() {
 static bool canVectorize(const CharMatrix &DepMatrix, unsigned LoopId) {
   for (const auto &Dep : DepMatrix) {
     char Dir = Dep[LoopId];
-    if (Dir != 'I' && Dir != '=')
-      return false;
+    char DepType = Dep.back();
+    assert((DepType == '<' || DepType == '*') &&
+           "Unexpected element in dependency vector");
+
+    // There are no loop-carried dependencies.
+    if (Dir == '=' || Dir == 'I')
+      continue;
+
+    // DepType being '<' means that this direction vector represents a forward
+    // dependency. In principle, a loop with '<' direction can be vectorized in
+    // this case.
+    if (Dir == '<' && DepType == '<')
+      continue;
+
+    // We cannot prove that the loop is vectorizable.
+    return false;
   }
   return true;
 }
 
 std::optional<bool> LoopInterchangeProfitability::isProfitableForVectorization(
     unsigned InnerLoopId, unsigned OuterLoopId, CharMatrix &DepMatrix) {
-  // If the outer loop is not loop independent it is not profitable to move
-  // this to inner position, since doing so would not enable inner loop
-  // parallelism.
+  // If the outer loop cannot be vectorized, it is not profitable to move this
+  // to inner position.
   if (!canVectorize(DepMatrix, OuterLoopId))
     return false;
 
-  // If inner loop has dependence and outer loop is loop independent then it is
+  // If inner loop cannot be vectorized and outer loop can be then it is
----------------
Meinersbur wrote:

> Ultimately, I think the root problem is that each pass has its own cost model,

Unless you have an universal cost model that takes everything into account and predicts the execution time, each pass needs its own heuristic for what it is optimizing for. E.g. the vectorizer optmizes cycles but does not consider cache effects.

> (I'm not sure why you bring up https://github.com/llvm/llvm-project/pull/146383, just a typo?)

No typo; the patch tries to teach DependenceAnalysis to determine dependencies after loop fusion has taken place without applying loop fusion. Now also do that for interchange, distribution, vectorization, ....

> Rather than moving the outer loop to the innermost position at the cost of memory locality, it seemed more reasonable to leave things as they are and rely on UnrollAndJam and SLPVectorizer to handle outer-loop vectorization.

UnrollAndJam is disabled by default. Its heuristic also does not take vectorization into account, but tires to maximize L1i cache usage.

Optimal outcome would be if the vectorizer supported outer-loop vectorization.

https://github.com/llvm/llvm-project/pull/133672