[llvm] [LoopInterchange] Consider forward/backward dependency in vectorize heuristic (PR #133672)

Wed Jul 30 07:40:40 PDT 2025

================
@@ -1334,21 +1405,34 @@ LoopInterchangeProfitability::isProfitablePerInstrOrderCost() {
 static bool canVectorize(const CharMatrix &DepMatrix, unsigned LoopId) {
   for (const auto &Dep : DepMatrix) {
     char Dir = Dep[LoopId];
-    if (Dir != 'I' && Dir != '=')
-      return false;
+    char DepType = Dep.back();
+    assert((DepType == '<' || DepType == '*') &&
+           "Unexpected element in dependency vector");
+
+    // There are no loop-carried dependencies.
+    if (Dir == '=' || Dir == 'I')
+      continue;
+
+    // DepType being '<' means that this direction vector represents a forward
+    // dependency. In principle, a loop with '<' direction can be vectorized in
+    // this case.
+    if (Dir == '<' && DepType == '<')
+      continue;
+
+    // We cannot prove that the loop is vectorizable.
+    return false;
   }
   return true;
 }
 
 std::optional<bool> LoopInterchangeProfitability::isProfitableForVectorization(
     unsigned InnerLoopId, unsigned OuterLoopId, CharMatrix &DepMatrix) {
-  // If the outer loop is not loop independent it is not profitable to move
-  // this to inner position, since doing so would not enable inner loop
-  // parallelism.
+  // If the outer loop cannot be vectorized, it is not profitable to move this
+  // to inner position.
   if (!canVectorize(DepMatrix, OuterLoopId))
     return false;
 
-  // If inner loop has dependence and outer loop is loop independent then it is
+  // If inner loop cannot be vectorized and outer loop can be then it is
----------------
kasuga-fj wrote:

> Unless you have an universal cost model that takes everything into account and predicts the execution time, each pass needs its own heuristic for what it is optimizing for. E.g. the vectorizer optmizes cycles but does not consider cache effects.

When you put it that way, it hardly seems feasible (well, if it were feasible, it would probably have been done already).

> No typo; the patch tries to teach DependenceAnalysis to determine dependencies after loop fusion has taken place without applying loop fusion. Now also do that for interchange, distribution, vectorization, ....

After reading this comment, I noticed that the patch introduces additional analysis for loop fusion even though the client doesn't require it. I initially expected an argument to be added (such as `depends(Src, Dst, /*ForFusion=*/true)`), but that doesn't seem to be the case. Tough, controlling the analysis behavior via flags could complicate caching and reusing results across different passes.

By the way, I've recently been reading DependenceAnalysis.cpp, and noticed that; it's already quite complex and potentially buggy. I'm fairly certain it should be refactored before adding any new features.

> UnrollAndJam is disabled by default. Its heuristic also does not take vectorization into account, but tires to maximize L1i cache usage.
>
> Optimal outcome would be if the vectorizer supported outer-loop vectorization.

I don't know much about the details of the UnrollAndJam pass, but it appears to work (unintentionally?) as if outer-loop vectorization is applied in some cases, especially when combined with the SLPVectorizer (of course, I needed to specify the unroll count explicitly by pragma). So, I just thought that it might make more sense to enhance UnrollAndJam instead of interchange, for cases where outer-loop is vectorizable but inner-loop is not. And, as you said, it would be the best solution to support outer-loop vectorization in the vectorizer.


https://github.com/llvm/llvm-project/pull/133672