[PATCH] D135808: [LoopInterchange] Correcting the profitability checking for vectorization

Wed Oct 19 00:20:54 PDT 2022

Meinersbur added a comment.

Could you add a test case that is not considered profitable where before it was not?

In D135808#3865187 <https://reviews.llvm.org/D135808#3865187>, @bmahjour wrote:

> I think profitability determination solely based on dependency matrix is fundamentally flawed.

I understand the idea is that when the locality model says that interchanged and non-interchanged have the same profitability according to cache locality analysis, only then we fall back to a heuristic that determines whether maybe the innermost loop of the interchange loop nest can be vectorized when the one before vectorization could not.

Unfortunately `LoopInterchangeProfitability::isProfitable` is not structured like that. It falls back to `isProfitableForVectorization` whenever cache analysis does not want to interchange the loops. This includes the case where the old loop nest has better locality than the interchanged would have. This could result in a endless application of LoopInterchange if we would run it until no more change is to be made:

1. Cache analysis thinks non-interchanged nest has more locality => fall back to `isProfitableForVectorization` which returns true => do the interchange
2. Cache analysis thinks the interchange has more locality => do the interchange (with two interchanges we arrive at the original loop)
3. Cache analysis thinks non-interchanged nest has more locality => fall back to `isProfitableForVectorization` which returns true => do the interchange again
4. ...

In D135808#3862874 <https://reviews.llvm.org/D135808#3862874>, @congzhe wrote:

> I thought what you meant is that after this patch, `[<, =]` and `[>, =]` (not [=,<] and [=,>]) will be interchanged? Because after interchange the dependency vector would become `[=, <]` and `[=, >]` respectively, which could improve potential parallelization and enable finer-grained parallelism, i.e., outer loop parallelism instead of inner loop parallelism. I think this is what `isProfitableForVectorization()` is supposed to be.

The current LoopVectorize pass only supports innermost loops, hence you would want the dependencies carried by the outer loops so the LoopVectorize pass does not have to consider them.

Btw, why does LopInterchange use a `DepMatrix` when the dependencies for all instructions in the loop could be just summarized in a single vector?

================
Comment at: llvm/lib/Transforms/Scalar/LoopInterchange.cpp:1116-1118
   // TODO: Improve this heuristic to catch more cases.
   // If the inner loop is loop independent or doesn't carry any dependency it is
   // profitable to move this to outer position.
----------------
Does you patch cover this TODO?

================
Comment at: llvm/lib/Transforms/Scalar/LoopInterchange.cpp:1120-1123
+    if (Row[InnerLoopId] != 'I' && Row[InnerLoopId] != '=')
       return false;
     // TODO: We need to improve this heuristic.
+    if (Row[OuterLoopId] == '=')
----------------
Shouldn't the condition on `Row[InnerLoopId]` and `Row[OuterLoopId]` be exact opposite? That is, it is profitable if the innermost loop has loop-carried dependencies while the outer has not?

================
Comment at: llvm/lib/Transforms/Scalar/LoopInterchange.cpp:1162-1163
   // we can move this loop outside to improve parallelism.
   if (isProfitableForVectorization(InnerLoopId, OuterLoopId, DepMatrix))
     return true;

----------------
Should this only be considered if `InnerLoopId` is actually an innermost loop (The only kind LoopVectorize can currently process)?

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D135808/new/

https://reviews.llvm.org/D135808