[PATCH] D73129: [LoopUnrollAndJam] Correctly update LoopInfo when unroll and jam more than 2-levels loop nests.

Tue Feb 4 16:41:39 PST 2020

Meinersbur added a comment.

Sorry for the break, I got the flu.

In D73129#1844313 <https://reviews.llvm.org/D73129#1844313>, @dmgreen wrote:

> Put more generally, I was expecting this:
>  ...
>  You are saying that we should also fuse the inner loops?

Yes.

The purposes of unroll-and-jam is to improve instruction-level-parallelism and reduce hot loop overhead. For performance-optimization, we should only consider the innermost body to be relevant (Statement C in your example). IMHO not jamming the innermost loop does not improve ILP nor overhead, so would be quite useless.

A way to define Unroll-And-Jam is to first tile by (unroll-factor,1,1) (all except the outermost tile factors are 1, so don't really need a loop) and the (fully) unroll the tile. As a side-effect, unroll-and-jam on a single loop would be identical to partial unrolling. Tiling is usually only defined for perfect loop nests, and so I would not necessarily assume that unroll-and-jam over non-perfectly nested loops is even defined. If we do, I'd expect something like:

  for i += 2
    A(i)
    A(i+1)
    for j
      B(i, j)
      B(i+1, j)
      for k 
        C(i, j, k)
        C(i+1, j, k)
      D(i, j)
      D(i+1, j)
    E(i)
    E(i+1)
  for i remainder:
    A(i)
    for j
      B(i, j)
      for k 
        C(i, j, k)
      D(i, j)
    E(i)

Caveat: What if A,B,D or E contain loops themselves? I'd just not allow it.

> When Unroll And Jam was written we did not have general loop fusion. We now do. Can we make use of it here to fuse any sub-loops together? I believe that is how gcc writes their algorithm, but last I looked they only supported perfectly nested loops which would be a big regression over what is here now. We might just be able to attempt sub-loop fusing, using the loop fusion infrastructure we have?

I think using the loop fusion here would make the implementation more complicated.

> The alternative like you said would be trying to prove it is valid beforehand, which would mean checking that more blocks inside subloops can be moved past each other and all the extra memory dependencies are safe.

I think the generalization of the legality check from "does the dependency violate jump over one loop" to "does it violate jumping n loops" to be hard.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D73129/new/

https://reviews.llvm.org/D73129