[Mlir-commits] [mlir] [OpenMP Dialect] Add omp.canonical_loop operation. (PR #65380)

Wed Sep 6 17:53:21 PDT 2023

jsjodin wrote:

> Unfortunately nesting instead of the %cli object like this:
> 
> ```
> omp.unroll loop(1)
>   omp.tile loop (0) { tile_sizes=[4] } {
>     omp.canonical_loop for (%c0) to (%c64) step (%c1) {
>       ..
>     }
>   }
> }
> ```
> 
> is that it does not work with the `apply` clause. For instance:
> 
> ```
> #pragma omp tile sizes(4) apply(intratile:unroll full)
> for (int i =0; i < n; ++i) {
>    ...
> }
> ```
> 
> where after tiling, the _inner_ loop is unrolled. 
>The MLIR representation would be like this:
> 
> ```
> %cli = omp.canonical_loop %iv = [0, %tc) {
>       ..
> }
> %grid, %intratile = omp.tile sizes(4) (%cli)
> omp.unroll_full (%intratile)
> ```
> 

If we encode the input loop (and maybe output) loop structures inside the omp ops to make it easier to see what the ops actually encode as far as loop structure e.g. loop(loop(loop())) if there are three nested loops and in the seq(loop(),loop()) for two sequential loops like in the example later on. The nested version would be:
```
omp.unroll_full {in_loop_structure = loop(loop()), transform_loop = 1)  { // 1 means inner loop, and just an index since there are no sequences yet. 
  omp.tile sizes(4) {in_loop_structure = loop(), transform_loop = 0) {
    omp.canonical_loop {in_loop_structure = () } {
     ....
    }
  }
}
```
So this works fine even if the inner loop is unrolled, since the omp.tile would transform the structure of the loop nest.

> A more useful example than the above (which is just equivalent to partial unrolling by 4), would be a 2d-tiling followed with a `simd`-ization of one (or both) of the inner loops. Since they are constant-sized, it makes them an ideal target for vectorization.
> 
> Without the %cli reference, one could need to introduce a language that identifies the loop to be transformed a la [XPath](https://en.wikipedia.org/wiki/XPath), e.g.
> 
> ```mlir
> omp.unroll_full { something the describes the second loop in the loop nest } // `loop(1)` in https://reviews.llvm.org/D155765#4638411
> omp.tile sizes(4) { 
>   omp.canonical_loop %iv = [0, %tc) {
>         ..
>   }
> }
> ```
> 
> This scheme is fragile, as the nested code could be transformed itself, e.g. one loop is empty or consists of only one iteration and is optimized away. An existing reference to a `%cli` would make such thing a hard error. 

I don't think it is any more fragile. Recomputing the loop structure in the nesting case should catch this.

>With the addition of loop sequences (see below), it is not just the nest depth, but also which of the loop in a loop sequence. E.g. "the second nested loop in the third loop of the loop sequence which is nested inside another loop".
>

Yes, if sequences are allowed, then it becomes more complicated, instead of a list it will be a tree. This tree will still have to be reconstructed by following the use-defs chains for doing code generation. There are error conditions that are possible with names that are not possible with nesting e.g. using the same name in two different ops, missing yield ops, ordering etc. 

> In the OpenMP spec, my original proposal was to allow the user to give those generates loop names (idea stolen from the xlc compiler) to avoid extensive nesting with chains of transformations. E.g.:
> 
> ```
> #pragma omp unroll on(mytiledloop)
> #pragma omp tile on(myoriginalloop) sizes(4) generates(intratile:mytiledloop)
> 
> #pragma omp loopid(myoriginalloop)
> for (int i =0; i < n; ++i) {
>    ...
> }
> ```
> 
> After big discussions on what the namespace of the loopids would be, we settled on the `apply` clause. They are equally powerful, but I think the loopids would allow make it easier to use, e.g. apply different transformations for each target architecture.
> 

> For loop fission, we are going to add a new notion of "canonical loop sequence" to the specification, analogous to "canonical loop nest". The loop nests built out of `omp.canonical_loop` are implicit, by which ones are combined in transformation such as collapse. We can do the same with sequences:
> 
> ```
> %cli1 = omp.canonical_loop %iv1 = [0, %tc) {
>       ..
> }
> %cli2 = omp.canonical_loop %iv2 = [0, %tc) {
>       ..
> }
> %fused = omp.fuse (%cli1, %cli2)
> ```
> 
> Like with loop nests, you should be allowed to apply another transformation on only a subset of the generated loops. For instance with loop fission:
> 
> ```
> %cli1, %cli2 = omp.canonical_loop %iv = [0, %tc) {
>       ...
>       omp.fissure()
>       ...
> }
> omp.unroll (%cli2)
> ```
> 

If there were loops before or after the omp.fissure (or both) would the implicit ones be listed first and then the ones with omp.yield? It seems a bit complicated to keep track of the various Ids and where they come from. FWIW it is still somewhat hard to see with nesting, If there are two inner loops: 
```
omp.canonical_loop { in_loop_structure = seq(loop(), loop(), out_loop_structure = seq(loop(loop()), loop(loop())) } {
  omp.canonical_loop { in_loop_structure =(), out_loop_structure = loop() } {
  }
  omp.fissure()
  omp.canonical_loop {in_loop_structure = (), out_loop_structure = loop() } {
  }
}
``` 
I think both schemes would work if we are limited to trees of loop nests. seems more a matter of what is more convenient and practical.

> Syntactically, this would be represented as
> 
> ```
> #pragma omp fission apply(nothing,unroll)
> for (int i = 0; i < n; ++i) {
>   ...
>   #pragma omp fissure
>   ...
> }
> ```

https://github.com/llvm/llvm-project/pull/65380