[Mlir-commits] [mlir] [MLIR][OpenMP] Add omp.simd operation (PR #79843)

Fri Feb 9 05:04:08 PST 2024

skatrak wrote:

> > While I don't currently get the full picture, the dependencies you mention between distribute and parallel/wsloop seem to stem from the fact that `distribute parallel do` is a composite construct. So maybe the solution is not to share lowering state between these but rather to recognize this is another construct separate from `parallel do` and to handle it independently, possibly sharing a good amount of code with that other combined construct.
> 
> Below is one example that hopefully illustrates the various issues.
> 
> ```
> #pragma omp target teams distribute parallel for
> #pragma omp tile sizes(2)
> for (int i = 0; i < N; ++i) {
>       code(i);
>     }
> }
> ```
> 
> is equivalent to:
> 
> ```
> #pragma omp target teams distribute parallel for
> for (int ii = 0; ii < N; ii+=2) {
>   for (int i = ii; i+2; ++i) {
>     code(i);
>   }
> }
> ```
> 
> If the lowering of the `#pragma omp tile sizes(2)` happens in the OMPIRBuilder, that means we would have to determine at lowering time that the resulting outer loop should be distribute parallel for, while the inner loop is sequential. If we are also bound by the current lowering mechanism, meaning a single pass over the MLIR module, then I think that implies there is no way to create combined constructs ahead of time, the ops have to be individual wrapper ops, and there has to be some information sharing between them during lowering. Also consider the case if collapse(2) was present, which would affect the codegen for the inner loop differently.

Thanks for the clarification. However, I still can't quite understand how using composite operations to represent distribute-parallel-for in that example vs wrapper ops is any different with regards to this loop transformation problem. If we're able to represent the loop transformations in MLIR, even if we only actually produce these loops in the OMPIRBuilder due to it being the place where loop transformations are eventually applied, we should be able to then apply the corresponding parallelization/worksharing constructs to the resulting top-level loop. Using your example above, and making assumptions as to how loops and transformations might be represented in MLIR:

```mlir
// Define loop body
%loop = omp.canonical_loop %i : i32 context(...) {
  // Use additional block args to access context values defined in the execution site
  llvm.call @code(%i) : (i32) -> ()
  omp.terminator
} (...) -> !omp.loop

// Represent transformations
%tloop = omp.tile %loop { sizes=[2] } : (!omp.loop) -> (!omp.loop)

// OPTION 1: Composite ops
omp.target {
  %n = ... : i32
  %c0 = arith.constant 0 : i32
  %c1 = arith.constant 1 : i32
  omp.teams {
    // Execute loop, specifying pre-transformations loop range
    omp.distparwsloop %tloop from(%c0) to(%n) step(%c1) context(...) : (!omp.loop, i32, i32, i32, ...) -> ()
    omp.terminator
  }
  omp.terminator
}

// OPTION 2: Wrapper ops
omp.target {
  %n = ... : i32
  %c0 = arith.constant 0 : i32
  %c1 = arith.constant 1 : i32
  omp.teams {
    omp.distribute {
      omp.parallel {
        // Execute loop, specifying pre-transformations loop range
        omp.wsloop %tloop from(%c0) to(%n) step(%c1) context(...) : (!omp.loop, i32, i32, i32, ...) -> ()
      }
      omp.terminator
    }
    omp.terminator
  }
  omp.terminator
}
```

In the example above, there are probably some challenges related to how to lower the loop body and transformations to it since it is defined outside of the place where it is eventually executed. Maybe it should be ignored at first and then processed when lowering the ops that run it (`omp.distparwsloop` or `omp.wsloop` in this case), or maybe it should be generated in a temporary outlined function which is then called or inlined at the point where it is run, passing loop bounds, step and context as arguments. But I think this is no more or less complicated regardless of how we represent the parallelism/worksharing constructs used to represent the loop's execution.

>From the MLIR to LLVM IR translation perspective, the wrapper approach would make the following calls:

- `convertOmpTarget`
- `convertOmpTeams`
- `convertOmpDistribute`
- `convertOmpParallel`
- `convertOmpWsLoop`

Whereas the composite approach would make the following calls:

- `convertOmpTarget`
- `convertOmpTeams`
- `convertOmpDistParWsLoop`

It seems to me that `convertOmpDistParWsLoop` could be split up into calls to the equivalent of `convertOmpDistribute` + `convertOmpParallel` + `convertOmpWsLoop` if that's the best way to deal with that composite construct. If anything, it looks like it would give us some flexibility as to how to share information (for e.g. reductions) across these partial translations of the composite construct rather than having to add more state to the OMPIRBuilder.

https://github.com/llvm/llvm-project/pull/79843