[llvm-branch-commits] [flang] [flang] Introduce custom loop nest generation for loops in workshare construct (PR #101445)

Thu Aug 22 08:32:19 PDT 2024

skatrak wrote:

> wsloop expects its parent block to be a parallel block which all threads will execute and all of those threads will share the work of the nested loop nest.

Yes, the `omp.wsloop` binds to the current team (usually the innermost `omp.parallel` parent). It doesn't have to be the direct parent, though, there can be other constructs in between.

> Whereas the workshare.loop_nest op is semantically executed by a single-thread (because the workshare directive acts like it preserves the semantics of single-threaded fortran execution.).

My understanding is that `omp.workshare` would be the operation defining a region with sequential execution (as if it was a single thread of the enclosing parallel region), but then there can be worksharing loops inside where all threads split loop iterations, which is why I proposed using `omp.wsloop`. Thinking about this, maybe this could be implemented based on existing OpenMP operations:

```f90
subroutine workshare(A, B, C)
  integer, parameter :: N = 10
  integer, intent(in) :: A(N), B(N)
  integer, intent(out :: C(N)
  integer :: tmp

  !$omp parallel workshare
  C = A + B
  tmp = N
  C = C + A
  !$omp end parallel workshare
end subroutine workshare
```
```mlir
func.func @workshare(%A : ..., %B : ..., %C : ...) {
  %N = arith.constant 10 : i32
  %tmp = fir.alloca i32
  omp.parallel {
    omp.wsloop {
      omp.loop_nest (%i) : i32 = (%...) to (%...) inclusive step (%...) {
        // C(%i) = A(%i) + B(%i)
        omp.yield
      }
      omp.terminator
    }
    omp.single {
      fir.store %N to %tmp
      omp.terminator
    }
    omp.wsloop {
      omp.loop_nest (%i) : i32 = (%...) to (%...) inclusive step (%...) {
        // C(%i) = C(%i) + A(%i)
        omp.yield
      }
      omp.terminator
    }
    omp.terminator
  }
}
```
Maybe support for this operation could be just based on changes to how the MLIR representation is built in the first place, what do you think? Otherwise, something more similar to your proposal for `workdistribute`, introducing only the `omp.workshare` operation, keeping `fir.do_loop` inside and having some sort of pass to translate this to `omp.wsloop` and `omp.single` (or splitting the parent `omp.parallel`) would be possible too. I just think that seems a bit too complex.

https://github.com/llvm/llvm-project/pull/101445