[Mlir-commits] [mlir] [mlir][OpenMP] inscan reduction modifier and scan op mlir support (PR #114737)

Thu Dec 12 03:57:16 PST 2024

================
@@ -1560,6 +1560,26 @@ def CancellationPointOp : OpenMP_Op<"cancellation_point", clauses = [
   let hasVerifier = 1;
 }
 
+def ScanOp : OpenMP_Op<"scan", [
----------------
skatrak wrote:

Thank you @tblah and @kiranchandramohan for sharing your thoughts. @anchuraj and I had a call about this the other day, and at the time the 2-region approach made the most sense. But if we can somehow ensure that no MLIR transformations are able to move operations across the `scan` split point in a way that breaks semantics, it is my understanding that the translation to LLVM IR wouldn't be much more difficult than having explicit regions, and it would potentially make the representation and Flang lowering simpler in general.

With regards to both approaches introducing regions, they have the disadvantage that the current to the PFT to MLIR lowering system doesn't seem well-suited to lower only parts of a block of code, so we thought that, if we followed that approach, then we'd probably want to first produce an `omp.scan` with both regions empty at first, while lowering the rest of the loop normally. After that, we'd introduce another Flang-specific MLIR pass (similar to function filtering, map info finalization, etc.) that would sink operations before the `omp.scan` into its `pre` region and operations after it into its `post` region. If there were values defined in `pre` used in `post`, that pass would be able to detect them and produce the corresponding `omp.yield` and `post` entry block arguments.

If a common subexpression doesn't depend on the loop index, I agree that this would result in no problems for any of the approaches, since it should be hoisted out of the loop body. Same thing with constants, for example. However, one case I had in mind, which I think could potentially result in MLIR optimizations moving operations across the split point would be the `i + 1` expression below:
```f90
subroutine foo(v1, v2, size)
  implicit none
  integer, intent(in) :: size, v1(size)
  integer, intent(out) :: v2(size)
  integer :: sum, i
  sum = 0
  !$omp parallel do reduction(+:sum)
  do i = 0, size - 1
    sum = sum + v1(i + 1)
    !$omp scan inclusive(sum)
    v2(i + 1) = sum
  end do
  !$omp end parallel do
end subroutine foo
```

At the moment that results in separate `fir.load`s for `i` and separate `arith.constant`s for `1`, so the only values that would be used both inside of `pre` and `post` would be the `hlfir.declare` for the `sum` reduction variable (which we may want to leave out of both regions or put it in `pre`). But I can't see any reason why, if we don't split the loop body, MLIR passes wouldn't be able to realize there are 2 loads to a same variable that is not modified and calculate `i + 1` only once at the beginning. That kind of transformation doesn't seem like it would break the region-less approach, but it would be one case introducing the need for passing values from `pre` to `post`.

So I'm thinking that perhaps it would indeed be enough to mark `omp.scan` as reading from and writing to its `inclusive / exclusive` arguments. That should ensure no ops related to these values will be reordered across the split point, and that `omp.scan` won't be itself moved in a way that results in such a reordering. Maybe that's enough to ensure valid passes won't break this operation.

https://github.com/llvm/llvm-project/pull/114737