[flang-commits] [clang] [flang] [flang][OpenMP] Upstream first part of `do concurrent` mapping (PR #126026)
Kareem Ergawy via flang-commits
flang-commits at lists.llvm.org
Tue Feb 18 01:46:33 PST 2025
================
@@ -0,0 +1,380 @@
+<!--===- docs/DoConcurrentMappingToOpenMP.md
+
+ Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+ See https://llvm.org/LICENSE.txt for license information.
+ SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+
+-->
+
+# `DO CONCURENT` mapping to OpenMP
+
+```{contents}
+---
+local:
+---
+```
+
+This document seeks to describe the effort to parallelize `do concurrent` loops
+by mapping them to OpenMP worksharing constructs. The goals of this document
+are:
+* Describing how to instruct `flang` to map `DO CONCURENT` loops to OpenMP
+ constructs.
+* Tracking the current status of such mapping.
+* Describing the limitations of the current implmenentation.
+* Describing next steps.
+* Tracking the current upstreaming status (from the AMD ROCm fork).
+
+## Usage
+
+In order to enable `do concurrent` to OpenMP mapping, `flang` adds a new
+compiler flag: `-fdo-concurrent-to-openmp`. This flag has 3 possible values:
+1. `host`: this maps `do concurent` loops to run in parallel on the host CPU.
+ This maps such loops to the equivalent of `omp parallel do`.
+2. `device`: this maps `do concurent` loops to run in parallel on a target device.
+ This maps such loops to the equivalent of
+ `omp target teams distribute parallel do`.
+3. `none`: this disables `do concurrent` mapping altogether. In that case, such
+ loops are emitted as sequential loops.
+
+The above compiler switch is currently available only when OpenMP is also
+enabled. So you need to provide the following options to flang in order to
+enable it:
+```
+flang ... -fopenmp -fdo-concurrent-to-openmp=[host|device|none] ...
+```
+
+## Current status
+
+Under the hood, `do concurrent` mapping is implemented in the
+`DoConcurrentConversionPass`. This is still an experimental pass which means
+that:
+* It has been tested in a very limited way so far.
+* It has been tested mostly on simple synthetic inputs.
+
+To describe current status in more detail, following is a description of how
+the pass currently behaves for single-range loops and then for multi-range
+loops. The following sub-sections describe the status of the downstream
+implementation on the AMD's ROCm fork[^1]. We are working on upstreaming the
+downstream implementation gradually and this document will be updated to reflect
+such upstreaming process. Example LIT tests referenced below might also be only
+be available in the ROCm fork and will upstream with the relevant parts of the
+code.
+
+[^1]: https://github.com/ROCm/llvm-project/blob/amd-staging/flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp
+
+### Single-range loops
+
+Given the following loop:
+```fortran
+ do concurrent(i=1:n)
+ a(i) = i * i
+ end do
+```
+
+#### Mapping to `host`
+
+Mapping this loop to the `host`, generates MLIR operations of the following
+structure:
+
+```
+%4 = fir.address_of(@_QFEa) ...
+%6:2 = hlfir.declare %4 ...
+
+omp.parallel {
+ // Allocate private copy for `i`.
+ // TODO Use delayed privatization.
+ %19 = fir.alloca i32 {bindc_name = "i"}
+ %20:2 = hlfir.declare %19 {uniq_name = "_QFEi"} ...
+
+ omp.wsloop {
+ omp.loop_nest (%arg0) : index = (%21) to (%22) inclusive step (%c1_2) {
+ %23 = fir.convert %arg0 : (index) -> i32
+ // Use the privatized version of `i`.
+ fir.store %23 to %20#1 : !fir.ref<i32>
+ ...
+
+ // Use "shared" SSA value of `a`.
+ %42 = hlfir.designate %6#0
+ hlfir.assign %35 to %42
+ ...
+ omp.yield
+ }
+ omp.terminator
+ }
+ omp.terminator
+}
+```
+
+#### Mapping to `device`
+
+Mapping the same loop to the `device`, generates MLIR operations of the
+following structure:
+
+```
+// Map `a` to the `target` region. The pass automatically detects memory blocks
+// and maps them to device. Currently detection logic is still limited and a lot
+// of work is going into making it more capable.
+%29 = omp.map.info ... {name = "_QFEa"}
+omp.target ... map_entries(..., %29 -> %arg4 ...) {
+ ...
+ %51:2 = hlfir.declare %arg4
+ ...
+ omp.teams {
+ // Allocate private copy for `i`.
+ // TODO Use delayed privatization.
+ %52 = fir.alloca i32 {bindc_name = "i"}
+ %53:2 = hlfir.declare %52
+ ...
+
+ omp.parallel {
+ omp.distribute {
+ omp.wsloop {
+ omp.loop_nest (%arg5) : index = (%54) to (%55) inclusive step (%c1_9) {
+ // Use the privatized version of `i`.
+ %56 = fir.convert %arg5 : (index) -> i32
+ fir.store %56 to %53#1
+ ...
+ // Use the mapped version of `a`.
+ ... = hlfir.designate %51#0
+ ...
+ }
+ omp.terminator
+ }
+ omp.terminator
+ }
+ omp.terminator
+ }
+ omp.terminator
+ }
+ omp.terminator
+}
+```
+
+### Multi-range loops
+
+The pass currently supports multi-range loops as well. Given the following
+example:
+
+```fortran
+ do concurrent(i=1:n, j=1:m)
+ a(i,j) = i * j
+ end do
+```
+
+The generated `omp.loop_nest` operation look like:
+
+```
+omp.loop_nest (%arg0, %arg1)
+ : index = (%17, %19) to (%18, %20)
+ inclusive step (%c1_2, %c1_4) {
+ fir.store %arg0 to %private_i#1 : !fir.ref<i32>
+ fir.store %arg1 to %private_j#1 : !fir.ref<i32>
+ ...
+ omp.yield
+}
+```
+
+It is worth noting that we have privatized versions for both iteration
+variables: `i` and `j`. These are locally allocated inside the parallel/target
+OpenMP region similar to what the single-range example in previous section
+shows.
+
+#### Multi-range and perfectly-nested loops
+
+Currently, on the `FIR` dialect level, the following loop:
+```fortran
+do concurrent(i=1:n, j=1:m)
+ a(i,j) = i * j
+end do
+```
+is modelled as a nest of `fir.do_loop` ops such that the outer loop's region
+contains:
+ 1. The operations needed to assign/update the outer loop's induction variable.
+ 1. The inner loop itself.
+
+So the MLIR structure looks similar to the following:
+```
+fir.do_loop %arg0 = %11 to %12 step %c1 unordered {
+ ...
+ fir.do_loop %arg1 = %14 to %15 step %c1_1 unordered {
+ ...
+ }
+}
+```
+This applies to multi-range loops in general; they are represented in the IR as
+a nest of `fir.do_loop` ops with the above nesting structure.
+
+Therefore, the pass detects such "perfectly" nested loop ops to identify multi-range
+loops and map them as "collapsed" loops in OpenMP.
+
+#### Further info regarding loop nest detection
+
+Loop-nest detection is currently limited to the scenario described in the previous
+section. However, this is quite limited and can be extended in the future to cover
+more cases. For example, for the following loop nest, even though, both loops are
+perfectly nested; at the moment, only the outer loop is parallelized:
+```fortran
+do concurrent(i=1:n)
+ do concurrent(j=1:m)
+ a(i,j) = i * j
+ end do
+end do
+```
+
+Similarly, for the following loop nest, even though the intervening statement `x = 41`
+does not have any memory effects that would affect parallelization, this nest is
+not parallelized as well (only the outer loop is).
+
+```fortran
+do concurrent(i=1:n)
+ x = 41
+ do concurrent(j=1:m)
+ a(i,j) = i * j
+ end do
+end do
+```
----------------
ergawy wrote:
We can re-open the discussion when the relevant part is upstreamed. This part of the doc was removed until later in any case.
https://github.com/llvm/llvm-project/pull/126026
More information about the flang-commits
mailing list