[flang-commits] [clang] [flang] [flang][OpenMP] Upstream first part of `do concurrent` mapping (PR #126026)

Sun Feb 16 23:36:58 PST 2025

https://github.com/ergawy updated https://github.com/llvm/llvm-project/pull/126026

>From 2a54270a2ad7f42ddf6787afd81a8b98641f8082 Mon Sep 17 00:00:00 2001
From: ergawy <kareem.ergawy at amd.com>
Date: Wed, 5 Feb 2025 23:31:15 -0600
Subject: [PATCH 01/10] [flang][OpenMP] Upstream first part of `do concurrent`
 mapping

This PR starts the effort to upstream AMD's internal implementation of
`do concurrent` to OpenMP mapping. This replaces #77285 since we
extended this WIP quite a bit on our fork over the past year.

An important part of this PR is a document that describes the current
status downstream, the upstreaming status, and next steps to make this
pass much more useful.

In addition to this document, this PR also contains the skeleton of the
pass (no useful transformations are done yet) and some testing for the
added command line options.
---
 clang/include/clang/Driver/Options.td         |   4 +
 clang/lib/Driver/ToolChains/Flang.cpp         |   3 +-
 flang/docs/DoConcurrentConversionToOpenMP.md  | 380 ++++++++++++++++++
 flang/docs/index.md                           |   1 +
 .../include/flang/Frontend/CodeGenOptions.def |   2 +
 flang/include/flang/Frontend/CodeGenOptions.h |   5 +
 flang/include/flang/Optimizer/OpenMP/Passes.h |   2 +
 .../include/flang/Optimizer/OpenMP/Passes.td  |  30 ++
 flang/include/flang/Optimizer/OpenMP/Utils.h  |  26 ++
 .../flang/Optimizer/Passes/Pipelines.h        |  11 +-
 flang/lib/Frontend/CompilerInvocation.cpp     |  30 ++
 flang/lib/Frontend/FrontendActions.cpp        |  31 +-
 flang/lib/Optimizer/OpenMP/CMakeLists.txt     |   1 +
 .../OpenMP/DoConcurrentConversion.cpp         | 104 +++++
 flang/lib/Optimizer/Passes/Pipelines.cpp      |   9 +-
 .../Transforms/DoConcurrent/basic_host.f90    |  53 +++
 .../DoConcurrent/command_line_options.f90     |  18 +
 flang/tools/bbc/bbc.cpp                       |  20 +-
 18 files changed, 720 insertions(+), 10 deletions(-)
 create mode 100644 flang/docs/DoConcurrentConversionToOpenMP.md
 create mode 100644 flang/include/flang/Optimizer/OpenMP/Utils.h
 create mode 100644 flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp
 create mode 100644 flang/test/Transforms/DoConcurrent/basic_host.f90
 create mode 100644 flang/test/Transforms/DoConcurrent/command_line_options.f90

diff --git a/clang/include/clang/Driver/Options.td b/clang/include/clang/Driver/Options.td
index 5ad187926e710..fedf2cdad3d49 100644
--- a/clang/include/clang/Driver/Options.td
+++ b/clang/include/clang/Driver/Options.td
@@ -6927,6 +6927,10 @@ defm loop_versioning : BoolOptionWithoutMarshalling<"f", "version-loops-for-stri
 
 def fhermetic_module_files : Flag<["-"], "fhermetic-module-files">, Group<f_Group>,
   HelpText<"Emit hermetic module files (no nested USE association)">;
+
+def do_concurrent_parallel_EQ : Joined<["-"], "fdo-concurrent-to-openmp=">,
+  HelpText<"Try to map `do concurrent` loops to OpenMP [none|host|device]">,
+      Values<"none,host,device">;
 } // let Visibility = [FC1Option, FlangOption]
 
 def J : JoinedOrSeparate<["-"], "J">,
diff --git a/clang/lib/Driver/ToolChains/Flang.cpp b/clang/lib/Driver/ToolChains/Flang.cpp
index 9ad795edd724d..bf0bfacd03742 100644
--- a/clang/lib/Driver/ToolChains/Flang.cpp
+++ b/clang/lib/Driver/ToolChains/Flang.cpp
@@ -153,7 +153,8 @@ void Flang::addCodegenOptions(const ArgList &Args,
     CmdArgs.push_back("-fversion-loops-for-stride");
 
   Args.addAllArgs(CmdArgs,
-                  {options::OPT_flang_experimental_hlfir,
+                  {options::OPT_do_concurrent_parallel_EQ,
+                   options::OPT_flang_experimental_hlfir,
                    options::OPT_flang_deprecated_no_hlfir,
                    options::OPT_fno_ppc_native_vec_elem_order,
                    options::OPT_fppc_native_vec_elem_order,
diff --git a/flang/docs/DoConcurrentConversionToOpenMP.md b/flang/docs/DoConcurrentConversionToOpenMP.md
new file mode 100644
index 0000000000000..6807e402ce081
--- /dev/null
+++ b/flang/docs/DoConcurrentConversionToOpenMP.md
@@ -0,0 +1,380 @@
+<!--===- docs/DoConcurrentMappingToOpenMP.md
+
+   Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+   See https://llvm.org/LICENSE.txt for license information.
+   SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+
+-->
+
+# `DO CONCURENT` mapping to OpenMP
+
+```{contents}
+---
+local:
+---
+```
+
+This document seeks to describe the effort to parallelize `do concurrent` loops
+by mapping them to OpenMP worksharing constructs. The goals of this document
+are:
+* Describing how to instruct `flang` to map `DO CONCURENT` loops to OpenMP
+  constructs.
+* Tracking the current status of such mapping.
+* Describing the limitations of the current implmenentation.
+* Describing next steps.
+* Tracking the current upstreaming status (from the AMD ROCm fork).
+
+## Usage
+
+In order to enable `do concurrent` to OpenMP mapping, `flang` adds a new
+compiler flag: `-fdo-concurrent-to-openmp`. This flags has 3 possible values:
+1. `host`: this maps `do concurent` loops to run in parallel on the host CPU.
+   This maps such loops to the equivalent of `omp parallel do`.
+2. `device`: this maps `do concurent` loops to run in parallel on a device
+   (GPU). This maps such loops to the equivalent of `omp target teams
+   distribute parallel do`.
+3. `none`: this disables `do concurrent` mapping altogether. In such case, such
+   loops are emitted as sequential loops.
+
+The above compiler switch is currently avaialble only when OpenMP is also
+enabled. So you need to provide the following options to flang in order to
+enable it:
+```
+flang ... -fopenmp -fdo-concurrent-to-openmp=[host|device|none] ...
+```
+
+## Current status
+
+Under the hood, `do concurrent` mapping is implemented in the
+`DoConcurrentConversionPass`. This is still an experimental pass which means
+that:
+* It has been tested in a very limited way so far.
+* It has been tested mostly on simple synthetic inputs.
+
+To describe current status in more detail, following is a description of how
+the pass currently behaves for single-range loops and then for multi-range
+loops. The following sub-sections describe the status of the downstream 
+implementation on the AMD's ROCm fork(*). We are working on upstreaming the
+downstream implementation gradually and this document will be updated to reflect
+such upstreaming process. Example LIT tests referenced below might also be only
+be available in the ROCm fork and will upstream with the relevant parts of the
+code.
+
+(*) https://github.com/ROCm/llvm-project/blob/amd-staging/flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp
+
+### Single-range loops
+
+Given the following loop:
+```fortran
+  do concurrent(i=1:n)
+    a(i) = i * i
+  end do
+```
+
+#### Mapping to `host`
+
+Mapping this loop to the `host`, generates MLIR operations of the following
+structure:
+
+```
+%4 = fir.address_of(@_QFEa) ...
+%6:2 = hlfir.declare %4 ...
+
+omp.parallel {
+  // Allocate private copy for `i`.
+  // TODO Use delayed privatization.
+  %19 = fir.alloca i32 {bindc_name = "i"}
+  %20:2 = hlfir.declare %19 {uniq_name = "_QFEi"} ...
+
+  omp.wsloop {
+    omp.loop_nest (%arg0) : index = (%21) to (%22) inclusive step (%c1_2) {
+      %23 = fir.convert %arg0 : (index) -> i32
+      // Use the privatized version of `i`.
+      fir.store %23 to %20#1 : !fir.ref<i32>
+      ...
+
+      // Use "shared" SSA value of `a`.
+      %42 = hlfir.designate %6#0
+      hlfir.assign %35 to %42
+      ...
+      omp.yield
+    }
+    omp.terminator
+  }
+  omp.terminator
+}
+```
+
+#### Mapping to `device`
+
+Mapping the same loop to the `device`, generates MLIR operations of the
+following structure:
+
+```
+// Map `a` to the `target` region. The pass automatically detects memory blocks
+// and maps them to device. Currently detection logic is still limited and a lot
+// of work is going into making it more capable.
+%29 = omp.map.info ... {name = "_QFEa"}
+omp.target ... map_entries(..., %29 -> %arg4 ...) {
+  ...
+  %51:2 = hlfir.declare %arg4
+  ...
+  omp.teams {
+    // Allocate private copy for `i`.
+    // TODO Use delayed privatization.
+    %52 = fir.alloca i32 {bindc_name = "i"}
+    %53:2 = hlfir.declare %52
+    ...
+
+    omp.parallel {
+      omp.distribute {
+        omp.wsloop {
+          omp.loop_nest (%arg5) : index = (%54) to (%55) inclusive step (%c1_9) {
+            // Use the privatized version of `i`.
+            %56 = fir.convert %arg5 : (index) -> i32
+            fir.store %56 to %53#1
+            ...
+            // Use the mapped version of `a`.
+            ... = hlfir.designate %51#0
+            ...
+          }
+          omp.terminator
+        }
+        omp.terminator
+      }
+      omp.terminator
+    }
+    omp.terminator
+  }
+  omp.terminator
+}
+```
+
+### Multi-range loops
+
+The pass currently supports multi-range loops as well. Given the following
+example:
+
+```fortran
+   do concurrent(i=1:n, j=1:m)
+       a(i,j) = i * j
+   end do
+```
+
+The generated `omp.loop_nest` operation look like:
+
+```
+omp.loop_nest (%arg0, %arg1)
+    : index = (%17, %19) to (%18, %20)
+    inclusive step (%c1_2, %c1_4) {
+  fir.store %arg0 to %private_i#1 : !fir.ref<i32>
+  fir.store %arg1 to %private_j#1 : !fir.ref<i32>
+  ...
+  omp.yield
+}
+```
+
+It is worth noting that we have privatized versions for both iteration
+variables: `i` and `j`. These are locally allocated inside the parallel/target
+OpenMP region similar to what the single-range example in previous section
+shows.
+
+#### Multi-range and perfectly-nested loops
+
+Currently, on the `FIR` dialect level, the following loop:
+```fortran
+do concurrent(i=1:n, j=1:m)
+  a(i,j) = i * j
+end do
+```
+is modelled as a nest of `fir.do_loop` ops such that the outer loop's region
+contains:
+  1. The operations needed to assign/update the outer loop's induction variable.
+  1. The inner loop itself.
+
+So the MLIR structure looks similar to the following:
+```
+fir.do_loop %arg0 = %11 to %12 step %c1 unordered {
+  ...
+  fir.do_loop %arg1 = %14 to %15 step %c1_1 unordered {
+    ...
+  }
+}
+```
+This applies to multi-range loops in general; they are represented in the IR as
+a nest of `fir.do_loop` ops with the above nesting structure.
+
+Therefore, the pass detects such "perfectly" nested loop ops to identify multi-range
+loops and map them as "collapsed" loops in OpenMP.
+
+#### Further info regarding loop nest detection
+
+Loop-nest detection is currently limited to the scenario described in the previous
+section. However, this is quite limited and can be extended in the future to cover
+more cases. For example, for the following loop nest, even thought, both loops are
+perfectly nested; at the moment, only the outer loop is parallized:
+```fortran
+do concurrent(i=1:n)
+  do concurrent(j=1:m)
+    a(i,j) = i * j
+  end do
+end do
+```
+
+Similary for the following loop nest, even though the intervening statement `x = 41`
+does not have any memory effects that would affect parallization, this nest is
+not parallized as well (only the outer loop is).
+
+```fortran
+do concurrent(i=1:n)
+  x = 41
+  do concurrent(j=1:m)
+    a(i,j) = i * j
+  end do
+end do
+```
+
+The above also has the consequence that the `j` variable will **not** be
+privatized in the OpenMP parallel/target region. In other words, it will be
+treated as if it was a `shared` variable. For more details about privatization,
+see the "Data environment" section below.
+
+See `flang/test/Transforms/DoConcurrent/loop_nest_test.f90` for more examples
+of what is and is not detected as a perfect loop nest.
+
+### Data environment
+
+By default, variables that are used inside a `do concurernt` loop nest are
+either treated as `shared` in case of mapping to `host`, or mapped into the
+`target` region using a `map` clause in case of mapping to `device`. The only
+exceptions to this are:
+  1. the loop's iteration variable(s) (IV) of **perfect** loop nests. In that
+     case, for each IV, we allocate a local copy as shown the by the mapping
+     examples above.
+  1. any values that are from allocations outside the loop nest and used
+     exclusively inside of it. In such cases, a local privatized
+     value is created in the OpenMP region to prevent multiple teams of threads
+     from accessing and destroying the same memory block which causes runtime
+     issues. For an example of such cases, see
+     `flang/test/Transforms/DoConcurrent/locally_destroyed_temp.f90`.
+
+Implicit mapping detection (for mapping to the GPU) is still quite limited and
+work to make it smarter is underway for both OpenMP in general and `do concurrent`
+mapping.
+
+#### Non-perfectly-nested loops' IVs
+
+For non-perfectly-nested loops, the IVs are still treated as `shared` or
+`map` entries as pointed out above. This **might not** be consistent with what
+the Fortran specficiation tells us. In particular, taking the following
+snippets from the spec (version 2023) into account:
+
+> § 3.35
+> ------
+> construct entity
+> entity whose identifier has the scope of a construct
+
+> § 19.4
+> ------
+>  A variable that appears as an index-name in a FORALL or DO CONCURRENT
+>  construct, or ... is a construct entity. A variable that has LOCAL or
+>  LOCAL_INIT locality in a DO CONCURRENT construct is a construct entity.
+> ...
+> The name of a variable that appears as an index-name in a DO CONCURRENT
+> construct, FORALL statement, or FORALL construct has a scope of the statement
+> or construct. A variable that has LOCAL or LOCAL_INIT locality in a DO
+> CONCURRENT construct has the scope of that construct.
+
+From the above quotes, it seems there is an equivalence between the IV of a `do
+concurrent` loop and a variable with a `LOCAL` locality specifier (equivalent
+to OpenMP's `private` clause). Which means that we should probably
+localize/privatize a `do concurernt` loop's IV even if it is not perfectly
+nested in the nest we are parallelizing. For now, however, we **do not** do
+that as pointed out previously. In the near future, we propose a middle-ground
+solution (see the Next steps section for more details).
+
+## Next steps
+
+### Delayed privatization
+
+So far, we emit the privatization logic for IVs inline in the parallel/target
+region. This is enough for our purposes right now since we don't
+localize/privatize any sophisticated types of variables yet. Once we have need
+for more advanced localization through `do concurrent`'s locality specifiers
+(see below), delayed privatization will enable us to have a much cleaner IR.
+Once delayed privatization's implementation upstream is supported for the
+required constructs by the pass, we will move to it rather than inlined/early
+privatization.
+
+### Locality specifiers for `do concurrent`
+
+Locality specifiers will enable the user to control the data environment of the
+loop nest in a more fine-grained way. Implementing these specifiers on the
+`FIR` dialect level is needed in order to support this in the
+`DoConcurrentConversionPass`.
+
+Such specifiers will also unlock a potential solution to the
+non-perfectly-nested loops' IVs issue described above. In particular, for a
+non-perfectly nested loop, one middle-ground proposal/solution would be to:
+* Emit the loop's IV as shared/mapped just like we do currently.
+* Emit a warning that the IV of the loop is emitted as shared/mapped.
+* Given support for `LOCAL`, we can recommend the user to explicitly
+  localize/privatize the loop's IV if they choose to.
+
+#### Sharing TableGen clause records from the OpenMP dialect
+
+At the moment, the FIR dialect does not have a way to model locality specifiers
+on the IR level. Instead, something similar to early/eager privatization in OpenMP
+is done for the locality specifiers in `fir.do_loop` ops. Having locality specifier
+modelled in a way similar to delayed privatization (i.e. the `omp.private` op) and
+reductions (i.e. the `omp.delcare_reduction` op) can make mapping `do concurrent`
+to OpenMP (and other parallization models) much easier.
+
+Therefore, one way to approach this problem is to extract the TableGen records
+for relevant OpenMP clauses in a shared dialect for "data environment management"
+and use these shared records for OpenMP, `do concurrent`, and possibly OpenACC
+as well.
+
+### More advanced detection of loop nests
+
+As pointed out earlier, any intervening code between the headers of 2 nested
+`do concurrent` loops prevents us currently from detecting this as a loop nest.
+In some cases this is overly conservative. Therefore, a more flexible detection
+logic of loop nests needs to be implemented.
+
+### Data-dependence analysis
+
+Right now, we map loop nests without analysing whether such mapping is safe to
+do or not. We probalby need to at least warn the use of unsafe loop nests due
+to loop-carried dependencies.
+
+### Non-rectangular loop nests
+
+So far, we did not need to use the pass for non-rectangular loop nests. For
+example:
+```fortran
+do concurrent(i=1:n)
+  do concurrent(j=i:n)
+    ...
+  end do
+end do
+```
+We defer this to the (hopefully) near future when we get the conversion in a
+good share for the samples/projects at hand.
+
+### Generalizing the pass to other parallization models
+
+Once we have a stable and capable `do concurrent` to OpenMP mapping, we can take
+this in a more generalized direction and allow the pass to target other models;
+e.g. OpenACC. This goal should be kept in mind from the get-go even while only
+targeting OpenMP.
+
+
+## Upstreaming status
+
+- [x] Command line options for `flang` and `bbc`.
+- [x] Conversion pass skeleton (no transormations happen yet).
+- [x] Status description and tracking document (this document).
+- [ ] Basic host/CPU mapping support.
+- [ ] Basic device/GPU mapping support.
+- [ ] More advanced host and device support (expaned to multiple items as needed).
diff --git a/flang/docs/index.md b/flang/docs/index.md
index c35f634746e68..913e53d4cfed9 100644
--- a/flang/docs/index.md
+++ b/flang/docs/index.md
@@ -50,6 +50,7 @@ on how to get in touch with us and to learn more about the current status.
    DebugGeneration
    Directives
    DoConcurrent
+   DoConcurrentConversionToOpenMP
    Extensions
    F202X
    FIRArrayOperations
diff --git a/flang/include/flang/Frontend/CodeGenOptions.def b/flang/include/flang/Frontend/CodeGenOptions.def
index deb8d1aede518..13cda965600b5 100644
--- a/flang/include/flang/Frontend/CodeGenOptions.def
+++ b/flang/include/flang/Frontend/CodeGenOptions.def
@@ -41,5 +41,7 @@ ENUM_CODEGENOPT(DebugInfo,  llvm::codegenoptions::DebugInfoKind, 4,  llvm::codeg
 ENUM_CODEGENOPT(VecLib, llvm::driver::VectorLibrary, 3, llvm::driver::VectorLibrary::NoLibrary) ///< Vector functions library to use
 ENUM_CODEGENOPT(FramePointer, llvm::FramePointerKind, 2, llvm::FramePointerKind::None) ///< Enable the usage of frame pointers
 
+ENUM_CODEGENOPT(DoConcurrentMapping, DoConcurrentMappingKind, 2, DoConcurrentMappingKind::DCMK_None) ///< Map `do concurrent` to OpenMP
+
 #undef CODEGENOPT
 #undef ENUM_CODEGENOPT
diff --git a/flang/include/flang/Frontend/CodeGenOptions.h b/flang/include/flang/Frontend/CodeGenOptions.h
index f19943335737b..23d99e1f0897a 100644
--- a/flang/include/flang/Frontend/CodeGenOptions.h
+++ b/flang/include/flang/Frontend/CodeGenOptions.h
@@ -15,6 +15,7 @@
 #ifndef FORTRAN_FRONTEND_CODEGENOPTIONS_H
 #define FORTRAN_FRONTEND_CODEGENOPTIONS_H
 
+#include "flang/Optimizer/OpenMP/Utils.h"
 #include "llvm/Frontend/Debug/Options.h"
 #include "llvm/Frontend/Driver/CodeGenOptions.h"
 #include "llvm/Support/CodeGen.h"
@@ -143,6 +144,10 @@ class CodeGenOptions : public CodeGenOptionsBase {
   /// (-mlarge-data-threshold).
   uint64_t LargeDataThreshold;
 
+  /// Optionally map `do concurrent` loops to OpenMP. This is only valid of
+  /// OpenMP is enabled.
+  using DoConcurrentMappingKind = flangomp::DoConcurrentMappingKind;
+
   // Define accessors/mutators for code generation options of enumeration type.
 #define CODEGENOPT(Name, Bits, Default)
 #define ENUM_CODEGENOPT(Name, Type, Bits, Default)                             \
diff --git a/flang/include/flang/Optimizer/OpenMP/Passes.h b/flang/include/flang/Optimizer/OpenMP/Passes.h
index feb395f1a12db..c67bddbcd2704 100644
--- a/flang/include/flang/Optimizer/OpenMP/Passes.h
+++ b/flang/include/flang/Optimizer/OpenMP/Passes.h
@@ -13,6 +13,7 @@
 #ifndef FORTRAN_OPTIMIZER_OPENMP_PASSES_H
 #define FORTRAN_OPTIMIZER_OPENMP_PASSES_H
 
+#include "flang/Optimizer/OpenMP/Utils.h"
 #include "mlir/Dialect/Func/IR/FuncOps.h"
 #include "mlir/IR/BuiltinOps.h"
 #include "mlir/Pass/Pass.h"
@@ -30,6 +31,7 @@ namespace flangomp {
 /// divided into units of work.
 bool shouldUseWorkshareLowering(mlir::Operation *op);
 
+std::unique_ptr<mlir::Pass> createDoConcurrentConversionPass(bool mapToDevice);
 } // namespace flangomp
 
 #endif // FORTRAN_OPTIMIZER_OPENMP_PASSES_H
diff --git a/flang/include/flang/Optimizer/OpenMP/Passes.td b/flang/include/flang/Optimizer/OpenMP/Passes.td
index 3add0c560f88d..b2152f8c674ea 100644
--- a/flang/include/flang/Optimizer/OpenMP/Passes.td
+++ b/flang/include/flang/Optimizer/OpenMP/Passes.td
@@ -50,6 +50,36 @@ def FunctionFilteringPass : Pass<"omp-function-filtering"> {
   ];
 }
 
+def DoConcurrentConversionPass : Pass<"fopenmp-do-concurrent-conversion", "mlir::func::FuncOp"> {
+  let summary = "Map `DO CONCURRENT` loops to OpenMP worksharing loops.";
+
+  let description = [{ This is an experimental pass to map `DO CONCURRENT` loops
+     to their correspnding equivalent OpenMP worksharing constructs.
+
+     For now the following is supported:
+       - Mapping simple loops to `parallel do`.
+
+     Still to TODO:
+       - More extensive testing.
+  }];
+
+  let dependentDialects = ["mlir::omp::OpenMPDialect"];
+
+  let options = [
+    Option<"mapTo", "map-to",
+           "flangomp::DoConcurrentMappingKind",
+           /*default=*/"flangomp::DoConcurrentMappingKind::DCMK_None",
+           "Try to map `do concurrent` loops to OpenMP [none|host|device]",
+           [{::llvm::cl::values(
+               clEnumValN(flangomp::DoConcurrentMappingKind::DCMK_None,
+                          "none", "Do not lower `do concurrent` to OpenMP"),
+               clEnumValN(flangomp::DoConcurrentMappingKind::DCMK_Host,
+                          "host", "Lower to run in parallel on the CPU"),
+               clEnumValN(flangomp::DoConcurrentMappingKind::DCMK_Device,
+                          "device", "Lower to run in parallel on the GPU")
+           )}]>,
+  ];
+}
 
 // Needs to be scheduled on Module as we create functions in it
 def LowerWorkshare : Pass<"lower-workshare", "::mlir::ModuleOp"> {
diff --git a/flang/include/flang/Optimizer/OpenMP/Utils.h b/flang/include/flang/Optimizer/OpenMP/Utils.h
new file mode 100644
index 0000000000000..636c768b016b7
--- /dev/null
+++ b/flang/include/flang/Optimizer/OpenMP/Utils.h
@@ -0,0 +1,26 @@
+//===-- Optimizer/OpenMP/Utils.h --------------------------------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// Coding style: https://mlir.llvm.org/getting_started/DeveloperGuide/
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef FORTRAN_OPTIMIZER_OPENMP_UTILS_H
+#define FORTRAN_OPTIMIZER_OPENMP_UTILS_H
+
+namespace flangomp {
+
+enum class DoConcurrentMappingKind {
+  DCMK_None,  ///< Do not lower `do concurrent` to OpenMP.
+  DCMK_Host,  ///< Lower to run in parallel on the CPU.
+  DCMK_Device ///< Lower to run in parallel on the GPU.
+};
+
+} // namespace flangomp
+
+#endif // FORTRAN_OPTIMIZER_OPENMP_UTILS_H
diff --git a/flang/include/flang/Optimizer/Passes/Pipelines.h b/flang/include/flang/Optimizer/Passes/Pipelines.h
index ef5d44ded706c..2a34cd94809ad 100644
--- a/flang/include/flang/Optimizer/Passes/Pipelines.h
+++ b/flang/include/flang/Optimizer/Passes/Pipelines.h
@@ -128,6 +128,14 @@ void createHLFIRToFIRPassPipeline(
     mlir::PassManager &pm, bool enableOpenMP,
     llvm::OptimizationLevel optLevel = defaultOptLevel);
 
+using DoConcurrentMappingKind =
+    Fortran::frontend::CodeGenOptions::DoConcurrentMappingKind;
+
+struct OpenMPFIRPassPipelineOpts {
+  bool isTargetDevice;
+  DoConcurrentMappingKind doConcurrentMappingKind;
+};
+
 /// Create a pass pipeline for handling certain OpenMP transformations needed
 /// prior to FIR lowering.
 ///
@@ -137,7 +145,8 @@ void createHLFIRToFIRPassPipeline(
 /// \param pm - MLIR pass manager that will hold the pipeline definition.
 /// \param isTargetDevice - Whether code is being generated for a target device
 /// rather than the host device.
-void createOpenMPFIRPassPipeline(mlir::PassManager &pm, bool isTargetDevice);
+void createOpenMPFIRPassPipeline(mlir::PassManager &pm,
+                                 OpenMPFIRPassPipelineOpts opts);
 
 #if !defined(FLANG_EXCLUDE_CODEGEN)
 void createDebugPasses(mlir::PassManager &pm,
diff --git a/flang/lib/Frontend/CompilerInvocation.cpp b/flang/lib/Frontend/CompilerInvocation.cpp
index f3d9432c62d3b..232f383fc99ce 100644
--- a/flang/lib/Frontend/CompilerInvocation.cpp
+++ b/flang/lib/Frontend/CompilerInvocation.cpp
@@ -157,6 +157,34 @@ static bool parseDebugArgs(Fortran::frontend::CodeGenOptions &opts,
   return true;
 }
 
+static bool parseDoConcurrentMapping(Fortran::frontend::CodeGenOptions &opts,
+                                     llvm::opt::ArgList &args,
+                                     clang::DiagnosticsEngine &diags) {
+  llvm::opt::Arg *arg =
+      args.getLastArg(clang::driver::options::OPT_do_concurrent_parallel_EQ);
+  if (!arg)
+    return true;
+
+  using DoConcurrentMappingKind =
+      Fortran::frontend::CodeGenOptions::DoConcurrentMappingKind;
+  std::optional<DoConcurrentMappingKind> val =
+      llvm::StringSwitch<std::optional<DoConcurrentMappingKind>>(
+          arg->getValue())
+          .Case("none", DoConcurrentMappingKind::DCMK_None)
+          .Case("host", DoConcurrentMappingKind::DCMK_Host)
+          .Case("device", DoConcurrentMappingKind::DCMK_Device)
+          .Default(std::nullopt);
+
+  if (!val.has_value()) {
+    diags.Report(clang::diag::err_drv_invalid_value)
+        << arg->getAsString(args) << arg->getValue();
+    return false;
+  }
+
+  opts.setDoConcurrentMapping(val.value());
+  return true;
+}
+
 static bool parseVectorLibArg(Fortran::frontend::CodeGenOptions &opts,
                               llvm::opt::ArgList &args,
                               clang::DiagnosticsEngine &diags) {
@@ -426,6 +454,8 @@ static void parseCodeGenArgs(Fortran::frontend::CodeGenOptions &opts,
                    clang::driver::options::OPT_funderscoring, false)) {
     opts.Underscoring = 0;
   }
+
+  parseDoConcurrentMapping(opts, args, diags);
 }
 
 /// Parses all target input arguments and populates the target
diff --git a/flang/lib/Frontend/FrontendActions.cpp b/flang/lib/Frontend/FrontendActions.cpp
index 763c810ace0eb..0809e4a0e2773 100644
--- a/flang/lib/Frontend/FrontendActions.cpp
+++ b/flang/lib/Frontend/FrontendActions.cpp
@@ -352,16 +352,37 @@ bool CodeGenAction::beginSourceFileAction() {
   // Add OpenMP-related passes
   // WARNING: These passes must be run immediately after the lowering to ensure
   // that the FIR is correct with respect to OpenMP operations/attributes.
-  if (ci.getInvocation().getFrontendOpts().features.IsEnabled(
-          Fortran::common::LanguageFeature::OpenMP)) {
-    bool isDevice = false;
+  bool isOpenMPEnabled =
+      ci.getInvocation().getFrontendOpts().features.IsEnabled(
+          Fortran::common::LanguageFeature::OpenMP);
+
+  fir::OpenMPFIRPassPipelineOpts opts;
+
+  using DoConcurrentMappingKind =
+      Fortran::frontend::CodeGenOptions::DoConcurrentMappingKind;
+  opts.doConcurrentMappingKind =
+      ci.getInvocation().getCodeGenOpts().getDoConcurrentMapping();
+
+  if (opts.doConcurrentMappingKind != DoConcurrentMappingKind::DCMK_None &&
+      !isOpenMPEnabled) {
+    unsigned diagID = ci.getDiagnostics().getCustomDiagID(
+        clang::DiagnosticsEngine::Error,
+        "lowering `do concurrent` loops to OpenMP is only supported if "
+        "OpenMP is enabled. Enable OpenMP using `-fopenmp`.");
+    ci.getDiagnostics().Report(diagID);
+    return false;
+  }
+
+  if (isOpenMPEnabled) {
+    opts.isTargetDevice = false;
     if (auto offloadMod = llvm::dyn_cast<mlir::omp::OffloadModuleInterface>(
             mlirModule->getOperation()))
-      isDevice = offloadMod.getIsTargetDevice();
+      opts.isTargetDevice = offloadMod.getIsTargetDevice();
+
     // WARNING: This pipeline must be run immediately after the lowering to
     // ensure that the FIR is correct with respect to OpenMP operations/
     // attributes.
-    fir::createOpenMPFIRPassPipeline(pm, isDevice);
+    fir::createOpenMPFIRPassPipeline(pm, opts);
   }
 
   pm.enableVerifier(/*verifyPasses=*/true);
diff --git a/flang/lib/Optimizer/OpenMP/CMakeLists.txt b/flang/lib/Optimizer/OpenMP/CMakeLists.txt
index 4a48d6e0936db..3acf143594356 100644
--- a/flang/lib/Optimizer/OpenMP/CMakeLists.txt
+++ b/flang/lib/Optimizer/OpenMP/CMakeLists.txt
@@ -1,6 +1,7 @@
 get_property(dialect_libs GLOBAL PROPERTY MLIR_DIALECT_LIBS)
 
 add_flang_library(FlangOpenMPTransforms
+  DoConcurrentConversion.cpp
   FunctionFiltering.cpp
   GenericLoopConversion.cpp
   MapsForPrivatizedSymbols.cpp
diff --git a/flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp b/flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp
new file mode 100644
index 0000000000000..55c60c1f339e3
--- /dev/null
+++ b/flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp
@@ -0,0 +1,104 @@
+//===- DoConcurrentConversion.cpp -- map `DO CONCURRENT` to OpenMP loops --===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "flang/Optimizer/Dialect/FIROps.h"
+#include "flang/Optimizer/OpenMP/Passes.h"
+#include "mlir/Dialect/Func/IR/FuncOps.h"
+#include "mlir/Dialect/OpenMP/OpenMPDialect.h"
+#include "mlir/IR/Diagnostics.h"
+#include "mlir/Pass/Pass.h"
+#include "mlir/Transforms/DialectConversion.h"
+
+#include <memory>
+#include <utility>
+
+namespace flangomp {
+#define GEN_PASS_DEF_DOCONCURRENTCONVERSIONPASS
+#include "flang/Optimizer/OpenMP/Passes.h.inc"
+} // namespace flangomp
+
+#define DEBUG_TYPE "do-concurrent-conversion"
+#define DBGS() (llvm::dbgs() << "[" DEBUG_TYPE << "]: ")
+
+namespace {
+class DoConcurrentConversion : public mlir::OpConversionPattern<fir::DoLoopOp> {
+public:
+  using mlir::OpConversionPattern<fir::DoLoopOp>::OpConversionPattern;
+
+  DoConcurrentConversion(mlir::MLIRContext *context, bool mapToDevice,
+                         llvm::DenseSet<fir::DoLoopOp> &concurrentLoopsToSkip)
+      : OpConversionPattern(context), mapToDevice(mapToDevice),
+        concurrentLoopsToSkip(concurrentLoopsToSkip) {}
+
+  mlir::LogicalResult
+  matchAndRewrite(fir::DoLoopOp doLoop, OpAdaptor adaptor,
+                  mlir::ConversionPatternRewriter &rewriter) const override {
+    return mlir::success();
+  }
+
+  bool mapToDevice;
+  llvm::DenseSet<fir::DoLoopOp> &concurrentLoopsToSkip;
+};
+
+class DoConcurrentConversionPass
+    : public flangomp::impl::DoConcurrentConversionPassBase<
+          DoConcurrentConversionPass> {
+public:
+  DoConcurrentConversionPass() = default;
+
+  DoConcurrentConversionPass(
+      const flangomp::DoConcurrentConversionPassOptions &options)
+      : DoConcurrentConversionPassBase(options) {}
+
+  void runOnOperation() override {
+    mlir::func::FuncOp func = getOperation();
+
+    if (func.isDeclaration()) {
+      return;
+    }
+
+    auto *context = &getContext();
+
+    if (mapTo != flangomp::DoConcurrentMappingKind::DCMK_Host &&
+        mapTo != flangomp::DoConcurrentMappingKind::DCMK_Device) {
+      mlir::emitWarning(mlir::UnknownLoc::get(context),
+                        "DoConcurrentConversionPass: invalid `map-to` value. "
+                        "Valid values are: `host` or `device`");
+      return;
+    }
+
+    llvm::DenseSet<fir::DoLoopOp> concurrentLoopsToSkip;
+    mlir::RewritePatternSet patterns(context);
+    patterns.insert<DoConcurrentConversion>(
+        context, mapTo == flangomp::DoConcurrentMappingKind::DCMK_Device,
+        concurrentLoopsToSkip);
+    mlir::ConversionTarget target(*context);
+    target.addDynamicallyLegalOp<fir::DoLoopOp>([&](fir::DoLoopOp op) {
+      return !op.getUnordered() || concurrentLoopsToSkip.contains(op);
+    });
+    target.markUnknownOpDynamicallyLegal(
+        [](mlir::Operation *) { return true; });
+
+    if (mlir::failed(mlir::applyFullConversion(getOperation(), target,
+                                               std::move(patterns)))) {
+      mlir::emitError(mlir::UnknownLoc::get(context),
+                      "error in converting do-concurrent op");
+      signalPassFailure();
+    }
+  }
+};
+} // namespace
+
+std::unique_ptr<mlir::Pass>
+flangomp::createDoConcurrentConversionPass(bool mapToDevice) {
+  DoConcurrentConversionPassOptions options;
+  options.mapTo = mapToDevice ? flangomp::DoConcurrentMappingKind::DCMK_Device
+                              : flangomp::DoConcurrentMappingKind::DCMK_Host;
+
+  return std::make_unique<DoConcurrentConversionPass>(options);
+}
diff --git a/flang/lib/Optimizer/Passes/Pipelines.cpp b/flang/lib/Optimizer/Passes/Pipelines.cpp
index a5cda3b7cb875..dfc6d64e7cc2c 100644
--- a/flang/lib/Optimizer/Passes/Pipelines.cpp
+++ b/flang/lib/Optimizer/Passes/Pipelines.cpp
@@ -278,12 +278,17 @@ void createHLFIRToFIRPassPipeline(mlir::PassManager &pm, bool enableOpenMP,
 /// \param pm - MLIR pass manager that will hold the pipeline definition.
 /// \param isTargetDevice - Whether code is being generated for a target device
 /// rather than the host device.
-void createOpenMPFIRPassPipeline(mlir::PassManager &pm, bool isTargetDevice) {
+void createOpenMPFIRPassPipeline(mlir::PassManager &pm,
+                                 OpenMPFIRPassPipelineOpts opts) {
+  if (opts.doConcurrentMappingKind != DoConcurrentMappingKind::DCMK_None)
+    pm.addPass(flangomp::createDoConcurrentConversionPass(
+        opts.doConcurrentMappingKind == DoConcurrentMappingKind::DCMK_Device));
+
   pm.addPass(flangomp::createMapInfoFinalizationPass());
   pm.addPass(flangomp::createMapsForPrivatizedSymbolsPass());
   pm.addPass(flangomp::createMarkDeclareTargetPass());
   pm.addPass(flangomp::createGenericLoopConversionPass());
-  if (isTargetDevice)
+  if (opts.isTargetDevice)
     pm.addPass(flangomp::createFunctionFilteringPass());
 }
 
diff --git a/flang/test/Transforms/DoConcurrent/basic_host.f90 b/flang/test/Transforms/DoConcurrent/basic_host.f90
new file mode 100644
index 0000000000000..b569668ab0f0e
--- /dev/null
+++ b/flang/test/Transforms/DoConcurrent/basic_host.f90
@@ -0,0 +1,53 @@
+! Mark as xfail for now until we upstream the relevant part. This is just for
+! demo purposes at this point. Upstreaming this is the next step.
+! XFAIL: *
+
+! Tests mapping of a basic `do concurrent` loop to `!$omp parallel do`.
+
+! RUN: %flang_fc1 -emit-hlfir -fopenmp -fdo-concurrent-to-openmp=host %s -o - \
+! RUN:   | FileCheck %s
+! RUN: bbc -emit-hlfir -fopenmp -fdo-concurrent-to-openmp=host %s -o - \
+! RUN:   | FileCheck %s
+ 
+! CHECK-LABEL: do_concurrent_basic
+program do_concurrent_basic
+    ! CHECK: %[[ARR:.*]]:2 = hlfir.declare %{{.*}}(%{{.*}}) {uniq_name = "_QFEa"} : (!fir.ref<!fir.array<10xi32>>, !fir.shape<1>) -> (!fir.ref<!fir.array<10xi32>>, !fir.ref<!fir.array<10xi32>>)
+
+    implicit none
+    integer :: a(10)
+    integer :: i
+
+    ! CHECK-NOT: fir.do_loop
+
+    ! CHECK: omp.parallel {
+
+    ! CHECK-NEXT: %[[ITER_VAR:.*]] = fir.alloca i32 {bindc_name = "i"}
+    ! CHECK-NEXT: %[[BINDING:.*]]:2 = hlfir.declare %[[ITER_VAR]] {uniq_name = "_QFEi"} : (!fir.ref<i32>) -> (!fir.ref<i32>, !fir.ref<i32>)
+
+    ! CHECK: %[[C1:.*]] = arith.constant 1 : i32
+    ! CHECK: %[[LB:.*]] = fir.convert %[[C1]] : (i32) -> index
+    ! CHECK: %[[C10:.*]] = arith.constant 10 : i32
+    ! CHECK: %[[UB:.*]] = fir.convert %[[C10]] : (i32) -> index
+    ! CHECK: %[[STEP:.*]] = arith.constant 1 : index
+
+    ! CHECK: omp.wsloop {
+    ! CHECK-NEXT: omp.loop_nest (%[[ARG0:.*]]) : index = (%[[LB]]) to (%[[UB]]) inclusive step (%[[STEP]]) {
+    ! CHECK-NEXT: %[[IV_IDX:.*]] = fir.convert %[[ARG0]] : (index) -> i32
+    ! CHECK-NEXT: fir.store %[[IV_IDX]] to %[[BINDING]]#1 : !fir.ref<i32>
+    ! CHECK-NEXT: %[[IV_VAL1:.*]] = fir.load %[[BINDING]]#0 : !fir.ref<i32>
+    ! CHECK-NEXT: %[[IV_VAL2:.*]] = fir.load %[[BINDING]]#0 : !fir.ref<i32>
+    ! CHECK-NEXT: %[[IV_VAL_I64:.*]] = fir.convert %[[IV_VAL2]] : (i32) -> i64
+    ! CHECK-NEXT: %[[ARR_ACCESS:.*]] = hlfir.designate %[[ARR]]#0 (%[[IV_VAL_I64]])  : (!fir.ref<!fir.array<10xi32>>, i64) -> !fir.ref<i32>
+    ! CHECK-NEXT: hlfir.assign %[[IV_VAL1]] to %[[ARR_ACCESS]] : i32, !fir.ref<i32>
+    ! CHECK-NEXT: omp.yield
+    ! CHECK-NEXT: }
+    ! CHECK-NEXT: }
+
+    ! CHECK-NEXT: omp.terminator
+    ! CHECK-NEXT: }
+    do concurrent (i=1:10)
+        a(i) = i
+    end do
+
+    ! CHECK-NOT: fir.do_loop
+end program do_concurrent_basic
diff --git a/flang/test/Transforms/DoConcurrent/command_line_options.f90 b/flang/test/Transforms/DoConcurrent/command_line_options.f90
new file mode 100644
index 0000000000000..1c5853ab2628f
--- /dev/null
+++ b/flang/test/Transforms/DoConcurrent/command_line_options.f90
@@ -0,0 +1,18 @@
+! RUN: %flang --help | FileCheck %s --check-prefix=FLANG
+
+! FLANG:      -fdo-concurrent-to-openmp=<value>
+! FLANG-NEXT:   Try to map `do concurrent` loops to OpenMP [none|host|device] 
+
+! RUN: bbc --help | FileCheck %s --check-prefix=BBC
+
+! BBC:      -fdo-concurrent-to-openmp=<string>
+! BBC-SAME:   Try to map `do concurrent` loops to OpenMP [none|host|device] 
+
+! RUN: not %flang -fdo-concurrent-to-openmp=host %s 2>&1 \
+! RUN: | FileCheck %s --check-prefix=OPT
+
+! OPT: error: lowering `do concurrent` loops to OpenMP is only supported if OpenMP is enabled.
+! OPT-SAME: Enable OpenMP using `-fopenmp`.
+
+program test_cli
+end program
diff --git a/flang/tools/bbc/bbc.cpp b/flang/tools/bbc/bbc.cpp
index 3b19a1c2a78d9..ce122d78f10fd 100644
--- a/flang/tools/bbc/bbc.cpp
+++ b/flang/tools/bbc/bbc.cpp
@@ -142,6 +142,12 @@ static llvm::cl::opt<bool>
                        llvm::cl::desc("enable openmp device compilation"),
                        llvm::cl::init(false));
 
+static llvm::cl::opt<std::string> enableDoConcurrentToOpenMPConversion(
+    "fdo-concurrent-to-openmp",
+    llvm::cl::desc(
+        "Try to map `do concurrent` loops to OpenMP [none|host|device]"),
+    llvm::cl::init("none"));
+
 static llvm::cl::opt<bool>
     enableOpenMPGPU("fopenmp-is-gpu",
                     llvm::cl::desc("enable openmp GPU target codegen"),
@@ -292,7 +298,19 @@ createTargetMachine(llvm::StringRef targetTriple, std::string &error) {
 static llvm::LogicalResult runOpenMPPasses(mlir::ModuleOp mlirModule) {
   mlir::PassManager pm(mlirModule->getName(),
                        mlir::OpPassManager::Nesting::Implicit);
-  fir::createOpenMPFIRPassPipeline(pm, enableOpenMPDevice);
+  using DoConcurrentMappingKind =
+      Fortran::frontend::CodeGenOptions::DoConcurrentMappingKind;
+
+  fir::OpenMPFIRPassPipelineOpts opts;
+  opts.isTargetDevice = enableOpenMPDevice;
+  opts.doConcurrentMappingKind =
+      llvm::StringSwitch<DoConcurrentMappingKind>(
+          enableDoConcurrentToOpenMPConversion)
+          .Case("host", DoConcurrentMappingKind::DCMK_Host)
+          .Case("device", DoConcurrentMappingKind::DCMK_Device)
+          .Default(DoConcurrentMappingKind::DCMK_None);
+
+  fir::createOpenMPFIRPassPipeline(pm, opts);
   (void)mlir::applyPassManagerCLOptions(pm);
   if (mlir::failed(pm.run(mlirModule))) {
     llvm::errs() << "FATAL: failed to correctly apply OpenMP pass pipeline";

>From 5771a61ad25359a0bbb7943fa736941d67a2daa5 Mon Sep 17 00:00:00 2001
From: ergawy <kareem.ergawy at amd.com>
Date: Tue, 11 Feb 2025 23:37:39 -0600
Subject: [PATCH 02/10] Handle some review comments

---
 clang/include/clang/Driver/Options.td         |  2 +-
 clang/lib/Driver/ToolChains/Flang.cpp         |  2 +-
 flang/docs/DoConcurrentConversionToOpenMP.md  | 54 +++++++++----------
 .../include/flang/Optimizer/OpenMP/Passes.td  |  4 +-
 .../flang/Optimizer/Passes/Pipelines.h        | 15 +++---
 flang/lib/Frontend/CompilerInvocation.cpp     |  2 +-
 flang/lib/Optimizer/Passes/Pipelines.cpp      |  3 ++
 7 files changed, 44 insertions(+), 38 deletions(-)

diff --git a/clang/include/clang/Driver/Options.td b/clang/include/clang/Driver/Options.td
index fedf2cdad3d49..98a13dc594685 100644
--- a/clang/include/clang/Driver/Options.td
+++ b/clang/include/clang/Driver/Options.td
@@ -6928,7 +6928,7 @@ defm loop_versioning : BoolOptionWithoutMarshalling<"f", "version-loops-for-stri
 def fhermetic_module_files : Flag<["-"], "fhermetic-module-files">, Group<f_Group>,
   HelpText<"Emit hermetic module files (no nested USE association)">;
 
-def do_concurrent_parallel_EQ : Joined<["-"], "fdo-concurrent-to-openmp=">,
+def do_concurrent_to_openmp_EQ : Joined<["-"], "fdo-concurrent-to-openmp=">,
   HelpText<"Try to map `do concurrent` loops to OpenMP [none|host|device]">,
       Values<"none,host,device">;
 } // let Visibility = [FC1Option, FlangOption]
diff --git a/clang/lib/Driver/ToolChains/Flang.cpp b/clang/lib/Driver/ToolChains/Flang.cpp
index bf0bfacd03742..ff29630ee4e84 100644
--- a/clang/lib/Driver/ToolChains/Flang.cpp
+++ b/clang/lib/Driver/ToolChains/Flang.cpp
@@ -153,7 +153,7 @@ void Flang::addCodegenOptions(const ArgList &Args,
     CmdArgs.push_back("-fversion-loops-for-stride");
 
   Args.addAllArgs(CmdArgs,
-                  {options::OPT_do_concurrent_parallel_EQ,
+                  {options::OPT_do_concurrent_to_openmp_EQ,
                    options::OPT_flang_experimental_hlfir,
                    options::OPT_flang_deprecated_no_hlfir,
                    options::OPT_fno_ppc_native_vec_elem_order,
diff --git a/flang/docs/DoConcurrentConversionToOpenMP.md b/flang/docs/DoConcurrentConversionToOpenMP.md
index 6807e402ce081..ae1a85bd71e15 100644
--- a/flang/docs/DoConcurrentConversionToOpenMP.md
+++ b/flang/docs/DoConcurrentConversionToOpenMP.md
@@ -27,16 +27,16 @@ are:
 ## Usage
 
 In order to enable `do concurrent` to OpenMP mapping, `flang` adds a new
-compiler flag: `-fdo-concurrent-to-openmp`. This flags has 3 possible values:
+compiler flag: `-fdo-concurrent-to-openmp`. This flag has 3 possible values:
 1. `host`: this maps `do concurent` loops to run in parallel on the host CPU.
    This maps such loops to the equivalent of `omp parallel do`.
-2. `device`: this maps `do concurent` loops to run in parallel on a device
-   (GPU). This maps such loops to the equivalent of `omp target teams
-   distribute parallel do`.
-3. `none`: this disables `do concurrent` mapping altogether. In such case, such
+2. `device`: this maps `do concurent` loops to run in parallel on a target device.
+   This maps such loops to the equivalent of
+   `omp target teams distribute parallel do`.
+3. `none`: this disables `do concurrent` mapping altogether. In that case, such
    loops are emitted as sequential loops.
 
-The above compiler switch is currently avaialble only when OpenMP is also
+The above compiler switch is currently available only when OpenMP is also
 enabled. So you need to provide the following options to flang in order to
 enable it:
 ```
@@ -54,13 +54,13 @@ that:
 To describe current status in more detail, following is a description of how
 the pass currently behaves for single-range loops and then for multi-range
 loops. The following sub-sections describe the status of the downstream 
-implementation on the AMD's ROCm fork(*). We are working on upstreaming the
+implementation on the AMD's ROCm fork[^1]. We are working on upstreaming the
 downstream implementation gradually and this document will be updated to reflect
 such upstreaming process. Example LIT tests referenced below might also be only
 be available in the ROCm fork and will upstream with the relevant parts of the
 code.
 
-(*) https://github.com/ROCm/llvm-project/blob/amd-staging/flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp
+[^1]: https://github.com/ROCm/llvm-project/blob/amd-staging/flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp
 
 ### Single-range loops
 
@@ -211,8 +211,8 @@ loops and map them as "collapsed" loops in OpenMP.
 
 Loop-nest detection is currently limited to the scenario described in the previous
 section. However, this is quite limited and can be extended in the future to cover
-more cases. For example, for the following loop nest, even thought, both loops are
-perfectly nested; at the moment, only the outer loop is parallized:
+more cases. For example, for the following loop nest, even though, both loops are
+perfectly nested; at the moment, only the outer loop is parallelized:
 ```fortran
 do concurrent(i=1:n)
   do concurrent(j=1:m)
@@ -221,9 +221,9 @@ do concurrent(i=1:n)
 end do
 ```
 
-Similary for the following loop nest, even though the intervening statement `x = 41`
-does not have any memory effects that would affect parallization, this nest is
-not parallized as well (only the outer loop is).
+Similarly, for the following loop nest, even though the intervening statement `x = 41`
+does not have any memory effects that would affect parallelization, this nest is
+not parallelized as well (only the outer loop is).
 
 ```fortran
 do concurrent(i=1:n)
@@ -244,7 +244,7 @@ of what is and is not detected as a perfect loop nest.
 
 ### Data environment
 
-By default, variables that are used inside a `do concurernt` loop nest are
+By default, variables that are used inside a `do concurrent` loop nest are
 either treated as `shared` in case of mapping to `host`, or mapped into the
 `target` region using a `map` clause in case of mapping to `device`. The only
 exceptions to this are:
@@ -253,20 +253,20 @@ exceptions to this are:
      examples above.
   1. any values that are from allocations outside the loop nest and used
      exclusively inside of it. In such cases, a local privatized
-     value is created in the OpenMP region to prevent multiple teams of threads
-     from accessing and destroying the same memory block which causes runtime
+     copy is created in the OpenMP region to prevent multiple teams of threads
+     from accessing and destroying the same memory block, which causes runtime
      issues. For an example of such cases, see
      `flang/test/Transforms/DoConcurrent/locally_destroyed_temp.f90`.
 
-Implicit mapping detection (for mapping to the GPU) is still quite limited and
-work to make it smarter is underway for both OpenMP in general and `do concurrent`
-mapping.
+Implicit mapping detection (for mapping to the target device) is still quite
+limited and work to make it smarter is underway for both OpenMP in general 
+and `do concurrent` mapping.
 
 #### Non-perfectly-nested loops' IVs
 
 For non-perfectly-nested loops, the IVs are still treated as `shared` or
 `map` entries as pointed out above. This **might not** be consistent with what
-the Fortran specficiation tells us. In particular, taking the following
+the Fortran specification tells us. In particular, taking the following
 snippets from the spec (version 2023) into account:
 
 > § 3.35
@@ -277,9 +277,9 @@ snippets from the spec (version 2023) into account:
 > § 19.4
 > ------
 >  A variable that appears as an index-name in a FORALL or DO CONCURRENT
->  construct, or ... is a construct entity. A variable that has LOCAL or
+>  construct [...] is a construct entity. A variable that has LOCAL or
 >  LOCAL_INIT locality in a DO CONCURRENT construct is a construct entity.
-> ...
+> [...]
 > The name of a variable that appears as an index-name in a DO CONCURRENT
 > construct, FORALL statement, or FORALL construct has a scope of the statement
 > or construct. A variable that has LOCAL or LOCAL_INIT locality in a DO
@@ -288,7 +288,7 @@ snippets from the spec (version 2023) into account:
 From the above quotes, it seems there is an equivalence between the IV of a `do
 concurrent` loop and a variable with a `LOCAL` locality specifier (equivalent
 to OpenMP's `private` clause). Which means that we should probably
-localize/privatize a `do concurernt` loop's IV even if it is not perfectly
+localize/privatize a `do concurrent` loop's IV even if it is not perfectly
 nested in the nest we are parallelizing. For now, however, we **do not** do
 that as pointed out previously. In the near future, we propose a middle-ground
 solution (see the Next steps section for more details).
@@ -327,8 +327,8 @@ At the moment, the FIR dialect does not have a way to model locality specifiers
 on the IR level. Instead, something similar to early/eager privatization in OpenMP
 is done for the locality specifiers in `fir.do_loop` ops. Having locality specifier
 modelled in a way similar to delayed privatization (i.e. the `omp.private` op) and
-reductions (i.e. the `omp.delcare_reduction` op) can make mapping `do concurrent`
-to OpenMP (and other parallization models) much easier.
+reductions (i.e. the `omp.declare_reduction` op) can make mapping `do concurrent`
+to OpenMP (and other parallel programming models) much easier.
 
 Therefore, one way to approach this problem is to extract the TableGen records
 for relevant OpenMP clauses in a shared dialect for "data environment management"
@@ -345,7 +345,7 @@ logic of loop nests needs to be implemented.
 ### Data-dependence analysis
 
 Right now, we map loop nests without analysing whether such mapping is safe to
-do or not. We probalby need to at least warn the use of unsafe loop nests due
+do or not. We probably need to at least warn the use of unsafe loop nests due
 to loop-carried dependencies.
 
 ### Non-rectangular loop nests
@@ -362,7 +362,7 @@ end do
 We defer this to the (hopefully) near future when we get the conversion in a
 good share for the samples/projects at hand.
 
-### Generalizing the pass to other parallization models
+### Generalizing the pass to other parallel programming models
 
 Once we have a stable and capable `do concurrent` to OpenMP mapping, we can take
 this in a more generalized direction and allow the pass to target other models;
diff --git a/flang/include/flang/Optimizer/OpenMP/Passes.td b/flang/include/flang/Optimizer/OpenMP/Passes.td
index b2152f8c674ea..fcc7a4ca31fef 100644
--- a/flang/include/flang/Optimizer/OpenMP/Passes.td
+++ b/flang/include/flang/Optimizer/OpenMP/Passes.td
@@ -50,7 +50,7 @@ def FunctionFilteringPass : Pass<"omp-function-filtering"> {
   ];
 }
 
-def DoConcurrentConversionPass : Pass<"fopenmp-do-concurrent-conversion", "mlir::func::FuncOp"> {
+def DoConcurrentConversionPass : Pass<"omp-do-concurrent-conversion", "mlir::func::FuncOp"> {
   let summary = "Map `DO CONCURRENT` loops to OpenMP worksharing loops.";
 
   let description = [{ This is an experimental pass to map `DO CONCURRENT` loops
@@ -59,7 +59,7 @@ def DoConcurrentConversionPass : Pass<"fopenmp-do-concurrent-conversion", "mlir:
      For now the following is supported:
        - Mapping simple loops to `parallel do`.
 
-     Still to TODO:
+     Still TODO:
        - More extensive testing.
   }];
 
diff --git a/flang/include/flang/Optimizer/Passes/Pipelines.h b/flang/include/flang/Optimizer/Passes/Pipelines.h
index 2a34cd94809ad..a3f59ee8dd013 100644
--- a/flang/include/flang/Optimizer/Passes/Pipelines.h
+++ b/flang/include/flang/Optimizer/Passes/Pipelines.h
@@ -128,12 +128,15 @@ void createHLFIRToFIRPassPipeline(
     mlir::PassManager &pm, bool enableOpenMP,
     llvm::OptimizationLevel optLevel = defaultOptLevel);
 
-using DoConcurrentMappingKind =
-    Fortran::frontend::CodeGenOptions::DoConcurrentMappingKind;
-
 struct OpenMPFIRPassPipelineOpts {
+  /// Whether code is being generated for a target device rather than the host
+  /// device
   bool isTargetDevice;
-  DoConcurrentMappingKind doConcurrentMappingKind;
+
+  /// Controls how to map `do concurrent` loops; to device, host, or none at
+  /// all.
+  Fortran::frontend::CodeGenOptions::DoConcurrentMappingKind
+      doConcurrentMappingKind;
 };
 
 /// Create a pass pipeline for handling certain OpenMP transformations needed
@@ -143,8 +146,8 @@ struct OpenMPFIRPassPipelineOpts {
 /// that the FIR is correct with respect to OpenMP operations/attributes.
 ///
 /// \param pm - MLIR pass manager that will hold the pipeline definition.
-/// \param isTargetDevice - Whether code is being generated for a target device
-/// rather than the host device.
+/// \param opts - options to control OpenMP code-gen; see struct docs for more
+/// details.
 void createOpenMPFIRPassPipeline(mlir::PassManager &pm,
                                  OpenMPFIRPassPipelineOpts opts);
 
diff --git a/flang/lib/Frontend/CompilerInvocation.cpp b/flang/lib/Frontend/CompilerInvocation.cpp
index 232f383fc99ce..01b4d299b8c60 100644
--- a/flang/lib/Frontend/CompilerInvocation.cpp
+++ b/flang/lib/Frontend/CompilerInvocation.cpp
@@ -161,7 +161,7 @@ static bool parseDoConcurrentMapping(Fortran::frontend::CodeGenOptions &opts,
                                      llvm::opt::ArgList &args,
                                      clang::DiagnosticsEngine &diags) {
   llvm::opt::Arg *arg =
-      args.getLastArg(clang::driver::options::OPT_do_concurrent_parallel_EQ);
+      args.getLastArg(clang::driver::options::OPT_do_concurrent_to_openmp_EQ);
   if (!arg)
     return true;
 
diff --git a/flang/lib/Optimizer/Passes/Pipelines.cpp b/flang/lib/Optimizer/Passes/Pipelines.cpp
index dfc6d64e7cc2c..e901dd23fd94d 100644
--- a/flang/lib/Optimizer/Passes/Pipelines.cpp
+++ b/flang/lib/Optimizer/Passes/Pipelines.cpp
@@ -280,6 +280,9 @@ void createHLFIRToFIRPassPipeline(mlir::PassManager &pm, bool enableOpenMP,
 /// rather than the host device.
 void createOpenMPFIRPassPipeline(mlir::PassManager &pm,
                                  OpenMPFIRPassPipelineOpts opts) {
+  using DoConcurrentMappingKind =
+      Fortran::frontend::CodeGenOptions::DoConcurrentMappingKind;
+
   if (opts.doConcurrentMappingKind != DoConcurrentMappingKind::DCMK_None)
     pm.addPass(flangomp::createDoConcurrentConversionPass(
         opts.doConcurrentMappingKind == DoConcurrentMappingKind::DCMK_Device));

>From e38adee0f848309de18d45efb947d68208040243 Mon Sep 17 00:00:00 2001
From: ergawy <kareem.ergawy at amd.com>
Date: Tue, 11 Feb 2025 23:53:20 -0600
Subject: [PATCH 03/10] Handle some more review comments

---
 .../OpenMP/DoConcurrentConversion.cpp         | 21 +++++++------------
 1 file changed, 8 insertions(+), 13 deletions(-)

diff --git a/flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp b/flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp
index 55c60c1f339e3..f4bd2851897c7 100644
--- a/flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp
+++ b/flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp
@@ -30,19 +30,18 @@ class DoConcurrentConversion : public mlir::OpConversionPattern<fir::DoLoopOp> {
 public:
   using mlir::OpConversionPattern<fir::DoLoopOp>::OpConversionPattern;
 
-  DoConcurrentConversion(mlir::MLIRContext *context, bool mapToDevice,
-                         llvm::DenseSet<fir::DoLoopOp> &concurrentLoopsToSkip)
-      : OpConversionPattern(context), mapToDevice(mapToDevice),
-        concurrentLoopsToSkip(concurrentLoopsToSkip) {}
+  DoConcurrentConversion(mlir::MLIRContext *context, bool mapToDevice)
+      : OpConversionPattern(context), mapToDevice(mapToDevice) {}
 
   mlir::LogicalResult
   matchAndRewrite(fir::DoLoopOp doLoop, OpAdaptor adaptor,
                   mlir::ConversionPatternRewriter &rewriter) const override {
+    // TODO This will be filled in with the next PRs that upstreams the rest of
+    // the ROCm implementaion.
     return mlir::success();
   }
 
   bool mapToDevice;
-  llvm::DenseSet<fir::DoLoopOp> &concurrentLoopsToSkip;
 };
 
 class DoConcurrentConversionPass
@@ -58,9 +57,8 @@ class DoConcurrentConversionPass
   void runOnOperation() override {
     mlir::func::FuncOp func = getOperation();
 
-    if (func.isDeclaration()) {
+    if (func.isDeclaration())
       return;
-    }
 
     auto *context = &getContext();
 
@@ -72,15 +70,12 @@ class DoConcurrentConversionPass
       return;
     }
 
-    llvm::DenseSet<fir::DoLoopOp> concurrentLoopsToSkip;
     mlir::RewritePatternSet patterns(context);
     patterns.insert<DoConcurrentConversion>(
-        context, mapTo == flangomp::DoConcurrentMappingKind::DCMK_Device,
-        concurrentLoopsToSkip);
+        context, mapTo == flangomp::DoConcurrentMappingKind::DCMK_Device);
     mlir::ConversionTarget target(*context);
-    target.addDynamicallyLegalOp<fir::DoLoopOp>([&](fir::DoLoopOp op) {
-      return !op.getUnordered() || concurrentLoopsToSkip.contains(op);
-    });
+    target.addDynamicallyLegalOp<fir::DoLoopOp>(
+        [&](fir::DoLoopOp op) { return !op.getUnordered(); });
     target.markUnknownOpDynamicallyLegal(
         [](mlir::Operation *) { return true; });
 

>From 63499df4a43563db983b60a1f4a7d315a5912627 Mon Sep 17 00:00:00 2001
From: ergawy <kareem.ergawy at amd.com>
Date: Wed, 12 Feb 2025 00:01:47 -0600
Subject: [PATCH 04/10] Convert error to warning

---
 flang/lib/Frontend/FrontendActions.cpp                     | 7 ++++---
 .../test/Transforms/DoConcurrent/command_line_options.f90  | 4 ++--
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/flang/lib/Frontend/FrontendActions.cpp b/flang/lib/Frontend/FrontendActions.cpp
index 0809e4a0e2773..0a5c0fb05c79c 100644
--- a/flang/lib/Frontend/FrontendActions.cpp
+++ b/flang/lib/Frontend/FrontendActions.cpp
@@ -366,11 +366,12 @@ bool CodeGenAction::beginSourceFileAction() {
   if (opts.doConcurrentMappingKind != DoConcurrentMappingKind::DCMK_None &&
       !isOpenMPEnabled) {
     unsigned diagID = ci.getDiagnostics().getCustomDiagID(
-        clang::DiagnosticsEngine::Error,
+        clang::DiagnosticsEngine::Warning,
         "lowering `do concurrent` loops to OpenMP is only supported if "
-        "OpenMP is enabled. Enable OpenMP using `-fopenmp`.");
+        "OpenMP is enabled. Enable OpenMP using `-fopenmp`. `do concurrent` "
+        "loops will be serialized.");
     ci.getDiagnostics().Report(diagID);
-    return false;
+    opts.doConcurrentMappingKind = DoConcurrentMappingKind::DCMK_None;
   }
 
   if (isOpenMPEnabled) {
diff --git a/flang/test/Transforms/DoConcurrent/command_line_options.f90 b/flang/test/Transforms/DoConcurrent/command_line_options.f90
index 1c5853ab2628f..da987434b4b5f 100644
--- a/flang/test/Transforms/DoConcurrent/command_line_options.f90
+++ b/flang/test/Transforms/DoConcurrent/command_line_options.f90
@@ -8,10 +8,10 @@
 ! BBC:      -fdo-concurrent-to-openmp=<string>
 ! BBC-SAME:   Try to map `do concurrent` loops to OpenMP [none|host|device] 
 
-! RUN: not %flang -fdo-concurrent-to-openmp=host %s 2>&1 \
+! RUN: %flang -fdo-concurrent-to-openmp=host %s 2>&1 \
 ! RUN: | FileCheck %s --check-prefix=OPT
 
-! OPT: error: lowering `do concurrent` loops to OpenMP is only supported if OpenMP is enabled.
+! OPT: warning: lowering `do concurrent` loops to OpenMP is only supported if OpenMP is enabled.
 ! OPT-SAME: Enable OpenMP using `-fopenmp`.
 
 program test_cli

>From 4219ef454f15c88056624bbdf2bc46f9b6e34335 Mon Sep 17 00:00:00 2001
From: ergawy <kareem.ergawy at amd.com>
Date: Wed, 12 Feb 2025 09:44:06 -0600
Subject: [PATCH 05/10] disable test on windows

---
 flang/test/Transforms/DoConcurrent/command_line_options.f90 | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/flang/test/Transforms/DoConcurrent/command_line_options.f90 b/flang/test/Transforms/DoConcurrent/command_line_options.f90
index da987434b4b5f..158d256bdc4b2 100644
--- a/flang/test/Transforms/DoConcurrent/command_line_options.f90
+++ b/flang/test/Transforms/DoConcurrent/command_line_options.f90
@@ -1,3 +1,5 @@
+! UNSUPPORTED: system-windows
+
 ! RUN: %flang --help | FileCheck %s --check-prefix=FLANG
 
 ! FLANG:      -fdo-concurrent-to-openmp=<value>

>From 961b887878af1fd2f5462e382be2a28a92a8dbdf Mon Sep 17 00:00:00 2001
From: ergawy <kareem.ergawy at amd.com>
Date: Wed, 12 Feb 2025 22:38:11 -0600
Subject: [PATCH 06/10] handle some more review comments

---
 clang/include/clang/Driver/Options.td        |  4 ++--
 clang/lib/Driver/ToolChains/Flang.cpp        |  2 +-
 flang/docs/DoConcurrentConversionToOpenMP.md | 14 ++++++++------
 flang/lib/Frontend/CompilerInvocation.cpp    |  2 +-
 4 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/clang/include/clang/Driver/Options.td b/clang/include/clang/Driver/Options.td
index 98a13dc594685..0cd3dfd3fb29d 100644
--- a/clang/include/clang/Driver/Options.td
+++ b/clang/include/clang/Driver/Options.td
@@ -6928,9 +6928,9 @@ defm loop_versioning : BoolOptionWithoutMarshalling<"f", "version-loops-for-stri
 def fhermetic_module_files : Flag<["-"], "fhermetic-module-files">, Group<f_Group>,
   HelpText<"Emit hermetic module files (no nested USE association)">;
 
-def do_concurrent_to_openmp_EQ : Joined<["-"], "fdo-concurrent-to-openmp=">,
+def fdo_concurrent_to_openmp_EQ : Joined<["-"], "fdo-concurrent-to-openmp=">,
   HelpText<"Try to map `do concurrent` loops to OpenMP [none|host|device]">,
-      Values<"none,host,device">;
+      Values<"none, host, device">;
 } // let Visibility = [FC1Option, FlangOption]
 
 def J : JoinedOrSeparate<["-"], "J">,
diff --git a/clang/lib/Driver/ToolChains/Flang.cpp b/clang/lib/Driver/ToolChains/Flang.cpp
index ff29630ee4e84..cb0b00a2fd699 100644
--- a/clang/lib/Driver/ToolChains/Flang.cpp
+++ b/clang/lib/Driver/ToolChains/Flang.cpp
@@ -153,7 +153,7 @@ void Flang::addCodegenOptions(const ArgList &Args,
     CmdArgs.push_back("-fversion-loops-for-stride");
 
   Args.addAllArgs(CmdArgs,
-                  {options::OPT_do_concurrent_to_openmp_EQ,
+                  {options::OPT_fdo_concurrent_to_openmp_EQ,
                    options::OPT_flang_experimental_hlfir,
                    options::OPT_flang_deprecated_no_hlfir,
                    options::OPT_fno_ppc_native_vec_elem_order,
diff --git a/flang/docs/DoConcurrentConversionToOpenMP.md b/flang/docs/DoConcurrentConversionToOpenMP.md
index ae1a85bd71e15..8f3b93858090e 100644
--- a/flang/docs/DoConcurrentConversionToOpenMP.md
+++ b/flang/docs/DoConcurrentConversionToOpenMP.md
@@ -6,7 +6,7 @@
 
 -->
 
-# `DO CONCURENT` mapping to OpenMP
+# `DO CONCURRENT` mapping to OpenMP
 
 ```{contents}
 ---
@@ -17,10 +17,10 @@ local:
 This document seeks to describe the effort to parallelize `do concurrent` loops
 by mapping them to OpenMP worksharing constructs. The goals of this document
 are:
-* Describing how to instruct `flang` to map `DO CONCURENT` loops to OpenMP
+* Describing how to instruct `flang` to map `DO CONCURRENT` loops to OpenMP
   constructs.
 * Tracking the current status of such mapping.
-* Describing the limitations of the current implmenentation.
+* Describing the limitations of the current implementation.
 * Describing next steps.
 * Tracking the current upstreaming status (from the AMD ROCm fork).
 
@@ -28,9 +28,9 @@ are:
 
 In order to enable `do concurrent` to OpenMP mapping, `flang` adds a new
 compiler flag: `-fdo-concurrent-to-openmp`. This flag has 3 possible values:
-1. `host`: this maps `do concurent` loops to run in parallel on the host CPU.
+1. `host`: this maps `do concurrent` loops to run in parallel on the host CPU.
    This maps such loops to the equivalent of `omp parallel do`.
-2. `device`: this maps `do concurent` loops to run in parallel on a target device.
+2. `device`: this maps `do concurrent` loops to run in parallel on a target device.
    This maps such loops to the equivalent of
    `omp target teams distribute parallel do`.
 3. `none`: this disables `do concurrent` mapping altogether. In that case, such
@@ -42,6 +42,8 @@ enable it:
 ```
 flang ... -fopenmp -fdo-concurrent-to-openmp=[host|device|none] ...
 ```
+For mapping to device, the target device architecture must be specified as well.
+See `-fopenmp-targets` and `-foffload-arch` for more info.
 
 ## Current status
 
@@ -249,7 +251,7 @@ either treated as `shared` in case of mapping to `host`, or mapped into the
 `target` region using a `map` clause in case of mapping to `device`. The only
 exceptions to this are:
   1. the loop's iteration variable(s) (IV) of **perfect** loop nests. In that
-     case, for each IV, we allocate a local copy as shown the by the mapping
+     case, for each IV, we allocate a local copy as shown by the mapping
      examples above.
   1. any values that are from allocations outside the loop nest and used
      exclusively inside of it. In such cases, a local privatized
diff --git a/flang/lib/Frontend/CompilerInvocation.cpp b/flang/lib/Frontend/CompilerInvocation.cpp
index 01b4d299b8c60..354ee29f314d3 100644
--- a/flang/lib/Frontend/CompilerInvocation.cpp
+++ b/flang/lib/Frontend/CompilerInvocation.cpp
@@ -161,7 +161,7 @@ static bool parseDoConcurrentMapping(Fortran::frontend::CodeGenOptions &opts,
                                      llvm::opt::ArgList &args,
                                      clang::DiagnosticsEngine &diags) {
   llvm::opt::Arg *arg =
-      args.getLastArg(clang::driver::options::OPT_do_concurrent_to_openmp_EQ);
+      args.getLastArg(clang::driver::options::OPT_fdo_concurrent_to_openmp_EQ);
   if (!arg)
     return true;
 

>From d123a833f877d59fa8393dd3b44316db729b5818 Mon Sep 17 00:00:00 2001
From: ergawy <kareem.ergawy at amd.com>
Date: Wed, 12 Feb 2025 22:48:53 -0600
Subject: [PATCH 07/10] remove some parts of the doc

---
 flang/docs/DoConcurrentConversionToOpenMP.md | 248 +------------------
 1 file changed, 7 insertions(+), 241 deletions(-)

diff --git a/flang/docs/DoConcurrentConversionToOpenMP.md b/flang/docs/DoConcurrentConversionToOpenMP.md
index 8f3b93858090e..404c0587c578f 100644
--- a/flang/docs/DoConcurrentConversionToOpenMP.md
+++ b/flang/docs/DoConcurrentConversionToOpenMP.md
@@ -53,250 +53,16 @@ that:
 * It has been tested in a very limited way so far.
 * It has been tested mostly on simple synthetic inputs.
 
-To describe current status in more detail, following is a description of how
-the pass currently behaves for single-range loops and then for multi-range
-loops. The following sub-sections describe the status of the downstream 
-implementation on the AMD's ROCm fork[^1]. We are working on upstreaming the
-downstream implementation gradually and this document will be updated to reflect
-such upstreaming process. Example LIT tests referenced below might also be only
-be available in the ROCm fork and will upstream with the relevant parts of the
-code.
-
-[^1]: https://github.com/ROCm/llvm-project/blob/amd-staging/flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp
-
-### Single-range loops
-
-Given the following loop:
-```fortran
-  do concurrent(i=1:n)
-    a(i) = i * i
-  end do
-```
-
-#### Mapping to `host`
-
-Mapping this loop to the `host`, generates MLIR operations of the following
-structure:
-
-```
-%4 = fir.address_of(@_QFEa) ...
-%6:2 = hlfir.declare %4 ...
-
-omp.parallel {
-  // Allocate private copy for `i`.
-  // TODO Use delayed privatization.
-  %19 = fir.alloca i32 {bindc_name = "i"}
-  %20:2 = hlfir.declare %19 {uniq_name = "_QFEi"} ...
-
-  omp.wsloop {
-    omp.loop_nest (%arg0) : index = (%21) to (%22) inclusive step (%c1_2) {
-      %23 = fir.convert %arg0 : (index) -> i32
-      // Use the privatized version of `i`.
-      fir.store %23 to %20#1 : !fir.ref<i32>
-      ...
-
-      // Use "shared" SSA value of `a`.
-      %42 = hlfir.designate %6#0
-      hlfir.assign %35 to %42
-      ...
-      omp.yield
-    }
-    omp.terminator
-  }
-  omp.terminator
-}
-```
-
-#### Mapping to `device`
-
-Mapping the same loop to the `device`, generates MLIR operations of the
-following structure:
-
-```
-// Map `a` to the `target` region. The pass automatically detects memory blocks
-// and maps them to device. Currently detection logic is still limited and a lot
-// of work is going into making it more capable.
-%29 = omp.map.info ... {name = "_QFEa"}
-omp.target ... map_entries(..., %29 -> %arg4 ...) {
-  ...
-  %51:2 = hlfir.declare %arg4
-  ...
-  omp.teams {
-    // Allocate private copy for `i`.
-    // TODO Use delayed privatization.
-    %52 = fir.alloca i32 {bindc_name = "i"}
-    %53:2 = hlfir.declare %52
-    ...
-
-    omp.parallel {
-      omp.distribute {
-        omp.wsloop {
-          omp.loop_nest (%arg5) : index = (%54) to (%55) inclusive step (%c1_9) {
-            // Use the privatized version of `i`.
-            %56 = fir.convert %arg5 : (index) -> i32
-            fir.store %56 to %53#1
-            ...
-            // Use the mapped version of `a`.
-            ... = hlfir.designate %51#0
-            ...
-          }
-          omp.terminator
-        }
-        omp.terminator
-      }
-      omp.terminator
-    }
-    omp.terminator
-  }
-  omp.terminator
-}
-```
-
-### Multi-range loops
-
-The pass currently supports multi-range loops as well. Given the following
-example:
-
-```fortran
-   do concurrent(i=1:n, j=1:m)
-       a(i,j) = i * j
-   end do
-```
-
-The generated `omp.loop_nest` operation look like:
-
-```
-omp.loop_nest (%arg0, %arg1)
-    : index = (%17, %19) to (%18, %20)
-    inclusive step (%c1_2, %c1_4) {
-  fir.store %arg0 to %private_i#1 : !fir.ref<i32>
-  fir.store %arg1 to %private_j#1 : !fir.ref<i32>
-  ...
-  omp.yield
-}
-```
-
-It is worth noting that we have privatized versions for both iteration
-variables: `i` and `j`. These are locally allocated inside the parallel/target
-OpenMP region similar to what the single-range example in previous section
-shows.
-
-#### Multi-range and perfectly-nested loops
-
-Currently, on the `FIR` dialect level, the following loop:
-```fortran
-do concurrent(i=1:n, j=1:m)
-  a(i,j) = i * j
-end do
-```
-is modelled as a nest of `fir.do_loop` ops such that the outer loop's region
-contains:
-  1. The operations needed to assign/update the outer loop's induction variable.
-  1. The inner loop itself.
-
-So the MLIR structure looks similar to the following:
-```
-fir.do_loop %arg0 = %11 to %12 step %c1 unordered {
-  ...
-  fir.do_loop %arg1 = %14 to %15 step %c1_1 unordered {
-    ...
-  }
-}
-```
-This applies to multi-range loops in general; they are represented in the IR as
-a nest of `fir.do_loop` ops with the above nesting structure.
-
-Therefore, the pass detects such "perfectly" nested loop ops to identify multi-range
-loops and map them as "collapsed" loops in OpenMP.
-
-#### Further info regarding loop nest detection
-
-Loop-nest detection is currently limited to the scenario described in the previous
-section. However, this is quite limited and can be extended in the future to cover
-more cases. For example, for the following loop nest, even though, both loops are
-perfectly nested; at the moment, only the outer loop is parallelized:
-```fortran
-do concurrent(i=1:n)
-  do concurrent(j=1:m)
-    a(i,j) = i * j
-  end do
-end do
-```
-
-Similarly, for the following loop nest, even though the intervening statement `x = 41`
-does not have any memory effects that would affect parallelization, this nest is
-not parallelized as well (only the outer loop is).
-
-```fortran
-do concurrent(i=1:n)
-  x = 41
-  do concurrent(j=1:m)
-    a(i,j) = i * j
-  end do
-end do
-```
-
-The above also has the consequence that the `j` variable will **not** be
-privatized in the OpenMP parallel/target region. In other words, it will be
-treated as if it was a `shared` variable. For more details about privatization,
-see the "Data environment" section below.
-
-See `flang/test/Transforms/DoConcurrent/loop_nest_test.f90` for more examples
-of what is and is not detected as a perfect loop nest.
-
-### Data environment
-
-By default, variables that are used inside a `do concurrent` loop nest are
-either treated as `shared` in case of mapping to `host`, or mapped into the
-`target` region using a `map` clause in case of mapping to `device`. The only
-exceptions to this are:
-  1. the loop's iteration variable(s) (IV) of **perfect** loop nests. In that
-     case, for each IV, we allocate a local copy as shown by the mapping
-     examples above.
-  1. any values that are from allocations outside the loop nest and used
-     exclusively inside of it. In such cases, a local privatized
-     copy is created in the OpenMP region to prevent multiple teams of threads
-     from accessing and destroying the same memory block, which causes runtime
-     issues. For an example of such cases, see
-     `flang/test/Transforms/DoConcurrent/locally_destroyed_temp.f90`.
-
-Implicit mapping detection (for mapping to the target device) is still quite
-limited and work to make it smarter is underway for both OpenMP in general 
-and `do concurrent` mapping.
-
-#### Non-perfectly-nested loops' IVs
-
-For non-perfectly-nested loops, the IVs are still treated as `shared` or
-`map` entries as pointed out above. This **might not** be consistent with what
-the Fortran specification tells us. In particular, taking the following
-snippets from the spec (version 2023) into account:
-
-> § 3.35
-> ------
-> construct entity
-> entity whose identifier has the scope of a construct
-
-> § 19.4
-> ------
->  A variable that appears as an index-name in a FORALL or DO CONCURRENT
->  construct [...] is a construct entity. A variable that has LOCAL or
->  LOCAL_INIT locality in a DO CONCURRENT construct is a construct entity.
-> [...]
-> The name of a variable that appears as an index-name in a DO CONCURRENT
-> construct, FORALL statement, or FORALL construct has a scope of the statement
-> or construct. A variable that has LOCAL or LOCAL_INIT locality in a DO
-> CONCURRENT construct has the scope of that construct.
-
-From the above quotes, it seems there is an equivalence between the IV of a `do
-concurrent` loop and a variable with a `LOCAL` locality specifier (equivalent
-to OpenMP's `private` clause). Which means that we should probably
-localize/privatize a `do concurrent` loop's IV even if it is not perfectly
-nested in the nest we are parallelizing. For now, however, we **do not** do
-that as pointed out previously. In the near future, we propose a middle-ground
-solution (see the Next steps section for more details).
+<!--
+More details about current status will be added along with relevant parts of the
+implementation in later upstreaming patches.
+-->
 
 ## Next steps
 
+This section describes some of the open questions/issues that are not tackled yet
+even in the downstream implementation.
+
 ### Delayed privatization
 
 So far, we emit the privatization logic for IVs inline in the parallel/target

>From ffc4cee59b279c268e3b4d539667de10cbb56e40 Mon Sep 17 00:00:00 2001
From: ergawy <kareem.ergawy at amd.com>
Date: Thu, 13 Feb 2025 05:31:16 -0600
Subject: [PATCH 08/10] handle some more review comments

---
 flang/docs/DoConcurrentConversionToOpenMP.md   |  7 +++++++
 .../OpenMP/DoConcurrentConversion.cpp          | 18 +++++++++---------
 .../do_concurrent_to_omp_cli.f90}              |  0
 3 files changed, 16 insertions(+), 9 deletions(-)
 rename flang/test/{Transforms/DoConcurrent/command_line_options.f90 => Driver/do_concurrent_to_omp_cli.f90} (100%)

diff --git a/flang/docs/DoConcurrentConversionToOpenMP.md b/flang/docs/DoConcurrentConversionToOpenMP.md
index 404c0587c578f..26215345b64a3 100644
--- a/flang/docs/DoConcurrentConversionToOpenMP.md
+++ b/flang/docs/DoConcurrentConversionToOpenMP.md
@@ -103,6 +103,13 @@ for relevant OpenMP clauses in a shared dialect for "data environment management
 and use these shared records for OpenMP, `do concurrent`, and possibly OpenACC
 as well.
 
+#### Supporting reductions
+
+Similar to locality specifiers, mapping reductions from `do concurrent` to OpenMP
+is also still an open TODO. We can potentially extend the MLIR infrastructure
+proposed in the previous section to share reduction records among the different 
+relevant dialects as well.
+
 ### More advanced detection of loop nests
 
 As pointed out earlier, any intervening code between the headers of 2 nested
diff --git a/flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp b/flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp
index f4bd2851897c7..cebf6cd8ed0df 100644
--- a/flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp
+++ b/flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp
@@ -8,15 +8,10 @@
 
 #include "flang/Optimizer/Dialect/FIROps.h"
 #include "flang/Optimizer/OpenMP/Passes.h"
-#include "mlir/Dialect/Func/IR/FuncOps.h"
+#include "flang/Optimizer/OpenMP/Utils.h"
 #include "mlir/Dialect/OpenMP/OpenMPDialect.h"
-#include "mlir/IR/Diagnostics.h"
-#include "mlir/Pass/Pass.h"
 #include "mlir/Transforms/DialectConversion.h"
 
-#include <memory>
-#include <utility>
-
 namespace flangomp {
 #define GEN_PASS_DEF_DOCONCURRENTCONVERSIONPASS
 #include "flang/Optimizer/OpenMP/Passes.h.inc"
@@ -60,7 +55,7 @@ class DoConcurrentConversionPass
     if (func.isDeclaration())
       return;
 
-    auto *context = &getContext();
+    mlir::MLIRContext *context = &getContext();
 
     if (mapTo != flangomp::DoConcurrentMappingKind::DCMK_Host &&
         mapTo != flangomp::DoConcurrentMappingKind::DCMK_Device) {
@@ -74,8 +69,13 @@ class DoConcurrentConversionPass
     patterns.insert<DoConcurrentConversion>(
         context, mapTo == flangomp::DoConcurrentMappingKind::DCMK_Device);
     mlir::ConversionTarget target(*context);
-    target.addDynamicallyLegalOp<fir::DoLoopOp>(
-        [&](fir::DoLoopOp op) { return !op.getUnordered(); });
+    target.addDynamicallyLegalOp<fir::DoLoopOp>([&](fir::DoLoopOp op) {
+      // The goal is to handle constructs that eventually get lowered to
+      // `fir.do_loop` with the `unordered` attribute (e.g. array expressions).
+      // Currently, this is only enabled for the `do concurrent` construct since
+      // the pass runs early in the pipeline.
+      return !op.getUnordered();
+    });
     target.markUnknownOpDynamicallyLegal(
         [](mlir::Operation *) { return true; });
 
diff --git a/flang/test/Transforms/DoConcurrent/command_line_options.f90 b/flang/test/Driver/do_concurrent_to_omp_cli.f90
similarity index 100%
rename from flang/test/Transforms/DoConcurrent/command_line_options.f90
rename to flang/test/Driver/do_concurrent_to_omp_cli.f90

>From 33e6bf634265cd5fb7ce23d96bda008696b096b8 Mon Sep 17 00:00:00 2001
From: ergawy <kareem.ergawy at amd.com>
Date: Thu, 13 Feb 2025 07:28:40 -0600
Subject: [PATCH 09/10] handle some more review comments

---
 flang/lib/Frontend/CompilerInvocation.cpp      | 6 ++----
 flang/lib/Frontend/FrontendActions.cpp         | 6 +++---
 flang/test/Driver/do_concurrent_to_omp_cli.f90 | 4 ++--
 3 files changed, 7 insertions(+), 9 deletions(-)

diff --git a/flang/lib/Frontend/CompilerInvocation.cpp b/flang/lib/Frontend/CompilerInvocation.cpp
index 354ee29f314d3..809e423f5aae9 100644
--- a/flang/lib/Frontend/CompilerInvocation.cpp
+++ b/flang/lib/Frontend/CompilerInvocation.cpp
@@ -157,13 +157,13 @@ static bool parseDebugArgs(Fortran::frontend::CodeGenOptions &opts,
   return true;
 }
 
-static bool parseDoConcurrentMapping(Fortran::frontend::CodeGenOptions &opts,
+static void parseDoConcurrentMapping(Fortran::frontend::CodeGenOptions &opts,
                                      llvm::opt::ArgList &args,
                                      clang::DiagnosticsEngine &diags) {
   llvm::opt::Arg *arg =
       args.getLastArg(clang::driver::options::OPT_fdo_concurrent_to_openmp_EQ);
   if (!arg)
-    return true;
+    return;
 
   using DoConcurrentMappingKind =
       Fortran::frontend::CodeGenOptions::DoConcurrentMappingKind;
@@ -178,11 +178,9 @@ static bool parseDoConcurrentMapping(Fortran::frontend::CodeGenOptions &opts,
   if (!val.has_value()) {
     diags.Report(clang::diag::err_drv_invalid_value)
         << arg->getAsString(args) << arg->getValue();
-    return false;
   }
 
   opts.setDoConcurrentMapping(val.value());
-  return true;
 }
 
 static bool parseVectorLibArg(Fortran::frontend::CodeGenOptions &opts,
diff --git a/flang/lib/Frontend/FrontendActions.cpp b/flang/lib/Frontend/FrontendActions.cpp
index 0a5c0fb05c79c..ccc8c7d96135f 100644
--- a/flang/lib/Frontend/FrontendActions.cpp
+++ b/flang/lib/Frontend/FrontendActions.cpp
@@ -367,9 +367,9 @@ bool CodeGenAction::beginSourceFileAction() {
       !isOpenMPEnabled) {
     unsigned diagID = ci.getDiagnostics().getCustomDiagID(
         clang::DiagnosticsEngine::Warning,
-        "lowering `do concurrent` loops to OpenMP is only supported if "
-        "OpenMP is enabled. Enable OpenMP using `-fopenmp`. `do concurrent` "
-        "loops will be serialized.");
+        "OpenMP is required for lowering `do concurrent` loops to OpenMP."
+        "Enable OpenMP using `-fopenmp`."
+        "`do concurrent` loops will be serialized.");
     ci.getDiagnostics().Report(diagID);
     opts.doConcurrentMappingKind = DoConcurrentMappingKind::DCMK_None;
   }
diff --git a/flang/test/Driver/do_concurrent_to_omp_cli.f90 b/flang/test/Driver/do_concurrent_to_omp_cli.f90
index 158d256bdc4b2..41b7575e206af 100644
--- a/flang/test/Driver/do_concurrent_to_omp_cli.f90
+++ b/flang/test/Driver/do_concurrent_to_omp_cli.f90
@@ -13,8 +13,8 @@
 ! RUN: %flang -fdo-concurrent-to-openmp=host %s 2>&1 \
 ! RUN: | FileCheck %s --check-prefix=OPT
 
-! OPT: warning: lowering `do concurrent` loops to OpenMP is only supported if OpenMP is enabled.
-! OPT-SAME: Enable OpenMP using `-fopenmp`.
+! OPT: warning: OpenMP is required for lowering `do concurrent` loops to OpenMP.
+! OPT-SAME:     Enable OpenMP using `-fopenmp`.
 
 program test_cli
 end program

>From 6febde358460d685cd68045ba69ac539c3acb81b Mon Sep 17 00:00:00 2001
From: ergawy <kareem.ergawy at amd.com>
Date: Mon, 17 Feb 2025 01:36:36 -0600
Subject: [PATCH 10/10] handle some more review comments

---
 flang/docs/DoConcurrentConversionToOpenMP.md | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/flang/docs/DoConcurrentConversionToOpenMP.md b/flang/docs/DoConcurrentConversionToOpenMP.md
index 26215345b64a3..43a8ff47161de 100644
--- a/flang/docs/DoConcurrentConversionToOpenMP.md
+++ b/flang/docs/DoConcurrentConversionToOpenMP.md
@@ -36,9 +36,9 @@ compiler flag: `-fdo-concurrent-to-openmp`. This flag has 3 possible values:
 3. `none`: this disables `do concurrent` mapping altogether. In that case, such
    loops are emitted as sequential loops.
 
-The above compiler switch is currently available only when OpenMP is also
-enabled. So you need to provide the following options to flang in order to
-enable it:
+The `-fdo-concurrent-to-openmp` compiler switch is currently available only when
+OpenMP is also enabled. So you need to provide the following options to flang in
+order to enable it:
 ```
 flang ... -fopenmp -fdo-concurrent-to-openmp=[host|device|none] ...
 ```
@@ -113,14 +113,14 @@ relevant dialects as well.
 ### More advanced detection of loop nests
 
 As pointed out earlier, any intervening code between the headers of 2 nested
-`do concurrent` loops prevents us currently from detecting this as a loop nest.
-In some cases this is overly conservative. Therefore, a more flexible detection
-logic of loop nests needs to be implemented.
+`do concurrent` loops prevents us from detecting this as a loop nest. In some
+cases this is overly conservative. Therefore, a more flexible detection logic
+of loop nests needs to be implemented.
 
 ### Data-dependence analysis
 
 Right now, we map loop nests without analysing whether such mapping is safe to
-do or not. We probably need to at least warn the use of unsafe loop nests due
+do or not. We probably need to at least warn the user of unsafe loop nests due
 to loop-carried dependencies.
 
 ### Non-rectangular loop nests