[flang-commits] [flang] [flang] Inline scalar-to-array hlfir.assign at -O0 (PR #197092)

Mon May 11 20:49:47 PDT 2026

https://github.com/Saieiei created https://github.com/llvm/llvm-project/pull/197092

At `-O0`, Flang can lower trivial scalar-to-array broadcasts such as `c = a(1) + 1.0` through `_FortranAAssign`. That runtime path can call `free()`, which is not valid in OpenMP GPU device code.

This patch teaches `InlineHLFIRAssign` to handle trivial scalar RHS values and runs the pass before ordered assignments are lowered at all optimization levels, including `-O0`.

Scalar RHS values are materialized before the generated loop with `loadTrivialScalar`, preserving intrinsic assignment ordering for cases like `a = a(1)`. Array-to-array alias handling is unchanged.

The functional change is limited to `InlineHLFIRAssign.cpp` and the HLFIR-to-FIR pipeline in `Pipelines.cpp`. The remaining files are test updates from scalar-to-array assignments now being inlined at `-O0` instead of lowering through `_FortranAAssign`.

Other device-runtime call paths are left out of scope.

Fixes #197091.

>From 127ba14d5617663ff00103be71affa8547937daa Mon Sep 17 00:00:00 2001
From: Sairudra More <moresair at pe31.hpc.amslabs.hpecorp.net>
Date: Mon, 11 May 2026 22:08:16 -0500
Subject: [PATCH] [flang] Inline scalar-to-array hlfir.assign at O0

InlineHLFIRAssign previously rejected scalar RHS operands and only ran at optimizing levels. This left simple scalar-to-array broadcasts such as 'c = a(1) + 1.0' to lower through _FortranAAssign at O0, which can introduce malloc/free based runtime paths that are not valid in OpenMP device code.

Allow trivial scalar RHS assignments to be inlined by materializing the scalar value before generating the element loop. This preserves intrinsic assignment ordering for cases such as 'a = a(1)', where the RHS must be evaluated before the LHS is defined.

Run InlineHLFIRAssign at all optimization levels so the same lowering is used at O0. Array-to-array assignments continue to use the existing alias checks before reusing genNoAliasArrayAssignment.
---
 .../HLFIR/Transforms/InlineHLFIRAssign.cpp    |  29 ++--
 flang/lib/Optimizer/Passes/Pipelines.cpp      |  15 +-
 .../test/Driver/mlir-debug-pass-pipeline.f90  |   5 +
 flang/test/Driver/mlir-pass-pipeline.f90      |   9 +-
 flang/test/HLFIR/inline-hlfir-assign.fir      |  52 +++++++
 .../parallel-private-reduction-worstcase.f90  | 128 ++++++++++++------
 .../Integration/OpenMP/private-global.f90     |  39 ++----
 .../OpenMP/workshare-scalar-array-mul.f90     |   6 +-
 flang/test/Integration/prefetch.f90           |   1 -
 ...workdistribute-saxpy-and-scalar-assign.f90 |   8 +-
 .../OpenMP/workdistribute-scalar-assign.f90   |   8 +-
 flang/test/Lower/array-derived.f90            |   6 +-
 12 files changed, 210 insertions(+), 96 deletions(-)

diff --git a/flang/lib/Optimizer/HLFIR/Transforms/InlineHLFIRAssign.cpp b/flang/lib/Optimizer/HLFIR/Transforms/InlineHLFIRAssign.cpp
index 160efede12bd5..5554f23eb1fc4 100644
--- a/flang/lib/Optimizer/HLFIR/Transforms/InlineHLFIRAssign.cpp
+++ b/flang/lib/Optimizer/HLFIR/Transforms/InlineHLFIRAssign.cpp
@@ -42,7 +42,9 @@ static llvm::cl::opt<bool> inlineAllocatableExprAssignFlag(
 
 namespace {
 /// Expand hlfir.assign of array RHS to array LHS into a loop nest
-/// of element-by-element assignments:
+/// of element-by-element assignments. Also handles scalar RHS broadcast
+/// to an array LHS; scalar RHS values are evaluated before the loop.
+///
 ///   hlfir.assign %4 to %5 : !fir.ref<!fir.array<3x3xf32>>,
 ///                           !fir.ref<!fir.array<3x3xf32>>
 /// into:
@@ -57,8 +59,8 @@ namespace {
 ///     }
 ///   }
 ///
-/// The transformation is correct only when LHS and RHS do not alias.
-/// When RHS is an array expression, then there is no aliasing.
+/// For array RHS, the transformation is correct only when LHS and RHS
+/// do not alias. When RHS is an array expression, there is no aliasing.
 /// This transformation does not support runtime checking for
 /// non-conforming LHS/RHS arrays' shapes currently.
 class InlineHLFIRAssignConversion
@@ -74,21 +76,17 @@ class InlineHLFIRAssignConversion
                                          "AssignOp may imply allocation");
 
     hlfir::Entity rhs{assign.getRhs()};
+    hlfir::Entity lhs{assign.getLhs()};
 
-    if (!rhs.isArray())
+    if (!lhs.isArray())
       return rewriter.notifyMatchFailure(assign,
-                                         "AssignOp's RHS is not an array");
+                                         "AssignOp's LHS is not an array");
 
     mlir::Type rhsEleTy = rhs.getFortranElementType();
     if (!fir::isa_trivial(rhsEleTy))
       return rewriter.notifyMatchFailure(
           assign, "AssignOp's RHS data type is not trivial");
 
-    hlfir::Entity lhs{assign.getLhs()};
-    if (!lhs.isArray())
-      return rewriter.notifyMatchFailure(assign,
-                                         "AssignOp's LHS is not an array");
-
     mlir::Type lhsEleTy = lhs.getFortranElementType();
     if (!fir::isa_trivial(lhsEleTy))
       return rewriter.notifyMatchFailure(
@@ -98,7 +96,7 @@ class InlineHLFIRAssignConversion
       return rewriter.notifyMatchFailure(assign,
                                          "RHS/LHS element types mismatch");
 
-    if (!mlir::isa<hlfir::ExprType>(rhs.getType())) {
+    if (rhs.isArray() && !mlir::isa<hlfir::ExprType>(rhs.getType())) {
       // If RHS is not an hlfir.expr, then we should prove that
       // LHS and RHS do not alias.
       // TODO: if they may alias, we can insert hlfir.as_expr for RHS,
@@ -124,6 +122,15 @@ class InlineHLFIRAssignConversion
     mlir::Location loc = assign->getLoc();
     fir::FirOpBuilder builder(rewriter, assign.getOperation());
     builder.setInsertionPoint(assign);
+
+    // Materialize scalar RHS before the assignment loop. Fortran 10.2.1.2
+    // requires that the RHS expression is fully evaluated before any part
+    // of the LHS variable is defined. When the scalar RHS is a reference
+    // into the LHS array (e.g. a = a(1)), loading it inside the loop
+    // would read a potentially modified value.
+    if (!rhs.isArray())
+      rhs = hlfir::loadTrivialScalar(loc, builder, rhs);
+
     mlir::ArrayAttr accessGroups;
     if (auto attrs = assign.getOperation()->getAttrOfType<mlir::ArrayAttr>(
             fir::getAccessGroupsAttrName()))
diff --git a/flang/lib/Optimizer/Passes/Pipelines.cpp b/flang/lib/Optimizer/Passes/Pipelines.cpp
index 920d6f86a355e..a3a26d63d693a 100644
--- a/flang/lib/Optimizer/Passes/Pipelines.cpp
+++ b/flang/lib/Optimizer/Passes/Pipelines.cpp
@@ -304,13 +304,16 @@ void createHLFIRToFIRPassPipeline(mlir::PassManager &pm,
         pm, hlfir::createPropagateFortranVariableAttributes);
     addNestedPassToAllTopLevelOperations<PassConstructor>(
         pm, hlfir::createOptimizedBufferization);
+  }
+  // Inline trivial array assignments at all optimization levels.
+  // At O0, this avoids emitting Fortran runtime calls (e.g. _FortranAAssign)
+  // that use malloc/free in device code generated by OpenMP target offloading,
+  // where free() is not available.
+  addNestedPassToAllTopLevelOperations<PassConstructor>(
+      pm, hlfir::createInlineHLFIRAssign);
+  if (optLevel == llvm::OptimizationLevel::O3) {
     addNestedPassToAllTopLevelOperations<PassConstructor>(
-        pm, hlfir::createInlineHLFIRAssign);
-
-    if (optLevel == llvm::OptimizationLevel::O3) {
-      addNestedPassToAllTopLevelOperations<PassConstructor>(
-          pm, hlfir::createInlineHLFIRCopyIn);
-    }
+        pm, hlfir::createInlineHLFIRCopyIn);
   }
   pm.addPass(hlfir::createLowerHLFIROrderedAssignments(
       {/*tryFusingAssignments=*/optLevel.isOptimizingForSpeed()}));
diff --git a/flang/test/Driver/mlir-debug-pass-pipeline.f90 b/flang/test/Driver/mlir-debug-pass-pipeline.f90
index 3f6bde2ded67b..6df4ead0fbf19 100644
--- a/flang/test/Driver/mlir-debug-pass-pipeline.f90
+++ b/flang/test/Driver/mlir-debug-pass-pipeline.f90
@@ -31,14 +31,19 @@
 ! ALL-NEXT: Pipeline Collection : ['fir.global', 'func.func', 'omp.declare_mapper', 'omp.declare_reduction', 'omp.private']
 ! ALL-NEXT: 'fir.global' Pipeline
 ! ALL-NEXT:   InlineElementals
+! ALL-NEXT:   InlineHLFIRAssign
 ! ALL-NEXT: 'func.func' Pipeline
 ! ALL-NEXT:   InlineElementals
+! ALL-NEXT:   InlineHLFIRAssign
 ! ALL-NEXT: 'omp.declare_mapper' Pipeline
 ! ALL-NEXT:   InlineElementals
+! ALL-NEXT:   InlineHLFIRAssign
 ! ALL-NEXT: 'omp.declare_reduction' Pipeline
 ! ALL-NEXT:   InlineElementals
+! ALL-NEXT:   InlineHLFIRAssign
 ! ALL-NEXT: 'omp.private' Pipeline
 ! ALL-NEXT:   InlineElementals
+! ALL-NEXT:   InlineHLFIRAssign
 ! ALL-NEXT: LowerHLFIROrderedAssignments
 ! ALL-NEXT: LowerHLFIRIntrinsics
 ! ALL-NEXT: BufferizeHLFIR
diff --git a/flang/test/Driver/mlir-pass-pipeline.f90 b/flang/test/Driver/mlir-pass-pipeline.f90
index 630076a7947ff..a1e0143859bba 100644
--- a/flang/test/Driver/mlir-pass-pipeline.f90
+++ b/flang/test/Driver/mlir-pass-pipeline.f90
@@ -1,8 +1,8 @@
 ! Test the MLIR pass pipeline
 
-! RUN: %flang_fc1 -S -mmlir --mlir-pass-statistics -mmlir --mlir-pass-statistics-display=pipeline -o /dev/null %s 2>&1 | FileCheck --check-prefixes=ALL %s
+! RUN: %flang_fc1 -S -mmlir --mlir-pass-statistics -mmlir --mlir-pass-statistics-display=pipeline -o /dev/null %s 2>&1 | FileCheck --check-prefixes=ALL,O0 %s
 ! -O0 is the default:
-! RUN: %flang_fc1 -S -mmlir --mlir-pass-statistics -mmlir --mlir-pass-statistics-display=pipeline %s -O0 -o /dev/null 2>&1 | FileCheck --check-prefixes=ALL %s
+! RUN: %flang_fc1 -S -mmlir --mlir-pass-statistics -mmlir --mlir-pass-statistics-display=pipeline %s -O0 -o /dev/null 2>&1 | FileCheck --check-prefixes=ALL,O0 %s
 ! RUN: %flang_fc1 -S -mmlir --mlir-pass-statistics -mmlir --mlir-pass-statistics-display=pipeline %s -O2 -o /dev/null 2>&1 | FileCheck --check-prefixes=ALL,O2 %s
 
 ! REQUIRES: asserts
@@ -31,18 +31,23 @@
 ! ALL-NEXT:'fir.global' Pipeline
 ! O2-NEXT:   SimplifyHLFIRIntrinsics
 ! ALL:       InlineElementals
+! O0-NEXT:   InlineHLFIRAssign
 ! ALL-NEXT:'func.func' Pipeline
 ! O2-NEXT:   SimplifyHLFIRIntrinsics
 ! ALL:       InlineElementals
+! O0-NEXT:   InlineHLFIRAssign
 ! ALL-NEXT:'omp.declare_mapper' Pipeline
 ! O2-NEXT:   SimplifyHLFIRIntrinsics
 ! ALL:       InlineElementals
+! O0-NEXT:   InlineHLFIRAssign
 ! ALL-NEXT:'omp.declare_reduction' Pipeline
 ! O2-NEXT:   SimplifyHLFIRIntrinsics
 ! ALL:       InlineElementals
+! O0-NEXT:   InlineHLFIRAssign
 ! ALL-NEXT:'omp.private' Pipeline
 ! O2-NEXT:   SimplifyHLFIRIntrinsics
 ! ALL:       InlineElementals
+! O0-NEXT:   InlineHLFIRAssign
 ! O2-NEXT: Canonicalizer
 ! O2-NEXT: CSE
 ! O2-NEXT: (S) {{.*}} num-cse'd
diff --git a/flang/test/HLFIR/inline-hlfir-assign.fir b/flang/test/HLFIR/inline-hlfir-assign.fir
index 797ef6e81946a..e3e74721127aa 100644
--- a/flang/test/HLFIR/inline-hlfir-assign.fir
+++ b/flang/test/HLFIR/inline-hlfir-assign.fir
@@ -444,3 +444,55 @@ func.func @_QPtest_disjoint(%arg0: !fir.ref<!fir.array<10x10xf32>>) {
 // CHECK:           }
 // CHECK:           return
 // CHECK:         }
+
+// Test scalar-to-array broadcast: c = scalar_val
+// This should be inlined into a loop that stores the scalar to each element.
+func.func @_QPtest_scalar_to_array(%arg0: !fir.ref<f32>, %arg1: !fir.ref<!fir.array<10xf32>>) {
+  %c10 = arith.constant 10 : index
+  %0 = fir.shape %c10 : (index) -> !fir.shape<1>
+  %1:2 = hlfir.declare %arg1(%0) {uniq_name = "_QFtestEc"} : (!fir.ref<!fir.array<10xf32>>, !fir.shape<1>) -> (!fir.ref<!fir.array<10xf32>>, !fir.ref<!fir.array<10xf32>>)
+  %2 = fir.load %arg0 : !fir.ref<f32>
+  hlfir.assign %2 to %1#0 : f32, !fir.ref<!fir.array<10xf32>>
+  return
+}
+// CHECK-LABEL:   func.func @_QPtest_scalar_to_array(
+// CHECK-SAME:                                       %[[VAL_0:.*]]: !fir.ref<f32>,
+// CHECK-SAME:                                       %[[VAL_1:.*]]: !fir.ref<!fir.array<10xf32>>) {
+// CHECK:           %[[VAL_2:.*]] = arith.constant 1 : index
+// CHECK:           %[[VAL_3:.*]] = arith.constant 10 : index
+// CHECK:           %[[VAL_4:.*]] = fir.shape %[[VAL_3]] : (index) -> !fir.shape<1>
+// CHECK:           %[[VAL_5:.*]]:2 = hlfir.declare %[[VAL_1]](%[[VAL_4]]) {uniq_name = "_QFtestEc"} : (!fir.ref<!fir.array<10xf32>>, !fir.shape<1>) -> (!fir.ref<!fir.array<10xf32>>, !fir.ref<!fir.array<10xf32>>)
+// CHECK:           %[[VAL_6:.*]] = fir.load %[[VAL_0]] : !fir.ref<f32>
+// CHECK:           fir.do_loop %[[VAL_7:.*]] = %[[VAL_2]] to %[[VAL_3]] step %[[VAL_2]] unordered {
+// CHECK:             %[[VAL_8:.*]] = hlfir.designate %[[VAL_5]]#0 (%[[VAL_7]])  : (!fir.ref<!fir.array<10xf32>>, index) -> !fir.ref<f32>
+// CHECK:             hlfir.assign %[[VAL_6]] to %[[VAL_8]] : f32, !fir.ref<f32>
+// CHECK:           }
+// CHECK:           return
+// CHECK:         }
+
+// Test scalar RHS that is a reference into the LHS array: a = a(1)
+// The scalar must be loaded BEFORE the loop to satisfy Fortran 10.2.1.2
+// (expression evaluation precedes variable definition).
+func.func @_QPtest_scalar_ref_from_lhs(%arg0: !fir.ref<!fir.array<10xf32>>) {
+  %c10 = arith.constant 10 : index
+  %c1 = arith.constant 1 : index
+  %0 = fir.shape %c10 : (index) -> !fir.shape<1>
+  %1:2 = hlfir.declare %arg0(%0) {uniq_name = "_QFtestEa"} : (!fir.ref<!fir.array<10xf32>>, !fir.shape<1>) -> (!fir.ref<!fir.array<10xf32>>, !fir.ref<!fir.array<10xf32>>)
+  %2 = hlfir.designate %1#0 (%c1) : (!fir.ref<!fir.array<10xf32>>, index) -> !fir.ref<f32>
+  hlfir.assign %2 to %1#0 : !fir.ref<f32>, !fir.ref<!fir.array<10xf32>>
+  return
+}
+// CHECK-LABEL:   func.func @_QPtest_scalar_ref_from_lhs(
+// CHECK-SAME:                                           %[[VAL_0:.*]]: !fir.ref<!fir.array<10xf32>>) {
+// CHECK:           %[[C10:.*]] = arith.constant 10 : index
+// CHECK:           %[[C1:.*]] = arith.constant 1 : index
+// CHECK:           %[[SHAPE:.*]] = fir.shape %[[C10]] : (index) -> !fir.shape<1>
+// CHECK:           %[[DECL:.*]]:2 = hlfir.declare %[[VAL_0]](%[[SHAPE]]) {uniq_name = "_QFtestEa"} : (!fir.ref<!fir.array<10xf32>>, !fir.shape<1>) -> (!fir.ref<!fir.array<10xf32>>, !fir.ref<!fir.array<10xf32>>)
+// CHECK:           %[[A1_REF:.*]] = hlfir.designate %[[DECL]]#0 (%[[C1]]) : (!fir.ref<!fir.array<10xf32>>, index) -> !fir.ref<f32>
+// CHECK:           %[[A1_VAL:.*]] = fir.load %[[A1_REF]] : !fir.ref<f32>
+// CHECK:           fir.do_loop %[[IV:.*]] = %[[C1]] to %[[C10]] step %[[C1]] unordered {
+// CHECK:             %[[ELEM:.*]] = hlfir.designate %[[DECL]]#0 (%[[IV]]) : (!fir.ref<!fir.array<10xf32>>, index) -> !fir.ref<f32>
+// CHECK:             hlfir.assign %[[A1_VAL]] to %[[ELEM]] : f32, !fir.ref<f32>
+// CHECK:           }
+// CHECK:           return
+// CHECK:         }
diff --git a/flang/test/Integration/OpenMP/parallel-private-reduction-worstcase.f90 b/flang/test/Integration/OpenMP/parallel-private-reduction-worstcase.f90
index c6a46691d58f5..c4688a6e8a192 100644
--- a/flang/test/Integration/OpenMP/parallel-private-reduction-worstcase.f90
+++ b/flang/test/Integration/OpenMP/parallel-private-reduction-worstcase.f90
@@ -50,7 +50,7 @@ subroutine worst_case(a, b, c, d)
 ! CHECK:         br i1 %{{.*}}, label %omp.private.init3, label %omp.private.init4
 
 ! CHECK:       omp.private.init4:                               ; preds = %omp.private.init2
-!                [finish private alloc for second var with zero extent]
+!                [finish private alloc for first var with zero extent]
 ! CHECK:         br label %omp.private.init5
 
 ! CHECK:       omp.private.init5:                               ; preds = %omp.private.init3, %omp.private.init4
@@ -61,13 +61,13 @@ subroutine worst_case(a, b, c, d)
 ! CHECK-NEXT:    br label %omp.private.init7
 
 ! CHECK:       omp.private.init7:
-!                [begin private alloc for first var]
+!                [begin private alloc for second var]
 !                [read the length from the mold argument]
 !                [if it is non-zero...]
 ! CHECK:         br i1 {{.*}}, label %omp.private.init8, label %omp.private.init9
 
 ! CHECK:       omp.private.init9:                               ; preds = %omp.private.init7
-!                [finish private alloc for first var with zero extent]
+!                [finish private alloc for second var with zero extent]
 ! CHECK:         br label %omp.private.init10
 
 ! CHECK:       omp.private.init10:                               ; preds = %omp.private.init8, %omp.private.init9
@@ -105,50 +105,64 @@ subroutine worst_case(a, b, c, d)
 ! CHECK-NEXT:    br label %omp.reduction.init
 
 ! CHECK:       omp.reduction.init:                               ; preds = %omp.region.cont15
-!                [deffered stores for results of reduction alloc regions]
+!                [deferred stores for results of reduction alloc regions]
 ! CHECK:         br label %[[VAL_96:.*]]
 
 ! CHECK:       omp.reduction.neutral:                            ; preds = %omp.reduction.init
-!                [start of reduction initialization region]
+!                [start of reduction initialization region for first var]
 !                [null check:]
 ! CHECK:         br i1 %{{.*}}, label %omp.reduction.neutral20, label %omp.reduction.neutral21
 
 ! CHECK:       omp.reduction.neutral21:                          ; preds = %omp.reduction.neutral
-!                [malloc and assign the default value to the reduction variable]
+!                [malloc the reduction variable]
 ! CHECK:         br label %omp.reduction.neutral22
 
-! CHECK:       omp.reduction.neutral22:                          ; preds = %omp.reduction.neutral20, %omp.reduction.neutral21
+! CHECK:       omp.reduction.neutral22:                          ; preds = %omp.reduction.neutral23, %omp.reduction.neutral21
+!                [inlined scalar-to-array init loop header]
+! CHECK:         br i1 %{{.*}}, label %omp.reduction.neutral23, label %omp.reduction.neutral24
+
+! CHECK:       omp.reduction.neutral24:                          ; preds = %omp.reduction.neutral22
+! CHECK:         br label %omp.reduction.neutral25
+
+! CHECK:       omp.reduction.neutral25:                          ; preds = %omp.reduction.neutral20, %omp.reduction.neutral24
 ! CHECK-NEXT:    br label %omp.region.cont19
 
-! CHECK:       omp.region.cont19:                                ; preds = %omp.reduction.neutral22
+! CHECK:       omp.region.cont19:                                ; preds = %omp.reduction.neutral25
 ! CHECK-NEXT:    %{{.*}} = phi ptr
-! CHECK-NEXT:    br label %omp.reduction.neutral24
+! CHECK-NEXT:    br label %omp.reduction.neutral27
 
-! CHECK:       omp.reduction.neutral24:                          ; preds = %omp.region.cont19
-!                [start of reduction initialization region]
+! CHECK:       omp.reduction.neutral27:                          ; preds = %omp.region.cont19
+!                [start of reduction initialization region for second var]
 !                [null check:]
-! CHECK:         br i1 %{{.*}}, label %omp.reduction.neutral25, label %omp.reduction.neutral26
+! CHECK:         br i1 %{{.*}}, label %omp.reduction.neutral28, label %omp.reduction.neutral29
+
+! CHECK:       omp.reduction.neutral29:                          ; preds = %omp.reduction.neutral27
+!                [malloc the reduction variable]
+! CHECK:         br label %omp.reduction.neutral30
+
+! CHECK:       omp.reduction.neutral30:                          ; preds = %omp.reduction.neutral31, %omp.reduction.neutral29
+!                [inlined scalar-to-array init loop header]
+! CHECK:         br i1 %{{.*}}, label %omp.reduction.neutral31, label %omp.reduction.neutral32
 
-! CHECK:       omp.reduction.neutral26:                          ; preds = %omp.reduction.neutral24
-!                [malloc and assign the default value to the reduction variable]
-! CHECK:         br label %omp.reduction.neutral27
+! CHECK:       omp.reduction.neutral32:                          ; preds = %omp.reduction.neutral30
+! CHECK:         br label %omp.reduction.neutral33
 
-! CHECK:       omp.reduction.neutral27:                          ; preds = %omp.reduction.neutral25, %omp.reduction.neutral26
-! CHECK-NEXT:    br label %omp.region.cont23
+! CHECK:       omp.reduction.neutral33:                          ; preds = %omp.reduction.neutral28, %omp.reduction.neutral32
+! CHECK-NEXT:    br label %omp.region.cont26
 
-! CHECK:       omp.region.cont23:                                ; preds = %omp.reduction.neutral27
+! CHECK:       omp.region.cont26:                                ; preds = %omp.reduction.neutral33
 ! CHECK-NEXT:    %{{.*}} = phi ptr
-! CHECK-NEXT:    br label %omp.par.region29
+! CHECK-NEXT:    br label %omp.par.region35
 
-! CHECK:       omp.par.region29:                                 ; preds = %omp.region.cont23
+! CHECK:       omp.par.region35:                                 ; preds = %omp.region.cont26
 !                [call SUM runtime function]
 !                [if (sum(a) == 1)]
-! CHECK:         br i1 %{{.*}}, label %omp.par.region30, label %omp.par.region31
+! CHECK:         br i1 %{{.*}}, label %omp.par.region36, label %omp.par.region37
 
-! CHECK:       omp.par.region31:                                 ; preds = %omp.par.region29
-! CHECK-NEXT:    br label %omp.region.cont28
+! CHECK:       omp.par.region37:                                 ; preds = %omp.par.region35
+! CHECK-NEXT:    br label %omp.region.cont34
 
-! CHECK:       omp.region.cont28:                                ; preds = %omp.par.region30, %omp.par.region31
+! CHECK:       omp.region.cont34:                                ; preds = %omp.par.region36, %omp.par.region37
 !                [omp parallel region done, call into the runtime to complete reduction]
 ! CHECK:         %[[VAL_233:.*]] = call i32 @__kmpc_reduce(
 ! CHECK:         switch i32 %[[VAL_233]], label %reduce.finalize [
@@ -156,16 +170,16 @@ subroutine worst_case(a, b, c, d)
 ! CHECK-NEXT:      i32 2, label %reduce.switch.atomic
 ! CHECK-NEXT:    ]
 
-! CHECK:       reduce.switch.atomic:                             ; preds = %omp.region.cont28
+! CHECK:       reduce.switch.atomic:                             ; preds = %omp.region.cont34
 ! CHECK-NEXT:    unreachable
 
-! CHECK:       reduce.switch.nonatomic:                          ; preds = %omp.region.cont28
+! CHECK:       reduce.switch.nonatomic:                          ; preds = %omp.region.cont34
 ! CHECK-NEXT:    %[[red_private_value_0:.*]] = load ptr, ptr %{{.*}}, align 8
 ! CHECK-NEXT:    br label %omp.reduction.nonatomic.body
 
 !              [various blocks implementing the reduction]
 
-! CHECK:       omp.region.cont36:                                ; preds =
+! CHECK:       omp.region.cont42:                                ; preds =
 ! CHECK-NEXT:    %{{.*}} = phi ptr
 ! CHECK-NEXT:    call void @__kmpc_end_reduce(
 ! CHECK-NEXT:    br label %reduce.finalize
@@ -182,30 +196,60 @@ subroutine worst_case(a, b, c, d)
 
 ! CHECK:       omp.reduction.cleanup:                            ; preds = %.fini
 !                [null check]
-! CHECK:         br i1 %{{.*}}, label %omp.reduction.cleanup42, label %omp.reduction.cleanup43
+! CHECK:         br i1 %{{.*}}, label %omp.reduction.cleanup48, label %omp.reduction.cleanup49
 
-! CHECK:       omp.reduction.cleanup43:                          ; preds = %omp.reduction.cleanup42, %omp.reduction.cleanup
-! CHECK-NEXT:    br label %omp.region.cont41
+! CHECK:       omp.reduction.cleanup49:                          ; preds = %omp.reduction.cleanup48, %omp.reduction.cleanup
+! CHECK-NEXT:    br label %omp.region.cont47
 
-! CHECK:       omp.region.cont41:                                ; preds = %omp.reduction.cleanup43
-! CHECK-NEXT:    %{{.*}} = load ptr, ptr
-! CHECK-NEXT:    br label %omp.reduction.cleanup45
+! CHECK:       omp.region.cont47:                                ; preds = %omp.reduction.cleanup49
+! CHECK:         br label %omp.reduction.cleanup51
 
-! CHECK:       omp.reduction.cleanup45:                          ; preds = %omp.region.cont41
+! CHECK:       omp.reduction.cleanup51:                          ; preds = %omp.region.cont47
 !                [null check]
-! CHECK:         br i1 %{{.*}}, label %omp.reduction.cleanup46, label %omp.reduction.cleanup47
+! CHECK:         br i1 %{{.*}}, label %omp.reduction.cleanup52, label %omp.reduction.cleanup53
+
+! CHECK:       omp.reduction.cleanup53:                          ; preds = %omp.reduction.cleanup52, %omp.reduction.cleanup51
+! CHECK-NEXT:    br label %omp.region.cont50
 
-! CHECK:       omp.par.region30:                                 ; preds = %omp.par.region29
+! CHECK:       omp.region.cont50:                                ; preds = %omp.reduction.cleanup53
+! CHECK-NEXT:    br label %omp.private.dealloc
+
+! CHECK:       omp.private.dealloc:                              ; preds = %omp.region.cont50
+!                [null check for first private var dealloc]
+! CHECK:         br i1 %{{.*}}, label %omp.private.dealloc55, label %omp.private.dealloc56
+
+! CHECK:       omp.private.dealloc56:                            ; preds = %omp.private.dealloc55, %omp.private.dealloc
+! CHECK-NEXT:    br label %omp.region.cont54
+
+! CHECK:       omp.region.cont54:                                ; preds = %omp.private.dealloc56
+! CHECK-NEXT:    br label %omp.private.dealloc58
+
+! CHECK:       omp.private.dealloc58:                            ; preds = %omp.region.cont54
+!                [null check for second private var dealloc]
+! CHECK:         br i1 %{{.*}}, label %omp.private.dealloc59, label %omp.private.dealloc60
+
+! CHECK:       omp.private.dealloc60:                            ; preds = %omp.private.dealloc59, %omp.private.dealloc58
+! CHECK-NEXT:    br label %omp.region.cont57
+
+! CHECK:       omp.par.region36:                                 ; preds = %omp.par.region35
 ! CHECK-NEXT:    call void @_FortranAStopStatement
 
-! CHECK:       omp.reduction.neutral25:                          ; preds = %omp.reduction.neutral24
-!                [source length was zero: finish initializing array]
-! CHECK:         br label %omp.reduction.neutral27
+! CHECK:       omp.reduction.neutral31:                          ; preds = %omp.reduction.neutral30
+!                [inlined init loop body for second var]
+! CHECK:         br label %omp.reduction.neutral30
 
-! CHECK:       omp.reduction.neutral20:                          ; preds = %omp.reduction.neutral
-!                [source length was zero: finish initializing array]
+! CHECK:       omp.reduction.neutral28:                          ; preds = %omp.reduction.neutral27
+!                [source length was zero: finish initializing second var]
+! CHECK:         br label %omp.reduction.neutral33
+
+! CHECK:       omp.reduction.neutral23:                          ; preds = %omp.reduction.neutral22
+!                [inlined init loop body for first var]
 ! CHECK:         br label %omp.reduction.neutral22
 
+! CHECK:       omp.reduction.neutral20:                          ; preds = %omp.reduction.neutral
+!                [source length was zero: finish initializing first var]
+! CHECK:         br label %omp.reduction.neutral25
+
 ! CHECK:       omp.private.copy17:                               ; preds = %omp.private.copy16
 !                [source length was non-zero: call assign runtime]
 ! CHECK:         br label %omp.private.copy18
@@ -222,5 +266,5 @@ subroutine worst_case(a, b, c, d)
 !                [var extent was non-zero: malloc a private array]
 ! CHECK:         br label %omp.private.init5
 
-! CHECK:       omp.par.exit.exitStub:                           ; preds = %omp.region.cont51
+! CHECK:       omp.par.exit.exitStub:                           ; preds = %omp.region.cont57
 ! CHECK-NEXT:    ret void
diff --git a/flang/test/Integration/OpenMP/private-global.f90 b/flang/test/Integration/OpenMP/private-global.f90
index ed11a95c4aeb1..4b27e6ddc79a4 100644
--- a/flang/test/Integration/OpenMP/private-global.f90
+++ b/flang/test/Integration/OpenMP/private-global.f90
@@ -17,34 +17,21 @@ program bug
 
 ! CHECK-LABEL: define internal void {{.*}}..omp_par(
 ! CHECK:       omp.par.entry:
-! CHECK:         %[[VAL_9:.*]] = alloca i32, align 4
-! CHECK:         %[[VAL_10:.*]] = load i32, ptr %[[VAL_11:.*]], align 4
-! CHECK:         store i32 %[[VAL_10]], ptr %[[VAL_9]], align 4
-! CHECK:         %[[VAL_12:.*]] = load i32, ptr %[[VAL_9]], align 4
 ! CHECK:         %[[PRIV_BOX_ALLOC:.*]] = alloca { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] }, align 8
-! CHECK:         %[[ELEMENTAL_TMP:.*]] = alloca { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] }, align 8
-! CHECK:         %[[ELEMENTAL_TMP_2:.*]] = alloca { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] }, align 8
-! CHECK:         %[[TABLE_BOX_ADDR:.*]] = alloca { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] }, align 8
-! CHECK:         %[[BOXED_FIFTY:.*]] = alloca { ptr, i64, i32, i8, i8, i8, i8 }, align 8
-! CHECK:         %[[FIFTY:.*]] = alloca i32, i64 1, align 4
-! CHECK:         %[[INTERMEDIATE:.*]] = alloca { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] }, align 8
-! CHECK:         %[[TABLE_BOX_ADDR2:.*]] = alloca { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] }, i64 1, align 8
 ! ...
-! check that we use the private copy of table for the assignment
+! check that the private copy is allocated via malloc
+! CHECK:       omp.private.init:
+! CHECK:         %[[PRIV_TABLE:.*]] = call ptr @malloc(i64 40)
+! ...
+! check that we use the private copy of table for the assignment (table = 50)
+! The assignment is now inlined as a loop instead of calling _FortranAAssign.
 ! CHECK:       omp.par.region1:
-! CHECK:         call void @llvm.memcpy.p0.p0.i32(ptr{{.*}}%[[INTERMEDIATE]], ptr{{.*}}%[[PRIV_BOX_ALLOC]], i32 {{4[48]}}, i1 false)
-! CHECK:         store i32 50, ptr %[[FIFTY]], align 4
-! CHECK:         %[[FIFTY_BOX_VAL:.*]] = insertvalue { ptr, i64, i32, i8, i8, i8, i8 } { ptr undef, i64 4, i32 20240719, i8 0, i8 9, i8 0, i8 0 }, ptr %[[FIFTY]], 0
-! CHECK:         store { ptr, i64, i32, i8, i8, i8, i8 } %[[FIFTY_BOX_VAL]], ptr %[[BOXED_FIFTY]], align {{[48]}}
-! CHECK:         call void @llvm.memcpy.p0.p0.i32(ptr %[[TABLE_BOX_ADDR2]], ptr %[[INTERMEDIATE]], i32 {{4[48]}}, i1 false)
-! CHECK:         call void @_FortranAAssign(ptr %[[TABLE_BOX_ADDR2]], ptr %[[BOXED_FIFTY]], ptr @{{.*}}, i32 9)
-! CHECK:         call void @llvm.memcpy.p0.p0.i32(ptr{{.*}}%[[TABLE_BOX_ADDR]], ptr{{.*}}%[[PRIV_BOX_ALLOC]], i32 {{4[48]}}, i1 false)
-! CHECK:         %[[PRIV_TABLE:.*]] = call ptr @malloc(i{{(32)|(64)}} 40)
+! CHECK:         call void @llvm.memcpy.p0.p0.i32(ptr{{.*}}%[[BOX_COPY:.*]], ptr{{.*}}%[[PRIV_BOX_ALLOC]], i32 48, i1 false)
+! ...
+! check that we use the private copy of table for table/=50 (inlined loop body)
+! CHECK:       omp.par.region6:
+! CHECK:         %[[VAL_44:.*]] = sub {{.*}} i64 %{{.*}}, 1
 ! ...
-! check that we use the private copy of table for table/=50
+! check that we store 50 into the private table's elements (inlined loop body)
 ! CHECK:       omp.par.region3:
-! CHECK:         %[[VAL_44:.*]] = sub nuw nsw i64 %{{.*}}, 1
-! CHECK:         %[[VAL_45:.*]] = mul nuw nsw i64 %[[VAL_44]], 1
-! CHECK:         %[[VAL_46:.*]] = mul nuw nsw i64 %[[VAL_45]], 1
-! CHECK:         %[[VAL_47:.*]] = add nuw nsw i64 %[[VAL_46]], 0
-! CHECK:         %[[VAL_48:.*]] = getelementptr nusw nuw i32, ptr %[[PRIV_TABLE]], i64 %[[VAL_47]]
+! CHECK:         store i32 50, ptr %{{.*}}, align 4
diff --git a/flang/test/Integration/OpenMP/workshare-scalar-array-mul.f90 b/flang/test/Integration/OpenMP/workshare-scalar-array-mul.f90
index 9b8ef66b48f47..24a82c4145be9 100644
--- a/flang/test/Integration/OpenMP/workshare-scalar-array-mul.f90
+++ b/flang/test/Integration/OpenMP/workshare-scalar-array-mul.f90
@@ -57,8 +57,12 @@ program test
 ! FIR-O0:      omp.wsloop {
 ! FIR-O0:        omp.loop_nest
 ! FIR-O0:          omp.yield
+! The scalar-to-array assignment (arr_01 = arr_01*2) is now inlined as a
+! second workshared loop instead of calling _FortranAAssign in omp.single.
+! FIR-O0:      omp.wsloop {
+! FIR-O0:        omp.loop_nest
+! FIR-O0:          omp.yield
 ! FIR-O0:      omp.single nowait {
-! FIR-O0:        fir.call @_FortranAAssign
 ! FIR-O0:        fir.freemem
 ! FIR-O0:        omp.terminator
 ! FIR-O0:      omp.barrier
diff --git a/flang/test/Integration/prefetch.f90 b/flang/test/Integration/prefetch.f90
index c015b6736972a..76227caf02b43 100644
--- a/flang/test/Integration/prefetch.f90
+++ b/flang/test/Integration/prefetch.f90
@@ -13,7 +13,6 @@
 !===============================================================================
 
 subroutine test_prefetch_01()
-    ! LLVM: {{.*}} = alloca i32, i64 1, align 4
     ! LLVM: %[[VAR_J:.*]] = alloca i32, i64 1, align 4
     ! LLVM: %[[VAR_I:.*]] = alloca i32, i64 1, align 4
     ! LLVM: %[[VAR_A:.*]] = alloca [256 x i32], i64 1, align 4
diff --git a/flang/test/Lower/OpenMP/workdistribute-saxpy-and-scalar-assign.f90 b/flang/test/Lower/OpenMP/workdistribute-saxpy-and-scalar-assign.f90
index 516c4603bd5da..c569831ff86df 100644
--- a/flang/test/Lower/OpenMP/workdistribute-saxpy-and-scalar-assign.f90
+++ b/flang/test/Lower/OpenMP/workdistribute-saxpy-and-scalar-assign.f90
@@ -1,4 +1,4 @@
-! RUN: %flang_fc1 -emit-fir -fopenmp -fopenmp-version=60 %s -o - | FileCheck %s
+! RUN: %flang_fc1 -emit-fir -fopenmp -fopenmp-version=60 %s -o - | FileCheck %s --implicit-check-not="fir.call @_FortranAAssign"
 
 ! CHECK-LABEL: func @_QPtarget_teams_workdistribute
 subroutine target_teams_workdistribute()
@@ -46,7 +46,11 @@ subroutine teams_workdistribute()
 
   y = a * x + y
 
-  ! CHECK: fir.call @_FortranAAssign
+  ! CHECK: omp.teams
+  ! CHECK: omp.parallel
+  ! CHECK: omp.distribute
+  ! CHECK: omp.wsloop
+  ! CHECK: omp.loop_nest
   y = 2.0_real32
 
   !$omp end teams workdistribute
diff --git a/flang/test/Lower/OpenMP/workdistribute-scalar-assign.f90 b/flang/test/Lower/OpenMP/workdistribute-scalar-assign.f90
index e0f773380d10a..217df8fb05176 100644
--- a/flang/test/Lower/OpenMP/workdistribute-scalar-assign.f90
+++ b/flang/test/Lower/OpenMP/workdistribute-scalar-assign.f90
@@ -1,4 +1,4 @@
-! RUN: %flang_fc1 -emit-fir -fopenmp -fopenmp-version=60 %s -o - | FileCheck %s
+! RUN: %flang_fc1 -emit-fir -fopenmp -fopenmp-version=60 %s -o - | FileCheck %s --implicit-check-not="fir.call @_FortranAAssign"
 
 ! CHECK-LABEL: func @_QPtarget_teams_workdistribute_scalar_assign
 subroutine target_teams_workdistribute_scalar_assign()
@@ -21,7 +21,11 @@ end subroutine target_teams_workdistribute_scalar_assign
 ! CHECK-LABEL: func @_QPteams_workdistribute_scalar_assign
 subroutine teams_workdistribute_scalar_assign()
   integer :: aa(10)
-  ! CHECK: fir.call @_FortranAAssign
+  ! CHECK: omp.teams
+  ! CHECK: omp.parallel
+  ! CHECK: omp.distribute
+  ! CHECK: omp.wsloop
+  ! CHECK: omp.loop_nest
   !$omp teams workdistribute
   aa = 20
   !$omp end teams workdistribute
diff --git a/flang/test/Lower/array-derived.f90 b/flang/test/Lower/array-derived.f90
index 4236fe6326741..d731874f8b1c8 100644
--- a/flang/test/Lower/array-derived.f90
+++ b/flang/test/Lower/array-derived.f90
@@ -91,7 +91,7 @@ subroutine test2(a1, a2)
     ! CHECK: %[[a1_f2_d_rebox:.*]] = fir.rebox %[[a1_f2_rebox]] [%[[slice_d]]]
 
     ! Assignment
-    ! CHECK: fir.call @_FortranAAssign
+    ! CHECK: fir.do_loop {{.*}} unordered
     a1%f2%d = a2%f1(1) + a2%f1(5) / a2%f1(3)
   end subroutine test2
 
@@ -128,7 +128,7 @@ subroutine test3(a3, a4)
     ! CHECK: %[[slice_a3_n:.*]] = fir.slice {{.*}} path %{{.*}}
     ! CHECK: %[[a3_n_rebox:.*]] = fir.rebox %[[a3_f2_rebox]] [%[[slice_a3_n]]]
 
-    ! CHECK: fir.call @_FortranAAssign
+    ! CHECK: fir.do_loop {{.*}} unordered
     a3%f(1,1)%f2%n = a4%f(2,2)%f1(4) - 4
 
     ! Assignment 2
@@ -140,7 +140,7 @@ subroutine test3(a3, a4)
     ! CHECK:   arith.addi
     ! CHECK:   fir.store
     ! CHECK: }
-    ! CHECK: fir.call @_FortranAAssign
+    ! CHECK: fir.do_loop {{.*}} unordered
     a4%f(3,3)%f1(2) = a3%f(1,2)%f2%d + 4
   end subroutine test3
 end module cs