[flang-commits] [flang] [flang] Restrict O0 hlfir.assign scalar-to-array inlining to OpenMP target device (PR #201774)

Fri Jun 5 01:06:10 PDT 2026

https://github.com/Saieiei created https://github.com/llvm/llvm-project/pull/201774

PR #197092 enabled `InlineHLFIRAssign{onlyScalarRHS=true}` at `-O0` to prevent `_FortranAAssign` (which uses `malloc`/`free`) from appearing in OpenMP target device code generated at `-O0`. However, running the pass for all `-O0` host compilations caused a debug regression: a line breakpoint on a scalar-to-array broadcast such as `arr = 11` now hits once per array element instead of once, because the assignment is expanded into an inline loop.

This patch restricts the `-O0` scheduling of `InlineHLFIRAssign{onlyScalarRHS=true}` to OpenMP target-device compilations only. Host `-O0` falls back to `_FortranAAssign` and the single-breakpoint-hit debug behavior is restored.

`MLIRToLLVMPassPipelineConfig` gains an `EnableOpenMPIsTargetDevice` bool (alongside the existing `EnableOpenMP` / `EnableOpenMPSimd` flags); both the flang frontend and `bbc` set it from `LangOpts.OpenMPIsTargetDevice`.

Two new regression tests are added:
- `flang/test/Lower/HLFIR/scalar-to-array-assign-host-O0.f90`: verifies host `-O0` still emits `_FortranAAssign`.
- `flang/test/Lower/OpenMP/scalar-to-array-assign-target-device-O0.f90`: verifies device `-O0` inside `omp.target` still inlines the broadcast loop.

Fixing the per-element debug locations on the device-side inlined loop is a separate concern and is left as follow-up work.

>From 17910682d4b140b4a49daa2628a7b65500e4b110 Mon Sep 17 00:00:00 2001
From: Sairudra More <sairudra60 at gmail.com>
Date: Thu, 4 Jun 2026 23:23:16 -0500
Subject: [PATCH] [flang] Restrict O0 scalar assign inlining to device code

PR #197092 added O0 scalar-to-array assignment inlining to avoid _FortranAAssign in OpenMP target device code.

That also changed normal host -g -O0 debugging: a breakpoint on a scalar broadcast such as 'arr = 11' could be hit once per array element.

Restrict the O0 scalar-RHS-only path to OpenMP target-device compilation, keeping host O0 on the existing runtime-call path.
---
 flang/include/flang/Tools/CrossToolHelpers.h  |   2 +
 flang/lib/Frontend/FrontendActions.cpp        |   5 +
 flang/lib/Optimizer/Passes/Pipelines.cpp      |  12 +-
 .../test/Driver/mlir-debug-pass-pipeline.f90  |   5 -
 flang/test/Driver/mlir-pass-pipeline.f90      |  10 +-
 .../parallel-private-reduction-worstcase.f90  | 128 ++++++------------
 .../Integration/OpenMP/private-global.f90     |  39 ++++--
 flang/test/Integration/prefetch.f90           |   1 +
 .../HLFIR/scalar-to-array-assign-host-O0.f90  |  17 +++
 ...calar-to-array-assign-target-device-O0.f90 |  18 +++
 ...workdistribute-saxpy-and-scalar-assign.f90 |   2 +-
 .../OpenMP/workdistribute-scalar-assign.f90   |   2 +-
 flang/tools/bbc/bbc.cpp                       |   2 +
 13 files changed, 128 insertions(+), 115 deletions(-)
 create mode 100644 flang/test/Lower/HLFIR/scalar-to-array-assign-host-O0.f90
 create mode 100644 flang/test/Lower/OpenMP/scalar-to-array-assign-target-device-O0.f90

diff --git a/flang/include/flang/Tools/CrossToolHelpers.h b/flang/include/flang/Tools/CrossToolHelpers.h
index 6240354bd899a..90e159cc157bf 100644
--- a/flang/include/flang/Tools/CrossToolHelpers.h
+++ b/flang/include/flang/Tools/CrossToolHelpers.h
@@ -141,6 +141,8 @@ struct MLIRToLLVMPassPipelineConfig : public FlangEPCallBacks {
                                       ///< functions.
   bool NSWOnLoopVarInc = true; ///< Add nsw flag to loop variable increments.
   bool EnableOpenMP = false; ///< Enable OpenMP lowering.
+  bool EnableOpenMPIsTargetDevice =
+      false; ///< Compiling for an OpenMP target device.
   bool UseSampleProfile = false; ///< Enable sample based profiling
   bool DebugInfoForProfiling = false; ///< Enable extra debugging info
   bool EnableOpenMPSimd = false; ///< Enable OpenMP simd-only mode.
diff --git a/flang/lib/Frontend/FrontendActions.cpp b/flang/lib/Frontend/FrontendActions.cpp
index 0d154a7157867..66602ed52f6cd 100644
--- a/flang/lib/Frontend/FrontendActions.cpp
+++ b/flang/lib/Frontend/FrontendActions.cpp
@@ -633,6 +633,8 @@ void CodeGenAction::lowerHLFIRToFIR() {
   MLIRToLLVMPassPipelineConfig config(level);
   config.fpMaxminBehavior =
       ci.getInvocation().getLoweringOpts().getFPMaxminBehavior();
+  if (ci.getInvocation().getLangOpts().OpenMPIsTargetDevice)
+    config.EnableOpenMPIsTargetDevice = true;
   // Create the pass pipeline
   fir::createHLFIRToFIRPassPipeline(pm, enableOpenMP, config);
   (void)mlir::applyPassManagerCLOptions(pm);
@@ -763,6 +765,9 @@ void CodeGenAction::generateLLVMIR() {
           Fortran::common::LanguageFeature::OpenMP))
     config.EnableOpenMP = true;
 
+  if (ci.getInvocation().getLangOpts().OpenMPIsTargetDevice)
+    config.EnableOpenMPIsTargetDevice = true;
+
   if (ci.getInvocation().getLangOpts().OpenMPSimd)
     config.EnableOpenMPSimd = true;
 
diff --git a/flang/lib/Optimizer/Passes/Pipelines.cpp b/flang/lib/Optimizer/Passes/Pipelines.cpp
index 682e3e48e0a22..8e8521391885e 100644
--- a/flang/lib/Optimizer/Passes/Pipelines.cpp
+++ b/flang/lib/Optimizer/Passes/Pipelines.cpp
@@ -313,10 +313,14 @@ void createHLFIRToFIRPassPipeline(mlir::PassManager &pm,
       addNestedPassToAllTopLevelOperations<PassConstructor>(
           pm, hlfir::createInlineHLFIRCopyIn);
     }
-  } else {
-    // At O0, only inline scalar-to-array broadcasts. This avoids emitting
-    // Fortran runtime calls (e.g. _FortranAAssign) that use malloc/free in
-    // device code generated by OpenMP target offloading.
+  } else if (config.EnableOpenMPIsTargetDevice) {
+    // At O0, only inline scalar-to-array broadcasts when compiling for an
+    // OpenMP target device. This avoids emitting Fortran runtime calls
+    // (e.g. _FortranAAssign) that use malloc/free in device code generated
+    // by OpenMP target offloading. Restricting this to target-device
+    // compilation preserves the runtime call on the host at -O0 so that a
+    // line breakpoint on a scalar-to-array assignment hits once instead of
+    // once per element.
     addNestedPassToAllTopLevelOperations(pm, [&]() {
       return hlfir::createInlineHLFIRAssign({/*onlyScalarRHS=*/true});
     });
diff --git a/flang/test/Driver/mlir-debug-pass-pipeline.f90 b/flang/test/Driver/mlir-debug-pass-pipeline.f90
index c5e63fdbd9d2b..d5126012b6957 100644
--- a/flang/test/Driver/mlir-debug-pass-pipeline.f90
+++ b/flang/test/Driver/mlir-debug-pass-pipeline.f90
@@ -32,23 +32,18 @@
 ! ALL-NEXT: 'fir.global' Pipeline
 ! ALL-NEXT:   InlineElementals
 ! ALL-NEXT:   SeparateAllocatableAssign
-! ALL-NEXT:   InlineHLFIRAssign
 ! ALL-NEXT: 'func.func' Pipeline
 ! ALL-NEXT:   InlineElementals
 ! ALL-NEXT:   SeparateAllocatableAssign
-! ALL-NEXT:   InlineHLFIRAssign
 ! ALL-NEXT: 'omp.declare_mapper' Pipeline
 ! ALL-NEXT:   InlineElementals
 ! ALL-NEXT:   SeparateAllocatableAssign
-! ALL-NEXT:   InlineHLFIRAssign
 ! ALL-NEXT: 'omp.declare_reduction' Pipeline
 ! ALL-NEXT:   InlineElementals
 ! ALL-NEXT:   SeparateAllocatableAssign
-! ALL-NEXT:   InlineHLFIRAssign
 ! ALL-NEXT: 'omp.private' Pipeline
 ! ALL-NEXT:   InlineElementals
 ! ALL-NEXT:   SeparateAllocatableAssign
-! ALL-NEXT:   InlineHLFIRAssign
 ! ALL-NEXT: LowerHLFIROrderedAssignments
 ! ALL-NEXT: LowerHLFIRIntrinsics
 ! ALL-NEXT: BufferizeHLFIR
diff --git a/flang/test/Driver/mlir-pass-pipeline.f90 b/flang/test/Driver/mlir-pass-pipeline.f90
index a7ea0a9de4867..b679564adff10 100644
--- a/flang/test/Driver/mlir-pass-pipeline.f90
+++ b/flang/test/Driver/mlir-pass-pipeline.f90
@@ -9,6 +9,11 @@
 
 end program
 
+! At -O0 on the host (no OpenMP target-device compilation), InlineHLFIRAssign
+! is no longer scheduled. See PR #197092 follow-up restricting the -O0 pass
+! to OpenMP target-device compilation.
+! O0-NOT: InlineHLFIRAssign
+
 ! ALL: Pass statistics report
 ! ALL: Fortran::lower::VerifierPass
 
@@ -32,27 +37,22 @@
 ! O2-NEXT:   SimplifyHLFIRIntrinsics
 ! ALL:       InlineElementals
 ! ALL-NEXT:  SeparateAllocatableAssign
-! O0-NEXT:   InlineHLFIRAssign
 ! ALL-NEXT:'func.func' Pipeline
 ! O2-NEXT:   SimplifyHLFIRIntrinsics
 ! ALL:       InlineElementals
 ! ALL-NEXT:  SeparateAllocatableAssign
-! O0-NEXT:   InlineHLFIRAssign
 ! ALL-NEXT:'omp.declare_mapper' Pipeline
 ! O2-NEXT:   SimplifyHLFIRIntrinsics
 ! ALL:       InlineElementals
 ! ALL-NEXT:  SeparateAllocatableAssign
-! O0-NEXT:   InlineHLFIRAssign
 ! ALL-NEXT:'omp.declare_reduction' Pipeline
 ! O2-NEXT:   SimplifyHLFIRIntrinsics
 ! ALL:       InlineElementals
 ! ALL-NEXT:  SeparateAllocatableAssign
-! O0-NEXT:   InlineHLFIRAssign
 ! ALL-NEXT:'omp.private' Pipeline
 ! O2-NEXT:   SimplifyHLFIRIntrinsics
 ! ALL:       InlineElementals
 ! ALL-NEXT:  SeparateAllocatableAssign
-! O0-NEXT:   InlineHLFIRAssign
 ! O2-NEXT: Canonicalizer
 ! O2-NEXT: CSE
 ! O2-NEXT: (S) {{.*}} num-cse'd
diff --git a/flang/test/Integration/OpenMP/parallel-private-reduction-worstcase.f90 b/flang/test/Integration/OpenMP/parallel-private-reduction-worstcase.f90
index c4688a6e8a192..c6a46691d58f5 100644
--- a/flang/test/Integration/OpenMP/parallel-private-reduction-worstcase.f90
+++ b/flang/test/Integration/OpenMP/parallel-private-reduction-worstcase.f90
@@ -50,7 +50,7 @@ subroutine worst_case(a, b, c, d)
 ! CHECK:         br i1 %{{.*}}, label %omp.private.init3, label %omp.private.init4
 
 ! CHECK:       omp.private.init4:                               ; preds = %omp.private.init2
-!                [finish private alloc for first var with zero extent]
+!                [finish private alloc for second var with zero extent]
 ! CHECK:         br label %omp.private.init5
 
 ! CHECK:       omp.private.init5:                               ; preds = %omp.private.init3, %omp.private.init4
@@ -61,13 +61,13 @@ subroutine worst_case(a, b, c, d)
 ! CHECK-NEXT:    br label %omp.private.init7
 
 ! CHECK:       omp.private.init7:
-!                [begin private alloc for second var]
+!                [begin private alloc for first var]
 !                [read the length from the mold argument]
 !                [if it is non-zero...]
 ! CHECK:         br i1 {{.*}}, label %omp.private.init8, label %omp.private.init9
 
 ! CHECK:       omp.private.init9:                               ; preds = %omp.private.init7
-!                [finish private alloc for second var with zero extent]
+!                [finish private alloc for first var with zero extent]
 ! CHECK:         br label %omp.private.init10
 
 ! CHECK:       omp.private.init10:                               ; preds = %omp.private.init8, %omp.private.init9
@@ -105,64 +105,50 @@ subroutine worst_case(a, b, c, d)
 ! CHECK-NEXT:    br label %omp.reduction.init
 
 ! CHECK:       omp.reduction.init:                               ; preds = %omp.region.cont15
-!                [deferred stores for results of reduction alloc regions]
+!                [deffered stores for results of reduction alloc regions]
 ! CHECK:         br label %[[VAL_96:.*]]
 
 ! CHECK:       omp.reduction.neutral:                            ; preds = %omp.reduction.init
-!                [start of reduction initialization region for first var]
+!                [start of reduction initialization region]
 !                [null check:]
 ! CHECK:         br i1 %{{.*}}, label %omp.reduction.neutral20, label %omp.reduction.neutral21
 
 ! CHECK:       omp.reduction.neutral21:                          ; preds = %omp.reduction.neutral
-!                [malloc the reduction variable]
+!                [malloc and assign the default value to the reduction variable]
 ! CHECK:         br label %omp.reduction.neutral22
 
-! CHECK:       omp.reduction.neutral22:                          ; preds = %omp.reduction.neutral23, %omp.reduction.neutral21
-!                [inlined scalar-to-array init loop header]
-! CHECK:         br i1 %{{.*}}, label %omp.reduction.neutral23, label %omp.reduction.neutral24
-
-! CHECK:       omp.reduction.neutral24:                          ; preds = %omp.reduction.neutral22
-! CHECK:         br label %omp.reduction.neutral25
-
-! CHECK:       omp.reduction.neutral25:                          ; preds = %omp.reduction.neutral20, %omp.reduction.neutral24
+! CHECK:       omp.reduction.neutral22:                          ; preds = %omp.reduction.neutral20, %omp.reduction.neutral21
 ! CHECK-NEXT:    br label %omp.region.cont19
 
-! CHECK:       omp.region.cont19:                                ; preds = %omp.reduction.neutral25
+! CHECK:       omp.region.cont19:                                ; preds = %omp.reduction.neutral22
 ! CHECK-NEXT:    %{{.*}} = phi ptr
-! CHECK-NEXT:    br label %omp.reduction.neutral27
+! CHECK-NEXT:    br label %omp.reduction.neutral24
 
-! CHECK:       omp.reduction.neutral27:                          ; preds = %omp.region.cont19
-!                [start of reduction initialization region for second var]
+! CHECK:       omp.reduction.neutral24:                          ; preds = %omp.region.cont19
+!                [start of reduction initialization region]
 !                [null check:]
-! CHECK:         br i1 %{{.*}}, label %omp.reduction.neutral28, label %omp.reduction.neutral29
-
-! CHECK:       omp.reduction.neutral29:                          ; preds = %omp.reduction.neutral27
-!                [malloc the reduction variable]
-! CHECK:         br label %omp.reduction.neutral30
-
-! CHECK:       omp.reduction.neutral30:                          ; preds = %omp.reduction.neutral31, %omp.reduction.neutral29
-!                [inlined scalar-to-array init loop header]
-! CHECK:         br i1 %{{.*}}, label %omp.reduction.neutral31, label %omp.reduction.neutral32
+! CHECK:         br i1 %{{.*}}, label %omp.reduction.neutral25, label %omp.reduction.neutral26
 
-! CHECK:       omp.reduction.neutral32:                          ; preds = %omp.reduction.neutral30
-! CHECK:         br label %omp.reduction.neutral33
+! CHECK:       omp.reduction.neutral26:                          ; preds = %omp.reduction.neutral24
+!                [malloc and assign the default value to the reduction variable]
+! CHECK:         br label %omp.reduction.neutral27
 
-! CHECK:       omp.reduction.neutral33:                          ; preds = %omp.reduction.neutral28, %omp.reduction.neutral32
-! CHECK-NEXT:    br label %omp.region.cont26
+! CHECK:       omp.reduction.neutral27:                          ; preds = %omp.reduction.neutral25, %omp.reduction.neutral26
+! CHECK-NEXT:    br label %omp.region.cont23
 
-! CHECK:       omp.region.cont26:                                ; preds = %omp.reduction.neutral33
+! CHECK:       omp.region.cont23:                                ; preds = %omp.reduction.neutral27
 ! CHECK-NEXT:    %{{.*}} = phi ptr
-! CHECK-NEXT:    br label %omp.par.region35
+! CHECK-NEXT:    br label %omp.par.region29
 
-! CHECK:       omp.par.region35:                                 ; preds = %omp.region.cont26
+! CHECK:       omp.par.region29:                                 ; preds = %omp.region.cont23
 !                [call SUM runtime function]
 !                [if (sum(a) == 1)]
-! CHECK:         br i1 %{{.*}}, label %omp.par.region36, label %omp.par.region37
+! CHECK:         br i1 %{{.*}}, label %omp.par.region30, label %omp.par.region31
 
-! CHECK:       omp.par.region37:                                 ; preds = %omp.par.region35
-! CHECK-NEXT:    br label %omp.region.cont34
+! CHECK:       omp.par.region31:                                 ; preds = %omp.par.region29
+! CHECK-NEXT:    br label %omp.region.cont28
 
-! CHECK:       omp.region.cont34:                                ; preds = %omp.par.region36, %omp.par.region37
+! CHECK:       omp.region.cont28:                                ; preds = %omp.par.region30, %omp.par.region31
 !                [omp parallel region done, call into the runtime to complete reduction]
 ! CHECK:         %[[VAL_233:.*]] = call i32 @__kmpc_reduce(
 ! CHECK:         switch i32 %[[VAL_233]], label %reduce.finalize [
@@ -170,16 +156,16 @@ subroutine worst_case(a, b, c, d)
 ! CHECK-NEXT:      i32 2, label %reduce.switch.atomic
 ! CHECK-NEXT:    ]
 
-! CHECK:       reduce.switch.atomic:                             ; preds = %omp.region.cont34
+! CHECK:       reduce.switch.atomic:                             ; preds = %omp.region.cont28
 ! CHECK-NEXT:    unreachable
 
-! CHECK:       reduce.switch.nonatomic:                          ; preds = %omp.region.cont34
+! CHECK:       reduce.switch.nonatomic:                          ; preds = %omp.region.cont28
 ! CHECK-NEXT:    %[[red_private_value_0:.*]] = load ptr, ptr %{{.*}}, align 8
 ! CHECK-NEXT:    br label %omp.reduction.nonatomic.body
 
 !              [various blocks implementing the reduction]
 
-! CHECK:       omp.region.cont42:                                ; preds =
+! CHECK:       omp.region.cont36:                                ; preds =
 ! CHECK-NEXT:    %{{.*}} = phi ptr
 ! CHECK-NEXT:    call void @__kmpc_end_reduce(
 ! CHECK-NEXT:    br label %reduce.finalize
@@ -196,59 +182,29 @@ subroutine worst_case(a, b, c, d)
 
 ! CHECK:       omp.reduction.cleanup:                            ; preds = %.fini
 !                [null check]
-! CHECK:         br i1 %{{.*}}, label %omp.reduction.cleanup48, label %omp.reduction.cleanup49
+! CHECK:         br i1 %{{.*}}, label %omp.reduction.cleanup42, label %omp.reduction.cleanup43
 
-! CHECK:       omp.reduction.cleanup49:                          ; preds = %omp.reduction.cleanup48, %omp.reduction.cleanup
-! CHECK-NEXT:    br label %omp.region.cont47
+! CHECK:       omp.reduction.cleanup43:                          ; preds = %omp.reduction.cleanup42, %omp.reduction.cleanup
+! CHECK-NEXT:    br label %omp.region.cont41
 
-! CHECK:       omp.region.cont47:                                ; preds = %omp.reduction.cleanup49
-! CHECK:         br label %omp.reduction.cleanup51
+! CHECK:       omp.region.cont41:                                ; preds = %omp.reduction.cleanup43
+! CHECK-NEXT:    %{{.*}} = load ptr, ptr
+! CHECK-NEXT:    br label %omp.reduction.cleanup45
 
-! CHECK:       omp.reduction.cleanup51:                          ; preds = %omp.region.cont47
+! CHECK:       omp.reduction.cleanup45:                          ; preds = %omp.region.cont41
 !                [null check]
-! CHECK:         br i1 %{{.*}}, label %omp.reduction.cleanup52, label %omp.reduction.cleanup53
-
-! CHECK:       omp.reduction.cleanup53:                          ; preds = %omp.reduction.cleanup52, %omp.reduction.cleanup51
-! CHECK-NEXT:    br label %omp.region.cont50
+! CHECK:         br i1 %{{.*}}, label %omp.reduction.cleanup46, label %omp.reduction.cleanup47
 
-! CHECK:       omp.region.cont50:                                ; preds = %omp.reduction.cleanup53
-! CHECK-NEXT:    br label %omp.private.dealloc
-
-! CHECK:       omp.private.dealloc:                              ; preds = %omp.region.cont50
-!                [null check for first private var dealloc]
-! CHECK:         br i1 %{{.*}}, label %omp.private.dealloc55, label %omp.private.dealloc56
-
-! CHECK:       omp.private.dealloc56:                            ; preds = %omp.private.dealloc55, %omp.private.dealloc
-! CHECK-NEXT:    br label %omp.region.cont54
-
-! CHECK:       omp.region.cont54:                                ; preds = %omp.private.dealloc56
-! CHECK-NEXT:    br label %omp.private.dealloc58
-
-! CHECK:       omp.private.dealloc58:                            ; preds = %omp.region.cont54
-!                [null check for second private var dealloc]
-! CHECK:         br i1 %{{.*}}, label %omp.private.dealloc59, label %omp.private.dealloc60
-
-! CHECK:       omp.private.dealloc60:                            ; preds = %omp.private.dealloc59, %omp.private.dealloc58
-! CHECK-NEXT:    br label %omp.region.cont57
-
-! CHECK:       omp.par.region36:                                 ; preds = %omp.par.region35
+! CHECK:       omp.par.region30:                                 ; preds = %omp.par.region29
 ! CHECK-NEXT:    call void @_FortranAStopStatement
 
-! CHECK:       omp.reduction.neutral31:                          ; preds = %omp.reduction.neutral30
-!                [inlined init loop body for second var]
-! CHECK:         br label %omp.reduction.neutral30
-
-! CHECK:       omp.reduction.neutral28:                          ; preds = %omp.reduction.neutral27
-!                [source length was zero: finish initializing second var]
-! CHECK:         br label %omp.reduction.neutral33
-
-! CHECK:       omp.reduction.neutral23:                          ; preds = %omp.reduction.neutral22
-!                [inlined init loop body for first var]
-! CHECK:         br label %omp.reduction.neutral22
+! CHECK:       omp.reduction.neutral25:                          ; preds = %omp.reduction.neutral24
+!                [source length was zero: finish initializing array]
+! CHECK:         br label %omp.reduction.neutral27
 
 ! CHECK:       omp.reduction.neutral20:                          ; preds = %omp.reduction.neutral
-!                [source length was zero: finish initializing first var]
-! CHECK:         br label %omp.reduction.neutral25
+!                [source length was zero: finish initializing array]
+! CHECK:         br label %omp.reduction.neutral22
 
 ! CHECK:       omp.private.copy17:                               ; preds = %omp.private.copy16
 !                [source length was non-zero: call assign runtime]
@@ -266,5 +222,5 @@ subroutine worst_case(a, b, c, d)
 !                [var extent was non-zero: malloc a private array]
 ! CHECK:         br label %omp.private.init5
 
-! CHECK:       omp.par.exit.exitStub:                           ; preds = %omp.region.cont57
+! CHECK:       omp.par.exit.exitStub:                           ; preds = %omp.region.cont51
 ! CHECK-NEXT:    ret void
diff --git a/flang/test/Integration/OpenMP/private-global.f90 b/flang/test/Integration/OpenMP/private-global.f90
index 4b27e6ddc79a4..ed11a95c4aeb1 100644
--- a/flang/test/Integration/OpenMP/private-global.f90
+++ b/flang/test/Integration/OpenMP/private-global.f90
@@ -17,21 +17,34 @@ program bug
 
 ! CHECK-LABEL: define internal void {{.*}}..omp_par(
 ! CHECK:       omp.par.entry:
+! CHECK:         %[[VAL_9:.*]] = alloca i32, align 4
+! CHECK:         %[[VAL_10:.*]] = load i32, ptr %[[VAL_11:.*]], align 4
+! CHECK:         store i32 %[[VAL_10]], ptr %[[VAL_9]], align 4
+! CHECK:         %[[VAL_12:.*]] = load i32, ptr %[[VAL_9]], align 4
 ! CHECK:         %[[PRIV_BOX_ALLOC:.*]] = alloca { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] }, align 8
+! CHECK:         %[[ELEMENTAL_TMP:.*]] = alloca { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] }, align 8
+! CHECK:         %[[ELEMENTAL_TMP_2:.*]] = alloca { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] }, align 8
+! CHECK:         %[[TABLE_BOX_ADDR:.*]] = alloca { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] }, align 8
+! CHECK:         %[[BOXED_FIFTY:.*]] = alloca { ptr, i64, i32, i8, i8, i8, i8 }, align 8
+! CHECK:         %[[FIFTY:.*]] = alloca i32, i64 1, align 4
+! CHECK:         %[[INTERMEDIATE:.*]] = alloca { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] }, align 8
+! CHECK:         %[[TABLE_BOX_ADDR2:.*]] = alloca { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] }, i64 1, align 8
 ! ...
-! check that the private copy is allocated via malloc
-! CHECK:       omp.private.init:
-! CHECK:         %[[PRIV_TABLE:.*]] = call ptr @malloc(i64 40)
-! ...
-! check that we use the private copy of table for the assignment (table = 50)
-! The assignment is now inlined as a loop instead of calling _FortranAAssign.
+! check that we use the private copy of table for the assignment
 ! CHECK:       omp.par.region1:
-! CHECK:         call void @llvm.memcpy.p0.p0.i32(ptr{{.*}}%[[BOX_COPY:.*]], ptr{{.*}}%[[PRIV_BOX_ALLOC]], i32 48, i1 false)
-! ...
-! check that we use the private copy of table for table/=50 (inlined loop body)
-! CHECK:       omp.par.region6:
-! CHECK:         %[[VAL_44:.*]] = sub {{.*}} i64 %{{.*}}, 1
+! CHECK:         call void @llvm.memcpy.p0.p0.i32(ptr{{.*}}%[[INTERMEDIATE]], ptr{{.*}}%[[PRIV_BOX_ALLOC]], i32 {{4[48]}}, i1 false)
+! CHECK:         store i32 50, ptr %[[FIFTY]], align 4
+! CHECK:         %[[FIFTY_BOX_VAL:.*]] = insertvalue { ptr, i64, i32, i8, i8, i8, i8 } { ptr undef, i64 4, i32 20240719, i8 0, i8 9, i8 0, i8 0 }, ptr %[[FIFTY]], 0
+! CHECK:         store { ptr, i64, i32, i8, i8, i8, i8 } %[[FIFTY_BOX_VAL]], ptr %[[BOXED_FIFTY]], align {{[48]}}
+! CHECK:         call void @llvm.memcpy.p0.p0.i32(ptr %[[TABLE_BOX_ADDR2]], ptr %[[INTERMEDIATE]], i32 {{4[48]}}, i1 false)
+! CHECK:         call void @_FortranAAssign(ptr %[[TABLE_BOX_ADDR2]], ptr %[[BOXED_FIFTY]], ptr @{{.*}}, i32 9)
+! CHECK:         call void @llvm.memcpy.p0.p0.i32(ptr{{.*}}%[[TABLE_BOX_ADDR]], ptr{{.*}}%[[PRIV_BOX_ALLOC]], i32 {{4[48]}}, i1 false)
+! CHECK:         %[[PRIV_TABLE:.*]] = call ptr @malloc(i{{(32)|(64)}} 40)
 ! ...
-! check that we store 50 into the private table's elements (inlined loop body)
+! check that we use the private copy of table for table/=50
 ! CHECK:       omp.par.region3:
-! CHECK:         store i32 50, ptr %{{.*}}, align 4
+! CHECK:         %[[VAL_44:.*]] = sub nuw nsw i64 %{{.*}}, 1
+! CHECK:         %[[VAL_45:.*]] = mul nuw nsw i64 %[[VAL_44]], 1
+! CHECK:         %[[VAL_46:.*]] = mul nuw nsw i64 %[[VAL_45]], 1
+! CHECK:         %[[VAL_47:.*]] = add nuw nsw i64 %[[VAL_46]], 0
+! CHECK:         %[[VAL_48:.*]] = getelementptr nusw nuw i32, ptr %[[PRIV_TABLE]], i64 %[[VAL_47]]
diff --git a/flang/test/Integration/prefetch.f90 b/flang/test/Integration/prefetch.f90
index 76227caf02b43..c015b6736972a 100644
--- a/flang/test/Integration/prefetch.f90
+++ b/flang/test/Integration/prefetch.f90
@@ -13,6 +13,7 @@
 !===============================================================================
 
 subroutine test_prefetch_01()
+    ! LLVM: {{.*}} = alloca i32, i64 1, align 4
     ! LLVM: %[[VAR_J:.*]] = alloca i32, i64 1, align 4
     ! LLVM: %[[VAR_I:.*]] = alloca i32, i64 1, align 4
     ! LLVM: %[[VAR_A:.*]] = alloca [256 x i32], i64 1, align 4
diff --git a/flang/test/Lower/HLFIR/scalar-to-array-assign-host-O0.f90 b/flang/test/Lower/HLFIR/scalar-to-array-assign-host-O0.f90
new file mode 100644
index 0000000000000..88d4344da6f2b
--- /dev/null
+++ b/flang/test/Lower/HLFIR/scalar-to-array-assign-host-O0.f90
@@ -0,0 +1,17 @@
+! Regression test for the follow-up to PR llvm/llvm-project#197092.
+!
+! At -O0 on the host (no OpenMP target-device compilation), a scalar-to-array
+! broadcast assignment must lower to a Fortran runtime call
+! (_FortranAAssign), not to an inline assignment loop. Lowering it inline
+! at -O0 caused -g line breakpoints to hit once per array element instead
+! of once.
+
+! RUN: %flang_fc1 -emit-fir -O0 %s -o - | FileCheck %s
+
+! CHECK-LABEL: func @_QPhost_scalar_broadcast
+subroutine host_scalar_broadcast(arr)
+  integer :: arr(4)
+  ! CHECK: fir.call @_FortranAAssign
+  ! CHECK-NOT: fir.do_loop
+  arr = 11
+end subroutine
diff --git a/flang/test/Lower/OpenMP/scalar-to-array-assign-target-device-O0.f90 b/flang/test/Lower/OpenMP/scalar-to-array-assign-target-device-O0.f90
new file mode 100644
index 0000000000000..db019a6a15ab1
--- /dev/null
+++ b/flang/test/Lower/OpenMP/scalar-to-array-assign-target-device-O0.f90
@@ -0,0 +1,18 @@
+! Regression test for PR llvm/llvm-project#197092 and its follow-up.
+!
+! When compiling for an OpenMP target device at -O0, a scalar-to-array
+! broadcast assignment inside a target region must still be inlined to
+! avoid emitting a _FortranAAssign runtime call (which internally uses
+! malloc/free) into GPU device code.
+
+! RUN: %flang_fc1 -emit-fir -O0 -fopenmp -fopenmp-is-target-device %s -o - \
+! RUN:   | FileCheck %s --implicit-check-not="fir.call @_FortranAAssign"
+
+subroutine device_scalar_broadcast()
+  integer :: arr(4)
+  !$omp target map(tofrom: arr)
+  ! CHECK: omp.target
+  ! CHECK: fir.do_loop
+  arr = 11
+  !$omp end target
+end subroutine
diff --git a/flang/test/Lower/OpenMP/workdistribute-saxpy-and-scalar-assign.f90 b/flang/test/Lower/OpenMP/workdistribute-saxpy-and-scalar-assign.f90
index cbb4dfc3cdadc..3824847a7bcda 100644
--- a/flang/test/Lower/OpenMP/workdistribute-saxpy-and-scalar-assign.f90
+++ b/flang/test/Lower/OpenMP/workdistribute-saxpy-and-scalar-assign.f90
@@ -1,4 +1,4 @@
-! RUN: %flang_fc1 -emit-fir -fopenmp -fopenmp-version=60 %s -o - | FileCheck %s
+! RUN: %flang_fc1 -emit-fir -O1 -fopenmp -fopenmp-version=60 %s -o - | FileCheck %s
 
 ! CHECK-LABEL: func @_QPtarget_teams_workdistribute
 subroutine target_teams_workdistribute()
diff --git a/flang/test/Lower/OpenMP/workdistribute-scalar-assign.f90 b/flang/test/Lower/OpenMP/workdistribute-scalar-assign.f90
index 217df8fb05176..3d7ef7abf6816 100644
--- a/flang/test/Lower/OpenMP/workdistribute-scalar-assign.f90
+++ b/flang/test/Lower/OpenMP/workdistribute-scalar-assign.f90
@@ -1,4 +1,4 @@
-! RUN: %flang_fc1 -emit-fir -fopenmp -fopenmp-version=60 %s -o - | FileCheck %s --implicit-check-not="fir.call @_FortranAAssign"
+! RUN: %flang_fc1 -emit-fir -O1 -fopenmp -fopenmp-version=60 %s -o - | FileCheck %s --implicit-check-not="fir.call @_FortranAAssign"
 
 ! CHECK-LABEL: func @_QPtarget_teams_workdistribute_scalar_assign
 subroutine target_teams_workdistribute_scalar_assign()
diff --git a/flang/tools/bbc/bbc.cpp b/flang/tools/bbc/bbc.cpp
index 30b4a99c8f2d5..23e7af238198f 100644
--- a/flang/tools/bbc/bbc.cpp
+++ b/flang/tools/bbc/bbc.cpp
@@ -576,6 +576,8 @@ static llvm::LogicalResult convertFortranSourceToMLIR(
     config.SkipConvertComplexPow = targetMachine.getTargetTriple().isAMDGCN();
     if (enableOpenMP)
       config.EnableOpenMP = true;
+    if (enableOpenMPDevice)
+      config.EnableOpenMPIsTargetDevice = true;
     config.NSWOnLoopVarInc = !integerWrapAround;
     fir::registerDefaultInlinerPass(config);
     fir::createDefaultFIROptimizerPassPipeline(pm, config);