[Mlir-commits] [clang] [llvm] [mlir] [Flang][OpenMP] Enable no-loop kernels (PR #155818)
Dominik Adamski
llvmlistbot at llvm.org
Fri Sep 26 03:30:22 PDT 2025
https://github.com/DominikAdamski updated https://github.com/llvm/llvm-project/pull/155818
>From efdd979d6189b986ea09dec7c0c0da26725ab1a1 Mon Sep 17 00:00:00 2001
From: Dominik Adamski <dominik.adamski at amd.com>
Date: Wed, 27 Aug 2025 10:24:51 -0500
Subject: [PATCH 1/8] [Flang][OpenMP] Enable no-loop kernels
Enable the generation of no-loop kernels for Fortran OpenMP code.
target teams distribute parallel do pragmas can be promoted to no-loop
kernels if the user adds the -fopenmp-assume-teams-oversubscription
and -fopenmp-assume-threads-oversubscription flags.
If the OpenMP kernel contains reduction or num_teams clauses,
it is not promoted to no-loop mode.
The global OpenMP device RTL oversubscription flags
no longer force no-loop code generation for Fortran.
---
clang/include/clang/Driver/Options.td | 21 +++++++--
.../llvm/Frontend/OpenMP/OMPIRBuilder.h | 8 +++-
llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp | 21 +++++----
.../mlir/Dialect/OpenMP/OpenMPEnums.td | 4 +-
mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp | 41 +++++++++++++++---
.../OpenMP/OpenMPToLLVMIRTranslation.cpp | 22 +++++++++-
offload/DeviceRTL/src/Workshare.cpp | 13 ------
.../offloading/fortran/target-no-loop.f90 | 43 +++++++++++++++++++
8 files changed, 136 insertions(+), 37 deletions(-)
create mode 100644 offload/test/offloading/fortran/target-no-loop.f90
diff --git a/clang/include/clang/Driver/Options.td b/clang/include/clang/Driver/Options.td
index b1ae3cf6525b8..4c078a45628a9 100644
--- a/clang/include/clang/Driver/Options.td
+++ b/clang/include/clang/Driver/Options.td
@@ -3824,10 +3824,23 @@ let Visibility = [ClangOption, CC1Option, FC1Option, FlangOption] in {
let Group = f_Group in {
def fopenmp_target_debug_EQ : Joined<["-"], "fopenmp-target-debug=">;
-def fopenmp_assume_teams_oversubscription : Flag<["-"], "fopenmp-assume-teams-oversubscription">;
-def fopenmp_assume_threads_oversubscription : Flag<["-"], "fopenmp-assume-threads-oversubscription">;
-def fno_openmp_assume_teams_oversubscription : Flag<["-"], "fno-openmp-assume-teams-oversubscription">;
-def fno_openmp_assume_threads_oversubscription : Flag<["-"], "fno-openmp-assume-threads-oversubscription">;
+def fopenmp_assume_teams_oversubscription : Flag<["-"], "fopenmp-assume-teams-oversubscription">,
+ HelpText<"Allow enforcement to ensure there are enough teams to cover the "
+ "loop iteration space. It may ignore environment variables. "
+ "If the fopenmp-assume-teams-oversubscription and "
+ "fopenmp-assume-threads-oversubscription flags are set, Flang may "
+ "generate more optimized OpenMP kernels for target teams distribute "
+ "parallel do pragmas.">;
+def fopenmp_assume_threads_oversubscription : Flag<["-"], "fopenmp-assume-threads-oversubscription">,
+ HelpText<"Assume threads oversubscription. If the "
+ "fopenmp-assume-teams-oversubscription and "
+ "fopenmp-assume-threads-oversubscription flags are set, Flang may "
+ "generate more optimized OpenMP kernels for target teams distribute "
+ "parallel do pragmas.">;
+def fno_openmp_assume_teams_oversubscription : Flag<["-"], "fno-openmp-assume-teams-oversubscription">,
+ HelpText<"Do not assume teams oversubscription.">;
+def fno_openmp_assume_threads_oversubscription : Flag<["-"], "fno-openmp-assume-threads-oversubscription">,
+ HelpText<"Do not assume threads oversubscription.">;
def fopenmp_assume_no_thread_state : Flag<["-"], "fopenmp-assume-no-thread-state">,
HelpText<"Assert no thread in a parallel region modifies an ICV">,
MarshallingInfoFlag<LangOpts<"OpenMPNoThreadState">>;
diff --git a/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h b/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
index 1050e3d8b08dd..49078e4162ebc 100644
--- a/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
+++ b/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
@@ -1075,11 +1075,13 @@ class OpenMPIRBuilder {
/// preheader of the loop.
/// \param LoopType Information about type of loop worksharing.
/// It corresponds to type of loop workshare OpenMP pragma.
+ /// \param NoLoop If true, no-loop code is generated.
///
/// \returns Point where to insert code after the workshare construct.
InsertPointTy applyWorkshareLoopTarget(DebugLoc DL, CanonicalLoopInfo *CLI,
InsertPointTy AllocaIP,
- omp::WorksharingLoopType LoopType);
+ omp::WorksharingLoopType LoopType,
+ bool NoLoop);
/// Modifies the canonical loop to be a statically-scheduled workshare loop.
///
@@ -1199,6 +1201,7 @@ class OpenMPIRBuilder {
/// present.
/// \param LoopType Information about type of loop worksharing.
/// It corresponds to type of loop workshare OpenMP pragma.
+ /// \param NoLoop If true, no-loop code is generated.
///
/// \returns Point where to insert code after the workshare construct.
LLVM_ABI InsertPointOrErrorTy applyWorkshareLoop(
@@ -1209,7 +1212,8 @@ class OpenMPIRBuilder {
bool HasMonotonicModifier = false, bool HasNonmonotonicModifier = false,
bool HasOrderedClause = false,
omp::WorksharingLoopType LoopType =
- omp::WorksharingLoopType::ForStaticLoop);
+ omp::WorksharingLoopType::ForStaticLoop,
+ bool NoLoop = false);
/// Tile a loop nest.
///
diff --git a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
index c0e956840f989..bec5abb45041f 100644
--- a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
+++ b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
@@ -4955,7 +4955,7 @@ static void createTargetLoopWorkshareCall(OpenMPIRBuilder *OMPBuilder,
WorksharingLoopType LoopType,
BasicBlock *InsertBlock, Value *Ident,
Value *LoopBodyArg, Value *TripCount,
- Function &LoopBodyFn) {
+ Function &LoopBodyFn, bool NoLoop) {
Type *TripCountTy = TripCount->getType();
Module &M = OMPBuilder->M;
IRBuilder<> &Builder = OMPBuilder->Builder;
@@ -4984,7 +4984,7 @@ static void createTargetLoopWorkshareCall(OpenMPIRBuilder *OMPBuilder,
if (LoopType == WorksharingLoopType::DistributeForStaticLoop) {
RealArgs.push_back(ConstantInt::get(TripCountTy, 0));
}
- RealArgs.push_back(ConstantInt::get(Builder.getInt8Ty(), 0));
+ RealArgs.push_back(ConstantInt::get(Builder.getInt8Ty(), NoLoop));
Builder.CreateCall(RTLFn, RealArgs);
}
@@ -4992,7 +4992,7 @@ static void createTargetLoopWorkshareCall(OpenMPIRBuilder *OMPBuilder,
static void workshareLoopTargetCallback(
OpenMPIRBuilder *OMPIRBuilder, CanonicalLoopInfo *CLI, Value *Ident,
Function &OutlinedFn, const SmallVector<Instruction *, 4> &ToBeDeleted,
- WorksharingLoopType LoopType) {
+ WorksharingLoopType LoopType, bool NoLoop) {
IRBuilder<> &Builder = OMPIRBuilder->Builder;
BasicBlock *Preheader = CLI->getPreheader();
Value *TripCount = CLI->getTripCount();
@@ -5039,17 +5039,16 @@ static void workshareLoopTargetCallback(
OutlinedFnCallInstruction->eraseFromParent();
createTargetLoopWorkshareCall(OMPIRBuilder, LoopType, Preheader, Ident,
- LoopBodyArg, TripCount, OutlinedFn);
+ LoopBodyArg, TripCount, OutlinedFn, NoLoop);
for (auto &ToBeDeletedItem : ToBeDeleted)
ToBeDeletedItem->eraseFromParent();
CLI->invalidate();
}
-OpenMPIRBuilder::InsertPointTy
-OpenMPIRBuilder::applyWorkshareLoopTarget(DebugLoc DL, CanonicalLoopInfo *CLI,
- InsertPointTy AllocaIP,
- WorksharingLoopType LoopType) {
+OpenMPIRBuilder::InsertPointTy OpenMPIRBuilder::applyWorkshareLoopTarget(
+ DebugLoc DL, CanonicalLoopInfo *CLI, InsertPointTy AllocaIP,
+ WorksharingLoopType LoopType, bool NoLoop) {
uint32_t SrcLocStrSize;
Constant *SrcLocStr = getOrCreateSrcLocStr(DL, SrcLocStrSize);
Value *Ident = getOrCreateIdent(SrcLocStr, SrcLocStrSize);
@@ -5132,7 +5131,7 @@ OpenMPIRBuilder::applyWorkshareLoopTarget(DebugLoc DL, CanonicalLoopInfo *CLI,
OI.PostOutlineCB = [=, ToBeDeletedVec =
std::move(ToBeDeleted)](Function &OutlinedFn) {
workshareLoopTargetCallback(this, CLI, Ident, OutlinedFn, ToBeDeletedVec,
- LoopType);
+ LoopType, NoLoop);
};
addOutlineInfo(std::move(OI));
return CLI->getAfterIP();
@@ -5143,9 +5142,9 @@ OpenMPIRBuilder::InsertPointOrErrorTy OpenMPIRBuilder::applyWorkshareLoop(
bool NeedsBarrier, omp::ScheduleKind SchedKind, Value *ChunkSize,
bool HasSimdModifier, bool HasMonotonicModifier,
bool HasNonmonotonicModifier, bool HasOrderedClause,
- WorksharingLoopType LoopType) {
+ WorksharingLoopType LoopType, bool NoLoop) {
if (Config.isTargetDevice())
- return applyWorkshareLoopTarget(DL, CLI, AllocaIP, LoopType);
+ return applyWorkshareLoopTarget(DL, CLI, AllocaIP, LoopType, NoLoop);
OMPScheduleType EffectiveScheduleType = computeOpenMPScheduleType(
SchedKind, ChunkSize, HasSimdModifier, HasMonotonicModifier,
HasNonmonotonicModifier, HasOrderedClause);
diff --git a/mlir/include/mlir/Dialect/OpenMP/OpenMPEnums.td b/mlir/include/mlir/Dialect/OpenMP/OpenMPEnums.td
index c080c3fac87d4..e0cd06805ab40 100644
--- a/mlir/include/mlir/Dialect/OpenMP/OpenMPEnums.td
+++ b/mlir/include/mlir/Dialect/OpenMP/OpenMPEnums.td
@@ -230,6 +230,7 @@ def TargetRegionFlagsNone : I32BitEnumAttrCaseNone<"none">;
def TargetRegionFlagsGeneric : I32BitEnumAttrCaseBit<"generic", 0>;
def TargetRegionFlagsSpmd : I32BitEnumAttrCaseBit<"spmd", 1>;
def TargetRegionFlagsTripCount : I32BitEnumAttrCaseBit<"trip_count", 2>;
+def TargetRegionFlagsNoLoop : I32BitEnumAttrCaseBit<"no_loop", 3>;
def TargetRegionFlags : OpenMP_BitEnumAttr<
"TargetRegionFlags",
@@ -237,7 +238,8 @@ def TargetRegionFlags : OpenMP_BitEnumAttr<
TargetRegionFlagsNone,
TargetRegionFlagsGeneric,
TargetRegionFlagsSpmd,
- TargetRegionFlagsTripCount
+ TargetRegionFlagsTripCount,
+ TargetRegionFlagsNoLoop
]>;
//===----------------------------------------------------------------------===//
diff --git a/mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp b/mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp
index 6e43f28e8d93d..1e10dd114b30a 100644
--- a/mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp
+++ b/mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp
@@ -2106,6 +2106,29 @@ Operation *TargetOp::getInnermostCapturedOmpOp() {
});
}
+/// Check if we can promote SPMD kernel to No-Loop kernel
+static bool canPromoteToNoLoop(Operation *capturedOp, TeamsOp teamsOp,
+ WsloopOp *wsLoopOp) {
+ // num_teams clause can break no-loop teams/threads assumption
+ if (teamsOp.getNumTeamsUpper())
+ return false;
+ // reduction kernels are slower in no-loop mode
+ if (teamsOp.getNumReductionVars())
+ return false;
+ if (wsLoopOp->getNumReductionVars())
+ return false;
+ // check if the user allows the promotion of kernels to no-loop mode
+ OffloadModuleInterface offloadMod =
+ capturedOp->getParentOfType<omp::OffloadModuleInterface>();
+ if (!offloadMod)
+ return false;
+ auto ompFlags = offloadMod.getFlags();
+ if (!ompFlags)
+ return false;
+ return ompFlags.getAssumeTeamsOversubscription() &&
+ ompFlags.getAssumeThreadsOversubscription();
+}
+
TargetRegionFlags TargetOp::getKernelExecFlags(Operation *capturedOp) {
// A non-null captured op is only valid if it resides inside of a TargetOp
// and is the result of calling getInnermostCapturedOmpOp() on it.
@@ -2134,7 +2157,8 @@ TargetRegionFlags TargetOp::getKernelExecFlags(Operation *capturedOp) {
// Detect target-teams-distribute-parallel-wsloop[-simd].
if (numWrappers == 2) {
- if (!isa<WsloopOp>(innermostWrapper))
+ WsloopOp *wsloopOp = dyn_cast<WsloopOp>(innermostWrapper);
+ if (!wsloopOp)
return TargetRegionFlags::generic;
innermostWrapper = std::next(innermostWrapper);
@@ -2145,12 +2169,19 @@ TargetRegionFlags TargetOp::getKernelExecFlags(Operation *capturedOp) {
if (!isa_and_present<ParallelOp>(parallelOp))
return TargetRegionFlags::generic;
- Operation *teamsOp = parallelOp->getParentOp();
- if (!isa_and_present<TeamsOp>(teamsOp))
+ TeamsOp teamsOp = dyn_cast<TeamsOp>(parallelOp->getParentOp());
+ if (!teamsOp)
return TargetRegionFlags::generic;
- if (teamsOp->getParentOp() == targetOp.getOperation())
- return TargetRegionFlags::spmd | TargetRegionFlags::trip_count;
+ TargetRegionFlags result;
+
+ if (teamsOp->getParentOp() == targetOp.getOperation()) {
+ TargetRegionFlags result =
+ TargetRegionFlags::spmd | TargetRegionFlags::trip_count;
+ if (canPromoteToNoLoop(capturedOp, teamsOp, wsloopOp))
+ result = result | TargetRegionFlags::no_loop;
+ return result;
+ }
}
// Detect target-teams-distribute[-simd] and target-teams-loop.
else if (isa<DistributeOp, LoopOp>(innermostWrapper)) {
diff --git a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
index 6694de8383534..d67d5eb741543 100644
--- a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
+++ b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
@@ -2590,13 +2590,27 @@ convertOmpWsloop(Operation &opInst, llvm::IRBuilderBase &builder,
}
builder.SetInsertPoint(*regionBlock, (*regionBlock)->begin());
+
+ bool noLoopMode = false;
+ omp::TargetOp targetOp = wsloopOp->getParentOfType<mlir::omp::TargetOp>();
+ if (targetOp) {
+ Operation *targetCapturedOp = targetOp.getInnermostCapturedOmpOp();
+ omp::TargetRegionFlags kernelFlags =
+ targetOp.getKernelExecFlags(targetCapturedOp);
+ if (omp::bitEnumContainsAll(kernelFlags,
+ omp::TargetRegionFlags::spmd |
+ omp::TargetRegionFlags::no_loop) &&
+ !omp::bitEnumContainsAny(kernelFlags, omp::TargetRegionFlags::generic))
+ noLoopMode = true;
+ }
+
llvm::OpenMPIRBuilder::InsertPointOrErrorTy wsloopIP =
ompBuilder->applyWorkshareLoop(
ompLoc.DL, loopInfo, allocaIP, loopNeedsBarrier,
convertToScheduleKind(schedule), chunk, isSimd,
scheduleMod == omp::ScheduleModifier::monotonic,
scheduleMod == omp::ScheduleModifier::nonmonotonic, isOrdered,
- workshareLoopType);
+ workshareLoopType, noLoopMode);
if (failed(handleError(wsloopIP, opInst)))
return failure();
@@ -5365,6 +5379,12 @@ initTargetDefaultAttrs(omp::TargetOp targetOp, Operation *capturedOp,
? llvm::omp::OMP_TGT_EXEC_MODE_GENERIC_SPMD
: llvm::omp::OMP_TGT_EXEC_MODE_GENERIC
: llvm::omp::OMP_TGT_EXEC_MODE_SPMD;
+ if (omp::bitEnumContainsAll(kernelFlags,
+ omp::TargetRegionFlags::spmd |
+ omp::TargetRegionFlags::no_loop) &&
+ !omp::bitEnumContainsAny(kernelFlags, omp::TargetRegionFlags::generic))
+ attrs.ExecFlags = llvm::omp::OMP_TGT_EXEC_MODE_SPMD_NO_LOOP;
+
attrs.MinTeams = minTeamsVal;
attrs.MaxTeams.front() = maxTeamsVal;
attrs.MinThreads = 1;
diff --git a/offload/DeviceRTL/src/Workshare.cpp b/offload/DeviceRTL/src/Workshare.cpp
index 59a2cc3f27aca..653104ce883d1 100644
--- a/offload/DeviceRTL/src/Workshare.cpp
+++ b/offload/DeviceRTL/src/Workshare.cpp
@@ -800,10 +800,6 @@ template <typename Ty> class StaticLoopChunker {
// If we know we have more threads than iterations we can indicate that to
// avoid an outer loop.
- if (config::getAssumeThreadsOversubscription()) {
- OneIterationPerThread = true;
- }
-
if (OneIterationPerThread)
ASSERT(NumThreads >= NumIters, "Broken assumption");
@@ -851,10 +847,6 @@ template <typename Ty> class StaticLoopChunker {
// If we know we have more blocks than iterations we can indicate that to
// avoid an outer loop.
- if (config::getAssumeTeamsOversubscription()) {
- OneIterationPerThread = true;
- }
-
if (OneIterationPerThread)
ASSERT(NumBlocks >= NumIters, "Broken assumption");
@@ -914,11 +906,6 @@ template <typename Ty> class StaticLoopChunker {
// If we know we have more threads (across all blocks) than iterations we
// can indicate that to avoid an outer loop.
- if (config::getAssumeTeamsOversubscription() &
- config::getAssumeThreadsOversubscription()) {
- OneIterationPerThread = true;
- }
-
if (OneIterationPerThread)
ASSERT(NumBlocks * NumThreads >= NumIters, "Broken assumption");
diff --git a/offload/test/offloading/fortran/target-no-loop.f90 b/offload/test/offloading/fortran/target-no-loop.f90
new file mode 100644
index 0000000000000..dd2bf7c2196b6
--- /dev/null
+++ b/offload/test/offloading/fortran/target-no-loop.f90
@@ -0,0 +1,43 @@
+! Check if the first OpenMP GPU kernel is promoted to no-loop mode.
+! The second cannot be promoted due to the limit on the number of teams.
+! REQUIRES: flang, amdgpu
+
+! RUN: %libomptarget-compile-fortran-generic -O3 -fopenmp-assume-threads-oversubscription -fopenmp-assume-teams-oversubscription
+! RUN: env LIBOMPTARGET_INFO=16 %libomptarget-run-generic 2>&1 | %fcheck-generic
+program main
+ use omp_lib
+ implicit none
+ integer :: i
+ integer :: array(1024), errors = 0
+ array = 1
+
+ !$omp target teams distribute parallel do
+ do i = 1, 1024
+ array(i) = i
+ end do
+
+ do i = 1, 1024
+ if ( array( i) .ne. (i) ) then
+ errors = errors + 1
+ end if
+ end do
+
+ !$omp target teams distribute parallel do num_teams(3)
+ do i = 1, 1024
+ array(i) = i
+ end do
+
+ do i = 1, 1024
+ if ( array( i) .ne. (i) ) then
+ errors = errors + 1
+ end if
+ end do
+
+ print *,"number of errors: ", errors
+
+end program main
+
+! CHECK: "PluginInterface" device {{[0-9]+}} info: Launching kernel {{.*}} SPMD-No-Loop mode
+! CHECK: "PluginInterface" device {{[0-9]+}} info: Launching kernel {{.*}} SPMD mode
+! CHECK: number of errors: 0
+
>From 564410d9930b9b838d4dadfbea566c222c265e87 Mon Sep 17 00:00:00 2001
From: Dominik Adamski <dominik.adamski at amd.com>
Date: Mon, 1 Sep 2025 04:50:10 -0500
Subject: [PATCH 2/8] Applied remarks
---
llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp | 4 +-
mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp | 8 +-
.../offloading/fortran/target-no-loop.f90 | 77 ++++++++++++++++---
3 files changed, 72 insertions(+), 17 deletions(-)
diff --git a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
index bec5abb45041f..9ec2057bbee13 100644
--- a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
+++ b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
@@ -4983,8 +4983,10 @@ static void createTargetLoopWorkshareCall(OpenMPIRBuilder *OMPBuilder,
RealArgs.push_back(ConstantInt::get(TripCountTy, 0));
if (LoopType == WorksharingLoopType::DistributeForStaticLoop) {
RealArgs.push_back(ConstantInt::get(TripCountTy, 0));
+ RealArgs.push_back(ConstantInt::get(Builder.getInt8Ty(), NoLoop));
}
- RealArgs.push_back(ConstantInt::get(Builder.getInt8Ty(), NoLoop));
+ else
+ RealArgs.push_back(ConstantInt::get(Builder.getInt8Ty(), 0));
Builder.CreateCall(RTLFn, RealArgs);
}
diff --git a/mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp b/mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp
index 1e10dd114b30a..0371bc2e449f2 100644
--- a/mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp
+++ b/mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp
@@ -2106,18 +2106,18 @@ Operation *TargetOp::getInnermostCapturedOmpOp() {
});
}
-/// Check if we can promote SPMD kernel to No-Loop kernel
+/// Check if we can promote SPMD kernel to No-Loop kernel.
static bool canPromoteToNoLoop(Operation *capturedOp, TeamsOp teamsOp,
WsloopOp *wsLoopOp) {
- // num_teams clause can break no-loop teams/threads assumption
+ // num_teams clause can break no-loop teams/threads assumption.
if (teamsOp.getNumTeamsUpper())
return false;
- // reduction kernels are slower in no-loop mode
+ // Reduction kernels are slower in no-loop mode.
if (teamsOp.getNumReductionVars())
return false;
if (wsLoopOp->getNumReductionVars())
return false;
- // check if the user allows the promotion of kernels to no-loop mode
+ // Check if the user allows the promotion of kernels to no-loop mode.
OffloadModuleInterface offloadMod =
capturedOp->getParentOfType<omp::OffloadModuleInterface>();
if (!offloadMod)
diff --git a/offload/test/offloading/fortran/target-no-loop.f90 b/offload/test/offloading/fortran/target-no-loop.f90
index dd2bf7c2196b6..6251aca471561 100644
--- a/offload/test/offloading/fortran/target-no-loop.f90
+++ b/offload/test/offloading/fortran/target-no-loop.f90
@@ -1,43 +1,96 @@
-! Check if the first OpenMP GPU kernel is promoted to no-loop mode.
-! The second cannot be promoted due to the limit on the number of teams.
! REQUIRES: flang, amdgpu
! RUN: %libomptarget-compile-fortran-generic -O3 -fopenmp-assume-threads-oversubscription -fopenmp-assume-teams-oversubscription
-! RUN: env LIBOMPTARGET_INFO=16 %libomptarget-run-generic 2>&1 | %fcheck-generic
+! RUN: env LIBOMPTARGET_INFO=16 OMP_NUM_TEAMS=16 OMP_TEAMS_THREAD_LIMIT=16 %libomptarget-run-generic 2>&1 | %fcheck-generic
+function check_errors(array) result (errors)
+ integer, intent(in) :: array(1024)
+ integer :: errors
+ integer :: i
+ errors = 0
+ do i = 1, 1024
+ if ( array( i) .ne. (i) ) then
+ errors = errors + 1
+ end if
+ end do
+end function
+
program main
use omp_lib
implicit none
- integer :: i
+ integer :: i,j,red
integer :: array(1024), errors = 0
array = 1
+ ! No-loop kernel
!$omp target teams distribute parallel do
do i = 1, 1024
array(i) = i
- end do
+ end do
+ errors = errors + check_errors(array)
+ ! SPMD kernel (num_teams clause blocks promotion to no-loop)
+ array = 1
+ !$omp target teams distribute parallel do num_teams(3)
do i = 1, 1024
- if ( array( i) .ne. (i) ) then
- errors = errors + 1
- end if
+ array(i) = i
end do
- !$omp target teams distribute parallel do num_teams(3)
+ errors = errors + check_errors(array)
+
+ ! No-loop kernel
+ array = 1
+ !$omp target teams distribute parallel do num_threads(64)
do i = 1, 1024
array(i) = i
end do
+ errors = errors + check_errors(array)
+
+ ! SPMD kernel
+ array = 1
+ !$omp target parallel do
do i = 1, 1024
- if ( array( i) .ne. (i) ) then
- errors = errors + 1
- end if
+ array(i) = i
+ end do
+
+ errors = errors + check_errors(array)
+
+ ! Generic kernel
+ array = 1
+ !$omp target teams distribute
+ do i = 1, 1024
+ array(i) = i
+ end do
+
+ errors = errors + check_errors(array)
+
+ ! SPMD kernel (reduction clause blocks promotion to no-loop)
+ array = 1
+ red =0
+ !$omp target teams distribute parallel do reduction(+:red)
+ do i = 1, 1024
+ red = red + array(i)
end do
+ if (red .ne. 1024) then
+ errors = errors + 1
+ end if
+
print *,"number of errors: ", errors
end program main
! CHECK: "PluginInterface" device {{[0-9]+}} info: Launching kernel {{.*}} SPMD-No-Loop mode
+! CHECK: info: #Args: 3 Teams x Thrds: 64x 16
+! CHECK: "PluginInterface" device {{[0-9]+}} info: Launching kernel {{.*}} SPMD mode
+! CHECK: info: #Args: 3 Teams x Thrds: 3x 16 {{.*}}
+! CHECK: "PluginInterface" device {{[0-9]+}} info: Launching kernel {{.*}} SPMD-No-Loop mode
+! CHECK: info: #Args: 3 Teams x Thrds: 64x 16 {{.*}}
+! CHECK: "PluginInterface" device {{[0-9]+}} info: Launching kernel {{.*}} SPMD mode
+! CHECK: info: #Args: 3 Teams x Thrds: 1x 16
+! CHECK: "PluginInterface" device {{[0-9]+}} info: Launching kernel {{.*}} Generic mode
+! CHECK: info: #Args: 3 Teams x Thrds: 16x 16 {{.*}}
! CHECK: "PluginInterface" device {{[0-9]+}} info: Launching kernel {{.*}} SPMD mode
+! CHECK: info: #Args: 4 Teams x Thrds: 16x 16 {{.*}}
! CHECK: number of errors: 0
>From 8efb5e061af1a16d636d9ed62600bd12b117c5dd Mon Sep 17 00:00:00 2001
From: Dominik Adamski <dominik.adamski at amd.com>
Date: Tue, 2 Sep 2025 05:15:53 -0500
Subject: [PATCH 3/8] Fix format
---
llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
index 9ec2057bbee13..aabac59c29b37 100644
--- a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
+++ b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
@@ -4984,8 +4984,7 @@ static void createTargetLoopWorkshareCall(OpenMPIRBuilder *OMPBuilder,
if (LoopType == WorksharingLoopType::DistributeForStaticLoop) {
RealArgs.push_back(ConstantInt::get(TripCountTy, 0));
RealArgs.push_back(ConstantInt::get(Builder.getInt8Ty(), NoLoop));
- }
- else
+ } else
RealArgs.push_back(ConstantInt::get(Builder.getInt8Ty(), 0));
Builder.CreateCall(RTLFn, RealArgs);
>From 6176c9547e0fe8454caf0292f03c63646a86385c Mon Sep 17 00:00:00 2001
From: Dominik Adamski <dominik.adamski at amd.com>
Date: Thu, 4 Sep 2025 02:03:02 -0500
Subject: [PATCH 4/8] Remove unused variable
---
mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp | 2 --
1 file changed, 2 deletions(-)
diff --git a/mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp b/mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp
index 0371bc2e449f2..6d19acd20dd9c 100644
--- a/mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp
+++ b/mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp
@@ -2173,8 +2173,6 @@ TargetRegionFlags TargetOp::getKernelExecFlags(Operation *capturedOp) {
if (!teamsOp)
return TargetRegionFlags::generic;
- TargetRegionFlags result;
-
if (teamsOp->getParentOp() == targetOp.getOperation()) {
TargetRegionFlags result =
TargetRegionFlags::spmd | TargetRegionFlags::trip_count;
>From d2e88db4bfb5e76adbc09fb2e38457703941a683 Mon Sep 17 00:00:00 2001
From: Dominik Adamski <dominik.adamski at amd.com>
Date: Tue, 9 Sep 2025 07:16:02 -0500
Subject: [PATCH 5/8] Applied remarks
---
.../OpenMP/OpenMPToLLVMIRTranslation.cpp | 21 ++++++++++++-------
1 file changed, 14 insertions(+), 7 deletions(-)
diff --git a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
index 04bdc11ed1c0f..6c975190d2846 100644
--- a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
+++ b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
@@ -2591,17 +2591,24 @@ convertOmpWsloop(Operation &opInst, llvm::IRBuilderBase &builder,
builder.SetInsertPoint(*regionBlock, (*regionBlock)->begin());
+ // Check if we can generate no-loop kernel
bool noLoopMode = false;
omp::TargetOp targetOp = wsloopOp->getParentOfType<mlir::omp::TargetOp>();
if (targetOp) {
Operation *targetCapturedOp = targetOp.getInnermostCapturedOmpOp();
- omp::TargetRegionFlags kernelFlags =
- targetOp.getKernelExecFlags(targetCapturedOp);
- if (omp::bitEnumContainsAll(kernelFlags,
- omp::TargetRegionFlags::spmd |
- omp::TargetRegionFlags::no_loop) &&
- !omp::bitEnumContainsAny(kernelFlags, omp::TargetRegionFlags::generic))
- noLoopMode = true;
+ // We need this check because, without it, noLoopMode would be set to true
+ // for every omp.wsloop nested inside a no-loop SPMD target region, even if
+ // that loop is not the top-level SPMD one.
+ if (loopOp == targetCapturedOp) {
+ omp::TargetRegionFlags kernelFlags =
+ targetOp.getKernelExecFlags(targetCapturedOp);
+ if (omp::bitEnumContainsAll(kernelFlags,
+ omp::TargetRegionFlags::spmd |
+ omp::TargetRegionFlags::no_loop) &&
+ !omp::bitEnumContainsAny(kernelFlags,
+ omp::TargetRegionFlags::generic))
+ noLoopMode = true;
+ }
}
llvm::OpenMPIRBuilder::InsertPointOrErrorTy wsloopIP =
>From e8a23ce5803f3a3cc561aec0b1e9861528f23d32 Mon Sep 17 00:00:00 2001
From: Dominik Adamski <dominik.adamski at amd.com>
Date: Wed, 10 Sep 2025 09:09:06 -0500
Subject: [PATCH 6/8] Applied remarks
---
clang/include/clang/Driver/Options.td | 25 +++++++++++--------
llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp | 3 ++-
.../mlir/Dialect/OpenMP/OpenMPEnums.td | 10 +++++++-
mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp | 2 ++
.../offloading/fortran/target-no-loop.f90 | 2 +-
5 files changed, 28 insertions(+), 14 deletions(-)
diff --git a/clang/include/clang/Driver/Options.td b/clang/include/clang/Driver/Options.td
index 12256c161ad8b..1305c9df9ad02 100644
--- a/clang/include/clang/Driver/Options.td
+++ b/clang/include/clang/Driver/Options.td
@@ -3836,18 +3836,21 @@ let Group = f_Group in {
def fopenmp_target_debug_EQ : Joined<["-"], "fopenmp-target-debug=">;
def fopenmp_assume_teams_oversubscription : Flag<["-"], "fopenmp-assume-teams-oversubscription">,
- HelpText<"Allow enforcement to ensure there are enough teams to cover the "
- "loop iteration space. It may ignore environment variables. "
- "If the fopenmp-assume-teams-oversubscription and "
- "fopenmp-assume-threads-oversubscription flags are set, Flang may "
- "generate more optimized OpenMP kernels for target teams distribute "
- "parallel do pragmas.">;
+ HelpText<"Allow the optimizer to discretely increase the number of "
+ "teams. May cause ignore environment variables that set "
+ "the number of teams to be ignored. The combination of "
+ "-fopenmp-assume-teams-oversubscription "
+ "and -fopenmp-assume-threads-oversubscription "
+ "may allow the conversion of loops into sequential code by "
+ "ensuring that each team/thread executes at most one iteration.">;
def fopenmp_assume_threads_oversubscription : Flag<["-"], "fopenmp-assume-threads-oversubscription">,
- HelpText<"Assume threads oversubscription. If the "
- "fopenmp-assume-teams-oversubscription and "
- "fopenmp-assume-threads-oversubscription flags are set, Flang may "
- "generate more optimized OpenMP kernels for target teams distribute "
- "parallel do pragmas.">;
+ HelpText<"Allow the optimizer to discretely increase the number of "
+ "threads. May cause ignore environment variables that set "
+ "the number of threads to be ignored. The combination of "
+ "-fopenmp-assume-teams-oversubscription "
+ "and -fopenmp-assume-threads-oversubscription "
+ "may allow the conversion of loops into sequential code by "
+ "ensuring that each team/thread executes at most one iteration.">;
def fno_openmp_assume_teams_oversubscription : Flag<["-"], "fno-openmp-assume-teams-oversubscription">,
HelpText<"Do not assume teams oversubscription.">;
def fno_openmp_assume_threads_oversubscription : Flag<["-"], "fno-openmp-assume-threads-oversubscription">,
diff --git a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
index eb63acead6f64..e322d2f3be1f3 100644
--- a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
+++ b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
@@ -5004,8 +5004,9 @@ static void createTargetLoopWorkshareCall(OpenMPIRBuilder *OMPBuilder,
if (LoopType == WorksharingLoopType::DistributeForStaticLoop) {
RealArgs.push_back(ConstantInt::get(TripCountTy, 0));
RealArgs.push_back(ConstantInt::get(Builder.getInt8Ty(), NoLoop));
- } else
+ } else {
RealArgs.push_back(ConstantInt::get(Builder.getInt8Ty(), 0));
+ }
Builder.CreateCall(RTLFn, RealArgs);
}
diff --git a/mlir/include/mlir/Dialect/OpenMP/OpenMPEnums.td b/mlir/include/mlir/Dialect/OpenMP/OpenMPEnums.td
index e0cd06805ab40..f1190e126d117 100644
--- a/mlir/include/mlir/Dialect/OpenMP/OpenMPEnums.td
+++ b/mlir/include/mlir/Dialect/OpenMP/OpenMPEnums.td
@@ -234,7 +234,15 @@ def TargetRegionFlagsNoLoop : I32BitEnumAttrCaseBit<"no_loop", 3>;
def TargetRegionFlags : OpenMP_BitEnumAttr<
"TargetRegionFlags",
- "target region property flags", [
+ "These flags describe properties of the target kernel. "
+ "TargetRegionFlagsGeneric - denotes generic kernel. "
+ "TargetRegionFlagsSpmd - denotes SPMD kernel. "
+ "TargetRegionFlagsNoLoop - denotes kernel where "
+ "num_teams * num_threads >= loop_trip_count. It allows the conversion "
+ "of loops into sequential code by ensuring that each team/thread "
+ "executes at most one iteration. "
+ "TargetRegionFlagsTripCount - checks if the loop trip count should be "
+ "calculated.", [
TargetRegionFlagsNone,
TargetRegionFlagsGeneric,
TargetRegionFlagsSpmd,
diff --git a/mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp b/mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp
index 6d19acd20dd9c..8cd07ea51aae1 100644
--- a/mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp
+++ b/mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp
@@ -2112,11 +2112,13 @@ static bool canPromoteToNoLoop(Operation *capturedOp, TeamsOp teamsOp,
// num_teams clause can break no-loop teams/threads assumption.
if (teamsOp.getNumTeamsUpper())
return false;
+
// Reduction kernels are slower in no-loop mode.
if (teamsOp.getNumReductionVars())
return false;
if (wsLoopOp->getNumReductionVars())
return false;
+
// Check if the user allows the promotion of kernels to no-loop mode.
OffloadModuleInterface offloadMod =
capturedOp->getParentOfType<omp::OffloadModuleInterface>();
diff --git a/offload/test/offloading/fortran/target-no-loop.f90 b/offload/test/offloading/fortran/target-no-loop.f90
index 6251aca471561..8e40e20e73e70 100644
--- a/offload/test/offloading/fortran/target-no-loop.f90
+++ b/offload/test/offloading/fortran/target-no-loop.f90
@@ -1,4 +1,4 @@
-! REQUIRES: flang, amdgpu
+! REQUIRES: flang
! RUN: %libomptarget-compile-fortran-generic -O3 -fopenmp-assume-threads-oversubscription -fopenmp-assume-teams-oversubscription
! RUN: env LIBOMPTARGET_INFO=16 OMP_NUM_TEAMS=16 OMP_TEAMS_THREAD_LIMIT=16 %libomptarget-run-generic 2>&1 | %fcheck-generic
>From 31f87cd9bb102e40bc98699605d3cda3ef785f4b Mon Sep 17 00:00:00 2001
From: Dominik Adamski <dominik.adamski at amd.com>
Date: Fri, 26 Sep 2025 04:41:09 -0500
Subject: [PATCH 7/8] Applied remarks.
---
.../mlir/Dialect/OpenMP/OpenMPEnums.td | 19 ++++++++++---------
1 file changed, 10 insertions(+), 9 deletions(-)
diff --git a/mlir/include/mlir/Dialect/OpenMP/OpenMPEnums.td b/mlir/include/mlir/Dialect/OpenMP/OpenMPEnums.td
index f693a0737e0fc..7cc35d9cff916 100644
--- a/mlir/include/mlir/Dialect/OpenMP/OpenMPEnums.td
+++ b/mlir/include/mlir/Dialect/OpenMP/OpenMPEnums.td
@@ -234,15 +234,16 @@ def TargetRegionFlagsNoLoop : I32BitEnumAttrCaseBit<"no_loop", 3>;
def TargetRegionFlags : OpenMP_BitEnumAttr<
"TargetRegionFlags",
- "These flags describe properties of the target kernel. "
- "TargetRegionFlagsGeneric - denotes generic kernel. "
- "TargetRegionFlagsSpmd - denotes SPMD kernel. "
- "TargetRegionFlagsNoLoop - denotes kernel where "
- "num_teams * num_threads >= loop_trip_count. It allows the conversion "
- "of loops into sequential code by ensuring that each team/thread "
- "executes at most one iteration. "
- "TargetRegionFlagsTripCount - checks if the loop trip count should be "
- "calculated.", [
+ [{ These flags describe properties of the target kernel.
+
+ TargetRegionFlagsGeneric - denotes generic kernel.
+ TargetRegionFlagsSpmd - denotes SPMD kernel.
+ TargetRegionFlagsNoLoop - denotes kernel where
+ num_teams * num_threads >= loop_trip_count. It allows the conversion
+ of loops into sequential code by ensuring that each team/thread
+ executes at most one iteration.
+ TargetRegionFlagsTripCount - checks if the loop trip count should be
+ calculated.}], [
TargetRegionFlagsNone,
TargetRegionFlagsGeneric,
TargetRegionFlagsSpmd,
>From a15711d77a507156c4f4ebd901b675c6a3725865 Mon Sep 17 00:00:00 2001
From: Dominik Adamski <dominik.adamski at amd.com>
Date: Fri, 26 Sep 2025 05:29:54 -0500
Subject: [PATCH 8/8] Revert "Applied remarks."
This reverts commit 31f87cd9bb102e40bc98699605d3cda3ef785f4b.
---
.../mlir/Dialect/OpenMP/OpenMPEnums.td | 19 +++++++++----------
1 file changed, 9 insertions(+), 10 deletions(-)
diff --git a/mlir/include/mlir/Dialect/OpenMP/OpenMPEnums.td b/mlir/include/mlir/Dialect/OpenMP/OpenMPEnums.td
index 7cc35d9cff916..f693a0737e0fc 100644
--- a/mlir/include/mlir/Dialect/OpenMP/OpenMPEnums.td
+++ b/mlir/include/mlir/Dialect/OpenMP/OpenMPEnums.td
@@ -234,16 +234,15 @@ def TargetRegionFlagsNoLoop : I32BitEnumAttrCaseBit<"no_loop", 3>;
def TargetRegionFlags : OpenMP_BitEnumAttr<
"TargetRegionFlags",
- [{ These flags describe properties of the target kernel.
-
- TargetRegionFlagsGeneric - denotes generic kernel.
- TargetRegionFlagsSpmd - denotes SPMD kernel.
- TargetRegionFlagsNoLoop - denotes kernel where
- num_teams * num_threads >= loop_trip_count. It allows the conversion
- of loops into sequential code by ensuring that each team/thread
- executes at most one iteration.
- TargetRegionFlagsTripCount - checks if the loop trip count should be
- calculated.}], [
+ "These flags describe properties of the target kernel. "
+ "TargetRegionFlagsGeneric - denotes generic kernel. "
+ "TargetRegionFlagsSpmd - denotes SPMD kernel. "
+ "TargetRegionFlagsNoLoop - denotes kernel where "
+ "num_teams * num_threads >= loop_trip_count. It allows the conversion "
+ "of loops into sequential code by ensuring that each team/thread "
+ "executes at most one iteration. "
+ "TargetRegionFlagsTripCount - checks if the loop trip count should be "
+ "calculated.", [
TargetRegionFlagsNone,
TargetRegionFlagsGeneric,
TargetRegionFlagsSpmd,
More information about the Mlir-commits
mailing list