[Openmp-commits] [PATCH] D105699: [libomptarget][devicertl] Remove branches around setting parallelLevel

Jon Chesterfield via Phabricator via Openmp-commits openmp-commits at lists.llvm.org
Fri Jul 9 07:25:41 PDT 2021


JonChesterfield created this revision.
JonChesterfield added reviewers: jdoerfert, ABataev, grokos, tianshilei1992, ye-luo, ronlieb, carlo.bertolli, pdhaliwal, ggeorgakoudis, Meinersbur.
JonChesterfield requested review of this revision.
Herald added a project: OpenMP.
Herald added a subscriber: openmp-commits.

Simplifies control flow to allow store/load forwarding

Specifically, for spmd kernels with sufficiently aggressive inlining, this
change allows the loads from parallelLevel in parallel_51 and global_thead_num
to constant fold, removing most of the handling for potentially nested parallel
when there is none. At present, the parallelLevel array remains with a single
store to it.

Two transforms here:

  int threadId = GetThreadIdInBlock();
  if (threadId == 0) {
    parallelLevel[0] = expr;
  } else if (GetLaneId() == 0) {
    parallelLevel[GetWarpId()] = expr;
  }
  // =>
  if (GetLaneId() == 0) {
    parallelLevel[GetWarpId()] = expr;
  }
  // because
  unsigned GetLaneId() { return GetThreadIdInBlock() & (WARPSIZE - 1);}
  // so whenever threadId == 0, GetLaneId() is also 0.

That replaces a store in two distinct basic blocks with as single store.

Second,

  if (GetLaneId() == 0) {
    parallelLevel[GetWarpId()] = expr;
  }
  // =>
  parallelLevel[GetWarpId()] = expr;
  // because
  unsigned GetWarpId() { return GetThreadIdInBlock() / WARPSIZE; }
  // so GetWarpId will index the same element for every thread in the warp
  // and, because expr is lane-invariant in this case, every lane stores the
  // same value to this unique address

The first transform is always an optimisation. The second is more debatable -
a GPU may use more power when every lane writes to a given address than when
it is masked off, but equally there is a cost to the masking and unmasking.

If the second transform is missed, the CFG for a SPMD kernel has two entry
points into the parallel_51 call (on LaneId == 0). With it included, the
calls into parallel_51 and global_thead_num are in the same basic block
as the writes to parallelLevel, and load/store forwarding removes the loads.


Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D105699

Files:
  openmp/libomptarget/deviceRTLs/common/src/omptarget.cu


Index: openmp/libomptarget/deviceRTLs/common/src/omptarget.cu
===================================================================
--- openmp/libomptarget/deviceRTLs/common/src/omptarget.cu
+++ openmp/libomptarget/deviceRTLs/common/src/omptarget.cu
@@ -93,12 +93,11 @@
   int threadId = GetThreadIdInBlock();
   if (threadId == 0) {
     usedSlotIdx = __kmpc_impl_smid() % MAX_SM;
-    parallelLevel[0] =
-        1 + (GetNumberOfThreadsInBlock() > 1 ? OMP_ACTIVE_PARALLEL_LEVEL : 0);
-  } else if (GetLaneId() == 0) {
-    parallelLevel[GetWarpId()] =
-        1 + (GetNumberOfThreadsInBlock() > 1 ? OMP_ACTIVE_PARALLEL_LEVEL : 0);
   }
+
+  parallelLevel[GetWarpId()] =
+      1 + (GetNumberOfThreadsInBlock() > 1 ? OMP_ACTIVE_PARALLEL_LEVEL : 0);
+
   __kmpc_data_sharing_init_stack();
   if (!RequiresOMPRuntime) {
     // Runtime is not required - exit.


-------------- next part --------------
A non-text attachment was scrubbed...
Name: D105699.357502.patch
Type: text/x-patch
Size: 858 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/openmp-commits/attachments/20210709/8dfa0df2/attachment.bin>


More information about the Openmp-commits mailing list