[Openmp-commits] [PATCH] D45326: [OpenMP] [CUDA plugin] Add support for teams reduction via scratchpad

George Rokos via Phabricator via Openmp-commits openmp-commits at lists.llvm.org
Thu Apr 5 11:05:59 PDT 2018

grokos added inline comments.

Comment at: libomptarget/plugins/cuda/src/rtl.cpp:75-76
+  int8_t ExecutionMode;
+  int32_t NumReductionVars;
+  int32_t ReductionVarsSize;
ABataev wrote:
> Why do you need all that data before starting the outlined function? Can we allocate the memory during execution of the outlined function by some runtime function call?
> Like this:
> ```
> __omp_offloading....
> <master>
> %Scratchpad = call i8 *__kmpc_allocate_scratchpad(<Size_of_the_reductions>);
> ....
> __kmpc_deallocate_scratchpad(i8 *%Scratchpad);
> <end_master>
> ```
We can go down that route if you prefer. I haven't been able to find official documentation about which type of memory allocation is faster (`cuadMalloc` on the host vs `malloc` on the device), so I assume they perform equally fast.

Any thoughts on that?

Comment at: libomptarget/plugins/cuda/src/rtl.cpp:638-641
+    if (KernelInfo->ExecutionMode == GENERIC) {
+      // Leave room for the master warp which will be added below.
+      cudaThreadsPerBlock -= DeviceInfo.WarpSize[device_id];
+    }
Hahnfeld wrote:
> I think this shouldn't be in this patch?

Comment at: libomptarget/plugins/cuda/src/rtl.cpp:709-712
+    if (Scratchpad == NULL)
+      DP("Failed to allocate reduction scratchpad\n");
Hahnfeld wrote:
> Maybe error out completely?
Correct, if allocating the scratchpad fails the kernel cannot be executed, so we'll return `OFFLOAD_FAIL`.

  rOMP OpenMP


More information about the Openmp-commits mailing list