[Openmp-commits] [PATCH] D45326: [OpenMP] [CUDA plugin] Add support for teams reduction via scratchpad

Thu Apr 5 11:05:59 PDT 2018

grokos added inline comments.

================
Comment at: libomptarget/plugins/cuda/src/rtl.cpp:75-76
+  int8_t ExecutionMode;
+  int32_t NumReductionVars;
+  int32_t ReductionVarsSize;
+
----------------
ABataev wrote:
> Why do you need all that data before starting the outlined function? Can we allocate the memory during execution of the outlined function by some runtime function call?
> Like this:
> ```
> __omp_offloading....
> <master>
> %Scratchpad = call i8 *__kmpc_allocate_scratchpad(<Size_of_the_reductions>);
> ....
> __kmpc_deallocate_scratchpad(i8 *%Scratchpad);
> <end_master>
> ```
> 
We can go down that route if you prefer. I haven't been able to find official documentation about which type of memory allocation is faster (`cuadMalloc` on the host vs `malloc` on the device), so I assume they perform equally fast.

Any thoughts on that?

================
Comment at: libomptarget/plugins/cuda/src/rtl.cpp:638-641
+    if (KernelInfo->ExecutionMode == GENERIC) {
+      // Leave room for the master warp which will be added below.
+      cudaThreadsPerBlock -= DeviceInfo.WarpSize[device_id];
+    }
----------------
Hahnfeld wrote:
> I think this shouldn't be in this patch?
Removed.

================
Comment at: libomptarget/plugins/cuda/src/rtl.cpp:709-712
+#ifdef OMPTARGET_DEBUG
+    if (Scratchpad == NULL)
+      DP("Failed to allocate reduction scratchpad\n");
+#endif
----------------
Hahnfeld wrote:
> Maybe error out completely?
Correct, if allocating the scratchpad fails the kernel cannot be executed, so we'll return `OFFLOAD_FAIL`.

Repository:
  rOMP OpenMP

https://reviews.llvm.org/D45326