[Openmp-commits] [PATCH] D45326: [OpenMP] [CUDA plugin] Add support for teams reduction via scratchpad

Alexey Bataev via Phabricator via Openmp-commits openmp-commits at lists.llvm.org
Thu Apr 5 11:27:50 PDT 2018

ABataev added a comment.

In https://reviews.llvm.org/D45326#1058730, @grokos wrote:

> One caveat regarding Alexey's proposal: According to the CUDA programming guide <https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#dynamic-global-memory-allocation-and-operations>, `malloc` on the device allocates space from a fixed-size heap. The default size of this heap is 8MB. If we run into a scenario where more than 8MB will be required for the reduction scratchpad, allocating the scratchpad from the device will fail. The heap size can be user-defined from the host, but for that to happen the host must know how large the scratchpad needs to be, which defeats the purpose of moving scratchpad allocation from the plugin to the nvptx runtime.

But you can change the limit using `cudaThreadSetLimit`

Comment at: libomptarget/plugins/cuda/src/rtl.cpp:75-76
+  int8_t ExecutionMode;
+  int32_t NumReductionVars;
+  int32_t ReductionVarsSize;
grokos wrote:
> ABataev wrote:
> > Why do you need all that data before starting the outlined function? Can we allocate the memory during execution of the outlined function by some runtime function call?
> > Like this:
> > ```
> > __omp_offloading....
> > <master>
> > %Scratchpad = call i8 *__kmpc_allocate_scratchpad(<Size_of_the_reductions>);
> > ....
> > __kmpc_deallocate_scratchpad(i8 *%Scratchpad);
> > <end_master>
> > ```
> > 
> We can go down that route if you prefer. I haven't been able to find official documentation about which type of memory allocation is faster (`cuadMalloc` on the host vs `malloc` on the device), so I assume they perform equally fast.
> Any thoughts on that?
I'd prefer this solution rather than the original one.

  rOMP OpenMP


More information about the Openmp-commits mailing list