[Openmp-dev] nested parallelism in libomptarget-nvptx

Fri Sep 7 09:35:26 PDT 2018

Hi, Doru,

What do you think we should do, upstream, for nested parallelism? Would
it be desirable to have a clang-ykt-like scheme? Something else?

Thanks again,

Hal

On 09/07/2018 10:59 AM, Gheorghe-Teod Bercea via Openmp-dev wrote:
> Hi Jonas,
>
> The second level of parallelism in clang-ykt uses a scheme where all
> the threads in each warp cooperate to execute the workload of the  1st
> thread in the warp then the 2nd and so on until the workload of each
> of the 32 threads in the warp has been completed. The workload of each
> thread is always executed by the full warp.
> You are correct in trunk the additional memory that this scheme uses
> is not required. For now we would like to keep this functionality in
> place so it would be good if you could hide it behind a flag. This
> will allow us to easily drop it in the future.
>
> Thanks a lot,
>
> --Doru
>
>
>
>
>
> From:        Jonas Hahnfeld <hahnjo at hahnjo.de>
> To:        openmp-dev at lists.llvm.org
> Cc:        Alexey Bataev <alexey.bataev at ibm.com>, Doru Bercea
> <gheorghe-teod.bercea at ibm.com>, Kelvin Li <kli at ca.ibm.com>
> Date:        09/07/2018 11:31 AM
> Subject:        nested parallelism in libomptarget-nvptx
> ------------------------------------------------------------------------
>
>
>
> Hi all,
>
> I've started some cleanups in libomptarget-nvptx, the OpenMP runtime
> implementation on Nvidia GPUs. The ultimate motivation is reducing the
> memory overhead: At the moment the runtime statically allocates ~660MiB
> of global memory. This amount can't be used by applications. This might
> not sound much, but wasting precious memory doesn't sound wise.
> I found that a portion of 448MiB come from buffers for data sharing. In
> particular they appear to be so large because the code is prepared to
> handle nested parallelism where every thread would be in the position to
> share data with its nested worker threads.
> From what I've seen so far this doesn't seem to be necessary for Clang
> trunk: Nested parallel regions are serialized, so only the initial
> thread needs to share data with one set of worker threads. That's in
> line with comments saying that there is no support for nested
> parallelism.
>
> However I found that my test applications compiled with clang-ykt
> support two levels of parallelism. My guess would be that this is
> related to "convergent parallelism": parallel.cu explains that this is
> meant for a "team of threads in a warp only". And indeed, each nested
> parallel region seems to be executed by 32 threads.
> I'm not really sure how this works because I seem to get one OpenMP
> thread per CUDA thread in the outer parallel region. So where are the
> nested worker threads coming from?
>
> In any case: If my analysis is correct, I'd like to propose adding a
> CMake flag which disables this (seemingly) legacy support [1]. That
> would avoid the memory overhead for users of Clang trunk and enable
> future optimizations (I think).
> Thoughts, opinions?
>
> Cheers,
> Jonas
>
>
> 1: Provided that IBM still wants to keep the code and we can't just go
> ahead and drop it. I guess that this can happen at some point in time,
> but I'm not sure if we are in that position right now.
>
>
>
>
>
>
> _______________________________________________
> Openmp-dev mailing list
> Openmp-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/openmp-dev/attachments/20180907/22a16fa3/attachment-0001.html>