[Openmp-dev] nested parallelism in libomptarget-nvptx

Fri Sep 7 15:15:31 PDT 2018

On 09/07/2018 03:03 PM, Gheorghe-Teod Bercea wrote:
> Hi Hal,
>
> At least as far as we are aware, the number of use cases where the
> nested parallel scheme would be used is quite small. Most of the use
> cases of OpenMP on GPUs have a single level of parallelism which is
> typically SPMD-like to achieve as much performance as possible. That
> said there is some merit to having a nested parallelism scheme because
> when it is helpful it typically is very helpful.
>
> As a novelty point to ykt-clang I would suggest that whichever scheme
> (or schemes) we decide to use, they should be applied only at the
> request of the user. This is because we can do a better code gen job
> for more OpenMP patterns when using existing schemes (generic and
> SPMD) if we know at compile time if there will be no second level
> parallelism in use. This is due to some changes in implementation in
> trunk compared to ykt-clang.
>
> Regarding which scheme to use there were two which were floated around
> based on discussions with users: (1) the current scheme in ykt-clang
> which enables the code in both inner and outer parallel loops to be
> executed in parallel and (2) a scheme where the outer loop code is
> executed by one thread and the innermost loop is executed by all
> threads (this was requested by users at one point, I assume this is
> still the case).
>
> Since ykt-clang only supports the fist scheme when we ran performance
> tests comparing nested parallelism against no nested parallelism we
> got anywhere from 4x slowdown to 32x speedup depending on the: ratio
> of outer:inner iterations, the work size in the innermost loop,
> reductions, atomics and memory coalescing. About 80% of the number of
> cases we tried showed speed-ups with some showing significant speed-ups.
> I would very much be in favour of having at least this scheme
> supported since it looks like it could be useful.
>
> In terms of timing, we are still tied up with upstreaming at the
> moment so we won't be attempting a new code generation scheme until we
> are feature complete on the current ones.

Hi, Doru,

Thanks for explaining. I think that your suggestion of putting this
behind a flag makes a lot of sense. It sounds as though, later, we might
want different user-selectable schemes (although we might want to use
pragmas instead of command-line flags at that point?).

 -Hal
>
> Thanks,
>
> --Doru
>
>
>
>
> From:        Hal Finkel <hfinkel at anl.gov>
> To:        Gheorghe-Teod Bercea <Gheorghe-Teod.Bercea at ibm.com>, Jonas
> Hahnfeld <hahnjo at hahnjo.de>
> Cc:        Alexey Bataev <alexey.bataev at ibm.com>,
> <openmp-dev at lists.llvm.org>
> Date:        09/07/2018 12:35 PM
> Subject:        Re: [Openmp-dev] nested parallelism in libomptarget-nvptx
> ------------------------------------------------------------------------
>
>
>
> Hi, Doru,
>
> What do you think we should do, upstream, for nested parallelism?
> Would it be desirable to have a clang-ykt-like scheme? Something else?
>
> Thanks again,
>
> Hal
>
>
> On 09/07/2018 10:59 AM, Gheorghe-Teod Bercea via Openmp-dev wrote:
> Hi Jonas,
>
> The second level of parallelism in clang-ykt uses a scheme where all
> the threads in each warp cooperate to execute the workload of the  1st
> thread in the warp then the 2nd and so on until the workload of each
> of the 32 threads in the warp has been completed. The workload of each
> thread is always executed by the full warp.
> You are correct in trunk the additional memory that this scheme uses
> is not required. For now we would like to keep this functionality in
> place so it would be good if you could hide it behind a flag. This
> will allow us to easily drop it in the future.
>
> Thanks a lot,
>
> --Doru
>
>
>
>
>
> From:        Jonas Hahnfeld _<hahnjo at hahnjo.de>_ <mailto:hahnjo at hahnjo.de>
> To:        _openmp-dev at lists.llvm.org_ <mailto:openmp-dev at lists.llvm.org>
> Cc:        Alexey Bataev _<alexey.bataev at ibm.com>_
> <mailto:alexey.bataev at ibm.com>, Doru Bercea
> _<gheorghe-teod.bercea at ibm.com>_
> <mailto:gheorghe-teod.bercea at ibm.com>, Kelvin Li _<kli at ca.ibm.com>_
> <mailto:kli at ca.ibm.com>
> Date:        09/07/2018 11:31 AM
> Subject:        nested parallelism in libomptarget-nvptx
>
> ------------------------------------------------------------------------
>
>
>
> Hi all,
>
> I've started some cleanups in libomptarget-nvptx, the OpenMP runtime
> implementation on Nvidia GPUs. The ultimate motivation is reducing the
> memory overhead: At the moment the runtime statically allocates ~660MiB
> of global memory. This amount can't be used by applications. This might
> not sound much, but wasting precious memory doesn't sound wise.
> I found that a portion of 448MiB come from buffers for data sharing. In
> particular they appear to be so large because the code is prepared to
> handle nested parallelism where every thread would be in the position to
> share data with its nested worker threads.
> From what I've seen so far this doesn't seem to be necessary for Clang
> trunk: Nested parallel regions are serialized, so only the initial
> thread needs to share data with one set of worker threads. That's in
> line with comments saying that there is no support for nested
> parallelism.
>
> However I found that my test applications compiled with clang-ykt
> support two levels of parallelism. My guess would be that this is
> related to "convergent parallelism": parallel.cu explains that this is
> meant for a "team of threads in a warp only". And indeed, each nested
> parallel region seems to be executed by 32 threads.
> I'm not really sure how this works because I seem to get one OpenMP
> thread per CUDA thread in the outer parallel region. So where are the
> nested worker threads coming from?
>
> In any case: If my analysis is correct, I'd like to propose adding a
> CMake flag which disables this (seemingly) legacy support [1]. That
> would avoid the memory overhead for users of Clang trunk and enable
> future optimizations (I think).
> Thoughts, opinions?
>
> Cheers,
> Jonas
>
>
> 1: Provided that IBM still wants to keep the code and we can't just go
> ahead and drop it. I guess that this can happen at some point in time,
> but I'm not sure if we are in that position right now.
>
>
>
>
>
>
> _______________________________________________
> Openmp-dev mailing list
> _Openmp-dev at lists.llvm.org_ <mailto:Openmp-dev at lists.llvm.org>
> _http://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev_
>
>
> -- 
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
>

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/openmp-dev/attachments/20180907/4f2198b3/attachment-0001.html>