<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <p><br>

    </p>

    <br>

    <div class="moz-cite-prefix">On 09/07/2018 03:03 PM, Gheorghe-Teod

      Bercea wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:OF6EB1DB83.D5814D73-ON00258301.005D2280-85258301.006E391D@notes.na.collabserv.com">

      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

      <font size="2" face="sans-serif">Hi Hal,</font><br>

      <br>

      <font size="2" face="sans-serif">At least as far as we are aware,

        the

        number of use cases where the nested parallel scheme would be

        used is quite

        small. Most of the use cases of OpenMP on GPUs have a single

        level of parallelism

        which is typically SPMD-like to achieve as much performance as

        possible.

        That said there is some merit to having a nested parallelism

        scheme because

        when it is helpful it typically is very helpful.</font><br>

      <br>

      <font size="2" face="sans-serif">As a novelty point to ykt-clang I

        would

        suggest that whichever scheme (or schemes) we decide to use,

        they should

        be applied only at the request of the user. This is because we

        can do a

        better code gen job for more OpenMP patterns when using existing

        schemes

        (generic and SPMD) if we know at compile time if there will be

        no second

        level parallelism in use. This is due to some changes in

        implementation

        in trunk compared to ykt-clang.</font><br>

      <br>

      <font size="2" face="sans-serif">Regarding which scheme to use

        there

        were two which were floated around based on discussions with

        users: (1)

        the current scheme in ykt-clang which enables the code in both

        inner and

        outer parallel loops to be executed in parallel and (2) a scheme

        where

        the outer loop code is executed by one thread and the innermost

        loop is

        executed by all threads (this was requested by users at one

        point, I assume

        this is still the case).</font><br>

      <br>

      <font size="2" face="sans-serif">Since ykt-clang only supports the

        fist

        scheme when we ran performance tests comparing nested

        parallelism against

        no nested parallelism we got anywhere from 4x slowdown to 32x

        speedup depending

        on the: ratio of outer:inner iterations, the work size in the

        innermost

        loop, reductions, atomics and memory coalescing. About 80% of

        the number

        of cases we tried showed speed-ups with some showing significant

        speed-ups.</font><br>

      <font size="2" face="sans-serif">I would very much be in favour of

        having

        at least this scheme supported since it looks like it could be

        useful.</font><br>

      <br>

      <font size="2" face="sans-serif">In terms of timing, we are still

        tied

        up with upstreaming at the moment so we won't be attempting a

        new code

        generation scheme until we are feature complete on the current

        ones.</font><br>

    </blockquote>

    <br>

    <br>

    <font size="2">Hi, Doru,<br>

      <br>

      Thanks for explaining. I think that your suggestion of putting

      this behind a flag makes a lot of sense. It sounds as though,

      later, we might want different user-selectable schemes (although

      we might want to use pragmas instead of command-line flags at that

      point?).<br>

      <br>

       -Hal<br>

    </font>

    <blockquote type="cite"

cite="mid:OF6EB1DB83.D5814D73-ON00258301.005D2280-85258301.006E391D@notes.na.collabserv.com"><br>

      <font size="2" face="sans-serif">Thanks,</font><br>

      <br>

      <font size="2" face="sans-serif">--Doru</font><br>

      <br>

      <br>

      <br>

      <br>

      <font color="#5f5f5f" size="1" face="sans-serif">From:      

         </font><font size="1" face="sans-serif">Hal Finkel

        <a class="moz-txt-link-rfc2396E" href="mailto:hfinkel@anl.gov"><hfinkel@anl.gov></a></font><br>

      <font color="#5f5f5f" size="1" face="sans-serif">To:      

         </font><font size="1" face="sans-serif">Gheorghe-Teod Bercea

        <a class="moz-txt-link-rfc2396E" href="mailto:Gheorghe-Teod.Bercea@ibm.com"><Gheorghe-Teod.Bercea@ibm.com></a>, Jonas Hahnfeld

        <a class="moz-txt-link-rfc2396E" href="mailto:hahnjo@hahnjo.de"><hahnjo@hahnjo.de></a></font><br>

      <font color="#5f5f5f" size="1" face="sans-serif">Cc:      

         </font><font size="1" face="sans-serif">Alexey Bataev

        <a class="moz-txt-link-rfc2396E" href="mailto:alexey.bataev@ibm.com"><alexey.bataev@ibm.com></a>,

        <a class="moz-txt-link-rfc2396E" href="mailto:openmp-dev@lists.llvm.org"><openmp-dev@lists.llvm.org></a></font><br>

      <font color="#5f5f5f" size="1" face="sans-serif">Date:      

         </font><font size="1" face="sans-serif">09/07/2018 12:35 PM</font><br>

      <font color="#5f5f5f" size="1" face="sans-serif">Subject:    

           </font><font size="1" face="sans-serif">Re: [Openmp-dev]

        nested parallelism in libomptarget-nvptx</font><br>

      <hr noshade="noshade"><br>

      <br>

      <br>

      <font size="3">Hi, Doru,</font>

      <p><font size="3">What do you think we should do, upstream, for

          nested parallelism?

          Would it be desirable to have a clang-ykt-like scheme?

          Something else?</font></p>

      <p><font size="3">Thanks again,</font></p>

      <p><font size="3">Hal</font></p>

      <p><br>

        <font size="3">On 09/07/2018 10:59 AM, Gheorghe-Teod Bercea via

          Openmp-dev

          wrote:</font><br>

        <font size="2" face="sans-serif">Hi Jonas,</font><font size="3"><br>

        </font><font size="2" face="sans-serif"><br>

          The second level of parallelism in clang-ykt uses a scheme

          where all the

          threads in each warp cooperate to execute the workload of the

           1st

          thread in the warp then the 2nd and so on until the workload

          of each of

          the 32 threads in the warp has been completed. The workload of

          each thread

          is always executed by the full warp.<br>

          You are correct in trunk the additional memory that this

          scheme uses is

          not required. For now we would like to keep this functionality

          in place

          so it would be good if you could hide it behind a flag. This

          will allow

          us to easily drop it in the future.</font><font size="3"><br>

        </font><font size="2" face="sans-serif"><br>

          Thanks a lot,</font><font size="3"><br>

        </font><font size="2" face="sans-serif"><br>

          --Doru<br>

        </font><font size="3"><br>

          <br>

          <br>

          <br>

        </font><font color="#5f5f5f" size="1" face="sans-serif"><br>

          From:        </font><font size="1" face="sans-serif">Jonas

          Hahnfeld </font><a href="mailto:hahnjo@hahnjo.de"

          moz-do-not-send="true"><font color="blue" size="1"

            face="sans-serif"><u><hahnjo@hahnjo.de></u></font></a><font

          color="#5f5f5f" size="1" face="sans-serif"><br>

          To:        </font><a href="mailto:openmp-dev@lists.llvm.org"

          moz-do-not-send="true"><font color="blue" size="1"

            face="sans-serif"><u>openmp-dev@lists.llvm.org</u></font></a><font

          color="#5f5f5f" size="1" face="sans-serif"><br>

          Cc:        </font><font size="1" face="sans-serif">Alexey

          Bataev </font><a href="mailto:alexey.bataev@ibm.com"

          moz-do-not-send="true"><font color="blue" size="1"

            face="sans-serif"><u><alexey.bataev@ibm.com></u></font></a><font

          size="1" face="sans-serif">,

          Doru Bercea </font><a

          href="mailto:gheorghe-teod.bercea@ibm.com"

          moz-do-not-send="true"><font color="blue" size="1"

            face="sans-serif"><u><gheorghe-teod.bercea@ibm.com></u></font></a><font

          size="1" face="sans-serif">,

          Kelvin Li </font><a href="mailto:kli@ca.ibm.com"

          moz-do-not-send="true"><font color="blue" size="1"

            face="sans-serif"><u><kli@ca.ibm.com></u></font></a><font

          color="#5f5f5f" size="1" face="sans-serif"><br>

          Date:        </font><font size="1" face="sans-serif">09/07/2018

          11:31 AM</font><font color="#5f5f5f" size="1"

          face="sans-serif"><br>

          Subject:        </font><font size="1" face="sans-serif">nested

          parallelism in libomptarget-nvptx</font><font size="3"><br>

        </font></p>

      <hr noshade="noshade"><font size="3"><br>

        <br>

      </font><tt><font size="2"><br>

          Hi all,<br>

          <br>

          I've started some cleanups in libomptarget-nvptx, the OpenMP

          runtime <br>

          implementation on Nvidia GPUs. The ultimate motivation is

          reducing the

          <br>

          memory overhead: At the moment the runtime statically

          allocates ~660MiB

          <br>

          of global memory. This amount can't be used by applications.

          This might

          <br>

          not sound much, but wasting precious memory doesn't sound

          wise.<br>

          I found that a portion of 448MiB come from buffers for data

          sharing. In

          <br>

          particular they appear to be so large because the code is

          prepared to <br>

          handle nested parallelism where every thread would be in the

          position to

          <br>

          share data with its nested worker threads.<br>

          From what I've seen so far this doesn't seem to be necessary

          for Clang

          <br>

          trunk: Nested parallel regions are serialized, so only the

          initial <br>

          thread needs to share data with one set of worker threads.

          That's in <br>

          line with comments saying that there is no support for nested

          <br>

          parallelism.<br>

          <br>

          However I found that my test applications compiled with

          clang-ykt <br>

          support two levels of parallelism. My guess would be that this

          is <br>

          related to "convergent parallelism": parallel.cu explains that

          this is <br>

          meant for a "team of threads in a warp only". And indeed, each

          nested <br>

          parallel region seems to be executed by 32 threads.<br>

          I'm not really sure how this works because I seem to get one

          OpenMP <br>

          thread per CUDA thread in the outer parallel region. So where

          are the <br>

          nested worker threads coming from?<br>

          <br>

          In any case: If my analysis is correct, I'd like to propose

          adding a <br>

          CMake flag which disables this (seemingly) legacy support [1].

          That <br>

          would avoid the memory overhead for users of Clang trunk and

          enable <br>

          future optimizations (I think).<br>

          Thoughts, opinions?<br>

          <br>

          Cheers,<br>

          Jonas<br>

          <br>

          <br>

          1: Provided that IBM still wants to keep the code and we can't

          just go

          <br>

          ahead and drop it. I guess that this can happen at some point

          in time,

          <br>

          but I'm not sure if we are in that position right now.<br>

        </font></tt><font size="3"><br>

        <br>

        <br>

        <br>

        <br>

      </font><br>

      <tt><font size="3">_______________________________________________<br>

          Openmp-dev mailing list<br>

        </font></tt><a href="mailto:Openmp-dev@lists.llvm.org"

        moz-do-not-send="true"><tt><font color="blue" size="3"><u>Openmp-dev@lists.llvm.org</u></font></tt></a><tt><font

          size="3"><br>

        </font></tt><a

        href="http://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev"

        moz-do-not-send="true"><tt><font color="blue" size="3"><u>http://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev</u></font></tt></a><tt><font

          size="3"><br>

        </font></tt><br>

      <br>

      <tt><font size="3">-- <br>

          Hal Finkel<br>

          Lead, Compiler Technology and Programming Languages<br>

          Leadership Computing Facility<br>

          Argonne National Laboratory</font></tt><br>

      <br>

      <br>

    </blockquote>

    <br>

    <pre class="moz-signature" cols="72">-- 

Hal Finkel

Lead, Compiler Technology and Programming Languages

Leadership Computing Facility

Argonne National Laboratory</pre>

  </body>

</html>