<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html;

      charset=windows-1252">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <p>Hi, Doru,</p>

    <p>What do you think we should do, upstream, for nested parallelism?

      Would it be desirable to have a clang-ykt-like scheme? Something

      else?</p>

    <p>Thanks again,</p>

    <p>Hal<br>

    </p>

    <br>

    <div class="moz-cite-prefix">On 09/07/2018 10:59 AM, Gheorghe-Teod

      Bercea via Openmp-dev wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:OF7699DBA4.02E1D3DB-ON00258301.0056D7AE-85258301.0057E1CB@notes.na.collabserv.com">

      <meta http-equiv="Content-Type" content="text/html;

        charset=windows-1252">

      <font size="2" face="sans-serif">Hi Jonas,</font><br>

      <br>

      <font size="2" face="sans-serif">The second level of parallelism

        in clang-ykt

        uses a scheme where all the threads in each warp cooperate to

        execute the

        workload of the  1st thread in the warp then the 2nd and so on

        until

        the workload of each of the 32 threads in the warp has been

        completed.

        The workload of each thread is always executed by the full warp.</font><br>

      <font size="2" face="sans-serif">You are correct in trunk the

        additional

        memory that this scheme uses is not required. For now we would

        like to

        keep this functionality in place so it would be good if you

        could hide

        it behind a flag. This will allow us to easily drop it in the

        future.</font><br>

      <br>

      <font size="2" face="sans-serif">Thanks a lot,</font><br>

      <br>

      <font size="2" face="sans-serif">--Doru</font><br>

      <font size="2" face="sans-serif"><br>

      </font><br>

      <br>

      <br>

      <br>

      <font color="#5f5f5f" size="1" face="sans-serif">From:      

         </font><font size="1" face="sans-serif">Jonas Hahnfeld

        <a class="moz-txt-link-rfc2396E" href="mailto:hahnjo@hahnjo.de"><hahnjo@hahnjo.de></a></font><br>

      <font color="#5f5f5f" size="1" face="sans-serif">To:      

         </font><font size="1" face="sans-serif"><a class="moz-txt-link-abbreviated" href="mailto:openmp-dev@lists.llvm.org">openmp-dev@lists.llvm.org</a></font><br>

      <font color="#5f5f5f" size="1" face="sans-serif">Cc:      

         </font><font size="1" face="sans-serif">Alexey Bataev

        <a class="moz-txt-link-rfc2396E" href="mailto:alexey.bataev@ibm.com"><alexey.bataev@ibm.com></a>,

        Doru Bercea <a class="moz-txt-link-rfc2396E" href="mailto:gheorghe-teod.bercea@ibm.com"><gheorghe-teod.bercea@ibm.com></a>, Kelvin Li

        <a class="moz-txt-link-rfc2396E" href="mailto:kli@ca.ibm.com"><kli@ca.ibm.com></a></font><br>

      <font color="#5f5f5f" size="1" face="sans-serif">Date:      

         </font><font size="1" face="sans-serif">09/07/2018 11:31 AM</font><br>

      <font color="#5f5f5f" size="1" face="sans-serif">Subject:    

           </font><font size="1" face="sans-serif">nested parallelism

        in libomptarget-nvptx</font><br>

      <hr noshade="noshade"><br>

      <br>

      <br>

      <tt><font size="2">Hi all,<br>

          <br>

          I've started some cleanups in libomptarget-nvptx, the OpenMP

          runtime <br>

          implementation on Nvidia GPUs. The ultimate motivation is

          reducing the

          <br>

          memory overhead: At the moment the runtime statically

          allocates ~660MiB

          <br>

          of global memory. This amount can't be used by applications.

          This might

          <br>

          not sound much, but wasting precious memory doesn't sound

          wise.<br>

          I found that a portion of 448MiB come from buffers for data

          sharing. In

          <br>

          particular they appear to be so large because the code is

          prepared to <br>

          handle nested parallelism where every thread would be in the

          position to

          <br>

          share data with its nested worker threads.<br>

          From what I've seen so far this doesn't seem to be necessary

          for Clang

          <br>

          trunk: Nested parallel regions are serialized, so only the

          initial <br>

          thread needs to share data with one set of worker threads.

          That's in <br>

          line with comments saying that there is no support for nested

          <br>

          parallelism.<br>

          <br>

          However I found that my test applications compiled with

          clang-ykt <br>

          support two levels of parallelism. My guess would be that this

          is <br>

          related to "convergent parallelism": parallel.cu explains that

          this is <br>

          meant for a "team of threads in a warp only". And indeed, each

          nested <br>

          parallel region seems to be executed by 32 threads.<br>

          I'm not really sure how this works because I seem to get one

          OpenMP <br>

          thread per CUDA thread in the outer parallel region. So where

          are the <br>

          nested worker threads coming from?<br>

          <br>

          In any case: If my analysis is correct, I'd like to propose

          adding a <br>

          CMake flag which disables this (seemingly) legacy support [1].

          That <br>

          would avoid the memory overhead for users of Clang trunk and

          enable <br>

          future optimizations (I think).<br>

          Thoughts, opinions?<br>

          <br>

          Cheers,<br>

          Jonas<br>

          <br>

          <br>

          1: Provided that IBM still wants to keep the code and we can't

          just go

          <br>

          ahead and drop it. I guess that this can happen at some point

          in time,

          <br>

          but I'm not sure if we are in that position right now.<br>

          <br>

        </font></tt><br>

      <br>

      <br>

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <br>

      <pre wrap="">_______________________________________________

Openmp-dev mailing list

<a class="moz-txt-link-abbreviated" href="mailto:Openmp-dev@lists.llvm.org">Openmp-dev@lists.llvm.org</a>

<a class="moz-txt-link-freetext" href="http://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev">http://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev</a>

</pre>

    </blockquote>

    <br>

    <pre class="moz-signature" cols="72">-- 

Hal Finkel

Lead, Compiler Technology and Programming Languages

Leadership Computing Facility

Argonne National Laboratory</pre>

  </body>

</html>