<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <p>Not sure about the API, most probably just some internal work is

      required. Better to ask Alex Eichenberger, he knows more about

      this.<br>

    </p>

    <pre class="moz-signature" cols="72">-------------

Best regards,

Alexey Bataev</pre>

    <div class="moz-cite-prefix">31.10.2019 4:36 PM, Finkel, Hal J.

      пишет:<br>

    </div>

    <blockquote type="cite"

      cite="mid:6a4feb16-65dc-a337-8eda-aeca5f46d6f0@anl.gov">

      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

      <p><br>

      </p>

      <div class="moz-cite-prefix">On 10/31/19 3:06 PM, Alexey Bataev

        wrote:<br>

      </div>

      <blockquote type="cite"

cite="mid:VI1PR09MB39821ED7EB17C852556C67E296630@VI1PR09MB3982.eurprd09.prod.outlook.com">

        <p>Hope to send this message from the main dev e-mail this time

          :)</p>

        <p><br>

        </p>

        <p>Well, about the memory. It depends on the number of kernels

          you have. All the memory in the kernels that must be

          globalized is squashed into a union. With streams we need to

          use the separate structure for each particular kernel. Plus,

          we cannot use shared memory for this buffer anymore again

          because of possible conflict. <br>

        </p>

        <p><br>

        </p>

        <p>We can add a new compiler option to compile only some files

          with streams support and use unique memory buffer for the

          globalized variables. Plus, some work in the libomptarget is

          required, of course.<br>

        </p>

      </blockquote>

      <p><br>

      </p>

      <p>Do we also need some kind of libomptarget API change in order

        to communicate the fact that it's allowed to run multiple target

        regions concurrently?</p>

      <p><br>

      </p>

      <p>Thanks again,</p>

      <p>Hal<br>

      </p>

      <p><br>

      </p>

      <blockquote type="cite"

cite="mid:VI1PR09MB39821ED7EB17C852556C67E296630@VI1PR09MB3982.eurprd09.prod.outlook.com">

        <p><br>

        </p>

        <pre class="moz-signature" cols="72">-------------

Best regards,

Alexey Bataev</pre>

        <div class="moz-cite-prefix">31.10.2019 3:58 PM, Finkel, Hal J.

          пишет:<br>

        </div>

        <blockquote type="cite"

          cite="mid:dbb670c9-376e-aae6-17af-c363afa52960@anl.gov">

          <p><br>

          </p>

          <div class="moz-cite-prefix">On 10/31/19 10:54 AM, Luo, Ye

            wrote:<br>

          </div>

          <blockquote type="cite"

cite="mid:DM6PR09MB3548BF5277EBEC67300B8F36A3630@DM6PR09MB3548.namprd09.prod.outlook.com">

            <style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>

            <div style="font-family: Calibri, Arial, Helvetica,

              sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

              Hi Hal,</div>

            <div style="font-family: Calibri, Arial, Helvetica,

              sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

              My experience of llvm/clang so far shows:</div>

            <div style="font-family: Calibri, Arial, Helvetica,

              sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

              1. all the target offload is blocking synchronous using

              the default stream. nowait is not supported.</div>

            <div style="font-family: Calibri, Arial, Helvetica,

              sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

              2. all the memory transfer calls invoke cudaMemcpy. There

              are no async calls.</div>

            <div style="font-family: Calibri, Arial, Helvetica,

              sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

              3. I had an experiment in the past turning on <samp>CUDA_API_PER_THREAD_DEFAULT_STREAM</samp>

              in libomptarget.</div>

            <div style="font-family: Calibri, Arial, Helvetica,

              sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

              Then I use multiple host threads to do individual blocking

              synchronous offload. I got it sort of running and saw

              multple streams but the code crashes due to memory

              corruption probably due to some data race in libomptarget.</div>

          </blockquote>

          <p><br>

          </p>

          <p>Thanks, Ye. That's consistent with Alexey's comments.</p>

          <p><br>

          </p>

          <p>Is there already a bug open on this? If not, we should open

            one.</p>

          <p><br>

          </p>

          <p>Alexey, the buffer-reuse optimizations in Clang that you

            mentioned, how much memory/overhead do they save? Is it

            worth keeping them in some mode?</p>

          <p><br>

          </p>

          <p> -Hal<br>

          </p>

          <p><br>

          </p>

          <blockquote type="cite"

cite="mid:DM6PR09MB3548BF5277EBEC67300B8F36A3630@DM6PR09MB3548.namprd09.prod.outlook.com">

            <div style="font-family: Calibri, Arial, Helvetica,

              sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

              Best,</div>

            <div style="font-family: Calibri, Arial, Helvetica,

              sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

              Ye<br>

            </div>

            <div style="font-family: Calibri, Arial, Helvetica,

              sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

              <br>

            </div>

            <hr style="display:inline-block;width:98%" tabindex="-1">

            <div id="divRplyFwdMsg" dir="ltr"><font

                style="font-size:11pt" face="Calibri, sans-serif"

                color="#000000"><b>From:</b> Finkel, Hal J.

                <a class="moz-txt-link-rfc2396E"

                  href="mailto:hfinkel@anl.gov" moz-do-not-send="true">

                  <hfinkel@anl.gov></a><br>

                <b>Sent:</b> Wednesday, October 30, 2019 1:40 PM<br>

                <b>To:</b> Alessandro Gabbana <a

                  class="moz-txt-link-rfc2396E"

                  href="mailto:gbblsn@unife.it" moz-do-not-send="true">

                  <gbblsn@unife.it></a>; <a

                  class="moz-txt-link-abbreviated"

                  href="mailto:cfe-dev@lists.llvm.org"

                  moz-do-not-send="true">

                  cfe-dev@lists.llvm.org</a> <a

                  class="moz-txt-link-rfc2396E"

                  href="mailto:cfe-dev@lists.llvm.org"

                  moz-do-not-send="true">

                  <cfe-dev@lists.llvm.org></a>; Luo, Ye <a

                  class="moz-txt-link-rfc2396E"

                  href="mailto:yeluo@anl.gov" moz-do-not-send="true">

                  <yeluo@anl.gov></a>; Doerfert, Johannes <a

                  class="moz-txt-link-rfc2396E"

                  href="mailto:jdoerfert@anl.gov" moz-do-not-send="true">

                  <jdoerfert@anl.gov></a><br>

                <b>Subject:</b> Re: [cfe-dev] openmp 4.5 and cuda

                streams</font>

              <div> </div>

            </div>

            <div class="BodyFragment"><font size="2"><span

                  style="font-size:11pt;">

                  <div class="PlainText">[+Ye, Johannes]<br>

                    <br>

                    I recall that we've also observed this behavior. Ye,

                    Johannes, we had a <br>

                    work-around and a patch, correct?<br>

                    <br>

                      -Hal<br>

                    <br>

                    On 10/30/19 12:28 PM, Alessandro Gabbana via cfe-dev

                    wrote:<br>

                    > Dear All,<br>

                    ><br>

                    > I'm using clang 9.0.0 to compile a code which

                    offloads sections of a <br>

                    > code on a GPU using the openmp target

                    construct.<br>

                    > I also use the nowait clause to overlap the

                    execution of certain <br>

                    > kernels and/or host<->device memory

                    transfers.<br>

                    > However, using the nvidia profiler I've noticed

                    that when I compile <br>

                    > the code with clang only one cuda stream is

                    active,<br>

                    > and therefore the execution gets serialized. On

                    the other hand, when <br>

                    > compiling with XLC I see that kernels are

                    executed<br>

                    > on different streams. I could not understand if

                    this is the expected <br>

                    > behavior (e.g. the nowait clause is currently

                    not supported),<br>

                    > or if I'm missing something. I'm using a NVIDIA

                    Tesla P100 GPU and <br>

                    > compiling with the following options:<br>

                    ><br>

                    > -target x86_64-pc-linux-gnu -fopenmp <br>

                    > -fopenmp-targets=nvptx64-nvidia-cuda <br>

                    > -Xopenmp-target=nvptx64-nvidia-cuda

                    -march=sm_60<br>

                    ><br>

                    > best wishes<br>

                    ><br>

                    > Alessandro<br>

                    ><br>

                    > _______________________________________________<br>

                    > cfe-dev mailing list<br>

                    > <a class="moz-txt-link-abbreviated"

                      href="mailto:cfe-dev@lists.llvm.org"

                      moz-do-not-send="true">

                      cfe-dev@lists.llvm.org</a><br>

                    > <a

                      href="https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev"

                      moz-do-not-send="true">

https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev</a><br>

                    <br>

                    -- <br>

                    Hal Finkel<br>

                    Lead, Compiler Technology and Programming Languages<br>

                    Leadership Computing Facility<br>

                    Argonne National Laboratory<br>

                    <br>

                  </div>

                </span></font></div>

          </blockquote>

          <pre class="moz-signature" cols="72">-- 

Hal Finkel

Lead, Compiler Technology and Programming Languages

Leadership Computing Facility

Argonne National Laboratory</pre>

        </blockquote>

      </blockquote>

      <pre class="moz-signature" cols="72">-- 

Hal Finkel

Lead, Compiler Technology and Programming Languages

Leadership Computing Facility

Argonne National Laboratory</pre>

    </blockquote>

  </body>

</html>