<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Dec 8, 2017 at 1:13 PM, Hal Finkel <span dir="ltr"><<a href="mailto:hfinkel@anl.gov" target="_blank">hfinkel@anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
  
    
  
  <div bgcolor="#FFFFFF"><span class="gmail-">
    <p><br>
    </p>
    <div class="gmail-m_-2288114881520531127moz-cite-prefix">On 12/07/2017 11:35 AM, Jeff Hammond
      wrote:<br>
    </div>
    <blockquote type="cite">
      
      <div><br>
        <div class="gmail_quote">
          <div dir="auto">On Wed, Dec 6, 2017 at 8:57 PM Hal Finkel <<a href="mailto:hfinkel@anl.gov" target="_blank">hfinkel@anl.gov</a>>
            wrote:<br>
          </div>
          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
            <div bgcolor="#FFFFFF">
              <p><br>
              </p>
              <div class="gmail-m_-2288114881520531127m_8513274869410520852moz-cite-prefix">On
                12/06/2017 10:23 PM, Jeff Hammond wrote:<br>
              </div>
              <blockquote type="cite">
                <div><br>
                  <div class="gmail_quote">
                    <div dir="auto">On Wed, Dec 6, 2017 at 4:23 PM Hal
                      Finkel <<a href="mailto:hfinkel@anl.gov" target="_blank">hfinkel@anl.gov</a>>
                      wrote:<br>
                    </div>
                    <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                      <div bgcolor="#FFFFFF">
                        <p><br>
                        </p>
                        <div class="gmail-m_-2288114881520531127m_8513274869410520852m_2065056468622040889moz-cite-prefix">On
                          12/04/2017 10:48 PM, Serge Preis via cfe-dev
                          wrote:<br>
                        </div>
                        <blockquote type="cite">
                          <div>I agree that guarantees provided by ICC
                            may be stronger than with other compilers,
                            so yes, under OpenMP terms vectorization is
                            permitted and cannot be assumed. However
                            OpenMP clearly defines semantics of
                            variables used within OpenMP region some
                            being shared(scalar), some private(vector)
                            and some being inductions. This goes far
                            beyond typical compiler specific pragmas
                            about dependencies and cost modelling and
                            makes vectorization much simpler task with
                            more predictable and robust results if
                            properly implemented (admittedly, even ICC
                            implementation is far from perfect). I hope
                            Intel's efforts to standardize someting like
                            this in core C++ will evntually come to
                            fruition. Until then I as a regular
                            application developer would appreciate
                            OpenMP-simd based execution policy (hoping
                            for good support for OpenMP SIMD in clang),
                            but it shouldn't necessary be part of
                            libc++. Since 'unordered' execution policy
                            is currently not part of C++ standard </div>
                        </blockquote>
                        <br>
                      </div>
                      <div bgcolor="#FFFFFF">
                        std::execution::par_unseq is part of C++17, and
                        that essentially maps to '#pragma omp parallel
                        for simd'.</div>
                      <div bgcolor="#FFFFFF"><br>
                      </div>
                    </blockquote>
                    <div dir="auto"><br>
                    </div>
                    <div dir="auto">Do you expect par/par_unseq to nest?</div>
                  </div>
                </div>
              </blockquote>
              <br>
            </div>
            <div bgcolor="#FFFFFF"> Yes.</div>
            <div bgcolor="#FFFFFF"><br>
              <br>
              <blockquote type="cite">
                <div>
                  <div class="gmail_quote">
                    <div dir="auto"> Nesting omp-parallel is generally
                      regarded as a Bad Idea.</div>
                  </div>
                </div>
              </blockquote>
              <br>
            </div>
            <div bgcolor="#FFFFFF"> Agreed. I suspect
              we'll want the mapping to be more like '#pragma omp
              taskloop simd'.</div>
            <div bgcolor="#FFFFFF"><br>
            </div>
          </blockquote>
          <div dir="auto"><br>
          </div>
          <div dir="auto">That won’t run in parallel unless in an
            omp-parallel-master region. </div>
        </div>
      </div>
    </blockquote>
    <br></span>
    Yes.<span class="gmail-"><br>
    <br>
    <blockquote type="cite">
      <div>
        <div class="gmail_quote">
          <div dir="auto">That means OpenMP-based PSTL won’t be parallel
            unless the user knows to add back-end specific code about
            the PSTL.</div>
        </div>
      </div>
    </blockquote>
    <br></span>
    That obviously wouldn't be acceptable.<span class="gmail-"><br>
    <br>
    <blockquote type="cite">
      <div>
        <div class="gmail_quote">
          <div dir="auto"><br>
          </div>
          <div dir="auto">What I’m trying to say is that OpenMP is a
            poor target for PSTL in its current form. Nested parallel
            regions is the only thing that works and it is likely to
            work poorly.</div>
        </div>
      </div>
    </blockquote>
    <br></span>
    I'm not sure that's true, but the technique may not be trivial. I
    believe that it is possible, however. For example, the mapping might
    be to something like:<br>
    <br>
    if (omp_in_parallel()) {<br>
    #pragma omp taskloop simd<br>
      for (size_t i = 0; i < N; ++i)<br>
        F(X[i]);<br>
    } else {<br>
    #pragma omp parallel<br>
      {<br>
    #pragma omp taskloop simd<br>
         for (size_t i = 0; i < N; ++i)<br>
           F(X[i]);<br>
      }<br>
    }<br>
    <br>
    The fact that we'd need to use this kind of pattern is a bit
    unfortunate, but it can be easily abstracted into a template
    function, so it just becomes some implementation detail of the
    library.<br>
    <br></div></blockquote><div><br></div><div>You are right and that is probably the best way to do it with OpenMP.  I am concerned about the absolute performance, based upon my observations of omp-taskloop vs omp-for and tbb::parallel_for in the PRK project, but at least it is sane from a semantic perspective.  Having motivating use cases like PSTL should lead to improvements in OpenMP runtime performance w.r.t. taskloop.</div><div><br></div><div><a href="https://i.stack.imgur.com/MVd5j.png">https://i.stack.imgur.com/MVd5j.png</a> is a snapshot of the performance of PRK stencil (<a href="https://github.com/ParRes/Kernels/tree/master/Cxx11">https://github.com/ParRes/Kernels/tree/master/Cxx11</a>), which shows taskloop loses to TBB-based PSTL, OpenMP for, and tbb::parallel_for (pure TBB beats TBB-based PSTL because I use tbb::blocked_range2d, which improves cache utilization).  I think those results tuned taskloop grainsize as well, so they may be an optimistic representation of taskloop in a general usage.<br></div><div><br></div><div>I'll see if I can prototype this in RAJA or Intel PSTL.  It's not hard to get results directly from the PRK tests, if the former attempts fail.</div><div><br></div><div>Best,</div><div><br></div><div>Jeff</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div bgcolor="#FFFFFF">
    Thanks again,<br>
    Hal<div><div class="gmail-h5"><br>
    <br>
    <blockquote type="cite">
      <div>
        <div class="gmail_quote">
          <div dir="auto"><br>
          </div>
          <div dir="auto">Jeff</div>
          <div dir="auto"><br>
          </div>
          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
            <div bgcolor="#FFFFFF"><br>
               -Hal</div>
            <div bgcolor="#FFFFFF"><br>
              <br>
              <blockquote type="cite">
                <div>
                  <div class="gmail_quote">
                    <div dir="auto"><br>
                    </div>
                    <div dir="auto">Jeff</div>
                    <div dir="auto"><br>
                    </div>
                    <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                      <div bgcolor="#FFFFFF"><br>
                        <blockquote type="cite">
                          <div>I don't care much on how it will be
                            implemneted in libc++ if it is. I just would
                            like to ask Intel guys and community here to
                            make implementation extensible in a sense
                            that custom OpenMP-SIMD-based execution
                            policy along with algorithms implementations
                            (as specializations for the policy) can be
                            used with the libc++ library. And I
                            additionally would like to ask Intel guys to
                            provide complete and compatible extension on
                            github for developers like me to use.</div>
                        </blockquote>
                        <br>
                      </div>
                      <div bgcolor="#FFFFFF"> In the end,
                        I think we want the following:<br>
                        <br>
                         1. A design for libc++ that allows the
                        thread-level parallelism to be implemented in
                        terms of different underlying providers (i.e.,
                        OpenMP, GCD, Work Queues on Windows, whatever
                        else).<br>
                         2. To follow the same philosophy with respect
                        to standards as we do everywhere else: Use
                        standards where possible with
                        compiler/system-specific extensions as
                        necessary.<br>
                        <br>
                         -Hal</div>
                      <div bgcolor="#FFFFFF"><br>
                        <br>
                        <blockquote type="cite">
                          <div> </div>
                          <div>Regards,</div>
                          <div>Serge.</div>
                          <div> </div>
                          <div> </div>
                          <div> </div>
                          <div>04.12.2017, 12:07, "Jeff Hammond" <a class="gmail-m_-2288114881520531127m_8513274869410520852m_2065056468622040889moz-txt-link-rfc2396E" href="mailto:jeff.science@gmail.com" target="_blank"><jeff.science@gmail.com></a>:</div>
                          <blockquote type="cite">
                            <div>
                              <div>ICC implements a very aggressive
                                interpretation of the OpenMP standard,
                                and this interpretation is not shared by
                                everyone in the OpenMP community.  ICC
                                is correct but other implementations may
                                be far less aggressive, so _Pragma("omp
                                simd") doesn't guarentee vectorization
                                unless the compiler documentation says
                                that is how it is implemented.  All the
                                standard says that it means is that
                                vectorization is _permitted_.</div>
                              <div> </div>
                              <div>Given that the practical meaning of
                                _Pragma("omp simd") isn't guaranteed to
                                be consistent across different
                                implementations, I don't really know how
                                to compare it to compiler-specific
                                pragmas unless we define everything
                                explicitly.</div>
                              <div> </div>
                              <div>In any case, my fundamental point
                                remains: do not use OpenMP pragmas here,
                                but instead use whatever the appropriate
                                compiler-specific pragma is, or create a
                                new one that meets the need.</div>
                              <div> </div>
                              <div>Best,</div>
                              <div> </div>
                              <div>Jeff</div>
                              <div title="Page 81">
                                <div>
                                  <div> </div>
                                </div>
                              </div>
                              <div> 
                                <div>On Sun, Dec 3, 2017 at 8:09 PM,
                                  Serge Preis <span><<a href="mailto:spreis@yandex-team.ru" target="_blank">spreis@yandex-team.ru</a>></span>
                                  wrote:
                                  <blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                                    <div>Hello,</div>
                                    <div> </div>
                                    <div>_Pragma("omp simd") is
                                      semantically quite different from
                                      _Pragma("clang loop
                                      vectorize(assume_safety)"),
                                      _Pragma("GCC ivdep") and
                                      _Pragma("vector always"), so I am
                                      not sure all latter will work as
                                      expected in all cases. They
                                      definitely won't provide any
                                      vectorization guarantees which
                                      slightly defeat the purpose of
                                      using corresponding execution
                                      policy.</div>
                                    <div> </div>
                                    <div>I support the idea of having
                                      OpenMP orthogonal and definitely
                                      having -fopenmp enabled by default
                                      is not an option. Intel compiler
                                      has separate -qopenmp-simd option
                                      which doesn't affect performance
                                      outside explicitly marked loops,
                                      but even this is not enabled by
                                      default. I would say that there
                                      might exist multiple
                                      implementations of unordered
                                      policy, originally OpenMP SIMD
                                      based implementation may be more
                                      powerful and one based on other
                                      pragmas being default, but hinting
                                      about existence of faster option.
                                      Later on one may be brave enough
                                      to add some SIMD template library
                                      and implement default unordered
                                      policy using it (such
                                      implementation is possible even
                                      now using vector types, but it
                                      will be extremely complex if
                                      attempt to target all base data
                                      types, vector widths and target
                                      SIMD architectures clang supports.
                                      Even with the library this may be
                                      quite tedious).</div>
                                    <div> </div>
                                    <div>Without any standard way of
                                      expressing SIMD perallelism in
                                      pure C++ any implementer of SIMD
                                      execution policy is to rely on
                                      means avaialble for
                                      plaform/compiler and so it is not
                                      totaly unnatural to ask user to
                                      enable OpenMP SIMD for efficient
                                      support of corresponding execution
                                      policy.</div>
                                    <div> </div>
                                    <div>Reagrds,</div>
                                    <div>Serge Preis</div>
                                    <div> </div>
                                    <div>(Who once was part of Intel
                                      Compiler Vectorizer team and
                                      driven OpenMP SIMD efforts within
                                      icc and beyond, if anyone is
                                      keeping track of
                                      conflicts-of-interest)</div>
                                    <div> </div>
                                    <div> </div>
                                    <div>04.12.2017, 08:46, "Jeff
                                      Hammond via cfe-dev" <<a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a>>:</div>
                                    <blockquote type="cite">
                                      <div>
                                        <div>
                                          <div>It would be nice to keep
                                            PSTL and OpenMP orthogonal,
                                            even if _Pragma("omp simd")
                                            does not require runtime
                                            support.  It should be
                                            trivial to use
                                            _Pragma("clang loop
                                            vectorize(assume_safety)")
                                            instead, by wrapping all of
                                            the different compiler
                                            vectorization pragmas in
                                            preprocessor logic.  I
                                            similarly recommend
                                            _Pragma("GCC ivdep") for GCC
                                            and _Pragma("vector always")
                                            for ICC.  While this
                                            requires O(n_compilers)
                                            effort instead of O(1), but
                                            orthogonality is worth it.
                                            <div> </div>
                                            <div>While OpenMP is
                                              vendor/compiler-agnostic,
                                              users should not be
                                              required to use -fopenmp
                                              or similar to enable
                                              vectorization from PSTL,
                                              nor should the compiler
                                              enable any OpenMP pragma
                                              by default.  I know of
                                              cases where merely using
                                              the -fopenmp flag alters
                                              code generation in a
                                              performance-visible
                                              manner, and enabling the
                                              OpenMP "simd" pragma by
                                              default may surprise some
                                              users, particularly if no
                                              other OpenMP pragmas are
                                              enabled by default.
                                              <div><br>
                                                Best,</div>
                                              <div> </div>
                                              <div>Jeff</div>
                                              <div>(who works for Intel
                                                but not on any software
                                                products and has been a
                                                heavy user of Intel PSTL
                                                since it was released,
                                                if anyone is keeping
                                                track of
                                                conflicts-of-interest)<br>
                                                <br>
                                                On Wed, Nov 29, 2017 at
                                                4:21 AM, Kukanov, Alexey
                                                via cfe-dev <<a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a>>
                                                wrote:<br>
                                                ><br>
                                                > Hello all,<br>
                                                ><br>
                                                > At Intel, we have
                                                developed an
                                                implementation of C++17
                                                execution policies<br>
                                                > for algorithms
                                                (often referred to as
                                                Parallel STL). We hope
                                                to contribute it<br>
                                                > to libc++/LLVM, so
                                                would like to ask the
                                                community for comments
                                                on this.<br>
                                                ><br>
                                                > The code is already
                                                published at GitHub (<a href="https://github.com/intel/parallelstl" target="_blank">https://github.com/intel/<wbr>parallelstl</a>).<br>
                                                > It supports the
                                                C++17 standard execution
                                                policies (seq, par,
                                                par_unseq) as well as<br>
                                                > the experimental
                                                unsequenced policy
                                                (unseq) for SIMD
                                                execution. At the
                                                moment,<br>
                                                > about half of the
                                                C++17 standard
                                                algorithms that must
                                                support execution
                                                policies<br>
                                                > are implemented; a
                                                few more will be ready
                                                soon, and the work
                                                continues.<br>
                                                > The tests that we
                                                use are also available
                                                at GitHub; needless to
                                                say we will<br>
                                                > contribute those as
                                                well.<br>
                                                ><br>
                                                > The implementation
                                                is not specific to
                                                Intel’s hardware. For
                                                thread-level parallelism<br>
                                                > it uses TBB* (<a href="https://www.threadingbuildingblocks.org/" target="_blank">https://www.<wbr>threadingbuildingblocks.org/</a>)
                                                but abstracts it with<br>
                                                > an internal API
                                                which can be implemented
                                                on top of other
                                                threading/parallel
                                                solutions –<br>
                                                > so it is for the
                                                community to decide
                                                which ones to use. For
                                                SIMD parallelism<br>
                                                > (unseq, par_unseq)
                                                we use #pragma omp simd
                                                directives; it is
                                                vendor-neutral and<br>
                                                > does not require
                                                any OpenMP runtime
                                                support.<br>
                                                ><br>
                                                > The current
                                                implementation meets the
                                                spirit but not always
                                                the letter of<br>
                                                > the standard,
                                                because it has to be
                                                separate from but also
                                                coexist with<br>
                                                > implementations of
                                                standard C++ libraries.
                                                While preparing the
                                                contribution,<br>
                                                > we will address
                                                inconsistencies, adjust
                                                the code to meet
                                                community standards,<br>
                                                > and better
                                                integrate it into the
                                                standard library code.<br>
                                                ><br>
                                                > We are also
                                                proposing that our
                                                implementation is
                                                included into
                                                libstdc++/GCC.<br>
                                                > Compatibility
                                                between the
                                                implementations seems
                                                useful as it can
                                                potentially<br>
                                                > reduce the amount
                                                of work for everyone. We
                                                hope to keep the code
                                                mostly identical,<br>
                                                > and would like to
                                                know if you think it’s
                                                too optimistic to
                                                expect.<br>
                                                ><br>
                                                > Obviously we plan
                                                to use appropriate open
                                                source licenses to meet
                                                the different<br>
                                                > projects’
                                                requirements.<br>
                                                ><br>
                                                > We expect to keep
                                                developing the code and
                                                will take the
                                                responsibility for<br>
                                                > maintaining it
                                                (with community
                                                contributions, of
                                                course). If there are
                                                other<br>
                                                > community efforts
                                                to implement parallel
                                                algorithms, we are
                                                willing to collaborate.<br>
                                                ><br>
                                                > We look forward to
                                                your feedback, both for
                                                the overall idea and –
                                                if supported –<br>
                                                > for the next steps
                                                we should take.<br>
                                                ><br>
                                                > Regards,<br>
                                                > - Alexey Kukanov<br>
                                                ><br>
                                                > * Note that TBB
                                                itself is highly
                                                portable (and ported by
                                                community to Power and
                                                ARM<br>
                                                > architectures) and
                                                permissively licensed,
                                                so could be the base for
                                                the threading<br>
                                                > infrastructure. But
                                                the Parallel STL
                                                implementation itself
                                                does not require TBB.<br>
                                                ><br>
                                                >
                                                ______________________________<wbr>_________________<br>
                                                > cfe-dev mailing
                                                list<br>
                                                > <a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a><br>
                                                > <a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev" target="_blank">http://lists.llvm.org/cgi-bin/<wbr>mailman/listinfo/cfe-dev</a><br>
                                                <br>
                                                <br>
                                                <br>
                                                <br>
                                                --<br>
                                                Jeff Hammond<br>
                                                <a href="mailto:jeff.science@gmail.com" target="_blank">jeff.science@gmail.com</a><br>
                                                <a href="http://jeffhammond.github.io/" target="_blank">http://jeffhammond.github.io/</a>
                                                <div> </div>
                                              </div>
                                            </div>
                                          </div>
                                        </div>
                                      </div>
                                      ,
                                      <p><span>______________________________<wbr>_________________<br>
                                          cfe-dev mailing list<br>
                                          <a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a><br>
                                          <a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev" target="_blank">http://lists.llvm.org/cgi-bin/<wbr>mailman/listinfo/cfe-dev</a></span></p>
                                    </blockquote>
                                  </blockquote>
                                </div>
                                 
                                <div> </div>
                                --
                                <div>Jeff Hammond<br>
                                  <a href="mailto:jeff.science@gmail.com" target="_blank">jeff.science@gmail.com</a><br>
                                  <a href="http://jeffhammond.github.io/" target="_blank">http://jeffhammond.github.io/</a></div>
                              </div>
                            </div>
                          </blockquote>
                          <br>
                          <fieldset class="gmail-m_-2288114881520531127m_8513274869410520852m_2065056468622040889mimeAttachmentHeader"></fieldset>
                          <br>
                          <pre>______________________________<wbr>_________________
cfe-dev mailing list
<a class="gmail-m_-2288114881520531127m_8513274869410520852m_2065056468622040889moz-txt-link-abbreviated" href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a>
<a class="gmail-m_-2288114881520531127m_8513274869410520852m_2065056468622040889moz-txt-link-freetext" href="http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev" target="_blank">http://lists.llvm.org/cgi-bin/<wbr>mailman/listinfo/cfe-dev</a>
</pre>
                        </blockquote>
                        <br>
                      </div>
                      <div bgcolor="#FFFFFF">
                        <pre class="gmail-m_-2288114881520531127m_8513274869410520852m_2065056468622040889moz-signature" cols="72">-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory</pre>
                      </div>
                    </blockquote>
                  </div>
                </div>
                <div>-- <br>
                </div>
                <div class="gmail-m_-2288114881520531127m_8513274869410520852gmail_signature">Jeff Hammond<br>
                  <a href="mailto:jeff.science@gmail.com" target="_blank">jeff.science@gmail.com</a><br>
                  <a href="http://jeffhammond.github.io/" target="_blank">http://jeffhammond.github.io/</a></div>
              </blockquote>
              <br>
              <pre class="gmail-m_-2288114881520531127m_8513274869410520852moz-signature" cols="72">-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory</pre>
            </div>
          </blockquote>
        </div>
      </div>
      <div dir="ltr">-- <br>
      </div>
      <div class="gmail-m_-2288114881520531127gmail_signature">Jeff
        Hammond<br>
        <a href="mailto:jeff.science@gmail.com" target="_blank">jeff.science@gmail.com</a><br>
        <a href="http://jeffhammond.github.io/" target="_blank">http://jeffhammond.github.io/</a></div>
    </blockquote>
    <br>
    <pre class="gmail-m_-2288114881520531127moz-signature" cols="72">-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory</pre>
  </div></div></div>

</blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature">Jeff Hammond<br><a href="mailto:jeff.science@gmail.com" target="_blank">jeff.science@gmail.com</a><br><a href="http://jeffhammond.github.io/" target="_blank">http://jeffhammond.github.io/</a></div>
</div></div>