<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Dec 8, 2017 at 1:13 PM, Hal Finkel <span dir="ltr"><<a href="mailto:hfinkel@anl.gov" target="_blank">hfinkel@anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

  <div bgcolor="#FFFFFF"><span class="gmail-">

    <p><br>

    </p>

    <div class="gmail-m_-2288114881520531127moz-cite-prefix">On 12/07/2017 11:35 AM, Jeff Hammond

      wrote:<br>

    </div>

    <blockquote type="cite">

      <div><br>

        <div class="gmail_quote">

          <div dir="auto">On Wed, Dec 6, 2017 at 8:57 PM Hal Finkel <<a href="mailto:hfinkel@anl.gov" target="_blank">hfinkel@anl.gov</a>>

            wrote:<br>

          </div>

          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

            <div bgcolor="#FFFFFF">

              <p><br>

              </p>

              <div class="gmail-m_-2288114881520531127m_8513274869410520852moz-cite-prefix">On

                12/06/2017 10:23 PM, Jeff Hammond wrote:<br>

              </div>

              <blockquote type="cite">

                <div><br>

                  <div class="gmail_quote">

                    <div dir="auto">On Wed, Dec 6, 2017 at 4:23 PM Hal

                      Finkel <<a href="mailto:hfinkel@anl.gov" target="_blank">hfinkel@anl.gov</a>>

                      wrote:<br>

                    </div>

                    <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

                      <div bgcolor="#FFFFFF">

                        <p><br>

                        </p>

                        <div class="gmail-m_-2288114881520531127m_8513274869410520852m_2065056468622040889moz-cite-prefix">On

                          12/04/2017 10:48 PM, Serge Preis via cfe-dev

                          wrote:<br>

                        </div>

                        <blockquote type="cite">

                          <div>I agree that guarantees provided by ICC

                            may be stronger than with other compilers,

                            so yes, under OpenMP terms vectorization is

                            permitted and cannot be assumed. However

                            OpenMP clearly defines semantics of

                            variables used within OpenMP region some

                            being shared(scalar), some private(vector)

                            and some being inductions. This goes far

                            beyond typical compiler specific pragmas

                            about dependencies and cost modelling and

                            makes vectorization much simpler task with

                            more predictable and robust results if

                            properly implemented (admittedly, even ICC

                            implementation is far from perfect). I hope

                            Intel's efforts to standardize someting like

                            this in core C++ will evntually come to

                            fruition. Until then I as a regular

                            application developer would appreciate

                            OpenMP-simd based execution policy (hoping

                            for good support for OpenMP SIMD in clang),

                            but it shouldn't necessary be part of

                            libc++. Since 'unordered' execution policy

                            is currently not part of C++ standard </div>

                        </blockquote>

                        <br>

                      </div>

                      <div bgcolor="#FFFFFF">

                        std::execution::par_unseq is part of C++17, and

                        that essentially maps to '#pragma omp parallel

                        for simd'.</div>

                      <div bgcolor="#FFFFFF"><br>

                      </div>

                    </blockquote>

                    <div dir="auto"><br>

                    </div>

                    <div dir="auto">Do you expect par/par_unseq to nest?</div>

                  </div>

                </div>

              </blockquote>

              <br>

            </div>

            <div bgcolor="#FFFFFF"> Yes.</div>

            <div bgcolor="#FFFFFF"><br>

              <br>

              <blockquote type="cite">

                <div>

                  <div class="gmail_quote">

                    <div dir="auto"> Nesting omp-parallel is generally

                      regarded as a Bad Idea.</div>

                  </div>

                </div>

              </blockquote>

              <br>

            </div>

            <div bgcolor="#FFFFFF"> Agreed. I suspect

              we'll want the mapping to be more like '#pragma omp

              taskloop simd'.</div>

            <div bgcolor="#FFFFFF"><br>

            </div>

          </blockquote>

          <div dir="auto"><br>

          </div>

          <div dir="auto">That won’t run in parallel unless in an

            omp-parallel-master region. </div>

        </div>

      </div>

    </blockquote>

    <br></span>

    Yes.<span class="gmail-"><br>

    <br>

    <blockquote type="cite">

      <div>

        <div class="gmail_quote">

          <div dir="auto">That means OpenMP-based PSTL won’t be parallel

            unless the user knows to add back-end specific code about

            the PSTL.</div>

        </div>

      </div>

    </blockquote>

    <br></span>

    That obviously wouldn't be acceptable.<span class="gmail-"><br>

    <br>

    <blockquote type="cite">

      <div>

        <div class="gmail_quote">

          <div dir="auto"><br>

          </div>

          <div dir="auto">What I’m trying to say is that OpenMP is a

            poor target for PSTL in its current form. Nested parallel

            regions is the only thing that works and it is likely to

            work poorly.</div>

        </div>

      </div>

    </blockquote>

    <br></span>

    I'm not sure that's true, but the technique may not be trivial. I

    believe that it is possible, however. For example, the mapping might

    be to something like:<br>

    <br>

    if (omp_in_parallel()) {<br>

    #pragma omp taskloop simd<br>

      for (size_t i = 0; i < N; ++i)<br>

        F(X[i]);<br>

    } else {<br>

    #pragma omp parallel<br>

      {<br>

    #pragma omp taskloop simd<br>

         for (size_t i = 0; i < N; ++i)<br>

           F(X[i]);<br>

      }<br>

    }<br>

    <br>

    The fact that we'd need to use this kind of pattern is a bit

    unfortunate, but it can be easily abstracted into a template

    function, so it just becomes some implementation detail of the

    library.<br>

    <br></div></blockquote><div><br></div><div>You are right and that is probably the best way to do it with OpenMP.  I am concerned about the absolute performance, based upon my observations of omp-taskloop vs omp-for and tbb::parallel_for in the PRK project, but at least it is sane from a semantic perspective.  Having motivating use cases like PSTL should lead to improvements in OpenMP runtime performance w.r.t. taskloop.</div><div><br></div><div><a href="https://i.stack.imgur.com/MVd5j.png">https://i.stack.imgur.com/MVd5j.png</a> is a snapshot of the performance of PRK stencil (<a href="https://github.com/ParRes/Kernels/tree/master/Cxx11">https://github.com/ParRes/Kernels/tree/master/Cxx11</a>), which shows taskloop loses to TBB-based PSTL, OpenMP for, and tbb::parallel_for (pure TBB beats TBB-based PSTL because I use tbb::blocked_range2d, which improves cache utilization).  I think those results tuned taskloop grainsize as well, so they may be an optimistic representation of taskloop in a general usage.<br></div><div><br></div><div>I'll see if I can prototype this in RAJA or Intel PSTL.  It's not hard to get results directly from the PRK tests, if the former attempts fail.</div><div><br></div><div>Best,</div><div><br></div><div>Jeff</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div bgcolor="#FFFFFF">

    Thanks again,<br>

    Hal<div><div class="gmail-h5"><br>

    <br>

    <blockquote type="cite">

      <div>

        <div class="gmail_quote">

          <div dir="auto"><br>

          </div>

          <div dir="auto">Jeff</div>

          <div dir="auto"><br>

          </div>

          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

            <div bgcolor="#FFFFFF"><br>

               -Hal</div>

            <div bgcolor="#FFFFFF"><br>

              <br>

              <blockquote type="cite">

                <div>

                  <div class="gmail_quote">

                    <div dir="auto"><br>

                    </div>

                    <div dir="auto">Jeff</div>

                    <div dir="auto"><br>

                    </div>

                    <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

                      <div bgcolor="#FFFFFF"><br>

                        <blockquote type="cite">

                          <div>I don't care much on how it will be

                            implemneted in libc++ if it is. I just would

                            like to ask Intel guys and community here to

                            make implementation extensible in a sense

                            that custom OpenMP-SIMD-based execution

                            policy along with algorithms implementations

                            (as specializations for the policy) can be

                            used with the libc++ library. And I

                            additionally would like to ask Intel guys to

                            provide complete and compatible extension on

                            github for developers like me to use.</div>

                        </blockquote>

                        <br>

                      </div>

                      <div bgcolor="#FFFFFF"> In the end,

                        I think we want the following:<br>

                        <br>

                         1. A design for libc++ that allows the

                        thread-level parallelism to be implemented in

                        terms of different underlying providers (i.e.,

                        OpenMP, GCD, Work Queues on Windows, whatever

                        else).<br>

                         2. To follow the same philosophy with respect

                        to standards as we do everywhere else: Use

                        standards where possible with

                        compiler/system-specific extensions as

                        necessary.<br>

                        <br>

                         -Hal</div>

                      <div bgcolor="#FFFFFF"><br>

                        <br>

                        <blockquote type="cite">

                          <div> </div>

                          <div>Regards,</div>

                          <div>Serge.</div>

                          <div> </div>

                          <div> </div>

                          <div> </div>

                          <div>04.12.2017, 12:07, "Jeff Hammond" <a class="gmail-m_-2288114881520531127m_8513274869410520852m_2065056468622040889moz-txt-link-rfc2396E" href="mailto:jeff.science@gmail.com" target="_blank"><jeff.science@gmail.com></a>:</div>

                          <blockquote type="cite">

                            <div>

                              <div>ICC implements a very aggressive

                                interpretation of the OpenMP standard,

                                and this interpretation is not shared by

                                everyone in the OpenMP community.  ICC

                                is correct but other implementations may

                                be far less aggressive, so _Pragma("omp

                                simd") doesn't guarentee vectorization

                                unless the compiler documentation says

                                that is how it is implemented.  All the

                                standard says that it means is that

                                vectorization is _permitted_.</div>

                              <div> </div>

                              <div>Given that the practical meaning of

                                _Pragma("omp simd") isn't guaranteed to

                                be consistent across different

                                implementations, I don't really know how

                                to compare it to compiler-specific

                                pragmas unless we define everything

                                explicitly.</div>

                              <div> </div>

                              <div>In any case, my fundamental point

                                remains: do not use OpenMP pragmas here,

                                but instead use whatever the appropriate

                                compiler-specific pragma is, or create a

                                new one that meets the need.</div>

                              <div> </div>

                              <div>Best,</div>

                              <div> </div>

                              <div>Jeff</div>

                              <div title="Page 81">

                                <div>

                                  <div> </div>

                                </div>

                              </div>

                              <div> 

                                <div>On Sun, Dec 3, 2017 at 8:09 PM,

                                  Serge Preis <span><<a href="mailto:spreis@yandex-team.ru" target="_blank">spreis@yandex-team.ru</a>></span>

                                  wrote:

                                  <blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

                                    <div>Hello,</div>

                                    <div> </div>

                                    <div>_Pragma("omp simd") is

                                      semantically quite different from

                                      _Pragma("clang loop

                                      vectorize(assume_safety)"),

                                      _Pragma("GCC ivdep") and

                                      _Pragma("vector always"), so I am

                                      not sure all latter will work as

                                      expected in all cases. They

                                      definitely won't provide any

                                      vectorization guarantees which

                                      slightly defeat the purpose of

                                      using corresponding execution

                                      policy.</div>

                                    <div> </div>

                                    <div>I support the idea of having

                                      OpenMP orthogonal and definitely

                                      having -fopenmp enabled by default

                                      is not an option. Intel compiler

                                      has separate -qopenmp-simd option

                                      which doesn't affect performance

                                      outside explicitly marked loops,

                                      but even this is not enabled by

                                      default. I would say that there

                                      might exist multiple

                                      implementations of unordered

                                      policy, originally OpenMP SIMD

                                      based implementation may be more

                                      powerful and one based on other

                                      pragmas being default, but hinting

                                      about existence of faster option.

                                      Later on one may be brave enough

                                      to add some SIMD template library

                                      and implement default unordered

                                      policy using it (such

                                      implementation is possible even

                                      now using vector types, but it

                                      will be extremely complex if

                                      attempt to target all base data

                                      types, vector widths and target

                                      SIMD architectures clang supports.

                                      Even with the library this may be

                                      quite tedious).</div>

                                    <div> </div>

                                    <div>Without any standard way of

                                      expressing SIMD perallelism in

                                      pure C++ any implementer of SIMD

                                      execution policy is to rely on

                                      means avaialble for

                                      plaform/compiler and so it is not

                                      totaly unnatural to ask user to

                                      enable OpenMP SIMD for efficient

                                      support of corresponding execution

                                      policy.</div>

                                    <div> </div>

                                    <div>Reagrds,</div>

                                    <div>Serge Preis</div>

                                    <div> </div>

                                    <div>(Who once was part of Intel

                                      Compiler Vectorizer team and

                                      driven OpenMP SIMD efforts within

                                      icc and beyond, if anyone is

                                      keeping track of

                                      conflicts-of-interest)</div>

                                    <div> </div>

                                    <div> </div>

                                    <div>04.12.2017, 08:46, "Jeff

                                      Hammond via cfe-dev" <<a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a>>:</div>

                                    <blockquote type="cite">

                                      <div>

                                        <div>

                                          <div>It would be nice to keep

                                            PSTL and OpenMP orthogonal,

                                            even if _Pragma("omp simd")

                                            does not require runtime

                                            support.  It should be

                                            trivial to use

                                            _Pragma("clang loop

                                            vectorize(assume_safety)")

                                            instead, by wrapping all of

                                            the different compiler

                                            vectorization pragmas in

                                            preprocessor logic.  I

                                            similarly recommend

                                            _Pragma("GCC ivdep") for GCC

                                            and _Pragma("vector always")

                                            for ICC.  While this

                                            requires O(n_compilers)

                                            effort instead of O(1), but

                                            orthogonality is worth it.

                                            <div> </div>

                                            <div>While OpenMP is

                                              vendor/compiler-agnostic,

                                              users should not be

                                              required to use -fopenmp

                                              or similar to enable

                                              vectorization from PSTL,

                                              nor should the compiler

                                              enable any OpenMP pragma

                                              by default.  I know of

                                              cases where merely using

                                              the -fopenmp flag alters

                                              code generation in a

                                              performance-visible

                                              manner, and enabling the

                                              OpenMP "simd" pragma by

                                              default may surprise some

                                              users, particularly if no

                                              other OpenMP pragmas are

                                              enabled by default.

                                              <div><br>

                                                Best,</div>

                                              <div> </div>

                                              <div>Jeff</div>

                                              <div>(who works for Intel

                                                but not on any software

                                                products and has been a

                                                heavy user of Intel PSTL

                                                since it was released,

                                                if anyone is keeping

                                                track of

                                                conflicts-of-interest)<br>

                                                <br>

                                                On Wed, Nov 29, 2017 at

                                                4:21 AM, Kukanov, Alexey

                                                via cfe-dev <<a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a>>

                                                wrote:<br>

                                                ><br>

                                                > Hello all,<br>

                                                ><br>

                                                > At Intel, we have

                                                developed an

                                                implementation of C++17

                                                execution policies<br>

                                                > for algorithms

                                                (often referred to as

                                                Parallel STL). We hope

                                                to contribute it<br>

                                                > to libc++/LLVM, so

                                                would like to ask the

                                                community for comments

                                                on this.<br>

                                                ><br>

                                                > The code is already

                                                published at GitHub (<a href="https://github.com/intel/parallelstl" target="_blank">https://github.com/intel/<wbr>parallelstl</a>).<br>

                                                > It supports the

                                                C++17 standard execution

                                                policies (seq, par,

                                                par_unseq) as well as<br>

                                                > the experimental

                                                unsequenced policy

                                                (unseq) for SIMD

                                                execution. At the

                                                moment,<br>

                                                > about half of the

                                                C++17 standard

                                                algorithms that must

                                                support execution

                                                policies<br>

                                                > are implemented; a

                                                few more will be ready

                                                soon, and the work

                                                continues.<br>

                                                > The tests that we

                                                use are also available

                                                at GitHub; needless to

                                                say we will<br>

                                                > contribute those as

                                                well.<br>

                                                ><br>

                                                > The implementation

                                                is not specific to

                                                Intel’s hardware. For

                                                thread-level parallelism<br>

                                                > it uses TBB* (<a href="https://www.threadingbuildingblocks.org/" target="_blank">https://www.<wbr>threadingbuildingblocks.org/</a>)

                                                but abstracts it with<br>

                                                > an internal API

                                                which can be implemented

                                                on top of other

                                                threading/parallel

                                                solutions –<br>

                                                > so it is for the

                                                community to decide

                                                which ones to use. For

                                                SIMD parallelism<br>

                                                > (unseq, par_unseq)

                                                we use #pragma omp simd

                                                directives; it is

                                                vendor-neutral and<br>

                                                > does not require

                                                any OpenMP runtime

                                                support.<br>

                                                ><br>

                                                > The current

                                                implementation meets the

                                                spirit but not always

                                                the letter of<br>

                                                > the standard,

                                                because it has to be

                                                separate from but also

                                                coexist with<br>

                                                > implementations of

                                                standard C++ libraries.

                                                While preparing the

                                                contribution,<br>

                                                > we will address

                                                inconsistencies, adjust

                                                the code to meet

                                                community standards,<br>

                                                > and better

                                                integrate it into the

                                                standard library code.<br>

                                                ><br>

                                                > We are also

                                                proposing that our

                                                implementation is

                                                included into

                                                libstdc++/GCC.<br>

                                                > Compatibility

                                                between the

                                                implementations seems

                                                useful as it can

                                                potentially<br>

                                                > reduce the amount

                                                of work for everyone. We

                                                hope to keep the code

                                                mostly identical,<br>

                                                > and would like to

                                                know if you think it’s

                                                too optimistic to

                                                expect.<br>

                                                ><br>

                                                > Obviously we plan

                                                to use appropriate open

                                                source licenses to meet

                                                the different<br>

                                                > projects’

                                                requirements.<br>

                                                ><br>

                                                > We expect to keep

                                                developing the code and

                                                will take the

                                                responsibility for<br>

                                                > maintaining it

                                                (with community

                                                contributions, of

                                                course). If there are

                                                other<br>

                                                > community efforts

                                                to implement parallel

                                                algorithms, we are

                                                willing to collaborate.<br>

                                                ><br>

                                                > We look forward to

                                                your feedback, both for

                                                the overall idea and –

                                                if supported –<br>

                                                > for the next steps

                                                we should take.<br>

                                                ><br>

                                                > Regards,<br>

                                                > - Alexey Kukanov<br>

                                                ><br>

                                                > * Note that TBB

                                                itself is highly

                                                portable (and ported by

                                                community to Power and

                                                ARM<br>

                                                > architectures) and

                                                permissively licensed,

                                                so could be the base for

                                                the threading<br>

                                                > infrastructure. But

                                                the Parallel STL

                                                implementation itself

                                                does not require TBB.<br>

                                                ><br>

                                                >

                                                ______________________________<wbr>_________________<br>

                                                > cfe-dev mailing

                                                list<br>

                                                > <a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a><br>

                                                > <a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev" target="_blank">http://lists.llvm.org/cgi-bin/<wbr>mailman/listinfo/cfe-dev</a><br>

                                                <br>

                                                <br>

                                                <br>

                                                <br>

                                                --<br>

                                                Jeff Hammond<br>

                                                <a href="mailto:jeff.science@gmail.com" target="_blank">jeff.science@gmail.com</a><br>

                                                <a href="http://jeffhammond.github.io/" target="_blank">http://jeffhammond.github.io/</a>

                                                <div> </div>

                                              </div>

                                            </div>

                                          </div>

                                        </div>

                                      </div>

                                      ,

                                      <p><span>______________________________<wbr>_________________<br>

                                          cfe-dev mailing list<br>

                                          <a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a><br>

                                          <a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev" target="_blank">http://lists.llvm.org/cgi-bin/<wbr>mailman/listinfo/cfe-dev</a></span></p>

                                    </blockquote>

                                  </blockquote>

                                </div>

                                <div> </div>

                                --

                                <div>Jeff Hammond<br>

                                  <a href="mailto:jeff.science@gmail.com" target="_blank">jeff.science@gmail.com</a><br>

                                  <a href="http://jeffhammond.github.io/" target="_blank">http://jeffhammond.github.io/</a></div>

                              </div>

                            </div>

                          </blockquote>

                          <br>

                          <fieldset class="gmail-m_-2288114881520531127m_8513274869410520852m_2065056468622040889mimeAttachmentHeader"></fieldset>

                          <br>

                          <pre>______________________________<wbr>_________________

cfe-dev mailing list

<a class="gmail-m_-2288114881520531127m_8513274869410520852m_2065056468622040889moz-txt-link-abbreviated" href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a>

<a class="gmail-m_-2288114881520531127m_8513274869410520852m_2065056468622040889moz-txt-link-freetext" href="http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev" target="_blank">http://lists.llvm.org/cgi-bin/<wbr>mailman/listinfo/cfe-dev</a>

</pre>

                        </blockquote>

                        <br>

                      </div>

                      <div bgcolor="#FFFFFF">

                        <pre class="gmail-m_-2288114881520531127m_8513274869410520852m_2065056468622040889moz-signature" cols="72">-- 

Hal Finkel

Lead, Compiler Technology and Programming Languages

Leadership Computing Facility

Argonne National Laboratory</pre>

                      </div>

                    </blockquote>

                  </div>

                </div>

                <div>-- <br>

                </div>

                <div class="gmail-m_-2288114881520531127m_8513274869410520852gmail_signature">Jeff Hammond<br>

                  <a href="mailto:jeff.science@gmail.com" target="_blank">jeff.science@gmail.com</a><br>

                  <a href="http://jeffhammond.github.io/" target="_blank">http://jeffhammond.github.io/</a></div>

              </blockquote>

              <br>

              <pre class="gmail-m_-2288114881520531127m_8513274869410520852moz-signature" cols="72">-- 

Hal Finkel

Lead, Compiler Technology and Programming Languages

Leadership Computing Facility

Argonne National Laboratory</pre>

            </div>

          </blockquote>

        </div>

      </div>

      <div dir="ltr">-- <br>

      </div>

      <div class="gmail-m_-2288114881520531127gmail_signature">Jeff

        Hammond<br>

        <a href="mailto:jeff.science@gmail.com" target="_blank">jeff.science@gmail.com</a><br>

        <a href="http://jeffhammond.github.io/" target="_blank">http://jeffhammond.github.io/</a></div>

    </blockquote>

    <br>

    <pre class="gmail-m_-2288114881520531127moz-signature" cols="72">-- 

Hal Finkel

Lead, Compiler Technology and Programming Languages

Leadership Computing Facility

Argonne National Laboratory</pre>

  </div></div></div>

</blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature">Jeff Hammond<br><a href="mailto:jeff.science@gmail.com" target="_blank">jeff.science@gmail.com</a><br><a href="http://jeffhammond.github.io/" target="_blank">http://jeffhammond.github.io/</a></div>

</div></div>