[cfe-dev] [RFC] Proposal to contribute Intel’s implementation of C++17 parallel algorithms

Hal Finkel via cfe-dev cfe-dev at lists.llvm.org
Fri Dec 8 14:34:53 PST 2017


On 12/08/2017 03:55 PM, Jeff Hammond wrote:
>
>
> On Fri, Dec 8, 2017 at 1:13 PM, Hal Finkel <hfinkel at anl.gov 
> <mailto:hfinkel at anl.gov>> wrote:
>
>
>     On 12/07/2017 11:35 AM, Jeff Hammond wrote:
>>
>>     On Wed, Dec 6, 2017 at 8:57 PM Hal Finkel <hfinkel at anl.gov
>>     <mailto:hfinkel at anl.gov>> wrote:
>>
>>
>>         On 12/06/2017 10:23 PM, Jeff Hammond wrote:
>>>
>>>         On Wed, Dec 6, 2017 at 4:23 PM Hal Finkel <hfinkel at anl.gov
>>>         <mailto:hfinkel at anl.gov>> wrote:
>>>
>>>
>>>             On 12/04/2017 10:48 PM, Serge Preis via cfe-dev wrote:
>>>>             I agree that guarantees provided by ICC may be stronger
>>>>             than with other compilers, so yes, under OpenMP terms
>>>>             vectorization is permitted and cannot be assumed.
>>>>             However OpenMP clearly defines semantics of variables
>>>>             used within OpenMP region some being shared(scalar),
>>>>             some private(vector) and some being inductions. This
>>>>             goes far beyond typical compiler specific pragmas about
>>>>             dependencies and cost modelling and makes vectorization
>>>>             much simpler task with more predictable and robust
>>>>             results if properly implemented (admittedly, even ICC
>>>>             implementation is far from perfect). I hope Intel's
>>>>             efforts to standardize someting like this in core C++
>>>>             will evntually come to fruition. Until then I as a
>>>>             regular application developer would appreciate
>>>>             OpenMP-simd based execution policy (hoping for good
>>>>             support for OpenMP SIMD in clang), but it shouldn't
>>>>             necessary be part of libc++. Since 'unordered'
>>>>             execution policy is currently not part of C++ standard
>>>
>>>             std::execution::par_unseq is part of C++17, and that
>>>             essentially maps to '#pragma omp parallel for simd'.
>>>
>>>
>>>         Do you expect par/par_unseq to nest?
>>
>>         Yes.
>>
>>
>>>         Nesting omp-parallel is generally regarded as a Bad Idea.
>>
>>         Agreed. I suspect we'll want the mapping to be more like
>>         '#pragma omp taskloop simd'.
>>
>>
>>     That won’t run in parallel unless in an omp-parallel-master region.
>
>     Yes.
>
>>     That means OpenMP-based PSTL won’t be parallel unless the user
>>     knows to add back-end specific code about the PSTL.
>
>     That obviously wouldn't be acceptable.
>
>>
>>     What I’m trying to say is that OpenMP is a poor target for PSTL
>>     in its current form. Nested parallel regions is the only thing
>>     that works and it is likely to work poorly.
>
>     I'm not sure that's true, but the technique may not be trivial. I
>     believe that it is possible, however. For example, the mapping
>     might be to something like:
>
>     if (omp_in_parallel()) {
>     #pragma omp taskloop simd
>       for (size_t i = 0; i < N; ++i)
>         F(X[i]);
>     } else {
>     #pragma omp parallel
>       {
>     #pragma omp taskloop simd
>          for (size_t i = 0; i < N; ++i)
>            F(X[i]);
>       }
>     }
>
>     The fact that we'd need to use this kind of pattern is a bit
>     unfortunate, but it can be easily abstracted into a template
>     function, so it just becomes some implementation detail of the
>     library.
>
>
> You are right and that is probably the best way to do it with OpenMP.  
> I am concerned about the absolute performance, based upon my 
> observations of omp-taskloop vs omp-for and tbb::parallel_for

Have you tried this recently? There was a recursive task-stealing 
strategy added to our OpenMP library in July of this year (r308338) 
which should have made the performance of taskloop better.

> in the PRK project, but at least it is sane from a semantic 
> perspective.  Having motivating use cases like PSTL should lead to 
> improvements in OpenMP runtime performance w.r.t. taskloop.

Indeed :-)

>
> https://i.stack.imgur.com/MVd5j.png is a snapshot of the performance 
> of PRK stencil (https://github.com/ParRes/Kernels/tree/master/Cxx11), 
> which shows taskloop loses to TBB-based PSTL, OpenMP for, and 
> tbb::parallel_for (pure TBB beats TBB-based PSTL because I use 
> tbb::blocked_range2d, which improves cache utilization).  I think 
> those results tuned taskloop grainsize as well, so they may be an 
> optimistic representation of taskloop in a general usage.

Interesting.

>
> I'll see if I can prototype this in RAJA or Intel PSTL.  It's not hard 
> to get results directly from the PRK tests, if the former attempts fail.

Thanks!

  -Hal

>
> Best,
>
> Jeff
>
>     Thanks again,
>     Hal
>
>
>>
>>     Jeff
>>
>>
>>          -Hal
>>
>>
>>>
>>>         Jeff
>>>
>>>
>>>>             I don't care much on how it will be implemneted in
>>>>             libc++ if it is. I just would like to ask Intel guys
>>>>             and community here to make implementation extensible in
>>>>             a sense that custom OpenMP-SIMD-based execution policy
>>>>             along with algorithms implementations (as
>>>>             specializations for the policy) can be used with the
>>>>             libc++ library. And I additionally would like to ask
>>>>             Intel guys to provide complete and compatible extension
>>>>             on github for developers like me to use.
>>>
>>>             In the end, I think we want the following:
>>>
>>>              1. A design for libc++ that allows the thread-level
>>>             parallelism to be implemented in terms of different
>>>             underlying providers (i.e., OpenMP, GCD, Work Queues on
>>>             Windows, whatever else).
>>>              2. To follow the same philosophy with respect to
>>>             standards as we do everywhere else: Use standards where
>>>             possible with compiler/system-specific extensions as
>>>             necessary.
>>>
>>>              -Hal
>>>
>>>
>>>>             Regards,
>>>>             Serge.
>>>>             04.12.2017, 12:07, "Jeff Hammond"
>>>>             <jeff.science at gmail.com> <mailto:jeff.science at gmail.com>:
>>>>>             ICC implements a very aggressive interpretation of the
>>>>>             OpenMP standard, and this interpretation is not shared
>>>>>             by everyone in the OpenMP community.  ICC is correct
>>>>>             but other implementations may be far less aggressive,
>>>>>             so _Pragma("omp simd") doesn't guarentee vectorization
>>>>>             unless the compiler documentation says that is how it
>>>>>             is implemented.  All the standard says that it means
>>>>>             is that vectorization is _permitted_.
>>>>>             Given that the practical meaning of _Pragma("omp
>>>>>             simd") isn't guaranteed to be consistent across
>>>>>             different implementations, I don't really know how to
>>>>>             compare it to compiler-specific pragmas unless we
>>>>>             define everything explicitly.
>>>>>             In any case, my fundamental point remains: do not use
>>>>>             OpenMP pragmas here, but instead use whatever the
>>>>>             appropriate compiler-specific pragma is, or create a
>>>>>             new one that meets the need.
>>>>>             Best,
>>>>>             Jeff
>>>>>             On Sun, Dec 3, 2017 at 8:09 PM, Serge Preis
>>>>>             <spreis at yandex-team.ru <mailto:spreis at yandex-team.ru>>
>>>>>             wrote:
>>>>>
>>>>>                 Hello,
>>>>>                 _Pragma("omp simd") is semantically quite
>>>>>                 different from _Pragma("clang loop
>>>>>                 vectorize(assume_safety)"), _Pragma("GCC ivdep")
>>>>>                 and _Pragma("vector always"), so I am not sure all
>>>>>                 latter will work as expected in all cases. They
>>>>>                 definitely won't provide any vectorization
>>>>>                 guarantees which slightly defeat the purpose of
>>>>>                 using corresponding execution policy.
>>>>>                 I support the idea of having OpenMP orthogonal and
>>>>>                 definitely having -fopenmp enabled by default is
>>>>>                 not an option. Intel compiler has separate
>>>>>                 -qopenmp-simd option which doesn't affect
>>>>>                 performance outside explicitly marked loops, but
>>>>>                 even this is not enabled by default. I would say
>>>>>                 that there might exist multiple implementations of
>>>>>                 unordered policy, originally OpenMP SIMD based
>>>>>                 implementation may be more powerful and one based
>>>>>                 on other pragmas being default, but hinting about
>>>>>                 existence of faster option. Later on one may be
>>>>>                 brave enough to add some SIMD template library and
>>>>>                 implement default unordered policy using it (such
>>>>>                 implementation is possible even now using vector
>>>>>                 types, but it will be extremely complex if attempt
>>>>>                 to target all base data types, vector widths and
>>>>>                 target SIMD architectures clang supports. Even
>>>>>                 with the library this may be quite tedious).
>>>>>                 Without any standard way of expressing SIMD
>>>>>                 perallelism in pure C++ any implementer of SIMD
>>>>>                 execution policy is to rely on means avaialble for
>>>>>                 plaform/compiler and so it is not totaly unnatural
>>>>>                 to ask user to enable OpenMP SIMD for efficient
>>>>>                 support of corresponding execution policy.
>>>>>                 Reagrds,
>>>>>                 Serge Preis
>>>>>                 (Who once was part of Intel Compiler Vectorizer
>>>>>                 team and driven OpenMP SIMD efforts within icc and
>>>>>                 beyond, if anyone is keeping track of
>>>>>                 conflicts-of-interest)
>>>>>                 04.12.2017, 08:46, "Jeff Hammond via cfe-dev"
>>>>>                 <cfe-dev at lists.llvm.org
>>>>>                 <mailto:cfe-dev at lists.llvm.org>>:
>>>>>>                 It would be nice to keep PSTL and OpenMP
>>>>>>                 orthogonal, even if _Pragma("omp simd") does not
>>>>>>                 require runtime support.  It should be trivial to
>>>>>>                 use _Pragma("clang loop
>>>>>>                 vectorize(assume_safety)") instead, by wrapping
>>>>>>                 all of the different compiler vectorization
>>>>>>                 pragmas in preprocessor logic.  I similarly
>>>>>>                 recommend _Pragma("GCC ivdep") for GCC and
>>>>>>                 _Pragma("vector always") for ICC.  While this
>>>>>>                 requires O(n_compilers) effort instead of O(1),
>>>>>>                 but orthogonality is worth it.
>>>>>>                 While OpenMP is vendor/compiler-agnostic, users
>>>>>>                 should not be required to use -fopenmp or similar
>>>>>>                 to enable vectorization from PSTL, nor should the
>>>>>>                 compiler enable any OpenMP pragma by default.  I
>>>>>>                 know of cases where merely using the -fopenmp
>>>>>>                 flag alters code generation in a
>>>>>>                 performance-visible manner, and enabling the
>>>>>>                 OpenMP "simd" pragma by default may surprise some
>>>>>>                 users, particularly if no other OpenMP pragmas
>>>>>>                 are enabled by default.
>>>>>>
>>>>>>                 Best,
>>>>>>                 Jeff
>>>>>>                 (who works for Intel but not on any software
>>>>>>                 products and has been a heavy user of Intel PSTL
>>>>>>                 since it was released, if anyone is keeping track
>>>>>>                 of conflicts-of-interest)
>>>>>>
>>>>>>                 On Wed, Nov 29, 2017 at 4:21 AM, Kukanov, Alexey
>>>>>>                 via cfe-dev <cfe-dev at lists.llvm.org
>>>>>>                 <mailto:cfe-dev at lists.llvm.org>> wrote:
>>>>>>                 >
>>>>>>                 > Hello all,
>>>>>>                 >
>>>>>>                 > At Intel, we have developed an implementation
>>>>>>                 of C++17 execution policies
>>>>>>                 > for algorithms (often referred to as Parallel
>>>>>>                 STL). We hope to contribute it
>>>>>>                 > to libc++/LLVM, so would like to ask the
>>>>>>                 community for comments on this.
>>>>>>                 >
>>>>>>                 > The code is already published at GitHub
>>>>>>                 (https://github.com/intel/parallelstl
>>>>>>                 <https://github.com/intel/parallelstl>).
>>>>>>                 > It supports the C++17 standard execution
>>>>>>                 policies (seq, par, par_unseq) as well as
>>>>>>                 > the experimental unsequenced policy (unseq) for
>>>>>>                 SIMD execution. At the moment,
>>>>>>                 > about half of the C++17 standard algorithms
>>>>>>                 that must support execution policies
>>>>>>                 > are implemented; a few more will be ready soon,
>>>>>>                 and the work continues.
>>>>>>                 > The tests that we use are also available at
>>>>>>                 GitHub; needless to say we will
>>>>>>                 > contribute those as well.
>>>>>>                 >
>>>>>>                 > The implementation is not specific to Intel’s
>>>>>>                 hardware. For thread-level parallelism
>>>>>>                 > it uses TBB*
>>>>>>                 (https://www.threadingbuildingblocks.org/
>>>>>>                 <https://www.threadingbuildingblocks.org/>) but
>>>>>>                 abstracts it with
>>>>>>                 > an internal API which can be implemented on top
>>>>>>                 of other threading/parallel solutions –
>>>>>>                 > so it is for the community to decide which ones
>>>>>>                 to use. For SIMD parallelism
>>>>>>                 > (unseq, par_unseq) we use #pragma omp simd
>>>>>>                 directives; it is vendor-neutral and
>>>>>>                 > does not require any OpenMP runtime support.
>>>>>>                 >
>>>>>>                 > The current implementation meets the spirit but
>>>>>>                 not always the letter of
>>>>>>                 > the standard, because it has to be separate
>>>>>>                 from but also coexist with
>>>>>>                 > implementations of standard C++ libraries.
>>>>>>                 While preparing the contribution,
>>>>>>                 > we will address inconsistencies, adjust the
>>>>>>                 code to meet community standards,
>>>>>>                 > and better integrate it into the standard
>>>>>>                 library code.
>>>>>>                 >
>>>>>>                 > We are also proposing that our implementation
>>>>>>                 is included into libstdc++/GCC.
>>>>>>                 > Compatibility between the implementations seems
>>>>>>                 useful as it can potentially
>>>>>>                 > reduce the amount of work for everyone. We hope
>>>>>>                 to keep the code mostly identical,
>>>>>>                 > and would like to know if you think it’s too
>>>>>>                 optimistic to expect.
>>>>>>                 >
>>>>>>                 > Obviously we plan to use appropriate open
>>>>>>                 source licenses to meet the different
>>>>>>                 > projects’ requirements.
>>>>>>                 >
>>>>>>                 > We expect to keep developing the code and will
>>>>>>                 take the responsibility for
>>>>>>                 > maintaining it (with community contributions,
>>>>>>                 of course). If there are other
>>>>>>                 > community efforts to implement parallel
>>>>>>                 algorithms, we are willing to collaborate.
>>>>>>                 >
>>>>>>                 > We look forward to your feedback, both for the
>>>>>>                 overall idea and – if supported –
>>>>>>                 > for the next steps we should take.
>>>>>>                 >
>>>>>>                 > Regards,
>>>>>>                 > - Alexey Kukanov
>>>>>>                 >
>>>>>>                 > * Note that TBB itself is highly portable (and
>>>>>>                 ported by community to Power and ARM
>>>>>>                 > architectures) and permissively licensed, so
>>>>>>                 could be the base for the threading
>>>>>>                 > infrastructure. But the Parallel STL
>>>>>>                 implementation itself does not require TBB.
>>>>>>                 >
>>>>>>                 > _______________________________________________
>>>>>>                 > cfe-dev mailing list
>>>>>>                 > cfe-dev at lists.llvm.org
>>>>>>                 <mailto:cfe-dev at lists.llvm.org>
>>>>>>                 >
>>>>>>                 http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>>>>                 <http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>                 --
>>>>>>                 Jeff Hammond
>>>>>>                 jeff.science at gmail.com
>>>>>>                 <mailto:jeff.science at gmail.com>
>>>>>>                 http://jeffhammond.github.io/
>>>>>>                 ,
>>>>>>
>>>>>>                 _______________________________________________
>>>>>>                 cfe-dev mailing list
>>>>>>                 cfe-dev at lists.llvm.org
>>>>>>                 <mailto:cfe-dev at lists.llvm.org>
>>>>>>                 http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>>>>                 <http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev>
>>>>>>
>>>>>             -- 
>>>>>             Jeff Hammond
>>>>>             jeff.science at gmail.com <mailto:jeff.science at gmail.com>
>>>>>             http://jeffhammond.github.io/
>>>>
>>>>
>>>>             _______________________________________________
>>>>             cfe-dev mailing list
>>>>             cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>
>>>>             http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>>             <http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev>
>>>
>>>             -- 
>>>             Hal Finkel
>>>             Lead, Compiler Technology and Programming Languages
>>>             Leadership Computing Facility
>>>             Argonne National Laboratory
>>>
>>>         -- 
>>>         Jeff Hammond jeff.science at gmail.com
>>>         <mailto:jeff.science at gmail.com> http://jeffhammond.github.io/
>>
>>         -- 
>>         Hal Finkel
>>         Lead, Compiler Technology and Programming Languages
>>         Leadership Computing Facility
>>         Argonne National Laboratory
>>
>>     -- 
>>     Jeff Hammond jeff.science at gmail.com
>>     <mailto:jeff.science at gmail.com> http://jeffhammond.github.io/
>
>     -- 
>     Hal Finkel
>     Lead, Compiler Technology and Programming Languages
>     Leadership Computing Facility
>     Argonne National Laboratory
>
> -- 
> Jeff Hammond jeff.science at gmail.com <mailto:jeff.science at gmail.com> 
> http://jeffhammond.github.io/
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20171208/de39ba24/attachment.html>


More information about the cfe-dev mailing list