[cfe-dev] [RFC] Proposal to contribute Intel’s implementation of C++17 parallel algorithms

Fri Dec 8 13:13:49 PST 2017

On 12/07/2017 11:35 AM, Jeff Hammond wrote:
>
> On Wed, Dec 6, 2017 at 8:57 PM Hal Finkel <hfinkel at anl.gov 
> <mailto:hfinkel at anl.gov>> wrote:
>
>
>     On 12/06/2017 10:23 PM, Jeff Hammond wrote:
>>
>>     On Wed, Dec 6, 2017 at 4:23 PM Hal Finkel <hfinkel at anl.gov
>>     <mailto:hfinkel at anl.gov>> wrote:
>>
>>
>>         On 12/04/2017 10:48 PM, Serge Preis via cfe-dev wrote:
>>>         I agree that guarantees provided by ICC may be stronger than
>>>         with other compilers, so yes, under OpenMP terms
>>>         vectorization is permitted and cannot be assumed. However
>>>         OpenMP clearly defines semantics of variables used within
>>>         OpenMP region some being shared(scalar), some
>>>         private(vector) and some being inductions. This goes far
>>>         beyond typical compiler specific pragmas about dependencies
>>>         and cost modelling and makes vectorization much simpler task
>>>         with more predictable and robust results if properly
>>>         implemented (admittedly, even ICC implementation is far from
>>>         perfect). I hope Intel's efforts to standardize someting
>>>         like this in core C++ will evntually come to fruition. Until
>>>         then I as a regular application developer would appreciate
>>>         OpenMP-simd based execution policy (hoping for good support
>>>         for OpenMP SIMD in clang), but it shouldn't necessary be
>>>         part of libc++. Since 'unordered' execution policy is
>>>         currently not part of C++ standard
>>
>>         std::execution::par_unseq is part of C++17, and that
>>         essentially maps to '#pragma omp parallel for simd'.
>>
>>
>>     Do you expect par/par_unseq to nest?
>
>     Yes.
>
>
>>     Nesting omp-parallel is generally regarded as a Bad Idea.
>
>     Agreed. I suspect we'll want the mapping to be more like '#pragma
>     omp taskloop simd'.
>
>
> That won’t run in parallel unless in an omp-parallel-master region.

Yes.

> That means OpenMP-based PSTL won’t be parallel unless the user knows 
> to add back-end specific code about the PSTL.

That obviously wouldn't be acceptable.

>
> What I’m trying to say is that OpenMP is a poor target for PSTL in its 
> current form. Nested parallel regions is the only thing that works and 
> it is likely to work poorly.

I'm not sure that's true, but the technique may not be trivial. I 
believe that it is possible, however. For example, the mapping might be 
to something like:

if (omp_in_parallel()) {
#pragma omp taskloop simd
   for (size_t i = 0; i < N; ++i)
     F(X[i]);
} else {
#pragma omp parallel
   {
#pragma omp taskloop simd
      for (size_t i = 0; i < N; ++i)
        F(X[i]);
   }
}

The fact that we'd need to use this kind of pattern is a bit 
unfortunate, but it can be easily abstracted into a template function, 
so it just becomes some implementation detail of the library.

Thanks again,
Hal

>
> Jeff
>
>
>      -Hal
>
>
>>
>>     Jeff
>>
>>
>>>         I don't care much on how it will be implemneted in libc++ if
>>>         it is. I just would like to ask Intel guys and community
>>>         here to make implementation extensible in a sense that
>>>         custom OpenMP-SIMD-based execution policy along with
>>>         algorithms implementations (as specializations for the
>>>         policy) can be used with the libc++ library. And I
>>>         additionally would like to ask Intel guys to provide
>>>         complete and compatible extension on github for developers
>>>         like me to use.
>>
>>         In the end, I think we want the following:
>>
>>          1. A design for libc++ that allows the thread-level
>>         parallelism to be implemented in terms of different
>>         underlying providers (i.e., OpenMP, GCD, Work Queues on
>>         Windows, whatever else).
>>          2. To follow the same philosophy with respect to standards
>>         as we do everywhere else: Use standards where possible with
>>         compiler/system-specific extensions as necessary.
>>
>>          -Hal
>>
>>
>>>         Regards,
>>>         Serge.
>>>         04.12.2017, 12:07, "Jeff Hammond" <jeff.science at gmail.com>
>>>         <mailto:jeff.science at gmail.com>:
>>>>         ICC implements a very aggressive interpretation of the
>>>>         OpenMP standard, and this interpretation is not shared by
>>>>         everyone in the OpenMP community.  ICC is correct but other
>>>>         implementations may be far less aggressive, so _Pragma("omp
>>>>         simd") doesn't guarentee vectorization unless the compiler
>>>>         documentation says that is how it is implemented.  All the
>>>>         standard says that it means is that vectorization is
>>>>         _permitted_.
>>>>         Given that the practical meaning of _Pragma("omp simd")
>>>>         isn't guaranteed to be consistent across different
>>>>         implementations, I don't really know how to compare it to
>>>>         compiler-specific pragmas unless we define everything
>>>>         explicitly.
>>>>         In any case, my fundamental point remains: do not use
>>>>         OpenMP pragmas here, but instead use whatever the
>>>>         appropriate compiler-specific pragma is, or create a new
>>>>         one that meets the need.
>>>>         Best,
>>>>         Jeff
>>>>         On Sun, Dec 3, 2017 at 8:09 PM, Serge Preis
>>>>         <spreis at yandex-team.ru <mailto:spreis at yandex-team.ru>> wrote:
>>>>
>>>>             Hello,
>>>>             _Pragma("omp simd") is semantically quite different
>>>>             from _Pragma("clang loop vectorize(assume_safety)"),
>>>>             _Pragma("GCC ivdep") and _Pragma("vector always"), so I
>>>>             am not sure all latter will work as expected in all
>>>>             cases. They definitely won't provide any vectorization
>>>>             guarantees which slightly defeat the purpose of using
>>>>             corresponding execution policy.
>>>>             I support the idea of having OpenMP orthogonal and
>>>>             definitely having -fopenmp enabled by default is not an
>>>>             option. Intel compiler has separate -qopenmp-simd
>>>>             option which doesn't affect performance outside
>>>>             explicitly marked loops, but even this is not enabled
>>>>             by default. I would say that there might exist multiple
>>>>             implementations of unordered policy, originally OpenMP
>>>>             SIMD based implementation may be more powerful and one
>>>>             based on other pragmas being default, but hinting about
>>>>             existence of faster option. Later on one may be brave
>>>>             enough to add some SIMD template library and implement
>>>>             default unordered policy using it (such implementation
>>>>             is possible even now using vector types, but it will be
>>>>             extremely complex if attempt to target all base data
>>>>             types, vector widths and target SIMD architectures
>>>>             clang supports. Even with the library this may be quite
>>>>             tedious).
>>>>             Without any standard way of expressing SIMD perallelism
>>>>             in pure C++ any implementer of SIMD execution policy is
>>>>             to rely on means avaialble for plaform/compiler and so
>>>>             it is not totaly unnatural to ask user to enable OpenMP
>>>>             SIMD for efficient support of corresponding execution
>>>>             policy.
>>>>             Reagrds,
>>>>             Serge Preis
>>>>             (Who once was part of Intel Compiler Vectorizer team
>>>>             and driven OpenMP SIMD efforts within icc and beyond,
>>>>             if anyone is keeping track of conflicts-of-interest)
>>>>             04.12.2017, 08:46, "Jeff Hammond via cfe-dev"
>>>>             <cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>>:
>>>>>             It would be nice to keep PSTL and OpenMP orthogonal,
>>>>>             even if _Pragma("omp simd") does not require runtime
>>>>>             support.  It should be trivial to use _Pragma("clang
>>>>>             loop vectorize(assume_safety)") instead, by wrapping
>>>>>             all of the different compiler vectorization pragmas in
>>>>>             preprocessor logic.  I similarly recommend
>>>>>             _Pragma("GCC ivdep") for GCC and _Pragma("vector
>>>>>             always") for ICC.  While this requires O(n_compilers)
>>>>>             effort instead of O(1), but orthogonality is worth it.
>>>>>             While OpenMP is vendor/compiler-agnostic, users should
>>>>>             not be required to use -fopenmp or similar to enable
>>>>>             vectorization from PSTL, nor should the compiler
>>>>>             enable any OpenMP pragma by default.  I know of cases
>>>>>             where merely using the -fopenmp flag alters code
>>>>>             generation in a performance-visible manner, and
>>>>>             enabling the OpenMP "simd" pragma by default may
>>>>>             surprise some users, particularly if no other OpenMP
>>>>>             pragmas are enabled by default.
>>>>>
>>>>>             Best,
>>>>>             Jeff
>>>>>             (who works for Intel but not on any software products
>>>>>             and has been a heavy user of Intel PSTL since it was
>>>>>             released, if anyone is keeping track of
>>>>>             conflicts-of-interest)
>>>>>
>>>>>             On Wed, Nov 29, 2017 at 4:21 AM, Kukanov, Alexey via
>>>>>             cfe-dev <cfe-dev at lists.llvm.org
>>>>>             <mailto:cfe-dev at lists.llvm.org>> wrote:
>>>>>             >
>>>>>             > Hello all,
>>>>>             >
>>>>>             > At Intel, we have developed an implementation of
>>>>>             C++17 execution policies
>>>>>             > for algorithms (often referred to as Parallel STL).
>>>>>             We hope to contribute it
>>>>>             > to libc++/LLVM, so would like to ask the community
>>>>>             for comments on this.
>>>>>             >
>>>>>             > The code is already published at GitHub
>>>>>             (https://github.com/intel/parallelstl).
>>>>>             > It supports the C++17 standard execution policies
>>>>>             (seq, par, par_unseq) as well as
>>>>>             > the experimental unsequenced policy (unseq) for SIMD
>>>>>             execution. At the moment,
>>>>>             > about half of the C++17 standard algorithms that
>>>>>             must support execution policies
>>>>>             > are implemented; a few more will be ready soon, and
>>>>>             the work continues.
>>>>>             > The tests that we use are also available at GitHub;
>>>>>             needless to say we will
>>>>>             > contribute those as well.
>>>>>             >
>>>>>             > The implementation is not specific to Intel’s
>>>>>             hardware. For thread-level parallelism
>>>>>             > it uses TBB*
>>>>>             (https://www.threadingbuildingblocks.org/) but
>>>>>             abstracts it with
>>>>>             > an internal API which can be implemented on top of
>>>>>             other threading/parallel solutions –
>>>>>             > so it is for the community to decide which ones to
>>>>>             use. For SIMD parallelism
>>>>>             > (unseq, par_unseq) we use #pragma omp simd
>>>>>             directives; it is vendor-neutral and
>>>>>             > does not require any OpenMP runtime support.
>>>>>             >
>>>>>             > The current implementation meets the spirit but not
>>>>>             always the letter of
>>>>>             > the standard, because it has to be separate from but
>>>>>             also coexist with
>>>>>             > implementations of standard C++ libraries. While
>>>>>             preparing the contribution,
>>>>>             > we will address inconsistencies, adjust the code to
>>>>>             meet community standards,
>>>>>             > and better integrate it into the standard library code.
>>>>>             >
>>>>>             > We are also proposing that our implementation is
>>>>>             included into libstdc++/GCC.
>>>>>             > Compatibility between the implementations seems
>>>>>             useful as it can potentially
>>>>>             > reduce the amount of work for everyone. We hope to
>>>>>             keep the code mostly identical,
>>>>>             > and would like to know if you think it’s too
>>>>>             optimistic to expect.
>>>>>             >
>>>>>             > Obviously we plan to use appropriate open source
>>>>>             licenses to meet the different
>>>>>             > projects’ requirements.
>>>>>             >
>>>>>             > We expect to keep developing the code and will take
>>>>>             the responsibility for
>>>>>             > maintaining it (with community contributions, of
>>>>>             course). If there are other
>>>>>             > community efforts to implement parallel algorithms,
>>>>>             we are willing to collaborate.
>>>>>             >
>>>>>             > We look forward to your feedback, both for the
>>>>>             overall idea and – if supported –
>>>>>             > for the next steps we should take.
>>>>>             >
>>>>>             > Regards,
>>>>>             > - Alexey Kukanov
>>>>>             >
>>>>>             > * Note that TBB itself is highly portable (and
>>>>>             ported by community to Power and ARM
>>>>>             > architectures) and permissively licensed, so could
>>>>>             be the base for the threading
>>>>>             > infrastructure. But the Parallel STL implementation
>>>>>             itself does not require TBB.
>>>>>             >
>>>>>             > _______________________________________________
>>>>>             > cfe-dev mailing list
>>>>>             > cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>
>>>>>             > http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>             --
>>>>>             Jeff Hammond
>>>>>             jeff.science at gmail.com <mailto:jeff.science at gmail.com>
>>>>>             http://jeffhammond.github.io/
>>>>>             ,
>>>>>
>>>>>             _______________________________________________
>>>>>             cfe-dev mailing list
>>>>>             cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>
>>>>>             http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>>>
>>>>         -- 
>>>>         Jeff Hammond
>>>>         jeff.science at gmail.com <mailto:jeff.science at gmail.com>
>>>>         http://jeffhammond.github.io/
>>>
>>>
>>>         _______________________________________________
>>>         cfe-dev mailing list
>>>         cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>
>>>         http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>
>>         -- 
>>         Hal Finkel
>>         Lead, Compiler Technology and Programming Languages
>>         Leadership Computing Facility
>>         Argonne National Laboratory
>>
>>     -- 
>>     Jeff Hammond
>>     jeff.science at gmail.com <mailto:jeff.science at gmail.com>
>>     http://jeffhammond.github.io/
>
>     -- 
>     Hal Finkel
>     Lead, Compiler Technology and Programming Languages
>     Leadership Computing Facility
>     Argonne National Laboratory
>
> -- 
> Jeff Hammond
> jeff.science at gmail.com <mailto:jeff.science at gmail.com>
> http://jeffhammond.github.io/

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20171208/4f88ec5d/attachment.html>