[cfe-dev] [RFC] Proposal to contribute Intel’s implementation of C++17 parallel algorithms
Hal Finkel via cfe-dev
cfe-dev at lists.llvm.org
Fri Dec 8 14:34:53 PST 2017
On 12/08/2017 03:55 PM, Jeff Hammond wrote:
>
>
> On Fri, Dec 8, 2017 at 1:13 PM, Hal Finkel <hfinkel at anl.gov
> <mailto:hfinkel at anl.gov>> wrote:
>
>
> On 12/07/2017 11:35 AM, Jeff Hammond wrote:
>>
>> On Wed, Dec 6, 2017 at 8:57 PM Hal Finkel <hfinkel at anl.gov
>> <mailto:hfinkel at anl.gov>> wrote:
>>
>>
>> On 12/06/2017 10:23 PM, Jeff Hammond wrote:
>>>
>>> On Wed, Dec 6, 2017 at 4:23 PM Hal Finkel <hfinkel at anl.gov
>>> <mailto:hfinkel at anl.gov>> wrote:
>>>
>>>
>>> On 12/04/2017 10:48 PM, Serge Preis via cfe-dev wrote:
>>>> I agree that guarantees provided by ICC may be stronger
>>>> than with other compilers, so yes, under OpenMP terms
>>>> vectorization is permitted and cannot be assumed.
>>>> However OpenMP clearly defines semantics of variables
>>>> used within OpenMP region some being shared(scalar),
>>>> some private(vector) and some being inductions. This
>>>> goes far beyond typical compiler specific pragmas about
>>>> dependencies and cost modelling and makes vectorization
>>>> much simpler task with more predictable and robust
>>>> results if properly implemented (admittedly, even ICC
>>>> implementation is far from perfect). I hope Intel's
>>>> efforts to standardize someting like this in core C++
>>>> will evntually come to fruition. Until then I as a
>>>> regular application developer would appreciate
>>>> OpenMP-simd based execution policy (hoping for good
>>>> support for OpenMP SIMD in clang), but it shouldn't
>>>> necessary be part of libc++. Since 'unordered'
>>>> execution policy is currently not part of C++ standard
>>>
>>> std::execution::par_unseq is part of C++17, and that
>>> essentially maps to '#pragma omp parallel for simd'.
>>>
>>>
>>> Do you expect par/par_unseq to nest?
>>
>> Yes.
>>
>>
>>> Nesting omp-parallel is generally regarded as a Bad Idea.
>>
>> Agreed. I suspect we'll want the mapping to be more like
>> '#pragma omp taskloop simd'.
>>
>>
>> That won’t run in parallel unless in an omp-parallel-master region.
>
> Yes.
>
>> That means OpenMP-based PSTL won’t be parallel unless the user
>> knows to add back-end specific code about the PSTL.
>
> That obviously wouldn't be acceptable.
>
>>
>> What I’m trying to say is that OpenMP is a poor target for PSTL
>> in its current form. Nested parallel regions is the only thing
>> that works and it is likely to work poorly.
>
> I'm not sure that's true, but the technique may not be trivial. I
> believe that it is possible, however. For example, the mapping
> might be to something like:
>
> if (omp_in_parallel()) {
> #pragma omp taskloop simd
> for (size_t i = 0; i < N; ++i)
> F(X[i]);
> } else {
> #pragma omp parallel
> {
> #pragma omp taskloop simd
> for (size_t i = 0; i < N; ++i)
> F(X[i]);
> }
> }
>
> The fact that we'd need to use this kind of pattern is a bit
> unfortunate, but it can be easily abstracted into a template
> function, so it just becomes some implementation detail of the
> library.
>
>
> You are right and that is probably the best way to do it with OpenMP.
> I am concerned about the absolute performance, based upon my
> observations of omp-taskloop vs omp-for and tbb::parallel_for
Have you tried this recently? There was a recursive task-stealing
strategy added to our OpenMP library in July of this year (r308338)
which should have made the performance of taskloop better.
> in the PRK project, but at least it is sane from a semantic
> perspective. Having motivating use cases like PSTL should lead to
> improvements in OpenMP runtime performance w.r.t. taskloop.
Indeed :-)
>
> https://i.stack.imgur.com/MVd5j.png is a snapshot of the performance
> of PRK stencil (https://github.com/ParRes/Kernels/tree/master/Cxx11),
> which shows taskloop loses to TBB-based PSTL, OpenMP for, and
> tbb::parallel_for (pure TBB beats TBB-based PSTL because I use
> tbb::blocked_range2d, which improves cache utilization). I think
> those results tuned taskloop grainsize as well, so they may be an
> optimistic representation of taskloop in a general usage.
Interesting.
>
> I'll see if I can prototype this in RAJA or Intel PSTL. It's not hard
> to get results directly from the PRK tests, if the former attempts fail.
Thanks!
-Hal
>
> Best,
>
> Jeff
>
> Thanks again,
> Hal
>
>
>>
>> Jeff
>>
>>
>> -Hal
>>
>>
>>>
>>> Jeff
>>>
>>>
>>>> I don't care much on how it will be implemneted in
>>>> libc++ if it is. I just would like to ask Intel guys
>>>> and community here to make implementation extensible in
>>>> a sense that custom OpenMP-SIMD-based execution policy
>>>> along with algorithms implementations (as
>>>> specializations for the policy) can be used with the
>>>> libc++ library. And I additionally would like to ask
>>>> Intel guys to provide complete and compatible extension
>>>> on github for developers like me to use.
>>>
>>> In the end, I think we want the following:
>>>
>>> 1. A design for libc++ that allows the thread-level
>>> parallelism to be implemented in terms of different
>>> underlying providers (i.e., OpenMP, GCD, Work Queues on
>>> Windows, whatever else).
>>> 2. To follow the same philosophy with respect to
>>> standards as we do everywhere else: Use standards where
>>> possible with compiler/system-specific extensions as
>>> necessary.
>>>
>>> -Hal
>>>
>>>
>>>> Regards,
>>>> Serge.
>>>> 04.12.2017, 12:07, "Jeff Hammond"
>>>> <jeff.science at gmail.com> <mailto:jeff.science at gmail.com>:
>>>>> ICC implements a very aggressive interpretation of the
>>>>> OpenMP standard, and this interpretation is not shared
>>>>> by everyone in the OpenMP community. ICC is correct
>>>>> but other implementations may be far less aggressive,
>>>>> so _Pragma("omp simd") doesn't guarentee vectorization
>>>>> unless the compiler documentation says that is how it
>>>>> is implemented. All the standard says that it means
>>>>> is that vectorization is _permitted_.
>>>>> Given that the practical meaning of _Pragma("omp
>>>>> simd") isn't guaranteed to be consistent across
>>>>> different implementations, I don't really know how to
>>>>> compare it to compiler-specific pragmas unless we
>>>>> define everything explicitly.
>>>>> In any case, my fundamental point remains: do not use
>>>>> OpenMP pragmas here, but instead use whatever the
>>>>> appropriate compiler-specific pragma is, or create a
>>>>> new one that meets the need.
>>>>> Best,
>>>>> Jeff
>>>>> On Sun, Dec 3, 2017 at 8:09 PM, Serge Preis
>>>>> <spreis at yandex-team.ru <mailto:spreis at yandex-team.ru>>
>>>>> wrote:
>>>>>
>>>>> Hello,
>>>>> _Pragma("omp simd") is semantically quite
>>>>> different from _Pragma("clang loop
>>>>> vectorize(assume_safety)"), _Pragma("GCC ivdep")
>>>>> and _Pragma("vector always"), so I am not sure all
>>>>> latter will work as expected in all cases. They
>>>>> definitely won't provide any vectorization
>>>>> guarantees which slightly defeat the purpose of
>>>>> using corresponding execution policy.
>>>>> I support the idea of having OpenMP orthogonal and
>>>>> definitely having -fopenmp enabled by default is
>>>>> not an option. Intel compiler has separate
>>>>> -qopenmp-simd option which doesn't affect
>>>>> performance outside explicitly marked loops, but
>>>>> even this is not enabled by default. I would say
>>>>> that there might exist multiple implementations of
>>>>> unordered policy, originally OpenMP SIMD based
>>>>> implementation may be more powerful and one based
>>>>> on other pragmas being default, but hinting about
>>>>> existence of faster option. Later on one may be
>>>>> brave enough to add some SIMD template library and
>>>>> implement default unordered policy using it (such
>>>>> implementation is possible even now using vector
>>>>> types, but it will be extremely complex if attempt
>>>>> to target all base data types, vector widths and
>>>>> target SIMD architectures clang supports. Even
>>>>> with the library this may be quite tedious).
>>>>> Without any standard way of expressing SIMD
>>>>> perallelism in pure C++ any implementer of SIMD
>>>>> execution policy is to rely on means avaialble for
>>>>> plaform/compiler and so it is not totaly unnatural
>>>>> to ask user to enable OpenMP SIMD for efficient
>>>>> support of corresponding execution policy.
>>>>> Reagrds,
>>>>> Serge Preis
>>>>> (Who once was part of Intel Compiler Vectorizer
>>>>> team and driven OpenMP SIMD efforts within icc and
>>>>> beyond, if anyone is keeping track of
>>>>> conflicts-of-interest)
>>>>> 04.12.2017, 08:46, "Jeff Hammond via cfe-dev"
>>>>> <cfe-dev at lists.llvm.org
>>>>> <mailto:cfe-dev at lists.llvm.org>>:
>>>>>> It would be nice to keep PSTL and OpenMP
>>>>>> orthogonal, even if _Pragma("omp simd") does not
>>>>>> require runtime support. It should be trivial to
>>>>>> use _Pragma("clang loop
>>>>>> vectorize(assume_safety)") instead, by wrapping
>>>>>> all of the different compiler vectorization
>>>>>> pragmas in preprocessor logic. I similarly
>>>>>> recommend _Pragma("GCC ivdep") for GCC and
>>>>>> _Pragma("vector always") for ICC. While this
>>>>>> requires O(n_compilers) effort instead of O(1),
>>>>>> but orthogonality is worth it.
>>>>>> While OpenMP is vendor/compiler-agnostic, users
>>>>>> should not be required to use -fopenmp or similar
>>>>>> to enable vectorization from PSTL, nor should the
>>>>>> compiler enable any OpenMP pragma by default. I
>>>>>> know of cases where merely using the -fopenmp
>>>>>> flag alters code generation in a
>>>>>> performance-visible manner, and enabling the
>>>>>> OpenMP "simd" pragma by default may surprise some
>>>>>> users, particularly if no other OpenMP pragmas
>>>>>> are enabled by default.
>>>>>>
>>>>>> Best,
>>>>>> Jeff
>>>>>> (who works for Intel but not on any software
>>>>>> products and has been a heavy user of Intel PSTL
>>>>>> since it was released, if anyone is keeping track
>>>>>> of conflicts-of-interest)
>>>>>>
>>>>>> On Wed, Nov 29, 2017 at 4:21 AM, Kukanov, Alexey
>>>>>> via cfe-dev <cfe-dev at lists.llvm.org
>>>>>> <mailto:cfe-dev at lists.llvm.org>> wrote:
>>>>>> >
>>>>>> > Hello all,
>>>>>> >
>>>>>> > At Intel, we have developed an implementation
>>>>>> of C++17 execution policies
>>>>>> > for algorithms (often referred to as Parallel
>>>>>> STL). We hope to contribute it
>>>>>> > to libc++/LLVM, so would like to ask the
>>>>>> community for comments on this.
>>>>>> >
>>>>>> > The code is already published at GitHub
>>>>>> (https://github.com/intel/parallelstl
>>>>>> <https://github.com/intel/parallelstl>).
>>>>>> > It supports the C++17 standard execution
>>>>>> policies (seq, par, par_unseq) as well as
>>>>>> > the experimental unsequenced policy (unseq) for
>>>>>> SIMD execution. At the moment,
>>>>>> > about half of the C++17 standard algorithms
>>>>>> that must support execution policies
>>>>>> > are implemented; a few more will be ready soon,
>>>>>> and the work continues.
>>>>>> > The tests that we use are also available at
>>>>>> GitHub; needless to say we will
>>>>>> > contribute those as well.
>>>>>> >
>>>>>> > The implementation is not specific to Intel’s
>>>>>> hardware. For thread-level parallelism
>>>>>> > it uses TBB*
>>>>>> (https://www.threadingbuildingblocks.org/
>>>>>> <https://www.threadingbuildingblocks.org/>) but
>>>>>> abstracts it with
>>>>>> > an internal API which can be implemented on top
>>>>>> of other threading/parallel solutions –
>>>>>> > so it is for the community to decide which ones
>>>>>> to use. For SIMD parallelism
>>>>>> > (unseq, par_unseq) we use #pragma omp simd
>>>>>> directives; it is vendor-neutral and
>>>>>> > does not require any OpenMP runtime support.
>>>>>> >
>>>>>> > The current implementation meets the spirit but
>>>>>> not always the letter of
>>>>>> > the standard, because it has to be separate
>>>>>> from but also coexist with
>>>>>> > implementations of standard C++ libraries.
>>>>>> While preparing the contribution,
>>>>>> > we will address inconsistencies, adjust the
>>>>>> code to meet community standards,
>>>>>> > and better integrate it into the standard
>>>>>> library code.
>>>>>> >
>>>>>> > We are also proposing that our implementation
>>>>>> is included into libstdc++/GCC.
>>>>>> > Compatibility between the implementations seems
>>>>>> useful as it can potentially
>>>>>> > reduce the amount of work for everyone. We hope
>>>>>> to keep the code mostly identical,
>>>>>> > and would like to know if you think it’s too
>>>>>> optimistic to expect.
>>>>>> >
>>>>>> > Obviously we plan to use appropriate open
>>>>>> source licenses to meet the different
>>>>>> > projects’ requirements.
>>>>>> >
>>>>>> > We expect to keep developing the code and will
>>>>>> take the responsibility for
>>>>>> > maintaining it (with community contributions,
>>>>>> of course). If there are other
>>>>>> > community efforts to implement parallel
>>>>>> algorithms, we are willing to collaborate.
>>>>>> >
>>>>>> > We look forward to your feedback, both for the
>>>>>> overall idea and – if supported –
>>>>>> > for the next steps we should take.
>>>>>> >
>>>>>> > Regards,
>>>>>> > - Alexey Kukanov
>>>>>> >
>>>>>> > * Note that TBB itself is highly portable (and
>>>>>> ported by community to Power and ARM
>>>>>> > architectures) and permissively licensed, so
>>>>>> could be the base for the threading
>>>>>> > infrastructure. But the Parallel STL
>>>>>> implementation itself does not require TBB.
>>>>>> >
>>>>>> > _______________________________________________
>>>>>> > cfe-dev mailing list
>>>>>> > cfe-dev at lists.llvm.org
>>>>>> <mailto:cfe-dev at lists.llvm.org>
>>>>>> >
>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>>>> <http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Jeff Hammond
>>>>>> jeff.science at gmail.com
>>>>>> <mailto:jeff.science at gmail.com>
>>>>>> http://jeffhammond.github.io/
>>>>>> ,
>>>>>>
>>>>>> _______________________________________________
>>>>>> cfe-dev mailing list
>>>>>> cfe-dev at lists.llvm.org
>>>>>> <mailto:cfe-dev at lists.llvm.org>
>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>>>> <http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev>
>>>>>>
>>>>> --
>>>>> Jeff Hammond
>>>>> jeff.science at gmail.com <mailto:jeff.science at gmail.com>
>>>>> http://jeffhammond.github.io/
>>>>
>>>>
>>>> _______________________________________________
>>>> cfe-dev mailing list
>>>> cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>> <http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev>
>>>
>>> --
>>> Hal Finkel
>>> Lead, Compiler Technology and Programming Languages
>>> Leadership Computing Facility
>>> Argonne National Laboratory
>>>
>>> --
>>> Jeff Hammond jeff.science at gmail.com
>>> <mailto:jeff.science at gmail.com> http://jeffhammond.github.io/
>>
>> --
>> Hal Finkel
>> Lead, Compiler Technology and Programming Languages
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
>> --
>> Jeff Hammond jeff.science at gmail.com
>> <mailto:jeff.science at gmail.com> http://jeffhammond.github.io/
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
> --
> Jeff Hammond jeff.science at gmail.com <mailto:jeff.science at gmail.com>
> http://jeffhammond.github.io/
--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20171208/de39ba24/attachment.html>
More information about the cfe-dev
mailing list