[cfe-dev] [RFC] Proposal to contribute Intel’s implementation of C++17 parallel algorithms

Fri Dec 8 17:35:54 PST 2017

On Fri, Dec 8, 2017 at 2:34 PM, Hal Finkel <hfinkel at anl.gov> wrote:

>
> On 12/08/2017 03:55 PM, Jeff Hammond wrote:
>
>
>
> On Fri, Dec 8, 2017 at 1:13 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>
>>
>> On 12/07/2017 11:35 AM, Jeff Hammond wrote:
>>
>>
>> On Wed, Dec 6, 2017 at 8:57 PM Hal Finkel <hfinkel at anl.gov> wrote:
>>
>>>
>>> On 12/06/2017 10:23 PM, Jeff Hammond wrote:
>>>
>>>
>>> On Wed, Dec 6, 2017 at 4:23 PM Hal Finkel <hfinkel at anl.gov> wrote:
>>>
>>>>
>>>> On 12/04/2017 10:48 PM, Serge Preis via cfe-dev wrote:
>>>>
>>>> I agree that guarantees provided by ICC may be stronger than with other
>>>> compilers, so yes, under OpenMP terms vectorization is permitted and cannot
>>>> be assumed. However OpenMP clearly defines semantics of variables used
>>>> within OpenMP region some being shared(scalar), some private(vector) and
>>>> some being inductions. This goes far beyond typical compiler specific
>>>> pragmas about dependencies and cost modelling and makes vectorization much
>>>> simpler task with more predictable and robust results if properly
>>>> implemented (admittedly, even ICC implementation is far from perfect). I
>>>> hope Intel's efforts to standardize someting like this in core C++ will
>>>> evntually come to fruition. Until then I as a regular application developer
>>>> would appreciate OpenMP-simd based execution policy (hoping for good
>>>> support for OpenMP SIMD in clang), but it shouldn't necessary be part of
>>>> libc++. Since 'unordered' execution policy is currently not part of C++
>>>> standard
>>>>
>>>>
>>>> std::execution::par_unseq is part of C++17, and that essentially maps
>>>> to '#pragma omp parallel for simd'.
>>>>
>>>>
>>> Do you expect par/par_unseq to nest?
>>>
>>>
>>> Yes.
>>>
>>>
>>> Nesting omp-parallel is generally regarded as a Bad Idea.
>>>
>>>
>>> Agreed. I suspect we'll want the mapping to be more like '#pragma omp
>>> taskloop simd'.
>>>
>>>
>> That won’t run in parallel unless in an omp-parallel-master region.
>>
>>
>> Yes.
>>
>> That means OpenMP-based PSTL won’t be parallel unless the user knows to
>> add back-end specific code about the PSTL.
>>
>>
>> That obviously wouldn't be acceptable.
>>
>>
>> What I’m trying to say is that OpenMP is a poor target for PSTL in its
>> current form. Nested parallel regions is the only thing that works and it
>> is likely to work poorly.
>>
>>
>> I'm not sure that's true, but the technique may not be trivial. I believe
>> that it is possible, however. For example, the mapping might be to
>> something like:
>>
>> if (omp_in_parallel()) {
>> #pragma omp taskloop simd
>>   for (size_t i = 0; i < N; ++i)
>>     F(X[i]);
>> } else {
>> #pragma omp parallel
>>   {
>> #pragma omp taskloop simd
>>      for (size_t i = 0; i < N; ++i)
>>        F(X[i]);
>>   }
>> }
>>
>> The fact that we'd need to use this kind of pattern is a bit unfortunate,
>> but it can be easily abstracted into a template function, so it just
>> becomes some implementation detail of the library.
>>
>>
> You are right and that is probably the best way to do it with OpenMP.  I
> am concerned about the absolute performance, based upon my observations of
> omp-taskloop vs omp-for and tbb::parallel_for
>
>
> Have you tried this recently? There was a recursive task-stealing strategy
> added to our OpenMP library in July of this year (r308338) which should
> have made the performance of taskloop better.
>
>
I ran those benchmarks this summer with Intel 18 beta.  Tom from LLNL
mentioned that a stealing-based implementation of OpenMP taskloop was
feasible but I didn't investigate whether it was used.  Obviously, I know
some people who can help me answer questions about the LLVM OpenMP runtime
;-)

> in the PRK project, but at least it is sane from a semantic perspective.
> Having motivating use cases like PSTL should lead to improvements in OpenMP
> runtime performance w.r.t. taskloop.
>
>
> Indeed :-)
>
>
> https://i.stack.imgur.com/MVd5j.png is a snapshot of the performance of
> PRK stencil (https://github.com/ParRes/Kernels/tree/master/Cxx11), which
> shows taskloop loses to TBB-based PSTL, OpenMP for, and tbb::parallel_for
> (pure TBB beats TBB-based PSTL because I use tbb::blocked_range2d, which
> improves cache utilization).  I think those results tuned taskloop
> grainsize as well, so they may be an optimistic representation of taskloop
> in a general usage.
>
>
> Interesting.
>

I should try to figure out how to recreate what TBB does with PSTL since
it's clearly beneficial, at least on KNL.  Obviously, I can block loops
manually as I do with raw OpenMP code, but I'm sure there's a nicer way.

>
> I'll see if I can prototype this in RAJA or Intel PSTL.  It's not hard to
> get results directly from the PRK tests, if the former attempts fail.
>
> Correct: I'll see if I can prototype for_each.  The rest will be left as
an exercise for the reader :-D

Jeff

> Thanks!
>
>  -Hal
>
>
>
> Best,
>
> Jeff
>
>
>> Thanks again,
>> Hal
>>
>>
>>
>> Jeff
>>
>>
>>>  -Hal
>>>
>>>
>>>
>>> Jeff
>>>
>>>
>>>> I don't care much on how it will be implemneted in libc++ if it is. I
>>>> just would like to ask Intel guys and community here to make implementation
>>>> extensible in a sense that custom OpenMP-SIMD-based execution policy along
>>>> with algorithms implementations (as specializations for the policy) can be
>>>> used with the libc++ library. And I additionally would like to ask Intel
>>>> guys to provide complete and compatible extension on github for developers
>>>> like me to use.
>>>>
>>>>
>>>> In the end, I think we want the following:
>>>>
>>>>  1. A design for libc++ that allows the thread-level parallelism to be
>>>> implemented in terms of different underlying providers (i.e., OpenMP, GCD,
>>>> Work Queues on Windows, whatever else).
>>>>  2. To follow the same philosophy with respect to standards as we do
>>>> everywhere else: Use standards where possible with compiler/system-specific
>>>> extensions as necessary.
>>>>
>>>>  -Hal
>>>>
>>>>
>>>>
>>>> Regards,
>>>> Serge.
>>>>
>>>>
>>>>
>>>> 04.12.2017, 12:07, "Jeff Hammond" <jeff.science at gmail.com>
>>>> <jeff.science at gmail.com>:
>>>>
>>>> ICC implements a very aggressive interpretation of the OpenMP standard,
>>>> and this interpretation is not shared by everyone in the OpenMP community.
>>>> ICC is correct but other implementations may be far less aggressive, so
>>>> _Pragma("omp simd") doesn't guarentee vectorization unless the compiler
>>>> documentation says that is how it is implemented.  All the standard says
>>>> that it means is that vectorization is _permitted_.
>>>>
>>>> Given that the practical meaning of _Pragma("omp simd") isn't
>>>> guaranteed to be consistent across different implementations, I don't
>>>> really know how to compare it to compiler-specific pragmas unless we define
>>>> everything explicitly.
>>>>
>>>> In any case, my fundamental point remains: do not use OpenMP pragmas
>>>> here, but instead use whatever the appropriate compiler-specific pragma is,
>>>> or create a new one that meets the need.
>>>>
>>>> Best,
>>>>
>>>> Jeff
>>>>
>>>>
>>>> On Sun, Dec 3, 2017 at 8:09 PM, Serge Preis <spreis at yandex-team.ru>
>>>> wrote:
>>>>
>>>> Hello,
>>>>
>>>> _Pragma("omp simd") is semantically quite different from _Pragma("clang
>>>> loop vectorize(assume_safety)"), _Pragma("GCC ivdep") and _Pragma("vector
>>>> always"), so I am not sure all latter will work as expected in all cases.
>>>> They definitely won't provide any vectorization guarantees which slightly
>>>> defeat the purpose of using corresponding execution policy.
>>>>
>>>> I support the idea of having OpenMP orthogonal and definitely having
>>>> -fopenmp enabled by default is not an option. Intel compiler has separate
>>>> -qopenmp-simd option which doesn't affect performance outside explicitly
>>>> marked loops, but even this is not enabled by default. I would say that
>>>> there might exist multiple implementations of unordered policy, originally
>>>> OpenMP SIMD based implementation may be more powerful and one based on
>>>> other pragmas being default, but hinting about existence of faster option.
>>>> Later on one may be brave enough to add some SIMD template library and
>>>> implement default unordered policy using it (such implementation is
>>>> possible even now using vector types, but it will be extremely complex if
>>>> attempt to target all base data types, vector widths and target SIMD
>>>> architectures clang supports. Even with the library this may be quite
>>>> tedious).
>>>>
>>>> Without any standard way of expressing SIMD perallelism in pure C++ any
>>>> implementer of SIMD execution policy is to rely on means avaialble for
>>>> plaform/compiler and so it is not totaly unnatural to ask user to enable
>>>> OpenMP SIMD for efficient support of corresponding execution policy.
>>>>
>>>> Reagrds,
>>>> Serge Preis
>>>>
>>>> (Who once was part of Intel Compiler Vectorizer team and driven OpenMP
>>>> SIMD efforts within icc and beyond, if anyone is keeping track of
>>>> conflicts-of-interest)
>>>>
>>>>
>>>> 04.12.2017, 08:46, "Jeff Hammond via cfe-dev" <cfe-dev at lists.llvm.org>:
>>>>
>>>> It would be nice to keep PSTL and OpenMP orthogonal, even if
>>>> _Pragma("omp simd") does not require runtime support.  It should be trivial
>>>> to use _Pragma("clang loop vectorize(assume_safety)") instead, by wrapping
>>>> all of the different compiler vectorization pragmas in preprocessor logic.
>>>> I similarly recommend _Pragma("GCC ivdep") for GCC and _Pragma("vector
>>>> always") for ICC.  While this requires O(n_compilers) effort instead of
>>>> O(1), but orthogonality is worth it.
>>>>
>>>> While OpenMP is vendor/compiler-agnostic, users should not be required
>>>> to use -fopenmp or similar to enable vectorization from PSTL, nor should
>>>> the compiler enable any OpenMP pragma by default.  I know of cases where
>>>> merely using the -fopenmp flag alters code generation in a
>>>> performance-visible manner, and enabling the OpenMP "simd" pragma by
>>>> default may surprise some users, particularly if no other OpenMP pragmas
>>>> are enabled by default.
>>>>
>>>> Best,
>>>>
>>>> Jeff
>>>> (who works for Intel but not on any software products and has been a
>>>> heavy user of Intel PSTL since it was released, if anyone is keeping track
>>>> of conflicts-of-interest)
>>>>
>>>> On Wed, Nov 29, 2017 at 4:21 AM, Kukanov, Alexey via cfe-dev <
>>>> cfe-dev at lists.llvm.org> wrote:
>>>> >
>>>> > Hello all,
>>>> >
>>>> > At Intel, we have developed an implementation of C++17 execution
>>>> policies
>>>> > for algorithms (often referred to as Parallel STL). We hope to
>>>> contribute it
>>>> > to libc++/LLVM, so would like to ask the community for comments on
>>>> this.
>>>> >
>>>> > The code is already published at GitHub (
>>>> https://github.com/intel/parallelstl).
>>>> > It supports the C++17 standard execution policies (seq, par,
>>>> par_unseq) as well as
>>>> > the experimental unsequenced policy (unseq) for SIMD execution. At
>>>> the moment,
>>>> > about half of the C++17 standard algorithms that must support
>>>> execution policies
>>>> > are implemented; a few more will be ready soon, and the work
>>>> continues.
>>>> > The tests that we use are also available at GitHub; needless to say
>>>> we will
>>>> > contribute those as well.
>>>> >
>>>> > The implementation is not specific to Intel’s hardware. For
>>>> thread-level parallelism
>>>> > it uses TBB* (https://www.threadingbuildingblocks.org/) but
>>>> abstracts it with
>>>> > an internal API which can be implemented on top of other
>>>> threading/parallel solutions –
>>>> > so it is for the community to decide which ones to use. For SIMD
>>>> parallelism
>>>> > (unseq, par_unseq) we use #pragma omp simd directives; it is
>>>> vendor-neutral and
>>>> > does not require any OpenMP runtime support.
>>>> >
>>>> > The current implementation meets the spirit but not always the letter
>>>> of
>>>> > the standard, because it has to be separate from but also coexist with
>>>> > implementations of standard C++ libraries. While preparing the
>>>> contribution,
>>>> > we will address inconsistencies, adjust the code to meet community
>>>> standards,
>>>> > and better integrate it into the standard library code.
>>>> >
>>>> > We are also proposing that our implementation is included into
>>>> libstdc++/GCC.
>>>> > Compatibility between the implementations seems useful as it can
>>>> potentially
>>>> > reduce the amount of work for everyone. We hope to keep the code
>>>> mostly identical,
>>>> > and would like to know if you think it’s too optimistic to expect.
>>>> >
>>>> > Obviously we plan to use appropriate open source licenses to meet the
>>>> different
>>>> > projects’ requirements.
>>>> >
>>>> > We expect to keep developing the code and will take the
>>>> responsibility for
>>>> > maintaining it (with community contributions, of course). If there
>>>> are other
>>>> > community efforts to implement parallel algorithms, we are willing to
>>>> collaborate.
>>>> >
>>>> > We look forward to your feedback, both for the overall idea and – if
>>>> supported –
>>>> > for the next steps we should take.
>>>> >
>>>> > Regards,
>>>> > - Alexey Kukanov
>>>> >
>>>> > * Note that TBB itself is highly portable (and ported by community to
>>>> Power and ARM
>>>> > architectures) and permissively licensed, so could be the base for
>>>> the threading
>>>> > infrastructure. But the Parallel STL implementation itself does not
>>>> require TBB.
>>>> >
>>>> > _______________________________________________
>>>> > cfe-dev mailing list
>>>> > cfe-dev at lists.llvm.org
>>>> > http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jeff Hammond
>>>> jeff.science at gmail.com
>>>> http://jeffhammond.github.io/
>>>>
>>>> ,
>>>>
>>>> _______________________________________________
>>>> cfe-dev mailing list
>>>> cfe-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>>
>>>>
>>>>
>>>> --
>>>> Jeff Hammond
>>>> jeff.science at gmail.com
>>>> http://jeffhammond.github.io/
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> cfe-dev mailing listcfe-dev at lists.llvm.orghttp://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>>
>>>> --
>>>> Hal Finkel
>>>> Lead, Compiler Technology and Programming Languages
>>>> Leadership Computing Facility
>>>> Argonne National Laboratory
>>>>
>>>> --
>>> Jeff Hammond jeff.science at gmail.com http://jeffhammond.github.io/
>>>
>>> --
>>> Hal Finkel
>>> Lead, Compiler Technology and Programming Languages
>>> Leadership Computing Facility
>>> Argonne National Laboratory
>>>
>>> --
>> Jeff Hammond jeff.science at gmail.com http://jeffhammond.github.io/
>>
>> --
>> Hal Finkel
>> Lead, Compiler Technology and Programming Languages
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
>> --
> Jeff Hammond jeff.science at gmail.com http://jeffhammond.github.io/
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
>

-- 
Jeff Hammond
jeff.science at gmail.com
http://jeffhammond.github.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20171208/9e3cf687/attachment.html>