On 12/06/2017 10:23 PM, Jeff Hammond wrote:
On Wed, Dec 6, 2017 at 4:23 PM Hal Finkel wrote: 

On 12/04/2017 10:48 PM, Serge Preis via cfe-dev wrote:
>>     I agree that guarantees provided by ICC may be stronger than with
>>     other compilers, so yes, under OpenMP terms vectorization is
>>     permitted and cannot be assumed. However OpenMP clearly defines
>>     semantics of variables used within OpenMP region some being
>>     shared(scalar), some private(vector) and some being inductions.
>>     This goes far beyond typical compiler specific pragmas about
>>     dependencies and cost modelling and makes vectorization much
>>     simpler task with more predictable and robust results if properly
>>     implemented (admittedly, even ICC implementation is far from
>>     perfect). I hope Intel's efforts to standardize someting like
>>     this in core C++ will evntually come to fruition. Until then I as
>>     a regular application developer would appreciate OpenMP-simd
>>     based execution policy (hoping for good support for OpenMP SIMD
>>     in clang), but it shouldn't necessary be part of libc++. Since
>>     'unordered' execution policy is currently not part of C++ standard
>     std::execution::par_unseq is part of C++17, and that essentially
>     maps to '#pragma omp parallel for simd'.
> Do you expect par/par_unseq to nest?


> Nesting omp-parallel is generally regarded as a Bad Idea.

Agreed. I suspect we'll want the mapping to be more like '#pragma omp 
taskloop simd'.



>>     I don't care much on how it will be implemneted in libc++ if it
>>     is. I just would like to ask Intel guys and community here to
>>     make implementation extensible in a sense that custom
>>     OpenMP-SIMD-based execution policy along with algorithms
>>     implementations (as specializations for the policy) can be used
>>     with the libc++ library. And I additionally would like to ask
>>     Intel guys to provide complete and compatible extension on github
>>     for developers like me to use.
>     In the end, I think we want the following:
>      1. A design for libc++ that allows the thread-level parallelism
>     to be implemented in terms of different underlying providers
>     (i.e., OpenMP, GCD, Work Queues on Windows, whatever else).
>      2. To follow the same philosophy with respect to standards as we
>     do everywhere else: Use standards where possible with
>     compiler/system-specific extensions as necessary.
-Hal

Regards,
Serge.
04.12.2017, 12:07, "Jeff Hammond" wrote:

>>>     ICC implements a very aggressive interpretation of the OpenMP
>>>     standard, and this interpretation is not shared by everyone in
>>>     the OpenMP community. ICC is correct but other implementations
>>>     may be far less aggressive, so _Pragma("omp simd") doesn't
>>>     guarentee vectorization unless the compiler documentation says
>>>     that is how it is implemented.  All the standard says that it
>>>     means is that vectorization is _permitted_.
>>>     Given that the practical meaning of _Pragma("omp simd") isn't
>>>     guaranteed to be consistent across different implementations, I
>>>     don't really know how to compare it to compiler-specific pragmas
>>>     unless we define everything explicitly.
>>>     In any case, my fundamental point remains: do not use OpenMP
>>>     pragmas here, but instead use whatever the appropriate
>>>     compiler-specific pragma is, or create a new one that meets the
>>>     need.
Best,
Jeff

On Sun, Dec 3, 2017 at 8:09 PM, Serge Preis wrote:

Hello,
>>>         _Pragma("omp simd") is semantically quite different from
>>>         _Pragma("clang loop vectorize(assume_safety)"), _Pragma("GCC
>>>         ivdep") and _Pragma("vector always"), so I am not sure all
>>>         latter will work as expected in all cases. They definitely
>>>         won't provide any vectorization guarantees which slightly
>>>         defeat the purpose of using corresponding execution policy.
>>>         I support the idea of having OpenMP orthogonal and
>>>         definitely having -fopenmp enabled by default is not an
>>>         option. Intel compiler has separate -qopenmp-simd option
>>>         which doesn't affect performance outside explicitly marked
>>>         loops, but even this is not enabled by default. I would say
>>>         that there might exist multiple implementations of unordered
>>>         policy, originally OpenMP SIMD based implementation may be
>>>         more powerful and one based on other pragmas being default,
>>>         but hinting about existence of faster option. Later on one
>>>         may be brave enough to add some SIMD template library and
>>>         implement default unordered policy using it (such
>>>         implementation is possible even now using vector types, but
>>>         it will be extremely complex if attempt to target all base
>>>         data types, vector widths and target SIMD architectures
>>>         clang supports. Even with the library this may be quite
>>>         tedious).
>>>         Without any standard way of expressing SIMD perallelism in
>>>         pure C++ any implementer of SIMD execution policy is to rely
>>>         on means avaialble for plaform/compiler and so it is not
>>>         totaly unnatural to ask user to enable OpenMP SIMD for
>>>         efficient support of corresponding execution policy.
>>>         Reagrds,
>>>         Serge Preis
>>>         (Who once was part of Intel Compiler Vectorizer team and
>>>         driven OpenMP SIMD efforts within icc and beyond, if anyone
>>>         is keeping track of conflicts-of-interest)
04.12.2017, 08:46, "Jeff Hammond via cfe-dev" wrote:

>>>>         It would be nice to keep PSTL and OpenMP orthogonal, even
>>>>         if _Pragma("omp simd") does not require runtime support. 
>>>>         It should be trivial to use _Pragma("clang loop
>>>>         vectorize(assume_safety)") instead, by wrapping all of the
>>>>         different compiler vectorization pragmas in preprocessor
>>>>         logic.  I similarly recommend _Pragma("GCC ivdep") for GCC
>>>>         and _Pragma("vector always") for ICC. While this requires
>>>>         O(n_compilers) effort instead of O(1), but orthogonality is
>>>>         worth it.
>>>>         While OpenMP is vendor/compiler-agnostic, users should not
>>>>         be required to use -fopenmp or similar to enable
>>>>         vectorization from PSTL, nor should the compiler enable any
>>>>         OpenMP pragma by default.  I know of cases where merely
>>>>         using the -fopenmp flag alters code generation in a
>>>>         performance-visible manner, and enabling the OpenMP "simd"
>>>>         pragma by default may surprise some users, particularly if
>>>>         no other OpenMP pragmas are enabled by default.
>>>>         Best,
>>>>         Jeff
>>>>         (who works for Intel but not on any software products and
>>>>         has been a heavy user of Intel PSTL since it was released,
>>>>         if anyone is keeping track of conflicts-of-interest)
On Wed, Nov 29, 2017 at 4:21 AM, Kukanov, Alexey via cfe-dev wrote:



Hello all,
>>>>         >
>>>>         > At Intel, we have developed an implementation of C++17
>>>>         execution policies
>>>>         > for algorithms (often referred to as Parallel STL). We
>>>>         hope to contribute it
>>>>         > to libc++/LLVM, so would like to ask the community for
>>>>         comments on this.
>>>>         >
>>>>         > The code is already published at GitHub
>>>>         (https://github.com/intel/parallelstl).
>>>>         > It supports the C++17 standard execution policies (seq,
>>>>         par, par_unseq) as well as
>>>>         > the experimental unsequenced policy (unseq) for SIMD
>>>>         execution. At the moment,
>>>>         > about half of the C++17 standard algorithms that must
>>>>         support execution policies
>>>>         > are implemented; a few more will be ready soon, and the
>>>>         work continues.
>>>>         > The tests that we use are also available at GitHub;
>>>>         needless to say we will
>>>>         > contribute those as well.
>>>>         >
>>>>         > The implementation is not specific to Intel’s hardware.
>>>>         For thread-level parallelism
>>>>         > it uses TBB* (https://www.threadingbuildingblocks.org/)
>>>>         but abstracts it with
>>>>         > an internal API which can be implemented on top of other
>>>>         threading/parallel solutions –
>>>>         > so it is for the community to decide which ones to use.
>>>>         For SIMD parallelism
>>>>         > (unseq, par_unseq) we use #pragma omp simd directives; it
>>>>         is vendor-neutral and
>>>>         > does not require any OpenMP runtime support.
>>>>         >
>>>>         > The current implementation meets the spirit but not
>>>>         always the letter of
>>>>         > the standard, because it has to be separate from but also
>>>>         coexist with
>>>>         > implementations of standard C++ libraries. While
>>>>         preparing the contribution,
>>>>         > we will address inconsistencies, adjust the code to meet
>>>>         community standards,
>>>>         > and better integrate it into the standard library code.
>>>>         >
>>>>         > We are also proposing that our implementation is included
>>>>         into libstdc++/GCC.
>>>>         > Compatibility between the implementations seems useful as
>>>>         it can potentially
>>>>         > reduce the amount of work for everyone. We hope to keep
>>>>         the code mostly identical,
>>>>         > and would like to know if you think it’s too optimistic
>>>>         to expect.
>>>>         >
>>>>         > Obviously we plan to use appropriate open source licenses
>>>>         to meet the different
>>>>         > projects’ requirements.
>>>>         >
>>>>         > We expect to keep developing the code and will take the
>>>>         responsibility for
>>>>         > maintaining it (with community contributions, of course).
>>>>         If there are other
>>>>         > community efforts to implement parallel algorithms, we
>>>>         are willing to collaborate.
>>>>         >
>>>>         > We look forward to your feedback, both for the overall
>>>>         idea and – if supported –
>>>>         > for the next steps we should take.
>>>>         >
Regards,
- Alexey Kukanov
>>>>         > - Alexey Kukanov
>>>>         >
>>>>         > * Note that TBB itself is highly portable (and ported by
>>>>         community to Power and ARM
>>>>         > architectures) and permissively licensed, so could be the
>>>>         base for the threading
>>>>         > infrastructure. But the Parallel STL implementation
>>>>         itself does not require TBB.
>>>>         >
