<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Dec 8, 2017 at 1:13 PM, Hal Finkel <span dir="ltr"><<a href="mailto:hfinkel@anl.gov" target="_blank">hfinkel@anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF"><span class="gmail-">
<p><br>
</p>
<div class="gmail-m_-2288114881520531127moz-cite-prefix">On 12/07/2017 11:35 AM, Jeff Hammond
wrote:<br>
</div>
<blockquote type="cite">
<div><br>
<div class="gmail_quote">
<div dir="auto">On Wed, Dec 6, 2017 at 8:57 PM Hal Finkel <<a href="mailto:hfinkel@anl.gov" target="_blank">hfinkel@anl.gov</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF">
<p><br>
</p>
<div class="gmail-m_-2288114881520531127m_8513274869410520852moz-cite-prefix">On
12/06/2017 10:23 PM, Jeff Hammond wrote:<br>
</div>
<blockquote type="cite">
<div><br>
<div class="gmail_quote">
<div dir="auto">On Wed, Dec 6, 2017 at 4:23 PM Hal
Finkel <<a href="mailto:hfinkel@anl.gov" target="_blank">hfinkel@anl.gov</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF">
<p><br>
</p>
<div class="gmail-m_-2288114881520531127m_8513274869410520852m_2065056468622040889moz-cite-prefix">On
12/04/2017 10:48 PM, Serge Preis via cfe-dev
wrote:<br>
</div>
<blockquote type="cite">
<div>I agree that guarantees provided by ICC
may be stronger than with other compilers,
so yes, under OpenMP terms vectorization is
permitted and cannot be assumed. However
OpenMP clearly defines semantics of
variables used within OpenMP region some
being shared(scalar), some private(vector)
and some being inductions. This goes far
beyond typical compiler specific pragmas
about dependencies and cost modelling and
makes vectorization much simpler task with
more predictable and robust results if
properly implemented (admittedly, even ICC
implementation is far from perfect). I hope
Intel's efforts to standardize someting like
this in core C++ will evntually come to
fruition. Until then I as a regular
application developer would appreciate
OpenMP-simd based execution policy (hoping
for good support for OpenMP SIMD in clang),
but it shouldn't necessary be part of
libc++. Since 'unordered' execution policy
is currently not part of C++ standard </div>
</blockquote>
<br>
</div>
<div bgcolor="#FFFFFF">
std::execution::par_unseq is part of C++17, and
that essentially maps to '#pragma omp parallel
for simd'.</div>
<div bgcolor="#FFFFFF"><br>
</div>
</blockquote>
<div dir="auto"><br>
</div>
<div dir="auto">Do you expect par/par_unseq to nest?</div>
</div>
</div>
</blockquote>
<br>
</div>
<div bgcolor="#FFFFFF"> Yes.</div>
<div bgcolor="#FFFFFF"><br>
<br>
<blockquote type="cite">
<div>
<div class="gmail_quote">
<div dir="auto"> Nesting omp-parallel is generally
regarded as a Bad Idea.</div>
</div>
</div>
</blockquote>
<br>
</div>
<div bgcolor="#FFFFFF"> Agreed. I suspect
we'll want the mapping to be more like '#pragma omp
taskloop simd'.</div>
<div bgcolor="#FFFFFF"><br>
</div>
</blockquote>
<div dir="auto"><br>
</div>
<div dir="auto">That won’t run in parallel unless in an
omp-parallel-master region. </div>
</div>
</div>
</blockquote>
<br></span>
Yes.<span class="gmail-"><br>
<br>
<blockquote type="cite">
<div>
<div class="gmail_quote">
<div dir="auto">That means OpenMP-based PSTL won’t be parallel
unless the user knows to add back-end specific code about
the PSTL.</div>
</div>
</div>
</blockquote>
<br></span>
That obviously wouldn't be acceptable.<span class="gmail-"><br>
<br>
<blockquote type="cite">
<div>
<div class="gmail_quote">
<div dir="auto"><br>
</div>
<div dir="auto">What I’m trying to say is that OpenMP is a
poor target for PSTL in its current form. Nested parallel
regions is the only thing that works and it is likely to
work poorly.</div>
</div>
</div>
</blockquote>
<br></span>
I'm not sure that's true, but the technique may not be trivial. I
believe that it is possible, however. For example, the mapping might
be to something like:<br>
<br>
if (omp_in_parallel()) {<br>
#pragma omp taskloop simd<br>
for (size_t i = 0; i < N; ++i)<br>
F(X[i]);<br>
} else {<br>
#pragma omp parallel<br>
{<br>
#pragma omp taskloop simd<br>
for (size_t i = 0; i < N; ++i)<br>
F(X[i]);<br>
}<br>
}<br>
<br>
The fact that we'd need to use this kind of pattern is a bit
unfortunate, but it can be easily abstracted into a template
function, so it just becomes some implementation detail of the
library.<br>
<br></div></blockquote><div><br></div><div>You are right and that is probably the best way to do it with OpenMP. I am concerned about the absolute performance, based upon my observations of omp-taskloop vs omp-for and tbb::parallel_for in the PRK project, but at least it is sane from a semantic perspective. Having motivating use cases like PSTL should lead to improvements in OpenMP runtime performance w.r.t. taskloop.</div><div><br></div><div><a href="https://i.stack.imgur.com/MVd5j.png">https://i.stack.imgur.com/MVd5j.png</a> is a snapshot of the performance of PRK stencil (<a href="https://github.com/ParRes/Kernels/tree/master/Cxx11">https://github.com/ParRes/Kernels/tree/master/Cxx11</a>), which shows taskloop loses to TBB-based PSTL, OpenMP for, and tbb::parallel_for (pure TBB beats TBB-based PSTL because I use tbb::blocked_range2d, which improves cache utilization). I think those results tuned taskloop grainsize as well, so they may be an optimistic representation of taskloop in a general usage.<br></div><div><br></div><div>I'll see if I can prototype this in RAJA or Intel PSTL. It's not hard to get results directly from the PRK tests, if the former attempts fail.</div><div><br></div><div>Best,</div><div><br></div><div>Jeff</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div bgcolor="#FFFFFF">
Thanks again,<br>
Hal<div><div class="gmail-h5"><br>
<br>
<blockquote type="cite">
<div>
<div class="gmail_quote">
<div dir="auto"><br>
</div>
<div dir="auto">Jeff</div>
<div dir="auto"><br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF"><br>
-Hal</div>
<div bgcolor="#FFFFFF"><br>
<br>
<blockquote type="cite">
<div>
<div class="gmail_quote">
<div dir="auto"><br>
</div>
<div dir="auto">Jeff</div>
<div dir="auto"><br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF"><br>
<blockquote type="cite">
<div>I don't care much on how it will be
implemneted in libc++ if it is. I just would
like to ask Intel guys and community here to
make implementation extensible in a sense
that custom OpenMP-SIMD-based execution
policy along with algorithms implementations
(as specializations for the policy) can be
used with the libc++ library. And I
additionally would like to ask Intel guys to
provide complete and compatible extension on
github for developers like me to use.</div>
</blockquote>
<br>
</div>
<div bgcolor="#FFFFFF"> In the end,
I think we want the following:<br>
<br>
1. A design for libc++ that allows the
thread-level parallelism to be implemented in
terms of different underlying providers (i.e.,
OpenMP, GCD, Work Queues on Windows, whatever
else).<br>
2. To follow the same philosophy with respect
to standards as we do everywhere else: Use
standards where possible with
compiler/system-specific extensions as
necessary.<br>
<br>
-Hal</div>
<div bgcolor="#FFFFFF"><br>
<br>
<blockquote type="cite">
<div> </div>
<div>Regards,</div>
<div>Serge.</div>
<div> </div>
<div> </div>
<div> </div>
<div>04.12.2017, 12:07, "Jeff Hammond" <a class="gmail-m_-2288114881520531127m_8513274869410520852m_2065056468622040889moz-txt-link-rfc2396E" href="mailto:jeff.science@gmail.com" target="_blank"><jeff.science@gmail.com></a>:</div>
<blockquote type="cite">
<div>
<div>ICC implements a very aggressive
interpretation of the OpenMP standard,
and this interpretation is not shared by
everyone in the OpenMP community. ICC
is correct but other implementations may
be far less aggressive, so _Pragma("omp
simd") doesn't guarentee vectorization
unless the compiler documentation says
that is how it is implemented. All the
standard says that it means is that
vectorization is _permitted_.</div>
<div> </div>
<div>Given that the practical meaning of
_Pragma("omp simd") isn't guaranteed to
be consistent across different
implementations, I don't really know how
to compare it to compiler-specific
pragmas unless we define everything
explicitly.</div>
<div> </div>
<div>In any case, my fundamental point
remains: do not use OpenMP pragmas here,
but instead use whatever the appropriate
compiler-specific pragma is, or create a
new one that meets the need.</div>
<div> </div>
<div>Best,</div>
<div> </div>
<div>Jeff</div>
<div title="Page 81">
<div>
<div> </div>
</div>
</div>
<div>
<div>On Sun, Dec 3, 2017 at 8:09 PM,
Serge Preis <span><<a href="mailto:spreis@yandex-team.ru" target="_blank">spreis@yandex-team.ru</a>></span>
wrote:
<blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>Hello,</div>
<div> </div>
<div>_Pragma("omp simd") is
semantically quite different from
_Pragma("clang loop
vectorize(assume_safety)"),
_Pragma("GCC ivdep") and
_Pragma("vector always"), so I am
not sure all latter will work as
expected in all cases. They
definitely won't provide any
vectorization guarantees which
slightly defeat the purpose of
using corresponding execution
policy.</div>
<div> </div>
<div>I support the idea of having
OpenMP orthogonal and definitely
having -fopenmp enabled by default
is not an option. Intel compiler
has separate -qopenmp-simd option
which doesn't affect performance
outside explicitly marked loops,
but even this is not enabled by
default. I would say that there
might exist multiple
implementations of unordered
policy, originally OpenMP SIMD
based implementation may be more
powerful and one based on other
pragmas being default, but hinting
about existence of faster option.
Later on one may be brave enough
to add some SIMD template library
and implement default unordered
policy using it (such
implementation is possible even
now using vector types, but it
will be extremely complex if
attempt to target all base data
types, vector widths and target
SIMD architectures clang supports.
Even with the library this may be
quite tedious).</div>
<div> </div>
<div>Without any standard way of
expressing SIMD perallelism in
pure C++ any implementer of SIMD
execution policy is to rely on
means avaialble for
plaform/compiler and so it is not
totaly unnatural to ask user to
enable OpenMP SIMD for efficient
support of corresponding execution
policy.</div>
<div> </div>
<div>Reagrds,</div>
<div>Serge Preis</div>
<div> </div>
<div>(Who once was part of Intel
Compiler Vectorizer team and
driven OpenMP SIMD efforts within
icc and beyond, if anyone is
keeping track of
conflicts-of-interest)</div>
<div> </div>
<div> </div>
<div>04.12.2017, 08:46, "Jeff
Hammond via cfe-dev" <<a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a>>:</div>
<blockquote type="cite">
<div>
<div>
<div>It would be nice to keep
PSTL and OpenMP orthogonal,
even if _Pragma("omp simd")
does not require runtime
support. It should be
trivial to use
_Pragma("clang loop
vectorize(assume_safety)")
instead, by wrapping all of
the different compiler
vectorization pragmas in
preprocessor logic. I
similarly recommend
_Pragma("GCC ivdep") for GCC
and _Pragma("vector always")
for ICC. While this
requires O(n_compilers)
effort instead of O(1), but
orthogonality is worth it.
<div> </div>
<div>While OpenMP is
vendor/compiler-agnostic,
users should not be
required to use -fopenmp
or similar to enable
vectorization from PSTL,
nor should the compiler
enable any OpenMP pragma
by default. I know of
cases where merely using
the -fopenmp flag alters
code generation in a
performance-visible
manner, and enabling the
OpenMP "simd" pragma by
default may surprise some
users, particularly if no
other OpenMP pragmas are
enabled by default.
<div><br>
Best,</div>
<div> </div>
<div>Jeff</div>
<div>(who works for Intel
but not on any software
products and has been a
heavy user of Intel PSTL
since it was released,
if anyone is keeping
track of
conflicts-of-interest)<br>
<br>
On Wed, Nov 29, 2017 at
4:21 AM, Kukanov, Alexey
via cfe-dev <<a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a>>
wrote:<br>
><br>
> Hello all,<br>
><br>
> At Intel, we have
developed an
implementation of C++17
execution policies<br>
> for algorithms
(often referred to as
Parallel STL). We hope
to contribute it<br>
> to libc++/LLVM, so
would like to ask the
community for comments
on this.<br>
><br>
> The code is already
published at GitHub (<a href="https://github.com/intel/parallelstl" target="_blank">https://github.com/intel/<wbr>parallelstl</a>).<br>
> It supports the
C++17 standard execution
policies (seq, par,
par_unseq) as well as<br>
> the experimental
unsequenced policy
(unseq) for SIMD
execution. At the
moment,<br>
> about half of the
C++17 standard
algorithms that must
support execution
policies<br>
> are implemented; a
few more will be ready
soon, and the work
continues.<br>
> The tests that we
use are also available
at GitHub; needless to
say we will<br>
> contribute those as
well.<br>
><br>
> The implementation
is not specific to
Intel’s hardware. For
thread-level parallelism<br>
> it uses TBB* (<a href="https://www.threadingbuildingblocks.org/" target="_blank">https://www.<wbr>threadingbuildingblocks.org/</a>)
but abstracts it with<br>
> an internal API
which can be implemented
on top of other
threading/parallel
solutions –<br>
> so it is for the
community to decide
which ones to use. For
SIMD parallelism<br>
> (unseq, par_unseq)
we use #pragma omp simd
directives; it is
vendor-neutral and<br>
> does not require
any OpenMP runtime
support.<br>
><br>
> The current
implementation meets the
spirit but not always
the letter of<br>
> the standard,
because it has to be
separate from but also
coexist with<br>
> implementations of
standard C++ libraries.
While preparing the
contribution,<br>
> we will address
inconsistencies, adjust
the code to meet
community standards,<br>
> and better
integrate it into the
standard library code.<br>
><br>
> We are also
proposing that our
implementation is
included into
libstdc++/GCC.<br>
> Compatibility
between the
implementations seems
useful as it can
potentially<br>
> reduce the amount
of work for everyone. We
hope to keep the code
mostly identical,<br>
> and would like to
know if you think it’s
too optimistic to
expect.<br>
><br>
> Obviously we plan
to use appropriate open
source licenses to meet
the different<br>
> projects’
requirements.<br>
><br>
> We expect to keep
developing the code and
will take the
responsibility for<br>
> maintaining it
(with community
contributions, of
course). If there are
other<br>
> community efforts
to implement parallel
algorithms, we are
willing to collaborate.<br>
><br>
> We look forward to
your feedback, both for
the overall idea and –
if supported –<br>
> for the next steps
we should take.<br>
><br>
> Regards,<br>
> - Alexey Kukanov<br>
><br>
> * Note that TBB
itself is highly
portable (and ported by
community to Power and
ARM<br>
> architectures) and
permissively licensed,
so could be the base for
the threading<br>
> infrastructure. But
the Parallel STL
implementation itself
does not require TBB.<br>
><br>
>
______________________________<wbr>_________________<br>
> cfe-dev mailing
list<br>
> <a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a><br>
> <a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev" target="_blank">http://lists.llvm.org/cgi-bin/<wbr>mailman/listinfo/cfe-dev</a><br>
<br>
<br>
<br>
<br>
--<br>
Jeff Hammond<br>
<a href="mailto:jeff.science@gmail.com" target="_blank">jeff.science@gmail.com</a><br>
<a href="http://jeffhammond.github.io/" target="_blank">http://jeffhammond.github.io/</a>
<div> </div>
</div>
</div>
</div>
</div>
</div>
,
<p><span>______________________________<wbr>_________________<br>
cfe-dev mailing list<br>
<a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a><br>
<a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev" target="_blank">http://lists.llvm.org/cgi-bin/<wbr>mailman/listinfo/cfe-dev</a></span></p>
</blockquote>
</blockquote>
</div>
<div> </div>
--
<div>Jeff Hammond<br>
<a href="mailto:jeff.science@gmail.com" target="_blank">jeff.science@gmail.com</a><br>
<a href="http://jeffhammond.github.io/" target="_blank">http://jeffhammond.github.io/</a></div>
</div>
</div>
</blockquote>
<br>
<fieldset class="gmail-m_-2288114881520531127m_8513274869410520852m_2065056468622040889mimeAttachmentHeader"></fieldset>
<br>
<pre>______________________________<wbr>_________________
cfe-dev mailing list
<a class="gmail-m_-2288114881520531127m_8513274869410520852m_2065056468622040889moz-txt-link-abbreviated" href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a>
<a class="gmail-m_-2288114881520531127m_8513274869410520852m_2065056468622040889moz-txt-link-freetext" href="http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev" target="_blank">http://lists.llvm.org/cgi-bin/<wbr>mailman/listinfo/cfe-dev</a>
</pre>
</blockquote>
<br>
</div>
<div bgcolor="#FFFFFF">
<pre class="gmail-m_-2288114881520531127m_8513274869410520852m_2065056468622040889moz-signature" cols="72">--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory</pre>
</div>
</blockquote>
</div>
</div>
<div>-- <br>
</div>
<div class="gmail-m_-2288114881520531127m_8513274869410520852gmail_signature">Jeff Hammond<br>
<a href="mailto:jeff.science@gmail.com" target="_blank">jeff.science@gmail.com</a><br>
<a href="http://jeffhammond.github.io/" target="_blank">http://jeffhammond.github.io/</a></div>
</blockquote>
<br>
<pre class="gmail-m_-2288114881520531127m_8513274869410520852moz-signature" cols="72">--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory</pre>
</div>
</blockquote>
</div>
</div>
<div dir="ltr">-- <br>
</div>
<div class="gmail-m_-2288114881520531127gmail_signature">Jeff
Hammond<br>
<a href="mailto:jeff.science@gmail.com" target="_blank">jeff.science@gmail.com</a><br>
<a href="http://jeffhammond.github.io/" target="_blank">http://jeffhammond.github.io/</a></div>
</blockquote>
<br>
<pre class="gmail-m_-2288114881520531127moz-signature" cols="72">--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory</pre>
</div></div></div>
</blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature">Jeff Hammond<br><a href="mailto:jeff.science@gmail.com" target="_blank">jeff.science@gmail.com</a><br><a href="http://jeffhammond.github.io/" target="_blank">http://jeffhammond.github.io/</a></div>
</div></div>