[cfe-dev] [libc++] Working on the parallel STL algorithms

Fri Jun 30 09:56:59 PDT 2017

Hi Serge,

The code I'm using for these evaluations is
https://github.com/ParRes/Kernels.  These tests are oriented towards
numerical simulation and only use std::for_each and std::transform.

I gathered data last week but I need to run new tests since I added and
changed quite a bit this week.  I've included documentation at the bottom
is this message describing how others can perform experiments, since I will
not have time to produce them until next week.

A few caveats:
- The PRK project is designed to be language-agnostic and contain
high-quality reference implementations.  The C++ implementations are
closely aligned with the implementations in Python, Fortran 2008, etc.,
which mean they are less idiomatic than they could be.
- The code is not tuned for any particular architecture, although heavy
testing on Intel Haswell processors may mean unconscious bias towards that
architecture.  In most cases, we thread the outer loop and SIMD-ize the
inner loop, because the results with _Pragma("omp for simd collapse(2)")
are usually worse and encounter more compiler bugs.
- Loop blocking helps stencil and transpose, but is not implemented
uniformly.  I am working to fix this.
- The p2p kernel is difficult to implement in some models, because it lacks
traditional data parallelism.  I discourage drawing conclusions from the
implementations for now.
- I am not an expert C++ programmer.  However, my colleagues who are
experts in TBB and PSTL reviewed that code and I've implemented their
suggestions.
- Most of the PRK code has been written by Intel employees, but it is a
research project.  The PRKs are not intended to be hardware benchmarks nor
does the presence of implementations of various models constitute an
endorsement of their use (e.g. CUDA).
- This is all a work-in-progress.  All of the C++ code is new in the last
month and will continue to evolve rapidly for at least another month.

Please create GitHub issues or email me privately to suggest improvements.
Contributions via pull request are also welcome.

Jeff

# Grab the code via git or wget
git clone https://github.com/ParRes/Kernels.git
wget https://github.com/ParRes/Kernels/archive/master.zip

# We do not use a fancy build system.  The critical file is
./common/make.defs, for which we provide examples for all supported
toolchains.
# Assuming you want to test Clang, start with the LLVM example and modify
appropriately.  The Intel and GCC examples are up-to-date, but others may
be stale.  Many example options are from my Mac+Homebrew setup.
cd common && cp make.defs.llvm make.defs

# We have scripts to build all of the dependencies for Travis in ./travis,
which you can adapt as appropriate.
# If you want to test GCC parallel STL, you need 7.2+ because of
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81221.
# I use
https://github.com/jeffhammond/HPCInfo/blob/master/buildscripts/gcc-git.sh
to build GCC on Linux.
# If you want to test Intel 18 PSTL in beta, it's freely available from
https://software.intel.com/en-us/articles/intel-parallel-studio-xe-2018-beta

# The code of interest to you is in ./Cxx11. Some of the tests do not build
because of compiler bugs, so ignore errors for simplicity.
cd ./Cxx11 && make -k

# All of the programs try to be self-documenting when provided no
arguments.  Below is an example.
# When the documentation isn't perfect, the source should be clear.

Cxx11 $ ./stencil-vector-pstl
Parallel Research Kernels version 2.16
C++17/Parallel STL Stencil execution on 2D grid
Usage: <# iterations> <array dimension> [<star/grid> <radius>]

# 10 iterations and a matrix of rank 1000 is a good starting point, but
this is too few iterations on a noisy system and matrices that fit in cache
give artificially good results.

Cxx11 $ ./stencil-vector-pstl 10 1000
Parallel Research Kernels version 2.16
C++17/Parallel STL Stencil execution on 2D grid
Number of iterations = 10
Grid size            = 1000
Type of stencil      = star
Radius of stencil    = 2
Solution validates
Rate (MFlops/s): 3147.89 Avg time (s): 0.0059876

# Programs will not print out performance data if the results are not
correct.  Our correctness testing isn't perfect but we are able to find
many compiler/runtime bugs.

# This is a more interesting test, showing the additional options.

$ ./stencil-vector-tbb 30 8000 200 grid 4
Parallel Research Kernels version 2.16
C++11/TBB Stencil execution on 2D grid
Number of iterations = 30
Grid size            = 8000
Tile size            = 200
Type of stencil      = grid
Radius of stencil    = 4
Solution validates
Rate (MFlops/s): 11218 Avg time (s): 0.928072

On Thu, Jun 29, 2017 at 10:43 PM, Serge Preis <spreis at yandex-team.ru> wrote:
>
> Hello Jeff,
>
> Would you mind sharing results, please.
>
> Thank you,
> Serge.
>
>
>
>
> 30.06.2017, 03:14, "Jeff Hammond via cfe-dev" <cfe-dev at lists.llvm.org>:
>
> (I apologize for not including the thread history properly - I was not on
the list until recently.)
>
> >> Once llvm OpenMP can do things like handle nested parallelism and a
few more advanced things properly all this might be fun (We can go down a
big list if anyone wants to digress)
> > This is why I said we might consider using taskloop ;) -- There are
other ways of handling nesting as well (colleagues of mine work on one:
http://www.bolt-omp.org/), but we should probably have a separate thread on
OpenMP and nesting to discuss this aspect of things.
>
> OpenMP tasks are the recommended way to solve the OpenMP
nesting/composition problem.  Implementations are not ready for this yet
though.  For example, Clang 4.0 generates incorrect code for one of my
simple tests of taskloop.  However, if PSTL relies on taskloop, that should
have the nice side effect of ensuring the OpenMP implementation supports
this well.
>
> For what it's worth, Intel 18 implements PSTL using TBB instead of OpenMP
to solve the composition problem (the compiler uses the OpenMP simd pragma
for vectorization).  I have apples-to-apples performance tests with PSTL,
TBB, OpenMP, etc. if anyone is interested (it's probably appropriate for
out-of-band discussion).
>
> Jeff
>
> --
> Jeff Hammond
> jeff.science at gmail.com
> http://jeffhammond.github.io/
> ,
>
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

--
Jeff Hammond
jeff.science at gmail.com
http://jeffhammond.github.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20170630/91d901c7/attachment.html>