<div dir="ltr">Hi Serge,<br><br>The code I'm using for these evaluations is <a href="https://github.com/ParRes/Kernels">https://github.com/ParRes/Kernels</a>. These tests are oriented towards numerical simulation and only use std::for_each and std::transform.<div><br></div><div>I gathered data last week but I need to run new tests since I added and changed quite a bit this week. I've included documentation at the bottom is this message describing how others can perform experiments, since I will not have time to produce them until next week.<div><br></div><div>A few caveats:</div><div>- The PRK project is designed to be language-agnostic and contain high-quality reference implementations. The C++ implementations are closely aligned with the implementations in Python, Fortran 2008, etc., which mean they are less idiomatic than they could be.</div><div>- The code is not tuned for any particular architecture, although heavy testing on Intel Haswell processors may mean unconscious bias towards that architecture. In most cases, we thread the outer loop and SIMD-ize the inner loop, because the results with _Pragma("omp for simd collapse(2)") are usually worse and encounter more compiler bugs.</div><div>- Loop blocking helps stencil and transpose, but is not implemented uniformly. I am working to fix this.</div><div>- The p2p kernel is difficult to implement in some models, because it lacks traditional data parallelism. I discourage drawing conclusions from the implementations for now.</div><div>- I am not an expert C++ programmer. However, my colleagues who are experts in TBB and PSTL reviewed that code and I've implemented their suggestions.</div><div>- Most of the PRK code has been written by Intel employees, but it is a research project. The PRKs are not intended to be hardware benchmarks nor does the presence of implementations of various models constitute an endorsement of their use (e.g. CUDA).</div><div>- This is all a work-in-progress. All of the C++ code is new in the last month and will continue to evolve rapidly for at least another month.</div><div><br></div><div>Please create GitHub issues or email me privately to suggest improvements. Contributions via pull request are also welcome.<br></div><div><br>Jeff<br><br># Grab the code via git or wget<br>git clone <a href="https://github.com/ParRes/Kernels.git">https://github.com/ParRes/Kernels.git</a><br>wget <a href="https://github.com/ParRes/Kernels/archive/master.zip">https://github.com/ParRes/Kernels/archive/master.zip</a><br><br># We do not use a fancy build system. The critical file is ./common/make.defs, for which we provide examples for all supported toolchains.<br># Assuming you want to test Clang, start with the LLVM example and modify appropriately. The Intel and GCC examples are up-to-date, but others may be stale. Many example options are from my Mac+Homebrew setup.<br>cd common && cp make.defs.llvm make.defs<br><br># We have scripts to build all of the dependencies for Travis in ./travis, which you can adapt as appropriate.<br># If you want to test GCC parallel STL, you need 7.2+ because of <a href="https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81221">https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81221</a>.<br># I use <a href="https://github.com/jeffhammond/HPCInfo/blob/master/buildscripts/gcc-git.sh">https://github.com/jeffhammond/HPCInfo/blob/master/buildscripts/gcc-git.sh</a> to build GCC on Linux.<br># If you want to test Intel 18 PSTL in beta, it's freely available from <a href="https://software.intel.com/en-us/articles/intel-parallel-studio-xe-2018-beta">https://software.intel.com/en-us/articles/intel-parallel-studio-xe-2018-beta</a><br><br># The code of interest to you is in ./Cxx11. Some of the tests do not build because of compiler bugs, so ignore errors for simplicity.<br>cd ./Cxx11 && make -k<br><br># All of the programs try to be self-documenting when provided no arguments. Below is an example.<div># When the documentation isn't perfect, the source should be clear.<br><br>Cxx11 $ ./stencil-vector-pstl<br>Parallel Research Kernels version 2.16<br>C++17/Parallel STL Stencil execution on 2D grid<br>Usage: <# iterations> <array dimension> [<star/grid> <radius>]<br><br># 10 iterations and a matrix of rank 1000 is a good starting point, but this is too few iterations on a noisy system and matrices that fit in cache give artificially good results.</div><div><br>Cxx11 $ ./stencil-vector-pstl 10 1000<br>Parallel Research Kernels version 2.16<br>C++17/Parallel STL Stencil execution on 2D grid<br>Number of iterations = 10<br>Grid size = 1000<br>Type of stencil = star<br>Radius of stencil = 2<br>Solution validates<br>Rate (MFlops/s): 3147.89 Avg time (s): 0.0059876</div><div><br></div><div># Programs will not print out performance data if the results are not correct. Our correctness testing isn't perfect but we are able to find many compiler/runtime bugs.<br></div><div><br></div><div># This is a more interesting test, showing the additional options.</div><br>$ ./stencil-vector-tbb 30 8000 200 grid 4<br>Parallel Research Kernels version 2.16<br>C++11/TBB Stencil execution on 2D grid<br>Number of iterations = 30<br>Grid size = 8000<br>Tile size = 200<br>Type of stencil = grid<br>Radius of stencil = 4<br>Solution validates<br>Rate (MFlops/s): 11218 Avg time (s): 0.928072<div><br>On Thu, Jun 29, 2017 at 10:43 PM, Serge Preis <<a href="mailto:spreis@yandex-team.ru">spreis@yandex-team.ru</a>> wrote:<br>><br>> Hello Jeff,<br>> <br>> Would you mind sharing results, please.<br>> <br>> Thank you,<br>> Serge.<br>> <br>> <br>> <br>> <br>> 30.06.2017, 03:14, "Jeff Hammond via cfe-dev" <<a href="mailto:cfe-dev@lists.llvm.org">cfe-dev@lists.llvm.org</a>>:<br>><br>> (I apologize for not including the thread history properly - I was not on the list until recently.)<br>><br>> >> Once llvm OpenMP can do things like handle nested parallelism and a few more advanced things properly all this might be fun (We can go down a big list if anyone wants to digress)<br>> > This is why I said we might consider using taskloop ;) -- There are other ways of handling nesting as well (colleagues of mine work on one: <a href="http://www.bolt-omp.org/">http://www.bolt-omp.org/</a>), but we should probably have a separate thread on OpenMP and nesting to discuss this aspect of things.<br>><br>> OpenMP tasks are the recommended way to solve the OpenMP nesting/composition problem. Implementations are not ready for this yet though. For example, Clang 4.0 generates incorrect code for one of my simple tests of taskloop. However, if PSTL relies on taskloop, that should have the nice side effect of ensuring the OpenMP implementation supports this well.<br>> <br>> For what it's worth, Intel 18 implements PSTL using TBB instead of OpenMP to solve the composition problem (the compiler uses the OpenMP simd pragma for vectorization). I have apples-to-apples performance tests with PSTL, TBB, OpenMP, etc. if anyone is interested (it's probably appropriate for out-of-band discussion).<br>> <br>> Jeff<br>><br>> --<br>> Jeff Hammond<br>> <a href="mailto:jeff.science@gmail.com">jeff.science@gmail.com</a><br>> <a href="http://jeffhammond.github.io/">http://jeffhammond.github.io/</a><br>> ,<br>><br>> _______________________________________________<br>> cfe-dev mailing list<br>> <a href="mailto:cfe-dev@lists.llvm.org">cfe-dev@lists.llvm.org</a><br>> <a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev">http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev</a><br><br><br><br><br>--<br>Jeff Hammond<br><a href="mailto:jeff.science@gmail.com">jeff.science@gmail.com</a><br><a href="http://jeffhammond.github.io/">http://jeffhammond.github.io/</a><br><div><div class="gmail_extra">
</div></div></div></div></div></div>