[Openmp-dev] Performance slowdown

Bliss, Brian E via Openmp-dev openmp-dev at lists.llvm.org
Wed Aug 19 09:27:52 PDT 2015


OMP_WAIT_POLICY=ACTIVE is equivalent to KMP_LIBRARY=turnaround.
If KMP_LIBRARY=throughput, then each thread pauses / releases its timeslice in spin-wait loops.
If KMP_LIBRARY=turnaround, the threads only pause it they know that the machine is oversubscribed.

KMP_BLOCKTIME controls the blocking, not the pausing.  The default value is 200 ms.   If a thread spins for longer than 200 ms (actually, some value between 200-400 ms), then it goes to sleep.
If KMP_BLOCKTIME=0, then the thread does a single check to see if it can proceed, then immediately goes to sleep if it cannot.
If KMP_BLOCKTIME=infinite (which is implemented as 2^31-1), then the threads will never block at a barrier.
KMP_BLOCKTIME=infinite also disables a lot of checks (and corresponding cache misses) in the barrier spin-wait loop, and can result in a 2x-3x speedup over KMP_BLOCKTIME=2^31-2 on EPCC parallel, for a machine with many procs.

FYI

Maybe we should have mapped OMP_WAIT_POLICY=ACTIVE to KMP_BLOCKTIME=infinite, and not KMP_LIBRARY=turnaround.  I don’t know…

-bb

From: Openmp-dev [mailto:openmp-dev-bounces at lists.llvm.org] On Behalf Of John Mellor-Crummey via Openmp-dev
Sent: Tuesday, August 18, 2015 2:18 PM
To: César
Cc: openmp-dev at lists.llvm.org
Subject: Re: [Openmp-dev] Performance slowdown

My guess is that you are blocking rather than spinning. Using OMP_WAIT_POLICY=active doesn't seem to be enough with Intel's runtime to turn off all blocking. As I recall, there is a KMP_xxx flag that applies as well. You can grep the barrier code in the runtime or hope that someone from intel responds to your inquiry.

--
John Mellor-Crummey

(sent from my phone)

On Aug 18, 2015, at 1:14 PM, César via Openmp-dev <openmp-dev at lists.llvm.org<mailto:openmp-dev at lists.llvm.org>> wrote:
Hello,

I don't know if this is the correct list to talk about this - I did not find a better place..

I am doing performance experiments with a few OpenMP implementations (IOMP, GOMP and our private impl.) and I am seeing a severe slowdown when I use IOMP (GOMP and others are performing well).

The benchmarks I am using are these ones: http://kastors.gforge.inria.fr/#!index.md

Really, the slowdown is huge. For one of the programs (plasma/dpotrf_taskdep -n 8192 -b 64 -i 1 -c) the serial version executes in ~28s and the parallel one executes in ~110s. I did some profiling and found that most of the time is being spent on synchronization barriers and dependence tracking (see attached image). Before digging deeper I would like to hear back from you if I am doing something wrong here:

- I tested with the last version of the repository:  http://llvm.org/svn/llvm-project/openmp/trunk
- I am using Ubuntu 14.10.
- I have tested on more than one machine, the results above are from a Intel i7-3770
- The runtime itself is compiled using: make compiler=gcc os_omp=linux arch=32e
- The version of GCC that I am using is: 4.9.1
- The version of Clang that I am using to compile the benchmarks: 3.5.0


César.
!DSPAM:8504,55d37631260061683114033!
<pic1.png>
<pic2.png>
<pic3.png>
_______________________________________________
Openmp-dev mailing list
Openmp-dev at lists.llvm.org<mailto:Openmp-dev at lists.llvm.org>
http://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev

!DSPAM:8504,55d37631260061683114033!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/openmp-dev/attachments/20150819/89054a05/attachment.html>


More information about the Openmp-dev mailing list