[Openmp-dev] Sampling-based performance measurement of LLVM OpenMP runtime leads to deadlock!
John Mellor-Crummey via Openmp-dev
openmp-dev at lists.llvm.org
Fri May 11 17:19:35 PDT 2018
While using a sampling-based profiler (HPCToolkit) to measure the performance of an application using a dynamically-linked version of the LLVM OpenMP runtime, I encountered a deadlock on x86_64. Although I haven’t considered other architectures in detail, I believe that they may be similarly affected.
Here’s what I believe I have observed: there is a subtle race condition between TLS setup for an OpenMP runtime and and a profiler that inspects it through the OMPT interface.
A thread executing code in __kmp_launch_worker in the context of the LLVM OpenMP runtime library acquired the lock controlling access to TLS state (
__tls_get_addr calls tls_get_addr_tail calls pthread_mutex_lock)to set up TLS needed for its access to its thread local variable __kmp_gtid in frame 24 of the callstack shown below. Immediately after acquiring the TLS lock by setting its __lock field with a CMPXCHG but before recording the lock owner or finishing TLS setup, the thread was interrupted by our profiler. As a normal part of its operation to record a sample, our profiler uses the OMPT tools API to check if the thread is an OpenMP thread by inspecting the thread id being maintained by the OpenMP runtime. A call to a runtime entry point through the OMPT API led to an access to __kmp_gtid in frame 5 of the call stack. However, TLS has still not been set up for the OpenMP runtime shared library for this thread and causing the access to __kmp_gtid to go through the same protocol as before (__tls_get_addr calls tls_get_addr_tail calls pthread_mutex_lock). However, the lock has already been acquired in frame 21 so it is unavailable for acquistion in frames 0-2, causing deadlock. The TLS lock is implemented as a recursive lock, but the profiler interrupted the lock acquisition in libpthread before the owner field of the recursive lock was set, so the inner call to pthread_mutex_lock can't succeed.
This is a serious problem if a profiler using the OMPT interface can cause a deadlock.
We need a design of the OMPT interface and OpenMP runtime implementations that make this impossible.
After thinking about this for a while, I think that a profiler can arrange to receive the ompt_callback_thread_begin and the profiler then set a thread local flag in its own TLS variables to note that a thread is an OpenMP thread. A profiler must not invoke any ompt runtime entry point on a thread that has not announced itself as an OpenMP thread by previously calling ompt_callback_thread_begin. An OpenMP runtime should ensure that its TLS is allocated before invoking the callback ompt_callback_thread_begin. Similarly, a profiler shouldn’t invoke an OMPT callback on a thread after receiving ompt_callback_thread_end.
If a profiler thread doesn’t use the OMPT interface to inspect a thread that hasn’t announced itself as an OpenMP thread, it won’t access any TLS state that the OpenMP library may maintain.
Does anyone care to comment or offer a vision of a different solution?
Below my signature block are some details of the thread state that I observed, in case you want to validate my assessment of the situation.
--
John Mellor-Crummey Professor
Dept of Computer Science Rice University
email: johnmc at rice.edu phone: 713-348-5179
In the 10 frames at the top of the stack, HPCToolkit is querying the OMPT interface to determine the thread identity. I’ve highlighted the top 5 frames in red. In frame 5, our profiler asks for the thread’s id by calling __kmp_get_global_thread_id. Below the callstack is a small snippet from the debugger that shows what is going on in frame 5 - on line 132, the runtime is trying to return the value of __kmp_gtid - a thread local variable. To resolve this variable reference, the runtime invokes __tls_get_addr which ends up trying to acquire the dl_load_lock as shown in frame 0.
The problem here is that the thread seems to have acquired the lock in frame 21 before getting interrupted by a sample. In frame 21, the thread just completed a cmpxchg to acquire the lock. I believe that this succeeded since RIP is now at JNE instruction (address 0x7f515ad25cd9) immediately following the CMPXCHG at address 0x7f515ad25cd4 and the condition code ZF is set. As I read the machine state, the CAS succeeded in swapping in a 1 for a 0 to acquire the lock.
The __kmp_gtid_mode was observed in the runtime to be 3 and __kmp_gtid is declared with modifier __thread.
#if KMP_OS_LINUX && defined(KMP_TDATA_GTID)
int __kmp_gtid_mode = 3; /* use __declspec(thread) TLS to store gtid */
int __kmp_adjust_gtid_mode = FALSE;
#ifdef KMP_TDATA_GTID
#if KMP_OS_WINDOWS
__declspec(thread) int __kmp_gtid = KMP_GTID_DNE;
#else
__thread int __kmp_gtid = KMP_GTID_DNE;
#endif /* KMP_OS_WINDOWS - workaround because Intel(R) Many Integrated Core \
compiler 20110316 doesn't accept __declspec */
The call stack
#0 0x00007f515ad2a4cd in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) where
#0 0x00007f515ad2a4cd in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007f515ad25de6 in _L_lock_870 () from /lib64/libpthread.so.0
#2 0x00007f515ad25cdf in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007f515b7041cb in tls_get_addr_tail () from /lib64/ld-linux-x86-64.so.2
#4 0x00007f515b709af8 in __tls_get_addr () from /lib64/ld-linux-x86-64.so.2
#5 0x00007f515af7fd90 in __kmp_get_global_thread_id ()
at /home/johnmc/pkgs-src/openmp/runtime/src/kmp_runtime.cpp:132
#6 0x00007f515b0068e4 in ompt_get_thread ()
at /home/johnmc/pkgs-src/openmp/runtime/src/ompt-specific.h:63
#7 0x00007f515b003fd7 in __ompt_get_teaminfo (depth=0, size=0x0)
at /home/johnmc/pkgs-src/openmp/runtime/src/ompt-specific.cpp:47
#8 0x00007f515b004381 in __ompt_get_parallel_id_internal (depth=0)
at /home/johnmc/pkgs-src/openmp/runtime/src/ompt-specific.cpp:216
#9 0x00007f515b0062bc in ompt_get_parallel_id (ancestor_level=0)
at /home/johnmc/pkgs-src/openmp/runtime/src/ompt-general.cpp:348
#10 0x00007f515b4a5e6c in hpcrun_ompt_get_parallel_id (level=0)
at ../../../../src/tool/hpcrun/ompt/ompt-interface.c:618
#11 0x00007f515b4a4ef2 in need_defer_cntxt () at ../../../../src/tool/hpcrun/ompt/ompt-defer.c:148
#12 0x00007f515b4a492a in ompt_backtrace_finalize (bt=0x7f513dbf5af0, isSync=0)
at ../../../../src/tool/hpcrun/ompt/ompt-callstack.c:551
#13 0x00007f515b490dd0 in cct_backtrace_finalize (bt=0x7f513dbf5af0, isSync=0)
at ../../../../src/tool/hpcrun/cct_backtrace_finalize.c:51
#14 0x00007f515b490bd9 in help_hpcrun_backtrace2cct (bundle=0x7f513d2f5fa8, context=0x7f513dbf6040,
metricId=0, metricIncr=..., skipInner=0, isSync=0, data=0x7f513dbf5d40)
at ../../../../src/tool/hpcrun/cct_insert_backtrace.c:401
#15 0x00007f515b490651 in hpcrun_backtrace2cct (cct=0x7f513d2f5fa8, context=0x7f513dbf6040, metricId=0,
metricIncr=..., skipInner=0, isSync=0, data=0x7f513dbf5d40)
at ../../../../src/tool/hpcrun/cct_insert_backtrace.c:246
#16 0x00007f515b493f5f in hpcrun_sample_callpath (context=0x7f513dbf6040, metricId=0, metricIncr=...,
skipInner=0, isSync=0, data=0x7f513dbf5cc0) at ../../../../src/tool/hpcrun/sample_event.c:239
#17 0x00007f515b4a9020 in record_sample (current=0x7f513d2d9d68, mmap_data=0x7f513dbf5d40,
context=0x7f513dbf6040, sv=0x7f513dbf5f10)
at ../../../../src/tool/hpcrun/sample-sources/perf/linux_perf.c:567
#18 0x00007f515b4aa388 in perf_event_handler (sig=38, siginfo=0x7f513dbf6170, context=0x7f513dbf6040)
at ../../../../src/tool/hpcrun/sample-sources/perf/linux_perf.c:1130
#19 0x00007f515b25846d in monitor_signal_handler (sig=38, info=0x7f513dbf6170, context=0x7f513dbf6040)
at ../../libmonitor/src/signal.c:217
#20 <signal handler called>
#21 0x00007f515ad25cd9 in pthread_mutex_lock () from /lib64/libpthread.so.0
#22 0x00007f515b7041cb in tls_get_addr_tail () from /lib64/ld-linux-x86-64.so.2
#23 0x00007f515b709af8 in __tls_get_addr () from /lib64/ld-linux-x86-64.so.2
#24 0x00007f515aff3478 in __kmp_launch_worker (thr=0x15e1600)
at /home/johnmc/pkgs-src/openmp/runtime/src/z_Linux_util.cpp:510
#25 0x00007f515b26112b in monitor_thread_fence2 () at ../../libmonitor/src/pthread.c:975
#26 0x00007f515ad23dd5 in start_thread () from /lib64/libpthread.so.0
#27 0x00007f515aa4db3d in clone () from /lib64/libc.so.6
(gdb) frame 24
#24 0x00007f515aff3478 in __kmp_launch_worker (thr=0x15e1600)
at /home/johnmc/pkgs-src/openmp/runtime/src/z_Linux_util.cpp:510
510 __kmp_gtid = gtid;
(gdb) frame 5
#5 0x00007f515af7fd90 in __kmp_get_global_thread_id ()
at /home/johnmc/pkgs-src/openmp/runtime/src/kmp_runtime.cpp:132
132 return __kmp_gtid;
Below is the state for frame 21, which shows that the PC is at the instruction following the cmpxchg and that the ZF bit is set, indicating the cmpxchg instruction succeeded, according to the instruction semantics below:
Semantics of the cmpxchg instruction
(* Accumulator = AL, AX, EAX, or RAX depending on whether a byte, word, doubleword, or quadword comparison is being performed *)
IF EAX = DEST THEN
ZF ← 1;
DEST ← SRC;
ELSE
ZF ← 0;
EAX ← DEST;
FI;
(gdb) frame 21
#21 0x00007f515ad25cd9 in pthread_mutex_lock () from /lib64/libpthread.so.0
(gdb) x/i 0x00007f515ad25cd9-5
0x7f515ad25cd4 <pthread_mutex_lock+164>: lock cmpxchg %edi,(%r8)
(gdb)
=> 0x7f515ad25cd9 <pthread_mutex_lock+169>: jne 0x7f515ad25dd7 <_L_lock_870>
(gdb) info reg
rax 0x0 0
rbx 0x7f515b911b18 139987405314840
rcx 0x43 67
rdx 0x0 0
rsi 0x0 0
rdi 0x1 1
rbp 0x7f513dbf6620 0x7f513dbf6620
rsp 0x7f513dbf65d8 0x7f513dbf65d8
r8 0x7f515b914908 139987405326600
r9 0x0 0
r10 0x22 34
r11 0x7f515aadc440 139987390415936
r12 0x7f515b241f28 139987398172456
r13 0x501980 5249408
r14 0x4 4
r15 0x7f515b46f840 139987400456256
rip 0x7f515ad25cd9 0x7f515ad25cd9 <pthread_mutex_lock+169>
eflags 0x246 [ PF ZF IF ]
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0
Below is the CMPXCHG instruction in the context of the code for pthread_mutex_lock. The magenta instruction shows the compare and exchange that succeeded. The orange instructions show the path to return following a successful acquisition. The blue comments indicate the instructions completing the acquisition of the recursive lock by setting the count, owner, and nusers
0x7f515ad25c30 <pthread_mutex_lock>: mov 0x10(%rdi),%esi
0x7f515ad25c33 <pthread_mutex_lock+3>: mov %rdi,%r8
0x7f515ad25c36 <pthread_mutex_lock+6>: mov %esi,%eax
0x7f515ad25c38 <pthread_mutex_lock+8>: and $0x17f,%eax
0x7f515ad25c3d <pthread_mutex_lock+13>: nop
0x7f515ad25c3e <pthread_mutex_lock+14>: mov %esi,%edx
0x7f515ad25c40 <pthread_mutex_lock+16>: and $0x7c,%edx
0x7f515ad25c43 <pthread_mutex_lock+19>: jne 0x7f515ad25ce9 <pthread_mutex_lock+185>
0x7f515ad25c49 <pthread_mutex_lock+25>: test %eax,%eax
0x7f515ad25c4b <pthread_mutex_lock+27>: jne 0x7f515ad25cad <pthread_mutex_lock+125>
0x7f515ad25c4d <pthread_mutex_lock+29>: mov 0x21182d(%rip),%eax # 0x7f515af37480 <__pthread_force_elision>
0x7f515ad25c53 <pthread_mutex_lock+35>: test %eax,%eax
0x7f515ad25c55 <pthread_mutex_lock+37>: je 0x7f515ad25c80 <pthread_mutex_lock+80>
0x7f515ad25c57 <pthread_mutex_lock+39>: test $0x300,%esi
0x7f515ad25c5d <pthread_mutex_lock+45>: jne 0x7f515ad25c80 <pthread_mutex_lock+80>
0x7f515ad25c5f <pthread_mutex_lock+47>: or $0x100,%esi
0x7f515ad25c65 <pthread_mutex_lock+53>: mov %esi,0x10(%rdi)
0x7f515ad25c68 <pthread_mutex_lock+56>: mov %esi,%edx
0x7f515ad25c6a <pthread_mutex_lock+58>: lea 0x16(%r8),%rsi
0x7f515ad25c6e <pthread_mutex_lock+62>: mov %r8,%rdi
0x7f515ad25c71 <pthread_mutex_lock+65>: and $0x80,%edx
0x7f515ad25c77 <pthread_mutex_lock+71>: jmpq 0x7f515ad2c9a0 <__lll_lock_elision>
0x7f515ad25c7c <pthread_mutex_lock+76>: nopl 0x0(%rax)
0x7f515ad25c80 <pthread_mutex_lock+80>: and $0x80,%esi
0x7f515ad25c86 <pthread_mutex_lock+86>: mov $0x1,%edi
0x7f515ad25c8b <pthread_mutex_lock+91>: xor %eax,%eax
0x7f515ad25c8d <pthread_mutex_lock+93>: lock cmpxchg %edi,(%r8)
0x7f515ad25c92 <pthread_mutex_lock+98>: jne 0x7f515ad25dbc <_L_lock_812>
0x7f515ad25c98 <pthread_mutex_lock+104>: mov %fs:0x2d0,%eax // get owner
0x7f515ad25ca0 <pthread_mutex_lock+112>: mov %eax,0x8(%r8) // set mutex->__data.__owner
0x7f515ad25ca4 <pthread_mutex_lock+116>: addl $0x1,0xc(%r8) // increment mutex->__data.__nusers
0x7f515ad25ca9 <pthread_mutex_lock+121>: nop
0x7f515ad25caa <pthread_mutex_lock+122>: xor %eax,%eax
0x7f515ad25cac <pthread_mutex_lock+124>: retq
0x7f515ad25cad <pthread_mutex_lock+125>: cmp $0x100,%eax
0x7f515ad25cb2 <pthread_mutex_lock+130>: je 0x7f515ad25c68 <pthread_mutex_lock+56>
0x7f515ad25cb4 <pthread_mutex_lock+132>: mov %esi,%edi
0x7f515ad25cb6 <pthread_mutex_lock+134>: and $0x7f,%edi
0x7f515ad25cb9 <pthread_mutex_lock+137>: cmp $0x1,%edi
0x7f515ad25cbc <pthread_mutex_lock+140>: jne 0x7f515ad25d05 <pthread_mutex_lock+213>
0x7f515ad25cbe <pthread_mutex_lock+142>: mov %fs:0x2d0,%eax
0x7f515ad25cc6 <pthread_mutex_lock+150>: cmp 0x8(%r8),%eax
0x7f515ad25cca <pthread_mutex_lock+154>: je 0x7f515ad25cee <pthread_mutex_lock+190>
0x7f515ad25ccc <pthread_mutex_lock+156>: and $0x80,%esi
0x7f515ad25cd2 <pthread_mutex_lock+162>: mov %edx,%eax
0x7f515ad25cd4 <pthread_mutex_lock+164>: lock cmpxchg %edi,(%r8) // successful CAS of mutex->__data.__lock
=> 0x7f515ad25cd9 <pthread_mutex_lock+169>: jne 0x7f515ad25dd7 <_L_lock_870> <--profiler interrupts here
0x7f515ad25cdf <pthread_mutex_lock+175>: movl $0x1,0x4(%r8) // set mutex->__data.__count to 1
0x7f515ad25ce7 <pthread_mutex_lock+183>: jmp 0x7f515ad25c98 <pthread_mutex_lock+104>
0x7f515ad25ce9 <pthread_mutex_lock+185>: jmpq 0x7f515ad218b0 <__pthread_mutex_lock_full>
0x7f515ad25cee <pthread_mutex_lock+190>: mov 0x4(%r8),%eax
0x7f515ad25cf2 <pthread_mutex_lock+194>: cmp $0xffffffff,%eax
0x7f515ad25cf5 <pthread_mutex_lock+197>: je 0x7f515ad25d84 <pthread_mutex_lock+340>
0x7f515ad25cfb <pthread_mutex_lock+203>: add $0x1,%eax
0x7f515ad25cfe <pthread_mutex_lock+206>: mov %eax,0x4(%r8)
0x7f515ad25d02 <pthread_mutex_lock+210>: xor %eax,%eax
0x7f515ad25d04 <pthread_mutex_lock+212>: retq
0x7f515ad25d05 <pthread_mutex_lock+213>: cmp $0x3,%edi
0x7f515ad25d08 <pthread_mutex_lock+216>: jne 0x7f515ad25da4 <pthread_mutex_lock+372>
0x7f515ad25d0e <pthread_mutex_lock+222>: cmpl $0x0,0x2116cf(%rip) # 0x7f515af373e4 <__is_smp>
0x7f515ad25d15 <pthread_mutex_lock+229>: je 0x7f515ad25c80 <pthread_mutex_lock+80>
0x7f515ad25d1b <pthread_mutex_lock+235>: mov %edx,%eax
0x7f515ad25d1d <pthread_mutex_lock+237>: mov $0x1,%edx
0x7f515ad25d22 <pthread_mutex_lock+242>: lock cmpxchg %edx,(%r8)
0x7f515ad25d27 <pthread_mutex_lock+247>: test %eax,%eax
0x7f515ad25d29 <pthread_mutex_lock+249>: je 0x7f515ad25c98 <pthread_mutex_lock+104>
0x7f515ad25d2f <pthread_mutex_lock+255>: movswl 0x14(%r8),%eax
0x7f515ad25d34 <pthread_mutex_lock+260>: mov $0x1,%esi
0x7f515ad25d39 <pthread_mutex_lock+265>: lea 0xa(%rax,%rax,1),%ecx
0x7f515ad25d3d <pthread_mutex_lock+269>: mov $0x64,%eax
0x7f515ad25d42 <pthread_mutex_lock+274>: cmp $0x64,%ecx
0x7f515ad25d45 <pthread_mutex_lock+277>: cmovg %eax,%ecx
0x7f515ad25d48 <pthread_mutex_lock+280>: xor %dl,%dl
0x7f515ad25d4a <pthread_mutex_lock+282>: xor %dil,%dil
0x7f515ad25d4d <pthread_mutex_lock+285>: add $0x1,%edx
0x7f515ad25d50 <pthread_mutex_lock+288>: lea -0x1(%rdx),%eax
0x7f515ad25d53 <pthread_mutex_lock+291>: cmp %eax,%ecx
0x7f515ad25d55 <pthread_mutex_lock+293>: jle 0x7f515ad25d8a <pthread_mutex_lock+346>
0x7f515ad25d57 <pthread_mutex_lock+295>: pause
0x7f515ad25d59 <pthread_mutex_lock+297>: mov %edi,%eax
0x7f515ad25d5b <pthread_mutex_lock+299>: lock cmpxchg %esi,(%r8)
0x7f515ad25d60 <pthread_mutex_lock+304>: test %eax,%eax
0x7f515ad25d62 <pthread_mutex_lock+306>: jne 0x7f515ad25d4d <pthread_mutex_lock+285>
0x7f515ad25d64 <pthread_mutex_lock+308>: movzwl 0x14(%r8),%ecx
0x7f515ad25d69 <pthread_mutex_lock+313>: mov $0x8,%esi
0x7f515ad25d6e <pthread_mutex_lock+318>: movswl %cx,%eax
0x7f515ad25d71 <pthread_mutex_lock+321>: sub %eax,%edx
0x7f515ad25d73 <pthread_mutex_lock+323>: mov %edx,%eax
0x7f515ad25d75 <pthread_mutex_lock+325>: cltd
0x7f515ad25d76 <pthread_mutex_lock+326>: idiv %esi
0x7f515ad25d78 <pthread_mutex_lock+328>: add %ecx,%eax
0x7f515ad25d7a <pthread_mutex_lock+330>: mov %ax,0x14(%r8)
0x7f515ad25d7f <pthread_mutex_lock+335>: jmpq 0x7f515ad25c98 <pthread_mutex_lock+104>
0x7f515ad25d84 <pthread_mutex_lock+340>: mov $0xb,%eax
0x7f515ad25d89 <pthread_mutex_lock+345>: retq
0x7f515ad25d8a <pthread_mutex_lock+346>: mov 0x10(%r8),%esi
0x7f515ad25d8e <pthread_mutex_lock+350>: mov $0x1,%edi
0x7f515ad25d93 <pthread_mutex_lock+355>: xor %eax,%eax
0x7f515ad25d95 <pthread_mutex_lock+357>: and $0x80,%esi
0x7f515ad25d9b <pthread_mutex_lock+363>: lock cmpxchg %edi,(%r8)
0x7f515ad25da0 <pthread_mutex_lock+368>: jne 0x7f515ad25df2 <_L_lock_1022>
0x7f515ad25da2 <pthread_mutex_lock+370>: jmp 0x7f515ad25d64 <pthread_mutex_lock+308>
0x7f515ad25da4 <pthread_mutex_lock+372>: mov %fs:0x2d0,%eax
0x7f515ad25dac <pthread_mutex_lock+380>: cmp 0x8(%r8),%eax
0x7f515ad25db0 <pthread_mutex_lock+384>: jne 0x7f515ad25c80 <pthread_mutex_lock+80>
0x7f515ad25db6 <pthread_mutex_lock+390>: mov $0x23,%eax
0x7f515ad25dbb <pthread_mutex_lock+395>: retq
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/openmp-dev/attachments/20180511/ab7c4d8b/attachment-0001.html>
More information about the Openmp-dev
mailing list