[Openmp-dev] Sampling-based performance measurement of LLVM OpenMP runtime leads to deadlock!

John Mellor-Crummey via Openmp-dev openmp-dev at lists.llvm.org
Fri May 11 17:19:35 PDT 2018


While using a sampling-based profiler (HPCToolkit) to measure the performance of an application using a dynamically-linked version of the LLVM OpenMP runtime, I encountered a deadlock on x86_64. Although I haven’t considered other architectures in detail, I believe that they may be similarly affected.

Here’s what I believe I have observed: there is a subtle race condition between TLS setup for an OpenMP runtime and and a profiler that inspects it through the OMPT interface.

A thread executing code in __kmp_launch_worker in the context of the LLVM OpenMP runtime library acquired the lock controlling access to TLS state (
__tls_get_addr calls tls_get_addr_tail calls pthread_mutex_lock)to set up TLS needed for its access to its thread local variable __kmp_gtid in frame 24 of the callstack shown below. Immediately after acquiring the TLS lock by setting its __lock field with a CMPXCHG but before recording the lock owner or finishing TLS setup, the thread was interrupted by our profiler. As a normal part of its operation to record a sample, our profiler uses the OMPT tools API to check if the thread is an OpenMP thread by inspecting the thread id being maintained by the OpenMP runtime. A call to a runtime entry point through the OMPT API led to an access to __kmp_gtid in frame 5 of the call stack. However, TLS has still not been set up for the OpenMP runtime shared library for this thread and causing the access to __kmp_gtid  to go through the same protocol as before (__tls_get_addr calls tls_get_addr_tail calls pthread_mutex_lock). However, the lock has already been acquired in frame 21 so it is unavailable for acquistion in frames 0-2, causing deadlock. The TLS lock is implemented as a recursive lock, but the profiler interrupted the lock acquisition in libpthread before the owner field of the recursive lock was set, so the inner call to pthread_mutex_lock can't succeed.

This is a serious problem if a profiler using the OMPT interface can cause a deadlock. 

We need a design of the OMPT interface and OpenMP runtime implementations that make this impossible.

After thinking about this for a while, I think that a profiler can arrange to receive the ompt_callback_thread_begin and the profiler then set a thread local flag in its own TLS variables to note that a thread is an OpenMP thread. A profiler must not invoke any ompt runtime entry point on a thread that has not announced itself as an OpenMP thread by previously calling ompt_callback_thread_begin. An OpenMP runtime should ensure that its TLS is allocated before invoking the callback ompt_callback_thread_begin. Similarly, a profiler shouldn’t invoke an OMPT callback on a thread after receiving ompt_callback_thread_end.

If a profiler thread doesn’t use the OMPT interface to inspect a thread that hasn’t announced itself as an OpenMP thread, it won’t access any TLS state that the OpenMP library may maintain.

Does anyone care to comment or offer a vision of a different solution?

Below my signature block are some details of the thread state that I observed, in case you want to validate my assessment of the situation.

--
John Mellor-Crummey		Professor
Dept of Computer Science	Rice University
email: johnmc at rice.edu		phone: 713-348-5179




In the 10 frames at the top of the stack, HPCToolkit is querying the OMPT interface to determine the thread identity. I’ve highlighted the top 5 frames in red. In frame 5, our profiler asks for the thread’s id by calling __kmp_get_global_thread_id. Below the callstack is a small snippet from the debugger that shows what is going on in frame 5 - on line 132, the runtime is trying to return the value of __kmp_gtid - a thread local variable. To resolve this variable reference, the runtime invokes __tls_get_addr which ends up trying to acquire the dl_load_lock as shown in frame 0.

The problem here is that the thread seems to have acquired the lock in frame 21 before getting interrupted by a sample. In frame 21, the thread just completed a cmpxchg to acquire the lock. I believe that this succeeded since RIP is now at JNE instruction (address 0x7f515ad25cd9) immediately following the CMPXCHG at address 0x7f515ad25cd4 and the condition code ZF is set. As I read the machine state, the CAS succeeded in swapping in a 1 for a 0 to acquire the lock. 


The __kmp_gtid_mode was observed in the runtime to be 3 and __kmp_gtid is declared with modifier __thread.

#if KMP_OS_LINUX && defined(KMP_TDATA_GTID)
int __kmp_gtid_mode = 3; /* use __declspec(thread) TLS to store gtid */
int __kmp_adjust_gtid_mode = FALSE;


#ifdef KMP_TDATA_GTID
#if KMP_OS_WINDOWS
__declspec(thread) int __kmp_gtid = KMP_GTID_DNE;
#else
__thread int __kmp_gtid = KMP_GTID_DNE;
#endif /* KMP_OS_WINDOWS - workaround because Intel(R) Many Integrated Core    \
          compiler 20110316 doesn't accept __declspec */

The call stack 


#0  0x00007f515ad2a4cd in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) where
#0  0x00007f515ad2a4cd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f515ad25de6 in _L_lock_870 () from /lib64/libpthread.so.0
#2  0x00007f515ad25cdf in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f515b7041cb in tls_get_addr_tail () from /lib64/ld-linux-x86-64.so.2
#4  0x00007f515b709af8 in __tls_get_addr () from /lib64/ld-linux-x86-64.so.2
#5  0x00007f515af7fd90 in __kmp_get_global_thread_id ()
    at /home/johnmc/pkgs-src/openmp/runtime/src/kmp_runtime.cpp:132
#6  0x00007f515b0068e4 in ompt_get_thread ()
    at /home/johnmc/pkgs-src/openmp/runtime/src/ompt-specific.h:63
#7  0x00007f515b003fd7 in __ompt_get_teaminfo (depth=0, size=0x0)
    at /home/johnmc/pkgs-src/openmp/runtime/src/ompt-specific.cpp:47
#8  0x00007f515b004381 in __ompt_get_parallel_id_internal (depth=0)
    at /home/johnmc/pkgs-src/openmp/runtime/src/ompt-specific.cpp:216
#9  0x00007f515b0062bc in ompt_get_parallel_id (ancestor_level=0)
    at /home/johnmc/pkgs-src/openmp/runtime/src/ompt-general.cpp:348
#10 0x00007f515b4a5e6c in hpcrun_ompt_get_parallel_id (level=0)
    at ../../../../src/tool/hpcrun/ompt/ompt-interface.c:618
#11 0x00007f515b4a4ef2 in need_defer_cntxt () at ../../../../src/tool/hpcrun/ompt/ompt-defer.c:148
#12 0x00007f515b4a492a in ompt_backtrace_finalize (bt=0x7f513dbf5af0, isSync=0)
    at ../../../../src/tool/hpcrun/ompt/ompt-callstack.c:551
#13 0x00007f515b490dd0 in cct_backtrace_finalize (bt=0x7f513dbf5af0, isSync=0)
    at ../../../../src/tool/hpcrun/cct_backtrace_finalize.c:51
#14 0x00007f515b490bd9 in help_hpcrun_backtrace2cct (bundle=0x7f513d2f5fa8, context=0x7f513dbf6040, 
    metricId=0, metricIncr=..., skipInner=0, isSync=0, data=0x7f513dbf5d40)
    at ../../../../src/tool/hpcrun/cct_insert_backtrace.c:401
#15 0x00007f515b490651 in hpcrun_backtrace2cct (cct=0x7f513d2f5fa8, context=0x7f513dbf6040, metricId=0, 
    metricIncr=..., skipInner=0, isSync=0, data=0x7f513dbf5d40)
    at ../../../../src/tool/hpcrun/cct_insert_backtrace.c:246
#16 0x00007f515b493f5f in hpcrun_sample_callpath (context=0x7f513dbf6040, metricId=0, metricIncr=..., 
    skipInner=0, isSync=0, data=0x7f513dbf5cc0) at ../../../../src/tool/hpcrun/sample_event.c:239
#17 0x00007f515b4a9020 in record_sample (current=0x7f513d2d9d68, mmap_data=0x7f513dbf5d40, 
    context=0x7f513dbf6040, sv=0x7f513dbf5f10)
    at ../../../../src/tool/hpcrun/sample-sources/perf/linux_perf.c:567
#18 0x00007f515b4aa388 in perf_event_handler (sig=38, siginfo=0x7f513dbf6170, context=0x7f513dbf6040)
    at ../../../../src/tool/hpcrun/sample-sources/perf/linux_perf.c:1130
#19 0x00007f515b25846d in monitor_signal_handler (sig=38, info=0x7f513dbf6170, context=0x7f513dbf6040)
    at ../../libmonitor/src/signal.c:217
#20 <signal handler called>
#21 0x00007f515ad25cd9 in pthread_mutex_lock () from /lib64/libpthread.so.0
#22 0x00007f515b7041cb in tls_get_addr_tail () from /lib64/ld-linux-x86-64.so.2
#23 0x00007f515b709af8 in __tls_get_addr () from /lib64/ld-linux-x86-64.so.2
#24 0x00007f515aff3478 in __kmp_launch_worker (thr=0x15e1600)
    at /home/johnmc/pkgs-src/openmp/runtime/src/z_Linux_util.cpp:510
#25 0x00007f515b26112b in monitor_thread_fence2 () at ../../libmonitor/src/pthread.c:975
#26 0x00007f515ad23dd5 in start_thread () from /lib64/libpthread.so.0
#27 0x00007f515aa4db3d in clone () from /lib64/libc.so.6


(gdb) frame 24
#24 0x00007f515aff3478 in __kmp_launch_worker (thr=0x15e1600)
    at /home/johnmc/pkgs-src/openmp/runtime/src/z_Linux_util.cpp:510
510       __kmp_gtid = gtid;


(gdb) frame 5
#5  0x00007f515af7fd90 in __kmp_get_global_thread_id ()
    at /home/johnmc/pkgs-src/openmp/runtime/src/kmp_runtime.cpp:132
132         return __kmp_gtid;



Below is the state for frame 21, which shows that the PC is at the instruction following the cmpxchg and that the ZF bit is set, indicating the cmpxchg instruction succeeded, according to the instruction semantics below:

Semantics of the cmpxchg instruction
(* Accumulator = AL, AX, EAX, or RAX depending on whether a byte, word, doubleword, or quadword comparison is being performed *)

IF EAX = DEST THEN

    ZF ← 1;

    DEST ← SRC; 

ELSE 

    ZF ← 0;

    EAX ← DEST; 

FI;


(gdb) frame 21
#21 0x00007f515ad25cd9 in pthread_mutex_lock () from /lib64/libpthread.so.0
(gdb) x/i 0x00007f515ad25cd9-5
   0x7f515ad25cd4 <pthread_mutex_lock+164>:     lock cmpxchg %edi,(%r8)
(gdb) 
=> 0x7f515ad25cd9 <pthread_mutex_lock+169>:     jne    0x7f515ad25dd7 <_L_lock_870>
(gdb) info reg
rax            0x0      0
rbx            0x7f515b911b18   139987405314840
rcx            0x43     67
rdx            0x0      0
rsi            0x0      0
rdi            0x1      1
rbp            0x7f513dbf6620   0x7f513dbf6620
rsp            0x7f513dbf65d8   0x7f513dbf65d8
r8             0x7f515b914908   139987405326600
r9             0x0      0
r10            0x22     34
r11            0x7f515aadc440   139987390415936
r12            0x7f515b241f28   139987398172456
r13            0x501980 5249408
r14            0x4      4
r15            0x7f515b46f840   139987400456256
rip            0x7f515ad25cd9   0x7f515ad25cd9 <pthread_mutex_lock+169>
eflags         0x246    [ PF ZF IF ]
cs             0x33     51
ss             0x2b     43
ds             0x0      0
es             0x0      0
fs             0x0      0
gs             0x0      0




Below is the CMPXCHG instruction in the context of the code for pthread_mutex_lock. The magenta instruction shows the compare and exchange that succeeded. The orange instructions show the path to return following a successful acquisition. The blue comments indicate the instructions completing the acquisition of the recursive lock by setting the count, owner, and nusers

  0x7f515ad25c30 <pthread_mutex_lock>: mov    0x10(%rdi),%esi
   0x7f515ad25c33 <pthread_mutex_lock+3>:       mov    %rdi,%r8
   0x7f515ad25c36 <pthread_mutex_lock+6>:       mov    %esi,%eax
   0x7f515ad25c38 <pthread_mutex_lock+8>:       and    $0x17f,%eax
   0x7f515ad25c3d <pthread_mutex_lock+13>:      nop
   0x7f515ad25c3e <pthread_mutex_lock+14>:      mov    %esi,%edx
   0x7f515ad25c40 <pthread_mutex_lock+16>:      and    $0x7c,%edx
   0x7f515ad25c43 <pthread_mutex_lock+19>:      jne    0x7f515ad25ce9 <pthread_mutex_lock+185>
   0x7f515ad25c49 <pthread_mutex_lock+25>:      test   %eax,%eax
   0x7f515ad25c4b <pthread_mutex_lock+27>:      jne    0x7f515ad25cad <pthread_mutex_lock+125>
   0x7f515ad25c4d <pthread_mutex_lock+29>:      mov    0x21182d(%rip),%eax        # 0x7f515af37480 <__pthread_force_elision>
   0x7f515ad25c53 <pthread_mutex_lock+35>:      test   %eax,%eax
   0x7f515ad25c55 <pthread_mutex_lock+37>:      je     0x7f515ad25c80 <pthread_mutex_lock+80>
   0x7f515ad25c57 <pthread_mutex_lock+39>:      test   $0x300,%esi
   0x7f515ad25c5d <pthread_mutex_lock+45>:      jne    0x7f515ad25c80 <pthread_mutex_lock+80>
   0x7f515ad25c5f <pthread_mutex_lock+47>:      or     $0x100,%esi
   0x7f515ad25c65 <pthread_mutex_lock+53>:      mov    %esi,0x10(%rdi)
   0x7f515ad25c68 <pthread_mutex_lock+56>:      mov    %esi,%edx
   0x7f515ad25c6a <pthread_mutex_lock+58>:      lea    0x16(%r8),%rsi
   0x7f515ad25c6e <pthread_mutex_lock+62>:      mov    %r8,%rdi
   0x7f515ad25c71 <pthread_mutex_lock+65>:      and    $0x80,%edx
   0x7f515ad25c77 <pthread_mutex_lock+71>:      jmpq   0x7f515ad2c9a0 <__lll_lock_elision>
   0x7f515ad25c7c <pthread_mutex_lock+76>:      nopl   0x0(%rax)
   0x7f515ad25c80 <pthread_mutex_lock+80>:      and    $0x80,%esi
   0x7f515ad25c86 <pthread_mutex_lock+86>:      mov    $0x1,%edi
   0x7f515ad25c8b <pthread_mutex_lock+91>:      xor    %eax,%eax
   0x7f515ad25c8d <pthread_mutex_lock+93>:      lock cmpxchg %edi,(%r8)
   0x7f515ad25c92 <pthread_mutex_lock+98>:      jne    0x7f515ad25dbc <_L_lock_812>
   0x7f515ad25c98 <pthread_mutex_lock+104>:     mov    %fs:0x2d0,%eax // get owner
   0x7f515ad25ca0 <pthread_mutex_lock+112>:     mov    %eax,0x8(%r8) // set mutex->__data.__owner
   0x7f515ad25ca4 <pthread_mutex_lock+116>:     addl   $0x1,0xc(%r8) // increment mutex->__data.__nusers
   0x7f515ad25ca9 <pthread_mutex_lock+121>:     nop
   0x7f515ad25caa <pthread_mutex_lock+122>:     xor    %eax,%eax
   0x7f515ad25cac <pthread_mutex_lock+124>:     retq   
   0x7f515ad25cad <pthread_mutex_lock+125>:     cmp    $0x100,%eax
   0x7f515ad25cb2 <pthread_mutex_lock+130>:     je     0x7f515ad25c68 <pthread_mutex_lock+56>
   0x7f515ad25cb4 <pthread_mutex_lock+132>:     mov    %esi,%edi
   0x7f515ad25cb6 <pthread_mutex_lock+134>:     and    $0x7f,%edi
   0x7f515ad25cb9 <pthread_mutex_lock+137>:     cmp    $0x1,%edi
   0x7f515ad25cbc <pthread_mutex_lock+140>:     jne    0x7f515ad25d05 <pthread_mutex_lock+213>
   0x7f515ad25cbe <pthread_mutex_lock+142>:     mov    %fs:0x2d0,%eax
   0x7f515ad25cc6 <pthread_mutex_lock+150>:     cmp    0x8(%r8),%eax
   0x7f515ad25cca <pthread_mutex_lock+154>:     je     0x7f515ad25cee <pthread_mutex_lock+190>
   0x7f515ad25ccc <pthread_mutex_lock+156>:     and    $0x80,%esi
   0x7f515ad25cd2 <pthread_mutex_lock+162>:     mov    %edx,%eax
   0x7f515ad25cd4 <pthread_mutex_lock+164>:     lock cmpxchg %edi,(%r8) // successful CAS of mutex->__data.__lock
=> 0x7f515ad25cd9 <pthread_mutex_lock+169>:     jne    0x7f515ad25dd7 <_L_lock_870>  <--profiler interrupts here
   0x7f515ad25cdf <pthread_mutex_lock+175>:     movl   $0x1,0x4(%r8)    // set mutex->__data.__count to 1
   0x7f515ad25ce7 <pthread_mutex_lock+183>:     jmp    0x7f515ad25c98 <pthread_mutex_lock+104>
   0x7f515ad25ce9 <pthread_mutex_lock+185>:     jmpq   0x7f515ad218b0 <__pthread_mutex_lock_full>
   0x7f515ad25cee <pthread_mutex_lock+190>:     mov    0x4(%r8),%eax
   0x7f515ad25cf2 <pthread_mutex_lock+194>:     cmp    $0xffffffff,%eax
   0x7f515ad25cf5 <pthread_mutex_lock+197>:     je     0x7f515ad25d84 <pthread_mutex_lock+340>
   0x7f515ad25cfb <pthread_mutex_lock+203>:     add    $0x1,%eax
   0x7f515ad25cfe <pthread_mutex_lock+206>:     mov    %eax,0x4(%r8)
   0x7f515ad25d02 <pthread_mutex_lock+210>:     xor    %eax,%eax
   0x7f515ad25d04 <pthread_mutex_lock+212>:     retq   
   0x7f515ad25d05 <pthread_mutex_lock+213>:     cmp    $0x3,%edi
   0x7f515ad25d08 <pthread_mutex_lock+216>:     jne    0x7f515ad25da4 <pthread_mutex_lock+372>
   0x7f515ad25d0e <pthread_mutex_lock+222>:     cmpl   $0x0,0x2116cf(%rip)        # 0x7f515af373e4 <__is_smp>
   0x7f515ad25d15 <pthread_mutex_lock+229>:     je     0x7f515ad25c80 <pthread_mutex_lock+80>
   0x7f515ad25d1b <pthread_mutex_lock+235>:     mov    %edx,%eax
   0x7f515ad25d1d <pthread_mutex_lock+237>:     mov    $0x1,%edx
   0x7f515ad25d22 <pthread_mutex_lock+242>:     lock cmpxchg %edx,(%r8)
   0x7f515ad25d27 <pthread_mutex_lock+247>:     test   %eax,%eax
   0x7f515ad25d29 <pthread_mutex_lock+249>:     je     0x7f515ad25c98 <pthread_mutex_lock+104>
   0x7f515ad25d2f <pthread_mutex_lock+255>:     movswl 0x14(%r8),%eax
   0x7f515ad25d34 <pthread_mutex_lock+260>:     mov    $0x1,%esi
   0x7f515ad25d39 <pthread_mutex_lock+265>:     lea    0xa(%rax,%rax,1),%ecx
   0x7f515ad25d3d <pthread_mutex_lock+269>:     mov    $0x64,%eax
   0x7f515ad25d42 <pthread_mutex_lock+274>:     cmp    $0x64,%ecx
   0x7f515ad25d45 <pthread_mutex_lock+277>:     cmovg  %eax,%ecx
   0x7f515ad25d48 <pthread_mutex_lock+280>:     xor    %dl,%dl
   0x7f515ad25d4a <pthread_mutex_lock+282>:     xor    %dil,%dil
   0x7f515ad25d4d <pthread_mutex_lock+285>:     add    $0x1,%edx
   0x7f515ad25d50 <pthread_mutex_lock+288>:     lea    -0x1(%rdx),%eax
   0x7f515ad25d53 <pthread_mutex_lock+291>:     cmp    %eax,%ecx
   0x7f515ad25d55 <pthread_mutex_lock+293>:     jle    0x7f515ad25d8a <pthread_mutex_lock+346>
   0x7f515ad25d57 <pthread_mutex_lock+295>:     pause  
   0x7f515ad25d59 <pthread_mutex_lock+297>:     mov    %edi,%eax
   0x7f515ad25d5b <pthread_mutex_lock+299>:     lock cmpxchg %esi,(%r8)
   0x7f515ad25d60 <pthread_mutex_lock+304>:     test   %eax,%eax
   0x7f515ad25d62 <pthread_mutex_lock+306>:     jne    0x7f515ad25d4d <pthread_mutex_lock+285>
   0x7f515ad25d64 <pthread_mutex_lock+308>:     movzwl 0x14(%r8),%ecx
   0x7f515ad25d69 <pthread_mutex_lock+313>:     mov    $0x8,%esi
   0x7f515ad25d6e <pthread_mutex_lock+318>:     movswl %cx,%eax
   0x7f515ad25d71 <pthread_mutex_lock+321>:     sub    %eax,%edx
   0x7f515ad25d73 <pthread_mutex_lock+323>:     mov    %edx,%eax
   0x7f515ad25d75 <pthread_mutex_lock+325>:     cltd   
   0x7f515ad25d76 <pthread_mutex_lock+326>:     idiv   %esi
   0x7f515ad25d78 <pthread_mutex_lock+328>:     add    %ecx,%eax
   0x7f515ad25d7a <pthread_mutex_lock+330>:     mov    %ax,0x14(%r8)
   0x7f515ad25d7f <pthread_mutex_lock+335>:     jmpq   0x7f515ad25c98 <pthread_mutex_lock+104>
   0x7f515ad25d84 <pthread_mutex_lock+340>:     mov    $0xb,%eax
   0x7f515ad25d89 <pthread_mutex_lock+345>:     retq   
   0x7f515ad25d8a <pthread_mutex_lock+346>:     mov    0x10(%r8),%esi
   0x7f515ad25d8e <pthread_mutex_lock+350>:     mov    $0x1,%edi
   0x7f515ad25d93 <pthread_mutex_lock+355>:     xor    %eax,%eax
   0x7f515ad25d95 <pthread_mutex_lock+357>:     and    $0x80,%esi
   0x7f515ad25d9b <pthread_mutex_lock+363>:     lock cmpxchg %edi,(%r8)
   0x7f515ad25da0 <pthread_mutex_lock+368>:     jne    0x7f515ad25df2 <_L_lock_1022>
   0x7f515ad25da2 <pthread_mutex_lock+370>:     jmp    0x7f515ad25d64 <pthread_mutex_lock+308>
   0x7f515ad25da4 <pthread_mutex_lock+372>:     mov    %fs:0x2d0,%eax
   0x7f515ad25dac <pthread_mutex_lock+380>:     cmp    0x8(%r8),%eax
   0x7f515ad25db0 <pthread_mutex_lock+384>:     jne    0x7f515ad25c80 <pthread_mutex_lock+80>
   0x7f515ad25db6 <pthread_mutex_lock+390>:     mov    $0x23,%eax
   0x7f515ad25dbb <pthread_mutex_lock+395>:     retq 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/openmp-dev/attachments/20180511/ab7c4d8b/attachment-0001.html>


More information about the Openmp-dev mailing list