[Openmp-dev] OpenMP target region deadlock when using -O3 with 7.0.0 release branch

Mon Aug 6 14:27:58 PDT 2018

Hello all,

I encounter a deadlock when using -O3 and -Ofast optimization in OpenMP
target offload regions. -O2 optimization does not deadlock. I am using the
7.0.0 release branch, specifically commit e7966c0 from Aug 2.

The last 5 lines of the OpenMP debug stamps are
"""
Libomptarget --> Launching target execution
__omp_offloading_34_3b44cbf_main_l67 with pointer 0x000000003d6a11a0
(index=0).
Target CUDA RTL --> Setting CUDA threads per block to default 128
Target CUDA RTL --> Using default number of teams 128
Target CUDA RTL --> Launch kernel with 128 blocks and 128 threads
Target CUDA RTL --> Launch of entry point at 0x000000003d6a11a0 successful!
"""

The stack trace shows that a thread is waiting in cuCtxSynchronize. The gdb
stack is
"""
(gdb) thread apply all bt

Thread 3 (Thread 0x200007b0f180 (LWP 31472)):
#0  0x000020000039b31c in accept4 () from /lib64/libc.so.6
#1  0x0000200000aa469c in ?? () from /usr/lib64/nvidia/libcuda.so.1
#2  0x0000200000a8f13c in ?? () from /usr/lib64/nvidia/libcuda.so.1
#3  0x0000200000aa5d7c in ?? () from /usr/lib64/nvidia/libcuda.so.1
#4  0x0000200000238af4 in start_thread () from /lib64/libpthread.so.0
#5  0x0000000000000000 in ?? ()

Thread 2 (Thread 0x20000821f180 (LWP 31473)):
#0  0x0000200000387ad8 in poll () from /lib64/libc.so.6
#1  0x0000200000aa2f30 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#2  0x0000200000b299f4 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#3  0x0000200000aa5d7c in ?? () from /usr/lib64/nvidia/libcuda.so.1
#4  0x0000200000238af4 in start_thread () from /lib64/libpthread.so.0
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 1 (Thread 0x2000000463f0 (LWP 31459)):
#0  0x00002000002419d4 in do_futex_wait () from /lib64/libpthread.so.0
#1  0x0000200000241aec in __new_sem_wait_slow.constprop.0 () from
/lib64/libpthread.so.0
#2  0x0000200000aa39a0 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#3  0x000020000097cba0 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#4  0x0000200000b638b4 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#5  0x0000200000a38ffc in ?? () from /usr/lib64/nvidia/libcuda.so.1
#6  0x000020000094de18 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#7  0x0000200000adccec in cuCtxSynchronize () from
/usr/lib64/nvidia/libcuda.so.1
#8  0x0000200000835a74 in __tgt_rtl_run_target_team_region
(device_id=1031115376, tgt_entry_ptr=0x0, tgt_args=0x3cd8d688,
tgt_offsets=0x3cd8d3f0, arg_num=0, team_num=-358104080,
    thread_limit=32767, loop_tripcount=1140852866)
    at
/autofs/nccs-svm1_home1/cdaley/Repos/software_packages/llvm-openmp/llvm-openmp-build.htzoek/openmp/libomptarget/plugins/cuda/src/rtl.cpp:735
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
"""

I can reproduce the error on two platforms: one with P9+V100 another with
Intel-Haswell+V100. Also, it does not deadlock when using Clang-ykt.

The deadlock can be reproduced in the test kernel at
https://bitbucket.org/crpl_cisc/sollve_vv/src/master/tests/application_kernels/mmm_target_parallel_for_simd.c
when using the following compiler flags: clang -O3 -fopenmp
-fopenmp-targets=nvptx64-nvidia-cuda mmm_target_parallel_for_simd.c -o
mmm_target_parallel_for_simd.

Am I doing anything obviously wrong? Can you also reproduce the deadlock?

Thanks for any help,
Chris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/openmp-dev/attachments/20180806/ee841b73/attachment.html>