<div dir="ltr"><div>Still not clear what went wrong. I just installed clang 11 release on my local cluster with sm_35. No issues show up.</div><div>Is this issue exposed from a complicated app or even just a simple "omp target" hangs the code? Are you able to run any CUDA program?</div><div>The call stack indicates cuDevicePrimaryCtxRetain tries to interact with the driver but it doesn't respond and keeps the host side waiting.</div><div>Still not clear if 
/usr/local/software/jureca/Stages/2019a/software/nvidia/driver/lib64/libcuda.so.1 is consistent with the running linux kernel driver.</div><div>Could you try asking Julich to see if they have any clue about their settings on the machine.<br></div><div></div><div>Ye<br></div><div><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div><div dir="ltr">===================<br>
Ye Luo, Ph.D.<br>Computational Science Division & Leadership Computing Facility<br>
Argonne National Laboratory</div></div></div></div></div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Sep 28, 2020 at 6:22 PM Itaru Kitayama <<a href="mailto:itaru.kitayama@gmail.com">itaru.kitayama@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">$ which nvcc<br>
/usr/local/software/jureca/Stages/2019a/software/CUDA/10.1.105/bin/nvcc<br>
[kitayama1@jrc0004 kitayama1]$ nvcc --version<br>
nvcc: NVIDIA (R) Cuda compiler driver<br>
Copyright (c) 2005-2019 NVIDIA Corporation<br>
Built on Fri_Feb__8_19:08:17_PST_2019<br>
Cuda compilation tools, release 10.1, V10.1.105<br>
[kitayama1@jrc0004 kitayama1]$ ldd<br>
/p/project/cjzam11/kitayama1/opt/clang/current/lib/<a href="http://libomptarget.rtl.cuda.so" rel="noreferrer" target="_blank">libomptarget.rtl.cuda.so</a><br>
linux-vdso.so.1 =>  (0x00007ffc2a767000)<br>
libcuda.so.1 =><br>
/usr/local/software/jureca/Stages/2019a/software/nvidia/driver/lib64/libcuda.so.1<br>
(0x00002ac5418b9000)<br>
libelf.so.1 => /usr/lib64/libelf.so.1 (0x00002ac542aa1000)<br>
libc++.so.1 => /p/project/cjzam11/kitayama1/opt/clang/current/lib/libc++.so.1<br>
(0x00002ac5416d7000)<br>
libc++abi.so.1 =><br>
/p/project/cjzam11/kitayama1/opt/clang/current/lib/libc++abi.so.1<br>
(0x00002ac5417a0000)<br>
libm.so.6 => /usr/lib64/libm.so.6 (0x00002ac542cb9000)<br>
libgcc_s.so.1 =><br>
/usr/local/software/jureca/Stages/2019a/software/GCCcore/8.3.0/lib64/libgcc_s.so.1<br>
(0x00002ac5417e2000)<br>
libc.so.6 => /usr/lib64/libc.so.6 (0x00002ac542fbb000)<br>
libdl.so.2 => /usr/lib64/libdl.so.2 (0x00002ac543389000)<br>
libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00002ac54358d000)<br>
librt.so.1 => /usr/lib64/librt.so.1 (0x00002ac5437a9000)<br>
libz.so.1 => /usr/local/software/jureca/Stages/2019a/software/zlib/1.2.11-GCCcore-8.3.0/lib/libz.so.1<br>
(0x00002ac5417fd000)<br>
/lib64/ld-linux-x86-64.so.2 (0x00002ac541695000)<br>
libatomic.so.1 =><br>
/usr/local/software/jureca/Stages/2019a/software/GCCcore/8.3.0/lib64/libatomic.so.1<br>
(0x00002ac541816000)<br>
[kitayama1@jrc0004 kitayama1]$ nvidia-smi<br>
Tue Sep 29 01:21:23 2020<br>
+-----------------------------------------------------------------------------+<br>
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |<br>
|-------------------------------+----------------------+----------------------+<br>
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |<br>
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |<br>
|===============================+======================+======================|<br>
|   0  Tesla K80           On   | 00000000:06:00.0 Off |                    0 |<br>
| N/A   27C    P8    26W / 149W |      0MiB / 11441MiB |      0%      Default |<br>
+-------------------------------+----------------------+----------------------+<br>
|   1  Tesla K80           On   | 00000000:07:00.0 Off |                    0 |<br>
| N/A   27C    P8    29W / 149W |      0MiB / 11441MiB |      0%      Default |<br>
+-------------------------------+----------------------+----------------------+<br>
|   2  Tesla K80           On   | 00000000:86:00.0 Off |                    0 |<br>
| N/A   29C    P8    25W / 149W |      0MiB / 11441MiB |      0%      Default |<br>
+-------------------------------+----------------------+----------------------+<br>
|   3  Tesla K80           On   | 00000000:87:00.0 Off |                    0 |<br>
| N/A   27C    P8    30W / 149W |      0MiB / 11441MiB |      0%      Default |<br>
+-------------------------------+----------------------+----------------------+<br>
<br>
+-----------------------------------------------------------------------------+<br>
| Processes:                                                       GPU Memory |<br>
|  GPU       PID   Type   Process name                             Usage      |<br>
|=============================================================================|<br>
|  No running processes found                                                 |<br>
+-----------------------------------------------------------------------------+<br>
<br>
On Tue, Sep 29, 2020 at 7:45 AM Ye Luo <<a href="mailto:xw111luoye@gmail.com" target="_blank">xw111luoye@gmail.com</a>> wrote:<br>
><br>
> Could you provide<br>
> `which nvcc`<br>
> `nvcc --version`<br>
> `ldd /p/project/cjzam11/kitayama1/opt/clang/current/lib/<a href="http://libomptarget.rtl.cuda.so" rel="noreferrer" target="_blank">libomptarget.rtl.cuda.so</a>`<br>
> and nvidia-smi output?<br>
> Ye<br>
><br>
> ===================<br>
> Ye Luo, Ph.D.<br>
> Computational Science Division & Leadership Computing Facility<br>
> Argonne National Laboratory<br>
><br>
><br>
> On Mon, Sep 28, 2020 at 5:11 PM Itaru Kitayama via Openmp-dev <<a href="mailto:openmp-dev@lists.llvm.org" target="_blank">openmp-dev@lists.llvm.org</a>> wrote:<br>
>><br>
>> This happens an unpredictable way even though I launch the app the same way.<br>
>><br>
>> On Mon, Sep 28, 2020 at 7:34 AM Itaru Kitayama <<a href="mailto:itaru.kitayama@gmail.com" target="_blank">itaru.kitayama@gmail.com</a>> wrote:<br>
>> ><br>
>> > No, I take that back. Here's the backtrace:<br>
>> ><br>
>> > (gdb) where<br>
>> > #0  0x00002aaaaaacd6c2 in clock_gettime ()<br>
>> > #1  0x00002aaaabd167fd in clock_gettime () from /usr/lib64/libc.so.6<br>
>> > #2  0x00002aaaac97837e in ?? ()<br>
>> >    from /usr/local/software/jureca/Stages/2019a/software/nvidia/driver/lib64/libcuda.so.1<br>
>> > #3  0x00002aaaaca3c4f7 in ?? ()<br>
>> >    from /usr/local/software/jureca/Stages/2019a/software/nvidia/driver/lib64/libcuda.so.1<br>
>> > #4  0x00002aaaac87240a in ?? ()<br>
>> >    from /usr/local/software/jureca/Stages/2019a/software/nvidia/driver/lib64/libcuda.so.1<br>
>> > #5  0x00002aaaac91bfbe in ?? ()<br>
>> >    from /usr/local/software/jureca/Stages/2019a/software/nvidia/driver/lib64/libcuda.so.1<br>
>> > #6  0x00002aaaac91e0d7 in ?? ()<br>
>> >    from /usr/local/software/jureca/Stages/2019a/software/nvidia/driver/lib64/libcuda.so.1<br>
>> > #7  0x00002aaaac848719 in ?? ()<br>
>> >    from /usr/local/software/jureca/Stages/2019a/software/nvidia/driver/lib64/libcuda.so.1<br>
>> > #8  0x00002aaaac9ba15e in cuDevicePrimaryCtxRetain ()<br>
>> >    from /usr/local/software/jureca/Stages/2019a/software/nvidia/driver/lib64/libcuda.so.1<br>
>> > #9  0x00002aaaac514757 in __tgt_rtl_init_device ()<br>
>> >    from /p/project/cjzam11/kitayama1/opt/clang/current/lib/<a href="http://libomptarget.rtl.cuda.so" rel="noreferrer" target="_blank">libomptarget.rtl.cuda.so</a><br>
>> > #10 0x00002aaaab9b88bb in DeviceTy::init() ()<br>
>> >    from /p/project/cjzam11/kitayama1/opt/clang/current/lib/libomptarget.so<br>
>> > #11 0x00002aaaac279348 in std::__1::__call_once(unsigned long<br>
>> > volatile&, void*, void (*)(void*)) ()<br>
>> >    from /p/project/cjzam11/kitayama1/opt/clang/current/lib/libc++.so.1<br>
>> > #12 0x00002aaaab9b8d88 in device_is_ready(int) ()<br>
>> >    from /p/project/cjzam11/kitayama1/opt/clang/current/lib/libomptarget.so<br>
>> > #13 0x00002aaaab9c5296 in CheckDeviceAndCtors(long) ()<br>
>> >    from /p/project/cjzam11/kitayama1/opt/clang/current/lib/libomptarget.so<br>
>> > #14 0x00002aaaab9bbead in __tgt_target_data_begin_mapper ()<br>
>> >    from /p/project/cjzam11/kitayama1/opt/clang/current/lib/libomptarget.so<br>
>> > #15 0x00002aaaaabfaa58 in nest::SimulationManager::initialize() (this=0x5d3290)<br>
>> >     at /p/project/cjzam11/kitayama1/projects/nest-simulator/nestkernel/simulation_manager.cpp:76<br>
>> > #16 0x00002aaaaabf2c69 in nest::KernelManager::initialize() (this=0x5d3190)<br>
>> >     at /p/project/cjzam11/kitayama1/projects/nest-simulator/nestkernel/kernel_manager.cpp:88<br>
>> > #17 0x0000000000405769 in neststartup(int*, char***, SLIInterpreter&) (<br>
>> >     argc=argc@entry=0x7fffffff0a84, argv=argv@entry=0x7fffffff0a88, engine=...)<br>
>> >     at /p/project/cjzam11/kitayama1/projects/nest-simulator/nest/neststartup.cpp:87<br>
>> > #18 0x0000000000405650 in main (argc=<optimized out>, argv=<optimized out>)<br>
>> >     at /p/project/cjzam11/kitayama1/projects/nest-simulator/nest/main.cpp:42<br>
>> ><br>
>> > On Mon, Sep 28, 2020 at 5:22 AM Itaru Kitayama <<a href="mailto:itaru.kitayama@gmail.com" target="_blank">itaru.kitayama@gmail.com</a>> wrote:<br>
>> > ><br>
>> > > I obtained a desired result (a crash) without a Spack environment.<br>
>> > ><br>
>> > > On Sun, Sep 27, 2020 at 1:13 PM Itaru Kitayama <<a href="mailto:itaru.kitayama@gmail.com" target="_blank">itaru.kitayama@gmail.com</a>> wrote:<br>
>> > > ><br>
>> > > > (gdb) where<br>
>> > > > #0  0x00002aaaaaacd6c2 in clock_gettime ()<br>
>> > > > #1  0x00002aaaabd347fd in clock_gettime () from /usr/lib64/libc.so.6<br>
>> > > > #2  0x00002aaaac98737e in ?? ()<br>
>> > > >    from /usr/local/software/jureca/Stages/2019a/software/nvidia/driver/lib64/libcuda.so.1<br>
>> > > > #3  0x00002aaaaca4b4f7 in ?? ()<br>
>> > > >    from /usr/local/software/jureca/Stages/2019a/software/nvidia/driver/lib64/libcuda.so.1<br>
>> > > > #4  0x00002aaaac88140a in ?? ()<br>
>> > > >    from /usr/local/software/jureca/Stages/2019a/software/nvidia/driver/lib64/libcuda.so.1<br>
>> > > > #5  0x00002aaaac92afbe in ?? ()<br>
>> > > >    from /usr/local/software/jureca/Stages/2019a/software/nvidia/driver/lib64/libcuda.so.1<br>
>> > > > #6  0x00002aaaac92d0d7 in ?? ()<br>
>> > > >    from /usr/local/software/jureca/Stages/2019a/software/nvidia/driver/lib64/libcuda.so.1<br>
>> > > > #7  0x00002aaaac857719 in ?? ()<br>
>> > > >    from /usr/local/software/jureca/Stages/2019a/software/nvidia/driver/lib64/libcuda.so.1<br>
>> > > > #8  0x00002aaaac9c915e in cuDevicePrimaryCtxRetain ()<br>
>> > > >    from /usr/local/software/jureca/Stages/2019a/software/nvidia/driver/lib64/libcuda.so.1<br>
>> > > > #9  0x00002aaaac523757 in __tgt_rtl_init_device ()<br>
>> > > >    from /p/project/cjzam11/kitayama1/opt/clang/current/lib/<a href="http://libomptarget.rtl.cuda.so" rel="noreferrer" target="_blank">libomptarget.rtl.cuda.so</a><br>
>> > > > #10 0x00002aaaaaca28bb in DeviceTy::init() ()<br>
>> > > >    from /p/project/cjzam11/kitayama1/opt/clang/current/lib/libomptarget.so<br>
>> > > > #11 0x00002aaaac297348 in std::__1::__call_once(unsigned long<br>
>> > > > volatile&, void*, void (*)(void*)) ()<br>
>> > > >    from /p/project/cjzam11/kitayama1/opt/clang/current/lib/libc++.so.1<br>
>> > > > #12 0x00002aaaaaca2d88 in device_is_ready(int) ()<br>
>> > > >    from /p/project/cjzam11/kitayama1/opt/clang/current/lib/libomptarget.so<br>
>> > > > #13 0x00002aaaaacaf296 in CheckDeviceAndCtors(long) ()<br>
>> > > >    from /p/project/cjzam11/kitayama1/opt/clang/current/lib/libomptarget.so<br>
>> > > > #14 0x00002aaaaaca5ead in __tgt_target_data_begin_mapper ()<br>
>> > > >    from /p/project/cjzam11/kitayama1/opt/clang/current/lib/libomptarget.so<br>
>> > > > #15 0x00002aaaab3a4958 in nest::SimulationManager::initialize() (this=0x5d3480)<br>
>> > > >     at /p/project/cjzam11/kitayama1/projects/nest-simulator/nestkernel/simulation_manager.cpp:76<br>
>> > > > #16 0x00002aaaab39cbb9 in nest::KernelManager::initialize() (this=0x5d3380)<br>
>> > > >     at /p/project/cjzam11/kitayama1/projects/nest-simulator/nestkernel/kernel_manager.cpp:88<br>
>> > > > #17 0x0000000000405769 in neststartup(int*, char***, SLIInterpreter&) (<br>
>> > > >     argc=argc@entry=0x7ffffffee554, argv=argv@entry=0x7ffffffee558, engine=...)<br>
>> > > >     at /p/project/cjzam11/kitayama1/projects/nest-simulator/nest/neststartup.cpp:87<br>
>> > > > #18 0x0000000000405650 in main (argc=<optimized out>, argv=<optimized out>)<br>
>> > > >     at /p/project/cjzam11/kitayama1/projects/nest-simulator/nest/main.cpp:42<br>
>> > > ><br>
>> > > > On Sun, Sep 27, 2020 at 12:55 PM Itaru Kitayama<br>
>> > > > <<a href="mailto:itaru.kitayama@gmail.com" target="_blank">itaru.kitayama@gmail.com</a>> wrote:<br>
>> > > > ><br>
>> > > > >  and when this happens, no signal can get caught immediately by the system.<br>
>> > > > ><br>
>> > > > > On Sun, Sep 27, 2020 at 12:52 PM Itaru Kitayama<br>
>> > > > > <<a href="mailto:itaru.kitayama@gmail.com" target="_blank">itaru.kitayama@gmail.com</a>> wrote:<br>
>> > > > > ><br>
>> > > > > > I see often when executing my work-in-the-progress offloading app on X86<br>
>> > > > > > with an older NVIDIA GPU (sm_35). Can someone enlighten me on this so I<br>
>> > > > > > can solve it quickly?<br>
>> > > > > ><br>
>> > > > > > Thanks,<br>
>> _______________________________________________<br>
>> Openmp-dev mailing list<br>
>> <a href="mailto:Openmp-dev@lists.llvm.org" target="_blank">Openmp-dev@lists.llvm.org</a><br>
>> <a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev" rel="noreferrer" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev</a><br>
</blockquote></div>