[Openmp-dev] [EXTERNAL] Re: OpenMP offloading app gets in unresponsive

Wed Sep 23 19:48:33 PDT 2020

1. Please show full call stack.
2. are you able to run a very simple omp code like just an empty "omp
target"
3. My current feeling is that when you build libomptarget plugins, the
cuda.h may not be consistent with
/usr/local/software/jureca/Stages/2019a/software/nvidia/driver/lib64/libcuda.so.1
Ye
===================
Ye Luo, Ph.D.
Computational Science Division & Leadership Computing Facility
Argonne National Laboratory

On Wed, Sep 23, 2020 at 9:22 PM Itaru Kitayama via Openmp-dev <
openmp-dev at lists.llvm.org> wrote:

> If I run it with CUDA-gdb I get:
>
> Target CUDA RTL --> Init requires flags to 1
> Target CUDA RTL --> Getting device 0
> Target CUDA RTL --> The primary context is inactive, set its flags to
> CU_CTX_SCHED_BLOCKING_SYNC
> [New Thread 0x2aaaae5e3700 (LWP 4154)]
> ^C
> Thread 1 "nest" received signal SIGINT, Interrupt.
> 0x00002aaaad2e5a1c in cuVDPAUCtxCreate ()
>    from
> /usr/local/software/jureca/Stages/2019a/software/nvidia/driver/lib64/libcuda.so.1
>
> On Thu, Sep 24, 2020 at 8:56 AM Itaru Kitayama <itaru.kitayama at gmail.com>
> wrote:
> >
> >  With the Trunk Clang running with CUDA Toolkit 10.1.105 on JURECA at
> > JSC, I started seeing a hang up:
> >
> > Libomptarget --> Call to omp_get_num_devices returning 1
> > Libomptarget --> Default TARGET OFFLOAD policy is now mandatory
> > (devices were found)
> > Libomptarget --> Entering data begin region for device -1 with 1 mappings
> > Libomptarget --> Use default device id 0
> > Libomptarget --> Checking whether device 0 is ready.
> > Libomptarget --> Is the device 0 (local ID 0) initialized? 0
> > Target CUDA RTL --> Init requires flags to 1
> > Target CUDA RTL --> Getting device 0
> > Target CUDA RTL --> The primary context is inactive, set its flags to
> > CU_CTX_SCHED_BLOCKING_SYNC
> >
> > Getting back to the prompt takes time or I needed to hit Ctrl + C or Z
> > hard many times.
> >
> > On Thu, Sep 24, 2020 at 7:31 AM Itaru Kitayama <itaru.kitayama at gmail.com>
> wrote:
> > >
> > > Should I back off from my ThunderX2 while fix is being developed?
> > >
> > > On Thu, Sep 24, 2020 at 5:36 AM Johannes Doerfert
> > > <johannesdoerfert at gmail.com> wrote:
> > > >
> > > > This could be a side effect of something else, namely the runtime
> > > > unloading order.
> > > > @Jon @Shilei where are we with fixing those issues?
> > > >
> > > > On 9/23/20 2:58 AM, Itaru Kitayama wrote:
> > > > > I think I was running my offloading app with CUDA Toolkit which is
> > > > > I've loaded via Spack, but
> > > > > the app itself is built with Clang (+CUDA Toolkit local admin
> provided
> > > > > via modules).
> > > > >
> > > > > However, the effect is this drastic; I mean locking totally up a
> > > > > ThunderX2 node?
> > > > >
> > > > > On Mon, Sep 21, 2020 at 9:58 PM Huber, Joseph <huberjn at ornl.gov>
> wrote:
> > > > >> The runtime library just calls abort() immediately after printing
> that last "Failure while offloading was mandatory" message. I'm not sure
> what would be causing the process to hang after that if SIGABRT isn't being
> caught.
> > > > >> ________________________________
> > > > >> From: Itaru Kitayama <itaru.kitayama at gmail.com>
> > > > >> Sent: Saturday, September 19, 2020 2:52 AM
> > > > >> To: Johannes Doerfert <johannesdoerfert at gmail.com>
> > > > >> Cc: openmp-dev <openmp-dev at lists.llvm.org>; Huber, Joseph <
> huberjn at ornl.gov>
> > > > >> Subject: [EXTERNAL] Re: [Openmp-dev] OpenMP offloading app gets
> in unresponsive
> > > > >>
> > > > >> I mean; the kernel gets aborted and I see a session prompt on
> JURECA at JSC.
> > > > >>
> > > > >> On Sat, Sep 19, 2020 at 3:38 PM Itaru Kitayama <
> itaru.kitayama at gmail.com> wrote:
> > > > >>> While it was observed on ThunderX2 with V100 system, I don't see
> it on
> > > > >>> JURECA (with GPUs).
> > > > >>>
> > > > >>> On Sat, Sep 19, 2020 at 1:21 PM Johannes Doerfert
> > > > >>> <johannesdoerfert at gmail.com> wrote:
> > > > >>>> I don't think so.
> > > > >>>> The only thing that comes to mind is that we switched to
> `abort` instead
> > > > >>>> of `exit` after the fatal error message.
> > > > >>>> Though, I'm not sure why that would cause the program to hang,
> except if
> > > > >>>> SIGABRT is somehow caught.
> > > > >>>>
> > > > >>>> ~ Johannes
> > > > >>>>
> > > > >>>> On 9/18/20 9:35 PM, Itaru Kitayama via Openmp-dev wrote:
> > > > >>>>> [...]
> > > > >>>>> Libomptarget error: Failed to synchronize device.
> > > > >>>>> Libomptarget error: Call to targetDataEnd failed, abort target.
> > > > >>>>> Libomptarget error: Failed to process data after launching the
> kernel.
> > > > >>>>> Libomptarget error: run with env LIBOMPTARGET_INFO>1 to dump
> > > > >>>>> host-targetpointer maps
> > > > >>>>> Libomptarget fatal error 1: failure of target construct while
> > > > >>>>> offloading is mandatory
> > > > >>>>>
> > > > >>>>> after this point, the process gets in the state of
> unresponsive and
> > > > >>>>> don't receive a signal from the user. Is this due to a new
> feature of
> > > > >>>>> LLVM?
> > > > >>>>> _______________________________________________
> > > > >>>>> Openmp-dev mailing list
> > > > >>>>> Openmp-dev at lists.llvm.org
> > > > >>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev
> _______________________________________________
> Openmp-dev mailing list
> Openmp-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/openmp-dev/attachments/20200923/5b34a7e8/attachment-0001.html>