[Openmp-dev] [EXTERNAL] Re: OpenMP offloading app gets in unresponsive

Wed Sep 23 21:12:21 PDT 2020

Looks like it, running the app with the same set of modules that used
for building Clang, it stops correctly after assertion happens.

On Thu, Sep 24, 2020 at 12:33 PM Ye Luo <xw111luoye at gmail.com> wrote:
>
> I'm not aware of a way to find what header it was.
> I think it is worth trying to use the same CUDA toolkit for building clang+libomptarget and your app.
> Ye
> ===================
> Ye Luo, Ph.D.
> Computational Science Division & Leadership Computing Facility
> Argonne National Laboratory
>
>
> On Wed, Sep 23, 2020 at 10:00 PM Itaru Kitayama <itaru.kitayama at gmail.com> wrote:
>>
>> Hi Ye,
>> How do I check the header consistency in libomptarget?
>>
>> On Thu, Sep 24, 2020 at 11:48 AM Ye Luo <xw111luoye at gmail.com> wrote:
>> >
>> > 1. Please show full call stack.
>> > 2. are you able to run a very simple omp code like just an empty "omp target"
>> > 3. My current feeling is that when you build libomptarget plugins, the cuda.h may not be consistent with /usr/local/software/jureca/Stages/2019a/software/nvidia/driver/lib64/libcuda.so.1
>> > Ye
>> > ===================
>> > Ye Luo, Ph.D.
>> > Computational Science Division & Leadership Computing Facility
>> > Argonne National Laboratory
>> >
>> >
>> > On Wed, Sep 23, 2020 at 9:22 PM Itaru Kitayama via Openmp-dev <openmp-dev at lists.llvm.org> wrote:
>> >>
>> >> If I run it with CUDA-gdb I get:
>> >>
>> >> Target CUDA RTL --> Init requires flags to 1
>> >> Target CUDA RTL --> Getting device 0
>> >> Target CUDA RTL --> The primary context is inactive, set its flags to
>> >> CU_CTX_SCHED_BLOCKING_SYNC
>> >> [New Thread 0x2aaaae5e3700 (LWP 4154)]
>> >> ^C
>> >> Thread 1 "nest" received signal SIGINT, Interrupt.
>> >> 0x00002aaaad2e5a1c in cuVDPAUCtxCreate ()
>> >>    from /usr/local/software/jureca/Stages/2019a/software/nvidia/driver/lib64/libcuda.so.1
>> >>
>> >> On Thu, Sep 24, 2020 at 8:56 AM Itaru Kitayama <itaru.kitayama at gmail.com> wrote:
>> >> >
>> >> >  With the Trunk Clang running with CUDA Toolkit 10.1.105 on JURECA at
>> >> > JSC, I started seeing a hang up:
>> >> >
>> >> > Libomptarget --> Call to omp_get_num_devices returning 1
>> >> > Libomptarget --> Default TARGET OFFLOAD policy is now mandatory
>> >> > (devices were found)
>> >> > Libomptarget --> Entering data begin region for device -1 with 1 mappings
>> >> > Libomptarget --> Use default device id 0
>> >> > Libomptarget --> Checking whether device 0 is ready.
>> >> > Libomptarget --> Is the device 0 (local ID 0) initialized? 0
>> >> > Target CUDA RTL --> Init requires flags to 1
>> >> > Target CUDA RTL --> Getting device 0
>> >> > Target CUDA RTL --> The primary context is inactive, set its flags to
>> >> > CU_CTX_SCHED_BLOCKING_SYNC
>> >> >
>> >> > Getting back to the prompt takes time or I needed to hit Ctrl + C or Z
>> >> > hard many times.
>> >> >
>> >> > On Thu, Sep 24, 2020 at 7:31 AM Itaru Kitayama <itaru.kitayama at gmail.com> wrote:
>> >> > >
>> >> > > Should I back off from my ThunderX2 while fix is being developed?
>> >> > >
>> >> > > On Thu, Sep 24, 2020 at 5:36 AM Johannes Doerfert
>> >> > > <johannesdoerfert at gmail.com> wrote:
>> >> > > >
>> >> > > > This could be a side effect of something else, namely the runtime
>> >> > > > unloading order.
>> >> > > > @Jon @Shilei where are we with fixing those issues?
>> >> > > >
>> >> > > > On 9/23/20 2:58 AM, Itaru Kitayama wrote:
>> >> > > > > I think I was running my offloading app with CUDA Toolkit which is
>> >> > > > > I've loaded via Spack, but
>> >> > > > > the app itself is built with Clang (+CUDA Toolkit local admin provided
>> >> > > > > via modules).
>> >> > > > >
>> >> > > > > However, the effect is this drastic; I mean locking totally up a
>> >> > > > > ThunderX2 node?
>> >> > > > >
>> >> > > > > On Mon, Sep 21, 2020 at 9:58 PM Huber, Joseph <huberjn at ornl.gov> wrote:
>> >> > > > >> The runtime library just calls abort() immediately after printing that last "Failure while offloading was mandatory" message. I'm not sure what would be causing the process to hang after that if SIGABRT isn't being caught.
>> >> > > > >> ________________________________
>> >> > > > >> From: Itaru Kitayama <itaru.kitayama at gmail.com>
>> >> > > > >> Sent: Saturday, September 19, 2020 2:52 AM
>> >> > > > >> To: Johannes Doerfert <johannesdoerfert at gmail.com>
>> >> > > > >> Cc: openmp-dev <openmp-dev at lists.llvm.org>; Huber, Joseph <huberjn at ornl.gov>
>> >> > > > >> Subject: [EXTERNAL] Re: [Openmp-dev] OpenMP offloading app gets in unresponsive
>> >> > > > >>
>> >> > > > >> I mean; the kernel gets aborted and I see a session prompt on JURECA at JSC.
>> >> > > > >>
>> >> > > > >> On Sat, Sep 19, 2020 at 3:38 PM Itaru Kitayama <itaru.kitayama at gmail.com> wrote:
>> >> > > > >>> While it was observed on ThunderX2 with V100 system, I don't see it on
>> >> > > > >>> JURECA (with GPUs).
>> >> > > > >>>
>> >> > > > >>> On Sat, Sep 19, 2020 at 1:21 PM Johannes Doerfert
>> >> > > > >>> <johannesdoerfert at gmail.com> wrote:
>> >> > > > >>>> I don't think so.
>> >> > > > >>>> The only thing that comes to mind is that we switched to `abort` instead
>> >> > > > >>>> of `exit` after the fatal error message.
>> >> > > > >>>> Though, I'm not sure why that would cause the program to hang, except if
>> >> > > > >>>> SIGABRT is somehow caught.
>> >> > > > >>>>
>> >> > > > >>>> ~ Johannes
>> >> > > > >>>>
>> >> > > > >>>> On 9/18/20 9:35 PM, Itaru Kitayama via Openmp-dev wrote:
>> >> > > > >>>>> [...]
>> >> > > > >>>>> Libomptarget error: Failed to synchronize device.
>> >> > > > >>>>> Libomptarget error: Call to targetDataEnd failed, abort target.
>> >> > > > >>>>> Libomptarget error: Failed to process data after launching the kernel.
>> >> > > > >>>>> Libomptarget error: run with env LIBOMPTARGET_INFO>1 to dump
>> >> > > > >>>>> host-targetpointer maps
>> >> > > > >>>>> Libomptarget fatal error 1: failure of target construct while
>> >> > > > >>>>> offloading is mandatory
>> >> > > > >>>>>
>> >> > > > >>>>> after this point, the process gets in the state of unresponsive and
>> >> > > > >>>>> don't receive a signal from the user. Is this due to a new feature of
>> >> > > > >>>>> LLVM?
>> >> > > > >>>>> _______________________________________________
>> >> > > > >>>>> Openmp-dev mailing list
>> >> > > > >>>>> Openmp-dev at lists.llvm.org
>> >> > > > >>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev
>> >> _______________________________________________
>> >> Openmp-dev mailing list
>> >> Openmp-dev at lists.llvm.org
>> >> https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev