[Openmp-dev] Libomptarget fatal error 1: '#pragma omp requires unified_shared_memory' not used consistently!

Tue Mar 23 22:34:06 PDT 2021

deviceQuery result:

```

Device 0: "Quadro RTX 8000"
  CUDA Driver Version / Runtime Version          11.0 / 11.0
  CUDA Capability Major/Minor version number:    7.5
  Total amount of global memory:                 45556 MBytes
(47768928256 bytes)
  (72) Multiprocessors, ( 64) CUDA Cores/MP:     4608 CUDA Cores
  GPU Max Clock rate:                            1620 MHz (1.62 GHz)
  Memory Clock rate:                             6501 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 6291456 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072),
2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        65536 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simu
ltaneously) >

```

On Wed, Mar 24, 2021 at 1:58 PM Itaru Kitayama <itaru.kitayama at gmail.com> wrote:
>
> Hi Ye,
> Do you happen to remember if the RHEL deployed for the POWER9+V100
> system was version 7? In my reading
> Red Hat provides HMM as a technical preview in RHEL 7.
>
> The JSC machine I'm using is JURECA with A100 GPUs and it is backed
> by, to be specific, CentOS 8.2.2004, but I am
> not certain as a user how I should check the HMM is enabled or not.
>
> Could you help me clarify the things further, thanks!
>
> Itaru.
>
> On Tue, Mar 2, 2021 at 11:53 PM Ye Luo <xw111luoye at gmail.com> wrote:
> >
> > Explicit (software) USM requires explicitly invoking allocators like CUDA managed memory.
> > Implicit (OS backed) USM doesn't have such restrictions and memory allocated by malloc on the host can be directly accessed on the device. P9+V100 is the only architecture I saw works in this category.
> >
> > In the test case, "int a[N]" is stack memory but not explicitly allocated as CUDA managed memory. So invalid memory access is a legitimate CUDA error indicating this situation on a machine without implicit USM support.
> >
> > "#pragma omp require unified_shared_memory" doesn't express which level of USM support is required. Then to me it means require USM up to the support on that machine and application developer needs to be aware of that. Beyond that is just undefined behavior.
> >
> > If Itaru wants to write codes like the test case to work, he needs a machine with implicit USM support.
> > Otherwise, he has to avoid OS allocation but call CUDA-managed malloc explicitly or do not use USM at all. I don't think stack allocation can be handled in explicit USM scenario.
> >
> > There may be improvements that can be done in libomptarget to improve the situation by adding more protection. But it doesn't help making the code work in this case.
> >
> > Ye
> > ===================
> > Ye Luo, Ph.D.
> > Computational Science Division & Leadership Computing Facility
> > Argonne National Laboratory
> >
> >
> > On Tue, Mar 2, 2021 at 1:47 AM Joachim Protze <protze.joachim at gmail.com> wrote:
> >>
> >> No!
> >>
> >> What you suggest (Itaru to check that the system supports "whatever?!")
> >> would be necessary for a fictitious:
> >>
> >> #pragma omp assume unified_shared_memory
> >>
> >> but certainly not for
> >>
> >> #pragma omp require unified_shared_memory
> >>
> >> The OpenMP implementation must support the required property or must
> >> state that it is not supported. Causing an error during the execution of
> >> a target region is no compliant behavior.
> >>
> >> As Johannes suggested, managed allocators might allow to provide
> >> software USM, when hardware USM is not available. In any case, compiler
> >> and runtime together need to make sure that the requirements for
> >> supporting USM are met or abort otherwise.
> >>
> >> BTW: will the managed allocators also change the behavior for stack
> >> allocations? This would be necessary for a compliant USM support.
> >>
> >> - Joachim
> >>
> >> Am 01.03.21 um 23:49 schrieb Ye Luo via Openmp-dev:
> >> > Are you sure your machine/OS supports USM? Summit like nodes P9+V100 are
> >> > the only architecture with NVIDIA GPU which supports USM that I'm aware of.
> >>
> >> As you can read on the page I linked earlier, already Kepler supports
> >> USM. Just not in hardware.
> >>
> >> > Ye
> >> > ===================
> >> > Ye Luo, Ph.D.
> >> > Computational Science Division & Leadership Computing Facility
> >> > Argonne National Laboratory
> >> >
> >> >
> >> > On Mon, Mar 1, 2021 at 4:35 PM Itaru Kitayama via Openmp-dev <
> >> > openmp-dev at lists.llvm.org> wrote:
> >> >
> >> >> I’m on JURECA and some nodes are attached to A100 GPUs.
> >> >>
> >> >> On Tue, Mar 2, 2021 at 7:34 Itaru Kitayama <itaru.kitayama at gmail.com>
> >> >> wrote:
> >> >>
> >> >>> Hi all,
> >> >>> In the mean time, what do I do?
> >> >>>
> >> >>> On Tue, Mar 2, 2021 at 3:23 Johannes Doerfert <johannesdoerfert at gmail.com>
> >> >>> wrote:
> >> >>>
> >> >>>> I think that is it. I heard of problems with our USM before.
> >> >>>> We need to use the managed allocators if USM is active, they are
> >> >>>> about to be upstreamed (I hope).
> >> >>>>
> >> >>>>
> >> >>>> On 3/1/21 12:15 PM, Alexey.Bataev wrote:
> >> >>>>> Looks like this example is for Explicit USM and I assume if you
> >> >>>> allocate
> >> >>>>> the memory for a in managed memory explicitly, the OpenMP example also
> >> >>>>> should work.
> >> >>>>>
> >> >>>>> There are other USM modes though, where the memory is shared implicitly
> >> >>>>> between the host and the devices. Looks like currently LLVM
> >> >>>>> implementation relies on this thing
> >> >>>>>
> >> >>>> https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-system-allocator
> >> >>>>> where Implicit USM is supported.
> >> >>>>>
> >> >>>>> -------------
> >> >>>>> Best regards,
> >> >>>>> Alexey Bataev
> >> >>>>>
> >> >>>>> 3/1/2021 1:04 PM, Joachim Protze пишет:
> >> >>>>>> Are the Kernel/Hardware requirements llvm specific?
> >> >>>>>>
> >> >>>>>> I can compile and execute the add_grid.cu example sucessfully:
> >> >>>>>> https://developer.nvidia.com/blog/unified-memory-cuda-beginners/
> >> >>>>>>
> >> >>>>>> So, I would expect that an OpenMP program should also run sucessfully.
> >> >>>>>>
> >> >>>>>> - Joachim
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> Am 01.03.21 um 18:49 schrieb Alexey.Bataev:
> >> >>>>>>> Hi, I you sure that you system supports Unified Shared Memory? As
> >> >>>> far as
> >> >>>>>>> I know it requires special linux kernel and the hardware must support
> >> >>>>>>> it. If you system does not support it, the code will crash for sure
> >> >>>> at
> >> >>>>>>> the runtime.
> >> >>>>>>>
> >> >>>>>>> In this mode, IIRC, we just ignore map clauses since the accelerator
> >> >>>>>>> devices can access the host memory directly without the need for
> >> >>>>>>> allocating the device-specific memory.
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> -------------
> >> >>>>>>> Best regards,
> >> >>>>>>> Alexey Bataev
> >> >>>>>>>
> >> >>>>>>> 3/1/2021 12:41 PM, Joachim Protze пишет:
> >> >>>>>>>> Hi all,
> >> >>>>>>>>
> >> >>>>>>>> even a more simple example segfaults, when the requires directive
> >> >>>> is there:
> >> >>>>>>>>
> >> >>>>>>>> #include <iostream>
> >> >>>>>>>> #include <omp.h>
> >> >>>>>>>> #include <stdio.h>
> >> >>>>>>>>
> >> >>>>>>>> #pragma omp requires unified_shared_memory
> >> >>>>>>>> #define N 1024
> >> >>>>>>>>
> >> >>>>>>>> int main() {
> >> >>>>>>>>    int a[N];
> >> >>>>>>>>    printf("a=%p\n", a);
> >> >>>>>>>> #pragma omp target map(tofrom : a[0:N])
> >> >>>>>>>>    {
> >> >>>>>>>>      printf("a=%p\n", a);
> >> >>>>>>>>      for (int i = 0; i < 1024; i++) {
> >> >>>>>>>>        a[i] = 123;
> >> >>>>>>>>      }
> >> >>>>>>>>    }
> >> >>>>>>>>    printf("a[0]=%i, a[%i]=%i\n", a[0], N/2, a[N/2]);
> >> >>>>>>>> }
> >> >>>>>>>>
> >> >>>>>>>> The code runs sucessfully when the requires directive is removed
> >> >>>> because
> >> >>>>>>>> the mapping of `a` is explicitly specified.
> >> >>>>>>>>
> >> >>>>>>>> For this code to run successfully, would it be necessary to
> >> >>>> allocate `a`
> >> >>>>>>>> specially as cuda managed memory? I don't see any special treatment
> >> >>>> of
> >> >>>>>>>> `a` in llvm ir. As I understand the OpenMP spec, the requires
> >> >>>> directive
> >> >>>>>>>> should lead to a compile error if clang fails to generate such code.
> >> >>>>>>>>
> >> >>>>>>>> The requires example from the OpenMP Examples also fails with the
> >> >>>> same
> >> >>>>>>>> runtime error:
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>> https://github.com/OpenMP/Examples/blob/main/sources/Example_requires.1.cpp
> >> >>>>>>>>
> >> >>>>>>>> - Joachim
> >> >>>>>>>>
> >> >>>>>>>> Am 28.02.21 um 11:12 schrieb Itaru Kitayama via Openmp-dev:
> >> >>>>>>>>> This is the code:
> >> >>>>>>>>>
> >> >>>>>>>>> #include <iostream>
> >> >>>>>>>>> #include <omp.h>
> >> >>>>>>>>>
> >> >>>>>>>>> #pragma omp requires unified_shared_memory
> >> >>>>>>>>> #define N 1024
> >> >>>>>>>>>
> >> >>>>>>>>> int main() {
> >> >>>>>>>>>    int a[N] = {0};
> >> >>>>>>>>>    int *device_data =  new int[N];
> >> >>>>>>>>> #pragma omp target map(tofrom : device_data[0:N])
> >> >>>>>>>>>    {
> >> >>>>>>>>>      device_data = &a[0];
> >> >>>>>>>>>      for (int i = 0; i < 1024; i++) {
> >> >>>>>>>>>        device_data[i] = 123;
> >> >>>>>>>>>      }
> >> >>>>>>>>>    }
> >> >>>>>>>>>    std::cout << a[0] << std::endl;
> >> >>>>>>>>> }
> >> >>>>>>>>>
> >> >>>>>>>>> On Sun, Feb 28, 2021 at 1:34 PM Johannes Doerfert
> >> >>>>>>>>> <johannesdoerfert at gmail.com> wrote:
> >> >>>>>>>>>> You have an illegal memory access, some memory is not properly
> >> >>>>>>>>>> mapped.
> >> >>>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>>>>> On 2/27/21 7:47 PM, Itaru Kitayama wrote:
> >> >>>>>>>>>>> Removed the internal function, but I get:
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> CUDA device 0 info: Device supports up to 65536 CUDA blocks and
> >> >>>> 1024
> >> >>>>>>>>>>> threads with a warp size of 32
> >> >>>>>>>>>>> CUDA device 0 info: Launching kernel
> >> >>>>>>>>>>> __omp_offloading_34_8009dd23_main_l12 with 1 blocks and 33
> >> >>>> threads in
> >> >>>>>>>>>>> Generic mode
> >> >>>>>>>>>>> CUDA error: Error when synchronizing stream. stream =
> >> >>>>>>>>>>> 0x0000000001d22ae0, async info ptr = 0x00007ffe73ea2728
> >> >>>>>>>>>>> CUDA error: an illegal memory access was encountered
> >> >>>>>>>>>>> Libomptarget error: Failed to synchronize device.
> >> >>>>>>>>>>> Libomptarget error: Call to targetDataEnd failed, abort target.
> >> >>>>>>>>>>> Libomptarget error: Failed to process data after launching the
> >> >>>> kernel.
> >> >>>>>>>>>>> Libomptarget error: Source location information not present.
> >> >>>> Compile
> >> >>>>>>>>>>> with -g or -gline-tables-only.
> >> >>>>>>>>>>> Libomptarget fatal error 1: failure of target construct while
> >> >>>>>>>>>>> offloading is mandatory
> >> >>>>>>>>>>> /var/spool/parastation/jobs/8941317: line 23: 20812 Aborted
> >> >>>>>>>>>>>        (core dumped) ./a.out
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> On Sun, Feb 28, 2021 at 10:35 AM Alexey Bataev <
> >> >>>> a.bataev at hotmail.com> wrote:
> >> >>>>>>>>>>>> Do not call __tgt_register_requires directly, this is the
> >> >>>> internal function called by global constructor and its arg value depends on
> >> >>>> #pragma omp requires. Use just this pragma.
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> Best regards,
> >> >>>>>>>>>>>> Alexey Bataev
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>>> 27 февр. 2021 г., в 20:28, Itaru Kitayama via Openmp-dev <
> >> >>>> openmp-dev at lists.llvm.org> написал(а):
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>> I'm trying to build a test C++ code that uses part of
> >> >>>>>>>>>>>>> unified_shared_memory/shared_update.c
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> On Sun, Feb 28, 2021 at 10:25 AM Johannes Doerfert
> >> >>>>>>>>>>>>>> <johannesdoerfert at gmail.com> wrote:
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> I don't see this test, nor do I understand what you are
> >> >>>> trying to say.
> >> >>>>>>>>>>>>>> Is the test failing? If so, which test is this?
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> ~ Johannes
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>> On 2/27/21 7:17 PM, Itaru Kitayama via Openmp-dev wrote:
> >> >>>>>>>>>>>>>>> The below C++ code builds, but the executable fails at
> >> >>>> runtime.
> >> >>>>>>>>>>>>>>> (It is taken from the C code under the libomptarget subdir's
> >> >>>> test directory)
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>> #include <omp.h>
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>> #pragma omp requires unified_shared_memory
> >> >>>>>>>>>>>>>>> #define N 1024
> >> >>>>>>>>>>>>>>> extern "C" void __tgt_register_requires(int64_t);
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>> int main() {
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>     int a[N] = {0};
> >> >>>>>>>>>>>>>>>     int b[N] = {0};
> >> >>>>>>>>>>>>>>>     int *device_data;
> >> >>>>>>>>>>>>>>>     __tgt_register_requires(1);
> >> >>>>>>>>>>>>>>> #pragma omp target map(tofrom : device_data)
> >> >>>>>>>>>>>>>>>     {
> >> >>>>>>>>>>>>>>>       device_data = &a[0];
> >> >>>>>>>>>>>>>>>       for (int i = 0; i < 1024; i++) {
> >> >>>>>>>>>>>>>>>         a[i] += 1;
> >> >>>>>>>>>>>>>>>       }
> >> >>>>>>>>>>>>>>>     }
> >> >>>>>>>>>>>>>>> }
> >> >>>>>>>>>>>>>>> _______________________________________________
> >> >>>>>>>>>>>>>>> Openmp-dev mailing list
> >> >>>>>>>>>>>>>>> Openmp-dev at lists.llvm.org
> >> >>>>>>>>>>>>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev
> >> >>>>>>>>>>>>> _______________________________________________
> >> >>>>>>>>>>>>> Openmp-dev mailing list
> >> >>>>>>>>>>>>> Openmp-dev at lists.llvm.org
> >> >>>>>>>>>>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev
> >> >>>>>>>>> _______________________________________________
> >> >>>>>>>>> Openmp-dev mailing list
> >> >>>>>>>>> Openmp-dev at lists.llvm.org
> >> >>>>>>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev
> >> >>>>>>>>>
> >> >>>>>
> >> >>>>
> >> >>> _______________________________________________
> >> >> Openmp-dev mailing list
> >> >> Openmp-dev at lists.llvm.org
> >> >> https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev
> >> >>
> >> >
> >> >
> >> > _______________________________________________
> >> > Openmp-dev mailing list
> >> > Openmp-dev at lists.llvm.org
> >> > https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev
> >> >
> >>