[Openmp-dev] Libomptarget fatal error 1: '#pragma omp requires unified_shared_memory' not used consistently!

Tue Mar 2 07:05:09 PST 2021

Agreed. Such features can help a consistent way of using USM in OpenMP.
Ye
===================
Ye Luo, Ph.D.
Computational Science Division & Leadership Computing Facility
Argonne National Laboratory

On Tue, Mar 2, 2021 at 8:58 AM Alexey.Bataev <a.bataev at outlook.com> wrote:

> Ye, generally speaking stack allocations can be replaced by managed mem
> allocation. We have very similar functionality for #pragma omp allocate,
> where static alloca can be replaced by __kmpc_alloc calls. But, most
> probably, it would be good to extend #pragma omp allocate to add managed
> memory support. Otherwise it would hard to implement it. Or it can be
> implemented in the backend by adding analysis/transformation of static
> allocas into managed memory alloc calls if it is proved that the variable
> is supposed to be used in USM mode.
>
> -------------
> Best regards,
> Alexey Bataev
>
> 3/2/2021 9:53 AM, Ye Luo пишет:
>
> Explicit (software) USM requires explicitly invoking allocators like CUDA
> managed memory.
> Implicit (OS backed) USM doesn't have such restrictions and memory
> allocated by malloc on the host can be directly accessed on the device.
> P9+V100 is the only architecture I saw works in this category.
>
> In the test case, "int a[N]" is stack memory but not explicitly allocated
> as CUDA managed memory. So invalid memory access is a legitimate CUDA error
> indicating this situation on a machine without implicit USM support.
>
> "#pragma omp require unified_shared_memory" doesn't express which level of
> USM support is required. Then to me it means require USM up to the support
> on that machine and application developer needs to be aware of that. Beyond
> that is just undefined behavior.
>
> If Itaru wants to write codes like the test case to work, he needs a
> machine with implicit USM support.
> Otherwise, he has to avoid OS allocation but call CUDA-managed malloc
> explicitly or do not use USM at all. I don't think stack allocation can be
> handled in explicit USM scenario.
>
> There may be improvements that can be done in libomptarget to improve the
> situation by adding more protection. But it doesn't help making the code
> work in this case.
>
> Ye
> ===================
> Ye Luo, Ph.D.
> Computational Science Division & Leadership Computing Facility
> Argonne National Laboratory
>
>
> On Tue, Mar 2, 2021 at 1:47 AM Joachim Protze <protze.joachim at gmail.com>
> wrote:
>
>> No!
>>
>> What you suggest (Itaru to check that the system supports "whatever?!")
>> would be necessary for a fictitious:
>>
>> #pragma omp assume unified_shared_memory
>>
>> but certainly not for
>>
>> #pragma omp require unified_shared_memory
>>
>> The OpenMP implementation must support the required property or must
>> state that it is not supported. Causing an error during the execution of
>> a target region is no compliant behavior.
>>
>> As Johannes suggested, managed allocators might allow to provide
>> software USM, when hardware USM is not available. In any case, compiler
>> and runtime together need to make sure that the requirements for
>> supporting USM are met or abort otherwise.
>>
>> BTW: will the managed allocators also change the behavior for stack
>> allocations? This would be necessary for a compliant USM support.
>>
>> - Joachim
>>
>> Am 01.03.21 um 23:49 schrieb Ye Luo via Openmp-dev:
>> > Are you sure your machine/OS supports USM? Summit like nodes P9+V100 are
>> > the only architecture with NVIDIA GPU which supports USM that I'm aware
>> of.
>>
>> As you can read on the page I linked earlier, already Kepler supports
>> USM. Just not in hardware.
>>
>> > Ye
>> > ===================
>> > Ye Luo, Ph.D.
>> > Computational Science Division & Leadership Computing Facility
>> > Argonne National Laboratory
>> >
>> >
>> > On Mon, Mar 1, 2021 at 4:35 PM Itaru Kitayama via Openmp-dev <
>> > openmp-dev at lists.llvm.org> wrote:
>> >
>> >> I’m on JURECA and some nodes are attached to A100 GPUs.
>> >>
>> >> On Tue, Mar 2, 2021 at 7:34 Itaru Kitayama <itaru.kitayama at gmail.com>
>> >> wrote:
>> >>
>> >>> Hi all,
>> >>> In the mean time, what do I do?
>> >>>
>> >>> On Tue, Mar 2, 2021 at 3:23 Johannes Doerfert <
>> johannesdoerfert at gmail.com>
>> >>> wrote:
>> >>>
>> >>>> I think that is it. I heard of problems with our USM before.
>> >>>> We need to use the managed allocators if USM is active, they are
>> >>>> about to be upstreamed (I hope).
>> >>>>
>> >>>>
>> >>>> On 3/1/21 12:15 PM, Alexey.Bataev wrote:
>> >>>>> Looks like this example is for Explicit USM and I assume if you
>> >>>> allocate
>> >>>>> the memory for a in managed memory explicitly, the OpenMP example
>> also
>> >>>>> should work.
>> >>>>>
>> >>>>> There are other USM modes though, where the memory is shared
>> implicitly
>> >>>>> between the host and the devices. Looks like currently LLVM
>> >>>>> implementation relies on this thing
>> >>>>>
>> >>>>
>> https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-system-allocator
>> >>>>> where Implicit USM is supported.
>> >>>>>
>> >>>>> -------------
>> >>>>> Best regards,
>> >>>>> Alexey Bataev
>> >>>>>
>> >>>>> 3/1/2021 1:04 PM, Joachim Protze пишет:
>> >>>>>> Are the Kernel/Hardware requirements llvm specific?
>> >>>>>>
>> >>>>>> I can compile and execute the add_grid.cu example sucessfully:
>> >>>>>> https://developer.nvidia.com/blog/unified-memory-cuda-beginners/
>> >>>>>>
>> >>>>>> So, I would expect that an OpenMP program should also run
>> sucessfully.
>> >>>>>>
>> >>>>>> - Joachim
>> >>>>>>
>> >>>>>>
>> >>>>>> Am 01.03.21 um 18:49 schrieb Alexey.Bataev:
>> >>>>>>> Hi, I you sure that you system supports Unified Shared Memory? As
>> >>>> far as
>> >>>>>>> I know it requires special linux kernel and the hardware must
>> support
>> >>>>>>> it. If you system does not support it, the code will crash for
>> sure
>> >>>> at
>> >>>>>>> the runtime.
>> >>>>>>>
>> >>>>>>> In this mode, IIRC, we just ignore map clauses since the
>> accelerator
>> >>>>>>> devices can access the host memory directly without the need for
>> >>>>>>> allocating the device-specific memory.
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> -------------
>> >>>>>>> Best regards,
>> >>>>>>> Alexey Bataev
>> >>>>>>>
>> >>>>>>> 3/1/2021 12:41 PM, Joachim Protze пишет:
>> >>>>>>>> Hi all,
>> >>>>>>>>
>> >>>>>>>> even a more simple example segfaults, when the requires directive
>> >>>> is there:
>> >>>>>>>>
>> >>>>>>>> #include <iostream>
>> >>>>>>>> #include <omp.h>
>> >>>>>>>> #include <stdio.h>
>> >>>>>>>>
>> >>>>>>>> #pragma omp requires unified_shared_memory
>> >>>>>>>> #define N 1024
>> >>>>>>>>
>> >>>>>>>> int main() {
>> >>>>>>>>    int a[N];
>> >>>>>>>>    printf("a=%p\n", a);
>> >>>>>>>> #pragma omp target map(tofrom : a[0:N])
>> >>>>>>>>    {
>> >>>>>>>>      printf("a=%p\n", a);
>> >>>>>>>>      for (int i = 0; i < 1024; i++) {
>> >>>>>>>>        a[i] = 123;
>> >>>>>>>>      }
>> >>>>>>>>    }
>> >>>>>>>>    printf("a[0]=%i, a[%i]=%i\n", a[0], N/2, a[N/2]);
>> >>>>>>>> }
>> >>>>>>>>
>> >>>>>>>> The code runs sucessfully when the requires directive is removed
>> >>>> because
>> >>>>>>>> the mapping of `a` is explicitly specified.
>> >>>>>>>>
>> >>>>>>>> For this code to run successfully, would it be necessary to
>> >>>> allocate `a`
>> >>>>>>>> specially as cuda managed memory? I don't see any special
>> treatment
>> >>>> of
>> >>>>>>>> `a` in llvm ir. As I understand the OpenMP spec, the requires
>> >>>> directive
>> >>>>>>>> should lead to a compile error if clang fails to generate such
>> code.
>> >>>>>>>>
>> >>>>>>>> The requires example from the OpenMP Examples also fails with the
>> >>>> same
>> >>>>>>>> runtime error:
>> >>>>>>>>
>> >>>>>>>>
>> >>>>
>> https://github.com/OpenMP/Examples/blob/main/sources/Example_requires.1.cpp
>> >>>>>>>>
>> >>>>>>>> - Joachim
>> >>>>>>>>
>> >>>>>>>> Am 28.02.21 um 11:12 schrieb Itaru Kitayama via Openmp-dev:
>> >>>>>>>>> This is the code:
>> >>>>>>>>>
>> >>>>>>>>> #include <iostream>
>> >>>>>>>>> #include <omp.h>
>> >>>>>>>>>
>> >>>>>>>>> #pragma omp requires unified_shared_memory
>> >>>>>>>>> #define N 1024
>> >>>>>>>>>
>> >>>>>>>>> int main() {
>> >>>>>>>>>    int a[N] = {0};
>> >>>>>>>>>    int *device_data =  new int[N];
>> >>>>>>>>> #pragma omp target map(tofrom : device_data[0:N])
>> >>>>>>>>>    {
>> >>>>>>>>>      device_data = &a[0];
>> >>>>>>>>>      for (int i = 0; i < 1024; i++) {
>> >>>>>>>>>        device_data[i] = 123;
>> >>>>>>>>>      }
>> >>>>>>>>>    }
>> >>>>>>>>>    std::cout << a[0] << std::endl;
>> >>>>>>>>> }
>> >>>>>>>>>
>> >>>>>>>>> On Sun, Feb 28, 2021 at 1:34 PM Johannes Doerfert
>> >>>>>>>>> <johannesdoerfert at gmail.com> wrote:
>> >>>>>>>>>> You have an illegal memory access, some memory is not properly
>> >>>>>>>>>> mapped.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> On 2/27/21 7:47 PM, Itaru Kitayama wrote:
>> >>>>>>>>>>> Removed the internal function, but I get:
>> >>>>>>>>>>>
>> >>>>>>>>>>> CUDA device 0 info: Device supports up to 65536 CUDA blocks
>> and
>> >>>> 1024
>> >>>>>>>>>>> threads with a warp size of 32
>> >>>>>>>>>>> CUDA device 0 info: Launching kernel
>> >>>>>>>>>>> __omp_offloading_34_8009dd23_main_l12 with 1 blocks and 33
>> >>>> threads in
>> >>>>>>>>>>> Generic mode
>> >>>>>>>>>>> CUDA error: Error when synchronizing stream. stream =
>> >>>>>>>>>>> 0x0000000001d22ae0, async info ptr = 0x00007ffe73ea2728
>> >>>>>>>>>>> CUDA error: an illegal memory access was encountered
>> >>>>>>>>>>> Libomptarget error: Failed to synchronize device.
>> >>>>>>>>>>> Libomptarget error: Call to targetDataEnd failed, abort
>> target.
>> >>>>>>>>>>> Libomptarget error: Failed to process data after launching the
>> >>>> kernel.
>> >>>>>>>>>>> Libomptarget error: Source location information not present.
>> >>>> Compile
>> >>>>>>>>>>> with -g or -gline-tables-only.
>> >>>>>>>>>>> Libomptarget fatal error 1: failure of target construct while
>> >>>>>>>>>>> offloading is mandatory
>> >>>>>>>>>>> /var/spool/parastation/jobs/8941317: line 23: 20812 Aborted
>> >>>>>>>>>>>        (core dumped) ./a.out
>> >>>>>>>>>>>
>> >>>>>>>>>>> On Sun, Feb 28, 2021 at 10:35 AM Alexey Bataev <
>> >>>> a.bataev at hotmail.com> wrote:
>> >>>>>>>>>>>> Do not call __tgt_register_requires directly, this is the
>> >>>> internal function called by global constructor and its arg value
>> depends on
>> >>>> #pragma omp requires. Use just this pragma.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Best regards,
>> >>>>>>>>>>>> Alexey Bataev
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>> 27 февр. 2021 г., в 20:28, Itaru Kitayama via Openmp-dev <
>> >>>> openmp-dev at lists.llvm.org> написал(а):
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> I'm trying to build a test C++ code that uses part of
>> >>>>>>>>>>>>> unified_shared_memory/shared_update.c
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>> On Sun, Feb 28, 2021 at 10:25 AM Johannes Doerfert
>> >>>>>>>>>>>>>> <johannesdoerfert at gmail.com> wrote:
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> I don't see this test, nor do I understand what you are
>> >>>> trying to say.
>> >>>>>>>>>>>>>> Is the test failing? If so, which test is this?
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> ~ Johannes
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> On 2/27/21 7:17 PM, Itaru Kitayama via Openmp-dev wrote:
>> >>>>>>>>>>>>>>> The below C++ code builds, but the executable fails at
>> >>>> runtime.
>> >>>>>>>>>>>>>>> (It is taken from the C code under the libomptarget
>> subdir's
>> >>>> test directory)
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> #include <omp.h>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> #pragma omp requires unified_shared_memory
>> >>>>>>>>>>>>>>> #define N 1024
>> >>>>>>>>>>>>>>> extern "C" void __tgt_register_requires(int64_t);
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> int main() {
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>     int a[N] = {0};
>> >>>>>>>>>>>>>>>     int b[N] = {0};
>> >>>>>>>>>>>>>>>     int *device_data;
>> >>>>>>>>>>>>>>>     __tgt_register_requires(1);
>> >>>>>>>>>>>>>>> #pragma omp target map(tofrom : device_data)
>> >>>>>>>>>>>>>>>     {
>> >>>>>>>>>>>>>>>       device_data = &a[0];
>> >>>>>>>>>>>>>>>       for (int i = 0; i < 1024; i++) {
>> >>>>>>>>>>>>>>>         a[i] += 1;
>> >>>>>>>>>>>>>>>       }
>> >>>>>>>>>>>>>>>     }
>> >>>>>>>>>>>>>>> }
>> >>>>>>>>>>>>>>> _______________________________________________
>> >>>>>>>>>>>>>>> Openmp-dev mailing list
>> >>>>>>>>>>>>>>> Openmp-dev at lists.llvm.org
>> >>>>>>>>>>>>>>>
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev
>> >>>>>>>>>>>>> _______________________________________________
>> >>>>>>>>>>>>> Openmp-dev mailing list
>> >>>>>>>>>>>>> Openmp-dev at lists.llvm.org
>> >>>>>>>>>>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev
>> >>>>>>>>> _______________________________________________
>> >>>>>>>>> Openmp-dev mailing list
>> >>>>>>>>> Openmp-dev at lists.llvm.org
>> >>>>>>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev
>> >>>>>>>>>
>> >>>>>
>> >>>>
>> >>> _______________________________________________
>> >> Openmp-dev mailing list
>> >> Openmp-dev at lists.llvm.org
>> >> https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev
>> >>
>> >
>> >
>> > _______________________________________________
>> > Openmp-dev mailing list
>> > Openmp-dev at lists.llvm.org
>> > https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev
>> >
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/openmp-dev/attachments/20210302/8e35b34a/attachment-0001.html>