[Openmp-dev] Libomptarget fatal error 1: '#pragma omp requires unified_shared_memory' not used consistently!

Mon Mar 1 10:23:29 PST 2021

I think that is it. I heard of problems with our USM before.
We need to use the managed allocators if USM is active, they are
about to be upstreamed (I hope).

On 3/1/21 12:15 PM, Alexey.Bataev wrote:
> Looks like this example is for Explicit USM and I assume if you allocate
> the memory for a in managed memory explicitly, the OpenMP example also
> should work.
>
> There are other USM modes though, where the memory is shared implicitly
> between the host and the devices. Looks like currently LLVM
> implementation relies on this thing
> https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-system-allocator
> where Implicit USM is supported.
>
> -------------
> Best regards,
> Alexey Bataev
>
> 3/1/2021 1:04 PM, Joachim Protze пишет:
>> Are the Kernel/Hardware requirements llvm specific?
>>
>> I can compile and execute the add_grid.cu example sucessfully:
>> https://developer.nvidia.com/blog/unified-memory-cuda-beginners/
>>
>> So, I would expect that an OpenMP program should also run sucessfully.
>>
>> - Joachim
>>
>>
>> Am 01.03.21 um 18:49 schrieb Alexey.Bataev:
>>> Hi, I you sure that you system supports Unified Shared Memory? As far as
>>> I know it requires special linux kernel and the hardware must support
>>> it. If you system does not support it, the code will crash for sure at
>>> the runtime.
>>>
>>> In this mode, IIRC, we just ignore map clauses since the accelerator
>>> devices can access the host memory directly without the need for
>>> allocating the device-specific memory.
>>>
>>>
>>> -------------
>>> Best regards,
>>> Alexey Bataev
>>>
>>> 3/1/2021 12:41 PM, Joachim Protze пишет:
>>>> Hi all,
>>>>
>>>> even a more simple example segfaults, when the requires directive is there:
>>>>
>>>> #include <iostream>
>>>> #include <omp.h>
>>>> #include <stdio.h>
>>>>
>>>> #pragma omp requires unified_shared_memory
>>>> #define N 1024
>>>>
>>>> int main() {
>>>>    int a[N];
>>>>    printf("a=%p\n", a);
>>>> #pragma omp target map(tofrom : a[0:N])
>>>>    {
>>>>      printf("a=%p\n", a);
>>>>      for (int i = 0; i < 1024; i++) {
>>>>        a[i] = 123;
>>>>      }
>>>>    }
>>>>    printf("a[0]=%i, a[%i]=%i\n", a[0], N/2, a[N/2]);
>>>> }
>>>>
>>>> The code runs sucessfully when the requires directive is removed because
>>>> the mapping of `a` is explicitly specified.
>>>>
>>>> For this code to run successfully, would it be necessary to allocate `a`
>>>> specially as cuda managed memory? I don't see any special treatment of
>>>> `a` in llvm ir. As I understand the OpenMP spec, the requires directive
>>>> should lead to a compile error if clang fails to generate such code.
>>>>
>>>> The requires example from the OpenMP Examples also fails with the same
>>>> runtime error:
>>>>
>>>> https://github.com/OpenMP/Examples/blob/main/sources/Example_requires.1.cpp
>>>>
>>>> - Joachim
>>>>
>>>> Am 28.02.21 um 11:12 schrieb Itaru Kitayama via Openmp-dev:
>>>>> This is the code:
>>>>>
>>>>> #include <iostream>
>>>>> #include <omp.h>
>>>>>
>>>>> #pragma omp requires unified_shared_memory
>>>>> #define N 1024
>>>>>
>>>>> int main() {
>>>>>    int a[N] = {0};
>>>>>    int *device_data =  new int[N];
>>>>> #pragma omp target map(tofrom : device_data[0:N])
>>>>>    {
>>>>>      device_data = &a[0];
>>>>>      for (int i = 0; i < 1024; i++) {
>>>>>        device_data[i] = 123;
>>>>>      }
>>>>>    }
>>>>>    std::cout << a[0] << std::endl;
>>>>> }
>>>>>
>>>>> On Sun, Feb 28, 2021 at 1:34 PM Johannes Doerfert
>>>>> <johannesdoerfert at gmail.com> wrote:
>>>>>> You have an illegal memory access, some memory is not properly
>>>>>> mapped.
>>>>>>
>>>>>>
>>>>>> On 2/27/21 7:47 PM, Itaru Kitayama wrote:
>>>>>>> Removed the internal function, but I get:
>>>>>>>
>>>>>>> CUDA device 0 info: Device supports up to 65536 CUDA blocks and 1024
>>>>>>> threads with a warp size of 32
>>>>>>> CUDA device 0 info: Launching kernel
>>>>>>> __omp_offloading_34_8009dd23_main_l12 with 1 blocks and 33 threads in
>>>>>>> Generic mode
>>>>>>> CUDA error: Error when synchronizing stream. stream =
>>>>>>> 0x0000000001d22ae0, async info ptr = 0x00007ffe73ea2728
>>>>>>> CUDA error: an illegal memory access was encountered
>>>>>>> Libomptarget error: Failed to synchronize device.
>>>>>>> Libomptarget error: Call to targetDataEnd failed, abort target.
>>>>>>> Libomptarget error: Failed to process data after launching the kernel.
>>>>>>> Libomptarget error: Source location information not present. Compile
>>>>>>> with -g or -gline-tables-only.
>>>>>>> Libomptarget fatal error 1: failure of target construct while
>>>>>>> offloading is mandatory
>>>>>>> /var/spool/parastation/jobs/8941317: line 23: 20812 Aborted
>>>>>>>        (core dumped) ./a.out
>>>>>>>
>>>>>>> On Sun, Feb 28, 2021 at 10:35 AM Alexey Bataev <a.bataev at hotmail.com> wrote:
>>>>>>>> Do not call __tgt_register_requires directly, this is the internal function called by global constructor and its arg value depends on #pragma omp requires. Use just this pragma.
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> Alexey Bataev
>>>>>>>>
>>>>>>>>> 27 февр. 2021 г., в 20:28, Itaru Kitayama via Openmp-dev <openmp-dev at lists.llvm.org> написал(а):
>>>>>>>>>
>>>>>>>>> I'm trying to build a test C++ code that uses part of
>>>>>>>>> unified_shared_memory/shared_update.c
>>>>>>>>>
>>>>>>>>>> On Sun, Feb 28, 2021 at 10:25 AM Johannes Doerfert
>>>>>>>>>> <johannesdoerfert at gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>> I don't see this test, nor do I understand what you are trying to say.
>>>>>>>>>> Is the test failing? If so, which test is this?
>>>>>>>>>>
>>>>>>>>>> ~ Johannes
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> On 2/27/21 7:17 PM, Itaru Kitayama via Openmp-dev wrote:
>>>>>>>>>>> The below C++ code builds, but the executable fails at runtime.
>>>>>>>>>>> (It is taken from the C code under the libomptarget subdir's test directory)
>>>>>>>>>>>
>>>>>>>>>>> #include <omp.h>
>>>>>>>>>>>
>>>>>>>>>>> #pragma omp requires unified_shared_memory
>>>>>>>>>>> #define N 1024
>>>>>>>>>>> extern "C" void __tgt_register_requires(int64_t);
>>>>>>>>>>>
>>>>>>>>>>> int main() {
>>>>>>>>>>>
>>>>>>>>>>>     int a[N] = {0};
>>>>>>>>>>>     int b[N] = {0};
>>>>>>>>>>>     int *device_data;
>>>>>>>>>>>     __tgt_register_requires(1);
>>>>>>>>>>> #pragma omp target map(tofrom : device_data)
>>>>>>>>>>>     {
>>>>>>>>>>>       device_data = &a[0];
>>>>>>>>>>>       for (int i = 0; i < 1024; i++) {
>>>>>>>>>>>         a[i] += 1;
>>>>>>>>>>>       }
>>>>>>>>>>>     }
>>>>>>>>>>> }
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Openmp-dev mailing list
>>>>>>>>>>> Openmp-dev at lists.llvm.org
>>>>>>>>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev
>>>>>>>>> _______________________________________________
>>>>>>>>> Openmp-dev mailing list
>>>>>>>>> Openmp-dev at lists.llvm.org
>>>>>>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev
>>>>> _______________________________________________
>>>>> Openmp-dev mailing list
>>>>> Openmp-dev at lists.llvm.org
>>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev
>>>>>
>