[Openmp-dev] Target CUDA RTL --> CUDA error is: an illegal memory access was encountered

Sun Jan 19 17:20:16 PST 2020

Alexey,
Builds without a single warning, but run time error.

[kitayama1 at juronc15 pcp0151]$ ./a.out
Libomptarget --> Loading RTLs...
Libomptarget --> Loading library 'libomptarget.rtl.ppc64.so'...
Libomptarget --> Successfully loaded library 'libomptarget.rtl.ppc64.so'!
Libomptarget --> Registering RTL libomptarget.rtl.ppc64.so supporting 4
devices!
Libomptarget --> Loading library 'libomptarget.rtl.x86_64.so'...
Libomptarget --> Unable to load library 'libomptarget.rtl.x86_64.so':
libomptarget.rtl.x86_64.so: cannot open shared object file: No such file or
directory!
Libomptarget --> Loading library 'libomptarget.rtl.cuda.so'...
Target CUDA RTL --> Start initializing CUDA
Libomptarget --> Successfully loaded library 'libomptarget.rtl.cuda.so'!
Libomptarget --> Registering RTL libomptarget.rtl.cuda.so supporting 1
devices!
Libomptarget --> Loading library 'libomptarget.rtl.aarch64.so'...
Libomptarget --> Unable to load library 'libomptarget.rtl.aarch64.so':
libomptarget.rtl.aarch64.so: cannot open shared object file: No such file
or directory!
Libomptarget --> RTLs loaded!
Libomptarget --> Image 0x0000000010001c90 is NOT compatible with RTL
libomptarget.rtl.ppc64.so!
Libomptarget --> Image 0x0000000010001c90 is compatible with RTL
libomptarget.rtl.cuda.so!
Libomptarget --> RTL 0x00000100153fd6d0 has index 0!
Libomptarget --> Registering image 0x0000000010001c90 with RTL
libomptarget.rtl.cuda.so!
Libomptarget --> Done registering entries!
Libomptarget --> Call to omp_get_num_devices returning 1
Libomptarget --> Default TARGET OFFLOAD policy is now mandatory (devices
were found)
Libomptarget --> Entering data begin region for device -1 with 3 mappings
Libomptarget --> Use default device id 0
Libomptarget --> Checking whether device 0 is ready.
Libomptarget --> Is the device 0 (local ID 0) initialized? 0
Target CUDA RTL --> Init requires flags to 1
Target CUDA RTL --> Getting device 0
Target CUDA RTL --> Max CUDA blocks per grid 2147483647 exceeds the hard
team limit 65536, capping at the hard limit
Target CUDA RTL --> Using 1024 CUDA threads per block
Target CUDA RTL --> Max number of CUDA blocks 65536, threads 1024 & warp
size 32
Target CUDA RTL --> Default number of teams set according to library's
default 128
Target CUDA RTL --> Default number of threads set according to library's
default 128
Libomptarget --> Device 0 is ready to use.
Target CUDA RTL --> Load data from image 0x0000000010001c90
Target CUDA RTL --> CUDA module successfully loaded!
Target CUDA RTL --> Entry point 0x0000000000000000 maps to
__omp_offloading_30_804572b9_main_l32 (0x00001100003ad4a0)
Target CUDA RTL --> Sending global device environment data 4 bytes
Libomptarget --> Entry  0: Base=0x0000010015451260,
Begin=0x0000010015451268, Size=8, Type=0x20
Libomptarget --> Entry  1: Base=0x0000010015451260,
Begin=0x0000010015451260, Size=16, Type=0x1000000000001
Libomptarget --> Entry  2: Base=0x0000010015451268,
Begin=0x0000010015451280, Size=4, Type=0x1000000000011
Libomptarget --> Looking up mapping(HstPtrBegin=0x0000010015451268,
Size=8)...
Libomptarget --> Creating new map entry: HstBase=0x0000010015451260,
HstBegin=0x0000010015451268, HstEnd=0x0000010015451270,
TgtBegin=0x0000110048600000
Libomptarget --> There are 8 bytes allocated at target address
0x0000110048600000 - is new
Libomptarget --> Looking up mapping(HstPtrBegin=0x0000010015451260,
Size=16)...
Libomptarget --> WARNING: Pointer is not mapped but section extends into
already mapped data
Libomptarget --> Explicit extension of mapping is not allowed.
Libomptarget --> Call to getOrAllocTgtPtr returned null pointer (device
failure or illegal mapping).
Libomptarget --> There are 16 bytes allocated at target address
0x0000000000000000 - is new
Libomptarget --> Moving 16 bytes (hst:0x0000010015451260) ->
(tgt:0x0000000000000000)
Target CUDA RTL --> Error when copying data from host to device. Pointers:
host = 0x0000010015451260, device = 0x0000000000000000, size = 16
Target CUDA RTL --> CUDA error is: invalid argument
Libomptarget --> Copying data to device failed.
Libomptarget fatal error 1: failure of target construct while offloading is
mandatory
Libomptarget --> Unloading target library!
Libomptarget --> Image 0x0000000010001c90 is compatible with RTL
0x00000100153fd6d0!
Libomptarget --> Unregistered image 0x0000000010001c90 from RTL
0x00000100153fd6d0!
Libomptarget --> Done unregistering images!
Libomptarget --> Removing translation table for descriptor
0x00000000100825f8
Libomptarget --> Done unregistering library!

On Mon, Jan 20, 2020 at 10:16 AM Alexey Bataev <a.bataev at hotmail.com> wrote:

> Itaru, try #pragma omp target parallel for is_device_ptr(absBase)
>
> Best regards,
> Alexey Bataev
>
> 19 янв. 2020 г., в 20:01, Itaru Kitayama <itaru.kitayama at gmail.com>
> написал(а):
>
> 
> I have updated the code, as a dynamically allocated array is in practice,
> much important:
> #include <stdio.h>
> #include <iostream>
>
> class AbsBase {
> public:
>         virtual int f() = 0;
> virtual void map_in() = 0;
> virtual void map_out() = 0;
> };
>
> class Derived : public AbsBase {
> private:
>         int *arr = new int[100];
>         //int arr[100];
> public:
>         int f() { return arr[0]; }
>         void fillarray() {
>                 arr[0] = 1234;
>         }
> virtual void map_in() {
> #pragma omp target enter data map(to:this[0:1], this->arr[0:1])
> }
> virtual void map_out() {
> #pragma omp target exit data map(from:this[0:1], this->arr[0:1])
> }
> };
>
> int main() {
>         AbsBase *absBase = new Derived();
> static_cast<Derived*>(absBase)->fillarray();
> static_cast<Derived*>(absBase)->map_in();
> #pragma omp target parallel for
>         for (int i=0;i<10;i++) {
>                 Derived d1(*static_cast<Derived*>(absBase));
>                 printf("arr[0] is %d\n", d1.f());
>         }
> static_cast<Derived*>(absBase)->map_out();
> }
>
> Here are the relevant errors:
> [...]
> Libomptarget --> Looking up mapping(HstPtrBegin=0x0000010023511268,
> Size=8)...
> Libomptarget --> Creating new map entry: HstBase=0x0000010023511260,
> HstBegin=0x0000010023511268, HstEnd=0x0000010023511270,
> TgtBegin=0x0000110048600000
> Libomptarget --> There are 8 bytes allocated at target address
> 0x0000110048600000 - is new
> Libomptarget --> Looking up mapping(HstPtrBegin=0x0000010023511260,
> Size=16)...
> Libomptarget --> WARNING: Pointer is not mapped but section extends into
> already mapped data
> Libomptarget --> Explicit extension of mapping is not allowed.
> Libomptarget --> Call to getOrAllocTgtPtr returned null pointer (device
> failure or illegal mapping).
> Libomptarget --> There are 16 bytes allocated at target address
> 0x0000000000000000 - is new
> Libomptarget --> Moving 16 bytes (hst:0x0000010023511260) ->
> (tgt:0x0000000000000000)
> Target CUDA RTL --> Error when copying data from host to device. Pointers:
> host = 0x0000010023511260, device = 0x0000000000000000, size = 16
> Target CUDA RTL --> CUDA error is: invalid argument
> Libomptarget --> Copying data to device failed.
> Libomptarget fatal error 1: failure of target construct while offloading
> is mandatory
> [...]
>
> On Sun, Jan 19, 2020 at 3:26 PM Itaru Kitayama <itaru.kitayama at gmail.com>
> wrote:
>
>> Tom,
>>
>> Below builds, but fails at run time:
>>
>> #include <stdio.h>
>>
>> class AbsBase {
>> public:
>>         virtual int f() = 0;
>> virtual void map_in() = 0;
>> virtual void map_out() = 0;
>> };
>>
>> class Derived : public AbsBase {
>> private:
>>         int *arr = new int[100];
>> public:
>>         int f() { return arr[0]; }
>>         void fillarray() {
>>                 arr[0] = 1234;
>>         }
>> virtual void map_in() {
>> #pragma omp target enter data map(to:this[0:1], this->arr[0:1])
>> }
>> virtual void map_out() {
>> #pragma omp target exit data map(from:this[0:1], this->arr[0:1])
>> }
>> };
>>
>> int main() {
>>         AbsBase *absBase = new Derived();
>> static_cast<Derived*>(absBase)->fillarray();
>> static_cast<Derived*>(absBase)->map_in();
>>
>> #pragma omp target parallel for
>>         for (int i=0;i<10;i++) {
>>                 Derived d1(*static_cast<Derived*>(absBase));
>>                 printf("arr[0] is %d\n", d1.f());
>>         }
>> absBase->map_out();
>> }
>>
>> What's wrong with my code?
>>
>> [kitayama1 at juronc11 pcp0151]$ ./a.out
>> Libomptarget --> Loading RTLs...
>> Libomptarget --> Loading library 'libomptarget.rtl.ppc64.so'...
>> Libomptarget --> Successfully loaded library 'libomptarget.rtl.ppc64.so'!
>> Libomptarget --> Registering RTL libomptarget.rtl.ppc64.so supporting 4
>> devices!
>> Libomptarget --> Loading library 'libomptarget.rtl.x86_64.so'...
>> Libomptarget --> Unable to load library 'libomptarget.rtl.x86_64.so':
>> libomptarget.rtl.x86_64.so: cannot open shared object file: No such file
>> or directory!
>> Libomptarget --> Loading library 'libomptarget.rtl.cuda.so'...
>> Target CUDA RTL --> Start initializing CUDA
>> Libomptarget --> Successfully loaded library 'libomptarget.rtl.cuda.so'!
>> Libomptarget --> Registering RTL libomptarget.rtl.cuda.so supporting 1
>> devices!
>> Libomptarget --> Loading library 'libomptarget.rtl.aarch64.so'...
>> Libomptarget --> Unable to load library 'libomptarget.rtl.aarch64.so':
>> libomptarget.rtl.aarch64.so: cannot open shared object file: No such
>> file or directory!
>> Libomptarget --> RTLs loaded!
>> Libomptarget --> Image 0x0000000010001a90 is NOT compatible with RTL
>> libomptarget.rtl.ppc64.so!
>> Libomptarget --> Image 0x0000000010001a90 is compatible with RTL
>> libomptarget.rtl.cuda.so!
>> Libomptarget --> RTL 0x0000010000a6d6d0 has index 0!
>> Libomptarget --> Registering image 0x0000000010001a90 with RTL
>> libomptarget.rtl.cuda.so!
>> Libomptarget --> Done registering entries!
>> Libomptarget --> Call to omp_get_num_devices returning 1
>> Libomptarget --> Default TARGET OFFLOAD policy is now mandatory (devices
>> were found)
>> Libomptarget --> Entering data begin region for device -1 with 3 mappings
>> Libomptarget --> Use default device id 0
>> Libomptarget --> Checking whether device 0 is ready.
>> Libomptarget --> Is the device 0 (local ID 0) initialized? 0
>> Target CUDA RTL --> Init requires flags to 1
>> Target CUDA RTL --> Getting device 0
>> Target CUDA RTL --> Max CUDA blocks per grid 2147483647 exceeds the hard
>> team limit 65536, capping at the hard limit
>> Target CUDA RTL --> Using 1024 CUDA threads per block
>> Target CUDA RTL --> Max number of CUDA blocks 65536, threads 1024 & warp
>> size 32
>> Target CUDA RTL --> Default number of teams set according to library's
>> default 128
>> Target CUDA RTL --> Default number of threads set according to library's
>> default 128
>> Libomptarget --> Device 0 is ready to use.
>> Target CUDA RTL --> Load data from image 0x0000000010001a90
>> Target CUDA RTL --> CUDA module successfully loaded!
>> Target CUDA RTL --> Entry point 0x0000000000000000 maps to
>> __omp_offloading_30_809bddc5_main_l31 (0x00001100003a99d0)
>> Target CUDA RTL --> Sending global device environment data 4 bytes
>> Libomptarget --> Entry  0: Base=0x0000010000ac1260,
>> Begin=0x0000010000ac1268, Size=8, Type=0x20
>> Libomptarget --> Entry  1: Base=0x0000010000ac1260,
>> Begin=0x0000010000ac1260, Size=16, Type=0x1000000000001
>> Libomptarget --> Entry  2: Base=0x0000010000ac1268,
>> Begin=0x0000010000ac1280, Size=4, Type=0x1000000000011
>> Libomptarget --> Looking up mapping(HstPtrBegin=0x0000010000ac1268,
>> Size=8)...
>> Libomptarget --> Creating new map entry: HstBase=0x0000010000ac1260,
>> HstBegin=0x0000010000ac1268, HstEnd=0x0000010000ac1270,
>> TgtBegin=0x0000110048600000
>> Libomptarget --> There are 8 bytes allocated at target address
>> 0x0000110048600000 - is new
>> Libomptarget --> Looking up mapping(HstPtrBegin=0x0000010000ac1260,
>> Size=16)...
>> Libomptarget --> WARNING: Pointer is not mapped but section extends into
>> already mapped data
>> Libomptarget --> Explicit extension of mapping is not allowed.
>> Libomptarget --> Call to getOrAllocTgtPtr returned null pointer (device
>> failure or illegal mapping).
>> Libomptarget --> There are 16 bytes allocated at target address
>> 0x0000000000000000 - is new
>> Libomptarget --> Moving 16 bytes (hst:0x0000010000ac1260) ->
>> (tgt:0x0000000000000000)
>> Target CUDA RTL --> Error when copying data from host to device.
>> Pointers: host = 0x0000010000ac1260, device = 0x0000000000000000, size = 16
>> Target CUDA RTL --> CUDA error is: invalid argument
>> Libomptarget --> Copying data to device failed.
>> Libomptarget fatal error 1: failure of target construct while offloading
>> is mandatory
>> Libomptarget --> Unloading target library!
>> Libomptarget --> Image 0x0000000010001a90 is compatible with RTL
>> 0x0000010000a6d6d0!
>> Libomptarget --> Unregistered image 0x0000000010001a90 from RTL
>> 0x0000010000a6d6d0!
>> Libomptarget --> Done unregistering images!
>> Libomptarget --> Removing translation table for descriptor
>> 0x000000001007dff8
>> Libomptarget --> Done unregistering library!
>>
>> On Sat, Jan 18, 2020 at 11:49 AM Scogland, Tom <scogland1 at llnl.gov>
>> wrote:
>>
>>> That's the map of "*this" talking, which we made possible in a newer
>>> standard than is implemented.  If you change it over to mapping either this
>>> by an array section or a reference to *this it should work.  The general
>>> pattern is to use that kind of mapping method to get the correct static
>>> type at the point of the map.
>>>
>>> -Tom
>>> ------------------------------
>>> *From:* Itaru Kitayama <itaru.kitayama at gmail.com>
>>> *Sent:* Friday, January 17, 2020 5:42 PM
>>> *To:* Doerfert, Johannes <jdoerfert at anl.gov>
>>> *Cc:* Alexey Bataev <a.bataev at hotmail.com>; openmp-dev <
>>> openmp-dev at lists.llvm.org>; Scogland, Tom <scogland1 at llnl.gov>
>>> *Subject:* Re: [Openmp-dev] Target CUDA RTL --> CUDA error is: an
>>> illegal memory access was encountered
>>>
>>> Johannes,
>>> Tested with Trunk on POWER8:
>>>
>>> test5.cpp:20:40: error: expected expression containing only member
>>> accesses and/or array sections based on named variables
>>>                 #pragma omp target enter data map(to:*this,
>>> this->arr[0:100])
>>>                                                      ^~~~~
>>> test5.cpp:23:40: error: expected expression containing only member
>>> accesses and/or array sections based on named variables
>>>                 #pragma omp target enter data map(to:*this,
>>> this->arr[0:100])
>>>
>>>
>>> On Sat, Jan 18, 2020 at 4:50 AM Doerfert, Johannes <jdoerfert at anl.gov>
>>> wrote:
>>>
>>> Check the attached email from Tom for a way to make this work. Let me
>>> know what you think
>>>
>>>
>>>
>>> ---------------------------------------
>>> Johannes Doerfert
>>> Researcher
>>>
>>> Argonne National Laboratory
>>> Lemont, IL 60439, USA
>>>
>>> jdoerfert at anl.gov
>>>
>>> ________________________________________
>>> From: Alexey Bataev <a.bataev at hotmail.com>
>>> Sent: Friday, January 17, 2020 03:33
>>> To: Itaru Kitayama
>>> Cc: Doerfert, Johannes; openmp-dev
>>> Subject: Re: [Openmp-dev] Target CUDA RTL --> CUDA error is: an illegal
>>> memory access was encountered
>>>
>>> The compiler does not work the way you want. It is not an issue of
>>> OpenMP but of C++ itself.
>>>
>>> Best regards,
>>> Alexey Bataev
>>>
>>> 16 янв. 2020 г., в 23:20, Itaru Kitayama <itaru.kitayama at gmail.com>
>>> написал(а):
>>>
>>> 
>>> #include <stdio.h>
>>>
>>> class AbsBase {
>>> public:
>>>         virtual int f() = 0;
>>> };
>>>
>>> class Derived : public AbsBase {
>>> private:
>>>         //int *arr = new int[100];
>>>         int arr[100];
>>> public:
>>>         int f() { return arr[0]; }
>>>         void fillarray() {
>>>                 arr[0] = 1234;
>>>         }
>>> };
>>>
>>> int main() {
>>>         AbsBase *absBase = new Derived();
>>> Derived *p = static_cast<Derived*>(absBase);
>>>         p->fillarray();
>>> #pragma omp target parallel for map(to: p[0:1])
>>>         for (int i=0;i<10;i++) {
>>>                 //Derived d1(*static_cast<Derived*>(absBase));
>>>                 Derived d1(*p);
>>>                 printf("arr[0] is %d\n", d1.f());
>>>         }
>>> }
>>>
>>> Above gives me what I wanted to see, but I would like to avoid doing a
>>> cast on a pointer
>>> to an abstract base class, but the instantiation is done by the derived
>>> class.
>>>
>>>
>>> On Fri, Jan 17, 2020 at 6:07 PM Alexey Bataev <a.bataev at hotmail.com
>>> <mailto:a.bataev at hotmail.com>> wrote:
>>> AbsBase is a pointer to a base class? Then not, compiler is not aware
>>> that it is a pointer to the derived class and copies the array as an array
>>> of base class, nit derived one.
>>>
>>> Best regards,
>>> Alexey Bataev
>>>
>>> 16 янв. 2020 г., в 22:58, Itaru Kitayama <itaru.kitayama at gmail.com
>>> <mailto:itaru.kitayama at gmail.com>> написал(а):
>>>
>>> 
>>> If I change class Derived's private data to an array of 100 integers, it
>>> executes, but
>>> the first element of the array is reported as 0, instead of 1234.
>>> Shouldn't bitwise copy
>>> work when mapping the pointer to AbsBase class like an array of one?
>>>
>>> On Wed, Jan 15, 2020 at 2:32 PM Alexey Bataev <a.bataev at hotmail.com
>>> <mailto:a.bataev at hotmail.com>> wrote:
>>> And it should fail since you're mapping non trivially copyable type.
>>>
>>> Best regards,
>>> Alexey Bataev
>>>
>>> 14 янв. 2020 г., в 19:14, Itaru Kitayama <itaru.kitayama at gmail.com
>>> <mailto:itaru.kitayama at gmail.com>> написал(а):
>>>
>>> 
>>> This piece of C++ program execution fails at run time.
>>>
>>> #include <stdio.h>
>>>
>>> class AbsBase {
>>> public:
>>>         virtual int f() = 0;
>>> };
>>>
>>> class Derived : public AbsBase {
>>> private:
>>>         int *arr = new int[100];
>>> public:
>>>         int f() { return arr[0]; }
>>>         void fillarray() {
>>>                 arr[0] = 1234;
>>>         }
>>> };
>>>
>>> int main() {
>>>         AbsBase *absBase = new Derived();
>>>         static_cast<Derived*>(absBase)->fillarray();
>>> #pragma omp target parallel for map(to: absBase[0:1])
>>>         for (int i=0;i<10;i++) {
>>>                 Derived d1(*static_cast<Derived*>(absBase));
>>>                 printf("arr[0] is %d\n", d1.f());
>>>         }
>>> }
>>>
>>> On Sun, Jan 12, 2020 at 1:45 PM Doerfert, Johannes <jdoerfert at anl.gov
>>> <mailto:jdoerfert at anl.gov>> wrote:
>>> On 01/12, Itaru Kitayama wrote:
>>> > Do you guys have a timeframe for that feature
>>> > Implemented?
>>>
>>> I do not and I don't know anyone who will drive this right now.
>>>
>>> As mentioned before, you should be able to "move/copy-create" the
>>> elements on the device in order to use virtual functions.
>>>
>>>
>>> > On Sun, Jan 12, 2020 at 12:51 Doerfert, Johannes <jdoerfert at anl.gov
>>> <mailto:jdoerfert at anl.gov>> wrote:
>>> >
>>> > > On 01/11, Alexey Bataev via Openmp-dev wrote:
>>> > > > Virtual functions are not supported.
>>> > >
>>> > > Not yet ;).
>>> > >
>>> > > We'll get it with 5.1 so we might actually implement it soon. Till
>>> then,
>>> > > you have to create the object on the device you call the virtual
>>> > > function.
>>> > >
>>>
>>> --
>>>
>>> Johannes Doerfert
>>> Researcher
>>>
>>> Argonne National Laboratory
>>> Lemont, IL 60439, USA
>>>
>>> jdoerfert at anl.gov<mailto:jdoerfert at anl.gov>
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/openmp-dev/attachments/20200120/4bc3fc57/attachment-0001.html>