[Openmp-dev] Target architecture does not support unified addressing

Sat May 2 03:45:11 PDT 2020

Setting LIBOMPTARGET_DBUEG=1, on POWER8 with P100 GPUs I get:

$ ./a.out
Libomptarget --> Loading RTLs...
Libomptarget --> Loading library 'libomptarget.rtl.ppc64.so'...
Libomptarget --> Successfully loaded library 'libomptarget.rtl.ppc64.so'!
Libomptarget --> Registering RTL libomptarget.rtl.ppc64.so supporting 4
devices!
Libomptarget --> Loading library 'libomptarget.rtl.x86_64.so'...
Libomptarget --> Unable to load library 'libomptarget.rtl.x86_64.so':
libomptarget.rtl.x86_64.so: cannot open shared object file: No such file or
directory!
Libomptarget --> Loading library 'libomptarget.rtl.cuda.so'...
Target CUDA RTL --> Start initializing CUDA
Libomptarget --> Successfully loaded library 'libomptarget.rtl.cuda.so'!
Libomptarget --> Registering RTL libomptarget.rtl.cuda.so supporting 1
devices!
Libomptarget --> Loading library 'libomptarget.rtl.aarch64.so'...
Libomptarget --> Unable to load library 'libomptarget.rtl.aarch64.so':
libomptarget.rtl.aarch64.so: cannot open shared object file: No such file
or directory!
Libomptarget --> RTLs loaded!
Libomptarget --> Image 0x0000000010001300 is NOT compatible with RTL
libomptarget.rtl.ppc64.so!
Libomptarget --> Image 0x0000000010001300 is compatible with RTL
libomptarget.rtl.cuda.so!
Libomptarget --> RTL 0x0000010001b6d860 has index 0!
Libomptarget --> Registering image 0x0000000010001300 with RTL
libomptarget.rtl.cuda.so!
Libomptarget --> Done registering entries!
Libomptarget --> New requires flags 8 compatible with existing 8!
Libomptarget --> Call to omp_get_num_devices returning 1
Libomptarget --> Default TARGET OFFLOAD policy is now mandatory (devices
were found)
Libomptarget --> Entering target region with entry point 0x0000000010001110
and device Id -1
Libomptarget --> Checking whether device 0 is ready.
Libomptarget --> Is the device 0 (local ID 0) initialized? 0
Target CUDA RTL --> Init requires flags to 8
Target CUDA RTL --> Getting device 0
Target CUDA RTL --> Max CUDA blocks per grid 2147483647 exceeds the hard
team limit 65536, capping at the hard limit
Target CUDA RTL --> Using 1024 CUDA threads per block
Target CUDA RTL --> Using warp size 32
Target CUDA RTL --> Max number of CUDA blocks 65536, threads 1024 & warp
size 32
Target CUDA RTL --> Default number of teams set according to library's
default 128
Target CUDA RTL --> Default number of threads set according to library's
default 128
Libomptarget --> Device 0 is ready to use.
Target CUDA RTL --> Load data from image 0x0000000010001300
Target CUDA RTL --> CUDA module successfully loaded!
Target CUDA RTL --> Entry point 0x0000000000000000 maps to
__omp_offloading_46_804afcb6_main_l41 (0x0000110000350fd0)
Target CUDA RTL --> Entry point 0x0000000000000001 maps to
__omp_offloading_46_804afcb6_main_l89 (0x0000110000361810)
Target CUDA RTL --> Sending global device environment data 4 bytes
Libomptarget --> Entry  0: Base=0x00003ffff55df0b0,
Begin=0x00003ffff55df0b0, Size=8, Type=0x23
Libomptarget --> Entry  1: Base=0x00003ffff55de0a8,
Begin=0x00003ffff55de0a8, Size=4096, Type=0x223
Libomptarget --> Entry  2: Base=0x00003ffff55df0c0,
Begin=0x00003ffff55df0c0, Size=8, Type=0x23
Libomptarget --> Entry  3: Base=0x0000010001bd1a80,
Begin=0x0000010001bd1a80, Size=0, Type=0x220
Libomptarget --> Looking up mapping(HstPtrBegin=0x00003ffff55df0b0,
Size=8)...
Libomptarget --> Return HstPtrBegin 0x00003ffff55df0b0 Size=8 RefCount=
updated
Libomptarget --> There are 8 bytes allocated at target address
0x00003ffff55df0b0 - is not new
Libomptarget --> Looking up mapping(HstPtrBegin=0x00003ffff55de0a8,
Size=4096)...
Libomptarget --> Return HstPtrBegin 0x00003ffff55de0a8 Size=4096 RefCount=
updated
Libomptarget --> There are 4096 bytes allocated at target address
0x00003ffff55de0a8 - is not new
Libomptarget --> Looking up mapping(HstPtrBegin=0x00003ffff55df0c0,
Size=8)...
Libomptarget --> Return HstPtrBegin 0x00003ffff55df0c0 Size=8 RefCount=
updated
Libomptarget --> There are 8 bytes allocated at target address
0x00003ffff55df0c0 - is not new
Libomptarget --> Looking up mapping(HstPtrBegin=0x0000010001bd1a80,
Size=0)...
Libomptarget --> There are 0 bytes allocated at target address
0x0000000000000000 - is not new
Libomptarget --> Looking up mapping(HstPtrBegin=0x00003ffff55df0b0,
Size=8)...
Libomptarget --> Get HstPtrBegin 0x00003ffff55df0b0 Size=8 RefCount=
Libomptarget --> Obtained target argument 0x00003ffff55df0b0 from host
pointer 0x00003ffff55df0b0
Libomptarget --> Looking up mapping(HstPtrBegin=0x00003ffff55de0a8,
Size=4096)...
Libomptarget --> Get HstPtrBegin 0x00003ffff55de0a8 Size=4096 RefCount=
Libomptarget --> Obtained target argument 0x00003ffff55de0a8 from host
pointer 0x00003ffff55de0a8
Libomptarget --> Looking up mapping(HstPtrBegin=0x00003ffff55df0c0,
Size=8)...
Libomptarget --> Get HstPtrBegin 0x00003ffff55df0c0 Size=8 RefCount=
Libomptarget --> Obtained target argument 0x00003ffff55df0c0 from host
pointer 0x00003ffff55df0c0
Libomptarget --> Looking up mapping(HstPtrBegin=0x0000010001bd1a80,
Size=0)...
Libomptarget --> Get HstPtrBegin 0x0000010001bd1a80 Size=0 RefCount=
Libomptarget --> Obtained target argument 0x0000010001bd1a80 from host
pointer 0x0000010001bd1a80
Libomptarget --> Launching target execution
__omp_offloading_46_804afcb6_main_l41 with pointer 0x0000110000322840
(index=0).
Target CUDA RTL --> Setting CUDA threads per block to requested 1
Target CUDA RTL --> Adding master warp: +32 threads
Target CUDA RTL --> Using requested number of teams 1
Target CUDA RTL --> Launch kernel with 1 blocks and 33 threads
Target CUDA RTL --> Launch of entry point at 0x0000110000322840 successful!
Libomptarget --> Looking up mapping(HstPtrBegin=0x0000010001bd1a80,
Size=0)...
Libomptarget --> Get HstPtrBegin 0x0000010001bd1a80 Size=0 RefCount= updated
Libomptarget --> There are 0 bytes allocated at target address
0x0000010001bd1a80 - is not last
Libomptarget --> Looking up mapping(HstPtrBegin=0x00003ffff55df0c0,
Size=8)...
Libomptarget --> Get HstPtrBegin 0x00003ffff55df0c0 Size=8 RefCount= updated
Libomptarget --> There are 8 bytes allocated at target address
0x00003ffff55df0c0 - is not last
Libomptarget --> Looking up mapping(HstPtrBegin=0x00003ffff55de0a8,
Size=4096)...
Libomptarget --> Get HstPtrBegin 0x00003ffff55de0a8 Size=4096 RefCount=
updated
Libomptarget --> There are 4096 bytes allocated at target address
0x00003ffff55de0a8 - is not last
Libomptarget --> Looking up mapping(HstPtrBegin=0x00003ffff55df0b0,
Size=8)...
Libomptarget --> Get HstPtrBegin 0x00003ffff55df0b0 Size=8 RefCount= updated
Libomptarget --> There are 8 bytes allocated at target address
0x00003ffff55df0b0 - is not last
Target CUDA RTL --> Error when synchronizing stream. stream =
0x00001100002cd7c0, async info ptr = 0x00003ffff55dddf8
Target CUDA RTL --> CUDA error is: an illegal memory access was encountered
Libomptarget fatal error 1: failure of target construct while offloading is
mandatory
Target CUDA RTL --> Error returned from cuStreamDestroy
Target CUDA RTL --> CUDA error is: an illegal memory access was encountered
Target CUDA RTL --> Error returned from cuStreamDestroy
Target CUDA RTL --> CUDA error is: an illegal memory access was encountered
Target CUDA RTL --> Error returned from cuStreamDestroy
Target CUDA RTL --> CUDA error is: an illegal memory access was encountered
Target CUDA RTL --> Error returned from cuStreamDestroy
Target CUDA RTL --> CUDA error is: an illegal memory access was encountered
Target CUDA RTL --> Error returned from cuStreamDestroy
Target CUDA RTL --> CUDA error is: an illegal memory access was encountered
Target CUDA RTL --> Error returned from cuStreamDestroy
Target CUDA RTL --> CUDA error is: an illegal memory access was encountered
Target CUDA RTL --> Error returned from cuStreamDestroy
Target CUDA RTL --> CUDA error is: an illegal memory access was encountered
Target CUDA RTL --> Error returned from cuStreamDestroy
Target CUDA RTL --> CUDA error is: an illegal memory access was encountered
Target CUDA RTL --> Error returned from cuStreamDestroy
Target CUDA RTL --> CUDA error is: an illegal memory access was encountered
Target CUDA RTL --> Error returned from cuStreamDestroy
Target CUDA RTL --> CUDA error is: an illegal memory access was encountered
Target CUDA RTL --> Error returned from cuStreamDestroy
Target CUDA RTL --> CUDA error is: an illegal memory access was encountered
Target CUDA RTL --> Error returned from cuStreamDestroy
Target CUDA RTL --> CUDA error is: an illegal memory access was encountered
Target CUDA RTL --> Error returned from cuStreamDestroy
Target CUDA RTL --> CUDA error is: an illegal memory access was encountered
Target CUDA RTL --> Error returned from cuStreamDestroy
Target CUDA RTL --> CUDA error is: an illegal memory access was encountered
Target CUDA RTL --> Error returned from cuStreamDestroy
Target CUDA RTL --> CUDA error is: an illegal memory access was encountered
Target CUDA RTL --> Error returned from cuStreamDestroy
Target CUDA RTL --> CUDA error is: an illegal memory access was encountered
Target CUDA RTL --> Error returned from cuStreamDestroy
Target CUDA RTL --> CUDA error is: an illegal memory access was encountered
Target CUDA RTL --> Error returned from cuStreamDestroy
Target CUDA RTL --> CUDA error is: an illegal memory access was encountered
Target CUDA RTL --> Error returned from cuStreamDestroy
Target CUDA RTL --> CUDA error is: an illegal memory access was encountered
Target CUDA RTL --> Error returned from cuStreamDestroy
Target CUDA RTL --> CUDA error is: an illegal memory access was encountered
Target CUDA RTL --> Error returned from cuStreamDestroy
Target CUDA RTL --> CUDA error is: an illegal memory access was encountered
Target CUDA RTL --> Error returned from cuStreamDestroy
Target CUDA RTL --> CUDA error is: an illegal memory access was encountered
Target CUDA RTL --> Error returned from cuStreamDestroy
Target CUDA RTL --> CUDA error is: an illegal memory access was encountered
Target CUDA RTL --> Error returned from cuStreamDestroy
Target CUDA RTL --> CUDA error is: an illegal memory access was encountered
Target CUDA RTL --> Error returned from cuStreamDestroy
Target CUDA RTL --> CUDA error is: an illegal memory access was encountered
Target CUDA RTL --> Error returned from cuStreamDestroy
Target CUDA RTL --> CUDA error is: an illegal memory access was encountered
Target CUDA RTL --> Error returned from cuStreamDestroy
Target CUDA RTL --> CUDA error is: an illegal memory access was encountered
Target CUDA RTL --> Error returned from cuStreamDestroy
Target CUDA RTL --> CUDA error is: an illegal memory access was encountered
Target CUDA RTL --> Error returned from cuStreamDestroy
Target CUDA RTL --> CUDA error is: an illegal memory access was encountered
Target CUDA RTL --> Error returned from cuStreamDestroy
Target CUDA RTL --> CUDA error is: an illegal memory access was encountered
Target CUDA RTL --> Error returned from cuStreamDestroy
Target CUDA RTL --> CUDA error is: an illegal memory access was encountered
Target CUDA RTL --> Error returned from cuStreamDestroy
Target CUDA RTL --> CUDA error is: an illegal memory access was encountered
Target CUDA RTL --> Error returned from cuModuleUnload
Target CUDA RTL --> CUDA error is: an illegal memory access was encountered
Libomptarget --> Unloading target library!
Libomptarget --> Image 0x0000000010001300 is compatible with RTL
0x0000010001b6d860!
Libomptarget --> Unregistered image 0x0000000010001300 from RTL
0x0000010001b6d860!
Libomptarget --> Done unregistering images!
Libomptarget --> Removing translation table for descriptor
0x00000000100193e8
Libomptarget --> Done unregistering library!
Libomptarget --> Deinit target library!

On Sat, May 2, 2020 at 12:55 PM Itaru Kitayama <itaru.kitayama at gmail.com>
wrote:

> deviceQuery returns:
>
>  CUDA Device Query (Runtime API) version (CUDART static linking)
>
> Detected 1 CUDA Capable device(s)
>
> Device 0: "Tesla P100-SXM2-16GB"
>   CUDA Driver Version / Runtime Version          10.1 / 8.0
>   CUDA Capability Major/Minor version number:    6.0
>   Total amount of global memory:                 16281 MBytes (17071734784
> bytes)
>   (56) Multiprocessors, ( 64) CUDA Cores/MP:     3584 CUDA Cores
>   GPU Max Clock rate:                            1481 MHz (1.48 GHz)
>   Memory Clock rate:                             715 Mhz
>   Memory Bus Width:                              4096-bit
>   L2 Cache Size:                                 4194304 bytes
>   Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072,
> 65536), 3D=(16384, 16384, 16384)
>   Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
>   Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048
> layers
>   Total amount of constant memory:               65536 bytes
>   Total amount of shared memory per block:       49152 bytes
>   Total number of registers available per block: 65536
>   Warp size:                                     32
>   Maximum number of threads per multiprocessor:  2048
>   Maximum number of threads per block:           1024
>   Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
>   Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
>   Maximum memory pitch:                          2147483647 bytes
>   Texture alignment:                             512 bytes
>   Concurrent copy and kernel execution:          Yes with 5 copy engine(s)
>   Run time limit on kernels:                     No
>   Integrated GPU sharing Host Memory:            No
>   Support host page-locked memory mapping:       Yes
>   Alignment requirement for Surfaces:            Yes
>   Device has ECC support:                        Enabled
>   Device supports Unified Addressing (UVA):      Yes
>
> On Sat, May 2, 2020 at 10:31 AM Itaru Kitayama <itaru.kitayama at gmail.com>
> wrote:
>
>> Executing shared_update.c on P100 results in errors;
>>
>> ==130340== NVPROF is profiling process 130340, command: ./a.out
>> Libomptarget fatal error 1: failure of target construct while offloading
>> is mandatory
>> ==130340== Profiling application: ./a.out
>> ==130340== Warning: 1 records have invalid timestamps due to insufficient
>> device buffer space. You can configure the buffer space using the option
>> --device-buffer-size.
>> ==130340== Profiling result:
>>             Type  Time(%)      Time     Calls       Avg       Min
>> Max  Name
>>  GPU activities:   89.68%  40.950us         2  20.475us  18.103us
>>  22.847us  [CUDA memcpy DtoH]
>>                    10.32%  4.7100us         1  4.7100us  4.7100us
>>  4.7100us  [CUDA memcpy HtoD]
>>       API calls:   69.95%  400.85ms         1  400.85ms  400.85ms
>>  400.85ms  cuCtxCreate
>>                    15.17%  86.932ms         1  86.932ms  86.932ms
>>  86.932ms  cuStreamSynchronize
>>                    12.11%  69.398ms         1  69.398ms  69.398ms
>>  69.398ms  cuCtxDestroy
>>                     2.68%  15.375ms         1  15.375ms  15.375ms
>>  15.375ms  cuModuleLoadDataEx
>>                     0.06%  363.13us        32  11.347us     754ns
>>  171.53us  cuStreamCreate
>>                     0.01%  48.938us         2  24.469us  19.581us
>>  29.357us  cuMemcpyDtoH
>>                     0.00%  22.184us         1  22.184us  22.184us
>>  22.184us  cuLaunchKernel
>>                     0.00%  7.6760us         1  7.6760us  7.6760us
>>  7.6760us  cuMemcpyHtoD
>>                     0.00%  4.7430us        32     148ns     113ns
>> 520ns  cuStreamDestroy
>>                     0.00%  2.9060us         3     968ns     562ns
>>  1.5750us  cuModuleGetGlobal
>>                     0.00%  2.8940us         2  1.4470us     336ns
>>  2.5580us  cuModuleGetFunction
>>                     0.00%  2.8250us         3     941ns     181ns
>>  2.2050us  cuDeviceGetCount
>>                     0.00%  2.6040us         2  1.3020us     965ns
>>  1.6390us  cuDeviceGet
>>                     0.00%  2.4200us         5     484ns     137ns
>> 882ns  cuCtxSetCurrent
>>                     0.00%  1.6450us         6     274ns     117ns
>> 671ns  cuDeviceGetAttribute
>>                     0.00%     804ns         1     804ns     804ns
>> 804ns  cuFuncGetAttribute
>>                     0.00%     296ns         1     296ns     296ns
>> 296ns  cuModuleUnload
>> ======== Error: Application returned non-zero code 1
>>
>> On Sat, May 2, 2020 at 8:24 AM Itaru Kitayama <itaru.kitayama at gmail.com>
>> wrote:
>>
>>> Doru,
>>> What's the current way of enabling SM_60 CUDA architecture support for
>>> unified addressing?
>>> It's been modified since we exchanged the message.
>>>
>>> On Thu, Nov 7, 2019 at 4:05 AM Gheorghe-Teod Bercea <
>>> Gheorghe-Teod.Bercea at ibm.com> wrote:
>>>
>>>> Hi Itaru,
>>>>
>>>> We did not test those features on an sm_60 machine like a Pascal GPU so
>>>> I can't guarantee it will work. I suggest you enable it locally and see how
>>>> it performs.
>>>> You only need to make a small change in "void
>>>> CGOpenMPRuntimeNVPTX::checkArchForUnifiedAddressing(const OMPRequiresDecl
>>>> *D)" to allow sm_60 to be accepted as a valid target.
>>>>
>>>> Thanks,
>>>>
>>>> --Doru
>>>>
>>>>
>>>>
>>>>
>>>> From:        Itaru Kitayama via Openmp-dev <openmp-dev at lists.llvm.org>
>>>> To:        Alexey Bataev <a.bataev at outlook.com>
>>>> Cc:        openmp-dev <openmp-dev at lists.llvm.org>
>>>> Date:        11/05/2019 06:04 PM
>>>> Subject:        [EXTERNAL] Re: [Openmp-dev] Target architecture does
>>>> not support unified        addressing
>>>> Sent by:        "Openmp-dev" <openmp-dev-bounces at lists.llvm.org>
>>>> ------------------------------
>>>>
>>>>
>>>>
>>>> Can you say briefly as to why SM60, while it is capable of handing
>>>> unified addresses, is not supported in Clang?
>>>>
>>>> On Wed, Nov 6, 2019 at 7:56 AM Alexey Bataev <*a.bataev at outlook.com*
>>>> <a.bataev at outlook.com>> wrote:
>>>> Yes, it is enforced in clang.
>>>>
>>>> Best regards,
>>>> Alexey Bataev
>>>>
>>>> 5 нояб. 2019 г., в 17:38, Itaru Kitayama <*itaru.kitayama at gmail.com*
>>>> <itaru.kitayama at gmail.com>> написал(а):
>>>>
>>>>
>>>> Thank you, Alexey. Now I am seeing:
>>>>
>>>> $ clang++ -fopenmp -fopenmp-targets=nvptx64 tmp.cpp
>>>> tmp.cpp:1:22: error: Target architecture sm_60 does not support unified
>>>> addressing
>>>> #pragma omp requires unified_shared_memory
>>>>                      ^
>>>> 1 error generated.
>>>>
>>>> P100 is a SM60 device, but supports unified memory. Is a requirement
>>>> sm_70 equals or greater
>>>> enforced in Clang?
>>>>
>>>> On Wed, Nov 6, 2019 at 5:07 AM Alexey Bataev <*a.bataev at outlook.com*
>>>> <a.bataev at outlook.com>> wrote:
>>>> Most probably, you use default architecture, i.e. sm_35. You need to
>>>> build clang with sm_35, sm_70, ... supported archs. Plus, your system must
>>>> support unified memory.
>>>> I updated error message in the compiler, now it says what target
>>>> architecture you use .
>>>> -------------
>>>> Best regards,
>>>> Alexey Bataev
>>>> 05.11.2019 3:01 PM, Itaru Kitayama пишет:
>>>> I’ve been building trunk Clang locally targeting the P100 device
>>>> attached to Host. Should I check the tool chain?
>>>>
>>>> On Tue, Nov 5, 2019 at 23:47 Alexey Bataev <*a.bataev at outlook.com*
>>>> <a.bataev at outlook.com>> wrote:
>>>> You're building you code for the architecture that does not support
>>>> unified memory, say sm_35. Unified memory only supported for architectures
>>>> >= sm_70.
>>>> -------------
>>>> Best regards,
>>>> Alexey Bataev
>>>> 05.11.2019 3:16 AM, Itaru Kitayama via Openmp-dev пишет:
>>>> Hi,
>>>> Using a pragma like below:
>>>>
>>>> $ cat tmp.cpp
>>>> #pragma omp requires unified_shared_memory
>>>>
>>>> int main() {
>>>> }
>>>>
>>>> produces en error on a POWER8 based system with P100 devices (that
>>>> support unified memory).
>>>>
>>>> $ clang++ -fopenmp -fopenmp-targets=nvptx64 tmp.cpp
>>>> tmp.cpp:1:22: error: Target architecture does not support unified
>>>> addressing
>>>> #pragma omp requires unified_shared_memory
>>>>                      ^
>>>> 1 error generated.
>>>>
>>>> The Clang is locally and natively built with the appropriate
>>>> capability, so
>>>> what does this mean?
>>>>
>>>>
>>>> _______________________________________________
>>>> Openmp-dev mailing list
>>>> *Openmp-dev at lists.llvm.org* <Openmp-dev at lists.llvm.org>
>>>> *https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev*
>>>> <https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev>
>>>> _______________________________________________
>>>> Openmp-dev mailing list
>>>> Openmp-dev at lists.llvm.org
>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev
>>>>
>>>>
>>>>
>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/openmp-dev/attachments/20200502/d50b18f3/attachment-0001.html>