[Openmp-dev] Libomptarget fatal error 1: failure of target construct while offloading is mandatory

Jonas Hahnfeld via Openmp-dev openmp-dev at lists.llvm.org
Mon Oct 1 06:57:26 PDT 2018


Nope, it's broken: "Warp Illegal Address".
I can confirm that the program is working correctly when compiled with 
Clang 7.0.0! (@Siegmar please ignore my previous statement about teams 
reductions, I didn't notice that you were attaching the source code in 
your initial post which is only using parallel reductions)

Interestingly it works with Clang trunk when compiling with 
-fopenmp-cuda-force-full-runtime, so maybe that optimization is not 
correct for reductions?

Jonas

On 2018-10-01 15:35, George Rokos via Openmp-dev wrote:
> Hi Siegmar,
> 
>  Something happens during the execution of the second target region.
> The only thing I can suspect is the reduction. I don't have access to
> a workstation right now to test this, though...
> 
>  Can anyone confirm whether threads reduction is working correctly in
> the trunk version?
> 
>  George
> 
> -------------------------
> 
> FROM: Siegmar Gross <siegmar.gross at informatik.hs-fulda.de>
> SENT: 01 October 2018 15:59
> TO: George Rokos; llvm-openmp-dev
> SUBJECT: Re: [Openmp-dev] Libomptarget fatal error 1: failure of
> target construct while offloading is mandatory
> 
> Hi George,
> 
> thank you very much for your suggestions.
> 
>> Apparently your application fails to offload to the GPU. And because
> offloading
>> is mandatory (that's the default behavior) the library terminates
> the application.
>> 
>> Can you compile libomptarget in debug mode and run the app with
>> LIBOMPTARGET_DEBUG=1 to see the debug output? That will help us
> identify the
>> problem.
> 
> loki introduction 115 clang -fopenmp
> -fopenmp-targets=nvptx64-nvidia-cuda
> dot_prod_accelerator_OpenMP.c
> loki introduction 116 a.out
> Number of processors:     24
> Number of devices:        1
> Default device:           0
> Is initial device:        1
> Libomptarget fatal error 1: failure of target construct while
> offloading is
> mandatory
> 
> loki introduction 117 setenv LIBOMPTARGET_DEBUG 1
> loki introduction 118 a.out
> Libomptarget --> Loading RTLs...
> Libomptarget --> Loading library 'libomptarget.rtl.ppc64.so'...
> Libomptarget --> Unable to load library 'libomptarget.rtl.ppc64.so':
> libomptarget.rtl.ppc64.so: cannot open shared object file: No such
> file or
> directory!
> Libomptarget --> Loading library 'libomptarget.rtl.x86_64.so'...
> Libomptarget --> Successfully loaded library
> 'libomptarget.rtl.x86_64.so'!
> Libomptarget --> Registering RTL libomptarget.rtl.x86_64.so supporting
> 4 devices!
> Libomptarget --> Loading library 'libomptarget.rtl.cuda.so'...
> Target CUDA RTL --> Start initializing CUDA
> Libomptarget --> Successfully loaded library
> 'libomptarget.rtl.cuda.so'!
> Libomptarget --> Registering RTL libomptarget.rtl.cuda.so supporting 1
> devices!
> Libomptarget --> Loading library 'libomptarget.rtl.aarch64.so'...
> Libomptarget --> Unable to load library 'libomptarget.rtl.aarch64.so':
> 
> libomptarget.rtl.aarch64.so: cannot open shared object file: No such
> file or
> directory!
> Libomptarget --> RTLs loaded!
> Libomptarget --> Image 0x0000000000602090 is NOT compatible with RTL
> libomptarget.rtl.x86_64.so!
> Libomptarget --> Image 0x0000000000602090 is compatible with RTL
> libomptarget.rtl.cuda.so!
> Libomptarget --> RTL 0x00000000609f95d0 has index 0!
> Libomptarget --> Registering image 0x0000000000602090 with RTL
> libomptarget.rtl.cuda.so!
> Libomptarget --> Done registering entries!
> Libomptarget --> Call to omp_get_num_devices returning 1
> Libomptarget --> Default TARGET OFFLOAD policy is now mandatory
> (devicew were found)
> Libomptarget --> Entering target region with entry point
> 0x00000000004012d0 and
> device Id -1
> Libomptarget --> Checking whether device 0 is ready.
> Libomptarget --> Is the device 0 (local ID 0) initialized? 0
> Target CUDA RTL --> Getting device 0
> Target CUDA RTL --> Max CUDA blocks per grid 2147483647 exceeds the
> hard team
> limit 65536, capping at the hard limit
> Target CUDA RTL --> Using 1024 CUDA threads per block
> Target CUDA RTL --> Max number of CUDA blocks 65536, threads 1024 &
> warp size 32
> Target CUDA RTL --> Default number of teams set according to library's
> default 128
> Target CUDA RTL --> Default number of threads set according to
> library's default 128
> Libomptarget --> Device 0 is ready to use.
> Target CUDA RTL --> Load data from image 0x0000000000602090
> Target CUDA RTL --> CUDA module successfully loaded!
> Target CUDA RTL --> Entry point 0x0000000000000000 maps to
> __omp_offloading_2b_1890d30_main_l48 (0x0000000060f23320)
> Target CUDA RTL --> Entry point 0x0000000000000001 maps to
> __omp_offloading_2b_1890d30_main_l67 (0x0000000060f27c70)
> Target CUDA RTL --> Sending global device environment data 4 bytes
> Libomptarget --> Entry  0: Base=0x0000000000613bf0,
> Begin=0x0000000000613bf0,
> Size=800000000, Type=0x22
> Libomptarget --> Entry  1: Base=0x00000000301043f0,
> Begin=0x00000000301043f0,
> Size=800000000, Type=0x22
> Libomptarget --> Looking up mapping(HstPtrBegin=0x0000000000613bf0,
> Size=800000000)...
> Libomptarget --> Creating new map entry: HstBase=0x0000000000613bf0,
> HstBegin=0x0000000000613bf0, HstEnd=0x00000000301043f0,
> TgtBegin=0x0000000b08c20000
> Libomptarget --> There are 800000000 bytes allocated at target address
> 
> 0x0000000b08c20000 - is new
> Libomptarget --> Looking up mapping(HstPtrBegin=0x00000000301043f0,
> Size=800000000)...
> Libomptarget --> Creating new map entry: HstBase=0x00000000301043f0,
> HstBegin=0x00000000301043f0, HstEnd=0x000000005fbf4bf0,
> TgtBegin=0x0000000b38720000
> Libomptarget --> There are 800000000 bytes allocated at target address
> 
> 0x0000000b38720000 - is new
> Libomptarget --> Looking up mapping(HstPtrBegin=0x0000000000613bf0,
> Size=800000000)...
> Libomptarget --> Mapping exists with HstPtrBegin=0x0000000000613bf0,
> TgtPtrBegin=0x0000000b08c20000, Size=800000000, RefCount=1
> Libomptarget --> Obtained target argument 0x0000000b08c20000 from host
> pointer
> 0x0000000000613bf0
> Libomptarget --> Looking up mapping(HstPtrBegin=0x00000000301043f0,
> Size=800000000)...
> Libomptarget --> Mapping exists with HstPtrBegin=0x00000000301043f0,
> TgtPtrBegin=0x0000000b38720000, Size=800000000, RefCount=1
> Libomptarget --> Obtained target argument 0x0000000b38720000 from host
> pointer
> 0x00000000301043f0
> Libomptarget --> Launching target execution
> __omp_offloading_2b_1890d30_main_l48
> with pointer 0x0000000060ee2ee0 (index=0).
> Target CUDA RTL --> Setting CUDA threads per block to default 128
> Target CUDA RTL --> Using requested number of teams 1
> Target CUDA RTL --> Launch kernel with 1 blocks and 128 threads
> Target CUDA RTL --> Launch of entry point at 0x0000000060ee2ee0
> successful!
> Target CUDA RTL --> Kernel execution at 0x0000000060ee2ee0 successful!
> Libomptarget --> Looking up mapping(HstPtrBegin=0x00000000301043f0,
> Size=800000000)...
> Libomptarget --> Mapping exists with HstPtrBegin=0x00000000301043f0,
> TgtPtrBegin=0x0000000b38720000, Size=800000000, updated RefCount=1
> Libomptarget --> There are 800000000 bytes allocated at target address
> 
> 0x0000000b38720000 - is last
> Libomptarget --> Moving 800000000 bytes (tgt:0x0000000b38720000) ->
> (hst:0x00000000301043f0)
> Libomptarget --> Looking up mapping(HstPtrBegin=0x00000000301043f0,
> Size=800000000)...
> Libomptarget --> Deleting tgt data 0x0000000b38720000 of size
> 800000000
> Libomptarget --> Removing mapping with HstPtrBegin=0x00000000301043f0,
> 
> TgtPtrBegin=0x0000000b38720000, Size=800000000
> Libomptarget --> Looking up mapping(HstPtrBegin=0x0000000000613bf0,
> Size=800000000)...
> Libomptarget --> Mapping exists with HstPtrBegin=0x0000000000613bf0,
> TgtPtrBegin=0x0000000b08c20000, Size=800000000, updated RefCount=1
> Libomptarget --> There are 800000000 bytes allocated at target address
> 
> 0x0000000b08c20000 - is last
> Libomptarget --> Moving 800000000 bytes (tgt:0x0000000b08c20000) ->
> (hst:0x0000000000613bf0)
> Libomptarget --> Looking up mapping(HstPtrBegin=0x0000000000613bf0,
> Size=800000000)...
> Libomptarget --> Deleting tgt data 0x0000000b08c20000 of size
> 800000000
> Libomptarget --> Removing mapping with HstPtrBegin=0x0000000000613bf0,
> 
> TgtPtrBegin=0x0000000b08c20000, Size=800000000
> Libomptarget --> Call to omp_get_num_devices returning 1
> Number of processors:     24
> Number of devices:        1
> Default device:           0
> Is initial device:        1
> Libomptarget --> Entering target region with entry point
> 0x00000000004012d1 and
> device Id -1
> Libomptarget --> Checking whether device 0 is ready.
> Libomptarget --> Is the device 0 (local ID 0) initialized? 1
> Libomptarget --> Device 0 is ready to use.
> Libomptarget --> Entry  0: Base=0x0000000000613bf0,
> Begin=0x0000000000613bf0,
> Size=800000000, Type=0x21
> Libomptarget --> Entry  1: Base=0x00000000301043f0,
> Begin=0x00000000301043f0,
> Size=800000000, Type=0x21
> Libomptarget --> Entry  2: Base=0x00007fff707a86e8,
> Begin=0x00007fff707a86e8,
> Size=8, Type=0x23
> Libomptarget --> Looking up mapping(HstPtrBegin=0x0000000000613bf0,
> Size=800000000)...
> Libomptarget --> Creating new map entry: HstBase=0x0000000000613bf0,
> HstBegin=0x0000000000613bf0, HstEnd=0x00000000301043f0,
> TgtBegin=0x0000000b08c20000
> Libomptarget --> There are 800000000 bytes allocated at target address
> 
> 0x0000000b08c20000 - is new
> Libomptarget --> Moving 800000000 bytes (hst:0x0000000000613bf0) ->
> (tgt:0x0000000b08c20000)
> Libomptarget --> Looking up mapping(HstPtrBegin=0x00000000301043f0,
> Size=800000000)...
> Libomptarget --> Creating new map entry: HstBase=0x00000000301043f0,
> HstBegin=0x00000000301043f0, HstEnd=0x000000005fbf4bf0,
> TgtBegin=0x0000000b38720000
> Libomptarget --> There are 800000000 bytes allocated at target address
> 
> 0x0000000b38720000 - is new
> Libomptarget --> Moving 800000000 bytes (hst:0x00000000301043f0) ->
> (tgt:0x0000000b38720000)
> Libomptarget --> Looking up mapping(HstPtrBegin=0x00007fff707a86e8,
> Size=8)...
> Libomptarget --> Creating new map entry: HstBase=0x00007fff707a86e8,
> HstBegin=0x00007fff707a86e8, HstEnd=0x00007fff707a86f0,
> TgtBegin=0x0000000b68220000
> Libomptarget --> There are 8 bytes allocated at target address
> 0x0000000b68220000 - is new
> Libomptarget --> Moving 8 bytes (hst:0x00007fff707a86e8) ->
> (tgt:0x0000000b68220000)
> Libomptarget --> Looking up mapping(HstPtrBegin=0x0000000000613bf0,
> Size=800000000)...
> Libomptarget --> Mapping exists with HstPtrBegin=0x0000000000613bf0,
> TgtPtrBegin=0x0000000b08c20000, Size=800000000, RefCount=1
> Libomptarget --> Obtained target argument 0x0000000b08c20000 from host
> pointer
> 0x0000000000613bf0
> Libomptarget --> Looking up mapping(HstPtrBegin=0x00000000301043f0,
> Size=800000000)...
> Libomptarget --> Mapping exists with HstPtrBegin=0x00000000301043f0,
> TgtPtrBegin=0x0000000b38720000, Size=800000000, RefCount=1
> Libomptarget --> Obtained target argument 0x0000000b38720000 from host
> pointer
> 0x00000000301043f0
> Libomptarget --> Looking up mapping(HstPtrBegin=0x00007fff707a86e8,
> Size=8)...
> Libomptarget --> Mapping exists with HstPtrBegin=0x00007fff707a86e8,
> TgtPtrBegin=0x0000000b68220000, Size=8, RefCount=1
> Libomptarget --> Obtained target argument 0x0000000b68220000 from host
> pointer
> 0x00007fff707a86e8
> Libomptarget --> Launching target execution
> __omp_offloading_2b_1890d30_main_l67
> with pointer 0x0000000060ee2e70 (index=1).
> Target CUDA RTL --> Setting CUDA threads per block to default 128
> Target CUDA RTL --> Using requested number of teams 1
> Target CUDA RTL --> Launch kernel with 1 blocks and 128 threads
> Target CUDA RTL --> Launch of entry point at 0x0000000060ee2e70
> successful!
> Target CUDA RTL --> Kernel execution error at 0x0000000060ee2e70!
> Target CUDA RTL --> CUDA error is: an illegal memory access was
> encountered
> Libomptarget --> Executing target region abort target.
> Libomptarget fatal error 1: failure of target construct while
> offloading is
> mandatory
> Libomptarget --> Unloading target library!
> Libomptarget --> Image 0x0000000000602090 is compatible with RTL
> 0x00000000609f95d0!
> Libomptarget --> Unregistered image 0x0000000000602090 from RTL
> 0x00000000609f95d0!
> Libomptarget --> Done unregistering images!
> Libomptarget --> Removing translation table for descriptor
> 0x0000000000613b90
> Libomptarget --> Done unregistering library!
> Target CUDA RTL --> Error when unloading CUDA module
> Target CUDA RTL --> CUDA error is: an illegal memory access was
> encountered
> loki introduction 119
> 
> Thank you very much for your help in advance.
> 
> Best regards
> 
> Siegmar
> 
>> 
>> George
>> 
>> 
> --------------------------------------------------------------------------------
>> *From:* Openmp-dev <openmp-dev-bounces at lists.llvm.org> on behalf of
> Siegmar
>> Gross via Openmp-dev <openmp-dev at lists.llvm.org>
>> *Sent:* 01 October 2018 13:26
>> *To:* llvm-openmp-dev
>> *Subject:* [Openmp-dev] Libomptarget fatal error 1: failure of
> target construct
>> while offloading is mandatory
>> Hi,
>> 
>> today I've installed llvm-trunk. Unfortunately, I get an error for
> one of my
>> programs.
>> 
>> 
>> loki introduction 110 clang -fopenmp
> -fopenmp-targets=nvptx64-nvidia-cuda
>> dot_prod_accelerator_OpenMP.c
>> loki introduction 111 a.out
>> Number of processors:     24
>> Number of devices:        1
>> Default device:           0
>> Is initial device:        1
>> Libomptarget fatal error 1: failure of target construct while
> offloading is
>> mandatory
>> 
>> loki introduction 112 setenv OMP_DEFAULT_DEVICE 1
>> loki introduction 113 a.out
>> Libomptarget fatal error 1: failure of target construct while
> offloading is
>> mandatory
>> 
>> loki introduction 114 clang -v
>> clang version 8.0.0 (trunk 343447)
>> Target: x86_64-unknown-linux-gnu
>> Thread model: posix
>> InstalledDir: /usr/local/llvm-trunk/bin
>> Found candidate GCC installation:
> /usr/lib64/gcc/x86_64-suse-linux/4.8
>> Selected GCC installation: /usr/lib64/gcc/x86_64-suse-linux/4.8
>> Candidate multilib: .;@m64
>> Candidate multilib: 32;@m32
>> Selected multilib: .;@m64
>> Found CUDA installation: /usr/local/cuda-9.0, version 9.0
>> loki introduction 115
>> 
>> 
>> 
>> The program works fine with llvm-7.0.0.
>> 
>> loki introduction 125 clang -fopenmp
> -fopenmp-targets=nvptx64-nvidia-cuda
>> dot_prod_accelerator_OpenMP.c
>> loki introduction 126 a.out
>> Number of processors:     24
>> Number of devices:        1
>> Default device:           0
>> Is initial device:        1
>> sum = 6.000000e+08
>> 
>> loki introduction 127 setenv OMP_DEFAULT_DEVICE 1
>> loki introduction 128 a.out
>> Number of processors:     24
>> Number of devices:        1
>> Default device:           1
>> Is initial device:        1
>> sum = 6.000000e+08
>> 
>> loki introduction 129 clang -v
>> clang version 7.0.0 (tags/RELEASE_700/final)
>> Target: x86_64-unknown-linux-gnu
>> Thread model: posix
>> InstalledDir: /usr/local/llvm-7.0.0/bin
>> Found candidate GCC installation:
> /usr/lib64/gcc/x86_64-suse-linux/4.8
>> Selected GCC installation: /usr/lib64/gcc/x86_64-suse-linux/4.8
>> Candidate multilib: .;@m64
>> Candidate multilib: 32;@m32
>> Selected multilib: .;@m64
>> Found CUDA installation: /usr/local/cuda-9.0, version 9.0
>> loki introduction 130
>> 
>> 
>> Hopefully somebody can fix the problem. Do you need anything else to
> locate the
>> error? Thank you very much for any help in advance.
>> 
>> 
>> Kind regards
>> 
>> Siegmar
> _______________________________________________
> Openmp-dev mailing list
> Openmp-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev


More information about the Openmp-dev mailing list